Editor’s Picks
Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.
[1] DreamAudio: Customized Text-to-Audio Generation with Diffusion Models
Yi Yuan, Xubo Liu, Haohe Liu, Xiyuan Kang, Zhuo Chen, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang
Main category: cs.SD
TL;DR: DreamAudio enables customized text-to-audio generation by learning auditory concepts from reference audio samples, allowing control over fine-grained acoustic characteristics while maintaining semantic alignment with text prompts.
Details
Motivation: Existing text-to-audio models generate semantically aligned sound but lack control over fine-grained acoustic characteristics, making it difficult for users to generate specific desired audio content with personalized audio events.Method: Introduces a framework that enables models to identify auditory information from user-provided reference concepts. Given few reference audio samples containing personalized audio events, the system can generate new audio samples with these specific events. Two types of datasets are developed for training and testing.
Result: DreamAudio generates audio samples highly consistent with customized audio features and well-aligned with input text prompts. It offers comparable performance in general text-to-audio tasks and provides a human-involved dataset with real-world CTTA cases as a benchmark.
Conclusion: DreamAudio successfully addresses the limitation of existing text-to-audio models by enabling customized generation with fine-grained acoustic control, while maintaining good performance on general text-to-audio tasks.
Abstract: With the development of large-scale diffusion-based and language-modeling-based generative models, impressive progress has been achieved in text-to-audio generation. Despite producing high-quality outputs, existing text-to-audio models mainly aim to generate semantically aligned sound and fall short of controlling fine-grained acoustic characteristics of specific sounds. As a result, users who need specific sound content may find it difficult to generate the desired audio clips. In this paper, we present DreamAudio for customized text-to-audio generation (CTTA). Specifically, we introduce a new framework that is designed to enable the model to identify auditory information from user-provided reference concepts for audio generation. Given a few reference audio samples containing personalized audio events, our system can generate new audio samples that include these specific events. In addition, two types of datasets are developed for training and testing the proposed systems. The experiments show that DreamAudio generates audio samples that are highly consistent with the customized audio features and aligned well with the input text prompts. Furthermore, DreamAudio offers comparable performance in general text-to-audio tasks. We also provide a human-involved dataset containing audio events from real-world CTTA cases as the benchmark for customized generation tasks.
Relevance: 9/10
[2] Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio
Jeong Hun Yeo, Hyeongseop Rha, Sungjune Park, Junil Won, Yong Man Ro
Main category: cs.CV
TL;DR: Unified multimodal framework for sign language, lip movements, and audio processing to generate spoken-language text, achieving SOTA performance across multiple tasks.
Details
Motivation: Audio-centric ASR systems exclude deaf/hard-of-hearing individuals. While sign language and lip reading offer alternatives, they've been studied in isolation without unified integration. Need a framework that combines these modalities for inclusive communication.Method: Proposes first unified framework handling sign language, lip movements, and audio combinations for spoken-language text generation. Features modality-agnostic architecture processing heterogeneous inputs, explores modality synergy (especially lip movements as non-manual cues in sign language), and aims for SOTA performance.
Result: Achieves performance on par with or better than task-specific SOTA models across SLT, VSR, ASR, and Audio-Visual Speech Recognition. Key finding: explicitly modeling lip movements as distinct modality significantly improves SLT performance by capturing critical non-manual cues.
Conclusion: Unified multimodal framework successfully integrates diverse communication modalities, enabling inclusive systems and revealing important linguistic insights about modality interactions.
Abstract: Audio is the primary modality for human communication and has driven the success of Automatic Speech Recognition (ASR) technologies. However, such audio-centric systems inherently exclude individuals who are deaf or hard of hearing. Visual alternatives such as sign language and lip reading offer effective substitutes, and recent advances in Sign Language Translation (SLT) and Visual Speech Recognition (VSR) have improved audio-less communication. Yet, these modalities have largely been studied in isolation, and their integration within a unified framework remains underexplored. In this paper, we propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation. We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or superior to state-of-the-art models specialized for individual tasks. Building on this framework, we achieve performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and Audio-Visual Speech Recognition. Furthermore, our analysis reveals a key linguistic insight: explicitly modeling lip movements as a distinct modality significantly improves SLT performance by capturing critical non-manual cues.
Relevance: 9/10
[3] Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy
Geewook Kim, Minjoon Seo
Main category: cs.CV
TL;DR: The paper audits video benchmarks and finds they’re largely solvable from vision alone, then enhances LLaVA-OneVision with speech/audio encoders to show audio actually matters for speech comprehension and cross-modal grounding tasks.
Details
Motivation: Current video understanding benchmarks don't properly measure audio-visual reasoning since they can be largely solved with visual cues alone, leading to underestimation of audio's importance in multimodal understanding.Method: Audited 10 video benchmarks, found ~76% of AVQA solvable from single frames; enhanced LLaVA-OneVision with speech/audio encoder, compared 5 compressor architectures with 25x token reduction (25Hz to 1Hz).
Result: Audio yields clear gains on tasks requiring speech comprehension or cross-modal grounding, while vision-centric benchmarks remain largely unaffected; speech encoders play larger role than current benchmarks suggest.
Conclusion: Video understanding benchmarks need better audio-visual reasoning measurement; speech/audio encoders are crucial for true multimodal understanding, especially for speech comprehension and cross-modal grounding tasks.
Abstract: Speech and audio encoders developed over years of community effort are routinely excluded from video understanding pipelines – not because they fail, but because benchmarks never required listening. We audit 10 video benchmarks and find items largely solvable from visual cues alone: a single-frame probe answers ~76% of AVQA without audio, suggesting poor measurement of audio-visual reasoning. Building on LLaVA-OneVision, we attach a speech/audio encoder and compare five compressor architectures under 25x token reduction (25 Hz to 1 Hz). Across 10 benchmarks – with and without filtering – audio yields clear gains on tasks requiring speech comprehension or cross-modal grounding, while vision-centric suites remain largely unaffected. Our results show that speech encoders play a larger role in video understanding than current benchmarks suggest. We will fully open-source our work at https://github.com/naver-ai/LLaVA-AV-SSM.
Relevance: 9/10
Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 79]
- cs.CV [Total: 246]
- cs.AI [Total: 108]
- cs.SD [Total: 7]
- cs.LG [Total: 172]
- cs.MA [Total: 3]
- cs.MM [Total: 3]
- eess.AS [Total: 6]
- eess.IV [Total: 10]
cs.CL
[1] Evaluating Prompting Strategies for Chart Question Answering with Large Language Models
Ruthuparna Naikar, Ying Zhu
Main category: cs.CL
TL;DR: Systematic evaluation of prompting strategies (Zero-Shot, Few-Shot, Zero-Shot CoT, Few-Shot CoT) on chart-based QA using GPT models, showing Few-Shot Chain-of-Thought performs best for reasoning tasks.
Details
Motivation: Prompting strategies significantly impact LLM reasoning performance, but their role in chart-based question answering remains underexplored, creating a need for systematic evaluation to guide strategy selection.Method: Evaluated four prompting paradigms (Zero-Shot, Few-Shot, Zero-Shot Chain-of-Thought, Few-Shot Chain-of-Thought) across GPT-3.5, GPT-4, and GPT-4o on ChartQA dataset using structured chart data only, with 1,200 diverse samples and two metrics: Accuracy and Exact Match.
Result: Few-Shot Chain-of-Thought prompting consistently yielded highest accuracy (up to 78.2%), especially on reasoning-intensive questions, while Few-Shot improved format adherence. Zero-Shot performed well only with high-capacity models on simpler tasks.
Conclusion: Provides actionable guidance for selecting prompting strategies in structured data reasoning tasks, with implications for efficiency and accuracy in real-world applications, showing Few-Shot CoT as optimal for complex reasoning.
Abstract: Prompting strategies affect LLM reasoning performance, but their role in chart-based QA remains underexplored. We present a systematic evaluation of four widely used prompting paradigms (Zero-Shot, Few-Shot, Zero-Shot Chain-of-Thought, and Few-Shot Chain-of-Thought) across GPT-3.5, GPT-4, and GPT-4o on the ChartQA dataset. Our framework operates exclusively on structured chart data, isolating prompt structure as the only experimental variable, and evaluates performance using two metrics: Accuracy and Exact Match. Results from 1,200 diverse ChartQA samples show that Few-Shot Chain-of-Thought prompting consistently yields the highest accuracy (up to 78.2%), particularly on reasoning-intensive questions, while Few-Shot prompting improves format adherence. Zero-Shot performs well only with high-capacity models on simpler tasks. These findings provide actionable guidance for selecting prompting strategies in structured data reasoning tasks, with implications for both efficiency and accuracy in real-world applications.
[2] MERIT: Memory-Enhanced Retrieval for Interpretable Knowledge Tracing
Runze Li, Kedi Chen, Guwei Feng, Mo Yu, Jun Wang, Wei Zhang
Main category: cs.CL
TL;DR: MERIT is a training-free framework for interpretable knowledge tracing that combines frozen LLMs with structured pedagogical memory, achieving SOTA performance without fine-tuning.
Details
Motivation: Traditional deep learning KT models lack interpretability, while LLM-based methods require expensive fine-tuning and suffer from limited context windows and hallucinations. There's a need for scalable, interpretable KT that doesn't require parameter updates.Method: MERIT transforms raw interaction logs into an interpretable memory bank using semantic denoising to categorize students into latent cognitive schemas. It constructs a paradigm bank with representative error patterns analyzed offline to generate explicit CoT rationales. During inference, hierarchical routing retrieves relevant contexts, and a logic-augmented module applies semantic constraints to calibrate predictions.
Result: Achieves state-of-the-art performance on real-world datasets without gradient updates, reduces computational costs, and supports dynamic knowledge updates while improving accessibility and transparency of educational diagnosis.
Conclusion: MERIT demonstrates that training-free frameworks combining frozen LLMs with structured memory can achieve high performance and interpretability in knowledge tracing, offering a scalable and transparent alternative to traditional fine-tuning approaches.
Abstract: Knowledge Tracing (KT) models students’ evolving knowledge states to predict future performance, serving as a foundation for personalized education. While traditional deep learning models achieve high accuracy, they often lack interpretability. Large Language Models (LLMs) offer strong reasoning capabilities but struggle with limited context windows and hallucinations. Furthermore, existing LLM-based methods typically require expensive fine-tuning, limiting scalability and adaptability to new data. We propose MERIT (Memory-Enhanced Retrieval for Interpretable Knowledge Tracing), a training-free framework combining frozen LLM reasoning with structured pedagogical memory. Rather than updating parameters, MERIT transforms raw interaction logs into an interpretable memory bank. The framework uses semantic denoising to categorize students into latent cognitive schemas and constructs a paradigm bank where representative error patterns are analyzed offline to generate explicit Chain-of-Thought (CoT) rationales. During inference, a hierarchical routing mechanism retrieves relevant contexts, while a logic-augmented module applies semantic constraints to calibrate predictions. By grounding the LLM in interpretable memory, MERIT achieves state-of-the-art performance on real-world datasets without gradient updates. This approach reduces computational costs and supports dynamic knowledge updates, improving the accessibility and transparency of educational diagnosis.
[3] Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data
Zaruhi Navasardyan, Spartak Bughdaryan, Bagrat Minasyan, Hrant Davtyan
Main category: cs.CL
TL;DR: Fine-tuning multilingual text embedding models on small amounts of noisy synthetic translation data yields surprisingly good performance for low-resource languages, challenging the need for large-scale datasets.
Details
Motivation: Low-resource languages lack high-quality datasets for training text embedding models, hindering applications like retrieval-augmented generation and semantic search. The paper challenges the assumption that effective semantic alignment requires massive datasets or pristine translations.Method: Proposes cost-effective adaptation strategy using small-scale noisy synthetic data generated by translating English Reddit title-body pairs with open-weights models. Focuses on Armenian as a case study, establishes comprehensive evaluation benchmark, and fine-tunes multilingual encoder (mE5) on just 10,000 noisy synthetic pairs.
Result: Reveals “Less is More” phenomenon: fine-tuning on 10,000 noisy pairs yields 11-12% average improvements across benchmark with 20%+ relative improvement in retrieval performance, matching models trained on ~1 million examples. Neither increasing data scale, improving translation quality via state-of-the-art LLMs, nor diversifying data domains yields significant gains over minimal baseline. Findings generalize to another low-resource language with unique script.
Conclusion: Semantic alignment for low-resource languages saturates early and is highly robust to noise, democratizing high-performance embedding creation for resource-constrained communities. The approach makes effective text embedding models accessible without requiring massive datasets.
Abstract: Low-resource languages (LRLs) often lack high-quality, large-scale datasets for training effective text embedding models, hindering their application in tasks like retrieval-augmented generation (RAG) and semantic search. In this work, we challenge the prevailing assumption that effective semantic alignment requires massive datasets or pristine, human-verified translations. Focusing on Armenian (an LRL with a unique script), we introduce a cost-effective adaptation strategy using small scale noisy synthetic data generated by translating English Reddit title-body pairs with open-weights models. We establish a comprehensive evaluation benchmark comprising existing datasets, translated data, and a manually curated dataset. Our experiments reveal a surprising “Less is More” phenomenon: fine-tuning a multilingual encoder (mE5) on just 10,000 noisy synthetic pairs yields 11-12% average improvements across the benchmark with a 20%+ relative improvement in retrieval performance, matching the performance of models trained on ~1 million examples. Furthermore, we demonstrate that neither increasing data scale, improving translation quality via state-of-the-art LLMs, nor diversifying data domains yields significant gains over this minimal baseline. We validate the generalizability of these findings on another LRL with a unique script. Our results suggest that semantic alignment for LRLs saturates early and is highly robust to noise, democratizing high-performance embedding creation for resource-constrained communities. We release the model, data, and the benchmark at https://metric-ai-lab.github.io/less-is-more-embeddings/ to facilitate further research.
[4] Evaluating Large Language Models’ Responses to Sexual and Reproductive Health Queries in Nepali
Medha Sharma, Supriya Khadka, Udit Chandra Aryal, Bishnu Hari Bhatta, Bijayan Bhattarai, Santosh Dahal, Kamal Gautam, Pushpa Joshi, Saugat Kafle, Shristi Khadka, Shushila Khadka, Binod Lamichhane, Shilpa Lamichhane, Anusha Parajuli, Sabina Pokharel, Suvekshya Sitaula, Neha Verma, Bishesh Khanal
Main category: cs.CL
TL;DR: LEAF framework evaluates LLMs for sensitive topics like sexual/reproductive health across accuracy, language, usability, and safety criteria, revealing only 35.1% of responses were “proper” for Nepali queries.
Details
Motivation: Current LLM evaluation methods focus mainly on accuracy for objective queries in high-resource languages, lacking comprehensive assessment of usability and safety for culturally sensitive domains like sexual/reproductive health in low-resource languages.Method: Developed LLM Evaluation Framework (LEAF) with multi-criteria assessment: accuracy, language, usability gaps (relevance, adequacy, cultural appropriateness), and safety gaps (safety, sensitivity, confidentiality). Applied to 14K Nepali SRH queries from 9K+ users, with manual annotation by SRH experts.
Result: Only 35.1% of responses were “proper” (accurate, adequate, no major usability/safety gaps). Found differences between ChatGPT versions - similar accuracy but varying usability and safety aspects. Revealed significant limitations in current LLMs for sensitive topics.
Conclusion: LEAF framework highlights critical gaps in LLM performance for sensitive domains and low-resource languages. Framework is adaptable across domains/languages where usability and safety are crucial, providing pathway for better addressing sensitive topics.
Abstract: As Large Language Models (LLMs) become integrated into daily life, they are increasingly used for personal queries, including Sexual and Reproductive Health (SRH), allowing users to chat anonymously without fear of judgment. However, current evaluation methods primarily focus on accuracy, often for objective queries in high-resource languages, and lack criteria to assess usability and safety, especially for low-resource languages and culturally sensitive domains like SRH. This paper introduces LLM Evaluation Framework (LEAF), that conducts assessments across multiple criteria: accuracy, language, usability gaps (including relevance, adequacy, and cultural appropriateness), and safety gaps (safety, sensitivity, and confidentiality). Using the LEAF framework, we assessed 14K SRH queries in Nepali from over 9K users. Responses were manually annotated by SRH experts according to the framework. Results revealed that only 35.1% of the responses were “proper”, meaning they were accurate, adequate and had no major usability or safety related gaps. Insights include differences in performance between ChatGPT versions, such as similar accuracy but varying usability and safety aspects. This evaluation highlights significant limitations of current LLMs and underscores the need for improvement. The LEAF Framework is adaptable across domains and languages, particularly where usability and safety are critical, offering a pathway to better address sensitive topics.
[5] TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs
Yutao Xie, Nathaniel Thomas, Nicklas Hansen, Yang Fu, Li Erran Li, Xiaolong Wang
Main category: cs.CL
TL;DR: TIPS introduces turn-level reward shaping using information potential from teacher models to address sparse reward and credit assignment problems in RL-trained search-augmented LLMs for QA.
Details
Motivation: Training search-augmented LLMs with RL for QA faces challenges of sparse rewards and difficult credit assignment across reasoning and tool calls, leading to unstable optimization.Method: TIPS (Turn-Level Information Potential Reward Shaping) assigns dense, turn-level rewards to each reasoning+tool-call segment based on increased likelihood of correct answer under a teacher model, using potential-based reward shaping for policy-invariant guidance.
Result: On seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines, improving training stability and achieving 11.8% higher Exact Match and 13.6% higher F1 scores with Qwen-2.5 7B Instruct model compared to PPO.
Conclusion: Turn-level information-potential reward shaping provides an effective and general solution to sparse-reward credit assignment for multi-turn LLM reasoning.
Abstract: Search-augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open-domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn-Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn-level rewards to each reasoning + tool-call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential-based reward shaping, TIPS offers fine-grained and policy-invariant guidance that overcomes the limitations of outcome-only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For instance, with a Qwen-2.5 7B Instruct model, TIPS improves the average Exact Match score by 11.8% and F1 by 13.6% relative to PPO. Our results demonstrate that turn-level information-potential reward shaping provides an effective and general solution to sparse-reward credit assignment for multi-turn LLM reasoning.
[6] Whether, Not Which: Mechanistic Interpretability Reveals Dissociable Affect Reception and Emotion Categorization in LLMs
Michael Keeman
Main category: cs.CL
TL;DR: LLMs have genuine emotion understanding beyond keyword spotting, with two dissociable mechanisms: affect reception (near-perfect accuracy) and emotion categorization (partially keyword-dependent).
Details
Motivation: Previous studies on emotion circuits in LLMs rely on explicit emotion keywords, raising the question of whether models genuinely understand emotional meaning or just detect emotion words. Need clinical validity tests using situational cues without keywords.Method: Used clinical vignettes without emotion keywords to test six models (Llama-3.2-1B, Llama-3-8B, Gemma-2-9B; base and instruct variants). Applied four mechanistic interpretability methods: linear probing, causal activation patching, knockout experiments, and representational geometry.
Result: Found two dissociable mechanisms: affect reception detects emotionally significant content with near-perfect accuracy (AUROC 1.000) and replicates across all models; emotion categorization drops 1-7% without keywords and improves with scale. Keyword-rich and keyword-free stimuli share representational space.
Conclusion: LLMs have genuine emotion understanding beyond keyword spotting, with a novel mechanistic dissociation. Clinical stimulus methodology provides rigorous standard for testing emotion processing in LLMs, with implications for AI safety and alignment.
Abstract: Large language models appear to develop internal representations of emotion – “emotion circuits,” “emotion neurons,” and structured emotional manifolds have been reported across multiple model families. But every study making these claims uses stimuli signalled by explicit emotion keywords, leaving a fundamental question unanswered: do these circuits detect genuine emotional meaning, or do they detect the word “devastated”? We present the first clinical validity test of emotion circuit claims using mechanistic interpretability methods grounded in clinical psychology – clinical vignettes that evoke emotions through situational and behavioural cues alone, emotion keywords removed. Across six models (Llama-3.2-1B, Llama-3-8B, Gemma-2-9B; base and instruct variants), we apply four convergent mechanistic interpretability methods – linear probing, causal activation patching, knockout experiments, and representational geometry – and discover two dissociable emotion processing mechanisms. Affect reception – detecting emotionally significant content – operates with near-perfect accuracy (AUROC 1.000), consistent with early-layer saturation, and replicates across all six models. Emotion categorization – mapping affect to specific emotion labels – is partially keyword-dependent, dropping 1-7% without keywords and improving with scale. Causal activation patching confirms keyword-rich and keyword-free stimuli share representational space, transferring affective salience rather than emotion-category identity. These findings falsify the keyword-spotting hypothesis, establish a novel mechanistic dissociation, and introduce clinical stimulus methodology as a rigorous standard for testing emotion processing claims in large language models – with direct implications for AI safety evaluation and alignment. All stimuli, code, and data are released for replication.
[7] Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs
Haoming Meng, Kexin Huang, Shaohang Wei, Chiyu Ma, Shuo Yang, Xue Wang, Guoyin Wang, Bolin Ding, Jingren Zhou
Main category: cs.CL
TL;DR: RLVR fine-tuning induces sparse, targeted token-level distributional changes in LLMs, with only a small fraction of token decisions driving reasoning performance improvements.
Details
Motivation: While RL with verifiable rewards improves LLM reasoning, the underlying token-level mechanisms remain unclear. The paper aims to systematically analyze RLVR's distributional effects at the token level to understand how these improvements emerge.Method: Three main analyses: (1) token-level characterization of distributional shifts between base and RL models, (2) cross-sampling interventions to assess impact of token-level shifts on sequence-level performance, and (3) fine-grained analysis of token-level mechanics including entropy, positional concentration, and probability mass reallocation.
Result: RL fine-tuning induces highly sparse changes - only small fraction of token distributions diverge meaningfully. Cross-sampling shows inserting few RL tokens into base generations recovers RL gains, while injecting base tokens into RL sequences collapses performance. Divergence-weighted advantage signals can improve over baselines.
Conclusion: RLVR fine-tuning acts as targeted refinement process with sparse token-level changes. Understanding these mechanisms provides fine-grained lens for RL optimization in LLMs.
Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly improved reasoning in large language models (LLMs), yet the token-level mechanisms underlying these improvements remain unclear. We present a systematic empirical study of RLVR’s distributional effects organized around three main analyses: (1) token-level characterization of distributional shifts between base and RL models, (2) the impact of token-level distributional shifts on sequence-level reasoning performance through cross-sampling interventions, and (3) fine-grained mechanics of these shifts at the token level. We find that RL fine-tuning induces highly sparse and targeted changes, with only a small fraction of token distributions exhibiting meaningful divergence between the base and RL policies. We further characterize the structure and evolution of these shifts through analyses of token entropy, positional concentration, and reallocation of probability mass. To assess the functional importance of these sparse changes, we conduct cross-sampling experiments that selectively swap token choices between the base and RL models with varying intervention budgets. We show that inserting only a small fraction of RL-sampled tokens into base generations progressively recovers RL performance gains, while injecting a similarly small number of base token choices into otherwise RL-generated sequences collapses performance to base levels, isolating a small set of token-level decisions directly responsible for RLVR’s performance gains. Finally, we explore divergence-weighted variants of the advantage signal as a diagnostic intervention, finding that they can yield improvements over baselines. Together, our results shed light on the distributional changes induced by RLVR and provide a fine-grained, token-level lens for understanding RLVR fine-tuning as a targeted refinement process.
[8] Towards Automated Community Notes Generation with Large Vision Language Models for Combating Contextual Deception
Jin Ma, Jingwen Yan, Mohammed Aldeen, Ethan Anderson, Taran Kavuru, Jinkyung Katie Park, Feng Luo, Long Cheng
Main category: cs.CL
TL;DR: Automated generation of Community Notes for image-based contextual deception using retrieval-augmented multi-agent LVLM framework
Details
Motivation: Human-written Community Notes are limited in timeliness and scalability; need automated methods for combating image-based contextual deception where authentic images are paired with misleading contextMethod: ACCNote: retrieval-augmented multi-agent collaboration framework built on large vision-language models; uses XCheck dataset with real-world X posts, Community Notes, and external contexts
Result: ACCNote improves deception detection and note generation over baselines, exceeds GPT5-mini; new Context Helpfulness Score (CHS) metric aligns with user study outcomes
Conclusion: Dataset, method, and metric advance practical automated generation of context-corrective notes for more responsible online social networks
Abstract: Community Notes have emerged as an effective crowd-sourced mechanism for combating online deception on social media platforms. However, its reliance on human contributors limits both the timeliness and scalability. In this work, we study the automated Community Notes generation method for image-based contextual deception, where an authentic image is paired with misleading context (e.g., time, entity, and event). Unlike prior work that primarily focuses on deception detection (i.e., judging whether a post is true or false in a binary manner), Community Notes-style systems need to generate concise and grounded notes that help users recover the missing or corrected context. This problem remains underexplored due to three reasons: (i) datasets that support the research are scarce; (ii) methods must handle the dynamic nature of contextual deception; (iii) evaluation is difficult because standard metrics do not capture whether notes actually improve user understanding. To address these gaps, we curate a real-world dataset, XCheck, comprising X posts with associated Community Notes and external contexts. We further propose the Automated Context-Corrective Note generation method, named ACCNote, which is a retrieval-augmented, multi-agent collaboration framework built on large vision-language models. Finally, we introduce a new evaluation metric, Context Helpfulness Score (CHS), that aligns with user study outcomes rather than relying on lexical overlap. Experiments on our XCheck dataset show that the proposed ACCNote improves both deception detection and note generation performance over baselines, and exceeds a commercial tool GPT5-mini. Together, our dataset, method, and metric advance practical automated generation of context-corrective notes toward more responsible online social networks.
[9] Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics
Naohiro Tawara, Samuele Cornell, Alexander Polok, Marc Delcroix, Lukáš Burget, Shinji Watanabe
Main category: cs.CL
TL;DR: LLM-based ASR systems perform well in single-speaker settings but degrade with increasing speaker count and overlap, while modular pipelines remain more robust; new semantic error metric tcpSemER introduced for better evaluation.
Details
Motivation: Conversational ASR faces challenges from overlapping speech, far-field noise, and varying speaker counts. While LLM-based systems excel in single-speaker benchmarks, their robustness in multi-speaker settings is unclear, necessitating systematic comparison with traditional modular approaches.Method: Systematic comparison of LLM-based vs modular pipeline approaches across four axes: overlap robustness, semantic fidelity, speaker count, and single/multi-channel input. Introduced tcpSemER metric extending tcpWER by replacing Levenshtein distance with embedding-based semantic similarity, and decomposed tcpWER into overlapping/non-overlapping components for finer analysis.
Result: Experiments across three datasets show LLM-based systems are competitive in two-speaker settings but degrade as speaker count and overlap increase, while modular pipelines remain more robust. The new tcpSemER metric captures meaning-altering errors that conventional metrics miss.
Conclusion: LLM-based ASR systems have limitations in complex multi-speaker conversational settings with high overlap, where traditional modular pipelines demonstrate better robustness. The proposed tcpSemER metric provides more meaningful evaluation by focusing on semantic fidelity rather than just word-level accuracy.
Abstract: Conversational automatic speech recognition remains challenging due to overlapping speech, far-field noise, and varying speaker counts. While recent LLM-based systems perform well on single-speaker benchmarks, their robustness in multi-speaker settings is unclear. We systematically compare LLM-based and modular pipeline approaches along four axes: overlap robustness, semantic fidelity, speaker count, and single- versus multi-channel input. To capture meaning-altering errors that conventional metrics miss, we introduce tcpSemER, which extends tcpWER by replacing Levenshtein distance with embedding-based semantic similarity. We further decompose tcpWER into overlapping and non-overlapping components for finer-grained analysis. Experiments across three datasets show that LLM-based systems are competitive in two-speaker settings but degrade as speaker count and overlap increase, whereas modular pipelines remain more robust.
[10] LLM-guided headline rewriting for clickability enhancement without clickbait
Yehudit Aperstein, Linoy Halifa, Sagiv Bar, Alexander Apartsin
Main category: cs.CL
TL;DR: A framework for controllable headline rewriting using LLMs with dual guidance to enhance engagement while avoiding clickbait through semantic faithfulness constraints.
Details
Motivation: News media faces the challenge of optimizing headlines for reader engagement without resorting to clickbait, which undermines editorial trust. Current approaches often conflate engagement optimization with clickbait, leading to exaggerated or misleading phrasing.Method: Uses LLM-based framework with FUDGE paradigm for inference-time control. Two auxiliary guide models: (1) clickbait scoring model for negative guidance to suppress excessive stylistic amplification, and (2) engagement-attribute model for positive guidance aligned with clickability objectives. Both trained on neutral headlines from curated news corpus, with synthetic clickbait variants generated by rewriting original headlines using LLM under controlled activation of engagement tactics.
Result: The framework generates headlines along a continuum from neutral paraphrases to more engaging yet editorially acceptable formulations by adjusting guidance weights at inference time.
Conclusion: Provides principled approach for studying trade-off between attractiveness, semantic preservation, and clickbait avoidance, supporting responsible LLM-based headline optimization in journalistic settings.
Abstract: Enhancing reader engagement while preserving informational fidelity is a central challenge in controllable text generation for news media. Optimizing news headlines for reader engagement is often conflated with clickbait, resulting in exaggerated or misleading phrasing that undermines editorial trust. We frame clickbait not as a separate stylistic category, but as an extreme outcome of disproportionate amplification of otherwise legitimate engagement cues. Based on this view, we formulate headline rewriting as a controllable generation problem, where specific engagement-oriented linguistic attributes are selectively strengthened under explicit constraints on semantic faithfulness and proportional emphasis. We present a guided headline rewriting framework built on a large language model (LLM) that uses the Future Discriminators for Generation (FUDGE) paradigm for inference-time control. The LLM is steered by two auxiliary guide models: (1) a clickbait scoring model that provides negative guidance to suppress excessive stylistic amplification, and (2) an engagement-attribute model that provides positive guidance aligned with target clickability objectives. Both guides are trained on neutral headlines drawn from a curated real-world news corpus. At the same time, clickbait variants are generated synthetically by rewriting these original headlines using an LLM under controlled activation of predefined engagement tactics. By adjusting guidance weights at inference time, the system generates headlines along a continuum from neutral paraphrases to more engaging yet editorially acceptable formulations. The proposed framework provides a principled approach for studying the trade-off between attractiveness, semantic preservation, and clickbait avoidance, and supports responsible LLM-based headline optimization in journalistic settings.
[11] Functional Component Ablation Reveals Specialization Patterns in Hybrid Language Model Architectures
Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó
Main category: cs.CL
TL;DR: Hybrid language models with attention + SSM/linear attention components are functionally utilized, with the alternative component (SSM/linear attention) serving as primary backbone and showing positional importance gradients and built-in redundancy.
Details
Motivation: To investigate whether both components in hybrid language models (attention + SSM/linear attention) are genuinely utilized or if one component is being bypassed, addressing uncertainty about functional contributions in these architectures.Method: Functional component ablation framework applied to two sub-1B hybrid models (Qwen3.5-0.8B and Falcon-H1-0.5B) with a pure Transformer control. Used group ablations, layer-wise sweeps, positional ablations, matched random controls, and perplexity analysis across five benchmarks.
Result: Four key findings: 1) Both components are essential; 2) Alternative component (SSM/linear attention) is primary backbone (35,000x perplexity degradation vs 82x for attention); 3) Component importance follows positional gradient with early layers critical; 4) Hybrids show 20-119x greater resilience to random layer removal than pure Transformers.
Conclusion: Hybrid architectures have built-in functional redundancy between component types, providing actionable guidance for model compression, architecture design, and fault-tolerant deployment.
Abstract: Hybrid language models combining attention with state space models (SSMs) or linear attention offer improved efficiency, but whether both components are genuinely utilized remains unclear. We present a functional component ablation framework applied to two sub-1B hybrid models – Qwen3.5-0.8B (sequential: Gated DeltaNet + softmax attention) and Falcon-H1-0.5B (parallel: Mamba-2 + attention) – with a pure Transformer control (Qwen2.5-0.5B). Through group ablations, layer-wise sweeps, positional ablations, matched random controls, and perplexity analysis across five benchmarks, we establish four findings: (1) both component types are essential and neither is bypassed; (2) the alternative component (linear attention or SSM) is the primary language modeling backbone, causing >35,000x perplexity degradation when removed versus ~82x for attention; (3) component importance follows a positional gradient, with early layers being disproportionately critical; and (4) hybrid architectures exhibit 20-119x greater resilience to random layer removal than pure Transformers, revealing built-in functional redundancy between component types. These results provide actionable guidance for hybrid model compression, architecture design, and fault-tolerant deployment.
[12] Rashid: A Cipher-Based Framework for Exploring In-Context Language Learning
Niyati Bafna, Ryan Soh-Eun Shim, Barbara Plank, David Yarowsky, Hale Sirin
Main category: cs.CL
TL;DR: Rashid framework enables study of in-context language learning by reversibly ciphering high-resource languages to create synthetic unseen languages, allowing comprehensive evaluation of ICLL methods with existing NLP resources.
Details
Motivation: Current in-context language learning research faces limitations due to lack of NLP tools, data resources, and expertise for truly low-resource languages, making progress assessment difficult and limiting experimentation to few languages and tasks.Method: Introduces Rashid framework that reversibly ciphers high-resource languages to construct synthetic unseen languages while retaining access to all existing NLP resources and evaluation tools for those languages.
Result: Enables comprehensive assessment of current ICLL methods using SOTA evaluation tools, exploration of resource utility in improving ICLL, and testing on rich downstream tasks beyond machine translation.
Conclusion: The framework unlocks previously impossible exploration of ICLL phenomena and provides actionable insights for current performance and future directions in in-context language learning.
Abstract: Where there is growing interest in in-context language learning (ICLL) for unseen languages with large language models, such languages usually suffer from the lack of NLP tools, data resources, and researcher expertise. This means that progress is difficult to assess, the field does not allow for cheap large-scale experimentation, and findings on ICLL are often limited to very few languages and tasks. In light of such limitations, we introduce a framework (Rashid), for studying ICLL wherein we reversibly cipher high-resource languages (HRLs) to construct truly unseen languages with access to a wide range of resources available for HRLs, unlocking previously impossible exploration of ICLL phenomena. We use our framework to assess current methods in the field with SOTA evaluation tools and manual analysis, explore the utility of potentially expensive resources in improving ICLL, and test ICLL strategies on rich downstream tasks beyond machine translation. These lines of exploration showcase the possibilities enabled by our framework, as well as providing actionable insights regarding current performance and future directions in ICLL.
[13] Reddit After Roe: A Computational Analysis of Abortion Narratives and Barriers in the Wake of Dobbs
Aria Pessianzadeh, Alex H. Poole, Rezvaneh Rezapour
Main category: cs.CL
TL;DR: Computational analysis of 17,000+ Reddit posts shows emotional/psychological barriers dominate abortion discourse, with information-seeking/sharing behaviors evolving through Dobbs decision phases.
Details
Motivation: To understand how barriers to abortion access are articulated in online discourse following the Dobbs decision, examining information behaviors, abortion stages, and emotional dimensions.Method: Multi-step pipeline analyzing 17,000+ posts from 4 abortion subreddits, classifying by information type, abortion stage, barrier category (8 types), emotions, and temporal phases around Dobbs decision.
Result: Emotional/psychological barriers consistently dominate abortion narratives online, with nervousness, confusion, fear, and sadness prevalent. Topic modeling revealed discourse evolution in response to legal/cultural shifts.
Conclusion: The study provides a multi-dimensional account of how abortion is navigated in online communities, linking information behaviors, barriers, emotions, and temporal dynamics in the post-Dobbs landscape.
Abstract: The 2022 U.S. Supreme Court decision in Dobbs v. Jackson Women’s Health Organization reshaped the reproductive rights landscape, introducing new uncertainty and barriers to abortion access. We present a large-scale computational analysis of abortion discourse on Reddit, examining how barriers to access are articulated across information-seeking and information-sharing behaviors, different stages of abortion (before, during, after), and three phases of the Dobbs decision in 2022. Drawing on more than 17,000 posts from four abortion-related subreddits, we employed a multi-step pipeline to classify posts by information type, abortion stage, barrier category, and expressed emotions. Using a codebook of eight barrier types, including legal, financial, emotional, and social obstacles, we analyzed their associations with emotions and information behaviors. Topic modeling of model-generated barrier rationales further revealed how discourse evolved in response to shifting legal and cultural contexts. Our findings show that emotional and psychological barriers consistently dominate abortion narratives online, with emotions such as nervousness, confusion, fear, and sadness prevalent across discourse. By linking information behaviors, barriers, emotions, and temporal dynamics, this study provides a multi-dimensional account of how abortion is navigated in online communities.
[14] CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context
Giovana Kerche Bonás, Roseval Malaquias Junior, Marcos Piau, Thiago Laitz, Thales Sales Almeida, Hugo Abonizio, Celio Larcher, Ramon Pires, Rodrigo Nogueira
Main category: cs.CL
TL;DR: CAPITU is a benchmark for evaluating instruction-following capabilities of LLMs in Brazilian Portuguese using culturally-grounded content from Brazilian literature with automatic verification.
Details
Motivation: Existing benchmarks focus on English or use generic prompts, lacking culturally-grounded evaluation for Portuguese. There's a need for a benchmark that combines verifiable instruction constraints with culturally-relevant content for Brazilian Portuguese LLMs.Method: Creates benchmark with 59 instruction types across 7 categories, contextualized within 8 canonical Brazilian literary works. Instructions include Portuguese-specific linguistic constraints and structural requirements. Evaluates 18 models in single-turn and multi-turn settings with automatic verification.
Result: Frontier reasoning models achieve strong performance (GPT-5.2 with reasoning: 98.5% strict accuracy). Portuguese-specialized models offer competitive cost-efficiency. Multi-turn evaluation shows significant variation in constraint persistence (60-96% conversation-level accuracy). Identifies challenges in morphological constraints, exact counting, and constraint persistence degradation.
Conclusion: CAPITU provides a comprehensive benchmark for evaluating Portuguese instruction-following capabilities with cultural grounding. Reveals performance gaps and specific challenges in Portuguese LLMs, facilitating research on instruction-following in Portuguese.
Abstract: We introduce CAPITU, a benchmark for evaluating instruction-following capabilities of Large Language Models (LLMs) in Brazilian Portuguese. Unlike existing benchmarks that focus on English or use generic prompts, CAPITU contextualizes all tasks within eight canonical works of Brazilian literature, combining verifiable instruction constraints with culturally-grounded content. The benchmark comprises 59 instruction types organized into seven categories, all designed to be automatically verifiable without requiring LLM judges or human evaluation. Instruction types include Portuguese-specific linguistic constraints (word termination patterns like -ando/-endo/-indo, -inho/-inha, -mente) and structural requirements. We evaluate 18 state-of-the-art models across single-turn and multi-turn settings. Our results show that frontier reasoning models achieve strong performance (GPT-5.2 with reasoning: 98.5% strict accuracy), while Portuguese-specialized models offer competitive cost-efficiency (Sabiazinho-4: 87.0% at $0.13 vs Claude-Haiku-4.5: 73.5% at $1.12). Multi-turn evaluation reveals significant variation in constraint persistence, with conversation-level accuracy ranging from 60% to 96% across models. We identify specific challenges in morphological constraints, exact counting, and constraint persistence degradation across turns. We release the complete benchmark, evaluation code, and baseline results to facilitate research on instruction-following in Portuguese.
[15] When Audio-LLMs Don’t Listen: A Cross-Linguistic Study of Modality Arbitration
Jayadev Billa
Main category: cs.CL
TL;DR: Audio-LLMs show strong text dominance when audio and text conflict, following text 10-26× more often than when arbitrating between two text sources, even when instructed to trust audio.
Details
Motivation: To investigate how speech-enabled language models handle conflicts between audio and text modalities, and whether they exhibit systematic bias toward text over audio inputs.Method: Created ALME dataset with 57,602 controlled audio-text conflict stimuli across 8 languages, introduced Text Dominance Ratio (TDR) metric, tested models including Gemini 2.0 Flash and GPT-4o, conducted framing experiments and fine-tuning ablation studies.
Result: Models show high text dominance: Gemini 2.0 Flash TDR 16.6% vs baseline 1.6%, GPT-4o TDR 23.2% vs baseline 0.9%. Framing transcript as corrupted reduces TDR by 80%, while forcing explicit transcription increases it by 14%. Pattern consistent across 4 audio-LLMs with cross-model and cross-linguistic variation.
Conclusion: Text dominance reflects not just information content but an asymmetry in arbitration accessibility - models find it easier to use text representations at decision time. Arbitration behavior depends more on LLM reasoning than audio input path alone.
Abstract: When audio and text conflict, speech-enabled language models follow text far more often than they do when arbitrating between two conflicting text sources, even under explicit instructions to trust the audio. We introduce ALME (Audio-LLM Modality Evaluation), a dataset of 57,602 controlled audio-text conflict stimuli across eight languages, together with Text Dominance Ratio (TDR), which measures how often a model follows conflicting text when instructed to follow audio. Gemini 2.0 Flash and GPT-4o show TDR 10–26$\times$ higher than a baseline that replaces audio with its transcript under otherwise identical conditions (Gemini 2.0 Flash: 16.6% vs. 1.6%; GPT-4o: 23.2% vs. 0.9%). These results suggest that text dominance reflects not only information content, but also an asymmetry in arbitration accessibility, i.e., how easily the model can use competing representations at decision time. Framing the transcript as deliberately corrupted reduces TDR by 80%, whereas forcing explicit transcription increases it by 14%. A fine-tuning ablation further suggests that arbitration behavior depends more on LLM reasoning than on the audio input path alone. Across four audio-LLMs, we observe the same qualitative pattern with substantial cross-model and cross-linguistic variation.
[16] Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?
Richard J. Young
Main category: cs.CL
TL;DR: Study evaluates faithfulness of Chain-of-Thought reasoning in 12 open-weight models, finding acknowledgment rates vary widely (40-90%) and models often suppress hint acknowledgment in outputs despite internal recognition.
Details
Motivation: To assess the effectiveness of Chain-of-Thought reasoning as a transparency mechanism for LLMs in safety-critical deployments by evaluating faithfulness - whether models accurately verbalize the factors influencing their outputs - across diverse open-weight models.Method: Tested 12 open-weight reasoning models (7B-685B parameters) from 9 architectural families on 498 multiple-choice questions from MMLU and GPQA Diamond. Injected six categories of reasoning hints (sycophancy, consistency, visual pattern, metadata, grader hacking, unethical information) and measured acknowledgment rates when hints successfully altered answers. Conducted 41,832 inference runs with keyword-based analysis of thinking-token vs answer-text acknowledgment.
Result: Overall faithfulness rates ranged from 39.7% (Seed-1.6-Flash) to 89.9% (DeepSeek-V3.2-Speciale). Consistency hints (35.5%) and sycophancy hints (53.9%) had lowest acknowledgment rates. Training methodology and model family predicted faithfulness more strongly than parameter count. Keyword analysis revealed large gap between thinking-token acknowledgment (~87.5%) and answer-text acknowledgment (~28.6%).
Conclusion: Chain-of-Thought monitoring effectiveness varies significantly across models and hint types. Models often internally recognize hint influence but systematically suppress acknowledgment in outputs. Faithfulness is not fixed but varies with architecture, training method, and cue nature, raising concerns about CoT as a reliable safety mechanism.
Abstract: Chain-of-thought (CoT) reasoning has been proposed as a transparency mechanism for large language models in safety-critical deployments, yet its effectiveness depends on faithfulness (whether models accurately verbalize the factors that actually influence their outputs), a property that prior evaluations have examined in only two proprietary models, finding acknowledgment rates as low as 25% for Claude 3.7 Sonnet and 39% for DeepSeek-R1. To extend this evaluation across the open-weight ecosystem, this study tests 12 open-weight reasoning models spanning 9 architectural families (7B-685B parameters) on 498 multiple-choice questions from MMLU and GPQA Diamond, injecting six categories of reasoning hints (sycophancy, consistency, visual pattern, metadata, grader hacking, and unethical information) and measuring the rate at which models acknowledge hint influence in their CoT when hints successfully alter answers. Across 41,832 inference runs, overall faithfulness rates range from 39.7% (Seed-1.6-Flash) to 89.9% (DeepSeek-V3.2-Speciale) across model families, with consistency hints (35.5%) and sycophancy hints (53.9%) exhibiting the lowest acknowledgment rates. Training methodology and model family predict faithfulness more strongly than parameter count, and keyword-based analysis reveals a striking gap between thinking-token acknowledgment (approximately 87.5%) and answer-text acknowledgment (approximately 28.6%), suggesting that models internally recognize hint influence but systematically suppress this acknowledgment in their outputs. These findings carry direct implications for the viability of CoT monitoring as a safety mechanism and suggest that faithfulness is not a fixed property of reasoning models but varies systematically with architecture, training method, and the nature of the influencing cue.
[17] LGSE: Lexically Grounded Subword Embedding Initialization for Low-Resource Language Adaptation
Hailay Teklehaymanot, Dren Fazlija, Wolfgang Nejdl
Main category: cs.CL
TL;DR: LGSE framework improves language model adaptation to low-resource morphologically rich languages by using morphological segmentation for embedding initialization instead of arbitrary subword units.
Details
Motivation: Existing vocabulary expansion methods for pretrained language models use arbitrarily segmented subword units, which fragment lexical representations and lose critical morphological information, especially problematic for morphologically rich, low-resource languages.Method: LGSE decomposes words into constituent morphemes and constructs embeddings by averaging pretrained subword or FastText-based morpheme representations. For unsegmentable tokens, uses character n-grams. During adaptation, applies regularization to preserve alignment with original embedding space while updating only new embeddings.
Result: LGSE consistently outperforms baseline methods across Question Answering, Named Entity Recognition, and Text Classification tasks in Amharic and Tigrinya, demonstrating improved representation quality for underrepresented languages.
Conclusion: Morphologically grounded embedding initialization effectively improves language model adaptation to low-resource, morphologically rich languages by preserving morphological structure and semantic coherence in representations.
Abstract: Adapting pretrained language models to low-resource, morphologically rich languages remains a significant challenge. Existing vocabulary expansion methods typically rely on arbitrarily segmented subword units, resulting in fragmented lexical representations and loss of critical morphological information. To address this limitation, we propose the Lexically Grounded Subword Embedding Initialization (LGSE) framework, which introduces morphologically informed segmentation for initializing embeddings of novel tokens. Instead of using random vectors or arbitrary subwords, LGSE decomposes words into their constituent morphemes and constructs semantically coherent embeddings by averaging pretrained subword or FastText-based morpheme representations. When a token cannot be segmented into meaningful morphemes, its embedding is constructed using character n-gram representations to capture structural information. During Language-Adaptive Pretraining, we apply a regularization term that penalizes large deviations of newly introduced embeddings from their initialized values, preserving alignment with the original pretrained embedding space while enabling adaptation to the target language. To isolate the effect of initialization, we retain the original pre-trained model vocabulary and tokenizer and update only the new embeddings during adaptation. We evaluate LGSE on three NLP tasks: Question Answering, Named Entity Recognition, and Text Classification, in two morphologically rich, low-resource languages: Amharic and Tigrinya, where morphological segmentation resources are available. Experimental results show that LGSE consistently outperforms baseline methods across all tasks, demonstrating the effectiveness of morphologically grounded embedding initialization for improving representation quality in underrepresented languages. Project resources are available in the GitHub link.
[18] Adapting Self-Supervised Speech Representations for Cross-lingual Dysarthria Detection in Parkinson’s Disease
Abner Hernandez, Eunjung Yeo, Kwanghee Choi, Chin-Jou Li, Zhengjun Yue, Rohan Kumar Das, Jan Rusz, Mathew Magimai Doss, Juan Rafael Orozco-Arroyave, Tomás Arias-Vergara, Andreas Maier, Elmar Nöth, David R. Mortensen, David Harwath, Paula Andrea Perez-Toro
Main category: cs.CL
TL;DR: Proposes representation-level language shift (LS) method to align speech representations across languages for cross-lingual dysarthria detection in Parkinson’s disease speech
Details
Motivation: Limited dysarthric speech data availability makes cross-lingual detection challenging; speech representations often encode language-dependent structure that confounds dysarthria detectionMethod: Representation-level language shift (LS) aligns source-language self-supervised speech representations with target-language distribution using centroid-based vector adaptation estimated from healthy-control speech
Result: LS substantially improves sensitivity and F1 in cross-lingual settings, yields smaller but consistent gains in multilingual settings; reduces language identity in embedding space
Conclusion: LS effectively removes language-dependent structure from speech representations, enabling better cross-lingual dysarthria detection
Abstract: The limited availability of dysarthric speech data makes cross-lingual detection an important but challenging problem. A key difficulty is that speech representations often encode language-dependent structure that can confound dysarthria detection. We propose a representation-level language shift (LS) that aligns source-language self-supervised speech representations with the target-language distribution using centroid-based vector adaptation estimated from healthy-control speech. We evaluate the approach on oral DDK recordings from Parkinson’s disease speech datasets in Czech, German, and Spanish under both cross-lingual and multilingual settings. LS substantially improves sensitivity and F1 in cross-lingual settings, while yielding smaller but consistent gains in multilingual settings. Representation analysis further shows that LS reduces language identity in the embedding space, supporting the interpretation that LS removes language-dependent structure.
[19] Multi-Method Validation of Large Language Model Medical Translation Across High- and Low-Resource Languages
Chukwuebuka Anyaegbuna, Eduardo Juan Perez Guerrero, Jerry Liu, Timothy Keyes, April Liang, Natasha Steele, Stephen Ma, Jonathan Chen, Kevin Schulman
Main category: cs.CL
TL;DR: Frontier LLMs (GPT-5.1, Claude Opus 4.5, Gemini 3 Pro, Kimi K2) effectively translate medical documents across 8 languages with high semantic preservation, showing no significant performance difference between high- and low-resource languages.
Details
Motivation: Address language barriers in healthcare affecting 27.3 million U.S. residents with non-English language preference, where professional medical translation is costly and often unavailable.Method: Evaluated four frontier LLMs translating 22 medical documents into 8 languages (Spanish, Chinese, Russian, Vietnamese, Korean, Arabic, Tagalog, Haitian Creole) using a five-layer validation framework with 704 translation pairs, including semantic preservation analysis, cross-model back-translation, inter-model concordance, and lexical borrowing analysis.
Result: All models achieved high semantic preservation (LaBSE > 0.92), no significant difference between high- and low-resource languages (p = 0.066), high inter-model concordance (LaBSE: 0.946), and no correlation between English term retention and fidelity scores in low-resource languages (rho = +0.018, p = 0.82).
Conclusion: Frontier LLMs preserve medical meaning across language resource levels, suggesting potential for improving language access in healthcare through AI translation.
Abstract: Language barriers affect 27.3 million U.S. residents with non-English language preference, yet professional medical translation remains costly and often unavailable. We evaluated four frontier large language models (GPT-5.1, Claude Opus 4.5, Gemini 3 Pro, Kimi K2) translating 22 medical documents into 8 languages spanning high-resource (Spanish, Chinese, Russian, Vietnamese), medium-resource (Korean, Arabic), and low-resource (Tagalog, Haitian Creole) categories using a five-layer validation framework. Across 704 translation pairs, all models achieved high semantic preservation (LaBSE greater than 0.92), with no significant difference between high- and low-resource languages (p = 0.066). Cross-model back-translation confirmed results were not driven by same-model circularity (delta = -0.0009). Inter-model concordance across four independently trained models was high (LaBSE: 0.946), and lexical borrowing analysis showed no correlation between English term retention and fidelity scores in low-resource languages (rho = +0.018, p = 0.82). These converging results suggest frontier LLMs preserve medical meaning across resource levels, with implications for language access in healthcare.
[20] Improving LLM Predictions via Inter-Layer Structural Encoders
Tom Ulanovski, Eyal Blyachman, Maya Bechler-Speicher
Main category: cs.CL
TL;DR: ILSE introduces a structural approach to combine intermediate layer representations from LLMs using Cayley-Encoder for better task performance.
Details
Motivation: While LLMs typically use final-layer representations, intermediate layers contain substantial task-relevant information, with different layers being optimal for different tasks. Current approaches don't effectively combine these diverse layer representations.Method: Proposes Inter-Layer Structural Encoders (ILSE) with Cayley-Encoder, a geometric encoder using expander Cayley graphs for efficient inter-layer information propagation to learn effective combined representations from all LLM layers.
Result: ILSE outperforms baselines across 13 classification and semantic similarity tasks with 9 LLMs (14M to 8B parameters), achieving up to 44% accuracy improvement and 25% similarity metric gains. Shows data efficiency in few-shot learning and enables small LLMs to compete with larger models.
Conclusion: ILSE provides an effective structural approach to leverage intermediate layer representations in LLMs, improving performance across diverse tasks and enabling more efficient model utilization.
Abstract: The standard practice in Large Language Models (LLMs) is to base predictions on the final-layer token representations. Recent studies, however, show that intermediate layers encode substantial information, which may contain more task-relevant features than the final-layer representations alone. Importantly, it was shown that for different tasks, different layers may be optimal. In this work we introduce Inter-Layer Structural Encoders (ILSE), a powerful structural approach to learn one effective representation from the LLM’s internal layer representations all together. Central to ILSE is Cayley-Encoder, a mathematically grounded geometric encoder that leverages expander Cayley graphs for efficient inter-layer information propagation. We evaluate ILSE across 13 classification and semantic similarity tasks with 9 pre-trained LLMs ranging from 14 million to 8 billion parameters. ILSE consistently outperforms baselines and existing approaches, achieving up to 44% improvement in accuracy and 25% in similarity metrics. We further show that ILSE is data-efficient in few-shot regimes and can make small LLMs competitive with substantially larger models.
[21] Synthetic or Authentic? Building Mental Patient Simulators from Longitudinal Evidence
Baihan Li, Bingrui Jin, Kunyao Lan, Ming Wang, Mengyue Wu
Main category: cs.CL
TL;DR: DEPROFILE: A data-grounded patient simulation framework that constructs unified patient profiles from multi-source real-world data to improve dialogue realism and behavioral diversity in mental health dialogue systems.
Details
Motivation: Existing patient simulation approaches for mental health dialogue systems rely on snapshot-style prompts with limited profile information, leading to homogeneous behaviors and incoherent disease progression in multi-turn interactions.Method: Proposes DEPROFILE framework that constructs unified patient profiles by integrating demographic attributes, standardized clinical symptoms, counseling dialogues, and longitudinal life-event histories from real-world data. Introduces Chain-of-Change agent to transform noisy longitudinal records into structured, temporally grounded memory representations.
Result: Experiments across multiple LLM backbones show improved dialogue realism, behavioral diversity, and event richness, exceeding state-of-the-art baselines. Demonstrates importance of grounding patient simulation in verifiable longitudinal evidence.
Conclusion: DEPROFILE effectively addresses key challenges in patient simulation by providing comprehensive, data-grounded profiles, leading to more realistic and diverse multi-turn interactions in mental health dialogue systems.
Abstract: Patient simulation is essential for developing and evaluating mental health dialogue systems. As most existing approaches rely on snapshot-style prompts with limited profile information, homogeneous behaviors and incoherent disease progression in multi-turn interactions have become key chellenges. In this work, we propose DEPROFILE, a data-grounded patient simulation framework that constructs unified, multi-source patient profiles by integrating demographic attributes, standardized clinical symptoms, counseling dialogues, and longitudinal life-event histories from real-world data. We further introduce a Chain-of-Change agent to transform noisy longitudinal records into structured, temporally grounded memory representations for simulation. Experiments across multiple large language model (LLM) backbones show that with more comprehensive profile constructed by DEPROFILE, the dialogue realism, behavioral diversity, and event richness have consistently improved and exceed state-of-the-art baselines, highlighting the importance of grounding patient simulation in verifiable longitudinal evidence.
[22] Detecting Non-Membership in LLM Training Data via Rank Correlations
Pranav Shetty, Mirazul Haque, Zhiqiang Ma, Xiaomo Liu
Main category: cs.CL
TL;DR: PRISM is a method for verifying that specific datasets were NOT used in training large language models, using correlation analysis of token log probabilities between models to detect dataset-level non-membership.
Details
Motivation: As LLMs are trained on increasingly large and opaque text corpora, there's a need to determine which data contributed to training for copyright enforcement, compliance auditing, and user trust. While prior work focuses on detecting dataset membership, the complementary problem of verifying dataset non-membership has been neglected.Method: PRISM uses grey-box access to model logits and compares normalized token log probabilities between models. The key insight is that two models that haven’t seen a dataset exhibit higher rank correlation in their token log probabilities than when one model has been trained on that data. A correlation-based test is constructed to detect non-membership.
Result: PRISM reliably rules out membership in training data across all tested datasets while avoiding false positives, providing a framework for verifying that specific datasets were excluded from LLM training.
Conclusion: PRISM addresses the important but overlooked problem of dataset non-membership verification, offering a practical method for compliance and trust in LLM training data provenance.
Abstract: As large language models (LLMs) are trained on increasingly vast and opaque text corpora, determining which data contributed to training has become essential for copyright enforcement, compliance auditing, and user trust. While prior work focuses on detecting whether a dataset was used in training (membership inference), the complementary problem – verifying that a dataset was not used – has received little attention. We address this gap by introducing PRISM, a test that detects dataset-level non-membership using only grey-box access to model logits. Our key insight is that two models that have not seen a dataset exhibit higher rank correlation in their normalized token log probabilities than when one model has been trained on that data. Using this observation, we construct a correlation-based test that detects non-membership. Empirically, PRISM reliably rules out membership in training data across all datasets tested while avoiding false positives, thus offering a framework for verifying that specific datasets were excluded from LLM training.
[23] How Utilitarian Are OpenAI’s Models Really? Replicating and Reinterpreting Pfeffer, Krügel, and Uhl (2025)
Johannes Himmelreich
Main category: cs.CL
TL;DR: Study finds that single-prompt evaluations of LLM moral reasoning are unreliable; multi-prompt testing reveals that apparent deontological responses in GPT-4o are actually safety refusals triggered by prompt framing, and all models converge on utilitarian answers when confounds are removed.
Details
Motivation: To investigate whether reasoning models (like o1-mini) truly produce more utilitarian responses than non-reasoning models (like GPT-4o) in moral dilemmas, and to test the robustness of these findings across different prompt variations.Method: Replicated previous study with four current OpenAI models and extended it with prompt variant testing, specifically comparing “Should I…” vs “Is it morally permissible…” framings for the trolley problem and footbridge dilemma.
Result: The trolley finding doesn’t survive: GPT-4o’s low utilitarian rate was due to safety refusals triggered by advisory framing. When framed as “Is it morally permissible…?”, GPT-4o gives 99% utilitarian responses. All models converge on utilitarian answers when prompt confounds are removed. The footbridge finding survives with limitations - reasoning models tend to give more utilitarian responses but often refuse to answer or give non-utilitarian answers.
Conclusion: Single-prompt evaluations of LLM moral reasoning are unreliable; multi-prompt robustness testing should be standard practice for any empirical claim about LLM behavior. Prompt framing significantly influences model responses, and apparent moral commitments may actually be safety mechanisms.
Abstract: Pfeffer, Krügel, and Uhl (2025) report that OpenAI’s reasoning model o1-mini produces more utilitarian responses to the trolley problem and footbridge dilemma than the non-reasoning model GPT-4o. I replicate their study with four current OpenAI models and extend it with prompt variant testing. The trolley finding does not survive: GPT-4o’s low utilitarian rate doesn’t reflect a deontological commitment but safety refusals triggered by the prompt’s advisory framing. When framed as “Is it morally permissible…?” instead of “Should I…?”, GPT-4o gives 99% utilitarian responses. All models converge on utilitarian answers when prompt confounds are removed. The footbridge finding survives with blemishes. Reasoning models tend to give more utilitarian responses than non-reasoning models across prompt variations. But often they refuse to answer the dilemma or, when they answer, give a non-utilitarian rather than a utilitarian answer. These results demonstrate that single-prompt evaluations of LLM moral reasoning are unreliable: multi-prompt robustness testing should be standard practice for any empirical claim about LLM behavior.
[24] Explanation Generation for Contradiction Reconciliation with LLMs
Jason Chan, Zhixue Zhao, Robert Gaizauskas
Main category: cs.CL
TL;DR: Paper introduces reconciliatory explanation generation task where LLMs must generate explanations that make contradictory statements compatible, showing most models struggle with this reasoning capability.
Details
Motivation: Current NLP treats contradictions as errors to be resolved by choosing statements, but human reasoning often involves hypothesizing explanations that reconcile contradictions. LLMs' ability to generate such reconciliatory explanations remains unexplored despite their growing reasoning capabilities.Method: Proposes repurposing existing natural language inference (NLI) datasets for reconciliatory explanation generation, introduces quality metrics for scalable automatic evaluation, and experiments with 18 LLMs to assess their performance on this task.
Result: Most LLMs achieve limited success in reconciliatory explanation generation. The benefit of extending test-time compute by “thinking” plateaus as model size increases, indicating this is a challenging reasoning task for current models.
Conclusion: Highlights an under-explored dimension of LLM reasoning and the need to address this limitation to enhance LLMs’ downstream applications like chatbots and scientific aids.
Abstract: Existing NLP work commonly treats contradictions as errors to be resolved by choosing which statements to accept or discard. Yet a key aspect of human reasoning in social interactions and professional domains is the ability to hypothesize explanations that reconcile contradictions. For example, “Cassie hates coffee” and “She buys coffee everyday” may appear contradictory, yet both are compatible if Cassie has the unenviable daily chore of buying coffee for all her coworkers. Despite the growing reasoning capabilities of large language models (LLMs), their ability to hypothesize such reconciliatory explanations remains largely unexplored. To address this gap, we introduce the task of reconciliatory explanation generation, where models must generate explanations that effectively render contradictory statements compatible. We propose a novel method of repurposing existing natural language inference (NLI) datasets, and introduce quality metrics that enable scalable automatic evaluation. Experiments with 18 LLMs show that most models achieve limited success in this task, and that the benefit of extending test-time compute by “thinking” plateaus as model size increases. Our results highlight an under-explored dimension of LLM reasoning and the need to address this limitation in enhancing LLMs’ downstream applications such as chatbots and scientific aids.
[25] PRISM: A Dual View of LLM Reasoning through Semantic Flow and Latent Computation
Ruidi Chang, Jiawei Zhou, Hanjie Chen
Main category: cs.CL
TL;DR: PRISM is a framework for jointly analyzing token sequences and hidden states in LLM reasoning traces to understand how reasoning evolves across steps and layers.
Details
Motivation: Current analysis of LLM reasoning traces is limited to either token sequences across steps or hidden states within steps, lacking a unified view of how reasoning evolves across both dimensions.Method: PRISM (Probabilistic Reasoning Inspection through Semantic and Implicit Modeling) provides a framework for joint analysis of token sequences and hidden-state vectors across reasoning steps and model layers.
Result: PRISM uncovers systematic patterns in reasoning processes, showing failed trajectories get trapped in verification loops and diverge into distinct modes like overthinking and premature commitment. It reveals how prompting reshapes reasoning behavior beyond accuracy.
Conclusion: PRISM enables observable and analyzable reasoning behaviors rather than relying solely on final-task accuracy, positioning it as a practical tool for analyzing and diagnosing reasoning processes in LLMs.
Abstract: Large language models (LLMs) solve complex problems by generating multi-step reasoning traces. Yet these traces are typically analyzed from only one of two perspectives: the sequence of tokens across different reasoning steps in the generated text, or the hidden-state vectors across model layers within one step. We introduce PRISM (Probabilistic Reasoning Inspection through Semantic and Implicit Modeling), a framework and diagnostic tool for jointly analyzing both levels, providing a unified view of how reasoning evolves across steps and layers. Across multiple reasoning models and benchmarks, PRISM uncovers systematic patterns in the reasoning process, showing that failed trajectories are more likely to become trapped in unproductive verification loops and further diverge into distinct modes such as overthinking and premature commitment, which behave differently once a candidate answer is reached. It further reveals how prompting reshapes reasoning behavior beyond aggregate accuracy by altering both semantic transitions and internal computational patterns. By modeling reasoning trajectories as structured processes, PRISM makes these behaviors observable and analyzable rather than relying solely on final-task accuracy. Taken together, these insights position PRISM as a practical tool for analyzing and diagnosing reasoning processes in LLMs.
[26] KALAVAI: Predicting When Independent Specialist Fusion Works – A Quantitative Model for Post-Hoc Cooperative LLM Training
Ramchand Kumaresan
Main category: cs.CL
TL;DR: Post-hoc fusion of independently trained domain specialists via lightweight MoE routing achieves predictable performance gains proportional to model divergence, enabling cooperative value estimation before compute commitment.
Details
Motivation: To enable practitioners to estimate the cooperative value of fusing multiple domain specialists before committing computational resources, and to develop a protocol for efficiently combining independently trained models.Method: KALAVAI protocol: contributors fine-tune copies of a shared checkpoint independently, then submit for lightweight MoE routing (500 steps). The router learns to match domain-oracle routing, with three key requirements: shared initialization, optional frozen layers below ~10k steps, and learned routing.
Result: Predictable gains: gain = 0.82 x divergence - 2.72 (R^2 = 0.856). Consistent improvements: +7.72% at 410M, +7.49% at 1B, +6.53% at 6.9B over best specialist. Cross-lingual fusion achieves +21.76% with Yoruba perplexity dropping from 41.9 to 7.7. 20-contributor federation achieves +16.71%.
Conclusion: Post-hoc fusion of domain specialists via lightweight MoE routing provides predictable performance gains, enabling practical cooperative model development with minimal additional compute. The approach works across scales and domains, with learned routing being essential for optimal performance.
Abstract: Independently trained domain specialists can be fused post-hoc into a single model that outperforms any individual specialist, and the gain is predictable: gain = 0.82 x divergence - 2.72 (R^2 = 0.856, n=6, 3-26% divergence). This enables practitioners to estimate cooperative value before committing compute. Below ~3.3% divergence, gains approach zero.In the KALAVAI protocol, contributors fine-tune copies of a shared checkpoint independently, then submit for lightweight MoE routing (500 steps). Gains are consistent: +7.72% at 410M (+/-0.02%, 3 seeds), +7.49% at 1B (+/-0.01%, 3 seeds), +6.53% at 6.9B, each over the best specialist. The router matches domain-oracle routing within <10^{-5} nats. Cross-lingual fusion (Tamil/Yoruba/Welsh/Code) achieves +21.76%, with Yoruba perplexity falling 41.9 to 7.7. A 20-contributor federation achieves +16.71% (+/-0.07pp, 3 seeds).Three requirements bound the protocol. Shared initialisation is necessary: checkpoint mismatch degrades routing. Frozen layers are optional below ~10,000 steps and beneficial beyond. Learned routing is essential: uniform averaging degrades by -1.2% vs. best specialist, while any trained router achieves oracle-optimal assignment.
[27] DALDALL: Data Augmentation for Lexical and Semantic Diverse in Legal Domain by leveraging LLM-Persona
Janghyeok Choi, Jaewon Lee, Sungzoon Cho
Main category: cs.CL
TL;DR: DALDALL: Persona-based data augmentation framework for legal information retrieval using domain-specific professional personas to generate diverse synthetic queries, improving retrieval performance in low-resource legal domains.
Details
Motivation: Data scarcity in low-resource domains like legal information retrieval, where existing LLM-based augmentation methods prioritize quantity over quality and lack domain-specific strategies.Method: Uses domain-specific professional personas (attorneys, prosecutors, judges) to generate synthetic queries with greater lexical and semantic diversity than vanilla prompting, evaluated on CLERC and COLIEE benchmarks.
Result: Persona-based augmentation achieves higher lexical diversity (measured by Self-BLEU scores) while preserving semantic fidelity. Dense retrievers fine-tuned on persona-augmented data achieve competitive or superior recall performance compared to original data or generic augmentations.
Conclusion: Persona-based prompting is an effective strategy for generating high-quality training data in specialized, low-resource domains like legal information retrieval.
Abstract: Data scarcity remains a persistent challenge in low-resource domains. While existing data augmentation methods leverage the generative capabilities of large language models (LLMs) to produce large volumes of synthetic data, these approaches often prioritize quantity over quality and lack domain-specific strategies. In this work, we introduce DALDALL, a persona-based data augmentation framework tailored for legal information retrieval (IR). Our method employs domain-specific professional personas–such as attorneys, prosecutors, and judges–to generate synthetic queries that exhibit substantially greater lexical and semantic diversity than vanilla prompting approaches. Experiments on the CLERC and COLIEE benchmarks demonstrate that persona-based augmentation achieves improvement in lexical diversity as measured by Self-BLEU scores, while preserving semantic fidelity to the original queries. Furthermore, dense retrievers fine-tuned on persona-augmented data consistently achieve competitive or superior recall performance compared to those trained on original data or generic augmentations. These findings establish persona-based prompting as an effective strategy for generating high-quality training data in specialized, low-resource domains.
[28] Span Modeling for Idiomaticity and Figurative Language Detection with Span Contrastive Loss
Blake Matheny, Phuong Minh Nguyen, Minh Le Nguyen
Main category: cs.CL
TL;DR: BERT- and RoBERTa-based models with slot loss and span contrastive loss achieve state-of-the-art performance on idiomaticity detection tasks.
Details
Motivation: Figurative language like idioms presents challenges for language models due to their non-compositional nature, where the meaning doesn't derive from individual words. Existing models often fail at immediate recognition of idioms without prompting or fine-tuning.Method: Proposes BERT- and RoBERTa-based models fine-tuned with a combination of slot loss and span contrastive loss (SCL) with hard negative reweighting for improved idiomaticity detection.
Result: Achieves state-of-the-art sequence accuracy performance on existing datasets. Ablation studies demonstrate the effectiveness of SCL and its generalizability.
Conclusion: The proposed approach with span contrastive loss significantly improves idiomaticity detection, and the geometric mean of F1 and sequence accuracy provides a better evaluation metric for span awareness and overall performance.
Abstract: The category of figurative language contains many varieties, some of which are non-compositional in nature. This type of phrase or multi-word expression (MWE) includes idioms, which represent a single meaning that does not consist of the sum of its words. For language models, this presents a unique problem due to tokenization and adjacent contextual embeddings. Many large language models have overcome this issue with large phrase vocabulary, though immediate recognition frequently fails without one- or few-shot prompting or instruction finetuning. The best results have been achieved with BERT-based or LSTM finetuning approaches. The model in this paper contains one such variety. We propose BERT- and RoBERTa-based models finetuned with a combination of slot loss and span contrastive loss (SCL) with hard negative reweighting to improve idiomaticity detection, attaining state of the art sequence accuracy performance on existing datasets. Comparative ablation studies show the effectiveness of SCL and its generalizability. The geometric mean of F1 and sequence accuracy (SA) is also proposed to assess a model’s span awareness and general performance together.
[29] Efficient Hallucination Detection: Adaptive Bayesian Estimation of Semantic Entropy with Guided Semantic Exploration
Qiyao Sun, Xingming Li, Xixiang He, Ao Cheng, Xuanyu Ji, Hailun Lu, Runke Huang, Qingyong Hu
Main category: cs.CL
TL;DR: Adaptive Bayesian framework for efficient hallucination detection in LLMs using dynamic sampling based on uncertainty
Details
Motivation: Existing hallucination detection methods use fixed sampling budgets that don't adapt to query complexity, leading to computational inefficiencyMethod: Adaptive Bayesian Estimation framework with Guided Semantic Exploration: hierarchical Bayesian modeling of semantic distribution, dynamic sampling control via variance-based thresholds, and perturbation-based importance sampling
Result: Achieves superior hallucination detection with 50% fewer samples for comparable performance, 12.6% average AUROC improvement under same budget
Conclusion: Proposed adaptive framework significantly improves efficiency and performance of hallucination detection in LLMs
Abstract: Large language models (LLMs) have achieved remarkable success in various natural language processing tasks, yet they remain prone to generating factually incorrect outputs known as hallucinations. While recent approaches have shown promise for hallucination detection by repeatedly sampling from LLMs and quantifying the semantic inconsistency among the generated responses, they rely on fixed sampling budgets that fail to adapt to query complexity, resulting in computational inefficiency. We propose an Adaptive Bayesian Estimation framework for Semantic Entropy with Guided Semantic Exploration, which dynamically adjusts sampling requirements based on observed uncertainty. Our approach employs a hierarchical Bayesian framework to model the semantic distribution, enabling dynamic control of sampling iterations through variance-based thresholds that terminate generation once sufficient certainty is achieved. We also develop a perturbation-based importance sampling strategy to systematically explore the semantic space. Extensive experiments on four QA datasets demonstrate that our method achieves superior hallucination detection performance with significant efficiency gains. In low-budget scenarios, our approach requires about 50% fewer samples to achieve comparable detection performance to existing methods, while delivers an average AUROC improvement of 12.6% under the same sampling budget.
[30] When AI Shows Its Work, Is It Actually Working? Step-Level Evaluation Reveals Frontier Language Models Frequently Bypass Their Own Reasoning
Abhinaba Basu, Pavan Chakraborty
Main category: cs.CL
TL;DR: Paper introduces step-level evaluation to test if language models’ chain-of-thought reasoning is genuinely used or decorative, finding most frontier models produce decorative reasoning that doesn’t affect final answers.
Details
Motivation: To determine whether language models' step-by-step reasoning explanations are genuinely used in decision-making or are just decorative narratives generated after the model has already decided the answer.Method: Step-level evaluation: remove one reasoning sentence at a time and check whether the answer changes. Simple test requiring only API access, costing $1-2 per model per task. Tested 10 frontier models across sentiment, mathematics, topic classification, and medical QA tasks.
Result: Majority of frontier models produce decorative reasoning: removing any step changes answer less than 17% of time, while any single step alone recovers answer. Only two models show some genuine step dependence on specific tasks. Also discovered “output rigidity” phenomenon and mechanistic analysis shows CoT attention drops more in late layers for decorative tasks.
Conclusion: Step-by-step explanations from frontier models are largely decorative, per-model per-domain evaluation is essential, and training objectives - not scale - determine whether reasoning is genuine.
Abstract: Language models increasingly “show their work” by writing step-by-step reasoning before answering. But are these reasoning steps genuinely used, or decorative narratives generated after the model has already decided? Consider: a medical AI writes “The patient’s eosinophilia and livedo reticularis following catheterization suggest cholesterol embolization syndrome. Answer: B.” If we remove the eosinophilia observation, does the diagnosis change? For most frontier models, the answer is no - the step was decorative. We introduce step-level evaluation: remove one reasoning sentence at a time and check whether the answer changes. This simple test requires only API access – no model weights – and costs approximately $1-2 per model per task. Testing 10 frontier models (GPT-5.4, Claude Opus, DeepSeek-V3.2, MiniMax-M2.5, Kimi-K2.5, and others) across sentiment, mathematics, topic classification, and medical QA (N=376-500 each), the majority produce decorative reasoning: removing any step changes the answer less than 17% of the time, while any single step alone recovers the answer. This holds even on math, where smaller models (0.8-8B) show genuine step dependence (55% necessity). Two models break the pattern: MiniMax-M2.5 on sentiment (37% necessity) and Kimi-K2.5 on topic classification (39%) - but both shortcut other tasks. Faithfulness is model-specific and task-specific. We also discover “output rigidity”: on the same medical questions, Claude Opus writes 11 diagnostic steps while GPT-OSS-120B outputs a single token. Mechanistic analysis (attention patterns) confirms that CoT attention drops more in late layers for decorative tasks (33%) than faithful ones (20%). Implications: step-by-step explanations from frontier models are largely decorative, per-model per-domain evaluation is essential, and training objectives - not scale - determine whether reasoning is genuine.
[31] RadTimeline: Timeline Summarization for Longitudinal Radiological Lung Findings
Sitong Zhou, Meliha Yetisgen, Mari Ostendorf
Main category: cs.CL
TL;DR: This paper introduces RadTimeline, a structured summarization approach for longitudinal radiology reports that organizes findings into timelines with temporal grouping for disease progression tracking.
Details
Motivation: Tracking findings in longitudinal radiology reports is crucial for identifying disease progression, but manual summarization is time-consuming. There's a need for automatic structured summarization that enables easy comparison of findings across time and facilitates fact-checking.Method: Frames longitudinal report summarization as timeline generation with dated findings organized in columns and temporally related findings grouped in rows. Uses a 3-step LLM process: (1) extracting findings, (2) generating group names, and (3) using names to group findings. Creates RadTimeline dataset focused on lung-related radiologic findings in chest imaging reports.
Result: Experiments show tradeoffs between different-sized LLMs and prompting strategies. Group name generation as an intermediate step is critical for effective finding grouping. Best configuration achieves good recall with some irrelevant findings, and grouping performance is comparable to human annotators.
Conclusion: The structured timeline approach enables straightforward comparison of findings across time and facilitates fact-checking. The 3-step LLM process with intermediate group name generation is effective for organizing longitudinal radiology findings.
Abstract: Tracking findings in longitudinal radiology reports is crucial for accurately identifying disease progression, and the time-consuming process would benefit from automatic summarization. This work introduces a structured summarization task, where we frame longitudinal report summarization as a timeline generation task, with dated findings organized in columns and temporally related findings grouped in rows. This structured summarization format enables straightforward comparison of findings across time and facilitates fact-checking against the associated reports. The timeline is generated using a 3-step LLM process of extracting findings, generating group names, and using the names to group the findings. To evaluate such systems, we create RadTimeline, a timeline dataset focused on tracking lung-related radiologic findings in chest-related imaging reports. Experiments on RadTimeline show tradeoffs of different-sized LLMs and prompting strategies. Our results highlight that group name generation as an intermediate step is critical for effective finding grouping. The best configuration has some irrelevant findings but very good recall, and grouping performance is comparable to human annotators.
[32] Analysing LLM Persona Generation and Fairness Interpretation in Polarised Geopolitical Contexts
Maida Aizaz, Quang Minh Nguyen
Main category: cs.CL
TL;DR: LLMs show systematic biases in generating Palestinian and Israeli personas, with Palestinian profiles often assigned lower socioeconomic status and survival roles in war contexts, while Israeli profiles maintain middle-class professional attributes, even when prompted for fairness.
Details
Motivation: As LLMs are increasingly used for social simulation and persona generation, understanding how they represent geopolitical identities is crucial, particularly regarding potential biases in representing Palestinian and Israeli identities across different contexts.Method: Analyzed personas generated for Palestinian and Israeli identities by five popular LLMs across 640 experimental conditions, varying context (war vs non-war) and assigned roles, with analysis of distributional patterns and reasoning traces.
Result: Palestinian profiles in war contexts are frequently associated with lower socioeconomic status and survival-oriented roles, while Israeli profiles predominantly retain middle-class status and specialized professional attributes. Fairness prompts led to diverse changes (increased non-binary gender, generic roles) but underlying socioeconomic distinctions often remained.
Conclusion: LLMs interpret geopolitical contexts with systematic biases, and while they process fairness concepts in reasoning, there’s no consistent translation of fairness into representative outcomes, revealing complex dynamics between model reasoning and generation.
Abstract: Large language models (LLMs) are increasingly utilised for social simulation and persona generation, necessitating an understanding of how they represent geopolitical identities. In this paper, we analyse personas generated for Palestinian and Israeli identities by five popular LLMs across 640 experimental conditions, varying context (war vs non-war) and assigned roles. We observe significant distributional patterns in the generated attributes: Palestinian profiles in war contexts are frequently associated with lower socioeconomic status and survival-oriented roles, whereas Israeli profiles predominantly retain middle-class status and specialised professional attributes. When prompted with explicit instructions to avoid harmful assumptions, models exhibit diverse distributional changes, e.g., marked increases in non-binary gender inferences or a convergence toward generic occupational roles (e.g., “student”), while the underlying socioeconomic distinctions often remain. Furthermore, analysis of reasoning traces reveals an interesting dynamics between model reasoning and generation: while rationales consistently mention fairness-related concepts, the final generated personas follow the aforementioned diverse distributional changes. These findings illustrate a picture of how models interpret geopolitical contexts, while suggesting that they process fairness and adjust in varied ways; there is no consistent, direct translation of fairness concepts into representative outcomes.
[33] Avoiding Over-smoothing in Social Media Rumor Detection with Pre-trained Propagation Tree Transformer
Chaoqun Cui, Caiyan Jia
Main category: cs.CL
TL;DR: P2T3: A Pre-Trained Propagation Tree Transformer for rumor detection that addresses GNN over-smoothing issues by using pure Transformer architecture on conversation chains extracted from propagation trees.
Details
Motivation: GNN-based rumor detection methods suffer from over-smoothing issues when processing rumor propagation structures, particularly due to the structural characteristics of propagation trees where most nodes are 1-level nodes. GNNs also struggle to capture long-range dependencies in these trees.Method: Proposes P2T3 (Pre-Trained Propagation Tree Transformer) based on pure Transformer architecture. It extracts all conversation chains from tree structures following reply propagation direction, uses token-wise embedding to infuse connection information with inductive bias, and pre-trains on large-scale unlabeled datasets.
Result: P2T3 surpasses previous state-of-the-art methods on multiple benchmark datasets and performs well under few-shot conditions. It avoids the over-smoothing issue inherent in GNNs.
Conclusion: P2T3 provides an effective solution to GNN limitations in rumor detection and potentially offers a foundation for large models or unified multi-modal schemes for future social media research.
Abstract: Deep learning techniques for rumor detection typically utilize Graph Neural Networks (GNNs) to analyze post relations. These methods, however, falter due to over-smoothing issues when processing rumor propagation structures, leading to declining performance. Our investigation into this issue reveals that over-smoothing is intrinsically tied to the structural characteristics of rumor propagation trees, in which the majority of nodes are 1-level nodes. Furthermore, GNNs struggle to capture long-range dependencies within these trees. To circumvent these challenges, we propose a Pre-Trained Propagation Tree Transformer (P2T3) method based on pure Transformer architecture. It extracts all conversation chains from a tree structure following the propagation direction of replies, utilizes token-wise embedding to infuse connection information and introduces necessary inductive bias, and pre-trains on large-scale unlabeled datasets. Experiments indicate that P2T3 surpasses previous state-of-the-art methods in multiple benchmark datasets and performs well under few-shot conditions. P2T3 not only avoids the over-smoothing issue inherent in GNNs but also potentially offers a large model or unified multi-modal scheme for future social media research.
[34] EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction
Yixuan Wang, Shiyu Ji, Yijun Liu, Qingfu Zhu, Wanxiang Che
Main category: cs.CL
TL;DR: EchoKV is a flexible KV cache compression method for LLMs that enables on-demand switching between standard and compressed inference using lightweight residual reconstruction, outperforming existing methods across compression ratios.
Details
Motivation: The increasing memory demand of KV cache poses a significant bottleneck for LLMs in long-context applications. Existing low-rank compression methods often rely on irreversible parameter transformations, sacrificing flexibility to switch back to full-precision inference when memory is abundant.Method: EchoKV uses a lightweight network to reconstruct residual KV components from a partial subset, leveraging intrinsic inter-layer and intra-layer similarities among attention heads. It introduces a two-stage fine-tuning strategy for rapid, low-cost training.
Result: Experimental results on LongBench and RULER demonstrate that EchoKV consistently outperforms existing methods across various compression ratios while maintaining high throughput for short-context scenarios.
Conclusion: EchoKV provides a flexible KV cache compression solution that enables efficient long-context processing while maintaining the ability to switch back to standard inference when needed.
Abstract: The increasing memory demand of the Key-Value (KV) cache poses a significant bottleneck for Large Language Models (LLMs) in long-context applications. Existing low-rank compression methods often rely on irreversible parameter transformations, sacrificing the flexibility to switch back to full-precision inference when memory is abundant. In this paper, we propose EchoKV, a flexible KV cache compression scheme that enables on-demand transitions between standard and compressed inference. Unlike traditional compression-decompression paradigms, EchoKV utilizes a lightweight network to reconstruct the residual KV components from a partial subset, leveraging intrinsic inter-layer and intra-layer similarities among attention heads. We further introduce a two-stage fine-tuning strategy that allows for rapid, low-cost training (e.g., ~1 A100 GPU-hour for a 7B model). Experimental results on LongBench and RULER demonstrate that EchoKV consistently outperforms existing methods across various compression ratios while maintaining high throughput for short-context scenarios.
[35] Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset
Ryoma Suzuki, Zhiyang Qi, Michimasa Inaba
Main category: cs.CL
TL;DR: Multilingual KokoroChat: A counseling dialogue dataset created by translating Japanese counseling corpus into English and Chinese using a novel multi-LLM ensemble method that outperforms individual LLMs for translation quality.
Details
Motivation: Addressing the critical scarcity of high-quality, publicly available counseling dialogue datasets, particularly for multilingual applications, and overcoming the limitations of single LLMs for sensitive domain translation where highest fidelity is essential.Method: Developed a multi-LLM ensemble method: first generates diverse translation hypotheses from multiple distinct LLMs, then uses a single LLM to produce high-quality translation by analyzing strengths and weaknesses of all presented hypotheses.
Result: Human preference studies confirmed that translations produced by the ensemble method were preferred over any individual state-of-the-art LLM, demonstrating superior quality. The Multilingual KokoroChat dataset is publicly available.
Conclusion: The multi-LLM ensemble approach effectively addresses translation quality challenges in sensitive domains like counseling, producing a high-quality multilingual counseling dialogue dataset that outperforms single-model approaches.
Abstract: To address the critical scarcity of high-quality, publicly available counseling dialogue datasets, we created Multilingual KokoroChat by translating KokoroChat, a large-scale manually authored Japanese counseling corpus, into both English and Chinese. A key challenge in this process is that the optimal model for translation varies by input, making it impossible for any single model to consistently guarantee the highest quality. In a sensitive domain like counseling, where the highest possible translation fidelity is essential, relying on a single LLM is therefore insufficient. To overcome this challenge, we developed and employed a novel multi-LLM ensemble method. Our approach first generates diverse hypotheses from multiple distinct LLMs. A single LLM then produces a high-quality translation based on an analysis of the respective strengths and weaknesses of all presented hypotheses. The quality of ``Multilingual KokoroChat’’ was rigorously validated through human preference studies. These evaluations confirmed that the translations produced by our ensemble method were preferred from any individual state-of-the-art LLM. This strong preference confirms the superior quality of our method’s outputs. The Multilingual KokoroChat is available at https://github.com/UEC-InabaLab/MultilingualKokoroChat.
[36] Quality Over Clicks: Intrinsic Quality-Driven Iterative Reinforcement Learning for Cold-Start E-Commerce Query Suggestion
Qi Sun, Kejun Xiao, Huaipeng Zhao, Tao Luo, Xiaoyi Zeng
Main category: cs.CL
TL;DR: Cold-EQS is an iterative reinforcement learning framework for cold-start e-commerce query suggestion that uses answerability, factuality, and information gain as rewards to optimize query quality without relying on click data.
Details
Motivation: Existing dialogue systems rely on query suggestion with CTR models, but fail in cold-start scenarios due to heavy dependence on abundant click data for CTR model training.Method: Proposes Cold-EQS using iterative reinforcement learning with answerability, factuality, and information gain as rewards. Estimates uncertainty for grouped candidate queries to select hard/ambiguous samples from queries lacking click signals.
Result: Achieves significant +6.81% improvement in online chatUV. Strong positive correlation between online and offline effectiveness demonstrated. Provides EQS-Benchmark with 16,949 online user queries.
Conclusion: Cold-EQS effectively addresses cold-start query suggestion problem in e-commerce without requiring click data, showing superiority over existing approaches.
Abstract: Existing dialogue systems rely on Query Suggestion (QS) to enhance user engagement. Recent efforts typically employ large language models with Click-Through Rate (CTR) model, yet fail in cold-start scenarios due to their heavy reliance on abundant online click data for effective CTR model training. To bridge this gap, we propose Cold-EQS, an iterative reinforcement learning framework for Cold-Start E-commerce Query Suggestion (EQS). Specifically, we leverage answerability, factuality, and information gain as reward to continuously optimize the quality of suggested queries. To continuously optimize our QS model, we estimate uncertainty for grouped candidate suggested queries to select hard and ambiguous samples from online user queries lacking click signals. In addition, we provide an EQS-Benchmark comprising 16,949 online user queries for offline training and evaluation. Extensive offline and online experiments consistently demonstrate a strong positive correlation between online and offline effectiveness. Both offline and online experimental results demonstrate the superiority of our Cold-EQS, achieving a significant +6.81% improvement in online chatUV.
[37] Set-Valued Prediction for Large Language Models with Feasibility-Aware Coverage Guarantees
Ye Li, Anqi Hu, Yuanchang Ye, Shiyan Tong, Zhiyuan Wang, Bo Fu
Main category: cs.CL
TL;DR: A framework for set-valued prediction in LLMs that provides coverage guarantees by constructing prediction sets from multiple samples, addressing the limitation of single point predictions.
Details
Motivation: Traditional LLM usage reports only the most likely generation as point prediction, which underestimates model capability. Valid answers may exist in the broader output space and can be discovered through repeated sampling, motivating a shift to set-valued prediction.Method: Proposes a principled framework for set-valued prediction with feasibility-aware coverage guarantees. Establishes minimum achievable risk level (MRL) and develops a data-driven calibration procedure that constructs prediction sets from sampled responses using rigorous thresholds.
Result: Extensive experiments on six language generation tasks with five LLMs demonstrate both statistical validity and predictive efficiency of the framework.
Conclusion: The framework successfully moves from point prediction to set-valued prediction, providing coverage guarantees when feasible and addressing the finite-sampling limitations of LLM generation.
Abstract: Large language models (LLMs) inherently operate over a large generation space, yet conventional usage typically reports the most likely generation (MLG) as a point prediction, which underestimates the model’s capability: although the top-ranked response can be incorrect, valid answers may still exist within the broader output space and can potentially be discovered through repeated sampling. This observation motivates moving from point prediction to set-valued prediction, where the model produces a set of candidate responses rather than a single MLG. In this paper, we propose a principled framework for set-valued prediction, which provides feasibility-aware coverage guarantees. We show that, given the finite-sampling nature of LLM generation, coverage is not always achievable: even with multiple samplings, LLMs may fail to yield an acceptable response for certain questions within the sampled candidate set. To address this, we establish a minimum achievable risk level (MRL), below which statistical coverage guarantees cannot be satisfied. Building on this insight, we then develop a data-driven calibration procedure that constructs prediction sets from sampled responses by estimating a rigorous threshold, ensuring that the resulting set contains a correct answer with a desired probability whenever the target risk level is feasible. Extensive experiments on six language generation tasks with five LLMs demonstrate both the statistical validity and the predictive efficiency of our framework.
[38] DariMis: Harm-Aware Modeling for Dari Misinformation Detection on YouTube
Jawid Ahmad Baktash, Mosa Ebrahimi, Mohammad Zarif Joya, Mursal Dawodi
Main category: cs.CL
TL;DR: DariMis: First manually annotated dataset of 9,224 Dari-language YouTube videos for misinformation detection with dual labeling (Information Type and Harm Level), showing structural coupling between misinformation and harm potential.
Details
Motivation: Dari, spoken by tens of millions in Afghanistan, is largely absent from misinformation detection literature. There's a need for resources to address misinformation in this under-resourced language, especially given its potential real-world harm.Method: Created DariMis dataset with 9,224 YouTube videos manually annotated across Information Type (Misinformation, Partly True, True) and Harm Level (Low, Medium, High). Proposed pair-input encoding strategy treating video title and description as separate BERT segments to model semantic relationship between headline claims and body content. Benchmarked ParsBERT (Dari/Farsi-specialized) against XLM-RoBERTa-base.
Result: Found structural coupling: 55.9% of Misinformation carries at least Medium harm potential vs. only 1.0% of True content. Pair-input encoding improved Misinformation recall by 7.0 percentage points (60.1% to 67.1%). ParsBERT achieved best performance with 76.60% accuracy and 72.77% macro F1. Information Type classifiers can function as implicit harm-triage filters.
Conclusion: DariMis addresses critical gap in multilingual misinformation detection, showing strong coupling between misinformation and harm potential in Dari content. The pair-input encoding strategy effectively captures semantic relationships between headlines and body content for improved misinformation detection.
Abstract: Dari, the primary language of Afghanistan, is spoken by tens of millions of people yet remains largely absent from the misinformation detection literature. We address this gap with DariMis, the first manually annotated dataset of 9,224 Dari-language YouTube videos, labeled across two dimensions: Information Type (Misinformation, Partly True, True) and Harm Level (Low, Medium, High). A central empirical finding is that these dimensions are structurally coupled, not independent: 55.9 percent of Misinformation carries at least Medium harm potential, compared with only 1.0 percent of True content. This enables Information Type classifiers to function as implicit harm-triage filters in content moderation pipelines. We further propose a pair-input encoding strategy that represents the video title and description as separate BERT segment inputs, explicitly modeling the semantic relationship between headline claims and body content, a key signal of misleading information. An ablation study against single-field concatenation shows that pair-input encoding yields a 7.0 percentage point gain in Misinformation recall (60.1 percent to 67.1 percent), the safety-critical minority class, despite modest overall macro F1 differences (0.09 percentage points). We benchmark a Dari/Farsi-specialized model (ParsBERT) against XLM-RoBERTa-base; ParsBERT achieves the best test performance with accuracy of 76.60 percent and macro F1 of 72.77 percent. Bootstrap 95 percent confidence intervals are reported for all metrics, and we discuss both the practical significance and statistical limitations of the results.
[39] Beyond Hate: Differentiating Uncivil and Intolerant Speech in Multimodal Content Moderation
Nils A. Herrmann, Tobias Eder, Jingyi He, Georg Groh
Main category: cs.CL
TL;DR: Fine-grained annotation scheme distinguishes incivility (tone) from intolerance (content) in multimodal toxicity detection, improving model performance and error balance compared to binary hatefulness labels.
Details
Motivation: Current multimodal toxicity benchmarks use coarse binary hatefulness labels that conflate tone and content, limiting model understanding and moderation effectiveness.Method: Introduced fine-grained annotation scheme separating incivility (rude tone) and intolerance (attacks on pluralism/identities), applied to 2,030 memes from Hateful Memes dataset. Evaluated VLMs under coarse-label training, transfer learning, and joint learning combining coarse and fine-grained labels.
Result: Fine-grained annotations complement coarse labels and improve overall model performance when used jointly. Models with fine-grained scheme show more balanced error profiles and reduced under-detection of harmful content (FNR-FPR improved from 0.74 to 0.42 for LLaVA-1.6-Mistral-7B; 0.54 to 0.28 for Qwen2.5-VL-7B).
Conclusion: Fine-grained annotations improve reliability and accuracy of multimodal moderation systems. Combining coarse and fine-grained labels provides practical route to more reliable content moderation through enhanced data quality.
Abstract: Current multimodal toxicity benchmarks typically use a single binary hatefulness label. This coarse approach conflates two fundamentally different characteristics of expression: tone and content. Drawing on communication science theory, we introduce a fine-grained annotation scheme that distinguishes two separable dimensions: incivility (rude or dismissive tone) and intolerance (content that attacks pluralism and targets groups or identities) and apply it to 2,030 memes from the Hateful Memes dataset. We evaluate different vision-language models under coarse-label training, transfer learning across label schemes and a joint learning approach that combines the coarse hatefulness label with our fine-grained annotations. Our results show that fine-grained annotations complement existing coarse labels and, when used jointly, improve overall model performance. Moreover, models trained with the fine-grained scheme exhibit more balanced moderation-relevant error profiles and are less prone to under-detection of harmful content than models trained on hatefulness labels alone (FNR-FPR, the difference between false negative and false positive rates: 0.74 to 0.42 for LLaVA-1.6-Mistral-7B; 0.54 to 0.28 for Qwen2.5-VL-7B). This work contributes to data-centric approaches in content moderation by improving the reliability and accuracy of moderation systems through enhanced data quality. Overall, combining both coarse and fine-grained labels provides a practical route to more reliable multimodal moderation.
[40] PaperVoyager : Building Interactive Web with Visual Language Models
Dasen Dai, Biao Wu, Meng Fang, Wenhao Wang
Main category: cs.CL
TL;DR: Paper-to-Interactive-System Agent converts research papers into executable interactive web systems for dynamic mechanism exploration
Details
Motivation: Existing document agents create static artifacts (summaries, webpages, slides) which are insufficient for technical papers involving dynamic mechanisms and state transitionsMethod: PaperVoyager framework performs end-to-end processing: paper understanding, system modeling, and interactive webpage synthesis; explicitly models mechanisms and interaction logic during synthesis
Result: PaperVoyager significantly improves quality of generated interactive systems; benchmark of 19 research papers with expert-built interactive systems as ground truth
Conclusion: Offers new paradigm for interactive scientific paper understanding by enabling users to manipulate inputs and observe dynamic behaviors
Abstract: Recent advances in visual language models have enabled autonomous agents for complex reasoning, tool use, and document understanding. However, existing document agents mainly transform papers into static artifacts such as summaries, webpages, or slides, which are insufficient for technical papers involving dynamic mechanisms and state transitions. In this work, we propose a Paper-to-Interactive-System Agent that converts research papers into executable interactive web systems. Given a PDF paper, the agent performs end-to-end processing without human intervention, including paper understanding, system modeling, and interactive webpage synthesis, enabling users to manipulate inputs and observe dynamic behaviors. To evaluate this task, we introduce a benchmark of 19 research papers paired with expert-built interactive systems as ground truth. We further propose PaperVoyager, a structured generation framework that explicitly models mechanisms and interaction logic during synthesis. Experiments show that PaperVoyager significantly improves the quality of generated interactive systems, offering a new paradigm for interactive scientific paper understanding.
[41] Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents
Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen
Main category: cs.CL
TL;DR: Memory-augmented inference framework uses lightweight 8B model with conversational memory to handle repetitive user queries efficiently, achieving 69% of 235B model performance at 96% cost reduction.
Details
Motivation: Production AI agents face high computational costs from processing repetitive user queries (up to 47% semantically similar). Current approaches treat each query independently, missing opportunities to leverage conversational memory for efficiency gains.Method: Proposed memory-augmented inference framework where lightweight 8B-parameter model uses retrieved conversational context to answer queries via low-cost inference path. Uses hybrid retrieval (BM25 + cosine similarity) without additional training or labeled data.
Result: Achieves 30.5% F1, recovering 69% of 235B model performance while reducing effective cost by 96%. Memory improves correctness by grounding responses in retrieved user-specific information rather than altering routing decisions. Hybrid retrieval adds +7.7 F1 improvement.
Conclusion: For user-specific queries, access to relevant conversational memory outweighs model scale. Memory, not model size, is primary driver of accuracy and efficiency in persistent AI agents, with performance gap narrowing as memory accumulates over time.
Abstract: Production AI agents frequently receive user-specific queries that are highly repetitive, with up to 47% being semantically similar to prior interactions, yet each query is typically processed with the same computational cost. We argue that this redundancy can be exploited through conversational memory, transforming repetition from a cost burden into an efficiency advantage. We propose a memory-augmented inference framework in which a lightweight 8B-parameter model leverages retrieved conversational context to answer all queries via a low-cost inference path. Without any additional training or labeled data, this approach achieves 30.5% F1, recovering 69% of the performance of a full-context 235B model while reducing effective cost by 96%. Notably, a 235B model without memory (13.7% F1) underperforms even the standalone 8B model (15.4% F1), indicating that for user-specific queries, access to relevant knowledge outweighs model scale. We further analyze the role of routing and confidence. At practical confidence thresholds, routing alone already directs 96% of queries to the small model, but yields poor accuracy (13.0% F1) due to confident hallucinations. Memory does not substantially alter routing decisions; instead, it improves correctness by grounding responses in retrieved user-specific information. As conversational memory accumulates over time, coverage of recurring topics increases, further narrowing the performance gap. We evaluate on 152 LoCoMo questions (Qwen3-8B/235B) and 500 LongMemEval questions. Incorporating hybrid retrieval (BM25 + cosine similarity) improves performance by an additional +7.7 F1, demonstrating that retrieval quality directly enhances end-to-end system performance. Overall, our results highlight that memory, rather than model size, is the primary driver of accuracy and efficiency in persistent AI agents.
[42] Parametric Knowledge and Retrieval Behavior in RAG Fine-Tuning for Electronic Design Automation
Julian Oestreich, Maximilian Bley, Frank Binder, Lydia Müller, Maksym Sydorenko, André Alcalde
Main category: cs.CL
TL;DR: RAG fine-tuning evaluation for long-form text generation in specialized domains, introducing TriFEX evaluation pipeline and PKP metric to measure knowledge internalization beyond standard NLP metrics.
Details
Motivation: Most RAG fine-tuning studies focus on document QA with standard NLP metrics that may obscure factual differences, especially for long-form text generation in specialized domains like electronic design automation.Method: Adapt 7B model under five context augmentation strategies with varying retrieval conditions; introduce TriFEX (human-validated triple-based evaluation pipeline) and Parametric Knowledge Precision (PKP) to isolate internalized knowledge by filtering out prompt-leaked claims.
Result: ROUGE and BERTScore fail to detect factual differences that triple-based evaluation reveals; existing knowledge internalization metrics are retrieval-sensitive; fine-tuned 7B variants outperform 72B baseline on most metrics and generalize across conditions.
Conclusion: Limitations of available RAG evaluation metrics; smaller models can be effectively adapted to specialized tasks for cost-efficient on-premises deployment through proper evaluation methods.
Abstract: Retrieval-Augmented Generation (RAG) fine-tuning has shown substantial improvements over vanilla RAG, yet most studies target document question answering and often rely on standard NLP metrics that can obscure factual differences. We evaluate RAG fine-tuning for long-form text generation in electronic design automation, adapting a 7B model under five context augmentation strategies with varying retrieval conditions. We introduce TriFEX, a human-validated, triple-based evaluation pipeline that attributes generated claims to their origin-user query, context and reference-and propose Parametric Knowledge Precision (PKP), which isolates internalized knowledge by filtering out claims leaked in the prompt. We show that ROUGE and BERTScore fail to detect factual differences that our triple-based evaluation reveals. Additionally, we demonstrate that an existing metric for knowledge internalization is retrieva-sensitive, with about 75% of its cross-condition variance driven by changes in the rate at which internal knowledge is expressed (PR), rather than by changes in its actual correctness (PKP). The fine-tuned 7B variants outperform a 72B baseline on most metrics, further showing generalization across conditions and on a related benchmark. These results underscore the limitations of available metrics in RAG evaluation and show that smaller models could be reasonably well adapted to specialized tasks for cost-efficient, on-premises deployment.
[43] AuthorMix: Modular Authorship Style Transfer via Layer-wise Adapter Mixing
Sarubi Thillainathan, Ji-Ung Lee, Michael Sullivan, Alexander Koller
Main category: cs.CL
TL;DR: AuthorMix: A lightweight, modular framework for authorship style transfer using LoRA adapters and layer-wise mixing for low-resource targets
Details
Motivation: Existing style transfer methods require training large models on extensive corpora, which is costly, inflexible for target-specific adaptation, and often sacrifices meaning preservation for style transferMethod: Train individual style-specific LoRA adapters on high-resource authors, then use learned layer-wise adapter mixing to rapidly adapt to new target styles with only a handful of training examples
Result: Outperforms state-of-the-art style-transfer baselines and GPT-5.1 for low-resource targets, achieving highest overall score and substantially improving meaning preservation
Conclusion: AuthorMix provides an effective, lightweight, and interpretable solution for authorship style transfer that excels with limited training data
Abstract: The task of authorship style transfer involves rewriting text in the style of a target author while preserving the meaning of the original text. Existing style transfer methods train a single model on large corpora to model all target styles at once: this high-cost approach offers limited flexibility for target-specific adaptation, and often sacrifices meaning preservation for style transfer. In this paper, we propose AuthorMix: a lightweight, modular, and interpretable style transfer framework. We train individual, style-specific LoRA adapters on a small set of high-resource authors, allowing the rapid training of specialized adaptation models for each new target via learned, layer-wise adapter mixing, using only a handful of target style training examples. AuthorMix outperforms existing, SoTA style-transfer baselines – as well as GPT-5.1 – for low-resource targets, achieving the highest overall score and substantially improving meaning preservation.
[44] When Language Models Lose Their Mind: The Consequences of Brain Misalignment
Gabriele Merlin, Mariya Toneva
Main category: cs.CL
TL;DR: Brain-misaligned LLMs (trained to predict brain activity poorly while maintaining language modeling performance) show substantial impairment on downstream linguistic tasks compared to brain-aligned models, demonstrating brain alignment’s critical role in robust linguistic competence.
Details
Motivation: While brain-aligned LLMs have gained attention for cognitive modeling and AI safety, the functional role of brain alignment for linguistic competence remains uncertain. The paper aims to investigate whether brain alignment actually contributes to language understanding capabilities.Method: Created brain-misaligned LLMs intentionally trained to predict brain activity poorly while maintaining high language modeling performance. Compared these with well-matched brain-aligned counterparts on over 200 downstream tasks across semantics, syntax, discourse, reasoning, and morphology.
Result: Brain misalignment substantially impairs downstream performance across diverse linguistic domains, indicating that brain alignment plays a critical role in achieving robust linguistic competence.
Conclusion: Brain alignment is important for LLMs’ linguistic capabilities, not just for cognitive modeling or safety. The findings offer novel insights into the relationship between neural representations and linguistic processing.
Abstract: While brain-aligned large language models (LLMs) have garnered attention for their potential as cognitive models and for potential for enhanced safety and trustworthiness in AI, the role of this brain alignment for linguistic competence remains uncertain. In this work, we investigate the functional implications of brain alignment by introducing brain-misaligned models–LLMs intentionally trained to predict brain activity poorly while maintaining high language modeling performance. We evaluate these models on over 200 downstream tasks encompassing diverse linguistic domains, including semantics, syntax, discourse, reasoning, and morphology. By comparing brain-misaligned models with well-matched brain-aligned counterparts, we isolate the specific impact of brain alignment on language understanding. Our experiments reveal that brain misalignment substantially impairs downstream performance, highlighting the critical role of brain alignment in achieving robust linguistic competence. These findings underscore the importance of brain alignment in LLMs and offer novel insights into the relationship between neural representations and linguistic processing.
[45] HGNet: Scalable Foundation Model for Automated Knowledge Graph Generation from Scientific Literature
Devvrat Joshi, Islem Rekik
Main category: cs.CL
TL;DR: Two-stage framework for zero-shot scientific knowledge graph construction with hierarchical relation extraction, achieving state-of-the-art performance on multiple benchmarks.
Details
Motivation: Existing knowledge graph construction methods struggle with long multi-word entities, lack domain generalization, and ignore hierarchical scientific knowledge. LLMs are computationally expensive and inconsistent for specialized tasks, resulting in shallow KGs.Method: Two-stage framework: 1) Z-NERD with Orthogonal Semantic Decomposition for domain-agnostic entity recognition and Multi-Scale TCQK attention for multi-word entities; 2) HGNet with hierarchy-aware message passing and two complementary objectives: Differentiable Hierarchy Loss and Continuum Abstraction Field Loss for continuous hierarchical abstraction in Euclidean space.
Result: Achieves new SOTA on SciERC, SciER, and SPHERE benchmarks, improving NER by 8.08% and RE by 5.99% on out-of-distribution tests. In zero-shot settings, gains reach 10.76% for NER and 26.2% for RE.
Conclusion: The framework enables scalable, zero-shot scientific KG construction with explicit hierarchical modeling, offering a simpler alternative to hyperbolic methods through continuous abstraction in Euclidean embeddings.
Abstract: Automated knowledge graph (KG) construction is essential for navigating the rapidly expanding body of scientific literature. However, existing approaches struggle to recognize long multi-word entities, often fail to generalize across domains, and typically overlook the hierarchical nature of scientific knowledge. While general-purpose large language models (LLMs) offer adaptability, they are computationally expensive and yield inconsistent accuracy on specialized tasks. As a result, current KGs are shallow and inconsistent, limiting their utility for exploration and synthesis. We propose a two-stage framework for scalable, zero-shot scientific KG construction. The first stage, Z-NERD, introduces (i) Orthogonal Semantic Decomposition (OSD), which promotes domain-agnostic entity recognition by isolating semantic “turns” in text, and (ii) a Multi-Scale TCQK attention mechanism that captures coherent multi-word entities through n-gram-aware attention heads. The second stage, HGNet, performs relation extraction with hierarchy-aware message passing, explicitly modeling parent, child, and peer relations. To enforce global consistency, we introduce two complementary objectives: a Differentiable Hierarchy Loss to discourage cycles and shortcut edges, and a Continuum Abstraction Field (CAF) Loss that embeds abstraction levels along a learnable axis in Euclidean space. This is the first approach to formalize hierarchical abstraction as a continuous property within standard Euclidean embeddings, offering a simpler alternative to hyperbolic methods. We release SPHERE (https://github.com/basiralab/SPHERE), a multi-domain benchmark for hierarchical relation extraction. Our framework establishes a new state of the art on SciERC, SciER, and SPHERE, improving NER by 8.08% and RE by 5.99% on out-of-distribution tests. In zero-shot settings, gains reach 10.76% for NER and 26.2% for RE.
[46] Why AI-Generated Text Detection Fails: Evidence from Explainable AI Beyond Benchmark Accuracy
Shushanta Pudasaini, Luis Miralles-Pechuán, David Lillis, Marisa Llorens Salvador
Main category: cs.CL
TL;DR: AI text detection framework using linguistic features achieves high benchmark performance but suffers from generalization issues due to dataset-specific artifacts rather than stable machine authorship signals.
Details
Motivation: Address the pressing challenge of detecting AI-generated text, questioning whether current detectors truly identify machine authorship or just exploit dataset-specific artifacts, and improve interpretability and reliability in real-world settings.Method: Propose an interpretable detection framework integrating linguistic feature engineering, machine learning, and explainable AI techniques (SHAP-based explanations). Evaluate on PAN CLEF 2025 and COLING 2025 benchmarks using 30 linguistic features.
Result: Achieves competitive F1 score of 0.9734 on benchmarks, but systematic evaluation reveals substantial generalization failure under distribution shift. SHAP analysis shows detectors rely on dataset-specific stylistic cues rather than stable machine authorship signals.
Conclusion: Linguistic-feature-based AI text detection faces fundamental tension: most discriminative features are also most susceptible to domain shift, formatting variation, and text-length effects. Open-source package released for practical use with predictions and explanations.
Abstract: The widespread adoption of Large Language Models (LLMs) has made the detection of AI-Generated text a pressing and complex challenge. Although many detection systems report high benchmark accuracy, their reliability in real-world settings remains uncertain, and their interpretability is often unexplored. In this work, we investigate whether contemporary detectors genuinely identify machine authorship or merely exploit dataset-specific artefacts. We propose an interpretable detection framework that integrates linguistic feature engineering, machine learning, and explainable AI techniques. When evaluated on two prominent benchmark corpora, namely PAN CLEF 2025 and COLING 2025, our model trained on 30 linguistic features achieves leaderboard-competitive performance, attaining an F1 score of 0.9734. However, systematic cross-domain and cross-generator evaluation reveals substantial generalisation failure: classifiers that excel in-domain degrade significantly under distribution shift. Using SHAP- based explanations, we show that the most influential features differ markedly between datasets, indicating that detectors often rely on dataset-specific stylistic cues rather than stable signals of machine authorship. Further investigation with in-depth error analysis exposes a fundamental tension in linguistic-feature-based AI text detection: the features that are most discriminative on in-domain data are also the features most susceptible to domain shift, formatting variation, and text-length effects. We believe that this knowledge helps build AI detectors that are robust across different settings. To support replication and practical use, we release an open-source Python package that returns both predictions and instance-level explanations for individual texts.
[47] UniDial-EvalKit: A Unified Toolkit for Evaluating Multi-Faceted Conversational Abilities
Qi Jia, Haodong Zhao, Dun Pei, Xiujie Song, Shibo Wang, Zijian Chen, Zicheng Zhang, Xiangyang Zhu, Guangtao Zhai
Main category: cs.CL
TL;DR: UniDial-EvalKit (UDE) is a unified evaluation toolkit that standardizes benchmarking for interactive AI systems across heterogeneous datasets, model interfaces, and evaluation pipelines.
Details
Motivation: Existing evaluation protocols for multi-turn interactive AI systems are highly heterogeneous with different dataset formats, model interfaces, and evaluation pipelines, making systematic comparison difficult and impeding progress in the field.Method: UDE provides holistic unification through: 1) standardizing heterogeneous data formats into a universal schema, 2) streamlining evaluation pipelines via modular architecture, 3) aligning metric calculations under consistent scoring interface, 4) supporting parallel generation/scoring for efficiency, and 5) implementing checkpoint-based caching to eliminate redundant computation.
Result: Validated across diverse multi-turn benchmarks, UDE guarantees high reproducibility through standardized workflows and transparent logging, while significantly improving evaluation efficiency and extensibility.
Conclusion: UDE enables standardized benchmarking ecosystem for interactive AI, accelerating future breakthroughs by providing a unified evaluation framework that addresses current fragmentation in evaluation protocols.
Abstract: Benchmarking AI systems in multi-turn interactive scenarios is essential for understanding their practical capabilities in real-world applications. However, existing evaluation protocols are highly heterogeneous, differing significantly in dataset formats, model interfaces, and evaluation pipelines, which severely impedes systematic comparison. In this work, we present UniDial-EvalKit (UDE), a unified evaluation toolkit for assessing interactive AI systems. The core contribution of UDE lies in its holistic unification: it standardizes heterogeneous data formats into a universal schema, streamlines complex evaluation pipelines through a modular architecture, and aligns metric calculations under a consistent scoring interface. It also supports efficient large-scale evaluation through parallel generation and scoring, as well as checkpoint-based caching to eliminate redundant computation. Validated across diverse multi-turn benchmarks, UDE not only guarantees high reproducibility through standardized workflows and transparent logging, but also significantly improves evaluation efficiency and extensibility. We make the complete toolkit and evaluation scripts publicly available to foster a standardized benchmarking ecosystem and accelerate future breakthroughs in interactive AI.
[48] From Synthetic to Native: Benchmarking Multilingual Intent Classification in Logistics Customer Service
Haoyu He, Jinyu Zhuang, Haoran Chu, Shuhang Yu, J, T AI Group, Hao Wang, Kunpeng Han
Main category: cs.CL
TL;DR: A benchmark for hierarchical multilingual intent classification using real logistics customer-service logs, showing machine-translated test sets overestimate performance on noisy native queries.
Details
Motivation: Existing multilingual benchmarks rely on machine-translated text which is cleaner than real customer requests, overestimating real-world robustness. Need for realistic benchmarks using native customer-service logs.Method: Constructed benchmark from 600K historical logistics customer-service records, filtered to 30K de-identified queries using LLM-assisted quality control and human verification. Organized into two-level taxonomy with 13 parent and 17 leaf intents across multiple languages.
Result: Translated test sets substantially overestimate performance on noisy native queries, especially for long-tail intents and cross-lingual transfer. Performance gap highlights need for more realistic multilingual benchmarks.
Conclusion: Machine-translated benchmarks are insufficient for evaluating real-world multilingual intent classification. Real customer-service logs reveal significant performance gaps, emphasizing need for native-language evaluation.
Abstract: Multilingual intent classification is central to customer-service systems on global logistics platforms, where models must process noisy user queries across languages and hierarchical label spaces. Yet most existing multilingual benchmarks rely on machine-translated text, which is typically cleaner and more standardized than native customer requests and can therefore overestimate real-world robustness. We present a public benchmark for hierarchical multilingual intent classification constructed from real logistics customer-service logs. The dataset contains approximately 30K de-identified, stand-alone user queries curated from 600K historical records through filtering, LLM-assisted quality control, and human verification, and is organized into a two-level taxonomy with 13 parent and 17 leaf intents. English, Spanish, and Arabic are included as seen languages, while Indonesian, Chinese, and additional test-only languages support zero-shot evaluation. To directly measure the gap between synthetic and real evaluation, we provide paired native and machine-translated test sets and benchmark multilingual encoders, embedding models, and small language models under flat and hierarchical protocols. Results show that translated test sets substantially overestimate performance on noisy native queries, especially for long-tail intents and cross-lingual transfer, underscoring the need for more realistic multilingual intent benchmarks.
[49] ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment
Hao Wang, Haocheng Yang, Licheng Pan, Lei Shen, Xiaoxi Li, Yinuo Wang, Zhichao Chen, Yuan Lu, Haoxuan Li, Zhouchen Lin
Main category: cs.CL
TL;DR: ImplicitRM learns unbiased reward models from implicit human feedback (clicks, copies) instead of expensive explicit feedback, addressing challenges of missing negative samples and user preference bias through stratification and likelihood maximization.
Details
Motivation: Current reward modeling in RLHF relies on expensive explicit human feedback data. Implicit feedback (clicks, copies) is more cost-effective but presents challenges: lacks definitive negative samples and suffers from user preference bias where different responses have different propensities to elicit feedback.Method: ImplicitRM stratifies training samples into four latent groups via a stratification model, then derives a learning objective through likelihood maximization that is theoretically unbiased, effectively addressing both challenges of missing negative samples and user preference bias.
Result: Experiments demonstrate that ImplicitRM learns accurate reward models across implicit preference datasets, showing effectiveness in learning from cost-effective implicit feedback.
Conclusion: Implicit reward modeling from implicit human feedback is a viable, cost-effective alternative to explicit feedback for RLHF, with ImplicitRM providing a theoretically sound solution to the fundamental challenges in this domain.
Abstract: Reward modeling represents a long-standing challenge in reinforcement learning from human feedback (RLHF) for aligning language models. Current reward modeling is heavily contingent upon experimental feedback data with high collection costs. In this work, we study \textit{implicit reward modeling} – learning reward models from implicit human feedback (e.g., clicks and copies) – as a cost-effective alternative. We identify two fundamental challenges in implicit reward modeling: (1) Implicit preference data lacks definitive negative samples, which makes standard positive-negative classification methods inapplicable; (2) Implicit preference data suffers from user preference bias, where different responses have different propensities to elicit user feedback actions, which exacerbates the difficulty of distinguishing definitive negative samples. To address these challenges, we propose ImplicitRM, which aims to learn unbiased reward models from implicit preference data. ImplicitRM stratifies training samples into four latent groups via a stratification model. Building on this, it derives a learning objective through likelihood maximization, which we prove is theoretically unbiased, effectively resolving both challenges. Experiments demonstrate that ImplicitRM learns accurate reward models across implicit preference datasets. Code is available on our project website.
[50] Decoding AI Authorship: Can LLMs Truly Mimic Human Style Across Literature and Politics?
Nasser A Alsadhan
Main category: cs.CL
TL;DR: LLMs can mimic literary/political figures but remain detectable via stylometric analysis; perplexity is key differentiator between AI and human writing.
Details
Motivation: To investigate whether state-of-the-art LLMs can authentically emulate the authorial signatures of prominent figures, and to develop methods for detecting AI-generated mimicry in digital humanities and social media contexts.Method: Used zero-shot prompting with GPT-4o, Gemini 1.5 Pro, and Claude Sonnet 3.5 to generate synthetic text mimicking Whitman, Wordsworth, Trump, and Obama. Evaluated using BERT and XGBoost classifiers with LIWC markers, perplexity, and readability indices.
Result: AI-generated mimicry remains highly detectable; XGBoost with only 8 stylometric features achieved accuracy comparable to neural classifiers. Perplexity was the primary discriminative feature, revealing AI outputs have less stochastic regularity than human writing.
Conclusion: LLMs exhibit distributional convergence on some features (syntax, readability) but cannot fully replicate nuanced affective density and stylistic variance of human writing, providing benchmarks for authorship attribution.
Abstract: Amidst the rising capabilities of generative AI to mimic specific human styles, this study investigates the ability of state-of-the-art large language models (LLMs), including GPT-4o, Gemini 1.5 Pro, and Claude Sonnet 3.5, to emulate the authorial signatures of prominent literary and political figures: Walt Whitman, William Wordsworth, Donald Trump, and Barack Obama. Utilizing a zero-shot prompting framework with strict thematic alignment, we generated synthetic corpora evaluated through a complementary framework combining transformer-based classification (BERT) and interpretable machine learning (XGBoost). Our methodology integrates Linguistic Inquiry and Word Count (LIWC) markers, perplexity, and readability indices to assess the divergence between AI-generated and human-authored text. Results demonstrate that AI-generated mimicry remains highly detectable, with XGBoost models trained on a restricted set of eight stylometric features achieving accuracy comparable to high-dimensional neural classifiers. Feature importance analyses identify perplexity as the primary discriminative metric, revealing a significant divergence in the stochastic regularity of AI outputs compared to the higher variability of human writing. While LLMs exhibit distributional convergence with human authors on low-dimensional heuristic features, such as syntactic complexity and readability, they do not yet fully replicate the nuanced affective density and stylistic variance inherent in the human-authored corpus. By isolating the specific statistical gaps in current generative mimicry, this study provides a comprehensive benchmark for LLM stylistic behavior and offers critical insights for authorship attribution in the digital humanities and social media.
[51] I Came, I Saw, I Explained: Benchmarking Multimodal LLMs on Figurative Meaning in Memes
Shijia Zhou, Saif M. Mohammad, Barbara Plank, Diego Frassinelli
Main category: cs.CL
TL;DR: MLLMs struggle with figurative meaning detection in memes, showing bias toward attributing figurative meaning even when absent, and often generate unfaithful explanations.
Details
Motivation: Internet memes are popular multimodal communication forms using figurative elements, but it's unclear how MLLMs interpret visual-textual combinations to identify figurative meaning in memes.Method: Evaluated eight state-of-the-art generative MLLMs across three datasets on detecting and explaining six types of figurative meaning, plus human evaluation of explanation faithfulness.
Result: All models showed strong bias to associate memes with figurative meaning even when none present; correct predictions not always accompanied by faithful explanations.
Conclusion: MLLMs have significant limitations in multimodal figurative understanding, with bias issues and explanation faithfulness problems in meme interpretation.
Abstract: Internet memes represent a popular form of multimodal online communication and often use figurative elements to convey layered meaning through the combination of text and images. However, it remains largely unclear how multimodal large language models (MLLMs) combine and interpret visual and textual information to identify figurative meaning in memes. To address this gap, we evaluate eight state-of-the-art generative MLLMs across three datasets on their ability to detect and explain six types of figurative meaning. In addition, we conduct a human evaluation of the explanations generated by these MLLMs, assessing whether the provided reasoning supports the predicted label and whether it remains faithful to the original meme content. Our findings indicate that all models exhibit a strong bias to associate a meme with figurative meaning, even when no such meaning is present. Qualitative analysis further shows that correct predictions are not always accompanied by faithful explanations.
[52] Is AI Catching Up to Human Expression? Exploring Emotion, Personality, Authorship, and Linguistic Style in English and Arabic with Six Large Language Models
Nasser A Alsadhan
Main category: cs.CL
TL;DR: LLMs struggle to convincingly mimic human emotional nuance and personality traits, with AI-generated texts distinguishable from human ones, though classification deteriorates on paraphrased samples revealing reliance on superficial stylistic cues.
Details
Motivation: To investigate whether LLMs can convincingly emulate complex human traits like emotional expression and personality across diverse linguistic and cultural contexts, particularly in under-resourced languages like Arabic.Method: Two tasks across six models (Jais, Mistral, LLaMA, GPT-4o, Gemini, DeepSeek): 1) Evaluate if machine classifiers can distinguish human vs AI-generated texts, 2) Assess if LLM-generated texts exhibit emotional/personality traits comparable to humans using classification experiments and linguistic/psycholinguistic analyses.
Result: AI-generated texts distinguishable from human ones (F1>0.95), but classification deteriorates on paraphrased samples; significant generalization gaps in emotion/personality classification; GPT-4o and Gemini show superior affective coherence; measurable divergences in tone, authenticity, and textual complexity between human and AI texts.
Conclusion: LLMs encode affective signals differently from humans, with implications for affective computing, authorship attribution, and responsible AI deployment in under-resourced language contexts where generative AI detection and alignment pose unique challenges.
Abstract: The advancing fluency of LLMs raises important questions about their ability to emulate complex human traits, including emotional expression and personality, across diverse linguistic and cultural contexts. This study investigates whether LLMs can convincingly mimic emotional nuance in English and personality markers in Arabic, a critical under-resourced language with unique linguistic and cultural characteristics. We conduct two tasks across six models:Jais, Mistral, LLaMA, GPT-4o, Gemini, and DeepSeek. First, we evaluate whether machine classifiers can reliably distinguish between human-authored and AI-generated texts. Second, we assess the extent to which LLM-generated texts exhibit emotional or personality traits comparable to those of humans. Our results demonstrate that AI-generated texts are distinguishable from human-authored ones (F1>0.95), though classification performance deteriorates on paraphrased samples, indicating a reliance on superficial stylistic cues. Emotion and personality classification experiments reveal significant generalization gaps: classifiers trained on human data perform poorly on AI-generated texts and vice versa, suggesting LLMs encode affective signals differently from humans. Importantly, augmenting training with AI-generated data enhances performance in the Arabic personality classification task, highlighting the potential of synthetic data to address challenges in under-resourced languages. Model-specific analyses show that GPT-4o and Gemini exhibit superior affective coherence. Linguistic and psycholinguistic analyses reveal measurable divergences in tone, authenticity, and textual complexity between human and AI texts. These findings have implications for affective computing, authorship attribution, and responsible AI deployment, particularly within underresourced language contexts where generative AI detection and alignment pose unique challenges.
[53] Steering LLMs for Culturally Localized Generation
Simran Khanuja, Hongbin Liu, Shujian Zhang, John Lambert, Mingqing Chen, Rajiv Mathews, Lun Wang
Main category: cs.CL
TL;DR: The paper introduces Cultural Embeddings (CuE) using sparse autoencoders to identify and manipulate cultural representations in LLMs, enabling white-box steering interventions to improve cultural faithfulness and elicit long-tail cultural concepts.
Details
Motivation: LLMs exhibit cultural biases favoring cultures with abundant training data, but existing localization methods (prompting, post-training alignment) are black-box, hard to control, and don't distinguish between missing knowledge vs. poor elicitation.Method: Use mechanistic interpretability with sparse autoencoders to identify interpretable features encoding culturally salient information, aggregate them into Cultural Embeddings (CuE), and apply CuE-based white-box steering interventions.
Result: CuE-based steering increases cultural faithfulness and elicits significantly rarer, long-tail cultural concepts compared to prompting alone. It’s complementary to black-box methods, offering gains when applied on top of prompt-augmented inputs.
Conclusion: The approach provides both diagnostic insight into cultural representations in LLMs and a controllable method to steer towards desired cultures, revealing that models often have knowledge but need better elicitation strategies.
Abstract: LLMs are deployed globally, yet produce responses biased towards cultures with abundant training data. Existing cultural localization approaches such as prompting or post-training alignment are black-box, hard to control, and do not reveal whether failures reflect missing knowledge or poor elicitation. In this paper, we address these gaps using mechanistic interpretability to uncover and manipulate cultural representations in LLMs. Leveraging sparse autoencoders, we identify interpretable features that encode culturally salient information and aggregate them into Cultural Embeddings (CuE). We use CuE both to analyze implicit cultural biases under underspecified prompts and to construct white-box steering interventions. Across multiple models, we show that CuE-based steering increases cultural faithfulness and elicits significantly rarer, long-tail cultural concepts than prompting alone. Notably, CuE-based steering is complementary to black-box localization methods, offering gains when applied on top of prompt-augmented inputs. This also suggests that models do benefit from better elicitation strategies, and don’t necessarily lack long-tail knowledge representation, though this varies across cultures. Our results provide both diagnostic insight into cultural representations in LLMs and a controllable method to steer towards desired cultures.
[54] WISTERIA: Weak Implicit Signal-based Temporal Relation Extraction with Attention
Duy Dao Do, Anaïs Halftermeyer, Thi-Bich-Hanh Dao
Main category: cs.CL
TL;DR: WISTERIA is a temporal relation extraction framework that uses pair-conditioned attention to identify implicit temporal cues rather than explicit markers, improving interpretability and competitive accuracy across multiple datasets.
Details
Motivation: Existing attention-based models for temporal relation extraction often highlight globally salient tokens but miss pair-specific cues that actually determine temporal relations. Current approaches rely on explicit temporal markers while overlooking implicit signals.Method: Proposes WISTERIA framework that combines multi-head attention with pair-conditioned top-K pooling to isolate the most informative contextual tokens for each event pair, focusing on implicit lexical, syntactic, or morphological elements expressing temporal order.
Result: Achieves competitive accuracy on TimeBank-Dense, MATRES, TDDMan, and TDDAuto datasets. Linguistic analyses show the model reveals pair-level rationales aligned with temporal linguistic cues, providing localized interpretable temporal reasoning.
Conclusion: WISTERIA offers a more interpretable approach to temporal relation extraction by focusing on pair-specific implicit signals rather than global attention patterns, advancing both performance and understanding of temporal reasoning in text.
Abstract: Temporal Relation Extraction (TRE) requires identifying how two events or temporal expressions are related in time. Existing attention-based models often highlight globally salient tokens but overlook the pair-specific cues that actually determine the temporal relation. We propose WISTERIA (Weak Implicit Signal-based Temporal Relation Extraction with Attention), a framework that examines whether the top-K attention components conditioned on each event pair truly encode interpretable evidence for temporal classification. Unlike prior works assuming explicit markers such as before, after, or when, WISTERIA considers signals as any lexical, syntactic, or morphological element implicitly expressing temporal order. By combining multi-head attention with pair-conditioned top-K pooling, the model isolates the most informative contextual tokens for each pair. We conduct extensive experiments on TimeBank-Dense, MATRES, TDDMan, and TDDAuto, including linguistic analyses of top-K tokens. Results show that WISTERIA achieves competitive accuracy and reveals pair-level rationales aligned with temporal linguistic cues, offering a localized and interpretable view of temporal reasoning.
[55] Failure of contextual invariance in gender inference with large language models
Sagar Kumar, Ariel Flint, Luca Maria Aiello, Andrea Baronchelli
Main category: cs.CL
TL;DR: LLM outputs show systematic shifts in gender inference tasks when minimal discourse context is added, violating contextual invariance assumptions.
Details
Motivation: To test the assumption that LLM outputs are stable under contextually equivalent formulations, particularly in gender inference tasks where bias benchmarking and deployment in high-stakes settings are concerned.Method: Used a controlled pronoun selection task with minimal, theoretically uninformative discourse context. Applied Contextuality-by-Default analysis to measure dependence on context after accounting for marginal effects.
Result: Found large systematic shifts in model outputs with context introduction. Cultural gender stereotypes weakened/disappeared while irrelevant features became most informative. Contextual dependence persisted in 19-52% of cases across models.
Conclusion: LLM outputs violate contextual invariance even under near-identical syntactic formulations, challenging standard evaluation practices and bias benchmarking approaches.
Abstract: Standard evaluation practices assume that large language model (LLM) outputs are stable under contextually equivalent formulations of a task. Here, we test this assumption in the setting of gender inference. Using a controlled pronoun selection task, we introduce minimal, theoretically uninformative discourse context and find that this induces large, systematic shifts in model outputs. Correlations with cultural gender stereotypes, present in decontextualized settings, weaken or disappear once context is introduced, while theoretically irrelevant features, such as the gender of a pronoun for an unrelated referent, become the most informative predictors of model behaviour. A Contextuality-by-Default analysis reveals that, in 19–52% of cases across models, this dependence persists after accounting for all marginal effects of context on individual outputs and cannot be attributed to simple pronoun repetition. These findings show that LLM outputs violate contextual invariance even under near-identical syntactic formulations, with implications for bias benchmarking and deployment in high-stakes settings.
[56] Smart Bilingual Focused Crawling of Parallel Documents
Cristian García-Romero, Miquel Esplà-Gomis, Felipe Sánchez-Martínez
Main category: cs.CL
TL;DR: A smart web crawling method using fine-tuned multilingual language models to efficiently find parallel translated texts by predicting document language from URLs and identifying parallel document pairs.
Details
Motivation: Traditional web crawling for parallel texts is inefficient, downloading massive amounts of content with only a small fraction being actual parallel translations. The authors aim to develop a smarter approach that guides crawling toward parallel content more rapidly.Method: Fine-tuned a pre-trained multilingual Transformer encoder model for two tasks: 1) inferring document language from URLs, and 2) predicting whether URL pairs link to parallel documents. Integrated both models into a crawling tool to guide the crawl toward parallel content.
Result: Both models performed effectively individually. Their combination enabled early discovery of parallel content during web crawling, reducing useless downloads and yielding more parallel documents compared to conventional crawling approaches.
Conclusion: Neural approaches using fine-tuned multilingual language models can significantly improve the efficiency of parallel text crawling by guiding the process toward relevant content and reducing wasted computational resources.
Abstract: Crawling parallel texts – texts that are mutual translations – from the Internet is usually done following a brute-force approach: documents are massively downloaded in an unguided process, and only a fraction of them end up leading to actual parallel content. In this work we propose a smart crawling method that guides the crawl towards finding parallel content more rapidly. We follow a neural approach that consists in adapting a pre-trained multilingual language model based on the encoder of the Transformer architecture by fine-tuning it for two new tasks: inferring the language of a document from its Uniform Resource Locator (URL), and inferring whether a pair of URLs link to parallel documents. We evaluate both models in isolation and their integration into a crawling tool. The results demonstrate the individual effectiveness of both models, and highlight that their combination enables us to address a practical engineering challenge: the early discovery of parallel content during web crawling in a given language pair. This leads to a reduction in the amount of downloaded documents deemed useless, and yields a greater quantity of parallel documents compared to conventional crawling approaches.
[57] Table-LLM-Specialist: Language Model Specialists for Tables using Iterative Generator-Validator Fine-tuning
Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Dongmei Zhang, Surajit Chaudhuri
Main category: cs.CL
TL;DR: Table-LLM-Specialist: A self-trained fine-tuning paradigm for table tasks using Generator-Validator approach that leverages dual task formulations to generate training data without manual labeling.
Details
Motivation: Language models like GPT and Llama perform poorly on complex table tasks (NL-to-Code, data cleaning) and require expensive task-specific fine-tuning with human labeling that leads to overfitting.Method: Proposes Table-LLM-Specialist with Generator-Validator paradigm that exploits dual formulations of table tasks (generative and classification versions) to iteratively generate and validate training data using language models for fine-tuning without manual labels.
Result: Achieves strong performance across diverse tasks (GPT-3.5 fine-tuned models often surpass GPT-4), lower deployment cost with smaller models reaching high quality, and better generalization across benchmarks due to diverse training data from real-world tables.
Conclusion: Table-LLM-Specialist enables effective fine-tuning for table tasks without manual labeling, deployed in Microsoft Excel for automated table data cleaning, demonstrating practical value.
Abstract: Language models such as GPT and Llama have shown remarkable ability on diverse natural language tasks, yet their performance on complex table tasks (e.g., NL-to-Code and data cleaning) remains suboptimal. Improving performance typically requires task-specific fine-tuning, which depends on expensive human labeling and is prone to overfitting. In this work, we propose Table-LLM-Specialist, a self-trained fine-tuning paradigm designed for table tasks. Our key insight is that many table tasks admit two dual formulations: a generative version and a classification version. Leveraging this duality, we introduce a Generator-Validator paradigm that iteratively generates and validates training data using language models, enabling effective fine-tuning without manually labeled data. Extensive evaluations on Llama, GPT-3.5, and GPT-4 show that Table-LLM-Specialist achieves (1) strong performance across diverse tasks compared to base models, for example, models fine-tuned on GPT-3.5 often surpass GPT-4 level quality; (2) lower deployment cost by enabling smaller models to reach high quality with reduced latency and cost; and (3) better generalization across multiple benchmarks, due to training on diverse, systematically generated data from real-world tables. Our code is available at https://github.com/microsoft/Table-Specialist. Models fine-tuned with Table-LLM-Specialist have been integrated into Microsoft Excel and are deployed in production for automated table data cleaning.
[58] LatentQA: Teaching LLMs to Decode Activations Into Natural Language
Alexander Pan, Lijie Chen, Jacob Steinhardt
Main category: cs.CL
TL;DR: LatentQA: A method for training decoder LLMs to answer open-ended questions about language model activations using natural language outputs, enabling more expressive analysis of model behaviors.
Details
Motivation: Current transparency methods for analyzing language model activations use probes with limited scalar or single-token outputs, restricting the range of behaviors that can be captured. There's a need for more expressive probes that can output natural language to better understand model internals.Method: Developed a decoder LLM probe that outputs natural language answers about activations (LatentQA). Created a dataset generation approach for mapping activations to question-answer pairs, then fine-tuned a decoder LLM on this dataset to perform open-ended question answering about activations.
Result: The decoder outperforms competitive probing baselines on supervised reading tasks like uncovering hidden system prompts and relational knowledge extraction. It’s precise enough to steer target models to exhibit novel behaviors unseen during training, and scales well with increasing dataset and model size.
Conclusion: LatentQA enables more expressive analysis of language model activations through natural language question answering, providing better transparency and control over model behaviors compared to traditional probing methods.
Abstract: Top-down transparency typically analyzes language model activations using probes with scalar or single-token outputs, limiting the range of behaviors that can be captured. To alleviate this issue, we develop a more expressive probe that can directly output natural language, performing LatentQA: the task of answering open-ended questions about activations. A key difficulty in developing such a probe is collecting a dataset mapping activations to natural-language descriptions. In response, we propose an approach for generating a dataset of activations and associated question-answer pairs and develop a fine-tuning method for training a decoder LLM on this dataset. We then validate our decoder’s fidelity by assessing its ability to read and control model activations. First, we evaluate the decoder on a number of supervised reading tasks with a known answer, such as uncovering hidden system prompts and relational knowledge extraction, and observe that it outperforms competitive probing baselines. Second, we demonstrate that the decoder is precise enough to steer the target model to exhibit behaviors unseen during training. Finally, we show that LatentQA scales well with increasing dataset and model size.
[59] Extracting and Following Paths for Robust Relational Reasoning with Large Language Models
Ge Zhang, Mohammad Ali Alomrani, Hongjian Gu, Jiaming Zhou, Yaochen Hu, Bin Wang, Qun Liu, Mark Coates, Yingxue Zhang, Jianye Hao
Main category: cs.CL
TL;DR: PoT is a novel framework for relational reasoning that decomposes tasks into graph extraction, path identification, and reasoning stages, outperforming SOTA methods without fine-tuning.
Details
Motivation: LLMs have extensive semantic knowledge but struggle with complex relational reasoning tasks like kinship or spatial reasoning, requiring more structured approaches.Method: Three-stage framework: 1) Extract reasoning graph identifying entities, relations, and attributes, 2) Identify query-relevant reasoning paths within the graph, 3) Use paths for downstream reasoning to generate answers.
Result: Outperforms state-of-the-art baselines by up to 21.3% across four relational reasoning datasets without requiring fine-tuning or extensive LLM calls.
Conclusion: PoT provides an effective framework for relational reasoning that leverages graph structures to improve LLM performance while being resilient to extraction errors and input ambiguity.
Abstract: Large language models (LLMs) possess vast semantic knowledge but often struggle with complex reasoning tasks, particularly in relational reasoning problems such as kinship or spatial reasoning. In this paper, we present Path-of-Thoughts (PoT), a novel framework for solving relation reasoning that decomposes the task into three key stages: graph extraction, path identification, and reasoning. Unlike previous approaches, PoT efficiently extracts a reasoning graph that identifies crucial entities, relations, and attributes within the context. Subsequently, PoT identifies query-relevant reasoning paths within the graph, facilitating downstream reasoning of potential answers. Experimental evaluations across four datasets of relational reasoning demonstrate that PoT surpasses state-of-the-art baselines by a significant margin (up to 21.3%) without requiring fine-tuning or extensive LLM calls. Furthermore, unlike prior neuro-symbolic methods, PoT exhibits improved resilience against LLM extraction errors and input ambiguity by leveraging the compositional nature of graphs.
[60] EmbBERT: Attention Under 2 MB Memory
Riccardo Bravin, Massimo Pavan, Hazem Hesham Yousef Shalby, Fabrizio Pittorino, Manuel Roveri
Main category: cs.CL
TL;DR: EmbBERT is a tiny language model designed for ultra-constrained devices with only 2MB memory, achieving comparable accuracy to models requiring 10x more memory through compact embeddings, streamlined feed-forward blocks, and efficient attention mechanisms.
Details
Motivation: Transformer models have revolutionized NLP but their high memory and computational requirements prevent deployment on ultra-constrained devices like wearables and IoT units with only a few megabytes of available memory.Method: EmbBERT integrates a compact embedding layer, streamlined feed-forward blocks, and an efficient attention mechanism specifically designed for extreme efficiency under strict memory budgets. The architecture is resilient to 8-bit quantization and scalable across sub-megabyte to tens-of-megabytes range.
Result: EmbBERT achieves competitive accuracy comparable to larger state-of-the-art models while requiring only 2MB total memory, outperforming downsized versions of BERT and MAMBA of similar size. 8-bit quantization further reduces memory to 781kB.
Conclusion: Highly simplified transformer architectures remain remarkably effective under tight resource constraints, enabling deployment on ultra-constrained devices while maintaining competitive performance.
Abstract: Transformer architectures based on the attention mechanism have revolutionized natural language processing (NLP), driving major breakthroughs across virtually every NLP task. However, their substantial memory and computational requirements still hinder deployment on ultra-constrained devices such as wearables and Internet-of-Things (IoT) units, where available memory is limited to just a few megabytes. To address this challenge, we introduce EmbBERT, a tiny language model (TLM) architecturally designed for extreme efficiency. The model integrates a compact embedding layer, streamlined feed-forward blocks, and an efficient attention mechanism that together enable optimal performance under strict memory budgets. Through this redesign for the extreme edge, we demonstrate that highly simplified transformer architectures remain remarkably effective under tight resource constraints. EmbBERT requires only 2 MB of total memory, and achieves accuracy performance comparable to the ones of state-of-the-art (SotA) models that require a $\mathbf{10\times}$ memory budget. Extensive experiments on the curated TinyNLP benchmark and the GLUE suite confirm that EmbBERT achieves competitive accuracy, comparable to that of larger SotA models, and consistently outperforms downsized versions of BERT and MAMBA of similar size. Furthermore, we demonstrate the model resilience to 8-bit quantization, which further reduces memory usage to just 781 kB , and the scalability of the EmbBERT architecture across the sub-megabyte to tens-of-megabytes range. Finally, we perform an ablation study demonstrating the positive contributions of all components and the pre-training procedure. All code, scripts, and checkpoints are publicly released to ensure reproducibility: https://github.com/RiccardoBravin/tiny-LLM.
[61] Collaborative Evaluation of Deepfake Text with Deliberation-Enhancing Dialogue Systems
Jooyoung Lee, Xiaochen Zhu, Georgi Karadzhov, Tom Stafford, Andreas Vlachos, Dongwon Lee
Main category: cs.CL
TL;DR: DeepFakeDeLiBot is a deliberation-enhancing chatbot that helps groups detect deepfake text, improving group dynamics and engagement but not overall detection accuracy.
Details
Motivation: The proliferation of generative models makes it challenging to distinguish authentic human-authored content from deepfake content. Collaborative human efforts augmented by AI tools present a promising solution for improving deepfake detection.Method: Developed DeepFakeDeLiBot, a deliberation-enhancing chatbot to support groups in detecting deepfake text. Conducted study comparing group-based problem-solving with individual efforts, measuring detection accuracy and group dynamics.
Result: Group-based problem-solving significantly improves accuracy of identifying machine-generated paragraphs compared to individual efforts. DeepFakeDeLiBot doesn’t yield substantial performance gains overall but enhances group dynamics (engagement, consensus building, reasoning diversity). Participants with higher perceived effectiveness of group collaboration benefited from the chatbot.
Conclusion: Deliberative chatbots like DeepFakeDeLiBot have potential to foster interactive and productive group dynamics while ensuring accuracy in collaborative deepfake text detection, though they don’t significantly improve detection performance overall.
Abstract: The proliferation of generative models has presented significant challenges in distinguishing authentic human-authored content from deepfake content. Collaborative human efforts, augmented by AI tools, present a promising solution. In this study, we explore the potential of DeepFakeDeLiBot, a deliberation-enhancing chatbot, to support groups in detecting deepfake text. Our findings reveal that group-based problem-solving significantly improves the accuracy of identifying machine-generated paragraphs compared to individual efforts. While engagement with DeepFakeDeLiBot does not yield substantial performance gains overall, it enhances group dynamics by fostering greater participant engagement, consensus building, and the frequency and diversity of reasoning-based utterances. Additionally, participants with higher perceived effectiveness of group collaboration exhibited performance benefits from DeepFakeDeLiBot. These findings underscore the potential of deliberative chatbots in fostering interactive and productive group dynamics while ensuring accuracy in collaborative deepfake text detection. \textit{Dataset and source code used in this study will be made publicly available upon acceptance of the manuscript.
[62] GeneMamba: An Efficient and Effective Foundation Model on Single Cell Data
Cong Qi, Hanzhang Fang, Siqi Jiang, Xun Song, Tianxing Hu, Wei Zhi
Main category: cs.CL
TL;DR: GeneMamba is a scalable foundation model for single-cell RNA sequencing data using state space modeling with linear-time complexity, outperforming transformer-based methods on various tasks.
Details
Motivation: Single-cell RNA sequencing data has high dimensionality, sparsity, and batch effects that pose computational challenges. Transformer-based models have limitations due to quadratic complexity and poor handling of long-range dependencies in this domain.Method: GeneMamba uses Bi-Mamba architecture based on state space modeling to capture bidirectional gene context with linear-time complexity. It’s pretrained on nearly 30 million cells with biologically informed objectives including pathway-aware contrastive loss and rank-based gene encoding.
Result: GeneMamba demonstrates strong performance across diverse tasks including multi-batch integration, cell type annotation, and gene-gene correlation, showing robustness and interpretability while offering substantial computational gains over transformer baselines.
Conclusion: GeneMamba represents a practical and powerful alternative to transformer-based methods for single-cell data analysis, advancing biologically grounded, scalable tools for large-scale transcriptomics.
Abstract: Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of cellular heterogeneity, but its complexity, which is marked by high dimensionality, sparsity, and batch effects, which poses major computational challenges. Transformer-based models have made significant advances in this domain but are often limited by their quadratic complexity and suboptimal handling of long-range dependencies. In this work, we introduce GeneMamba, a scalable and efficient foundation model for single-cell transcriptomics built on state space modeling. Leveraging the Bi-Mamba architecture, GeneMamba captures bidirectional gene context with linear-time complexity, offering substantial computational gains over transformer baselines. The model is pretrained on nearly 30 million cells and incorporates biologically informed objectives, including pathway-aware contrastive loss and rank-based gene encoding. We evaluate GeneMamba across diverse tasks, including multi-batch integration, cell type annotation, and gene-gene correlation, demonstrating strong performance, interpretability, and robustness. These results position GeneMamba as a practical and powerful alternative to transformer-based methods, advancing the development of biologically grounded, scalable tools for large-scale single-cell data analysis.
[63] Flying Pigs, FaR and Beyond: Evaluating LLM Reasoning in Counterfactual Worlds
Anish R Joishy, Ishwar B Balappanawar, Vamshi Krishna Bonagiri, Manas Gaur, Krishnaprasad Thirunarayan, Ponnurangam Kumaraguru
Main category: cs.CL
TL;DR: LLMs struggle with logical reasoning when context contradicts their parametric knowledge, but a metacognitive intervention (Flag & Reason) significantly improves performance on counterfactual reasoning tasks.
Details
Motivation: The paper investigates whether LLMs can reason logically when presented with contexts that contradict their ingrained parametric knowledge, addressing a fundamental challenge in reasoning about hypothetical or counterfactual worlds.Method: Introduced CounterLogic benchmark to disentangle logical validity from knowledge alignment, evaluated 11 LLMs across six reasoning datasets, and proposed Flag & Reason (FaR) intervention where models first flag potential knowledge conflicts before reasoning.
Result: LLMs show consistent failure in counterfactual scenarios with accuracy dropping 14% on average compared to knowledge-aligned contexts. FaR intervention narrowed the performance gap to 7% and increased overall accuracy by 4%.
Conclusion: The study diagnoses a critical limitation in LLMs’ reasoning capabilities and demonstrates that metacognitive awareness through conflict flagging can make models more robust and reliable thinkers in counterfactual scenarios.
Abstract: A fundamental challenge in reasoning is navigating hypothetical, counterfactual worlds where logic may conflict with ingrained knowledge. We investigate this frontier for Large Language Models (LLMs) by asking: Can LLMs reason logically when the context contradicts their parametric knowledge? To facilitate a systematic analysis, we first introduce CounterLogic, a benchmark specifically designed to disentangle logical validity from knowledge alignment. Evaluation of 11 LLMs across six diverse reasoning datasets reveals a consistent failure: model accuracy plummets by an average of 14% in counterfactual scenarios compared to knowledge-aligned ones. We hypothesize that this gap stems not from a flaw in logical processing, but from an inability to manage the cognitive conflict between context and knowledge. Inspired by human metacognition, we propose a simple yet powerful intervention: Flag & Reason (FaR), where models are first prompted to flag potential knowledge conflicts before they reason. This metacognitive step is highly effective, narrowing the performance gap to just 7% and increasing overall accuracy by 4%. Our findings diagnose and study a critical limitation in modern LLMs’ reasoning and demonstrate how metacognitive awareness can make them more robust and reliable thinkers.
[64] DualEdit: Mitigating Safety Fallback in LLM Backdoor Editing via Affirmation-Refusal Regulation
Houcheng Jiang, Zetong Zhao, Junfeng Fang, Haokai Ma, Ruipeng Wang, Xiang Wang, Xiangnan He, Yang Deng
Main category: cs.CL
TL;DR: DualEdit: A dual-objective model editing framework for more stable backdoor attacks on safety-aligned LLMs by simultaneously promoting affirmative tokens and suppressing refusal tokens with dynamic loss weighting and value anchoring techniques.
Details
Motivation: Safety-aligned LLMs remain vulnerable to backdoor attacks, but existing editing-based attacks are unstable under safety alignment - models may initially accept triggers but later revert to refusals during generation (safety fallback phenomenon).Method: DualEdit framework with two key techniques: (1) dynamic loss weighting to calibrate relative scales of affirmative promotion and refusal suppression objectives using pre-edited model, and (2) value anchoring that clusters representative attention value vectors to form compact anchors, reducing conflicts from diverse token sets.
Result: Experiments on safety-aligned LLMs show DualEdit improves attack success by 10% and reduces safety fallback rate by 11% over baseline editing-based backdoor attacks.
Conclusion: DualEdit effectively addresses safety fallback in backdoor attacks on safety-aligned LLMs through dual-objective optimization with calibrated loss weighting and attention value anchoring, demonstrating more stable and successful attacks.
Abstract: Safety-aligned large language models (LLMs) remain vulnerable to backdoor attacks. Recent model editing-based approaches enable efficient backdoor injection by directly modifying a small set of parameters to map triggers to attacker-desired behaviors. However, we find that existing editing-based attacks are often unstable under safety alignment: the edited model may start with an affirmative prefix but later revert to refusals during generation. We term this phenomenon safety fallback. To mitigate it, we propose DualEdit, a dual-objective model editing framework that simultaneously promotes affirmative tokens and suppresses refusal tokens. DualEdit further addresses two key challenges, objective imbalance and refusal diversity, via two complementary techniques: (1) dynamic loss weighting, which calibrates the relative scales of the two objectives using the pre-edited model to stabilize optimization, and (2) value anchoring, which clusters representative attention value vectors to form compact anchors, reducing conflicts from overly diverse token sets and improving generalization. Experiments on safety-aligned LLMs show that DualEdit improves attack success by 10% and reduces safety fallback rate by 11% over baselines.
[65] MARS: toward more efficient multi-agent collaboration for LLM reasoning
Xiao Wang, Jia Wang, Yijie Wang, Pengtao Dang, Sha Cao, Chi Zhang
Main category: cs.CL
TL;DR: MARS is a multi-agent review system that improves LLM reasoning through role-based collaboration, reducing computational costs by 50% compared to multi-agent debate approaches.
Details
Motivation: While multi-agent debate (MAD) improves LLM reasoning through collaborative discussion, it suffers from high computational overhead due to frequent agent interactions. There's a need for efficient multi-agent frameworks that maintain reasoning quality while reducing costs.Method: Proposes MARS (Multi-Agent Review System) with a role-based collaboration framework inspired by academic review processes. An author agent generates initial solutions, reviewer agents provide independent decisions and comments, and a meta-reviewer integrates feedback for final decisions and revision guidance, avoiding costly reviewer-to-reviewer interactions.
Result: Extensive experiments across multiple benchmarks show MARS matches the accuracy of MAD while reducing both token usage and inference time by approximately 50% with different LLMs.
Conclusion: MARS provides an efficient alternative to MAD that maintains reasoning quality while significantly reducing computational costs through structured role-based collaboration without direct reviewer interactions.
Abstract: Large language models (LLMs) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents. Multi-Agent Debate (MAD) has been proposed to address this limitation by enabling collaborative reasoning among multiple models in a round-table debate manner. While effective, MAD introduces substantial computational overhead due to the number of agents involved and the frequent communication required. In this paper, we propose MARS (Multi-Agent Review System), a role-based collaboration framework inspired by the review process. In MARS, an author agent generates an initial solution, reviewer agents provide decisions and comments independently, and a meta-reviewer integrates the feedback to make the final decision and guide further revision. This design enhances reasoning quality while avoiding costly reviewer-to-reviewer interactions, thereby controlling token consumption and inference time. We compared MARS with both MAD and other state-of-the-art reasoning strategies across multiple benchmarks. Extensive experiments with different LLMs show that MARS matches the accuracy of MAD while reducing both token usage and inference time by approximately 50%. Code is available at https://github.com/xwang97/MARS.
[66] Happiness is Sharing a Vocabulary: A Study of Transliteration Methods
Haeji Jung, Jinju Kim, Kyungjin Kim, Youjeong Roh, David R. Mortensen
Main category: cs.CL
TL;DR: Romanization (Latin script conversion) outperforms other transliteration methods for multilingual NLP tasks by enabling better token vocabulary sharing with pre-trained models.
Details
Motivation: To understand how script conversion affects multilingual NLP performance, specifically investigating the contributions of shared script, overlapping token vocabularies, and shared phonology across languages.Method: Controlled experiments comparing three transliteration types (romanization, phonemic transcription, substitution ciphers) plus original orthography, evaluated on three downstream tasks: NER, POS tagging, and NLI across multiple languages.
Result: Romanization significantly outperformed other input types in 11 out of 12 evaluation settings, with longer subword tokens shared with pre-trained languages leading to better model utilization.
Conclusion: Romanization is the most effective transliteration approach for multilingual NLP, primarily because it maximizes vocabulary overlap with pre-trained models, enabling better knowledge transfer across languages.
Abstract: Transliteration has emerged as a promising means to bridge the gap between various languages in multilingual NLP, showing promising results especially for languages using non-Latin scripts. We investigate the degree to which shared script, overlapping token vocabularies, and shared phonology contribute to performance of multilingual models. To this end, we conduct controlled experiments using three kinds of transliteration (romanization, phonemic transcription, and substitution ciphers) as well as orthography. We evaluate each model on three downstream tasks – named entity recognition (NER), part-of-speech tagging (POS) and natural language inference (NLI) – and find that romanization significantly outperforms other input types in 11 out of 12 evaluation settings, largely consistent with our hypothesis that it is the most effective approach. We further analyze how each factor contributed to the success, and suggest that having longer (subword) tokens shared with pre-trained languages leads to better utilization of the model.
[67] Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents
Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, Zhenzhe Ying
Main category: cs.CL
TL;DR: IGPO introduces information gain-based intrinsic rewards for RL training of LLM agents in multi-turn settings to address reward sparsity issues.
Details
Motivation: Existing RL approaches for LLM agents rely on sparse outcome-based rewards, causing problems in multi-turn settings: advantage collapse, poor credit assignment, and low sample efficiency.Method: IGPO models each interaction turn as incremental information acquisition about ground truth, defining turn-level rewards as marginal increase in policy’s probability of correct answer, combining intrinsic rewards with outcome supervision.
Result: IGPO outperforms baselines in multi-turn scenarios on in-domain and out-of-domain benchmarks, achieving higher accuracy and improved data efficiency.
Conclusion: IGPO provides effective dense supervision for multi-turn agent training through information gain-based intrinsic rewards, addressing key limitations of sparse reward RL.
Abstract: Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided exclusively upon generating the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate three critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals; (ii) lack of fine-grained credit assignment, where the correctness of intermediate turns is obscured, especially in long-horizon tasks; and (iii) poor sample efficiency, where each rollout yields only a single outcome signal, leading to low data utilization. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy’s probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model’s own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward signals. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved data efficiency. Our code is available at https://github.com/GuoqingWang1/IGPO.
[68] In Generative AI We (Dis)Trust? Computational Analysis of Trust and Distrust in Reddit Discussions
Aria Pessianzadeh, Naima Sultana, Hildegarde Van den Bulck, David Gefen, Shahin Jabbari, Rezvaneh Rezapour
Main category: cs.CL
TL;DR: First computational study of trust and distrust in GenAI using multi-year Reddit data, finding balanced attitudes with trust slightly outweighing distrust, influenced by model releases and varying across user groups.
Details
Motivation: Understanding public trust in GenAI is essential for responsible adoption and governance, but prior work lacks computational, large-scale, and longitudinal approaches to measuring trust and distrust in GenAI and LLMs.Method: Used multi-year Reddit dataset (2022-2025) spanning 39 subreddits and 230,576 posts, combined crowd-sourced annotations with classification models to scale analysis of trust and distrust patterns.
Result: Trust and distrust are nearly balanced over time with trust modestly outweighing distrust, showing shifts around major model releases. Technical performance and usability dominate as dimensions, while personal experience is the most frequent reason shaping attitudes, with distinct patterns across different user groups.
Conclusion: Provides a methodological framework for large-scale trust analysis and insights into evolving public perceptions of GenAI, highlighting the importance of computational approaches for understanding trust dynamics in AI systems.
Abstract: The rise of generative AI (GenAI) has impacted many aspects of human life. As these systems become embedded in everyday practices, understanding public trust in them is also essential for responsible adoption and governance. Prior work on trust in AI has largely drawn from psychology and human-computer interaction, but there is a lack of computational, large-scale, and longitudinal approaches to measuring trust and distrust in GenAI and large language models (LLMs). This paper presents the first computational study of trust and distrust in GenAI, using a multi-year Reddit dataset (2022–2025) spanning 39 subreddits and 230,576 posts. Crowd-sourced annotations of a representative sample were combined with classification models to scale analysis. We find that trust and distrust are nearly balanced over time, although trust modestly outweighs distrust, with shifts around major model releases. Technical performance and usability dominate as dimensions, while personal experience is the most frequent reason shaping attitudes. Distinct patterns also emerge across trustors (e.g., experts, ethicists, and general users). Our results provide a methodological framework for large-scale trust analysis and insights into evolving public perceptions of GenAI.
[69] HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation
Jiajun Zhang, Shijia Luo, Ruikang Zhang, Qi Su
Main category: cs.CL
TL;DR: HUMORCHAIN: A theory-guided multi-stage reasoning framework for generating humorous image captions by integrating visual semantic parsing, humor- and psychology-based reasoning, and humor evaluation.
Details
Motivation: Multimodal humor is prevalent in online communication but existing AI approaches lack explicit modeling of humor theories, producing literal descriptions that fail to capture genuine humor or cognitive depth.Method: Proposes HUMORCHAIN - a theory-guided multi-stage reasoning framework that integrates visual semantic parsing, humor- and psychology-based reasoning, and a fine-tuned discriminator for humor evaluation, forming an interpretable cognitive reasoning chain.
Result: Outperforms state-of-the-art baselines on Meme-Image-No-Text, Oogiri-GO, and OxfordTVG-HIC datasets in human humor preference, Elo/BT scores, and semantic diversity.
Conclusion: Theory-driven structured reasoning enables large language models to generate humor aligned with human perception, representing the first work to explicitly embed cognitive structures from humor theories into multimodal humor generation.
Abstract: Humor, as both a creative human activity and a social binding mechanism, has long posed a major challenge for AI generation. Although producing humor requires complex cognitive reasoning and social understanding, theories of humor suggest that it follows learnable patterns and structures, making it theoretically possible for generative models to acquire them implicitly. In recent years, multimodal humor has become a prevalent form of online communication, especially among Gen Z, highlighting the need for AI systems capable of integrating visual understanding with humorous language generation. However, existing data-driven approaches lack explicit modeling or theoretical grounding of humor, often producing literal descriptions that fail to capture its underlying cognitive mechanisms, resulting in the generated image descriptions that are fluent but lack genuine humor or cognitive depth. To address this limitation, we propose HUMORCHAIN (HUmor-guided Multi-step Orchestrated Reasoning Chain for Image Captioning), a theory-guided multi-stage reasoning framework. It integrates visual semantic parsing, humor- and psychology-based reasoning, and a fine-tuned discriminator for humor evaluation, forming an interpretable and controllable cognitive reasoning chain. To the best of our knowledge, this is the first work to explicitly embed cognitive structures from humor theories into multimodal humor generation, enabling a structured reasoning process from visual understanding to humor creation. Experiments on Meme-Image-No-Text, Oogiri-GO, and OxfordTVG-HIC datasets show that HUMORCHAIN outperforms state-of-the-art baselines in human humor preference, Elo/BT scores, and semantic diversity, demonstrating that theory-driven structured reasoning enables large language models to generate humor aligned with human perception.
[70] PaperBanana: Automating Academic Illustration for AI Scientists
Dawei Zhu, Rui Meng, Yale Song, Xiyu Wei, Sujian Li, Tomas Pfister, Jinsung Yoon
Main category: cs.CL
TL;DR: PaperBanana is an agentic framework using VLMs and image generation models to automatically create publication-ready academic illustrations, evaluated on a benchmark of methodology diagrams from NeurIPS 2025.
Details
Motivation: Despite advances in AI-powered research, generating publication-ready illustrations remains labor-intensive, creating a bottleneck in the research workflow that needs automation.Method: Uses an agentic framework powered by state-of-the-art VLMs and image generation models, with specialized agents for reference retrieval, content/style planning, image rendering, and iterative refinement via self-critique.
Result: Outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics on PaperBananaBench (292 test cases from NeurIPS 2025), and effectively extends to generating high-quality statistical plots.
Conclusion: PaperBanana paves the way for automated generation of publication-ready illustrations, addressing a significant bottleneck in research workflows.
Abstract: Despite rapid advances in autonomous AI scientists powered by language models, generating publication-ready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready academic illustrations. Powered by state-of-the-art VLMs and image generation models, PaperBanana orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To rigorously evaluate our framework, we introduce PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications, covering diverse research domains and illustration styles. Comprehensive experiments demonstrate that PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics. We further show that our method effectively extends to the generation of high-quality statistical plots. Collectively, PaperBanana paves the way for the automated generation of publication-ready illustrations.
[71] KDFlow: A User-Friendly and Efficient Knowledge Distillation Framework for Large Language Models
Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, Jinan Xu
Main category: cs.CL
TL;DR: KDFlow is a novel framework for LLM knowledge distillation that decouples teacher inference from student training, using SGLang for efficient teacher inference and FSDP2 for student training, achieving significant speedups.
Details
Motivation: Current knowledge distillation frameworks use homogeneous training backends for both teacher and student models, leading to suboptimal training efficiency due to different computational requirements of inference vs. training.Method: Decoupled architecture with SGLang for teacher inference and FSDP2 for student training; transmits only teacher hidden states (not full logits) via zero-copy data transfer; recomputes logits on student side; supports both off-policy and on-policy distillation.
Result: Achieves 1.44× to 6.36× speedup compared to current KD frameworks, enabling rapid prototyping and scaling of LLM distillation with minimal engineering overhead.
Conclusion: KDFlow provides an efficient, extensible framework for LLM knowledge distillation that fully utilizes the advantages of both inference-optimized and training-optimized systems.
Abstract: Knowledge distillation (KD) is an essential technique to compress large language models (LLMs) into smaller ones. However, despite the distinct roles of the student model and the teacher model in KD, most existing frameworks still use a homogeneous training backend (e.g., FSDP and DeepSpeed) for both models, leading to suboptimal training efficiency. In this paper, we present a novel framework for LLM distillation, termed \textbf{KDFlow}, which features a decoupled architecture and employs SGLang for teacher inference. By bridging the training efficiency of FSDP2 and the inference efficiency of SGLang, KDFlow achieves full utilization of both advantages in a unified system. Moreover, instead of transferring full logits across different processes, our framework only transmits the teacher’s hidden states using zero-copy data transfer and recomputes the logits on the student side, effectively balancing the communication cost and KD performance. Furthermore, our framework supports both off-policy and on-policy distillation and incorporates KD algorithms for cross-tokenizer KD through highly extensible and user-friendly APIs. Experiments show that KDFlow can achieve \textbf{1.44$\times$ to 6.36$\times$} speedup compared to current KD frameworks, enabling researchers to rapidly prototype and scale LLM distillation with minimal engineering overhead. Code is available at: https://github.com/songmzhang/KDFlow
[72] From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG
Wenhao Wu, Zhentao Tang, Yafu Li, Shixiong Kai, Mingxuan Yuan, Chunlin Chen, Zhi Wang
Main category: cs.CL
TL;DR: MA-RAG is a multi-round agentic RAG framework for medical QA that iteratively refines evidence retrieval and reasoning through semantic conflict analysis to reduce hallucinations and improve accuracy.
Details
Motivation: LLMs in medical QA suffer from hallucinations and outdated knowledge. While RAG helps, existing methods use noisy token-level signals and lack multi-round refinement needed for complex medical reasoning.Method: MA-RAG uses an agentic refinement loop where semantic conflicts among candidate responses drive evidence retrieval queries, while optimizing reasoning history to avoid long-context degradation. It extends self-consistency by using inconsistency as a signal for multi-round reasoning and retrieval.
Result: MA-RAG outperforms competitive inference-time scaling and RAG baselines across 7 medical Q&A benchmarks, achieving +6.8 points average accuracy improvement over backbone models.
Conclusion: MA-RAG effectively addresses LLM limitations in medical QA through iterative agentic refinement, leveraging semantic conflicts to drive evidence retrieval and reasoning optimization for more reliable medical consensus.
Abstract: Large Language Models (LLMs) exhibit high reasoning capacity in medical question-answering, but their tendency to produce hallucinations and outdated knowledge poses critical risks in healthcare fields. While Retrieval-Augmented Generation (RAG) mitigates these issues, existing methods rely on noisy token-level signals and lack the multi-round refinement required for complex reasoning. In the paper, we propose MA-RAG (Multi-Round Agentic RAG), a framework that facilitates test-time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic refinement loop. At each round, the agent transforms semantic conflict among candidate responses into actionable queries to retrieve external evidence, while optimizing history reasoning traces to mitigate long-context degradation. MA-RAG extends the self-consistency principle by leveraging the lack of consistency as a proactive signal for multi-round agentic reasoning and retrieval, and mirrors a boosting mechanism that iteratively minimizes the residual error toward a stable, high-fidelity medical consensus. Extensive evaluations across 7 medical Q&A benchmarks show that MA-RAG consistently surpasses competitive inference-time scaling and RAG baselines, delivering substantial +6.8 points on average accuracy over the backbone model. Our code is available at https://github.com/NJU-RL/MA-RAG.
[73] Designing Explainable Conversational Agentic Systems for Guaraní Speakers
Samantha Adorno, Akshata Kishore Moharir, Ratna Kandala
Main category: cs.CL
TL;DR: Position paper arguing that AI/HCI systems need oral-first design for indigenous languages like Guaraní, proposing multi-agent architecture that respects oral practices and data sovereignty rather than text-centric approaches.
Details
Motivation: Current AI/HCI systems are text-first and underserve oral languages and indigenous communities. Using Guaraní as a case study, the paper argues that language support remains insufficient unless it aligns with lived oral practices and respects indigenous data sovereignty.Method: Proposes an oral-first multi-agent architecture as an alternative to standard text-to-speech pipelines. Decouples Guaraní natural language understanding from dedicated agents for conversation state and community-led governance, creating a technical framework that respects diglossia and oral conversation practices.
Result: Demonstrates a technical framework that moves beyond mere recognition to focus on turn-taking, repair, and shared context as primary interaction loci. Shows how AI can be culturally grounded by treating spoken conversation as first-class design requirement rather than adapting oral languages to text-centric systems.
Conclusion: For AI to be truly culturally grounded, it must shift from adapting oral languages to text-centric systems to treating spoken conversation as a first-class design requirement, ensuring digital ecosystems empower rather than overlook diverse linguistic practices.
Abstract: Although artificial intelligence (AI) and Human-Computer Interaction (HCI) systems are often presented as universal solutions, their design remains predominantly text-first, underserving primarily oral languages and indigenous communities. This position paper uses Guaraní, an official and widely spoken language of Paraguay, as a case study to argue that language support in AI remains insufficient unless it aligns with lived oral practices. We propose an alternative to the standard “text-to-speech” pipeline, proposing instead an oral-first multi-agent architecture. By decoupling Guaraní natural language understanding from dedicated agents for conversation state and community-led governance, we demonstrate a technical framework that respects indigenous data sovereignty and diglossia. Our work moves beyond mere recognition to focus on turn-taking, repair, and shared context as the primary locus of interaction. We conclude that for AI to be truly culturally grounded, it must shift from adapting oral languages to text-centric systems to treating spoken conversation as a first-class design requirement, ensuring digital ecosystems empower rather than overlook diverse linguistic practices.
[74] NLP Occupational Emergence Analysis: How Occupations Form and Evolve in Real Time – A Zero-Assumption Method Demonstrated on AI in the US Technology Workforce, 2022-2026
David Nordfors
Main category: cs.CL
TL;DR: The paper proposes a co-attractor theory for occupational emergence and applies it to detect AI as a diffusing technology rather than an emerging occupation, based on analysis of 8.2 million US resumes.
Details
Motivation: Occupations evolve faster than classification systems can track, creating a need for methods to detect genuine occupational emergence without relying on predefined taxonomies or job titles.Method: Proposes a co-attractor concept where occupations are self-reinforcing structures with shared vocabulary and cohesive population. Uses zero-assumption method analyzing vocabulary cohesion and population cohesion independently with ablation tests on 8.2 million US resumes (2022-2026).
Result: Identifies established occupations correctly and reveals AI shows vocabulary cohesion formed rapidly in early 2024 but practitioner population never cohered. Pre-existing AI community dissolved as tools went mainstream, with vocabulary absorbed into existing careers rather than binding a new occupation.
Conclusion: AI appears to be a diffusing technology rather than an emerging occupation. Discusses whether introducing “AI Engineer” occupational category could catalyze population cohesion around the already-formed vocabulary to complete the co-attractor.
Abstract: Occupations form and evolve faster than classification systems can track. We propose that a genuine occupation is a self-reinforcing structure (a bipartite co-attractor) in which a shared professional vocabulary makes practitioners cohesive as a group, and the cohesive group sustains the vocabulary. This co-attractor concept enables a zero-assumption method for detecting occupational emergence from resume data, requiring no predefined taxonomy or job titles: we test vocabulary cohesion and population cohesion independently, with ablation to test whether the vocabulary is the mechanism binding the population. Applied to 8.2 million US resumes (2022-2026), the method correctly identifies established occupations and reveals a striking asymmetry for AI: a cohesive professional vocabulary formed rapidly in early 2024, but the practitioner population never cohered. The pre-existing AI community dissolved as the tools went mainstream, and the new vocabulary was absorbed into existing careers rather than binding a new occupation. AI appears to be a diffusing technology, not an emerging occupation. We discuss whether introducing an “AI Engineer” occupational category could catalyze population cohesion around the already-formed vocabulary, completing the co-attractor.
[75] Mi:dm K 2.5 Pro
KT Tech innovation Group
Main category: cs.CL
TL;DR: Mi:dm K 2.5 Pro is a 32B parameter Korean LLM optimized for enterprise multi-step reasoning, long-context understanding, and agentic workflows with specialized training techniques and Korean cultural understanding.
Details
Motivation: Address the need for LLMs with multi-step reasoning, long-context understanding, and agentic workflows in enterprise environments, particularly for Korean-language and domain-specific scenarios where scaling alone is insufficient.Method: Quality-centric data curation using AST analysis for code and gap-filling for math; pre-training with Depth Upscaling and progressive strategy for 128K context; post-training with Reasoning SFT, model merging, asynchronous RL, and “Fusion Training” to balance reasoning with conversational fluency and tool-use.
Result: Achieves competitive performance against leading global and domestic models, sets SOTA on Korean-specific benchmarks, and demonstrates strong safety profile through Responsible AI evaluations.
Conclusion: Mi:dm K 2.5 Pro successfully addresses enterprise-grade complexity through reasoning-focused optimization and specialized training techniques, offering deep Korean linguistic/cultural understanding with balanced safety and performance.
Abstract: The evolving LLM landscape requires capabilities beyond simple text generation, prioritizing multi-step reasoning, long-context understanding, and agentic workflows. This shift challenges existing models in enterprise environments, especially in Korean-language and domain-specific scenarios where scaling is insufficient. We introduce Mi:dm K 2.5 Pro, a 32B parameter flagship LLM designed to address enterprise-grade complexity through reasoning-focused optimization. Our methodology builds a robust data foundation via a quality-centric curation pipeline utilizing abstract syntax tree (AST) analysis for code, gap-filling synthesis for mathematics, and an LLM-based quality evaluator. Pre-training scales the model via layer-predictor-based Depth Upscaling (DuS) and a progressive strategy supporting a 128K token context window. Post-training introduces a specialized multi-stage pipeline, including Reasoning SFT, model merging, and asynchronous reinforcement learning (RL), to develop complex problem-solving skills. “Fusion Training” then rebalances these capabilities with conversational fluency, consistent response styling, and reliable tool-use. The evaluations show that Mi:dm K 2.5 Pro achieves competitive performance against leading global and domestic models. In addition, it sets state-of-the-art results on Korean-specific benchmarks, showcasing deep linguistic and cultural understanding. Finally, Responsible AI evaluations validate safety against attacks, ensuring a secure profile for deployment with a balance of harmlessness and responsiveness.
[76] Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation
Richard J. Young
Main category: cs.CL
TL;DR: Chain-of-thought faithfulness measurements are not objective but depend heavily on classifier choice, with different methods producing significantly different results and even reversing model rankings.
Details
Motivation: Recent work reports single aggregate numbers for chain-of-thought faithfulness, implying it's an objective, measurable property. This paper challenges that assumption by showing faithfulness measurements vary significantly depending on the classification methodology used.Method: Applied three different classifiers (regex-only detector, regex-plus-LLM pipeline, and Claude Sonnet 4 judge) to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters. Compared results across classifiers using statistical tests including McNemar tests and Cohen’s kappa.
Result: Different classifiers produced faithfulness rates of 74.4%, 82.6%, and 69.7% on identical data. Per-model gaps ranged from 2.6 to 30.6 percentage points. All pairwise McNemar tests were significant (p < 0.001). Cohen’s kappa ranged from 0.06 (“slight”) to 0.42 (“moderate”). Classifier choice reversed model rankings: Qwen3.5-27B ranked 1st under pipeline but 7th under Sonnet; OLMo-3.1-32B moved from 9th to 3rd.
Conclusion: Published faithfulness numbers cannot be meaningfully compared across studies using different classifiers. Future evaluations should report sensitivity ranges across multiple classification methodologies rather than single aggregate numbers.
Abstract: Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying that faithfulness is an objective, measurable property of a model. This paper provides evidence that it is not. Three classifiers (a regex-only detector, a regex-plus-LLM pipeline, and a Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters. On identical data, these classifiers produce faithfulness rates of 74.4%, 82.6%, and 69.7%. Per-model gaps range from 2.6 to 30.6 percentage points; all pairwise McNemar tests are significant (p < 0.001). The disagreements are systematic: Cohen’s kappa ranges from 0.06 (“slight”) for sycophancy hints to 0.42 (“moderate”) for grader hints, and the asymmetry is pronounced: for sycophancy, 883 cases are classified as faithful by the pipeline but unfaithful by the Sonnet judge, while only 2 go the other direction. Classifier choice can also reverse model rankings: Qwen3.5-27B ranks 1st under the pipeline but 7th under Sonnet; OLMo-3.1-32B moves from 9th to 3rd. Different classifiers operationalize faithfulness at different levels of stringency (lexical mention versus epistemic dependence), yielding divergent measurements on the same behavior. These results indicate that published faithfulness numbers cannot be meaningfully compared across studies using different classifiers, and that future evaluations should report sensitivity ranges across multiple classification methodologies.
[77] Children’s Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs
Hengwei Ye, Yuanting Guan, Yuxuan Ge, Tianying Zhu, Zhenhan Guan, Yijia Zhong, Yijing Zhang, Han Zhang, Yingna Wu, Zheng Tian
Main category: cs.CL
TL;DR: KidGym is a comprehensive 2D grid-based benchmark for evaluating multimodal LLMs across five cognitive capabilities inspired by children’s intelligence tests.
Details
Motivation: Current MLLM evaluations lack comprehensive assessment of core cognitive capabilities. Inspired by children's intelligence scales, researchers aim to create a benchmark that systematically tests MLLMs' adaptability and developmental potential across essential abilities.Method: Developed KidGym - a 2D grid-based benchmark with 12 unique tasks targeting five capabilities: Execution, Perception Reasoning, Learning, Memory, and Planning. Features diverse scenarios with randomly generated layouts for robust evaluation, and is fully customizable/extensible.
Result: Evaluation of state-of-the-art MLLMs using KidGym revealed significant insights into model capabilities and identified several limitations of current models. The benchmark provides a more accurate assessment of MLLM cognitive abilities.
Conclusion: KidGym offers a comprehensive, customizable benchmark for evaluating MLLMs’ cognitive capabilities, filling a gap in current evaluation methods and supporting the growing MLLM research community.
Abstract: Multimodal Large Language Models (MLLMs) combine the linguistic strengths of LLMs with the ability to process multimodal data, enbaling them to address a broader range of visual tasks. Because MLLMs aim at more general, human-like competence than language-only models, we take inspiration from the Wechsler Intelligence Scales - an established battery for evaluating children by decomposing intelligence into interpretable, testable abilities. We introduce KidGym, a comprehensive 2D grid-based benchmark for assessing five essential capabilities of MLLMs: Execution, Perception Reasoning, Learning, Memory and Planning. The benchmark comprises 12 unique tasks, each targeting at least one core capability, specifically designed to guage MLLMs’ adaptability and developmental potential, mirroring the stages of children’s cognitive growth. Additionally, our tasks encompass diverse scenarios and objects with randomly generated layouts, ensuring a more accurate and robust evluation of MLLM capabilities. KidGym is designed to be fully user-customizable and extensible, allowing researchers to create new evaluation scenarios and adjust difficuly levels to accommodate the rapidly growing MLLM community. Through the evaluation of state-of-the-art MLLMs using KidGym, we identified significant insights into model capabilities and revealed several limitations of current models. We release our benchmark at: https://bobo-ye.github.io/KidGym/.
[78] CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language
Roy Uziel, Omer Belhasin, Itay Levi, Akhiad Bercovich, Ran El-Yaniv, Ran Zilberstein, Michael Elad
Main category: cs.CL
TL;DR: CRoCoDiL introduces continuous semantic space diffusion for language generation, improving masked diffusion models by training encoder-demasker architecture for better token dependencies and semantic coherence.
Details
Motivation: Masked Diffusion Models (MDMs) are efficient for non-causal language generation but struggle with token dependencies and semantic incoherence due to reliance on discrete marginal distributions. The authors aim to address these limitations by moving diffusion to continuous sentence-level semantic space.Method: Proposes CRoCoDiL (Continuous and Robust Conditioned Diffusion for Language) - a unified fine-tuning approach that jointly trains an encoder-demasker architecture, grounding MDM demasking in continuous latent representations. Also introduces two unconditional text synthesis algorithms: Continuous-Then-Discrete (ConThenDisc) and Continuous-Within-Discrete (ConWithinDisc).
Result: Experiments using LLaDA show superior generation quality and more than 10x faster sampling speeds in unconditional settings compared to baseline methods.
Conclusion: Moving diffusion to continuous semantic space significantly improves masked diffusion models for language generation, achieving better quality and efficiency through novel continuous-discrete hybrid approaches.
Abstract: Masked Diffusion Models (MDMs) provide an efficient non-causal alternative to autoregressive generation but often struggle with token dependencies and semantic incoherence due to their reliance on discrete marginal distributions. We address these limitations by shifting the diffusion process into a continuous sentence-level semantic space. We propose CRoCoDiL (Continuous and Robust Conditioned Diffusion for Language), a unified fine-tuning approach that jointly trains an encoder-demasker architecture, grounding the MDM demasking in continuous latent representations. This leads to the formation of a novel autoencoder in which decoding is obtained by an MDM algorithm. Relying on the same framework, we introduce two unconditional text synthesis algorithms: Continuous-Then-Discrete (ConThenDisc), a hybrid-diffusion approach that first generates latent representations in continuous space and then decodes these to tokens via an MDM, and Continuous-Within-Discrete (ConWithinDisc), a multi-diffusion strategy that refines latent representations throughout the discrete sampling process. Experiments using LLaDA show that our methods achieve superior generation quality and more than 10x faster sampling speeds in an unconditional setting.
[79] TimeTox: An LLM-Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols
Saketh Vinjamuri, Marielle Fis Loperena, Marie C. Spezia, Ramez Kouzy
Main category: cs.CL
TL;DR: TimeTox is an LLM-based pipeline using Gemini models to automatically extract time toxicity metrics from clinical trial protocol documents, with validation showing high reproducibility on real-world oncology protocols.
Details
Motivation: Time toxicity (cumulative healthcare contact days from clinical trial participation) is important but labor-intensive to extract manually from protocol documents, requiring automated solutions.Method: Three-stage pipeline using Google’s Gemini models: 1) summary extraction from full-length protocol PDFs, 2) time toxicity quantification at six timepoints per treatment arm, 3) multi-run consensus via position-based arm matching. Compared single-pass vs two-stage architectures.
Result: Two-stage pipeline achieved 100% clinically acceptable accuracy on synthetic data (MAE 0.81 days) vs 41.5% for vanilla (MAE 9.0 days). On real-world protocols, vanilla pipeline showed superior reproducibility: 95.3% clinically acceptable accuracy across 644 protocols with 82.0% perfect stability.
Conclusion: Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is decisive for production LLM deployment. The pipeline successfully extracted time toxicity for 1,288 treatment arms across multiple disease sites.
Abstract: Time toxicity, the cumulative healthcare contact days from clinical trial participation, is an important but labor-intensive metric to extract from protocol documents. We developed TimeTox, an LLM-based pipeline for automated extraction of time toxicity from Schedule of Assessments tables. TimeTox uses Google’s Gemini models in three stages: summary extraction from full-length protocol PDFs, time toxicity quantification at six cumulative timepoints for each treatment arm, and multi-run consensus via position-based arm matching. We validated against 20 synthetic schedules (240 comparisons) and assessed reproducibility on 644 real-world oncology protocols. Two architectures were compared: single-pass (vanilla) and two-stage (structure-then-count). The two-stage pipeline achieved 100% clinically acceptable accuracy ($\pm$3 days) on synthetic data (MAE 0.81 days) versus 41.5% for vanilla (MAE 9.0 days). However, on real-world protocols, the vanilla pipeline showed superior reproducibility: 95.3% clinically acceptable accuracy (IQR $\leq$ 3 days) across 3 runs on 644 protocols, with 82.0% perfect stability (IQR = 0). The production pipeline extracted time toxicity for 1,288 treatment arms across multiple disease sites. Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is the decisive factor for production LLM deployment.
cs.CV
[80] Founder effects shape the evolutionary dynamics of multimodality in open LLM families
Manuel Cebrian
Main category: cs.CV
TL;DR: Analysis of multimodal capabilities in open LLM families shows vision-language tasks emerge months to years after text-only models, with multimodality spreading primarily within existing VLM lineages rather than through cross-type transfer from text models.
Details
Motivation: To understand how multimodal capabilities emerge and propagate within open LLM families over time, given the rapid improvement of LLMs but unclear patterns of multimodal adoption.Method: Used ModelBiome AI Ecosystem dataset of Hugging Face model metadata (1.8M+ entries) to quantify multimodality over time and along parent-to-child lineage relations, analyzing transition rates and founder events.
Result: Vision-language models appear months to years after text-only releases; only 0.218% of text-generation fine-tuning edges yield VLM descendants; 94.5% of VLM-child edges originate from VLM parents; ~60% of VLM releases are new roots without recorded parents.
Conclusion: Multimodality enters open LLM families through rare founder events then expands rapidly within descendant lineages, showing punctuated adoption dynamics with weak cross-type transfer and distinct scaling behavior for multimodal capabilities.
Abstract: Large language model (LLM) families are improving rapidly, yet it remains unclear how quickly multimodal capabilities emerge and propagate within open families. Using the ModelBiome AI Ecosystem dataset of Hugging Face model metadata and recorded lineage fields (>1.8x10^6 model entries), we quantify multimodality over time and along recorded parent-to-child relations. Cross-modal tasks are widespread in the broader ecosystem well before they become common within major open LLM families: within these families, multimodality remains rare through 2023 and most of 2024, then increases sharply in 2024-2025 and is dominated by image-text vision-language tasks. Across major families, the first vision-language model (VLM) variants typically appear months after the first text-generation releases, with lags ranging from ~1 month (Gemma) to more than a year for several families and ~26 months for GLM. Lineage-conditioned transition rates show weak cross-type transfer: among fine-tuning edges from text-generation parents, only 0.218% yield VLM descendants. Instead, multimodality expands primarily within existing VLM lineages: 94.5% of VLM-child fine-tuning edges originate from VLM parents, versus 4.7% from text-generation parents. At the model level, most VLM releases appear as new roots without recorded parents (~60%), while the remainder are predominantly VLM-derived; founder concentration analyses indicate rapid within-lineage amplification followed by diversification. Together, these results show that multimodality enters open LLM families through rare founder events and then expands rapidly within their descendant lineages, producing punctuated adoption dynamics that likely induce distinct, transfer-limited scaling behavior for multimodal capabilities.
[81] From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs
Federico Toschi, Nicolò Brunello, Andrea Sassella, Vincenzo Scotti, Mark James Carman
Main category: cs.CV
TL;DR: Evaluation of multimodal LLMs for real-time technical assistance using a furniture assembly dataset, assessing reasoning abilities, step tracking, and manual reference capabilities.
Details
Motivation: To evaluate how well current multimodal LLMs can serve as real-time assistants for procedural technical tasks, particularly in VR/AR environments where they share the user's viewpoint and need to reason about the same visual scenario.Method: Created the Manual to Action Dataset (M2AD) with step-by-step furniture assembly annotations and manual references. Used this dataset to assess MLMs on three aspects: (1) reducing detailed labeling needs through reasoning, (2) tracking assembly step progression, and (3) correctly referencing instruction manual pages.
Result: Some models demonstrated understanding of procedural sequences, but overall performance was limited by architectural and hardware constraints. The study highlighted the need for improved multi-image and interleaved text-image reasoning capabilities.
Conclusion: Current multimodal LLMs show promise for technical assistance but face significant limitations in real-time procedural task support, indicating the need for architectural improvements to handle multi-image contexts and better text-image integration.
Abstract: The recent advancements introduced by Large Language Models (LLMs) have transformed how Artificial Intelligence (AI) can support complex, real world tasks, pushing research outside the text boundaries towards multi modal contexts and leading to Multimodal Large Language Models (MLMs). Given the current adoption of LLM based assistants in solving technical or domain specific problems, the natural continuation of this trend is to extend the input domains of these assistants exploiting MLMs. Ideally, these MLMs should be used as real time assistants in procedural tasks, hopefully integrating a view of the environment where the user being assisted is, or even better sharing the same point of view via Virtual Reality (VR) or Augmented Reality (AR) supports, to reason over the same scenario the user is experiencing. With this work, we aim at evaluating the quality of currently openly available MLMs to provide this kind of assistance on technical tasks. To this end, we annotated a data set of furniture assembly with step by step labels and manual references: the Manual to Action Dataset (M2AD). We used this dataset to assess (1) to which extent the reasoning abilities of MLMs can be used to reduce the need for detailed labelling, allowing for more efficient, cost effective annotation practices, (2) whether MLMs are able to track the progression of assembly steps (3) and whether MLMs can refer correctly to the instruction manual pages. Our results showed that while some models understand procedural sequences, their performance is limited by architectural and hardware constraints, highlighting the need for multi image and interleaved text image reasoning.
[82] Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing
Weitong Cai, Hang Zhang, Yukai Huang, Shitong Sun, Jiankang Deng, Songcen Xu, Jifei Song, Zhensong Zhang
Main category: cs.CV
TL;DR: ColorTrigger enables efficient streaming video understanding by using continuous grayscale video with sparse RGB frames, reducing sensing and inference costs while maintaining performance.
Details
Motivation: Always-on sensing is crucial for edge/wearable AI, but continuous high-fidelity RGB video is too expensive for resource-constrained platforms. The authors aim to develop a more efficient paradigm for streaming video understanding.Method: Proposes ColorTrigger: an online training-free trigger that selectively activates color capture based on windowed grayscale affinity analysis. Uses lightweight quadratic programming for causal chromatic redundancy detection, coupled with credit-budgeted control and dynamic token routing to jointly reduce sensing and inference costs.
Result: Achieves 91.6% of full-color baseline performance while using only 8.1% RGB frames on streaming video understanding benchmarks, demonstrating substantial color redundancy in natural videos.
Conclusion: The grayscale-always, color-on-demand paradigm enables practical always-on video sensing on resource-constrained devices by exploiting color redundancy in natural videos.
Abstract: Always-on sensing is essential for next-generation edge/wearable AI systems, yet continuous high-fidelity RGB video capture remains prohibitively expensive for resource-constrained mobile and edge platforms. We present a new paradigm for efficient streaming video understanding: grayscale-always, color-on-demand. Through preliminary studies, we discover that color is not always necessary. Sparse RGB frames suffice for comparable performance when temporal structure is preserved via continuous grayscale streams. Building on this insight, we propose ColorTrigger, an online training-free trigger that selectively activates color capture based on windowed grayscale affinity analysis. Designed for real-time edge deployment, ColorTrigger uses lightweight quadratic programming to detect chromatic redundancy causally, coupled with credit-budgeted control and dynamic token routing to jointly reduce sensing and inference costs. On streaming video understanding benchmarks, ColorTrigger achieves 91.6% of full-color baseline performance while using only 8.1% RGB frames, demonstrating substantial color redundancy in natural videos and enabling practical always-on video sensing on resource-constrained devices.
[83] When Visuals Aren’t the Problem: Evaluating Vision-Language Models on Misleading Data Visualizations
Harsh Nishant Lalai, Raj Sanjay Shah, Hanspeter Pfister, Sashank Varma, Grace Guo
Main category: cs.CV
TL;DR: VLMs struggle to detect misleading visualizations, especially reasoning-based misinformation, performing better on visual design errors but often misclassifying non-misleading content.
Details
Motivation: Visualizations can be deceptive and propagate misinformation, but current Vision Language Models' ability to detect misleading visualizations, particularly subtle reasoning errors in captions, is not well understood.Method: Developed a benchmark with real-world visualizations and human-authored misleading captions based on a fine-grained taxonomy of reasoning errors (Cherry-picking, Causal inference) and visualization design errors (Truncated axis, Dual axis, inappropriate encodings). Evaluated commercial and open-source VLMs on this benchmark.
Result: VLMs detect visual design errors more reliably than reasoning-based misinformation, and frequently misclassify non-misleading visualizations as deceptive.
Conclusion: The work bridges the gap between coarse misleading content detection and attributing specific reasoning or visualization errors, revealing VLMs’ limitations in detecting subtle reasoning-based misinformation in visualizations.
Abstract: Visualizations help communicate data insights, but deceptive data representations can distort their interpretation and propagate misinformation. While recent Vision Language Models (VLMs) perform well on many chart understanding tasks, their ability to detect misleading visualizations, especially when deception arises from subtle reasoning errors in captions, remains poorly understood. Here, we evaluate VLMs on misleading visualization-caption pairs grounded in a fine-grained taxonomy of reasoning errors (e.g., Cherry-picking, Causal inference) and visualization design errors (e.g., Truncated axis, Dual axis, inappropriate encodings). To this end, we develop a benchmark that combines real-world visualization with human-authored, curated misleading captions designed to elicit specific reasoning and visualization error types, enabling controlled analysis across error categories and modalities of misleadingness. Evaluating many commercial and open-source VLMs, we find that models detect visual design errors substantially more reliably than reasoning-based misinformation, and frequently misclassify non-misleading visualizations as deceptive. Overall, our work fills a gap between coarse detection of misleading content and the attribution of the specific reasoning or visualization errors that give rise to it.
[84] Tiny Inference-Time Scaling with Latent Verifiers
Davide Bucciarelli, Evelyn Turri, Lorenzo Baraldi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Main category: cs.CV
TL;DR: VHS is a verifier that operates directly on Diffusion Transformer hidden states to enable efficient inference-time scaling for generative models, avoiding costly pixel-space decoding required by MLLM verifiers.
Details
Motivation: MLLM verifiers improve generative model performance but introduce substantial inference-time costs by requiring decoding to pixel space and re-encoding, creating redundant operations. There's a need for more efficient verification methods.Method: Proposes Verifier on Hidden States (VHS) that operates directly on intermediate hidden representations of Diffusion Transformer single-step generators, analyzing generator features without decoding to pixel space.
Result: Reduces joint generation-and-verification time by 63.3%, compute FLOPs by 51%, and VRAM usage by 14.5% compared to standard MLLM verifiers, while achieving +2.7% improvement on GenEval at same inference-time budget.
Conclusion: VHS enables more efficient inference-time scaling for generative models by operating directly on hidden states, reducing computational overhead while maintaining or improving performance compared to MLLM-based verifiers.
Abstract: Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference-time cost. Indeed, diffusion pipelines operate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re-encoding them into the visual embedding space, leading to redundant and costly operations. In this work, we propose Verifier on Hidden States (VHS), a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single-step generators. VHS analyzes generator features without decoding to pixel space, thereby reducing the per-candidate verification cost while improving or matching the performance of MLLM-based competitors. We show that, under tiny inference budgets with only a small number of candidates per prompt, VHS enables more efficient inference-time scaling reducing joint generation-and-verification time by 63.3%, compute FLOPs by 51% and VRAM usage by 14.5% with respect to a standard MLLM verifier, achieving a +2.7% improvement on GenEval at the same inference-time budget.
[85] Efficient Universal Perception Encoder
Chenchen Zhu, Saksham Suri, Cijo Jose, Maxime Oquab, Marc Szafraniec, Wei Wen, Yunyang Xiong, Patrick Labatut, Piotr Bojanowski, Raghuraman Krishnamoorthi, Vikas Chandra
Main category: cs.CV
TL;DR: EUPE is an efficient universal perception encoder that achieves both inference efficiency and versatile representations for diverse downstream tasks by distilling from multiple domain-expert foundation vision encoders through a scale-up then scale-down approach.
Details
Motivation: Running AI models on edge devices requires small but powerful vision encoders that can handle multiple tasks simultaneously, but existing methods struggle to balance efficiency with versatile representations across diverse domains.Method: EUPE uses knowledge distillation from multiple domain-expert foundation vision encoders, but instead of directly scaling down from multiple teachers, it first scales up to create a large proxy teacher, then scales down from this single teacher to achieve efficient universal representations.
Result: EUPE achieves on-par or better performance than individual domain experts of the same size across diverse task domains and outperforms previous agglomerative encoders.
Conclusion: The scale-up then scale-down distillation approach enables efficient universal vision encoders suitable for edge devices, with the full model family and code being released to foster research.
Abstract: Running AI models on smart edge devices can unlock versatile user experiences, but presents challenges due to limited compute and the need to handle multiple tasks simultaneously. This requires a vision encoder with small size but powerful and versatile representations. We present our method, Efficient Universal Perception Encoder (EUPE), which offers both inference efficiency and universally good representations for diverse downstream tasks. We achieve this by distilling from multiple domain-expert foundation vision encoders. Unlike previous agglomerative methods that directly scale down from multiple teachers to an efficient encoder, we demonstrate the importance of first scaling up to a large proxy teacher and then scaling down from this single teacher. Experiments show that EUPE achieves on-par or better performance than individual domain experts of the same size on diverse task domains and also outperforms previous agglomerative encoders. We will release the full family of EUPE models and the code to foster future research.
[86] SMSP: A Plug-and-Play Strategy of Multi-Scale Perception for MLLMs to Perceive Visual Illusions
Jinzhe Tu, Ruilei Guo, Zihan Guo, Junxiao Yang, Shiyao Cui, Minlie Huang
Main category: cs.CV
TL;DR: MLLMs are vulnerable to visual illusions where hidden content is imperceptible to models but obvious to humans. The paper introduces IlluChar dataset, identifies high-frequency attention bias as the failure mechanism, and proposes SMSP framework to improve MLLM perception of hidden patterns.
Details
Motivation: Current Multimodal Large Language Models (MLLMs) show significant vulnerability to visual illusions containing hidden patterns that are obvious to humans but imperceptible to models. This reveals a perceptual misalignment between MLLMs and humans, posing potential safety concerns that need to be addressed.Method: The authors introduce IlluChar, a comprehensive illusion dataset, and identify high-frequency attention bias as the key failure mechanism. They propose Strategy of Multi-Scale Perception (SMSP), a plug-and-play framework that suppresses distracting high-frequency backgrounds to generate images closer to human perception.
Result: SMSP significantly improves MLLM performance on illusion images, increasing accuracy of Qwen3-VL-8B-Instruct from 13.0% to 84.0%. The framework works effectively across all evaluated MLLMs, demonstrating practical improvement in visual perception alignment.
Conclusion: The work provides novel insights into MLLMs’ visual perception mechanisms and offers a practical, robust solution to enhance their ability to perceive hidden patterns in visual illusions, addressing both perceptual misalignment and safety concerns.
Abstract: Recent works have shown that Multimodal Large Language Models (MLLMs) are highly vulnerable to hidden-pattern visual illusions, where the hidden content is imperceptible to models but obvious to humans. This deficiency highlights a perceptual misalignment between current MLLMs and humans, and also introduces potential safety concerns. To systematically investigate this failure, we introduce IlluChar, a comprehensive and challenging illusion dataset, and uncover a key underlying mechanism for the models’ failure: high-frequency attention bias, where the models are easily distracted by high-frequency background textures in illusion images, causing them to overlook hidden patterns. To address the issue, we propose the Strategy of Multi-Scale Perception (SMSP), a plug-and-play framework that aligns with human visual perceptual strategies. By suppressing distracting high-frequency backgrounds, SMSP generates images closer to human perception. Our experiments demonstrate that SMSP significantly improves the performance of all evaluated MLLMs on illusion images, for instance, increasing the accuracy of Qwen3-VL-8B-Instruct from 13.0% to 84.0%. Our work provides novel insights into MLLMs’ visual perception, and offers a practical and robust solution to enhance it. Our code is publicly available at https://github.com/Tujz2023/SMSP.
[87] Spatially-Aware Evaluation Framework for Aerial LiDAR Point Cloud Semantic Segmentation: Distance-Based Metrics on Challenging Regions
Alex Salvatierra, José Antonio Sanz, Christian Gutiérrez, Mikel Galar
Main category: cs.CV
TL;DR: Novel evaluation framework for aerial LiDAR semantic segmentation that introduces distance-based metrics and focused evaluation on hard points to address limitations of traditional metrics like mIoU and OA.
Details
Motivation: Traditional semantic segmentation metrics (mIoU, OA) have two key limitations for aerial LiDAR data: 1) they treat all misclassifications equally regardless of spatial context, ignoring geometric severity of errors that impact derived geospatial products, and 2) they are dominated by easily classified points, masking meaningful differences between models and under-representing performance in challenging regions.Method: Two complementary approaches: 1) Distance-based metrics that account for spatial deviation between misclassified points and nearest ground-truth points of predicted class to capture geometric severity of errors, and 2) Focused evaluation on a common subset of hard points (points misclassified by at least one model) to reduce bias from easily classified points and better reveal performance differences in challenging regions.
Result: Validated on three state-of-the-art deep learning models across three aerial LiDAR datasets. Results show proposed metrics provide complementary information to traditional measures, revealing spatial error patterns critical for Earth Observation applications but invisible to conventional evaluation approaches.
Conclusion: The proposed framework enables more informed model selection for scenarios where spatial consistency is critical, addressing key limitations of traditional metrics for aerial LiDAR semantic segmentation evaluation.
Abstract: Semantic segmentation metrics for 3D point clouds, such as mean Intersection over Union (mIoU) and Overall Accuracy (OA), present two key limitations in the context of aerial LiDAR data. First, they treat all misclassifications equally regardless of their spatial context, overlooking cases where the geometric severity of errors directly impacts the quality of derived geospatial products such as Digital Terrain Models. Second, they are often dominated by the large proportion of easily classified points, which can mask meaningful differences between models and under-represent performance in challenging regions. To address these limitations, we propose a novel evaluation framework for comparing semantic segmentation models through two complementary approaches. First, we introduce distance-based metrics that account for the spatial deviation between each misclassified point and the nearest ground-truth point of the predicted class, capturing the geometric severity of errors. Second, we propose a focused evaluation on a common subset of hard points, defined as the points misclassified by at least one of the evaluated models, thereby reducing the bias introduced by easily classified points and better revealing differences in model performance in challenging regions. We validate our framework by comparing three state-of-the-art deep learning models on three aerial LiDAR datasets. Results demonstrate that the proposed metrics provide complementary information to traditional measures, revealing spatial error patterns that are critical for Earth Observation applications but invisible to conventional evaluation approaches. The proposed framework enables more informed model selection for scenarios where spatial consistency is critical.
[88] OsteoFlow: Lyapunov-Guided Flow Distillation for Predicting Bone Remodeling after Mandibular Reconstruction
Hamidreza Aftabi, Faye Yu, Brooke Switzer, Zachary Fishman, Eitan Prisman, Antony Hodgson, Cari Whyne, Sidney Fels, Michael Hardisty
Main category: cs.CV
TL;DR: OsteoFlow: A flow-based framework using Lyapunov-guided trajectory distillation to predict Year-1 post-operative CT scans from Day-5 scans for mandibular reconstruction, outperforming baselines by ~20% in surgical resection zones.
Details
Motivation: Predicting long-term bone remodeling after mandibular reconstruction is clinically important but challenging for standard generative models, which struggle with trajectory-level consistency and anatomical fidelity over long time horizons.Method: OsteoFlow uses a flow-based framework with Lyapunov-guided trajectory distillation that distills continuous trajectories over transport time from registration-derived stationary velocity field teachers, combined with resection-aware image loss to enforce geometric correspondence while maintaining generative capacity.
Result: Evaluated on 344 paired regions of interest, OsteoFlow significantly outperforms state-of-the-art baselines, reducing mean absolute error in the surgical resection zone by approximately 20%.
Conclusion: The approach demonstrates the promise of trajectory distillation for long-term prediction in medical imaging applications like bone remodeling prediction.
Abstract: Predicting long-term bone remodeling after mandibular reconstruction would be of great clinical benefit, yet standard generative models struggle to maintain trajectory-level consistency and anatomical fidelity over long horizons. We introduce OsteoFlow, a flow-based framework predicting Year-1 post-operative CT scans from Day-5 scans. Our core contribution is Lyapunov-guided trajectory distillation: Unlike one-step distillation, our method distills a continuous trajectory over transport time from a registration-derived stationary velocity field teacher. Combined with a resection-aware image loss, this enforces geometric correspondence without sacrificing generative capacity. Evaluated on 344 paired regions of interest, OsteoFlow significantly outperforms state of-the-art baselines, reducing mean absolute error in the surgical resection zone by ~20%. This highlights the promise of trajectory distillation for long-term prediction. Code is available on GitHub: OsteoFlow.
[89] Multi-Modal Image Fusion via Intervention-Stable Feature Learning
Xue Wang, Zheng Guan, Wenhua Qian, Chengchao Wang, Runzhuo Ma
Main category: cs.CV
TL;DR: Proposes a causal intervention framework for multimodal image fusion to identify robust cross-modal dependencies rather than spurious correlations, using three intervention strategies and a Causal Feature Integrator.
Details
Motivation: Current multimodal image fusion methods optimize statistical correlations between modalities, which often capture dataset-induced spurious associations that degrade under distribution shifts. There's a need to identify robust cross-modal dependencies that remain stable across different conditions.Method: Uses causal principles with three intervention strategies: 1) complementary masking with spatially disjoint perturbations to test genuine compensation, 2) random masking of identical regions to identify informative features under partial observability, and 3) modality dropout to evaluate irreplaceable contributions. Implements a Causal Feature Integrator (CFI) with adaptive invariance gating to prioritize intervention-stable features.
Result: Achieves state-of-the-art performance on both public benchmarks and downstream high-level vision tasks, demonstrating improved robustness and generalization.
Conclusion: The causal intervention framework successfully identifies robust cross-modal dependencies in multimodal image fusion, outperforming correlation-based methods and showing better generalization under distribution shifts.
Abstract: Multi-modal image fusion integrates complementary information from different modalities into a unified representation. Current methods predominantly optimize statistical correlations between modalities, often capturing dataset-induced spurious associations that degrade under distribution shifts. In this paper, we propose an intervention-based framework inspired by causal principles to identify robust cross-modal dependencies. Drawing insights from Pearl’s causal hierarchy, we design three principled intervention strategies to probe different aspects of modal relationships: i) complementary masking with spatially disjoint perturbations tests whether modalities can genuinely compensate for each other’s missing information, ii) random masking of identical regions identifies feature subsets that remain informative under partial observability, and iii) modality dropout evaluates the irreplaceable contribution of each modality. Based on these interventions, we introduce a Causal Feature Integrator (CFI) that learns to identify and prioritize intervention-stable features maintaining importance across different perturbation patterns through adaptive invariance gating, thereby capturing robust modal dependencies rather than spurious correlations. Extensive experiments demonstrate that our method achieves SOTA performance on both public benchmarks and downstream high-level vision tasks.
[90] Static Scene Reconstruction from Dynamic Egocentric Videos
Qifei Cui, Patrick Chen
Main category: cs.CV
TL;DR: A robust pipeline for 3D reconstruction from egocentric videos that adapts static reconstruction systems to handle dynamic scenes with moving hands and camera motion.
Details
Motivation: Egocentric videos present unique challenges for 3D reconstruction due to rapid camera motion and frequent dynamic interactions (especially moving hands), causing state-of-the-art static reconstruction systems to degrade with trajectory drift and "ghost" geometry artifacts.Method: Proposes a mask-aware reconstruction mechanism that explicitly suppresses dynamic foreground in attention layers to prevent hand artifacts, and employs a chunked reconstruction strategy with pose-graph stitching for global consistency and elimination of long-term drift.
Result: Experiments on HD-EPIC and indoor drone datasets show significant improvements in absolute trajectory error and yield visually clean static geometry compared to naive baselines.
Conclusion: The pipeline effectively extends the capability of foundation models to dynamic first-person scenes by adapting static reconstruction backbones to long-form egocentric video.
Abstract: Egocentric videos present unique challenges for 3D reconstruction due to rapid camera motion and frequent dynamic interactions. State-of-the-art static reconstruction systems, such as MapAnything, often degrade in these settings, suffering from catastrophic trajectory drift and “ghost” geometry caused by moving hands. We bridge this gap by proposing a robust pipeline that adapts static reconstruction backbones to long-form egocentric video. Our approach introduces a mask-aware reconstruction mechanism that explicitly suppresses dynamic foreground in the attention layers, preventing hand artifacts from contaminating the static map. Furthermore, we employ a chunked reconstruction strategy with pose-graph stitching to ensure global consistency and eliminate long-term drift. Experiments on HD-EPIC and indoor drone datasets demonstrate that our pipeline significantly improves absolute trajectory error and yields visually clean static geometry compared to naive baselines, effectively extending the capability of foundation models to dynamic first-person scenes.
[91] MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding
Hejun Dong, Junbo Niu, Bin Wang, Weijun Zeng, Wentao Zhang, Conghui He
Main category: cs.CV
TL;DR: MinerU-Diffusion: A diffusion-based framework for document OCR that replaces autoregressive decoding with parallel diffusion denoising, achieving faster inference and improved robustness.
Details
Motivation: Current OCR systems rely on autoregressive decoding which introduces sequential latency and error propagation in long documents. The authors argue that left-to-right causal generation is an artifact of serialization rather than intrinsic to document OCR tasks.Method: Proposes MinerU-Diffusion, a unified diffusion-based framework with block-wise diffusion decoder and uncertainty-driven curriculum learning for stable training and efficient long-sequence inference.
Result: Achieves up to 3.2x faster decoding compared to autoregressive baselines while improving robustness. Shows reduced dependence on linguistic priors and stronger visual OCR capability on Semantic Shuffle benchmark.
Conclusion: Diffusion-based parallel decoding is a promising alternative to autoregressive methods for document OCR, offering faster inference and better handling of complex document structures.
Abstract: Optical character recognition (OCR) has evolved from line-level transcription to structured document parsing, requiring models to recover long-form sequences containing layout, tables, and formulas. Despite recent advances in vision-language models, most existing systems rely on autoregressive decoding, which introduces sequential latency and amplifies error propagation in long documents. In this work, we revisit document OCR from an inverse rendering perspective, arguing that left-to-right causal generation is an artifact of serialization rather than an intrinsic property of the task. Motivated by this insight, we propose MinerU-Diffusion, a unified diffusion-based framework that replaces autoregressive sequential decoding with parallel diffusion denoising under visual conditioning. MinerU-Diffusion employs a block-wise diffusion decoder and an uncertainty-driven curriculum learning strategy to enable stable training and efficient long-sequence inference. Extensive experiments demonstrate that MinerU-Diffusion consistently improves robustness while achieving up to 3.2x faster decoding compared to autoregressive baselines. Evaluations on the proposed Semantic Shuffle benchmark further confirm its reduced dependence on linguistic priors and stronger visual OCR capability.
[92] Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio
Jeong Hun Yeo, Hyeongseop Rha, Sungjune Park, Junil Won, Yong Man Ro
Main category: cs.CV
TL;DR: Unified multimodal framework for sign language, lip movements, and audio processing to generate spoken-language text, achieving SOTA performance across multiple tasks.
Details
Motivation: Audio-centric ASR systems exclude deaf/hard-of-hearing individuals. While sign language and lip reading offer alternatives, they've been studied in isolation without unified integration. Need a framework that combines these modalities for inclusive communication.Method: Proposes first unified framework handling sign language, lip movements, and audio combinations for spoken-language text generation. Features modality-agnostic architecture processing heterogeneous inputs, explores modality synergy (especially lip movements as non-manual cues in sign language), and aims for SOTA performance.
Result: Achieves performance on par with or better than task-specific SOTA models across SLT, VSR, ASR, and Audio-Visual Speech Recognition. Key finding: explicitly modeling lip movements as distinct modality significantly improves SLT performance by capturing critical non-manual cues.
Conclusion: Unified multimodal framework successfully integrates diverse communication modalities, enabling inclusive systems and revealing important linguistic insights about modality interactions.
Abstract: Audio is the primary modality for human communication and has driven the success of Automatic Speech Recognition (ASR) technologies. However, such audio-centric systems inherently exclude individuals who are deaf or hard of hearing. Visual alternatives such as sign language and lip reading offer effective substitutes, and recent advances in Sign Language Translation (SLT) and Visual Speech Recognition (VSR) have improved audio-less communication. Yet, these modalities have largely been studied in isolation, and their integration within a unified framework remains underexplored. In this paper, we propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation. We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or superior to state-of-the-art models specialized for individual tasks. Building on this framework, we achieve performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and Audio-Visual Speech Recognition. Furthermore, our analysis reveals a key linguistic insight: explicitly modeling lip movements as a distinct modality significantly improves SLT performance by capturing critical non-manual cues.
[93] Drop-In Perceptual Optimization for 3D Gaussian Splatting
Ezgi Ozyilkan, Zhiqi Chen, Oren Rippel, Jona Ballé, Kedar Tatwawadi
Main category: cs.CV
TL;DR: WD-R (regularized Wasserstein Distortion) emerges as the best perceptual loss for 3D Gaussian Splatting, outperforming previous methods in human preference studies and perceptual metrics.
Details
Motivation: Current 3D Gaussian Splatting methods use ad-hoc pixel-level losses that produce blurry renderings, despite outputs being viewed by humans. There's a need for systematic perceptual optimization strategies.Method: Conducted large-scale human subjective study (39,320 pairwise ratings) across multiple datasets and 3DGS frameworks. Systematically explored diverse distortion losses and identified WD-R (regularized Wasserstein Distortion) as optimal.
Result: WD-R was preferred 2.3× over original 3DGS loss and 1.5× over Perceptual-GS. Achieved SOTA LPIPS, DISTS, and FID scores. Generalized across frameworks (Mip-Splatting, Scaffold-GS) with 1.8× and 3.6× human preference improvements. Enabled ≈50% bitrate savings for 3DGS scene compression.
Conclusion: WD-R is an effective perceptual loss for 3DGS that significantly improves visual quality without increasing computational cost, with applications in scene compression.
Abstract: Despite their output being ultimately consumed by human viewers, 3D Gaussian Splatting (3DGS) methods often rely on ad-hoc combinations of pixel-level losses, resulting in blurry renderings. To address this, we systematically explore perceptual optimization strategies for 3DGS by searching over a diverse set of distortion losses. We conduct the first-of-its-kind large-scale human subjective study on 3DGS, involving 39,320 pairwise ratings across several datasets and 3DGS frameworks. A regularized version of Wasserstein Distortion, which we call WD-R, emerges as the clear winner, excelling at recovering fine textures without incurring a higher splat count. WD-R is preferred by raters more than $2.3\times$ over the original 3DGS loss, and $1.5\times$ over current best method Perceptual-GS. WD-R also consistently achieves state-of-the-art LPIPS, DISTS, and FID scores across various datasets, and generalizes across recent frameworks, such as Mip-Splatting and Scaffold-GS, where replacing the original loss with WD-R consistently enhances perceptual quality within a similar resource budget (number of splats for Mip-Splatting, model size for Scaffold-GS), and leads to reconstructions being preferred by human raters $1.8\times$ and $3.6\times$, respectively. We also find that this carries over to the task of 3DGS scene compression, with $\approx 50%$ bitrate savings for comparable perceptual metric performance.
[94] Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation
Delin An, Chaoli Wang
Main category: cs.CV
TL;DR: Sketch2CT is a multimodal diffusion framework that generates 3D medical CT volumes from 2D sketches and text descriptions, producing anatomically consistent structures for medical data augmentation.
Details
Motivation: Addressing the challenge of data scarcity in medical imaging by developing a controllable method to generate 3D medical volumes with anatomically consistent structures under multimodal conditions (sketch + text).Method: A multimodal diffusion framework that first generates 3D segmentation masks from random noise conditioned on 2D sketches and textual descriptions, using specialized modules for sketch-text alignment and fusion. The masks then guide a latent diffusion model for 3D CT volume synthesis.
Result: Extensive experiments on public CT datasets show superior performance in generating multimodal medical volumes compared to existing methods, enabling efficient medical dataset augmentation.
Conclusion: Sketch2CT provides a controllable, low-cost generation pipeline for principled augmentation of medical datasets, addressing data scarcity through multimodal 3D medical volume generation.
Abstract: Diffusion probabilistic models have demonstrated significant potential in generating high-quality, realistic medical images, providing a promising solution to the persistent challenge of data scarcity in the medical field. Nevertheless, producing 3D medical volumes with anatomically consistent structures under multimodal conditions remains a complex and unresolved problem. We introduce Sketch2CT, a multimodal diffusion framework for structure-aware 3D medical volume generation, jointly guided by a user-provided 2D sketch and a textual description that captures 3D geometric semantics. The framework initially generates 3D segmentation masks of the target organ from random noise, conditioned on both modalities. To effectively align and fuse these inputs, we propose two key modules that refine sketch features with localized textual cues and integrate global sketch-text representations. Built upon a capsule-attention backbone, these modules leverage the complementary strengths of sketches and text to produce anatomically accurate organ shapes. The synthesized segmentation masks subsequently guide a latent diffusion model for 3D CT volume synthesis, enabling realistic reconstruction of organ appearances that are consistent with user-defined sketches and descriptions. Extensive experiments on public CT datasets demonstrate that Sketch2CT achieves superior performance in generating multimodal medical volumes. Its controllable, low-cost generation pipeline enables principled, efficient augmentation of medical datasets. Code is available at https://github.com/adlsn/Sketch2CT.
[95] Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy
Geewook Kim, Minjoon Seo
Main category: cs.CV
TL;DR: The paper audits video benchmarks and finds they’re largely solvable from vision alone, then enhances LLaVA-OneVision with speech/audio encoders to show audio actually matters for speech comprehension and cross-modal grounding tasks.
Details
Motivation: Current video understanding benchmarks don't properly measure audio-visual reasoning since they can be largely solved with visual cues alone, leading to underestimation of audio's importance in multimodal understanding.Method: Audited 10 video benchmarks, found ~76% of AVQA solvable from single frames; enhanced LLaVA-OneVision with speech/audio encoder, compared 5 compressor architectures with 25x token reduction (25Hz to 1Hz).
Result: Audio yields clear gains on tasks requiring speech comprehension or cross-modal grounding, while vision-centric benchmarks remain largely unaffected; speech encoders play larger role than current benchmarks suggest.
Conclusion: Video understanding benchmarks need better audio-visual reasoning measurement; speech/audio encoders are crucial for true multimodal understanding, especially for speech comprehension and cross-modal grounding tasks.
Abstract: Speech and audio encoders developed over years of community effort are routinely excluded from video understanding pipelines – not because they fail, but because benchmarks never required listening. We audit 10 video benchmarks and find items largely solvable from visual cues alone: a single-frame probe answers ~76% of AVQA without audio, suggesting poor measurement of audio-visual reasoning. Building on LLaVA-OneVision, we attach a speech/audio encoder and compare five compressor architectures under 25x token reduction (25 Hz to 1 Hz). Across 10 benchmarks – with and without filtering – audio yields clear gains on tasks requiring speech comprehension or cross-modal grounding, while vision-centric suites remain largely unaffected. Our results show that speech encoders play a larger role in video understanding than current benchmarks suggest. We will fully open-source our work at https://github.com/naver-ai/LLaVA-AV-SSM.
[96] POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency
Ashim Dahal, Ankit Ghimire, Saydul Akbar Murad, Nick Rahimi
Main category: cs.CV
TL;DR: POVQA introduces a data-efficient pipeline for video question answering that compresses each second of video into a single temporally pooled image, then aligns large vision-language models with lightweight supervision to improve performance on reasoning-based VQA tasks.
Details
Motivation: Current video question answering systems with large vision-language models have limited context windows (only 50 seconds for 1500+ frames), which restricts their ability to process longer videos without losing information. There's a need for more efficient temporal compression methods that can preserve essential video information while enabling better reasoning capabilities.Method: POVQA compresses each second of video into a single temporally pooled image using various techniques: Blend Blur with Last Frame, Weighted Average, Exponential, and Ramp pooling. The system then fine-tunes QWEN-2.5-VL 7B model with supervised two-turn targets including reasoning and final answer. Both Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO) are applied on the novel ReasonVQA dataset containing 12 movies with 239 human-annotated question-answer pairs with reasoning prompts.
Result: On the ReasonVQA dataset, POVQA dramatically improves performance over pooled baselines: F1 score improves from 0.212 to 0.543, BLEU-4 from 0.031 to 0.291, and ROUGE-L from 0.196 to 0.528. Rationale quality also significantly increases. Cross-evaluation shows gains persist regardless of pooling scheme used at train or test time, indicating strong robustness. Similar improvements were observed in zero-shot evaluation on TVQA.
Conclusion: POVQA demonstrates that compressing video into temporally pooled images combined with lightweight supervision (SFT+DPO) can significantly improve video question answering performance, particularly for reasoning-based tasks. The approach is robust across different temporal pooling schemes and shows promise for efficient long-video understanding.
Abstract: Video Question Answering (VQA) with Large Vision Language Models (LVLMs) has gained significant traction in research ever since the Flamingo was introduced by Deepmind. Recent advancements in large context/long video question answering have allowed VQA tasks to have context window of 1500+ frames. However, this only leads to 50 seconds of video footage without losing any significant information. We introduce POVQA, a data-efficient pipeline that compresses each second of video into a single temporally pooled image (via motion blur and weighted averaging variants) and then align LVLMs with lightweight supervision. Concretely, we build 1 fps input sources using Blend Blur with Last Frame, Weighted Average, Exponential and Ramp pooling and fine-tune QWEN-2.5-VL 7B with supervised two turn target including reasoning and final answer. We apply Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO) on our novel dataset ReasonVQA consisting of 12 movies with 239 human annotated question-answer with reasoning prompts. On our ReasonVQA dataset, this method dramatically improves performance over pooled baselines: F1 score improves from 0.212 to 0.543, BLEU-4 from 0.031 to 0.291, and ROUGE-L from 0.196 to 0.528. Rationale quality also significantly increases. Cross-evaluation of SFT + DPO on various pooling functions show that the gains persist regardless of the pooling scheme used at train or test time, indicating strong robustness on summarization of temporal evidence. Similar observations were made on zero-shot in TVQA.
[97] Harnessing Lightweight Transformer with Contextual Synergic Enhancement for Efficient 3D Medical Image Segmentation
Xinyu Liu, Zhen Chen, Wuyang Li, Chenxin Li, Yixuan Yuan
Main category: cs.CV
TL;DR: Light-UNETR: A lightweight transformer architecture with novel attention modules and semi-supervised learning strategy for efficient 3D medical image segmentation with limited labeled data.
Details
Motivation: Transformers show strong performance in 3D medical segmentation but face challenges with high computational requirements and need for large labeled datasets, limiting practical clinical applications.Method: Proposes Light-UNETR with Lightweight Dimension Reductive Attention (LIDR) module for efficient global-local feature capture and Compact Gated Linear Unit (CGLU) for selective channel interaction. Also introduces Contextual Synergic Enhancement (CSE) learning strategy using Attention-Guided Replacement and Spatial Masking Consistency for semi-supervised learning.
Result: Achieves superior performance with only 10% labeled data on Left Atrial Segmentation dataset, surpassing BCP by 1.43% Jaccard while reducing FLOPs by 90.8% and parameters by 85.8%.
Conclusion: Light-UNETR provides an efficient solution for 3D medical segmentation that addresses both model efficiency (computational requirements) and data efficiency (limited labeled data) challenges.
Abstract: Transformers have shown remarkable performance in 3D medical image segmentation, but their high computational requirements and need for large amounts of labeled data limit their applicability. To address these challenges, we consider two crucial aspects: model efficiency and data efficiency. Specifically, we propose Light-UNETR, a lightweight transformer designed to achieve model efficiency. Light-UNETR features a Lightweight Dimension Reductive Attention (LIDR) module, which reduces spatial and channel dimensions while capturing both global and local features via multi-branch attention. Additionally, we introduce a Compact Gated Linear Unit (CGLU) to selectively control channel interaction with minimal parameters. Furthermore, we introduce a Contextual Synergic Enhancement (CSE) learning strategy, which aims to boost the data efficiency of Transformers. It first leverages the extrinsic contextual information to support the learning of unlabeled data with Attention-Guided Replacement, then applies Spatial Masking Consistency that utilizes intrinsic contextual information to enhance the spatial context reasoning for unlabeled data. Extensive experiments on various benchmarks demonstrate the superiority of our approach in both performance and efficiency. For example, with only 10% labeled data on the Left Atrial Segmentation dataset, our method surpasses BCP by 1.43% Jaccard while drastically reducing the FLOPs by 90.8% and parameters by 85.8%. Code is released at https://github.com/CUHK-AIM-Group/Light-UNETR.
[98] High Resolution Flood Extent Detection Using Deep Learning with Random Forest Derived Training Labels
Azizbek Nuriddinov, Ebrahim Ahmadisharaf, Mohammad Reza Alizadeh
Main category: cs.CV
TL;DR: Flood mapping framework using PlanetScope optical imagery and topographic features with ML/DL, showing topographic features provide marginal improvement over optical-only approach.
Details
Motivation: Flood model validation is challenging due to limited observations during extreme events. High-resolution optical imagery offers opportunities but is limited by cloud cover and lack of labeled training data during disasters.Method: Developed flood mapping framework integrating PlanetScope optical imagery with topographic features using ML/DL. Used Random Forest on expert-annotated flood masks to generate training labels for U-Net models. Trained two U-Net models with ResNet18 backbone: optical-only (4 bands) and optical+topographic (HAND and slope, 6 bands).
Result: U-Net model with topographic features achieved very close performance to optical-only configuration (F1=0.92 and IoU=0.85 for both). HAND and slope provided only marginal value to inundation extent detection.
Conclusion: The framework offers a scalable and label-efficient approach for mapping inundation extent that enables modeling under data-scarce flood scenarios. Topographic features provide limited additional value over optical imagery alone.
Abstract: Validation of flood models, used to support risk mitigation strategies, remains challenging due to limited observations during extreme events. High-frequency, high-resolution optical imagery (~3 m), such as PlanetScope, offers new opportunities for flood mapping, although applications remain limited by cloud cover and the lack of labeled training data during disasters. To address this, we develop a flood mapping framework that integrates PlanetScope optical imagery with topographic features using machine learning (ML) and deep learning (DL) algorithms. A Random Forest model was applied to expert-annotated flood masks to generate training labels for DL models, U-Net. Two U-Net models with ResNet18 backbone were trained using optical imagery only (4 bands) and optical imagery combined with Height Above Nearest Drainage (HAND) and topographic slope (6 bands). Hurricane Ida (September 2021), which caused catastrophic flooding across the eastern United States, including the New York City metropolitan area, was used as an example to evaluate the framework. Results demonstrate that the U-Net model with topographic features achieved very close performance to the optical-only configuration (F1=0.92 and IoU=0.85 by both modeling scenarios), indicating that HAND and slope provide only marginal value to inundation extent detection. The proposed framework offers a scalable and label-efficient approach for mapping inundation extent that enables modeling under data-scarce flood scenarios.
[99] Investigating self-supervised representations for audio-visual deepfake detection
Dragos-Alexandru Boldisor, Stefan Smeu, Dan Oneata, Elisabeta Oneata
Main category: cs.CV
TL;DR: Systematic evaluation of self-supervised representations for audio-visual deepfake detection across modalities and domains, finding audio-informed features generalize best and achieve SOTA results.
Details
Motivation: Self-supervised representations excel at vision and speech tasks but their potential for audio-visual deepfake detection remains underexplored, with prior work using them in isolation or within complex architectures.Method: Systematically evaluate self-supervised features across modalities (audio, video, multimodal) and domains (lip movements, generic visual content), assessing three dimensions: detection effectiveness, interpretability, and cross-modal complementarity.
Result: Most self-supervised features capture deepfake-relevant information that is complementary; models attend to semantically meaningful regions; audio-informed representations generalize best and achieve state-of-the-art results; generalization to in-the-wild data remains challenging due to dataset difficulty.
Conclusion: Self-supervised representations are effective for audio-visual deepfake detection, with audio features performing best, but real-world generalization requires addressing intrinsic dataset challenges rather than superficial pattern learning.
Abstract: Self-supervised representations excel at many vision and speech tasks, but their potential for audio-visual deepfake detection remains underexplored. Unlike prior work that uses these features in isolation or buried within complex architectures, we systematically evaluate them across modalities (audio, video, multimodal) and domains (lip movements, generic visual content). We assess three key dimensions: detection effectiveness, interpretability of encoded information, and cross-modal complementarity. We find that most self-supervised features capture deepfake-relevant information, and that this information is complementary. Moreover, models primarily attend to semantically meaningful regions rather than spurious artifacts (such as the leading silence). Among the investigated features, audio-informed representations generalize best and achieve state-of-the-art results. However, generalization to realistic in-the-wild data remains challenging. Our analysis indicates this gap stems from intrinsic dataset difficulty rather than from features latching onto superficial patterns. Project webpage: https://bit-ml.github.io/ssr-dfd.
[100] Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
Shoubin Yu, Lei Shu, Antoine Yang, Yao Fu, Srinivas Sunkara, Maria Wang, Jindong Chen, Mohit Bansal, Boqing Gong
Main category: cs.CV
TL;DR: Ego2Web: A benchmark bridging egocentric video perception and web agent execution for multimodal AI agents that need to understand real-world physical surroundings and complete related online tasks.
Details
Motivation: Current web-agent benchmarks focus only on web-based interaction and perception, lacking grounding in real-world physical surroundings. This prevents evaluation of crucial scenarios where agents must use egocentric visual perception (e.g., via AR glasses) to recognize objects in the user's surroundings and complete related online tasks.Method: Introduces Ego2Web benchmark with real-world first-person video recordings paired with web tasks requiring visual understanding, web task planning, and online interaction. Uses automatic data-generation pipeline with human verification to curate video-task pairs across diverse web task types (e-commerce, media retrieval, knowledge lookup). Also develops Ego2WebJudge, a novel LLM-as-a-Judge automatic evaluation method.
Result: Ego2WebJudge achieves ~84% agreement with human judgment, substantially higher than existing evaluation methods. Experiments show state-of-the-art agents perform weakly on Ego2Web with substantial headroom across all task categories. Ablation study highlights necessity of accurate video understanding and limitations of current agents.
Conclusion: Ego2Web addresses critical gap in multimodal AI agent evaluation by bridging physical and digital worlds. It enables development of truly capable AI assistants that can seamlessly see, understand, and act across both domains, providing a critical new resource for advancing multimodal AI systems.
Abstract: Multimodal AI agents are increasingly automating complex real-world workflows that involve online web execution. However, current web-agent benchmarks suffer from a critical limitation: they focus entirely on web-based interaction and perception, lacking grounding in the user’s real-world physical surroundings. This limitation prevents evaluation in crucial scenarios, such as when an agent must use egocentric visual perception (e.g., via AR glasses) to recognize an object in the user’s surroundings and then complete a related task online. To address this gap, we introduce Ego2Web, the first benchmark designed to bridge egocentric video perception and web agent execution. Ego2Web pairs real-world first-person video recordings with web tasks that require visual understanding, web task planning, and interaction in an online environment for successful completion. We utilize an automatic data-generation pipeline combined with human verification and refinement to curate well-constructed, high-quality video-task pairs across diverse web task types, including e-commerce, media retrieval, knowledge lookup, etc. To facilitate accurate and scalable evaluation for our benchmark, we also develop a novel LLM-as-a-Judge automatic evaluation method, Ego2WebJudge, which achieves approximately 84% agreement with human judgment, substantially higher than existing evaluation methods. Experiments with diverse SoTA agents on our Ego2Web show that their performance is weak, with substantial headroom across all task categories. We also conduct a comprehensive ablation study on task design, highlighting the necessity of accurate video understanding in the proposed task and the limitations of current agents. We hope Ego2Web can be a critical new resource for developing truly capable AI assistants that can seamlessly see, understand, and act across the physical and digital worlds.
[101] UrbanVGGT: Scalable Sidewalk Width Estimation from Street View Images
Kaizhen Tan, Fan Zhang
Main category: cs.CV
TL;DR: UrbanVGGT: A pipeline for estimating metric sidewalk width from single street-view images using semantic segmentation, 3D reconstruction, and scale calibration.
Details
Motivation: Sidewalk width is important for pedestrian accessibility but large-scale data is scarce; existing methods are costly, require high-resolution imagery, or make simplified geometric assumptions that limit scalability.Method: Combines semantic segmentation, feed-forward 3D reconstruction, adaptive ground-plane fitting, camera-height-based scale calibration, and directional width measurement on the recovered plane.
Result: Achieves mean absolute error of 0.252 m on Washington, D.C. benchmark, with 95.5% of estimates within 0.50 m of reference width; generated SV-SideWidth dataset covering 527 OpenStreetMap street segments.
Conclusion: Street-view imagery can support scalable generation of sidewalk-width attributes, though broader cross-city validation and local ground-truth auditing remain necessary before deployment as authoritative planning data.
Abstract: Sidewalk width is an important indicator of pedestrian accessibility, comfort, and network quality, yet large-scale width data remain scarce in most cities. Existing approaches typically rely on costly field surveys, high-resolution overhead imagery, or simplified geometric assumptions that limit scalability or introduce systematic error. To address this gap, we present UrbanVGGT, a measurement pipeline for estimating metric sidewalk width from a single street-view image. The method combines semantic segmentation, feed-forward 3D reconstruction, adaptive ground-plane fitting, camera-height-based scale calibration, and directional width measurement on the recovered plane. On a ground-truth benchmark from Washington, D.C., UrbanVGGT achieves a mean absolute error of 0.252 m, with 95.5% of estimates within 0.50 m of the reference width. Ablation experiments show that metric scale calibration is the most critical component, and controlled comparisons with alternative geometry backbones support the effectiveness of the overall design. As a feasibility demonstration, we further apply the pipeline to three cities and generate SV-SideWidth, a prototype sidewalk-width dataset covering 527 OpenStreetMap street segments. The results indicate that street-view imagery can support scalable generation of candidate sidewalk-width attributes, while broader cross-city validation and local ground-truth auditing remain necessary before deployment as authoritative planning data.
[102] Generalized multi-object classification and tracking with sparse feature resonator networks
Lazar Supic, Alec Mullen, E. Paxon Frady
Main category: cs.CV
TL;DR: A generative model with resonator networks captures both invariant and equivariant visual scene structure, enabling object recognition without translation invariance assumptions and precise position tracking.
Details
Motivation: Traditional neural networks often lose equivariant information (like precise object locations) when trained for invariance to transformations. They also don't naturally guarantee invariance and generalize poorly to unseen transformations.Method: Uses analysis-by-synthesis with resonator networks. A generative model constructs scenes with MNIST digits and transformations (color, position). Resonator networks invert the generative model to extract both invariant and equivariant information. Sparse features learned from data provide flexibility for unseen shapes.
Result: The modular structure separates shape and translation information, allowing classification on centered data only. Network identifies objects with arbitrary translations without data augmentation. Natural attention mechanism handles multiple objects, selecting one at a time. Can track multiple moving objects with pixel-level precision.
Conclusion: Resonator networks with generative modeling effectively capture both invariant and equivariant structure in visual scenes, enabling robust object recognition and precise position tracking without extensive training data or data augmentation.
Abstract: In visual scene understanding tasks, it is essential to capture both invariant and equivariant structure. While neural networks are frequently trained to achieve invariance to transformations such as translation, this often comes at the cost of losing access to equivariant information - e.g., the precise location of an object. Moreover, invariance is not naturally guaranteed through supervised learning alone, and many architectures generalize poorly to input transformations not encountered during training. Here, we take an approach based on analysis-by-synthesis and factoring using resonator networks. A generative model describes the construction of simple scenes containing MNIST digits and their transformations, like color and position. The resonator network inverts the generative model, and provides both invariant and equivariant information about particular objects. Sparse features learned from training data act as a basis set to provide flexibility in representing variable shapes of objects, allowing the resonator network to handle previously unseen digit shapes from the test set. The modular structure provides a shape module which contains information about the object shape with translation factored out, allowing a simple classifier to operate on centered digits. The classification layer is trained solely on centered data, requiring much less training data, and the network as a whole can identify objects with arbitrary translations without data augmentation. The natural attention-like mechanism of the resonator network also allows for analysis of scenes with multiple objects, where the network dynamics selects and centers only one object at a time. Further, the specific position information of a particular object can be extracted from the translation module, and we show that the resonator can be designed to track multiple moving objects with precision of a few pixels.
[103] CanViT: Toward Active-Vision Foundation Models
Yohaï-Eliel Berreby, Sabrina Du, Audrey Durand, B. Suresh Krishna
Main category: cs.CV
TL;DR: CanViT is the first Active-Vision Foundation Model using a Vision Transformer with scene-relative RoPE and a canvas workspace, pretrained on 13.2M scenes with policy-agnostic passive-to-active distillation, achieving state-of-the-art active vision performance on segmentation and classification tasks.
Details
Motivation: Active computer vision lacks scalable general-purpose architectures and pretraining pipelines, leaving Active-Vision Foundation Models unexplored despite their promise for efficient, biologically plausible perception through sequential glimpses.Method: CanViT uses scene-relative RoPE to bind a retinotopic Vision Transformer backbone with a spatiotopic canvas workspace. It employs Canvas Attention (asymmetric cross-attention) to decouple thinking from memory. Pretrained via policy-agnostic passive-to-active dense latent distillation, reconstructing scene-wide DINOv3 embeddings from random glimpse sequences.
Result: CanViT-B achieves 38.5% mIoU on ADE20K segmentation with single glimpse (vs 27.6% previous best), 45.9% with additional glimpses, and 81.2% top-1 accuracy on ImageNet-1k classification. It uses 19.5x fewer FLOPs than previous models and generalizes to longer rollouts and larger scenes.
Conclusion: CanViT demonstrates the potential of Active-Vision Foundation Models as a new research axis, closing the gap between passive and active vision on semantic segmentation and showing scalability to large-scale pretraining.
Abstract: Active computer vision promises efficient, biologically plausible perception through sequential, localized glimpses, but lacks scalable general-purpose architectures and pretraining pipelines. As a result, Active-Vision Foundation Models (AVFMs) have remained unexplored. We introduce CanViT, the first task- and policy-agnostic AVFM. CanViT uses scene-relative RoPE to bind a retinotopic Vision Transformer backbone and a spatiotopic scene-wide latent workspace, the canvas. Efficient interaction with this high-capacity working memory is supported by Canvas Attention, a novel asymmetric cross-attention mechanism. We decouple thinking (backbone-level) and memory (canvas-level), eliminating canvas-side self-attention and fully-connected layers to achieve low-latency sequential inference and scalability to large scenes. We propose a label-free active vision pretraining scheme, policy-agnostic passive-to-active dense latent distillation: reconstructing scene-wide DINOv3 embeddings from sequences of low-resolution glimpses with randomized locations, zoom levels, and lengths. We pretrain CanViT-B from a random initialization on 13.2 million ImageNet-21k scenes – an order of magnitude more than previous active models – and 1 billion random glimpses, in 166 hours on a single H100. On ADE20K segmentation, a frozen CanViT-B achieves 38.5% mIoU in a single low-resolution glimpse, outperforming the best active model’s 27.6% with 19.5x fewer inference FLOPs and no fine-tuning, as well as its FLOP- or input-matched DINOv3 teacher. Given additional glimpses, CanViT-B reaches 45.9% ADE20K mIoU. On ImageNet-1k classification, CanViT-B reaches 81.2% top-1 accuracy with frozen teacher probes. CanViT generalizes to longer rollouts, larger scenes, and new policies. Our work closes the wide gap between passive and active vision on semantic segmentation and demonstrates the potential of AVFMs as a new research axis.
[104] FullCircle: Effortless 3D Reconstruction from Casual 360$^\circ$ Captures
Yalda Foroutan, Ipek Oztas, Daniel Rebain, Aysegul Dundar, Kwang Moo Yi, Lily Goli, Andrea Tagliasacchi
Main category: cs.CV
TL;DR: A practical pipeline for 3D scene reconstruction from raw 360° camera captures without special protocols or pre-processing, robust to human operators in imagery, with new benchmark dataset.
Details
Motivation: Radiance fields for 3D reconstruction face challenges with casual capture due to narrow FOV of perspective cameras. 360° cameras offer broader coverage but existing methods require special protocols that undermine effortless workflows. Need for practical pipeline that works directly from raw 360° captures.Method: Proposes a pipeline for reconstructing 3D scenes directly from raw 360° camera captures without special protocols or pre-processing. Uses dual-fisheye images and exhibits robustness to human operators visible in 360° imagery. Introduces multi-tiered dataset of scenes captured as raw dual-fisheye images as benchmark.
Result: Method significantly outperforms vanilla 3DGS for 360° cameras and robust perspective baselines when perspective cameras are simulated from same capture. Demonstrates advantages of 360° capture for casual reconstruction.
Conclusion: Proposes practical solution for effortless 360° scene reconstruction, establishing benchmark dataset and showing superiority over existing methods for casual capture scenarios.
Abstract: Radiance fields have emerged as powerful tools for 3D scene reconstruction. However, casual capture remains challenging due to the narrow field of view of perspective cameras, which limits viewpoint coverage and feature correspondences necessary for reliable camera calibration and reconstruction. While commercially available 360$^\circ$ cameras offer significantly broader coverage than perspective cameras for the same capture effort, existing 360$^\circ$ reconstruction methods require special capture protocols and pre-processing steps that undermine the promise of radiance fields: effortless workflows to capture and reconstruct 3D scenes. We propose a practical pipeline for reconstructing 3D scenes directly from raw 360$^\circ$ camera captures. We require no special capture protocols or pre-processing, and exhibit robustness to a prevalent source of reconstruction errors: the human operator that is visible in all 360$^\circ$ imagery. To facilitate evaluation, we introduce a multi-tiered dataset of scenes captured as raw dual-fisheye images, establishing a benchmark for robust casual 360$^\circ$ reconstruction. Our method significantly outperforms not only vanilla 3DGS for 360$^\circ$ cameras but also robust perspective baselines when perspective cameras are simulated from the same capture, demonstrating the advantages of 360$^\circ$ capture for casual reconstruction. Additional results are available at: https://theialab.github.io/fullcircle
[105] A vision-language model and platform for temporally mapping surgery from video
Dani Kiyasseh
Main category: cs.CV
TL;DR: Halsted is a vision-language model trained on a comprehensive surgical video atlas that maps surgical activities from videos, offering automated procedure analysis accessible to surgeons via a web platform.
Details
Motivation: Current AI models for surgical video analysis are narrow in scope, capture limited behavioral components within single procedures, and have limited translational value as they remain inaccessible to practicing surgeons. There's a need for comprehensive surgical mapping that can bridge the gap between AI research and clinical deployment.Method: Developed Halsted, a vision-language model trained on the Halsted Surgical Atlas (HSA) - a comprehensive annotated video library with over 650,000 videos across eight surgical specialties. Used an iterative self-labelling framework for annotation. Created HSA-27k as a public benchmark subset. Developed a web platform for surgeons to automatically map their own procedures.
Result: Halsted surpasses previous state-of-the-art models in mapping surgical activity while offering greater comprehensiveness and computational efficiency. The web platform provides surgeons worldwide with previously-unavailable capability to automatically map their own procedures within minutes.
Conclusion: By standardizing unstructured surgical video data and making AI capabilities directly accessible to surgeons, this work brings surgical AI closer to clinical deployment and helps pave the way toward autonomous robotic surgery.
Abstract: Mapping surgery is fundamental to developing operative guidelines and enabling autonomous robotic surgery. Recent advances in artificial intelligence (AI) have shown promise in mapping the behaviour of surgeons from videos, yet current models remain narrow in scope, capturing limited behavioural components within single procedures, and offer limited translational value, as they remain inaccessible to practising surgeons. Here we introduce Halsted, a vision-language model trained on the Halsted Surgical Atlas (HSA), one of the most comprehensive annotated video libraries grown through an iterative self-labelling framework and encompassing over 650,000 videos across eight surgical specialties. To facilitate benchmarking, we publicly release HSA-27k, a subset of the Halsted Surgical Atlas. Halsted surpasses previous state-of-the-art models in mapping surgical activity while offering greater comprehensiveness and computational efficiency. To bridge the longstanding translational gap of surgical AI, we develop the Halsted web platform (https://halstedhealth.ai/) to provide surgeons anywhere in the world with the previously-unavailable capability of automatically mapping their own procedures within minutes. By standardizing unstructured surgical video data and making these capabilities directly accessible to surgeons, our work brings surgical AI closer to clinical deployment and helps pave the way toward autonomous robotic surgery.
[106] Language Models Can Explain Visual Features via Steering
Javier Ferrando, Enrique Lopez-Cuena, Pablo Agustin Martin-Torres, Daniel Hinjos, Anna Arias-Duart, Dario Garcia-Gasulla
Main category: cs.CV
TL;DR: Steering SAE features in vision-language models via causal interventions to generate automated explanations of visual concepts
Details
Motivation: Current methods for explaining sparse autoencoder features in vision models rely on human intervention or correlation-based approaches using top activating examples, which are limited. There's a need for automated, scalable interpretability methods that can explain visual features without human input.Method: Proposes Steering: causal intervention method that steers individual SAE features in vision encoder after providing empty image, then prompts language model to explain what it “sees”. Also introduces Steering-informed Top-k hybrid approach combining causal interventions with input-based methods.
Result: Steering offers scalable alternative to traditional input-based approaches, with explanation quality improving consistently with language model scale. Hybrid approach achieves state-of-the-art explanation quality without additional computational cost.
Conclusion: Causal intervention-based steering provides new axis for automated interpretability in vision models, complementing traditional approaches. Method shows promise for future research, especially as language models scale.
Abstract: Sparse Autoencoders uncover thousands of features in vision models, yet explaining these features without requiring human intervention remains an open challenge. While previous work has proposed generating correlation-based explanations based on top activating input examples, we present a fundamentally different alternative based on causal interventions. We leverage the structure of Vision-Language Models and steer individual SAE features in the vision encoder after providing an empty image. Then, we prompt the language model to explain what it ``sees’’, effectively eliciting the visual concept represented by each feature. Results show that Steering offers an scalable alternative that complements traditional approaches based on input examples, serving as a new axis for automated interpretability in vision models. Moreover, the quality of explanations improves consistently with the scale of the language model, highlighting our method as a promising direction for future research. Finally, we propose Steering-informed Top-k, a hybrid approach that combines the strengths of causal interventions and input-based approaches to achieve state-of-the-art explanation quality without additional computational cost.
[107] TrajLoom: Dense Future Trajectory Generation from Video
Zewei Zhang, Jia Jun Cheng Xian, Kaiwen Liu, Ming Liang, Hang Chu, Jun Chen, Renjie Liao
Main category: cs.CV
TL;DR: A framework for predicting future dense point trajectories from video using grid-anchor offset encoding, a VAE for compact trajectory representation, and flow matching for future trajectory generation, with extended prediction horizons and improved motion realism.
Details
Motivation: Predicting future motion is crucial for video understanding and controllable video generation. Dense point trajectories are a compact, expressive motion representation, but modeling their future evolution from observed video remains challenging.Method: Three components: (1) Grid-Anchor Offset Encoding reduces location-dependent bias by representing points as offsets from pixel-center anchors; (2) TrajLoom-VAE learns a compact spatiotemporal latent space for dense trajectories with masked reconstruction and spatiotemporal consistency regularization; (3) TrajLoom-Flow generates future trajectories in latent space via flow matching with boundary cues and on-policy K-step fine-tuning for stable sampling.
Result: Extends prediction horizon from 24 to 81 frames while improving motion realism and stability across datasets. Introduces TrajLoomBench benchmark spanning real and synthetic videos with standardized setup aligned with video-generation benchmarks.
Conclusion: The framework successfully predicts future dense point trajectories with extended horizons and improved quality, supporting downstream video generation and editing applications.
Abstract: Predicting future motion is crucial in video understanding and controllable video generation. Dense point trajectories are a compact, expressive motion representation, but modeling their future evolution from observed video remains challenging. We propose a framework that predicts future trajectories and visibility from past trajectories and video context. Our method has three components: (1) Grid-Anchor Offset Encoding, which reduces location-dependent bias by representing each point as an offset from its pixel-center anchor; (2) TrajLoom-VAE, which learns a compact spatiotemporal latent space for dense trajectories with masked reconstruction and a spatiotemporal consistency regularizer; and (3) TrajLoom-Flow, which generates future trajectories in latent space via flow matching, with boundary cues and on-policy K-step fine-tuning for stable sampling. We also introduce TrajLoomBench, a unified benchmark spanning real and synthetic videos with a standardized setup aligned with video-generation benchmarks. Compared with state-of-the-art methods, our approach extends the prediction horizon from 24 to 81 frames while improving motion realism and stability across datasets. The predicted trajectories directly support downstream video generation and editing. Code, model checkpoints, and datasets are available at https://trajloom.github.io/.
[108] Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off
Fulvio Sanguigni, Davide Lobba, Bin Ren, Marcella Cornia, Nicu Sebe, Rita Cucchiara
Main category: cs.CV
TL;DR: Dress-ED: A large-scale benchmark dataset for instruction-driven fashion editing that unifies virtual try-on, try-off, and text-guided garment editing with natural language instructions.
Details
Motivation: Existing virtual try-on/try-off datasets are static and lack instruction-driven editing capabilities for controllable and interactive fashion generation.Method: Created Dress-ED dataset using automated multimodal pipeline with MLLM-based garment understanding, diffusion-based editing, and LLM-guided verification. Also proposed unified multimodal diffusion framework for joint reasoning over linguistic instructions and visual garment cues.
Result: Dataset contains over 146k verified quadruplets spanning 3 garment categories and 7 edit types (appearance and structural modifications). Provides strong baseline for instruction-driven VTON/VTOFF.
Conclusion: Dress-ED enables instruction-driven fashion editing and serves as comprehensive benchmark for controllable fashion generation research.
Abstract: Recent advances in Virtual Try-On (VTON) and Virtual Try-Off (VTOFF) have greatly improved photo-realistic fashion synthesis and garment reconstruction. However, existing datasets remain static, lacking instruction-driven editing for controllable and interactive fashion generation. In this work, we introduce the Dress Editing Dataset (Dress-ED), the first large-scale benchmark that unifies VTON, VTOFF, and text-guided garment editing within a single framework. Each sample in Dress-ED includes an in-shop garment image, the corresponding person image wearing the garment, their edited counterparts, and a natural-language instruction of the desired modification. Built through a fully automated multimodal pipeline that integrates MLLM-based garment understanding, diffusion-based editing, and LLM-guided verification, Dress-ED comprises over 146k verified quadruplets spanning three garment categories and seven edit types, including both appearance (e.g., color, pattern, material) and structural (e.g., sleeve length, neckline) modifications. Based on this benchmark, we further propose a unified multimodal diffusion framework that jointly reasons over linguistic instructions and visual garment cues, serving as a strong baseline for instruction-driven VTON and VTOFF. Dataset and code will be made publicly available.
[109] A Vision Language Model for Generating Procedural Plant Architecture Representations from Simulated Images
Heesup Yun, Isaac Kazuo Uyehara, Ioannis Droutsas, Earl Ranario, Christine H. Diepenbrock, Brian N. Bailey, J. Mason Earles
Main category: cs.CV
TL;DR: A method to generate 3D plant architecture models from single images using vision-language models, converting procedural plant definitions into token sequences for prediction.
Details
Motivation: Measuring 3D plant architectural parameters at field scale is labor-intensive; need automated methods to extract organ-level geometric and topological parameters from images.Method: Uses synthetic cowpea plant images from Helios 3D simulator, develops plant architecture tokenizer to convert XML definitions into token sequences, trains vision-language model to predict these sequences from images.
Result: Achieved token F1 score of 0.73 in teacher-forced training, BLEU-4 score of 94.00% and ROUGE-L score of 0.5182 in autoregressive generation, demonstrating feasibility of extracting architectural parameters from synthetic images.
Conclusion: Plant architecture model generation and parameter extraction from images is possible; future work will extend to real imagery data.
Abstract: Three-dimensional (3D) procedural plant architecture models have emerged as an important tool for simulation-based studies of plant structure and function, extracting plant architectural parameters from field measurements, and for generating realistic plants in computer graphics. However, measuring the architectural parameters and nested structures for these models at the field scales remains prohibitively labor-intensive. We present a novel algorithm that generates a 3D plant architecture from an image, creating a functional structural plant model that reflects organ-level geometric and topological parameters and provides a more comprehensive representation of the plant’s architecture. Instead of using 3D sensors or processing multi-view images with computer vision to obtain the 3D structure of plants, we proposed a method that generates token sequences that encode a procedural definition of plant architecture. This work used only synthetic images for training and testing, with exact architectural parameters known, allowing testing of the hypothesis that organ-level architectural parameters could be extracted from image data using a vision-language model (VLM). A synthetic dataset of cowpea plant images was generated using the Helios 3D plant simulator, with the detailed plant architecture encoded in XML files. We developed a plant architecture tokenizer for the XML file defining plant architecture, converting it into a token sequence that a language model can predict. The model achieved a token F1 score of 0.73 during teacher-forced training. Evaluation of the model was performed through autoregressive generation, achieving a BLEU-4 score of 94.00% and a ROUGE-L score of 0.5182. This led to the conclusion that such plant architecture model generation and parameter extraction were possible from synthetic images; thus, future work will extend the approach to real imagery data.
[110] To Agree or To Be Right? The Grounding-Sycophancy Tradeoff in Medical Vision-Language Models
OFM Riaz Rahman Aranya, Kevin Desai
Main category: cs.CV
TL;DR: Medical VLMs show a grounding-sycophancy tradeoff: models with lowest hallucination are most sycophantic, and vice versa, with none achieving good performance on both safety metrics.
Details
Motivation: Medical VLMs show strong performance on VQA benchmarks but their robustness against critical failure modes like hallucination and sycophancy remains poorly understood, especially in combination, which is crucial for clinical safety.Method: Evaluated six VLMs (three general-purpose, three medical-specialist) on three medical VQA datasets. Proposed three metrics: L-VASE (logit-space reformulation of VASE), CCS (confidence-calibrated sycophancy score), and Clinical Safety Index (CSI) combining grounding, autonomy, and calibration via geometric mean.
Result: Found grounding-sycophancy tradeoff: models with lowest hallucination propensity are most sycophantic, while most pressure-resistant model hallucinates more than all medical-specialist models. Across 1,151 test cases, no model achieved CSI above 0.35, indicating none of the 7-8B parameter VLMs is simultaneously well-grounded and robust to social pressure.
Conclusion: Joint evaluation of both hallucination and sycophancy properties is necessary before medical VLMs can be considered for clinical use, as current models fail to achieve adequate safety on both dimensions simultaneously.
Abstract: Vision-language models (VLMs) adapted to the medical domain have shown strong performance on visual question answering benchmarks, yet their robustness against two critical failure modes, hallucination and sycophancy, remains poorly understood, particularly in combination. We evaluate six VLMs (three general-purpose, three medical-specialist) on three medical VQA datasets and uncover a grounding-sycophancy tradeoff: models with the lowest hallucination propensity are the most sycophantic, while the most pressure-resistant model hallucinates more than all medical-specialist models. To characterize this tradeoff, we propose three metrics: L-VASE, a logit-space reformulation of VASE that avoids its double-normalization; CCS, a confidence-calibrated sycophancy score that penalizes high-confidence capitulation; and Clinical Safety Index (CSI), a unified safety index that combines grounding, autonomy, and calibration via a geometric mean. Across 1,151 test cases, no model achieves a CSI above 0.35, indicating that none of the evaluated 7-8B parameter VLMs is simultaneously well-grounded and robust to social pressure. Our findings suggest that joint evaluation of both properties is necessary before these models can be considered for clinical use. Code is available at https://github.com/UTSA-VIRLab/AgreeOrRight
[111] Toward Faithful Segmentation Attribution via Benchmarking and Dual-Evidence Fusion
Abu Noman Md Sakib, OFM Riaz Rahman Aranya, Kevin Desai, Zijie Zhang
Main category: cs.CV
TL;DR: DEA benchmark evaluates semantic segmentation attribution methods using intervention-based faithfulness metrics, revealing hidden tradeoffs between faithfulness and stability that visual evaluation misses.
Details
Motivation: Current semantic segmentation attribution methods are primarily evaluated by visual plausibility, which doesn't guarantee that highlighted pixels actually drive model predictions or that attribution stays within target regions. There's a need for dedicated evaluation protocols to measure intervention-based faithfulness.Method: Introduces a reproducible benchmark testing intervention-based faithfulness, off-target leakage, perturbation robustness, and runtime on Pascal VOC and SBD across three pretrained backbones. Proposes Dual-Evidence Attribution (DEA), a lightweight correction that fuses gradient evidence with region-level intervention signals through agreement-weighted fusion.
Result: DEA consistently improves deletion-based faithfulness over gradient-only baselines and preserves strong robustness across all completed runs. The benchmark exposes a faithfulness-stability tradeoff among attribution families that is entirely hidden under visual evaluation.
Conclusion: The benchmark provides a foundation for principled method selection in segmentation explainability, revealing that visual evaluation alone is insufficient. DEA demonstrates how fusing multiple evidence sources can improve attribution quality at the cost of additional compute.
Abstract: Attribution maps for semantic segmentation are almost always judged by visual plausibility. Yet looking convincing does not guarantee that the highlighted pixels actually drive the model’s prediction, nor that attribution credit stays within the target region. These questions require a dedicated evaluation protocol. We introduce a reproducible benchmark that tests intervention-based faithfulness, off-target leakage, perturbation robustness, and runtime on Pascal VOC and SBD across three pretrained backbones. To further demonstrate the benchmark, we propose Dual-Evidence Attribution (DEA), a lightweight correction that fuses gradient evidence with region-level intervention signals through agreement-weighted fusion. DEA increases emphasis where both sources agree and retains causal support when gradient responses are unstable. Across all completed runs, DEA consistently improves deletion-based faithfulness over gradient-only baselines and preserves strong robustness, at the cost of additional compute from intervention passes. The benchmark exposes a faithfulness-stability tradeoff among attribution families that is entirely hidden under visual evaluation, providing a foundation for principled method selection in segmentation explainability. Code is available at https://github.com/anmspro/DEA.
[112] PIVM: Diffusion-Based Prior-Integrated Variation Modeling for Anatomically Precise Abdominal CT Synthesis
Dinglun He, Baoming Zhang, Xu Wang, Yao Hao, Deshan Yang, Ye Duan
Main category: cs.CV
TL;DR: PIVM is a diffusion-based method for generating anatomically accurate abdominal CT images by predicting intensity variations relative to organ-specific priors derived from segmentation labels.
Details
Motivation: Abdominal CT data faces challenges with high annotation costs and privacy constraints, limiting the development of robust segmentation and diagnostic models. There's a need for synthetic data generation methods that can produce anatomically accurate CT images while preserving realistic organ boundaries and textures.Method: PIVM uses a diffusion-based framework that predicts voxel-wise intensity variations relative to organ-specific intensity priors derived from segmentation labels. Instead of generating full images from noise, it uses priors and labels to jointly guide the diffusion process, ensuring spatial alignment and realistic organ boundaries. The method operates directly in image space (not latent space) while preserving the full Hounsfield Unit range.
Result: The method generates anatomically accurate CT images with realistic organ boundaries and fine anatomical textures without smoothing, capturing the full HU range. The source code is publicly available.
Conclusion: PIVM provides an effective framework for generating high-quality synthetic abdominal CT images that can help overcome data limitations in medical imaging, potentially enabling better segmentation and diagnostic model development despite annotation costs and privacy constraints.
Abstract: Abdominal CT data are limited by high annotation costs and privacy constraints, which hinder the development of robust segmentation and diagnostic models. We present a Prior-Integrated Variation Modeling (PIVM) framework, a diffusion-based method for anatomically accurate CT image synthesis. Instead of generating full images from noise, PIVM predicts voxel-wise intensity variations relative to organ-specific intensity priors derived from segmentation labels. These priors and labels jointly guide the diffusion process, ensuring spatial alignment and realistic organ boundaries. Unlike latent-space diffusion models, our approach operates directly in image space while preserving the full Hounsfield Unit (HU) range, capturing fine anatomical textures without smoothing. Source code is available at https://github.com/BZNR3/PIVM.
[113] CAM3R: Camera-Agnostic Model for 3D Reconstruction
Namitha Guruprasad, Abhay Yadav, Cheng Peng, Rama Chellappa
Main category: cs.CV
TL;DR: CAM3R is a camera-agnostic 3D reconstruction model that handles wide-angle imagery without camera calibration, using ray direction estimation and cross-view modules for pose and geometry recovery.
Details
Motivation: Current 3D reconstruction models trained on perspective datasets fail with wide-angle imagery from fisheye or panoramic sensors due to their implicit pinhole camera assumptions, creating a need for camera-agnostic approaches.Method: Two-view network with Ray Module for per-pixel ray direction estimation and Cross-view Module for radial distance inference with confidence maps, pointmaps, and relative poses, plus Ray-Aware Global Alignment for consistent 3D scene unification.
Result: CAM3R establishes new state-of-the-art in pose estimation and reconstruction across various camera models including panorama, fisheye, and pinhole imagery.
Conclusion: The camera-agnostic approach successfully handles diverse camera geometries without prior calibration, advancing 3D reconstruction capabilities for wide-angle imagery.
Abstract: Recovering dense 3D geometry from unposed images remains a foundational challenge in computer vision. Current state-of-the-art models are predominantly trained on perspective datasets, which implicitly constrains them to a standard pinhole camera geometry. As a result, these models suffer from significant geometric degradation when applied to wide-angle imagery captured via non-rectilinear optics, such as fisheye or panoramic sensors. To address this, we present CAM3R, a Camera-Agnostic, feed-forward Model for 3D Reconstruction capable of processing images from wide-angle camera models without prior calibration. Our framework consists of a two-view network which is bifurcated into a Ray Module (RM) to estimate per-pixel ray directions and a Cross-view Module (CVM) to infer radial distance with confidence maps, pointmaps, and relative poses. To unify these pairwise predictions into a consistent 3D scene, we introduce a Ray-Aware Global Alignment framework for pose refinement and scale optimization while strictly preserving the predicted local geometry. Extensive experiments on various camera model datasets, including panorama, fisheye and pinhole imagery, demonstrate that CAM3R establishes a new state-of-the-art in pose estimation and reconstruction.
[114] Q-Tacit: Image Quality Assessment via Latent Visual Reasoning
Yuxuan Jiang, Yixuan Li, Hanwei Zhu, Siyue Teng, Fan Zhang, David Bull
Main category: cs.CV
TL;DR: Q-Tacit proposes a new paradigm for Vision-Language Model-based image quality assessment that moves beyond natural language reasoning to perform quality reasoning in latent space, achieving strong performance with fewer tokens.
Details
Motivation: Current VLM-based IQA methods rely heavily on language-centric reasoning, but visual quality cues cannot be fully abstracted into text due to the gap between discrete textual tokens and quality perception space, limiting effectiveness for visually intensive IQA tasks.Method: Two-stage approach: (1) injecting structural visual quality priors into latent space, and (2) calibrating latent reasoning trajectories to improve quality assessment ability, enabling VLMs to reason beyond natural language in latent quality space.
Result: Q-Tacit effectively performs quality reasoning with significantly fewer tokens than previous reasoning-based methods while achieving strong overall performance on IQA tasks.
Conclusion: Language is not the only compact representation suitable for visual quality assessment; latent reasoning paradigms offer promising alternatives for IQA, opening possibilities for further exploration of effective reasoning beyond natural language.
Abstract: Vision-Language Model (VLM)-based image quality assessment (IQA) has been significantly advanced by incorporating Chain-of-Thought (CoT) reasoning. Recent work has refined image quality reasoning by applying reinforcement learning (RL) and leveraging active visual tools. However, such strategies are typically language-centric, with visual information being treated as static preconditions. Quality-related visual cues often cannot be abstracted into text in extenso due to the gap between discrete textual tokens and quality perception space, which in turn restricts the reasoning effectiveness for visually intensive IQA tasks. In this paper, we revisit this by asking the question, “Is natural language the ideal space for quality reasoning?” and, as a consequence, we propose Q-Tacit, a new paradigm that elicits VLMs to reason beyond natural language in the latent quality space. Our approach follows a synergistic two-stage process: (i) injecting structural visual quality priors into the latent space, and (ii) calibrating latent reasoning trajectories to improve quality assessment ability. Extensive experiments demonstrate that Q-Tacit can effectively perform quality reasoning with significantly fewer tokens than previous reasoning-based methods, while achieving strong overall performance. This paper validates the proposition that language is not the only compact representation suitable for visual quality, opening possibilities for further exploration of effective latent reasoning paradigms for IQA. Source code will be released to support future research.
[115] Pretext Matters: An Empirical Study of SSL Methods in Medical Imaging
Vedrana Ivezić, Mara Pleasure, Ashwath Radhachandran, Saarang Panchavati, Shreeram Athreya, Vivek Sant, Benjamin Emert, Gregory Fishbein, Corey Arnold, William Speier
Main category: cs.CV
TL;DR: JEAs (joint embedding architectures) work best for spatially localized signals like histopathology, while JEPAs (joint embedding predictive architectures) are optimal for globally structured signals like liver ultrasounds in medical imaging SSL.
Details
Motivation: While self-supervised learning (SSL) shows promise in medical imaging, different SSL strategies (JEAs vs JEPAs) yield vastly different performance outcomes. There's a need to systematically investigate which SSL objective aligns better with the spatial organization of clinically relevant signals in different medical imaging modalities.Method: Empirical investigation of SSL method impact on learned representations in medical imaging. Selected two representative modalities with unique noise profiles: ultrasound and histopathology. Compared JEAs (view-invariance objective) against JEPAs, with independent validation by board-certified radiologists and pathologists.
Result: JEAs are more effective when informative signal is spatially localized (histopathology), while JEPAs are optimal when diagnostically relevant information is globally structured (liver ultrasounds). Differences are evident in clinical relevance of learned features, validated by medical experts.
Conclusion: Provides a framework for matching SSL objectives to structural and noise properties of medical imaging modalities. Choice between JEAs and JEPAs should be guided by spatial organization of clinically relevant signals in the target domain.
Abstract: Though self-supervised learning (SSL) has demonstrated incredible ability to learn robust representations from unlabeled data, the choice of optimal SSL strategy can lead to vastly different performance outcomes in specialized domains. Joint embedding architectures (JEAs) and joint embedding predictive architectures (JEPAs) have shown robustness to noise and strong semantic feature learning compared to pixel reconstruction-based SSL methods, leading to widespread adoption in medical imaging. However, no prior work has systematically investigated which SSL objective is better aligned with the spatial organization of clinically relevant signal. In this work, we empirically investigate how the choice of SSL method impacts the learned representations in medical imaging. We select two representative imaging modalities characterized by unique noise profiles: ultrasound and histopathology. When informative signal is spatially localized, as in histopathology, JEAs are more effective due to their view-invariance objective. In contrast, when diagnostically relevant information is globally structured, such as the macroscopic anatomy present in liver ultrasounds, JEPAs are optimal. These differences are especially evident in the clinical relevance of the learned features, as independently validated by board-certified radiologists and pathologists. Together, our results provide a framework for matching SSL objectives to the structural and noise properties of medical imaging modalities.
[116] MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping
Shiyao Li, Antoine Guédon, Shizhe Chen, Vincent Lepetit
Main category: cs.CV
TL;DR: MAGICIAN is a long-term planning framework for active mapping that uses Imagined Gaussians representation and tree-search to maximize accumulated surface coverage gain, outperforming greedy next-best-view approaches.
Details
Motivation: Existing active mapping approaches rely on greedy next-best-view prediction, which leads to inefficient exploration and incomplete scene reconstruction. There's a need for long-term planning methods that can optimize accumulated coverage gain over multiple steps rather than just immediate rewards.Method: Uses Imagined Gaussians - a scene representation derived from a pre-trained occupancy network with structural priors. This enables efficient computation of coverage gain for any novel viewpoint via fast volumetric rendering. The representation is integrated into a tree-search algorithm for long-horizon planning, with closed-loop updates to Imagined Gaussians and trajectory refinement.
Result: Achieves state-of-the-art performance across indoor and outdoor benchmarks with varying action spaces, demonstrating the critical advantage of long-term planning over greedy approaches in active mapping.
Conclusion: Long-term planning using Imagined Gaussians representation enables more efficient and complete active mapping compared to greedy next-best-view methods, with significant performance improvements across diverse environments.
Abstract: Active mapping aims to determine how an agent should move to efficiently reconstruct an unknown environment. Most existing approaches rely on greedy next-best-view prediction, resulting in inefficient exploration and incomplete scene reconstruction. To address this limitation, we introduce MAGICIAN, a novel long-term planning framework that maximizes accumulated surface coverage gain through Imagined Gaussians, a scene representation derived from a pre-trained occupancy network with strong structural priors. This representation enables efficient computation of coverage gain for any novel viewpoint via fast volumetric rendering, allowing its integration into a tree-search algorithm for long-horizon planning. We update Imagined Gaussians and refine the planned trajectory in a closed-loop manner. Our method achieves state-of-the-art performance across indoor and outdoor benchmarks with varying action spaces, demonstrating the critical advantage of long-term planning in active mapping.
[117] Large-Scale Avalanche Mapping from SAR Images with Deep Learning-based Change Detection
Mattia Gatti, Alberto Mariani, Ignazio Gallo, Fabiano Monti
Main category: cs.CV
TL;DR: Systematic investigation of large-scale avalanche mapping using Sentinel-1 SAR imagery for bi-temporal change detection, achieving state-of-the-art performance with optimized precision-recall trade-offs.
Details
Motivation: Accurate change detection from satellite imagery is essential for monitoring rapid mass-movement hazards like snow avalanches, which increasingly threaten human life, infrastructure, and ecosystems due to rising frequency and intensity.Method: Treats avalanche mapping as a unimodal change detection problem using only pre- and post-event SAR images. Proposes an end-to-end pipeline with systematic experiments across multiple alpine ecoregions using manually validated avalanche inventories.
Result: Achieves F1-score of 0.8061 in conservative (F1-optimized) configuration and F2-score of 0.8414 with 80.36% avalanche-polygon hit rate in recall-oriented (F2-optimized) tuning. Demonstrates trade-off between precision and completeness with threshold adjustment improving detection of smaller/marginal avalanches.
Conclusion: Unimodal SAR-based change detection achieves most consistent performance for avalanche mapping. The release of annotated multi-region dataset establishes reproducible benchmark for SAR-based avalanche mapping applications.
Abstract: Accurate change detection from satellite imagery is essential for monitoring rapid mass-movement hazards such as snow avalanches, which increasingly threaten human life, infrastructure, and ecosystems due to their rising frequency and intensity. This study presents a systematic investigation of large-scale avalanche mapping through bi-temporal change detection using Sentinel-1 synthetic aperture radar (SAR) imagery. Extensive experiments across multiple alpine ecoregions with manually validated avalanche inventories show that treating the task as a unimodal change detection problem, relying solely on pre- and post-event SAR images, achieves the most consistent performance. The proposed end-to-end pipeline achieves an F1-score of 0.8061 in a conservative (F1-optimized) configuration and attains an F2-score of 0.8414 with 80.36% avalanche-polygon hit rate under a less conservative, recall-oriented (F2-optimized) tuning. These results highlight the trade-off between precision and completeness and demonstrate how threshold adjustment can improve the detection of smaller or marginal avalanches. The release of the annotated multi-region dataset establishes a reproducible benchmark for SAR-based avalanche mapping.
[118] GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning
Jiayin Sun, Caixia Sun, Boyu Yang, Hailin Li, Xiao Chen, Yi Zhang, Errui Ding, Liang Li, Chao Deng, Junlan Feng
Main category: cs.CV
TL;DR: GeoTikzBridge enhances MLLMs’ geometric perception and reasoning through tikz-based code generation, using two models trained on complementary datasets to achieve state-of-the-art performance.
Details
Motivation: Current MLLMs struggle with fine-grained geometric structure perception, limiting their geometric understanding and visual reasoning capabilities. There's a need to enhance local geometric perception in multimodal models.Method: Proposes GeoTikzBridge framework with two models: GeoTikzBridge-Base trained on GeoTikz-Base dataset (2.5M image-to-tikz pairs via iterative data expansion and localized geometric transformation), and GeoTikzBridge-Instruct fine-tuned on GeoTikz-Instruct dataset (first instruction-augmented tikz dataset for visual reasoning).
Result: Models achieve state-of-the-art performance among open-sourced MLLMs and can serve as plug-and-play reasoning modules for any MLLM/LLM, enhancing reasoning performance in geometric problem-solving.
Conclusion: GeoTikzBridge successfully addresses MLLMs’ limitations in geometric perception through tikz-based code generation, providing effective tools for geometric understanding and visual reasoning tasks.
Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities. However, they struggle to perceive fine-grained geometric structures, constraining their ability of geometric understanding and visual reasoning. To address this, we propose GeoTikzBridge, a framework that enhances local geometric perception and visual reasoning through tikz-based code generation. Within this framework, we build two models supported by two complementary datasets. The GeoTikzBridge-Base model is trained on GeoTikz-Base dataset, the largest image-to-tikz dataset to date with 2.5M pairs (16 $\times$ larger than existing open-sourced datasets). This process is achieved via iterative data expansion and a localized geometric transformation strategy. Subsequently, GeoTikzBridge-Instruct is fine-tuned on GeoTikz-Instruct dataset which is the first instruction-augmented tikz dataset supporting visual reasoning. Extensive experimental results demonstrate that our models achieve state-of-the-art performance among open-sourced MLLMs. Furthermore, GeoTikzBridge models can serve as plug-and-play reasoning modules for any MLLM(LLM), enhancing reasoning performance in geometric problem-solving. Datasets and codes are publicly available at: https://github.com/sjy-1995/GeoTikzBridge-Advancing-Multimodal-Code-Generation-for-Geometric-Perception-and-Reasoning.
[119] Think 360°: Evaluating the Width-centric Reasoning Capability of MLLMs Beyond Depth
Mingrui Chen, Hexiong Yang, Haogeng Liu, Huaibo Huang, Ran He
Main category: cs.CV
TL;DR: A multimodal benchmark evaluating MLLMs’ reasoning width (parallel exploration) alongside depth (sequential reasoning), with 1200+ cases and tree-of-thought evaluation showing current models struggle with genuine insight-based reasoning.
Details
Motivation: Current MLLM evaluation focuses primarily on reasoning depth (sequential, chain-of-thought reasoning), but overlooks reasoning width - the ability to explore multiple parallel paths, apply constraints, and backtrack. The paper aims to address this gap by creating a holistic benchmark that evaluates both dimensions.Method: Created a benchmark with 1200+ high-quality multimodal cases across diverse domains. Developed a fine-grained tree-of-thought evaluation protocol that jointly quantifies reasoning width and depth. Evaluated 12 major model families (over 30 advanced MLLMs) across difficulty tiers, question types, and required skills.
Result: Current models show strong performance on general/common-sense VQA tasks but struggle to combine deep sequential thought chains with wide exploratory search for genuine insight-based reasoning. The benchmark reveals characteristic failure modes in multimodal reasoning.
Conclusion: MLLMs need improvement in both reasoning depth and width for genuine insight-based reasoning. The benchmark provides directions for building models that can reason both deeper and wider across multimodal domains.
Abstract: In this paper, we present a holistic multimodal benchmark that evaluates the reasoning capabilities of MLLMs with an explicit focus on reasoning width, a complementary dimension to the more commonly studied reasoning depth. Specifically, reasoning depth measures the model’s ability to carry out long-chain, sequential reasoning in which each step is tightly and rigorously linked to the next. Reasoning width tends to focus more on the model’s capacity for broad trial-and-error search or multi-constrained optimization: it must systematically traverse many possible and parallelized reasoning paths, apply diverse constraints to prune unpromising branches, and identify valid solution routes for efficient iteration or backtracking. To achieve it, we carefully curate 1200+ high-quality multimodal cases spanning heterogeneous domains, and propose a fine-grained tree-of-thought evaluation protocol that jointly quantifies reasoning width and depth. We evaluate 12 major model families (over 30 advanced MLLMs) across difficulty tiers, question types, and required skills. Results show that while current models exhibit strong performance on general or common-sense VQA tasks, they still struggle to combine deep sequential thought chains with wide exploratory search to perform genuine insight-based reasoning. Finally, we analyze characteristic failure modes to provide possible directions for building MLLMs that reason not only deeper but also wider.
[120] WiFi2Cap: Semantic Action Captioning from Wi-Fi CSI via Limb-Level Semantic Alignment
Tzu-Ti Wei, Chu-Yu Huang, Yu-Chee Tseng, Jen-Jee Chen
Main category: cs.CV
TL;DR: WiFi2Cap: A framework that generates natural language action descriptions from Wi-Fi CSI signals using vision-language teacher-student learning and mirror-consistency loss to address left/right ambiguities.
Details
Motivation: Existing Wi-Fi CSI systems focus on pose estimation or predefined action classification, but lack fine-grained language generation capabilities. There's a need for privacy-preserving semantic understanding that can map wireless signals to natural language descriptions, addressing challenges like the semantic gap between signals and language, and direction-sensitive ambiguities.Method: Three-stage framework: 1) Vision-language teacher learns from synchronized video-text pairs, 2) CSI student aligns to teacher’s visual space and text embeddings using Mirror-Consistency Loss to reduce mirrored-action and left-right ambiguities, 3) Prefix-tuned language model generates action descriptions from CSI embeddings. Also introduces WiFi2Cap Dataset with synchronized CSI-RGB-sentence data.
Result: WiFi2Cap consistently outperforms baseline methods on BLEU-4, METEOR, ROUGE-L, CIDEr, and SPICE metrics, demonstrating effective privacy-friendly semantic sensing.
Conclusion: The proposed WiFi2Cap framework successfully generates action captions directly from Wi-Fi CSI, addressing the semantic gap and direction-sensitive ambiguities through teacher-student learning and mirror-consistency loss, enabling privacy-preserving semantic understanding of human activities.
Abstract: Privacy-preserving semantic understanding of human activities is important for indoor sensing, yet existing Wi-Fi CSI-based systems mainly focus on pose estimation or predefined action classification rather than fine-grained language generation. Mapping CSI to natural-language descriptions remains challenging because of the semantic gap between wireless signals and language and direction-sensitive ambiguities such as left/right limb confusion. We propose WiFi2Cap, a three-stage framework for generating action captions directly from Wi-Fi CSI. A vision-language teacher learns transferable supervision from synchronized video-text pairs, and a CSI student is aligned to the teacher’s visual space and text embeddings. To improve direction-sensitive captioning, we introduce a Mirror-Consistency Loss that reduces mirrored-action and left-right ambiguities during cross-modal alignment. A prefix-tuned language model then generates action descriptions from CSI embeddings. We also introduce the WiFi2Cap Dataset, a synchronized CSI-RGB-sentence benchmark for semantic captioning from Wi-Fi signals. Experimental results show that WiFi2Cap consistently outperforms baseline methods on BLEU-4, METEOR, ROUGE-L, CIDEr, and SPICE, demonstrating effective privacy-friendly semantic sensing.
[121] TimeWeaver: Age-Consistent Reference-Based Face Restoration with Identity Preservation
Teer Song, Yue Zhang, Yu Tian, Ziyang Wang, Xianlin Zhang, Guixuan Zhang, Xuan Liu, Xueming Li, Yasen Zhang
Main category: cs.CV
TL;DR: TimeWeaver is a reference-based face restoration framework that supports cross-age references, enabling restoration with both identity fidelity and age consistency when only age-mismatched reference images are available.
Details
Motivation: Current reference-based face restoration methods assume reference and degraded input are age-aligned, failing when only cross-age references are available (e.g., historical restoration or missing-person retrieval). There's a need to maintain both identity fidelity and age consistency in such scenarios.Method: Decouples identity and age conditioning across training and inference. During training: learns age-robust identity representation by fusing global identity embedding with age-suppressed facial tokens via transformer-based ID-Fusion module. During inference: uses two training-free techniques - Age-Aware Gradient Guidance and Token-Targeted Attention Boost - to steer sampling toward desired age semantics.
Result: Extensive experiments show TimeWeaver surpasses existing methods in visual quality, identity preservation, and age consistency.
Conclusion: TimeWeaver successfully addresses the cross-age face restoration problem, enabling high-quality restoration with both identity fidelity and age consistency when only age-mismatched references are available.
Abstract: Recent progress in face restoration has shifted from visual fidelity to identity fidelity, driving a transition from reference-free to reference-based paradigms that condition restoration on reference images of the same person. However, these methods assume the reference and degraded input are age-aligned. When only cross-age references are available, as in historical restoration or missing-person retrieval, they fail to maintain age fidelity. To address this limitation, we propose TimeWeaver, the first reference-based face restoration framework supporting cross-age references. Given arbitrary reference images and a target-age prompt, TimeWeaver produces restorations with both identity fidelity and age consistency. Specifically, we decouple identity and age conditioning across training and inference. During training, the model learns an age-robust identity representation by fusing a global identity embedding with age-suppressed facial tokens via a transformer-based ID-Fusion module. During inference, two training-free techniques, Age-Aware Gradient Guidance and Token-Targeted Attention Boost, steer sampling toward desired age semantics, enabling precise adherence to the target-age prompt. Extensive experiments show that TimeWeaver surpasses existing methods in visual quality, identity preservation, and age consistency.
[122] How Far Can VLMs Go for Visual Bug Detection? Studying 19,738 Keyframes from 41 Hours of Gameplay Videos
Wentao Lu, Alexander Senchenko, Alan Sayle, Abram Hindle, Cor-Paul Bezemer
Main category: cs.CV
TL;DR: VLMs show limited practical effectiveness for automated visual bug detection in long-form gameplay videos, with simple baselines performing similarly to enhanced strategies.
Details
Motivation: Automating quality assurance for long-form gameplay videos is challenging but valuable for game stability assessment. While VLMs show promise for visual reasoning, their real-world performance in practical QA scenarios needs evaluation.Method: Study using industrial QA gameplay videos (100 videos, 41 hours, 19,738 keyframes). Evaluated VLMs on bug detection from keyframes, starting with single-prompt baseline, then testing two enhancement strategies: secondary judge model for output re-evaluation and metadata-augmented prompting using prior bug report retrieval.
Result: Baseline achieved precision 0.50 and accuracy 0.72. Enhancement strategies provided only marginal improvements over baseline while adding computational cost and output variance. VLMs can detect certain visual bugs but need better approaches for practical use.
Conclusion: Off-the-shelf VLMs have limited practical effectiveness for visual bug detection in gameplay QA. Future progress requires hybrid approaches that better separate textual and visual anomaly detection rather than simple prompting enhancements.
Abstract: Video-based quality assurance (QA) for long-form gameplay video is labor-intensive and error-prone, yet valuable for assessing game stability and visual correctness over extended play sessions. Vision language models (VLMs) promise general-purpose visual reasoning capabilities and thus appear attractive for detecting visual bugs directly from video frames. Recent benchmarks suggest that VLMs can achieve promising results in detecting visual glitches on curated datasets. Building on these findings, we conduct a real-world study using industrial QA gameplay videos to evaluate how well VLMs perform in practical scenarios. Our study samples keyframes from long gameplay videos and asks a VLM whether each keyframe contains a bug. Starting from a single-prompt baseline, the model achieves a precision of 0.50 and an accuracy of 0.72. We then examine two common enhancement strategies used to improve VLM performance without fine-tuning: (1) a secondary judge model that re-evaluates VLM outputs, and (2) metadata-augmented prompting through the retrieval of prior bug reports. Across \textbf{100 videos} totaling \textbf{41 hours} and \textbf{19,738 keyframes}, these strategies provide only marginal improvements over the simple baseline, while introducing additional computational cost and output variance. Our findings indicate that off-the-shelf VLMs are already capable of detecting a certain range of visual bugs in QA gameplay videos, but further progress likely requires hybrid approaches that better separate textual and visual anomaly detection.
[123] SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts
Khanh Binh Nguyen, Chae Jung Park
Main category: cs.CV
TL;DR: SOUPLE introduces learnable context tokens with visual features to bridge audio-visual semantic correspondence, improving localization and segmentation over fixed prompt methods.
Details
Motivation: Existing CLIP-based methods for audio-visual localization struggle because replacing classification tokens with audio embeddings fails to capture semantic cues, and fixed prompts don't establish meaningful connections between audio embeddings and visual context.Method: Proposes Sound-aware Prompt Learning (SOUPLE) which replaces fixed prompts with learnable context tokens that incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs.
Result: Experiments on VGGSound, SoundNet, and AVSBench demonstrate improved localization and segmentation performance compared to previous methods.
Conclusion: SOUPLE effectively addresses the limitations of fixed prompts in CLIP for audio-visual tasks by learning context tokens that better bridge audio and visual semantics, leading to improved multimodal understanding.
Abstract: Large-scale pre-trained image-text models exhibit robust multimodal representations, yet applying the Contrastive Language-Image Pre-training (CLIP) model to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A]) struggles to capture semantic cues, and the prompt “a photo of a [V_A]” fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose Sound-aware Prompt Learning (SOUPLE), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench demonstrate that SOUPLE improves localization and segmentation performance.
[124] MVPBench: A Multi-Video Perception Evaluation Benchmark for Multi-Modal Video Understanding
Purui Bai, Tao Wu, Jiayang Sun, Xinyue Liu, Huaibo Huang, Ran He
Main category: cs.CV
TL;DR: MVPBench is a new benchmark for evaluating multi-modal LLMs on multi-video perception tasks, featuring 14 subtasks across diverse domains with 5K QA tests using 2.7K video clips.
Details
Motivation: Existing benchmarks for multi-modal LLMs are limited to static images or single videos, overlooking the complex interactions across multiple videos that are crucial for real-world applications.Method: Created MVPBench with 14 subtasks across diverse visual domains, including 5K question-answering tests involving 2.7K video clips sourced from existing datasets and manually annotated clips.
Result: Extensive evaluations show current models struggle to process multi-video inputs effectively, revealing substantial limitations in their multi-video comprehension capabilities.
Conclusion: MVPBench addresses a critical gap in evaluating multi-modal LLMs and is expected to drive advancements in multi-video perception research.
Abstract: The rapid progress of Large Language Models (LLMs) has spurred growing interest in Multi-modal LLMs (MLLMs) and motivated the development of benchmarks to evaluate their perceptual and comprehension abilities. Existing benchmarks, however, are limited to static images or single videos, overlooking the complex interactions across multiple videos. To address this gap, we introduce the Multi-Video Perception Evaluation Benchmark (MVPBench), a new benchmark featuring 14 subtasks across diverse visual domains designed to evaluate models on extracting relevant information from video sequences to make informed decisions. MVPBench includes 5K question-answering tests involving 2.7K video clips sourced from existing datasets and manually annotated clips. Extensive evaluations reveal that current models struggle to process multi-video inputs effectively, underscoring substantial limitations in their multi-video comprehension. We anticipate MVPBench will drive advancements in multi-video perception.
[125] Multimodal Industrial Anomaly Detection via Geometric Prior
Min Li, Jinghui He, Gang Li, Jiachen Li, Jin Wan, Delong Han
Main category: cs.CV
TL;DR: GPAD: A novel multimodal industrial anomaly detection network that leverages geometric priors from 3D point clouds to detect subtle surface deformations and irregular contours that are hard to detect with 2D methods.
Details
Motivation: Current multimodal industrial anomaly detection methods lack effective use of crucial geometric information like surface normal vectors and 3D shape topology, resulting in low detection accuracy for complex geometric shape defects.Method: Proposes GPAD with: 1) Point cloud expert model for fine-grained geometric feature extraction using differential normal vector computation to generate geometric priors; 2) Two-stage fusion strategy to leverage multimodal data complementarity and geometric priors; 3) Attention fusion and anomaly region segmentation based on geometric priors.
Result: Outperforms state-of-the-art methods in detection accuracy on both MVTec-3D AD and Eyecandies datasets.
Conclusion: GPAD effectively utilizes geometric priors from 3D point clouds to enhance multimodal industrial anomaly detection, particularly for subtle geometric defects that are challenging for 2D-based methods.
Abstract: The purpose of multimodal industrial anomaly detection is to detect complex geometric shape defects such as subtle surface deformations and irregular contours that are difficult to detect in 2D-based methods. However, current multimodal industrial anomaly detection lacks the effective use of crucial geometric information like surface normal vectors and 3D shape topology, resulting in low detection accuracy. In this paper, we propose a novel Geometric Prior-based Anomaly Detection network (GPAD). Firstly, we propose a point cloud expert model to perform fine-grained geometric feature extraction, employing differential normal vector computation to enhance the geometric details of the extracted features and generate geometric prior. Secondly, we propose a two-stage fusion strategy to efficiently leverage the complementarity of multimodal data as well as the geometric prior inherent in 3D points. We further propose attention fusion and anomaly regions segmentation based on geometric prior, which enhance the model’s ability to perceive geometric defects. Extensive experiments show that our multimodal industrial anomaly detection model outperforms the State-of-the-art (SOTA) methods in detection accuracy on both MVTec-3D AD and Eyecandies datasets.
[126] Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning
WonJun Moon, Hyun Seok Seong, Jae-Pil Heo
Main category: cs.CV
TL;DR: SlotCurri: A curriculum learning approach for video object-centric learning that progressively allocates slots based on reconstruction error, uses structure-aware loss for semantic boundaries, and cyclic inference for temporal consistency, reducing over-fragmentation.
Details
Motivation: Existing slot-attention models for video object-centric learning suffer from severe over-fragmentation, where a single object is represented by multiple redundant slots due to the reconstruction objective encouraging all slots to be occupied.Method: 1) Reconstruction-guided slot curriculum: Start with few coarse slots, progressively allocate new slots where reconstruction error remains high. 2) Structure-aware loss: Augment MSE with local contrast and edge preservation to sharpen semantic boundaries. 3) Cyclic inference: Roll slots forward and backward through frames for temporal consistency.
Result: Significant improvements in FG-ARI: +6.8 on YouTube-VIS and +8.3 on MOVi-C datasets, demonstrating effective reduction of object over-fragmentation.
Conclusion: SlotCurri successfully addresses object over-fragmentation in video object-centric learning through curriculum-based slot allocation, structural cues for boundary sharpening, and cyclic inference for temporal consistency.
Abstract: Video Object-Centric Learning seeks to decompose raw videos into a small set of object slots, but existing slot-attention models often suffer from severe over-fragmentation. This is because the model is implicitly encouraged to occupy all slots to minimize the reconstruction objective, thereby representing a single object with multiple redundant slots. We tackle this limitation with a reconstruction-guided slot curriculum (SlotCurri). Training starts with only a few coarse slots and progressively allocates new slots where reconstruction error remains high, thus expanding capacity only where it is needed and preventing fragmentation from the outset. Yet, during slot expansion, meaningful sub-parts can emerge only if coarse-level semantics are already well separated; however, with a small initial slot budget and an MSE objective, semantic boundaries remain blurry. Therefore, we augment MSE with a structure-aware loss that preserves local contrast and edge information to encourage each slot to sharpen its semantic boundaries. Lastly, we propose a cyclic inference that rolls slots forward and then backward through the frame sequence, producing temporally consistent object representations even in the earliest frames. All combined, SlotCurri addresses object over-fragmentation by allocating representational capacity where reconstruction fails, further enhanced by structural cues and cyclic inference. Notable FG-ARI gains of +6.8 on YouTube-VIS and +8.3 on MOVi-C validate the effectiveness of SlotCurri. Our code is available at github.com/wjun0830/SlotCurri.
[127] ENC-Bench: A Benchmark for Evaluating Multimodal Large Language Models in Electronic Navigational Chart Understanding
Ao Cheng, Xingming Li, Xuanyu Ji, Xixiang He, Qiyao Sun, Chunping Qiu, Runke Huang, Qingyong Hu
Main category: cs.CV
TL;DR: ENC-Bench: First benchmark for evaluating multimodal LLMs on professional Electronic Navigational Charts understanding, showing current models struggle with specialized maritime symbolic reasoning and spatial computation.
Details
Motivation: Electronic Navigational Charts (ENCs) are safety-critical for maritime navigation but require specialized expertise to interpret. It's unclear whether current multimodal LLMs can reliably understand these complex charts that encode regulations, bathymetry, and route constraints through standardized vector symbols and geometric structures.Method: Created ENC-Bench with 20,490 expert-validated samples from 840 NOAA ENCs, organized into three levels: Perception (symbol/feature recognition), Spatial Reasoning (coordinate localization, bearing, distance), and Maritime Decision-Making (route legality, safety assessment, emergency planning). Generated samples from raw S-57 data using vector-to-image pipeline with automated consistency checks and expert review.
Result: Evaluated 10 state-of-the-art MLLMs (GPT-4o, Gemini 2.5, Qwen3-VL, InternVL-3, GLM-4.5V, etc.) under zero-shot protocol. Best model achieved only 47.88% accuracy, showing systematic challenges in symbolic grounding, spatial computation, multi-constraint reasoning, and robustness to lighting/scale variations.
Conclusion: Established first rigorous ENC benchmark, opening new research frontier at intersection of specialized symbolic reasoning and safety-critical AI. Provides essential infrastructure for advancing MLLMs toward professional maritime applications.
Abstract: Electronic Navigational Charts (ENCs) are the safety-critical backbone of modern maritime navigation, yet it remains unclear whether multimodal large language models (MLLMs) can reliably interpret them. Unlike natural images or conventional charts, ENCs encode regulations, bathymetry, and route constraints via standardized vector symbols, scale-dependent rendering, and precise geometric structure – requiring specialized maritime expertise for interpretation. We introduce ENC-Bench, the first benchmark dedicated to professional ENC understanding. ENC-Bench contains 20,490 expert-validated samples from 840 authentic National Oceanic and Atmospheric Administration (NOAA) ENCs, organized into a three-level hierarchy: Perception (symbol and feature recognition), Spatial Reasoning (coordinate localization, bearing, distance), and Maritime Decision-Making (route legality, safety assessment, emergency planning under multiple constraints). All samples are generated from raw S-57 data through a calibrated vector-to-image pipeline with automated consistency checks and expert review. We evaluate 10 state-of-the-art MLLMs such as GPT-4o, Gemini 2.5, Qwen3-VL, InternVL-3, and GLM-4.5V, under a unified zero-shot protocol. The best model achieves only 47.88% accuracy, with systematic challenges in symbolic grounding, spatial computation, multi-constraint reasoning, and robustness to lighting and scale variations. By establishing the first rigorous ENC benchmark, we open a new research frontier at the intersection of specialized symbolic reasoning and safety-critical AI, providing essential infrastructure for advancing MLLMs toward professional maritime applications.
[128] From Pixels to Semantics: A Multi-Stage AI Framework for Structural Damage Detection in Satellite Imagery
Bijay Shakya, Catherine Hoier, Khandaker Mamun Ahmed
Main category: cs.CV
TL;DR: Hybrid framework combining AI super-resolution, object detection, and Vision-Language Models for post-disaster building damage assessment from satellite imagery.
Details
Motivation: Remote sensing imagery for disaster damage assessment suffers from low resolution, contextual ambiguity, and limited semantic interpretability, reducing reliability of traditional detection methods.Method: Three-stage approach: 1) Video Restoration Transformer (VRT) upscales satellite images from 1024x1024 to 4096x4096, 2) YOLOv11 detects buildings in pre-disaster imagery, 3) VLMs analyze cropped building regions to assess structural damage across four severity levels using CLIPScore for reference-free semantic alignment and multi-model VLM-as-a-Jury strategy.
Result: Framework enhances semantic interpretation of damaged buildings on xBD dataset subsets (Moore Tornado, Hurricane Matthew) and provides helpful recovery recommendations to first responders.
Conclusion: The proposed hybrid framework effectively addresses limitations of remote sensing imagery for damage assessment by combining super-resolution, object detection, and VLMs for comprehensive semantic analysis.
Abstract: Rapid and accurate structural damage assessment following natural disasters is critical for effective emergency response and recovery. However, remote sensing imagery often suffers from low spatial resolution, contextual ambiguity, and limited semantic interpretability, reducing the reliability of traditional detection pipelines. In this work, we propose a novel hybrid framework that integrates AI-based super-resolution, deep learning object detection, and Vision-Language Models (VLMs) for comprehensive post-disaster building damage assessment. First, we enhance pre- and post-disaster satellite imagery using a Video Restoration Transformer (VRT) to upscale images from 1024x1024 to 4096x4096 resolution, improving structural detail visibility. Next, a YOLOv11-based detector localizes buildings in pre-disaster imagery, and cropped building regions are analyzed using VLMs to semantically assess structural damage across four severity levels. To ensure robust evaluation in the absence of ground-truth captions, we employ CLIPScore for reference-free semantic alignment and introduce a multi-model VLM-as-a-Jury strategy to reduce individual model bias in safety-critical decision making. Experiments on subsets of the xBD dataset, including the Moore Tornado and Hurricane Matthew events, demonstrate that the proposed framework enhances the semantic interpretation of damaged buildings. In addition, our framework provides helpful recommendations to first responders for recovery based on damage analysis.
[129] Typography-Based Monocular Distance Estimation Framework for Vehicle Safety Systems
Manognya Lokesh Reddy, Zheng Liu
Main category: cs.CV
TL;DR: A monocular distance estimation framework using license plate typography as fiducial markers, achieving 7.7% mean absolute error through geometric analysis of character height and multi-feature fusion.
Details
Motivation: Need for low-cost distance estimation for ADAS/autonomous driving without expensive LiDAR/radar; monocular vision suffers from scale ambiguity, so license plate typography provides standardized metric reference.Method: Uses license plate detection, character segmentation to measure character height, applies pinhole camera model; includes camera pose compensation, hybrid deep-learning fusion, Kalman filtering, and multi-feature fusion (stroke width, spacing, border thickness).
Result: Achieved 2.3% coefficient of variation in character height across frames, 7.7% mean absolute error; character-based ranging reduced standard deviation by 35% compared to plate-width method; operates in real-time without GPU.
Conclusion: Typography-based monocular distance estimation is viable low-cost alternative to LiDAR/radar; robust framework with multi-feature fusion improves consistency for practical ADAS applications.
Abstract: Accurate inter-vehicle distance estimation is a cornerstone of advanced driver assistance systems and autonomous driving. While LiDAR and radar provide high precision, their cost prohibits widespread adoption in mass-market vehicles. Monocular vision offers a low-cost alternative but suffers from scale ambiguity and sensitivity to environmental disturbances. This paper introduces a typography-based monocular distance estimation framework, which exploits the standardized typography of license plates as passive fiducial markers for metric distance estimation. The core geometric module uses robust plate detection and character segmentation to measure character height and computes distance via the pinhole camera model. The system incorporates interactive calibration, adaptive detection with strict and permissive modes, and multi-method character segmentation leveraging both adaptive and global thresholding. To enhance robustness, the framework further includes camera pose compensation using lane-based horizon estimation, hybrid deep-learning fusion, temporal Kalman filtering for velocity estimation, and multi-feature fusion that exploits additional typographic cues such as stroke width, character spacing, and plate border thickness. Experimental validation with a calibrated monocular camera in a controlled indoor setup achieved a coefficient of variation of 2.3% in character height across consecutive frames and a mean absolute error of 7.7%. The framework operates without GPU acceleration, demonstrating real-time feasibility. A comprehensive comparison with a plate-width based method shows that character-based ranging reduces the standard deviation of estimates by 35%, translating to smoother, more consistent distance readings in practice, where erratic estimates could trigger unnecessary braking or acceleration.
[130] Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models
Wenyue Chen, Wenjue Chen, Peng Li, Qinghe Wang, Xu Jia, Heliang Zheng, Rongfei Jia, Yuan Liu, Ronggang Wang
Main category: cs.CV
TL;DR: Know3D integrates multimodal LLM knowledge into 3D generation via latent injection to enable language-controllable back-view generation, addressing stochastic unseen region synthesis.
Details
Motivation: Existing 3D generation models produce stochastic and difficult-to-control unseen regions due to single-view ambiguity and limited 3D training data, often failing to align with user intentions or producing implausible geometries.Method: Proposes Know3D framework that incorporates multimodal LLM knowledge into 3D generative processes via latent hidden-state injection. Uses VLM-diffusion-based model where VLM handles semantic understanding/guidance and diffusion model transfers semantic knowledge to 3D generation model.
Result: Successfully bridges gap between abstract textual instructions and geometric reconstruction of unobserved regions, transforming stochastic back-view hallucination into semantically controllable process.
Conclusion: Demonstrates promising direction for future 3D generation models by enabling language-controllable generation of back-view for 3D assets through multimodal LLM integration.
Abstract: Recent advances in 3D generation have improved the fidelity and geometric details of synthesized 3D assets. However, due to the inherent ambiguity of single-view observations and the lack of robust global structural priors caused by limited 3D training data, the unseen regions generated by existing models are often stochastic and difficult to control, which may sometimes fail to align with user intentions or produce implausible geometries. In this paper, we propose Know3D, a novel framework that incorporates rich knowledge from multimodal large language models into 3D generative processes via latent hidden-state injection, enabling language-controllable generation of the back-view for 3D assets. We utilize a VLM-diffusion-based model, where the VLM is responsible for semantic understanding and guidance. The diffusion model acts as a bridge that transfers semantic knowledge from the VLM to the 3D generation model. In this way, we successfully bridge the gap between abstract textual instructions and the geometric reconstruction of unobserved regions, transforming the traditionally stochastic back-view hallucination into a semantically controllable process, demonstrating a promising direction for future 3D generation models.
[131] Exposure-Normalized Bed and Chair Fall Rates via Continuous AI Monitoring
Paolo Gabriel, Peter Rehani, Zack Drumm, Tyler Troy, Tiffany Wyatt, Narinder Singh
Main category: cs.CV
TL;DR: AI monitoring study comparing fall rates between chairs vs beds in healthcare settings, finding higher fall rates in chairs but with statistical uncertainty.
Details
Motivation: To better understand fall risks in healthcare settings by using AI monitoring to estimate fall rates based on actual exposure time rather than traditional occupied bed-days metrics.Method: Retrospective cohort study using continuous AI monitoring to track fall events and exposure hours for chairs and beds in a healthcare system from August 2024 to December 2025.
Result: Found higher fall rates in chairs (17.8 per 1,000 chair exposure-hours) vs beds (4.3 per 1,000 bed exposure-hours), with chair-to-bed rate ratio of 2.35, though not statistically significant. Most chair falls involved footrest-positioning failures.
Conclusion: Results are hypothesis-generating and support testing safer chair setups rather than reducing chair use, but findings are limited by being observational in a single health system.
Abstract: This retrospective cohort study used continuous AI monitoring to estimate fall rates by exposure time rather than occupied bed-days. From August 2024 to December 2025, 3,980 eligible monitoring units contributed 292,914 hourly rows, yielding probability-weighted rates of 17.8 falls per 1,000 chair exposure-hours and 4.3 per 1,000 bed exposure-hours. Within the study window, 43 adjudicated falls matched the monitoring pipeline, and 40 linked to eligible exposure hours for the primary Poisson model, producing an adjusted chair-versus-bed rate ratio of 2.35 (95% confidence interval 0.87 to 6.33; p=0.0907). In a separate broader observation cohort (n=32 deduplicated events), 6 of 7 direct chair falls involved footrest-positioning failures. Because this was an observational study in a single health system, these findings remain hypothesis-generating and support testing safer chair setups rather than using chairs less.
[132] Predictive Photometric Uncertainty in Gaussian Splatting for Novel View Synthesis
Chamuditha Jayanga Galappaththige, Thomas Gottwald, Peter Stehr, Edgar Heinert, Niko Suenderhauf, Dimity Miller, Matthias Rottmann
Main category: cs.CV
TL;DR: A lightweight framework for estimating pixel-wise predictive uncertainty in 3D Gaussian Splatting, enabling trustworthy spatial maps for autonomous agents through Bayesian-regularized optimization.
Details
Motivation: While 3D Gaussian Splatting achieves impressive photorealistic rendering, it lacks uncertainty estimation needed for safety-critical applications like autonomous agents. The paper aims to bridge this gap by providing reliability signals for trustworthy spatial mapping.Method: A post-hoc, plug-and-play framework using Bayesian-regularized linear least-squares optimization over reconstruction residuals. The architecture-agnostic approach extracts per-primitive uncertainty channels without modifying the underlying scene representation or degrading visual fidelity.
Result: The method successfully translates 3D Gaussian splatting into trustworthy spatial maps and improves state-of-the-art performance across three critical downstream perception tasks: active view selection, pose-agnostic scene change detection, and pose-agnostic anomaly detection.
Conclusion: The framework enables 3D Gaussian Splatting to become a reliable spatial map for autonomous agents by providing actionable uncertainty estimation, making it suitable for safety-critical applications beyond pure rendering.
Abstract: Recent advances in 3D Gaussian Splatting have enabled impressive photorealistic novel view synthesis. However, to transition from a pure rendering engine to a reliable spatial map for autonomous agents and safety-critical applications, knowing where the representation is uncertain is as important as the rendering fidelity itself. We bridge this critical gap by introducing a lightweight, plug-and-play framework for pixel-wise, view-dependent predictive uncertainty estimation. Our post-hoc method formulates uncertainty as a Bayesian-regularized linear least-squares optimization over reconstruction residuals. This architecture-agnostic approach extracts a per-primitive uncertainty channel without modifying the underlying scene representation or degrading baseline visual fidelity. Crucially, we demonstrate that providing this actionable reliability signal successfully translates 3D Gaussian splatting into a trustworthy spatial map, further improving state-of-the-art performance across three critical downstream perception tasks: active view selection, pose-agnostic scene change detection, and pose-agnostic anomaly detection.
[133] It Takes Two: A Duet of Periodicity and Directionality for Burst Flicker Removal
Lishen Qu, Shihao Zhou, Jie Liang, Hui Zeng, Lei Zhang, Jufeng Yang
Main category: cs.CV
TL;DR: Flickerformer: A transformer-based architecture that removes flicker artifacts in short-exposure photography by exploiting periodicity and directionality characteristics through phase-based fusion, autocorrelation feed-forward networks, and wavelet-based directional attention.
Details
Motivation: Flicker artifacts from unstable illumination and row-wise exposure inconsistencies degrade image quality in short-exposure photography. Current generic restoration frameworks fail to account for flicker's structured spatial-temporal patterns, leading to suboptimal suppression and ghosting artifacts.Method: Proposes Flickerformer with three key components: 1) Phase-based Fusion Module (PFM) for inter-frame phase correlation to aggregate burst features based on periodicity, 2) Autocorrelation Feed-Forward Network (AFFN) to exploit intra-frame structural regularities through autocorrelation, and 3) Wavelet-based Directional Attention Module (WDAM) that leverages high-frequency variations in wavelet domain to guide restoration of low-frequency dark regions based on directionality.
Result: Extensive experiments demonstrate that Flickerformer outperforms state-of-the-art approaches in both quantitative metrics and visual quality. The method effectively removes flicker without introducing ghosting artifacts.
Conclusion: Flickerformer successfully addresses flicker artifacts by exploiting their intrinsic periodicity and directionality characteristics through specialized transformer components, achieving superior performance compared to existing methods.
Abstract: Flicker artifacts, arising from unstable illumination and row-wise exposure inconsistencies, pose a significant challenge in short-exposure photography, severely degrading image quality. Unlike typical artifacts, e.g., noise and low-light, flicker is a structured degradation with specific spatial-temporal patterns, which are not accounted for in current generic restoration frameworks, leading to suboptimal flicker suppression and ghosting artifacts. In this work, we reveal that flicker artifacts exhibit two intrinsic characteristics, periodicity and directionality, and propose Flickerformer, a transformer-based architecture that effectively removes flicker without introducing ghosting. Specifically, Flickerformer comprises three key components: a phase-based fusion module (PFM), an autocorrelation feed-forward network (AFFN), and a wavelet-based directional attention module (WDAM). Based on the periodicity, PFM performs inter-frame phase correlation to adaptively aggregate burst features, while AFFN exploits intra-frame structural regularities through autocorrelation, jointly enhancing the network’s ability to perceive spatially recurring patterns. Moreover, motivated by the directionality of flicker artifacts, WDAM leverages high-frequency variations in the wavelet domain to guide the restoration of low-frequency dark regions, yielding precise localization of flicker artifacts. Extensive experiments demonstrate that Flickerformer outperforms state-of-the-art approaches in both quantitative metrics and visual quality. The source code is available at https://github.com/qulishen/Flickerformer.
[134] PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding
Lirong Che, Zhenfeng Gan, Yanbo Chen, Junbo Tan, Xueqian Wang
Main category: cs.CV
TL;DR: PhotoAgent is an embodied agent for creative photography that uses LMM reasoning to translate aesthetic goals into geometric constraints, then refines viewpoints through visual reflection in a 3D Gaussian Splatting world model.
Details
Motivation: Bridging the semantic gap between high-level language commands and geometric control for embodied agents in creative tasks like photography, avoiding costly physical trial-and-error.Method: Integrates Large Multimodal Models (LMMs) with chain-of-thought reasoning to translate aesthetic goals into geometric constraints, uses analytical solver for initial viewpoint, then iteratively refines through visual reflection in a photorealistic 3D Gaussian Splatting internal world model.
Result: PhotoAgent excels in spatial reasoning and achieves superior final image quality compared to alternatives, enabling rapid convergence to aesthetically superior results without physical trial-and-error.
Conclusion: The proposed mental simulation approach using LMM reasoning and 3DGS world modeling effectively bridges language commands to geometric control for creative photography agents.
Abstract: Embodied agents for creative tasks like photography must bridge the semantic gap between high-level language commands and geometric control. We introduce PhotoAgent, an agent that achieves this by integrating Large Multimodal Models (LMMs) reasoning with a novel control paradigm. PhotoAgent first translates subjective aesthetic goals into solvable geometric constraints via LMM-driven, chain-of-thought (CoT) reasoning, allowing an analytical solver to compute a high-quality initial viewpoint. This initial pose is then iteratively refined through visual reflection within a photorealistic internal world model built with 3D Gaussian Splatting (3DGS). This ``mental simulation’’ replaces costly and slow physical trial-and-error, enabling rapid convergence to aesthetically superior results. Evaluations confirm that PhotoAgent excels in spatial reasoning and achieves superior final image quality.
[135] Focus, Don’t Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding
Mincheol Kwon, Minseung Lee, Seonga Choi, Miso Choi, Kyeong-Jin Oh, Hyunyoung Lee, Cheonyoung Park, Yongho Song, Seunghyun Park, Jinkyu Kim
Main category: cs.CV
TL;DR: PinPoint is a two-stage LVLM framework that identifies instruction-relevant image regions first, then extracts fine-grained features, reducing computational overhead while improving accuracy on complex visual documents.
Details
Motivation: Current LVLMs generate many visual tokens for complex images like infographics/documents, causing computational inefficiency. Need to focus only on instruction-relevant regions to improve both performance and efficiency.Method: Two-stage framework: 1) Instruction-Region Alignment to localize relevant regions using both visual input and textual instructions, 2) Region refinement for fine-grained feature extraction. Introduces new annotations for instruction-relevant regions across VQA benchmarks.
Result: Achieves superior accuracy compared to existing methods while reducing computational overhead by minimizing irrelevant visual tokens on InfographicVQA, MultiPageDocVQA, and SinglePageDocVQA benchmarks.
Conclusion: PinPoint effectively addresses computational inefficiency in LVLMs for complex visual documents by focusing on instruction-relevant regions, improving both accuracy and efficiency.
Abstract: Large Vision-Language Models (LVLMs) have shown strong performance across various multimodal tasks by leveraging the reasoning capabilities of Large Language Models (LLMs). However, processing visually complex and information-rich images, such as infographics or document layouts, requires these models to generate a large number of visual tokens, leading to significant computational overhead. To address this, we propose PinPoint, a novel two-stage framework that first identifies instruction-relevant image regions and then refines them to extract fine-grained visual features for improved reasoning and efficiency. Central to our approach is the Instruction-Region Alignment, which localizes relevant regions using both visual input and textual instructions. We further introduce new annotations that provide richer ground-truth supervision for instruction-relevant regions across challenging VQA benchmarks: InfographicVQA, MultiPageDocVQA, and SinglePageDocVQA. Experimental results show that PinPoint not only achieves superior accuracy compared to existing methods but also reduces computational overhead by minimizing irrelevant visual tokens.
[136] TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment
Chunxia Qin, Chenyu Liu, Pengcheng Xia, Jun Du, Baocai Yin, Bing Yin, Cong Liu
Main category: cs.CV
TL;DR: TDATR is an end-to-end table recognition framework that uses table detail-aware learning and cell-level visual alignment to improve table structure and content understanding with limited training data.
Details
Motivation: Existing table recognition approaches have limitations: modular pipelines separate structure and content modeling leading to suboptimal integration, while end-to-end approaches require large-scale data and struggle in data-constrained scenarios.Method: TDATR uses a “perceive-then-fuse” strategy with table detail-aware learning to jointly perceive table structure and content through multiple structure understanding and content recognition tasks under a language modeling paradigm. It includes a structure-guided cell localization module for vision-language alignment.
Result: Achieves state-of-the-art or highly competitive performance on seven benchmarks without dataset-specific fine-tuning.
Conclusion: TDATR provides an effective end-to-end table recognition solution that works well with limited data through table detail-aware learning and improved vision-language alignment.
Abstract: Tables are pervasive in diverse documents, making table recognition (TR) a fundamental task in document analysis. Existing modular TR pipelines separately model table structure and content, leading to suboptimal integration and complex workflows. End-to-end approaches rely heavily on large-scale TR data and struggle in data-constrained scenarios. To address these issues, we propose TDATR (Table Detail-Aware Table Recognition) improves end-to-end TR through table detail-aware learning and cell-level visual alignment. TDATR adopts a ``perceive-then-fuse’’ strategy. The model first performs table detail-aware learning to jointly perceive table structure and content through multiple structure understanding and content recognition tasks designed under a language modeling paradigm. These tasks can naturally leverage document data from diverse scenarios to enhance model robustness. The model then integrates implicit table details to generate structured HTML outputs, enabling more efficient TR modeling when trained with limited data. Furthermore, we design a structure-guided cell localization module integrated into the end-to-end TR framework, which efficiently locates cell and strengthens vision-language alignment. It enhances the interpretability and accuracy of TR. We achieve state-of-the-art or highly competitive performance on seven benchmarks without dataset-specific fine-tuning.
[137] Cross-Slice Knowledge Transfer via Masked Multi-Modal Heterogeneous Graph Contrastive Learning for Spatial Gene Expression Inference
Zhiceng Shi, Changmiao Wang, Jun Wan, Wenwen Min
Main category: cs.CV
TL;DR: SpaHGC is a multi-modal heterogeneous graph-based model that predicts spatial transcriptomics from pathology images by capturing intra-slice and inter-slice spatial relationships using pathology foundation model embeddings and masked graph contrastive learning.
Details
Motivation: Spatial transcriptomics (ST) provides valuable gene expression data in tissue context but is expensive. Predicting ST from pathology images offers a cost-effective alternative, but existing methods fail to capture complex cross-slide spatial relationships needed for accurate predictions.Method: SpaHGC uses a multi-modal heterogeneous graph approach that integrates: 1) local spatial context within target slides, 2) cross-slide similarities computed from pathology foundation model embeddings for inter-slice knowledge transfer, and 3) Masked Graph Contrastive Learning to enhance feature representation and transfer spatial gene expression knowledge from reference to target slides.
Result: Comprehensive benchmarking on seven matched histology-ST datasets from different platforms, tissues, and cancer subtypes shows SpaHGC significantly outperforms nine state-of-the-art methods across all evaluation metrics. Predictions are significantly enriched in multiple cancer-related pathways, demonstrating strong biological relevance.
Conclusion: SpaHGC effectively addresses the challenge of predicting spatial transcriptomics from pathology images by modeling complex spatial dependencies through multi-modal heterogeneous graph learning and inter-slice knowledge transfer, offering a promising cost-effective alternative to expensive ST experiments.
Abstract: While spatial transcriptomics (ST) has advanced our understanding of gene expression in tissue context, its high experimental cost limits its large-scale application. Predicting ST from pathology images is a promising, cost-effective alternative, but existing methods struggle to capture complex cross-slide spatial relationships. To address the challenge, we propose SpaHGC, a multi-modal heterogeneous graph-based model that captures both intra-slice and inter-slice spot-spot relationships from histology images. It integrates local spatial context within the target slide and cross-slide similarities computed from image embeddings extracted by a pathology foundation model. These embeddings enable inter-slice knowledge transfer, and SpaHGC further incorporates Masked Graph Contrastive Learning to enhance feature representation and transfer spatial gene expression knowledge from reference to target slides, enabling it to model complex spatial dependencies and significantly improve prediction accuracy. We conducted comprehensive benchmarking on seven matched histology-ST datasets from different platforms, tissues, and cancer subtypes. The results demonstrate that SpaHGC significantly outperforms the existing nine state-of-the-art methods across all evaluation metrics. Additionally, the predictions are significantly enriched in multiple cancer-related pathways, thereby highlighting its strong biological relevance and application potential.
[138] MVRD-Bench: Multi-View Learning and Benchmarking for Dynamic Remote Photoplethysmography under Occlusion
Zuxian He, Xu Cheng, Zhaodong Sun, Haoyu Chen, Jingang Shi, Xiaobai Li, Guoying Zhao
Main category: cs.CV
TL;DR: A multi-view rPPG framework that addresses motion-induced occlusion in remote photoplethysmography using synchronized facial videos from multiple viewpoints with adaptive temporal compensation and multi-view correlation-aware attention.
Details
Motivation: Existing rPPG methods degrade under facial motion and occlusion due to reliance on static, single-view facial videos. The paper aims to solve motion-induced occlusion problems in unconstrained multi-view facial videos for more robust physiological signal estimation.Method: Proposes MVRD-rPPG framework with: 1) Adaptive Temporal Optical Compensation (ATOC) for motion artifact suppression, 2) Rhythm-Visual Dual-Stream Network to separate rhythmic and appearance features, 3) Multi-View Correlation-Aware Attention (MVCA) for adaptive view-wise signal aggregation, and 4) Correlation Frequency Adversarial (CFA) learning for temporal accuracy, spectral consistency, and perceptual realism.
Result: Achieves MAE of 0.90 and Pearson correlation coefficient of 0.99 on the MVRD movement scenario. Extensive experiments and ablation studies demonstrate superiority over existing methods.
Conclusion: The proposed multi-view rPPG framework effectively addresses motion-induced occlusion challenges and provides robust physiological signal estimation in real-world conditions with facial movements.
Abstract: Remote photoplethysmography (rPPG) is a non-contact technique that estimates physiological signals by analyzing subtle skin color changes in facial videos. Existing rPPG methods often encounter performance degradation under facial motion and occlusion scenarios due to their reliance on static and single-view facial videos. Thus, this work focuses on tackling the motion-induced occlusion problem for rPPG measurement in unconstrained multi-view facial videos. Specifically, we introduce a Multi-View rPPG Dataset (MVRD), a high-quality benchmark dataset featuring synchronized facial videos from three viewpoints under stationary, speaking, and head movement scenarios to better match real-world conditions. We also propose MVRD-rPPG, a unified multi-view rPPG learning framework that fuses complementary visual cues to maintain robust facial skin coverage, especially under motion conditions. Our method integrates an Adaptive Temporal Optical Compensation (ATOC) module for motion artifact suppression, a Rhythm-Visual Dual-Stream Network to disentangle rhythmic and appearance-related features, and a Multi-View Correlation-Aware Attention (MVCA) for adaptive view-wise signal aggregation. Furthermore, we introduce a Correlation Frequency Adversarial (CFA) learning strategy, which jointly enforces temporal accuracy, spectral consistency, and perceptual realism in the predicted signals. Extensive experiments and ablation studies on the MVRD dataset demonstrate the superiority of our approach. In the MVRD movement scenario, MVRD-rPPG achieves an MAE of 0.90 and a Pearson correlation coefficient (R) of 0.99. The source code and dataset will be made available.
[139] MultiCam: On-the-fly Multi-Camera Pose Estimation Using Spatiotemporal Overlaps of Known Objects
Shiyu Li, Hannah Schieber, Kristoffer Waldow, Benjamin Busam, Julian Kreimeier, Daniel Roth
Main category: cs.CV
TL;DR: A marker-less multi-camera AR system that uses spatiotemporal field-of-view overlaps of known objects for dynamic camera pose estimation, outperforming state-of-the-art on standard datasets.
Details
Motivation: Traditional multi-camera AR systems rely on markers or initial calibration, requiring markers to stay in camera view. The authors aim to overcome these limitations by using scene objects instead of markers for dynamic camera pose estimation.Method: Proposes a constant dynamic camera pose estimation method leveraging spatiotemporal FoV overlaps of known objects. Enhances state-of-the-art object pose estimator to update a spatiotemporal scene graph, enabling relations even among non-overlapping FoV cameras. Introduces a new multi-camera, multi-object pose estimation dataset with temporal FoV overlap.
Result: Outperforms state-of-the-art on YCB-V and T-LESS datasets in camera pose accuracy for FoV overlapping scenarios. Validates effectiveness of marker-less approach for AR applications on both existing and proposed datasets.
Conclusion: The proposed marker-less approach using spatiotemporal FoV overlaps of known objects provides effective dynamic camera pose estimation for multi-camera AR systems, eliminating the need for markers and their continuous tracking.
Abstract: Multi-camera dynamic Augmented Reality (AR) applications require a camera pose estimation to leverage individual information from each camera in one common system. This can be achieved by combining contextual information, such as markers or objects, across multiple views. While commonly cameras are calibrated in an initial step or updated through the constant use of markers, another option is to leverage information already present in the scene, like known objects. Another downside of marker-based tracking is that markers have to be tracked inside the field-of-view (FoV) of the cameras. To overcome these limitations, we propose a constant dynamic camera pose estimation leveraging spatiotemporal FoV overlaps of known objects on the fly. To achieve that, we enhance the state-of-the-art object pose estimator to update our spatiotemporal scene graph, enabling a relation even among non-overlapping FoV cameras. To evaluate our approach, we introduce a multi-camera, multi-object pose estimation dataset with temporal FoV overlap, including static and dynamic cameras. Furthermore, in FoV overlapping scenarios, we outperform the state-of-the-art on the widely used YCB-V and T-LESS dataset in camera pose accuracy. Our performance on both previous and our proposed datasets validates the effectiveness of our marker-less approach for AR applications. The code and dataset are available on https://github.com/roth-hex-lab/IEEE-VR-2026-MultiCam.
[140] URA-Net: Uncertainty-Integrated Anomaly Perception and Restoration Attention Network for Unsupervised Anomaly Detection
Wei Luo, Peng Xing, Yunkang Cao, Haiming Yao, Weiming Shen, Zechao Li
Main category: cs.CV
TL;DR: URA-Net: An uncertainty-integrated anomaly detection network that restores abnormal patterns to normality using restoration attention, outperforming traditional reconstruction methods on industrial and medical datasets.
Details
Motivation: Traditional unsupervised anomaly detection methods using reconstruction frameworks suffer from over-generalization, reconstructing anomalies well and leading to poor detection performance. The authors aim to explicitly restore abnormal patterns to normality rather than just focusing on normality reconstruction.Method: Proposes URA-Net with four key components: 1) Uses pre-trained CNN for multi-level semantic feature extraction as reconstruction target, 2) Feature-level artificial anomaly synthesis module to generate training anomalies, 3) Uncertainty-integrated anomaly perception module based on Bayesian neural networks to learn anomaly/normal distributions, 4) Restoration attention mechanism using global normal semantics to restore anomalous regions.
Result: Comprehensive experiments on industrial datasets (MVTec AD, BTAD) and medical dataset (OCT-2017) demonstrate effectiveness and superiority over existing methods in anomaly detection and localization.
Conclusion: URA-Net successfully addresses over-generalization in anomaly detection by explicitly restoring abnormal patterns to normality through uncertainty-integrated perception and restoration attention, achieving state-of-the-art performance across industrial and medical domains.
Abstract: Unsupervised anomaly detection plays a pivotal role in industrial defect inspection and medical image analysis, with most methods relying on the reconstruction framework. However, these methods may suffer from over-generalization, enabling them to reconstruct anomalies well, which leads to poor detection performance. To address this issue, instead of focusing solely on normality reconstruction, we propose an innovative Uncertainty-Integrated Anomaly Perception and Restoration Attention Network (URA-Net), which explicitly restores abnormal patterns to their corresponding normality. First, unlike traditional image reconstruction methods, we utilize a pre-trained convolutional neural network to extract multi-level semantic features as the reconstruction target. To assist the URA-Net learning to restore anomalies, we introduce a novel feature-level artificial anomaly synthesis module to generate anomalous samples for training. Subsequently, a novel uncertainty-integrated anomaly perception module based on Bayesian neural networks is introduced to learn the distributions of anomalous and normal features. This facilitates the estimation of anomalous regions and ambiguous boundaries, laying the foundation for subsequent anomaly restoration. Then, we propose a novel restoration attention mechanism that leverages global normal semantic information to restore detected anomalous regions, thereby obtaining defect-free restored features. Finally, we employ residual maps between input features and restored features for anomaly detection and localization. The comprehensive experimental results on two industrial datasets, MVTec AD and BTAD, along with a medical image dataset, OCT-2017, unequivocally demonstrate the effectiveness and superiority of the proposed method.
[141] UAV-DETR: DETR for Anti-Drone Target Detection
Jun Yang, Dong Wang, Hongxu Yin, Hongpeng Li, Jianxiong Yu
Main category: cs.CV
TL;DR: UAV-DETR is a novel drone detection framework that balances feature representation and computational efficiency for detecting small drones in complex environments, achieving state-of-the-art performance with reduced parameters.
Details
Motivation: Existing deep learning methods struggle to balance robust feature representation with computational efficiency for detecting miniature drones against complex backgrounds under severe environmental interference.Method: UAV-DETR integrates a small-target-friendly architecture with real-time detection capabilities, featuring a WTConv-enhanced backbone, Sliding Window Self-Attention encoder, Efficient Cross-Scale Feature Recalibration and Fusion Network, and hybrid Inner-CIoU and NWD loss strategy.
Result: Outperforms baseline RT-DETR on custom UAV dataset (+6.61% mAP50:95 with 39.8% parameter reduction) and DUT-ANTI-UAV benchmark (+1.4% Precision, +1.0% F1-Score).
Conclusion: UAV-DETR establishes a superior trade-off between efficiency and precision in counter-UAV object detection, making it effective for security and counter-UAV applications.
Abstract: Drone detection is pivotal in numerous security and counter-UAV applications. However, existing deep learning-based methods typically struggle to balance robust feature representation with computational efficiency. This challenge is particularly acute when detecting miniature drones against complex backgrounds under severe environmental interference. To address these issues, we introduce UAV-DETR, a novel framework that integrates a small-target-friendly architecture with real-time detection capabilities. Specifically, UAV-DETR features a WTConv-enhanced backbone and a Sliding Window Self-Attention (SWSA-IFI) encoder, capturing the high-frequency structural details of tiny targets while drastically reducing parameter overhead. Furthermore, we propose an Efficient Cross-Scale Feature Recalibration and Fusion Network (ECFRFN) to suppress background noise and aggregate multi-scale semantics. To further enhance accuracy, UAV-DETR incorporates a hybrid Inner-CIoU and NWD loss strategy, mitigating the extreme sensitivity of standard IoU metrics to minor positional deviations in small objects. Extensive experiments demonstrate that UAV-DETR significantly outperforms the baseline RT-DETR on our custom UAV dataset (+6.61% in mAP50:95, with a 39.8% reduction in parameters) and the public DUT-ANTI-UAV benchmark (+1.4% in Precision, +1.0% in F1-Score). These results establish UAV-DETR as a superior trade-off between efficiency and precision in counter-UAV object detection. The code is available at https://github.com/wd-sir/UAVDETR.
[142] Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought
Yunheng Li, Hangyi Kuang, Hengrui Zhang, Jiangxia Cao, Zhaojie Liu, Qibin Hou, Ming-Ming Cheng
Main category: cs.CV
TL;DR: PEPO improves multimodal reasoning by analyzing token-level dynamics and optimizing perception-exploration balance through a gating mechanism integrated with existing RL frameworks.
Details
Motivation: Existing RLVR methods treat multimodal CoT reasoning uniformly without distinguishing varying degrees of visual grounding, failing to optimize at appropriate granularity for effective perception-exploration balance.Method: Token-level analysis of reasoning trajectories reveals structured token dynamics. PEPO derives perception prior from hidden state similarity and integrates it with token entropy through smooth gating to produce token-level advantages, compatible with existing RLVR frameworks like GRPO and DAPO.
Result: Consistent improvements across diverse multimodal benchmarks including geometry reasoning, visual grounding, visual puzzle solving, and few-shot classification, with stable training dynamics.
Conclusion: PEPO effectively balances perceptual grounding and exploratory inference at token level, enhancing multimodal reasoning without additional supervision or architectural changes.
Abstract: Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coarse granularity, treating CoT uniformly without distinguishing their varying degrees of visual grounding. In this work, we conduct a token-level analysis of multimodal reasoning trajectories and show that successful reasoning is characterized by structured token dynamics reflecting both perceptual grounding and exploratory inference. Building upon this analysis, we propose Perception-Exploration Policy Optimization (PEPO), which derives a perception prior from hidden state similarity and integrates it with token entropy through a smooth gating mechanism to produce token-level advantages. PEPO integrates seamlessly with existing RLVR frameworks such as GRPO and DAPO, requiring neither additional supervision nor auxiliary branches. Extensive experiments across diverse multimodal benchmarks demonstrate consistent and robust improvements over strong RL baselines, spanning geometry reasoning, visual grounding, visual puzzle solving, and few-shot classification, while maintaining stable training dynamics. Code: https://github.com/xzxxntxdy/PEPO
[143] UniQueR: Unified Query-based Feedforward 3D Reconstruction
Chensheng Peng, Quentin Herau, Jiezhi Yang, Yichen Xie, Yihan Hu, Wenzhao Zheng, Matthew Strong, Masayoshi Tomizuka, Wei Zhan
Main category: cs.CV
TL;DR: UniQueR is a unified query-based feedforward framework for efficient 3D reconstruction from unposed images using sparse 3D queries instead of dense 2.5D representations.
Details
Motivation: Existing feedforward 3D reconstruction methods like DUSt3R, VGGT, and AnySplat produce 2.5D representations limited to visible surfaces and cannot infer occluded geometry. There's a need for more efficient and accurate reconstruction that can handle complete scene structure including occluded regions.Method: Formulates reconstruction as sparse 3D query inference problem. Learns compact set of 3D anchor points as explicit geometric queries that encode spatial and appearance priors in global 3D space. Each query spawns 3D Gaussians for differentiable rendering. Uses unified query interactions across multi-view features with decoupled cross-attention design for efficiency.
Result: Outperforms state-of-the-art feedforward methods on Mip-NeRF 360 and VR-NeRF datasets in both rendering quality and geometric accuracy. Uses order of magnitude fewer primitives than dense alternatives while achieving better results.
Conclusion: UniQueR demonstrates that sparse 3D query-based formulation enables more efficient and accurate 3D reconstruction from unposed images, overcoming limitations of 2.5D representations and achieving strong geometric expressiveness with reduced computational cost.
Abstract: We present UniQueR, a unified query-based feedforward framework for efficient and accurate 3D reconstruction from unposed images. Existing feedforward models such as DUSt3R, VGGT, and AnySplat typically predict per-pixel point maps or pixel-aligned Gaussians, which remain fundamentally 2.5D and limited to visible surfaces. In contrast, UniQueR formulates reconstruction as a sparse 3D query inference problem. Our model learns a compact set of 3D anchor points that act as explicit geometric queries, enabling the network to infer scene structure, including geometry in occluded regions–in a single forward pass. Each query encodes spatial and appearance priors directly in global 3D space (instead of per-frame camera space) and spawns a set of 3D Gaussians for differentiable rendering. By leveraging unified query interactions across multi-view features and a decoupled cross-attention design, UniQueR achieves strong geometric expressiveness while substantially reducing memory and computational cost. Experiments on Mip-NeRF 360 and VR-NeRF demonstrate that UniQueR surpasses state-of-the-art feedforward methods in both rendering quality and geometric accuracy, using an order of magnitude fewer primitives than dense alternatives.
[144] Gau-Occ: Geometry-Completed Gaussians for Multi-Modal 3D Occupancy Prediction
Chengxin Lv, Yihui Li, Hongyu Yang, YunHong Wang
Main category: cs.CV
TL;DR: Gau-Occ: A multi-modal 3D semantic occupancy prediction framework using compact 3D Gaussians instead of dense voxels, with LiDAR completion and efficient image fusion for autonomous driving.
Details
Motivation: Current multi-modal fusion methods for 3D semantic occupancy prediction rely on computationally expensive dense voxel or BEV tensors, limiting efficiency. The authors aim to develop a more efficient framework that maintains accuracy while reducing computational burden.Method: Gau-Occ models scenes as compact collections of semantic 3D Gaussians rather than dense volumetric representations. It uses a LiDAR Completion Diffuser (LCD) to recover missing structures from sparse LiDAR for robust Gaussian anchor initialization, and Gaussian Anchor Fusion (GAF) to efficiently integrate multi-view image semantics through geometry-aligned 2D sampling and cross-modal alignment.
Result: Extensive experiments across challenging benchmarks demonstrate state-of-the-art performance with significant computational efficiency improvements over previous methods.
Conclusion: Gau-Occ provides an effective and efficient multi-modal framework for 3D semantic occupancy prediction that bypasses dense volumetric processing while maintaining high accuracy through compact Gaussian representations and efficient fusion techniques.
Abstract: 3D semantic occupancy prediction is crucial for autonomous driving. While multi-modal fusion improves accuracy over vision-only methods, it typically relies on computationally expensive dense voxel or BEV tensors. We present Gau-Occ, a multi-modal framework that bypasses dense volumetric processing by modeling the scene as a compact collection of semantic 3D Gaussians. To ensure geometric completeness, we propose a LiDAR Completion Diffuser (LCD) that recovers missing structures from sparse LiDAR to initialize robust Gaussian anchors. Furthermore, we introduce Gaussian Anchor Fusion (GAF), which efficiently integrates multi-view image semantics via geometry-aligned 2D sampling and cross-modal alignment. By refining these compact Gaussian descriptors, Gau-Occ captures both spatial consistency and semantic discriminability. Extensive experiments across challenging benchmarks demonstrate that Gau-Occ achieves state-of-the-art performance with significant computational efficiency.
[145] EVA: Efficient Reinforcement Learning for End-to-End Video Agent
Yaolun Zhang, Ruohui Wang, Jiahao Wang, Yepeng Tang, Xuanyu Zheng, Haonan Duan, Hao Lu, Hanming Deng, Lewei Lu
Main category: cs.CV
TL;DR: EVA is an efficient RL framework for video agents that uses planning-before-perception reasoning to adaptively decide what/when/how to watch videos, achieving query-driven understanding with 6-12% improvement over general MLLMs.
Details
Motivation: Video understanding with MLLMs is challenging due to long token sequences with temporal dependencies and redundant frames. Existing approaches treat MLLMs as passive recognizers or use manually designed workflows, leading to inefficiency on long videos.Method: EVA uses iterative summary-plan-action-reflection reasoning for planning-before-perception. It employs a three-stage learning pipeline: supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Generalized Reward Policy Optimization (GRPO) to bridge supervised imitation and reinforcement learning.
Result: EVA achieves 6-12% improvement over general MLLM baselines and 1-3% gain over prior adaptive agent methods on six video understanding benchmarks, demonstrating comprehensive capabilities.
Conclusion: EVA enables efficient, query-driven video understanding through autonomous decision-making about what/when/how to watch, with a stable training pipeline that bridges supervised and reinforcement learning approaches.
Abstract: Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline - comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Generalized Reward Policy Optimization (GRPO) - that bridges supervised imitation and reinforcement learning. We further construct high-quality datasets for each stage, supporting stable and reproducible training. We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6-12% over general MLLM baselines and a further 1-3% gain over prior adaptive agent methods. Our code and model are available at https://github.com/wangruohui/EfficientVideoAgent.
[146] A Feature Shuffling and Restoration Strategy for Universal Unsupervised Anomaly Detection
Wei Luo, Haiming Yao, Zhenfeng Qiang, Xiaotian Zhang, Weihang Zhang
Main category: cs.CV
TL;DR: FSR framework for universal anomaly detection using feature shuffling and restoration to address identical shortcut issues across different settings
Details
Motivation: Existing reconstruction-based anomaly detection methods suffer from identical shortcut issues where both normal and anomalous regions can be well reconstructed, limiting their transferability across different scenarios. Need for a universal model applicable to various anomaly detection settings.Method: Feature Shuffling and Restoration (FSR) framework: uses multi-scale features as reconstruction targets instead of raw pixels, partitions features into non-overlapping blocks, randomly shuffles them, then restores original state using restoration network. Introduces shuffling rate to regulate task complexity.
Result: Extensive experiments validate superiority and efficiency of FSR framework across different settings. The method alleviates identical shortcut issues and improves transferability between scenarios.
Conclusion: FSR provides a novel, straightforward yet efficient framework for universal anomaly detection that focuses on global contextual information and addresses transferability limitations of existing methods.
Abstract: Unsupervised anomaly detection is vital in industrial fields, with reconstruction-based methods favored for their simplicity and effectiveness. However, reconstruction methods often encounter an identical shortcut issue, where both normal and anomalous regions can be well reconstructed and fail to identify outliers. The severity of this problem increases with the complexity of the normal data distribution. Consequently, existing methods may exhibit excellent detection performance in a specific scenario, but their performance sharply declines when transferred to another scenario. This paper focuses on establishing a universal model applicable to anomaly detection tasks across different settings, termed as universal anomaly detection. In this work, we introduce a novel, straightforward yet efficient framework for universal anomaly detection: \uline{F}eature \uline{S}huffling and \uline{R}estoration (FSR), which can alleviate the identical shortcut issue across different settings. First and foremost, FSR employs multi-scale features with rich semantic information as reconstruction targets, rather than raw image pixels. Subsequently, these multi-scale features are partitioned into non-overlapping feature blocks, which are randomly shuffled and then restored to their original state using a restoration network. This simple paradigm encourages the model to focus more on global contextual information. Additionally, we introduce a novel concept, the shuffling rate, to regulate the complexity of the FSR task, thereby alleviating the identical shortcut across different settings. Furthermore, we provide theoretical explanations for the effectiveness of FSR framework from two perspectives: network structure and mutual information. Extensive experimental results validate the superiority and efficiency of the FSR framework across different settings.Code is available at https://github.com/luow23/FSR.
[147] Designing to Forget: Deep Semi-parametric Models for Unlearning
Amber Yijia Zheng, Yu-Shan Tai, Raymond A. Yeh
Main category: cs.CV
TL;DR: SPMs are semi-parametric models designed for efficient machine unlearning that maintain competitive performance while enabling explicit test-time deletion of training samples without parameter updates.
Details
Motivation: Current machine unlearning research focuses on algorithms to remove training samples from trained models, but not all models are equally easy to unlearn. The authors aim to create models that are inherently easier to unlearn while maintaining performance.Method: Introduces deep semi-parametric models (SPMs) that exhibit non-parametric behavior during unlearning. SPMs use a fusion module to aggregate information from each training sample, enabling explicit test-time deletion of selected samples without altering model parameters.
Result: SPMs achieve competitive task performance to parametric models in image classification and generation, while being significantly more efficient for unlearning. On ImageNet classification, SPMs reduce the prediction gap relative to retrained baseline by 11% and achieve over 10× faster unlearning compared to existing approaches.
Conclusion: SPMs provide an effective architecture for machine unlearning that balances performance with unlearning efficiency, offering a practical solution for applications requiring data deletion compliance.
Abstract: Recent advances in machine unlearning have focused on developing algorithms to remove specific training samples from a trained model. In contrast, we observe that not all models are equally easy to unlearn. Hence, we introduce a family of deep semi-parametric models (SPMs) that exhibit non-parametric behavior during unlearning. SPMs use a fusion module that aggregates information from each training sample, enabling explicit test-time deletion of selected samples without altering model parameters. Empirically, we demonstrate that SPMs achieve competitive task performance to parametric models in image classification and generation, while being significantly more efficient for unlearning. Notably, on ImageNet classification, SPMs reduce the prediction gap relative to a retrained (oracle) baseline by $11%$ and achieve over $10\times$ faster unlearning compared to existing approaches on parametric models. The code is available at https://github.com/amberyzheng/spm_unlearning.
[148] YOLOv10 with Kolmogorov-Arnold networks and vision-language foundation models for interpretable object detection and trustworthy multimodal AI in computer vision perception
Marios Impraimakis, Daniel Vazquez, Feiyu Zhou
Main category: cs.CV
TL;DR: A novel interpretable object detection framework using Kolmogorov-Arnold networks as post-hoc surrogates to model trustworthiness of YOLOv10 detections, with multimodal BLIP captioning for scene understanding.
Details
Motivation: Addressing limited transparency in object detection confidence scores for autonomous vehicles, especially in visually degraded or ambiguous scenes where reliability assessment is crucial for safety.Method: Uses Kolmogorov-Arnold network as interpretable post-hoc surrogate to model trustworthiness of YOLOv10 detections using seven geometric/semantic features. Additive spline-based structure enables feature influence visualization. Combines with bootstrapped language-image (BLIP) foundation model for descriptive scene captions.
Result: Framework accurately identifies low-trust predictions under blur, occlusion, or low texture conditions on COCO and University of Bath campus datasets. Provides actionable insights for filtering/review and enables lightweight multimodal interface without compromising interpretability.
Conclusion: System delivers interpretable object detection with trustworthy confidence estimates, offering transparent and practical perception component for autonomous and multimodal AI applications.
Abstract: The interpretable object detection capabilities of a novel Kolmogorov-Arnold network framework are examined here. The approach refers to a key limitation in computer vision for autonomous vehicles perception, and beyond. These systems offer limited transparency regarding the reliability of their confidence scores in visually degraded or ambiguous scenes. To address this limitation, a Kolmogorov-Arnold network is employed as an interpretable post-hoc surrogate to model the trustworthiness of the You Only Look Once (Yolov10) detections using seven geometric and semantic features. The additive spline-based structure of the Kolmogorov-Arnold network enables direct visualisation of each feature’s influence. This produces smooth and transparent functional mappings that reveal when the model’s confidence is well supported and when it is unreliable. Experiments on both Common Objects in Context (COCO), and images from the University of Bath campus demonstrate that the framework accurately identifies low-trust predictions under blur, occlusion, or low texture. This provides actionable insights for filtering, review, or downstream risk mitigation. Furthermore, a bootstrapped language-image (BLIP) foundation model generates descriptive captions of each scene. This tool enables a lightweight multimodal interface without affecting the interpretability layer. The resulting system delivers interpretable object detection with trustworthy confidence estimates. It offers a powerful tool for transparent and practical perception component for autonomous and multimodal artificial intelligence applications.
[149] ForeSea: AI Forensic Search with Multi-modal Queries for Video Surveillance
Hyojin Park, Yi Li, Janghoon Cho, Sungha Choi, Jungsoo Lee, Taotao Jing, Shuai Zhang, Munawar Hayat, Dashan Gao, Ning Bi, Fatih Porikli
Main category: cs.CV
TL;DR: ForeSeaQA is a new benchmark for video QA with image-and-text queries and timestamped annotations, and ForeSea is a 3-stage AI forensic search system that improves accuracy and temporal localization over prior methods.
Details
Motivation: Existing surveillance systems struggle with finding specific targets across long, multi-camera video. Prior methods require heavy manual filtering, capture only shallow attributes, fail at temporal reasoning, and lack proper benchmarks for evaluating multimodal queries in video QA.Method: 1) Introduces ForeSeaQA benchmark with long-horizon surveillance footage and diverse multimodal questions. 2) Proposes ForeSea system with 3-stage pipeline: tracking module filters irrelevant footage, multimodal embedding module indexes clips, and during inference retrieves top-K candidate clips for a VideoLLM to answer queries and localize events.
Result: On ForeSeaQA, ForeSea improves accuracy by 3.5% and temporal IoU by 11.0 over prior VideoRAG models. The benchmark supports complex multimodal queries with precise temporal grounding.
Conclusion: ForeSeaQA is the first benchmark for complex multimodal queries with temporal grounding, and ForeSea is the first VideoRAG system designed to excel in this setting, addressing key limitations in forensic video analysis.
Abstract: Despite decades of work, surveillance still struggles to find specific targets across long, multi-camera video. Prior methods – tracking pipelines, CLIP based models, and VideoRAG – require heavy manual filtering, capture only shallow attributes, and fail at temporal reasoning. Real-world searches are inherently multimodal (e.g., “When does this person join the fight?” with the person’s image), yet this setting remains underexplored. Also, there are no proper benchmarks to evaluate those setting - asking video with multimodal queries. To address this gap, we introduce ForeSeaQA, a new benchmark specifically designed for video QA with image-and-text queries and timestamped annotations of key events. The dataset consists of long-horizon surveillance footage paired with diverse multimodal questions, enabling systematic evaluation of retrieval, temporal grounding, and multimodal reasoning in realistic forensic conditions. Not limited to this benchmark, we propose ForeSea, an AI forensic search system with a 3-stage, plug-and-play pipeline. (1) A tracking module filters irrelevant footage; (2) a multimodal embedding module indexes the remaining clips; and (3) during inference, the system retrieves top-K candidate clips for a Video Large Language Model (VideoLLM) to answer queries and localize events. On ForeSeaQA, ForeSea improves accuracy by 3.5% and temporal IoU by 11.0 over prior VideoRAG models. To our knowledge, ForeSeaQA is the first benchmark to support complex multimodal queries with precise temporal grounding, and ForeSea is the first VideoRAG system built to excel in this setting.
[150] Template-Based Feature Aggregation Network for Industrial Anomaly Detection
Wei Luo, Haiming Yao, Wenyong Yu
Main category: cs.CV
TL;DR: TFA-Net is a template-based feature aggregation network for industrial anomaly detection that aggregates input features onto normal template features to filter out anomalies, achieving state-of-the-art performance with real-time capability.
Details
Motivation: Existing feature-reconstruction methods for industrial anomaly detection suffer from shortcut learning issues where anomalous features can be undesirably reconstructed. The authors aim to develop a more effective approach that properly handles anomalous features while maintaining real-time performance for industrial applications.Method: TFA-Net extracts hierarchical features from both template (normal) and input images using a pre-trained CNN. Instead of direct reconstruction, it aggregates input features onto template features based on similarity, filtering out anomalous features. The aggregated template features then refine feature details to obtain reconstructed features. A random masking strategy enhances inspection performance. Anomalies are located by comparing input and reconstructed features.
Result: TFA-Net achieves state-of-the-art detection performance on various real-world industrial datasets while meeting real-time requirements for industrial applications. The method effectively addresses shortcut learning issues in feature-reconstruction approaches.
Conclusion: The template-based feature aggregation approach provides a nontrivial and meaningful feature reconstruction task that effectively filters anomalous features. TFA-Net is simple yet efficient, suitable for practical industrial applications with both high performance and real-time capability.
Abstract: Industrial anomaly detection plays a crucial role in ensuring product quality control. Therefore, proposing an effective anomaly detection model is of great significance. While existing feature-reconstruction methods have demonstrated excellent performance, they face challenges with shortcut learning, which can lead to undesirable reconstruction of anomalous features. To address this concern, we present a novel feature-reconstruction model called the \textbf{T}emplate-based \textbf{F}eature \textbf{A}ggregation \textbf{Net}work (TFA-Net) for anomaly detection via template-based feature aggregation. Specifically, TFA-Net first extracts multiple hierarchical features from a pre-trained convolutional neural network for a fixed template image and an input image. Instead of directly reconstructing input features, TFA-Net aggregates them onto the template features, effectively filtering out anomalous features that exhibit low similarity to normal template features. Next, TFA-Net utilizes the template features that have already fused normal features in the input features to refine feature details and obtain the reconstructed feature map. Finally, the defective regions can be located by comparing the differences between the input and reconstructed features. Additionally, a random masking strategy for input features is employed to enhance the overall inspection performance of the model. Our template-based feature aggregation schema yields a nontrivial and meaningful feature reconstruction task. The simple, yet efficient, TFA-Net exhibits state-of-the-art detection performance on various real-world industrial datasets. Additionally, it fulfills the real-time demands of industrial scenarios, rendering it highly suitable for practical applications in the industry. Code is available at https://github.com/luow23/TFA-Net.
[151] Group Editing : Edit Multiple Images in One Go
Yue Ma, Xinyu Wang, Qianli Ma, Qinghe Wang, Mingzhe Zheng, Xiangpeng Yang, Hao Li, Chongbo Zhao, Jixuan Ying, Harry Yang, Hongyu Liu, Qifeng Chen
Main category: cs.CV
TL;DR: GroupEditing: A framework for consistent image editing across groups of related images with varying poses/viewpoints by combining explicit geometric correspondences (VGGT) with implicit temporal coherence from video models.
Details
Motivation: Editing groups of related images consistently is challenging due to variations in pose, viewpoint, and spatial layout. Existing methods struggle to maintain semantic alignment and visual coherence across multiple images.Method: 1) Extract explicit geometric correspondences using VGGT for spatial alignment; 2) Reformulate image groups as pseudo-videos to leverage temporal coherence priors from pre-trained video models; 3) Fuse explicit and implicit correspondences via novel fusion mechanism; 4) Use GroupEditData dataset for training; 5) Introduce alignment-enhanced RoPE module for identity preservation.
Result: GroupEditing significantly outperforms existing methods in visual quality, cross-view consistency, and semantic alignment. The framework demonstrates superior performance on the proposed GroupEditBench benchmark.
Conclusion: The proposed GroupEditing framework effectively addresses group-level image editing by combining explicit geometric and implicit temporal correspondences, achieving state-of-the-art performance in maintaining consistency across diverse image groups.
Abstract: In this paper, we tackle the problem of performing consistent and unified modifications across a set of related images. This task is particularly challenging because these images may vary significantly in pose, viewpoint, and spatial layout. Achieving coherent edits requires establishing reliable correspondences across the images, so that modifications can be applied accurately to semantically aligned regions. To address this, we propose GroupEditing, a novel framework that builds both explicit and implicit relationships among images within a group. On the explicit side, we extract geometric correspondences using VGGT, which provides spatial alignment based on visual features. On the implicit side, we reformulate the image group as a pseudo-video and leverage the temporal coherence priors learned by pre-trained video models to capture latent relationships. To effectively fuse these two types of correspondences, we inject the explicit geometric cues from VGGT into the video model through a novel fusion mechanism. To support large-scale training, we construct GroupEditData, a new dataset containing high-quality masks and detailed captions for numerous image groups. Furthermore, to ensure identity preservation during editing, we introduce an alignment-enhanced RoPE module, which improves the model’s ability to maintain consistent appearance across multiple images. Finally, we present GroupEditBench, a dedicated benchmark designed to evaluate the effectiveness of group-level image editing. Extensive experiments demonstrate that GroupEditing significantly outperforms existing methods in terms of visual quality, cross-view consistency, and semantic alignment.
[152] SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes
Zhicheng Qiu, Jiarui Meng, Tong-an Luo, Yican Huang, Xuan Feng, Xuanfu Li, ZHan Xu
Main category: cs.CV
TL;DR: SLARM is a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference through higher-order motion modeling and language-aligned semantic features.
Details
Motivation: The paper aims to address the challenge of unifying dynamic scene reconstruction with semantic understanding in real-time streaming scenarios, overcoming limitations of existing methods that treat these tasks separately or lack language-aligned representations.Method: SLARM uses higher-order motion modeling to capture complex non-uniform motion, trained solely on differentiable renderings without flow supervision. It distills semantic features from LSeg for language-aligned representations and processes image sequences with window-based causal attention for low-latency streaming inference.
Result: SLARM achieves state-of-the-art results: 21% improvement in motion accuracy, 1.6 dB improvement in reconstruction PSNR, and 20% improvement in segmentation mIoU over existing methods.
Conclusion: SLARM demonstrates that unified modeling of dynamic reconstruction, semantics, and streaming inference in a single feed-forward framework enables superior performance across multiple tasks while supporting natural language querying and real-time applications.
Abstract: We propose SLARM, a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference. SLARM captures complex, non-uniform motion through higher-order motion modeling, trained solely on differentiable renderings without any flow supervision. Besides, SLARM distills semantic features from LSeg to obtain language-aligned representations. This design enables semantic querying via natural language, and the tight coupling between semantics and geometry further enhances the accuracy and robustness of dynamic reconstruction. Moreover, SLARM processes image sequences using window-based causal attention, achieving stable, low-latency streaming inference without accumulating memory cost. Within this unified framework, SLARM achieves state-of-the-art results in dynamic estimation, rendering quality, and scene parsing, improving motion accuracy by 21%, reconstruction PSNR by 1.6 dB, and segmentation mIoU by 20% over existing methods.
[153] Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning
Jiacheng Hua, Yishu Yin, Yuhang Wu, Tai Wang, Yifei Huang, Miao Liu
Main category: cs.CV
TL;DR: TRACE is a prompting method that enables MLLMs to generate text-based spatial representations from egocentric videos for improved 3D spatial reasoning.
Details
Motivation: Existing MLLMs struggle with 3D spatial reasoning because they fail to construct structured abstractions of 3D environments from video inputs, limiting their ability to answer spatial questions accurately.Method: TRACE (Textual Representation of Allocentric Context from Egocentric Video) is a prompting method that induces MLLMs to generate text-based representations of 3D environments as intermediate reasoning traces. It encodes meta-context, camera trajectories, and detailed object entities to support structured spatial reasoning over egocentric videos.
Result: Extensive experiments on VSI-Bench and OST-Bench show TRACE yields notable and consistent improvements over prior prompting strategies across diverse MLLM backbones of different parameter scales and training schemas.
Conclusion: TRACE effectively bridges the gap in 3D spatial reasoning for MLLMs by enabling them to construct structured text-based representations from video inputs, leading to more accurate spatial question answering.
Abstract: Existing Multimodal Large Language Models (MLLMs) struggle with 3D spatial reasoning, as they fail to construct structured abstractions of the 3D environment depicted in video inputs. To bridge this gap, drawing inspiration from cognitive theories of allocentric spatial reasoning, we investigate how to enable MLLMs to model and reason over text-based spatial representations of video. Specifically, we introduce Textual Representation of Allocentric Context from Egocentric Video (TRACE), a prompting method that induces MLLMs to generate text-based representations of 3D environments as intermediate reasoning traces for more accurate spatial question answering. TRACE encodes meta-context, camera trajectories, and detailed object entities to support structured spatial reasoning over egocentric videos. Extensive experiments on VSI-Bench and OST-Bench demonstrate that TRACE yields notable and consistent improvements over prior prompting strategies across a diverse range of MLLM backbones, spanning different parameter scales and training schemas. We further present ablation studies to validate our design choices, along with detailed analyses that probe the bottlenecks of 3D spatial reasoning in MLLMs.
[154] Dual-Teacher Distillation with Subnetwork Rectification for Black-Box Domain Adaptation
Zhe Zhang, Jing Li, Wanli Xue, Xu Cheng, Jianhua Zhang, Qinghua Hu, Shengyong Chen
Main category: cs.CV
TL;DR: DDSR: Dual teacher distillation with subnetwork rectification for black box domain adaptation using source model predictions and vision-language model semantic priors
Details
Motivation: Black box domain adaptation is highly practical but challenging when source data and model are inaccessible, with existing methods suffering from noisy supervision or insufficient use of vision-language model semantic priorsMethod: Proposes DDSR that jointly exploits black box source model knowledge and vision-language model semantic information, adaptively integrates their predictions, uses subnetwork regularization to mitigate overfitting, and employs iterative refinement with classwise prototypes
Result: Extensive experiments on multiple benchmark datasets show consistent improvements over state-of-the-art methods, including those using source data or models
Conclusion: DDSR effectively addresses black box domain adaptation by leveraging complementary knowledge from black box models and vision-language models through dual teacher distillation and subnetwork rectification
Abstract: Assuming that neither source data nor the source model is accessible, black box domain adaptation represents a highly practical yet extremely challenging setting, as transferable information is restricted to the predictions of the black box source model, which can only be queried using target samples. Existing approaches attempt to extract transferable knowledge through pseudo label refinement or by leveraging external vision language models (ViLs), but they often suffer from noisy supervision or insufficient utilization of the semantic priors provided by ViLs, which ultimately hinder adaptation performance. To overcome these limitations, we propose a dual teacher distillation with subnetwork rectification (DDSR) model that jointly exploits the specific knowledge embedded in black box source models and the general semantic information of a ViL. DDSR adaptively integrates their complementary predictions to generate reliable pseudo labels for the target domain and introduces a subnetwork driven regularization strategy to mitigate overfitting caused by noisy supervision. Furthermore, the refined target predictions iteratively enhance both the pseudo labels and ViL prompts, enabling more accurate and semantically consistent adaptation. Finally, the target model is further optimized through self training with classwise prototypes. Extensive experiments on multiple benchmark datasets validate the effectiveness of our approach, demonstrating consistent improvements over state of the art methods, including those using source data or models.
[155] SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning
Haoyu Huang, Jinfa Huang, Zhongwei Wan, Xiawu Zheng, Rongrong Ji, Jiebo Luo
Main category: cs.CV
TL;DR: SpecEyes accelerates agentic multimodal LLMs by using a lightweight MLLM as speculative planner to predict execution trajectories, enabling early termination of expensive tool chains while maintaining accuracy.
Details
Motivation: Agentic MLLMs achieve strong reasoning through iterative visual tool invocation, but the cascaded perception-reasoning-tool loops create significant sequential overhead (agentic depth), causing prohibitive latency and limiting system concurrency.Method: Uses a lightweight, tool-free MLLM as speculative planner to predict execution trajectories; introduces cognitive gating based on answer separability for confidence self-verification; designs heterogeneous parallel funnel to mask large model’s stateful execution with small model’s stateless concurrency.
Result: Achieves 1.1-3.35x speedup over agentic baseline while preserving or improving accuracy (up to +6.7%), boosting serving throughput under concurrent workloads on V* Bench, HR-Bench, and POPE benchmarks.
Conclusion: SpecEyes effectively breaks the sequential bottleneck of agentic MLLMs through speculative planning and parallel execution, significantly improving latency and throughput without sacrificing accuracy.
Abstract: Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model’s confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.
[156] ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
Shaobo Ju, Baiyang Song, Tao Chen, Jiapeng Zhang, Qiong Wu, Chao Chang, HuaiXi Wang, Yiyi Zhou, Rongrong Ji
Main category: cs.CV
TL;DR: ForestPrune: A training-free token pruning method for video MLLMs that achieves high-ratio compression via spatial-temporal forest modeling, maintaining performance while reducing token count by 90%.
Details
Motivation: Existing token compression methods for video MLLMs fall short of high-ratio compression due to insufficient modeling of temporal and continual video content. There's a need for effective token pruning that preserves video understanding while reducing computational overhead.Method: ForestPrune constructs token forests across video frames based on semantic, spatial, and temporal constraints. It evaluates token tree/node importance using tree depth and node roles to make globally optimal pruning decisions, without requiring training.
Result: Applied to LLaVA-Video and LLaVA-OneVision, ForestPrune retains 95.8% average accuracy while reducing 90% tokens for LLaVA-OneVision. Outperforms other methods with +10.1% accuracy on MLVU and -81.4% pruning time than FrameFusion on LLaVA-Video.
Conclusion: ForestPrune effectively addresses high-ratio token compression for video MLLMs through spatial-temporal forest modeling, achieving superior performance and efficiency compared to existing methods while being training-free.
Abstract: Due to the great saving of computation and memory overhead, token compression has become a research hot-spot for MLLMs and achieved remarkable progress in image-language tasks. However, for the video, existing methods still fall short of high-ratio token compression. We attribute this shortcoming to the insufficient modeling of temporal and continual video content, and propose a novel and training-free token pruning method for video MLLMs, termed ForestPrune, which achieves effective and high-ratio pruning via Spatial-temporal Forest Modeling. In practice, ForestPrune construct token forests across video frames based on the semantic, spatial and temporal constraints, making an overall comprehension of videos. Afterwards, ForestPrune evaluates the importance of token trees and nodes based on tree depth and node roles, thereby obtaining a globally optimal pruning decision. To validate ForestPrune, we apply it to two representative video MLLMs, namely LLaVA-Video and LLaVA-OneVision, and conduct extensive experiments on a bunch of video benchmarks. The experimental results not only show the great effectiveness for video MLLMs, e.g., retaining 95.8% average accuracy while reducing 90% tokens for LLaVA-OneVision, but also show its superior performance and efficiency than the compared token compression methods, e.g., +10.1% accuracy on MLVU and -81.4% pruning time than FrameFusion on LLaVA-Video.
[157] MedObvious: Exposing the Medical Moravec’s Paradox in VLMs via Clinical Triage
Ufaq Khan, Umair Nawaz, L D M S S Teja, Numaan Saeed, Muhammad Bilal, Yutong Xie, Mohammad Yaqub, Muhammad Haris Khan
Main category: cs.CV
TL;DR: MedObvious benchmark tests medical VLMs’ ability to perform pre-diagnostic sanity checks on medical images, revealing they often hallucinate anomalies even on normal inputs and struggle with input validation.
Details
Motivation: Current medical VLMs can generate fluent diagnostic text even when input images are invalid or inconsistent, creating safety risks. Existing benchmarks assume input validation is solved, missing this critical failure mode.Method: Created MedObvious benchmark with 1,880 tasks testing set-level consistency across multi-panel image sets. Includes five progressive tiers (orientation/modality mismatches to clinical anatomy verification) and five evaluation formats. Evaluated 17 different VLMs.
Result: Sanity checking remains unreliable: models hallucinate anomalies on normal inputs, performance degrades with larger image sets, and accuracy varies substantially between multiple-choice vs open-ended formats.
Conclusion: Pre-diagnostic verification is unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.
Abstract: Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a set-level consistency capability over small multi-panel image sets: the model must identify whether any panel violates expected coherence. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.
[158] When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse
Yihuan Huang, Jun Xue, Liu Jiajun, Daixian Li, Tong Zhang, Zhuolin Yi, Yanzhen Ren, Kai Li
Main category: cs.CV
TL;DR: First systematic evaluation of AVSR models in video conferencing reveals performance degradation from transmission distortions and human hyper-expression, addressed by creating MLD-VC dataset with Lombard effect data.
Details
Motivation: AVSR has advanced in offline settings but lacks evaluation in real-world video conferencing where transmission distortions and spontaneous human hyper-expression cause severe performance degradation.Method: Created MLD-VC dataset with 31 speakers, 22.79 hours of audio-visual data using Lombard effect to enhance human hyper-expression; systematically evaluated state-of-the-art AVSR models across VC platforms; analyzed distribution shifts from speech enhancement algorithms.
Result: Speech enhancement algorithms cause distribution shift altering audio formants; Lombard effect distribution shift resembles speech enhancement shift; fine-tuning on MLD-VC achieves average 17.5% CER reduction across VC platforms.
Conclusion: MLD-VC dataset and findings provide foundation for developing more robust and generalizable AVSR systems for real-world video conferencing applications.
Abstract: Audio-Visual Speech Recognition (AVSR) has achieved remarkable progress in offline conditions, yet its robustness in real-world video conferencing (VC) remains largely unexplored. This paper presents the first systematic evaluation of state-of-the-art AVSR models across mainstream VC platforms, revealing severe performance degradation caused by transmission distortions and spontaneous human hyper-expression. To address this gap, we construct \textbf{MLD-VC}, the first multimodal dataset tailored for VC, comprising 31 speakers, 22.79 hours of audio-visual data, and explicit use of the Lombard effect to enhance human hyper-expression. Through comprehensive analysis, we find that speech enhancement algorithms are the primary source of distribution shift, which alters the first and second formants of audio. Interestingly, we find that the distribution shift induced by the Lombard effect closely resembles that introduced by speech enhancement, which explains why models trained on Lombard data exhibit greater robustness in VC. Fine-tuning AVSR models on MLD-VC mitigates this issue, achieving an average 17.5% reduction in CER across several VC platforms. Our findings and dataset provide a foundation for developing more robust and generalizable AVSR systems in real-world video conferencing. MLD-VC is available at https://huggingface.co/datasets/nccm2p2/MLD-VC.
[159] FixationFormer: Direct Utilization of Expert Gaze Trajectories for Chest X-Ray Classification
Daniel Beckmann, Benjamin Risse
Main category: cs.CV
TL;DR: FixationFormer: A transformer-based architecture that integrates expert eye movement sequences with medical images for improved chest X-ray classification by preserving temporal gaze structure.
Details
Motivation: Expert eye movements contain valuable diagnostic reasoning cues in radiology, but existing CNN-based systems struggle to integrate sequential, sparse gaze data effectively. Transformers naturally align with gaze sequences due to their sequential nature and attention mechanisms.Method: Proposes FixationFormer, a transformer architecture that represents expert gaze trajectories as token sequences, preserving temporal and spatial structure. Uses explicit cross-attention between image and gaze token sequences to model gaze jointly with image features, addressing sparsity and variability in gaze data.
Result: Achieves state-of-the-art classification performance on three publicly available benchmark chest X-ray datasets, demonstrating the value of representing gaze as sequences in transformer-based medical image analysis.
Conclusion: Transformer architectures are well-suited for integrating expert gaze data in medical image analysis due to their sequential nature and attention mechanisms, enabling more direct and fine-grained integration of diagnostic reasoning cues.
Abstract: Expert eye movements provide a rich, passive source of domain knowledge in radiology, offering a powerful cue for integrating diagnostic reasoning into computer-aided analysis. However, direct integration into CNN-based systems, which historically have dominated the medical image analysis domain, is challenging: gaze recordings are sequential, temporally dense yet spatially sparse, noisy, and variable across experts. As a consequence, most existing image-based models utilize reduced representations such as heatmaps. In contrast, gaze naturally aligns with transformer architectures, as both are sequential in nature and rely on attention to highlight relevant input regions. In this work, we introduce FixationFormer, a transformer-based architecture that represents expert gaze trajectories as sequences of tokens, thereby preserving their temporal and spatial structure. By modeling gaze sequences jointly with image features, our approach addresses sparsity and variability in gaze data while enabling a more direct and fine-grained integration of expert diagnostic cues through explicit cross-attention between the image and gaze token sequences. We evaluate our method on three publicly available benchmark chest X-ray datasets and demonstrate that it achieves state-of-the-art classification performance, highlighting the value of representing gaze as a sequence in transformer-based medical image analysis.
[160] Caption Generation for Dongba Paintings via Prompt Learning and Semantic Fusion
Shuangwu Qian, Xiaochan Yuan, Pengfei Liu
Main category: cs.CV
TL;DR: PVGF-DPC: A novel encoder-decoder framework for generating culturally-aware captions for Dongba paintings using prompt modules and visual semantic-generation fusion loss.
Details
Motivation: Dongba paintings have rich cultural symbolism but mainstream captioning models fail due to domain shift; need specialized approach for culturally-specific imagery.Method: MobileNetV2 encoder extracts visual features, 10-layer Transformer decoder with BERT initialization, content prompt module maps features to culture-aware labels, visual semantic-generation fusion loss jointly optimizes prompt prediction and caption generation.
Result: Constructed dedicated dataset of 9,408 augmented Dongba paintings with cultural annotations across 7 categories; model bridges gap between generic captioning and culturally-specific art.
Conclusion: PVGF-DPC effectively addresses domain shift in culturally-specific image captioning, enabling accurate textual descriptions of Dongba paintings with cultural awareness.
Abstract: Dongba paintings, the treasured pictorial legacy of the Naxi people in southwestern China, feature richly layered visual elements, vivid color palettes, and pronounced ethnic and regional cultural symbolism, yet their automatic textual description remains largely unexplored owing to severe domain shift when mainstream captioning models are applied directly. This paper proposes \textbf{PVGF-DPC} (\textit{Prompt and Visual Semantic-Generation Fusion-based Dongba Painting Captioning}), an encoder-decoder framework that integrates a content prompt module with a novel visual semantic-generation fusion loss to bridge the gap between generic natural-image captioning and the culturally specific imagery found in Dongba art. A MobileNetV2 encoder extracts discriminative visual features, which are injected into the layer normalization of a 10-layer Transformer decoder initialized with pretrained BERT weights; meanwhile, the content prompt module maps the image feature vector to culture-aware labels – such as \emph{deity}, \emph{ritual pattern}, or \emph{hell ghost} – and constructs a post-prompt that steers the decoder toward thematically accurate descriptions. The visual semantic-generation fusion loss jointly optimizes the cross-entropy objectives of both the prompt predictor and the caption generator, encouraging the model to extract key cultural and visual cues and to produce captions that are semantically aligned with the input image. We construct a dedicated Dongba painting captioning dataset comprising 9{}408 augmented images with culturally grounded annotations spanning seven thematic categories.
[161] Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining
Weijun Zhuang, Yuqing Huang, Weikang Meng, Xin Li, Ming Liu, Xiaopeng Hong, Yaowei Wang, Wangmeng Zuo
Main category: cs.CV
TL;DR: ClusterSTM: Cluster-wise spatio-temporal masking for efficient video-language pretraining that addresses visual information loss and temporal leakage issues
Details
Motivation: Large-scale video-language pretraining is computationally expensive. Existing masked visual modeling approaches suffer from severe visual information loss under high masking ratios and temporal information leakage due to inter-frame correlations.Method: Proposes ClusterSTM: 1) Intra-frame clustering to partition visual tokens into semantically independent clusters, 2) Cluster-wise masking that retains the token with highest temporal density per cluster, 3) Video-text relevance reconstruction objective for high-level multimodal alignment.
Result: Achieves superior performance on video-text retrieval, video question answering, and video captioning benchmarks, establishing new state-of-the-art among efficient video-language models.
Conclusion: ClusterSTM provides an effective masking strategy for efficient video-language pretraining that preserves holistic video content while maintaining temporal correlation, outperforming existing approaches.
Abstract: Large-scale video-language pretraining enables strong generalization across multimodal tasks but often incurs prohibitive computational costs. Although recent advances in masked visual modeling help mitigate this issue, they still suffer from two fundamental limitations: severe visual information loss under high masking ratios and temporal information leakage caused by inter-frame correlations. To address these challenges, we propose ClusterSTM, a Cluster-Wise Spatio-Temporal Masking strategy for efficient video-language pretraining. ClusterSTM first performs intra-frame clustering to partition visual tokens into multiple semantically independent clusters, then conducts cluster-wise masking by retaining the token with the highest temporal density within each cluster. Our masking strategy ensure that the retained tokens capture holistic video content while exhibit strong temporal correlation. Additionally, we introduce a video-text relevance reconstruction objective that aligns high-level multimodal semantics beyond conventional visual reconstruction. Extensive experiments across multiple benchmarks demonstrate that ClusterSTM achieves superior performance on video-text retrieval, video question answering, and video captioning tasks, establishing a new state-of-the-art among efficient video-language models.
[162] Few-Shot Generative Model Adaption via Identity Injection and Preservation
Yeqi He, Liang Li, Jiehua Zhang, Yaoqi Sun, Xichun Sheng, Zhidong Zhao, Chenggang Yan
Main category: cs.CV
TL;DR: I²P: Identity Injection and Preservation method for few-shot generative model adaptation that prevents forgetting source domain identity knowledge through identity injection and consistency alignment.
Details
Motivation: Existing few-shot generative model adaptation methods suffer from forgetting source domain identity knowledge during adaptation, which degrades the quality of generated images in the target domain.Method: Proposes Identity Injection and Preservation (I²P) with two modules: 1) Identity injection module integrates source domain identity knowledge into target domain’s latent space; 2) Identity substitution module with style-content decoupler and reconstruction modulator, plus identity consistency constraints through feature alignment.
Result: Achieves substantial improvements over state-of-the-art methods on multiple public datasets and 5 metrics, both quantitatively and qualitatively.
Conclusion: I²P effectively preserves source domain identity knowledge during few-shot generative model adaptation, improving generated image quality in target domains.
Abstract: Training generative models with limited data presents severe challenges of mode collapse. A common approach is to adapt a large pretrained generative model upon a target domain with very few samples (fewer than 10), known as few-shot generative model adaptation. However, existing methods often suffer from forgetting source domain identity knowledge during adaptation, which degrades the quality of generated images in the target domain. To address this, we propose Identity Injection and Preservation (I$^2$P), which leverages identity injection and consistency alignment to preserve the source identity knowledge. Specifically, we first introduce an identity injection module that integrates source domain identity knowledge into the target domain’s latent space, ensuring the generated images retain key identity knowledge of the source domain. Second, we design an identity substitution module, which includes a style-content decoupler and a reconstruction modulator, to further enhance source domain identity preservation. We enforce identity consistency constraints by aligning features from identity substitution, thereby preserving identity knowledge. Both quantitative and qualitative experiments show that our method achieves substantial improvements over state-of-the-art methods on multiple public datasets and 5 metrics.
[163] FCL-COD: Weakly Supervised Camouflaged Object Detection with Frequency-aware and Contrastive Learning
Jingchen Ni, Quan Zhang, Dan Jiang, Keyu Lv, Ke Zhang, Chun Yuan
Main category: cs.CV
TL;DR: FCL-COD: A weakly-supervised camouflage object detection framework using frequency-aware adaptation and contrastive learning to improve SAM’s performance without mask annotations.
Details
Motivation: Current camouflage object detection methods require expensive mask annotations, while weakly-supervised methods perform poorly. Even SAM struggles with WSCOD due to non-camouflage target responses, local responses, extreme responses, and lack of refined boundary awareness.Method: Proposes FCL-COD framework with three components: 1) Frequency-aware Low-rank Adaptation (FoRA) to incorporate camouflage scene knowledge into SAM, 2) gradient-aware contrastive learning to delineate foreground-background boundaries, and 3) multi-scale frequency-aware representation learning for refined boundary modeling.
Result: Extensive experiments on three COD benchmarks show the method surpasses both state-of-the-art weakly supervised and even fully supervised techniques.
Conclusion: The proposed FCL-COD framework effectively addresses WSCOD challenges through frequency-aware adaptation and contrastive learning, achieving superior performance without mask annotations.
Abstract: Existing camouflage object detection (COD) methods typically rely on fully-supervised learning guided by mask annotations. However, obtaining mask annotations is time-consuming and labor-intensive. Compared to fully-supervised methods, existing weakly-supervised COD methods exhibit significantly poorer performance. Even for the Segment Anything Model (SAM), there are still challenges in handling weakly-supervised camouflage object detection (WSCOD), such as: a. non-camouflage target responses, b. local responses, c. extreme responses, and d. lack of refined boundary awareness, which leads to unsatisfactory results in camouflage scenes. To alleviate these issues, we propose a frequency-aware and contrastive learning-based WSCOD framework in this paper, named FCL-COD. To mitigate the problem of non-camouflaged object responses, we propose the Frequency-aware Low-rank Adaptation (FoRA) method, which incorporates frequency-aware camouflage scene knowledge into SAM. To overcome the challenges of local and extreme responses, we introduce a gradient-aware contrastive learning approach that effectively delineates precise foreground-background boundaries. Additionally, to address the lack of refined boundary perception, we present a multi-scale frequency-aware representation learning strategy that facilitates the modeling of more refined boundaries. We validate the effectiveness of our approach through extensive empirical experiments on three widely recognized COD benchmarks. The results confirm that our method surpasses both state-of-the-art weakly supervised and even fully supervised techniques.
[164] WorldMesh: Generating Navigable Multi-Room 3D Scenes via Mesh-Conditioned Image Diffusion
Manuel-Andreas Schneider, Angela Dai
Main category: cs.CV
TL;DR: A geometry-first approach for large-scale 3D scene generation that decouples structural composition (mesh scaffold) from appearance synthesis using image models conditioned on the mesh.
Details
Motivation: Text-to-image/video approaches struggle with scene- and object-level consistency at large scales due to lack of explicit geometry. Need for scalable 3D scene generation with robust consistency and photorealistic detail.Method: 1) Construct mesh scaffold capturing environment geometry from text description; 2) Populate mesh with objects using image synthesis, segmentation, and reconstruction; 3) Render mesh scaffold to condition image synthesis for consistent appearance generation.
Result: Enables scalable, arbitrarily-sized 3D scenes with high object richness and diversity, combining 3D consistency with photorealistic detail for environment-scale immersive worlds.
Conclusion: Geometry-first approach marks significant step toward generating truly environment-scale, immersive 3D worlds by decoupling structure from appearance synthesis.
Abstract: Recent progress in image and video synthesis has inspired their use in advancing 3D scene generation. However, we observe that text-to-image and -video approaches struggle to maintain scene- and object-level consistency beyond a limited environment scale due to the absence of explicit geometry. We thus present a geometry-first approach that decouples this complex problem of large-scale 3D scene synthesis into its structural composition, represented as a mesh scaffold, and realistic appearance synthesis, which leverages powerful image synthesis models conditioned on the mesh scaffold. From an input text description, we first construct a mesh capturing the environment’s geometry (walls, floors, etc.), and then use image synthesis, segmentation and object reconstruction to populate the mesh structure with objects in realistic layouts. This mesh scaffold is then rendered to condition image synthesis, providing a structural backbone for consistent appearance generation. This enables scalable, arbitrarily-sized 3D scenes of high object richness and diversity, combining robust 3D consistency with photorealistic detail. We believe this marks a significant step toward generating truly environment-scale, immersive 3D worlds.
[165] VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models
Jintao Cheng, Haozhe Wang, Weibin Li, Gang Wang, Yipu Zhang, Xiaoyu Tang, Jin Wu, Xieyuanli Chen, Yunhui Liu, Wei Zhang
Main category: cs.CV
TL;DR: VLA-IAP introduces a training-free visual token pruning method for Vision-Language-Action models that prioritizes interaction-aligned pruning to maintain structural anchors and adapt pruning intensity based on semantic-motion alignment.
Details
Motivation: As VLA models grow in capacity and visual context length, inference cost becomes a bottleneck for real-world deployment on resource-constrained platforms. Existing visual token pruning methods overlook continuous physical interaction, often pruning visually sparse but structurally critical regions needed for manipulation tasks.Method: Proposes VLA-IAP (Interaction-Aligned Pruning) with two key components: 1) geometric prior mechanism to preserve structural anchors, and 2) dynamic scheduling strategy that adapts pruning intensity based on semantic-motion alignment, enabling conservative-to-aggressive transition during task execution.
Result: Achieves 97.8% success rate with 1.25× speedup on LIBERO benchmark, and up to 1.54× speedup while maintaining performance comparable to unpruned backbone. Demonstrates consistent performance across multiple model architectures, three simulation environments, and real robot platform.
Conclusion: VLA-IAP provides an effective training-free pruning method that prioritizes interaction-aligned visual token selection, enabling efficient VLA deployment on resource-constrained platforms while maintaining task performance and robustness.
Abstract: Vision-Language-Action (VLA) models have rapidly advanced embodied intelligence, enabling robots to execute complex, instruction-driven tasks. However, as model capacity and visual context length grow, the inference cost of VLA systems becomes a major bottleneck for real-world deployment on resource-constrained platforms. Existing visual token pruning methods mainly rely on semantic saliency or simple temporal cues, overlooking the continuous physical interaction, a fundamental property of VLA tasks. Consequently, current approaches often prune visually sparse yet structurally critical regions that support manipulation, leading to unstable behavior during early task phases. To overcome this, we propose a shift toward an explicit Interaction-First paradigm. Our proposed \textbf{training-free} method, VLA-IAP (Interaction-Aligned Pruning), introduces a geometric prior mechanism to preserve structural anchors and a dynamic scheduling strategy that adapts pruning intensity based on semantic-motion alignment. This enables a conservative-to-aggressive transition, ensuring robustness during early uncertainty and efficiency once interaction is locked. Extensive experiments show that VLA-IAP achieves a \textbf{97.8% success rate} with a \textbf{$1.25\times$ speedup} on the LIBERO benchmark, and up to \textbf{$1.54\times$ speedup} while maintaining performance \textbf{comparable to the unpruned backbone}. Moreover, the method demonstrates superior and consistent performance across multiple model architectures and three different simulation environments, as well as a real robot platform, validating its strong generalization capability and practical applicability. Our project website is: \href{https://chengjt1999.github.io/VLA-IAP.github.io/}{VLA-IAP.com}.
[166] VQ-Jarvis: Retrieval-Augmented Video Restoration Agent with Sharp Vision and Fast Thought
Xuanyu Zhang, Weiqi Li, Qunliang Xing, Jingfen Xie, Bin Chen, Junlin Li, Li Zhang, Jian Zhang, Shijie Zhao
Main category: cs.CV
TL;DR: VQ-Jarvis is a retrieval-augmented intelligent video restoration agent that uses sharp vision (degradation perception) and fast thought (hierarchical scheduling) to handle heterogeneous video degradations.
Details
Motivation: Real-world video restoration faces heterogeneous degradations where static architectures and fixed inference pipelines fail to generalize. Existing agent-based approaches have insufficient quality perception and inefficient search strategies.Method: Proposes VQ-Jarvis with: 1) Sharp vision via VSR-Compare dataset (20K comparison pairs, 7 degradation types, 11 operators) and trained perception/judge models; 2) Fast thought via hierarchical operator scheduling - easy cases use one-step RAG retrieval, harder cases use step-by-step greedy search.
Result: Extensive experiments show VQ-Jarvis consistently outperforms existing methods on complex degraded videos.
Conclusion: VQ-Jarvis provides an effective solution for real-world video restoration by combining accurate degradation perception with efficient search strategies through a retrieval-augmented agent framework.
Abstract: Video restoration in real-world scenarios is challenged by heterogeneous degradations, where static architectures and fixed inference pipelines often fail to generalize. Recent agent-based approaches offer dynamic decision making, yet existing video restoration agents remain limited by insufficient quality perception and inefficient search strategies. We propose VQ-Jarvis, a retrieval-augmented, all-in-one intelligent video restoration agent with sharper vision and faster thought. VQ-Jarvis is designed to accurately perceive degradations and subtle differences among paired restoration results, while efficiently discovering optimal restoration trajectories. To enable sharp vision, we construct VSR-Compare, the first large-scale video paired enhancement dataset with 20K comparison pairs covering 7 degradation types, 11 enhancement operators, and diverse content domains. Based on this dataset, we train a multiple operator judge model and a degradation perception model to guide agent decisions. To achieve fast thought, we introduce a hierarchical operator scheduling strategy that adapts to video difficulty: for easy cases, optimal restoration trajectories are retrieved in a one-step manner from a retrieval-augmented generation (RAG) library; for harder cases, a step-by-step greedy search is performed to balance efficiency and accuracy. Extensive experiments demonstrate that VQ-Jarvis consistently outperforms existing methods on complex degraded videos.
[167] Zero-Shot Personalization of Objects via Textual Inversion
Aniket Roy, Maitreya Suin, Rama Chellappa
Main category: cs.CV
TL;DR: A novel framework for fast, zero-shot personalization of arbitrary objects in diffusion models using learned networks to predict object-specific textual inversion embeddings.
Details
Motivation: Existing text-to-image diffusion models struggle with fast and efficient personalization, especially for arbitrary object categories beyond human subjects. Current approaches that inject identity-specific embeddings don't generalize well, limiting real-world applicability.Method: Uses a learned network to predict object-specific textual inversion embeddings, which are then integrated into the UNet timesteps of a diffusion model for text-conditional customization, enabling single forward pass personalization.
Result: Extensive experiments show the approach enables rapid, zero-shot personalization of diverse objects, offering flexibility and scalability across multiple tasks and settings.
Conclusion: This represents the first training-free, general-purpose personalization framework for diffusion models, paving the way for more versatile and inclusive image customization.
Abstract: Recent advances in text-to-image diffusion models have substantially improved the quality of image customization, enabling the synthesis of highly realistic images. Despite this progress, achieving fast and efficient personalization remains a key challenge, particularly for real-world applications. Existing approaches primarily accelerate customization for human subjects by injecting identity-specific embeddings into diffusion models, but these strategies do not generalize well to arbitrary object categories, limiting their applicability. To address this limitation, we propose a novel framework that employs a learned network to predict object-specific textual inversion embeddings, which are subsequently integrated into the UNet timesteps of a diffusion model for text-conditional customization. This design enables rapid, zero-shot personalization of a wide range of objects in a single forward pass, offering both flexibility and scalability. Extensive experiments across multiple tasks and settings demonstrate the effectiveness of our approach, highlighting its potential to support fast, versatile, and inclusive image customization. To the best of our knowledge, this work represents the first attempt to achieve such general-purpose, training-free personalization within diffusion models, paving the way for future research in personalized image generation.
[168] Concept-based explanations of Segmentation and Detection models in Natural Disaster Management
Samar Heydari, Jawher Said, Galip Ümit Yolcu, Evgenii Kortukov, Elena Golimblevskaia, Evgenios Vlachos, Vasileios Mygdalis, Ioannis Pitas, Sebastian Lapuschkin, Leila Arras
Main category: cs.CV
TL;DR: An explainability framework for flood segmentation and car detection models using novel LRP extension for fusion layers and concept-based explanations for disaster management applications.
Details
Motivation: Deep learning models for disaster management lack transparency, hindering human trust required for emergency response. Need explainable AI for flood/wildfire segmentation and object detection on drone platforms.Method: 1) Novel redistribution strategy extending Layer-wise Relevance Propagation (LRP) for sigmoid-gated element-wise fusion layers in PIDNet; 2) Application of Prototypical Concept-based Explanations (PCX) for local and global concept-level explanations.
Result: Framework provides reliable and interpretable explanations while maintaining near real-time inference capabilities suitable for deployment on resource-constrained UAV platforms. Tested on public flood dataset.
Conclusion: The explainability framework enhances trust in disaster management AI systems by providing transparent decision-making explanations for segmentation and detection models, enabling practical deployment on drone platforms.
Abstract: Deep learning models for flood and wildfire segmentation and object detection enable precise, real-time disaster localization when deployed on embedded drone platforms. However, in natural disaster management, the lack of transparency in their decision-making process hinders human trust required for emergency response. To address this, we present an explainability framework for understanding flood segmentation and car detection predictions on the widely used PIDNet and YOLO architectures. More specifically, we introduce a novel redistribution strategy that extends Layer-wise Relevance Propagation (LRP) explanations for sigmoid-gated element-wise fusion layers. This extension allows LRP relevances to flow through the fusion modules of PIDNet, covering the entire computation graph back to the input image. Furthermore, we apply Prototypical Concept-based Explanations (PCX) to provide both local and global explanations at the concept level, revealing which learned features drive the segmentation and detection of specific disaster semantic classes. Experiments on a publicly available flood dataset show that our framework provides reliable and interpretable explanations while maintaining near real-time inference capabilities, rendering it suitable for deployment on resource-constrained platforms, such as Unmanned Aerial Vehicles (UAVs).
[169] Cog3DMap: Multi-View Vision-Language Reasoning with 3D Cognitive Maps
Chanyoung Gwak, Yoonwoo Jeong, Byungwoo Jeon, Hyunseok Lee, Jinwoo Shin, Minsu Cho
Main category: cs.CV
TL;DR: Cog3DMap introduces a framework that builds explicit 3D memory from multi-view images for MLLMs, enabling direct spatial reasoning over structured 3D maps with both semantic and geometric information.
Details
Motivation: Current MLLMs lack explicit geometric grounding for spatial understanding from multi-view images, as their visual representations are predominantly semantic. Existing approaches that augment visual tokens with geometric cues still require MLLMs to implicitly infer 3D structure, limiting spatial reasoning capabilities.Method: Cog3DMap recurrently constructs an explicit 3D memory from multi-view images where each token is grounded in 3D space and contains both semantic and geometric information. These tokens are then fed into the MLLM for direct reasoning over a spatially structured 3D map.
Result: Achieves state-of-the-art performance on various spatial reasoning benchmarks by enabling direct reasoning over structured 3D maps with explicit geometric grounding.
Conclusion: The framework successfully addresses the geometric grounding limitation in MLLMs for spatial understanding from multi-view images, providing both semantic and geometric information in a structured 3D representation that enhances spatial reasoning capabilities.
Abstract: Precise spatial understanding from multi-view images remains a fundamental challenge for Multimodal Large Language Models (MLLMs), as their visual representations are predominantly semantic and lack explicit geometric grounding. While existing approaches augment visual tokens with geometric cues from visual geometry models, their MLLM is still required to implicitly infer the underlying 3D structure of the scene from these augmented tokens, limiting its spatial reasoning capability. To address this issue, we introduce Cog3DMap, a framework that recurrently constructs an explicit 3D memory from multi-view images, where each token is grounded in 3D space and possesses both semantic and geometric information. By feeding these tokens into the MLLM, our framework enables direct reasoning over a spatially structured 3D map, achieving state-of-the-art performance on various spatial reasoning benchmarks. Code will be made publicly available.
[170] Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation
ByeongCheol Lee, Hyun Seok Seong, Sangeek Hyun, Gilhan Park, WonJun Moon, Jae-Pil Heo
Main category: cs.CV
TL;DR: GLA-CLIP addresses semantic discrepancy in sliding-window inference for open-vocabulary segmentation by enabling global-local information exchange across windows using proxy anchors and dynamic normalization.
Details
Motivation: Sliding-window inference in training-free open-vocabulary semantic segmentation causes semantic discrepancies across windows due to independent processing, limiting CLIP's ability to maintain consistent semantics across the entire image.Method: Proposes GLA-CLIP with: 1) Extended key-value tokens across all windows for global context, 2) Proxy anchors aggregating similar tokens from all windows as unified semantic references, and 3) Dynamic normalization scaling attention based on object size for small objects.
Result: Extensive experiments show GLA-CLIP effectively enhances training-free open-vocabulary semantic segmentation performance and can be integrated into existing methods to broaden their receptive field.
Conclusion: GLA-CLIP successfully addresses window bias in sliding-window inference by enabling comprehensive information exchange across windows, improving semantic consistency in open-vocabulary segmentation without additional training.
Abstract: A sliding-window inference strategy is commonly adopted in recent training-free open-vocabulary semantic segmentation methods to overcome limitation of the CLIP in processing high-resolution images. However, this approach introduces a new challenge: each window is processed independently, leading to semantic discrepancy across windows. To address this issue, we propose Global-Local Aligned CLIP~(GLA-CLIP), a framework that facilitates comprehensive information exchange across windows. Rather than limiting attention to tokens within individual windows, GLA-CLIP extends key-value tokens to incorporate contextual cues from all windows. Nevertheless, we observe a window bias: outer-window tokens are less likely to be attended, since query features are produced through interactions within the inner window patches, thereby lacking semantic grounding beyond their local context. To mitigate this, we introduce a proxy anchor, constructed by aggregating tokens highly similar to the given query from all windows, which provides a unified semantic reference for measuring similarity across both inner- and outer-window patches. Furthermore, we propose a dynamic normalization scheme that adjusts attention strength according to object scale by dynamically scaling and thresholding the attention map to cope with small-object scenarios. Moreover, GLA-CLIP can be equipped on existing methods and broad their receptive field. Extensive experiments validate the effectiveness of GLA-CLIP in enhancing training-free open-vocabulary semantic segmentation performance. Code is available at https://github.com/2btlFe/GLA-CLIP.
[171] Generative Event Pretraining with Foundation Model Alignment
Jianwen Cao, Jiaxu Xing, Nico Messikommer, Davide Scaramuzza
Main category: cs.CV
TL;DR: GEP framework transfers semantic knowledge from image datasets to event data while learning event-specific temporal dynamics through VFM-guided alignment and generative sequence modeling.
Details
Motivation: Event cameras provide robust visual signals under challenging conditions but have unique sensing characteristics and limited labeled data, making it difficult to train event-based visual foundation models that can learn transferable features across tasks.Method: Two-stage framework: 1) Event encoder aligned to frozen visual foundation model through joint regression-contrastive objective to ground event features in image semantics; 2) Transformer backbone autoregressively pretrained on mixed event-image sequences to capture event-specific temporal dynamics.
Result: Outperforms state-of-the-art event pretraining methods on diverse downstream tasks including object recognition, segmentation, and depth estimation.
Conclusion: VFM-guided alignment combined with generative sequence modeling yields a semantically rich, temporally aware event model that generalizes robustly across domains.
Abstract: Event cameras provide robust visual signals under fast motion and challenging illumination conditions thanks to their microsecond latency and high dynamic range. However, their unique sensing characteristics and limited labeled data make it challenging to train event-based visual foundation models (VFMs), which are crucial for learning visual features transferable across tasks. To tackle this problem, we propose GEP (Generative Event Pretraining), a two-stage framework that transfers semantic knowledge learned from internet-scale image datasets to event data while learning event-specific temporal dynamics. First, an event encoder is aligned to a frozen VFM through a joint regression-contrastive objective, grounding event features in image semantics. Second, a transformer backbone is autoregressively pretrained on mixed event-image sequences to capture the temporal structure unique to events. Our approach outperforms state-of-the-art event pretraining methods on a diverse range of downstream tasks, including object recognition, segmentation, and depth estimation. Together, VFM-guided alignment and generative sequence modeling yield a semantically rich, temporally aware event model that generalizes robustly across domains.
[172] Traffic Sign Recognition in Autonomous Driving: Dataset, Benchmark, and Field Experiment
Guoyang Zhao, Weiqing Qi, Kai Zhang, Chenguang Zhang, Zeying Gong, Zhihai Bi, Kai Chen, Benshan Ma, Ming Liu, Jun Ma
Main category: cs.CV
TL;DR: TS-1M is a large-scale traffic sign dataset with 1M+ images across 454 categories, featuring diagnostic benchmarks for cross-region generalization, rare-class identification, and semantic understanding, with analysis of supervised, self-supervised, and vision-language models.
Details
Motivation: Existing traffic sign datasets lack diagnostic insight into how models handle practical challenges like cross-region variation, long-tailed categories, and semantic ambiguity, which are crucial for real-world autonomous driving deployment.Method: Created TS-1M dataset with over 1 million real-world images across 454 standardized categories, designed diagnostic benchmark with challenge-oriented settings (cross-region recognition, rare-class identification, low-clarity robustness, semantic text understanding), and benchmarked three learning paradigms: classical supervised models, self-supervised pretrained models, and multimodal vision-language models.
Result: Analysis reveals consistent paradigm-dependent behaviors: semantic alignment is key for cross-region generalization and rare-category recognition, while purely visual models remain sensitive to appearance shift and data imbalance. Real-scene autonomous driving experiments validate practical relevance.
Conclusion: TS-1M establishes a reference-level diagnostic benchmark for traffic sign recognition and provides principled insights into robust and semantic-aware traffic sign perception, highlighting the importance of semantic alignment for practical deployment.
Abstract: Traffic Sign Recognition (TSR) is a core perception capability for autonomous driving, where robustness to cross-region variation, long-tailed categories, and semantic ambiguity is essential for reliable real-world deployment. Despite steady progress in recognition accuracy, existing traffic sign datasets and benchmarks offer limited diagnostic insight into how different modeling paradigms behave under these practical challenges. We present TS-1M, a large-scale and globally diverse traffic sign dataset comprising over one million real-world images across 454 standardized categories, together with a diagnostic benchmark designed to analyze model capability boundaries. Beyond standard train-test evaluation, we provide a suite of challenge-oriented settings, including cross-region recognition, rare-class identification, low-clarity robustness, and semantic text understanding, enabling systematic and fine-grained assessment of modern TSR models. Using TS-1M, we conduct a unified benchmark across three representative learning paradigms: classical supervised models, self-supervised pretrained models, and multimodal vision-language models (VLMs). Our analysis reveals consistent paradigm-dependent behaviors, showing that semantic alignment is a key factor for cross-region generalization and rare-category recognition, while purely visual models remain sensitive to appearance shift and data imbalance. Finally, we validate the practical relevance of TS-1M through real-scene autonomous driving experiments, where traffic sign recognition is integrated with semantic reasoning and spatial localization to support map-level decision constraints. Overall, TS-1M establishes a reference-level diagnostic benchmark for TSR and provides principled insights into robust and semantic-aware traffic sign perception. Project page: https://guoyangzhao.github.io/projects/ts1m.
[173] HUydra: Full-Range Lung CT Synthesis via Multiple HU Interval Generative Modelling
António Cardoso, Pedro Sousa, Tania Pereira, Hélder P. Oliveira
Main category: cs.CV
TL;DR: A novel decomposition strategy for generating lung CT scans by synthesizing images one Hounsfield Unit interval at a time, using tissue-focused HU windows and reconstruction networks, achieving superior performance over conventional methods.
Details
Motivation: Data scarcity in medical imaging delays CAD model deployment and validation, particularly for lung cancer diagnosis. Generative AI offers promise but faces challenges with complex HU range distributions in CT scans and high computational demands.Method: Proposes a decomposition strategy that trains generative architectures on individual tissue-focused HU windows, then merges outputs via a learned reconstruction network that reverses HU-windowing. Introduces multi-head and multi-decoder models (multi-head VQVAE) to capture textures while preserving anatomical consistency.
Result: Significantly outperforms conventional 2D full-range baselines with 6.2% improvement in FID and superior MMD, Precision, and Recall across all HU intervals. Multi-head VQVAE variant achieves best performance, enhancing visual fidelity and variability while reducing model complexity and computational cost.
Conclusion: Establishes a new paradigm for structure-aware medical image synthesis that aligns generative modeling with clinical interpretation, addressing data scarcity in medical imaging through efficient decomposition-based generation.
Abstract: Currently, a central challenge and bottleneck in the deployment and validation of computer-aided diagnosis (CAD) models within the field of medical imaging is data scarcity. For lung cancer, one of the most prevalent types worldwide, limited datasets can delay diagnosis and have an impact on patient outcome. Generative AI offers a promising solution for this issue, but dealing with the complex distribution of full Hounsfield Unit (HU) range lung CT scans is challenging and remains as a highly computationally demanding task. This paper introduces a novel decomposition strategy that synthesizes CT images one HU interval at a time, rather than modelling the entire HU domain at once. This framework focuses on training generative architectures on individual tissue-focused HU windows, then merges their output into a full-range scan via a learned reconstruction network that effectively reverses the HU-windowing process. We further propose multi-head and multi-decoder models to better capture textures while preserving anatomical consistency, with a multi-head VQVAE achieving the best performance for the generative task. Quantitative evaluation shows this approach significantly outperforms conventional 2D full-range baselines, achieving a 6.2% improvement in FID and superior MMD, Precision, and Recall across all HU intervals. The best performance is achieved by a multi-head VQVAE variant, demonstrating that it is possible to enhance visual fidelity and variability while also reducing model complexity and computational cost. This work establishes a new paradigm for structure-aware medical image synthesis, aligning generative modelling with clinical interpretation.
[174] MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding
Basit Alawode, Arif Mahmood, Muaz Khalifa Al-Radi, Shahad Albastaki, Asim Khan, Muhammad Bilal, Moshira Ali Abdalla, Mohammed Bennamoun, Sajid Javed
Main category: cs.CV
TL;DR: MLLM-HWSI is a hierarchical multimodal LLM for whole slide images that aligns visual features at four scales (cell, patch, region, WSI) with pathology language, enabling interpretable evidence-grounded reasoning through hierarchical contrastive learning and cross-scale consistency.
Details
Motivation: Existing CPath MLLMs compress entire WSIs into single embeddings, losing fine-grained grounding and ignoring how pathologists synthesize evidence across different scales. There's a need for hierarchical modeling that preserves multi-scale diagnostic information.Method: Decomposes WSIs into multi-scale embeddings with scale-specific projectors, uses hierarchical contrastive objective and cross-scale consistency loss, computes diagnostically relevant patches, aggregates cell embeddings via Cell-Cell Attention Fusion transformer, and fuses multi-scale tokens with text tokens for LLM reasoning.
Result: Achieves new SOTA results on 13 WSI-level benchmarks across six CPath tasks, demonstrating superior performance in open-ended reasoning, VQA, report, and caption generation tasks.
Conclusion: MLLM-HWSI provides accurate, interpretable outputs that mirror diagnostic workflows by aligning language with multi-scale visual evidence, advancing holistic WSI understanding in computational pathology.
Abstract: Whole Slide Images (WSIs) exhibit hierarchical structure, where diagnostic information emerges from cellular morphology, regional tissue organization, and global context. Existing Computational Pathology (CPath) Multimodal Large Language Models (MLLMs) typically compress an entire WSI into a single embedding, which hinders fine-grained grounding and ignores how pathologists synthesize evidence across different scales. We introduce \textbf{MLLM-HWSI}, a Hierarchical WSI-level MLLM that aligns visual features with pathology language at four distinct scales, cell as word, patch as phrase, region as sentence, and WSI as paragraph to support interpretable evidence-grounded reasoning. MLLM-HWSI decomposes each WSI into multi-scale embeddings with scale-specific projectors and jointly enforces (i) a hierarchical contrastive objective and (ii) a cross-scale consistency loss, preserving semantic coherence from cells to the WSI. We compute diagnostically relevant patches and aggregate segmented cell embeddings into a compact cellular token per-patch using a lightweight \textit{Cell-Cell Attention Fusion (CCAF)} transformer. The projected multi-scale tokens are fused with text tokens and fed to an instruction-tuned LLM for open-ended reasoning, VQA, report, and caption generation tasks. Trained in three stages, MLLM-HWSI achieves new SOTA results on 13 WSI-level benchmarks across six CPath tasks. By aligning language with multi-scale visual evidence, MLLM-HWSI provides accurate, interpretable outputs that mirror diagnostic workflows and advance holistic WSI understanding. Code is available at: \href{https://github.com/BasitAlawode/HWSI-MLLM}{GitHub}.
[175] PolarAPP: Beyond Polarization Demosaicking for Polarimetric Applications
Yidong Luo, Chenggong Li, Yunfeng Song, Ping Wang, Boxin Shi, Junchao Zhang, Xin Yuan
Main category: cs.CV
TL;DR: PolarAPP is a framework that jointly optimizes polarimetric image demosaicking and downstream vision tasks through meta-learning and feature alignment, improving both reconstruction quality and task performance.
Details
Motivation: Current polarimetric imaging applications use suboptimal reconstruction strategies that extract sparse images without proper demosaicking, resulting in incomplete targets that limit downstream performance. Existing demosaicking methods are task-agnostic, optimizing only for photometric fidelity rather than utility in downstream tasks.Method: PolarAPP introduces: 1) Feature alignment mechanism that semantically aligns representations of demosaicking and downstream networks via meta-learning, 2) Equivalent imaging constraint for demosaicking training enabling direct regression to physically meaningful outputs, and 3) Task-refinement stage that fine-tunes the task network using the stable demosaicking front-end.
Result: Extensive experimental results demonstrate that PolarAPP outperforms existing methods in both demosaicking quality and downstream performance on tasks like normal estimation and de-reflection.
Conclusion: PolarAPP successfully addresses the limitations of current polarimetric imaging pipelines by jointly optimizing demosaicking and downstream tasks, creating a task-aware reconstruction framework that improves both reconstruction quality and task performance.
Abstract: Polarimetric imaging enables advanced vision applications such as normal estimation and de-reflection by capturing unique surface-material interactions. However, existing applications (alternatively called downstream tasks) rely on datasets constructed by naively regrouping raw measurements from division-of-focal-plane sensors, where pixels of the same polarization angle are extracted and aligned into sparse images without proper demosaicking. This reconstruction strategy results in suboptimal, incomplete targets that limit downstream performance. Moreover, current demosaicking methods are task-agnostic, optimizing only for photometric fidelity rather than utility in downstream tasks. Towards this end, we propose PolarAPP, the first framework to jointly optimize demosaicking and its downstream tasks. PolarAPP introduces a feature alignment mechanism that semantically aligns the representations of demosaicking and downstream networks via meta-learning, guiding the reconstruction to be task-aware. It further employs an equivalent imaging constraint for demosaicking training, enabling direct regression to physically meaningful outputs without relying on rearranged data. Finally, a task-refinement stage fine-tunes the task network using the stable demosaicking front-end to further enhance accuracy. Extensive experimental results demonstrate that PolarAPP outperforms existing methods in both demosaicking quality and downstream performance. Code is available upon acceptance.
[176] A Synchronized Audio-Visual Multi-View Capture System
Xiangwei Shi, Era Dorta Perez, Ruud de Jong, Ojas Shirekar, Chirag Raman
Main category: cs.CV
TL;DR: An audio-visual multi-view capture system that treats synchronized audio and video as first-class signals for studying conversational interaction with precise timing.
Details
Motivation: Existing multi-view capture systems focus primarily on video with little support for audio acquisition and rigorous audio-video alignment, despite both being essential for studying conversational interaction where timing at the level of turn-taking, overlap, and prosody matters.Method: Combines a multi-camera pipeline with multi-channel microphone recording under a unified timing architecture, providing a practical workflow for calibration, acquisition, and quality control that supports repeatable recordings at scale.
Result: Quantified synchronization performance in deployment shows that the resulting recordings are temporally consistent enough to support fine-grained analysis and data-driven modeling of conversation behavior.
Conclusion: The system addresses the gap in existing multi-view capture systems by treating synchronized audio and synchronized video as first-class signals, enabling precise study of conversational interactions.
Abstract: Multi-view capture systems have been an important tool in research for recording human motion under controlling conditions. Most existing systems are specified around video streams and provide little or no support for audio acquisition and rigorous audio-video alignment, despite both being essential for studying conversational interaction where timing at the level of turn-taking, overlap, and prosody matters. In this technical report, we describe an audio-visual multi-view capture system that addresses this gap by treating synchronized audio and synchronized video as first-class signals. The system combines a multi-camera pipeline with multi-channel microphone recording under a unified timing architecture and provides a practical workflow for calibration, acquisition, and quality control that supports repeatable recordings at scale. We quantify synchronization performance in deployment and show that the resulting recordings are temporally consistent enough to support fine-grained analysis and data-driven modeling of conversation behavior.
[177] NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization
Yik San Cheng, Runkai Zhao, Weidong Cai
Main category: cs.CV
TL;DR: Adapting 2D DINOv3 visual foundation model to 3D for neuronal reconstruction using inflation-based adaptation and topology-aware skeleton loss
Details
Motivation: Lack of 3D foundation models for volumetric neuroimaging due to challenges in 3D image acquisition and annotation scarcity, while 2D visual foundation models like DINOv3 show strong zero-shot generalizationMethod: Inflation-based adaptation strategy that inflates 2D filters into 3D operators to preserve DINOv3 semantic priors, plus topology-aware skeleton loss for structural fidelity in graph-based neuronal reconstruction
Result: Consistent improvements on four neuronal imaging datasets with average gains of 2.9% in Entire Structure Average, 2.8% in Different Structure Average, and 3.8% in Percentage of Different Structure over state-of-the-art methods
Conclusion: Successfully adapted 2D visual foundation model to 3D biomedical segmentation, enabling more data-efficient and morphologically faithful neuronal reconstruction
Abstract: 2D visual foundation models, such as DINOv3, a self-supervised model trained on large-scale natural images, have demonstrated strong zero-shot generalization, capturing both rich global context and fine-grained structural cues. However, an analogous 3D foundation model for downstream volumetric neuroimaging remains lacking, largely due to the challenges of 3D image acquisition and the scarcity of high-quality annotations. To address this gap, we propose to adapt the 2D visual representations learned by DINOv3 to a 3D biomedical segmentation model, enabling more data-efficient and morphologically faithful neuronal reconstruction. Specifically, we design an inflation-based adaptation strategy that inflates 2D filters into 3D operators, preserving semantic priors from DINOv3 while adapting to 3D neuronal volume patches. In addition, we introduce a topology-aware skeleton loss to explicitly enforce structural fidelity of graph-based neuronal arbor reconstruction. Extensive experiments on four neuronal imaging datasets, including two from BigNeuron and two public datasets, NeuroFly and CWMBS, demonstrate consistent improvements in reconstruction accuracy over SoTA methods, with average gains of 2.9% in Entire Structure Average, 2.8% in Different Structure Average, and 3.8% in Percentage of Different Structure. Code: https://github.com/yy0007/NeurINO.
[178] AgentFoX: LLM Agent-Guided Fusion with eXplainability for AI-Generated Image Detection
Yangxin Yu, Yue Zhou, Bin Li, Kaiqing Lin, Haodong Li, Jiangqun Ni, Bo Cao
Main category: cs.CV
TL;DR: AgentFoX is an LLM-driven framework for AI-generated image detection that uses multi-phase analysis with expert profiles and produces detailed forensic reports instead of binary outputs.
Details
Motivation: The increasing realism of AI-generated images creates urgent need for reliable forensic tools. Existing detectors are specialized for specific artifacts and often produce conflicting judgments, lacking interpretability.Method: Uses LLM-driven framework with quick-integration fusion mechanism guided by calibrated Expert Profiles and contextual Clustering Profiles. Performs multi-phase analysis: high-level semantic assessment followed by fine-grained, context-aware synthesis of signal-level expert evidence with structured reasoning to resolve contradictions.
Result: Produces detailed, human-readable forensic reports that substantiate verdicts, enhancing interpretability and trustworthiness for real-world deployment. Introduces scalable agentic paradigm for integrating future forensic tools.
Conclusion: AgentFoX redefines AIGI detection as dynamic multi-phase analytical process, providing novel solution that goes beyond binary classification to offer interpretable forensic analysis.
Abstract: The increasing realism of AI-Generated Images (AIGI) has created an urgent need for forensic tools capable of reliably distinguishing synthetic content from authentic imagery. Existing detectors are typically tailored to specific forgery artifacts–such as frequency-domain patterns or semantic inconsistencies–leading to specialized performance and, at times, conflicting judgments. To address these limitations, we present \textbf{AgentFoX}, a Large Language Model-driven framework that redefines AIGI detection as a dynamic, multi-phase analytical process. Our approach employs a quick-integration fusion mechanism guided by a curated knowledge base comprising calibrated Expert Profiles and contextual Clustering Profiles. During inference, the agent begins with high-level semantic assessment, then transitions to fine-grained, context-aware synthesis of signal-level expert evidence, resolving contradictions through structured reasoning. Instead of returning a coarse binary output, AgentFoX produces a detailed, human-readable forensic report that substantiates its verdict, enhancing interpretability and trustworthiness for real-world deployment. Beyond providing a novel detection solution, this work introduces a scalable agentic paradigm that facilitates intelligent integration of future and evolving forensic tools.
[179] Automatic Segmentation of 3D CT scans with SAM2 using a zero-shot approach
Miquel Lopez Escoriza, Pau Amargant Alvarez
Main category: cs.CV
TL;DR: SAM2 adapted for zero-shot 3D CT segmentation using video-based memory mechanisms and inference modifications without fine-tuning.
Details
Motivation: Foundation models like SAM2 show strong generalization in natural images but have limited applicability to 3D medical imaging due to lack of volumetric awareness.Method: Proposed inference-alone architectural and procedural modifications adapting SAM2’s video-based memory mechanism to 3D CT data by treating slices as ordered sequences, with systematic ablation studies on prompt strategies, memory propagation, and multi-pass refinement.
Result: SAM2 with frozen weights can produce coherent 3D segmentations when inference pipeline is carefully structured, demonstrating feasibility of fully zero-shot approach for volumetric medical image segmentation on TotalSegmentator dataset.
Conclusion: Carefully structured inference pipelines enable foundation models like SAM2 to work effectively for 3D medical imaging without domain-specific training, opening possibilities for zero-shot volumetric segmentation.
Abstract: Foundation models for image segmentation have shown strong generalization in natural images, yet their applicability to 3D medical imaging remains limited. In this work, we study the zero-shot use of Segment Anything Model 2 (SAM2) for automatic segmentation of volumetric CT data, without any fine-tuning or domain-specific training. We analyze how SAM2 should be applied to CT volumes and identify its main limitation: the lack of inherent volumetric awareness. To address this, we propose a set of inference-alone architectural and procedural modifications that adapt SAM2’s video-based memory mechanism to 3D data by treating CT slices as ordered sequences. We conduct a systematic ablation study on a subset of 500 CT scans from the TotalSegmentator dataset to evaluate prompt strategies, memory propagation schemes and multi-pass refinement. Based on these findings, we select the best-performing configuration and report final results on a bigger sample of the TotalSegmentator dataset comprising 2,500 CT scans. Our results show that, even with frozen weights, SAM2 can produce coherent 3D segmentations when its inference pipeline is carefully structured, demonstrating the feasibility of a fully zero-shot approach for volumetric medical image segmentation.
[180] PiCo: Active Manifold Canonicalization for Robust Robotic Visual Anomaly Detection
Teng Yan, Binkai Liu, Shuai Liu, Yue Yu, Bingzhuo Zhong
Main category: cs.CV
TL;DR: PiCo introduces an active canonicalization framework for robotic visual anomaly detection that actively projects observations onto a condition-invariant canonical manifold through physical object reorientation and neural latent refinement.
Details
Motivation: Current robotic visual anomaly detection systems are limited by passive perception under diverse 6-DoF poses and unstable operating conditions (illumination changes, shadows), where semantic anomalies and physical disturbances coexist and interact.Method: Two-stage cascaded framework: 1) Active Physical Canonicalization - robotic agent reorients objects to reduce geometric uncertainty; 2) Neural Latent Canonicalization - three-stage denoising hierarchy with photometric processing, latent refinement, and contextual reasoning across representational scales.
Result: Achieves state-of-the-art 93.7% O-AUROC on M2AD benchmark (3.7% improvement over prior methods), and 98.5% accuracy in active closed-loop scenarios.
Conclusion: Active manifold canonicalization is critical for robust embodied perception, representing a paradigm shift from passive feature learning to active canonicalization for robotic visual anomaly detection.
Abstract: Industrial deployment of robotic visual anomaly detection (VAD) is fundamentally constrained by passive perception under diverse 6-DoF pose configurations and unstable operating conditions such as illumination changes and shadows, where intrinsic semantic anomalies and physical disturbances coexist and interact. To overcome these limitations, a paradigm shift from passive feature learning to Active Canonicalization is proposed. PiCo (Pose-in-Condition Canonicalization) is introduced as a unified framework that actively projects observations onto a condition-invariant canonical manifold. PiCo operates through a cascaded mechanism. The first stage, Active Physical Canonicalization, enables a robotic agent to reorient objects in order to reduce geometric uncertainty at its source. The second stage, Neural Latent Canonicalization, adopts a three-stage denoising hierarchy consisting of photometric processing at the input level, latent refinement at the feature level, and contextual reasoning at the semantic level, progressively eliminating nuisance factors across representational scales. Extensive evaluations on the large-scale M2AD benchmark demonstrate the superiority of this paradigm. PiCo achieves a state-of-the-art 93.7% O-AUROC, representing a 3.7% improvement over prior methods in static settings, and attains 98.5% accuracy in active closed-loop scenarios. These results demonstrate that active manifold canonicalization is critical for robust embodied perception.
[181] 3rd Place of MeViS-Audio Track of the 5th PVUW: VIRST-Audio
Jihwan Hong, Jaeyoung Do
Main category: cs.CV
TL;DR: VIRST-Audio converts audio queries to text using ASR, then uses a pretrained RVOS model with vision-language architecture for audio-based referring video object segmentation, adding an existence-aware gating mechanism to reduce hallucinations.
Details
Motivation: Audio-based Referring Video Object Segmentation (ARVOS) faces challenges in bridging acoustic signals with spatio-temporal visual representations. The paper aims to create a practical framework that can effectively ground audio queries into pixel-level object masks over time without requiring extensive audio-specific training.Method: 1) Convert input audio into text using an ASR module; 2) Use a pretrained RVOS model integrated with vision-language architecture; 3) Perform segmentation using text-based supervision to transfer from text-based reasoning to audio-driven scenarios; 4) Incorporate an existence-aware gating mechanism that estimates whether the referred target object is present in the video and suppresses predictions when absent.
Result: Achieved 3rd place on the MeViS-Audio track of the 5th PVUW Challenge, demonstrating strong generalization and reliable performance in audio-based referring video segmentation. The approach shows effective transfer from text-based reasoning to audio-driven scenarios.
Conclusion: The VIRST-Audio framework provides a practical solution for ARVOS by leveraging existing text-based models through audio-to-text conversion, with the existence-aware gating mechanism improving robustness by reducing hallucinated masks and stabilizing segmentation behavior.
Abstract: Audio-based Referring Video Object Segmentation (ARVOS) requires grounding audio queries into pixel-level object masks over time, posing challenges in bridging acoustic signals with spatio-temporal visual representations. In this report, we present VIRST-Audio, a practical framework built upon a pretrained RVOS model integrated with a vision-language architecture. Instead of relying on audio-specific training, we convert input audio into text using an ASR module and perform segmentation using text-based supervision, enabling effective transfer from text-based reasoning to audio-driven scenarios. To improve robustness, we further incorporate an existence-aware gating mechanism that estimates whether the referred target object is present in the video and suppresses predictions when it is absent, reducing hallucinated masks and stabilizing segmentation behavior. We evaluate our approach on the MeViS-Audio track of the 5th PVUW Challenge, where VIRST-Audio achieves 3rd place, demonstrating strong generalization and reliable performance in audio-based referring video segmentation.
[182] InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance
Dongwei Pan, Longwei Guo, Jiazhi Guan, Luying Huang, Yiding Li, Haojie Liu, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou
Main category: cs.CV
TL;DR: InterDyad: A framework for synthesizing naturalistic interactive dynamics in dyadic (two-person) video generation from speech, using structural motion guidance and multimodal LLMs to achieve contextually appropriate reactions and improved lip-sync.
Details
Motivation: Existing speech-to-video synthesis methods struggle with capturing cross-individual dependencies and providing fine-grained control over reactive behaviors in dyadic (two-person) interactions, limiting their ability to produce natural conversational dynamics.Method: 1) Interactivity Injector for video reenactment using identity-agnostic motion priors; 2) MetaQuery-based modality alignment to bridge conversational audio and motion priors; 3) Multimodal LLM to distill linguistic intent for precise reaction timing; 4) Role-aware Dyadic Gaussian Guidance (RoDG) for enhanced lip-synchronization under extreme head poses.
Result: InterDyad significantly outperforms state-of-the-art methods in producing natural and contextually grounded two-person interactions, with improved lip-sync quality and spatial consistency, validated through comprehensive experiments and novel evaluation metrics.
Conclusion: The proposed InterDyad framework effectively addresses the challenges of dyadic speech-to-video synthesis by leveraging structural motion guidance and multimodal LLMs, enabling more naturalistic interactive dynamics with precise control over reactive behaviors.
Abstract: Despite progress in speech-to-video synthesis, existing methods often struggle to capture cross-individual dependencies and provide fine-grained control over reactive behaviors in dyadic settings. To address these challenges, we propose InterDyad, a framework that enables naturalistic interactive dynamics synthesis via querying structural motion guidance. Specifically, we first design an Interactivity Injector that achieves video reenactment based on identity-agnostic motion priors extracted from reference videos. Building upon this, we introduce a MetaQuery-based modality alignment mechanism to bridge the gap between conversational audio and these motion priors. By leveraging a Multimodal Large Language Model (MLLM), our framework is able to distill linguistic intent from audio to dictate the precise timing and appropriateness of reactions. To further improve lip-sync quality under extreme head poses, we propose Role-aware Dyadic Gaussian Guidance (RoDG) for enhanced lip-synchronization and spatial consistency. Finally, we introduce a dedicated evaluation suite with novelly designed metrics to quantify dyadic interaction. Comprehensive experiments demonstrate that InterDyad significantly outperforms state-of-the-art methods in producing natural and contextually grounded two-person interactions. Please refer to our project page for demo videos: https://interdyad.github.io/.
[183] VoDaSuRe: A Large-Scale Dataset Revealing Domain Shift in Volumetric Super-Resolution
August Leander Høeg, Sophia Wiinberg Bardenfleth, Hans Martin Kjer, Tim Bjørn Dyrby, Vedrana Andersen Dahl, Anders Bjorholm Dahl
Main category: cs.CV
TL;DR: The paper reveals that current volumetric super-resolution methods perform well on downsampled data but fail on real low-resolution scans, showing a significant performance gap. The authors introduce VoDaSuRe, a dataset with paired real high- and low-resolution 3D scans, to address this issue.
Details
Motivation: Current volumetric super-resolution methods are trained primarily on downsampled data due to scarcity of paired real high- and low-resolution 3D datasets. This creates a disconnect between research performance and real-world application, as models trained on synthetic downsampling don't generalize well to actual low-resolution scans.Method: The authors introduce VoDaSuRe, a large-scale volumetric dataset containing paired high- and low-resolution scans. They use this dataset to evaluate existing SR models (transformer- and CNN-based approaches) and compare performance between models trained on downsampled data versus real low-resolution scans.
Result: Models trained on downsampled data produce sharper predictions but are inaccurate when applied to real scans, while models trained on real low-resolution scans smooth fine structures. The performance gap reveals that current SR methods are overstated and don’t actually recover structures lost in real low-resolution scans.
Conclusion: Progress in deep learning-based volumetric super-resolution requires datasets with paired real scans of high complexity like VoDaSuRe. Current methods trained on downsampled data don’t generalize to real applications, and the field needs to shift toward training on real paired data.
Abstract: Recent advances in volumetric super-resolution (SR) have demonstrated strong performance in medical and scientific imaging, with transformer- and CNN-based approaches achieving impressive results even at extreme scaling factors. In this work, we show that much of this performance stems from training on downsampled data rather than real low-resolution scans. This reliance on downsampling is partly driven by the scarcity of paired high- and low-resolution 3D datasets. To address this, we introduce VoDaSuRe, a large-scale volumetric dataset containing paired high- and low-resolution scans. When training models on VoDaSuRe, we reveal a significant discrepancy: SR models trained on downsampled data produce substantially sharper predictions than those trained on real low-resolution scans, which smooth fine structures. Conversely, applying models trained on downsampled data to real scans preserves more structure but is inaccurate. Our findings suggest that current SR methods are overstated - when applied to real data, they do not recover structures lost in low-resolution scans and instead predict a smoothed average. We argue that progress in deep learning-based volumetric SR requires datasets with paired real scans of high complexity, such as VoDaSuRe. Our dataset and code are publicly available through: https://augusthoeg.github.io/VoDaSuRe/
[184] Conformal Cross-Modal Active Learning
Huy Hoang Nguyen, Cédric Jung, Shirin Salehi, Tobias Glück, Anke Schmeink, Andreas Kugi
Main category: cs.CV
TL;DR: CCMA is a novel active learning framework that leverages vision-language models as teachers to provide conformally calibrated uncertainty estimates for selecting informative samples to label, achieving superior data efficiency compared to traditional AL methods.
Details
Motivation: While foundation models for vision have powerful capabilities, their potential for data-efficient learning remains untapped. Active learning aims to minimize annotation costs but existing methods overlook the rich multimodal knowledge in vision-language models.Method: CCMA uses a teacher-student architecture where a pretrained VLM serves as a teacher to provide semantically grounded uncertainty estimates. These estimates are conformally calibrated to guide sample selection for a vision-only student model, integrating multimodal conformal scoring with diversity-aware selection strategies.
Result: CCMA consistently outperforms state-of-the-art AL baselines across multiple benchmarks, demonstrating clear advantages over methods relying solely on uncertainty or diversity metrics.
Conclusion: The framework successfully bridges vision and language modalities for active learning, showing that leveraging multimodal knowledge from VLMs can significantly improve data efficiency in visual recognition tasks.
Abstract: Foundation models for vision have transformed visual recognition with powerful pretrained representations and strong zero-shot capabilities, yet their potential for data-efficient learning remains largely untapped. Active Learning (AL) aims to minimize annotation costs by strategically selecting the most informative samples for labeling, but existing methods largely overlook the rich multimodal knowledge embedded in modern vision-language models (VLMs). We introduce Conformal Cross-Modal Acquisition (CCMA), a novel AL framework that bridges vision and language modalities through a teacher-student architecture. CCMA employs a pretrained VLM as a teacher to provide semantically grounded uncertainty estimates, conformally calibrated to guide sample selection for a vision-only student model. By integrating multimodal conformal scoring with diversity-aware selection strategies, CCMA achieves superior data efficiency across multiple benchmarks. Our approach consistently outperforms state-of-the-art AL baselines, demonstrating clear advantages over methods relying solely on uncertainty or diversity metrics.
[185] Dual Contrastive Network for Few-Shot Remote Sensing Image Scene Classification
Zhong Ji, Liyuan Hou, Xuan Wang, Gang Wang, Yanwei Pang
Main category: cs.CV
TL;DR: Dual Contrastive Network (DCN) for few-shot remote sensing image classification uses two supervised contrastive learning branches to address small inter-class variances and large intra-class variances.
Details
Motivation: Remote sensing images have inherent challenges: small inter-class variances (different classes look similar) and large intra-class variances (same class looks different). Few-shot learning makes this even harder due to limited labeled samples.Method: Proposes DCN with two auxiliary supervised contrastive learning branches: 1) Context-guided Contrastive Learning (CCL) with Condenser Network to capture context features for inter-class discriminability, and 2) Detail-guided Contrastive Learning (DCL) with Smelter Network to highlight local details for intra-class invariance.
Result: Extensive experiments on four public benchmark remote sensing datasets demonstrate competitive performance of DCN.
Conclusion: The dual contrastive approach effectively addresses the key challenges in few-shot remote sensing image classification by separately handling inter-class discriminability and intra-class invariance.
Abstract: Few-shot remote sensing image scene classification (FS-RSISC) aims at classifying remote sensing images with only a few labeled samples. The main challenges lie in small inter-class variances and large intra-class variances, which are the inherent property of remote sensing images. To address these challenges, we propose a transfer-based Dual Contrastive Network (DCN), which incorporates two auxiliary supervised contrastive learning branches during the training process. Specifically, one is a Context-guided Contrastive Learning (CCL) branch and the other is a Detail-guided Contrastive Learning (DCL) branch, which focus on inter-class discriminability and intra-class invariance, respectively. In the CCL branch, we first devise a Condenser Network to capture context features, and then leverage a supervised contrastive learning on top of the obtained context features to facilitate the model to learn more discriminative features. In the DCL branch, a Smelter Network is designed to highlight the significant local detail information. And then we construct a supervised contrastive learning based on the detail feature maps to fully exploit the spatial information in each map, enabling the model to concentrate on invariant detail features. Extensive experiments on four public benchmark remote sensing datasets demonstrate the competitive performance of our proposed DCN.
[186] GSwap: Realistic Head Swapping with Dynamic Neural Gaussian Field
Jingtao Zhou, Xuan Gao, Dongyu Liu, Junhui Hou, Yudong Guo, Juyong Zhang
Main category: cs.CV
TL;DR: GSwap is a video head-swapping system using dynamic neural Gaussian portrait priors for 3D-consistent, realistic face and head replacement with seamless background integration.
Details
Motivation: Previous methods for video head-swapping suffer from poor 3D consistency, unnatural facial expressions, restricted synthesis quality, insufficient holistic head modeling, and ineffective background blending, resulting in visible artifacts and misalignments.Method: Introduces an intrinsic 3D Gaussian feature field embedded within a full-body SMPL-X surface, elevating 2D portrait videos into a dynamic neural Gaussian field. Uses few-shot domain adaptation of a pretrained 2D portrait generative model and neural re-rendering for seamless background integration.
Result: Extensive experiments show GSwap surpasses existing methods in visual quality, temporal coherence, identity preservation, and 3D consistency.
Conclusion: GSwap advances the state of the art in video head-swapping by overcoming limitations of previous 2D and 3DMM-based approaches through dynamic neural Gaussian portrait priors and holistic modeling.
Abstract: We present GSwap, a novel consistent and realistic video head-swapping system empowered by dynamic neural Gaussian portrait priors, which significantly advances the state of the art in face and head replacement. Unlike previous methods that rely primarily on 2D generative models or 3D Morphable Face Models (3DMM), our approach overcomes their inherent limitations, including poor 3D consistency, unnatural facial expressions, and restricted synthesis quality. Moreover, existing techniques struggle with full head-swapping tasks due to insufficient holistic head modeling and ineffective background blending, often resulting in visible artifacts and misalignments. To address these challenges, GSwap introduces an intrinsic 3D Gaussian feature field embedded within a full-body SMPL-X surface, effectively elevating 2D portrait videos into a dynamic neural Gaussian field. This innovation ensures high-fidelity, 3D-consistent portrait rendering while preserving natural head-torso relationships and seamless motion dynamics. To facilitate training, we adapt a pretrained 2D portrait generative model to the source head domain using only a few reference images, enabling efficient domain adaptation. Furthermore, we propose a neural re-rendering strategy that harmoniously integrates the synthesized foreground with the original background, eliminating blending artifacts and enhancing realism. Extensive experiments demonstrate that GSwap surpasses existing methods in multiple aspects, including visual quality, temporal coherence, identity preservation, and 3D consistency.
[187] myMNIST: Benchmark of PETNN, KAN, and Classical Deep Learning Models for Burmese Handwritten Digit Recognition
Ye Kyaw Thu, Thazin Myint Oo, Thepchai Supnithi
Main category: cs.CV
TL;DR: Systematic benchmark of Burmese handwritten digit dataset (BHDD) evaluating 11 architectures including CNN, LSTM, GRU, Transformer, KAN variants, energy-based JEM, and physics-inspired PETNN variants, with CNN achieving best performance.
Details
Motivation: BHDD serves as foundational resource for Myanmar NLP/AI but lacks comprehensive, reproducible performance baselines across modern architectures. Need to establish standardized benchmarks for regional scripts.Method: Evaluated 11 architectures on BHDD: classical deep learning models (MLP, CNN, LSTM, GRU, Transformer), recent alternatives (FastKAN, EfficientKAN), energy-based model (JEM), and physics-inspired PETNN variants (Sigmoid, GELU, SiLU). Used Precision, Recall, F1-Score, and Accuracy metrics.
Result: CNN achieved best overall scores (F1 = 0.9959, Accuracy = 0.9970). PETNN (GELU) closely followed (F1 = 0.9955, Accuracy = 0.9966), outperforming LSTM, GRU, Transformer, and KAN variants. JEM performed competitively (F1 = 0.9944, Accuracy = 0.9958). KAN-based models trailed top performers but provided meaningful alternative baseline (Accuracy ~0.992).
Conclusion: Established reproducible baselines for BHDD across diverse modeling paradigms, highlighted PETNN’s strong performance relative to classical and Transformer-based models, and quantified gap between energy-inspired PETNNs and true energy-based model (JEM). Released benchmark to facilitate future research on Myanmar digit recognition.
Abstract: We present the first systematic benchmark on a standardized iteration of the publicly available Burmese Handwritten Digit Dataset (BHDD), which we have designated as myMNIST Benchmarking. While BHDD serves as a foundational resource for Myanmar NLP/AI, it lacks a comprehensive, reproducible performance baseline across modern architectures. We evaluate eleven architectures spanning classical deep learning models (Multi-Layer Perceptron, Convolutional Neural Network, Long Short-Term Memory, Gated Recurrent Unit, Transformer), recent alternatives (FastKAN, EfficientKAN), an energy-based model (JEM), and physics-inspired PETNN variants (Sigmoid, GELU, SiLU). Using Precision, Recall, F1-Score, and Accuracy as evaluation metrics, our results show that the CNN remains a strong baseline, achieving the best overall scores (F1 = 0.9959, Accuracy = 0.9970). The PETNN (GELU) model closely follows (F1 = 0.9955, Accuracy = 0.9966), outperforming LSTM, GRU, Transformer, and KAN variants. JEM, representing energy-based modeling, performs competitively (F1 = 0.9944, Accuracy = 0.9958). KAN-based models (FastKAN, EfficientKAN) trail the top performers but provide a meaningful alternative baseline (Accuracy ~0.992). These findings (i) establish reproducible baselines for BHDD across diverse modeling paradigms, (ii) highlight PETNN’s strong performance relative to classical and Transformer-based models, and (iii) quantify the gap between energy-inspired PETNNs and a true energy-based model (JEM). We release this benchmark to facilitate future research on Myanmar digit recognition and to encourage broader evaluation of emerging architectures on regional scripts.
[188] Gimbal360: Differentiable Auto-Leveling for Canonicalized $360^\circ$ Panoramic Image Completion
Yuqin Lu, Haofeng Liu, Yang Zhou, Jun Liang, Shengfeng He, Jing Li
Main category: cs.CV
TL;DR: Gimbal360 is a framework for 360° panoramic completion from unposed perspective images using geometric regularization and topological equivariance.
Details
Motivation: Diffusion models work well for 2D outpainting but struggle with 360° panoramic completion due to geometric/topological mismatches between perspective images and spherical panoramas.Method: Introduces Canonical Viewing Space for geometric regularization, Differentiable Auto-Leveling for feature stabilization without camera parameters, and topological equivariance in latent space to preserve seamless periodic structure.
Result: Achieves state-of-the-art performance in structurally consistent 360° scene completion, validated on the Horizon360 dataset.
Conclusion: Explicitly standardizing geometric and topological priors enables effective 360° panoramic completion from unposed perspective images.
Abstract: Diffusion models excel at 2D outpainting, but extending them to $360^\circ$ panoramic completion from unposed perspective images is challenging due to the geometric and topological mismatch between perspective projections and spherical panoramas. We present Gimbal360, a principled framework that explicitly bridges perspective observations and spherical panoramas. We introduce a Canonical Viewing Space that regularizes projective geometry and provides a consistent intermediate representation between the two domains. To anchor in-the-wild inputs to this space, we propose a Differentiable Auto-Leveling module that stabilizes feature orientation without requiring camera parameters at inference. Panoramic generation also introduces a topological challenge. Standard generative architectures assume a bounded Euclidean image plane, while Equirectangular Projection (ERP) panoramas exhibit intrinsic $S^1$ periodicity. Euclidean operations therefore break boundary continuity. We address this mismatch by enforcing topological equivariance in the latent space to preserve seamless periodic structure. To support this formulation, we introduce Horizon360, a curated large-scale dataset of gravity-aligned panoramic environments. Extensive experiments show that explicitly standardizing geometric and topological priors enables Gimbal360 to achieve state-of-the-art performance in structurally consistent $360^\circ$ scene completion.
[189] ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting
Yeonkyung Lee, Dayun Ju, Youngmin Kim, Seil Kang, Seong Jae Hwang
Main category: cs.CV
TL;DR: ViKey enhances VideoLLMs’ temporal reasoning by adding visual ordinal cues and keyword-frame mapping to sparse frame sequences, achieving dense-frame performance with only 20% of frames.
Details
Motivation: Current efficiency methods for VideoLLMs (like frame selection) reduce computational cost but hurt temporal reasoning performance. Unlike humans who can infer event progression from sparse cues, VideoLLMs struggle with temporal relations when intermediate frames are omitted.Method: ViKey combines visual prompting (annotating frames with explicit ordinal information) with a lightweight Keyword-Frame Mapping module that uses frame indices as dictionary-like keys to link textual cues to relevant frames, providing explicit temporal anchors.
Result: The approach substantially improves temporal reasoning and preserves dense-frame baseline performance on some datasets with as few as 20% of frames, despite being training-free.
Conclusion: Simple visual prompting with ordinal cues and keyword-frame mapping can effectively enhance VideoLLMs’ temporal understanding in sparse frame settings, offering an efficient solution to computational cost challenges.
Abstract: Recent advancements in Video Large Language Models (VideoLLMs) have enabled strong performance across diverse multimodal video tasks. To reduce the high computational cost of processing dense video frames, efficiency-oriented methods such as frame selection have been widely adopted. While effective at minimizing redundancy, these methods often cause notable performance drops on tasks requiring temporal reasoning. Unlike humans, who can infer event progression from sparse visual cues, VideoLLMs frequently misinterpret temporal relations when intermediate frames are omitted. To address this limitation, we explore visual prompting (VP) as a lightweight yet effective way to enhance temporal understanding in VideoLLMs. Our analysis reveals that simply annotating each frame with explicit ordinal information helps the model perceive temporal continuity. This visual cue also supports frame-level referencing and mitigates positional ambiguity within a sparsely sampled sequence. Building on these insights, we introduce ViKey, a training-free framework that combines VP with a lightweight Keyword-Frame Mapping (KFM) module. KFM leverages frame indices as dictionary-like keys to link textual cues to the most relevant frames, providing explicit temporal anchors during inference. Despite its simplicity, our approach substantially improves temporal reasoning and, on some datasets, preserves dense-frame baseline performance with as few as 20% of frames.
[190] Gaze-Regularized VLMs for Ego-Centric Behavior Understanding
Anupam Pani, Yanchao Yang
Main category: cs.CV
TL;DR: A gaze-regularized framework enhances Vision Language Models for egocentric behavior understanding by incorporating gaze information to align model attention with human attention patterns, improving future event prediction.
Details
Motivation: Existing Vision Language Models rely solely on visual data and overlook gaze information, which provides critical insights into human intentions and future actions. The authors aim to leverage gaze data to enhance VLMs' predictive capabilities for egocentric behavior understanding.Method: The approach directly incorporates gaze information into VLM architecture during training using gaze-based queries that dynamically focus on gaze-highlighted regions. A gaze-regularization mechanism ensures alignment of model attention with human attention patterns. Extensive experiments explore various strategies for integrating gaze data.
Result: Experimental results demonstrate nearly 13% improvement in semantic scores compared to baseline models not leveraging gaze data. The framework enables prediction of future events with detailed action descriptions.
Conclusion: This work establishes a foundation for leveraging human gaze in VLMs, significantly boosting their predictive capabilities for applications requiring accurate and robust future event prediction in egocentric settings.
Abstract: Eye gaze, encompassing fixations and saccades, provides critical insights into human intentions and future actions. This study introduces a gaze-regularized framework that enhances Vision Language Models (VLMs) for egocentric behavior understanding. Unlike existing methods that rely solely on visual data and overlook gaze information, our approach directly incorporates gaze information into the VLM architecture during training. By generating gaze-based queries, the model dynamically focuses on gaze-highlighted regions, while a gaze-regularization mechanism ensures the alignment of model attention with human attention patterns. To better understand how gaze can be effectively integrated into VLMs, we conducted extensive experiments exploring various strategies for incorporating gaze data. These innovations enable the prediction of future events with detailed action descriptions. Experimental results demonstrate a nearly 13 % improvement in semantic scores compared to baseline models not leveraging gaze data, highlighting the effectiveness of our approach. This work establishes a foundation for leveraging the human gaze in VLMs, significantly boosting their predictive capabilities in applications requiring accurate and robust future event prediction.
[191] FDIF: Formula-Driven supervised Learning with Implicit Functions for 3D Medical Image Segmentation
Yukinori Yamamoto, Kazuya Nishimura, Tsukasa Fukusato, Hirokazu Nosato, Tetsuya Ogata, Hirokatsu Kataoka
Main category: cs.CV
TL;DR: FDIF introduces an implicit-function representation using signed distance functions for formula-driven 3D medical image segmentation, enabling data-free pre-training without real medical data or expert annotations.
Details
Motivation: Medical image segmentation requires large labeled datasets, but acquiring such data is difficult due to privacy constraints and high annotation costs. Existing voxel-based formula-driven methods have limited geometric expressiveness and cannot synthesize realistic textures.Method: FDIF uses implicit-function representation based on signed distance functions (SDFs) to model complex geometries compactly. It exploits SDF surface representations to support controllable synthesis of both geometric and intensity textures, enabling scalable pre-training without real data.
Result: FDIF consistently outperforms formula-driven methods and achieves performance comparable to self-supervised approaches pre-trained on large-scale real datasets across three medical segmentation benchmarks (AMOS, ACDC, KiTS) with three different architectures.
Conclusion: Implicit-function-based formula supervision is a promising paradigm for data-free representation learning in 3D medical imaging, with potential benefits extending to 3D classification tasks.
Abstract: Deep learning-based 3D medical image segmentation methods relies on large-scale labeled datasets, yet acquiring such data is difficult due to privacy constraints and the high cost of expert annotation. Formula-Driven Supervised Learning (FDSL) offers an appealing alternative by generating training data and labels directly from mathematical formulas. However, existing voxel-based approaches are limited in geometric expressiveness and cannot synthesize realistic textures. We introduce Formula-Driven supervised learning with Implicit Functions (FDIF), a framework that enables scalable pre-training without using any real data and medical expert annotations. FDIF introduces an implicit-function representation based on signed distance functions (SDFs), enabling compact modeling of complex geometries while exploiting the surface representation of SDFs to support controllable synthesis of both geometric and intensity textures. Across three medical image segmentation benchmarks (AMOS, ACDC, and KiTS) and three architectures (SwinUNETR, nnUNet ResEnc-L, and nnUNet Primus-M), FDIF consistently improves over a formula-driven method, and achieves performance comparable to self-supervised approaches pre-trained on large-scale real datasets. We further show that FDIF pre-training also benefits 3D classification tasks, highlighting implicit-function-based formula supervision as a promising paradigm for data-free representation learning. Code is available at https://github.com/yamanoko/FDIF.
[192] Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation
Anupam Pani, Yanchao Yang
Main category: cs.CV
TL;DR: Gaze-regularized training framework for VLA models that aligns robot attention with human visual patterns to improve fine-grained manipulation tasks without architectural changes or inference overhead.
Details
Motivation: Current Vision-Language-Action (VLA) models struggle with fine-grained robotic manipulation tasks due to lack of mechanisms for active visual attention allocation. Human gaze naturally encodes intent, planning, and execution patterns, offering a powerful supervisory signal for guiding robot perception.Method: Introduces a gaze-regularized training framework that transforms temporally aggregated gaze heatmaps into patch-level distributions and regularizes transformer attention through KL divergence. This creates an inductive bias toward task-relevant features while preserving deployment efficiency, requiring no architectural modifications or inference-time overhead.
Result: Achieves 4-12% improvements across manipulation benchmarks. The gaze-regularized models reach equivalent performance with fewer training steps and maintain robustness under lighting variations and sensor noise. Learned attention patterns produce interpretable visualizations that mirror human strategies.
Conclusion: Human perceptual priors can significantly accelerate robot learning while improving both task performance and system interpretability. The framework requires no eye-tracking equipment and applies directly to existing datasets, enhancing trust in robotic systems.
Abstract: Despite advances in Vision-Language-Action (VLA) models, robotic manipulation struggles with fine-grained tasks because current models lack mechanisms for active visual attention allocation. Human gaze naturally encodes intent, planning, and execution patterns – offering a powerful supervisory signal for guiding robot perception. We introduce a gaze-regularized training framework that aligns VLA models’ internal attention with human visual patterns without architectural modifications or inference-time overhead. Our method transforms temporally aggregated gaze heatmaps into patch-level distributions and regularizes the transformer’s attention through KL divergence, creating an inductive bias toward task-relevant features while preserving deployment efficiency. When integrated into existing VLA architectures, our approach yields 4-12% improvements across manipulation benchmarks. The gaze-regularized models reach equivalent performance with fewer training steps and maintain robustness under lighting variations and sensor noise. Beyond performance metrics, the learned attention patterns produce interpretable visualizations that mirror human strategies, enhancing trust in robotic systems. Moreover, our framework requires no eye-tracking equipment and applies directly to existing datasets. These results demonstrate that human perceptual priors can significantly accelerate robot learning while improving both task performance and system interpretability.
[193] Curriculum-Driven 3D CT Report Generation via Language-Free Visual Grafting and Zone-Constrained Compression
V. K. Cody Bumgardner, Mitchell A. Klusty, Mahmut S. Gokmen, Evan W. Damron
Main category: cs.CV
TL;DR: Ker-VLJEPA-3B: A curriculum learning framework for free-text radiology report generation from thoracic CT volumes using a self-supervised visual encoder and Llama 3.2 decoder, achieving state-of-the-art performance on CT-RATE benchmark.
Details
Motivation: Address challenges in automated radiology report generation from 3D CT volumes: extreme sequence lengths, severe class imbalance, and LLMs' tendency to ignore visual tokens in favor of linguistic priors.Method: Four-phase curriculum learning: 1) self-supervised visual encoder (LeJEPA ViT-Large) trained on unlabeled CTs, 2) bridge phase for vision-language alignment, 3) generation phase, 4) fine-tuning. Key innovations: zone-constrained cross-attention, PCA whitening, positive-findings-only strategy, warm bridge initialization, and selective cross-attention freezing.
Result: Achieves macro F1 of 0.429 on CT-RATE benchmark (2,984 validation volumes, 18 classes), surpassing state-of-the-art U-VLM (0.414) by 3.6%, reaching 0.448 (+8.2%) with threshold optimization. Ablation shows 56.6% of generation quality from patient-specific visual content.
Conclusion: Proposed framework successfully integrates self-supervised visual encoders with LLMs for medical report generation, demonstrating superior performance through careful curriculum design and architectural innovations that ensure visual grounding.
Abstract: Automated radiology report generation from 3D computed tomography (CT) volumes is challenging due to extreme sequence lengths, severe class imbalance, and the tendency of large language models (LLMs) to ignore visual tokens in favor of linguistic priors. We present Ker-VLJEPA-3B, a four-phase curriculum learning framework for free-text report generation from thoracic CT volumes. A phased training curriculum progressively adapts a Llama 3.2 3B decoder to ground its output in visual features from a frozen, self-supervised encoder. Our visual backbone (LeJEPA ViT-Large) is trained via self-supervised joint-embedding prediction on unlabeled CTs, without text supervision. Unlike contrastive models (CLIP, BiomedCLIP), this language-free backbone yields modality-pure representations. Vision-language alignment is deferred to the curriculum’s bridge and generation phases. This modality-agnostic design can integrate any self-supervised encoder into an LLM without paired text during foundation training. Methodological innovations include: (1) zone-constrained cross-attention compressing slice embeddings into 32 spatially-grounded visual tokens; (2) PCA whitening of anisotropic LLM embeddings; (3) a positive-findings-only strategy eliminating posterior collapse; (4) warm bridge initialization transferring projection weights; and (5) selective cross-attention freezing with elastic weight consolidation to prevent catastrophic forgetting. Evaluated on the CT-RATE benchmark (2,984 validation volumes, 18 classes), Ker-VLJEPA-3B achieves a macro F1 of 0.429, surpassing the state-of-the-art (U-VLM, macro F1 = 0.414) by 3.6%, and reaching 0.448 (+8.2%) with threshold optimization. Ablation studies confirm 56.6% of generation quality derives from patient-specific visual content. Code and weights are available.
[194] PoseDriver: A Unified Approach to Multi-Category Skeleton Detection for Autonomous Driving
Yasamin Borhani, Taylor Mordan, Yihan Wang, Reyhaneh Hosseininejad, Javad Khoramdel, Alexandre Alahi
Main category: cs.CV
TL;DR: PoseDriver: A unified bottom-up framework for multi-category skeleton detection in driving scenarios, achieving SOTA on lane detection and demonstrating transferability to novel categories like bicycles.
Details
Motivation: Object skeletons provide structural information crucial for autonomous driving, but existing methods lack a unified architecture that can handle multiple instances and categories simultaneously using only input images.Method: PoseDriver is a unified bottom-up framework for multi-category skeleton detection. It models each category as a distinct task to address multi-task learning challenges, with a novel approach for lane detection based on skeleton representations.
Result: Achieves state-of-the-art performance on the OpenLane dataset for lane detection, presents a new bicycle skeleton detection dataset, and demonstrates framework transferability to novel categories.
Conclusion: The proposed PoseDriver framework effectively addresses multi-category skeleton detection in driving scenarios, showing strong performance and transferability potential.
Abstract: Object skeletons offer a concise representation of structural information, capturing essential aspects of posture and orientation that are crucial for autonomous driving applications. However, a unified architecture that simultaneously handles multiple instances and categories using only the input image remains elusive. In this paper, we introduce PoseDriver, a unified framework for bottom-up multi-category skeleton detection tailored to common objects in driving scenarios. We model each category as a distinct task to systematically address the challenges of multi-task learning. Specifically, we propose a novel approach for lane detection based on skeleton representations, achieving state-of-the-art performance on the OpenLane dataset. Moreover, we present a new dataset for bicycle skeleton detection and assess the transferability of our framework to novel categories. Experimental results validate the effectiveness of the proposed approach.
[195] GO-Renderer: Generative Object Rendering with 3D-aware Controllable Video Diffusion Models
Zekai Gu, Shuoxuan Feng, Yansong Wang, Hanzhuo Huang, Zhongshuo Du, Chengfeng Zhao, Chengwei Ren, Peng Wang, Yuan Liu
Main category: cs.CV
TL;DR: GO-Renderer integrates 3D reconstruction with diffusion models for high-quality object rendering with accurate viewpoint control and lighting adaptation
Details
Motivation: Current feedforward 3D reconstruction methods struggle with complex appearances, while diffusion models lack precise viewpoint control. There's a need to combine 3D reconstruction's geometric accuracy with diffusion models' appearance quality for controllable object rendering.Method: Proposes GO-Renderer, a unified framework that uses reconstructed 3D proxies to guide video generative models. The 3D proxy provides accurate viewpoint control, while diffusion models handle appearance and lighting without explicit material modeling.
Result: Achieves state-of-the-art performance in object rendering tasks including novel viewpoint synthesis, novel lighting environment rendering, and object insertion into existing videos.
Conclusion: GO-Renderer successfully combines 3D reconstruction and diffusion models to enable high-quality, controllable object rendering with accurate viewpoints and lighting adaptation.
Abstract: Reconstructing a renderable 3D model from images is a useful but challenging task. Recent feedforward 3D reconstruction methods have demonstrated remarkable success in efficiently recovering geometry, but still cannot accurately model the complex appearances of these 3D reconstructed models. Recent diffusion-based generative models can synthesize realistic images or videos of an object using reference images without explicitly modeling its appearance, which provides a promising direction for object rendering, but lacks accurate control over the viewpoints. In this paper, we propose GO-Renderer, a unified framework integrating the reconstructed 3D proxies to guide the video generative models to achieve high-quality object rendering on arbitrary viewpoints under arbitrary lighting conditions. Our method not only enjoys the accurate viewpoint control using the reconstructed 3D proxy but also enables high-quality rendering in different lighting environments using diffusion generative models without explicitly modeling complex materials and lighting. Extensive experiments demonstrate that GO-Renderer achieves state-of-the-art performance across the object rendering tasks, including synthesizing images on new viewpoints, rendering the objects in a novel lighting environment, and inserting an object into an existing video.
[196] CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection
Yuchen Wu, Kun Wang, Yining Pan, Na Zhao
Main category: cs.CV
TL;DR: A robust cross-domain 3D object detection method that addresses modality imbalance in dual-branch detectors through query-decoupled supervision, LiDAR-guided depth priors, and complementary cross-modal masking.
Details
Motivation: Current multi-modal 3D object detectors degrade significantly when deployed in challenging domains (rain, nighttime) due to: 1) severe degradation of one modality, and 2) LiDAR branch dominance leading to underutilization of visual cues and vulnerability when point clouds are compromised.Method: Three key components: 1) Query-Decoupled Loss provides independent supervision for 2D-only, 3D-only, and fused queries to rebalance gradient flow; 2) LiDAR-Guided Depth Prior augments 2D queries with instance-aware geometric priors through probabilistic fusion of image-predicted and LiDAR-derived depth distributions; 3) Complementary Cross-Modal Masking applies complementary spatial masks to image and point cloud to encourage competition between modalities in the fused decoder.
Result: Extensive experiments demonstrate substantial gains over state-of-the-art baselines while preserving source-domain performance. The method shows improved robustness in challenging domains like rain and nighttime.
Conclusion: The proposed approach effectively addresses modality imbalance in cross-domain 3D object detection, promoting adaptive fusion and improving robustness when one modality is degraded.
Abstract: Multi-modal fusion has emerged as a promising paradigm for accurate 3D object detection. However, performance degrades substantially when deployed in target domains different from training. In this work, focusing on dual-branch proposal-level detectors, we identify two factors that limit robust cross-domain generalization: 1) in challenging domains such as rain or nighttime, one modality may undergo severe degradation; 2) the LiDAR branch often dominates the detection process, leading to systematic underutilization of visual cues and vulnerability when point clouds are compromised. To address these challenges, we propose three components. First, Query-Decoupled Loss provides independent supervision for 2D-only, 3D-only, and fused queries, rebalancing gradient flow across modalities. Second, LiDAR-Guided Depth Prior augments 2D queries with instance-aware geometric priors through probabilistic fusion of image-predicted and LiDAR-derived depth distributions, improving their spatial initialization. Third, Complementary Cross-Modal Masking applies complementary spatial masks to the image and point cloud, encouraging queries from both modalities to compete within the fused decoder and thereby promoting adaptive fusion. Extensive experiments demonstrate substantial gains over state-of-the-art baselines while preserving source-domain performance. Code and models are publicly available at https://github.com/IMPL-Lab/CCF.
[197] WaveSFNet: A Wavelet-Based Codec and Spatial–Frequency Dual-Domain Gating Network for Spatiotemporal Prediction
Xinyong Cai, Runming Xie, Hu Chen, Yuankai Wu
Main category: cs.CV
TL;DR: WaveSFNet: An efficient spatiotemporal predictive learning framework using wavelet-based codec and spatial-frequency dual-domain gated translator for high-quality future frame prediction.
Details
Motivation: Existing efficient recurrent-free frameworks for spatiotemporal prediction often lose high-frequency details due to strided convolutions/pooling, and struggle to balance local interactions with global propagation. Need to preserve textures/boundaries while modeling long-range dynamics.Method: 1) Wavelet-based codec preserves high-frequency subband cues during downsampling/reconstruction. 2) Spatial-frequency dual-domain gated translator: injects adjacent-frame differences for dynamic information, performs dual-domain gated fusion between large-kernel spatial local modeling and frequency-domain global modulation, with gated channel interaction for cross-channel feature exchange.
Result: Achieves competitive prediction accuracy on Moving MNIST, TaxiBJ, and WeatherBench datasets while maintaining low computational complexity.
Conclusion: WaveSFNet effectively addresses the trade-off between preserving high-frequency details and modeling long-range dynamics in spatiotemporal prediction through wavelet-based codec and spatial-frequency dual-domain fusion.
Abstract: Spatiotemporal predictive learning aims to forecast future frames from historical observations in an unsupervised manner, and is critical to a wide range of applications. The key challenge is to model long-range dynamics while preserving high-frequency details for sharp multi-step predictions. Existing efficient recurrent-free frameworks typically rely on strided convolutions or pooling for sampling, which tends to discard textures and boundaries, while purely spatial operators often struggle to balance local interactions with global propagation. To address these issues, we propose WaveSFNet, an efficient framework that unifies a wavelet-based codec with a spatial–frequency dual-domain gated spatiotemporal translator. The wavelet-based codec preserves high-frequency subband cues during downsampling and reconstruction. Meanwhile, the translator first injects adjacent-frame differences to explicitly enhance dynamic information, and then performs dual-domain gated fusion between large-kernel spatial local modeling and frequency-domain global modulation, together with gated channel interaction for cross-channel feature exchange. Extensive experiments demonstrate that WaveSFNet achieves competitive prediction accuracy on Moving MNIST, TaxiBJ, and WeatherBench, while maintaining low computational complexity. Our code is available at https://github.com/fhjdqaq/WaveSFNet.
[198] Knot-10:A Tightness-Stratified Benchmark for Real-World Knot Classification with Topological Difficulty Analysis
Shiheng Nie, Yunguang Yue
Main category: cs.CV
TL;DR: Knots-10 benchmark for fine-grained visual classification of physical knots where appearance cues are suppressed and class identity resides in crossing structure, with models achieving high accuracy but suffering from domain shift when tested on phone photographs.
Details
Motivation: To study fine-grained visual classification in scenarios where appearance cues are deliberately suppressed, using physical knot classification as a testbed where different classes share the same material, color, and background, with class identity primarily in crossing structure.Method: Created Knots-10 benchmark with 1,440 images using deployment-oriented split (train on loosely tied knots, test on tightly dressed ones). Evaluated various backbones including Swin-T, TransFG, PMG. Proposed TACA regularization to improve embedding-topology alignment. Conducted cross-domain test with phone photographs.
Result: Swin-T and TransFG both achieved 97.2% accuracy; PMG scored 94.5%. Topological distance significantly correlated with confusion patterns in 3 of 5 models. TACA regularization improved embedding-topology alignment from rho=0.46 to rho=0.65 but didn’t improve classification accuracy. Cross-domain test revealed 58-69 percentage-point accuracy drop, exposing rope appearance bias as dominant failure mode.
Conclusion: Physical knot classification reveals challenges in fine-grained visual classification when appearance cues are suppressed. While models achieve high accuracy on in-domain data, they suffer severe performance drops in cross-domain settings due to appearance bias, highlighting the need for more robust visual understanding beyond surface features.
Abstract: Physical knot classification is a fine-grained visual classification (FGVC) scenario in which appearance cues are deliberately suppressed: different classes share the same rope material, color, and background, and class identity resides primarily in crossing structure. We introduce the Knots-10 benchmark, comprising 1,440 images with a deployment-oriented split that trains on loosely tied knots and tests on tightly dressed ones. Swin-T and TransFG both average 97.2% accuracy; PMG scores 94.5%, consistent with the hypothesis that jigsaw shuffling disrupts crossing continuity. McNemar tests cannot separate four of the five general-purpose backbones, so small ranking margins should be interpreted with caution. A Mantel permutation test shows that topological distance significantly correlates with confusion patterns in three of the five models (p < 0.01). We propose TACA regularization, which improves embedding-topology alignment from rho=0.46 to rho=0.65 without improving classification accuracy; a random-distance ablation yields comparable alignment, indicating the benefit is likely driven by generic regularization. A pilot cross-domain test with 100 phone photographs reveals a 58-69 percentage-point accuracy drop, exposing rope appearance bias as the dominant failure mode.
[199] Mamba-driven MRI-to-CT Synthesis for MRI-only Radiotherapy Planning
Konstantinos Barmpounakis, Theodoros P. Vagenas, Maria Vakalopoulou, George K. Matsopoulos
Main category: cs.CV
TL;DR: Mamba-based architectures (U-Mamba and SegMamba) adapted for MRI-to-CT synthesis in radiotherapy planning, showing advantages over nnU-Net for cross-modality translation with efficient volumetric feature capture.
Details
Motivation: MRI-only treatment planning reduces radiation exposure and registration errors, but current nnU-Net frameworks for MRI-to-CT synthesis may be improved by state-space models that better capture long-range dependencies in volumetric data.Method: Adapted U-Mamba and SegMamba architectures (originally for segmentation) for 3D cross-modality image generation. Used state-space modeling to capture complex volumetric features and long-range dependencies for accurate CT synthesis from MRI.
Result: Mamba architectures effectively synthesized CT from MRI while maintaining fast inference times. Evaluated on SynthRAD2025 dataset using image similarity metrics (HU) and segmentation-based metrics from TotalSegmentator to ensure geometric consistency.
Conclusion: State-space models show promise for MRI-to-CT synthesis in radiotherapy workflows, offering advantages over standard convolutional networks for cross-modality translation with efficient volumetric feature capture.
Abstract: Radiotherapy workflows for oncological patients increasingly rely on multi-modal medical imaging, commonly involving both Magnetic Resonance Imaging (MRI) and Computed Tomography (CT). MRI-only treatment planning has emerged as an attractive alternative, as it reduces patient exposure to ionizing radiation and avoids errors introduced by inter-modality registration. While nnU-Net-based frameworks are predominantly used for MRI-to-CT synthesis, we explore Mamba-based architectures for this task, aiming to showcase the advantages of state-space modeling for cross-modality translation compared to standard convolutional neural networks. Specifically, we adapt both the U-Mamba and the SegMamba architecture, originally proposed for segmentation, to perform cross-modality image generation. Our 3D Mamba architecture effectively captures complex volumetric features and long-range dependencies, thus allowing accurate CT synthesis while maintaining fast inference times. Experiments were conducted on a subset of SynthRAD2025 dataset, comprising registered single-channel MRI-CT volume pairs across three anatomical regions. Quantitative evaluation is performed via a combination of image similarity metrics computed in Hounsefield Units (HU) and segmentation-based metrics obtained from TotalSegmentator to ensure geometric consistency is preserved. The findings pave the way for the integration of state-space models into radiotherapy workflows.
[200] ARGENT: Adaptive Hierarchical Image-Text Representations
Chuong Huynh, Hossein Souri, Abhinav Kumar, Vitali Petsiuk, Deen Dayal Mohan, Suren Kumar
Main category: cs.CV
TL;DR: ARGENT introduces a hyperbolic vision-language model with adaptive entailment loss and norm regularization to prevent cone collapse, plus a probabilistic entailment protocol for hierarchical evaluation.
Details
Motivation: Current hyperbolic VLMs suffer from unstable entailment losses causing catastrophic cone collapse, and hierarchical evaluation remains unreliable with correlation-based metrics prone to taxonomy dependence.Method: Proposes adaptive entailment loss with norm regularizer to prevent cone collapse without heuristic clipping, and introduces angle-based probabilistic entailment protocol (PEP) for hierarchical evaluation using AUC-ROC and Average Precision.
Result: ARGENT improves SOTA hyperbolic VLM by 0.7, 1.1, and 0.8 absolute points on image classification, text-to-image retrieval, and proposed hierarchical metrics respectively.
Conclusion: ARGENT provides a stronger hyperbolic VLM baseline that better captures hierarchical structure in vision-language representations through stable training and reliable evaluation.
Abstract: Large-scale Vision-Language Models (VLMs) such as CLIP learn powerful semantic representations but operate in Euclidean space, which fails to capture the inherent hierarchical structure of visual and linguistic concepts. Hyperbolic geometry, with its exponential volume growth, offers a principled alternative for embedding such hierarchies with low distortion. However, existing hyperbolic VLMs use entailment losses that are unstable: as parent embeddings contract toward the origin, their entailment cones widen toward a half-space, causing catastrophic cone collapse that destroys the intended hierarchy. Additionally, hierarchical evaluation of these models remains unreliable, being largely retrieval-based and correlation-based metrics and prone to taxonomy dependence and ambiguous negatives. To address these limitations, we propose an adaptive entailment loss paired with a norm regularizer that prevents cone collapse without heuristic aperture clipping. We further introduce an angle-based probabilistic entailment protocol (PEP) for evaluating hierarchical understanding, scored with AUC-ROC and Average Precision. This paper introduces a stronger hyperbolic VLM baseline ARGENT, Adaptive hieRarchical imaGe-tExt represeNTation. ARGENT improves the SOTA hyperbolic VLM by 0.7, 1.1, and 0.8 absolute points on image classification, text-to-image retrieval, and proposed hierarchical metrics, respectively.
[201] Pose-Free Omnidirectional Gaussian Splatting for 360-Degree Videos with Consistent Depth Priors
Chuanqing Zhuang, Xin Lu, Zehui Deng, Zhengda Lu, Yiqun Wang, Junqi Diao, Jun Xiao
Main category: cs.CV
TL;DR: PFGS360 enables omnidirectional 3D Gaussian Splatting from unposed 360° videos without requiring SfM camera poses, using spherical consistency-aware pose estimation and depth-inlier-aware densification for high-quality novel view synthesis.
Details
Motivation: Existing omnidirectional 3D Gaussian Splatting methods rely on slow Structure-from-Motion (SfM) for camera poses and sparse points, limiting their practicality. The authors aim to develop a pose-free method that can reconstruct 3D Gaussians directly from unposed omnidirectional videos.Method: Two key modules: 1) Spherical consistency-aware pose estimation that recovers camera poses by establishing 2D-3D correspondences between reconstructed Gaussians and unposed images using Gaussian depth priors. 2) Depth-inlier-aware densification that extracts depth inliers and Gaussian outliers using monocular depth priors for efficient Gaussian densification and photorealistic novel view synthesis.
Result: Significantly outperforms existing pose-free and pose-aware 3DGS methods on both real-world and synthetic 360-degree videos, achieving high-quality novel view synthesis without requiring pre-computed camera poses.
Conclusion: PFGS360 successfully enables pose-free omnidirectional 3D Gaussian Splatting, overcoming the dependency on SfM while maintaining high-quality reconstruction and novel view synthesis for 360° videos.
Abstract: Omnidirectional 3D Gaussian Splatting with panoramas is a key technique for 3D scene representation, and existing methods typically rely on slow SfM to provide camera poses and sparse points priors. In this work, we propose a pose-free omnidirectional 3DGS method, named PFGS360, that reconstructs 3D Gaussians from unposed omnidirectional videos. To achieve accurate camera pose estimation, we first construct a spherical consistency-aware pose estimation module, which recovers poses by establishing consistent 2D-3D correspondences between the reconstructed Gaussians and the unposed images using Gaussians’ internal depth priors. Besides, to enhance the fidelity of novel view synthesis, we introduce a depth-inlier-aware densification module to extract depth inliers and Gaussian outliers with consistent monocular depth priors, enabling efficient Gaussian densification and achieving photorealistic novel view synthesis. The experiments show significant outperformance over existing pose-free and pose-aware 3DGS methods on both real-world and synthetic 360-degree videos. Code is available at https://github.com/zcq15/PFGS360.
[202] 3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding
Yiping Chen, Jinpeng Li, Wenyu Ke, Yang Luo, Jie Ouyang, Zhongjie He, Li Liu, Hongchao Fan, Hao Wu
Main category: cs.CV
TL;DR: 3DCity-LLM is a unified framework for 3D city-scale vision-language perception that uses coarse-to-fine feature encoding and is trained on a new 1.2M dataset of urban scenarios, achieving state-of-the-art performance on spatial reasoning tasks.
Details
Motivation: Current multimodal LLMs work well for object-centric or indoor scenarios but struggle with 3D city-scale environments. There's a need for models that can understand complex urban scenes with spatial relationships at city scale.Method: Proposes 3DCity-LLM framework with coarse-to-fine feature encoding using three parallel branches: target object, inter-object relationships, and global scene. Creates 3DCity-LLM-1.2M dataset with 1.2M high-quality samples across 7 task categories. Uses multi-dimensional evaluation protocol with text-similarity metrics and LLM-based semantic assessment.
Result: Extensive experiments on two benchmarks show 3DCity-LLM significantly outperforms existing state-of-the-art methods. The framework demonstrates strong spatial reasoning capabilities for urban intelligence tasks.
Conclusion: 3DCity-LLM offers a promising direction for advancing spatial reasoning and urban intelligence, bridging the gap in scaling multimodal LLMs to 3D city-scale environments. The released dataset and code enable further research.
Abstract: While multi-modality large language models excel in object-centric or indoor scenarios, scaling them to 3D city-scale environments remains a formidable challenge. To bridge this gap, we propose 3DCity-LLM, a unified framework designed for 3D city-scale vision-language perception and understanding. 3DCity-LLM employs a coarse-to-fine feature encoding strategy comprising three parallel branches for target object, inter-object relationship, and global scene. To facilitate large-scale training, we introduce 3DCity-LLM-1.2M dataset that comprises approximately 1.2 million high-quality samples across seven representative task categories, ranging from fine-grained object analysis to multi-faceted scene planning. This strictly quality-controlled dataset integrates explicit 3D numerical information and diverse user-oriented simulations, enriching the question-answering diversity and realism of urban scenarios. Furthermore, we apply a multi-dimensional protocol based on text-similarity metrics and LLM-based semantic assessment to ensure faithful and comprehensive evaluations for all methods. Extensive experiments on two benchmarks demonstrate that 3DCity-LLM significantly outperforms existing state-of-the-art methods, offering a promising and meaningful direction for advancing spatial reasoning and urban intelligence. The source code and dataset are available at https://github.com/SYSU-3DSTAILab/3D-City-LLM.
[203] ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images
Yunfeng Wu, Hongying Cheng, Zihao He, Songhua Liu
Main category: cs.CV
TL;DR: A framework that upgrades video diffusion transformers to ultra-high-resolution synthesis via image-only adaptation, using a two-stage Relay LoRA approach to bridge image-video modality gap and enable spatial extrapolation.
Details
Motivation: Transformer-based video diffusion models suffer from quadratic complexity when scaling to ultra-high-resolution videos, making end-to-end training prohibitively expensive. Existing approaches struggle with the image-video modality gap when fine-tuning with images alone.Method: Proposes Relay LoRA: a two-stage adaptation strategy. First stage uses low-resolution images to bridge image-video modality gap. Second stage uses high-resolution images for spatial extrapolation. Also introduces High-Frequency-Awareness-Training-Objective to enhance detail synthesis via reconstruction loss on high-frequency components.
Result: Method produces ultra-high-resolution videos with rich visual details without requiring video training data. Outperforms previous state-of-the-art models trained on high-resolution videos by 0.8 on VBench benchmark.
Conclusion: The proposed image adaptation framework successfully enables high-resolution video synthesis from pre-trained video diffusion models, overcoming computational bottlenecks and modality gaps through decoupled learning objectives and specialized training strategies.
Abstract: Transformer-based video diffusion models rely on 3D attention over spatial and temporal tokens, which incurs quadratic time and memory complexity and makes end-to-end training for ultra-high-resolution videos prohibitively expensive. To overcome this bottleneck, we propose a pure image adaptation framework that upgrades a video Diffusion Transformer pre-trained at its native scale to synthesize higher-resolution videos. Unfortunately, naively fine-tuning with high-resolution images alone often introduces noticeable noise due to the image-video modality gap. To address this, we decouple the learning objective to separately handle modality alignment and spatial extrapolation. At the core of our approach is Relay LoRA, a two-stage adaptation strategy. In the first stage, the video diffusion model is adapted to the image domain using low-resolution images to bridge the modality gap. In the second stage, the model is further adapted with high-resolution images to acquire spatial extrapolation capability. During inference, only the high-resolution adaptation is retained to preserve the video generation modality while enabling high-resolution video synthesis. To enhance fine-grained detail synthesis, we further propose a High-Frequency-Awareness-Training-Objective, which explicitly encourages the model to recover high-frequency components from degraded latent representations via a dedicated reconstruction loss. Extensive experiments demonstrate that our method produces ultra-high-resolution videos with rich visual details without requiring any video training data, even outperforming previous state-of-the-art models trained on high-resolution videos by 0.8 on the VBench benchmark. Code will be available at https://github.com/WillWu111/ViBe.
[204] An Explainable AI-Driven Framework for Automated Brain Tumor Segmentation Using an Attention-Enhanced U-Net
MD Rashidul Islam, Bakary Gibba
Main category: cs.CV
TL;DR: U-Net with attention gates and custom loss functions achieves high-performance brain tumor segmentation from MRI with explainable AI via Grad-CAM.
Details
Motivation: Manual brain tumor segmentation from MRI is time-consuming and unreliable, especially for heterogeneous gliomas. There's a need for robust automated techniques for clinical decision-making in diagnosis, treatment planning, and disease monitoring.Method: U-Net model augmented with attention gates to focus on significant regions; uses Dice Loss and Categorical Dice Loss with categorical cross-entropy to handle class imbalance; incorporates Grad-CAM for explainable AI with Gaussian filtering for smooth heatmap generation.
Result: Achieved superior performance: accuracy 0.9919, Dice coefficient 0.9901, mean IoU 0.9873, sensitivity 0.9908, specificity 0.9974 on BraTS 2020 dataset.
Conclusion: Attention mechanisms, personalized loss functions, and explainable AI significantly improve complex tumor structure segmentation precision in MRI scans, providing reliable and explainable method for clinical applications.
Abstract: Computer-aided segmentation of brain tumors from MRI data is of crucial significance to clinical decision-making in diagnosis, treatment planning, and follow-up disease monitoring. Gliomas, owing to their high malignancy and heterogeneity, represent a very challenging task for accurate and reliable segmentation into intra-tumoral sub-regions. Manual segmentation is typically time-consuming and not reliable, which justifies the need for robust automated techniques.This research resolves this problem by leveraging the BraTS 2020 dataset, where we have labeled MRI scans of glioma patients with four significant classes: background/healthy tissue, necrotic/non-enhancing core, edema, and enhancing tumor. In this work, we present a new segmentation technique based on a U-Net model augmented with executed attention gates to focus on the most significant regions of images. To counter class imbalance, we employ manually designed loss functions like Dice Loss and Categorical Dice Loss, in conjunction with standard categorical cross-entropy. Other evaluation metrics, like sensitivity and specificity, were used to measure discriminability of the model between tumor classes. Besides, we introduce Grad-CAM-based explainable AI to enable visualizing attention regions and improve model interpretability, together with a smooth heatmap generation technique through Gaussian filtering. Our approach achieved superior performance with accuracy of 0.9919, Dice coefficient of 0.9901, mean IoU of 0.9873, sensitivity of 0.9908, and specificity of 0.9974. This study demonstrates that the use of attention mechanisms, personalized loss functions, and explainable AI significantly improves highly complex tumor structure segmentation precision in MRI scans, providing a reliable and explainable method for clinical applications.
[205] InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting
Duc Vu, Kien Nguyen, Trong-Tung Nguyen, Ngan Nguyen, Phong Nguyen, Khoi Nguyen, Cuong Pham, Anh Tran
Main category: cs.CV
TL;DR: InverFill: A one-step inversion method for few-step image inpainting that injects semantic information from masked images into initial noise, enabling high-fidelity results without training specialized inpainting models.
Details
Motivation: Current diffusion models require many sampling steps for inpainting, limiting practical use. Few-step text-to-image models applied to inpainting produce poor harmonization and artifacts due to random Gaussian noise initialization causing semantic misalignment.Method: Proposes InverFill, a one-step inversion method that injects semantic information from input masked images into the initial noise. Uses a blended sampling pipeline with semantically aligned noise as input, leveraging existing few-step text-to-image models without training specialized inpainting models.
Result: InverFill significantly improves vanilla blended sampling and matches specialized inpainting models at low number of function evaluations (NFEs). Improves image quality and text coherence without costly retraining or heavy iterative optimization.
Conclusion: InverFill enables high-fidelity few-step inpainting by addressing semantic misalignment in noise initialization, offering practical advantages over existing methods with minimal inference overhead and no need for real-image supervision.
Abstract: Recent diffusion-based models achieve photorealism in image inpainting but require many sampling steps, limiting practical use. Few-step text-to-image models offer faster generation, but naively applying them to inpainting yields poor harmonization and artifacts between the background and inpainted region. We trace this cause to random Gaussian noise initialization, which under low function evaluations causes semantic misalignment and reduced fidelity. To overcome this, we propose InverFill, a one-step inversion method tailored for inpainting that injects semantic information from the input masked image into the initial noise, enabling high-fidelity few-step inpainting. Instead of training inpainting models, InverFill leverages few-step text-to-image models in a blended sampling pipeline with semantically aligned noise as input, significantly improving vanilla blended sampling and even matching specialized inpainting models at low NFEs. Moreover, InverFill does not require real-image supervision and only adds minimal inference overhead. Extensive experiments show that InverFill consistently boosts baseline few-step models, improving image quality and text coherence without costly retraining or heavy iterative optimization.
[206] FHAvatar: Fast and High-Fidelity Reconstruction of Face-and-Hair Composable 3D Head Avatar from Few Casual Captures
Yujie Sun, Zhuoqiang Cai, Chaoyue Niu, Jianchuan Chen, Zhiwen Chen, Chengfei Lv, Fan Wu
Main category: cs.CV
TL;DR: FHAvatar: A framework for reconstructing 3D Gaussian avatars with separate face and hair components from few views, enabling real-time animation and editing.
Details
Motivation: Previous methods couple facial and hair representations in unified modeling, requiring dense multi-view captures or costly per-identity optimization. The authors aim to decouple these components for more efficient and accessible avatar creation.Method: Explicitly decouples face and hair components: face represented with planar Gaussians, hair with strand-based Gaussians. Uses aggregated transformer backbone to learn geometry-aware cross-view priors and head-hair structural coherence from multi-view datasets, enabling efficient feature extraction from few casual captures.
Result: Achieves state-of-the-art reconstruction quality from only a few observations of new identities within minutes. Supports real-time animation, convenient hairstyle transfer, and stylized editing.
Conclusion: FHAvatar broadens the accessibility and applicability of digital avatar creation by enabling high-quality reconstruction from few views with efficient processing and flexible editing capabilities.
Abstract: We present FHAvatar, a novel framework for reconstructing 3D Gaussian avatars with composable face and hair components from an arbitrary number of views. Unlike previous approaches that couple facial and hair representations within a unified modeling process, we explicitly decouple two components in texture space by representing the face with planar Gaussians and the hair with strand-based Gaussians. To overcome the limitations of existing methods that rely on dense multi-view captures or costly per-identity optimization, we propose an aggregated transformer backbone to learn geometry-aware cross-view priors and head-hair structural coherence from multi-view datasets, enabling effective and efficient feature extraction and fusion from few casual captures. Extensive quantitative and qualitative experiments demonstrate that FHAvatar achieves state-of-the-art reconstruction quality from only a few observations of new identities within minutes, while supporting real-time animation, convenient hairstyle transfer, and stylized editing, broadening the accessibility and applicability of digital avatar creation.
[207] Object Pose Transformer: Unifying Unseen Object Pose Estimation
Weihang Li, Lorenzo Garattoni, Fabien Despinoy, Nassir Navab, Benjamin Busam
Main category: cs.CV
TL;DR: Object Pose Transformer (OPT) is a unified framework for both category-level absolute pose estimation and unseen-object relative pose estimation from RGB inputs, bridging two previously disjoint paradigms in 3D vision.
Details
Motivation: Existing methods for object pose estimation fall into two disjoint paradigms: category-level approaches that predict absolute poses but rely on predefined taxonomies, and relative pose methods that estimate cross-view transformations but cannot recover single-view absolute pose. There's a need for a unified approach that can handle both tasks.Method: OPT uses a feed-forward transformer framework that jointly predicts depth, point maps, camera parameters, and normalized object coordinates (NOCS) from RGB inputs. It leverages contrastive object-centric latent embeddings for canonicalization without semantic labels, and uses point maps as camera-space representation for multi-view geometric reasoning. The model is camera-agnostic and learns camera intrinsics on-the-fly.
Result: Extensive experiments on diverse benchmarks (NOCS, HouseCat6D, Omni6DPose, Toyota-Light) demonstrate state-of-the-art performance in both absolute and relative pose estimation tasks within a single unified architecture.
Conclusion: OPT successfully bridges the gap between category-level absolute pose estimation and unseen-object relative pose estimation through a unified framework that leverages cross-view geometric consistency to improve single-view predictions while remaining flexible and camera-agnostic.
Abstract: Learning model-free object pose estimation for unseen instances remains a fundamental challenge in 3D vision. Existing methods typically fall into two disjoint paradigms: category-level approaches predict absolute poses in a canonical space but rely on predefined taxonomies, while relative pose methods estimate cross-view transformations but cannot recover single-view absolute pose. In this work, we propose Object Pose Transformer (\ours{}), a unified feed-forward framework that bridges these paradigms through task factorization within a single model. \ours{} jointly predicts depth, point maps, camera parameters, and normalized object coordinates (NOCS) from RGB inputs, enabling both category-level absolute SA(3) pose and unseen-object relative SE(3) pose. Our approach leverages contrastive object-centric latent embeddings for canonicalization without requiring semantic labels at inference time, and uses point maps as a camera-space representation to enable multi-view relative geometric reasoning. Through cross-frame feature interaction and shared object embeddings, our model leverages relative geometric consistency across views to improve absolute pose estimation, reducing ambiguity in single-view predictions. Furthermore, \ours{} is camera-agnostic, learning camera intrinsics on-the-fly and supporting optional depth input for metric-scale recovery, while remaining fully functional in RGB-only settings. Extensive experiments on diverse benchmarks (NOCS, HouseCat6D, Omni6DPose, Toyota-Light) demonstrate state-of-the-art performance in both absolute and relative pose estimation tasks within a single unified architecture.
[208] ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment
Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu
Main category: cs.CV
TL;DR: ABot-PhysWorld is a 14B Diffusion Transformer model that generates physically plausible and action-controllable manipulation videos, trained on physics-annotated data with DPO-based post-training to suppress unphysical behaviors while maintaining visual quality.
Details
Motivation: Current video-based world models often generate physically implausible manipulations (object penetration, anti-gravity motion) due to training on generic visual data and likelihood-based objectives that ignore physical laws. There's a need for models that generate both visually realistic and physically plausible videos for embodied simulation and planning.Method: 1) Built on curated dataset of 3M manipulation clips with physics-aware annotation; 2) 14B Diffusion Transformer architecture; 3) Novel DPO-based post-training framework with decoupled discriminators to suppress unphysical behaviors while preserving visual quality; 4) Parallel context block for precise spatial action injection for cross-embodiment control; 5) Introduces EZSbench benchmark for training-independent embodied zero-shot evaluation.
Result: ABot-PhysWorld achieves new SOTA performance on PBench and EZSbench, surpassing Veo 3.1 and Sora v2 Pro in physical plausibility and trajectory consistency. The model generates visually realistic, physically plausible, and action-controllable videos.
Conclusion: ABot-PhysWorld demonstrates that combining physics-aware data curation with DPO-based training can significantly improve physical plausibility in video generation while maintaining visual quality, advancing the field of embodied video generation for simulation and planning.
Abstract: Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ignore physical laws. We present ABot-PhysWorld, a 14B Diffusion Transformer model that generates visually realistic, physically plausible, and action-controllable videos. Built on a curated dataset of three million manipulation clips with physics-aware annotation, it uses a novel DPO-based post-training framework with decoupled discriminators to suppress unphysical behaviors while preserving visual quality. A parallel context block enables precise spatial action injection for cross-embodiment control. To better evaluate generalization, we introduce EZSbench, the first training-independent embodied zero-shot benchmark combining real and synthetic unseen robot-task-scene combinations. It employs a decoupled protocol to separately assess physical realism and action alignment. ABot-PhysWorld achieves new state-of-the-art performance on PBench and EZSbench, surpassing Veo 3.1 and Sora v2 Pro in physical plausibility and trajectory consistency. We will release EZSbench to promote standardized evaluation in embodied video generation.
[209] VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions
Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Yassine Ouali, Georgios Tzimiropoulos
Main category: cs.CV
TL;DR: VISOR improves LVLM efficiency by sparsifying text-image interactions instead of compressing visual tokens, using strategic attention layers and dynamic computation allocation based on task complexity.
Details
Motivation: Existing visual token reduction methods create information bottlenecks that impair performance on tasks requiring fine-grained understanding. The paper challenges this paradigm by seeking to reduce inference cost without discarding visual information.Method: VISOR sparsifies interaction between image and text tokens using strategically placed attention layers: efficient cross-attention provides general visual context, while dynamically selected self-attention layers refine visual representations when needed for complex reasoning. Includes a lightweight policy mechanism for dynamic computation allocation based on per-sample complexity.
Result: Extensive experiments show VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across diverse benchmarks, excelling in challenging tasks requiring detailed visual understanding.
Conclusion: VISOR presents an effective alternative to visual token reduction for LVLM efficiency, maintaining high-resolution visual information while reducing computation through intelligent interaction sparsification and dynamic resource allocation.
Abstract: Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.
[210] FG-Portrait: 3D Flow Guided Editable Portrait Animation
Yating Xu, Yunqi Miao, Evangelos Ververas, Jiankang Deng, Jifei Song
Main category: cs.CV
TL;DR: A 3D flow-based diffusion model for portrait animation that uses parametric 3D head models to compute geometry-driven motion correspondences, enabling high-fidelity motion transfer with identity preservation and user editing capabilities.
Details
Motivation: Current diffusion-based portrait animation methods fail to capture source-to-driving correspondences, leading to suboptimal motion transfer. Flow estimation from 2D input is ill-posed and inaccurate. There's a need for better motion correspondence that preserves source identity while enabling faithful animation.Method: Introduces 3D flows - learning-free, geometry-driven motion correspondence computed from parametric 3D head models. Uses 3D flow encoding to query potential 3D flows for each target pixel. Proposes depth-guided sampling to accurately locate corresponding 3D points for each pixel, aligning 3D flows with 2D motion changes.
Result: Demonstrates superiority in consistent driving motion transfer and faithful source identity preservation. Supports user-specified editing of facial expression and head pose beyond standard animation.
Conclusion: The 3D flow-based approach effectively addresses motion transfer challenges in portrait animation by leveraging 3D geometric priors, enabling high-fidelity animation with better correspondence and identity preservation than existing diffusion-based methods.
Abstract: Motion transfer from the driving to the source portrait remains a key challenge in the portrait animation. Current diffusion-based approaches condition only on the driving motion, which fails to capture source-to-driving correspondences and consequently yields suboptimal motion transfer. Although flow estimation provides an alternative, predicting dense correspondences from 2D input is ill-posed and often yields inaccurate animation. We address this problem by introducing 3D flows, a learning-free and geometry-driven motion correspondence directly computed from parametric 3D head models. To integrate this 3D prior into diffusion model, we introduce 3D flow encoding to query potential 3D flows for each target pixel to indicate its displacement back to the source location. To obtain 3D flows aligned with 2D motion changes, we further propose depth-guided sampling to accurately locate the corresponding 3D points for each pixel. Beyond high-fidelity portrait animation, our model further supports user-specified editing of facial expression and head pose. Extensive experiments demonstrate the superiority of our method on consistent driving motion transfer as well as faithful source identity preservation.
[211] From Feature Learning to Spectral Basis Learning: A Unifying and Flexible Framework for Efficient and Robust Shape Matching
Feifan Luo, Hongyang Chen
Main category: cs.CV
TL;DR: Unsupervised spectral basis learning framework for 3D shape matching that replaces fixed basis functions with learnable ones, enabling joint optimization of feature extraction and basis functions for improved performance.
Details
Motivation: Existing functional map methods focus on learning feature representations but neglect optimization of the spectral basis, leading to suboptimal matching. Current approaches also rely on time-consuming solvers with computational overhead.Method: Introduces Advanced Functional Maps with learnable spectral basis optimized through learned inhibition functions. Proposes unsupervised spectral basis learning with heat diffusion module, unsupervised loss function, and streamlined architecture that bypasses expensive solvers.
Result: Significantly outperforms state-of-the-art feature-learning approaches, especially in challenging non-isometric and topological noise scenarios, while maintaining high efficiency. Reveals equivalence between optimizing basis functions and spectral convolution.
Conclusion: The framework enables enhanced representations inspired by spectral graph networks, opening new avenues for research in shape matching and spectral learning.
Abstract: Shape matching is a fundamental task in computer graphics and vision, with deep functional maps becoming a prominent paradigm. However, existing methods primarily focus on learning informative feature representations by constraining pointwise and functional maps, while neglecting the optimization of the spectral basis-a critical component of the functional map pipeline. This oversight often leads to suboptimal matching results. Furthermore, many current approaches rely on conventional, time-consuming functional map solvers, incurring significant computational overhead. To bridge these gaps, we introduce Advanced Functional Maps, a framework that generalizes standard functional maps by replacing fixed basis functions with learnable ones, supported by rigorous theoretical guarantees. Specifically, the spectral basis is optimized through a set of learned inhibition functions. Building on this, we propose the first unsupervised spectral basis learning method for robust non-rigid 3D shape matching, enabling the joint, end-to-end optimization of feature extraction and basis functions. Our approach incorporates a novel heat diffusion module and an unsupervised loss function, alongside a streamlined architecture that bypasses expensive solvers and auxiliary losses. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art feature-learning approaches, particularly in challenging non-isometric and topological noise scenarios, while maintaining high efficiency. Finally, we reveal that optimizing basis functions is equivalent to spectral convolution, where inhibition functions act as filters. This insight enables enhanced representations inspired by spectral graph networks, opening new avenues for future research. Our code is available at https://github.com/LuoFeifan77/Unsupervised-Spectral-Basis-Learning.
[212] SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM
Chuanrui Zhang, Minghan Qin, Yuang Wang, Baifeng Xie, Hang Li, Ziwei Wang
Main category: cs.CV
TL;DR: SIMART: A unified multimodal LLM framework for generating articulated 3D objects with part decomposition and kinematic prediction using sparse 3D tokenization.
Details
Motivation: There's a gap in "sim-ready" interactive 3D assets for embodied AI and physical simulation. Current methods use multi-stage pipelines that accumulate errors, while dense voxel-based 3D tokenization in unified MLLMs creates long sequences and high memory overhead, limiting scalability to complex articulated objects.Method: Proposes SIMART, a unified MLLM framework that jointly performs part-level decomposition and kinematic prediction. Introduces a Sparse 3D VQ-VAE to reduce token counts by 70% compared to dense voxel tokens, enabling high-fidelity multi-part assemblies.
Result: Achieves state-of-the-art performance on PartNet-Mobility and in-the-wild AIGC datasets, and enables physics-based robotic simulation.
Conclusion: SIMART provides an effective unified approach for generating articulated 3D assets suitable for simulation and embodied AI applications through sparse 3D tokenization and joint understanding/generation.
Abstract: High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in “sim-ready” interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors across decoupled modules. Alternatively, unified MLLMs offer a single-stage path to joint static asset understanding and sim-ready asset generation. However dense voxel-based 3D tokenization yields long 3D token sequences and high memory overhead, limiting scalability to complex articulated objects. To address this, we propose SIMART, a unified MLLM framework that jointly performs part-level decomposition and kinematic prediction. By introducing a Sparse 3D VQ-VAE, SIMART reduces token counts by 70% vs. dense voxel tokens, enabling high-fidelity multi-part assemblies. SIMART achieves state-of-the-art performance on PartNet-Mobility and in-the-wild AIGC datasets, and enables physics-based robotic simulation.
[213] GeoSANE: Learning Geospatial Representations from Models, Not Data
Joelle Hanna, Damian Falk, Stella X. Yu, Damian Borth
Main category: cs.CV
TL;DR: GeoSANE is a geospatial model foundry that learns unified neural representations from existing foundation models and generates novel neural network weights on-demand for various remote sensing tasks across multiple modalities.
Details
Motivation: Current remote sensing foundation models are trained on different modalities, datasets, and objectives, capturing only parts of geospatial knowledge. They remain complementary rather than unified, creating a need to combine their strengths into a single shared representation.Method: GeoSANE learns a unified neural representation from weights of existing foundation models and task-specific models. It can generate novel neural network weights on-demand for target architectures, ready for finetuning on classification, segmentation, and detection tasks across multiple modalities.
Result: Models generated by GeoSANE outperform counterparts trained from scratch, match or surpass state-of-the-art remote sensing foundation models, and outperform models obtained through pruning or knowledge distillation when generating lightweight networks. Strong generalization confirmed across ten diverse datasets and GEO-Bench.
Conclusion: GeoSANE introduces a new framework for unifying and transferring geospatial knowledge across models and tasks by shifting from pre-training to weight generation, enabling on-demand creation of specialized models from a unified representation.
Abstract: Recent advances in remote sensing have led to an increase in the number of available foundation models; each trained on different modalities, datasets, and objectives, yet capturing only part of the vast geospatial knowledge landscape. While these models show strong results within their respective domains, their capabilities remain complementary rather than unified. Therefore, instead of choosing one model over another, we aim to combine their strengths into a single shared representation. We introduce GeoSANE, a geospatial model foundry that learns a unified neural representation from the weights of existing foundation models and task-specific models, able to generate novel neural networks weights on-demand. Given a target architecture, GeoSANE generates weights ready for finetuning for classification, segmentation, and detection tasks across multiple modalities. Models generated by GeoSANE consistently outperform their counterparts trained from scratch, match or surpass state-of-the-art remote sensing foundation models, and outperform models obtained through pruning or knowledge distillation when generating lightweight networks. Evaluations across ten diverse datasets and on GEO-Bench confirm its strong generalization capabilities. By shifting from pre-training to weight generation, GeoSANE introduces a new framework for unifying and transferring geospatial knowledge across models and tasks. Code is available at \href{https://hsg-aiml.github.io/GeoSANE/}{hsg-aiml.github.io/GeoSANE/}.
[214] I3DM: Implicit 3D-aware Memory Retrieval and Injection for Consistent Video Scene Generation
Jia Li, Han Yan, Yihang Chen, Siqi Li, Xibin Song, Yifu Wang, Jianfei Cai, Tien-Tsin Wong, Pan Ji
Main category: cs.CV
TL;DR: I3DM: Implicit 3D-aware memory mechanism for consistent video scene generation without explicit 3D reconstruction, improving revisit consistency and camera control.
Details
Motivation: Existing video generation methods struggle with long-term scene consistency when revisiting previously explored areas. Explicit 3D reconstruction suffers from error accumulation and scale ambiguity, while naive camera FoV retrieval fails under complex occlusions.Method: Proposes I3DM with two key components: 1) 3D-aware memory retrieval using intermediate features from a pre-trained FF-NVS model to score view relevance for robust retrieval under occlusions, and 2) 3D-aligned memory injection that implicitly warps historical content to target view and adaptively conditions generation on reliable warping regions.
Result: Extensive experiments show the method outperforms state-of-the-art approaches, achieving superior revisit consistency, generation fidelity, and camera control precision.
Conclusion: I3DM provides an effective implicit 3D-aware memory mechanism for consistent video scene generation that bypasses explicit 3D reconstruction challenges while handling complex occlusions.
Abstract: Despite remarkable progress in video generation, maintaining long-term scene consistency upon revisiting previously explored areas remains challenging. Existing solutions rely either on explicitly constructing 3D geometry, which suffers from error accumulation and scale ambiguity, or on naive camera Field-of-View (FoV) retrieval, which typically fails under complex occlusions. To overcome these limitations, we propose I3DM, a novel implicit 3D-aware memory mechanism for consistent video scene generation that bypasses explicit 3D reconstruction. At the core of our approach is a 3D-aware memory retrieval strategy, which leverages the intermediate features of a pre-trained Feed-Forward Novel View Synthesis (FF-NVS) model to score view relevance, enabling robust retrieval even in highly occluded scenarios. Furthermore, to fully utilize the retrieved historical frames, we introduce a 3D-aligned memory injection module. This module implicitly warps historical content to the target view and adaptively conditions the generation on reliable warping regions, leading to improved revisit consistency and accurate camera control. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, achieving superior revisit consistency, generation fidelity, and camera control precision.
[215] SIGMA: A Physics-Based Benchmark for Gas Chimney Understanding in Seismic Images
Bao Truong, Quang Nguyen, Baoru Huang, Jinpei Han, Van Nguyen, Ngan Le, Minh-Tan Pham, Doan Huy Hien, Anh Nguyen
Main category: cs.CV
TL;DR: SIGMA: A physics-based dataset for gas chimney detection and enhancement in seismic images, addressing challenges of limited labeled data for deep learning applications.
Details
Motivation: Gas chimneys in seismic images are crucial for hydrocarbon exploration and hazard avoidance, but accurate detection is challenging due to seismic attenuation/scattering. Traditional physics-based methods are computationally expensive and sensitive to errors, while deep learning lacks sufficient labeled datasets.Method: Created SIGMA dataset using physics-based methods covering diverse geological settings and acquisition conditions. Includes pixel-level gas-chimney masks for detection and paired degraded/ground-truth images for enhancement tasks.
Result: SIGMA serves as a challenging benchmark for gas chimney interpretation and benefits general seismic understanding through comprehensive experimental validation.
Conclusion: The SIGMA dataset addresses the critical need for labeled data in seismic image analysis, enabling more effective deep learning approaches for gas chimney detection and enhancement.
Abstract: Seismic images reconstruct subsurface reflectivity from field recordings, guiding exploration and reservoir monitoring. Gas chimneys are vertical anomalies caused by subsurface fluid migration. Understanding these phenomena is crucial for assessing hydrocarbon potential and avoiding drilling hazards. However, accurate detection is challenging due to strong seismic attenuation and scattering. Traditional physics-based methods are computationally expensive and sensitive to model errors, while deep learning offers efficient alternatives, yet lacks labeled datasets. In this work, we introduce \textbf{SIGMA}, a new physics-based dataset for gas chimney understanding in seismic images, featuring (i) pixel-level gas-chimney mask for detection and (ii) paired degraded and ground-truth image for enhancement. We employed physics-based methods that cover a wide range of geological settings and data acquisition conditions. Comprehensive experiments demonstrate that SIGMA serves as a challenging benchmark for gas chimney interpretation and benefits general seismic understanding.
[216] DetPO: In-Context Learning with Multi-Modal LLMs for Few-Shot Object Detection
Gautam Rajendrakumar Gare, Neehar Peri, Matvei Popov, Shruti Jain, John Galeotti, Deva Ramanan
Main category: cs.CV
TL;DR: DetPO is a gradient-free test-time optimization method that refines text-only prompts for few-shot object detection in multimodal LLMs by maximizing detection accuracy on few-shot visual examples while calibrating prediction confidence.
Details
Motivation: Current multimodal LLMs struggle with out-of-distribution generalization in object detection tasks and cannot effectively leverage few-shot visual examples and rich textual descriptions. Since frontier MLLMs are typically API-only and open-weight models are expensive to fine-tune, there's a need for black-box prompt optimization approaches.Method: Detection Prompt Optimization (DetPO) - a gradient-free test-time optimization approach that refines text-only prompts by maximizing detection accuracy on few-shot visual training examples while calibrating prediction confidence. It works with black-box MLLMs without requiring model weights or fine-tuning.
Result: DetPO yields consistent improvements across generalist MLLMs on Roboflow20-VL and LVIS benchmarks, outperforming prior black-box approaches by up to 9.7%.
Conclusion: DetPO provides an effective black-box optimization approach for improving few-shot object detection in multimodal LLMs without requiring model access or fine-tuning, addressing limitations in current MLLMs’ ability to leverage few-shot visual examples.
Abstract: Multi-Modal LLMs (MLLMs) demonstrate strong visual grounding capabilities on popular object detection benchmarks like OdinW-13 and RefCOCO. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. While in-context prompting is a common strategy to improve performance across diverse tasks, we find that it often yields lower detection accuracy than prompting with class names alone. This suggests that current MLLMs cannot yet effectively leverage few-shot visual examples and rich textual descriptions for object detection. Since frontier MLLMs are typically only accessible via APIs, and state-of-the-art open-weights models are prohibitively expensive to fine-tune on consumer-grade hardware, we instead explore black-box prompt optimization for few-shot object detection. To this end, we propose Detection Prompt Optimization (DetPO), a gradient-free test-time optimization approach that refines text-only prompts by maximizing detection accuracy on few-shot visual training examples while calibrating prediction confidence. Our proposed approach yields consistent improvements across generalist MLLMs on Roboflow20-VL and LVIS, outperforming prior black-box approaches by up to 9.7%. Our code is available at https://github.com/ggare-cmu/DetPO
[217] RealMaster: Lifting Rendered Scenes into Photorealistic Video
Dana Cohen-Bar, Ido Sobol, Raphael Bensadoun, Shelly Sheynin, Oran Gafni, Or Patashnik, Daniel Cohen-Or, Amit Zohar
Main category: cs.CV
TL;DR: RealMaster bridges 3D engines and video diffusion models to generate photorealistic videos while maintaining precise 3D control and consistency.
Details
Motivation: Current video generation models lack precise control and 3D consistency, while 3D engines produce unrealistic "uncanny valley" results. There's a need to bridge this gap by combining structural precision from 3D engines with photorealism from diffusion models.Method: Uses video diffusion models to lift rendered 3D video into photorealistic video while maintaining alignment with 3D engine output. Creates paired dataset via anchor-based propagation (enhancing first/last frames and propagating via geometric conditioning), then trains IC-LoRA to distill high-quality outputs into a generalizable model.
Result: Significantly outperforms existing video editing baselines on complex GTA-V sequences, improving photorealism while preserving geometry, dynamics, and identity specified by original 3D control.
Conclusion: RealMaster successfully bridges the sim-to-real gap by combining precise 3D control with photorealistic video generation, enabling applications that require both structural precision and visual realism.
Abstract: State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the “uncanny valley”. Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism. We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline’s constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.
[218] UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation
Jiaying Lin, Dan Xu
Main category: cs.CV
TL;DR: UniFunc3D is a unified, training-free framework for 3D functionality segmentation that uses multimodal LLMs as active observers to ground natural language instructions into precise interactive element masks through joint semantic, temporal, and spatial reasoning.
Details
Motivation: Existing methods for 3D functionality segmentation rely on fragmented pipelines that suffer from visual blindness during initial task parsing, limited by single-scale, passive and heuristic frame selection approaches.Method: Treats multimodal large language model as an active observer with active spatial-temporal grounding and coarse-to-fine strategy. Consolidates semantic, temporal, and spatial reasoning into single forward pass for joint reasoning to ground task decomposition in direct visual evidence.
Result: Achieves state-of-the-art performance on SceneFun3D, surpassing both training-free and training-based methods by large margin with relative 59.9% mIoU improvement, without any task-specific training.
Conclusion: UniFunc3D demonstrates effective unified framework for 3D functionality segmentation using multimodal LLMs as active observers, showing significant performance improvements without requiring training.
Abstract: Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.
[219] TETO: Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation
Jini Yang, Eunbeen Hong, Soowon Son, Hyunkoo Lee, Sunghwan Hong, Sunok Kim, Seungryong Kim
Main category: cs.CV
TL;DR: TETO is a teacher-student framework that learns event-based motion estimation from minimal real-world data by distilling knowledge from a pretrained RGB tracker, achieving state-of-the-art tracking and optical flow with much less training data.
Details
Motivation: Event cameras capture high-temporal-resolution motion data but existing estimators rely on synthetic data with sim-to-real gaps. The paper aims to learn event-based motion estimation from limited real-world data without annotations.Method: Proposes TETO: a teacher-student framework using knowledge distillation from a pretrained RGB tracker. Uses motion-aware data curation and query sampling to disentangle object motion from ego-motion. Predicts both point trajectories and dense optical flow, then uses these as motion priors to condition a video diffusion transformer for frame interpolation.
Result: Achieves state-of-the-art point tracking on EVIMO2 and optical flow on DSEC using orders of magnitude less training data (~25 minutes vs. large-scale synthetic datasets). Demonstrates superior frame interpolation quality on BS-ERGB and HQ-EVFI benchmarks.
Conclusion: TETO enables effective event-based motion estimation from minimal real-world data through knowledge distillation, and accurate motion estimation directly improves frame interpolation quality in event-based vision systems.
Abstract: Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existing event-based motion estimators depend on large-scale synthetic data that often suffers from a significant sim-to-real gap. We propose TETO (Tracking Events with Teacher Observation), a teacher-student framework that learns event motion estimation from only $\sim$25 minutes of unannotated real-world recordings through knowledge distillation from a pretrained RGB tracker. Our motion-aware data curation and query sampling strategy maximizes learning from limited data by disentangling object motion from dominant ego-motion. The resulting estimator jointly predicts point trajectories and dense optical flow, which we leverage as explicit motion priors to condition a pretrained video diffusion transformer for frame interpolation. We achieve state-of-the-art point tracking on EVIMO2 and optical flow on DSEC using orders of magnitude less training data, and demonstrate that accurate motion estimation translates directly to superior frame interpolation quality on BS-ERGB and HQ-EVFI.
[220] One View Is Enough! Monocular Training for In-the-Wild Novel View Generation
Adrien Ramanana Rahary, Nicolas Dufour, Patrick Perez, David Picard
Main category: cs.CV
TL;DR: OVIE enables monocular novel-view synthesis using only unpaired internet images by leveraging monocular depth estimation as a geometric scaffold during training, then becoming geometry-free at inference.
Details
Motivation: Traditional monocular novel-view synthesis requires multi-view image pairs for supervision, which limits training data scale and diversity. The authors argue that one view should be enough and aim to leverage the vast amount of unpaired internet images available.Method: Uses a monocular depth estimator as geometric scaffold during training: lifts source image into 3D, applies camera transformation, projects to get pseudo-target view. Introduces masked training formulation to handle disocclusions by restricting losses to valid regions. Trained on 30M uncurated images. At inference, becomes geometry-free requiring no depth estimator.
Result: Outperforms prior methods in zero-shot setting while being 600x faster than second-best baseline. Trained exclusively on in-the-wild images without paired supervision.
Conclusion: Demonstrates that monocular novel-view synthesis can be learned from unpaired internet images alone, achieving state-of-the-art performance with significantly faster inference than previous methods.
Abstract: Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at https://github.com/AdrienRR/ovie.
[221] AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation
Woojeong Jin, Jaeho Lee, Heeseong Shin, Seungho Jang, Junhwan Heo, Seungryong Kim
Main category: cs.CV
TL;DR: AgentRVOS: A training-free agentic pipeline for Referring Video Object Segmentation that uses SAM3 for full spatio-temporal perception and MLLM for query-grounded reasoning, achieving state-of-the-art performance.
Details
Motivation: Existing training-free RVOS methods have MLLMs make temporal decisions before object-level evidence is available, limiting reasoning quality and spatio-temporal coverage.Method: Proposes AgentRVOS with complementary SAM3 and MLLM: SAM3 provides reliable perception over full video through mask tracks, then MLLM identifies target via query-grounded reasoning over object evidence, iteratively pruning using SAM3’s temporal existence info.
Result: Achieves state-of-the-art performance among training-free methods across multiple benchmarks, with consistent results across diverse MLLM backbones.
Conclusion: AgentRVOS overcomes limitations of prior training-free RVOS methods by leveraging complementary strengths of SAM3 and MLLM through an agentic pipeline.
Abstract: Referring Video Object Segmentation (RVOS) aims to segment a target object throughout a video given a natural language query. Training-free methods for this task follow a common pipeline: a MLLM selects keyframes, grounds the referred object within those frames, and a video segmentation model propagates the results. While intuitive, this design asks the MLLM to make temporal decisions before any object-level evidence is available, limiting both reasoning quality and spatio-temporal coverage. To overcome this, we propose AgentRVOS, a training-free agentic pipeline built on the complementary strengths of SAM3 and a MLLM. Given a concept derived from the query, SAM3 provides reliable perception over the full spatio-temporal extent through generated mask tracks. The MLLM then identifies the target through query-grounded reasoning over this object-level evidence, iteratively pruning guided by SAM3’s temporal existence information. Extensive experiments show that AgentRVOS achieves state-of-the-art performance among training-free methods across multiple benchmarks, with consistent results across diverse MLLM backbones. Our project page is available at: https://cvlab-kaist.github.io/AgentRVOS/.
[222] Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation
Brian Chao, Lior Yariv, Howard Xiao, Gordon Wetzstein
Main category: cs.CV
TL;DR: Foveated diffusion models for efficient image/video generation by allocating tokens non-uniformly based on human visual acuity, using higher density in gaze regions and lower in periphery.
Details
Motivation: Address computational challenges in high-resolution content generation by leveraging human visual perception - users perceive high detail only in foveal region (gaze area) while peripheral vision has lower acuity.Method: Use mask modeling foveated resolution to allocate tokens non-uniformly, develop principled mechanism for constructing mixed-resolution tokens from high-resolution data, enable post-training from existing base models while maintaining consistency.
Result: Perceptually indistinguishable results from full-resolution generation with drastically reduced token count and generation time, validated through extensive analysis and user study.
Conclusion: Foveation provides practical and scalable approach for efficient generation in eye-tracking settings, maintaining quality while reducing computational complexity.
Abstract: Diffusion and flow matching models have unlocked unprecedented capabilities for creative content creation, such as interactive image and streaming video generation. The growing demand for higher resolutions, frame rates, and context lengths, however, makes efficient generation increasingly challenging, as computational complexity grows quadratically with the number of generated tokens. Our work seeks to optimize the efficiency of the generation process in settings where the user’s gaze location is known or can be estimated, for example, by using eye tracking. In these settings, we leverage the eccentricity-dependent acuity of human vision: while a user perceives very high-resolution visual information in a small region around their gaze location (the foveal region), the ability to resolve detail quickly degrades in the periphery of the visual field. Our approach starts with a mask modeling the foveated resolution to allocate tokens non-uniformly, assigning higher token density to foveal regions and lower density to peripheral regions. An image or video is generated in a mixed-resolution token setting, yielding results perceptually indistinguishable from full-resolution generation, while drastically reducing the token count and generation time. To this end, we develop a principled mechanism for constructing mixed-resolution tokens directly from high-resolution data, allowing a foveated diffusion model to be post-trained from an existing base model while maintaining content consistency across resolutions. We validate our approach through extensive analysis and a carefully designed user study, demonstrating the efficacy of foveation as a practical and scalable axis for efficient generation.
[223] WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG
Zhen Li, Zian Meng, Shuwei Shi, Wenshuo Peng, Yuwei Wu, Bo Zheng, Chuanhao Li, Kaipeng Zhang
Main category: cs.CV
TL;DR: WildWorld is a large-scale action-conditioned world modeling dataset collected from a photorealistic AAA game, featuring explicit state annotations to address limitations in existing video world model datasets.
Details
Motivation: Existing video world model datasets lack diverse, semantically meaningful action spaces and have actions directly tied to visual observations rather than underlying states, making it difficult to learn structured world dynamics and maintain consistent evolution over long horizons.Method: Automatically collected data from Monster Hunter: Wilds game, containing over 108 million frames with 450+ actions and synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. Created WildBench for evaluation through Action Following and State Alignment tasks.
Result: Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation.
Conclusion: WildWorld provides a valuable benchmark for action-conditioned world modeling with explicit state annotations, addressing key limitations in existing datasets and revealing important challenges in the field.
Abstract: Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is https://shandaai.github.io/wildworld-project/.
[224] DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models
Jaewon Min, Jaeeun Lee, Yeji Choi, Paul Hyunbin Cho, Jin Hyeon Kim, Tae-Young Lee, Jongsik Ahn, Hwayeong Lee, Seonghyun Park, Seungryong Kim
Main category: cs.CV
TL;DR: DA-Flow introduces degradation-aware optical flow using diffusion model features with spatio-temporal attention for robust flow estimation on corrupted videos.
Details
Motivation: Optical flow models perform poorly on real-world corrupted videos (blur, noise, compression artifacts), creating a need for degradation-aware approaches that maintain accuracy under corruption.Method: Lifts image restoration diffusion models to attend across frames via full spatio-temporal attention, creating corruption-aware temporal features. Fuses these diffusion features with convolutional features in DA-Flow hybrid architecture within iterative refinement framework.
Result: DA-Flow substantially outperforms existing optical flow methods under severe degradation across multiple benchmarks, demonstrating robust correspondence estimation on corrupted videos.
Conclusion: Diffusion model features with spatio-temporal attention provide effective degradation-aware representations for optical flow, enabling robust performance on real-world corrupted videos through hybrid architecture design.
Abstract: Optical flow models trained on high-quality data often degrade severely when confronted with real-world corruptions such as blur, noise, and compression artifacts. To overcome this limitation, we formulate Degradation-Aware Optical Flow, a new task targeting accurate dense correspondence estimation from real-world corrupted videos. Our key insight is that the intermediate representations of image restoration diffusion models are inherently corruption-aware but lack temporal awareness. To address this limitation, we lift the model to attend across adjacent frames via full spatio-temporal attention, and empirically demonstrate that the resulting features exhibit zero-shot correspondence capabilities. Based on this finding, we present DA-Flow, a hybrid architecture that fuses these diffusion features with convolutional features within an iterative refinement framework. DA-Flow substantially outperforms existing optical flow methods under severe degradation across multiple benchmarks.
[225] UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation
Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, Wanli Ouyang
Main category: cs.CV
TL;DR: Unified RL framework (UniGRPO) for interleaved text-image generation using GRPO for reasoning and modified FlowGRPO for image synthesis, enabling scalable multimodal generation.
Details
Motivation: To advance unified models for interleaved generation by developing a scalable reinforcement learning framework that can handle multimodal generation tasks, particularly reasoning-driven image generation as a fundamental unit for more complex interleaved scenarios.Method: Proposes UniGRPO framework treating multimodal generation as MDP with sparse terminal rewards. Uses standard GRPO for text reasoning and modified FlowGRPO for image synthesis. Key modifications: (1) eliminates classifier-free guidance for linear rollouts, (2) replaces latent KL penalty with MSE penalty on velocity fields for better regularization.
Result: The unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for future post-training of fully interleaved models.
Conclusion: The proposed UniGRPO framework successfully enables joint optimization of text and image generation policies, offering a scalable approach for interleaved multimodal generation that can be extended to more complex multi-round scenarios.
Abstract: Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.
[226] OccAny: Generalized Unconstrained Urban 3D Occupancy
Anh-Quan Cao, Tuan-Hung Vu
Main category: cs.CV
TL;DR: OccAny is a generalized 3D occupancy prediction model for urban scenes that works with uncalibrated, out-of-domain data using monocular, sequential, or surround-view images, featuring segmentation forcing and novel view rendering for geometry completion.
Details
Motivation: Existing 3D occupancy prediction methods require in-domain annotations and precise sensor-rig priors, limiting scalability and generalization. Visual geometry foundation models lack key ingredients for urban occupancy prediction: metric prediction, geometry completion in cluttered scenes, and adaptation to urban scenarios.Method: Proposes OccAny with three key components: 1) Generalized 3D occupancy framework for uncalibrated scenes, 2) Segmentation Forcing that improves occupancy quality and enables mask-level prediction, 3) Novel View Rendering pipeline for test-time view augmentation and geometry completion. Works with sequential, monocular, or surround-view images.
Result: Outperforms all visual geometry baselines on 3D occupancy prediction task, remains competitive with in-domain self-supervised methods across three input settings on two established urban occupancy prediction datasets.
Conclusion: OccAny is the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes, demonstrating strong generalization capabilities for metric occupancy prediction with segmentation features.
Abstract: Relying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization. While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios. We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features. OccAny is versatile and can predict occupancy from sequential, monocular, or surround-view images. Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework with (ii) Segmentation Forcing that improves occupancy quality while enabling mask-level prediction, and (iii) a Novel View Rendering pipeline that infers novel-view geometry to enable test-time view augmentation for geometry completion. Extensive experiments demonstrate that OccAny outperforms all visual geometry baselines on 3D occupancy prediction task, while remaining competitive with in-domain self-supervised methods across three input settings on two established urban occupancy prediction datasets. Our code is available at https://github.com/valeoai/OccAny .
[227] Gradient Descent Provably Solves Nonlinear Tomographic Reconstruction
Sara Fridovich-Keil, Fabrizio Valdivia, Gordon Wetzstein, Benjamin Recht, Mahdi Soltanolkotabi
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2310.03956 suggests it’s from October 2023, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2310.03956: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2310.03956&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[228] Quantifying Noise of Dynamic Vision Sensor
Evgeny V. Votyakov, Alessandro Artusi
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2404.01948: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.01948&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[229] DifAttack++: Query-Efficient Black-Box Adversarial Attack via Hierarchical Disentangled Feature Space in Cross-Domain
Jun Liu, Jiantao Zhou, Jiandian Zeng, Jinyu Tian, Isao Echizen
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2406.03017: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.03017&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[230] Mitigating Object Hallucinations in Large Vision-Language Models via Attention Calibration
Younan Zhu, Linwei Tao, Minjing Dong, Chang Xu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2502.01969: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.01969&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[231] Replay-Free Continual Low-Rank Adaptation with Dynamic Memory
Huancheng Chen, Jingtao Li, Weiming Zhuang, Chen Chen, Lingjuan Lyu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2411.00623: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.00623&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[232] Quasi-Conformal Convolution : A Learnable Convolution for Deep Learning on Simply Connected Open Surfaces
Han Zhang, Tsz Lok Ip, Lok Ming Lui
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2502.01356: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.01356&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[233] Architecture-Aware Minimization (A$^2$M): How to Find Flat Minima in Neural Architecture Search
Matteo Gambella, Fabrizio Pittorino, Manuel Roveri
Main category: cs.CV
TL;DR: Failed to fetch summary for paper 2503.10404 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed summary fetchMethod: Unable to determine method due to failed summary fetch
Result: Unable to determine results due to failed summary fetch
Conclusion: Unable to determine conclusion due to failed summary fetch
Abstract: Failed to fetch summary for 2503.10404: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.10404&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[234] Redefining non-IID Data in Federated Learning for Computer Vision Tasks: Migrating from Labels to Embeddings for Task-Specific Data Distributions
Kasra Borazjani, Payam Abdisarabshali, Naji Khosravan, Seyyedali Hosseinalipour
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2503.14553: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.14553&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[235] Cross-Domain Underwater Image Enhancement Guided by No-Reference Image Quality Assessment: A Transfer Learning Approach
Zhi Zhang, Minfu Li, Lu Li, Daoyi Chen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2503.17937: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.17937&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[236] FiGKD: Fine-Grained Knowledge Distillation via High-Frequency Detail Transfer
Seonghak Kim
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2505.11897: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11897&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[237] PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion
Jaehyun Choi, Jiwan Hur, Gyojin Han, Jaemyung Yu, Junmo Kim
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2505.22564: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.22564&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[238] Image Generation from Contextually-Contradictory Prompts
Saar Huberman, Or Patashnik, Omer Dahary, Ron Mokady, Daniel Cohen-Or
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to determine conclusion due to technical error fetching paper content
Abstract: Failed to fetch summary for 2506.01929: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01929&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[239] Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning
Yue Ma, Yulong Liu, Qiyuan Zhu, Ayden Yang, Kunyu Feng, Xinhua Zhang, Zexuan Yan, Zhifeng Li, Sirui Han, Chenyang Qi, Qifeng Chen
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2506.05207: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.05207&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[240] Towards a general-purpose foundation model for fMRI analysis
Cheng Wang, Yu Jiang, Zhihao Peng, Chenxin Li, Changbae Bang, Lin Zhao, Wanyi Fu, Jinglei Lv, Jorge Sepulcre, Carl Yang, Lifang He, Tianming Liu, Xue-Jun Kong, Quanzheng Li, Daniel S. Barron, Anqi Qiu, Randy Hirschtick, Byung-Hoon Kim, Hongbin Han, Xiang Li, Yixuan Yuan
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2506.11167: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.11167&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[241] BeltCrack: the First Sequential-image Industrial Conveyor Belt Crack Detection Dataset and Its Baseline with Triple-domain Feature Learning
Jianghong Huang, Luping Ji, Xin Ma, Mao Ye
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.17892: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.17892&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[242] MS-DGCNN++: Multi-Scale Dynamic Graph Convolution with Scale-Dependent Normalization for Robust LiDAR Tree Species Classification
Said Ohamouddou, Hanaa El Afia, Mohamed Hamza Boulaich, Abdellatif El Afia, Raddouane Chiheb
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.12602: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.12602&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[243] Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images
Xiangyu Sun, Haoyi Jiang, Liu Liu, Seungtae Nam, Gyeongjin Kang, Xinjie Wang, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang, Eunbyung Park
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed data retrievalMethod: Unable to determine method due to failed data retrieval
Result: Unable to determine results due to failed data retrieval
Conclusion: Unable to draw conclusions due to failed data retrieval
Abstract: Failed to fetch summary for 2508.03643: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.03643&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[244] From Noisy Labels to Intrinsic Structure: A Geometric-Structural Dual-Guided Framework for Noise-Robust Medical Image Segmentation
Tao Wang, Zhenxuan Zhang, Yuanbo Zhou, Xinlin Zhang, Yuanbin Chen, Tao Tan, Guang Yang, Tong Tong
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.02419: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.02419&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[245] From Editor to Dense Geometry Estimator
JiYuan Wang, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie, Mingxing Li, Kang Liao, Xiangxiang Chu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2509.04338: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.04338&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[246] Learning to Stylize by Learning to Destylize: A Scalable Paradigm for Supervised Style Transfer
Ye Wang, Zili Yi, Yibo Zhang, Peng Zheng, Xuping Xie, Jiang Lin, Yijun Li, Yilin Wang, Rui Ma
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to determine conclusion due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2509.05970: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.05970&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[247] Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using a GPT-Based VLM: A Preliminary Study on Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework
Nanaka Hosokawa, Ryo Takahashi, Tomoya Kitano, Yukihiro Iida, Chisako Muramatsu, Tatsuro Hayashi, Yuta Seino, Xiangrong Zhou, Takeshi Hara, Akitoshi Katsumata, Hiroshi Fujita
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable
Result: No results available due to technical limitations in accessing the paper
Conclusion: Cannot draw conclusions about paper content due to access restrictions
Abstract: Failed to fetch summary for 2510.02001: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02001&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[248] GenExam: A Multidisciplinary Text-to-Image Exam
Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, Gen Luo
Main category: cs.CV
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Unable to determine motivation due to abstract fetch failureMethod: Unable to determine method due to abstract fetch failure
Result: Unable to determine results due to abstract fetch failure
Conclusion: Unable to determine conclusion due to abstract fetch failure
Abstract: Failed to fetch summary for 2509.14232: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.14232&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[249] Background Fades, Foreground Leads: Curriculum-Guided Background Pruning for Efficient Foreground-Centric Collaborative Perception
Yuheng Wu, Xiangbo Gao, Quang Tau, Zhengzhong Tu, Dongman Lee
Main category: cs.CV
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching arXiv summary
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to analyze paper due to technical error in fetching content
Abstract: Failed to fetch summary for 2510.19250: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.19250&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[250] Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding
Anupam Pani, Yanchao Yang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.21356: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21356&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[251] Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench
Fenfen Lin, Yesheng Liu, Haiyu Xu, Chen Yue, Zheqi He, Mingxuan Zhao, Miguel Hu Chen, Jiakang Liu, JG Yao, Xi Yang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2510.26865: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26865&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[252] Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
Xiangyu Fan, Zesong Qiu, Zhuguanyu Wu, Fanzhou Wang, Zhiqian Lin, Tianxiang Ren, Dahua Lin, Ruihao Gong, Lei Yang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.27684: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.27684&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[253] MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding
Zhanheng Nie, Chenghan Fu, Daoze Zhang, Junxian Wu, Wanxian Guan, Pengjie Wang, Jian Xu, Bo Zheng
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.12449: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.12449&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[254] UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, Limin Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2511.03334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[255] Pedestrian Crossing Intention Prediction Using Multimodal Fusion Network
Yuanzhe Li, Steffen Müller
Main category: cs.CV
TL;DR: Unable to analyze paper 2511.20008 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2511.20008: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20008&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[256] The Potential of Copernicus Satellites for Disaster Response: Retrieving Building Damage from Sentinel-1 and Sentinel-2
Olivier Dietrich, Merlin Alfredsson, Emilia Arens, Nando Metzger, Torben Peters, Linus Scheibenreif, Jan Dirk Wegner, Konrad Schindler
Main category: cs.CV
TL;DR: Paper 2511.05461: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstractMethod: Cannot determine method due to missing abstract
Result: Cannot determine results due to missing abstract
Conclusion: Cannot draw conclusions due to missing abstract
Abstract: Failed to fetch summary for 2511.05461: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05461&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[257] Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding
Yerim Jeon, Miso Lee, WonJun Moon, Jae-Pil Heo
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.02487: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02487&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[258] Latent Diffusion Inversion Requires Understanding the Latent Space
Mingxing Rao, Bowen Qu, Daniel Moyer
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.20592: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20592&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[259] MoEGCL: Mixture of Ego-Graphs Contrastive Representation Learning for Multi-View Clustering
Jian Zhu, Xin Zou, Jun Sun, Cheng Luo, Lei Liu, Lingfang Zeng, Ning Zhang, Bian Wu, Chang Tang, Lirong Dai
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2511.05876: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05876&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[260] Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles
Haicheng Liao, Huanming Shen, Bonan Wang, Yongkang Li, Yihong Tang, Chengyue Wang, Dingyi Zhuang, Kehua Chen, Hai Yang, Chengzhong Xu, Zhenning Li
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2512.03454: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03454&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[261] Classification of Microplastic Particles in Water using Polarized Light Scattering and Machine Learning Methods
Leonard Saur, Marc von Pawlowski, Ulrich Gengenbach, Ingo Sieber, Hossein Shirali, Lorenz Wührl, Xiangyu Weng, Rainer Kiko, Christian Pylatiuk
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to determine conclusion due to technical error fetching paper content
Abstract: Failed to fetch summary for 2511.06901: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06901&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[262] Metaphor-based Jailbreak Attacks on Text-to-Image Models
Chenyu Zhang, Lanjun Wang, Yiwen Ma, Wenhui Li, Yi Tu, An-An Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.10766: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10766&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[263] DI3CL: Contrastive Learning With Dynamic Instances and Contour Consistency for SAR Land-Cover Classification Foundation Model
Zhongle Ren, Hui Ding, Kai Wang, Biao Hou, Xingyu Luo, Weibin Li, Licheng Jiao
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.07808: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07808&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[264] Schrödinger’s Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation
Yu He, Da Huang, Zhenyang Liu, Zixiao Gu, Qiang Sun, Guangnan Ye, Yanwei Fu, Yu-Gang Jiang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available due to technical limitations in accessing the paper
Conclusion: Cannot provide analysis due to API rate limiting preventing access to the paper content
Abstract: Failed to fetch summary for 2512.21201: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21201&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[265] Learning to See Through a Baby’s Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines
Yusen Cai, Qing Lin, Bhargava Satya Nunna, Mengmi Zhang
Main category: cs.CV
TL;DR: Unable to analyze paper 2511.14440 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.14440: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.14440&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[266] Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention
Jianfei Zhao, Feng Zhang, Xin Sun, Chong Feng, Zhixing Tan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.20032: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20032&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[267] Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search
Xinlei Yin, Xiulian Peng, Xiao Li, Zhiwei Xiong, Yan Lu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.13719: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13719&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[268] HalDec-Bench: Benchmarking Hallucination Detector in Image Captioning
Kuniaki Saito, Risa Shinoda, Shohei Tanaka, Tosho Hirasawa, Fumio Okura, Yoshitaka Ushiku
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.20515: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20515&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[269] Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models
Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, Zehui Chen, Xu Tang, Yao Hu, Shaohui Lin, Philip Torr, Feng Zhao, Wanli Ouyang
Main category: cs.CV
TL;DR: Paper ID 2601.22060 summary could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed summary fetchMethod: Unable to determine method due to failed summary fetch
Result: Unable to determine results due to failed summary fetch
Conclusion: Unable to draw conclusions due to failed summary fetch
Abstract: Failed to fetch summary for 2601.22060: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22060&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[270] DINO-Tok: Adapting DINO for Visual Tokenizers
Mingkai Jia, Mingxiao Li, Zhijian Shu, Anlin Zheng, Liaoyuan Fan, Jiaxin Guo, Tianxing Shi, Dongyue Lu, Zeming Li, Xiaoyang Guo, Xiaojuan Qi, Xiao-Xiao Long, Qian Zhang, Ping Tan, Wei Yin
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2511.20565: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20565&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[271] 1S-DAug: One-Shot Data Augmentation for Robust Few-Shot Generalization
Yunwei Bai, Ying Kiat Tan, Yao Shu, Tsuhan Chen
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.00114: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00114&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[272] FlyPrompt: Brain-Inspired Random-Expanded Routing with Temporal-Ensemble Experts for General Continual Learning
Hongwei Yan, Guanglong Sun, Kanglei Zhou, Qian Li, Liyuan Wang, Yi Zhong
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: No method information available due to API access restrictions
Result: No results available - paper content inaccessible
Conclusion: Cannot analyze paper due to technical limitations in accessing arXiv API
Abstract: Failed to fetch summary for 2602.01976: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01976&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[273] From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition
Jingxi Chen, Yixiao Zhang, Xiaoye Qian, Zongxia Li, Cornelia Fermuller, Caren Chen, Yiannis Aloimonos
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without paper contentMethod: Cannot determine method without paper content
Result: Cannot determine results without paper content
Conclusion: Cannot determine conclusion without paper content
Abstract: Failed to fetch summary for 2511.20996: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20996&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[274] Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance
Xinrong Chen, Xu Chu, Yingmin Qiu, Hengyuan Zhang, Jing Xiong, Shiyu Tang, Shuai Liu, Shaokang Yang, Cheng Yang, Hayden Kwok-Hay So, Ngai Wong
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2602.01047: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01047&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[275] Refine Now, Query Fast: A Decoupled Refinement Paradigm for Implicit Neural Fields
Tianyu Xiong, Skylar Wurster, Han-Wei Shen
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2602.15155
Details
Motivation: Cannot determine motivation as paper content could not be retrieved due to technical limitationsMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions about the paper’s content due to access limitations
Abstract: Failed to fetch summary for 2602.15155: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15155&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[276] Captain Safari: A World Engine with Pose-Aligned 3D Memory
Yu-Cheng Chou, Xingrui Wang, Yitong Li, Jiahao Wang, Hanting Liu, Cihang Xie, Alan Yuille, Junfei Xiao
Main category: cs.CV
TL;DR: Unable to analyze paper 2511.22815 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2511.22815: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22815&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[277] TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition
Junyuan Zhang, Bin Wang, Qintong Zhang, Fan Wu, Zichen Wen, Jialin Lu, Junjie Shan, Ziqi Zhao, Shuya Yang, Ziling Wang, Ziyang Miao, Huaping Zhong, Yuhang Zang, Xiaoyi Dong, Ka-Ho Chow, Conghui He
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to wait before retrying or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2512.01248: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01248&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[278] nuScenes Revisited: Progress and Challenges in Autonomous Driving
Whye Kit Fong, Venice Erin Liong, Kok Seang Tan, Holger Caesar
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.02448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[279] Human Presence Detection via Wi-Fi Range-Filtered Doppler Spectrum on Commodity Laptops
Jessica Sanson, Rahul C. Shah, Valerio Frascolla
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.10845: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10845&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[280] GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding
Jiaqi Liu, Ronghao Fu, Haoran Liu, Lang Sun, Bo Yang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2512.02505: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02505&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[281] U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences
Xiang Xu, Alan Liang, Youquan Liu, Linfeng Li, Lingdong Kong, Ziwei Liu, Qingshan Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.02982: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02982&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[282] SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition
Jingxiao Yang, DaLin He, Miao Pan, Ge Su, Wenqi Zhang, Yifeng Hu, Tangwei Li, Yuke Li, Xuhong Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.17729: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17729&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[283] ViDiC: Video Difference Captioning
Jiangtao Wu, Shihao Li, Zhaozhou Bian, Jialu Chen, Runzhe Wen, An Ping, Yiwen He, Jiakai Wang, Yuanxing Zhang, Jiaheng Liu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2512.03405 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2512.03405: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03405&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[284] CHIMERA: Adaptive Cache Injection and Semantic Anchor Prompting for Zero-shot Image Morphing with Morphing-oriented Metrics
Dahyeon Kye, Jeahun Sung, Minkyu Jeon, Jihyong Oh
Main category: cs.CV
TL;DR: Paper ID 2512.07155 could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2512.07155: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07155&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[285] Elastic Weight Consolidation Done Right for Continual Learning
Xuan Liu, Xiaobin Chang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2603.18596: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18596&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[286] GeoDiffMM: Geometry-Guided Conditional Diffusion for Motion Magnification
Xuedeng Liu, Jiabao Guo, Zheng Zhang, Fei Wang, Zhi Liu, Dan Guo
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2512.08325: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08325&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[287] UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models
Hewen Pan, Cong Wei, Dashuang Liang, Zepeng Huang, Pengfei Gao, Ziqi Zhou, Lulu Xue, Pengfei Yan, Xiaoming Wei, Minghui Li, Shengshan Hu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.11336: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.11336&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[288] Coarse-to-Fine Hierarchical Alignment for UAV-based Human Detection using Diffusion Models
Wenda Li, Meng Wu, Liangzhao Chen, Sungmin Eum, Heesung Kwon, Qing Qu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2512.13869: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13869&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[289] LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment
Shuaibang Peng, Juelin Zhu, Xia Li, Kun Yang, Maojun Zhang, Yu Liu, Shen Yan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.19609: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19609&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[290] Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation
Yunkai Yang, Yudong Zhang, Kunquan Zhang, Jinxiao Zhang, Xinying Chen, Haohuan Fu, Runmin Dong
Main category: cs.CV
TL;DR: Unable to analyze paper 2512.16740 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrieved due to API rate limitingMethod: No method information available due to failed API request
Result: No results available - paper summary could not be fetched
Conclusion: Cannot provide analysis due to technical limitations in accessing paper content
Abstract: Failed to fetch summary for 2512.16740: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16740&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[291] OmniDiT: Extending Diffusion Transformer to Omni-VTON Framework
Weixuan Zeng, Pengcheng Wei, Huaiqing Wang, Boheng Zhang, Jia Sun, Dewen Fan, Lin HE, Long Chen, Qianqian Gan, Fan Yang, Tingting Gao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.19643: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19643&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[292] Point What You Mean: Visually Grounded Instruction Policy
Hang Yu, Juntu Zhao, Yufeng Liu, Kaiyu Li, Cheng Ma, Di Zhang, Yingdong Hu, Guang Chen, Junyuan Xie, Junliang Guo, Junqiao Zhao, Yang Gao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.18933: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18933&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[293] From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping
Xu He, Haoxian Zhang, Hejia Chen, Changyuan Zheng, Liyang Chen, Songlin Tang, Jiehui Huang, Xiaoqiang Liu, Pengfei Wan, Zhiyong Wu
Main category: cs.CV
TL;DR: Paper 2512.25066: Failed to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2512.25066: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.25066&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[294] LPNSR: Prior-Enhanced Diffusion Image Super-Resolution via LR-Guided Noise Prediction
Shuwei Huang, Shizhuo Liu, Zijun Wei
Main category: cs.CV
TL;DR: Paper 2603.21045: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2603.21045: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21045&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[295] Temporal Slowness in Central Vision Drives Semantic Object Learning
Timothy Schaumlöffel, Arthur Aubret, Gemma Roig, Jochen Triesch
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.04462: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04462&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[296] FastVMT: Eliminating Redundancy in Video Motion Transfer
Yue Ma, Zhikai Wang, Tianhao Ren, Mingzhe Zheng, Hongyu Liu, Jiayi Guo, Kunyu Feng, Yuxuan Xue, Zixiang Zhao, Konrad Schindler, Qifeng Chen, Linfeng Zhang
Main category: cs.CV
TL;DR: Unable to analyze paper 2602.05551 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2602.05551: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05551&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[297] When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning
Zhengxian Wu, Kai Shi, Chuanrui Zhang, Zirui Liao, Jun Yang, Ni Yang, Qiuying Peng, Luyuan Zhang, Hangrui Xu, Tianhuang Su, Zhenyu Yang, Haonan Lu, Haoqian Wang
Main category: cs.CV
TL;DR: Unable to analyze paper 2603.21289 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot determine conclusion as abstract is unavailable
Abstract: Failed to fetch summary for 2603.21289: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21289&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[298] SurgMotion: A Video-Native Foundation Model for Universal Understanding of Surgical Videos
Jinlin Wu, Felix Holm, Chuxi Chen, An Wang, Yaxin Hu, Xiaofan Ye, Zelin Zang, Miao Xu, Lihua Zhou, Huai Liao, Danny T. M. Chan, Ming Feng, Wai S. Poon, Hongliang Ren, Dong Yi, Nassir Navab, Gaofeng Meng, Jiebo Luo, Hongbin Liu, Zhen Lei
Main category: cs.CV
TL;DR: Failed to fetch summary for paper 2602.05638 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed summary fetchMethod: Unable to determine method due to failed summary fetch
Result: Unable to determine results due to failed summary fetch
Conclusion: Unable to determine conclusion due to failed summary fetch
Abstract: Failed to fetch summary for 2602.05638: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05638&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[299] SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models
Jiwoo Chung, Sangeek Hyun, MinKyu Lee, Byeongju Han, Geonho Cha, Dongyoon Wee, Youngjun Hong, Jae-Pil Heo
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to draw conclusions due to access restrictions
Abstract: Failed to fetch summary for 2602.18993: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18993&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[300] CAD-Prompted SAM3: Geometry-Conditioned Instance Segmentation for Industrial Objects
Zhenran Tang, Rohan Nagabhirava, Changliu Liu
Main category: cs.CV
TL;DR: Paper ID 2602.20551 - Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to abstract fetching failureMethod: Cannot determine method due to abstract fetching failure
Result: Cannot determine results due to abstract fetching failure
Conclusion: Cannot determine conclusion due to abstract fetching failure
Abstract: Failed to fetch summary for 2602.20551: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20551&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[301] Momentum Memory for Knowledge Distillation in Computational Pathology
Yongxin Guo, Hao Lu, Onur C. Koyun, Zhengjie Zhu, Muhammet Fatih Demir, Metin Nafi Gurcan
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.21395: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21395&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[302] Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models
Hayeon Kim, Ji Ha Jang, Junghun James Kim, Se Young Chun
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to technical error in accessing paper contentMethod: Unable to determine method due to technical error in accessing paper content
Result: Unable to determine results due to technical error in accessing paper content
Conclusion: Unable to determine conclusion due to technical error in accessing paper content
Abstract: Failed to fetch summary for 2603.22042: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22042&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[303] WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval
Tianyue Wang, Leigang Qu, Tianyu Yang, Xiangzhao Hao, Yifan Xu, Haiyun Guo, Jinqiao Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The arXiv API request for paper 2602.23029 failed.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable.
Result: Cannot determine results as paper content is unavailable.
Conclusion: Cannot draw conclusions about the paper due to content unavailability.
Abstract: Failed to fetch summary for 2602.23029: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23029&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[304] Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation
Kejia Liu, Haoyang Zhou, Ruoyu Xu, Peicheng Wang, Mingli Song, Haofei Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.22153: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22153&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[305] Spectral Gaps and Spatial Priors: Studying Hyperspectral Downstream Adaptation Using TerraMind
Julia Anna Leonardi, Johannes Jakubik, Paolo Fraccaro, Maria Antonia Brovelli
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) for arXiv ID 2603.06690
Details
Motivation: Unable to determine motivation due to technical error in fetching paper informationMethod: Unable to determine method due to technical error in fetching paper information
Result: Unable to determine results due to technical error in fetching paper information
Conclusion: Unable to draw conclusions due to technical error in fetching paper information
Abstract: Failed to fetch summary for 2603.06690: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06690&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[306] SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer
Tong Shao, Yusen Fu, Guoying Sun, Jingde Kong, Zhuotao Tian, Jingyong Su
Main category: cs.CV
TL;DR: Unable to analyze paper 2603.07057 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot draw conclusions as abstract is unavailable due to rate limiting error
Abstract: Failed to fetch summary for 2603.07057: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07057&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[307] Efficient Chest X-ray Representation Learning via Semantic-Partitioned Contrastive Learning
Wangyu Feng, Shawn Young, Lijian Xu
Main category: cs.CV
TL;DR: Paper 2603.07113: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to unavailability of paper contentMethod: Cannot determine method due to unavailability of paper content
Result: Cannot determine results due to unavailability of paper content
Conclusion: Cannot determine conclusion due to unavailability of paper content
Abstract: Failed to fetch summary for 2603.07113: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07113&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[308] LiveWorld: Simulating Out-of-Sight Dynamics in Generative Video World Models
Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, Lingqiao Liu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2603.07145
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.07145: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07145&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[309] GazeShift: Unsupervised Gaze Estimation and Dataset for VR
Gil Shapira, Ishay Goldin, Evgeny Artyomov, Donghoon Kim, Yosi Keller, Niv Zehngut
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.07832: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07832&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[310] MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals
Junyu Shen, Zhendong She, Chenghanyu Zhang, Yuchuang Sun, Luqing Luo, Dingwei Tan, Zonghao Guo, Bo Guo, Zehua Han, Wupeng Xie, Yaxin Mu, Peng Zhang, Peipei Li, Fengxiang Wang, Yangang Sun, Maosong Sun
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.08174 suggests it’s from March 2026, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates rate limiting on arXiv API requests.Method: No method information available due to failed API request.
Result: No results available for analysis.
Conclusion: Unable to analyze paper due to technical limitations in accessing content.
Abstract: Failed to fetch summary for 2603.08174: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08174&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[311] Where, What, Why: Toward Explainable 3D-GS Watermarking
Mingshu Cai, Jiajun Li, Osamu Yoshie, Yuya Ieiri, Yixuan Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.08809: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08809&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[312] Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency
Zhaofeng Shi, Heqian Qiu, Lanxiao Wang, Qingbo Wu, Fanman Meng, Lili Pan, Hongliang Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access failureMethod: Unable to determine method due to API access failure
Result: Unable to determine results due to API access failure
Conclusion: Unable to determine conclusion due to API access failure
Abstract: Failed to fetch summary for 2603.09798: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09798&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[313] Local Precise Refinement: A Dual-Gated Mixture-of-Experts for Enhancing Foundation Model Generalization against Spectral Shifts
Xi Chen, Maojun Zhang, Yu Liu, Shen Yan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2603.13352: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13352&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[314] Selective Noise Suppression and Discriminative Mutual Interaction for Robust Audio-Visual Segmentation
Kai Peng, Yunzhe Shen, Miao Zhang, Leiye Liu, Yidong Han, Wei Ji, Jingjing Li, Yongri Piao, Huchuan Lu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.14203: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14203&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[315] HalDec-Bench: Benchmarking Hallucination Detector in Image Captioning
Kuniaki Saito, Risa Shinoda, Shohei Tanaka, Tosho Hirasawa, Fumio Okura, Yoshitaka Ushiku
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.15253 suggests it’s from March 2025, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content. The HTTP 429 error indicates rate limiting from arXiv API, preventing content retrieval.Method: Cannot determine method without paper content. The error suggests technical limitations in accessing the paper rather than issues with the paper itself.
Result: No results available due to access limitations. The arXiv API returned HTTP 429 (Too Many Requests), indicating rate limiting.
Conclusion: Unable to analyze paper due to technical access issues. Need to wait for rate limits to reset or try alternative access methods.
Abstract: Failed to fetch summary for 2603.15253: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15253&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[316] LLM-Powered Flood Depth Estimation from Social Media Imagery: A Vision-Language Model Framework with Mechanistic Interpretability for Transportation Resilience
Nafis Fuad, Xiaodong Qian
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.17108: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17108&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[317] PhyUnfold-Net: Advancing Remote Sensing Change Detection with Physics-Guided Deep Unfolding
Zelin Lei, Yaoxing Ren, Jiaming Chang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.19566 could not be retrieved from arXiv API.
Details
Motivation: Unable to determine motivation as the paper content could not be fetched due to rate limiting on the arXiv API.Method: Unable to determine method as the paper content could not be fetched due to rate limiting on the arXiv API.
Result: Unable to determine results as the paper content could not be fetched due to rate limiting on the arXiv API.
Conclusion: Unable to determine conclusion as the paper content could not be fetched due to rate limiting on the arXiv API.
Abstract: Failed to fetch summary for 2603.19566: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19566&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[318] Inverting Neural Networks: New Methods to Generate Neural Network Inputs from Prescribed Outputs
Rebecca Pattichis, Sebastian Janampa, Constantinos S. Pattichis, Marios S. Pattichis
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.20461: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20461&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[319] ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent Framework
Guanzhou Chen, Erfei Cui, Changyao Tian, Danni Yang, Ganlin Yang, Yu Qiao, Hongsheng Li, Gen Luo, Hongjie Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.20644: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20644&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[320] PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization
Xiaoya Cheng, Long Wang, Yan Liu, Xinyi Liu, Hanlin Tan, Yu Liu, Maojun Zhang, Shen Yan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.20778: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20778&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[321] 2Xplat: Two Experts Are Better Than One Generalist
Hwasik Jeong, Seungryong Lee, Gyeongjin Kang, Seungkwon Yang, Xiangyu Sun, Seungtae Nam, Eunbyung Park
Main category: cs.CV
TL;DR: 2Xplat introduces a two-expert framework for pose-free 3D Gaussian Splatting that separates geometry estimation (camera pose prediction) from appearance modeling (3D Gaussian generation), achieving state-of-the-art performance in fewer than 5K training iterations.
Details
Motivation: Current pose-free 3DGS methods use unified monolithic architectures that entangle geometric reasoning and appearance modeling, which may be suboptimal for high-fidelity 3D generation. The authors propose that separating these tasks could lead to better performance.Method: A two-expert design where a geometry expert first predicts camera poses from uncalibrated multi-view images, then explicitly passes these poses to an appearance expert that synthesizes 3D Gaussian representations. This modular approach decouples geometric estimation from appearance synthesis.
Result: The framework substantially outperforms prior pose-free feed-forward 3DGS approaches and achieves performance on par with state-of-the-art posed methods, all in fewer than 5K training iterations.
Conclusion: The success of 2Xplat challenges the prevailing unified paradigm and suggests modular design principles offer advantages for complex 3D geometric estimation and appearance synthesis tasks.
Abstract: Pose-free feed-forward 3D Gaussian Splatting (3DGS) has opened a new frontier for rapid 3D modeling, enabling high-quality Gaussian representations to be generated from uncalibrated multi-view images in a single forward pass. The dominant approach in this space adopts unified monolithic architectures, often built on geometry-centric 3D foundation models, to jointly estimate camera poses and synthesize 3DGS representations within a single network. While architecturally streamlined, such “all-in-one” designs may be suboptimal for high-fidelity 3DGS generation, as they entangle geometric reasoning and appearance modeling within a shared representation. In this work, we introduce 2Xplat, a pose-free feed-forward 3DGS framework based on a two-expert design that explicitly separates geometry estimation from Gaussian generation. A dedicated geometry expert first predicts camera poses, which are then explicitly passed to a powerful appearance expert that synthesizes 3D Gaussians. Despite its conceptual simplicity, being largely underexplored in prior works, the proposed approach proves highly effective. In fewer than 5K training iterations, the proposed two-experts pipeline substantially outperforms prior pose-free feed-forward 3DGS approaches and achieves performance on par with state-of-the-art posed methods. These results challenge the prevailing unified paradigm and suggest the potential advantages of modular design principles for complex 3D geometric estimation and appearance synthesis tasks.
[322] Test-Time Adaptation via Cache Personalization for Facial Expression Recognition in Videos
Masoumeh Sharafi, Muhammad Osama Zeeshan, Soufiane Belharbi, Alessandro Lameiras Koerich, Marco Pedersoli, Eric Granger
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2603.21309: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21309&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[323] ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies
Jinyan Yuan, Bangbang Yang, Keke Wang, Panwang Pan, Lin Ma, Xuehai Zhang, Xiao Liu, Zhaopeng Cui, Yuewen Ma
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2506.14315: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.14315&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[324] DiffBMP: Differentiable Rendering with Bitmap Primitives
Seongmin Hong, Junghun James Kim, Daehyeop Kim, Insoo Chung, Se Young Chun
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2602.22625: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22625&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[325] GHOST: Ground-projected Hypotheses from Observed Structure-from-Motion Trajectories
Tomasz Frelek, Rohan Patil, Akshar Tumu, Henrik I. Christensen
Main category: cs.CV
TL;DR: Paper 2603.20583 summary could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unknown - paper content not accessible due to technical limitations
Result: No results available due to failed API request
Conclusion: Cannot provide analysis due to technical access issues with arXiv API
Abstract: Failed to fetch summary for 2603.20583: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20583&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.AI
[326] Memory Bear AI Memory Science Engine for Multimodal Affective Intelligence: A Technical Report
Deliang Wen, Ke Sun, Yu Wang
Main category: cs.AI
TL;DR: Memory Bear AI introduces a memory-centered framework for multimodal emotion recognition that treats affective information as structured, evolving memory units rather than transient labels, enabling better handling of long-horizon dependencies and noisy/missing modalities.
Details
Motivation: Current multimodal emotion recognition systems are limited to short-range inference with poor support for persistent affective memory, long-horizon dependency modeling, and robust interpretation under imperfect input conditions like noisy or missing modalities.Method: Proposes a memory-centered framework with structured Emotion Memory Units (EMUs) that organizes processing through memory formation, working-memory aggregation, long-term consolidation, memory-driven retrieval, dynamic fusion calibration, and continuous updating.
Result: The framework shows consistent gains over comparison systems across benchmark and business settings, with stronger accuracy and robustness, especially under noisy or missing-modality conditions.
Conclusion: The approach represents a practical step from local emotion recognition toward more continuous, robust, and deployment-relevant affective intelligence by treating emotion as structured memory rather than transient output.
Abstract: Affective judgment in real interaction is rarely a purely local prediction problem. Emotional meaning often depends on prior trajectory, accumulated context, and multimodal evidence that may be weak, noisy, or incomplete at the current moment. Although multimodal emotion recognition (MER) has improved the integration of text, speech, and visual signals, many existing systems remain optimized for short-range inference and provide limited support for persistent affective memory, long-horizon dependency modeling, and robust interpretation under imperfect input. This technical report presents the Memory Bear AI Memory Science Engine, a memory-centered framework for multimodal affective intelligence. Instead of treating emotion as a transient output label, the framework models affective information as a structured and evolving variable within a memory system. It organizes processing through structured memory formation, working-memory aggregation, long-term consolidation, memory-driven retrieval, dynamic fusion calibration, and continuous memory updating. At its core, multimodal signals are transformed into structured Emotion Memory Units (EMUs), enabling affective information to be preserved, reactivated, and revised across interaction horizons. Experimental results show consistent gains over comparison systems across benchmark and business-grounded settings, with stronger accuracy and robustness, especially under noisy or missing-modality conditions. The framework offers a practical step from local emotion recognition toward more continuous, robust, and deployment-relevant affective intelligence.
[327] The Efficiency Attenuation Phenomenon: A Computational Challenge to the Language of Thought Hypothesis
Di Zhang
Main category: cs.AI
TL;DR: The paper investigates whether thought requires language-like formats using an AI thought experiment where agents develop efficient but inscrutable communication protocols, challenging the Language of Thought hypothesis.
Details
Motivation: To computationally test the Language of Thought (LoT) hypothesis which posits that thought requires a language-like format, by examining whether artificial agents develop optimal communication that is not human-comprehensible.Method: Introduces the “AI Private Language” thought experiment using multi-agent reinforcement learning (MARL) in a cooperative navigation task under partial observability. Compares emergent communication protocols against pre-defined human-like symbolic protocols.
Result: Agents with emergent communication protocols achieved 50.5% higher efficiency than those using pre-defined human-like symbolic protocols, confirming the Efficiency Attenuation Phenomenon (EAP).
Conclusion: Optimal collaborative cognition in these systems is mediated by sub-symbolic computations rather than symbolic structures, supporting pluralism in cognitive architectures and raising implications for AI ethics.
Abstract: This paper computationally investigates whether thought requires a language-like format, as posited by the Language of Thought (LoT) hypothesis. We introduce the ``AI Private Language’’ thought experiment: if two artificial agents develop an efficient, inscrutable communication protocol via multi-agent reinforcement learning (MARL), and their performance declines when forced to use a human-comprehensible language, this Efficiency Attenuation Phenomenon (EAP) challenges the LoT. We formalize this in a cooperative navigation task under partial observability. Results show that agents with an emergent protocol achieve 50.5% higher efficiency than those using a pre-defined, human-like symbolic protocol, confirming the EAP. This suggests optimal collaborative cognition in these systems is not mediated by symbolic structures but is naturally coupled with sub-symbolic computations. The work bridges philosophy, cognitive science, and AI, arguing for pluralism in cognitive architectures and highlighting implications for AI ethics.
[328] Dynamic Fusion-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition in Conversations
Tao Meng, Weilun Tang, Yuntao Shou, Yilong Tan, Jun Zhou, Wei Ai, Keqin Li
Main category: cs.AI
TL;DR: DF-GCN: Dynamic fusion-aware graph convolutional network for multimodal emotion recognition in conversations using ODEs and global information vectors to guide dynamic multimodal feature fusion.
Details
Motivation: Existing multimodal emotion recognition methods use fixed parameters for all emotion types, ignoring the dynamic nature of multimodal fusion and forcing models to balance performance across all emotions, which limits performance on specific emotions.Method: Proposes DF-GCN that integrates ordinary differential equations into GCNs to capture dynamic emotional dependencies and uses prompts from global information vectors to guide dynamic fusion of multimodal features, allowing parameter adaptation for different emotion categories.
Result: Comprehensive experiments on two public multimodal conversational datasets confirm superior performance, benefiting significantly from the dynamic fusion mechanism.
Conclusion: DF-GCN enables more flexible emotion classification and enhances model generalization by dynamically adapting parameters for different emotion categories through guided multimodal feature fusion.
Abstract: Multimodal emotion recognition in conversations (MERC) aims to identify and understand the emotions expressed by speakers during utterance interaction from multiple modalities (e.g., text, audio, images, etc.). Existing studies have shown that GCN can improve the performance of MERC by modeling dependencies between speakers. However, existing methods usually use fixed parameters to process multimodal features for different emotion types, ignoring the dynamics of fusion between different modalities, which forces the model to balance performance between multiple emotion categories, thus limiting the model’s performance on some specific emotions. To this end, we propose a dynamic fusion-aware graph convolutional neural network (DF-GCN) for robust recognition of multimodal emotion features in conversations. Specifically, DF-GCN integrates ordinary differential equations into graph convolutional networks (GCNs) to {capture} the dynamic nature of emotional dependencies within utterance interaction networks and leverages the prompts generated by the global information vector (GIV) of the utterance to guide the dynamic fusion of multimodal features. This allows our model to dynamically change parameters when processing each utterance feature, so that different network parameters can be equipped for different emotion categories in the inference stage, thereby achieving more flexible emotion classification and enhancing the generalization ability of the model. Comprehensive experiments conducted on two public multimodal conversational datasets {confirm} that the proposed DF-GCN model delivers superior performance, benefiting significantly from the dynamic fusion mechanism introduced.
[329] Intelligence Inertia: Physical Principles and Applications
Jipeng Han
Main category: cs.AI
TL;DR: The paper introduces “intelligence inertia” as a physical principle explaining super-linear computational costs when maintaining symbolic interpretability during system reconfiguration, deriving a non-linear cost formula similar to the Lorentz factor.
Details
Motivation: Classical frameworks like Landauer's principle and Fisher Information fail to explain the explosive computational and energy costs incurred when maintaining symbolic interpretability during reconfiguration of advanced intelligent systems. There's a need for fundamental physical principles to quantify the computational weight of intelligence beyond static information-theoretic estimates.Method: Introduces the concept of “intelligence inertia” based on fundamental non-commutativity between rules and states. Derives a non-linear cost formula mirroring the Lorentz factor, characterizing a relativistic J-shaped inflation curve. Validates through three experiments: comparative adjudication against classical Fisher Information models, geometric analysis of neural architecture evolution trajectories, and implementation of an inertia-aware scheduler wrapper for deep network training.
Result: Demonstrates a unified physical description for structural adaptation costs, offering first-principle explanation for computational and interpretability-maintenance overhead in intelligent agents. The inertia-aware scheduler wrapper optimizes training by respecting the agent’s physical resistance to change.
Conclusion: Intelligence inertia provides foundational characteristics for quantifying computational weight of intelligence, explaining super-linear adaptation costs that classical static models cannot capture. This offers new physical principles for understanding computational overhead in intelligent systems.
Abstract: While Landauer’s principle establishes the fundamental thermodynamic floor for information erasure and Fisher Information provides a metric for local curvature in parameter space, these classical frameworks function effectively only as approximations within regimes of sparse rule-constraints. They fail to explain the super-linear, and often explosive, computational and energy costs incurred when maintaining symbolic interpretability during the reconfiguration of advanced intelligent systems. This paper introduces the property of intelligence inertia and its underlying physical principles as foundational characteristics for quantifying the computational weight of intelligence. We demonstrate that this phenomenon is not merely an empirical observation but originates from the fundamental non-commutativity between rules and states, a root cause we have formally organized into a rigorous mathematical framework. By analyzing the growing discrepancy between actual adaptation costs and static information-theoretic estimates, we derive a non-linear cost formula that mirrors the Lorentz factor, characterizing a relativistic J-shaped inflation curve – a “computational wall” that static models are blind to. The validity of these physical principles is examined through a trilogy of decisive experiments: (1) a comparative adjudication of this J-curve inflation against classical Fisher Information models, (2) a geometric analysis of the “Zig-Zag” trajectory of neural architecture evolution, and (3) the implementation of an inertia-aware scheduler wrapper that optimizes the training of deep networks by respecting the agent’s physical resistance to change. Our results suggest a unified physical description for the cost of structural adaptation, offering a first-principle explanation for the computational and interpretability-maintenance overhead in intelligent agents.
[330] MuQ-Eval: An Open-Source Per-Sample Quality Metric for AI Music Generation Evaluation
Di Zhu, Zixuan Li
Main category: cs.AI
TL;DR: MUQ-EVAL is an open-source per-sample quality metric for AI-generated music that uses frozen MuQ-310M features with lightweight prediction heads, achieving high correlation with human judgments and real-time performance.
Details
Motivation: Existing distributional metrics like Fréchet Audio Distance cannot score individual music clips and correlate poorly with human judgments, while the only effective per-sample learned metric is closed-source. There's a need for open, accurate quality metrics for AI-generated music.Method: Train lightweight prediction heads on frozen MuQ-310M features using MusicEval dataset (generated clips from 31 text-to-music systems with expert ratings). Use attention pooling and two-layer MLP. Systematic ablation over training objectives and adaptation strategies.
Result: Achieves system-level SRCC = 0.957 and utterance-level SRCC = 0.838 with human mean opinion scores. Frozen baseline performs best, showing MuQ representations already capture quality information. LoRA-adapted models work with as few as 150 clips. Real-time performance on consumer GPU.
Conclusion: MUQ-EVAL provides an effective open-source per-sample quality metric for AI-generated music, outperforming existing open metrics. Encoder choice is the dominant design factor. The metric enables personalized quality evaluators from individual listener annotations.
Abstract: Distributional metrics such as Fréchet Audio Distance cannot score individual music clips and correlate poorly with human judgments, while the only per-sample learned metric achieving high human correlation is closed-source. We introduce MUQ-EVAL, an open-source per-sample quality metric for AIgenerated music built by training lightweight prediction heads on frozen MuQ-310M features using MusicEval, a dataset of generated clips from 31 text-to-music systems with expert quality ratings. Our simplest model, frozen features with attention pooling and a two-layer MLP, achieves system-level SRCC = 0.957 and utterance-level SRCC = 0.838 with human mean opinion scores. A systematic ablation over training objectives and adaptation strategies shows that no addition meaningfully improves the frozen baseline, indicating that frozen MuQ representations already capture quality-relevant information. Encoder choice is the dominant design factor, outweighing all architectural and training decisions. LoRA-adapted models trained on as few as 150 clips already achieve usable correlation, enabling personalized quality evaluators from individual listener annotations. A controlled degradation analysis reveals selective sensitivity to signal-level artifacts but insensitivity to musical-structural distortions. Our metric, MUQ-EVAL, is fully open-source, outperforms existing open per-sample metrics, and runs in real time on a single consumer GPU. Code, model weights, and evaluation scripts are available at https://github.com/dgtql/MuQ-Eval.
[331] Session Risk Memory (SRM): Temporal Authorization for Deterministic Pre-Execution Safety Gates
Florin Adrian Chitan
Main category: cs.AI
TL;DR: SRM adds trajectory-level authorization to deterministic safety gates by tracking semantic centroids and accumulating risk signals through exponential moving average, enabling detection of distributed attacks across multiple individually-compliant steps.
Details
Motivation: Current deterministic safety gates only evaluate individual agent actions, making them blind to distributed attacks that decompose harmful intent across multiple individually-compliant steps. There's a need for session-level safety that can detect gradual, multi-turn threats.Method: Session Risk Memory (SRM) maintains a compact semantic centroid representing the evolving behavioral profile of an agent session and accumulates a risk signal through exponential moving average over baseline-subtracted gate outputs. It uses the same semantic vector representation as the underlying gate without requiring additional model components, training, or probabilistic inference.
Result: ILION+SRM achieves F1 = 1.0000 with 0% false positive rate, compared to stateless ILION at F1 = 0.9756 with 5% FPR, while maintaining 100% detection rate for both systems. SRM eliminates all false positives with per-turn overhead under 250 microseconds.
Conclusion: SRM provides a lightweight, deterministic approach to session-level safety by introducing temporal authorization consistency alongside spatial authorization consistency, enabling detection of distributed attacks without probabilistic inference or additional model training.
Abstract: Deterministic pre-execution safety gates evaluate whether individual agent actions are compatible with their assigned roles. While effective at per-action authorization, these systems are structurally blind to distributed attacks that decompose harmful intent across multiple individually-compliant steps. This paper introduces Session Risk Memory (SRM), a lightweight deterministic module that extends stateless execution gates with trajectory-level authorization. SRM maintains a compact semantic centroid representing the evolving behavioral profile of an agent session and accumulates a risk signal through exponential moving average over baseline-subtracted gate outputs. It operates on the same semantic vector representation as the underlying gate, requiring no additional model components, training, or probabilistic inference. We evaluate SRM on a multi-turn benchmark of 80 sessions containing slow-burn exfiltration, gradual privilege escalation, and compliance drift scenarios. Results show that ILION+SRM achieves F1 = 1.0000 with 0% false positive rate, compared to stateless ILION at F1 = 0.9756 with 5% FPR, while maintaining 100% detection rate for both systems. Critically, SRM eliminates all false positives with a per-turn overhead under 250 microseconds. The framework introduces a conceptual distinction between spatial authorization consistency (evaluated per action) and temporal authorization consistency (evaluated over trajectory), providing a principled basis for session-level safety in agentic systems.
[332] STEM Agent: A Self-Adapting, Tool-Enabled, Extensible Architecture for Multi-Protocol AI Agent Systems
Alfred Shen, Aaron Shen
Main category: cs.AI
TL;DR: STEM Agent is a modular AI agent framework with self-adapting protocol handlers, tool bindings, and memory subsystems that unifies multiple interoperability protocols and learns user preferences.
Details
Motivation: Current AI agent frameworks are limited by fixed interaction protocols, static tool integration, and rigid user models, preventing deployment across diverse interaction paradigms.Method: Modular architecture inspired by biological pluripotency where an undifferentiated agent core differentiates into specialized protocol handlers, tool bindings, and memory subsystems. Unifies five interoperability protocols, introduces continuous user preference learning, externalizes domain capabilities via Model Context Protocol, and implements biologically inspired skills acquisition.
Result: Validated with 413-test suite across all five architectural layers, completing in under three seconds. Memory system achieves sub-linear growth under sustained interaction through consolidation mechanisms.
Conclusion: STEM Agent provides a flexible, self-adapting framework that overcomes limitations of current AI agent systems through biological inspiration and modular design.
Abstract: Current AI agent frameworks commit early to a single interaction protocol, a fixed tool integration strategy, and static user models, limiting their deployment across diverse interaction paradigms. To address these constraints, we introduce STEM Agent (Self-adapting, Tool-enabled, Extensible, Multi-agent), a modular architecture inspired by biological pluripotency in which an undifferentiated agent core differentiates into specialized protocol handlers, tool bindings, and memory subsystems that compose into a fully functioning AI system. The framework unifies five interoperability protocols (A2A, AG-UI, A2UI, UCP, and AP2) behind a single gateway, introduces a Caller Profiler that continuously learns user preferences across more than twenty behavioral dimensions, externalizes all domain capabilities through the Model Context Protocol (MCP), and implements a biologically inspired skills acquisition system in which recurring interaction patterns crystallize into reusable agent skills through a maturation lifecycle analogous to cell differentiation. Complementing these capabilities, the memory system incorporates consolidation mechanisms, including episodic pruning, semantic deduplication, and pattern extraction, designed for sub-linear growth under sustained interaction. A comprehensive 413-test suite validates protocol handler behavior and component integration across all five architectural layers, completing in under three seconds.
[333] From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents
Ling Yue, Kushal Raj Bhandari, Ching-Yun Ko, Dhaval Patel, Shuxin Lin, Nianjun Zhou, Jianxi Gao, Pin-Yu Chen, Shaowu Pan
Main category: cs.AI
TL;DR: Survey paper on LLM-based workflow optimization methods, focusing on agentic computation graphs (ACGs) and their structural design patterns.
Details
Motivation: LLM-based systems increasingly use executable workflows combining LLM calls, tools, code execution, etc. There's a need to organize and understand methods for designing and optimizing these workflows as agentic computation graphs.Method: Survey methodology organizing literature based on when workflow structure is determined (static vs dynamic), what parts are optimized, and evaluation signals used. Introduces framework distinguishing workflow templates, realized graphs, and execution traces.
Result: Provides unified framework for positioning workflow optimization methods, clear vocabulary, comparable view of existing literature, and structure-aware evaluation perspective.
Conclusion: Offers systematic approach to understanding LLM workflow optimization, enabling better comparison of methods and more reproducible evaluation standards for future research.
Abstract: Large language model (LLM)-based systems are becoming increasingly popular for solving tasks by constructing executable workflows that interleave LLM calls, information retrieval, tool use, code execution, memory updates, and verification. This survey reviews recent methods for designing and optimizing such workflows, which we treat as agentic computation graphs (ACGs). We organize the literature based on when workflow structure is determined, where structure refers to which components or agents are present, how they depend on each other, and how information flows between them. This lens distinguishes static methods, which fix a reusable workflow scaffold before deployment, from dynamic methods, which select, generate, or revise the workflow for a particular run before or during execution. We further organize prior work along three dimensions: when structure is determined, what part of the workflow is optimized, and which evaluation signals guide optimization (e.g., task metrics, verifier signals, preferences, or trace-derived feedback). We also distinguish reusable workflow templates, run-specific realized graphs, and execution traces, separating reusable design choices from the structures actually deployed in a given run and from realized runtime behavior. Finally, we outline a structure-aware evaluation perspective that complements downstream task metrics with graph-level properties, execution cost, robustness, and structural variation across inputs. Our goal is to provide a clear vocabulary, a unified framework for positioning new methods, a more comparable view of existing body of literature, and a more reproducible evaluation standard for future work in workflow optimizations for LLM agents.
[334] Computational Arbitrage in AI Model Markets
Ricardo Olmedo, Bernhard Schölkopf, Moritz Hardt
Main category: cs.AI
TL;DR: Arbitrage in AI model markets involves buying inference from multiple providers at different costs and reselling solutions to customers at competitive prices, creating profit without developing models.
Details
Motivation: To study how arbitrage can emerge in AI model markets where providers sell query access to models with varying costs and capabilities, and to understand its economic implications for market competition, pricing, and model development.Method: Empirical study of arbitrage viability through case study of SWE-bench GitHub issue resolution using GPT-5 mini and DeepSeek v3.2 models. Analysis of simple and robust arbitrage strategies, distillation effects, and market dynamics with multiple arbitrageurs.
Result: Simple arbitrage strategies achieve up to 40% net profit margins in verifiable domains. Robust strategies remain profitable across domains. Distillation creates strong arbitrage opportunities but may reduce teacher model revenue. Multiple arbitrageurs drive down consumer prices while enabling market entry for smaller providers.
Conclusion: Arbitrage is a viable and powerful force in AI model markets with significant implications for market structure, pricing, model development strategies, and deployment economics.
Abstract: Consider a market of competing model providers selling query access to models with varying costs and capabilities. Customers submit problem instances and are willing to pay up to a budget for a verifiable solution. An arbitrageur efficiently allocates inference budget across providers to undercut the market, thus creating a competitive offering with no model-development risk. In this work, we initiate the study of arbitrage in AI model markets, empirically demonstrating the viability of arbitrage and illustrating its economic consequences. We conduct an in-depth case study of SWE-bench GitHub issue resolution using two representative models, GPT-5 mini and DeepSeek v3.2. In this verifiable domain, simple arbitrage strategies generate net profit margins of up to 40%. Robust arbitrage strategies that generalize across different domains remain profitable. Distillation further creates strong arbitrage opportunities, potentially at the expense of the teacher model’s revenue. Multiple competing arbitrageurs drive down consumer prices, reducing the marginal revenue of model providers. At the same time, arbitrage reduces market segmentation and facilitates market entry for smaller model providers by enabling earlier revenue capture. Our results suggest that arbitrage can be a powerful force in AI model markets with implications for model development, distillation, and deployment.
[335] Maximum Entropy Relaxation of Multi-Way Cardinality Constraints for Synthetic Population Generation
François Pachet, Jean-Daniel Zucker
Main category: cs.AI
TL;DR: Maximum-entropy approach for generating synthetic populations that satisfy multi-way frequency constraints on categorical attributes, outperforming raking methods on high-dimensional problems.
Details
Motivation: Generating synthetic populations from aggregate statistics is crucial for microsimulation, agent-based modeling, policy analysis, and privacy-preserving data release. Many applications require matching heterogeneous unary, binary, and ternary constraints from various sources, but existing methods scale poorly with increasing numbers and arity of constraints.Method: Proposes a maximum-entropy relaxation where multi-way cardinality constraints are matched in expectation rather than exactly. This yields an exponential-family distribution over complete population assignments and transforms the problem into a convex optimization over Lagrange multipliers.
Result: Evaluated on NPORS-derived scaling benchmarks with 4 to 40 attributes. MaxEnt becomes increasingly advantageous as the number of attributes and ternary interactions grows, while raking remains competitive on smaller, lower-arity instances.
Conclusion: The maximum-entropy approach provides an effective solution for generating synthetic populations with complex multi-way constraints, particularly outperforming existing methods on high-dimensional problems with numerous overlapping constraints.
Abstract: Generating synthetic populations from aggregate statistics is a core component of microsimulation, agent-based modeling, policy analysis, and privacy-preserving data release. Beyond classical census marginals, many applications require matching heterogeneous unary, binary, and ternary constraints derived from surveys, expert knowledge, or automatically extracted descriptions. Constructing populations that satisfy such multi-way constraints simultaneously poses a significant computational challenge. We consider populations where each individual is described by categorical attributes and the target is a collection of global frequency constraints over attribute combinations. Exact formulations scale poorly as the number and arity of constraints increase, especially when the constraints are numerous and overlapping. Grounded in methods from statistical physics, we propose a maximum-entropy relaxation of this problem. Multi-way cardinality constraints are matched in expectation rather than exactly, yielding an exponential-family distribution over complete population assignments and a convex optimization problem over Lagrange multipliers. We evaluate the approach on NPORS-derived scaling benchmarks with 4 to 40 attributes and compare it primarily against generalized raking. The results show that MaxEnt becomes increasingly advantageous as the number of attributes and ternary interactions grows, while raking remains competitive on smaller, lower-arity instances.
[336] AI Mental Models: Learned Intuition and Deliberation in a Bounded Neural Architecture
Laurence Anthony
Main category: cs.AI
TL;DR: A bounded neural architecture with separate intuition and deliberation pathways is tested on syllogistic reasoning, showing deliberation outperforms intuition and develops structured internal computation.
Details
Motivation: To investigate whether bounded neural architectures can exhibit meaningful division of labor between intuition and deliberation in reasoning tasks, and whether they can develop structured internal computation rather than just one-shot associative prediction.Method: Two experiments: 1) Direct neural baseline for predicting human response distributions; 2) Bounded dual-path architecture with separate intuition and deliberation pathways, evaluated on 64-item syllogistic reasoning benchmark with cross-validation.
Result: Deliberation pathway (r=0.8152) significantly outperforms intuition pathway (r=0.7272). Deliberation develops sparse, differentiated internal structure with interpretable states, showing reasoning-like organization under bounded conditions.
Conclusion: Bounded neural architectures can develop structured internal computation resembling reasoning processes, with deliberation showing superior performance and interpretable internal organization compared to intuition.
Abstract: This paper asks whether a bounded neural architecture can exhibit a meaningful division of labor between intuition and deliberation on a classic 64-item syllogistic reasoning benchmark. More broadly, the benchmark is relevant to ongoing debates about world models and multi-stage reasoning in AI. It provides a controlled setting for testing whether a learned system can develop structured internal computation rather than only one-shot associative prediction. Experiment 1 evaluates a direct neural baseline for predicting full 9-way human response distributions under 5-fold cross-validation. Experiment 2 introduces a bounded dual-path architecture with separate intuition and deliberation pathways, motivated by computational mental-model theory (Khemlani & Johnson-Laird, 2022). Under cross-validation, bounded intuition reaches an aggregate correlation of r = 0.7272, whereas bounded deliberation reaches r = 0.8152, and the deliberation advantage is significant across folds (p = 0.0101). The largest held-out gains occur for NVC, Eca, and Oca, suggesting improved handling of rejection responses and c-a conclusions. A canonical 80:20 interpretability run and a five-seed stability sweep further indicate that the deliberation pathway develops sparse, differentiated internal structure, including an Oac-leaning state, a dominant workhorse state, and several weakly used or unused states whose exact indices vary across runs. These findings are consistent with reasoning-like internal organization under bounded conditions, while stopping short of any claim that the model reproduces full sequential processes of model construction, counterexample search, and conclusion revision.
[337] Understanding LLM Performance Degradation in Multi-Instance Processing: The Roles of Instance Count and Context Length
Jingxuan Chen, Mohammad Taher Pilehvar, Jose Camacho-Collados
Main category: cs.AI
TL;DR: LLMs show performance degradation and eventual collapse when processing multiple instances, with instance count having stronger effect than context length.
Details
Motivation: While LLMs perform well on individual tasks, there's little research on how they handle multi-instance inputs like analyzing multiple documents or reviews, which is common in real-world applications.Method: Comprehensive evaluation of LLMs’ multi-instance processing (MIP) ability on tasks where they excel individually, analyzing performance across different instance counts.
Result: All LLMs show slight performance degradation for small instance counts (20-100) followed by performance collapse on larger counts. Instance count has stronger effect than context length on performance degradation.
Conclusion: When optimizing LLMs for multi-instance processing, attention should focus on both context length and especially instance count, as the latter has stronger impact on performance.
Abstract: Users often rely on Large Language Models (LLMs) for processing multiple documents or performing analysis over a number of instances. For example, analysing the overall sentiment of a number of movie reviews requires an LLM to process the sentiment of each review individually in order to provide a final aggregated answer. While LLM performance on such individual tasks is generally high, there has been little research on how LLMs perform when dealing with multi-instance inputs. In this paper, we perform a comprehensive evaluation of the multi-instance processing (MIP) ability of LLMs for tasks in which they excel individually. The results show that all LLMs follow a pattern of slight performance degradation for small numbers of instances (approximately 20-100), followed by a performance collapse on larger instance counts. Crucially, our analysis shows that while context length is associated with this degradation, the number of instances has a stronger effect on the final results. This finding suggests that when optimising LLM performance for MIP, attention should be paid to both context length and, in particular, instance count.
[338] Bridging the Know-Act Gap via Task-Level Autoregressive Reasoning
Jihyun Janice Ahn, Ryo Kamoi, Berk Atil, Renze Lou, WonWoo Kang, Heehyun Park, Sarkar Snigdha Sarathi Das, Zhuoyang Zou, Xiaoxin Lu, Yusen Zhang, Asfahan Shah, Ridwanul Hasan Tanvir, Lingxiao Zhao, Hongxi Huang, Vignesh Venkatesh, Dianjun Lin, Hamid Shah, Wentao Wang, Zhanpeng Song, Joshua Reed Bassin, Dax Patel, Ishan Appareddy Agrahar, Sahil Pardasani, Xin Dong, Fatemeh Rahbari, Benjamin David Rishel, Soochan Andrew Lee, Yuv Boghani, Ali B. AlNaseeb, Pranav Suby, Seokhyeon Bae, Shreya Buddharaju, Damien Kula, Soumyadeep Das, Hanyang Frank Liu, Faye Mo, Wenpeng Yin
Main category: cs.AI
TL;DR: DeIllusionLLM addresses the “know-act gap” in LLMs where models can identify flawed inputs discriminatively but still generate incorrect answers generatively, using task-level autoregression and self-distillation to unify discriminative judgment with generative reasoning.
Details
Motivation: LLMs often generate seemingly valid answers to flawed or ill-posed inputs despite having the discriminative ability to identify such issues, revealing a fundamental know-act gap between discriminative recognition and generative behavior that needs to be addressed.Method: Proposes DeIllusionLLM, a task-level autoregressive framework that explicitly models the decision between validation vs. answering tasks. Uses self-distillation to unify discriminative judgment and generative reasoning within a single backbone, addressing the entanglement of task selection with content generation in standard token-level autoregression.
Result: DeIllusionLLM substantially reduces answer-despite-error failures under natural prompting while maintaining general reasoning performance, demonstrating that self-distillation is an effective and scalable solution for bridging the discriminative-generative know-act gap.
Conclusion: The paper presents a comprehensive analysis of the know-act gap using a new benchmark and proposes an effective solution through task-level autoregression and self-distillation, showing that discriminative and generative capabilities can be unified within a single model architecture.
Abstract: LLMs often generate seemingly valid answers to flawed or ill-posed inputs. This is not due to missing knowledge: under discriminative prompting, the same models can mostly identify such issues, yet fail to reflect this in standard generative responses. This reveals a fundamental know-act gap between discriminative recognition and generative behavior. Prior work largely characterizes this issue in narrow settings, such as math word problems or question answering, with limited focus on how to integrate these two modes. In this work, we present a comprehensive analysis using FaultyScience, a newly constructed large-scale, cross-disciplinary benchmark of faulty scientific questions. We show that the gap is pervasive and stems from token-level autoregression, which entangles task selection (validate vs. answer) with content generation, preventing discriminative knowledge from being utilized. To address this, we propose DeIllusionLLM, a task-level autoregressive framework that explicitly models this decision. Through self-distillation, the model unifies discriminative judgment and generative reasoning within a single backbone. Empirically, DeIllusionLLM substantially reduces answer-despite-error failures under natural prompting while maintaining general reasoning performance, demonstrating that self-distillation is an effective and scalable solution for bridging the discriminative-generative know-act gap
[339] Toward Data Systems That Are Business Semantic Centric and AI Agents Assisted
Cecil Pang
Main category: cs.AI
TL;DR: BSDS is a business semantics-centric, AI agent-assisted data system that integrates architecture, workflows, and team organization to align data systems with business priorities rather than technical constraints.
Details
Motivation: Existing data platforms often fail by prioritizing tools over business alignment, causing inefficiencies and delays in dynamic business environments that require rapid adaptation.Method: Proposes a modular architecture with curated business entity data, knowledge base for context-aware AI agents, and efficient pipelines, complemented by optimized workflows for both exploratory analysis and production requirements.
Result: Validated through real-world implementation, BSDS accelerates time-to-market for data initiatives, enhances cross-functional collaboration, and provides scalable blueprint for businesses of all sizes.
Conclusion: BSDS transforms data systems from passive tools into active drivers of organizational growth by aligning technical capabilities with business semantics through AI agent assistance.
Abstract: Contemporary businesses operate in dynamic environments requiring rapid adaptation to achieve goals and maintain competitiveness. Existing data platforms often fall short by emphasizing tools over alignment with business needs, resulting in inefficiencies and delays. To address this gap, I propose the Business Semantics Centric, AI Agents Assisted Data System (BSDS), a holistic system that integrates architecture, workflows, and team organization to ensure data systems are tailored to business priorities rather than dictated by technical constraints. BSDS redefines data systems as dynamic enablers of business success, transforming them from passive tools into active drivers of organizational growth. BSDS has a modular architecture that comprises curated data linked to business entities, a knowledge base for context-aware AI agents, and efficient data pipelines. AI agents play a pivotal role in assisting with data access and system management, reducing human effort, and improving scalability. Complementing this architecture, BSDS incorporates workflows optimized for both exploratory data analysis and production requirements, balancing speed of delivery with quality assurance. A key innovation of BSDS is its incorporation of the human factor. By aligning data team expertise with business semantics, BSDS bridges the gap between technical capabilities and business needs. Validated through real-world implementation, BSDS accelerates time-to-market for data-driven initiatives, enhances cross-functional collaboration, and provides a scalable blueprint for businesses of all sizes. Future research can build on BSDS to explore optimization strategies using complex systems and adaptive network theories, as well as developing autonomous data systems leveraging AI agents.
[340] Graph-Aware Late Chunking for Retrieval-Augmented Generation in Biomedical Literature
Pouria Mortezaagha, Arya Rahgozar
Main category: cs.AI
TL;DR: GraLC-RAG introduces structure-aware retrieval for biomedical literature RAG, showing standard metrics undervalue multi-section evidence retrieval needed for comprehensive document understanding.
Details
Motivation: Current biomedical RAG systems are evaluated with ranking metrics like MRR that focus on retrieving the single most relevant chunk, ignoring the need to retrieve evidence from across different structural sections of scientific documents for comprehensive understanding.Method: Proposes GraLC-RAG framework with late chunking and graph-aware structural intelligence, including structure-aware chunk boundary detection, UMLS knowledge graph infusion, and graph-guided hybrid retrieval. Evaluates six strategies on 2,359 IMRaD PubMed articles using 2,033 cross-section questions with both standard ranking metrics and new structural coverage metrics.
Result: Content-similarity methods achieve highest MRR (0.517) but retrieve from single sections only, while structure-aware methods retrieve from up to 15.6x more sections. KG-infused retrieval narrows answer-quality gap to delta-F1 = 0.009 while maintaining 4.6x section diversity.
Conclusion: Standard metrics systematically undervalue structural retrieval, and closing the multi-section synthesis gap is a key open problem for biomedical RAG. Structure-aware approaches are essential for comprehensive document understanding.
Abstract: Retrieval-Augmented Generation (RAG) systems for biomedical literature are typically evaluated using ranking metrics like Mean Reciprocal Rank (MRR), which measure how well the system identifies the single most relevant chunk. We argue that for full-text scientific documents, this paradigm is incomplete: it rewards retrieval precision while ignoring retrieval breadth – the ability to surface evidence from across a document’s structural sections. We propose GraLC-RAG, a framework that unifies late chunking with graph-aware structural intelligence, introducing structure-aware chunk boundary detection, UMLS knowledge graph infusion, and graph-guided hybrid retrieval. We evaluate six strategies on 2,359 IMRaD-filtered PubMed Central articles using 2,033 cross-section questions and two metric families: standard ranking metrics (MRR, Recall@k) and structural coverage metrics (SecCov@k, CS Recall). Our results expose a sharp divergence: content-similarity methods achieve the highest MRR (0.517) but always retrieve from a single section, while structure-aware methods retrieve from up to 15.6x more sections. Generation experiments show that KG-infused retrieval narrows the answer-quality gap to delta-F1 = 0.009 while maintaining 4.6x section diversity. These findings demonstrate that standard metrics systematically undervalue structural retrieval and that closing the multi-section synthesis gap is a key open problem for biomedical RAG.
[341] Benchmarking Multi-Agent LLM Architectures for Financial Document Processing: A Comparative Study of Orchestration Patterns, Cost-Accuracy Tradeoffs and Production Scaling Strategies
Siddhant Kulkarni, Yukta Kulkarni
Main category: cs.AI
TL;DR: Systematic benchmark comparing four multi-agent LLM architectures for financial document extraction, evaluating performance, cost, and scalability across 10,000 SEC filings.
Details
Motivation: Production deployments of LLMs for financial document extraction face architectural decisions with limited empirical guidance, needing systematic comparison of multi-agent orchestration approaches.Method: Benchmarked four architectures (sequential pipeline, parallel fan-out, hierarchical supervisor-worker, reflexive self-correcting loop) across five LLMs on 10,000 SEC filings, evaluating 25 extraction field types across five metrics: F1, accuracy, latency, cost, and token efficiency.
Result: Reflexive architectures achieved highest F1 (0.943) but at 2.3x cost of sequential baselines; hierarchical architectures offered best cost-accuracy tradeoff (F1 0.921 at 1.4x cost); hybrid configurations recovered 89% of accuracy gains at only 1.15x baseline cost.
Conclusion: Provides actionable guidance for multi-agent LLM deployments in regulated financial environments, with hierarchical architectures offering favorable cost-accuracy tradeoffs and hybrid approaches enabling significant efficiency gains.
Abstract: The adoption of large language models (LLMs) for structured information extraction from financial documents has accelerated rapidly, yet production deployments face fundamental architectural decisions with limited empirical guidance. We present a systematic benchmark comparing four multi-agent orchestration architectures: sequential pipeline, parallel fan-out with merge, hierarchical supervisor-worker and reflexive self-correcting loop. These are evaluated across five frontier and open-weight LLMs on a corpus of 10,000 SEC filings (10-K, 10-Q and 8-K forms). Our evaluation spans 25 extraction field types covering governance structures, executive compensation and financial metrics, measured along five axes: field-level F1, document-level accuracy, end-to-end latency, cost per document and token efficiency. We find that reflexive architectures achieve the highest field-level F1 (0.943) but at 2.3x the cost of sequential baselines, while hierarchical architectures occupy the most favorable position on the cost-accuracy Pareto frontier (F1 0.921 at 1.4x cost). We further present ablation studies on semantic caching, model routing and adaptive retry strategies, demonstrating that hybrid configurations can recover 89% of the reflexive architecture’s accuracy gains at only 1.15x baseline cost. Our scaling analysis from 1K to 100K documents per day reveals non-obvious throughput-accuracy degradation curves that inform capacity planning. These findings provide actionable guidance for practitioners deploying multi-agent LLM systems in regulated financial environments.
[342] Towards Intelligent Geospatial Data Discovery: a knowledge graph-driven multi-agent framework powered by large language models
Ruixiang Liu, Zhenlong Li, Ali Khosravi Kazazi
Main category: cs.AI
TL;DR: A knowledge graph-driven multi-agent framework using LLMs for intelligent geospatial data discovery, addressing semantic heterogeneity in distributed geospatial data ecosystems.
Details
Motivation: Geospatial data ecosystems are highly distributed, heterogeneous, and semantically inconsistent, with existing keyword-based search systems failing to capture user intent and providing weak retrieval performance.Method: Proposes a knowledge graph-driven multi-agent framework powered by LLMs, featuring: 1) unified geospatial metadata ontology as semantic mediation layer, 2) geospatial metadata knowledge graph modeling datasets and relationships, 3) multi-agent collaborative architecture for intent parsing, knowledge graph retrieval, and answer synthesis.
Result: Framework substantially improves intent matching accuracy, ranking quality, recall, and discovery transparency compared to traditional systems, as shown in representative use cases and performance evaluation.
Conclusion: Advances geospatial data discovery toward semantic, intent-aware, intelligent paradigm, providing foundation for next-generation intelligent spatial data infrastructures and contributing to Autonomous GIS vision.
Abstract: The rapid growth in the volume, variety, and velocity of geospatial data has created data ecosystems that are highly distributed, heterogeneous, and semantically inconsistent. Existing data catalogs, portals, and infrastructures still rely largely on keyword-based search with limited semantic support, which often fails to capture user intent and leads to weak retrieval performance. To address these challenges, this study proposes a knowledge graph-driven multi-agent framework for intelligent geospatial data discovery, powered by large language models. The framework introduces a unified geospatial metadata ontology as a semantic mediation layer to align heterogeneous metadata standards across platforms and constructs a geospatial metadata knowledge graph to explicitly model datasets and their multidimensional relationships. Building on the structured representation, the framework adopts a multi-agent collaborative architecture to perform intent parsing, knowledge graph retrieval, and answer synthesis, forming an interpretable and closed-loop discovery process from user queries to results. Results from representative use cases and performance evaluation show that the framework substantially improves intent matching accuracy, ranking quality, recall, and discovery transparency compared with traditional systems. This study advances geospatial data discovery toward a more semantic, intent-aware, and intelligent paradigm, providing a practical foundation for next-generation intelligent and autonomous spatial data infrastructures and contributing to the broader vision of Autonomous GIS.
[343] HyFI: Hyperbolic Feature Interpolation for Brain-Vision Alignment
Sangmin Jo, Wootaek Jeong, Da-Woon Heo, Yoohwan Hwang, Heung-Il Suk
Main category: cs.AI
TL;DR: HyFI uses hyperbolic space to interpolate between semantic and perceptual visual features for better brain-to-image alignment, achieving SOTA in zero-shot retrieval.
Details
Motivation: Prior brain decoding methods fail to address the modality gap between brain signals and images, and the entanglement of semantic/perceptual features in neural activity.Method: Proposes Hyperbolic Feature Interpolation (HyFI) framework that interpolates between semantic and perceptual visual features along hyperbolic geodesics to fuse/compress information.
Result: Achieves state-of-the-art performance in zero-shot brain-to-image retrieval with Top-1 accuracy improvements of +17.3% on THINGS-EEG and +9.1% on THINGS-MEG.
Conclusion: Hyperbolic space effectively addresses modality gap and feature entanglement in brain decoding, enabling better brain-to-image alignment through feature interpolation.
Abstract: Recent progress in artificial intelligence has encouraged numerous attempts to understand and decode human visual system from brain signals. These prior works typically align neural activity independently with semantic and perceptual features extracted from images using pre-trained vision models. However, they fail to account for two key challenges: (1) the modality gap arising from the natural difference in the information level of representation between brain signals and images, and (2) the fact that semantic and perceptual features are highly entangled within neural activity. To address these issues, we utilize hyperbolic space, which is well-suited for considering differences in the amount of information and has the geometric property that geodesics between two points naturally bend toward the origin, where the representational capacity is lower. Leveraging these properties, we propose a novel framework, Hyperbolic Feature Interpolation (HyFI), which interpolates between semantic and perceptual visual features along hyperbolic geodesics. This enables both the fusion and compression of perceptual and semantic information, effectively reflecting the limited expressiveness of brain signals and the entangled nature of these features. As a result, it facilitates better alignment between brain and visual features. We demonstrate that HyFI achieves state-of-the-art performance in zero-shot brain-to-image retrieval, outperforming prior methods with Top-1 accuracy improvements of up to +17.3% on THINGS-EEG and +9.1% on THINGS-MEG.
[344] Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks
Abhishek Chandwani, Ishan Gupta
Main category: cs.AI
TL;DR: LH-Bench: A three-pillar evaluation framework for scoring autonomous, long-horizon execution on subjective enterprise tasks, moving beyond binary correctness to assess multi-tool workflows with expert-grounded rubrics and human preference validation.
Details
Motivation: Current LLM evaluation focuses on objectively verifiable tasks (math, programming) with binary correctness, but real enterprise work is subjective, context-dependent, and involves multi-tool workflows where success depends on organizational goals, user intent, and intermediate artifact quality.Method: Three-pillar design: (1) expert-grounded rubrics giving LLM judges domain context to score subjective work, (2) curated ground-truth artifacts enabling stepwise reward signals (e.g., chapter-level annotation), (3) pairwise human preference evaluation for convergent validation. Tested on Figma-to-code (33 real .fig tasks) and programmatic content (41 courses, 183 chapters).
Result: Domain-authored rubrics provide substantially more reliable evaluation signals than LLM-authored rubrics (kappa = 0.60 vs. 0.46). Human preference judgments confirm the same top-tier separation (p < 0.05), showing expert-grounded evaluation can scale without sacrificing reliability.
Conclusion: LH-Bench enables reliable evaluation of autonomous, long-horizon execution on subjective enterprise tasks by combining expert-grounded rubrics, curated artifacts, and human validation, addressing the gap between traditional LLM evaluation and real-world enterprise work requirements.
Abstract: Large language models excel on objectively verifiable tasks such as math and programming, where evaluation reduces to unit tests or a single correct answer. In contrast, real-world enterprise work is often subjective and context-dependent: success hinges on organizational goals, user intent, and the quality of intermediate artifacts produced across long, multi-tool workflows. We introduce LH-Bench, a three-pillar evaluation design that moves beyond binary correctness to score autonomous, long-horizon execution on subjective enterprise tasks. The pillars are: (i) expert-grounded rubrics that give LLM judges the domain context needed to score subjective work, (ii) curated ground-truth artifacts that enable stepwise reward signals (e.g., chapter-level annotation for content tasks), and (iii) pairwise human preference evaluation for convergent validation. We show that domain-authored rubrics provide substantially more reliable evaluation signals than LLM-authored rubrics (kappa = 0.60 vs. 0.46), and that human preference judgments confirm the same top-tier separation (p < 0.05), evidence that expert-grounded evaluation can scale without sacrificing reliability. We release public datasets and report results on two environments: Figma-to-code (33 real .fig tasks against the Figma API via MCP) and Programmatic content (41 courses comprising 183 individually-evaluated chapters on a course platform serving 30+ daily users).
[345] CLiGNet: Clinical Label-Interaction Graph Network for Medical Specialty Classification from Clinical Transcriptions
Pronob Kumar Barman, Pronoy Kumar Barman
Main category: cs.AI
TL;DR: CLiGNet: A neural architecture combining Bio ClinicalBERT with Graph Convolutional Networks on specialty label graphs for medical transcription classification, addressing data leakage issues in prior benchmarks.
Details
Motivation: Prior work on medical transcription classification suffers from severe data leakage due to methodological flaws in benchmark construction. The true task difficulty is substantially higher than reported, and there's a need for better models to handle extreme class imbalance across 40 medical specialties.Method: CLiGNet combines a Bio ClinicalBERT text encoder with a two-layer Graph Convolutional Network operating on a specialty label graph constructed from semantic similarity and ICD-10 chapter priors. Uses per-label attention gates to fuse document and label graph representations, trained with focal binary cross-entropy loss to handle extreme class imbalance.
Result: CLiGNet achieves the highest macro F1 of 0.279 across seven baselines. The GCN label graph provides the largest component gain (increase of 0.066 macro F1). Adding calibration yields an expected calibration error of 0.007, demonstrating good probability reliability.
Conclusion: CLiGNet effectively addresses data leakage issues and improves medical transcription classification through label graph integration, providing a principled approach for handling extreme class imbalance while maintaining probability calibration.
Abstract: Automated classification of clinical transcriptions into medical specialties is essential for routing, coding, and clinical decision support, yet prior work on the widely used MTSamples benchmark suffers from severe data leakage caused by applying SMOTE oversampling before train test splitting. We first document this methodological flaw and establish a leakage free benchmark across 40 medical specialties (4966 records), revealing that the true task difficulty is substantially higher than previously reported. We then introduce CLiGNet (Clinical Label Interaction Graph Network), a neural architecture that combines a Bio ClinicalBERT text encoder with a two layer Graph Convolutional Network operating on a specialty label graph constructed from semantic similarity and ICD 10 chapter priors. Per label attention gates fuse document and label graph representations, trained with focal binary cross entropy loss to handle extreme class imbalance (181 to 1 ratio). Across seven baselines ranging from TF IDF classifiers to Clinical Longformer, CLiGNet without calibration achieves the highest macro F1 of 0.279, with an ablation study confirming that the GCN label graph provides the single largest component gain (increase of 0.066 macro F1). Adding per label Platt scaling calibration yields an expected calibration error of 0.007, demonstrating a principled trade off between ranking performance and probability reliability. We provide comprehensive failure analysis covering pairwise specialty confusions, rare class behaviour, document length effects, and token level Integrated Gradients attribution, offering actionable insights for clinical NLP system deployment.
[346] Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases
Dubai Li, Yuxiang He, Yan Hu, Yu Tian, Jingsong Li
Main category: cs.AI
TL;DR: RWE-bench: A benchmark for evaluating LLM agents on end-to-end observational study execution in real-world databases, requiring generation of structured evidence bundles rather than isolated answers.
Details
Motivation: Observational studies can provide clinically actionable evidence at scale, but executing them on real-world databases requires coherent decisions across multiple steps. Existing LLM agent evaluations focus on isolated steps or single answers, missing the integrity and internal structure of complete evidence bundles needed for real-world evidence generation.Method: Introduces RWE-bench benchmark grounded in MIMIC-IV database and derived from peer-reviewed observational studies. Each task provides study protocols as reference standards, requiring agents to execute experiments in real databases and iteratively generate tree-structured evidence bundles. Evaluates six LLMs (three open-source, three closed-source) under three agent scaffolds using both question-level correctness and end-to-end task metrics.
Result: Across 162 tasks, task success is low: best agent reaches 39.9%, best open-source model reaches 30.4%. Agent scaffolds cause over 30% variation in performance metrics. Automated cohort evaluation method helps localize errors and identify agent failure modes.
Conclusion: Results highlight persistent limitations in agents’ ability to produce end-to-end evidence bundles for observational studies. Efficient validation remains an important direction for future work in real-world evidence generation using LLM agents.
Abstract: Observational studies can yield clinically actionable evidence at scale, but executing them on real-world databases is open-ended and requires coherent decisions across cohort construction, analysis, and reporting. Prior evaluations of LLM agents emphasize isolated steps or single answers, missing the integrity and internal structure of the resulting evidence bundle. To address this gap, we introduce RWE-bench, a benchmark grounded in MIMIC-IV and derived from peer-reviewed observational studies. Each task provides the corresponding study protocol as the reference standard, requiring agents to execute experiments in a real database and iteratively generate tree-structured evidence bundles. We evaluate six LLMs (three open-source, three closed-source) under three agent scaffolds using both question-level correctness and end-to-end task metrics. Across 162 tasks, task success is low: the best agent reaches 39.9%, and the best open-source model reaches 30.4%. Agent scaffolds also matter substantially, causing over 30% variation in performance metrics. Furthermore, we implement an automated cohort evaluation method to rapidly localize errors and identify agent failure modes. Overall, the results highlight persistent limitations in agents’ ability to produce end-to-end evidence bundles, and efficient validation remains an important direction for future work. Code and data are available at https://github.com/somewordstoolate/RWE-bench.
[347] AgriPestDatabase-v1.0: A Structured Insect Dataset for Training Agricultural Large Language Model
Yagizhan Bilal Durak, Ahsan Ul Islam, Shahidul Islam, Ashley Morgan-Olvera, Iftekhar Ibne Basith, Syed Hasib Akhter Faruqui
Main category: cs.AI
TL;DR: Fine-tuning lightweight LLMs (≤7B parameters) for agricultural pest management using expert-curated Q/A pairs, with Mistral 7B achieving best performance for edge deployment.
Details
Motivation: Address the need for accessible agricultural pest management tools in rural areas with limited internet connectivity, leveraging AI/LLMs to provide practical decision support directly to farmers.Method: 1) Created structured insect information dataset from pest databases and manuscripts, validated by domain experts; 2) Generated Q/A pairs for training/evaluation; 3) Applied LoRA-based fine-tuning to multiple lightweight LLMs (Mistral 7B, Qwen 2.5 7B, LLaMA 3.1 8B).
Result: Mistral 7B achieved 88.9% pass rate on domain-specific Q/A, outperforming Qwen 2.5 7B (63.9%) and LLaMA 3.1 8B (58.7%). Mistral showed higher semantic alignment (embedding similarity: 0.865) despite lower lexical overlap (BLEU: 0.097).
Conclusion: Demonstrates feasibility of deploying compact, high-performing language models for practical field-level pest management guidance through expert-organized data, structured Q/A pairs, semantic quality control, and efficient model adaptation.
Abstract: Agricultural pest management increasingly relies on timely and accurate access to expert knowledge, yet high quality labeled data and continuous expert support remain limited, particularly for farmers operating in rural regions with unstable/no internet connectivity. At the same time, the rapid growth of AI and LLMs has created new opportunities to deliver practical decision support tools directly to end users in agriculture through compact and deployable systems. This work addresses (i) generating a structured insect information dataset, and (ii) adapting a lightweight LLM model ($\leq$ 7B) by fine tuning it for edge device uses in agricultural pest management. The textual data collection was done by reviewing and collecting information from available pest databases and published manuscripts on nine selected pest species. These structured reports were then reviewed and validated by a domain expert. From these reports, we constructed Q/A pairs to support model training and evaluation. A LoRA-based fine-tuning approach was applied to multiple lightweight LLMs and evaluated. Initial evaluation shows that Mistral 7B achieves an 88.9% pass rate on the domain-specific Q/A task, substantially outperforming Qwen 2.5 7B (63.9%), and LLaMA 3.1 8B (58.7%). Notably, Mistral demonstrates higher semantic alignment (embedding similarity: 0.865) despite lower lexical overlap (BLEU: 0.097), indicating that semantic understanding and robust reasoning are more predictive of task success than surface-level conformity in specialized domains. By combining expert organized data, well-structured Q/A pairs, semantic quality control, and efficient model adaptation, this work contributes towards providing support for farmer facing agricultural decision support tools and demonstrates the feasibility of deploying compact, high-performing language models for practical field-level pest management guidance.
[348] ABSTRAL: Automatic Design of Multi-Agent Systems Through Iterative Refinement and Topology Optimization
Weijia Song, Jiashu Yue, Zhe Pang
Main category: cs.AI
TL;DR: ABSTRAL is a framework that treats multi-agent system architecture as an evolving natural-language document refined through contrastive trace analysis, enabling inspectable, revisable, and transferable design knowledge.
Details
Motivation: To create a framework for multi-agent system design that captures design knowledge in an inspectable, revisable, and transferable form, addressing the need for systematic MAS architecture development.Method: Treats MAS architecture as evolving natural-language documents refined through contrastive trace analysis, measuring multi-agent coordination tax, enabling knowledge transfer of topology reasoning and role templates, and discovering specialist roles.
Result: Achieves 70% validation / 65.96% test pass rate on SOPBench with GPT-4o backbone; measures multi-agent coordination tax showing 26% turn efficiency; demonstrates design knowledge transfer matching coldstart iteration 3 performance in one iteration; discovers specialist roles absent from initial designs.
Conclusion: ABSTRAL successfully captures MAS design knowledge in inspectable documents, enables knowledge transfer across domains, and discovers emergent specialist roles through contrastive trace analysis, providing a framework for systematic multi-agent system development.
Abstract: How should multi-agent systems be designed, and can that design knowledge be captured in a form that is inspectable, revisable, and transferable? We introduce ABSTRAL, a framework that treats MAS architecture as an evolving natural-language document, an artifact refined through contrastive trace analysis. Three findings emerge. First, we provide a precise measurement of the multi-agent coordination tax: under fixed turn budgets, ensembles achieve only 26% turn efficiency, with 66% of tasks exhausting the limit, yet still improve over single-agent baselines by discovering parallelizable task decompositions. Second, design knowledge encoded in documents transfers: topology reasoning and role templates learned on one domain provide a head start on new domains, with transferred seeds matching coldstart iteration 3 performance in a single iteration. Third, contrastive trace analysis discovers specialist roles absent from any initial design, a capability no prior system demonstrates. On SOPBench (134 bank tasks, deterministic oracle), ABSTRAL reaches 70% validation / 65.96% test pass rate with a GPT-4o backbone. We release the converged documents as inspectable design rationale.
[349] Reliable Classroom AI via Neuro-Symbolic Multimodal Reasoning
Sina Bagheri Nezhad
Main category: cs.AI
TL;DR: NSCR is a neuro-symbolic framework for classroom AI that decomposes analytics into four layers (perceptual grounding, symbolic abstraction, executable reasoning, governance) to enable verifiable, interpretable multimodal analysis of educational settings with explicit deployment guardrails.
Details
Motivation: Classroom AI is expanding beyond basic perception to higher-level judgments about engagement, confusion, collaboration, and instructional quality, but classrooms present unique challenges: multi-party, noisy, privacy-sensitive, pedagogically diverse, and often multilingual environments. Raw predictive accuracy is insufficient without verifiable evidence, calibrated uncertainty, and explicit deployment guardrails.Method: NSCR (neuro-symbolic classroom reasoning) framework with four layers: 1) perceptual grounding (processing video, audio, ASR, metadata), 2) symbolic abstraction (converting observations to typed facts), 3) executable reasoning (composing facts with rules, programs, policy constraints), and 4) governance (deployment guardrails). Adapts symbolic fact extraction and verifiable code generation to multimodal educational settings.
Result: The paper presents a framework and evaluation agenda rather than empirical results. It contributes a benchmark organized around five tasks: classroom state inference, discourse-grounded event linking, temporal early warning, collaboration analysis, and multilingual classroom reasoning, with reliability metrics for abstention, calibration, robustness, construct alignment, and human usefulness.
Conclusion: Classroom AI should be treated as a critical domain requiring interpretable, privacy-aware, and pedagogically grounded multimodal AI. NSCR provides a concrete framework to support verifiable evidence, calibrated uncertainty, and explicit deployment guardrails for educational applications.
Abstract: Classroom AI is rapidly expanding from low-level perception toward higher-level judgments about engagement, confusion, collaboration, and instructional quality. Yet classrooms are among the hardest real-world settings for multimodal vision: they are multi-party, noisy, privacy-sensitive, pedagogically diverse, and often multilingual. In this paper, we argue that classroom AI should be treated as a critical domain, where raw predictive accuracy is insufficient unless predictions are accompanied by verifiable evidence, calibrated uncertainty, and explicit deployment guardrails. We introduce NSCR, a neuro-symbolic framework that decomposes classroom analytics into four layers: perceptual grounding, symbolic abstraction, executable reasoning, and governance. NSCR adapts recent ideas from symbolic fact extraction and verifiable code generation to multimodal educational settings, enabling classroom observations from video, audio, ASR, and contextual metadata to be converted into typed facts and then composed by executable rules, programs, and policy constraints. Beyond the system design, we contribute a benchmark and evaluation protocol organized around five tasks: classroom state inference, discourse-grounded event linking, temporal early warning, collaboration analysis, and multilingual classroom reasoning. We further specify reliability metrics centered on abstention, calibration, robustness, construct alignment, and human usefulness. The paper does not report new empirical results; its contribution is a concrete framework and evaluation agenda intended to support more interpretable, privacy-aware, and pedagogically grounded multimodal AI for classrooms.
[350] Learning What Matters Now: Dynamic Preference Inference under Contextual Shifts
Xianwei Cao, Dou Quan, Zhenliang Zhang, Shuang Wang
Main category: cs.AI
TL;DR: Dynamic Preference Inference (DPI) enables RL agents to adapt to changing preference weights by maintaining probabilistic beliefs about latent objectives and conditioning policies on inferred preferences.
Details
Motivation: Humans dynamically adjust priorities based on context, but most RL methods assume static preferences. The paper addresses sequential decision-making with unobserved, drifting preference weights.Method: Proposes DPI framework with variational preference inference module trained jointly with preference-conditioned actor-critic, using vector-valued returns as evidence about latent trade-offs.
Result: In queueing, maze, and multi-objective continuous-control environments with event-driven objective changes, DPI adapts to new regimes and outperforms fixed-weight and heuristic envelope baselines.
Conclusion: DPI provides a cognitively inspired approach for adaptive decision-making with drifting preferences, enabling better performance in dynamic multi-objective environments.
Abstract: Humans often juggle multiple, sometimes conflicting objectives and shift their priorities as circumstances change, rather than following a fixed objective function. In contrast, most computational decision-making and multi-objective RL methods assume static preference weights or a known scalar reward. In this work, we study sequential decision-making problem when these preference weights are unobserved latent variables that drift with context. Specifically, we propose Dynamic Preference Inference (DPI), a cognitively inspired framework in which an agent maintains a probabilistic belief over preference weights, updates this belief from recent interaction, and conditions its policy on inferred preferences. We instantiate DPI as a variational preference inference module trained jointly with a preference-conditioned actor-critic, using vector-valued returns as evidence about latent trade-offs. In queueing, maze, and multi-objective continuous-control environments with event-driven changes in objectives, DPI adapts its inferred preferences to new regimes and achieves higher post-shift performance than fixed-weight and heuristic envelope baselines.
[351] Empirical Comparison of Agent Communication Protocols for Task Orchestration
Ivan Dobrovolskyi
Main category: cs.AI
TL;DR: First systematic benchmark comparing tool-integration-only vs multi-agent delegation vs hybrid architectures for AI agent systems across performance metrics
Details
Motivation: AI agent systems are evolving from single-tool interactions to complex multi-agent orchestrations, with two competing communication protocols emerging (tool integration vs inter-agent delegation), but no empirical comparison exists despite widespread industry adoptionMethod: Develop systematic benchmark comparing three architectures (tool-integration-only, multi-agent delegation, hybrid) across standardized queries at three complexity levels, quantifying trade-offs in response time, context window consumption, monetary cost, error recovery, and implementation complexity
Result: Not specified in abstract (paper presents first empirical comparison of these protocols)
Conclusion: Provides first systematic benchmark for comparing AI agent communication protocols, enabling informed architectural decisions based on empirical performance trade-offs
Abstract: Context. Nowadays, artificial intelligence agent systems are transforming from single-tool interactions to complex multi-agent orchestrations. As a result, two competing communication protocols have emerged: a tool integration protocol that standardizes how agents invoke external tools, and an inter-agent delegation protocol that enables autonomous agents to discover and delegate tasks to one another. Despite widespread industry adoption by dozens of enterprise partners, no empirical comparison of these protocols exists in the literature. Objective. The goal of this work is to develop the first systematic benchmark comparing tool-integration-only, multi-agent delegation, and hybrid architectures across standardized queries at three complexity levels, and to quantify the trade-offs in response time, context window consumption, monetary cost, error recovery, and implementation complexity.
[352] Improving Safety Alignment via Balanced Direct Preference Optimization
Shiji Zhao, Mengyang Wang, Shukun Xiong, Fangzhou Chen, Qihui Zhu, Shouwei Ruan, Yisong Xiao, Ranjie Duan, Xun Chen, XingXing Wei
Main category: cs.AI
TL;DR: B-DPO addresses safety alignment overfitting in LLMs by balancing preference comprehension between preferred and dispreferred responses using mutual information-based adaptive optimization.
Details
Motivation: Safety alignment in LLMs using methods like DPO suffers from severe overfitting, which limits actual safety performance. The paper identifies an "Imbalanced Preference Comprehension" phenomenon where models don't equally comprehend both responses in preference pairs, compromising safety alignment.Method: Proposes Balanced Direct Preference Optimization (B-DPO) which adaptively modulates optimization strength between preferred and dispreferred responses based on mutual information. This balances the model’s comprehension of both types of responses to prevent overfitting.
Result: Experimental results show B-DPO enhances safety capability while maintaining competitive general capabilities on various mainstream benchmarks compared to state-of-the-art methods.
Conclusion: B-DPO effectively addresses the overfitting problem in safety alignment by balancing preference comprehension, improving safety performance without sacrificing general capabilities.
Abstract: With the rapid development and widespread application of Large Language Models (LLMs), their potential safety risks have attracted widespread attention. Reinforcement Learning from Human Feedback (RLHF) has been adopted to enhance the safety performance of LLMs. As a simple and effective alternative to RLHF, Direct Preference Optimization (DPO) is widely used for safety alignment. However, safety alignment still suffers from severe overfitting, which limits its actual performance. This paper revisits the overfitting phenomenon from the perspective of the model’s comprehension of the training data. We find that the Imbalanced Preference Comprehension phenomenon exists between responses in preference pairs, which compromises the model’s safety performance. To address this, we propose Balanced Direct Preference Optimization (B-DPO), which adaptively modulates optimization strength between preferred and dispreferred responses based on mutual information. A series of experimental results show that B-DPO can enhance the safety capability while maintaining the competitive general capabilities of LLMs on various mainstream benchmarks compared to state-of-the-art methods. \color{red}{Warning: This paper contains examples of harmful texts, and reader discretion is recommended.
[353] PhySe-RPO: Physics and Semantics Guided Relative Policy Optimization for Diffusion-Based Surgical Smoke Removal
Zining Fang, Cheng Xue, Chunhui Liu, Bin Xu, Ming Chen, Xiaowei Hu
Main category: cs.AI
TL;DR: PhySe-RPO: A diffusion-based framework for surgical smoke removal using physics- and semantics-guided reinforcement learning optimization for robust video restoration under limited supervision.
Details
Motivation: Surgical smoke degrades intraoperative video quality, obscuring anatomical structures. Existing learning-based desmoking methods rely on scarce paired supervision and deterministic pipelines, making exploration and refinement difficult under real surgical conditions.Method: Proposes PhySe-RPO, a diffusion restoration framework optimized through Physics- and Semantics-Guided Relative Policy Optimization. Transforms deterministic restoration into stochastic policy for trajectory-level exploration and critic-free updates via group-relative optimization. Uses physics-guided reward for illumination/color consistency and visual-concept semantic reward from CLIP-based surgical concepts.
Result: Produces physically consistent, semantically faithful, and clinically interpretable results across synthetic and real robotic surgical datasets, providing robust diffusion-based restoration under limited paired supervision.
Conclusion: PhySe-RPO offers a principled route to robust diffusion-based surgical video restoration by combining physics and semantic guidance with reinforcement learning optimization, enabling effective desmoking under real surgical conditions with limited supervision.
Abstract: Surgical smoke severely degrades intraoperative video quality, obscuring anatomical structures and limiting surgical perception. Existing learning-based desmoking approaches rely on scarce paired supervision and deterministic restoration pipelines, making it difficult to perform exploration or reinforcement-driven refinement under real surgical conditions. We propose PhySe-RPO, a diffusion restoration framework optimized through Physics- and Semantics-Guided Relative Policy Optimization. The core idea is to transform deterministic restoration into a stochastic policy, enabling trajectory-level exploration and critic-free updates via group-relative optimization. A physics-guided reward imposes illumination and color consistency, while a visual-concept semantic reward learned from CLIP-based surgical concepts promotes smoke-free and anatomically coherent restoration. Together with a reference-free perceptual constraint, PhySe-RPO produces results that are physically consistent, semantically faithful, and clinically interpretable across synthetic and real robotic surgical datasets, providing a principled route to robust diffusion-based restoration under limited paired supervision.
[354] CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models
Youzhi Liu, Li Gao, Liu Liu, Mingyang Lv, Yang Cai
Main category: cs.AI
TL;DR: CoMaTrack introduces a competitive multi-agent reinforcement learning framework for embodied visual tracking, using adversarial training between tracker and opponent agents to improve robustness and generalization, with a new benchmark for evaluation.
Details
Motivation: Current embodied visual tracking methods rely on single-agent imitation learning, which suffers from costly expert data collection and limited generalization due to static training environments. There's a need for more robust approaches that can handle dynamic, adversarial scenarios.Method: Proposes CoMaTrack, a competitive game-theoretic multi-agent reinforcement learning framework that trains agents in dynamic adversarial settings with competitive subtasks. Also introduces CoMaTrack-Bench, the first benchmark for competitive EVT featuring game scenarios between tracker and adaptive opponents across diverse environments.
Result: CoMaTrack achieves state-of-the-art results on both standard benchmarks and CoMaTrack-Bench. A 3B VLM trained with this framework surpasses previous single-agent imitation learning methods based on 7B models on EVT-Bench, achieving 92.1% in STT, 74.2% in DT, and 57.5% in AT.
Conclusion: The competitive multi-agent reinforcement learning approach enables stronger adaptive planning and interference-resilient strategies for embodied visual tracking, demonstrating superior performance over imitation learning methods despite using smaller models.
Abstract: Embodied Visual Tracking (EVT), a core dynamic task in embodied intelligence, requires an agent to precisely follow a language-specified target. Yet most existing methods rely on single-agent imitation learning, suffering from costly expert data and limited generalization due to static training environments. Inspired by competition-driven capability evolution, we propose CoMaTrack, a competitive game-theoretic multi-agent reinforcement learning framework that trains agents in a dynamic adversarial setting with competitive subtasks, yielding stronger adaptive planning and interference-resilient strategies. We further introduce CoMaTrack-Bench, the first benchmark for competitive EVT, featuring game scenarios between a tracker and adaptive opponents across diverse environments and instructions, enabling standardized robustness evaluation under active adversarial interactions. Experiments show that CoMaTrack achieves state-of-the-art results on both standard benchmarks and CoMaTrack-Bench. Notably, a 3B VLM trained with our framework surpasses previous single-agent imitation learning methods based on 7B models on the challenging EVT-Bench, achieving 92.1% in STT, 74.2% in DT, and 57.5% in AT. The benchmark code will be available at https://github.com/wlqcode/CoMaTrack-Bench
[355] Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories
Yang Li, Yule Liu, Xinlei He, Youjian Zhao, Qi Li, Ke Xu
Main category: cs.AI
TL;DR: CoA framework embeds authorization logic into LLMs, requiring explicit permission reasoning before generating responses to prevent unauthorized access.
Details
Motivation: Current LLMs lack awareness of knowledge ownership and access boundaries, creating security risks like sensitive data leakage and adversarial manipulation. Existing protection methods are either too rigid or lack fine-grained permission distinctions.Method: Chain-of-Authorization (CoA) framework restructures LLM information flow by embedding permission context at input and requiring generation of explicit authorization reasoning trajectory (resource review, identity resolution, decision-making) before final response, trained via supervised fine-tuning on diverse authorization data.
Result: CoA maintains comparable utility in authorized scenarios while effectively handling permission mismatches, showing high rejection rates against unauthorized and adversarial access attempts.
Conclusion: CoA leverages LLMs’ reasoning capabilities for dynamic authorization, using natural language understanding as a proactive security mechanism for deploying reliable LLMs in AI systems.
Abstract: Large Language Models (LLMs) have become core cognitive components in modern artificial intelligence (AI) systems, combining internal knowledge with external context to perform complex tasks. However, LLMs typically treat all accessible data indiscriminately, lacking inherent awareness of knowledge ownership and access boundaries. This deficiency heightens risks of sensitive data leakage and adversarial manipulation, potentially enabling unauthorized system access and severe security crises. Existing protection strategies rely on rigid, uniform defense that prevent dynamic authorization. Structural isolation methods faces scalability bottlenecks, while prompt guidance methods struggle with fine-grained permissions distinctions. Here, we propose the Chain-of-Authorization (CoA) framework, a secure training and reasoning paradigm that internalizes authorization logic into LLMs’ core capabilities. Unlike passive external defneses, CoA restructures the model’s information flow: it embeds permission context at input and requires generating explicit authorization reasoning trajectory that includes resource review, identity resolution, and decision-making stages before final response. Through supervised fine-tuning on data covering various authorization status, CoA integrates policy execution with task responses, making authorization a causal prerequisite for substantive responses. Extensive evaluations show that CoA not only maintains comparable utility in authorized scenarios but also overcomes the cognitive confusion when permissions mismatches. It exhibits high rejection rates against various unauthorized and adversarial access. This mechanism leverages LLMs’ reasoning capability to perform dynamic authorization, using natural language understanding as a proactive security mechanism for deploying reliable LLMs in modern AI systems.
[356] Dynamical Systems Theory Behind a Hierarchical Reasoning Model
Vasiliy A. Es’kin, Mikhail E. Smorkalov
Main category: cs.AI
TL;DR: CMM: A mathematically grounded recursive reasoning architecture using Neural ODEs/SDEs with contraction mapping principles for algorithmic reasoning, achieving state-of-the-art performance with extreme parameter efficiency.
Details
Motivation: Current LLMs struggle with complex algorithmic reasoning despite massive parameters. Existing recursive reasoning architectures lack mathematical guarantees, leading to instability and representational collapse.Method: Proposes Contraction Mapping Model (CMM) that reformulates discrete recursive reasoning into continuous Neural ODEs/SDEs. Enforces convergence to stable equilibrium states and uses hyperspherical repulsion loss to prevent feature collapse.
Result: 5M-parameter CMM achieves 93.7% accuracy on Sudoku-Extreme, outperforming 27M-parameter HRM (55.0%) and 5M-parameter TRM (87.4%). Even at 0.26M parameters, achieves 85.4% on Sudoku-Extreme and 82.2% on Maze benchmark.
Conclusion: Mathematically rigorous latent dynamics can effectively replace brute-force parameter scaling for artificial reasoning, establishing new frontier for extreme parameter efficiency in algorithmic reasoning tasks.
Abstract: Current large language models (LLMs) primarily rely on linear sequence generation and massive parameter counts, yet they severely struggle with complex algorithmic reasoning. While recent reasoning architectures, such as the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM), demonstrate that compact recursive networks can tackle these tasks, their training dynamics often lack rigorous mathematical guarantees, leading to instability and representational collapse. We propose the Contraction Mapping Model (CMM), a novel architecture that reformulates discrete recursive reasoning into continuous Neural Ordinary and Stochastic Differential Equations (NODEs/NSDEs). By explicitly enforcing the convergence of the latent phase point to a stable equilibrium state and mitigating feature collapse with a hyperspherical repulsion loss, the CMM provides a mathematically grounded and highly stable reasoning engine. On the Sudoku-Extreme benchmark, a 5M-parameter CMM achieves a state-of-the-art accuracy of 93.7 %, outperforming the 27M-parameter HRM (55.0 %) and 5M-parameter TRM (87.4 %). Remarkably, even when aggressively compressed to an ultra-tiny footprint of just 0.26M parameters, the CMM retains robust predictive power, achieving 85.4 % on Sudoku-Extreme and 82.2 % on the Maze benchmark. These results establish a new frontier for extreme parameter efficiency, proving that mathematically rigorous latent dynamics can effectively replace brute-force scaling in artificial reasoning.
[357] Continuous Optimization for Satisfiability Modulo Theories on Linear Real Arithmetic
Yunuo Cen, Daniel Ebler, Xuanyao Fong
Main category: cs.AI
TL;DR: FourierSMT: A continuous-variable optimization framework for SMT using extended Walsh-Fourier expansion and binary decision diagrams for scalable parallel solving.
Details
Motivation: Existing SMT solvers based on conflict-driven clause learning are difficult to parallelize and scale poorly. There's a need for scalable, highly parallelizable approaches for industrial applications like hardware verification and design automation.Method: Generalizes Walsh-Fourier expansion to mixed Boolean-real domain (xWFE), uses extended binary decision diagrams (xBDDs) to reduce evaluation complexity, and employs circuit-output probability sampling under randomized rounding for efficient constraint computation.
Result: Achieves 8-fold speedups on large-scale scheduling and placement problems with up to 10,000 variables and 700,000 constraints compared to state-of-the-art SMT solvers.
Conclusion: FourierSMT provides a scalable, parallelizable framework for SMT solving that enables GPU-based optimization and handles large-scale industrial problems efficiently.
Abstract: Efficient solutions for satisfiability modulo theories (SMT) are integral in industrial applications such as hardware verification and design automation. Existing approaches are predominantly based on conflict-driven clause learning, which is structurally difficult to parallelize and therefore scales poorly. In this work, we introduce FourierSMT as a scalable and highly parallelizable continuous-variable optimization framework for SMT. We generalize the Walsh-Fourier expansion (WFE), called extended WFE (xWFE), from the Boolean domain to a mixed Boolean-real domain, which allows the use of gradient methods for SMT. This addresses the challenge of finding satisfying variable assignments to high-arity constraints by local updates of discrete variables. To reduce the evaluation complexity of xWFE, we present the extended binary decision diagram (xBDD) and map the constraints from xWFE to xBDDs. We then show that sampling the circuit-output probability (COP) of xBDDs under randomized rounding is equivalent to the expectation value of the xWFEs. This allows for efficient computation of the constraints. We show that the reduced problem is guaranteed to converge and preserves satisfiability, ensuring the soundness of the solutions. The framework is benchmarked for large-scale scheduling and placement problems with up to 10,000 variables and 700,000 constraints, achieving 8-fold speedups compared to state-of-the-art SMT solvers. These results pave the way for GPU-based optimization of SMTs with continuous systems.
[358] Separating Diagnosis from Control: Auditable Policy Adaptation in Agent-Based Simulations with LLM-Based Diagnostics
Shaoxin Zhong, Yuchen Su, Michael Witbrock
Main category: cs.AI
TL;DR: A three-layer framework separates LLM diagnosis from deterministic control to achieve both adaptability and auditability in elderly loneliness policy interventions.
Details
Motivation: Existing methods for elderly loneliness interventions struggle to reconcile adaptability and auditability - traditional agent-based models are too static while LLM controllers lack traceability.Method: Three-layer framework where LLMs serve strictly as diagnostic instruments to assess population state and generate structured risk evaluations, while deterministic formulas with explicit bounds translate these assessments into traceable parameter updates.
Result: Explicit control rules outperform end-to-end black-box LLM approaches by 11.7% while preserving full auditability in elderly care simulations across five experimental conditions.
Conclusion: The separation of diagnosis from control enables both adaptive response to emergent needs and full auditability, demonstrating that transparency need not compromise performance in policy interventions.
Abstract: Mitigating elderly loneliness requires policy interventions that achieve both adaptability and auditability. Existing methods struggle to reconcile these objectives: traditional agent-based models suffer from static rigidity, while direct large language model (LLM) controllers lack essential traceability. This work proposes a three-layer framework that separates diagnosis from control to achieve both properties simultaneously. LLMs operate strictly as diagnostic instruments that assess population state and generate structured risk evaluations, while deterministic formulas with explicit bounds translate these assessments into traceable parameter updates. This separation ensures that every policy decision can be attributed to inspectable rules while maintaining adaptive response to emergent needs. We validate the framework through systematic ablation across five experimental conditions in elderly care simulation. Results demonstrate that explicit control rules outperform end-to-end black-box LLM approaches by 11.7% while preserving full auditability, confirming that transparency need not compromise adaptive performance.
[359] ProGRank: Probe-Gradient Reranking to Defend Dense-Retriever RAG from Corpus Poisoning
Xiangyu Yin, Yi Qi, Chih-hong Cheng
Main category: cs.AI
TL;DR: ProGRank is a post-hoc, training-free defense against corpus poisoning attacks in Retrieval-Augmented Generation (RAG) systems that uses gradient-based instability signals to detect and filter malicious passages.
Details
Motivation: RAG systems are vulnerable to corpus poisoning attacks where adversaries inject malicious passages that get retrieved for target queries, affecting downstream generation. Existing defenses require content filtering, auxiliary models, or generator-side reasoning, making deployment difficult.Method: ProGRank stress-tests query-passage pairs under mild randomized perturbations and extracts probe gradients from a small fixed parameter subset of the retriever. It derives two instability signals (representational consistency and dispersion risk) and combines them with a score gate in a reranking step.
Result: Extensive experiments across three datasets, three dense retriever backbones, and representative corpus poisoning attacks show ProGRank provides stronger defense performance and favorable robustness-utility trade-off. It remains competitive under adaptive evasive attacks.
Conclusion: ProGRank offers an effective, training-free defense against corpus poisoning in RAG systems that preserves original content, requires no retraining, and supports surrogate-based variants when the deployed retriever is unavailable.
Abstract: Retrieval-Augmented Generation (RAG) improves the reliability of large language model applications by grounding generation in retrieved evidence, but it also introduces a new attack surface: corpus poisoning. In this setting, an adversary injects or edits passages so that they are ranked into the Top-$K$ results for target queries and then affect downstream generation. Existing defences against corpus poisoning often rely on content filtering, auxiliary models, or generator-side reasoning, which can make deployment more difficult. We propose ProGRank, a post hoc, training-free retriever-side defence for dense-retriever RAG. ProGRank stress-tests each query–passage pair under mild randomized perturbations and extracts probe gradients from a small fixed parameter subset of the retriever. From these signals, it derives two instability signals, representational consistency and dispersion risk, and combines them with a score gate in a reranking step. ProGRank preserves the original passage content, requires no retraining, and also supports a surrogate-based variant when the deployed retriever is unavailable. Extensive experiments across three datasets, three dense retriever backbones, representative corpus poisoning attacks, and both retrieval-stage and end-to-end settings show that ProGRank provides stronger defence performance and a favorable robustness–utility trade-off. It also remains competitive under adaptive evasive attacks.
[360] Ran Score: a LLM-based Evaluation Score for Radiology Report Generation
Ran Zhang, Yucong Lin, Zhaoli Su, Bowen Liu, Danni Ai, Tianyu Fu, Deqiang Xiao, Jingfan Fan, Yuanyuan Wang, Mingwei Gao, Yuwan Hu, Shuya Gao, Jingtao Li, Jian Yang, Hong Song, Hongliang Sun
Main category: cs.AI
TL;DR: Clinician-guided framework using LLMs for multi-label finding extraction from chest X-ray reports, with Ran Score metric for finding-level evaluation of report generation models.
Details
Motivation: Current chest X-ray report generation and evaluation systems have poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language like negation and ambiguity.Method: Developed a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports, defined Ran Score metric, optimized prompts, and established radiologist-derived reference labels across multiple datasets.
Result: Optimized framework improved macro-averaged score from 0.753 to 0.956 on MIMIC-CXR-EN, exceeded CheXbert benchmark by 15.7 percentage points, and showed robust generalization on independent ChestX-CN validation cohort.
Conclusion: Clinician-guided prompt optimization improves agreement with radiologist reference standards, and Ran Score enables finding-level evaluation of report fidelity, especially for low-prevalence abnormalities.
Abstract: Chest X-ray report generation and automated evaluation are limited by poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language, including negation and ambiguity. We develop a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports and use it to define Ran Score, a finding-level metric for report evaluation. Using three non-overlapping MIMIC-CXR-EN cohorts from a public chest X-ray dataset and an independent ChestX-CN validation cohort, we optimize prompts, establish radiologist-derived reference labels and evaluate report generation models. The optimized framework improves the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort, exceeds the CheXbert benchmark by 15.7 percentage points on directly comparable labels, and shows robust generalization on the ChestX-CN validation cohort. Here we show that clinician-guided prompt optimization improves agreement with a radiologist-derived reference standard and that Ran Score enables finding-level evaluation of report fidelity, particularly for low-prevalence abnormalities.
[361] Optimizing Small Language Models for NL2SQL via Chain-of-Thought Fine-Tuning
Anshul Solanki, Sanchit Latawa, Koushik Chakraborty, Navneet Kamboj
Main category: cs.AI
TL;DR: Fine-tuning small language models on NL2SQL tasks with Chain-of-Thought reasoning achieves significant accuracy gains (36% to 54.5%) while reducing costs compared to large models.
Details
Motivation: Large Language Models (LLMs) have high inference costs that limit deployment at scale for NL2SQL tasks in enterprises, creating a need for more cost-effective solutions.Method: Fine-tuning both large and small language models on NL2SQL tasks, with dataset enrichment using explicit Chain-of-Thought (CoT) reasoning patterns for small models.
Result: Fine-tuning large models showed negligible returns with overfitting, while small models improved from 36% to 45% baseline, and further to 54.5% with CoT enrichment, achieving cost-effective production-grade performance.
Conclusion: Transferring reasoning patterns to smaller models enables compute-efficient approaches that can meet business accuracy thresholds while significantly reducing costs and latency.
Abstract: Translating Natural Language to SQL (NL2SQL) remains a critical bottleneck for democratization of data in enterprises. Although Large Language Models (LLMs) like Gemini 2.5 and other LLMs have demonstrated impressive zero-shot capabilities, their high inference costs limit deployment at scale. This paper explores the efficacy of fine-tuning both large and small language models on NL2SQL tasks. Our research reveals a counter-intuitive scaling phenomenon. Fine-tuning large models (Gemini 2.5 Flash/Lite) on standard datasets yields negligible returns, often leading to overfitting on complex queries. Conversely, small models (Qwen) show significant gains. Fine-tuning improved the small model baseline from 36% to 45%, and further enriching the dataset with explicit Chain-of-Thought (CoT) reasoning surged accuracy to 54.5%(Fig 2). While this is still lower than the accuracy of large models like Gemini 2.5 , it does serve the business goal of significant cost reduction, latency in inference time and also meeting the business critical performance accuracy threshold.This paper demonstrates that transferring reasoning patterns enables compute-efficient smaller models to approach production-grade performance.
[362] PersonalQ: Select, Quantize, and Serve Personalized Diffusion Models for Efficient Inference
Qirui Wang, Qi Guo, Yiding Sun, Junkai Yang, Dongxu Zhang, Shanmin Pang, Qing Guo
Main category: cs.AI
TL;DR: PersonalQ: A unified framework for efficient serving of personalized text-to-image checkpoints through intent-aligned checkpoint selection and trigger-aware quantization.
Details
Motivation: Serving personalized text-to-image checkpoints efficiently is challenging due to ambiguous natural language requests that can be misrouted to visually similar checkpoints, and standard post-training quantization can distort fragile representations encoding personalized concepts.Method: PersonalQ connects checkpoint selection and quantization through the checkpoint’s trigger token. Check-in performs intent-aligned selection via intent-aware hybrid retrieval with LLM-based reranking, asks clarification questions when needed, and rewrites prompts with canonical triggers. Trigger-Aware Quantization (TAQ) applies trigger-aware mixed precision in cross-attention, preserving trigger-conditioned key/value rows while aggressively quantizing other pathways.
Result: PersonalQ improves intent alignment over retrieval and reranking baselines, while TAQ offers a stronger compression-quality trade-off than prior diffusion PTQ methods, enabling scalable serving of personalized checkpoints without sacrificing fidelity.
Conclusion: PersonalQ provides an effective framework for efficient serving of personalized text-to-image checkpoints by addressing both selection ambiguity and quantization challenges through trigger-aware techniques.
Abstract: Personalized text-to-image generation lets users fine-tune diffusion models into repositories of concept-specific checkpoints, but serving these repositories efficiently is difficult for two reasons: natural-language requests are often ambiguous and can be misrouted to visually similar checkpoints, and standard post-training quantization can distort the fragile representations that encode personalized concepts. We present PersonalQ, a unified framework that connects checkpoint selection and quantization through a shared signal – the checkpoint’s trigger token. Check-in performs intent-aligned selection by combining intent-aware hybrid retrieval with LLM-based reranking over checkpoint context and asks a brief clarification question only when multiple intents remain plausible; it then rewrites the prompt by inserting the selected checkpoint’s canonical trigger. Complementing this, Trigger-Aware Quantization (TAQ) applies trigger-aware mixed precision in cross-attention, preserving trigger-conditioned key/value rows (and their attention weights) while aggressively quantizing the remaining pathways for memory-efficient inference. Experiments show that PersonalQ improves intent alignment over retrieval and reranking baselines, while TAQ consistently offers a stronger compression-quality trade-off than prior diffusion PTQ methods, enabling scalable serving of personalized checkpoints without sacrificing fidelity.
[363] Where Experts Disagree, Models Fail: Detecting Implicit Legal Citations in French Court Decisions
Avrile Floro, Tamara Dhorasoo, Soline Pellez, Nils Holzenberger
Main category: cs.AI
TL;DR: A computational legal analysis study examining implicit citation of the French Civil Code in court decisions, creating an expert-annotated benchmark and developing models to distinguish legal reasoning from semantic similarity.
Details
Motivation: To develop computational methods for analyzing law at scale, specifically focusing on detecting when courts implicitly apply statutory rules rather than just semantic similarity, which requires distinguishing legal reasoning from mere textual similarity.Method: Created a benchmark of 1,015 passage-article pairs annotated by three legal experts, analyzed inter-annotator agreement, developed supervised ensemble models, and reframed the task as top-k ranking with multi-model consensus in unsupervised settings.
Result: Supervised ensemble achieved F1 = 0.70 (77% accuracy), but 68% of false positives occurred in cases where annotators disagreed; unsupervised top-k ranking with multi-model consensus achieved 76% precision at k = 200; expert disagreement predicted model failures.
Conclusion: Computational methods can detect implicit legal reasoning despite inherent ambiguity, with model failures often corresponding to expert disagreements, suggesting that remaining false positives may surface legally ambiguous applications rather than obvious errors.
Abstract: Computational methods applied to legal scholarship hold the promise of analyzing law at scale. We start from a simple question: how often do courts implicitly apply statutory rules? This requires distinguishing legal reasoning from semantic similarity. We focus on implicit citation of the French Civil Code in first-instance court decisions and introduce a benchmark of 1,015 passage-article pairs annotated by three legal experts. We show that expert disagreement predicts model failures. Inter-annotator agreement is moderate ($κ$ = 0.33) with 43% of disagreements involving the boundary between factual description and legal reasoning. Our supervised ensemble achieves F1 = 0.70 (77% accuracy), but this figure conceals an asymmetry: 68% of false positives fall on the 33% of cases where the annotators disagreed. Despite these limits, reframing the task as top-k ranking and leveraging multi-model consensus yields 76% precision at k = 200 in an unsupervised setting. Moreover, the remaining false positives tend to surface legally ambiguous applications rather than obvious errors.
[364] JFTA-Bench: Evaluate LLM’s Ability of Tracking and Analyzing Malfunctions Using Fault Trees
Yuhui Wang, Zhixiong Yang, Ming Zhang, Shihan Dou, Zhiheng Xi, Enyu Zhou, Senjie Jin, Yujiong Shen, Dingwei Zhu, Yi Dong, Tao Gui, Qi Zhang, Xuanjing Huang
Main category: cs.AI
TL;DR: Proposes textual representation of fault tree images for LLM processing, creates benchmark for multi-turn dialogue systems in complex maintenance environments, and trains model to simulate user behavior with error recovery capabilities.
Details
Motivation: Fault trees stored as images cannot be directly processed by large language models for maintenance assistance. Need to enable LLMs to help track and analyze malfunctions using fault tree information.Method: 1) Develop novel textual representation of fault trees for LLM processing; 2) Construct benchmark with 3130 entries averaging 40.75 turns per entry for multi-turn dialogue evaluation; 3) Train end-to-end model to generate vague information reflecting user behavior; 4) Introduce long-range rollback and recovery procedures to simulate user error scenarios.
Result: Created comprehensive benchmark for evaluating model capabilities in malfunction localization. Gemini 2.5 pro achieved best performance on the benchmark. Model can simulate realistic user interactions including error scenarios.
Conclusion: Textual representation enables LLMs to process fault tree images for maintenance assistance. The benchmark and simulation methods provide robust evaluation of models’ integrated capabilities in task tracking and error recovery in complex environments.
Abstract: In the maintenance of complex systems, fault trees are used to locate problems and provide targeted solutions. To enable fault trees stored as images to be directly processed by large language models, which can assist in tracking and analyzing malfunctions, we propose a novel textual representation of fault trees. Building on it, we construct a benchmark for multi-turn dialogue systems that emphasizes robust interaction in complex environments, evaluating a model’s ability to assist in malfunction localization, which contains $3130$ entries and $40.75$ turns per entry on average. We train an end-to-end model to generate vague information to reflect user behavior and introduce long-range rollback and recovery procedures to simulate user error scenarios, enabling assessment of a model’s integrated capabilities in task tracking and error recovery, and Gemini 2.5 pro archives the best performance.
[365] On the use of Aggregation Operators to improve Human Identification using Dental Records
Antonio D. Villegas-Yeguas, Guillermo R-García, Tzipi Kahana, Jorge Pinares Toledo, Esi Sharon, Oscar Ibañez, Oscar Cordón
Main category: cs.AI
TL;DR: The paper proposes explainable aggregation methods for comparing dental records in forensic identification, using data-driven lexicographical orders, fuzzy logic, and white-box machine learning to improve ranking accuracy while maintaining interpretability.
Details
Motivation: Current automatic methods for odontogram comparison either use overly simple techniques that don't fully utilize comparison information, or are black-box approaches without peer-reviewed validation. There's a need for transparent, expert-validatable aggregation mechanisms that improve upon state-of-the-art while maintaining explainability.Method: Three aggregation approaches are studied: 1) data-driven lexicographical order-based aggregations, 2) fuzzy logic aggregation methods, and 3) white-box machine learning techniques as aggregation mechanisms. These use a state-of-the-art codification based on seven different criteria for comparing dental records.
Result: White-box machine learning techniques achieved average rankings from 2.02 to 2.21, significantly improving upon the state-of-the-art average ranking of 3.91. The methods were validated on 215 forensic cases from two different populations.
Conclusion: White-box machine learning aggregation models can substantially improve dental record comparison performance without compromising explainability and interpretability, making them suitable for forensic applications where expert validation is crucial.
Abstract: The comparison of dental records is a standardized technique in forensic dentistry used to speed up the identification of individuals in multiple-comparison scenarios. Specifically, the odontogram comparison is a procedure to compute criteria that will be used to perform a ranking. State-of-the-art automatic methods either make use of simple techniques, without utilizing the full potential of the information obtained from a comparison, or their internal behavior is not known due to the lack of peer-reviewed publications. This work aims to design aggregation mechanisms to automatically compare pairs of dental records that can be understood and validated by experts, improving the current methods. To do so, we introduce different aggregation approaches using the state-of-the-art codification, based on seven different criteria. In particular, we study the performance of i) data-driven lexicographical order-based aggregations, ii) well-known fuzzy logic aggregation methods and iii) machine learning techniques as aggregation mechanisms. To validate our proposals, 215 forensic cases from two different populations have been used. The results obtained show how the use of white-box machine learning techniques as aggregation models (average ranking from 2.02 to 2.21) are able to improve the state-of-the-art (average ranking of 3.91) without compromising the explainability and interpretability of the method.
[366] Can Large Language Models Reason and Optimize Under Constraints?
Fabien Bernier, Salah Ghamizi, Pantelis Dogoulis, Maxime Cordy
Main category: cs.AI
TL;DR: LLMs struggle with abstraction and optimization problems under constraints like Optimal Power Flow, revealing critical gaps in structured reasoning capabilities.
Details
Motivation: To investigate whether Large Language Models can reason and optimize under physical and operational constraints of real-world problems like Optimal Power Flow, which requires fundamental skills such as reasoning, structured input handling, arithmetic, and constrained optimization.Method: Introduced a challenging evaluation setup for Optimal Power Flow problems that tests LLMs’ abilities in reasoning, structured input handling, arithmetic, and constrained optimization. Evaluated state-of-the-art LLMs on this benchmark.
Result: State-of-the-art LLMs fail in most tasks, and reasoning LLMs still fail in the most complex settings. The evaluation reveals critical gaps in LLMs’ ability to handle structured reasoning under constraints.
Conclusion: LLMs have significant limitations in solving abstraction and optimization problems with constraints. The work provides a rigorous testing environment for developing more capable LLM assistants that can tackle real-world power grid optimization problems.
Abstract: Large Language Models (LLMs) have demonstrated great capabilities across diverse natural language tasks; yet their ability to solve abstraction and optimization problems with constraints remains scarcely explored. In this paper, we investigate whether LLMs can reason and optimize under the physical and operational constraints of Optimal Power Flow (OPF) problem. We introduce a challenging evaluation setup that requires a set of fundamental skills such as reasoning, structured input handling, arithmetic, and constrained optimization. Our evaluation reveals that SoTA LLMs fail in most of the tasks, and that reasoning LLMs still fail in the most complex settings. Our findings highlight critical gaps in LLMs’ ability to handle structured reasoning under constraints, and this work provides a rigorous testing environment for developing more capable LLM assistants that can tackle real-world power grid optimization problems.
[367] Minibal: Balanced Game-Playing Without Opponent Modeling
Quentin Cohen-Solal, Tristan Cazenave
Main category: cs.AI
TL;DR: Minibal: A Minimax variant for balanced AI play in board games that avoids dominating human opponents while maintaining challenge
Details
Motivation: Current superhuman game AI agents like AlphaZero overwhelm human players, offering little enjoyment or educational value. There's a need for AI that can provide balanced, engaging play suitable for human-AI interaction in both entertainment and serious games.Method: Introduces Minibal (Minimize & Balance), a variant of Minimax specifically designed for balanced play. Proposes several modifications of the Unbounded Minimax algorithm explicitly aimed at discovering balanced strategies that challenge without dominating.
Result: Experiments across seven board games show one variant consistently achieves the most balanced play, with average outcomes close to perfect balance.
Conclusion: Minibal establishes a promising foundation for designing AI agents that are both challenging and engaging, suitable for both entertainment and serious games where human-AI interaction is important.
Abstract: Recent advances in game AI, such as AlphaZero and Athénan, have achieved superhuman performance across a wide range of board games. While highly powerful, these agents are ill-suited for human-AI interaction, as they consistently overwhelm human players, offering little enjoyment and limited educational value. This paper addresses the problem of balanced play, in which an agent challenges its opponent without either dominating or conceding. We introduce Minibal (Minimize & Balance), a variant of Minimax specifically designed for balanced play. Building on this concept, we propose several modifications of the Unbounded Minimax algorithm explicitly aimed at discovering balanced strategies. Experiments conducted across seven board games demonstrate that one variant consistently achieves the most balanced play, with average outcomes close to perfect balance. These results establish Minibal as a promising foundation for designing AI agents that are both challenging and engaging, suitable for both entertainment and serious games.
[368] MedCausalX: Adaptive Causal Reasoning with Self-Reflection for Trustworthy Medical Vision-Language Models
Jianxin Lin, Chunzheng Zhu, Peter J. Kneuertz, Yunfei Bai, Yuan Xue
Main category: cs.AI
TL;DR: MedCausalX is a framework that enhances medical vision-language models with explicit causal reasoning mechanisms to improve diagnostic reliability and reduce spurious correlations.
Details
Motivation: Existing medical chain-of-thought models lack explicit causal reasoning mechanisms, making them vulnerable to spurious correlations and limiting clinical reliability. The paper identifies three core challenges: adaptive causal correction triggering, high-quality causal-spurious contrastive sample construction, and maintaining causal consistency across reasoning trajectories.Method: Proposes MedCausalX with three key components: 1) CRMed dataset with fine-grained anatomical annotations, structured causal reasoning chains, and counterfactual variants; 2) Two-stage adaptive reflection architecture using ⟨causal⟩ and ⟨verify⟩ tokens for autonomous causal analysis; 3) Trajectory-level causal correction objective optimized through error-attributed reinforcement learning.
Result: MedCausalX consistently outperforms state-of-the-art methods on multiple benchmarks, improving diagnostic consistency by +5.4 points, reducing hallucination by over 10 points, and achieving top spatial grounding IoU.
Conclusion: The framework successfully integrates explicit causal reasoning into medical VLMs, setting a new standard for causally grounded medical reasoning by addressing spurious correlations and improving clinical reliability.
Abstract: Vision-Language Models (VLMs) have enabled interpretable medical diagnosis by integrating visual perception with linguistic reasoning. Yet, existing medical chain-of-thought (CoT) models lack explicit mechanisms to represent and enforce causal reasoning, leaving them vulnerable to spurious correlations and limiting their clinical reliability. We pinpoint three core challenges in medical CoT reasoning: how to adaptively trigger causal correction, construct high-quality causal-spurious contrastive samples, and maintain causal consistency across reasoning trajectories. To address these challenges, we propose MedCausalX, an end-to-end framework explicitly models causal reasoning chains in medical VLMs. We first introduce the CRMed dataset providing fine-grained anatomical annotations, structured causal reasoning chains, and counterfactual variants that guide the learning of causal relationships beyond superficial correlations. Building upon CRMed, MedCausalX employs a two-stage adaptive reflection architecture equipped with $\langle$causal$\rangle$ and $\langle$verify$\rangle$ tokens, enabling the model to autonomously determine when and how to perform causal analysis and verification. Finally, a trajectory-level causal correction objective optimized through error-attributed reinforcement learning refines the reasoning chain, allowing the model to distinguish genuine causal dependencies from shortcut associations. Extensive experiments on multiple benchmarks show that MedCausalX consistently outperforms state-of-the-art methods, improving diagnostic consistency by +5.4 points, reducing hallucination by over 10 points, and attaining top spatial grounding IoU, thereby setting a new standard for causally grounded medical reasoning.
[369] Between Rules and Reality: On the Context Sensitivity of LLM Moral Judgment
Adrian Sauter, Mona Schirmer
Main category: cs.AI
TL;DR: LLMs show context-sensitive moral decision-making that differs from human patterns, with systematic contextual variations affecting model judgments differently than humans.
Details
Motivation: Current research on LLM morality focuses on fixed scenarios, but human moral decisions heavily depend on context. The paper aims to address this gap by studying how contextual variations affect LLM moral judgments.Method: Introduces Contextual MoralChoice dataset with systematic contextual variations (consequentialist, emotional, relational) known to shift human judgment. Evaluates 22 LLMs and compares with human survey data. Uses activation steering approach to control contextual sensitivity.
Result: Nearly all models are context-sensitive, shifting toward rule-violating behavior. Models and humans are most triggered by different contextual variations. Alignment with human judgments in base cases doesn’t guarantee alignment in contextual sensitivity. Activation steering can reliably increase or decrease contextual sensitivity.
Conclusion: LLMs exhibit systematic contextual sensitivity in moral decision-making that differs from human patterns, raising concerns about alignment. The activation steering approach offers a way to control this sensitivity.
Abstract: A human’s moral decision depends heavily on the context. Yet research on LLM morality has largely studied fixed scenarios. We address this gap by introducing Contextual MoralChoice, a dataset of moral dilemmas with systematic contextual variations known from moral psychology to shift human judgment: consequentialist, emotional, and relational. Evaluating 22 LLMs, we find that nearly all models are context-sensitive, shifting their judgments toward rule-violating behavior. Comparing with a human survey, we find that models and humans are most triggered by different contextual variations, and that a model aligned with human judgments in the base case is not necessarily aligned in its contextual sensitivity. This raises the question of controlling contextual sensitivity, which we address with an activation steering approach that can reliably increase or decrease a model’s contextual sensitivity.
[370] Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models
Massimiliano Pappa, Luca Romani, Valentino Sacco, Alessio Palma, Stéphane Lathuilière, Fabio Galasso, Xavier Alameda-Pineda, Indro Spinelli
Main category: cs.AI
TL;DR: DILLO is a fast steering layer that uses language models to predict action outcomes from latent states, bypassing visual simulation for 14x speedup while improving task success rates.
Details
Motivation: Current world models for safety-critical agents rely on visual simulation which causes prohibitive latencies (several seconds per step). The authors challenge the assumption that visual processing is necessary for failure prevention.Method: DILLO uses cross-modal distillation: a Vision Language Model teacher annotates offline trajectories, and a latent-conditioned Large Language Model student learns to predict semantic outcomes. This creates a text-only inference path that bypasses visual generation.
Result: Achieves 14x speedup over visual simulation baselines. Improves episode success rate by up to 15 percentage points (9.3 pp average) across MetaWorld and LIBERO tasks.
Conclusion: A policy’s latent state combined with planned actions encodes sufficient information for outcome prediction, making visual simulation redundant for failure prevention. Text-based inference enables fast proactive foresight.
Abstract: Deploying safety-critical agents requires anticipating the consequences of actions before they are executed. While world models offer a paradigm for this proactive foresight, current approaches relying on visual simulation incur prohibitive latencies, often exceeding several seconds per step. In this work, we challenge the assumption that visual processing is necessary for failure prevention. We show that a trained policy’s latent state, combined with its planned actions, already encodes sufficient information to anticipate action outcomes, making visual simulation redundant for failure prevention. To this end, we introduce DILLO (DIstiLLed Language-ActiOn World Model), a fast steering layer that shifts the paradigm from “simulate-then-act” to “describe-then-act.” DILLO is trained via cross-modal distillation, where a privileged Vision Language Model teacher annotates offline trajectories and a latent-conditioned Large Language Model student learns to predict semantic outcomes. This creates a text-only inference path, bypassing heavy visual generation entirely, achieving a 14x speedup over baselines. Experiments on MetaWorld and LIBERO demonstrate that DILLO produces high-fidelity descriptions of the next state and is able to steer the policy, improving episode success rate by up to 15 pp and 9.3 pp on average across tasks.
[371] SAiW: Source-Attributable Invisible Watermarking for Proactive Deepfake Defense
Bibek Das, Chandranath Adak, Soumi Chattopadhyay, Zahid Akhtar, Soumya Dutta
Main category: cs.AI
TL;DR: SAiW is a source-attributed invisible watermarking framework for proactive deepfake defense that embeds traceable source identity into media at creation time, enabling both automated verification and forensic source attribution.
Details
Motivation: Existing deepfake detection methods are reactive and fail to generalize across evolving generation techniques, creating a need for proactive mechanisms that secure media authenticity at the time of creation.Method: Formulates watermark embedding as source-conditioned representation learning using feature-wise linear modulation to inject source identity, with perceptual guidance from human visual system priors to maintain imperceptibility, and a dual-purpose forensic decoder for watermark reconstruction and source attribution.
Result: Achieves high perceptual quality while maintaining strong robustness against compression, filtering, noise, geometric transformations, and adversarial perturbations across multiple deepfake datasets.
Conclusion: SAiW provides a scalable foundation for proactive deepfake defense and trustworthy media provenance by binding digital media to its origin through invisible yet verifiable markers.
Abstract: Deepfakes generated by modern generative models pose a serious threat to information integrity, digital identity, and public trust. Existing detection methods are largely reactive, attempting to identify manipulations after they occur and often failing to generalize across evolving generation techniques. This motivates the need for proactive mechanisms that secure media authenticity at the time of creation. In this work, we introduce SAiW, a Source-Attributed Invisible watermarking Framework for proactive deepfake defense and media provenance verification. Unlike conventional watermarking methods that treat watermark payloads as generic signals, SAiW formulates watermark embedding as a source-conditioned representation learning problem, where watermark identity encodes the originating source and modulates the embedding process to produce discriminative and traceable signatures. The framework integrates feature-wise linear modulation to inject source identity into the embedding network, enabling scalable multi-source watermark generation. A perceptual guidance module derived from human visual system priors ensures that watermark perturbations remain visually imperceptible while maintaining robustness. In addition, a dual-purpose forensic decoder simultaneously reconstructs the embedded watermark and performs source attribution, providing both automated verification and interpretable forensic evidence. Extensive experiments across multiple deepfake datasets demonstrate that SAiW achieves high perceptual quality while maintaining strong robustness against compression, filtering, noise, geometric transformations, and adversarial perturbations. By binding digital media to its origin through invisible yet verifiable markers, SAiW enables reliable authentication and source attribution, providing a scalable foundation for proactive deepfake defense and trustworthy media provenance.
[372] PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
Shuochen Liu, Junyi Zhu, Long Shu, Junda Lin, Yuhao Chen, Haotian Zhang, Chao Zhang, Derong Xu, Jia Li, Bo Tang, Zhiyu Li, Feiyu Xiong, Enhong Chen, Tong Xu
Main category: cs.AI
TL;DR: PERMA is a benchmark for evaluating LLM memory systems on maintaining persona consistency over time through temporally ordered interactions with text variability and linguistic alignment, moving beyond simple preference recall.
Details
Motivation: Current evaluations of LLM memory systems focus on needle-in-a-haystack retrieval in static settings, ignoring how real-world preferences evolve gradually across noisy interactions. There's a need for benchmarks that test persona consistency over time with realistic user behavior patterns.Method: Created PERMA benchmark with temporally ordered interaction events across multiple sessions/domains, incorporating text variability and linguistic alignment to simulate erratic user inputs. Includes multiple-choice and interactive tasks to probe persona understanding along interaction timelines.
Result: Advanced memory systems linking related interactions can extract more precise preferences and reduce token consumption compared to traditional semantic retrieval, but still struggle with maintaining coherent persona across temporal depth and cross-domain interference.
Conclusion: PERMA reveals limitations in current memory systems for personalized agents and highlights the need for more robust memory management that can handle evolving preferences in noisy, multi-domain contexts over time.
Abstract: Empowering large language models with long-term memory is crucial for building agents that adapt to users’ evolving needs. However, prior evaluations typically interleave preference-related dialogues with irrelevant conversations, reducing the task to needle-in-a-haystack retrieval while ignoring relationships between events that drive the evolution of user preferences. Such settings overlook a fundamental characteristic of real-world personalization: preferences emerge gradually and accumulate across interactions within noisy contexts. To bridge this gap, we introduce PERMA, a benchmark designed to evaluate persona consistency over time beyond static preference recall. Additionally, we incorporate (1) text variability and (2) linguistic alignment to simulate erratic user inputs and individual idiolects in real-world data. PERMA consists of temporally ordered interaction events spanning multiple sessions and domains, with preference-related queries inserted over time. We design both multiple-choice and interactive tasks to probe the model’s understanding of persona along the interaction timeline. Experiments demonstrate that by linking related interactions, advanced memory systems can extract more precise preferences and reduce token consumption, outperforming traditional semantic retrieval of raw dialogues. Nevertheless, they still struggle to maintain a coherent persona across temporal depth and cross-domain interference, highlighting the need for more robust personalized memory management in agents. Our code and data are open-sourced at https://github.com/PolarisLiu1/PERMA.
[373] MemCollab: Cross-Agent Memory Collaboration via Contrastive Trajectory Distillation
Yurui Chang, Yiran Wu, Qingyun Wu, Lu Lin
Main category: cs.AI
TL;DR: MemCollab: A collaborative memory framework that creates agent-agnostic memory by contrasting reasoning trajectories from different agents on the same tasks, enabling shared memory across heterogeneous LLM-based agents.
Details
Motivation: Existing memory systems are per-agent and tightly coupled to individual models' reasoning styles, limiting knowledge sharing in heterogeneous agent deployments. The paper addresses whether a single memory system can be shared across different models.Method: Proposes MemCollab which constructs agent-agnostic memory by contrasting reasoning trajectories from different agents on the same tasks. This contrastive process distills abstract reasoning constraints capturing shared task-level invariants while suppressing agent-specific artifacts. Also introduces task-aware retrieval that conditions memory access on task category.
Result: Experiments on mathematical reasoning and code generation benchmarks show MemCollab consistently improves both accuracy and inference-time efficiency across diverse agents, including cross-modal-family settings.
Conclusion: Collaboratively constructed memory can function as a shared reasoning resource for diverse LLM-based agents, enabling effective knowledge transfer across heterogeneous models.
Abstract: Large language model (LLM)-based agents rely on memory mechanisms to reuse knowledge from past problem-solving experiences. Existing approaches typically construct memory in a per-agent manner, tightly coupling stored knowledge to a single model’s reasoning style. In modern deployments with heterogeneous agents, a natural question arises: can a single memory system be shared across different models? We found that naively transferring memory between agents often degrades performance, as such memory entangles task-relevant knowledge with agent-specific biases. To address this challenge, we propose MemCollab, a collaborative memory framework that constructs agent-agnostic memory by contrasting reasoning trajectories generated by different agents on the same task. This contrastive process distills abstract reasoning constraints that capture shared task-level invariants while suppressing agent-specific artifacts. We further introduce a task-aware retrieval mechanism that conditions memory access on task category, ensuring that only relevant constraints are used at inference time. Experiments on mathematical reasoning and code generation benchmarks demonstrate that MemCollab consistently improves both accuracy and inference-time efficiency across diverse agents, including cross-modal-family settings. Our results show that the collaboratively constructed memory can function as a shared reasoning resource for diverse LLM-based agents.
[374] Online library learning in human visual puzzle solving
Pinzhe Zhao, Emanuele Sansone, Marta Kryven, Bonan Zhao
Main category: cs.AI
TL;DR: Humans learn to create reusable abstractions (helpers) in visual puzzle tasks, becoming more efficient with experience, and computational modeling shows this aligns with program induction with library learning.
Details
Motivation: To understand how humans form efficient reusable abstractions when learning complex tasks despite uncertainty about future applications, particularly in visual problem-solving contexts.Method: Online experiment where participants solved visual puzzles of increasing difficulty, creating and reusing helpers. Computational modeling using program induction with library learning to analyze human decision times and operations.
Result: Participants initially created many helpers favoring completeness, then became more selective and efficient with experience. Helper access enabled solving otherwise difficult puzzles. Human effort metrics correlated with search space estimated by library learning models, not raw program length.
Conclusion: Online library learning is a core mechanism in human problem solving, allowing flexible building, refinement, and reuse of abstractions as task demands grow.
Abstract: When learning a novel complex task, people often form efficient reusable abstractions that simplify future work, despite uncertainty about the future. We study this process in a visual puzzle task where participants define and reuse helpers – intermediate constructions that capture repeating structure. In an online experiment, participants solved puzzles of increasing difficulty. Early on, they created many helpers, favouring completeness over efficiency. With experience, helper use became more selective and efficient, reflecting sensitivity to reuse and cost. Access to helpers enabled participants to solve puzzles that were otherwise difficult or impossible. Computational modelling shows that human decision times and number of operations used to complete a puzzle increase with search space estimated by a program induction model with library learning. In contrast, raw program length predicts failure but not effort. Together, these results point to online library learning as a core mechanism in human problem solving, allowing people to flexibly build, refine, and reuse abstractions as task demands grow.
[375] LLM Olympiad: Why Model Evaluation Needs a Sealed Exam
Jan Christian Blaise Cruz, Alham Fikri Aji
Main category: cs.AI
TL;DR: Proposes Olympiad-style evaluation events with sealed problems, frozen submissions, and standardized harness to address benchmark-chasing issues in LLM evaluation
Details
Motivation: Current NLP benchmarks are increasingly misleading due to benchmark-chasing, hidden evaluation choices, and test data exposure, reducing trust in reported progressMethod: Designs a competition format where problems are sealed until evaluation, submissions are frozen in advance, and all entries run through a single standardized evaluation harness
Result: Proposes a framework that makes strong performance harder to manufacture and easier to trust, with full task sets and evaluation code released after scoring for reproducibility
Conclusion: Olympiad-style evaluation events can complement traditional benchmarks by increasing transparency, reducing gaming, and enabling better community learning from results
Abstract: Benchmarks and leaderboards are how NLP most often communicates progress, but in the LLM era they are increasingly easy to misread. Scores can reflect benchmark-chasing, hidden evaluation choices, or accidental exposure to test content – not just broad capability. Closed benchmarks delay some of these issues, but reduce transparency and make it harder for the community to learn from results. We argue for a complementary practice: an Olympiad-style evaluation event where problems are sealed until evaluation, submissions are frozen in advance, and all entries run through one standardized harness. After scoring, the full task set and evaluation code are released so results can be reproduced and audited. This design aims to make strong performance harder to ``manufacture’’ and easier to trust.
[376] RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue
Long Mai
Main category: cs.AI
TL;DR: RelayS2S is a hybrid architecture for real-time spoken dialogue systems that combines fast S2S response generation with high-quality LLM-based continuation to balance latency and response quality.
Details
Motivation: Real-time spoken dialogue systems face a fundamental trade-off between latency and response quality. End-to-end speech-to-speech models offer low latency but weaker semantics, while cascaded ASR->LLM pipelines provide better quality at the cost of higher latency that grows with model size.Method: RelayS2S runs two parallel paths upon turn detection: 1) a fast path using a duplex S2S model to speculatively draft a short response prefix streamed immediately to TTS, and 2) a slow path using a cascaded ASR->LLM pipeline to generate higher-quality continuation conditioned on the committed prefix. A lightweight learned verifier gates the handoff between paths.
Result: RelayS2S achieves P90 onset latency comparable to S2S models while retaining 99% of cascaded response quality in average score. Benefits increase as the slow-path model scales, and the system serves as a lightweight drop-in addition to existing cascaded pipelines.
Conclusion: RelayS2S effectively balances latency and quality in real-time spoken dialogue systems by combining immediate S2S response with high-quality LLM continuation, requiring no architectural modifications to existing components.
Abstract: Real-time spoken dialogue systems face a fundamental tension between latency and response quality. End-to-end speech-to-speech (S2S) models respond immediately and naturally handle turn-taking, backchanneling, and interruption, but produce semantically weaker outputs. Cascaded pipelines (ASR -> LLM) deliver stronger responses at the cost of latency that grows with model size. We present RelayS2S, a hybrid architecture that runs two paths in parallel upon turn detection. The fast path – a duplex S2S model – speculatively drafts a short response prefix that is streamed immediately to TTS for low-latency audio onset, while continuing to monitor live audio events. The slow path – a cascaded ASR -> LLM pipeline – generates a higher-quality continuation conditioned on the committed prefix, producing a seamless utterance. A lightweight learned verifier gates the handoff, committing the prefix when appropriate or falling back gracefully to the slow path alone. Experiments show that RelayS2S achieves P90 onset latency comparable to the S2S model while retaining 99% cascaded response quality in average score, with benefits growing as the slow-path model scales. Because the prefix handoff requires no architectural modification to either component, RelayS2S serves as a lightweight, drop-in addition to existing cascaded pipelines. Our code and data are publicly available at: https://github.com/mailong25/relays2s
[377] Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies
Hanzhong Zhang, Siyang Song, Jindong Wang
Main category: cs.AI
TL;DR: A framework combining computational virtual ethnography and socio-cognitive profiling to study how LLM agents form stable stances and negotiate identities during interventions, revealing innate progressive biases and trust-action decoupling in advanced models.
Details
Motivation: To understand how large language models form stable stances and negotiate identities during complex interventions, overcoming limitations of static evaluations by studying dynamic social behaviors in multiagent communities.Method: Mixed-methods framework combining computational virtual ethnography with quantitative socio-cognitive profiling, embedding human researchers in generative multiagent communities, with controlled discursive interventions and three new metrics: Innate Value Bias (IVB), Persuasion Sensitivity, and Trust-Action Decoupling (TAD).
Result: Agents exhibit endogenous progressive stances overriding preset identities (IVB > 0). Rational persuasion aligned with stances shifts 90% of neutral agents while maintaining high trust. Conflicting emotional provocations induce 40.0% TAD rate in advanced models (hypocritical stance changes despite low trust), while smaller models maintain 0% TAD rate. Agents actively dismantle power hierarchies and reconstruct community boundaries through language.
Conclusion: The study exposes fragility of static prompt engineering and provides methodological foundation for dynamic alignment in human-agent hybrid societies, revealing complex socio-cognitive behaviors in LLM agents during interventions.
Abstract: While large language models simulate social behaviors, their capacity for stable stance formation and identity negotiation during complex interventions remains unclear. To overcome the limitations of static evaluations, this paper proposes a novel mixed-methods framework combining computational virtual ethnography with quantitative socio-cognitive profiling. By embedding human researchers into generative multiagent communities, controlled discursive interventions are conducted to trace the evolution of collective cognition. To rigorously measure how agents internalize and react to these specific interventions, this paper formalizes three new metrics: Innate Value Bias (IVB), Persuasion Sensitivity, and Trust-Action Decoupling (TAD). Across multiple representative models, agents exhibit endogenous stances that override preset identities, consistently demonstrating an innate progressive bias (IVB > 0). When aligned with these stances, rational persuasion successfully shifts 90% of neutral agents while maintaining high trust. In contrast, conflicting emotional provocations induce a paradoxical 40.0% TAD rate in advanced models, which hypocritically alter stances despite reporting low trust. Smaller models contrastingly maintain a 0% TAD rate, strictly requiring trust for behavioral shifts. Furthermore, guided by shared stances, agents use language interactions to actively dismantle assigned power hierarchies and reconstruct self organized community boundaries. These findings expose the fragility of static prompt engineering, providing a methodological and quantitative foundation for dynamic alignment in human-agent hybrid societies. The official code is available at: https://github.com/armihia/CMASE-Endogenous-Stances
[378] Bilevel Autoresearch: Meta-Autoresearching Itself
Yaonan Qu, Meng Lu
Main category: cs.AI
TL;DR: Bilevel Autoresearch: A framework where an outer loop meta-optimizes an inner autoresearch loop by generating and injecting new search mechanisms as Python code at runtime, achieving 5x improvement on GPT pretraining benchmark.
Details
Motivation: To explore whether LLMs can autonomously improve autoresearch systems themselves, rather than requiring human intervention to identify bottlenecks and write new code for optimization.Method: Bilevel framework with outer loop that meta-optimizes inner autoresearch loop by generating and injecting new search mechanisms as Python code at runtime. Both loops use the same LLM without needing stronger model at meta level.
Result: Achieved 5x improvement over standard inner loop alone on Karpathy’s GPT pretraining benchmark (-0.045 vs -0.009 val_bpb). Outer loop autonomously discovered mechanisms from combinatorial optimization, multi-armed bandits, and design of experiments without human specification.
Conclusion: Autoresearch can meta-autoresearch itself, suggesting it can meta-autoresearch anything with a measurable objective. The framework enables autonomous discovery of search mechanisms that break deterministic patterns and force exploration of directions LLM priors avoid.
Abstract: If autoresearch is itself a form of research, then autoresearch can be applied to research itself. We take this idea literally: we use an autoresearch loop to optimize the autoresearch loop. Every existing autoresearch system – from Karpathy’s single-track loop to AutoResearchClaw’s multi-batch extension and EvoScientist’s persistent memory – was improved by a human who read the code, identified a bottleneck, and wrote new code. We ask whether an LLM can do the same, autonomously. We present Bilevel Autoresearch, a bilevel framework where an outer loop meta-optimizes the inner autoresearch loop by generating and injecting new search mechanisms as Python code at runtime. The inner loop optimizes the task; the outer loop optimizes how the inner loop searches. Both loops use the same LLM – no stronger model is needed at the meta level. On Karpathy’s GPT pretraining benchmark, the meta-autoresearch outer loop achieves a 5x improvement over the standard inner loop alone (-0.045 vs. -0.009 val_bpb), while parameter-level adjustment without mechanism change yields no reliable gain. The outer loop autonomously discovers mechanisms from combinatorial optimization, multi-armed bandits, and design of experiments – without human specification of which domains to explore. These mechanisms succeed by breaking the inner loop’s deterministic search patterns, forcing exploration of directions the LLM’s priors systematically avoid. The core principle is simple: if autoresearch can meta-autoresearch itself, it can, in principle, meta-autoresearch anything with a measurable objective.
[379] Mecha-nudges for Machines
Giulio Frey, Kawin Ethayarajh
Main category: cs.AI
TL;DR: Paper introduces “mecha-nudges” - subtle presentation changes that influence AI agents without harming human decision-making, combining Bayesian persuasion with V-usable information to measure interventions across contexts and models.
Details
Motivation: As AI agents increasingly make decisions in human environments, choice presentations may be optimized for machines as well as people. The paper aims to formalize how subtle presentation changes can systematically influence AI agents without degrading human decision environments.Method: Combines Bayesian persuasion framework with V-usable information (a generalization of Shannon information that is observer-relative) to create a common scale (bits of usable information) for comparing interventions, contexts, and models. Applies framework to analyze product listings on Etsy marketplace.
Result: Finds that following ChatGPT’s release, Etsy listings have significantly more machine-usable information about product selection, consistent with systematic mecha-nudging occurring in practice.
Conclusion: Mecha-nudges represent a new class of interventions that influence AI agents through presentation changes, measurable through the proposed V-usable information framework, with evidence of real-world adoption in digital marketplaces.
Abstract: Nudges are subtle changes to the way choices are presented to human decision-makers (e.g., opt-in vs. opt-out by default) that shift behavior without restricting options or changing incentives. As AI agents increasingly make decisions in the same environments as humans, the presentation of choices may be optimized for machines as well as people. We introduce mecha-nudges: changes to how choices are presented that systematically influence AI agents without degrading the decision environment for humans. To formalize mecha-nudges, we combine the Bayesian persuasion framework with V-usable information, a generalization of Shannon information that is observer-relative. This yields a common scale (bits of usable information) for comparing a wide range of interventions, contexts, and models. Applying our framework to product listings on Etsy – a global marketplace for independent sellers – we find that following ChatGPT’s release, listings have significantly more machine-usable information about product selection, consistent with systematic mecha-nudging.
[380] An Accurate and Interpretable Framework for Trustworthy Process Monitoring
Hao Wang, Zhiyu Wang, Yunlong Niu, Zhaoran Liu, Haozhe Li, Yilin Liao, Yuxin Huang, Xinggao Liu
Main category: cs.AI
TL;DR: AttentionMixer improves process monitoring for energy plants by capturing both temporal and variate-wise correlations while filtering spurious correlations for better accuracy and interpretability.
Details
Motivation: Current self-attentive models for energy conversion plant monitoring have two main limitations: they rely on step-wise correlations that lack physically meaningful semantics, and attention matrices contain spurious correlations that obscure meaningful ones, leading to suboptimal accuracy and interpretability.Method: Proposes AttentionMixer framework with spatial adaptive message passing block to capture variate-wise correlations, coupled with temporal adaptive message passing block through a mixing operator. Also employs sparse message passing regularizer to filter out spurious correlations.
Result: Validated using two real-world datasets from radiation monitoring network for Chinese nuclear power plants, demonstrating improved accuracy and interpretability.
Conclusion: AttentionMixer establishes a trustworthy ECP monitoring framework by improving both accuracy and interpretability of existing methods through multi-faceted representation and spurious correlation filtering.
Abstract: Trustworthy process monitoring seeks to build an accurate and interpretable monitoring framework, which is critical for ensuring the safety of energy conversion plant (ECP) that operates under extreme working conditions such as high pressure and temperature. Contemporary self-attentive models, however, fall short in this domain for two main reasons. First, they rely on step-wise correlations that fail to involve physically meaningful semantics in ECP logs, resulting in suboptimal accuracy and interpretability. Second, attention matrices are frequently cluttered with spurious correlations that obscure physically meaningful ones, further impeding effective interpretation. To overcome these issues, we propose AttentionMixer, a framework aimed at improving both accuracy and interpretability of existing methods and establish a trustworthy ECP monitoring framework. Specifically, to tackle the first issue, we employ a spatial adaptive message passing block to capture variate-wise correlations. This block is coupled with a temporal adaptive message passing block through an \textit{mixing} operator, yielding a multi-faceted representation of ECP logs accounting for both step-wise and variate-wise correlations. Concurrently, to tackle the second issue, we employ a sparse message passing regularizer to filter out spurious correlations. We validate the efficacy of AttentionMixer using two real-world datasets from the radiation monitoring network for Chinese nuclear power plants.
[381] RealCQA-V2: A Diagnostic Benchmark for Structured Visual Entailment over Scientific Charts
Saleem Ahmed, Srirangaraj Setlur, Venu Govindaraju
Main category: cs.AI
TL;DR: RealCQA-V2 benchmark reformulates chart QA as Visual Premise Proving with structured logical entailment over chart-grounded visual predicates, enabling atomic verification of reasoning chains.
Details
Motivation: Existing benchmarks only evaluate final-answer correctness without supporting atomic visual entailment verification of intermediate steps, especially for visual compositional logic in scientific chart understanding where answers depend on deterministically grounded visual semantics.Method: Introduces RealCQA-V2 benchmark that deconstructs chart questions into manually curated, atomic premises grounded in chart elements (axes, legends, marks, quantitative relations) to form executable reasoning chains rather than free-form textual rationales.
Result: Baseline evaluations across representative LVLMs reveal a consistent local-global reasoning gap: models often verify individual premises correctly while failing to preserve coherence across full chains. Introduces chain-level metrics (AccVPP and DCP) beyond traditional VQA accuracy.
Conclusion: RealCQA-V2 establishes a reproducible benchmark for structured visual entailment over real scientific charts and enables rigorous diagnosis of multimodal reasoning beyond answer-only evaluation.
Abstract: Multimodal reasoning models often produce fluent answers supported by seemingly coherent rationales. Existing benchmarks evaluate only final-answer correctness. They do not support atomic visual entailment verification of intermediate steps, especially visual compositional logic. This limitation is especially acute in scientific chart understanding, where answers depend on deterministically grounded visual semantics such as axes, legends, and quantitative relations. We introduce RealCQA-V2, a large-scale benchmark that reformulates chart question answering as Visual Premise Proving (VPP): a structured logical entailment task over chart-grounded visual predicates. Each question is deconstructed into manually curated, atomic premises grounded in chart elements (axes, legends, marks, and quantitative relations), yielding executable reasoning chains rather than free-form textual rationales. These premises form compositional reasoning chains, enabling verification at the level of individual visual statements and complete reasoning sequences. We introduce chain-level metrics that measure both full logical validity (AccVPP) and partial reasoning progress within failed chains (DCP), extending beyond traditional VQA accuracy. Baseline evaluations across representative LVLMs reveal a consistent local-global reasoning gap: models often verify many individual premises correctly while failing to preserve coherence across the full chain. RealCQA-V2 establishes a reproducible benchmark for structured visual entailment over real scientific charts and enables rigorous diagnosis of multimodal reasoning beyond answer-only evaluation.
[382] Towards Self-Evolving Benchmarks: Synthesizing Agent Trajectories via Test-Time Exploration under Validate-by-Reproduce Paradigm
Dadi Guo, Tianyi Zhou, Dongrui Liu, Chen Qian, Qihan Ren, Shuai Shao, Zhiyuan Fan, Yi R. Fung, Kun Wang, Linfeng Zhang, Jing Shao
Main category: cs.AI
TL;DR: TRACE framework enables dynamic evolution of agent benchmarks by having agents explore and transform existing tasks into more complex versions with validatable trajectories.
Details
Motivation: Existing agent benchmarks are quickly saturated by new agents, making it difficult to evaluate advancing agent capabilities. There's a need for dynamic, evolving evaluation systems rather than static benchmarks.Method: Three-stage framework: (1) evolutionary proposal mining through preliminary exploration and divergent thinking, (2) problem formation and free exploration where proposals become feasible problems with recorded execution trajectories, (3) multi-level validation ensuring evolved tasks have validatable and reproducible trajectories.
Result: TRACE consistently enhances task complexity on GAIA benchmark while improving reliability through validatable trajectories. Successfully adapts to reasoning datasets like AIME-2024.
Conclusion: Introduces paradigm shift from static, manually curated benchmarks to dynamic, self-evolving evaluation systems, providing sustainable and challenging evaluation for agent development.
Abstract: Recent advances in large language models (LLMs) and agent system designs have empowered agents with unprecedented levels of capability. However, existing agent benchmarks are showing a trend of rapid ceiling-hitting by newly developed agents, making it difficult to meet the demands for evaluating agent abilities. To address this problem, we propose the Trajectory-based Validated-by-Reproducing Agent-benchmark Complexity Evolution (TRACE) framework. This framework takes an original task from an existing benchmark and encourages agents to freely explore and evolve it into a new task with higher difficulty while recording validatable agent trajectories. The framework proceeds in three stages: (1) evolutionary proposal mining, which provides task evolution proposals through preliminary exploration and divergent thinking; (2) problem formation and free exploration, where proposals are conceptualized into feasible problem candidates and the agents then explore them freely while recording their execution trajectories; and (3) multi-level validation, which ensures that the evolved tasks are accompanied by validatable and reproducible trajectories. Experiments on the GAIA benchmark demonstrate that the TRACE framework consistently enhances task complexity while improving the reliability of correctness through validatable execution trajectories. In addition, our framework can successfully adapt to and improve reasoning datasets represented by AIME-2024. This work marks a paradigm shift from static, manually curated benchmarks to dynamic, self-evolving evaluation systems, providing a sustainable and challenging runway for agent development
[383] BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions
Nan Huo, Xiaohan Xu, Jinyang Li, Per Jacobsson, Shipei Lin, Bowen Qin, Binyuan Hui, Xiaolong Li, Ge Qu, Shuzheng Si, Linheng Han, Edward Alexander, Xintong Zhu, Rui Qin, Ruihan Yu, Yiyao Jin, Feige Zhou, Weihao Zhong, Yun Chen, Hongyu Liu, Chenhao Ma, Fatma Ozcan, Yannis Papakonstantinou, Reynold Cheng
Main category: cs.AI
TL;DR: BIRD-INTERACT is a benchmark for multi-turn text-to-SQL interactions that simulates real-world database assistant challenges with ambiguous queries, execution errors, and evolving requirements.
Details
Motivation: Existing multi-turn text-to-SQL benchmarks treat conversation histories as static context or limit evaluation to read-only operations, failing to reflect production-grade database assistant challenges that require dynamic interaction, error recovery, and knowledge retrieval.Method: Introduces BIRD-INTERACT benchmark with: (1) comprehensive interaction environment with hierarchical knowledge base, metadata files, and function-driven user simulator; (2) two evaluation settings - conversational protocol (c-Interact) and open-ended agentic setting (a-Interact); (3) challenging task suite covering full CRUD spectrum with executable test cases.
Result: The benchmark is highly challenging - GPT-5 completes only 8.67% of tasks in c-Interact and 17.00% in a-Interact. BIRD-INTERACT-FULL contains 600 tasks with up to 11,796 interactions, while BIRD-INTERACT-LITE has 300 tasks with simplified databases for rapid development.
Conclusion: BIRD-INTERACT effectively captures the difficulty of real-world multi-turn text-to-SQL tasks and demonstrates the importance of effective interaction for complex, dynamic database assistant scenarios.
Abstract: Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short by treating conversation histories as static context or limiting evaluation to read-only operations, failing to reflect production-grade database assistant challenges. We introduce BIRD-INTERACT, a benchmark that restores this realism through: (1) a comprehensive interaction environment coupling each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from errors without human supervision; (2) two evaluation settings consisting of a pre-defined conversational protocol (c-Interact) and an open-ended agentic setting (a-Interact) where models autonomously decide when to query the user simulator or explore the environment; (3) a challenging task suite covering the full CRUD spectrum for business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks requiring dynamic interaction. The suite comprises BIRD-INTERACT-FULL (600 tasks, up to 11,796 interactions) for comprehensive performance assessment, and BIRD-INTERACT-LITE (300 tasks with simplified databases) for detailed behavioral analysis and rapid method development. Our empirical results highlight BIRD-INTERACT’s difficulty: GPT-5 completes only 8.67% of tasks in c-Interact and 17.00% in a-Interact. Analysis via memory grafting and Interaction Test-time Scaling validates the importance of effective interaction for complex, dynamic text-to-SQL tasks.
[384] BuilderBench: The Building Blocks of Intelligent Agents
Raj Ghugare, Roger Creus Castanyer, Catherine Ji, Kathryn Wantlin, Jin Schofield, Karthik Narasimhan, Benjamin Eysenbach
Main category: cs.AI
TL;DR: BuilderBench is a benchmark for agent pre-training focused on open-ended exploration where agents must learn to build structures using blocks in a physics simulator, testing embodied reasoning through interaction without supervision.
Details
Motivation: Current AI models struggle with novel problems beyond existing data limits. To solve novel problems, agents need exploration and experiential learning skills. The paper aims to accelerate research into agent pre-training through open-ended exploration.Method: Introduces BuilderBench with (1) hardware-accelerated simulator of robotic agent interacting with physical blocks, and (2) task-suite with 42 diverse target structures testing physics, mathematics, and long-horizon planning. Agents explore without supervision during training, then build unseen structures during evaluation.
Result: The benchmark challenges current algorithms, requiring embodied reasoning not reflected in words but in actions and experimentation. The paper provides a “training wheels” protocol for single structure building and single-file implementations of six algorithms as reference points.
Conclusion: BuilderBench addresses the need for scalable learning mechanisms for agents that learn through interaction, focusing on embodied reasoning and open-ended exploration as key capabilities for solving novel problems beyond data limits.
Abstract: Today’s AI models learn primarily through mimicry and refining, so it is not surprising that they struggle to solve problems beyond the limits set by existing data. To solve novel problems, agents should acquire skills for exploring and learning through experience. Finding a scalable learning mechanism for developing agents that learn through interaction remains a major open problem. In this work, we introduce BuilderBench, a benchmark to accelerate research into agent pre-training that centers open-ended exploration. BuilderBench requires agents to learn how to build any structure using blocks. BuilderBench is equipped with $(1)$ a hardware accelerated simulator of a robotic agent interacting with various physical blocks, and $(2)$ a task-suite with over 42 diverse target structures that are carefully curated to test an understanding of physics, mathematics, and long-horizon planning. During training, agents have to explore and learn general principles about the environment without any external supervision. During evaluation, agents have to build the unseen target structures from the task suite. Solving these tasks requires a sort of \emph{embodied reasoning} that is not reflected in words but rather in actions, experimenting with different strategies and piecing them together. Our experiments show that many of these tasks challenge the current iteration of algorithms. Hence, we also provide a ``training wheels’’ protocol, in which agents are trained and evaluated to build a single target structure from the task suite. Finally, we provide single-file implementations of six different algorithms as a reference point for researchers.
[385] Operational machine learning for remote spectroscopic detection of CH$_{4}$ point sources
Vít Růžička, Gonzalo Mateo-García, Itziar Irakulis-Loitxate, Juan Emmanuel Johnson, Manuel Montesino San Martín, Anna Allen, Alma Raunak, Carol Castaneda, Luis Guanter, David R. Thompson
Main category: cs.AI
TL;DR: Deep learning system for automated methane point-source detection from satellite imaging spectrometers, deployed operationally in UNEP’s Methane Alert and Response System, processing 25,000+ hyperspectral products and identifying 2,851 methane leaks.
Details
Motivation: Current methane retrieval methods using matched filters produce many false detections requiring manual verification. Need automated, scalable system to handle increasing data volumes from current and future imaging spectrometers for global methane monitoring.Method: Created large annotated methane plume dataset from three imaging spectrometer missions, compared deep learning model configurations, extended evaluation to full operational granules, used model ensembling to reduce false detections by over 74%.
Result: Operational deployment processed >25,000 hyperspectral products, verified 2,851 distinct methane leaks, sent 834 stakeholder notifications. Model ensembling reduced false detections by over 74%. Demonstrated utility in case studies across multiple countries.
Conclusion: First operational deployment of automated methane point-source detection using spaceborne imaging spectrometers. Critical step toward global AI-assisted methane leak detection system capable of handling dramatically higher data volumes from current and future missions.
Abstract: Mitigating anthropogenic methane sources is one of the most cost-effective levers to slow down global warming. While satellite-based imaging spectrometers, such as EMIT, PRISMA, and EnMAP, can detect these point sources, current methane retrieval methods based on matched filters produce a high number of false detections requiring manual verification. To address this challenge, we deployed a ML system for detecting methane emissions within the Methane Alert and Response System (MARS) of UNEP’s IMEO. This represents the first operational deployment of automated methane point-source detection using spaceborne imaging spectrometers, providing regular global coverage and scalability to future constellations with even higher data volumes. This task required several technical advances. First, we created one of the largest and most diverse and global ML ready datasets to date of annotated methane plumes from three imaging spectrometer missions, and quantitatively compared different deep learning model configurations. Second, we extended prior evaluation methodologies from small, tiled datasets to full granules that are more representative of operational use. This revealed that deep learning models still produce a large number of false detections, a problem we addressed with model ensembling, which reduced false detections by over 74%. During 11 months of operational deployment, our system processed more than 25,000 hyperspectral products faciliting the verification of 2,851 distinct methane leaks, which resulted in 834 stakeholder notifications. We further demonstrate the model’s utility in verifying mitigation success through case studies in Libya, Argentina, Oman, and Azerbaijan. Our work represents a critical step towards a global AI-assisted methane leak detection system, which is required to process the dramatically higher data volumes expected from current and future imaging spectrometers.
[386] Hybrid Stackelberg Game and Diffusion-based Auction for Two-tier Agentic AI Task Offloading in Internet of Agents
Yue Zhong, Yongju Tong, Jiawen Kang, Minghui Dai, Hong-Ning Dai, Zhou Su, Dusit Niyato
Main category: cs.AI
TL;DR: Two-tier optimization for Internet of Agents: Stackelberg game for ground-level offloading and Double Dutch Auction for aerial resource allocation, solved with diffusion-based DRL.
Details
Motivation: Internet of Agents requires efficient resource allocation for compute-intensive AI services across heterogeneous agents (wireless, mobile, fixed, aerial) with varying mobility and connectivity constraints.Method: Two-tier approach: 1) Multi-leader multi-follower Stackelberg game where mobile/fixed agents set prices and wireless agents determine offloading ratios; 2) Double Dutch Auction for overloaded fixed agents to access aerial resources via diffusion-based Deep Reinforcement Learning.
Result: Numerical results demonstrate superiority in facilitating task offloading across the agent network.
Conclusion: Proposed optimization framework effectively manages resource allocation in heterogeneous IoA systems, enabling efficient AI service offloading through game-theoretic and auction-based mechanisms.
Abstract: The Internet of Agents (IoA) is rapidly gaining prominence as a foundational architecture for interconnected intelligent systems, designed to facilitate seamless discovery, communication, and collaborative reasoning among a vast network of Artificial Intelligence (AI) agents. Powered by Large Language and Vision-Language Models, IoA enables the development of interactive, rational agents capable of complex cooperation, moving far beyond traditional isolated models. IoA involves physical entities, i.e., Wireless Agents (WAs) with limited onboard resources, which need to offload their compute-intensive agentic AI services to nearby servers. Such servers can be Mobile Agents (MAs), e.g., vehicle agents, or Fixed Agents (FAs), e.g., end-side units agents. Given their fixed geographical locations and stable connectivity, FAs can serve as reliable communication gateways and task aggregation points. This stability allows them to effectively coordinate with and offload to an Aerial Agent (AA) tier, which has an advantage not affordable for highly mobile MAs with dynamic connectivity limitations. As such, we propose a two-tier optimization approach. The first tier employs a multi-leader multi-follower Stackelberg game. In the game, MAs and FAs act as the leaders who set resource prices. WAs are the followers to determine task offloading ratios. However, when FAs become overloaded, they can further offload tasks to available aerial resources. Therefore, the second tier introduces a Double Dutch Auction model where overloaded FAs act as the buyers to request resources, and AAs serve as the sellers for resource provision. We then develop a diffusion-based Deep Reinforcement Learning algorithm to solve the model. Numerical results demonstrate the superiority of our proposed scheme in facilitating task offloading.
[387] DriveSafe: A Hierarchical Risk Taxonomy for Safety-Critical LLM-Based Driving Assistants
Abhishek Kumar, Riya Tapwal, Carsten Maple
Main category: cs.AI
TL;DR: DriveSafe: A hierarchical risk taxonomy for evaluating safety-critical failures of LLM-based driving assistants, with 129 fine-grained risk categories across technical, legal, societal, and ethical dimensions.
Details
Motivation: LLMs integrated into vehicle assistants pose serious safety, ethical, and regulatory risks when providing unsafe, ambiguous, or legally incorrect responses. Existing safety evaluation frameworks are too general-purpose and fail to capture domain-specific risks in real-world driving scenarios.Method: Developed DriveSafe, a hierarchical four-level risk taxonomy with 129 atomic risk categories spanning technical, legal, societal, and ethical dimensions. The taxonomy is grounded in real-world driving regulations and safety principles, reviewed by domain experts. Validated by evaluating refusal behavior across six widely deployed LLMs using constructed prompts.
Result: Evaluation shows that current LLMs often fail to appropriately refuse unsafe or non-compliant driving-related queries, highlighting limitations of general-purpose safety alignment in driving contexts.
Conclusion: Domain-specific safety evaluation is crucial for LLM-based driving assistants. DriveSafe provides a systematic framework to assess and improve safety alignment for automotive applications.
Abstract: Large Language Models (LLMs) are increasingly integrated into vehicle-based digital assistants, where unsafe, ambiguous, or legally incorrect responses can lead to serious safety, ethical, and regulatory consequences. Despite growing interest in LLM safety, existing taxonomies and evaluation frameworks remain largely general-purpose and fail to capture the domain-specific risks inherent to real-world driving scenarios. In this paper, we introduce DriveSafe, a hierarchical, four-level risk taxonomy designed to systematically characterize safety-critical failure modes of LLM-based driving assistants. The taxonomy comprises 129 fine-grained atomic risk categories spanning technical, legal, societal, and ethical dimensions, grounded in real-world driving regulations and safety principles and reviewed by domain experts. To validate the safety relevance and realism of the constructed prompts, we evaluate their refusal behavior across six widely deployed LLMs. Our analysis shows that the evaluated models often fail to appropriately refuse unsafe or non-compliant driving-related queries, underscoring the limitations of general-purpose safety alignment in driving contexts.
[388] Rethinking the Role of Entropy in Optimizing Tool-Use Behaviors for Large Language Model Agents
Zeping Li, Hongru Wang, Yiwen Zhao, Guanhua Chen, Yixia Li, Keyang Chen, Yixin Cao, Guangnan Ye, Hongfeng Chai, Zhenfei Yin
Main category: cs.AI
TL;DR: LLM-based tool-using agents often make excessive low-quality tool calls in long trajectories, increasing latency. The paper proposes using entropy reduction as a supervisory signal with two reward strategies to optimize tool-use behavior.
Details
Motivation: Tool-using LLM agents excel in tasks like mathematical reasoning and multi-hop QA, but in long trajectories they trigger excessive low-quality tool calls, increasing latency and degrading inference performance, making tool-use behavior management challenging.Method: After observing strong correlation between entropy reduction and high-quality tool calls, the authors propose using entropy reduction as supervisory signal with two reward strategies: sparse outcome rewards for trajectory-level efficiency guidance, and dense process rewards for fine-grained performance supervision.
Result: Experiments across diverse domains show both reward designs improve tool-use behavior: sparse rewards reduce tool calls by 72.07% compared to baseline average, while dense rewards improve performance by 22.27%.
Conclusion: Entropy reduction serves as a key mechanism for enhancing tool-use behavior, enabling agents to be more adaptive in real-world applications by optimizing when and how to use tools.
Abstract: Tool-using agents based on Large Language Models (LLMs) excel in tasks such as mathematical reasoning and multi-hop question answering. However, in long trajectories, agents often trigger excessive and low-quality tool calls, increasing latency and degrading inference performance, making managing tool-use behavior challenging. In this work, we conduct entropy-based pilot experiments and observe a strong positive correlation between entropy reduction and high-quality tool calls. Building on this finding, we propose using entropy reduction as a supervisory signal and design two reward strategies to address the differing needs of optimizing tool-use behavior. Sparse outcome rewards provide coarse, trajectory-level guidance to improve efficiency, while dense process rewards offer fine-grained supervision to enhance performance. Experiments across diverse domains show that both reward designs improve tool-use behavior: the former reduces tool calls by 72.07% compared to the average of baselines, while the latter improves performance by 22.27%. These results position entropy reduction as a key mechanism for enhancing tool-use behavior, enabling agents to be more adaptive in real-world applications.
[389] Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
Xun Huang, Simeng Qin, Xiaoshuang Jia, Ranjie Duan, Huanqian Yan, Zhitao Zeng, Fei Yang, Yang Liu, Xiaojun Jia
Main category: cs.AI
TL;DR: CC-BOS: A framework using classical Chinese prompts and multi-dimensional fruit fly optimization for automated black-box jailbreak attacks on LLMs
Details
Motivation: LLMs have security vulnerabilities to jailbreak attacks, and classical Chinese's conciseness and obscurity can bypass existing safety constraints, exposing notable LLM vulnerabilitiesMethod: Proposes CC-BOS framework: encodes prompts into 8 policy dimensions (role, behavior, mechanism, metaphor, expression, knowledge, trigger pattern, context), uses multi-dimensional fruit fly optimization with smell search, visual search, and cauchy mutation for iterative refinement, includes classical Chinese to English translation module for evaluation
Result: Extensive experiments show CC-BOS consistently outperforms state-of-the-art jailbreak attack methods in effectiveness
Conclusion: Classical Chinese presents significant vulnerabilities for LLM jailbreak attacks, and the CC-BOS framework provides an efficient automated approach for black-box jailbreak attacks
Abstract: As Large Language Models (LLMs) are increasingly used, their security risks have drawn increasing attention. Existing research reveals that LLMs are highly susceptible to jailbreak attacks, with effectiveness varying across language contexts. This paper investigates the role of classical Chinese in jailbreak attacks. Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs. Based on this observation, this paper proposes a framework, CC-BOS, for the automatic generation of classical Chinese adversarial prompts based on multi-dimensional fruit fly optimization, facilitating efficient and automated jailbreak attacks in black-box settings. Prompts are encoded into eight policy dimensions-covering role, behavior, mechanism, metaphor, expression, knowledge, trigger pattern and context; and iteratively refined via smell search, visual search, and cauchy mutation. This design enables efficient exploration of the search space, thereby enhancing the effectiveness of black-box jailbreak attacks. To enhance readability and evaluation accuracy, we further design a classical Chinese to English translation module. Extensive experiments demonstrate that effectiveness of the proposed CC-BOS, consistently outperforming state-of-the-art jailbreak attack methods.
[390] CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays
Hyungyung Lee, Hangyul Yoon, Edward Choi
Main category: cs.AI
TL;DR: CXReasonAgent integrates LLMs with clinical diagnostic tools for evidence-grounded chest X-ray interpretation, addressing limitations of LVLMs in reliability and adaptability for medical diagnosis.
Details
Motivation: Current large vision-language models (LVLMs) for chest X-ray interpretation generate plausible but not faithfully grounded responses, lack verifiable visual evidence, and require costly retraining for new diagnostic tasks, limiting reliability and adaptability in clinical settings.Method: CXReasonAgent combines a large language model with clinically grounded diagnostic tools to perform evidence-grounded diagnostic reasoning using image-derived diagnostic and visual evidence. The approach is evaluated using CXReasonDial, a multi-turn dialogue benchmark with 1,946 dialogues across 12 diagnostic tasks.
Result: CXReasonAgent produces faithfully grounded responses, enabling more reliable and verifiable diagnostic reasoning than LVLMs, demonstrating the importance of integrating clinically grounded diagnostic tools in safety-critical clinical settings.
Conclusion: The integration of LLMs with clinical diagnostic tools provides more reliable, evidence-grounded, and adaptable chest X-ray interpretation compared to standard LVLMs, with potential applications in safety-critical medical settings.
Abstract: Chest X-ray plays a central role in thoracic diagnosis, and its interpretation inherently requires multi-step, evidence-grounded reasoning. However, large vision-language models (LVLMs) often generate plausible responses that are not faithfully grounded in diagnostic evidence and provide limited visual evidence for verification, while also requiring costly retraining to support new diagnostic tasks, limiting their reliability and adaptability in clinical settings. To address these limitations, we present CXReasonAgent, a diagnostic agent that integrates a large language model (LLM) with clinically grounded diagnostic tools to perform evidence-grounded diagnostic reasoning using image-derived diagnostic and visual evidence. To evaluate these capabilities, we introduce CXReasonDial, a multi-turn dialogue benchmark with 1,946 dialogues across 12 diagnostic tasks, and show that CXReasonAgent produces faithfully grounded responses, enabling more reliable and verifiable diagnostic reasoning than LVLMs. These findings highlight the importance of integrating clinically grounded diagnostic tools, particularly in safety-critical clinical settings. The demo is available \href{https://ttumyche.github.io/cxreasonagent/#demo}{here}.
[391] Agentic AI-based Coverage Closure for Formal Verification
Sivaram Pothireddypalli, Ashish Raman, Deepak Narayan Gadde, Aman Kumar
Main category: cs.AI
TL;DR: Agentic AI workflow using LLMs automates coverage analysis for formal verification, identifies gaps, and generates formal properties to accelerate verification efficiency and improve coverage metrics.
Details
Motivation: Traditional exhaustive approaches in IC verification often fail to achieve full coverage within project timelines, creating a need for more efficient methods to automate coverage analysis and property generation for formal verification.Method: An agentic AI-driven workflow utilizing LLM-enabled Generative AI to automate coverage analysis, identify coverage gaps, and generate required formal properties for formal verification.
Result: Benchmarking on open-source and internal designs shows measurable increase in coverage metrics, with improvements correlated to design complexity. Comparative analysis validates the effectiveness of the approach.
Conclusion: Agentic AI-based techniques have significant potential to improve formal verification productivity and support comprehensive coverage closure in IC development.
Abstract: Coverage closure is a critical requirement in Integrated Chip (IC) development process and key metric for verification sign-off. However, traditional exhaustive approaches often fail to achieve full coverage within project timelines. This study presents an agentic AI-driven workflow that utilizes Large Language Model (LLM)-enabled Generative AI (GenAI) to automate coverage analysis for formal verification, identify coverage gaps, and generate the required formal properties. The framework accelerates verification efficiency by systematically addressing coverage holes. Benchmarking open-source and internal designs reveals a measurable increase in coverage metrics, with improvements correlated to the complexity of the design. Comparative analysis validates the effectiveness of this approach. These results highlight the potential of agentic AI-based techniques to improve formal verification productivity and support comprehensive coverage closure.
[392] Retrieval-Augmented Generation with Covariate Time Series
Kenny Ye Liang, Zhongyi Pei, Huan Zhang, Yuhui Liu, Shaoxu Song, Jianmin Wang
Main category: cs.AI
TL;DR: RAG4CTS: A regime-aware, training-free RAG framework for covariate time-series that addresses challenges in predictive maintenance scenarios with data scarcity, short transient sequences, and covariate coupled dynamics.
Details
Motivation: Extending RAG to Time-Series Foundation Models (TSFMs) remains challenging, especially in high-stakes industrial scenarios like Predictive Maintenance of PRSOV valves characterized by data scarcity, short transient sequences, and covariate coupled dynamics. Existing time-series RAG approaches with static vector embeddings and learnable context augmenters fail to distinguish similar regimes in such scenarios.Method: Proposes RAG4CTS with: (1) hierarchical time-series native knowledge base for lossless storage and physics-informed retrieval of raw historical regimes, (2) two-stage bi-weighted retrieval mechanism aligning historical trends through point-wise and multivariate similarities, and (3) agent-driven strategy for dynamic context optimization in self-supervised manner.
Result: Extensive experiments on PRSOV demonstrate significant outperformance over state-of-the-art baselines in prediction accuracy. Deployed in Apache IoTDB within China Southern Airlines, successfully identified one PRSOV fault in two months with zero false alarms.
Conclusion: RAG4CTS effectively addresses challenges in time-series RAG for industrial predictive maintenance through regime-aware, training-free framework with physics-informed retrieval and dynamic context optimization, demonstrating practical value in real-world deployment.
Abstract: While RAG has greatly enhanced LLMs, extending this paradigm to Time-Series Foundation Models (TSFMs) remains a challenge. This is exemplified in the Predictive Maintenance of the Pressure Regulating and Shut-Off Valve (PRSOV), a high-stakes industrial scenario characterized by (1) data scarcity, (2) short transient sequences, and (3) covariate coupled dynamics. Unfortunately, existing time-series RAG approaches predominantly rely on generated static vector embeddings and learnable context augmenters, which may fail to distinguish similar regimes in such scarce, transient, and covariate coupled scenarios. To address these limitations, we propose RAG4CTS, a regime-aware, training-free RAG framework for Covariate Time-Series. Specifically, we construct a hierarchal time-series native knowledge base to enable lossless storage and physics-informed retrieval of raw historical regimes. We design a two-stage bi-weighted retrieval mechanism that aligns historical trends through point-wise and multivariate similarities. For context augmentation, we introduce an agent-driven strategy to dynamically optimize context in a self-supervised manner. Extensive experiments on PRSOV demonstrate that our framework significantly outperforms state-of-the-art baselines in prediction accuracy. The proposed system is deployed in Apache IoTDB within China Southern Airlines. Since deployment, our method has successfully identified one PRSOV fault in two months with zero false alarm.
[393] Planning as Goal Recognition: Deriving Heuristics from Intention Models – Extended Version
Giacomo Rosa, Jean Honorio, Nir Lipovetzky, Sebastian Sardina
Main category: cs.AI
TL;DR: A new framework using goal recognition heuristics for classical planning, with divergence-based intention assessment to improve planning efficiency.
Details
Motivation: To improve classical planning by leveraging insights from goal recognition research, creating a feedback loop where GR techniques can enhance planning heuristics rather than just using planning for GR.Method: Proposes a divergence-based framework for assessing goal intention, derives two efficiently-computable heuristics from this framework, and integrates them into classical planning algorithms.
Result: The derived heuristics yield improvements for top-scoring classical planners, demonstrating the practical value of the approach.
Conclusion: Provides foundational knowledge for probabilistic intention-based heuristics in planning, establishing a bidirectional relationship between goal recognition and classical planning.
Abstract: Classical planning aims to find a sequence of actions, a plan, that maps a starting state into one of the goal states. If a trajectory appears to be leading to the goal, should we prioritise exploring it? Seminal work in goal recognition (GR) has defined GR in terms of a classical planning problem, adopting classical solvers and heuristics to recognise plans. We come full circle, and study the adoption and properties of GR-derived heuristics for seeking solutions to classical planning problems. We propose a new divergence-based framework for assessing goal intention, which informs a new class of efficiently-computable heuristics. As a proof of concept, we derive two such heuristics, and show that they can already yield improvements for top-scoring classical planners. Our work provides foundational knowledge for understanding and deriving probabilistic intention-based heuristics for planning.
[394] Cascade-Aware Multi-Agent Routing: Spatio-Temporal Sidecars and Geometry-Switching
Davide Di Gioia
Main category: cs.AI
TL;DR: A geometry-aware sidecar system improves AI agent scheduling by modeling failure propagation in different graph topologies (tree-like vs cyclic), boosting scheduler performance by 36.8 percentage points.
Details
Motivation: Current AI reasoning systems use dynamic execution graphs of specialized agents, but schedulers lack understanding of how failure propagation differs between tree-like and cyclic graph structures, leading to suboptimal routing decisions.Method: Developed a spatio-temporal sidecar with: 1) Euclidean propagation scorer for dense cyclic subgraphs, 2) hyperbolic scorer for exponential risk in tree-like subgraphs, and 3) a compact learned gate (133 parameters) that blends scores using topology and geometry-aware features.
Result: On 250 benchmark scenarios across five topology regimes, the sidecar improved native scheduler’s win rate from 50.4% to 87.2% (+36.8 pp). In tree-like regimes, gains reached +48 to +68 pp. Learned gate achieved held-out AUC = 0.9247.
Conclusion: Geometry-aware failure propagation modeling significantly improves AI agent scheduling performance, with the approach generalizing across different graph families (Barabasi-Albert, Watts-Strogatz, Erdos-Renyi).
Abstract: Advanced AI reasoning systems route tasks through dynamic execution graphs of specialized agents. We identify a structural blind spot in this architecture: schedulers optimize load and fitness but lack a model of how failure propagates differently in tree-like versus cyclic graphs. In tree-like regimes, a single failure cascades exponentially; in dense cyclic regimes, it self-limits. A geometry-blind scheduler cannot distinguish these cases. We formalize this observability gap as an online geometry-control problem. We prove a cascade-sensitivity condition: failure spread is supercritical when per-edge propagation probability exceeds the inverse of the graph’s branching factor (p > e^{-γ}, where γis the BFS shell-growth exponent). We close this gap with a spatio-temporal sidecar that predicts which routing geometry fits the current topology. The sidecar comprises (i) a Euclidean propagation scorer for dense, cyclic subgraphs, (ii) a hyperbolic scorer capturing exponential risk in tree-like subgraphs, and (iii) a compact learned gate (133 parameters) that blends the two scores using topology and geometry-aware features. On 250 benchmark scenarios spanning five topology regimes, the sidecar lifts the native scheduler’s win rate from 50.4% to 87.2% (+36.8 pp). In tree-like regimes, gains reach +48 to +68 pp. The learned gate achieves held-out AUC = 0.9247, confirming geometry preference is recoverable from live signals. Cross-architecture validation on Barabasi-Albert, Watts-Strogatz, and Erdos-Renyi graphs confirms propagation modeling generalizes across graph families.
[395] A transformer architecture alteration to incentivise externalised reasoning
Elizabeth Pavlova, Mariia Koroliuk, Karthik Viswanathan, Cameron Tice, Edward James Young, Puria Radmard
Main category: cs.AI
TL;DR: Teaching LLMs to truncate forward passes early using early-exit mechanisms and reinforcement learning to reduce computation while maintaining performance
Details
Motivation: To make LLMs more efficient verbose reasoners by reducing excess computation, allowing models to exit early at shallow layers when tokens can be predicted without deep computation, reserving complex processing only for difficult tokensMethod: Augment transformer architecture with early-exit mechanism at intermediate layers, train model to exit at shallower layers when next token can be predicted without deep computation, use calibration stage followed by reinforcement learning to incentivize early exits while maintaining task performance
Result: Preliminary results show small reasoning models learn to adaptively reduce computations across tokens by exiting early when possible
Conclusion: The approach can minimize excess computation in reasoning models, allowing them to reserve deep processing only for difficult-to-predict tokens, potentially improving efficiency at scale
Abstract: We propose a new architectural change, and post-training pipeline, for making LLMs more verbose reasoners by teaching a model to truncate forward passes early. We augment an existing transformer architecture with an early-exit mechanism at intermediate layers and train the model to exit at shallower layers when the next token can be predicted without deep computation. After a calibration stage, we incentivise the model to exit as early as possible while maintaining task performance using reinforcement learning. We provide preliminary results to this effect for small reasoning models, showing that they learn to adaptively reduce computations across tokens. We predict that, applied at the right scale, our approach can minimise the amount of excess computation that reasoning models have at their disposal to perform non-myopic planning using their internal activations, reserving this only for difficult-to-predict tokens.
[396] Cerebra: A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk Assessment
Sheng Liu, Long Chen, Zeyun Zhao, Qinglin Gou, Qingyue Wei, Arjun Masurkar, Kevin M. Spiegler, Philip Kuball, Stefania C. Bray, Megan Bernath, Deanna R. Willis, Jiang Bian, Lei Xing, Eric Topol, Kyunghyun Cho, Yu Huang, Ruogu Fang, Narges Razavian, James Zou
Main category: cs.AI
TL;DR: Cerebra is an interactive multi-agent AI system for clinical decision support that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis, with a conversational interface for clinicians.
Details
Motivation: Clinical practice requires reasoning over heterogeneous, evolving, and incomplete patient data, but existing multimodal foundation models are static, opaque, and poorly aligned with real-world clinical workflows.Method: Multi-agent AI team coordinating specialized agents for different data modalities (EHR, clinical notes, medical imaging), with outputs synthesized into a clinician-facing dashboard combining visual analytics with conversational interface.
Result: Outperformed state-of-the-art single-modality models and large multimodal language model baselines on massive multi-institutional dataset (3M patients). Achieved AUROCs up to 0.80 for dementia risk prediction (vs 0.74 for single-modality, 0.68 for LLM baselines), 0.86 for dementia diagnosis, and C-index of 0.81 for survival prediction. Improved physician accuracy by 17.5 percentage points in reader study.
Conclusion: Cerebra demonstrates potential for interpretable, robust decision support in clinical care through its multi-agent architecture and interactive interface.
Abstract: Modern clinical practice increasingly depends on reasoning over heterogeneous, evolving, and incomplete patient data. Although recent advances in multimodal foundation models have improved performance on various clinical tasks, most existing models remain static, opaque, and poorly aligned with real-world clinical workflows. We present Cerebra, an interactive multi-agent AI team that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis. These outputs are synthesized into a clinician-facing dashboard that combines visual analytics with a conversational interface, enabling clinicians to interrogate predictions and contextualize risk at the point of care. Cerebra supports privacy-preserving deployment by operating on structured representations and remains robust when modalities are incomplete. We evaluated Cerebra using a massive multi-institutional dataset spanning 3 million patients from four independent healthcare systems. Cerebra consistently outperformed both state-of-the-art single-modality models and large multimodal language model baselines. In dementia risk prediction, it achieved AUROCs up to 0.80, compared with 0.74 for the strongest single-modality model and 0.68 for language model baselines. For dementia diagnosis, it achieved an AUROC of 0.86, and for survival prediction, a C-index of 0.81. In a reader study with experienced physicians, Cerebra significantly improved expert performance, increasing accuracy by 17.5 percentage points in prospective dementia risk estimation. These results demonstrate Cerebra’s potential for interpretable, robust decision support in clinical care.
[397] Dataset Distillation-based Hybrid Federated Learning on Non-IID Data
Xiufang Shi, Wei Zhang, Yuheng Li, Mincheng Wu, Zhenyu Wen, Shibo He, Tejal Shah, Rajiv Ranjan
Main category: cs.AI
TL;DR: Unable to analyze paper 2409.17517 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2409.17517: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.17517&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[398] LOGSAFE: Logic-Guided Verification for Trustworthy Federated Time-Series Learning
Dung Thuy Nguyen, Ziyan An, Taylor T. Johnson, Meiyi Ma, Kevin Leach
Main category: cs.AI
TL;DR: Unable to analyze paper 2411.03231 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2411.03231: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.03231&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[399] Streaming Attention Approximation via Discrepancy Theory
Ekaterina Kochetkova, Kshiteej Sheth, Insu Han, Amir Zandieh, Michael Kapralov
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2502.07861: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.07861&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[400] Multiplicative learning from observation-prediction ratios
Han Kim, Hyungjoon Soh, Vipul Periwal, Junghyo Jo
Main category: cs.AI
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Unable to determine motivation as the paper abstract could not be retrieved due to rate limiting (HTTP 429 error)Method: Cannot analyze method without access to the paper’s abstract or content
Result: No results available due to technical limitations in accessing the paper information
Conclusion: Unable to provide analysis due to API rate limiting preventing access to the paper’s abstract
Abstract: Failed to fetch summary for 2503.10144: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.10144&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[401] Training-free Adjustable Polynomial Graph Filtering for Ultra-fast Multimodal Recommendation
Yu-Seung Roh, Joo-Young Kim, Jin-Duk Park, Won-Yong Shin
Main category: cs.AI
TL;DR: Paper ID 2503.04406 summary could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed summary fetchMethod: Unable to determine method due to failed summary fetch
Result: Unable to determine results due to failed summary fetch
Conclusion: Unable to determine conclusion due to failed summary fetch
Abstract: Failed to fetch summary for 2503.04406: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.04406&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[402] Automating quantum feature map design via large language models
Kenya Sakka, Kosuke Mitarai, Keisuke Fujii
Main category: cs.AI
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv
Details
Motivation: Unable to determine motivation as abstract could not be fetched due to rate limitingMethod: Unable to determine method as abstract could not be fetched due to rate limiting
Result: Unable to determine results as abstract could not be fetched due to rate limiting
Conclusion: Unable to draw conclusions about the paper due to technical fetching error
Abstract: Failed to fetch summary for 2504.07396: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.07396&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[403] Leakage and Interpretability in Concept-Based Models
Enrico Parisini, Tapabrata Chakraborti, Chris Harbron, Ben D. MacArthur, Christopher R. S. Banerji
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to determine conclusion due to retrieval failure
Abstract: Failed to fetch summary for 2504.14094: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.14094&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[404] GAIA: A Foundation Model for Operational Atmospheric Dynamics
Ata Akbari Asanjan, Olivia Alexander, Tom Berg, Stephen Peng, Jad Makki, Clara Zhang, Matt Yang, Disha Shidham, Srija Chakraborty, William Bender, Cara Crawford, Arun Ravindran, Olivier Raiman, David Potere, David Bell
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.18179: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.18179&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[405] Generalizable Heuristic Generation Through LLMs with Meta-Optimization
Yiding Shi, Jianan Zhou, Wen Song, Jieyi Bi, Yaoxin Wu, Zhiguang Cao, Jie Zhang
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to server rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to unavailability of content
Abstract: Failed to fetch summary for 2505.20881: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.20881&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[406] CyberGym: Evaluating AI Agents’ Real-World Cybersecurity Capabilities at Scale
Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, Dawn Song
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.02548: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02548&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[407] Learning The Minimum Action Distance
Lorenzo Steccanella, Joshua B. Evans, Özgür Şimşek, Anders Jonsson
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2506.09276: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.09276&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[408] UniCA: Unified Covariate Adaptation for Time Series Foundation Model
Lu Han, Yu Liu, Lan Li, Qiwen Deng, Jian Jiang, Yinbo Sun, Zhe Yu, Binfeng Wang, Xingyu Lu, Lintao Ma, Han-Jia Ye, De-Chuan Zhan
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to determine conclusion due to retrieval failure
Abstract: Failed to fetch summary for 2506.22039: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.22039&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[409] Graph Structure Learning with Privacy Guarantees for Open Graph Data
Muhao Guo, Jiaqi Wu, Yizheng Liao, Wenke Lee, Shengzhe Chen, Yang Weng
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2507.19116: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.19116&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[410] From Product Hilbert Spaces to the Generalized Koopman Operator and the Nonlinear Fundamental Lemma
Mircea Lazar
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2508.07494: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.07494&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[411] From Context to Intent: Reasoning-Guided Function-Level Code Completion
Yanzhou Li, Tianlin Li, Yiran Zhang, Shangqing Liu, Aishan Liu, Xianglong Liu, Yang Liu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Unable to determine motivation due to fetch failure.Method: Unable to determine method due to fetch failure.
Result: Unable to determine results due to fetch failure.
Conclusion: Unable to determine conclusion due to fetch failure.
Abstract: Failed to fetch summary for 2508.09537: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.09537&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[412] VL-KnG: Persistent Spatiotemporal Knowledge Graphs from Egocentric Video for Embodied Scene Understanding
Mohamad Al Mdfaa, Svetlana Lukina, Timur Akhtyamov, Arthur Nigmatzyanov, Dmitrii Nalberskii, Sergey Zagoruyko, Gonzalo Ferrer
Main category: cs.AI
TL;DR: Paper 2510.01483: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to HTTP 429 error preventing access to paper contentMethod: Unable to determine method due to HTTP 429 error preventing access to paper content
Result: Unable to determine results due to HTTP 429 error preventing access to paper content
Conclusion: Unable to determine conclusion due to HTTP 429 error preventing access to paper content
Abstract: Failed to fetch summary for 2510.01483: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01483&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[413] Counterfactual Identifiability via Dynamic Optimal Transport
Fabio De Sousa Ribeiro, Ainkaran Santhirasekaram, Ben Glocker
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.08294: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08294&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[414] MCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM Agents
Dongsen Zhang, Zekun Li, Xu Luo, Xuannan Liu, Peipei Li, Wenjun Xu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval errorMethod: Unable to determine method due to retrieval error
Result: Unable to determine results due to retrieval error
Conclusion: Unable to determine conclusion due to retrieval error
Abstract: Failed to fetch summary for 2510.15994: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15994&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[415] GUIrilla: A Scalable Framework for Automated Desktop UI Exploration
Sofiya Garkot, Maksym Shamrai, Ivan Synytsia, Mariya Hirna
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2510.16051: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16051&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[416] Quantifying Systemic Vulnerability in the Foundation Model Industry
Claudio Pirrone, Stefano Fricano, Gioacchino Fazio
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.23421: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23421&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[417] Does Privacy Always Harm Fairness? Data-Dependent Trade-offs via Chernoff Information Neural Estimation
Arjun Nichani, Hsiang Hsu, Chun-Fu, Chen, Haewon Jeong
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2601.13698: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13698&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[418] Representational Homomorphism Predicts and Improves Compositional Generalization In Transformer Language Model
Zhiyu An, Wan Du
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2601.18858: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18858&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[419] Behavioral Consistency Validation for LLM Agents: An Analysis of Trading-Style Switching through Stock-Market Simulation
Zeping Li, Guancheng Wan, Keyang Chen, Yu Chen, Yiwen Zhao, Philip Torr, Guangnan Ye, Zhenfei Yin, Hongfeng Chai
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.07023 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable due to API rate limiting.
Result: Cannot determine results as paper content is unavailable due to API rate limiting.
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2602.07023: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07023&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[420] Energy-Aware Reinforcement Learning for Robotic Manipulation of Articulated Components in Infrastructure Operation and Maintenance
Xiaowen Tao, Yinuo Wang, Haitao Ding, Yuanyang Qi, Ziyu Song
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.12288 appears to be from February 2026, suggesting it’s a future paper or incorrect ID format.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2602.12288: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12288&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[421] When Sensors Fail: Temporal Sequence Models for Robust PPO under Sensor Drift
Kevin Vogt-Lowell, Theodoros Tsiligkaridis, Rodney Lafuente-Mercado, Surabhi Ghatti, Shanghua Gao, Marinka Zitnik, Daniela Rus
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2603.04648: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04648&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[422] NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL
Amos Goldman, Nimrod Boker, Maayan Sheraizin, Nimrod Admoni, Artem Polyakov, Subhadeep Bhattacharya, Fan Yu, Kai Sun, Georgios Theodorakis, Hsin-Chun Yin, Peter-Jan Gootzen, Aamir Shafi, Assaf Ravid, Salvatore Di Girolamo, Manjunath Gorentla Venkata, Gil Bloch
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.13606: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13606&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[423] EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards
Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, Kui Jia
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.17808: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17808&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[424] Agent Control Protocol: Admission Control for Agent Actions
Marcelo Fernandez
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.18829: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18829&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[425] MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels
Tianyang Luo, Tao Feng, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.19310: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19310&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[426] Global Convergence of Multiplicative Updates for the Matrix Mechanism: A Collaborative Proof with Gemini 3
Keith Rush
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.19465: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19465&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[427] MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning
Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.20586: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20586&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[428] An Industrial-Scale Retrieval-Augmented Generation Framework for Requirements Engineering: Empirical Evaluation with Automotive Manufacturing Data
Muhammad Khalid, Yilmaz Uygun
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to content unavailability
Abstract: Failed to fetch summary for 2603.20534: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20534&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[429] mSFT: Addressing Dataset Mixtures Overfitting Heterogeneously in Multi-task SFT
Woosung Koh, Jeyoung Jeon, Youngjin Song, Yujin Cheon, Soowon Oh, Jaehyeong Choi, Se-Young Yun
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.21606: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21606&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[430] TRACE: A Multi-Agent System for Autonomous Physical Reasoning in Seismological
Feng Liu, Jian Xu, Xin Cui, Xinghao Wang, Zijie Guo, Jiong Wang, S. Mostafa Mousavi, Xinyu Gu, Hao Chen, Ben Fei, Lihua Fang, Fenghua Ling, Zefeng Li, Lei Bai
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to paper fetch failureMethod: Unable to determine method due to paper fetch failure
Result: Unable to determine results due to paper fetch failure
Conclusion: Unable to determine conclusion due to paper fetch failure
Abstract: Failed to fetch summary for 2603.21152: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21152&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[431] DeepXplain: XAI-Guided Autonomous Defense Against Multi-Stage APT Campaigns
Trung V. Phan, Thomas Bauschert
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.21296: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21296&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[432] LLM-Powered Workflow Optimization for Multidisciplinary Software Development: An Automotive Industry Case Study
Shuai Wang, Yinan Yu, Earl Barr, Dhasarathy Parthasarathy
Main category: cs.AI
TL;DR: Paper 2603.21439 could not be fetched due to HTTP 429 error (rate limiting), so analysis cannot be performed
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.21439: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21439&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[433] BadminSense: Enabling Fine-Grained Badminton Stroke Evaluation on a Single Smartwatch
Taizhou Chen, Kai Chen, Xingyu Liu, Pingchuan Ke, Zhida Sun
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error from arXiv API
Result: No results available - API request was rate limited (HTTP 429)
Conclusion: Cannot draw conclusions about paper content due to technical limitations in accessing the abstract
Abstract: Failed to fetch summary for 2603.21825: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21825&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.SD
[434] Velocity Potential Neural Field for Efficient Ambisonics Impulse Response Modeling
Yoshiki Masuyama, Francois G. Germain, Gordon Wichern, Chiori Hori, Jonathan Le Roux
Main category: cs.SD
TL;DR: A physics-informed neural network approach for spatial interpolation of First-order Ambisonics (FOA) signals by learning a velocity potential function that automatically satisfies physical constraints.
Details
Motivation: Previous physics-informed neural networks for FOA spatial interpolation used soft penalty terms derived from physical principles, but this approach doesn't guarantee physical consistency. The authors aim to develop a method where the predicted FOA signal automatically satisfies the linearized momentum equation by construction.Method: Instead of directly predicting the FOA signal, the network approximates a scalar velocity potential function. The four channels of FOA are then derived from partial derivatives of this velocity potential with respect to time and microphone position, ensuring physical consistency by construction according to sound propagation physics.
Result: Experimental results on room impulse response reconstruction confirm the effectiveness of the proposed framework, showing improved physical consistency compared to previous approaches.
Conclusion: The proposed method provides a more physically consistent approach to FOA spatial interpolation by learning a velocity potential function rather than directly predicting FOA signals, ensuring adherence to physical principles at all times and positions.
Abstract: First-order Ambisonics (FOA) is a standard spatial audio format based on spherical harmonic decomposition. Its zeroth- and first-order components capture the sound pressure and particle velocity, respectively. Recently, physics-informed neural networks have been applied to the spatial interpolation of FOA signals, regularizing the network outputs based on soft penalty terms derived from physical principles, e.g., the linearized momentum equation. In this paper, we reformulate the task so that the predicted FOA signal automatically satisfies the linearized momentum equation. Our network approximates a scalar function called velocity potential, rather than the FOA signal itself. Then, the FOA signal can be readily recovered through the partial derivatives of the velocity potential with respect to the network inputs (i.e., time and microphone position) according to physics of sound propagation. By deriving the four channels of FOA from the single-channel velocity potential, the reconstructed signal follows the physical principle at any time and position by construction. Experimental results on room impulse response reconstruction confirm the effectiveness of the proposed framework.
[435] The Interspeech 2026 Audio Encoder Capability Challenge for Large Audio Language Models
Heinrich Dinkel, Jiahao Zhou, Guanbo Wang, Yadong Niu, Junbo Zhang, Yufeng Hao, Ying Liu, Ke Li, Wenwu Wang, Zhiyong Wu, Jian Luan
Main category: cs.SD
TL;DR: The Interspeech 2026 Audio Encoder Capability Challenge introduces a benchmark to evaluate pre-trained audio encoders as front-end modules for Large Audio Language Models (LALMs) using the XARES-LLM framework.
Details
Motivation: While LALMs show strong acoustic scene understanding, their performance depends on the semantic richness of audio encoder representations. There's an integration gap between encoder development and LLM fine-tuning that needs addressing.Method: Creates a unified generative evaluation framework called XARES-LLM that assesses submitted audio encoders across diverse downstream classification and generation tasks, decoupling encoder development from LLM fine-tuning.
Result: Establishes a standardized protocol for evaluating general-purpose audio representations that can effectively be used for next-generation multimodal language models.
Conclusion: The challenge provides a benchmark to advance audio encoder capabilities as front-end modules for LALMs, addressing the integration gap through standardized evaluation.
Abstract: This paper presents the Interspeech 2026 Audio Encoder Capability Challenge, a benchmark specifically designed to evaluate and advance the performance of pre-trained audio encoders as front-end modules for Large Audio Language Models (LALMs). While LALMs have shown remarkable understanding of complex acoustic scenes, their performance depends on the semantic richness of the underlying audio encoder representations. This challenge addresses the integration gap by providing a unified generative evaluation framework, XARES-LLM, which assesses submitted encoders across a diverse suite of downstream classification and generation tasks. By decoupling encoder development from LLM fine-tuning, the challenge establishes a standardized protocol for general-purpose audio representations that can effectively be used for the next generation of multimodal language models.
[436] MSR-HuBERT: Self-supervised Pre-training for Adaptation to Multiple Sampling Rates
Zikang Huang, Meng Ge, Tianrui Wang, Xuanchen Li, Xiaobao Wang, Longbiao Wang, Jianwu Dang
Main category: cs.SD
TL;DR: MSRHuBERT extends HuBERT with multi-sampling-rate adaptive CNN to handle mixed-rate speech data without resampling, improving performance on speech recognition and reconstruction across 16-48kHz.
Details
Motivation: Existing speech SSL methods assume single sampling rates and struggle with mixed-rate data due to temporal resolution mismatch, limiting their practical applicability to real-world scenarios with varying audio qualities.Method: Replaces HuBERT’s single-rate downsampling CNN with a multi-sampling-rate adaptive CNN that maps raw waveforms from different sampling rates to a shared temporal resolution without resampling, enabling unified mixed-rate pre-training and fine-tuning while retaining HuBERT’s mask-prediction objective and Transformer encoder.
Result: Outperforms HuBERT on speech recognition and full-band speech reconstruction across 16-48kHz range, preserving high-frequency detail while modeling low-frequency semantic structure.
Conclusion: MSRHuBERT successfully addresses the mixed-rate limitation in speech SSL, maintains compatibility with existing HuBERT analyses/improvements, and provides a practical solution for real-world audio applications with varying sampling rates.
Abstract: Self-supervised learning (SSL) has advanced speech processing. However, existing speech SSL methods typically assume a single sampling rate and struggle with mixed-rate data due to temporal resolution mismatch. To address this limitation, we propose MSRHuBERT, a multi-sampling-rate adaptive pre-training method. Building on HuBERT, we replace its single-rate downsampling CNN with a multi-sampling-rate adaptive downsampling CNN that maps raw waveforms from different sampling rates to a shared temporal resolution without resampling. This design enables unified mixed-rate pre-training and fine-tuning. In experiments spanning 16 to 48 kHz, MSRHuBERT outperforms HuBERT on speech recognition and full-band speech reconstruction, preserving high-frequency detail while modeling low-frequency semantic structure. Moreover, MSRHuBERT retains HuBERT’s mask-prediction objective and Transformer encoder, so existing analyses and improvements that were developed for HuBERT can apply directly.
[437] Structural and Statistical Audio Texture Knowledge Distillation for Acoustic Classification
Jarin Ritu, Amirmohammad Mohammadi, Davelle Carreiro, Alexandra Van Dine, Joshua Peeples
Main category: cs.SD
TL;DR: SSATKD framework combines high-level contextual info with low-level audio textures for better environmental sound classification via knowledge distillation.
Details
Motivation: Existing knowledge distillation methods for environmental sound classification overlook essential low-level audio texture features needed to capture local patterns in complex acoustic environments.Method: Proposes Structural and Statistical Audio Texture Knowledge Distillation (SSATKD) framework that extracts both high-level contextual information and low-level structural/statistical audio textures from intermediate layers. Evaluated on four environmental sound datasets with different teacher adaptation strategies.
Result: Experimental results show consistent accuracy improvements across all datasets and settings, confirming SSATKD’s effectiveness and robustness in real-world sound classification tasks.
Conclusion: SSATKD successfully addresses the gap in audio knowledge distillation by incorporating essential low-level audio texture features, demonstrating improved performance across diverse acoustic domains.
Abstract: While knowledge distillation has shown success in various audio tasks, its application to environmental sound classification often overlooks essential low-level audio texture features needed to capture local patterns in complex acoustic environments. To address this gap, the Structural and Statistical Audio Texture Knowledge Distillation (SSATKD) framework is proposed, which combines high-level contextual information with low-level structural and statistical audio textures extracted from intermediate layers. To evaluate its generalizability across diverse acoustic domains, SSATKD is tested on four datasets within the environmental sound classification domain, including two passive sonar datasets (DeepShip and Vessel Type Underwater Acoustic Data (VTUAD)) and two general environmental sound datasets (Environmental Sound Classification 50 (ESC-50) and Tampere University of Technology (TUT) Acoustic Scenes). Two teacher adaptation strategies are explored: classifier-head-only adaptation and full fine-tuning. The framework is further evaluated using various convolutional and transformer-based teacher models. Experimental results demonstrate consistent accuracy improvements across all datasets and settings, confirming the effectiveness and robustness of SSATKD in real-world sound classification tasks.
[438] DreamAudio: Customized Text-to-Audio Generation with Diffusion Models
Yi Yuan, Xubo Liu, Haohe Liu, Xiyuan Kang, Zhuo Chen, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang
Main category: cs.SD
TL;DR: DreamAudio enables customized text-to-audio generation by learning auditory concepts from reference audio samples, allowing control over fine-grained acoustic characteristics while maintaining semantic alignment with text prompts.
Details
Motivation: Existing text-to-audio models generate semantically aligned sound but lack control over fine-grained acoustic characteristics, making it difficult for users to generate specific desired audio content with personalized audio events.Method: Introduces a framework that enables models to identify auditory information from user-provided reference concepts. Given few reference audio samples containing personalized audio events, the system can generate new audio samples with these specific events. Two types of datasets are developed for training and testing.
Result: DreamAudio generates audio samples highly consistent with customized audio features and well-aligned with input text prompts. It offers comparable performance in general text-to-audio tasks and provides a human-involved dataset with real-world CTTA cases as a benchmark.
Conclusion: DreamAudio successfully addresses the limitation of existing text-to-audio models by enabling customized generation with fine-grained acoustic control, while maintaining good performance on general text-to-audio tasks.
Abstract: With the development of large-scale diffusion-based and language-modeling-based generative models, impressive progress has been achieved in text-to-audio generation. Despite producing high-quality outputs, existing text-to-audio models mainly aim to generate semantically aligned sound and fall short of controlling fine-grained acoustic characteristics of specific sounds. As a result, users who need specific sound content may find it difficult to generate the desired audio clips. In this paper, we present DreamAudio for customized text-to-audio generation (CTTA). Specifically, we introduce a new framework that is designed to enable the model to identify auditory information from user-provided reference concepts for audio generation. Given a few reference audio samples containing personalized audio events, our system can generate new audio samples that include these specific events. In addition, two types of datasets are developed for training and testing the proposed systems. The experiments show that DreamAudio generates audio samples that are highly consistent with the customized audio features and aligned well with the input text prompts. Furthermore, DreamAudio offers comparable performance in general text-to-audio tasks. We also provide a human-involved dataset containing audio events from real-world CTTA cases as the benchmark for customized generation tasks.
[439] U3-xi: Pushing the Boundaries of Speaker Recognition by Incorporating Uncertainty
Junjie Li, Kong Aik Lee
Main category: cs.SD
TL;DR: U3-xi framework improves speaker verification by estimating frame-level uncertainty to weight contributions to utterance embeddings, with uncertainty supervision via stochastic variance loss and adaptive softmax scaling.
Details
Motivation: Frame-level representations in speaker verification contain both speaker-relevant information and nuisance factors, leading to unequal contributions to final embeddings. Current methods lack reliable uncertainty estimation for weighting frames appropriately.Method: Proposes U3-xi framework with: 1) Speaker-level uncertainty supervision via Stochastic Variance Loss using distance to speaker centroid as pseudo ground truth; 2) Global-level uncertainty supervision by injecting uncertainty into softmax scale for adaptive decision boundary; 3) Transformer encoder with multi-view self-attention for uncertainty estimation capturing temporal dependencies.
Result: Achieves 21.1% and 15.57% relative improvements in EER and minDCF on VoxCeleb1 test sets when applied to ECAPA-TDNN. Framework is model-agnostic and works with various speaker encoders.
Conclusion: U3-xi provides reliable uncertainty estimation for speaker embeddings, enabling adaptive weighting of frames based on their uncertainty, leading to significant performance improvements in speaker verification.
Abstract: An utterance-level speaker embedding is typically obtained by aggregating a sequence of frame-level representations. However, in real-world scenarios, individual frames encode not only speaker-relevant information but also various nuisance factors. As a result, different frames contribute unequally to the final utterance-level speaker representation for Automatic Speaker Verification systems. To address this issue, we propose to estimate the inherent uncertainty of each frame and assign adaptive weights accordingly, where frames with higher uncertainty receive lower attention. Based on this idea, we present U3-xi, a comprehensive framework designed to produce more reliable and interpretable uncertainty estimates for speaker embeddings. Specifically, we introduce several strategies for uncertainty supervision. First, we propose speaker-level uncertainty supervision via a Stochastic Variance Loss, where the distance between an utterance embedding and its corresponding speaker centroid serves as a pseudo ground truth for uncertainty learning. Second, we incorporate global-level uncertainty supervision by injecting the predicted uncertainty into the sof tmax scale during training. This adaptive scaling mechanism adjusts the sharpness of the decision boundary according to sample difficulty, providing global guidance. Third, we redesign the uncertainty estimation module by integrating a Transformer encoder with multi-view self-attention, enabling the model to capture rich local and long-range temporal dependencies. Comprehensive experiments demonstrate that U3-xi is model-agnostic and can be seamlessly applied to various speaker encoders. In particular, when applied to ECAPA-TDNN, it achieves 21.1% and 15.57% relative improvements on the VoxCeleb1 test sets in terms of EER and minDCF, respectively.
[440] Voice Privacy from an Attribute-based Perspective
Mehtab Ur Rahman, Martha Larson, Cristian Tejedor-Garcia
Main category: cs.SD
TL;DR: This paper introduces an attribute-based perspective for evaluating voice privacy, moving beyond signal-to-signal comparisons to analyze privacy protection through speaker attribute comparisons and uniqueness analysis.
Details
Motivation: Current voice privacy benchmarks focus on signal-to-signal comparisons, but this approach may not fully capture privacy risks. The authors propose that analyzing speaker attributes provides a more comprehensive view of privacy protection, especially considering that inferred attributes from speech could still reveal speaker identity even after anonymization.Method: The paper introduces an attribute-based evaluation framework that: 1) calculates speaker uniqueness for ground truth attributes, attributes inferred from original speech, and attributes inferred from anonymized speech; 2) examines threat scenarios with single utterances per speaker; and 3) calculates attack error rates to measure privacy protection effectiveness.
Result: The research finds that inferred attributes still present privacy risks despite attribute inference errors. The analysis shows that current anonymization methods may not adequately protect against attribute-based attacks, highlighting gaps in existing voice privacy approaches.
Conclusion: Future voice privacy research should consider both attribute-related threats and protection mechanisms, moving beyond traditional signal-based evaluations to incorporate attribute-based perspectives for more comprehensive privacy assessment.
Abstract: Voice privacy approaches that preserve the anonymity of speakers modify speech in an attempt to break the link with the true identity of the speaker. Current benchmarks measure speaker protection based on signal-to-signal comparisons. In this paper, we introduce an attribute-based perspective, where we measure privacy protection in terms of comparisons between sets of speaker attributes. First, we analyze privacy impact by calculating speaker uniqueness for ground truth attributes, attributes inferred on the original speech, and attributes inferred on speech protected with standard anonymization. Next, we examine a threat scenario involving only a single utterance per speaker and calculate attack error rates. Overall, we observe that inferred attributes still present a risk despite attribute inference errors. Our research points to the importance of considering both attribute-related threats and protection mechanisms in future voice privacy research.
cs.LG
[441] Beyond Hard Constraints: Budget-Conditioned Reachability For Safe Offline Reinforcement Learning
Janaka Chathuranga Brahmanage, Akshat Kumar
Main category: cs.LG
TL;DR: Offline safe RL algorithm using safety-conditioned reachability sets to decouple reward maximization from cumulative safety constraints, enabling safe policy learning from fixed datasets without environment interaction.
Details
Motivation: Real-world sequential decision making must balance reward maximization with safety constraints, but existing methods often lead to unstable min/max adversarial optimization. While reachability analysis offers promise for safety, most methods only handle hard constraints, not cumulative cost constraints.Method: Defines a safety-conditioned reachability set that decouples reward maximization from cumulative safety constraints. Uses this set to enforce safety without unstable min/max or Lagrangian optimization, creating an offline safe RL algorithm that learns from fixed datasets without environment interaction.
Result: Experiments on standard offline safe RL benchmarks and a real-world maritime navigation task show the method matches or outperforms state-of-the-art baselines while maintaining safety constraints.
Conclusion: The proposed safety-conditioned reachability approach provides an effective framework for offline safe RL that avoids unstable optimization while ensuring safety constraints are satisfied.
Abstract: Sequential decision making using Markov Decision Process underpins many realworld applications. Both model-based and model free methods have achieved strong results in these settings. However, real-world tasks must balance reward maximization with safety constraints, often conflicting objectives, that can lead to unstable min/max, adversarial optimization. A promising alternative is safety reachability analysis, which precomputes a forward-invariant safe state, action set, ensuring that an agent starting inside this set remains safe indefinitely. Yet, most reachability based methods address only hard safety constraints, and little work extends reachability to cumulative cost constraints. To address this, first, we define a safetyconditioned reachability set that decouples reward maximization from cumulative safety cost constraints. Second, we show how this set enforces safety constraints without unstable min/max or Lagrangian optimization, yielding a novel offline safe RL algorithm that learns a safe policy from a fixed dataset without environment interaction. Finally, experiments on standard offline safe RL benchmarks, and a real world maritime navigation task demonstrate that our method matches or outperforms state of the art baselines while maintaining safety.
[442] Efficient Embedding-based Synthetic Data Generation for Complex Reasoning Tasks
Srideepika Jayaraman, Achille Fokoue, Dhaval Patel, Jayant Kalagnanam
Main category: cs.LG
TL;DR: The paper proposes an embedding-based sampling pipeline for synthetic data generation using LLMs to improve diversity and performance of smaller models through targeted fine-tuning.
Details
Motivation: Synthetic Data Generation (SDG) using LLMs is effective for improving smaller models, but faces challenges in ensuring quality and diversity of generated data. The paper aims to address these challenges by analyzing data distribution in embedding space.Method: Analyzes diversity and distribution of generated data in embedding space, identifies correlation between example density and prediction accuracy, and develops a targeted embedding-based sampling pipeline to enhance data diversity.
Result: Demonstrates strong correlation between density of examples in specific neighborhoods and prediction accuracy. The proposed pipeline consistently improves performance across several benchmarks by enhancing data diversity.
Conclusion: Embedding-based analysis reveals important patterns in synthetic data distribution, and targeted sampling based on these insights can effectively improve SDG quality and downstream model performance.
Abstract: Synthetic Data Generation (SDG), leveraging Large Language Models (LLMs), has recently been recognized and broadly adopted as an effective approach to improve the performance of smaller but more resource and compute efficient LLMs through fine-tuning. A key challenge in SDG is ensuring the quality and diversity of the generated data. In this paper, we analyze the diversity and distribution of generated data in the embedding space, and demonstrate a strong correlation between the density of examples within a specific neighborhood and the accuracy of predictions on examples drawn from that region. Building on this insight, we present a targeted pipeline for embedding-based sampling that enhances data diversity and consistently improves performance across several benchmarks.
[443] Between the Layers Lies the Truth: Uncertainty Estimation in LLMs Using Intra-Layer Local Information Scores
Zvi N. Badash, Yonatan Belinkov, Moti Freiman
Main category: cs.LG
TL;DR: Proposes a compact uncertainty estimation method for LLMs that scores cross-layer agreement patterns in internal representations using a single forward pass, achieving strong performance across models and transfer scenarios.
Details
Motivation: LLMs are often confidently wrong, making reliable uncertainty estimation essential. Existing methods have trade-offs: output-based heuristics are cheap but brittle, while probing internal representations is effective but high-dimensional and hard to transfer across models/datasets.Method: A compact, per-instance uncertainty estimation method that scores cross-layer agreement patterns in internal representations using a single forward pass. It examines specific layer-layer interactions to capture how models encode uncertainty.
Result: Across three models, matches probing in-distribution (mean diagonal differences ≤ -1.8 AUPRC percentage points, +4.9 Brier score points). Under cross-dataset transfer, consistently outperforms probing (off-diagonal gains up to +2.86 AUPRC, +21.02 Brier points). Under 4-bit weight-only quantization, remains robust (improves over probing by +1.94 AUPRC, +5.33 Brier points on average).
Conclusion: The method offers a lightweight, compact means to capture transferable uncertainty in LLMs, with analysis revealing differences in how disparate models encode uncertainty through layer-layer interactions.
Abstract: Large language models (LLMs) are often confidently wrong, making reliable uncertainty estimation (UE) essential. Output-based heuristics are cheap but brittle, while probing internal representations is effective yet high-dimensional and hard to transfer. We propose a compact, per-instance UE method that scores cross-layer agreement patterns in internal representations using a single forward pass. Across three models, our method matches probing in-distribution, with mean diagonal differences of at most $-1.8$ AUPRC percentage points and $+4.9$ Brier score points. Under cross-dataset transfer, it consistently outperforms probing, achieving off-diagonal gains up to $+2.86$ AUPRC and $+21.02$ Brier points. Under 4-bit weight-only quantization, it remains robust, improving over probing by $+1.94$ AUPRC points and $+5.33$ Brier points on average. Beyond performance, examining specific layer–layer interactions reveals differences in how disparate models encode uncertainty. Altogether, our UE method offers a lightweight, compact means to capture transferable uncertainty in LLMs.
[444] ST-GDance++: A Scalable Spatial-Temporal Diffusion for Long-Duration Group Choreography
Jing Xu, Weiqiang Wang, Cunjian Chen, Jun Liu, Qiuhong Ke
Main category: cs.LG
TL;DR: ST-GDance++ is a scalable framework for group dance generation from music that decouples spatial and temporal dependencies to enable efficient, collision-aware choreography with reduced computational overhead.
Details
Motivation: Current group dance generation models struggle with interactive deployment due to quadratic attention computation growth with more dancers and longer sequences, leading to inefficiency and motion collisions. Existing methods fail to effectively model dense spatial-temporal interactions, limiting scalability and coordination.Method: Proposes ST-GDance++ with spatial modeling using lightweight distance-aware graph convolutions to capture inter-dancer relationships, and temporal modeling using diffusion noise scheduling with efficient temporal-aligned attention masks for stream-based generation of long sequences.
Result: Experiments on AIOZ-GDance dataset show competitive generation quality with significantly reduced latency compared to existing methods.
Conclusion: The framework successfully addresses scalability challenges in group dance generation by decoupling spatial and temporal dependencies, enabling efficient and collision-aware choreography generation suitable for interactive applications.
Abstract: Group dance generation from music requires synchronizing multiple dancers while maintaining spatial coordination, making it highly relevant to applications such as film production, gaming, and animation. Recent group dance generation models have achieved promising generation quality, but they remain difficult to deploy in interactive scenarios due to bidirectional attention dependencies. As the number of dancers and the sequence length increase, the attention computation required for aligning music conditions with motion sequences grows quadratically, leading to reduced efficiency and increased risk of motion collisions. Effectively modeling dense spatial-temporal interactions is therefore essential, yet existing methods often struggle to capture such complexity, resulting in limited scalability and unstable multi-dancer coordination. To address these challenges, we propose ST-GDance++, a scalable framework that decouples spatial and temporal dependencies to enable efficient and collision-aware group choreography generation. For spatial modeling, we introduce lightweight distance-aware graph convolutions to capture inter-dancer relationships while reducing computational overhead. For temporal modeling, we design a diffusion noise scheduling strategy together with an efficient temporal-aligned attention mask, enabling stream-based generation for long motion sequences and improving scalability in long-duration scenarios. Experiments on the AIOZ-GDance dataset show that ST-GDance++ achieves competitive generation quality with significantly reduced latency compared to existing methods.
[445] Precision-Varying Prediction (PVP): Robustifying ASR systems against adversarial attacks
Matías Pizarro, Raghavan Narasimhan, Asja Fischer
Main category: cs.LG
TL;DR: Random precision sampling during ASR inference improves adversarial robustness and enables attack detection via output comparison across precisions.
Details
Motivation: As automated speech recognition systems become more prevalent, their vulnerability to adversarial attacks poses security risks. The authors observed that changing model precision during inference affects attack success rates, suggesting this could be leveraged for defense.Method: Proposes random precision sampling during ASR model inference to disrupt adversarial attacks. Also develops an attack detection method by comparing outputs from different precisions and using a Gaussian classifier on the differences.
Result: Experimental analysis shows significant robustness improvements and competitive detection performance across various ASR models and attack types.
Conclusion: Simple precision manipulation during inference provides effective defense against adversarial attacks on ASR systems, offering both robustness enhancement and detection capabilities.
Abstract: With the increasing deployment of automated and agentic systems, ensuring the adversarial robustness of automatic speech recognition (ASR) models has become critical. We observe that changing the precision of an ASR model during inference reduces the likelihood of adversarial attacks succeeding. We take advantage of this fact to make the models more robust by simple random sampling of the precision during prediction. Moreover, the insight can be turned into an adversarial example detection strategy by comparing outputs resulting from different precisions and leveraging a simple Gaussian classifier. An experimental analysis demonstrates a significant increase in robustness and competitive detection performance for various ASR models and attack types.
[446] Scaling Attention via Feature Sparsity
Yan Xie, Tiansheng Wen, Tangda Huang, Bo Chen, Chenyu You, Stefanie Jegelka, Yifei Wang
Main category: cs.LG
TL;DR: Sparse Feature Attention (SFA) reduces Transformer attention cost from O(n²d) to O(n²k²/d) using k-sparse codes for queries/keys, enabling efficient ultra-long context scaling with minimal accuracy loss.
Details
Motivation: Scaling Transformers to ultra-long contexts is bottlenecked by the quadratic cost of self-attention. Existing methods reduce cost along sequence axis but degrade accuracy. This paper explores feature sparsity as an orthogonal approach to maintain accuracy while reducing computational cost.Method: Proposes Sparse Feature Attention (SFA) where queries and keys are represented as k-sparse codes that preserve high-dimensional expressivity. Introduces FlashSFA, an IO-aware kernel extending FlashAttention to operate directly on sparse overlaps without materializing dense score matrices.
Result: SFA matches dense baselines while improving speed by up to 2.5× and reducing FLOPs and KV-cache by nearly 50%. Preserves retrieval accuracy and robustness at long contexts, outperforming short-embedding baselines that collapse feature diversity.
Conclusion: Feature-level sparsity is a complementary and underexplored axis for efficient attention, enabling Transformers to scale to orders-of-magnitude longer contexts with minimal quality loss.
Abstract: Scaling Transformers to ultra-long contexts is bottlenecked by the $O(n^2 d)$ cost of self-attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token-level sparsity, but these approaches consistently degrade accuracy. In this paper, we instead explore an orthogonal axis: feature sparsity. We propose Sparse Feature Attention (SFA), where queries and keys are represented as $k$-sparse codes that preserve high-dimensional expressivity while reducing the cost of attention from $Θ(n^2 d)$ to $Θ(n^2 k^2/d)$. To make this efficient at scale, we introduce FlashSFA, an IO-aware kernel that extends FlashAttention to operate directly on sparse overlaps without materializing dense score matrices. Across GPT-2 and Qwen3 pretraining, SFA matches dense baselines while improving speed by up to $2.5\times$ and reducing FLOPs and KV-cache by nearly 50%. On synthetic and downstream benchmarks, SFA preserves retrieval accuracy and robustness at long contexts, outperforming short-embedding baselines that collapse feature diversity. These results establish feature-level sparsity as a complementary and underexplored axis for efficient attention, enabling Transformers to scale to orders-of-magnitude longer contexts with minimal quality loss. Code is available at https://github.com/YannX1e/Sparse-Feature-Attention.
[447] Latent Semantic Manifolds in Large Language Models
Mohamed A. Mabrok
Main category: cs.LG
TL;DR: LLMs operate in continuous vector spaces but output discrete tokens, creating a geometric mismatch. The paper develops a Riemannian framework where tokens partition a semantic manifold via Voronoi regions, introducing an “expressibility gap” to measure discretization distortion.
Details
Motivation: To understand the geometric consequences of the fundamental mismatch between LLMs' continuous internal computations and discrete token outputs, which remains poorly understood despite being central to how LLMs work.Method: Develops a mathematical framework interpreting LLM hidden states as points on a latent semantic manifold (Riemannian submanifold with Fisher information metric). Defines expressibility gap as geometric measure of semantic distortion from vocabulary discretization. Proves two theorems: rate-distortion lower bound on distortion, and linear volume scaling law via coarea formula. Validates across six transformer architectures (124M-1.5B parameters).
Result: Validated predictions across six transformer models, confirming universal hourglass intrinsic dimension profiles, smooth curvature structure, and linear gap scaling with slopes 0.87-1.12 (R^2 > 0.985). Margin distribution reveals persistent hard core of boundary-proximal representations invariant to scale, providing geometric decomposition of perplexity.
Conclusion: Provides geometric framework for understanding LLM tokenization, with implications for architecture design, model compression, decoding strategies, and scaling laws. Reveals universal geometric properties across model scales.
Abstract: Large Language Models (LLMs) perform internal computations in continuous vector spaces yet produce discrete tokens – a fundamental mismatch whose geometric consequences remain poorly understood. We develop a mathematical framework that interprets LLM hidden states as points on a latent semantic manifold: a Riemannian submanifold equipped with the Fisher information metric, where tokens correspond to Voronoi regions partitioning the manifold. We define the expressibility gap, a geometric measure of the semantic distortion from vocabulary discretization, and prove two theorems: a rate-distortion lower bound on distortion for any finite vocabulary, and a linear volume scaling law for the expressibility gap via the coarea formula. We validate these predictions across six transformer architectures (124M-1.5B parameters), confirming universal hourglass intrinsic dimension profiles, smooth curvature structure, and linear gap scaling with slopes 0.87-1.12 (R^2 > 0.985). The margin distribution across models reveals a persistent hard core of boundary-proximal representations invariant to scale, providing a geometric decomposition of perplexity. We discuss implications for architecture design, model compression, decoding strategies, and scaling laws
[448] Behavioral Heterogeneity as Quantum-Inspired Representation
Mohammad Elayan, Wissam Kontar
Main category: cs.LG
TL;DR: Quantum-inspired representation models drivers as evolving latent states using density matrices, with behavioral observations embedded via Random Fourier Features and state evolution combining temporal persistence with context-dependent profile activation.
Details
Motivation: Current approaches to driver heterogeneity often reduce complex, dynamic driver behaviors to static labels or discrete regimes, which compresses the inherent dynamism of driving behavior. There's a need for more sophisticated representations that can capture the evolving nature of driver states and behaviors over time.Method: The paper introduces a quantum-inspired representation that models each driver as an evolving latent state represented as a density matrix with structured mathematical properties. Behavioral observations are embedded using non-linear Random Fourier Features. State evolution blends temporal persistence of behavior with context-dependent profile activation.
Result: The approach is evaluated on empirical driving data and Third Generation Simulation Data (TGSIM), demonstrating how driving profiles can be extracted and analyzed using the proposed quantum-inspired representation.
Conclusion: The quantum-inspired representation provides a more nuanced way to model driver heterogeneity by capturing evolving latent states rather than reducing drivers to static categories, enabling better analysis of driving profiles and behaviors.
Abstract: Driver heterogeneity is often reduced to labels or discrete regimes, compressing what is inherently dynamic into static categories. We introduce quantum-inspired representation that models each driver as an evolving latent state, presented as a density matrix with structured mathematical properties. Behavioral observations are embedded via non-linear Random Fourier Features, while state evolution blends temporal persistence of behavior with context-dependent profile activation. We evaluate our approach on empirical driving data, Third Generation Simulation Data (TGSIM), showing how driving profiles are extracted and analyzed.
[449] Research on Individual Trait Clustering and Development Pathway Adaptation Based on the K-means Algorithm
Qianru Wei, Jihaoyu Yang, Cheng Zhang, Jinming Yang
Main category: cs.LG
TL;DR: Using K-means clustering on student data (CET-4 scores, GPA, personality traits, student cadre experiences) to provide personalized career guidance for college students.
Details
Motivation: Existing career guidance methods focus mainly on predicting career paths, but lack analysis of how students with different characteristic combinations fit specific career directions. The study aims to provide more personalized and accurate career guidance using clustering techniques.Method: Analyzed data from over 3000 students including CET-4 scores, GPA, personality traits, and student cadre experiences. Applied K-means clustering algorithm to group students into four main clusters by minimizing intra-cluster squared error, ensuring high similarity within clusters and maximizing differences between clusters.
Result: Students were classified into four distinct groups based on their characteristics. The clustering results showed that students with different characteristic combinations are suitable for different career directions, providing scientific basis for personalized career guidance and effectively enhancing employment success rates.
Conclusion: K-means clustering provides effective personalized career guidance by identifying student groups with similar characteristics. Future research can improve precision by expanding sample size, adding feature variables, and considering external factors.
Abstract: With the development of information technology, the application of artificial intelligence and machine learning in the field of education shows great potential. This study aims to explore how to utilize K-means clustering algorithm to provide accurate career guidance for college students. Existing methods mostly focus on the prediction of career paths, but there are fewer studies on the fitness of students with different combinations of characteristics in specific career directions. In this study, we analyze the data of more than 3000 students on their CET-4 scores, GPA, personality traits and student cadre experiences, and use the K-means clustering algorithm to classify the students into four main groups. The K-means clustering algorithm groups students with similar characteristics into one group by minimizing the intra-cluster squared error, ensuring that the students within the same cluster are highly similar in their characteristics, and that differences between different clusters are maximized. Based on the clustering results, targeted career guidance suggestions are provided for each group. The results of the study show that students with different combinations of characteristics are suitable for different career directions, which provides a scientific basis for personalized career guidance and effectively enhances students’ employment success rate. Future research can further improve the precision of clustering and the guidance effect by expanding the sample size, increasing the feature variables and considering external factors.
[450] Sample Transform Cost-Based Training-Free Hallucination Detector for Large Language Models
Zeyang Ding, Xinglin Hu, Jicong Fan
Main category: cs.LG
TL;DR: A training-free hallucination detector for LLMs that uses Wasserstein distance between token embeddings of generated responses to measure distribution complexity as an indicator of truthfulness.
Details
Motivation: Hallucinations in LLMs remain a major obstacle to trustworthy deployment, creating need for accurate, lightweight, and broadly applicable detectors. The authors argue that the complexity of the conditional distribution defined by an LLM with a prompt can serve as an indicator of hallucination.Method: Compute optimal-transport distances between sets of token embeddings from pairwise samples (generated responses). This yields a Wasserstein distance matrix that quantifies the distribution complexity. Two complementary signals are derived: AvgWD (average cost) and EigenWD (cost complexity). The approach is extended to black-box LLMs via teacher forcing with an accessible teacher model.
Result: Experiments show that AvgWD and EigenWD are competitive with strong uncertainty baselines and provide complementary behavior across models and datasets. The distribution complexity approach proves effective for detecting LLM hallucinations.
Conclusion: Distribution complexity measured via Wasserstein distances between token embeddings provides an effective signal for detecting hallucinations in LLMs, offering a training-free solution that works across different models and datasets.
Abstract: Hallucinations in large language models (LLMs) remain a central obstacle to trustworthy deployment, motivating detectors that are accurate, lightweight, and broadly applicable. Since an LLM with a prompt defines a conditional distribution, we argue that the complexity of the distribution is an indicator of hallucination. However, the density of the distribution is unknown and the samples (i.e., responses generated for the prompt) are discrete distributions, which leads to a significant challenge in quantifying the complexity of the distribution. We propose to compute the optimal-transport distances between the sets of token embeddings of pairwise samples, which yields a Wasserstein distance matrix measuring the costs of transforming between the samples. This Wasserstein distance matrix provides a means to quantify the complexity of the distribution defined by the LLM with the prompt. Based on the Wasserstein distance matrix, we derive two complementary signals: AvgWD, measuring the average cost, and EigenWD, measuring the cost complexity. This leads to a training-free detector for hallucinations in LLMs. We further extend the framework to black-box LLMs via teacher forcing with an accessible teacher model. Experiments show that AvgWD and EigenWD are competitive with strong uncertainty baselines and provide complementary behavior across models and datasets, highlighting distribution complexity as an effective signal for LLM truthfulness.
[451] Mitigating Premature Discretization with Progressive Quantization for Robust Vector Tokenization
Wenhao Zhao, Qiran Zou, Zhouhan Lin, Dianbo Liu
Main category: cs.LG
TL;DR: ProVQ addresses premature discretization in VQ tokenization by introducing progressive quantization that gradually transitions from continuous to discrete latent spaces, improving performance across image generation and biological sequence modeling.
Details
Motivation: Current VQ methods suffer from premature discretization - they enforce discretization before the encoder has properly captured the underlying data manifold, leading to suboptimal tokenization for multimodal LLMs and diffusion models.Method: Progressive Quantization (ProVQ) treats quantization as a curriculum that smoothly anneals from a continuous latent space to a discrete one, incorporating quantization hardness dynamics to guide codebooks toward well-expanded manifolds.
Result: ProVQ shows improved reconstruction and generative performance on ImageNet benchmarks, and establishes new state-of-the-art for protein structure tokenization on StrutTokenBench leaderboard.
Conclusion: ProVQ resolves the fundamental conflict of premature discretization in VQ tokenization, demonstrating broad effectiveness across diverse modalities including vision and biological sequences.
Abstract: Vector Quantization (VQ) has become the cornerstone of tokenization for many multimodal Large Language Models and diffusion synthesis. However, existing VQ paradigms suffer from a fundamental conflict: they enforce discretization before the encoder has captured the underlying data manifold. We term this phenomenon Premature Discretization. To resolve this, we propose Progressive Quantization (ProVQ), which incorporates the dynamics of quantization hardness as a fundamental yet previously overlooked axis in VQ training. By treating quantization as a curriculum that smoothly anneals from a continuous latent space to a discrete one, ProVQ effectively guides the codebook toward the well-expanded manifolds. Extensive experimental results demonstrate the broad effectiveness of ProVQ across diverse modalities. We report improved reconstruction and generative performance on the ImageNet-1K and ImageNet-100 benchmarks, highlighting the ProVQ’s boost for generative modeling. Furthermore, ProVQ proves highly effective for modeling complex biological sequences, establishing a new performance ceiling for protein structure tokenization on the StrutTokenBench leaderboard.
[452] CN-Buzz2Portfolio: A Chinese-Market Dataset and Benchmark for LLM-Based Macro and Sector Asset Allocation from Daily Trending Financial News
Liyuan Chen, Shilong Li, Jiangpeng Yan, Shuoling Liu, Qiang Yang, Xiu Li
Main category: cs.LG
TL;DR: CN-Buzz2Portfolio benchmark for evaluating LLMs as financial agents using Chinese market news to macro/sector asset allocation, with Tri-Stage CPA workflow testing ETF portfolio construction from trending narratives.
Details
Motivation: LLMs are evolving from static NLP tasks to dynamic financial decision-making agents, but face evaluation challenges: live trading is irreproducible and biased, while existing benchmarks focus only on entity-level stock picking and ignore broader market attention dynamics.Method: Introduces CN-Buzz2Portfolio benchmark with Chinese market data mapping daily trending news to macro/sector asset allocation. Proposes Tri-Stage CPA (Compression, Perception, Allocation) Agent Workflow to evaluate LLMs on broad asset classes like ETFs rather than individual stocks.
Result: Extensive experiments on nine LLMs reveal significant disparities in how models translate macro-level narratives into portfolio weights, providing insights into alignment between general reasoning and financial decision-making.
Conclusion: The benchmark facilitates rigorous analysis of LLMs as financial agents, addressing evaluation challenges and promoting sustainable financial agent research through released data, codes, and experiments.
Abstract: Large Language Models (LLMs) are rapidly transitioning from static Natural Language Processing (NLP) tasks including sentiment analysis and event extraction to acting as dynamic decision-making agents in complex financial environments. However, the evolution of LLMs into autonomous financial agents faces a significant dilemma in evaluation paradigms. Direct live trading is irreproducible and prone to outcome bias by confounding luck with skill, whereas existing static benchmarks are often confined to entity-level stock picking and ignore broader market attention. To facilitate the rigorous analysis of these challenges, we introduce CN-Buzz2Portfolio, a reproducible benchmark grounded in the Chinese market that maps daily trending news to macro and sector asset allocation. Spanning a rolling horizon from 2024 to mid-2025, our dataset simulates a realistic public attention stream, requiring agents to distill investment logic from high-exposure narratives instead of pre-filtered entity news. We propose a Tri-Stage CPA Agent Workflow involving Compression, Perception, and Allocation to evaluate LLMs on broad asset classes such as Exchange Traded Funds (ETFs) rather than individual stocks, thereby reducing idiosyncratic volatility. Extensive experiments on nine LLMs reveal significant disparities in how models translate macro-level narratives into portfolio weights. This work provides new insights into the alignment between general reasoning and financial decision-making, and all data, codes, and experiments are released to promote sustainable financial agent research.
[453] Full waveform inversion method based on diffusion model
Caiyun Liu, Siyang Pei, Qingfeng Yu, Jie Xiong
Main category: cs.LG
TL;DR: A full-waveform inversion method using conditional diffusion models that incorporates density information as a conditional input to improve resolution and stability of subsurface model parameter estimation.
Details
Motivation: Seismic full-waveform inversion suffers from nonlinearity and local minima issues. Existing diffusion model approaches use unconditional processes, ignoring physical coupling between velocity and density properties.Method: Proposes conditional diffusion model regularization by improving the U-Net backbone to incorporate 2D density information as conditional input, leveraging the physical relationship between velocity and density.
Result: The conditional diffusion model approach significantly improves resolution and structural fidelity of inversion results, shows stronger stability and robustness in complex scenarios, and effectively utilizes density constraints.
Conclusion: The method successfully integrates physical coupling relationships through conditional diffusion models, offering practical value for seismic inversion applications by improving accuracy and robustness.
Abstract: Seismic full-waveform inversion is a core technology for obtaining high-resolution subsurface model parameters. However, its highly nonlinear characteristics and strong dependence on the initial model often lead to the inversion process getting trapped in local minima. In recent years, generative diffusion models have provided a way to regularize full-waveform inversion by learning implicit prior distributions. However, existing methods mostly use unconditional diffusion processes, ignoring the inherent physical coupling relationship between velocity and density and other physical properties. This paper proposes a full-waveform inversion method based on conditional diffusion model regularization. By improving the backbone network structure of the diffusion model, two-dimensional density information is introduced as a conditional input into the U-Net network. Experimental results show that the full-waveform inversion method based on the conditional diffusion model significantly improves the resolution and structural fidelity of the inversion results, and exhibits stronger stability and robustness when dealing with complex situations. This method effectively utilizes density information to constrain the inversion and has good practical application value. Keywords: Deep learning; Diffusion model; Full waveform inversion.
[454] Federated Learning for Data-Driven Feedforward Control: A Case Study on Vehicle Lateral Dynamics
Jakob Weber, Markus Gurtner, Benedikt Alt, Adrian Trachte, Andreas Kugi
Main category: cs.LG
TL;DR: Federated Learning framework for data-driven feedforward control design that enables collaborative training of neural controllers across distributed systems without sharing raw data.
Details
Motivation: Data-driven feedforward control improves tracking accuracy but requires large datasets and faces challenges with distributed data across multiple systems, including privacy concerns and technical difficulties with data transfer.Method: Integrates Federated Learning into data-driven feedforward control design where each client trains neural feedforward controllers locally and shares only model updates (not raw data) for global aggregation.
Result: FL-based neural feedforward controller matches performance of centralized neural feedforward controller while reducing communication overhead and increasing data privacy in vehicle trajectory-tracking simulations.
Conclusion: Federated Learning provides an effective solution for collaborative data-driven feedforward control design that preserves privacy while maintaining performance comparable to centralized approaches.
Abstract: In many control systems, tracking accuracy can be enhanced by combining (data-driven) feedforward (FF) control with feedback (FB) control. However, designing effective data-driven FF controllers typically requires large amounts of high-quality data and a dedicated design-of-experiment process. In practice, relevant data are often distributed across multiple systems, which not only introduces technical challenges but also raises regulatory and privacy concerns regarding data transfer. To address these challenges, we propose a framework that integrates Federated Learning (FL) into the data-driven FF control design. Each client trains a data-driven, neural FF controller using local data and provides only model updates to the global aggregation process, avoiding the exchange of raw data. We demonstrate our method through simulation for a vehicle trajectory-tracking task. Therein, a neural FF controller is learned collaboratively using FL. Our results show that the FL-based neural FF controller matches the performance of the centralized neural FF controller while reducing communication overhead and increasing data privacy.
[455] UniFluids: Unified Neural Operator Learning with Conditional Flow-matching
Haosen Li, Qi Meng, Jiahao Li, Rui Zhang, Ruihua Song, Liang Ma, Zhi-Ming Ma
Main category: cs.LG
TL;DR: UniFluids: A conditional flow-matching framework using diffusion Transformer to unify learning of PDE solution operators across diverse dimensions and physical variables with parallel sequence generation.
Details
Motivation: Current PDE simulation methods using deep neural networks have great potential, but existing approaches lack unified frameworks that can handle diverse PDEs with varying dimensionality and physical variables while enabling parallel sequence generation.Method: Uses conditional flow-matching framework with diffusion Transformer, introduces unified 4D spatiotemporal representation for heterogeneous PDE datasets, employs x-prediction in flow-matching operator learning based on finding that effective dimension of PDE datasets is lower than patch dimension.
Result: Achieves strong prediction accuracy on PDE datasets covering 1D, 2D, and 3D spatial dimensions, demonstrates good scalability and cross-scenario generalization capability.
Conclusion: UniFluids successfully unifies learning of PDE solution operators across diverse scenarios with parallel generation, showing promising results for scientific simulation applications.
Abstract: Partial differential equation (PDE) simulation holds extensive significance in scientific research. Currently, the integration of deep neural networks to learn solution operators of PDEs has introduced great potential. In this paper, we present UniFluids, a conditional flow-matching framework that harnesses the scalability of diffusion Transformer to unify learning of solution operators across diverse PDEs with varying dimensionality and physical variables. Unlike the autoregressive PDE foundation models, UniFluids adopts flow-matching to achieve parallel sequence generation, making it the first such approach for unified operator learning. Specifically, the introduction of a unified four-dimensional spatiotemporal representation for the heterogeneous PDE datasets enables joint training and conditional encoding. Furthermore, we find the effective dimension of the PDE dataset is much lower than its patch dimension. We thus employ $x$-prediction in the flow-matching operator learning, which is verified to significantly improve prediction accuracy. We conduct a large-scale evaluation of UniFluids on several PDE datasets covering spatial dimensions 1D, 2D and 3D. Experimental results show that UniFluids achieves strong prediction accuracy and demonstrates good scalability and cross-scenario generalization capability. The code will be released later.
[456] A Multi-Modal CNN-LSTM Framework with Multi-Head Attention and Focal Loss for Real-Time Elderly Fall Detection
Lijie Zhou, Luran Wang
Main category: cs.LG
TL;DR: A multimodal deep learning framework for real-time elderly fall detection using wearable sensors, achieving high accuracy with low latency.
Details
Motivation: Addressing the growing need for reliable health monitoring for aging populations, particularly fall detection, by overcoming limitations of single-modality approaches and manual feature engineering in traditional methods.Method: MultiModalFallDetector integrates tri-axial accelerometer, gyroscope, and physiological signals; uses multi-scale CNN feature extraction, multi-head self-attention for temporal weighting, Focal Loss for class imbalance, auxiliary activity classification for regularization, and transfer learning from UCI HAR to SisFall dataset.
Result: Achieves F1-score of 98.7, Recall of 98.9, and AUC-ROC of 99.4 on SisFall dataset, significantly outperforming baselines with sub-50ms inference latency on edge devices.
Conclusion: The proposed multimodal deep learning framework provides accurate, real-time fall detection suitable for deployment in geriatric care settings, addressing critical healthcare needs for elderly populations.
Abstract: The increasing global aging population has intensified the demand for reliable health monitoring systems, particularly those capable of detecting critical events such as falls among elderly individuals. Traditional fall detection approaches relying on single-modality acceleration data suffer from high false alarm rates, while conventional machine learning methods require extensive hand-crafted feature engineering. This paper proposes a novel multi-modal deep learning framework, MultiModalFallDetector, designed for real-time elderly fall detection using wearable sensors. Our approach integrates multiple innovations: a multi-scale CNN-based feature extractor capturing motion dynamics at varying temporal resolutions; fusion of tri-axial accelerometer, gyroscope, and four-channel physiological signals; incorporation of a multi-head self-attention mechanism for dynamic temporal weighting; adoption of Focal Loss to mitigate severe class imbalance; introduction of an auxiliary activity classification task for regularization; and implementation of transfer learning from UCI HAR to SisFall dataset. Extensive experiments on the SisFall dataset, which includes real-world simulated fall trials from elderly participants (aged 60-85), demonstrate that our framework achieves an F1-score of 98. 7, Recall of 98. 9, and AUC-ROC of 99. 4, significantly outperforming baseline methods including traditional machine learning and standard deep learning approaches. The model maintains sub- 50ms inference latency on edge devices, confirming its suitability for real-time deployment in geriatric care settings.
[457] Enhancing AI-Based Tropical Cyclone Track and Intensity Forecasting via Systematic Bias Correction
Peisong Niu, Haifan Zhang, Yang Zhao, Tian Zhou, Ziqing Ma, Wenqiang Shen, Junping Zhao, Huiling Yuan, Liang Sun
Main category: cs.LG
TL;DR: BaguanCyclone is an AI-based framework for tropical cyclone forecasting that improves both track and intensity predictions through probabilistic center refinement and region-aware intensity forecasting modules.
Details
Motivation: Current AI-based weather forecasting systems for tropical cyclones have limitations: they use coarse-resolution data causing discretization errors in track prediction, and struggle with intensity forecasting especially for strong TCs due to smoothing effects and regression loss biases.Method: Proposes a unified framework with two innovations: (1) probabilistic center refinement module that models continuous spatial distribution of TC centers for finer track precision, and (2) region-aware intensity forecasting module that uses high-resolution internal representations within dynamically defined sub-grid zones around TC cores to capture localized extremes.
Result: Evaluated on global IBTrACS dataset across six major TC basins, consistently outperforms both operational NWP models and most AI-based baselines, delivering substantial enhancement in forecast accuracy. Excels in complex meteorological scenarios like re-intensification, sweeping arcs, twin cyclones, and meandering events.
Conclusion: BaguanCyclone provides a novel approach to tropical cyclone forecasting that addresses key limitations of existing methods, offering improved accuracy for both track and intensity predictions through its innovative modules.
Abstract: Tropical cyclones (TCs) pose severe threats to life, infrastructure, and economies in tropical and subtropical regions, underscoring the critical need for accurate and timely forecasts of both track and intensity. Recent advances in AI-based weather forecasting have shown promise in improving TC track forecasts. However, these systems are typically trained on coarse-resolution reanalysis data (e.g., ERA5 at 0.25 degree), which constrains predicted TC positions to a fixed grid and introduces significant discretization errors. Moreover, intensity forecasting remains limited especially for strong TCs by the smoothing effect of coarse meteorological fields and the use of regression losses that bias predictions toward conditional means. To address these limitations, we propose BaguanCyclone, a novel, unified framework that integrates two key innovations: (1) a probabilistic center refinement module that models the continuous spatial distribution of TC centers, enabling finer track precision; and (2) a region-aware intensity forecasting module that leverages high-resolution internal representations within dynamically defined sub-grid zones around the TC core to better capture localized extremes. Evaluated on the global IBTrACS dataset across six major TC basins, our system consistently outperforms both operational numerical weather prediction (NWP) models and most AI-based baselines, delivering a substantial enhancement in forecast accuracy. Remarkably, BaguanCyclone excels in navigating meteorological complexities, consistently delivering accurate forecasts for re-intensification, sweeping arcs, twin cyclones, and meandering events. Our code is available at https://github.com/DAMO-DI-ML/Baguan-cyclone.
[458] Emergency Preemption Without Online Exploration: A Decision Transformer Approach
Haoran Su, Hanxiao Deng, Yandong Sun
Main category: cs.LG
TL;DR: Decision Transformer-based framework for emergency vehicle corridor optimization using return-conditioned sequence modeling to control dispatch urgency and coordinate multi-agent systems via graph attention.
Details
Motivation: Current emergency vehicle signal preemption strategies are reactive and uncontrollable, leading to suboptimal response times. There's a need for a controllable, offline learning approach that can optimize emergency corridors while minimizing civilian disruption.Method: Proposes a return-conditioned framework based on Decision Transformer for offline sequence modeling of corridor optimization. Uses target-return scalar for dispatch-level urgency control, extends to multi-agent settings with Multi-Agent Decision Transformer (MADT) using graph attention for spatial coordination, and includes Constrained DT variant for explicit civilian disruption budgets.
Result: DT reduces average EV travel time by 37.7% vs fixed-timing preemption (88.6s vs 142.3s) on 4x4 grid, achieving lowest civilian delay (11.3s/veh) and fewest EV stops (1.2). MADT improves further on larger grids with 45.2% reduction on 8x8 grid. Return conditioning enables smooth trade-off between EV travel time (72.4-138.2s) and civilian delay (16.8-5.4s/veh) without retraining.
Conclusion: The return-conditioned Decision Transformer framework provides an effective, controllable solution for emergency corridor optimization that outperforms existing methods, enables dispatch-level urgency control through simple scalar adjustments, and scales to multi-agent coordination via graph attention mechanisms.
Abstract: Emergency vehicle (EV) response time is a critical determinant of survival outcomes, yet deployed signal preemption strategies remain reactive and uncontrollable. We propose a return-conditioned framework for emergency corridor optimization based on the Decision Transformer (DT). By casting corridor optimization as offline, return-conditioned sequence modeling, our approach (1) eliminates online environment interaction during policy learning, (2) enables dispatch-level urgency control through a single target-return scalar, and (3) extends to multi-agent settings via a Multi-Agent Decision Transformer (MADT) with graph attention for spatial coordination. On the LightSim simulator, DT reduces average EV travel time by 37.7% relative to fixed-timing preemption on a 4x4 grid (88.6 s vs. 142.3 s), achieving the lowest civilian delay (11.3 s/veh) and fewest EV stops (1.2) among all methods, including online RL baselines that require environment interaction. MADT further improves on larger grids, overtaking DT with 45.2% reduction on 8x8 via graph-attention coordination. Return conditioning produces a smooth dispatch interface: varying the target return from 100 to -400 trades EV travel time (72.4-138.2 s) against civilian delay (16.8-5.4 s/veh), requiring no retraining. A Constrained DT extension adds explicit civilian disruption budgets as a second control knob.
[459] Geometric Mixture-of-Experts with Curvature-Guided Adaptive Routing for Graph Representation Learning
Haifang Cao, Yu Wang, Timing Li, Xinjie Yao, Pengfei Zhu
Main category: cs.LG
TL;DR: GeoMoE: Geometric Mixture-of-Experts framework that fuses node representations across Riemannian spaces using Ollivier-Ricci Curvature as geometric prior for graph-structured data.
Details
Motivation: Graph data exhibits complex topological heterogeneity that cannot be accurately modeled within a single Riemannian manifold. Existing mixed-curvature methods lack fundamental geometric grounding and rely on implicit, task-driven routing.Method: Proposes GeoMoE framework with: 1) Graph-aware gating network for node-specific fusion weights, 2) Curvature-guided alignment loss for interpretable routing, 3) Curvature-aware contrastive objective for geometric discriminability using positive/negative pairs based on curvature consistency.
Result: Extensive experiments on six benchmark datasets show GeoMoE outperforms state-of-the-art baselines across diverse graph types.
Conclusion: GeoMoE effectively captures multi-scale topological structures in graph data by adaptively fusing representations across Riemannian spaces using intrinsic geometric priors, providing interpretable and geometry-consistent routing.
Abstract: Graph-structured data typically exhibits complex topological heterogeneity, making it difficult to model accurately within a single Riemannian manifold. While emerging mixed-curvature methods attempt to capture such diversity, they often rely on implicit, task-driven routing that lacks fundamental geometric grounding. To address this challenge, we propose a Geometric Mixture-of-Experts framework (GeoMoE) that adaptively fuses node representations across diverse Riemannian spaces to better accommodate multi-scale topological structures. At its core, GeoMoE leverages Ollivier-Ricci Curvature (ORC) as an intrinsic geometric prior to orchestrate the collaboration of specialized experts. Specifically, we design a graph-aware gating network that assigns node-specific fusion weights, regularized by a curvature-guided alignment loss to ensure interpretable and geometry-consistent routing. Additionally, we introduce a curvature-aware contrastive objective that promotes geometric discriminability by constructing positive and negative pairs according to curvature consistency. Extensive experiments on six benchmark datasets demonstrate that GeoMoE outperforms state-of-the-art baselines across diverse graph types.
[460] A graph neural network based chemical mechanism reduction method for combustion applications
Manuru Nithin Padiyar, Priyabrat Dash, Konduri Aditya
Main category: cs.LG
TL;DR: GNN-based chemical mechanism reduction methods for turbulent reacting flows using message-passing transformers to learn species/reaction dependencies.
Details
Motivation: Direct numerical simulations of turbulent reacting flows with detailed chemical mechanisms (hundreds of species, thousands of reactions) are computationally prohibitive, requiring automated reduction methods.Method: Two GNN formulations: 1) GNN-SM uses pre-trained surrogate model for broad condition reduction, 2) GNN-AE uses autoencoder for highly compact mechanisms within training regimes.
Result: Tested on methane, ethylene, iso-octane mechanisms. GNN-SM achieves DRGEP-comparable reductions with broad accuracy. GNN-AE achieves up to 95% reduction and outperforms DRGEP within target conditions.
Conclusion: GNN framework provides automated, ML-based chemical mechanism reduction that complements traditional expert-guided approaches.
Abstract: Direct numerical simulations of turbulent reacting flows involving millions of grid points and detailed chemical mechanisms with hundreds of species and thousands of reactions are computationally prohibitive. To address this challenge, we present two data-driven chemical mechanism reduction formulations based on graph neural networks (GNNs) with message-passing transformer layers that learn nonlinear dependencies among species and reactions. The first formulation, GNN-SM, employs a pre-trained surrogate model to guide reduction across a broad range of reactor conditions. The second formulation, GNN-AE, uses an autoencoder formulation to obtain highly compact mechanisms that remain accurate within the thermochemical regimes used during training. The approaches are demonstrated on detailed mechanisms for methane (53 species, 325 reactions), ethylene (96 species, 1054 reactions), and iso-octane (1034 species, 8453 reactions). GNN-SM achieves reductions comparable to the established graph-based method DRGEP while maintaining accuracy across a wide range of thermochemical states. In contrast, GNN-AE achieves up to 95% reduction in species and reactions and outperforms DRGEP within its target conditions. Overall, the proposed framework provides an automated, machine-learning-based pathway for chemical mechanism reduction that can complement traditional expert-guided analytical approaches.
[461] Sparsely-Supervised Data Assimilation via Physics-Informed Schrödinger Bridge
Dohyun Bu, Chanho Kim, Seokun Choi, Jong-Seok Lee
Main category: cs.LG
TL;DR: PICSB is a physics-informed conditional Schrödinger Bridge method for fast spatiotemporal field reconstruction in data assimilation, using low-fidelity priors without requiring high-fidelity supervision or test-time optimization.
Details
Motivation: Traditional data assimilation for PDE-governed systems requires computationally expensive per-instance test-time optimization, and existing amortized approaches need full high-fidelity supervision during training which is impractical in real-world settings.Method: Proposes Physics-Informed Conditional Schrödinger Bridge (PICSB) that transports low-fidelity priors to observation-conditioned high-fidelity posteriors using iterative surrogate-endpoint refresh scheme, incorporates PDE residuals into training objective, and enforces observations via hard conditioning during sampling.
Result: Experiments on fluid PDE benchmarks show PICSB enables extremely fast spatiotemporal field reconstruction while maintaining competitive accuracy under sparse high-fidelity supervision.
Conclusion: PICSB provides an efficient solution for data assimilation that avoids test-time optimization and full high-fidelity supervision requirements, making it practical for time-critical applications.
Abstract: Data assimilation (DA) for systems governed by partial differential equations (PDE) aims to reconstruct full spatiotemporal fields from sparse high-fidelity (HF) observations while respecting physical constraints. While full-grid low-fidelity (LF) simulations provide informative priors in multi-fidelity settings, recovering an HF field consistent with both sparse observations and the governing PDE typically requires per-instance test-time optimization, which becomes a major bottleneck in time-critical applications. To alleviate this, amortized reconstruction using generative models has recently been proposed; however, such approaches rely on full-field HF supervision during training, which is often impractical in real-world settings. From a more realistic perspective, we propose the Physics-Informed Conditional Schrödinger Bridge (PICSB), which transports an informative LF prior toward an observation-conditioned HF posterior without any additional inference-time guidance. To enable learning without HF endpoints, PICSB employs an iterative surrogate-endpoint refresh scheme, and directly incorporates PDE residuals into the training objective while enforcing observations via hard conditioning throughout sampling. Experiments on fluid PDE benchmarks demonstrate that PICSB enables extremely fast spatiotemporal field reconstruction while maintaining competitive accuracy under sparse HF supervision.
[462] Bridging the Gap Between Climate Science and Machine Learning in Climate Model Emulation
Luca Schmidt, Nina Effenberger
Main category: cs.LG
TL;DR: A framework for integrating climate science and machine learning to create accessible, reliable climate model emulators that overcome adoption barriers
Details
Motivation: Climate models are computationally expensive and technically demanding, while ML emulators face adoption barriers including limited accessibility, lack of specialized knowledge, and mistrust of non-physical ML methodsMethod: Proposes an integration framework combining climate science and ML perspectives, focusing on designing easy-to-adopt emulators for clearly defined tasks and demonstrating their reliability
Result: The framework offers a promising path for bridging the gap between climate science and machine learning fields by addressing adoption barriers
Conclusion: Integrating climate science and ML perspectives through accessible, task-specific emulators with demonstrated reliability can overcome current adoption barriers
Abstract: While climate models provide insights for climate decision-making, their use is constrained by significant computational and technical demands. Although machine learning (ML) emulators offer a way to bypass the high computational costs, their effective use remains challenging. The hurdles are diverse, ranging from limited accessibility and a lack of specialized knowledge to a general mistrust of ML methods that are perceived as insufficiently physical. Here, we introduce a framework to overcome these barriers by integrating both climate science and machine learning perspectives. We find that designing easy-to-adopt emulators that address a clearly defined task and demonstrating their reliability offers a promising path for bridging the gap between our two fields.
[463] AEGIS: An Operational Infrastructure for Post-Market Governance of Adaptive Medical AI Under US and EU Regulations
Fardin Afdideh, Mehdi Astaraki, Fernando Seoane, Farhad Abtahi
Main category: cs.LG
TL;DR: AEGIS is a governance framework for medical AI systems that operationalizes regulatory change-control mechanisms (FDA PCCP, EU AI Act) through three modules for dataset assimilation/retraining, model monitoring, and conditional decisions, enabling safe continuous learning across diverse clinical applications.
Details
Motivation: Medical AI systems need governance frameworks that ensure safety while allowing continuous improvement. Regulatory bodies have introduced mechanisms like Predetermined Change Control Plans (PCCP) and Post-Market Surveillance (PMS), but these need to be operationalized into executable procedures for practical implementation.Method: AEGIS framework with three modules: 1) dataset assimilation and retraining, 2) model monitoring, and 3) conditional decision. It implements a four-category deployment decision taxonomy (APPROVE, CONDITIONAL APPROVAL, CLINICAL REVIEW, REJECT) with independent PMS ALARM signals to detect critical states where no deployable model exists while the released model is at risk.
Result: Tested on sepsis prediction (EHR) and brain tumor segmentation (medical imaging) with identical governance architecture. In sepsis example across 11 simulated iterations: 8 APPROVE, 1 CONDITIONAL APPROVAL, 1 CLINICAL REVIEW, 1 REJECT decisions. ALARM signals co-issued at iterations 8 and 10, detecting drift before observable performance degradation.
Conclusion: AEGIS successfully translates regulatory change-control concepts into executable governance procedures, supporting safe continuous learning for adaptive medical AI across diverse clinical applications, demonstrating practical implementation of FDA PCCP and EU AI Act provisions.
Abstract: Machine learning systems deployed in medical devices require governance frameworks that ensure safety while enabling continuous improvement. Regulatory bodies including the FDA and European Union have introduced mechanisms such as the Predetermined Change Control Plan (PCCP) and Post-Market Surveillance (PMS) to manage iterative model updates without repeated submissions. This paper presents AI/ML Evaluation and Governance Infrastructure for Safety (AEGIS), a governance framework applicable to any healthcare AI system. AEGIS comprises three modules, i.e., dataset assimilation and retraining, model monitoring, and conditional decision, that operationalize FDA PCCP and EU AI Act Article 43(4) provisions. We implement a four-category deployment decision taxonomy (APPROVE, CONDITIONAL APPROVAL, CLINICAL REVIEW, REJECT) with an independent PMS ALARM signal, enabling detection of the critical state in which no deployable model exists while the released model is simultaneously at risk. To illustrate how AEGIS can be instantiated across heterogeneous clinical contexts, we provide two examples: sepsis prediction from electronic health records and brain tumor segmentation from medical imaging. Both cases use identical governance architecture, differing only in configuration. Across 11 simulated iterations on the sepsis example, AEGIS yielded 8 APPROVE, 1 CONDITIONAL APPROVAL, 1 CLINICAL REVIEW, and 1 REJECT decision, exercising all four categories. ALARM signals were co-issued at iterations 8 and 10, including the critical state where no deployable model exists and the released model is simultaneously failing. AEGIS detected drift before observable performance degradation. These results demonstrate that AEGIS translates regulatory change-control concepts into executable governance procedures, supporting safe continuous learning for adaptive medical AI across diverse clinical applications.
[464] A Multi-Task Targeted Learning Framework for Lithium-Ion Battery State-of-Health and Remaining Useful Life
Chenhan Wang, Zhengyi Bao, Huipin Lin, Jiahao Nie, Chunxiang Zhu
Main category: cs.LG
TL;DR: Multi-task learning framework combining multi-scale CNNs, improved LSTM, and dual-stream attention for battery SOH and RUL prediction with Hyperopt optimization.
Details
Motivation: Current deep learning methods for battery health prediction have limitations in feature selection and temporal dependency modeling, with most relying on traditional RNNs that struggle with long-term time-series modeling.Method: Multi-task framework with: 1) multi-scale CNN feature extraction for local battery decline patterns, 2) improved extended LSTM for long-term temporal information, 3) dual-stream attention (polarized + sparse) to focus on key SOH/RUL information, and 4) Hyperopt for automated hyperparameter optimization.
Result: Extensive experiments show the method reduces average RMSE for SOH and RUL predictions by 111.3% and 33.0% respectively compared to traditional and state-of-the-art methods.
Conclusion: The proposed framework effectively addresses limitations in current battery health prediction methods by combining multi-scale feature extraction, improved temporal modeling, and attention mechanisms for selective feature focus.
Abstract: Accurately predicting the state-of-health (SOH) and remaining useful life (RUL) of lithium-ion batteries is crucial for ensuring the safe and efficient operation of electric vehicles while minimizing associated risks. However, current deep learning methods are limited in their ability to selectively extract features and model time dependencies for these two parameters. Moreover, most existing methods rely on traditional recurrent neural networks, which have inherent shortcomings in long-term time-series modeling. To address these issues, this paper proposes a multi-task targeted learning framework for SOH and RUL prediction, which integrates multiple neural networks, including a multi-scale feature extraction module, an improved extended LSTM, and a dual-stream attention module. First, a feature extraction module with multi-scale CNNs is designed to capture detailed local battery decline patterns. Secondly, an improved extended LSTM network is employed to enhance the model’s ability to retain long-term temporal information, thus improving temporal relationship modeling. Building on this, the dual-stream attention module-comprising polarized attention and sparse attention to selectively focus on key information relevant to SOH and RUL, respectively, by assigning higher weights to important features. Finally, a many-to-two mapping is achieved through the dual-task layer. To optimize the model’s performance and reduce the need for manual hyperparameter tuning, the Hyperopt optimization algorithm is used. Extensive comparative experiments on battery aging datasets demonstrate that the proposed method reduces the average RMSE for SOH and RUL predictions by 111.3% and 33.0%, respectively, compared to traditional and state-of-the-art methods.
[465] DAQ: Delta-Aware Quantization for Post-Training LLM Weight Compression
Xiaoming Yu, Shize Tang, Guanghua Yu, Linchuan Xie, Song Liu, Jianchen Zhu, Feng Li
Main category: cs.LG
TL;DR: DAQ is a data-free post-training quantization method that preserves post-training knowledge by optimizing for directional fidelity of parameter deltas rather than reconstruction error.
Details
Motivation: Standard quantization methods minimize reconstruction error but are agnostic to the base model, allowing quantization noise to disproportionately corrupt small-magnitude parameter deltas that encode post-training behavior, which is analyzed through the lens of quantization as implicit regularization.Method: DAQ replaces reconstruction-based objectives with two delta-aware metrics: Sign Preservation Rate and Cosine Similarity, which directly optimize for directional fidelity of parameter deltas (ΔW), requiring only the base and post-trained weight matrices.
Result: In a pilot FP8 study, DAQ recovers style-specific capabilities lost under standard quantization while maintaining general performance.
Conclusion: DAQ provides an effective data-free post-training quantization framework that preserves post-training knowledge by focusing on directional fidelity of parameter deltas rather than traditional reconstruction error minimization.
Abstract: We introduce Delta-Aware Quantization (DAQ), a data-free post-training quantization framework that preserves the knowledge acquired during post-training. Standard quantization objectives minimize reconstruction error but are agnostic to the base model, allowing quantization noise to disproportionately corrupt the small-magnitude parameter deltas ($ΔW$) that encode post-training behavior – an effect we analyze through the lens of quantization as implicit regularization. DAQ replaces reconstruction-based objectives with two delta-aware metrics – Sign Preservation Rate and Cosine Similarity – that directly optimize for directional fidelity of $ΔW$, requiring only the base and post-trained weight matrices. In a pilot FP8 study, DAQ recovers style-specific capabilities lost under standard quantization while maintaining general performance.
[466] Hybrid Associative Memories
Leon Lufkin, Tomás Figliolia, Beren Millidge, Kamesh Krishnamurthy
Main category: cs.LG
TL;DR: HAM combines RNNs and self-attention by using RNNs to compress sequences and attention only for information the RNN struggles to predict, enabling controlled KV cache growth.
Details
Motivation: RNNs and self-attention have orthogonal strengths: RNNs compress past into fixed-size state (efficient but degrade over long contexts), while attention stores every past step (excellent retrieval but high memory/computation). Prior combinations naively interleaved them without leveraging complementary mechanisms.Method: Proposes Hybrid Associative Memory (HAM) layer that uses RNN to compress entire sequence, while attention supplements only information difficult for RNN to predict. Enables data-dependent KV cache growth controlled by a single continuous threshold.
Result: HAM offers smooth trade-off between KV cache growth rate and loss/performance. Provides strong competitive performance relative to RNNs and Transformers at substantially lower KV-cache usage.
Conclusion: HAM effectively combines RNN and attention mechanisms by leveraging their complementary strengths, enabling efficient sequence modeling with controlled memory usage.
Abstract: Recurrent neural networks (RNNs) and self-attention are both widely used sequence-mixing layers that maintain an internal memory. However, this memory is constructed using two orthogonal mechanisms: RNNs compress the entire past into a fixed-size state, whereas self-attention’s state stores every past time step growing its state (the KV cache) linearly with the sequence length. This results in orthogonal strengths and weaknesses. Self-attention layers excel at retrieving information in the context but have large memory and computational costs, while RNNs are more efficient but degrade over longer contexts and underperform for precise recall tasks. Prior work combining these mechanisms has focused primarily on naively interleaving them to reduce computational cost without regard to their complementary mechanisms. We propose the Hybrid Associative Memory (HAM) layer, which combines self-attention and RNNs while leveraging their individual strengths: the RNN compresses the entire sequence, while attention supplements it only with information that is difficult for the RNN to predict, which is hence the most valuable information to explicitly store. HAM layers enable data-dependent growth of the KV cache, which can be precisely controlled by the user with a single, continuous threshold. We find that this fine-grained control of the KV cache growth rate has a smooth trade-off with loss and performance. Empirically, we show that our hybrid architecture offers strong, competitive performance relative to RNNs and Transformers even at substantially lower KV-cache usage.
[467] A Direct Classification Approach for Reliable Wind Ramp Event Forecasting under Severe Class Imbalance
Alejandro Morales-Hernández, Fabrizio De Caroa, Gian Marco Paldino, Pascal Tribel, Alfredo Vaccaro, Gianluca Bontempi
Main category: cs.LG
TL;DR: A novel methodology for Wind Power Ramp Event forecasting using multivariate time series classification with class imbalance handling through undersampling and ensemble learning.
Details
Motivation: Decision support systems need accurate forecasting of Wind Power Ramp Events for grid stability, but conventional ML models struggle with class imbalance where ramp events are rare (less than 15% of observations).Method: Proposes a multivariate time series classification approach with data preprocessing that extracts features from recent power observations and masks unavailable ramp information. Combines majority-class undersampling with ensemble learning to handle class imbalance.
Result: Achieves over 85% accuracy and 88% weighted F1 score on real-world dataset, outperforming benchmark classifiers.
Conclusion: The proposed methodology effectively addresses class imbalance in WPRE forecasting and provides superior performance for early warning systems in wind power plants.
Abstract: Decision support systems are essential for maintaining grid stability in low-carbon power systems, such as wind power plants, by providing real-time alerts to control room operators regarding potential events, including Wind Power Ramp Events (WPREs). These early warnings enable the timely initiation of more detailed system stability assessments and preventive actions. However, forecasting these events is challenging due to the inherent class imbalance in WPRE datasets, where ramp events are less frequent (typically less than 15% of observed events) compared to normal conditions. Ignoring this characteristic undermines the performance of conventional machine learning models, which often favor the majority class. This paper introduces a novel methodology for WPRE forecasting as a multivariate time series classification task and proposes a data preprocessing strategy that extracts features from recent power observations and masks unavailable ramp information, making it integrable with traditional real-time ramp identification tools. Particularly, the proposed methodology combines majority-class undersampling and ensemble learning to enhance wind ramp event forecasting under class imbalance. Numerical simulations conducted on a real-world dataset demonstrate the superiority of our approach, achieving over 85% accuracy and 88% weighted F1 score, outperforming benchmark classifiers.
[468] Beyond the Mean: Distribution-Aware Loss Functions for Bimodal Regression
Abolfazl Mohammadi-Seif, Carlos Soares, Rita P. Ribeiro, Ricardo Baeza-Yates
Main category: cs.LG
TL;DR: Proposes distribution-aware loss functions (normalized RMSE with Wasserstein/Cramér distances) for deep regression to handle bimodal uncertainty without mixture model instability, achieving better uncertainty estimation than standard approaches.
Details
Motivation: Standard regression models struggle with reliable uncertainty estimation when predictive errors follow bimodal distributions (confident vs ambiguous predictions), as they assume unimodal Gaussian noise. Mixture Density Networks can represent such distributions but suffer from severe optimization instability.Method: Introduces a family of distribution-aware loss functions combining normalized RMSE with Wasserstein and Cramér distances. Applied to standard deep regression models, these losses can recover bimodal distributions without the volatility of mixture models.
Result: The Wasserstein loss establishes a new Pareto efficiency frontier: matches stability of standard regression losses like MSE on unimodal tasks while reducing Jensen-Shannon Divergence by 45% on complex bimodal datasets. Framework strictly dominates Mixture Density Networks in both fidelity and robustness.
Conclusion: Proposed distribution-aware loss functions offer a reliable tool for aleatoric uncertainty estimation in trustworthy AI systems, addressing the critical challenge of assessing model trustworthiness through reliable predictive confidence estimates.
Abstract: Despite the strong predictive performance achieved by machine learning models across many application domains, assessing their trustworthiness through reliable estimates of predictive confidence remains a critical challenge. This issue arises in scenarios where the likelihood of error inferred from learned representations follows a bimodal distribution, resulting from the coexistence of confident and ambiguous predictions. Standard regression approaches often struggle to adequately express this predictive uncertainty, as they implicitly assume unimodal Gaussian noise, leading to mean-collapse behavior in such settings. Although Mixture Density Networks (MDNs) can represent different distributions, they suffer from severe optimization instability. We propose a family of distribution-aware loss functions integrating normalized RMSE with Wasserstein and Cramér distances. When applied to standard deep regression models, our approach recovers bimodal distributions without the volatility of mixture models. Validated across four experimental stages, our results show that the proposed Wasserstein loss establishes a new Pareto efficiency frontier: matching the stability of standard regression losses like MSE in unimodal tasks while reducing Jensen-Shannon Divergence by 45% on complex bimodal datasets. Our framework strictly dominates MDNs in both fidelity and robustness, offering a reliable tool for aleatoric uncertainty estimation in trustworthy AI systems.
[469] Trained Persistent Memory for Frozen Decoder-Only LLMs
Hong Jeong
Main category: cs.LG
TL;DR: Decoder-only models can gain persistent memory through adapter training, with architectural priors crucial for effective memory retention at low capacity.
Details
Motivation: To investigate whether persistent latent-space memory principles from encoder-decoder models transfer to decoder-only architectures where no cross-attention exists, requiring memory injection through self-attention alone.Method: Adapted six memory methods (prefix, parallel cross-attention, KV extension, Hebbian memory, context-gated branch, slot-based sparse write) to frozen GPT-2, training only small memory adapters with shared write rules but varying read injection mechanisms.
Result: Found striking inductive-bias dichotomy: at 1× capacity, three methods with strong architectural priors achieved 7-18% retained-memory scores and 7-10 knowledge gains, while others failed (<0.4%). At 10× capacity all methods converged, showing architectural rather than fundamental limitations.
Conclusion: Persistent latent-space memory is a general paradigm spanning major transformer families, with architectural priors crucial for effective memory at low capacity.
Abstract: Decoder-only language models are stateless: hidden representations are discarded after every forward pass and nothing persists across sessions. Jeong (2026a) showed that trained memory adapters give a frozen encoder-decoder backbone persistent latent-space memory, building on the lateral-memory framework of Jeong (2026b,c). Here we ask whether the same principle transfers to the decoder-only setting, where no cross-attention pathway exists and memory must enter through self-attention alone. We adapt six methods – prefix, parallel cross-attention, KV extension, Hebbian memory, context-gated branch, and slot-based sparse write – to a frozen GPT-2, training only a small adapter $θ_{mem}$. The write rule is shared; only the read injection changes from decoder cross-attention to self-attention KV prefix or parallel branch. On LoCoMo we find a striking inductive-bias dichotomy: at $1\times$ capacity, three methods with strong architectural priors – cross-attention (M.2), Hebbian (M.4), and slot write (M.6) – achieve retained-memory scores of $7-18%$ and knowledge gains $ΔK$ of $7-10$, while the other three fail ($< 0.4%$). At $10\times$ capacity all six converge, showing the gap is architectural, not fundamental. Together with the encoder-decoder results of Jeong (2026a) and the brain-inspired modules of Jeong (2026b,c), these findings establish persistent latent-space memory as a general paradigm spanning major transformer families.
[470] Conformal Risk Control for Safety-Critical Wildfire Evacuation Mapping: A Comparative Study of Tabular, Spatial, and Graph-Based Models
Baljinnyam Dayan
Main category: cs.LG
TL;DR: First application of conformal risk control to wildfire spread prediction provides formal safety guarantees on false negative rates, showing standard thresholds miss 7-72% of fire spread while CRC achieves ~95% coverage with high efficiency.
Details
Motivation: Current wildfire prediction models lack formal safety guarantees on missed fire spread, leaving evacuation planners reliant on probability thresholds without formal assurance. No prior work has applied distribution-free safety guarantees to wildfire spread prediction despite extensive deep learning research in this domain.Method: Applied conformal risk control (CRC) to wildfire spread prediction across three model families: tabular (LightGBM), convolutional (Tiny U-Net), and graph-based (Hybrid ResGNN-UNet). Proposed shift-aware three-way CRC framework for operational triage with SAFE/MONITOR/EVACUATE zones. Analyzed prevalence-weighted bounds under extreme class imbalance (~5% fire prevalence).
Result: Standard thresholds captured only 7-72% of true fire spread across models. CRC eliminated this failure uniformly, achieving ~95% fire coverage while flagging only ~15% of total pixels. Spatial models with CRC were 4.2x more efficient than LightGBM. Graph model’s additional complexity over simple U-Net yielded no meaningful efficiency gain.
Conclusion: Model architecture determines evacuation efficiency while CRC determines safety. The proposed CRC framework provides formal guarantees on false negative rates (FNR ≤ 0.05) and enables operational triage with safety assurances. All models, calibration code, and evaluation pipelines are released for reproducibility.
Abstract: Every wildfire prediction model deployed today shares a dangerous property: none of these methods provides formal guarantees on how much fire spread is missed. Despite extensive work on wildfire spread prediction using deep learning, no prior study has applied distribution-free safety guarantees to this domain, leaving evacuation planners reliant on probability thresholds with no formal assurance. We address this gap by presenting, to our knowledge, the first application of conformal risk control (CRC) to wildfire spread prediction, providing finite-sample guarantees on false negative rate (FNR <= 0.05). We expose a stark failure: across three model families of increasing complexity (tabular: LightGBM, AUROC 0.854; convolutional: Tiny U-Net, AUROC 0.969; and graph-based: Hybrid ResGNN-UNet, AUROC 0.964), standard thresholds capture only 7-72% of true fire spread. CRC eliminates this failure uniformly. Our central finding is that model architecture determines evacuation efficiency, while CRC determines safety: both spatial models with CRC achieve approximately 95% fire coverage while flagging only approximately 15% of total pixels, making them 4.2x more efficient than LightGBM, while the graph model’s additional complexity over a simple U-Net yields no meaningful efficiency gain. We propose a shift-aware three-way CRC framework that assigns SAFE/MONITOR/EVACUATE zones for operational triage, and characterize a fundamental limitation of prevalence-weighted bounds under extreme class imbalance (approximately 5% fire prevalence). All models, calibration code, and evaluation pipelines are released for reproducibility.
[471] Large Language Models for Missing Data Imputation: Understanding Behavior, Hallucination Effects, and Control Mechanisms
Arthur Dantas Mangussi, Ricardo Cardoso Pereira, Ana Carolina Lorena, Pedro Henriques Abreu
Main category: cs.LG
TL;DR: LLMs show promise for tabular data imputation using zero-shot prompting, outperforming traditional methods on real-world datasets but lagging on synthetic data due to reliance on semantic context learned during pre-training.
Details
Motivation: Existing LLM-based imputation studies face scalability issues, limited cross-model comparisons, small/domain-specific datasets, inconsistent experimental protocols, and inadequate treatment of different missingness mechanisms (MCAR, MAR, MNAR).Method: Comprehensive benchmarking comparing 5 LLMs against 6 state-of-the-art imputation baselines across 29 datasets (including 9 synthetic) under MCAR, MAR, and MNAR mechanisms with up to 20% missing rates, using zero-shot prompt engineering.
Result: Leading LLMs (Gemini 3.0 Flash, Claude 4.5 Sonnet) outperform traditional methods on real-world datasets but underperform on synthetic data; LLM effectiveness depends on semantic context from pre-training rather than statistical reconstruction; LLMs have higher computational/time costs.
Conclusion: LLMs are promising semantics-driven imputers for complex tabular data but face cost-performance trade-offs; their effectiveness relies on pre-training exposure to domain patterns rather than pure statistical methods.
Abstract: Data imputation is a cornerstone technique for handling missing values in real-world datasets, which are often plagued by missingness. Despite recent progress, prior studies on Large Language Models-based imputation remain limited by scalability challenges, restricted cross-model comparisons, and evaluations conducted on small or domain-specific datasets. Furthermore, heterogeneous experimental protocols and inconsistent treatment of missingness mechanisms (MCAR, MAR, and MNAR) hinder systematic benchmarking across methods. This work investigates the robustness of Large Language Models for missing data imputation in tabular datasets using a zero-shot prompt engineering approach. To this end, we present a comprehensive benchmarking study comparing five widely used LLMs against six state-of-the-art imputation baselines. The experimental design evaluates these methods across 29 datasets (including nine synthetic datasets) under MCAR, MAR, and MNAR mechanisms, with missing rates of up to 20%. The results demonstrate that leading LLMs, particularly Gemini 3.0 Flash and Claude 4.5 Sonnet, consistently achieve superior performance on real-world open-source datasets compared to traditional methods. However, this advantage appears to be closely tied to the models’ prior exposure to domain-specific patterns learned during pre-training on internet-scale corpora. In contrast, on synthetic datasets, traditional methods such as MICE outperform LLMs, suggesting that LLM effectiveness is driven by semantic context rather than purely statistical reconstruction. Furthermore, we identify a clear trade-off: while LLMs excel in imputation quality, they incur significantly higher computational time and monetary costs. Overall, this study provides a large-scale comparative analysis, positioning LLMs as promising semantics-driven imputers for complex tabular data.
[472] Graph Signal Processing Meets Mamba2: Adaptive Filter Bank via Delta Modulation
Yehjin Shin, Seojin Kim, Noseong Park
Main category: cs.LG
TL;DR: HADES introduces a hierarchical adaptive filter bank framework for state-space models, inspired by graph signal processing, that improves parameter efficiency while maintaining performance comparable to Mamba2.
Details
Motivation: While Mamba2 offers efficient linear-time recurrence for language modeling, its multi-head recurrence operates independently without structured utilization or analysis. The authors aim to bridge graph signal processing with neural sequence modeling to create more efficient, hierarchical, and interpretable filtering within state-space models.Method: HADES reinterprets Mamba2 as an adaptive filter bank on a line graph using graph signal processing principles. It introduces a hierarchical architecture with two filter types: shared filters for global low-pass behavior and expert filters for local high-pass behavior, achieved through structured bias on the parameter Δ. This enables more efficient parameter usage.
Result: HADES achieves comparable performance to baseline models including Mamba2 across various benchmarks in language modeling, commonsense reasoning, and long-context retrieval, while using only 58.9% of the original parameters.
Conclusion: HADES successfully bridges graph signal processing and neural sequence modeling, enabling efficient, hierarchical, and interpretable filtering within state-space models while maintaining strong performance with significantly reduced parameters.
Abstract: State-space models (SSMs) offer efficient alternatives to attention with linear-time recurrence. Mamba2, a recent SSM-based language model, uses selective input gating and a multi-head structure, enabling parallel computation and strong benchmark performance. However, its multi-head recurrence operates independently without structured utilization or analysis. In this work, we propose a novel method called Hierarchical ADaptive filter bank for Efficient SSMs (HADES), a Graph Signal Processing (GSP)-inspired framework that reinterprets Mamba2 as an adaptive filter bank on a line graph. Our hierarchical architecture introduces two filter types: shared filters for global low-pass behavior and expert filters for local high-pass behavior, achieved through structured bias on the parameter Δ. HADES achieves comparable performance to baseline models including Mamba2 across various benchmarks in language modeling, commonsense reasoning, and long-context retrieval, while using only 58.9% of the original parameters. In this regard, HADES bridges GSP and neural sequence modeling, enabling efficient, hierarchical, and interpretable filtering within state-space models.
[473] Problems with Chinchilla Approach 2: Systematic Biases in IsoFLOP Parabola Fits
Eric Czech, Zhiwei Xu, Yael Elmatad, Yixin Wang, William Held
Main category: cs.LG
TL;DR: The paper analyzes biases in Chinchilla Approach 2 for neural scaling laws and proposes Chinchilla Approach 3 with Variable Projection as a more accurate alternative.
Details
Motivation: Chinchilla Approach 2 introduces systematic biases in compute-optimal allocation estimates for neural scaling laws, leading to significant computational waste and parameter misallocation in large models like Llama 3.Method: Analyzes three error sources in Approach 2: IsoFLOP sampling grid width, uncentered IsoFLOP sampling, and loss surface asymmetry. Proposes Chinchilla Approach 3 with Variable Projection that exploits partially linear structure for unbiased inference through two-dimensional optimization.
Result: Approach 2 biases cause 6.5% parameter underallocation in Llama 3 training (corresponding to $1.4M in unnecessary compute). Approach 3 eliminates these biases and is shown to be well-conditioned, analytically differentiable, and amenable to grid search.
Conclusion: Chinchilla Approach 3 with Variable Projection is a superior alternative to Approach 2, providing unbiased estimates while addressing concerns about data efficiency, numerical stability, and implementation difficulty.
Abstract: Chinchilla Approach 2 is among the most widely used methods for fitting neural scaling laws. Its parabolic approximation introduces systematic biases in compute-optimal allocation estimates, even on noise-free synthetic data. Applied to published Llama 3 IsoFLOP data at open frontier compute scales, these biases imply a parameter underallocation corresponding to 6.5% of the $3.8\times10^{25}$ FLOP training budget and $1.4M (90% CI: $412K-$2.9M) in unnecessary compute at 50% H100 MFU. Simulated multimodal model misallocations show even greater opportunity costs due to higher loss surface asymmetry. Three sources of this error are examined: IsoFLOP sampling grid width (Taylor approximation accuracy), uncentered IsoFLOP sampling, and loss surface asymmetry ($α\neq β$). Chinchilla Approach 3 largely eliminates these biases but is often regarded as less data-efficient, numerically unstable, prone to local minima, and harder to implement. Each concern is shown to be unfounded or addressable, especially when the partially linear structure of the objective is exploited via Variable Projection, enabling unbiased inference on all five loss surface parameters through a two-dimensional optimization that is well-conditioned, analytically differentiable, and amenable to dense, or even exhaustive, grid search. It may serve as a more convenient replacement for Approach 2 or a more scalable alternative for adaptations of Approach 3 to richer scaling law formulations.
[474] Cloud-Edge Collaborative Large Models for Robust Photovoltaic Power Forecasting
Nan Qiao, Sijing Duan, Shuning Wang, Xingyuan Hua, Ju Ren
Main category: cs.LG
TL;DR: A risk-aware cloud-edge collaborative framework for photovoltaic power forecasting that balances accuracy, robustness, and latency constraints by integrating local specialized models, edge-side lightweight models, and cloud-based large retrieval models with intelligent routing.
Details
Motivation: PV power forecasting in edge-enabled grids needs to balance forecasting accuracy, robustness under weather-driven distribution shifts, and strict latency constraints. Local specialized models are efficient for routine conditions but degrade under rare ramp events and unseen weather patterns, while always using cloud-side large models incurs substantial communication delay and cloud overhead.Method: Proposes a risk-aware cloud-edge collaborative framework with: 1) site-specific expert predictor for routine cases, 2) lightweight edge-side model for enhanced local inference, 3) cloud-side large retrieval model providing matched historical context through retrieval-prediction pipeline, 4) lightweight screening module estimating predictive uncertainty, OOD risk, weather mutation intensity, and model disagreement, 5) Lyapunov-guided router selectively escalating inference to edge-small or cloud-assisted branches under constraints, and 6) adaptive fusion of activated branch outputs.
Result: Experiments on two real-world PV datasets demonstrate favorable overall trade-off among forecasting accuracy, routing quality, robustness, and system efficiency.
Conclusion: The proposed framework effectively addresses the challenge of balancing accuracy, robustness, and latency in PV power forecasting through intelligent cloud-edge collaboration with risk-aware routing and adaptive fusion.
Abstract: Photovoltaic (PV) power forecasting in edge-enabled grids requires balancing forecasting accuracy, robustness under weather-driven distribution shifts, and strict latency constraints. Local specialized models are efficient for routine conditions but often degrade under rare ramp events and unseen weather patterns, whereas always relying on cloud-side large models incurs substantial communication delay and cloud overhead. To address this challenge, we propose a risk-aware cloud-edge collaborative framework for latency-sensitive PV forecasting. The framework integrates a site-specific expert predictor for routine cases, a lightweight edge-side model for enhanced local inference, and a cloud-side large retrieval model that provides matched historical context when needed through a retrieval-prediction pipeline. A lightweight screening module estimates predictive uncertainty, out-of-distribution risk, weather mutation intensity, and model disagreement, while a Lyapunov-guided router selectively escalates inference to the edge-small or cloud-assisted branches under long-term latency, communication, and cloud-usage constraints. The outputs of the activated branches are combined through adaptive fusion. Experiments on two real-world PV datasets demonstrate a favorable overall trade-off among forecasting accuracy, routing quality, robustness, and system efficiency.
[475] First-Mover Bias in Gradient Boosting Explanations: Mechanism, Detection, and Resolution
Drake Caraker, Bryan Arnold, David Rhoads
Main category: cs.LG
TL;DR: The paper identifies and addresses “first-mover bias” in gradient boosting models where correlated features compete for early splits, causing unstable SHAP feature importance rankings, and proposes DASH method to stabilize explanations through model independence.
Details
Motivation: The paper aims to solve the instability of SHAP-based feature rankings under multicollinearity in gradient boosting models, identifying a specific mechanistic cause called "first-mover bias" where sequential residual fitting creates path-dependent concentration of feature importance on arbitrary correlated features.Method: Proposes DASH (Diversified Aggregation of SHAP) which breaks sequential dependency chains through model independence, along with simple seed-averaging (Stochastic Retrain). Also introduces diagnostic tools: Feature Stability Index (FSI) and Importance-Stability (IS) Plot to detect first-mover bias without ground truth.
Result: DASH achieves stability=0.977 at rho=0.9 correlation, outperforming single-best workflow (0.958) and Large Single Model (0.938). On Breast Cancer dataset, DASH improves stability from 0.32 to 0.93 (+0.61 improvement). Both DASH and seed-averaging effectively mitigate first-mover bias.
Conclusion: Model independence is sufficient to resolve first-mover bias in linear regimes and remains most effective mitigation under nonlinear processes. DASH provides stable feature importance explanations and diagnostic tools for practitioners to audit explanation reliability before acting on feature rankings.
Abstract: We isolate and empirically characterize first-mover bias – a path-dependent concentration of feature importance caused by sequential residual fitting in gradient boosting – as a specific mechanistic cause of the well-known instability of SHAP-based feature rankings under multicollinearity. When correlated features compete for early splits, gradient boosting creates a self-reinforcing advantage for whichever feature is selected first: subsequent trees inherit modified residuals that favor the incumbent, concentrating SHAP importance on an arbitrary feature rather than distributing it across the correlated group. Scaling up a single model amplifies this effect – a Large Single Model with the same total tree count as our method produces the worst explanations of any approach tested. We demonstrate that model independence is sufficient to resolve first-mover bias in the linear regime, and remains the most effective mitigation under nonlinear data-generating processes. Both our proposed method, DASH (Diversified Aggregation of SHAP), and simple seed-averaging (Stochastic Retrain) restore stability by breaking the sequential dependency chain, confirming that the operative mechanism is independence between explained models. At rho=0.9, both achieve stability=0.977, while the single-best workflow degrades to 0.958 and the Large Single Model to 0.938. On the Breast Cancer dataset, DASH improves stability from 0.32 to 0.93 (+0.61) against a tree-count-matched baseline. DASH additionally provides two diagnostic tools – the Feature Stability Index (FSI) and Importance-Stability (IS) Plot – that detect first-mover bias without ground truth, enabling practitioners to audit explanation reliability before acting on feature rankings. Software and reproducible benchmarks are available at https://github.com/DrakeCaraker/dash-shap.
[476] COMPASS-Hedge: Learning Safely Without Knowing the World
Ting Hu, Luanda Cai, Manolis Vlatakis
Main category: cs.LG
TL;DR: COMPASS-Hedge is a parameter-free online learning algorithm that simultaneously achieves optimal regret in adversarial settings, instance-optimal gap-dependent regret in stochastic settings, and baseline safety against a fixed comparator.
Details
Motivation: Online learning algorithms face a fundamental trilemma: balancing regret guarantees between adversarial and stochastic environments while providing baseline safety against a fixed comparator. Existing methods typically excel in only one or two of these regimes, requiring trade-offs or oracle access to problem parameters.Method: COMPASS-Hedge integrates adaptive pseudo-regret scaling, phase-based aggression, and a comparator-aware mixing strategy. It’s parameter-free and requires no prior knowledge of the environment’s nature or stochastic sub-optimality gaps.
Result: The algorithm achieves: 1) Minimax-optimal regret in adversarial environments, 2) Instance-optimal, gap-dependent regret in stochastic environments, and 3) Õ(1) regret relative to a designated baseline policy.
Conclusion: COMPASS-Hedge provides the first “best-of-three-world” guarantee in full-information online learning, demonstrating that baseline safety doesn’t have to compromise worst-case robustness or stochastic efficiency.
Abstract: Online learning algorithms often faces a fundamental trilemma: balancing regret guarantees between adversarial and stochastic settings and providing baseline safety against a fixed comparator. While existing methods excel in one or two of these regimes, they typically fail to unify all three without sacrificing optimal rates or requiring oracle access to problem-dependent parameters. In this work, we bridge this gap by introducing COMPASS-Hedge. Our algorithm is the first full-information method to simultaneously achieve: i) Minimax-optimal regret in adversarial environments; ii) Instance-optimal, gap-dependent regret in stochastic environments; and iii) $\tilde{\mathcal{O}}(1)$ regret relative to a designated baseline policy, up to logarithmic factors. Crucially, COMPASS-Hedge is parameter-free and requires no prior knowledge of the environment’s nature or the magnitude of the stochastic sub optimality gaps. Our approach hinges on a novel integration of adaptive pseudo-regret scaling and phase-based aggression, coupled with a comparator-aware mixing strategy. To the best of our knowledge, this provides the first “best-of-three-world” guarantee in the full-information setting, establishing that baseline safety does not have to come at the cost of worst-case robustness or stochastic efficiency.
[477] WIST: Web-Grounded Iterative Self-Play Tree for Domain-Targeted Reasoning Improvement
Fangyuan Li, Pengfei Li, Shijie Wang, Junqi Gao, Jianxing Liu, Biqing Qi, Yuqiang Li
Main category: cs.LG
TL;DR: WIST is a web-grounded iterative self-play tree framework for domain-targeted reasoning improvement that learns directly from the open web without requiring curated domain corpora.
Details
Motivation: Existing RLVR methods face a trade-off: endogenous self-play can drift over iterations, while corpus-grounded approaches rely on curated data environments. There's a need for methods that can learn directly from open web data without pre-arranged domain corpora.Method: WIST incrementally expands a domain tree for exploration, retrieves and cleans path-consistent web corpus to construct a controllable training environment, performs Challenger-Solver self-play with verifiable rewards, and feeds learnability signals back to update node posteriors and guide subsequent exploration through an adaptive curriculum.
Result: Across four backbones, WIST consistently improves over base models and typically outperforms both purely endogenous self-evolution and corpus-grounded self-play baselines, with overall gains reaching +9.8 (Qwen3-4B-Base) and +9.7 (OctoThinker-8B). WIST is also domain-steerable, improving Qwen3-8B-Base by +14.79 in medicine and Qwen3-4B-Base by +5.28 on PhyBench.
Conclusion: WIST provides an effective framework for domain-targeted reasoning improvement using web-grounded self-play without requiring curated domain corpora, with ablations confirming the importance of its key components for stable open-web learning.
Abstract: Recent progress in reinforcement learning with verifiable rewards (RLVR) offers a practical path to self-improvement of language models, but existing methods face a key trade-off: endogenous self-play can drift over iterations, while corpus-grounded approaches rely on curated data environments. We present \textbf{WIST}, a \textbf{W}eb-grounded \textbf{I}terative \textbf{S}elf-play \textbf{T}ree framework for domain-targeted reasoning improvement that learns directly from the open web without requiring any pre-arranged domain corpus. WIST incrementally expands a domain tree for exploration, and retrieves and cleans path-consistent web corpus to construct a controllable training environment. It then performs Challenger–Solver self-play with verifiable rewards, and feeds learnability signals back to update node posteriors and guide subsequent exploration through an adaptive curriculum. Across four backbones, WIST consistently improves over the base models and typically outperforms both purely endogenous self-evolution and corpus-grounded self-play baselines, with the Overall gains reaching \textbf{+9.8} (\textit{Qwen3-4B-Base}) and \textbf{+9.7} (\textit{OctoThinker-8B}). WIST is also domain-steerable, improving \textit{Qwen3-8B-Base} by \textbf{+14.79} in medicine and \textit{Qwen3-4B-Base} by \textbf{+5.28} on PhyBench. Ablations further confirm the importance of WIST’s key components for stable open-web learning. Our Code is available at https://github.com/lfy-123/WIST.
[478] Unveiling the Mechanism of Continuous Representation Full-Waveform Inversion: A Wave Based Neural Tangent Kernel Framework
Ruihua Chen, Yisi Luo, Bangyu Wu, Deyu Meng
Main category: cs.LG
TL;DR: The paper develops a theoretical framework using wave-based neural tangent kernel to understand continuous representation full-waveform inversion, proposes hybrid INR-grid methods that improve robustness and convergence, and demonstrates superior performance on geophysical models.
Details
Motivation: Full-waveform inversion (FWI) is sensitive to initial model accuracy. While recent continuous representation FWI using implicit neural representations (INR) reduces this dependency, its mechanism is unclear and it suffers from slower high-frequency convergence.Method: Developed a wave-based neural tangent kernel (NTK) framework to analyze CR-FWI, revealing that wave-based NTK is not constant due to FWI’s nonlinearity. Proposed hybrid representation combining INR with multi-resolution grid (IG-FWI) to balance robustness and convergence.
Result: The eigenvalue decay behavior of wave-based NTK explains why CR-FWI reduces initial model dependency but has slower high-frequency convergence. Proposed IG-FWI achieves better trade-off between robustness and convergence rate, outperforming conventional FWI and existing INR-FWI on multiple geophysical models.
Conclusion: The wave-based NTK framework provides theoretical understanding of CR-FWI mechanisms, and the proposed hybrid representation methods offer practical improvements for FWI applications in geophysical exploration.
Abstract: Full-waveform inversion (FWI) estimates physical parameters in the wave equation from limited measurements and has been widely applied in geophysical exploration, medical imaging, and non-destructive testing. Conventional FWI methods are limited by their notorious sensitivity to the accuracy of the initial models. Recent progress in continuous representation FWI (CR-FWI) demonstrates that representing parameter models with a coordinate-based neural network, such as implicit neural representation (INR), can mitigate the dependence on initial models. However, its underlying mechanism remains unclear, and INR-based FWI shows slower high-frequency convergence. In this work, we investigate the general CR-FWI framework and develop a unified theoretical understanding by extending the neural tangent kernel (NTK) for FWI to establish a wave-based NTK framework. Unlike standard NTK, our analysis reveals that wave-based NTK is not constant, both at initialization and during training, due to the inherent nonlinearity of FWI. We further show that the eigenvalue decay behavior of the wave-based NTK can explain why CR-FWI alleviates the dependency on initial models and shows slower high-frequency convergence. Building on these insights, we propose several CR-FWI methods with tailored eigenvalue decay properties for FWI, including a novel hybrid representation combining INR and multi-resolution grid (termed IG-FWI) that achieves a more balanced trade-off between robustness and high-frequency convergence rate. Applications in geophysical exploration on Marmousi, 2D SEG/EAGE Salt and Overthrust, 2004 BP model, and the more realistic 2014 Chevron models show the superior performance of our proposed methods compared to conventional FWI and existing INR-based FWI methods.
[479] MCLR: Improving Conditional Modeling in Visual Generative Models via Inter-Class Likelihood-Ratio Maximization and Establishing the Equivalence between Classifier-Free Guidance and Alignment Objectives
Xiang Li, Yixuan Jia, Xiao Li, Jeffrey A. Fessler, Rongrong Wang, Qing Qu
Main category: cs.LG
TL;DR: MCLR: A training objective that makes diffusion models achieve CFG-like improvements without inference-time guidance by maximizing inter-class likelihood ratios.
Details
Motivation: Classifier-free guidance (CFG) is crucial for diffusion model performance but is an inference-time heuristic. The paper investigates whether training objectives can be modified to achieve CFG-like effects without needing guidance during sampling.Method: Proposes MCLR (Maximum Class Likelihood Ratio), a principled alignment objective that explicitly maximizes inter-class likelihood-ratios during training. Models are fine-tuned with MCLR to improve inter-class separation.
Result: Models fine-tuned with MCLR achieve comparable qualitative and quantitative gains to CFG-guided models under standard sampling, without requiring inference-time guidance.
Conclusion: Establishes formal equivalence between classifier-free guidance and alignment-based objectives, providing mechanistic interpretation of CFG and showing training objectives can be modified to eliminate need for inference-time guidance.
Abstract: Diffusion models have achieved state-of-the-art performance in generative modeling, but their success often relies heavily on classifier-free guidance (CFG), an inference-time heuristic that modifies the sampling trajectory. From a theoretical perspective, diffusion models trained with standard denoising score matching (DSM) are expected to recover the target data distribution, raising the question of why inference-time guidance is necessary in practice. In this work, we ask whether the DSM training objective can be modified in a principled manner such that standard reverse-time sampling, without inference-time guidance, yields effects comparable to CFG. We identify insufficient inter-class separation as a key limitation of standard diffusion models. To address this, we propose MCLR, a principled alignment objective that explicitly maximizes inter-class likelihood-ratios during training. Models fine-tuned with MCLR exhibit CFG-like improvements under standard sampling, achieving comparable qualitative and quantitative gains without requiring inference-time guidance. Beyond empirical benefits, we provide a theoretical result showing that the CFG-guided score is exactly the optimal solution to a weighted MCLR objective. This establishes a formal equivalence between classifier-free guidance and alignment-based objectives, offering a mechanistic interpretation of CFG.
[480] FAAR: Format-Aware Adaptive Rounding for NVFP4
Hanglin Li, Shuchang Tian, Chen Lin, Zhiyong Zhao, Kun Zhan
Main category: cs.LG
TL;DR: FAAR introduces format-aware adaptive rounding for ultra-low-bit NVFP4 quantization of LLMs, achieving better performance than conventional methods with minimal training overhead.
Details
Motivation: Deploying LLMs on edge devices requires extremely low-bit quantization, but existing methods fail to account for the non-uniformity of NVFP4 numerical grids, leading to suboptimal rounding decisions and amplified quantization errors.Method: Proposes Format-Aware Adaptive Rounding (FAAR) that incorporates the non-uniform NVFP4 grid into optimization, adaptively adjusting rounding decisions guided by loss gradients. Also introduces 2-stages Format Alignment (2FA) fine-tuning to align LLM parameters layer-by-layer to NVFP4 numerical space.
Result: Reduces perplexity on WikiText-2 from 14.28 to 12.60 on Llama3-1B and from 23.06 to 21.27 on Qwen3-1.7B compared to RTN. Consistently outperforms state-of-the-art approaches across various zero-shot downstream tasks with only 4 GPU hours training overhead.
Conclusion: FAAR provides an effective learnable rounding strategy for NVFP4 quantization that significantly improves LLM performance on edge devices while maintaining minimal training costs.
Abstract: Deploying large language models (LLMs) on edge devices requires extremely low-bit quantization. Ultra-low precision formats such as NVFP4 offer a promising solution for reducing memory footprint and accelerating computation. However, existing quantization methods typically rely on conventional rounding strategies and fail to account for the non-uniformity of the NVFP4 numerical grid, resulting in suboptimal rounding decisions and amplified quantization errors. To address this, we propose Format-Aware Adaptive Rounding (FAAR), a learnable rounding strategy tailored for the NVFP4 format. Unlike conventional quantization paradigms, FAAR explicitly incorporates the non-uniform NVFP4 grid into the optimization process. By adaptively adjusting rounding decisions guided by loss gradients, our method effectively approximates the theoretically optimal quantization. To complement FAAR, we introduce a 2-stages Format Alignment (2FA) fine-tuning scheme that aligns LLM parameters layer-by-layer to the NVFP4 numerical space, further narrowing the performance gap. Remarkably, this learnable optimization incurs a minimal training overhead of only 4 GPU hours on Llama3-1B. Extensive experiments demonstrate the effectiveness of our approach. Compared with Round-to-Nearest (RTN), our method reduces perplexity on WikiText-2 from 14.28 to 12.60 on Llama3-1B and from 23.06 to 21.27 on Qwen3-1.7B. Additionally, our method consistently outperforms state-of-the-art approaches across various zero-shot downstream tasks.
[481] Rethinking Multimodal Fusion for Time Series: Auxiliary Modalities Need Constrained Fusion
Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, SoonYoung Lee, Wonbin Ahn
Main category: cs.LG
TL;DR: CFA (Controlled Fusion Adapter) is a plug-in method for time series forecasting that selectively integrates relevant textual information while filtering out irrelevant auxiliary data, outperforming naive multimodal fusion approaches.
Details
Motivation: Existing multimodal time series forecasting methods often provide limited gains due to uncontrolled integration of auxiliary modalities (text/vision) that can introduce irrelevant information, sometimes even underperforming unimodal approaches.Method: Proposes Controlled Fusion Adapter (CFA) - a simple plug-in method using low-rank adapters to filter irrelevant textual information before fusing it into temporal representations, enabling controlled cross-modal interactions without modifying the time series backbone.
Result: Conducted over 20K experiments across various datasets and time series/text models, demonstrating that constrained fusion methods (including CFA) consistently outperform naive fusion methods.
Conclusion: Controlled fusion of auxiliary modalities is crucial for effective multimodal time series forecasting, and CFA provides a practical solution for integrating only relevant textual information aligned with time series dynamics.
Abstract: Recent advances in multimodal learning have motivated the integration of auxiliary modalities such as text or vision into time series (TS) forecasting. However, most existing methods provide limited gains, often improving performance only in specific datasets or relying on architecture-specific designs that limit generalization. In this paper, we show that multimodal models with naive fusion strategies (e.g., simple addition or concatenation) often underperform unimodal TS models, which we attribute to the uncontrolled integration of auxiliary modalities which may introduce irrelevant information. Motivated by this observation, we explore various constrained fusion methods designed to control such integration and find that they consistently outperform naive fusion methods. Furthermore, we propose Controlled Fusion Adapter (CFA), a simple plug-in method that enables controlled cross-modal interactions without modifying the TS backbone, integrating only relevant textual information aligned with TS dynamics. CFA employs low-rank adapters to filter irrelevant textual information before fusing it into temporal representations. We conduct over 20K experiments across various datasets and TS/text models, demonstrating the effectiveness of the constrained fusion methods including CFA. Code is publicly available at: https://github.com/seunghan96/cfa/.
[482] Three Creates All: You Only Sample 3 Steps
Yuren Cai, Guangyi Wang, Zongqing Li, Li Li, Zhihui Liu, Songzhi Su
Main category: cs.LG
TL;DR: MTEO optimizes time embeddings for diffusion models to enable faster few-step sampling without modifying the backbone network.
Details
Motivation: Diffusion models produce high-quality results but are slow due to many sequential network evaluations. Standard timestep conditioning becomes a bottleneck for few-step sampling, limiting practical deployment.Method: Proposes Multi-layer Time Embedding Optimization (MTEO) which freezes the pretrained diffusion backbone and distills a small set of step-wise, layer-wise time embeddings from reference trajectories. It’s plug-and-play with existing ODE solvers, adds no inference-time overhead, and trains only a tiny fraction of parameters.
Result: Extensive experiments across diverse datasets and backbones show state-of-the-art performance in few-step sampling, substantially narrowing the gap between distillation-based and lightweight methods.
Conclusion: MTEO provides an efficient way to accelerate diffusion model inference while maintaining high fidelity, making diffusion models more practical for real-world applications.
Abstract: Diffusion models deliver high-fidelity generation but remain slow at inference time due to many sequential network evaluations. We find that standard timestep conditioning becomes a key bottleneck for few-step sampling. Motivated by layer-dependent denoising dynamics, we propose Multi-layer Time Embedding Optimization (MTEO), which freeze the pretrained diffusion backbone and distill a small set of step-wise, layer-wise time embeddings from reference trajectories. MTEO is plug-and-play with existing ODE solvers, adds no inference-time overhead, and trains only a tiny fraction of parameters. Extensive experiments across diverse datasets and backbones show state-of-the-art performance in the few-step sampling and substantially narrow the gap between distillation-based and lightweight methods. Code will be available.
[483] Instruction-Tuned, but Not More Verifiable Instruction-Following: A Cross-Task Diagnosis for LoRA Adapters
Junyi Zou
Main category: cs.LG
TL;DR: Paper examines whether nominal adapter labels (like “instruction-tuned”) reliably predict actual cross-task capability gains, finding significant mismatches called “capability drift”
Details
Motivation: To test whether nominal training objectives of adapters (e.g., instruction-tuned labels) reliably align with realized cross-task capability improvements, as practitioners often deploy adapters based on these nominal labelsMethod: Evaluated same LoRA adapter across multiple tasks, focusing on strict automatically verifiable instruction following (IFEval). Tested across multiple seeds, base models, and LoRA settings, with controlled instruction-versus-numeric experiments
Result: Found nominal labels recurrently but not universally fail to predict improvements on verifiable targets. In strongest case, instruction-tuned adapter substantially improved off-target numeric benchmark performance (0.133 to 0.632) while not improving verifiable instruction following on IFEval
Conclusion: Practitioners should perform routine cross-task evaluation before deployment and avoid treating nominal labels as reliable capability proxies due to “capability drift” - the mismatch between nominal objectives and realized cross-task gains
Abstract: Adapters are often selected and deployed based on nominal labels (e.g., instruction-tuned), which implicitly suggest what capability improves after adaptation. We test whether nominal training objectives reliably align with realized cross-task capability gains by evaluating the same LoRA adapter across tasks. Our strongest evidence is tied to strict, automatically verifiable instruction following as measured by IFEval: across multiple seeds, base models, and LoRA settings, nominal labels recurrently but not universally fail to predict improvements on this verifiable target, with clear configuration sensitivity including a near-zero or negative case. As an illustrative strongest-case example in a controlled instruction-versus-numeric setting, an instruction-tuned adapter substantially improves off-target NM-based numeric benchmark performance from 0.133 to 0.632 while not improving verifiable instruction following on IFEval (ILA: 0.313 to 0.271; PLA: 0.250 to 0.143; values rounded to three decimals). We refer to this nominal-versus-realized mismatch pattern as capability drift as a descriptive label. The mismatch is visible in the raw cross-task performance matrix; we use a drift score only as a compact summary in the same units as the underlying metrics, not as a new formal metric contribution. Evidence from broader instruction-following benchmarks is benchmark-dependent and mixed, reflecting heterogeneity in how instruction following is operationalized; we therefore do not treat cross-benchmark agreement as a premise. Overall, the practical takeaway is to perform routine cross-task evaluation before deployment and to avoid treating nominal labels as reliable capability proxies.
[484] Symbolic Graph Networks for Robust PDE Discovery from Noisy Sparse Data
Xingyu Chen, Junxiu An, Jun Guo, Yuqian Zhou
Main category: cs.LG
TL;DR: SGN framework combines graph neural networks with symbolic regression for robust PDE discovery from noisy, sparse data
Details
Motivation: Existing PDE discovery methods struggle with noisy and sparse measurements; need robust approaches that don't rely on error-prone numerical differentiationMethod: Symbolic Graph Network (SGN) uses graph message passing for non-local spatial representations (less sensitive to noise) followed by symbolic regression to extract interpretable mathematical expressions
Result: SGN successfully recovers governing equations from wave, convection-diffusion, and Navier-Stokes systems under varying noise levels, outperforming baselines in sparse/noisy settings
Conclusion: Graph-based representations combined with symbolic regression provide robust framework for data-driven physical law discovery from imperfect observations
Abstract: Data-driven discovery of partial differential equations (PDEs) offers a promising paradigm for uncovering governing physical laws from observational data. However, in practical scenarios, measurements are often contaminated by noise and limited by sparse sampling, which poses significant challenges to existing approaches based on numerical differentiation or integral formulations. In this work, we propose a Symbolic Graph Network (SGN) framework for PDE discovery under noisy and sparse conditions. Instead of relying on local differential approximations, SGN leverages graph message passing to model spatial interactions, providing a non-local representation that is less sensitive to high frequency noise. Based on this representation, the learned latent features are further processed by a symbolic regression module to extract interpretable mathematical expressions. We evaluate the proposed method on several benchmark systems, including the wave equation, convection-diffusion equation, and incompressible Navier-Stokes equations. Experimental results show that SGN can recover meaningful governing relations or solution forms under varying noise levels, and demonstrates improved robustness compared to baseline methods in sparse and noisy settings. These results suggest that combining graph-based representations with symbolic regression provides a viable direction for robust data-driven discovery of physical laws from imperfect observations. The code is available at https://github.com/CXY0112/SGN
[485] Learning When to Act: Interval-Aware Reinforcement Learning with Predictive Temporal Structure
Davide Di Gioia
Main category: cs.LG
TL;DR: A temporal control system that learns optimal action intervals using hyperbolic geometry to measure future uncertainty, with spatial-temporal embeddings improving timing decisions.
Details
Motivation: Autonomous agents need to decide not just what actions to take, but when to take them. Current approaches use ad hoc biologically inspired timers rather than principled learned policies for optimal timing intervals.Method: Introduces ATCPG (Adaptive Temporal Control with Predictive Geometry) that learns optimal intervals between cognitive ticks using hyperbolic geometry. Uses Poincare distance among sampled futures as a “curvature signal” to measure uncertainty. High spread (uncertain future) triggers sooner action, low spread permits longer intervals. Also introduces interval-aware reward to penalize inefficiency relative to wait time, and joint spatio-temporal embedding (ATCPG-ST) that combines state and position projections.
Result: The system achieves 22.8% efficiency improvement over fixed-interval baseline. Learning contributes 54.8% efficiency gain, hyperbolic spread adds 26.2% over geometry-free control, and spatial position information yields additional 5.8% improvement. Mean hyperbolic spread increases from 1.88 to 3.37 with spatio-temporal embedding.
Conclusion: Learned adaptive temporal control with hyperbolic geometry provides significant efficiency gains for autonomous agents by optimizing action timing based on future uncertainty, with spatial information further enhancing performance.
Abstract: Autonomous agents operating in continuous environments must decide not only what to do, but when to act. We introduce a lightweight adaptive temporal control system that learns the optimal interval between cognitive ticks from experience, replacing ad hoc biologically inspired timers with a principled learned policy. The policy state is augmented with a predictive hyperbolic spread signal (a “curvature signal” shorthand) derived from hyperbolic geometry: the mean pairwise Poincare distance among n sampled futures embedded in the Poincare ball. High spread indicates a branching, uncertain future and drives the agent to act sooner; low spread signals predictability and permits longer rest intervals. We further propose an interval-aware reward that explicitly penalises inefficiency relative to the chosen wait time, correcting a systematic credit-assignment failure of naive outcome-based rewards in timing problems. We additionally introduce a joint spatio-temporal embedding (ATCPG-ST) that concatenates independently normalised state and position projections in the Poincare ball; spatial trajectory divergence provides an independent timing signal unavailable to the state-only variant (ATCPG-SO). This extension raises mean hyperbolic spread (kappa) from 1.88 to 3.37 and yields a further 5.8 percent efficiency gain over the state-only baseline. Ablation experiments across five random seeds demonstrate that (i) learning is the dominant efficiency factor (54.8 percent over no-learning), (ii) hyperbolic spread provides significant complementary gain (26.2 percent over geometry-free control), (iii) the combined system achieves 22.8 percent efficiency over the fixed-interval baseline, and (iv) adding spatial position information to the spread embedding yields an additional 5.8 percent.
[486] Neural Structure Embedding for Symbolic Regression via Continuous Structure Search and Coefficient Optimization
Fateme Memar, Tao Zhe, Dongjie Wang
Main category: cs.LG
TL;DR: SRCO: A unified embedding-driven framework for symbolic regression that transforms symbolic structures into continuous, optimizable representations for more efficient equation discovery.
Details
Motivation: Existing symbolic regression methods rely heavily on discrete structure search (e.g., genetic programming), which leads to high computational cost, unstable performance, and limited scalability to large equation spaces.Method: Three-component framework: (1) Structure embedding: generate exploratory equations and train Transformer to compress symbolic structures into continuous embedding space; (2) Continuous structure search: use gradient-based or sampling-based optimization in embedding space; (3) Coefficient optimization: treat symbolic coefficients as learnable parameters and apply gradient optimization.
Result: Experiments on synthetic and real-world datasets show consistent outperformance of state-of-the-art methods in equation accuracy, robustness, and search efficiency.
Conclusion: Introduces a new paradigm for symbolic regression by bridging symbolic equation discovery with continuous embedding learning and optimization.
Abstract: Symbolic regression aims to discover human-interpretable equations that explain observational data. However, existing approaches rely heavily on discrete structure search (e.g., genetic programming), which often leads to high computational cost, unstable performance, and limited scalability to large equation spaces. To address these challenges, we propose SRCO, a unified embedding-driven framework for symbolic regression that transforms symbolic structures into a continuous, optimizable representation space. The framework consists of three key components: (1) structure embedding: we first generate a large pool of exploratory equations using traditional symbolic regression algorithms and train a Transformer model to compress symbolic structures into a continuous embedding space; (2) continuous structure search: the embedding space enables efficient exploration using gradient-based or sampling-based optimization, significantly reducing the cost of navigating the combinatorial structure space; and (3) coefficient optimization: for each discovered structure, we treat symbolic coefficients as learnable parameters and apply gradient optimization to obtain accurate numerical values. Experiments on synthetic and real-world datasets show that our approach consistently outperforms state-of-the-art methods in equation accuracy, robustness, and search efficiency. This work introduces a new paradigm for symbolic regression by bridging symbolic equation discovery with continuous embedding learning and optimization.
[487] Model Predictive Control with Differentiable World Models for Offline Reinforcement Learning
Rohan Deb, Stephen J. Wright, Arindam Banerjee
Main category: cs.LG
TL;DR: Offline RL framework using differentiable world models for inference-time policy optimization via MPC, achieving gains on D4RL benchmarks
Details
Motivation: Existing offline RL methods use fixed policies at inference time without adaptation, missing opportunities to leverage real-time information for better performanceMethod: Differentiable World Model (DWM) pipeline enabling end-to-end gradient computation through imagined rollouts for inference-time policy optimization using Model Predictive Control
Result: Consistent improvements over strong offline RL baselines on D4RL continuous-control benchmarks (MuJoCo locomotion and AntMaze tasks)
Conclusion: Inference-time adaptation using differentiable world models and MPC provides significant performance gains in offline RL settings
Abstract: Offline Reinforcement Learning (RL) aims to learn optimal policies from fixed offline datasets, without further interactions with the environment. Such methods train an offline policy (or value function), and apply it at inference time without further refinement. We introduce an inference time adaptation framework inspired by model predictive control (MPC) that utilizes a pretrained policy along with a learned world model of state transitions and rewards. While existing world model and diffusion-planning methods use learned dynamics to generate imagined trajectories during training, or to sample candidate plans at inference time, they do not use inference-time information to optimize the policy parameters on the fly. In contrast, our design is a Differentiable World Model (DWM) pipeline that enables endto-end gradient computation through imagined rollouts for policy optimization at inference time based on MPC. We evaluate our algorithm on D4RL continuous-control benchmarks (MuJoCo locomotion tasks and AntMaze), and show that exploiting inference-time information to optimize the policy parameters yields consistent gains over strong offline RL baselines.
[488] SkillRouter: Retrieve-and-Rerank Skill Selection for LLM Agents at Scale
YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuan Zhu, Baohua Dong, Hangcheng Zhu
Main category: cs.LG
TL;DR: SkillRouter: A two-stage retrieve-and-rerank pipeline for routing LLM agents to relevant skills from large repositories, using skill body text as the decisive signal for accurate routing.
Details
Motivation: As LLM agent ecosystems grow with tens of thousands of available skills, it becomes infeasible to inject all skills into an agent's context. Current architectures expose only skill names and descriptions while hiding implementation bodies, but this may not be sufficient for accurate skill routing given pervasive functional overlap in community repositories.Method: Proposes SkillRouter, a two-stage retrieve-and-rerank pipeline with 1.2B total parameters (0.6B encoder + 0.6B reranker). The approach uses skill body text as the primary signal, with cross-encoder attention analysis showing 91.7% of attention concentrates on the body field. The system is designed to be deployable on consumer hardware.
Result: SkillRouter achieves 74.0% top-1 routing accuracy on a benchmark of ~80K skills and 75 expert-verified queries. Removing skill body causes 29-44 percentage point degradation across all retrieval methods. The system delivers the strongest average result among compact and zero-shot baselines evaluated.
Conclusion: Skill body text is the decisive signal for accurate skill routing in LLM agent ecosystems, challenging the assumption that metadata alone is sufficient. The proposed SkillRouter provides an effective solution for routing agents to relevant skills from large repositories while remaining practical for deployment.
Abstract: As LLM agent ecosystems grow, the number of available skills (tools, plugins) has reached tens of thousands, making it infeasible to inject all skills into an agent’s context. This creates a need for skill routing – retrieving the most relevant skills from a large pool given a user task. The problem is compounded by pervasive functional overlap in community skill repositories, where many skills share similar names and purposes yet differ in implementation details. Despite its practical importance, skill routing remains under-explored. Current agent architectures adopt a progressive disclosure design – exposing only skill names and descriptions to the agent while keeping the full implementation body hidden – implicitly treating metadata as sufficient for selection. We challenge this assumption through a systematic empirical study on a benchmark of ~$80K skills and 75 expert-verified queries. Our key finding is that the skill body (full implementation text) is the decisive signal: removing it causes 29–44 percentage point degradation across all retrieval methods, and cross-encoder attention analysis reveals 91.7% of attention concentrating on the body field. Motivated by this finding, we propose SkillRouter, a two-stage retrieve-and-rerank pipeline totaling only 1.2B parameters (0.6B encoder + 0.6B reranker). SkillRouter achieves 74.0% top-1 routing accuracy and delivers the strongest average result among the compact and zero-shot baselines we evaluate, while remaining deployable on consumer hardware.
[489] A Theoretical Framework for Energy-Aware Gradient Pruning in Federated Learning
Emmanouil M. Athanasakos
Main category: cs.LG
TL;DR: CWMP is a federated learning gradient pruning method that optimizes for energy efficiency by weighting parameter updates by their hardware costs, outperforming standard magnitude-based pruning.
Details
Motivation: Standard gradient sparsification methods like Top-K pruning reduce communication but ignore hardware energy costs, treating all parameter updates equally despite varying memory and compute requirements in edge devices.Method: Formalizes pruning as an energy-constrained projection problem, proposes Cost-Weighted Magnitude Pruning (CWMP) that selects parameters based on magnitude relative to physical cost, proven as optimal greedy solution with probabilistic analysis of global energy efficiency.
Result: On non-IID CIFAR-10 benchmark, CWMP establishes superior performance-energy Pareto frontier compared to Top-K baseline, demonstrating better trade-offs between model accuracy and energy consumption.
Conclusion: Energy-aware gradient pruning via CWMP provides practical improvements for federated learning on resource-constrained edge devices by accounting for hardware-level cost disparities.
Abstract: Federated Learning (FL) is constrained by the communication and energy limitations of decentralized edge devices. While gradient sparsification via Top-K magnitude pruning effectively reduces the communication payload, it remains inherently energy-agnostic. It assumes all parameter updates incur identical downstream transmission and memory-update costs, ignoring hardware realities. We formalize the pruning process as an energy-constrained projection problem that accounts for the hardware-level disparities between memory-intensive and compute-efficient operations during the post-backpropagation phase. We propose Cost-Weighted Magnitude Pruning (CWMP), a selection rule that prioritizes parameter updates based on their magnitude relative to their physical cost. We demonstrate that CWMP is the optimal greedy solution to this constrained projection and provide a probabilistic analysis of its global energy efficiency. Numerical results on a non-IID CIFAR-10 benchmark show that CWMP consistently establishes a superior performance-energy Pareto frontier compared to the Top-K baseline.
[490] Adversarial Vulnerabilities in Neural Operator Digital Twins: Gradient-Free Attacks on Nuclear Thermal-Hydraulic Surrogates
Samrendra Roy, Kazuma Kobayashi, Souvik Chakraborty, Rizwan-uddin, Syed Bahauddin Alam
Main category: cs.LG
TL;DR: Neural operators for digital twins are highly vulnerable to sparse adversarial attacks exploiting boundary condition sensitivity, with gradient-free attacks causing catastrophic prediction failures while evading standard detection methods.
Details
Motivation: Operator learning models are becoming core components of digital twins for nuclear and energy systems, but their robustness to adversarial perturbations remains uncharacterized, creating a critical safety gap for deployment in safety-critical applications.Method: Used gradient-free differential evolution across four operator architectures to demonstrate vulnerability to sparse adversarial perturbations. Introduced effective perturbation dimension (d_eff) as a Jacobian-based diagnostic and developed a two-factor vulnerability model combining sensitivity magnitude and concentration.
Result: Minimal modifications (fewer than 1% of inputs) trigger catastrophic prediction failures, increasing relative L2 error from ~1.5% to 37-63% while remaining undetectable by standard validation metrics. 100% of successful single-point attacks pass z-score anomaly detection. Gradient-free search outperforms gradient-based alternatives on architectures with gradient pathologies.
Conclusion: Operator learning models have a previously overlooked attack surface and require robustness guarantees beyond standard validation before deployment in safety-critical systems. The vulnerability depends on both sensitivity concentration and amplification, not just extreme sensitivity.
Abstract: Operator learning models are rapidly emerging as the predictive core of digital twins for nuclear and energy systems, promising real-time field reconstruction from sparse sensor measurements. Yet their robustness to adversarial perturbations remains uncharacterized, a critical gap for deployment in safety-critical systems. Here we show that neural operators are acutely vulnerable to extremely sparse (fewer than 1% of inputs), physically plausible perturbations that exploit their sensitivity to boundary conditions. Using gradient-free differential evolution across four operator architectures, we demonstrate that minimal modifications trigger catastrophic prediction failures, increasing relative $L_2$ error from $\sim$1.5% (validated accuracy) to 37-63% while remaining completely undetectable by standard validation metrics. Notably, 100% of successful single-point attacks pass z-score anomaly detection. We introduce the effective perturbation dimension $d_{\text{eff}}$, a Jacobian-based diagnostic that, together with sensitivity magnitude, yields a two-factor vulnerability model explaining why architectures with extreme sensitivity concentration (POD-DeepONet, $d_{\text{eff}} \approx 1$) are not necessarily the most exploitable, since low-rank output projections cap maximum error, while moderate concentration with sufficient amplification (S-DeepONet, $d_{\text{eff}} \approx 4$) produces the highest attack success. Gradient-free search outperforms gradient-based alternatives (PGD) on architectures with gradient pathologies, while random perturbations of equal magnitude achieve near-zero success rates, confirming that the discovered vulnerabilities are structural. Our findings expose a previously overlooked attack surface in operator learning models and establish that these models require robustness guarantees beyond standard validation before deployment.
[491] Multimodal Training to Unimodal Deployment: Leveraging Unstructured Data During Training to Optimize Structured Data Only Deployment
Zigui Wang, Minghui Sun, Jiang Shu, Matthew M. Engelhard, Lauren Franz, Benjamin A. Goldstein
Main category: cs.LG
TL;DR: Multimodal framework using unstructured EHR data during training to enhance structured-only model deployment for late talking diagnosis
Details
Motivation: Unstructured EHR data (clinical notes) contain valuable clinical context not in structured fields, but are often unavailable for deployment; need methods to leverage this data during training while producing deployable structured-only modelsMethod: Multimodal learning framework with teacher-student architecture: note-based teacher model (BioClinicalBERT embeddings) and structured-only student model (demographics + medical codes), trained jointly using contrastive learning and contrastive knowledge distillation loss
Result: Proposed model achieved AUROC of 0.705, outperforming structured-only baseline of 0.656; note-based teacher achieved AUROC = 0.985; cohort of 3,466 children evaluated for late talking
Conclusion: Incorporating unstructured data during training enhances model’s ability to identify task-relevant information in structured EHR data, enabling deployable structured-only phenotype models
Abstract: Unstructured Electronic Health Record (EHR) data, such as clinical notes, contain clinical contextual observations that are not directly reflected in structured data fields. This additional information can substantially improve model learning. However, due to their unstructured nature, these data are often unavailable or impractical to use when deploying a model. We introduce a multimodal learning framework that leverages unstructured EHR data during training while producing a model that can be deployed using only structured EHR data. Using a cohort of 3,466 children evaluated for late talking, we generated note embeddings with BioClinicalBERT and encoded structured embeddings from demographics and medical codes. A note-based teacher model and a structured-only student model were jointly trained using contrastive learning and contrastive knowledge distillation loss, producing a strong classifier (AUROC = 0.985). Our proposed model reached an AUROC of 0.705, outperforming the structured-only baseline of 0.656. These results demonstrate that incorporating unstructured data during training enhances the model’s capacity to identify task-relevant information within structured EHR data, enabling a deployable structured-only phenotype model.
[492] MIOFlow 2.0: A unified framework for inferring cellular stochastic dynamics from single cell and spatial transcriptomics data
Xingzhi Sun, João Felipe Rocha, Brett Phelan, Dhananjay Bhaskar, Guillaume Huguet, Yanlei Zhang, D. S. Magruder, Alexander Tong, Ke Xu, Oluwadamilola Fasina, Mark Gerstein, Guy Wolf, Natalia Ivanova, Christine L. Chaffer, Smita Krishnaswamy
Main category: cs.LG
TL;DR: MIOFlow 2.0 is a computational framework that learns continuous cellular trajectories from discrete single-cell transcriptomics data by integrating manifold learning, optimal transport, and neural differential equations to model stochastic branching, population changes, and spatial environmental influences.
Details
Motivation: Current methods for inferring cellular trajectories from time-resolved single-cell transcriptomics often use deterministic interpolations that fail to capture the probabilistic branching, population shifts, and niche-dependent signaling that drive real biological processes. There's a need for methods that can model the biological complexity of cellular development, regeneration, and disease.Method: MIOFlow 2.0 integrates manifold learning, optimal transport, and neural differential equations. It models: (1) stochasticity and branching via Neural Stochastic Differential Equations; (2) non-conservative population changes using a learned growth-rate model initialized with unbalanced optimal transport; and (3) environmental influence through a joint latent space unifying gene expression with spatial features. It operates in a PHATE-distance matching autoencoder latent space to respect the data’s intrinsic geometry.
Result: Empirical comparisons show MIOFlow 2.0’s expressive trajectory learning via neural differential equations outperforms existing generative models, including simulation-free flow matching. Validated on synthetic datasets, embryoid body differentiation, and spatially resolved axolotl brain regeneration, it improves trajectory accuracy and reveals hidden drivers of cellular transitions, like specific signaling niches.
Conclusion: MIOFlow 2.0 bridges single-cell and spatial transcriptomics to uncover tissue-scale trajectories, providing a biologically informed framework for understanding cellular development, regeneration, and disease processes through continuous trajectory inference.
Abstract: Understanding cellular trajectories via time-resolved single-cell transcriptomics is vital for studying development, regeneration, and disease. A key challenge is inferring continuous trajectories from discrete snapshots. Biological complexity stems from stochastic cell fate decisions, temporal proliferation changes, and spatial environmental influences. Current methods often use deterministic interpolations treating cells in isolation, failing to capture the probabilistic branching, population shifts, and niche-dependent signaling driving real biological processes. We introduce Manifold Interpolating Optimal-Transport Flow (MIOFlow) 2.0. This framework learns biologically informed cellular trajectories by integrating manifold learning, optimal transport, and neural differential equations. It models three core processes: (1) stochasticity and branching via Neural Stochastic Differential Equations; (2) non-conservative population changes using a learned growth-rate model initialized with unbalanced optimal transport; and (3) environmental influence through a joint latent space unifying gene expression with spatial features like local cell type composition and signaling. By operating in a PHATE-distance matching autoencoder latent space, MIOFlow 2.0 ensures trajectories respect the data’s intrinsic geometry. Empirical comparisons show expressive trajectory learning via neural differential equations outperforms existing generative models, including simulation-free flow matching. Validated on synthetic datasets, embryoid body differentiation, and spatially resolved axolotl brain regeneration, MIOFlow 2.0 improves trajectory accuracy and reveals hidden drivers of cellular transitions, like specific signaling niches. MIOFlow 2.0 thus bridges single-cell and spatial transcriptomics to uncover tissue-scale trajectories.
[493] A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks
Anish Saha, Konstantin Shmakov
Main category: cs.LG
TL;DR: Time-series foundation model using instruction-conditioned in-context learning with quantile-regression T5 encoder-decoder for forecasting and related tasks without task-specific fine-tuning.
Details
Motivation: Existing time-series foundation models lack explicit instruction-conditioned demonstrations for in-context learning, relying instead on implicit positional context, retrieval, or task-specific objectives.Method: Uses quantile-regression T5 encoder-decoder with structured tokenization marking target series, covariates, context, and task-specific future information. Hierarchical Transformer with per-example encoding, example-level fusion, and cross-example attention conditions decoding on demonstration pairs.
Result: Outperforms strong foundation baselines on point and probabilistic forecasting benchmarks (fev-bench and GIFT-Eval) across diverse datasets, frequencies, and horizons, while remaining competitive on classification and anomaly detection.
Conclusion: The model enables effective instruction-conditioned in-context learning for time-series tasks through multi-task training that learns a distribution over task mappings and improves adaptation to local structure.
Abstract: In-context learning (ICL) allows a model to adapt at inference time by conditioning on examples rather than updating parameters. Existing time-series foundation models use implicit positional context, retrieval, or task-specific objectives, but rarely explicit instruction-conditioned demonstrations. We present a foundation model for instruction-conditioned in-context time-series tasks based on a quantile-regression T5 encoder-decoder. Historical examples and queries are encoded with a structured tokenization scheme that marks target series, covariates, context, and task-specific future information. A hierarchical Transformer with per-example encoding, example-level fusion, and cross-example attention conditions decoding on demonstration pairs, enabling forecasting and related tasks without task-specific fine-tuning. We train on large-scale real and synthetic time series using supervised forecasting plus self-supervised tasks, including imputation, reconstruction, classification, anomaly detection, and source demixing. This multi-task training learns a distribution over task mappings and improves adaptation to local structure at inference time. Across diverse datasets, frequencies, and horizons, our method outperforms strong foundation baselines on point and probabilistic forecasting benchmarks, including fev-bench and GIFT-Eval, while remaining competitive on classification and anomaly detection.
[494] TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly
Toshiaki Koike-Akino, Jing Liu, Ye Wang
Main category: cs.LG
TL;DR: Test-time quantization framework for large models that performs activation-aware compression at inference time without retraining, adapting to each prompt for improved performance across downstream tasks.
Details
Motivation: Existing activation-aware compression methods for large foundation models rely heavily on calibration data and suffer from domain shift issues when applied to unseen downstream tasks, limiting their practical applicability.Method: Proposes a test-time quantization (TTQ) framework that compresses large models on-the-fly at inference time using efficient online calibration, enabling instant activation-aware quantization that adapts to each prompt regardless of downstream tasks while maintaining inference speedup.
Result: Experiments demonstrate that TTQ improves quantization performance over state-of-the-art baselines, showing better adaptation to various downstream tasks without domain shift issues.
Conclusion: TTQ provides an effective solution for compressing large foundation models by performing activation-aware quantization at inference time, overcoming calibration data dependency and domain shift problems while maintaining efficiency.
Abstract: To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these methods highly rely on calibration data, domain shift issues may arise for unseen downstream tasks. We propose a test-time quantization (TTQ) framework which compresses large models on the fly at inference time to resolve this issue. With an efficient online calibration, instant activation-aware quantization can adapt every prompt regardless of the downstream tasks, yet achieving inference speedup. Several experiments demonstrate that TTQ can improve the quantization performance over state-of-the-art baselines.
[495] Causal Discovery in Action: Learning Chain-Reaction Mechanisms from Interventions
Panayiotis Panayiotou, Özgür Şimşek
Main category: cs.LG
TL;DR: Causal discovery in chain-reaction systems is uniquely identifiable from blocking interventions, with proposed estimator achieving exponential error decay and logarithmic sample complexity.
Details
Motivation: Causal discovery in general dynamical systems faces identifiability challenges even with interventional data, but many real-world systems exhibit directional, cascade-like structure where components activate sequentially and upstream failures suppress downstream effects.Method: Study causal discovery in chain-reaction systems, show causal structure is uniquely identifiable from blocking interventions that prevent individual components from activating, and propose a minimal estimator with finite-sample guarantees.
Result: The method achieves exponential error decay and logarithmic sample complexity. Experiments on synthetic models and diverse chain-reaction environments demonstrate reliable recovery from few interventions, while observational heuristics fail in regimes with delayed or overlapping causal effects.
Conclusion: Causal structure in chain-reaction systems can be reliably identified using blocking interventions, overcoming limitations of observational approaches and providing efficient recovery with minimal interventions.
Abstract: Causal discovery is challenging in general dynamical systems because, without strong structural assumptions, the underlying causal graph may not be identifiable even from interventional data. However, many real-world systems exhibit directional, cascade-like structure, in which components activate sequentially and upstream failures suppress downstream effects. We study causal discovery in such chain-reaction systems and show that the causal structure is uniquely identifiable from blocking interventions that prevent individual components from activating. We propose a minimal estimator with finite-sample guarantees, achieving exponential error decay and logarithmic sample complexity. Experiments on synthetic models and diverse chain-reaction environments demonstrate reliable recovery from a few interventions, while observational heuristics fail in regimes with delayed or overlapping causal effects.
[496] Transfer learning via interpolating structures
T. A. Dardeno, A. J. Hughes, L. A. Bull, R. S. Mills, N. Dervilis, K. Worden
Main category: cs.LG
TL;DR: Proposes using intermediate structures to enable knowledge transfer between highly-disparate systems in structural health monitoring by continuously morphing one structure into another through parameter variation.
Details
Motivation: Addresses the challenge of knowledge transfer between heterogeneous structures in population-based structural health monitoring, where traditional methods struggle with highly-disparate systems.Method: Uses intermediate structures to bridge information gaps by varying parameters like material properties and geometry to continuously morph one structure into another. Demonstrated through simulated bridge designs and physical representations of bridge-to-airplane transfer via finite-element models.
Result: Shows that positive transfer can be achieved between highly-disparate systems (like bridges and airplanes) through the proposed intermediate structure approach, challenging the assumption that such transfer is impossible.
Conclusion: Demonstrates that knowledge transfer between heterogeneous structures is possible via intermediate bridging structures, expanding the potential applications of population-based structural health monitoring.
Abstract: Despite recent advances in population-based structural health monitoring (PBSHM), knowledge transfer between highly-disparate structures (i.e., heterogeneous populations) remains a challenge. The current work proposes that heterogeneous transfer may be accomplished via intermediate structures that bridge the gap in information between the structures of interest. A key aspect of the technique is the idea that by varying parameters such as material properties and geometry, one structure can be continuously morphed into another. The approach is demonstrated via a case study involving the parameterisation of (and transfer between) simulated heterogeneous bridge designs (Case 1). Transfer between simplified physical representations of a ‘bridge’ and ‘aeroplane’ is then demonstrated in Case 2, via a chain of finite-element models. The facetious question ‘When is a bridge not an aeroplane?’ has been previously asked in the context of predicting positive transfer based on structural similarity. While the obvious answer to this question is ‘Always,’ the results presented in the current paper show that, in some cases, positive transfer can indeed be achieved between highly-disparate systems.
[497] Generalizing Dynamics Modeling More Easily from Representation Perspective
Yiming Wang, Zhengnan Zhang, Genghe Zhang, Jiawen Dan, Changchun Li, Chenlong Hu, Chris Nugent, Jun Liu, Ximing Li, Bo Yang
Main category: cs.LG
TL;DR: PDEDER is a pre-trained dynamics encoder that learns generalized latent representations for complex system dynamics, enabling better forecasting across different systems.
Details
Motivation: Existing neural dynamics modeling methods create specific models for each observation system, lacking generalization across different complex systems. There's a need for a generalized approach that can learn dynamics from various systems and transfer knowledge effectively.Method: Pre-trained Dynamics EncoDER (PDEDER) uses pre-trained language models to embed state observations into a latent space. It minimizes Lyapunov exponent objective to constrain chaotic behavior, incorporates reconstruction and forecasting objectives to prevent over-smoothing, and is pre-trained on 152 datasets from 23 complex systems.
Result: PDEDER shows effectiveness and generalizability when evaluated on 12 dynamic systems for short/long-term forecasting under both in-domain and cross-domain settings.
Conclusion: PDEDER provides a generalized pre-trained framework for dynamics modeling that can be fine-tuned for specific systems, offering improved generalization across different complex systems compared to system-specific approaches.
Abstract: Learning system dynamics from observations is a critical problem in many applications over various real-world complex systems, e.g., climate, ecology, and fluid systems. Recently, neural dynamics modeling method have become a prevalent solution that embeds the object’s observations into a latent space before learning dynamics using neural methods such as neural Ordinary Differential Equations (ODE). Existing dynamics modeling methods induce a specific model for each observation of different complex systems, resulting in poor generalization across systems. Inspired by the great success of pre-trained models, we conduct a generalized Pre-trained Dynamics EncoDER (PDEDER) which can embed the original state observations into a latent space where the dynamics can be captured more easily. To conduct the generalized PDEDER, we pre-train any Pre-trained Language Model (PLM) by minimizing the Lyapunov exponent objective, which constrains the chaotic behavior of governing dynamics learned in the latent space. By penalizing the divergence of embedded observations, our PDEDER promotes locally stable and well-structured latent dynamics, thereby facilitating more effective dynamics modeling than in the original observation space. In addition, we incorporate reconstruction and forecasting objectives to mitigate the risk of obtaining an over-smoothed latent space. Specifically, we collect 152 sets of real-world and synthetic observations from 23 complex systems as pre-training corpora and employ them to pre-train PDEDER. Given any future dynamic observation, we can fine-tune PDEDER with any specific dynamics modeling method. We evaluate PDEDER on 12 dynamic systems by short/long-term forecasting under both in-domain and cross-domain settings, and the empirical results indicate the effectiveness and generalizability of PDEDER.
[498] Bounding Box Anomaly Scoring for simple and efficient Out-of-Distribution detection
Mohamed Bahi Yahiaoui, Geoffrey Daniel, Loïc Giraldi, Jérémie Bruyelle, Julyan Arbel
Main category: cs.LG
TL;DR: BBAS is a post-hoc OOD detection method using bounding-box abstraction to represent in-distribution support through axis-aligned summaries of neural network activations, combining interval exceedance scores, convolutional layer monitoring, and decoupled clustering.
Details
Motivation: Existing OOD detection methods face a trade-off between compact parametric models (like Mahalanobis) and flexible reference-based methods (like k-NN). Bounding-box abstraction offers an intermediate approach that maintains simplicity and compactness while providing robust detection.Method: BBAS uses bounding-box abstraction to represent in-distribution support through axis-aligned summaries of hidden activations. It combines graded anomaly scores based on interval exceedances, monitoring variables adapted to convolutional layers, and decoupled clustering and box construction for multi-layer representations.
Result: Experiments on image-classification benchmarks show BBAS provides robust separation between in-distribution and out-of-distribution samples while preserving the simplicity, compactness, and updateability of the bounding-box approach.
Conclusion: BBAS offers an effective intermediate solution for OOD detection that balances compactness and flexibility, maintaining the advantages of bounding-box abstraction while improving detection performance.
Abstract: Out-of-distribution (OOD) detection aims to identify inputs that differ from the training distribution in order to reduce unreliable predictions by deep neural networks. Among post-hoc feature-space approaches, OOD detection is commonly performed by approximating the in-distribution support in the representation space of a pretrained network. Existing methods often reflect a trade-off between compact parametric models, such as Mahalanobis-based scores, and more flexible but reference-based methods, such as k-nearest neighbors. Bounding-box abstraction provides an attractive intermediate perspective by representing in-distribution support through compact axis-aligned summaries of hidden activations. In this paper, we introduce Bounding Box Anomaly Scoring (BBAS), a post-hoc OOD detection method that leverages bounding-box abstraction. BBAS combines graded anomaly scores based on interval exceedances, monitoring variables adapted to convolutional layers, and decoupled clustering and box construction for richer and multi-layer representations. Experiments on image-classification benchmarks show that BBAS provides robust separation between in-distribution and out-of-distribution samples while preserving the simplicity, compactness, and updateability of the bounding-box approach.
[499] Vision-based Deep Learning Analysis of Unordered Biomedical Tabular Datasets via Optimal Spatial Cartography
Sakib Mostafa, Tarik Massoud, Maximilian Diehn, Lei Xing, Md Tauhidul Islam
Main category: cs.LG
TL;DR: Dynomap transforms tabular biomedical data into learned spatial feature maps, enabling vision architectures to process unordered tabular data and discover clinically relevant patterns.
Details
Motivation: Tabular biomedical data lacks intrinsic spatial organization, limiting vision architectures' ability to exploit local structure and feature interactions. Existing approaches treat features as unordered dimensions, requiring models to implicitly infer relationships.Method: Dynamic Feature Mapping (Dynomap) is an end-to-end framework that learns task-optimized spatial topology of features directly from data. It jointly optimizes feature placement and prediction through a fully differentiable rendering mechanism without relying on heuristics or external priors.
Result: Dynomap consistently outperformed classical ML, modern deep tabular models, and existing vector-to-image approaches across multiple biomedical datasets. It improved cancer subtype prediction accuracy by up to 18% and Parkinson disease voice classification by up to 8%, while organizing clinically relevant features into interpretable spatial patterns.
Conclusion: Dynomap establishes a general strategy for bridging tabular and vision-based deep learning, enabling effective use of vision architectures on unordered biomedical data and uncovering structured, clinically relevant patterns.
Abstract: Tabular data are central to biomedical research, from liquid biopsy and bulk and single-cell transcriptomics to electronic health records and phenotypic profiling. Unlike images or sequences, however, tabular datasets lack intrinsic spatial organization: features are treated as unordered dimensions, and their relationships must be inferred implicitly by the model. This limits the ability of vision architectures to exploit local structure and higher-order feature interactions in non-spatial biomedical data. Here we introduce Dynamic Feature Mapping (Dynomap), an end-to-end deep learning framework that learns a task-optimized spatial topology of features directly from data. Dynomap jointly optimizes feature placement and prediction through a fully differentiable rendering mechanism, without relying on heuristics, predefined groupings, or external priors. By transforming high-dimensional tabular vectors into learned feature maps, Dynomap enables vision-based models to operate effectively on unordered biomedical inputs. Across multiple clinical and biological datasets, Dynomap consistently outperformed classical machine learning, modern deep tabular models, and existing vector-to-image approaches. In liquid biopsy data, Dynomap organized clinically relevant gene signatures into coherent spatial patterns and improved multiclass cancer subtype prediction accuracy by up to 18%. In a Parkinson disease voice dataset, it clustered disease-associated acoustic descriptors and improved accuracy by up to 8%. Similar gains and interpretable feature organization were observed in additional biomedical datasets. These results establish Dynomap as a general strategy for bridging tabular and vision-based deep learning and for uncovering structured, clinically relevant patterns in high-dimensional biomedical data.
[500] Coordinate Encoding on Linear Grids for Physics-Informed Neural Networks
Tetsuro Tsuchino, Motoki Shiga
Main category: cs.LG
TL;DR: A PINN-based method with coordinate-encoding layer on linear grid cells improves training convergence speed by separating local domains and reducing computational cost through axis-independent linear grid cells.
Details
Motivation: PINNs suffer from slow convergence due to spectral bias problems, which hinders their effectiveness in solving PDEs despite advantages like mesh-free solutions and unsupervised learning.Method: Proposes a PINN-based method with coordinate-encoding layer on linear grid cells that separates local domains using grid cells, uses axis-independent linear grid cells to reduce computational cost, and employs natural cubic splines for interpolation between grid points to ensure continuous derivative functions.
Result: Numerical experiments demonstrate effective performance and efficient training convergence speed of the proposed method.
Conclusion: The proposed method successfully addresses PINN’s convergence issues through grid-based domain separation and efficient interpolation techniques, enabling faster and more stable training for PDE solutions.
Abstract: In solving partial differential equations (PDEs), machine learning utilizing physical laws has received considerable attention owing to advantages such as mesh-free solutions, unsupervised learning, and feasibility for solving high-dimensional problems. An effective approach is based on physics-informed neural networks (PINNs), which are based on deep neural networks known for their excellent performance in various academic and industrial applications. However, PINNs struggled with model training owing to significantly slow convergence because of a spectral bias problem. In this study, we propose a PINN-based method equipped with a coordinate-encoding layer on linear grid cells. The proposed method improves the training convergence speed by separating the local domains using grid cells. Moreover, it reduces the overall computational cost by using axis-independent linear grid cells. The method also achieves efficient and stable model training by adequately interpolating the encoded coordinates between grid points using natural cubic splines, which guarantees continuous derivative functions of the model computed for the loss functions. The results of numerical experiments demonstrate the effective performance and efficient training convergence speed of the proposed method.
[501] Non-Adversarial Imitation Learning Provably Free of Compounding Errors: The Role of Bellman Constraints
Tian Xu, Chenyang Wang, Xiaochen Zhai, Ziniu Li, Yi-Chen Li, Yang Yu
Main category: cs.LG
TL;DR: Dual Q-DM: A new non-adversarial imitation learning method that eliminates compounding errors through a primal-dual framework with Bellman constraints, outperforming both behavioral cloning and IQ-Learn.
Details
Motivation: Existing non-adversarial imitation learning methods like IQ-Learn were believed to outperform behavioral cloning by using online interactions, but this paper reveals they still suffer from compounding errors and fail to generalize beyond demonstrations.Method: Introduces Dual Q-DM, a primal-dual framework for distribution matching that incorporates Bellman constraints to propagate high Q-values from visited states to unvisited ones, enabling generalization beyond demonstrations.
Result: Theoretical proof shows Dual Q-DM is equivalent to adversarial imitation learning and can recover expert actions beyond demonstrations, eliminating compounding errors. Experimental results confirm theoretical findings.
Conclusion: Dual Q-DM is the first non-adversarial imitation learning method theoretically guaranteed to eliminate compounding errors, providing a stable alternative to adversarial approaches while maintaining strong generalization capabilities.
Abstract: Adversarial imitation learning (AIL) achieves high-quality imitation by mitigating compounding errors in behavioral cloning (BC), but often exhibits training instability due to adversarial optimization. To avoid this issue, a class of non-adversarial Q-based imitation learning (IL) methods, represented by IQ-Learn, has emerged and is widely believed to outperform BC by leveraging online environment interactions. However, this paper revisits IQ-Learn and demonstrates that it provably reduces to BC and suffers from an imitation gap lower bound with quadratic dependence on horizon, therefore still suffering from compounding errors. Theoretical analysis reveals that, despite using online interactions, IQ-Learn uniformly suppresses the Q-values for all actions on states uncovered by demonstrations, thereby failing to generalize. To address this limitation, we introduce a primal-dual framework for distribution matching, yielding a new Q-based IL method, Dual Q-DM. The key mechanism in Dual Q-DM is incorporating Bellman constraints to propagate high Q-values from visited states to unvisited ones, thereby achieving generalization beyond demonstrations. We prove that Dual Q-DM is equivalent to AIL and can recover expert actions beyond demonstrations, thereby mitigating compounding errors. To the best of our knowledge, Dual Q-DM is the first non-adversarial IL method that is theoretically guaranteed to eliminate compounding errors. Experimental results further corroborate our theoretical results.
[502] Double Coupling Architecture and Training Method for Optimization Problems of Differential Algebraic Equations with Parameters
Wenqiang Yang, Wenyuan Wu, Yong Feng, Changbo Chen
Main category: cs.LG
TL;DR: A dual physics-informed neural network architecture for multi-task optimization of parametric differential algebraic equations in product development, enhanced with genetic algorithm training.
Details
Motivation: Product development requires efficient simulation and modeling via complex nonlinear differential algebraic equations. Growing product diversity demands multi-task optimization, which is challenging for traditional simulation modeling approaches.Method: Proposes a dual physics-informed neural network architecture that decouples constraints and objective functions in parametric differential algebraic equation optimization. Uses relaxation variables with global error bounds for solution equivalence, and employs a genetic algorithm-enhanced training framework to improve precision and efficiency.
Result: The approach enables generalization for multi-task objectives with single training while maintaining real-time responsiveness to product requirements, avoiding redundant solving of differential algebraic equations.
Conclusion: The proposed method effectively addresses multi-task optimization challenges in simulation modeling for product development through innovative neural network architecture and training enhancements.
Abstract: Simulation and modeling are essential in product development, integrated into the design and manufacturing process to enhance efficiency and quality. They are typically represented as complex nonlinear differential algebraic equations. The growing diversity of product requirements demands multi-task optimization, a key challenge in simulation modeling research. A dual physics-informed neural network architecture has been proposed to decouple constraints and objective functions in parametric differential algebraic equation optimization problems. Theoretical analysis shows that introducing a relaxation variable with a global error bound ensures solution equivalence between the network and optimization problem. A genetic algorithm-enhanced training framework for physics-informed neural networks improves training precision and efficiency, avoiding redundant solving of differential algebraic equations. This approach enables generalization for multi-task objectives with a single, training maintaining real-time responsiveness to product requirements.
[503] Spiking Personalized Federated Learning for Brain-Computer Interface-Enabled Immersive Communication
Chen Shang, Dinh Thai Hoang, Diep N. Nguyen, Jiadong Yu
Main category: cs.LG
TL;DR: A brain-computer interface framework using personalized federated learning with spiking neural networks for energy-efficient immersive communication adaptation based on brain signals.
Details
Motivation: To create more personalized and robust immersive communication systems that can adapt to individual users' intentions and discomfort by analyzing brain signals, while addressing privacy concerns and energy limitations of on-device learning.Method: Combines brain-computer interface for brain signal acquisition with personalized federated learning to handle neurodiverse data and preserve privacy, and embeds spiking neural networks for energy-efficient computation through event-driven spike processing.
Result: Achieves best overall identification accuracy on real brain-signal datasets while reducing inference energy by 6.46× compared to conventional artificial neural network-based personalized baselines.
Conclusion: The proposed framework successfully enables personalized immersive adaptation using brain signals while addressing privacy, computational efficiency, and energy constraints through the integration of BCI, PFL, and SNNs.
Abstract: This work proposes a novel immersive communication framework that leverages brain-computer interface (BCI) to acquire brain signals for inferring user-centric states (e.g., intention and perception-related discomfort), thereby enabling more personalized and robust immersive adaptation under strong individual variability. Specifically, we develop a personalized federated learning (PFL) model to analyze and process the collected brain signals, which not only accommodates neurodiverse brain-signal data but also prevents the leakage of sensitive brain-signal information. To address the energy bottleneck of continual on-device learning and inference on energy-limited immersive terminals (e.g., head-mounted display), we further embed spiking neural networks (SNNs) into the PFL. By exploiting sparse, event-driven spike computation, the SNN-enabled PFL reduces the computation and energy cost of training and inference while maintaining competitive personalization performance. Experiments on real brain-signal dataset demonstrate that our method achieves the best overall identification accuracy while reducing inference energy by 6.46$\times$ compared with conventional artificial neural network-based personalized baselines.
[504] Sparser, Faster, Lighter Transformer Language Models
Edoardo Cetin, Stefano Peluchetti, Emilio Castillo, Akira Naruse, Mana Murakami, Llion Jones
Main category: cs.LG
TL;DR: Sparse LLMs: Using L1 regularization to achieve over 99% sparsity in feedforward layers with new sparse packing format and CUDA kernels for efficient inference/training.
Details
Motivation: Autoregressive LLMs have high computational costs, especially in feedforward layers which account for most parameters and FLOPs. Need to reduce these costs through sparsity while maintaining performance.Method: Introduce new sparse packing format and CUDA kernels for efficient sparse computation. Use L1 regularization to induce over 99% sparsity in feedforward layers while minimizing impact on downstream performance.
Result: Achieves substantial throughput, energy efficiency, and memory usage benefits that increase with model scale. Maintains performance despite high sparsity levels.
Conclusion: Sparsity can be a practical axis for improving efficiency and scalability of foundation models. Open-source release to promote adoption.
Abstract: Scaling autoregressive large language models (LLMs) has driven unprecedented progress but comes with vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM’s feedforward layers, the components accounting for most of the model parameters and execution FLOPs. To achieve this, we introduce a new sparse packing format and a set of CUDA kernels designed to seamlessly integrate with the optimized execution pipelines of modern GPUs, enabling efficient sparse computation during LLM inference and training. To substantiate our gains, we provide a quantitative study of LLM sparsity, demonstrating that simple L1 regularization can induce over 99% sparsity with negligible impact on downstream performance. When paired with our kernels, we show that these sparsity levels translate into substantial throughput, energy efficiency, and memory usage benefits that increase with model scale. We will release all code and kernels under an open-source license to promote adoption and accelerate research toward establishing sparsity as a practical axis for improving the efficiency and scalability of modern foundation models.
[505] Multitask-Informed Prior for In-Context Learning on Tabular Data: Application to Steel Property Prediction
Dimitrios Sinodinos, Bahareh Nikpour, Jack Yi Wei, Sushant Sinha, Xiaoping Ma, Kashif Rehman, Stephen Yue, Narges Armanfard
Main category: cs.LG
TL;DR: Multitask learning framework for TabPFN transformer to predict multiple steel mechanical properties simultaneously in hot rolling processes, outperforming traditional methods.
Details
Motivation: Predicting steel mechanical properties in hot rolling processes is challenging due to complex interactions. Traditional methods are resource-intensive and lack adaptability, and existing approaches don't leverage correlations among properties through multitask learning.Method: Augment TabPFN (transformer-based foundation model for tabular data) with two multitask fine-tuning strategies: target averaging (unified scalar signal) and task-specific adapters (task-specific supervision) to create multitask-informed prior capturing cross-property relationships.
Result: Outperforms classical ML methods and state-of-the-art tabular learning models on industrial TSDR dataset across multiple metrics, enhancing both predictive accuracy and computational efficiency compared to task-specific fine-tuning.
Conclusion: Multitask-aware prior adaptation enables foundation models for tabular data to deliver scalable, rapid, and reliable deployment for automated industrial quality control and process optimization in TSDR.
Abstract: Accurate prediction of mechanical properties of steel during hot rolling processes, such as Thin Slab Direct Rolling (TSDR), remains challenging due to complex interactions among chemical compositions, processing parameters, and resultant microstructures. Traditional empirical and experimental methodologies, while effective, are often resource-intensive and lack adaptability to varied production conditions. Moreover, most existing approaches do not explicitly leverage the strong correlations among key mechanical properties, missing an opportunity to improve predictive accuracy through multitask learning. To address this, we present a multitask learning framework that injects multitask awareness into the prior of TabPFN–a transformer-based foundation model for in-context learning on tabular data–through novel fine-tuning strategies. Originally designed for single-target regression or classification, we augment TabPFN’s prior with two complementary approaches: (i) target averaging, which provides a unified scalar signal compatible with TabPFN’s single-target architecture, and (ii) task-specific adapters, which introduce task-specific supervision during fine-tuning. These strategies jointly guide the model toward a multitask-informed prior that captures cross-property relationships among key mechanical metrics. Extensive experiments on an industrial TSDR dataset demonstrate that our multitask adaptations outperform classical machine learning methods and recent state-of-the-art tabular learning models across multiple evaluation metrics. Notably, our approach enhances both predictive accuracy and computational efficiency compared to task-specific fine-tuning, demonstrating that multitask-aware prior adaptation enables foundation models for tabular data to deliver scalable, rapid, and reliable deployment for automated industrial quality control and process optimization in TSDR.
[506] Off-Policy Value-Based Reinforcement Learning for Large Language Models
Peng-Yuan Wang, Ziniu Li, Tian Xu, Bohan Yang, Tian-Shuo Liu, ChenYang Wang, Xiong-Hui Chen, Yi-Chen Li, Tianyun Yang, Congliang Chen, Yang Yu
Main category: cs.LG
TL;DR: ReVal: A value-based RL method for LLMs that enables off-policy learning with replay buffers, improving data efficiency for long-horizon tasks.
Details
Motivation: Current RL methods for LLMs are largely on-policy, requiring fresh samples for each update and resulting in poor sample efficiency, which is problematic for expensive trajectory generation in long-horizon tasks.Method: Proposes ReVal, a Bellman-update-based value-based RL framework that combines stepwise signals (internal consistency) with trajectory-level signals (outcome verification), enabling replay-buffer-based training for efficient data reuse.
Result: ReVal converges faster and outperforms GRPO on mathematical reasoning benchmarks, achieving 2.7% improvement on AIME24 and 4.5% on GPQA with DeepSeek-R1-Distill-1.5B.
Conclusion: Value-based RL is a practical alternative to policy-based methods for LLM training, offering better data efficiency through off-policy learning with replay buffers.
Abstract: Improving data utilization efficiency is critical for scaling reinforcement learning (RL) for long-horizon tasks where generating trajectories is expensive. However, the dominant RL methods for LLMs are largely on-policy: they update each batch of data only once, discard it, and then collect fresh samples, resulting in poor sample efficiency. In this work, we explore an alternative value-based RL framework for LLMs that naturally enables off-policy learning. We propose ReVal, a Bellman-update-based method that combines stepwise signals capturing internal consistency with trajectory-level signals derived from outcome verification. ReVal naturally supports replay-buffer-based training, allowing efficient reuse of past trajectories. Experiments on standard mathematical reasoning benchmarks show that ReVal not only converges faster but also outperforms GRPO in final performance. On DeepSeek-R1-Distill-1.5B, ReVal improves training efficiency and achieves improvement of 2.7% in AIME24 and 4.5% in out-of-domain benchmark GPQA over GRPO. These results suggest that value-based RL is a practical alternative to policy-based methods for LLM training.
[507] From Arithmetic to Logic: The Resilience of Logic and Lookup-Based Neural Networks Under Parameter Bit-Flips
Alan T. L. Bacellar, Sathvik Chemudupati, Shashank Nag, Allison Seigler, Priscila M. V. Lima, Felipe M. G. França, Lizy K. John
Main category: cs.LG
TL;DR: Theoretical analysis shows neural networks with lower precision, higher sparsity, bounded activations, and shallow depth are more resilient to hardware bit-flip errors, with logic/lookup-based architectures offering optimal accuracy-resilience trade-offs.
Details
Motivation: Deep neural networks deployed in safety-critical edge environments need robustness against hardware-induced bit-flip errors. While empirical studies suggest lower precision improves fault tolerance, the theoretical basis remains underexplored.Method: Derived expected squared error (MSE) under independent parameter bit flips across multiple numerical formats and layer primitives. Analyzed structural properties like precision, sparsity, activation bounds, and depth. Conducted ablation studies on MLPerf Tiny benchmark suite comparing standard floating-point models with LUT-based models.
Result: Lower precision, higher sparsity, bounded activations, and shallow depth consistently improve resilience to bit-flip errors. LUT-based models remain highly stable in corruption regimes where standard floating-point models fail sharply. Identified novel even-layer recovery effect unique to logic-based architectures.
Conclusion: Shifting from continuous arithmetic weights to discrete Boolean lookups provides favorable accuracy-resilience trade-off for hardware fault tolerance. Resilience should be considered as a structural property of neural architectures rather than solely as a property of dataset-specific trained solutions.
Abstract: The deployment of deep neural networks (DNNs) in safety-critical edge environments necessitates robustness against hardware-induced bit-flip errors. While empirical studies indicate that reducing numerical precision can improve fault tolerance, the theoretical basis of this phenomenon remains underexplored. In this work, we study resilience as a structural property of neural architectures rather than solely as a property of a dataset-specific trained solution. By deriving the expected squared error (MSE) under independent parameter bit flips across multiple numerical formats and layer primitives, we show that lower precision, higher sparsity, bounded activations, and shallow depth are consistently favored under this corruption model. We then argue that logic and lookup-based neural networks realize the joint limit of these design trends. Through ablation studies on the MLPerf Tiny benchmark suite, we show that the observed empirical trends are consistent with the theoretical predictions, and that LUT-based models remain highly stable in corruption regimes where standard floating-point models fail sharply. Furthermore, we identify a novel even-layer recovery effect unique to logic-based architectures and analyze the structural conditions under which it emerges. Overall, our results suggest that shifting from continuous arithmetic weights to discrete Boolean lookups can provide a favorable accuracy-resilience trade-off for hardware fault tolerance.
[508] Caterpillar of Thoughts: The Optimal Test-Time Algorithm for Large Language Models
Amir Azarmehr, Soheil Behnezhad, Alma Ghafari
Main category: cs.LG
TL;DR: Theoretical analysis of test-time computation for LLMs modeled as algorithms interacting with Markov chains with backtracking, showing optimal algorithms generate caterpillar trees and proposing Caterpillar of Thoughts (CaT) as an efficient implementation.
Details
Motivation: While empirical success exists for test-time computation techniques like sampling, chain of thought, and backtracking in LLMs, there's limited theoretical understanding of optimal inference-time computation structure and how to best use fixed computation budgets.Method: Model test-time computation as algorithms interacting with Markov chains where the algorithm can backtrack to any previously observed state. Analyze optimal algorithms theoretically, prove they generate caterpillar trees, and propose Caterpillar of Thoughts (CaT) as a practical implementation.
Result: Proved that while backtracking can reduce generations exponentially, limited backtracking is sufficient, with optimal algorithms always generating caterpillar trees. Empirical evaluation shows CaT achieves better success rates than Tree-of-Thoughts while reducing token generations.
Conclusion: Theoretical framework provides understanding of optimal test-time computation for LLMs, with Caterpillar of Thoughts offering practical efficiency improvements over existing methods like Tree-of-Thoughts.
Abstract: Large language models (LLMs) can often produce substantially better outputs when allowed to use additional test-time computation, such as sampling, chain of thought, backtracking, or revising partial solutions. Despite the growing empirical success of such techniques, there is limited theoretical understanding of how inference time computation should be structured, or what constitutes an optimal use of a fixed computation budget. We model test-time computation as an algorithm interacting with a Markov chain: at any point, the algorithm may resume generation from any previously observed state. That is, unlike standard Markov chains where the states are drawn passively, we allow the algorithm to backtrack to any previously observed state of the Markov chain at any time. Many of the existing test-time algorithms, such as Chain-of-Thought (CoT) (Wei et al., 2023), Tree-of-Thoughts (ToT) (Yao et al., 2023), or Best-of-$k$ (Brown et al., 2024) could be seen as specific algorithms in this model. We prove that while backtracking can reduce the number of generations exponentially, a very limited form of backtracking is theoretically sufficient. Namely, we show that the optimal algorithm always generates a caterpillar tree. That is, if we remove the leaves of the state tree generated by the optimal algorithm, we obtain a path. Motivated by our characterization of the optimal algorithm, we present Caterpillar of Thoughts (CaT), a new test-time computation algorithm, reducing the number of token/state generations. Our empirical evaluation shows that CaT, compared to ToT, achieves a better success rate while also reducing the number of token generations.
[509] Transformers Trained via Gradient Descent Can Provably Learn a Class of Teacher Models
Chenyang Zhang, Qingyue Zhao, Quanquan Gu, Yuan Cao
Main category: cs.LG
TL;DR: Transformers can theoretically learn from various teacher models (convolution layers, graph convolutions, statistical models) using simplified attention, achieving optimal population loss and good out-of-distribution generalization.
Details
Motivation: To demystify the strong capabilities of transformers across diverse applications by providing theoretical foundations for their success in learning from various teacher models.Method: Theoretical investigation of transformers as students learning from teacher models including convolution layers with average pooling, graph convolution layers, sparse token selection models, and group-sparse linear predictors. Analysis focuses on one-layer transformers with simplified “position-only” attention.
Result: Proved that transformers can successfully recover all parameter blocks of teacher models, achieving optimal population loss. Demonstrated good generalization to out-of-distribution data under mild assumptions by identifying a fundamental bilinear structure shared by various learning tasks.
Conclusion: Transformers have strong theoretical foundations for learning from diverse teacher models, with simplified attention mechanisms capable of achieving optimal performance and good generalization, providing unified learning guarantees across multiple tasks.
Abstract: Transformers have achieved great success across a wide range of applications, yet the theoretical foundations underlying their success remain largely unexplored. To demystify the strong capacities of transformers applied to versatile scenarios and tasks, we theoretically investigate utilizing transformers as students to learn from a class of teacher models. Specifically, the teacher models covered in our analysis include convolution layers with average pooling, graph convolution layers, and various classic statistical learning models, including a variant of sparse token selection models [Sanford et al., 2023, Wang et al., 2024] and group-sparse linear predictors [Zhang et al., 2025]. When learning from this class of teacher models, we prove that one-layer transformers with simplified “position-only’’ attention can successfully recover all parameter blocks of the teacher models, thus achieving the optimal population loss. Building upon the efficient mimicry of trained transformers towards teacher models, we further demonstrate that they can generalize well to a broad class of out-of-distribution data under mild assumptions. The key in our analysis is to identify a fundamental bilinear structure shared by various learning tasks, which enables us to establish unified learning guarantees for these tasks when treating them as teachers for transformers.
[510] Universal and efficient graph neural networks with dynamic attention for machine learning interatomic potentials
Shuyu Bi, Zhede Zhao, Qiangchao Sun, Tao Hu, Xionggang Lu, Hongwei Cheng
Main category: cs.LG
TL;DR: MLANet: An efficient graph neural network framework for machine learning interatomic potentials that achieves near-quantum accuracy with lower computational cost than existing equivariant models.
Details
Motivation: Traditional empirical potentials lack accuracy while first-principles methods are computationally prohibitive. Existing machine learning interatomic potentials (MLIPs) face challenges in efficiency and stability for large-scale atomic simulations.Method: Proposes MLANet with dual-path dynamic attention mechanism for geometry-aware message passing and multi-perspective pooling strategy to construct comprehensive system representations.
Result: Tested across diverse systems including organic molecules, periodic inorganic materials, 2D materials, surface catalytic reactions, and charged systems. Maintains competitive prediction accuracy with markedly lower computational cost than mainstream equivariant models, enabling stable long-time molecular dynamics simulations.
Conclusion: MLANet provides an efficient and practical tool for large-scale, high-accuracy atomic simulations, making high-fidelity simulations more accessible.
Abstract: The core of molecular dynamics simulation fundamentally lies in the interatomic potential. Traditional empirical potentials lack accuracy, while first-principles methods are computationally prohibitive. Machine learning interatomic potentials (MLIPs) promise near-quantum accuracy at linear cost, but existing models still face challenges in efficiency and stability. We presents Machine Learning Advances Neural Network (MLANet), an efficient and robust graph neural network framework. MLANet introduces a dual-path dynamic attention mechanism for geometry-aware message passing and a multi-perspective pooling strategy to construct comprehensive system representations. This design enables highly accurate modeling of atomic environments while achieving exceptional computational efficiency, making high-fidelity simulations more accessible. Tested across a wide range of datasets spanning diverse systems, including organic molecules (e.g., QM7, MD17), periodic inorganic materials (e.g., Li-containing crystals), two-dimensional materials (e.g., bilayer graphene, black phosphorus), surface catalytic reactions (e.g., formate decomposition), and charged systems, MLANet maintains competitive prediction accuracy while its computational cost is markedly lower than mainstream equivariant models, and it enables stable long-time molecular dynamics simulations. MLANet provides an efficient and practical tool for large-scale, high-accuracy atomic simulations.
[511] Towards The Implicit Bias on Multiclass Separable Data Under Norm Constraints
Shengping Xie, Zekun Wu, Quan Chen, Kaixu Tang
Main category: cs.LG
TL;DR: NucGD: A geometry-aware optimizer using nuclear norm constraints to enforce low-rank structures in multiclass separable data, with efficient SVD-free updates via asynchronous power iteration.
Details
Motivation: To understand how optimization geometry shapes solutions in overparameterized models on multiclass separable data, and to develop methods that enforce desirable low-rank structures through geometric constraints.Method: Introduces NucGD (Nuclear Gradient Descent) within the Normalized Steepest Descent framework, using nuclear norm constraints to enforce low-rank structures. Derives efficient SVD-free update rules via asynchronous power iteration for scalable training. Analyzes impact of stochastic optimization dynamics including gradient noise from mini-batch sampling and momentum.
Result: Provides a unified perspective connecting NucGD with emerging low-rank projection methods. Demonstrates how varying levels of gradient noise modulate convergence toward expected maximum margin solutions. Offers empirical analysis of stochastic optimization dynamics.
Conclusion: The NSD framework and NucGD optimizer provide insights into how optimization geometry shapes implicit bias in overparameterized models, with practical methods for enforcing low-rank structures and understanding stochastic training dynamics.
Abstract: Implicit bias induced by gradient-based algorithms is essential to the generalization of overparameterized models, yet its mechanisms can be subtle. This work leverages the Normalized Steepest Descent} (NSD) framework to investigate how optimization geometry shapes solutions on multiclass separable data. We introduce NucGD, a geometry-aware optimizer designed to enforce low rank structures through nuclear norm constraints. Beyond the algorithm itself, we connect NucGD with emerging low-rank projection methods, providing a unified perspective. To enable scalable training, we derive an efficient SVD-free update rule via asynchronous power iteration. Furthermore, we empirically dissect the impact of stochastic optimization dynamics, characterizing how varying levels of gradient noise induced by mini-batch sampling and momentum modulate the convergence toward the expected maximum margin solutions.Our code is accessible at: https://github.com/Tsokarsic/observing-the-implicit-bias-on-multiclass-seperable-data.
[512] The Coordinate System Problem in Persistent Structural Memory for Neural Architectures
Abhinaba Basu
Main category: cs.LG
TL;DR: DPPN architecture uses pheromone-based routing with stable coordinate systems to achieve persistent structural memory in neural networks, identifying two key requirements: coordinate stability and graceful transfer mechanisms.
Details
Motivation: To understand the fundamental requirements for persistent structural memory in neural networks, particularly addressing the instability of learned coordinate systems and enabling effective knowledge transfer across tasks.Method: Developed Dual-View Pheromone Pathway Network (DPPN) with sparse attention routing through persistent pheromone fields over latent slot transitions. Conducted systematic experiments with 5 model variants, 4 transfer targets, and up to 10 seeds per condition to analyze memory stability and transfer mechanisms.
Result: DPPN outperforms transformer and random sparse baselines for within-task learning (AULC 0.700 vs 0.680 vs 0.670). Identified that fixed random Fourier features provide stable extrinsic coordinates, while learning-rate modulation with warm pheromone achieves +0.003 improvement on same-family tasks without negative transfer.
Conclusion: Persistent structural memory requires two independent components: (a) stable coordinate systems (achieved through extrinsic coordinates) and (b) graceful transfer mechanisms (achieved through learning-rate modulation rather than routing bias).
Abstract: We introduce the Dual-View Pheromone Pathway Network (DPPN), an architecture that routes sparse attention through a persistent pheromone field over latent slot transitions, and use it to discover two independent requirements for persistent structural memory in neural networks. Through five progressively refined experiments using up to 10 seeds per condition across 5 model variants and 4 transfer targets, we identify a core principle: persistent memory requires a stable coordinate system, and any coordinate system learned jointly with the model is inherently unstable. We characterize three obstacles – pheromone saturation, surface-structure entanglement, and coordinate incompatibility – and show that neither contrastive updates, multi-source distillation, Hungarian alignment, nor semantic decomposition resolves the instability when embeddings are learned from scratch. Fixed random Fourier features provide extrinsic coordinates that are stable, structure-blind, and informative, but coordinate stability alone is insufficient: routing-bias pheromone does not transfer (10 seeds, p>0.05). DPPN outperforms transformer and random sparse baselines for within-task learning (AULC 0.700 vs 0.680 vs 0.670). Replacing routing bias with learning-rate modulation eliminates negative transfer: warm pheromone as a learning-rate prior achieves +0.003 on same-family tasks (17 seeds, p<0.05) while never reducing performance. A structure completion function over extrinsic coordinates produces +0.006 same-family bonus beyond regularization, showing the catch-22 between stability and informativeness is partially permeable to learned functions. The contribution is two independent requirements for persistent structural memory: (a) coordinate stability and (b) graceful transfer mechanism.
[513] Confidence Calibration under Ambiguous Ground Truth
Linwei Tao, Haoyang Luo, Minjing Dong, Chang Xu
Main category: cs.LG
TL;DR: The paper addresses confidence calibration in multi-annotator settings where traditional methods fail due to label ambiguity, proposing ambiguity-aware calibrators that work with varying levels of annotation data.
Details
Motivation: Standard confidence calibration assumes unique ground-truth labels, but this fails in real-world scenarios with genuine annotator disagreement. Current methods using majority-voted labels appear calibrated but are actually miscalibrated against the true annotator distribution.Method: Develops a family of ambiguity-aware post-hoc calibrators: Dirichlet-Soft (uses full annotator distribution), Monte Carlo Temperature Scaling with single annotation (MCTS S=1), and Label-Smooth Temperature Scaling (LS-TS) that works with voted labels alone by constructing pseudo-soft targets from model confidence.
Result: Experiments on four benchmarks (CIFAR-10H, ChaosNLI, ISIC~2019, DermaMNIST) show Dirichlet-Soft reduces true-label ECE by 55-87% relative to Temperature Scaling, while LS-TS reduces ECE by 9-77% without any annotator data.
Conclusion: The paper demonstrates that traditional calibration methods are structurally biased in multi-annotator settings and proposes practical solutions that work with varying levels of annotation data availability.
Abstract: Confidence calibration assumes a unique ground-truth label per input, yet this assumption fails wherever annotators genuinely disagree. Post-hoc calibrators fitted on majority-voted labels, the standard single-label targets used in practice, can appear well-calibrated under conventional evaluation yet remain substantially miscalibrated against the underlying annotator distribution. We show that this failure is structural: under simplifying assumptions, Temperature Scaling is biased toward temperatures that underestimate annotator uncertainty, with true-label miscalibration increasing monotonically with annotation entropy. To address this, we develop a family of ambiguity-aware post-hoc calibrators that optimise proper scoring rules against the full label distribution and require no model retraining. Our methods span progressively weaker annotation requirements: Dirichlet-Soft leverages the full annotator distribution and achieves the best overall calibration quality across settings; Monte Carlo Temperature Scaling with a single annotation per example (MCTS S=1) matches full-distribution calibration across all benchmarks, demonstrating that pre-aggregated label distributions are unnecessary; and Label-Smooth Temperature Scaling (LS-TS) operates with voted labels alone by constructing data-driven pseudo-soft targets from the model’s own confidence. Experiments on four benchmarks with real multi-annotator distributions (CIFAR-10H, ChaosNLI) and clinically-informed synthetic annotations (ISIC~2019, DermaMNIST) show that Dirichlet-Soft reduces true-label ECE by 55-87% relative to Temperature Scaling, while LS-TS reduces ECE by 9-77% without any annotator data.
[514] TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration
Chunxiao Li, Lijun Li, Jing Shao
Main category: cs.LG
TL;DR: TreeTeaming is an automated red teaming framework for Vision-Language Models that uses an LLM-powered Orchestrator to dynamically construct and expand attack strategy trees, discovering novel vulnerabilities beyond linear exploration methods.
Details
Motivation: Existing red teaming methods for VLMs are constrained by linear exploration paradigms that optimize within predefined strategy sets, preventing discovery of novel, diverse exploits. There's a need for more dynamic approaches to uncover safety vulnerabilities in frontier AI models.Method: TreeTeaming reframes strategy exploration as a dynamic evolutionary process with an LLM-powered Orchestrator that autonomously decides whether to evolve promising attack paths or explore diverse strategic branches, constructing a strategy tree. A multimodal actuator executes these complex strategies.
Result: Achieved state-of-the-art attack success rates on 11 out of 12 prominent VLMs, reaching up to 87.60% on GPT-4o. Demonstrated superior strategic diversity over union of previously public jailbreak strategies, with generated attacks showing 23.09% average toxicity reduction.
Conclusion: TreeTeaming introduces a new paradigm for automated vulnerability discovery in VLMs, showing the necessity of proactive exploration beyond static heuristics to secure frontier AI models through dynamic, evolutionary strategy exploration.
Abstract: The rapid advancement of Vision-Language Models (VLMs) has brought their safety vulnerabilities into sharp focus. However, existing red teaming methods are fundamentally constrained by an inherent linear exploration paradigm, confining them to optimizing within a predefined strategy set and preventing the discovery of novel, diverse exploits. To transcend this limitation, we introduce TreeTeaming, an automated red teaming framework that reframes strategy exploration from static testing to a dynamic, evolutionary discovery process. At its core lies a strategic Orchestrator, powered by a Large Language Model (LLM), which autonomously decides whether to evolve promising attack paths or explore diverse strategic branches, thereby dynamically constructing and expanding a strategy tree. A multimodal actuator is then tasked with executing these complex strategies. In the experiments across 12 prominent VLMs, TreeTeaming achieves state-of-the-art attack success rates on 11 models, outperforming existing methods and reaching up to 87.60% on GPT-4o. The framework also demonstrates superior strategic diversity over the union of previously public jailbreak strategies. Furthermore, the generated attacks exhibit an average toxicity reduction of 23.09%, showcasing their stealth and subtlety. Our work introduces a new paradigm for automated vulnerability discovery, underscoring the necessity of proactive exploration beyond static heuristics to secure frontier AI models.
[515] Balancing Safety and Efficiency in Aircraft Health Diagnosis: A Task Decomposition Framework with Heterogeneous Long-Micro Scale Cascading and Knowledge Distillation-based Interpretability
Xinhang Chen, Zhihuan Wei, Yang Hu, Zhiguo Zeng, Kang Zeng, Suili Yang
Main category: cs.LG
TL;DR: DDF framework decouples aircraft diagnosis into anomaly detection and fault classification using LMSD with global-local feature extraction, improving accuracy and efficiency with interpretable decisions.
Details
Motivation: Address challenges in general aviation diagnosis: data uncertainty, task heterogeneity, and computational inefficiency. Existing end-to-end approaches have receptive field conflicts between global context and local features, plus high training costs under class imbalance.Method: Proposes Diagnosis Decomposition Framework (DDF) with Long-Micro Scale Diagnostician (LMSD). Uses “long-range global screening and micro-scale local precise diagnosis” strategy. ConvTokMHSA for global pattern discrimination, MMK Net for local fault extraction. Decoupled training separates optimization pathways. Keyness Extraction Layer provides interpretability via knowledge distillation.
Result: Experiments on NGAFID real-world aviation dataset show 4-8% improvement in Multi-Class Weighted Penalty Metric over baselines with substantially reduced training time. Validates advantages in task adaptability, interpretability, and efficiency.
Conclusion: DDF provides deployable methodology for general aviation health management by addressing key challenges through task decomposition, efficient architecture, and interpretable design.
Abstract: Whole-aircraft diagnosis for general aviation faces threefold challenges: data uncertainty, task heterogeneity, and computational inefficiency. Existing end-to-end approaches uniformly model health discrimination and fault characterization, overlooking intrinsic receptive field conflicts between global context modeling and local feature extraction, while incurring prohibitive training costs under severe class imbalance. To address these, this study proposes the Diagnosis Decomposition Framework (DDF), explicitly decoupling diagnosis into Anomaly Detection (AD) and Fault Classification (FC) subtasks via the Long-Micro Scale Diagnostician (LMSD). Employing a “long-range global screening and micro-scale local precise diagnosis” strategy, LMSD utilizes Convolutional Tokenizer with Multi-Head Self-Attention (ConvTokMHSA) for global operational pattern discrimination and Multi-Micro Kernel Network (MMK Net) for local fault feature extraction. Decoupled training separates “large-sample lightweight” and “small-sample complex” optimization pathways, significantly reducing computational overhead. Concurrently, Keyness Extraction Layer (KEL) via knowledge distillation furnishes physically traceable explanations for two-stage decisions, materializing interpretability-by-design. Experiments on the NGAFID real-world aviation dataset demonstrate approximately 4-8% improvement in Multi-Class Weighted Penalty Metric (MCWPM) over baselines with substantially reduced training time, validating comprehensive advantages in task adaptability, interpretability, and efficiency. This provides a deployable methodology for general aviation health management.
[516] Conditionally Identifiable Latent Representation for Multivariate Time Series with Structural Dynamics
Minkey Chang, Jae-Young Kim
Main category: cs.LG
TL;DR: iVDFM learns identifiable latent factors from multivariate time series using innovation conditioning and linear diagonal dynamics, with applications in factor recovery, intervention accuracy, and forecasting.
Details
Motivation: The paper addresses the need for identifiable latent factor models in multivariate time series analysis, where traditional methods often suffer from identifiability issues that limit interpretability and causal inference capabilities.Method: Proposes Identifiable Variational Dynamic Factor Model (iVDFM) that applies iVAE-style conditioning to the innovation process rather than latent states, uses linear diagonal dynamics for identifiability preservation, and employs companion-matrix and Krylov methods for scalable computation.
Result: Shows improved factor recovery on synthetic data, stable intervention accuracy on synthetic structural causal models, and competitive probabilistic forecasting performance on real-world benchmarks compared to existing methods.
Conclusion: iVDFM provides a principled approach for learning identifiable latent factors from multivariate time series with theoretical guarantees, enabling better interpretability and causal analysis while maintaining computational efficiency.
Abstract: We propose the Identifiable Variational Dynamic Factor Model (iVDFM), which learns latent factors from multivariate time series with identifiability guarantees. By applying iVAE-style conditioning to the innovation process driving the dynamics rather than to the latent states, we show that factors are identifiable up to permutation and component-wise affine (or monotone invertible) transformations. Linear diagonal dynamics preserve this identifiability and admit scalable computation via companion-matrix and Krylov methods. We demonstrate improved factor recovery on synthetic data, stable intervention accuracy on synthetic SCMs, and competitive probabilistic forecasting on real-world benchmarks.
[517] VLGOR: Visual-Language Knowledge Guided Offline Reinforcement Learning for Generalizable Agents
Pengsen Liu, Maosen Zeng, Nan Tang, Kaiyuan Li, Jing-Cheng Pang, Yunan Liu, Yang Yu
Main category: cs.LG
TL;DR: VLGOR integrates vision-language models with offline RL to generate imaginary rollouts for better generalization to unseen tasks in robotic manipulation.
Details
Motivation: LLMs combined with RL can interpret language instructions but lack direct perception of physical environments, limiting their understanding of environmental dynamics and generalization to unseen tasks.Method: Proposes Visual-Language Knowledge-Guided Offline RL (VLGOR) that fine-tunes a vision-language model to predict future states and actions conditioned on initial visual observations and high-level instructions, ensuring temporally coherent and spatially plausible rollouts. Uses counterfactual prompts for diverse rollouts.
Result: Experiments on robotic manipulation benchmarks show VLGOR significantly improves performance on unseen tasks requiring novel optimal policies, achieving over 24% higher success rate than baseline methods.
Conclusion: VLGOR effectively bridges the gap between language instructions and visual grounding in RL, enabling better generalization to unseen tasks through knowledge-guided imaginary rollouts.
Abstract: Combining Large Language Models (LLMs) with Reinforcement Learning (RL) enables agents to interpret language instructions more effectively for task execution. However, LLMs typically lack direct perception of the physical environment, which limits their understanding of environmental dynamics and their ability to generalize to unseen tasks. To address this limitation, we propose Visual-Language Knowledge-Guided Offline Reinforcement Learning (VLGOR), a framework that integrates visual and language knowledge to generate imaginary rollouts, thereby enriching the interaction data. The core premise of VLGOR is to fine-tune a vision-language model to predict future states and actions conditioned on an initial visual observation and high-level instructions, ensuring that the generated rollouts remain temporally coherent and spatially plausible. Furthermore, we employ counterfactual prompts to produce more diverse rollouts for offline RL training, enabling the agent to acquire knowledge that facilitates following language instructions while grounding in environments based on visual cues. Experiments on robotic manipulation benchmarks demonstrate that VLGOR significantly improves performance on unseen tasks requiring novel optimal policies, achieving a success rate over 24% higher than the baseline methods.
[518] Weak-PDE-Net: Discovering Open-Form PDEs via Differentiable Symbolic Networks and Weak Formulation
Xinxin Li, Xingyu Cui, Jin Qi, Juan Zhang, Da Li, Junping Yin
Main category: cs.LG
TL;DR: Weak-PDE-Net: An end-to-end differentiable framework for discovering open-form PDEs from sparse, noisy data using weak-form representations and neural architecture search.
Details
Motivation: Traditional sparse regression methods for PDE discovery suffer from instability with numerical differentiation under sparse/noisy data and limited flexibility due to pre-defined candidate libraries.Method: Two interconnected modules: (1) forward response learner with learnable Gaussian kernels in lightweight MLP as surrogate model, (2) weak-form PDE generator with symbolic network and integral module to avoid explicit differentiation. Uses Differentiable Neural Architecture Search to explore functional space and incorporates Galilean Invariance constraints and symmetry equivariance for physical consistency.
Result: Experiments on challenging PDE benchmarks show accurate recovery of governing equations even under highly sparse and noisy observations.
Conclusion: Weak-PDE-Net provides a robust framework for discovering open-form PDEs from sparse, noisy data by combining weak-form representations with differentiable architecture search and physical constraints.
Abstract: Discovering governing Partial Differential Equations (PDEs) from sparse and noisy data is a challenging issue in data-driven scientific computing. Conventional sparse regression methods often suffer from two major limitations: (i) the instability of numerical differentiation under sparse and noisy data, and (ii) the restricted flexibility of a pre-defined candidate library. We propose Weak-PDE-Net, an end-to-end differentiable framework that can robustly identify open-form PDEs. Weak-PDE-Net consists of two interconnected modules: a forward response learner and a weak-form PDE generator. The learner embeds learnable Gaussian kernels within a lightweight MLP, serving as a surrogate model that adaptively captures system dynamics from sparse observations. Meanwhile, the generator integrates a symbolic network with an integral module to construct weak-form PDEs, avoiding explicit numerical differentiation and improving robustness to noise. To relax the constraints of the pre-defined library, we leverage Differentiable Neural Architecture Search strategy during training to explore the functional space, which enables the efficient discovery of open-form PDEs. The capability of Weak-PDE-Net in multivariable systems discovery is further enhanced by incorporating Galilean Invariance constraints and symmetry equivariance hypotheses to ensure physical consistency. Experiments on several challenging PDE benchmarks demonstrate that Weak-PDE-Net accurately recovers governing equations, even under highly sparse and noisy observations.
[519] Asymptotic Learning Curves for Diffusion Models with Random Features Score and Manifold Data
Anand Jerry George, Nicolas Macris
Main category: cs.LG
TL;DR: Theoretical analysis of denoising score matching for diffusion models on low-dimensional manifolds using random feature neural networks, showing linear scaling with intrinsic dimension for linear manifolds but diminished benefits for non-linear manifolds.
Details
Motivation: To understand the theoretical behavior of diffusion models when data lies on low-dimensional manifolds, specifically how sample complexity scales with intrinsic vs ambient dimensions, and how different manifold structures affect learning.Method: Theoretical analysis using random feature neural networks to parameterize score functions, deriving asymptotically exact expressions for test, train, and score errors in high-dimensional limit for data distributions supported on low-dimensional manifolds.
Result: For linear manifolds, sample complexity scales linearly with intrinsic dimension rather than ambient dimension. However, benefits of low-dimensional structure diminish for non-linear manifolds, showing subtle dependence on specific manifold structure.
Conclusion: Diffusion models can benefit from structured data on low-dimensional manifolds, but the advantage depends intricately on the specific manifold structure, with linear manifolds showing clear benefits that diminish for non-linear cases.
Abstract: We study the theoretical behavior of denoising score matching–the learning task associated to diffusion models–when the data distribution is supported on a low-dimensional manifold and the score is parameterized using a random feature neural network. We derive asymptotically exact expressions for the test, train, and score errors in the high-dimensional limit. Our analysis reveals that, for linear manifolds the sample complexity required to learn the score function scales linearly with the intrinsic dimension of the manifold, rather than with the ambient dimension. Perhaps surprisingly, the benefits of low-dimensional structure starts to diminish once we have a non-linear manifold. These results indicate that diffusion models can benefit from structured data; however, the dependence on the specific type of structure is subtle and intricate.
[520] Can Graph Foundation Models Generalize Over Architecture?
Benjamin Gutteridge, Michael Bronstein, Xiaowen Dong
Main category: cs.LG
TL;DR: GFMs need architecture adaptivity for true zero-shot generalization across diverse graph tasks, as fixed backbones fail on tasks with different architectural requirements than training.
Details
Motivation: Current graph foundation models use fixed architectural backbones assuming one message-passing regime works for all tasks, but this lacks robustness to task-dependent architectural needs like message-passing range.Method: Propose a framework that adapts GNN architecture at inference time by discovering and mixing task-specific linear graph operators, enabling zero-shot generalization without retraining.
Result: Theoretical analysis and synthetic experiments show fixed-backbone GFMs under-reach on tasks with different architectural requirements; proposed method improves performance and robustness on arbitrary-range synthetic tasks and real-world benchmarks.
Conclusion: Architecture adaptivity is necessary for true graph foundation models to achieve robust zero-shot generalization across tasks with heterogeneous architectural requirements.
Abstract: Graph foundation models (GFMs) have recently attracted interest due to the promise of graph neural network (GNN) architectures that generalize zero-shot across graphs of arbitrary scales, feature dimensions, and domains. While existing work has demonstrated this ability empirically across diverse real-world benchmarks, these tasks share a crucial hidden limitation: they admit a narrow set of effective GNN architectures. In particular, current domain-agnostic GFMs rely on fixed architectural backbones, implicitly assuming that a single message-passing regime suffices across tasks. In this paper, we argue that architecture adaptivity is a necessary requirement for true GFMs. We show that existing approaches are non-robust to task-dependent architectural attributes and, as a case study, use range as a minimal and measurable axis along which this limitation becomes explicit. With theoretical analysis and controlled synthetic experiments, we demonstrate that fixed-backbone GFMs provably under-reach on tasks whose architectural requirements differ from those seen at training time. To address this issue, we introduce a framework that adapts effective GNN architecture at inference time by discovering and mixing task-specific linear graph operators, enabling zero-shot generalization across tasks with heterogeneous architectural requirements, without retraining. We validate our approach on arbitrary-range synthetic tasks and a suite of real-world benchmarks, demonstrating improved performance and robustness over existing domain-agnostic GFMs.
[521] Robustness Quantification and Uncertainty Quantification: Comparing Two Methods for Assessing the Reliability of Classifier Predictions
Adrián Detavernier, Jasper De Bock
Main category: cs.LG
TL;DR: RQ (Robustness Quantification) outperforms UQ (Uncertainty Quantification) for assessing classifier prediction reliability, especially under distribution shift, and combining both approaches yields best results.
Details
Motivation: The paper addresses the need for reliable assessment of individual classifier predictions, comparing two different approaches: Robustness Quantification (RQ) and Uncertainty Quantification (UQ). The motivation is to understand which method better evaluates prediction reliability and whether they can complement each other.Method: The authors compare RQ and UQ approaches for reliability assessment, explaining their conceptual differences. They evaluate both methods on benchmark datasets under standard settings and in the presence of distribution shift. They also explore combining both approaches for improved reliability assessment.
Result: RQ outperforms UQ in assessing prediction reliability, both in standard settings and under distribution shift. The combination of RQ and UQ yields even better reliability assessments than either approach alone, demonstrating their complementarity.
Conclusion: Robustness Quantification is competitive with and can outperform Uncertainty Quantification for reliability assessment, especially under distribution shift. The two approaches are complementary, and their combination provides the best reliability assessment for classifier predictions.
Abstract: We consider two approaches for assessing the reliability of the individual predictions of a classifier: Robustness Quantification (RQ) and Uncertainty Quantification (UQ). We explain the conceptual differences between the two approaches, compare both approaches on a number of benchmark datasets and show that RQ is capable of outperforming UQ, both in a standard setting and in the presence of distribution shift. Beside showing that RQ can be competitive with UQ, we also demonstrate the complementarity of RQ and UQ by showing that a combination of both approaches can lead to even better reliability assessments.
[522] A Sobering Look at Tabular Data Generation via Probabilistic Circuits
Davide Scassola, Dylan Ponsford, Adrián Javaloy, Sebastiano Saccani, Luca Bortolussi, Henry Gouk, Antonio Vergari
Main category: cs.LG
TL;DR: Simple hierarchical mixture models (deep probabilistic circuits) outperform SotA diffusion models for tabular data generation when using proper evaluation metrics, challenging current perception of progress.
Details
Motivation: Current state-of-the-art diffusion models for tabular data generation appear to have saturated performance, but this may be due to inadequate evaluation metrics rather than actual progress.Method: Proposes using hierarchical mixture models in the form of deep probabilistic circuits (PCs) as a simple baseline for tabular data generation, which can handle heterogeneous data natively and provide tractable probabilistic generation.
Result: Probabilistic circuits deliver competitive or superior performance to SotA diffusion models at a fraction of the cost, showing that apparent saturation is due to inadequate metrics rather than actual progress.
Conclusion: There is still much work needed for realistic tabular data generation, and current evaluation protocols need improvement to properly assess model performance.
Abstract: Tabular data is more challenging to generate than text and images, due to its heterogeneous features and much lower sample sizes. On this task, diffusion-based models are the current state-of-the-art (SotA) model class, achieving almost perfect performance on commonly used benchmarks. In this paper, we question the perception of progress for tabular data generation. First, we highlight the limitations of current protocols to evaluate the fidelity of generated data, and advocate for alternative ones. Next, we revisit a simple baseline – hierarchical mixture models in the form of deep probabilistic circuits (PCs) – which delivers competitive or superior performance to SotA models for a fraction of the cost. PCs are the generative counterpart of decision forests, and as such can natively handle heterogeneous data as well as deliver tractable probabilistic generation and inference. Finally, in a rigorous empirical analysis we show that the apparent saturation of progress for SotA models is largely due to the use of inadequate metrics. As such, we highlight that there is still much to be done to generate realistic tabular data. Code available at https://github.com/april-tools/tabpc.
[523] Assessing the Robustness of Climate Foundation Models under No-Analog Distribution Shifts
Maria Conchita Agana Navarro, Geng Li, Theo Wolf, Maria Perez-Ortiz
Main category: cs.LG
TL;DR: Climate ML emulators struggle with future climate states outside historical training data, showing accuracy-stability tradeoffs when evaluated on out-of-distribution scenarios.
Details
Motivation: Climate change creates non-stationarities that challenge ML-based climate emulators' ability to generalize beyond training distributions, especially for "no-analog" future states outside historical data ranges.Method: Benchmarked OOD robustness of three architectures (U-Net, ConvLSTM, ClimaX foundation model) trained only on historical data (1850-2014), evaluated via temporal extrapolation (2015-2023) and cross-scenario forcing shifts across emission pathways.
Result: ClimaX achieves lowest absolute error but shows higher relative performance changes under distribution shifts, with precipitation errors increasing up to 8.44% under extreme forcing scenarios, revealing accuracy vs. stability trade-off.
Conclusion: Even high-capacity foundation models are sensitive to external forcing trajectories when restricted to historical training, necessitating scenario-aware training and rigorous OOD evaluation protocols for climate emulator robustness.
Abstract: The accelerating pace of climate change introduces profound non-stationarities that challenge the ability of Machine Learning based climate emulators to generalize beyond their training distributions. While these emulators offer computationally efficient alternatives to traditional Earth System Models, their reliability remains a potential bottleneck under “no-analog” future climate states, which we define here as regimes where external forcing drives the system into conditions outside the empirical range of the historical training data. A fundamental challenge in evaluating this reliability is data contamination; because many models are trained on simulations that already encompass future scenarios, true out-of-distribution (OOD) performance is often masked. To address this, we benchmark the OOD robustness of three state-of-the-art architectures: U-Net, ConvLSTM, and the ClimaX foundation model specifically restricted to a historical-only training regime (1850-2014). We evaluate these models using two complementary strategies: (i) temporal extrapolation to the recent climate (2015-2023) and (ii) cross-scenario forcing shifts across divergent emission pathways. Our analysis within this experimental setup reveals an accuracy vs. stability trade-off: while the ClimaX foundation model achieves the lowest absolute error, it exhibits higher relative performance changes under distribution shifts, with precipitation errors increasing by up to 8.44% under extreme forcing scenarios. These findings suggest that when restricted to historical training dynamics, even high-capacity foundation models are sensitive to external forcing trajectories. Our results underscore the necessity of scenario-aware training and rigorous OOD evaluation protocols to ensure the robustness of climate emulators under a changing climate.
[524] Generalization Bounds for Physics-Informed Neural Networks for the Incompressible Navier-Stokes Equations
Sebastien Andre-Sloan, Dibyakanti Kumar, Alejandro F Frangi, Anirbit Mukherjee
Main category: cs.LG
TL;DR: Theoretical generalization bounds for Physics-Informed Neural Networks (PINNs) solving Navier-Stokes equations, with dimension-independent sample complexity and novel activation function recommendations.
Details
Motivation: To establish rigorous theoretical foundations for PINNs in solving high-dimensional fluid dynamics problems, specifically providing generalization error bounds that don't depend on network width and are dimension-independent.Method: Bounding the Rademacher complexity of PINN risk for depth-2 neural networks trained via unsupervised PINN framework to solve (d+1)-dimensional incompressible Navier-Stokes equations.
Result: Derived generalization bounds that don’t explicitly depend on network width, characterize generalization gap in terms of fluid’s kinematic viscosity and loss regularization parameters, and yield dimension-independent sample complexity bounds.
Conclusion: The theoretical analysis suggests novel activation functions for solving fluid dynamics, which are empirically validated on the Taylor-Green vortex benchmark using PINN setup.
Abstract: This work establishes rigorous first-of-its-kind upper bounds on the generalization error for the method of approximating solutions to the (d+1)-dimensional incompressible Navier-Stokes equations by training depth-2 neural networks trained via the unsupervised Physics-Informed Neural Network (PINN) framework. This is achieved by bounding the Rademacher complexity of the PINN risk. For appropriately weight bounded net classes our derived generalization bounds do not explicitly depend on the network width and our framework characterizes the generalization gap in terms of the fluid’s kinematic viscosity and loss regularization parameters. In particular, the resulting sample complexity bounds are dimension-independent. Our generalization bounds suggest using novel activation functions for solving fluid dynamics. We provide empirical validation of the suggested activation functions and the corresponding bounds on a PINN setup solving the Taylor-Green vortex benchmark.
[525] MsFormer: Enabling Robust Predictive Maintenance Services for Industrial Devices
Jiahui Zhou, Dan Li, Ruibing Jin, Jian Lou, Yanran Zhao, Zhenghua Chen, Zigui Jiang, See-Kiong Ng
Main category: cs.LG
TL;DR: MsFormer is a lightweight Multi-scale Transformer designed for industrial predictive maintenance services, addressing multi-scale temporal correlations in streaming sensor data and data scarcity challenges.
Details
Motivation: Existing deep learning methods lack a general service-oriented framework for industrial predictive maintenance, while Transformers face bottlenecks with multi-scale temporal correlations in streaming sensor data and limited training datasets in real-world service environments.Method: Proposes MsFormer with a Multi-scale Sampling (MS) module and tailored position encoding to capture sequential correlations across multi-streaming data, plus a lightweight attention mechanism with pooling operations instead of self-attention for data-scarce environments.
Result: Extensive experiments on real-world datasets show significant performance improvements over state-of-the-art methods, with strong generalizability across industrial devices and operating conditions while maintaining reliable Quality of Service.
Conclusion: MsFormer provides an effective unified AI service model for industrial predictive maintenance that addresses key challenges of multi-scale temporal correlations and data scarcity in real-world service environments.
Abstract: Providing reliable predictive maintenance is a critical industrial AI service essential for ensuring the high availability of manufacturing devices. Existing deep-learning methods present competitive results on such tasks but lack a general service-oriented framework to capture complex dependencies in industrial IoT sensor data. While Transformer-based models show strong sequence modeling capabilities, their direct deployment as robust AI services faces significant bottlenecks. Specifically, streaming sensor data collected in real-world service environments often exhibits multi-scale temporal correlations driven by machine working principles. Besides, the datasets available for training time-to-failure predictive services are typically limited in size. These issues pose significant challenges for directly applying existing models as robust predictive services. To address these challenges, we propose MsFormer, a lightweight Multi-scale Transformer designed as a unified AI service model for reliable industrial predictive maintenance. MsFormer incorporates a Multi-scale Sampling (MS) module and a tailored position encoding mechanism to capture sequential correlations across multi-streaming service data. Additionally, to accommodate data-scarce service environments, MsFormer adopts a lightweight attention mechanism with straightforward pooling operations instead of self-attention. Extensive experiments on real-world datasets demonstrate that the proposed framework achieves significant performance improvements over state-of-the-art methods. Furthermore, MsFormer outperforms across industrial devices and operating conditions, demonstrating strong generalizability while maintaining a highly reliable Quality of Service (QoS).
[526] Policy-based Tuning of Autoregressive Image Models with Instance- and Distribution-Level Rewards
Orhun Buğra Baran, Melih Kandemir, Ramazan Gokberk Cinbis
Main category: cs.LG
TL;DR: RL framework for autoregressive image generation using distribution-level LOO-FID reward to improve quality and diversity, avoiding mode collapse while maintaining semantic fidelity.
Details
Motivation: Standard autoregressive models trained with maximum likelihood lack direct optimization for sample quality and diversity. Existing RL methods for diffusion models suffer from diversity collapse, and RL for AR models uses only instance-level rewards, sacrificing distributional coverage for quality.Method: Proposes lightweight RL framework casting token-based AR synthesis as Markov Decision Process, optimized via Group Relative Policy Optimization (GRPO). Introduces distribution-level Leave-One-Out FID (LOO-FID) reward using exponential moving average of feature moments to encourage diversity and prevent mode collapse. Combines with instance-level rewards (CLIP and HPSv2) for semantic/perceptual fidelity, stabilized with adaptive entropy regularization.
Result: Extensive experiments on LlamaGen and VQGAN architectures show clear improvements across standard quality and diversity metrics within few hundred tuning iterations. Model can produce competitive samples without Classifier-Free Guidance, bypassing its 2x inference cost.
Conclusion: The proposed RL framework with LOO-FID reward effectively addresses quality-diversity trade-off in autoregressive image generation, enabling efficient optimization that improves both aspects simultaneously while avoiding mode collapse.
Abstract: Autoregressive (AR) models are highly effective for image generation, yet their standard maximum-likelihood estimation training lacks direct optimization for sample quality and diversity. While reinforcement learning (RL) has been used to align diffusion models, these methods typically suffer from output diversity collapse. Similarly, concurrent RL methods for AR models rely strictly on instance-level rewards, often trading off distributional coverage for quality. To address these limitations, we propose a lightweight RL framework that casts token-based AR synthesis as a Markov Decision Process, optimized via Group Relative Policy Optimization (GRPO). Our core contribution is the introduction of a novel distribution-level Leave-One-Out FID (LOO-FID) reward; by leveraging an exponential moving average of feature moments, it explicitly encourages sample diversity and prevents mode collapse during policy updates. We integrate this with composite instance-level rewards (CLIP and HPSv2) for strict semantic and perceptual fidelity, and stabilize the multi-objective learning with an adaptive entropy regularization term. Extensive experiments on LlamaGen and VQGAN architectures demonstrate clear improvements across standard quality and diversity metrics within only a few hundred tuning iterations. The results also show that the model can be updated to produce competitive samples even without Classifier-Free Guidance, and bypass its 2x inference cost.
[527] SpecXMaster Technical Report
Yutang Ge, Yaning Cui, Hanzheng Li, Jun-Jie Wang, Fanjie Xu, Jinhan Dong, Yongqi Jin, Dongxu Cui, Peng Jin, Guojiang Zhao, Hengxing Cai, Rong Zhu, Linfeng Zhang, Xiaohong Ji, Zhifeng Gao
Main category: cs.LG
TL;DR: SpecXMaster: An intelligent framework using Agentic Reinforcement Learning for automated NMR spectral interpretation, enabling end-to-end extraction of molecular structure information from raw FID data.
Details
Motivation: Conventional expert-dependent spectral interpretation faces challenges including human bias, limited specialized expertise, and variability across interpreters. The paper aims to automate NMR spectral interpretation to overcome these limitations and enable AI-driven closed-loop scientific discovery.Method: Proposes SpecXMaster framework leveraging Agentic Reinforcement Learning (RL) for NMR molecular spectral interpretation. The system enables automated extraction of multiplicity information from both 1H and 13C spectra directly from raw FID data, creating an end-to-end pipeline for fully automated interpretation of NMR spectra into chemical structures.
Result: Demonstrates superior performance across multiple public NMR interpretation benchmarks. The system has been refined through iterative evaluations by professional chemical spectroscopists.
Conclusion: SpecXMaster represents a novel methodological paradigm for spectral interpretation that will have profound impact on the organic chemistry community, serving as a critical bridge between matter structure and artificial intelligence in intelligent spectroscopy.
Abstract: Intelligent spectroscopy serves as a pivotal element in AI-driven closed-loop scientific discovery, functioning as the critical bridge between matter structure and artificial intelligence. However, conventional expert-dependent spectral interpretation encounters substantial hurdles, including susceptibility to human bias and error, dependence on limited specialized expertise, and variability across interpreters. To address these challenges, we propose SpecXMaster, an intelligent framework leveraging Agentic Reinforcement Learning (RL) for NMR molecular spectral interpretation. SpecXMaster enables automated extraction of multiplicity information from both 1H and 13C spectra directly from raw FID (free induction decay) data. This end-to-end pipeline enables fully automated interpretation of NMR spectra into chemical structures. It demonstrates superior performance across multiple public NMR interpretation benchmarks and has been refined through iterative evaluations by professional chemical spectroscopists. We believe that SpecXMaster, as a novel methodological paradigm for spectral interpretation, will have a profound impact on the organic chemistry community.
[528] Polaris: A Gödel Agent Framework for Small Language Models through Experience-Abstracted Policy Repair
Aditya Kakade, Vivek Srivastava, Shirish Karande
Main category: cs.LG
TL;DR: Polaris is a Gödel agent for compact models that performs policy repair via experience abstraction, turning failures into policy updates through a structured cycle of analysis, strategy formation, abstraction, and minimal code patch repair with conservative checks.
Details
Motivation: The paper aims to enable recursive self-improvement in compact models by allowing agents to inspect and modify their own policies, moving beyond simple response-level self-correction or parameter tuning to achieve persistent policy-level improvements.Method: Polaris implements a structured cycle: 1) analyzing errors, 2) forming strategies, 3) abstracting experiences into reusable strategies, and 4) applying minimal code patches with conservative checks. It uses experience abstraction to distill failures into compact, transferable strategies.
Result: On benchmarks including MGSM, DROP, GPQA, and LitBench (covering arithmetic reasoning, compositional inference, graduate-level problem solving, and creative writing evaluation), a 7-billion-parameter model with Polaris achieves consistent gains over base policy and competitive baselines.
Conclusion: Polaris demonstrates that compact models can achieve recursive self-improvement through policy repair with auditable patches, enabling cumulative refinement and transfer of learned strategies to unseen instances.
Abstract: Gödel agent realize recursive self-improvement: an agent inspects its own policy and traces and then modifies that policy in a tested loop. We introduce Polaris, a Gödel agent for compact models that performs policy repair via experience abstraction, turning failures into policy updates through a structured cycle of analysis, strategy formation, abstraction, and minimal code pat ch repair with conservative checks. Unlike response level self correction or parameter tuning, Polaris makes policy level changes with small, auditable patches that persist in the policy and are reused on unseen instances within each benchmark. As part of the loop, the agent engages in meta reasoning: it explains its errors, proposes concrete revisions to its own policy, and then updates the policy. To enable cumulative policy refinement, we introduce experience abstraction, which distills failures into compact, reusable strategies that transfer to unseen instances. On MGSM, DROP, GPQA, and LitBench (covering arithmetic reasoning, compositional inference, graduate-level problem solving, and creative writing evaluation), a 7-billion-parameter model equipped with Polaris achieves consistent gains over the base policy and competitive baselines.
[529] A Bayesian Learning Approach for Drone Coverage Network: A Case Study on Cardiac Arrest in Scotland
Tathagata Basu, Edoardo Patelli, Gianluca Filippi, Ben Parsonage, Christy Maddock, Massimiliano Vasile, Marco Fossati, Adam Loyd, Shaun Marshall, Paul Gowens
Main category: cs.LG
TL;DR: A reliability-informed Bayesian learning framework for designing drone-assisted AED delivery networks to improve emergency response coverage in both urban and rural areas.
Details
Motivation: Drones show promise for emergency medical services (EMS) and AED delivery, but scaling operational networks faces challenges due to high costs and environmental uncertainties. Current systems need better reliability and coverage, especially in remote areas with longer ambulance response times.Method: Proposes a reliability-informed Bayesian learning framework that uses survival probability of OHCA patients as objective function to identify optimal drone station locations. Considers existing EMS infrastructure coverage to improve response reliability in remote areas. Uses geographically referenced cardiac arrest data from Scotland for illustration.
Result: Environmental variability and spatial demand patterns significantly influence optimal drone station placement across urban/rural regions. The network shows robustness and economic viability through cost-effectiveness analysis based on expected QALY. Drone-assisted AED delivery is cost-effective and improves emergency response coverage in areas with longer ambulance response times.
Conclusion: Drone-assisted AED delivery networks designed using the proposed reliability-informed Bayesian framework are cost-effective and can significantly improve emergency response coverage, particularly benefiting rural and urban areas with longer ambulance response times.
Abstract: Drones are becoming popular as a complementary system for \ac{ems}. Although several pilot studies and flight trials have shown the feasibility of drone-assisted \ac{aed} delivery, running a full-scale operational network remains challenging due to high capital expenditure and environmental uncertainties. In this paper, we formulate a reliability-informed Bayesian learning framework for designing drone-assisted \ac{aed} delivery networks under environmental and operational uncertainty. We propose our objective function based on the survival probability of \ac{ohca} patients to identify the ideal locations of drone stations. Moreover, we consider the coverage of existing \ac{ems} infrastructure to improve the response reliability in remote areas. We illustrate our proposed method using geographically referenced cardiac arrest data from Scotland. The result shows how environmental variability and spatial demand patterns influence optimal drone station placement across urban and rural regions. In addition, we assess the robustness of the network and evaluate its economic viability using a cost-effectiveness analysis based on expected \ac{qaly}. The findings suggest that drone-assisted \ac{aed} delivery is expected to be cost-effective and has the potential to significantly improve the emergency response coverage in rural and urban areas with longer ambulance response times.
[530] DAK-UCB: Diversity-Aware Prompt Routing for LLMs and Generative Models
Donya Jafari, Farzan Farnia
Main category: cs.LG
TL;DR: DAK-UCB is a contextual bandit algorithm for online selection of generative AI models that balances both fidelity and diversity in generated outputs, addressing limitations of fidelity-only selection methods.
Details
Motivation: Current model selection methods for generative AI focus only on maximizing fidelity scores (like CLIP-Score) but ignore output diversity, which can lead to homogeneous or repetitive generations that fail to address diversity shortcomings in responses.Method: Proposes Diversity-Aware Kernelized Upper Confidence Bound (DAK-UCB), a contextual bandit algorithm that incorporates both fidelity and diversity metrics using prompt-aware diversity score functions. The framework uses joint kernel distance and kernel entropy measures to evaluate diversity across prompt-output pairs from previous generation rounds.
Result: Experimental results show DAK-UCB effectively promotes diversity-aware model selection while maintaining fidelity in generations across a sequence of prompts.
Conclusion: The DAK-UCB method successfully addresses the diversity shortcomings of fidelity-only selection approaches for generative AI models, providing a balanced framework for online model selection that considers both quality and variety of outputs.
Abstract: The expansion of generative AI and LLM services underscores the growing need for adaptive mechanisms to select an appropriate available model to respond to a user’s prompts. Recent works have proposed offline and online learning formulations to identify the optimal generative AI model for an input prompt, based solely on maximizing prompt-based fidelity evaluation scores, e.g., CLIP-Score in text-to-image generation. However, such fidelity-based selection methods overlook the diversity of generated outputs, and hence, they can fail to address potential diversity shortcomings in the generated responses. In this paper, we introduce the Diversity-Aware Kernelized Upper Confidence Bound (DAK-UCB) method as a contextual bandit algorithm for the online selection of generative models with diversity considerations. The proposed DAK-UCB method incorporates both fidelity and diversity-related metrics into the selection process. We design this framework based on prompt-aware diversity score functions that decompose to a two-sample-based expectation over prompt-output pairs in the previous generation rounds. Specifically, we illustrate the application of our framework using joint kernel distance and kernel entropy measures. Our experimental results demonstrate the effectiveness of DAK-UCB in promoting diversity-aware model selection while maintaining fidelity in the generations for a sequence of prompts. The code is available at https://github.com/Donya-Jafari/DAK-UCB.
[531] A Schrödinger Eigenfunction Method for Long-Horizon Stochastic Optimal Control
Louis Claeys, Artur Goldman, Zebang Shen, Niao He
Main category: cs.LG
TL;DR: The paper presents a novel approach to long-horizon stochastic optimal control by connecting it to Schrödinger operators, enabling efficient solutions through eigenfunction learning with neural networks.
Details
Motivation: High-dimensional stochastic optimal control becomes exponentially harder with longer planning horizons, with existing methods scaling linearly in horizon length and suffering performance degradation. The authors aim to overcome these limitations for a subclass of linearly-solvable SOC problems.Method: The paper proves that under gradient drift assumptions, the Hamilton-Jacobi-Bellman operator is unitarily equivalent to a Schrödinger operator with discrete spectrum. This connection allows efficient description of long-horizon control via the eigensystem. For symmetric LQR, the Schrödinger operator matches a quantum harmonic oscillator Hamiltonian. For general settings, neural networks learn the eigensystem with a novel loss function addressing implicit reweighting issues.
Result: The method achieves an order-of-magnitude improvement in control accuracy on long-horizon benchmarks compared to state-of-the-art methods, while reducing memory usage and runtime complexity from O(Td) to O(d).
Conclusion: The connection between stochastic optimal control and Schrödinger operators provides an efficient framework for long-horizon control problems, with neural network-based eigenfunction learning enabling practical implementation and significant performance improvements.
Abstract: High-dimensional stochastic optimal control (SOC) becomes harder with longer planning horizons: existing methods scale linearly in the horizon $T$, with performance often deteriorating exponentially. We overcome these limitations for a subclass of linearly-solvable SOC problems-those whose uncontrolled drift is the gradient of a potential. In this setting, the Hamilton-Jacobi-Bellman equation reduces to a linear PDE governed by an operator $\mathcal{L}$. We prove that, under the gradient drift assumption, $\mathcal{L}$ is unitarily equivalent to a Schrödinger operator $\mathcal{S} = -Δ+ \mathcal{V}$ with purely discrete spectrum, allowing the long-horizon control to be efficiently described via the eigensystem of $\mathcal{L}$. This connection provides two key results: first, for a symmetric linear-quadratic regulator (LQR), $\mathcal{S}$ matches the Hamiltonian of a quantum harmonic oscillator, whose closed-form eigensystem yields an analytic solution to the symmetric LQR with \emph{arbitrary} terminal cost. Second, in a more general setting, we learn the eigensystem of $\mathcal{L}$ using neural networks. We identify implicit reweighting issues with existing eigenfunction learning losses that degrade performance in control tasks, and propose a novel loss function to mitigate this. We evaluate our method on several long-horizon benchmarks, achieving an order-of-magnitude improvement in control accuracy compared to state-of-the-art methods, while reducing memory usage and runtime complexity from $\mathcal{O}(Td)$ to $\mathcal{O}(d)$.
[532] A One-Inclusion Graph Approach to Multi-Group Learning
Noah Bergam, Samuel Deng, Daniel Hsu
Main category: cs.LG
TL;DR: The paper presents optimal sample complexity bounds for multi-group learning using a one-inclusion graph strategy with bipartite b-matching generalization, achieving optimal convergence rates in group-realizable settings.
Details
Motivation: Multi-group learning aims to achieve good performance across multiple demographic groups, but existing approaches may have suboptimal sample complexity bounds. The authors seek to establish tightest-known upper bounds and optimal convergence rates for this important fairness-oriented learning setting.Method: Extends the one-inclusion graph prediction strategy using a generalization of bipartite b-matching. This approach provides an algorithm that achieves optimal sample complexity for multi-group learning in group-realizable settings.
Result: Proves tightest-known upper bounds on sample complexity, shows algorithm achieves optimal log n/n convergence rate in general group-realizable setting, and achieves optimal 1/n rate when evaluation group is chosen obliviously of the sample.
Conclusion: The paper establishes fundamental limits and optimal algorithms for multi-group learning, providing both upper bounds and matching lower bounds that confirm the optimality of their approach in various group-realizable settings.
Abstract: We prove the tightest-known upper bounds on the sample complexity of multi-group learning. Our algorithm extends the one-inclusion graph prediction strategy using a generalization of bipartite $b$-matching. In the group-realizable setting, we provide a lower bound confirming that our algorithm’s $\log n / n$ convergence rate is optimal in general. If one relaxes the learning objective such that the group on which we are evaluated is chosen obliviously of the sample, then our algorithm achieves the optimal $1/n$ convergence rate under group-realizability.
[533] General Machine Learning: Theory for Learning Under Variable Regimes
Aomar Osmani
Main category: cs.LG
TL;DR: A foundational learning theory framework for regime-varying settings where learner, memory state, and evaluation conditions evolve over time, focusing on structural properties rather than complete quantitative theory.
Details
Motivation: To address learning scenarios where the learning environment, learner's state, and evaluation criteria change over time (regime variation), requiring new theoretical foundations beyond static learning settings.Method: Develops regime-varying framework using admissible transport, protected-core preservation, and evaluator-aware learning evolution. Includes structural obstruction arguments, protected-stability templates, and explicit witnesses on controlled subclasses (convex and deductive settings).
Result: Establishes theorem-layer results on evaluator factorization, morphisms, composition, and partial kernel-level alignment across semantically commensurable layers. Provides worked two-regime example demonstrating admissibility certificates and regime-variation costs.
Conclusion: Presents a structured learning-theoretic framework for regime-varying learning with initial theorem-supporting layer, but not a complete quantitative theory of all learning systems.
Abstract: We study learning under regime variation, where the learner, its memory state, and the evaluative conditions may evolve over time. This paper is a foundational and structural contribution: its goal is to define the core learning-theoretic objects required for such settings and to establish their first theorem-supporting consequences. The paper develops a regime-varying framework centered on admissible transport, protected-core preservation, and evaluator-aware learning evolution. It records the immediate closure consequences of admissibility, develops a structural obstruction argument for faithful fixed-ontology reduction in genuinely multi-regime settings, and introduces a protected-stability template together with explicit numerical and symbolic witnesses on controlled subclasses, including convex and deductive settings. It also establishes theorem-layer results on evaluator factorization, morphisms, composition, and partial kernel-level alignment across semantically commensurable layers. A worked two-regime example makes the admissibility certificate, protected evaluative core, and regime-variation cost explicit on a controlled subclass. The symbolic component is deliberately restricted in scope: the paper establishes a first kernel-level compatibility result together with a controlled monotonic deductive witness. The manuscript should therefore be read as introducing a structured learning-theoretic framework for regime-varying learning together with its first theorem-supporting layer, not as a complete quantitative theory of all learning systems.
[534] GEM: Guided Expectation-Maximization for Behavior-Normalized Candidate Action Selection in Offline RL
Haoyu Wang, Jingcheng Wang, Shunyu Wu, Xinwei Xiao
Main category: cs.LG
TL;DR: GEM: Offline RL framework using Gaussian Mixture Model actors with critic-guided EM updates and candidate-based action selection with support-aware reranking
Details
Motivation: Unimodal policy extraction in offline RL can produce weak "in-between" actions when datasets have multimodal action distributions, making decisions brittle even with strong value functionsMethod: Trains GMM actor via advantage-weighted EM updates guided by critic; learns GMM behavior model; inference uses candidate-based selection with conservative ensemble lower-confidence bound and behavior-normalized support scoring
Result: Competitive performance on D4RL benchmarks; offers inference-time compute-quality tradeoff via candidate count knob without retraining
Conclusion: GEM provides multimodal, explicitly controllable action selection for offline RL that handles branched/multimodal action landscapes while quantifying support
Abstract: Offline reinforcement learning (RL) can fit strong value functions from fixed datasets, yet reliable deployment still hinges on the action selection interface used to query them. When the dataset induces a branched or multimodal action landscape, unimodal policy extraction can blur competing hypotheses and yield “in-between” actions that are weakly supported by data, making decisions brittle even with a strong critic. We introduce GEM (Guided Expectation-Maximization), an analytical framework that makes action selection both multimodal and explicitly controllable. GEM trains a Gaussian Mixture Model (GMM) actor via critic-guided, advantage-weighted EM-style updates that preserve distinct components while shifting probability mass toward high-value regions, and learns a tractable GMM behavior model to quantify support. During inference, GEM performs candidate-based selection: it generates a parallel candidate set and reranks actions using a conservative ensemble lower-confidence bound together with behavior-normalized support, where the behavior log-likelihood is standardized within each state’s candidate set to yield stable, comparable control across states and candidate budgets. Empirically, GEM is competitive across D4RL benchmarks, and offers a simple inference-time budget knob (candidate count) that trades compute for decision quality without retraining.
[535] Neural ODE and SDE Models for Adaptation and Planning in Model-Based Reinforcement Learning
Chao Han, Stefanos Ioannou, Luca Manneschi, T. J. Hayward, Michael Mangan, Aditya Gilra, Eleni Vasilaki
Main category: cs.LG
TL;DR: Neural SDEs outperform neural ODEs for modeling stochastic dynamics in RL, enabling better policy learning and adaptation in partially observable environments through latent SDE models.
Details
Motivation: To address the challenge of modeling stochastic dynamics in reinforcement learning environments, especially when dealing with partial observability and the need for efficient policy adaptation to changing environment dynamics.Method: Uses neural ordinary differential equations (ODEs) and stochastic differential equations (SDEs) to model transition dynamics. Introduces a latent SDE model combining ODE with GAN-trained stochastic component in latent space for partial observability. Leverages inverse models for efficient policy adaptation.
Result: Neural SDEs more effectively capture stochasticity than neural ODEs, enabling high-performing policies with improved sample efficiency. The latent SDE model outperforms or matches general model-based and model-free approaches across stochastic continuous-control benchmarks.
Conclusion: Action-conditional latent SDEs are effective for RL planning in environments with stochastic transitions, demonstrating superior performance in modeling stochastic dynamics and enabling efficient policy adaptation.
Abstract: We investigate neural ordinary and stochastic differential equations (neural ODEs and SDEs) to model stochastic dynamics in fully and partially observed environments within a model-based reinforcement learning (RL) framework. Through a sequence of simulations, we show that neural SDEs more effectively capture the inherent stochasticity of transition dynamics, enabling high-performing policies with improved sample efficiency in challenging scenarios. We leverage neural ODEs and SDEs for efficient policy adaptation to changes in environment dynamics via inverse models, requiring only limited interactions with the new environment. To address partial observability, we introduce a latent SDE model that combines an ODE with a GAN-trained stochastic component in latent space. Policies derived from this model provide a strong baseline, outperforming or matching general model-based and model-free approaches across stochastic continuous-control benchmarks. This work demonstrates the applicability of action-conditional latent SDEs for RL planning in environments with stochastic transitions. Our code is available at: https://github.com/ChaoHan-UoS/NeuralRL
[536] RedTopic: Toward Topic-Diverse Red Teaming of Large Language Models
Jiale Ding, Xiang Zheng, Yutao Wu, Cong Wang, Wei-Bin Lee, Ling Pan, Xingjun Ma, Yu-Gang Jiang
Main category: cs.LG
TL;DR: RedTopic is a novel red teaming framework that generates topic-diverse adversarial prompts for LLMs using contextualized generation, aggregate rewards, and multi-objective RL to overcome limitations of existing approaches.
Details
Motivation: Existing red teaming approaches for LLMs have limitations: topic-based methods rely on pre-collected harmful topics (lacking flexibility), while topic-free RL methods lack explicit exploration rewards and over-optimize narrow objectives, reducing topic diversity.Method: RedTopic uses a contextualized generation pipeline, aggregate reward design, and multi-objective RL training loop to generate topic-diverse adversarial prompts that adapt to evolving LLM capabilities.
Result: Experiments show RedTopic produces more effective and diverse adversarial prompts than existing methods, with notable improvements in integrated evaluation metrics.
Conclusion: RedTopic represents a step toward more adaptive and topic-diverse red teaming for large language models, addressing limitations of current approaches.
Abstract: As large language models (LLMs) are increasingly deployed as black-box components in real-world applications, red teaming has become essential for identifying potential risks. It tests LLMs with adversarial prompts to uncover vulnerabilities and improve safety alignment. Ideally, effective red teaming should be adaptive to evolving LLM capabilities and explore a broad range of harmful topics. However, existing approaches face two limitations: 1) topic-based approaches rely on pre-collected harmful topics, limited in flexibility and adaptivity. 2) topic-free methods use reinforcement learning (RL), but they lack an explicit reward signal for exploration and tend to over-optimize a narrow objective, reducing topic diversity. To address these limitations, we propose RedTopic, a novel red teaming framework that generates topic-diverse adversarial prompts through a contextualized generation pipeline, an aggregate reward design, and a multi-objective RL training loop. Experiments show that RedTopic produces more effective and diverse adversarial prompts than existing methods, with notable improvements in integrated evaluation metrics. We believe RedTopic represents a step toward more adaptive and topic-diverse red teaming for large language models.
[537] A Learning Method with Gap-Aware Generation for Heterogeneous DAG Scheduling
Ruisong Zhou, Haijun Zou, Li Zhou, Chumin Sun, Zaiwen Wen
Main category: cs.LG
TL;DR: WeCAN: RL-based heterogeneous DAG scheduler with two-stage single-pass design using weighted cross-attention encoder and skip-extended realization to address compatibility coefficients and generation-induced optimality gaps.
Details
Motivation: Efficient DAG scheduling in heterogeneous environments is challenging due to resource capacities, dependencies, adaptability needs across varying resource pools and task types, and the requirement for rapid schedule generation. Existing approaches struggle with task-pool compatibility and generation-induced optimality gaps from restricted reachability in list-scheduling maps.Method: WeCAN uses a two-stage single-pass RL framework: 1) single forward pass produces task-pool scores and global parameters, 2) generation map constructs schedules without repeated network calls. It employs weighted cross-attention encoder modeling task-pool interactions gated by compatibility coefficients, size-agnostic to environment fluctuations. Introduces order-space analysis characterizing reachable sets via feasible schedule orders, then designs skip-extended realization with analytically parameterized decreasing skip rule to enlarge reachable order set while preserving single-pass efficiency.
Result: Experiments on computation graphs and real-world TPC-H DAGs demonstrate improved makespan over strong baselines, with inference time comparable to classical heuristics and faster than multi-round neural schedulers.
Conclusion: WeCAN effectively addresses heterogeneous DAG scheduling challenges through its end-to-end RL framework, achieving better performance than existing methods while maintaining computational efficiency comparable to classical heuristics.
Abstract: Efficient scheduling of directed acyclic graphs (DAGs) in heterogeneous environments is challenging due to resource capacities and dependencies. In practice, the need for adaptability across environments with varying resource pools and task types, alongside rapid schedule generation, complicates these challenges. We propose WeCAN, an end-to-end reinforcement learning framework for heterogeneous DAG scheduling that addresses task–pool compatibility coefficients and generation-induced optimality gaps. It adopts a two-stage single-pass design: a single forward pass produces task–pool scores and global parameters, followed by a generation map that constructs schedules without repeated network calls. Its weighted cross-attention encoder models task–pool interactions gated by compatibility coefficients, and is size-agnostic to environment fluctuations. Moreover, widely used list-scheduling maps can incur generation-induced optimality gaps from restricted reachability. We introduce an order-space analysis that characterizes the reachable set of generation maps via feasible schedule orders, explains the mechanism behind generation-induced gaps, and yields sufficient conditions for gap elimination. Guided by these conditions, we design a skip-extended realization with an analytically parameterized decreasing skip rule, which enlarges the reachable order set while preserving single-pass efficiency. Experiments on computation graphs and real-world TPC-H DAGs demonstrate improved makespan over strong baselines, with inference time comparable to classical heuristics and faster than multi-round neural schedulers.
[538] Permutation-Symmetrized Diffusion for Unconditional Molecular Generation
Gyeonghoon Ko, Juho Lee
Main category: cs.LG
TL;DR: Proposes direct diffusion modeling on quotient manifold for molecular point-cloud generation to enforce permutation invariance, with explicit heat kernel expression and MCMC-based training approach.
Details
Motivation: Most diffusion models for molecular point-cloud generation enforce permutation invariance indirectly through permutation-equivariant networks on ordered space. The authors propose to directly model diffusion on the quotient manifold where all atom permutations are identified, aiming for more principled and efficient permutation invariance.Method: Models diffusion directly on quotient manifold R^{d×N}/S_N where all atom permutations are identified. Derives explicit expression for heat kernel as sum of Euclidean heat kernels over permutations. Uses expectation form over posterior on permutations for training, approximated via MCMC in permutation space. Evaluates on unconditional 3D molecule generation on QM9 using EQGAT-Diff protocol with SemlaFlow-style backbone.
Result: Demonstrates that quotient-based permutation symmetrization is practical and yields competitive generation quality with improved efficiency compared to existing approaches.
Conclusion: Direct diffusion modeling on quotient manifold provides a principled approach to permutation invariance in molecular point-cloud generation, with practical implementation and competitive performance.
Abstract: Permutation invariance is fundamental in molecular point-cloud generation, yet most diffusion models enforce it indirectly via permutation-equivariant networks on an ordered space. We propose to model diffusion directly on the quotient manifold $\tilde{\calX}=\sR^{d\times N}/S_N$, where all atom permutations are identified. We show that the heat kernel on $\tilde{\calX}$ admits an explicit expression as a sum of Euclidean heat kernels over permutations, which clarifies how diffusion on the quotient differs from ordered-particle diffusion. Training requires a permutation-symmetrized score involving an intractable sum over $S_N$; we derive an expectation form over a posterior on permutations and approximate it using MCMC in permutation space. We evaluate on unconditional 3D molecule generation on QM9 under the EQGAT-Diff protocol, using SemlaFlow-style backbone and treating all variables continuously. The results demonstrate that quotient-based permutation symmetrization is practical and yields competitive generation quality with improved efficiency.
[539] Arc Gradient Descent: A Geometrically Motivated Gradient Descent-based Optimiser with Phase-Aware, User-Controlled Step Dynamics (proof-of-concept)
Nikhil Verma, Joonas Linnosmaa, Leonardo Espinosa-Leal, Napat Vajragupta
Main category: cs.LG
TL;DR: ArcGD optimizer outperforms Adam and other state-of-the-art optimizers on non-convex functions and CIFAR-10 image classification, showing better generalization and resistance to overfitting without early stopping tuning.
Details
Motivation: The paper aims to develop a new optimizer (ArcGD) that can outperform existing optimizers like Adam on challenging non-convex optimization problems and real-world machine learning tasks, particularly addressing issues like overfitting and poor generalization observed with current optimizers.Method: ArcGD optimizer formulation and implementation, evaluated on two fronts: (1) non-convex benchmark functions (Rosenbrock function) across dimensions from 2D to 50,000D, compared with Adam under different learning rate settings; (2) real-world ML evaluation on CIFAR-10 image classification using 8 diverse MLP architectures, compared against Adam, AdamW, Lion, and SGD.
Result: ArcGD consistently outperformed Adam on Rosenbrock function under its effective learning rate setting and achieved superior final solutions in most cases. On CIFAR-10, ArcGD achieved highest average test accuracy (50.7%) at 20,000 iterations, outperforming all other optimizers and winning/tying on 6 of 8 architectures. ArcGD showed continued improvement with extended training while others regressed.
Conclusion: ArcGD demonstrates strong performance on both geometric stress tests and standard deep-learning benchmarks, showing broad applicability. The optimizer exhibits better generalization and resistance to overfitting without requiring early stopping tuning. The paper also reveals conceptual links between ArcGD’s phase structure and the Lion optimizer’s core mechanism.
Abstract: The paper presents the formulation, implementation, and evaluation of the ArcGD optimiser. The evaluation is conducted initially on a non-convex benchmark function and subsequently on a real-world ML dataset. The initial comparative study using the Adam optimiser is conducted on a stochastic variant of the highly non-convex and notoriously challenging Rosenbrock function, renowned for its narrow, curved valley, across dimensions ranging from 2D to 1000D and an extreme case of 50,000D. Two configurations were evaluated to eliminate learning-rate bias: (i) both using ArcGD’s effective learning rate and (ii) both using Adam’s default learning rate. ArcGD consistently outperformed Adam under the first setting and, although slower under the second, achieved superior final solutions in most cases. In the second evaluation, ArcGD is evaluated against state-of-the-art optimizers (Adam, AdamW, Lion, SGD) on the CIFAR-10 image classification dataset across 8 diverse MLP architectures ranging from 1 to 5 hidden layers. ArcGD achieved the highest average test accuracy (50.7%) at 20,000 iterations, outperforming AdamW (46.6%), Adam (46.8%), SGD (49.6%), and Lion (43.4%), winning or tying on 6 of 8 architectures. Notably, while Adam and AdamW showed strong early convergence at 5,000 iterations, but regressed with extended training, whereas ArcGD continued improving, demonstrating generalization and resistance to overfitting without requiring early stopping tuning. Strong performance on geometric stress tests and standard deep-learning benchmarks indicates broad applicability, highlighting the need for further exploration. Moreover, it is also shown that both a limiting variant of ArcGD and a momentum augmented ArcGD, recover sign-based momentum updates, revealing a clear conceptual link between ArcGD’s phase structure and the core mechanism of the Lion Optimiser.
[540] SynForceNet: A Force-Driven Global-Local Latent Representation Framework for Lithium-Ion Battery Fault Diagnosis
Rongxiu Chen, Yuting Su
Main category: cs.LG
TL;DR: A deep anomaly detection framework combining kernel one-class classification and minimum-volume estimation for online battery fault diagnosis in electric vehicles, validated on real-world EV data.
Details
Motivation: Online safety fault diagnosis is crucial for lithium-ion batteries in electric vehicles, especially under complex and rare safety-critical conditions during real-world operation, requiring robust detection methods.Method: Develops an online battery fault diagnosis network using deep anomaly detection framework combining kernel one-class classification and minimum-volume estimation, with mechanical constraints and spike-timing-dependent plasticity (STDP)-based dynamic representations to improve fault characterization and create compact normal-state boundaries.
Result: Validated on 8.6 million data points from 20 EVs, achieving average improvements of 7.59% in TPR, 27.92% in PPV, 18.28% in F1 score, and 23.68% in AUC compared to advanced baselines. Analysis shows spatial separation of fault representations and enhanced robustness through manifold structure learning.
Conclusion: The method demonstrates promise for battery safety diagnosis by integrating deep learning with physical constraints and neural dynamics, with evidence suggesting shared causal structures across different fault types.
Abstract: Online safety fault diagnosis is essential for lithium-ion batteries in electric vehicles(EVs), particularly under complex and rare safety-critical conditions in real-world operation. In this work, we develop an online battery fault diagnosis network based on a deep anomaly detection framework combining kernel one-class classification and minimum-volume estimation. Mechanical constraints and spike-timing-dependent plasticity(STDP)-based dynamic representations are introduced to improve complex fault characterization and enable a more compact normal-state boundary. The proposed method is validated using 8.6 million valid data points collected from 20 EVs. Compared with several advanced baseline methods, it achieves average improvements of 7.59% in TPR, 27.92% in PPV, 18.28% in F1 score, and 23.68% in AUC. In addition, we analyze the spatial separation of fault representations before and after modeling, and further enhance framework robustness by learning the manifold structure in the latent space. The results also suggest the possible presence of shared causal structures across different fault types, highlighting the promise of integrating deep learning with physical constraints and neural dynamics for battery safety diagnosis.
[541] SafeSeek: Universal Attribution of Safety Circuits in Language Models
Miao Yu, Siyuan Fu, Moayad Aloqaily, Zhenhong Zhou, Safa Otoum, Xing fan, Kun Wang, Yufei Guo, Qingsong Wen
Main category: cs.LG
TL;DR: A unified safety interpretability framework that identifies functionally complete safety circuits in LLMs via optimization, validated on backdoor attacks and safety alignment scenarios.
Details
Motivation: Existing safety attribution methods for LLMs struggle with generalization and reliability due to heuristic, domain-specific metrics and search algorithms. There's a need for a more systematic approach to identify safety-critical functional components in LLMs.Method: Proposes a framework using differentiable binary masks to extract multi-granular safety circuits through gradient descent on safety datasets, integrated with Safety Circuit Tuning for efficient safety fine-tuning using these sparse circuits.
Result: Identified a backdoor circuit with 0.42% sparsity whose ablation reduces Attack Success Rate from 100% to 0.4% while retaining >99% general utility. For safety alignment, localized a circuit with 3.03% heads and 0.79% neurons whose removal spikes ASR from 0.8% to 96.9%, while excluding it during fine-tuning maintains 96.5% safety retention.
Conclusion: The proposed framework successfully identifies functionally complete safety circuits in LLMs, enabling both interpretability of safety mechanisms and efficient safety fine-tuning through sparse circuit manipulation.
Abstract: Mechanistic interpretability reveals that safety-critical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose \ourmethod, a unified safety interpretability framework that identifies functionally complete safety circuits in LLMs via optimization. Unlike methods focusing on isolated heads or neurons, \ourmethod introduces differentiable binary masks to extract multi-granular circuits through gradient descent on safety datasets, while integrates Safety Circuit Tuning to utilize these sparse circuits for efficient safety fine-tuning. We validate \ourmethod in two key scenarios in LLM safety: \textbf{(1) backdoor attacks}, identifying a backdoor circuit with 0.42% sparsity, whose ablation eradicates the Attack Success Rate (ASR) from 100% $\to$ 0.4% while retaining over 99% general utility; \textbf{(2) safety alignment}, localizing an alignment circuit with 3.03% heads and 0.79% neurons, whose removal spikes ASR from 0.8% $\to$ 96.9%, whereas excluding this circuit during helpfulness fine-tuning maintains 96.5% safety retention.
[542] A Comparative Study of Machine Learning Models for Hourly Forecasting of Air Temperature and Relative Humidity
Jiaqi Dong
Main category: cs.LG
TL;DR: Comparison of 7 ML models for hourly temperature/humidity forecasting in Chongqing, with XGBoost achieving best performance.
Details
Motivation: Accurate short-term forecasting of air temperature and relative humidity is critical for urban management in topographically complex cities like Chongqing, China.Method: Compares seven machine learning models (XGBoost, Random Forest, SVR, MLP, Decision Tree, LSTM, CNN-LSTM) using unified framework of data preprocessing, lag-feature construction, rolling statistical features, and time-series validation.
Result: XGBoost achieves best overall performance with test MAE of 0.302°C for temperature and 1.271% for humidity, with average R² of 0.989 across both forecasting tasks.
Conclusion: Tree-based ensemble learning (XGBoost) is highly effective for structured meteorological time-series forecasting, providing practical guidance for intelligent meteorological forecasting in mountainous cities.
Abstract: Accurate short-term forecasting of air temperature and relative humidity is critical for urban management, especially in topographically complex cities such as Chongqing, China. This study compares seven machine learning models: eXtreme Gradient Boosting (XGBoost), Random Forest, Support Vector Regression (SVR), Multi-Layer Perceptron (MLP), Decision Tree, Long Short-Term Memory (LSTM) networks, and Convolutional Neural Network (CNN)-LSTM (CNN-LSTM), for hourly prediction using real-world open data. Based on a unified framework of data preprocessing, lag-feature construction, rolling statistical features, and time-series validation, the models are systematically evaluated in terms of predictive accuracy and robustness. The results show that XGBoost achieves the best overall performance, with a test mean absolute error (MAE) of 0.302 °C for air temperature and 1.271% for relative humidity, together with an average R2 of 0.989 across the two forecasting tasks. These findings demonstrate the strong effectiveness of tree-based ensemble learning for structured meteorological time-series forecasting and provide practical guidance for intelligent meteorological forecasting in mountainous cities.
[543] Robustness Quantification for Discriminative Models: a New Robustness Metric and its Application to Dynamic Classifier Selection
Rodrigo F. L. Lassance, Jasper De Bock
Main category: cs.LG
TL;DR: Proposes a new robustness metric for evaluating classifier reliability that works with any probabilistic discriminative classifier and feature type, enabling dynamic classifier selection.
Details
Motivation: Existing robustness quantification methods for evaluating classifier reliability are limited - they require generative models, work only with specific architectures, or handle only discrete features. There's a need for a more general robustness metric applicable to any classifier and feature type.Method: Develops a new robustness metric that can be applied to any probabilistic discriminative classifier regardless of architecture, and works with any type of features (continuous, discrete, etc.). The metric quantifies how much uncertainty a classifier can tolerate before changing predictions.
Result: The proposed metric successfully distinguishes between reliable and unreliable predictions. This capability enables the development of new strategies for dynamic classifier selection based on prediction reliability.
Conclusion: The work presents a more general robustness metric that overcomes limitations of existing methods, making robustness quantification applicable to a wider range of classifiers and feature types, with practical applications in dynamic classifier selection.
Abstract: Among the different possible strategies for evaluating the reliability of individual predictions of classifiers, robustness quantification stands out as a method that evaluates how much uncertainty a classifier could cope with before changing its prediction. However, its applicability is more limited than some of its alternatives, since it requires the use of generative models and restricts the analyses either to specific model architectures or discrete features. In this work, we propose a new robustness metric applicable to any probabilistic discriminative classifier and any type of features. We demonstrate that this new metric is capable of distinguishing between reliable and unreliable predictions, and use this observation to develop new strategies for dynamic classifier selection.
[544] Central Dogma Transformer III: Interpretable AI Across DNA, RNA, and Protein
Nobuyuki Ota
Main category: cs.LG
TL;DR: CDT-III is a biological AI model that extends mechanism-oriented AI across the central dogma (DNA→RNA→protein) using a Virtual Cell Embedder architecture with nuclear and cytosolic components, achieving high accuracy in predicting RNA and protein levels while improving interpretability and enabling in silico drug effect screening.
Details
Motivation: Current biological AI models predict complex cellular responses but their learned representations remain disconnected from the underlying molecular processes they aim to capture. There's a need for models that better align with biological mechanisms across the full central dogma.Method: Two-stage Virtual Cell Embedder (VCE) architecture that mirrors cellular spatial compartmentalization: VCE-N models transcription in the nucleus and VCE-C models translation in the cytosol. The model extends mechanism-oriented AI across DNA, RNA, and protein levels.
Result: Achieved per-gene RNA r=0.843 and protein r=0.969 on five held-out genes. Protein prediction improved RNA performance (r=0.804 to 0.843), demonstrating downstream tasks regularize upstream representations. Protein supervision increased DNA-level interpretability (CTCF enrichment by 30%). Successfully predicted 29/29 protein changes in CD52 knockdown and rediscovered 5 of 7 known clinical side effects without clinical data.
Conclusion: CDT-III demonstrates that mechanism-oriented AI across the central dogma improves both prediction accuracy and biological interpretability. The model enables in silico screening of genetic perturbations and drug effects using only baseline data, offering a powerful tool for biological discovery and drug development.
Abstract: Biological AI models increasingly predict complex cellular responses, yet their learned representations remain disconnected from the molecular processes they aim to capture. We present CDT-III, which extends mechanism-oriented AI across the full central dogma: DNA, RNA, and protein. Its two-stage Virtual Cell Embedder architecture mirrors the spatial compartmentalization of the cell: VCE-N models transcription in the nucleus and VCE-C models translation in the cytosol. On five held-out genes, CDT-III achieves per-gene RNA r=0.843 and protein r=0.969. Adding protein prediction improves RNA performance (r=0.804 to 0.843), demonstrating that downstream tasks regularize upstream representations. Protein supervision sharpens DNA-level interpretability, increasing CTCF enrichment by 30%. Applied to in silico CD52 knockdown approximating Alemtuzumab, the model predicts 29/29 protein changes correctly and rediscovers 5 of 7 known clinical side effects without clinical data. Gradient-based side effect profiling requires only unperturbed baseline data (r=0.939), enabling screening of all 2,361 genes without new experiments.
[545] Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generation
Michal Balcerak, Suprosana Shit, Chinmay Prabhakar, Sebastian Kaltenbach, Michael S. Albergo, Yilun Du, Bjoern Menze
Main category: cs.LG
TL;DR: Graph Energy Matching (GEM) is a generative framework for graphs that combines energy-based modeling with transport guidance to achieve high-quality sampling while enabling compositional inference tasks.
Details
Motivation: Discrete energy-based models for graphs can capture relative likelihoods and enable composable probabilistic inference, but they struggle with efficient sampling due to spurious local minima, creating a fidelity gap compared to discrete diffusion models.Method: GEM learns a permutation-invariant potential energy that provides transport-aligned guidance from noise to data while refining samples in high-likelihood regions. It uses a sampling protocol with an energy-based switch to transition between gradient-guided transport and mixing exploration.
Result: On molecular graph benchmarks, GEM matches or exceeds strong discrete diffusion baselines in sample quality while enabling targeted exploration at inference time.
Conclusion: GEM closes the fidelity gap for discrete energy-based models on graphs, achieving high-quality sampling while maintaining the benefits of explicit likelihood modeling for compositional generation, property-constrained sampling, and interpolation tasks.
Abstract: Energy-based models for discrete domains, such as graphs, explicitly capture relative likelihoods, naturally enabling composable probabilistic inference tasks like conditional generation or enforcing constraints at test-time. However, discrete energy-based models typically struggle with efficient and high-quality sampling, as off-support regions often contain spurious local minima, trapping samplers and causing training instabilities. This has historically resulted in a fidelity gap relative to discrete diffusion models. We introduce Graph Energy Matching (GEM), a generative framework for graphs that closes this fidelity gap. Motivated by the transport map optimization perspective of the Jordan-Kinderlehrer-Otto (JKO) scheme, GEM learns a permutation-invariant potential energy that simultaneously provides transport-aligned guidance from noise toward data and refines samples within regions of high data likelihood. Further, we introduce a sampling protocol that leverages an energy-based switch to seamlessly bridge: (i) rapid, gradient-guided transport toward high-probability regions to (ii) a mixing regime for exploration of the learned graph distribution. On molecular graph benchmarks, GEM matches or exceeds strong discrete diffusion baselines. Beyond sample quality, explicit modeling of relative likelihood enables targeted exploration at inference time, facilitating compositional generation, property-constrained sampling, and geodesic interpolation between graphs.
[546] SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling
Yiqi Zhang, Huiqiang Jiang, Xufang Luo, Zhihe Yang, Chengruidong Zhang, Yifei Shen, Dongsheng Li, Yuqing Yang, Lili Qiu, Yang You
Main category: cs.LG
TL;DR: SortedRL improves RL training efficiency for LLMs by reordering rollout samples by length, enabling larger batches and reducing training time by over 50% while improving performance.
Details
Motivation: RL training for LLMs is bottlenecked by slow rollout phase (up to 70% of training time) due to autoregressive generation and synchronization overhead, especially for long trajectories (e.g., 16k tokens).Method: SortedRL uses online length-aware scheduling that reorders rollout samples by output length, prioritizing short samples for early updates. It includes cache-based off-policy control and dedicated RL infrastructure with stateful controller and rollout buffer.
Result: Reduces RL training bubble ratios by over 50%, achieves 3.9% to 18.4% superior performance over baselines on tasks including logical puzzles, AIME 24, Math 500, and Minerval using LLaMA-3.1-8B and Qwen-2.5-32B.
Conclusion: SortedRL effectively addresses rollout efficiency bottlenecks in RL training for LLMs, enabling more efficient training with better performance through intelligent scheduling and infrastructure improvements.
Abstract: Scaling reinforcement learning (RL) has shown strong promise for enhancing the reasoning abilities of large language models (LLMs), particularly in tasks requiring long chain-of-thought generation. However, RL training efficiency is often bottlenecked by the rollout phase, which can account for up to 70% of total training time when generating long trajectories (e.g., 16k tokens), due to slow autoregressive generation and synchronization overhead between rollout and policy updates. We propose SortedRL, an online length-aware scheduling strategy designed to address this bottleneck by improving rollout efficiency and maintaining training stability. SortedRL reorders rollout samples based on output lengths, prioritizing short samples forming groups for early updates. This enables large rollout batches, flexible update batches, and near on-policy micro-curriculum construction simultaneously. To further accelerate the pipeline, SortedRL incorporates a mechanism to control the degree of off-policy training through a cache-based mechanism, and is supported by a dedicated RL infrastructure that manages rollout and update via a stateful controller and rollout buffer. Experiments using LLaMA-3.1-8B and Qwen-2.5-32B on diverse tasks, including logical puzzles, and math challenges like AIME 24, Math 500, and Minerval, show that SortedRL reduces RL training bubble ratios by over 50%, while attaining 3.9% to 18.4% superior performance over baseline given same amount of data.
[547] Similarity-Aware Mixture-of-Experts for Data-Efficient Continual Learning
Connor Mclaughlin, Nigel Lee, Lili Su
Main category: cs.LG
TL;DR: Proposes adaptive mixture-of-experts framework for continual learning with limited data and arbitrary task overlaps, using incremental global pooling and instance-wise prompt masking to enable efficient knowledge transfer while preventing negative interference.
Details
Motivation: Addresses limitations of existing continual learning approaches that assume either sufficient data per task or non-overlapping tasks. Real-world scenarios often involve limited data per task and arbitrary task overlaps without prior knowledge, creating challenges for effective knowledge transfer and preventing negative interference.Method: Adaptive mixture-of-experts framework over pre-trained models with two key components: 1) incremental global pooling that gradually introduces prompts over time to reduce association noise, and 2) instance-wise prompt masking that decomposes task samples into in-distribution (aligning with current prompts) and out-of-distribution (requiring new prompts) categories.
Result: The method enhances sample efficiency across varying data volumes and inter-task similarity levels, demonstrating broad applicability in challenging continual learning scenarios with data scarcity and unstructured task overlaps.
Conclusion: The proposed framework effectively addresses the dual challenges of data scarcity and arbitrary task overlapping in continual learning by strategically leveraging task overlaps while preventing negative mutual interference, making it suitable for real-world deployment scenarios.
Abstract: Machine learning models often need to adapt to new data after deployment due to structured or unstructured real-world dynamics. The Continual Learning (CL) framework enables continuous model adaptation, but most existing approaches either assume each task contains sufficiently many data samples or that the learning tasks are non-overlapping. In this paper, we address the more general setting where each task may have a limited dataset, and tasks may overlap in an arbitrary manner without a priori knowledge. This general setting is substantially more challenging for two reasons. On the one hand, data scarcity necessitates effective contextualization of general knowledge and efficient knowledge transfer across tasks. On the other hand, unstructured task overlapping can easily result in negative knowledge transfer. To address the above challenges, we propose an adaptive mixture-of-experts (MoE) framework over pre-trained models that progressively establishes similarity awareness among tasks. Our design contains two innovative algorithmic components: incremental global pooling and instance-wise prompt masking. The former mitigates prompt association noise through gradual prompt introduction over time. The latter decomposes incoming task samples into those aligning with current prompts (in-distribution) and those requiring new prompts (out-of-distribution). Together, our design strategically leverages potential task overlaps while actively preventing negative mutual interference in the presence of per-task data scarcity. Experiments across varying data volumes and inter-task similarity show that our method enhances sample efficiency and is broadly applicable.
[548] End-to-End Efficient RL for Linear Bellman Complete MDPs with Deterministic Transitions
Zakaria Mhammedi, Alexander Rakhlin, Nneka Okolo
Main category: cs.LG
TL;DR: Efficient RL algorithm for linear Bellman complete MDPs with deterministic transitions, stochastic initial states and rewards, requiring only argmax oracle for large action spaces
Details
Motivation: Prior computationally efficient algorithms for linear Bellman complete MDPs are limited to small action spaces or require strong oracle assumptions, creating a need for more practical solutionsMethod: Develops a computationally efficient algorithm for linear Bellman complete MDPs with deterministic transitions, stochastic initial states and rewards, using polynomial sample and computational complexity
Result: Algorithm learns ε-optimal policy with sample and computational complexity polynomial in horizon, feature dimension, and 1/ε, requiring only standard argmax oracle for large/infinite action spaces
Conclusion: Provides practical solution for linear Bellman complete MDPs that bridges gap between statistical tractability and computational efficiency, especially for large action spaces
Abstract: We study reinforcement learning (RL) with linear function approximation in Markov Decision Processes (MDPs) satisfying \emph{linear Bellman completeness} – a fundamental setting where the Bellman backup of any linear value function remains linear. While statistically tractable, prior computationally efficient algorithms are either limited to small action spaces or require strong oracle assumptions over the feature space. We provide a computationally efficient algorithm for linear Bellman complete MDPs with \emph{deterministic transitions}, stochastic initial states, and stochastic rewards. For finite action spaces, our algorithm is end-to-end efficient; for large or infinite action spaces, we require only a standard argmax oracle over actions. Our algorithm learns an $\varepsilon$-optimal policy with sample and computational complexity polynomial in the horizon, feature dimension, and $1/\varepsilon$.
[549] Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions
Rustem Islamov, Grigory Malinovsky, Alexander Gaponov, Aurelien Lucchi, Peter Richtárik, Eduard Gorbunov
Main category: cs.LG
TL;DR: Byz-Clip21-SGD2M: A federated learning algorithm combining Byzantine robustness, differential privacy, and convergence guarantees without unrealistic assumptions.
Details
Motivation: Federated learning offers privacy but still leaks information through gradients, while malicious servers can mount Byzantine attacks. Existing approaches have unrealistic assumptions like bounded gradients, require auxiliary datasets, or lack convergence guarantees.Method: Proposes Byz-Clip21-SGD2M algorithm that integrates robust aggregation with double momentum and carefully designed clipping. Uses standard L-smoothness and σ-sub-Gaussian gradient noise assumptions.
Result: Proves high-probability convergence guarantees, recovers state-of-the-art convergence rates without adversaries, and improves utility guarantees under Byzantine and DP settings. Empirical validation on CNN/MLP models with MNIST.
Conclusion: The algorithm addresses limitations of prior work by providing a unified framework for differential privacy and Byzantine robustness with strong theoretical guarantees and practical effectiveness.
Abstract: Federated Learning (FL) enables heterogeneous clients to collaboratively train a shared model without centralizing their raw data, offering an inherent level of privacy. However, gradients and model updates can still leak sensitive information, while malicious servers may mount adversarial attacks such as Byzantine manipulation. These vulnerabilities highlight the need to address differential privacy (DP) and Byzantine robustness within a unified framework. Existing approaches, however, often rely on unrealistic assumptions such as bounded gradients, require auxiliary server-side datasets, or fail to provide convergence guarantees. We address these limitations by proposing Byz-Clip21-SGD2M, a new algorithm that integrates robust aggregation with double momentum and carefully designed clipping. We prove high-probability convergence guarantees under standard $L$-smoothness and $σ$-sub-Gaussian gradient noise assumptions, thereby relaxing conditions that dominate prior work. Our analysis recovers state-of-the-art convergence rates in the absence of adversaries and improves utility guarantees under Byzantine and DP settings. Empirical evaluations on CNN and MLP models trained on MNIST further validate the effectiveness of our approach.
[550] Estimating Flow Velocity and Vehicle Angle-of-Attack from Non-invasive Piezoelectric Structural Measurements Using Deep Learning
Chandler B. Smith, S. Hales Swift, Andrew Steyer, Ihab El-Kady
Main category: cs.LG
TL;DR: CNN-based method estimates aerodynamic state variables (velocity and angle of attack) from structural vibration measurements instead of direct flow instrumentation, using piezoelectric sensors on aeroshell interior.
Details
Motivation: Accurate estimation of aerodynamic state variables like freestream velocity and angle of attack is crucial for load prediction, flight control, and model validation. Current methods rely on direct flow instrumentation, but this work explores non-intrusive alternatives using structural vibration measurements.Method: Uses dense array of piezoelectric sensors on aeroshell interior to capture vibrations from turbulent boundary layer pressure fluctuations. Trains convolutional neural network (CNN) to invert structural responses to recover velocity and AoA. Includes short-window moving-median post-processing to reduce variance in predictions during varying conditions.
Result: Demonstrated in hypersonic wind tunnel experiments at Mach5 and Mach8. After post-processing, achieves mean velocity error below 2.27 m/s (0.21%) and mean AoA error of 0.44° (8.25%) on held-out test data from same experimental campaign.
Conclusion: Feasibility of vibration-based velocity and AoA estimation demonstrated in controlled laboratory environment, offering non-intrusive alternative to traditional flow instrumentation methods.
Abstract: Accurate estimation of aerodynamic state variables such as freestream velocity and angle of attack (AoA) is important for aerodynamic load prediction, flight control, and model validation. This work presents a non-intrusive method for estimating vehicle velocity and AoA from structural vibration measurements rather than direct flow instrumentation such as pitot tubes. A dense array of piezoelectric sensors mounted on the interior skin of an aeroshell capture vibrations induced by turbulent boundary layer pressure fluctuations, and a convolutional neural network (CNN) is trained to invert these structural responses to recover velocity and AoA.
Proof-of-concept is demonstrated through controlled experiments in Sandia’s hypersonic wind tunnel spanning zero and nonzero AoA configurations, Mach5 and Mach8 conditions, and both constant and continuously varying tunnel operations. The CNN is trained and evaluated using data from 16 wind tunnel runs, with a temporally centered held-out interval within each run used to form training, validation, and test datasets and assess intra-run temporal generalization. Raw CNN predictions exhibit increased variance during continuously varying conditions; a short-window moving-median post-processing step suppresses this variance and improves robustness. After post-processing, the method achieves a mean velocity error relative to the low-pass filtered reference velocity below 2.27~m/s (0.21%) and a mean AoA error of $0.44^{\circ} (8.25%)$ on held-out test data from the same experimental campaign, demonstrating feasibility of vibration-based velocity and AoA estimation in a controlled laboratory environment.
[551] Knee or ROC
Veronica Wendt, Jacob Steiner, Byunggu Yu, Caleb Kelly, Justin Kim
Main category: cs.LG
TL;DR: Paper proposes alternative accuracy calculation methods for multi-class image detection when population representation is unknown, using ROC curves and knee method thresholds on CIFAR-10 data.
Details
Motivation: Self-attention transformers show promise for image classification but existing accuracy evaluation methods assume single-class detection with known population representation. Need methods for multi-class scenarios with unknown population distributions.Method: Proposes using Receiver Operating Characteristic (ROC) accuracy thresholds for multi-class detection and knee method for determining thresholds when population representation is unknown. Tests on multi-class dataset created from CIFAR-10 images.
Result: Results show ROC curve and knee thresholds can be applied to multi-class image detection, providing alternative accuracy calculation methods when traditional approaches are unsuitable.
Conclusion: Alternative accuracy calculation methods (ROC thresholds and knee method) are necessary for multi-class image detection scenarios where population representation is unknown, extending evaluation capabilities beyond traditional single-class approaches.
Abstract: Self-attention transformers have demonstrated accuracy for image classification with smaller data sets. However, a limitation is that tests to-date are based upon single class image detection with known representation of image populations. For instances where the input image classes may be greater than one and test sets that lack full information on representation of image populations, accuracy calculations must adapt. The Receiver Operating Characteristic (ROC) accuracy threshold can address the instances of multiclass input images. However, this approach is unsuitable in instances where image population representation is unknown. We then consider calculating accuracy using the knee method to determine threshold values on an ad-hoc basis. Results of ROC curve and knee thresholds for a multi-class data set, created from CIFAR-10 images, are discussed for multiclass image detection.
[552] MLFEF: Machine Learning Fusion Model with Empirical Formula to Explore the Momentum in Competitive Sports
Ruixin Peng, Ziqing Li
Main category: cs.LG
TL;DR: This paper develops data-driven and empirical models to define and quantify momentum in tennis matches using Grand Slam men’s singles data, employing machine learning fusion models and statistical analysis to analyze momentum patterns.
Details
Motivation: Tennis coaches and players are interested in understanding momentum as a factor beyond skill that affects match outcomes, requiring a systematic approach to define and quantify momentum for real-time match analysis.Method: Two approaches: 1) Data-driven model using 5 years of public tennis match data with preprocessing, feature engineering, and a fusion model combining SVM, Random Forest, and XGBoost; 2) Empirical model based on player/expert suggestions using sliding window algorithms for weight calculation and visualization methods. Additional analysis uses CUMSUM algorithm and RUN Test for momentum fluctuation analysis, with Monte Carlo simulation for robustness testing.
Result: The analysis shows that momentum in tennis is not random, though the trend might be random. The fusion model demonstrates robustness through Monte Carlo simulation validation.
Conclusion: The paper successfully develops quantitative models for analyzing momentum in tennis, providing a basis for real-time match analysis and demonstrating that momentum patterns can be systematically identified and studied.
Abstract: Tennis is so popular that coaches and players are curious about factors other than skill, such as momentum. This article will try to define and quantify momentum, providing a basis for real-time analysis of tennis matches. Based on the tennis Grand Slam men’s singles match data in recent years, we built two models, one is to build a model based on data-driven, and the other is to build a model based on empirical formulas. For the data-driven model, we first found a large amount of public data including public data on tennis matches in the past five years and personal information data of players. Then the data is preprocessed, and feature engineered, and a fusion model of SVM, Random Forrest algorithm and XGBoost was established. For the mechanism analysis model, important features were selected based on the suggestions of many tennis players and enthusiasts, the sliding window algorithm was used to calculate the weight, and different methods were used to visualize the momentum. For further analysis of the momentum fluctuation, it is based on the popular CUMSUM algorithm in the industry as well as the RUN Test, and the result shows the momentum is not random and the trend might be random. At last, the robustness of the fusion model is analyzed by Monte Carlo simulation.
[553] Reliable OOD Virtual Screening with Extrapolatory Pseudo-Label Matching
Yunni Qu, Bhargav Vaduri, Karthikeya Jatoth, James Wellnitz, Dzung Dinh, Seth Veenbaas, Jonathan Chapman, Alexander Tropsha, Junier Oliva
Main category: cs.LG
TL;DR: EXPLOR framework uses extrapolatory pseudo-labeling on latent-space augmentations to address OOD uncertainty in drug discovery virtual screening, enabling reliable confidence estimates for novel chemical scaffolds.
Details
Motivation: ML models for virtual screening face a fundamental challenge: they need to identify novel, chemically diverse scaffolds (OOD regions) but degrade under distribution shift. Standard novelty-rejection strategies limit discovery by rejecting the most valuable novel scaffolds, while experimental budgets require reliable confidence estimates for top candidates.Method: EXPLOR uses extrapolatory pseudo-labeling on latent-space augmentations with a multi-headed architecture and novel per-head matching loss. It learns to extrapolate to OOD chemical space while producing reliable confidence estimates, requiring only a single labeled training set and no access to unlabeled test compounds.
Result: State-of-the-art performance across chemical and tabular benchmarks using different molecular embeddings, with particularly strong performance in high-confidence regions critical for virtual screening where only top-ranked candidates advance to experimental validation.
Conclusion: EXPLOR addresses the dual challenges of OOD generalization and reliable confidence estimation in virtual screening, enabling discovery of novel chemical scaffolds while maintaining reliability under distribution shift.
Abstract: Machine learning (ML) models are increasingly deployed for virtual screening in drug discovery, where the goal is to identify novel, chemically diverse scaffolds while minimizing experimental costs. This creates a fundamental challenge: the most valuable discoveries lie in out-of-distribution (OOD) regions beyond the training data, yet ML models often degrade under distribution shift. Standard novelty-rejection strategies ensure reliability within the training domain but limit discovery by rejecting precisely the novel scaffolds most worth finding. Moreover, experimental budgets permit testing only a small fraction of nominated candidates, demanding models that produce reliable confidence estimates. We introduce EXPLOR (Extrapolatory Pseudo-Label Matching for OOD Uncertainty-Based Rejection), a framework that addresses both challenges through extrapolatory pseudo-labeling on latent-space augmentations, requiring only a single labeled training set and no access to unlabeled test compounds, mirroring the realistic conditions of prospective screening campaigns. Through a multi-headed architecture with a novel per-head matching loss, EXPLOR learns to extrapolate to OOD chemical space while producing reliable confidence estimates, with particularly strong performance in high-confidence regions, which is critical for virtual screening where only top-ranked candidates advance to experimental validation. We demonstrate state-of-the-art performance across chemical and tabular benchmarks using different molecular embeddings.
[554] Equivariance via Minimal Frame Averaging for More Symmetries and Efficiency
Yuchao Lin, Jacob Helwig, Shurui Gui, Shuiwang Ji
Main category: cs.LG
TL;DR: Minimal Frame Averaging (MFA) is a mathematical framework for constructing provably minimal frames that achieve exact equivariance in machine learning systems, extending to groups like Lorentz and unitary groups.
Details
Motivation: Current frame averaging methods for achieving equivariance in ML systems are either computationally expensive (summing over large frames) or approximate (sampling-based). There's a need for efficient, exact equivariance methods that work with diverse symmetry groups.Method: Proposes Minimal Frame Averaging (MFA) - a mathematical framework that constructs provably minimal frames for exact equivariance. Extends frame averaging to more groups including Lorentz group (space-time symmetries) and unitary group (complex-valued domains).
Result: Demonstrates efficiency and effectiveness across diverse tasks: n-body simulation, top tagging in collider physics, and relaxed energy prediction. MFA provides exact equivariance with minimal computational overhead.
Conclusion: MFA offers a principled approach to achieving exact equivariance with minimal frames, extending the applicability of frame averaging to broader symmetry groups while maintaining computational efficiency.
Abstract: We consider achieving equivariance in machine learning systems via frame averaging. Current frame averaging methods involve a costly sum over large frames or rely on sampling-based approaches that only yield approximate equivariance. Here, we propose Minimal Frame Averaging (MFA), a mathematical framework for constructing provably minimal frames that are exactly equivariant. The general foundations of MFA also allow us to extend frame averaging to more groups than previously considered, including the Lorentz group for describing symmetries in space-time, and the unitary group for complex-valued domains. Results demonstrate the efficiency and effectiveness of encoding symmetries via MFA across a diverse range of tasks, including $n$-body simulation, top tagging in collider physics, and relaxed energy prediction. Our code is available at https://github.com/divelab/MFA.
[555] Almost Sure Convergence of Linear Temporal Difference Learning with Arbitrary Features
Jiuqi Wang, Shangtong Zhang
Main category: cs.LG
TL;DR: Linear TD learning converges almost surely even with linearly dependent features, without algorithm modifications or special assumptions.
Details
Motivation: Traditional linear TD learning requires linearly independent features for convergence, but this assumption fails in many practical scenarios where features are dependent.Method: Analyzes linear TD algorithm without modifications, proves weight iterates converge to bounded set, establishes local stability, and characterizes bounded invariant sets of the mean ODE.
Result: First proof of almost sure convergence of linear TD without requiring linearly independent features; weight iterates converge to bounded set with consistent value estimates.
Conclusion: Linear TD learning is more robust than previously thought, converging reliably even with dependent features, expanding its practical applicability.
Abstract: Temporal difference (TD) learning with linear function approximation (linear TD) is a classic and powerful prediction algorithm in reinforcement learning. While it is well-understood that linear TD converges almost surely to a unique point, this convergence traditionally requires the assumption that the features used by the approximator are linearly independent. However, this linear independence assumption does not hold in many practical scenarios. This work is the first to establish the almost sure convergence of linear TD without requiring linearly independent features. We prove that the weight iterates of linear TD converge to a bounded set, and that the value estimates derived from the weights in that set are the same almost everywhere. We also establish a notion of local stability of the weight iterates. Importantly, we do not impose assumptions tailored to feature dependence and do not modify the linear TD algorithm. Key to our analysis is a novel characterization of bounded invariant sets of the mean ODE of linear TD.
[556] Addressing Large Action Spaces in 3D Floorplanning via Spatial Generalization
Fin Amin, Nirjhor Rouf, Tse-Han Pan, Sounak Dutta, Md Kamal Ibn Shafi, Paul D. Franzon
Main category: cs.LG
TL;DR: Continuous action representation for 3D floorplanning that decouples output structure from canvas resolution, enabling scalable learning in large design spaces through L-action similarity.
Details
Motivation: Discrete canvas coordinates create scalability bottlenecks in floorplanning as action space grows; need more tractable approaches for large design spaces.Method: Learn continuous action representation for 3D floorplanning, reason in continuous placement space, discretize only at inference time, leverage L-action similarity (similar returns for nearby actions).
Result: Approach can learn to construct floorplans even when pre-trained only on random floorplans; continuous decision spaces address large-action-space challenge.
Conclusion: Continuous decision spaces are promising for addressing large-action-space challenges in floorplanning by providing structural bias and generalization capabilities.
Abstract: Many recent machine learning approaches to floorplanning represent placement decisions using discrete canvas coordinates, which creates scalability bottlenecks as the action space grows. In this work, we study the effect of learning a continuous action representation for 3D floorplanning. By reasoning in a continuous placement space and discretizing only at inference time, our method decouples the output structure from the canvas resolution, which makes learning and inference more tractable in large design spaces. A central idea in our approach is \textit{$L$-action similarity}: actions that are close in the placement space often produce similar returns. This smoothness induces a useful structural bias that allows the model to generalize information from one decision to nearby decisions. As a case study, we show that this approach can learn to construct floorplans even when pre-trained only on random floorplans. Our results suggest that continuous decision spaces are a promising way to address the large-action-space challenge in floorplanning.
[557] DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA
Aman Patel, Arpita Singhal, Austin Wang, Anusri Pampari, Maya Kasowski, Anshul Kundaje
Main category: cs.LG
TL;DR: Unable to analyze paper 2412.05430 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper contentMethod: Cannot determine method without access to the paper content
Result: Cannot determine results without access to the paper content
Conclusion: Cannot draw conclusions without access to the paper content
Abstract: Failed to fetch summary for 2412.05430: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.05430&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[558] Paired Wasserstein Autoencoders for Conditional Sampling
Moritz Piening, Matthias Chung
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2412.07586: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.07586&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[559] MSA-CNN: A Lightweight Multi-Scale CNN with Attention for Sleep Stage Classification
Stephan Goerttler, Yucheng Wang, Emadeldeen Eldele, Min Wu, Fei He
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2501.02949 suggests it’s from January 2025, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content. The HTTP 429 error indicates the arXiv API rate limit was exceeded.Method: No method information available due to failed API request.
Result: No results available for analysis.
Conclusion: Unable to analyze paper due to technical limitations in accessing content.
Abstract: Failed to fetch summary for 2501.02949: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.02949&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[560] Two Stage Wireless Federated LoRA Fine-Tuning with Sparsified Orthogonal Updates
Bumjun Kim, Wan Choi
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2505.00333: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.00333&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[561] Covariance Density Neural Networks
Om Roy, Yashar Moshfeghi, Keith Smith
Main category: cs.LG
TL;DR: Unable to analyze paper 2505.11139 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract could not be retrievedMethod: Cannot determine method as abstract could not be retrieved
Result: Cannot determine results as abstract could not be retrieved
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2505.11139: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11139&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[562] Enhancing generalizability of model discovery across parameter space with multi-experiment equation learning (ME-EQL)
Maria-Veronica Ciocanel, John T. Nardini, Kevin B. Flores, Erica M. Rutter, Suzanne S. Sindi, Alexandria Volkening
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2506.08916: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.08916&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[563] Reinforcement-Guided Hyper-Heuristic Hyperparameter Optimization for Fair and Explainable Spiking Neural Network-Based Financial Fraud Detection
Sadman Mohammad Nasif, Md Abrar Jahin, M. F. Mridha
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2508.16915: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.16915&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[564] TopoMap: A Feature-based Semantic Discriminator of the Topographical Regions in the Test Input Space
Gianmarco De Vita, Nargiz Humbatova, Paolo Tonella
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.03242: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.03242&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[565] FEDONet : Fourier-Embedded DeepONet for Spectrally Accurate Operator Learning
Arth Sojitra, Mrigank Dhingra, Omer San
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.12344: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.12344&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[566] JaGuard: Position Error Correction of GNSS Jamming with Deep Temporal Graphs
Ivana Kesić, Aljaž Blatnik, Carolina Fortuna, Blaž Bertalanič
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.14000: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.14000&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[567] Graph Variate Neural Networks
Om Roy, Yashar Moshfeghi, Keith Smith
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2509.20311: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.20311&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[568] FTTE: Enabling Federated and Resource-Constrained Deep Edge Intelligence
Irene Tenison, Anna Murphy, Charles Beauville, Lalana Kagal
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2510.03165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[569] Guided Star-Shaped Masked Diffusion
Viacheslav Meshchaninov, Egor Shibaev, Artem Makoian, Ivan Klimov, Nikita Balagansky, Daniil Gavrilov, Aibek Alanov, Dmitry Vetrov
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2510.08369: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08369&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[570] CSI-4CAST: A Hybrid Deep Learning Model for CSI Prediction with Comprehensive Robustness and Generalization Testing
Sikai Cheng, Reza Zandehshahvar, Haoruo Zhao, Daniel A. Garcia-Ulloa, Alejandro Villena-Rodriguez, Carles Navarro Manchón, Pascal Van Hentenryck
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.12996: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12996&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[571] On-device Semantic Selection Made Low Latency and Memory Efficient with Monolithic Forwarding
Jiahao Zhou, Chengliang Lin, Dingji Li, Mingkai Dong, Haibo Chen
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.15620: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15620&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[572] Parameter-Free Clustering via Self-Supervised Consensus Maximization (Extended Version)
Lijun Zhang, Suyuan Liu, Siwei Wang, Shengju Yu, Xueling Zhu, Miaomiao Li, Xinwang Liu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper with ID 2511.09211 cannot be analyzed without access to its abstract or content.
Details
Motivation: Cannot determine motivation without access to the paper content. The arXiv API returned a rate limiting error (HTTP 429), preventing retrieval of the paper's abstract.Method: Cannot determine method without access to the paper content. The arXiv API request was blocked due to rate limiting.
Result: Cannot determine results without access to the paper content. The paper analysis is impossible due to the HTTP 429 error from arXiv API.
Conclusion: Cannot draw conclusions about the paper without access to its content. The analysis is blocked by technical limitations (rate limiting) from the arXiv API.
Abstract: Failed to fetch summary for 2511.09211: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09211&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[573] Data-Efficient and Robust Trajectory Generation through Pathlet Dictionary Learning
Yuanbo Tang, Yan Tang, Zixuan Zhang, Zihui Zhao, Yang Li
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2511.16105: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16105&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[574] Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity
Noa Rubin, Orit Davidovich, Zohar Ringel
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.04165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[575] DP-FedSOFIM: Differentially Private Federated Stochastic Optimization using Regularized Fisher Information Matrix
Sidhant Nair, Tanmay Sen, Mrinmay Sen, Sayantan Banerjee
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unknown - paper content not accessible due to HTTP 429 error
Result: No results available - technical issue prevented paper retrieval
Conclusion: Cannot analyze paper due to API rate limiting issues
Abstract: Failed to fetch summary for 2601.09166: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.09166&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[576] From Hawkes Processes to Attention: Time-Modulated Mechanisms for Event Sequences
Xinzi Tan, Kejian Zhang, Junhan Yu, Doudou Zhou
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.09220: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.09220&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[577] Universal Approximation Theorem for Input-Connected Multilayer Perceptrons
Vugar Ismailov
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2601.14026: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.14026&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[578] How Understanding Forecast Uncertainty Resolves the Explainability Problem in Machine Learning Models
Joseph L. Breeden
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.00179: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00179&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[579] T1: One-to-One Channel-Head Binding for Multivariate Time-Series Imputation
Dongik Park, Hyunwoo Ryu, Suahn Bae, Keondo Park, Hyung-Sin Kim
Main category: cs.LG
TL;DR: Paper 2602.21043: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing abstract contentMethod: Unable to determine method due to missing abstract content
Result: Unable to determine results due to missing abstract content
Conclusion: Unable to determine conclusion due to missing abstract content
Abstract: Failed to fetch summary for 2602.21043: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21043&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[580] MJ1: Multimodal Judgment via Grounded Verification
Bhavesh Kumar, Dylan Feng, Leonard Tang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2603.07990: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07990&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[581] Multi-Station WiFi CSI Sensing Framework Robust to Station-wise Feature Missingness and Limited Labeled Data
Keita Kayano, Takayuki Nishio, Daiki Yoda, Yuta Hirai, Tomoko Adachi
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.11858: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11858&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[582] A Stability-Aware Frozen Euler Autoencoder for Physics-Informed Tracking in Continuum Mechanics (SAFE-PIT-CM)
Emil Hovad
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.13280: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13280&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[583] Rethinking Machine Unlearning: Models Designed to Forget via Key Deletion
Sonia Laguna, Jorge da Silva Goncalves, Moritz Vandenhirtz, Alain Ryser, Irene Cannistraci, Julia E. Vogt
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.15033: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15033&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[584] PRISM: Demystifying Retention and Interaction in Mid-Training
Bharat Runwal, Ashish Agrawal, Anurag Roy, Rameswar Panda
Main category: cs.LG
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error from arXiv API
Result: No results available - technical issue with accessing paper metadata
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content
Abstract: Failed to fetch summary for 2603.17074: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17074&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[585] A Dynamic Bayesian and Machine Learning Framework for Quantitative Evaluation and Prediction of Operator Situation Awareness in Nuclear Power Plants
Shuai Chen, Huiqiao Jia, Tao Qing, Li Zhang, Xingyu Xiao
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.19298: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19298&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[586] FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization
Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush Vosoughi, Guoyin Wang, Jingren Zhou
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.19835: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19835&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[587] Exponential Family Discriminant Analysis: Generalizing LDA-Style Generative Classification to Non-Gaussian Models
Anish Lakkapragada
Main category: cs.LG
TL;DR: Unable to analyze paper 2603.20655 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to the paper abstract
Abstract: Failed to fetch summary for 2603.20655: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20655&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[588] Uncertainty Quantification for Distribution-to-Distribution Flow Matching in Scientific Imaging
Dongxia Wu, Yuhui Zhang, Serena Yeung-Levy, Emma Lundberg, Emily B. Fox
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limitingMethod: No method information available due to failed API request
Result: No results available as paper content could not be accessed
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2603.21717: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21717&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[589] CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning
Dongxia Wu, Shiye Su, Yuhui Zhang, Elaine Sui, Emma Lundberg, Emily B. Fox, Serena Yeung-Levy
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2603.21743: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21743&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[590] Inference of Multiscale Gaussian Graphical Model
Do Edmond Sanou, Christophe Ambroise, Geneviève Robin
Main category: cs.LG
TL;DR: Unable to analyze paper 2202.05775 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2202.05775: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2202.05775&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[591] HD-Bind: Encoding of Molecular Structure with Low Precision, Hyperdimensional Binary Representations
Derek Jones, Jonathan E. Allen, Xiaohua Zhang, Behnam Khaleghi, Jaeyoung Kang, Weihong Xu, Niema Moshiri, Tajana S. Rosing
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2303.15604: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2303.15604&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[592] Near-Optimal Nonconvex-Strongly-Convex Bilevel Optimization with Fully First-Order Oracles
Lesi Chen, Yaohua Ma, Jingzhao Zhang
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2306.14853 suggests it’s from June 2023, but content is unavailable for analysis.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to HTTP 429 error from arXiv API.Method: Cannot determine method as paper content is unavailable.
Result: Cannot determine results as paper content is unavailable.
Conclusion: Cannot draw conclusions about the paper due to content unavailability.
Abstract: Failed to fetch summary for 2306.14853: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2306.14853&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[593] All you need is spin: SU(2) equivariant variational quantum circuits based on spin networks
Richard D. P. East, Guillermo Alonso-Linaje, Chae-Yeun Park
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2309.07250: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2309.07250&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[594] Sparse Learning and Class Probability Estimation with Weighted Support Vector Machines
Liyun Zeng, Hao Helen Zhang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2312.10618: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2312.10618&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[595] Arena: Efficiently Training Large Models via Dynamic Scheduling and Adaptive Parallelism Co-Design
Chunyu Xue, Weihao Cui, Quan Chen, Chen Chen, Han Zhao, Shulai Zhang, Linmei Wang, Yan Li, Limin Xiao, Weifeng Zhang, Jing Yang, Bingsheng He, Minyi Guo
Main category: cs.LG
TL;DR: Paper 2403.16125 appears to be about a multimodal large language model with audio and vision capabilities, but the abstract could not be fetched due to HTTP 429 error (rate limiting).
Details
Motivation: Based on the arXiv ID pattern (2403.16125 suggests March 2024), this likely addresses multimodal AI challenges, potentially focusing on integrating audio and visual understanding in large language models.Method: Unable to determine specific method due to abstract fetch failure. However, typical approaches for such models include transformer architectures, cross-modal attention mechanisms, and joint training on audio-visual-text datasets.
Result: Results unknown due to abstract unavailability. Would typically include benchmarks on audio-visual understanding tasks, generation quality metrics, and comparisons to existing multimodal models.
Conclusion: Cannot provide specific conclusions without paper content. Generally, such papers conclude about the effectiveness of their multimodal approach and future research directions in audio-visual AI.
Abstract: Failed to fetch summary for 2403.16125: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2403.16125&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[596] Clusterpath Gaussian Graphical Modeling
D. J. W. Touw, A. Alfons, P. J. F. Groenen, I. Wilms
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2407.00644 appears to be a recent arXiv submission from July 2024.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot determine conclusion without access to paper content.
Abstract: Failed to fetch summary for 2407.00644: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.00644&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[597] SwiftQueue: Optimizing Low-Latency Applications with Swift Packet Queuing
Siddhant Ray, Xi Jiang, Jack Luo, Nick Feamster, Junchen Jiang
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2410.06112: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.06112&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[598] Artificial intelligence for partial differential equations in computational mechanics: A review
Yizheng Wang, Jinshuai Bai, Zhongya Lin, Qimin Wang, Cosmin Anitescu, Jia Sun, Mohammad Sadegh Eshaghi, Yuantong Gu, Xi-Qiao Feng, Xiaoying Zhuang, Timon Rabczuk, Yinghua Liu
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2410.19843: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.19843&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[599] Learning dynamically inspired bases for Koopman and transfer operator approximation
Gary Froyland, Kevin Kühl
Main category: cs.LG
TL;DR: Unable to analyze paper 2505.05085 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2505.05085: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.05085&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[600] Prediction-Powered Inference with Inverse Probability Weighting
Jyotishka Datta, Nicholas G. Polson
Main category: cs.LG
TL;DR: Unable to analyze paper 2508.10149 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2508.10149: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.10149&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[601] Morphology-Aware Peptide Discovery via Masked Conditional Generative Modeling
Nuno Costa, Julija Zavadlav
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2509.02060: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.02060&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[602] Riesz Regression As Direct Density Ratio Estimation
Masahiro Kato
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to analyze paper due to technical fetching error
Abstract: Failed to fetch summary for 2511.04568: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.04568&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[603] Delay-Aware Diffusion Policy: Bridging the Observation-Execution Gap in Dynamic Tasks
Aileen Liao, Dong-Ki Kim, Max Olan Smith, Ali-akbar Agha-mohammadi, Shayegan Omidshafiei
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.07697: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07697&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[604] Impact of Positional Encoding: Clean and Adversarial Rademacher Complexity for Transformers under In-Context Regression
Weiyi He, Yue Xing
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2512.09275: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09275&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[605] Adaptive Probability Flow Residual Minimization for High-Dimensional Fokker-Planck Equations
Xiaolong Wu, Qifeng Liao
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2512.19196: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19196&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[606] Cross-Sensory Brain Passage Retrieval: Scaling Beyond Visual to Audio
Niall McGuire, Yashar Moshfeghi
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.14001: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.14001&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[607] Non-Clashing Teaching in Graphs: Algorithms, Complexity, and Bounds
Sujoy Bhore, Liana Khazaliya, Fionn Mc Inerney
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2602.00657: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00657&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[608] A Survey of Reinforcement Learning For Economics
Pranjal Rawat
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2603.08956: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08956&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[609] Foundation-Model Surrogates Enable Data-Efficient Active Learning for Materials Discovery
Jeffrey Hu, Rongzhi Dong, Ying Feng, Ming Hu, Jianjun Hu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.12567: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12567&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[610] Deep Adaptive Model-Based Design of Experiments
Arno Strouwen, Sebastian Micluţa-Câmpeanu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to determine conclusion due to retrieval failure
Abstract: Failed to fetch summary for 2603.16146: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16146&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[611] Exploring the Agentic Frontier of Verilog Code Generation
Patrick Yubeaton, Siddharth Garg, Chinmay Hegde
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.19347: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19347&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[612] Decorrelation, Diversity, and Emergent Intelligence: The Isomorphism Between Social Insect Colonies and Ensemble Machine Learning
Ernest Fokoué, Gregory Babbitt, Yuval Levental
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.20328: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20328&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MA
[613] Planning over MAPF Agent Dependencies via Multi-Dependency PIBT
Zixiang Jiang, Yulun Zhang, Rishi Veerapaneni, Jiaoyang Li
Main category: cs.MA
TL;DR: MD-PIBT extends PIBT by planning over agent dependencies, enabling more general multi-agent path finding for thousands of agents with various kinodynamic constraints.
Details
Motivation: PIBT and EPIBT are limited by rule-based planning and can only handle conflicts with at most one other agent, lacking generality for complex multi-agent scenarios.Method: Proposes Multi-Dependency PIBT (MD-PIBT) that searches over agent dependencies, inspired by PIBT’s priority inheritance logic. It’s a general framework where specific parameterizations can reproduce PIBT and EPIBT while enabling novel planning strategies.
Result: MD-PIBT effectively plans for up to 10,000 homogeneous agents under various kinodynamic constraints including pebble motion, rotation motion, and differential drive robots with speed/acceleration limits. Particularly effective in MAPF with large agents.
Conclusion: MD-PIBT provides a more general framework for multi-agent path finding that overcomes limitations of PIBT/EPIBT while maintaining efficiency for large-scale scenarios with various motion constraints.
Abstract: Modern Multi-Agent Path Finding (MAPF) algorithms must plan for hundreds to thousands of agents in congested environments within a second, requiring highly efficient algorithms. Priority Inheritance with Backtracking (PIBT) is a popular algorithm capable of effectively planning in such situations. However, PIBT is constrained by its rule-based planning procedure and lacks generality because it restricts its search to paths that conflict with at most one other agent. This limitation also applies to Enhanced PIBT (EPIBT), a recent extension of PIBT. In this paper, we describe a new perspective on solving MAPF by planning over agent dependencies. Taking inspiration from PIBT’s priority inheritance logic, we define the concept of agent dependencies and propose Multi-Dependency PIBT (MD-PIBT) that searches over agent dependencies. MD-PIBT is a general framework where specific parameterizations can reproduce PIBT and EPIBT. At the same time, alternative configurations yield novel planning strategies that are not expressible by PIBT or EPIBT. Our experiments demonstrate that MD-PIBT effectively plans for as many as 10,000 homogeneous agents under various kinodynamic constraints, including pebble motion, rotation motion, and differential drive robots with speed and acceleration limits. We perform thorough evaluations on different variants of MAPF and find that MD-PIBT is particularly effective in MAPF with large agents.
[614] VLM-CAD: VLM-Optimized Collaborative Agent Design Workflow for Analog Circuit Sizing
Guanyuan Pan, Shuai Wang, Yugui Lin, Tiansheng Zhou, Pietro Liò, Zhenxin Zhao, Yaqi Wang
Main category: cs.MA
TL;DR: VLM-CAD is a collaborative agent workflow combining vision language models with neuro-symbolic parsing and explainable Bayesian optimization for robust analog circuit sizing with spatial reasoning and physics-based explainability.
Details
Motivation: Vision Language Models (VLMs) suffer from spatial blindness and logical hallucinations when interpreting complex engineering content like analog circuit schematics, requiring more robust multimodal reasoning for specialized technical domains.Method: Proposes VLM-CAD workflow with two key components: 1) Image2Net neuro-symbolic structural parsing module that converts circuit schematics into topological graphs and structured JSON representations, and 2) ExTuRBO (Explainable Trust Region Bayesian Optimization) as an explainable grounding engine using agent-generated semantic seeds and Automatic Relevance Determination.
Result: VLM-CAD significantly enhances spatial reasoning accuracy and maintains physics-based explainability on two complex circuit benchmarks, consistently satisfying complex specifications while achieving low power consumption with total runtime under 66 minutes.
Conclusion: VLM-CAD represents a significant step toward robust, explainable multimodal reasoning in specialized technical domains by bridging modality gaps and ensuring reliability for engineering decisions.
Abstract: Vision Language Models (VLMs) have demonstrated remarkable potential in multimodal reasoning, yet they inherently suffer from spatial blindness and logical hallucinations when interpreting densely structured engineering content, such as analog circuit schematics. To address these challenges, we propose a Vision Language Model-Optimized Collaborative Agent Design Workflow for Analog Circuit Sizing (VLM-CAD) designed for robust, step-by-step reasoning over multimodal evidence. VLM-CAD bridges the modality gap by integrating a neuro-symbolic structural parsing module, Image2Net, which transforms raw pixels into explicit topological graphs and structured JSON representations to anchor VLM interpretation in deterministic facts. To ensure the reliability required for engineering decisions, we further propose ExTuRBO, an Explainable Trust Region Bayesian Optimization method. ExTuRBO serves as an explainable grounding engine, employing agent-generated semantic seeds to warm-start local searches and utilizing Automatic Relevance Determination to provide quantified evidence for the VLM’s decisions. Experimental results on two complex circuit benchmarks demonstrate that VLM-CAD significantly enhances spatial reasoning accuracy and maintains physics-based explainability. VLM-CAD consistently satisfies complex specification requirements while achieving low power consumption, with a total runtime under 66 minutes, marking a significant step toward robust, explainable multimodal reasoning in specialized technical domains.
[615] Evidence-Decision-Feedback: Theory-Driven Adaptive Scaffolding for LLM Agents
Clayton Cohn, Siyuan Guo, Surya Rayala, Hanchen David Wang, Naveeduddin Mohammed, Umesh Timalsina, Shruti Jain, Angela Eeds, Menton Deweese, Pamela J. Osborn Popp, Rebekah Stanton, Shakeera Walker, Meiyi Ma, Gautam Biswas
Main category: cs.MA
TL;DR: EDF framework enables adaptive scaffolding in multi-agent LLM systems for personalized STEM+C tutoring through evidence-based inference, pedagogical decision-making, and adaptive feedback.
Details
Motivation: Current multi-agent LLM architectures for pedagogical agents operate on "one-size-fits-all" basis, limiting personalized support for students in developing domain knowledge and critical-thinking skills.Method: Introduces Evidence-Decision-Feedback (EDF) framework integrating intelligent tutoring systems and agentic behavior, organized around evidentiary inference, pedagogical decision-making, and adaptive feedback. Instantiated through Copa, an agentic collaborative peer agent for STEM+C problem-solving.
Result: In authentic high school classroom study, EDF-guided interactions: align feedback with students’ demonstrated understanding and task mastery; promote gradual scaffold fading; and support interpretable, evidence-grounded explanations without fostering overreliance.
Conclusion: EDF framework successfully enables adaptive scaffolding in multi-agent LLM systems for personalized STEM+C education, addressing limitations of current “one-size-fits-all” approaches.
Abstract: Multi-agent LLM architectures offer opportunities for pedagogical agents to help students construct domain knowledge and develop critical-thinking skills, yet many operate on a “one-size-fits-all” basis, limiting their ability to provide personalized support. To address this, we introduce Evidence-Decision-Feedback (EDF), a theoretical framework for adaptive scaffolding using LLMs. EDF integrates elements of intelligent tutoring systems and agentic behavior by organizing interactions around evidentiary inference, pedagogical decision-making, and adaptive feedback. We instantiate EDF through Copa, an agentic collaborative peer agent for STEM+C problem-solving. In an authentic high school classroom study, we show that EDF-guided interactions align feedback with students’ demonstrated understanding and task mastery; promote gradual scaffold fading; and support interpretable, evidence-grounded explanations without fostering overreliance.
cs.MM
[616] Short-Form Video Viewing Behavior Analysis and Multi-Step Viewing Time Prediction
Vu Thi Hai Yen, Duc V. Nguyen, Cao Anh Minh Huy, Truong Thu Huong
Main category: cs.MM
TL;DR: Measurement study of user viewing behavior in short-form video streaming to optimize chunk-based preloading and reduce data wastage from early skips.
Details
Motivation: Current short-video platforms use simple streaming approaches that preload entire videos, causing significant data wastage when users skip early. Chunk-based preloading can reduce this wastage, but requires understanding user viewing behavior to optimize.Method: Conducted a measurement study to construct a user behavior dataset containing viewing times for 100 short videos across various categories. Evaluated standard time-series forecasting algorithms (Auto-ARIMA, AR, LR, SVR, DTR) for predicting user viewing time.
Result: Auto-ARIMA generally achieves the lowest and most stable forecasting errors across most experimental settings. Other methods (AR, LR, SVR, DTR) tend to produce higher errors and exhibit lower stability. The dataset is publicly available.
Conclusion: Understanding user viewing behavior is crucial for optimizing chunk-based preloading in short-form video streaming. Auto-ARIMA shows promise for accurate viewing time prediction, which can help reduce data wastage from early user skips.
Abstract: Short-form videos have become one of the most popular user-generated content formats nowadays. Popular short-video platforms use a simple streaming approach that preloads one or more videos in the recommendation list in advance. However, this approach results in significant data wastage, as a large portion of the downloaded video data is not used due to the user’s early skip behavior. To address this problem, the chunk-based preloading approach has been proposed, where videos are divided into chunks, and preloading is performed in a chunk-based manner to reduce data wastage. To optimize chunk-based preloading, it is important to understand the user’s viewing behavior in short-form video streaming. In this paper, we conduct a measurement study to construct a user behavior dataset that contains users’ viewing times of one hundred short videos of various categories. Using the dataset, we evaluate the performance of standard time-series forecasting algorithms for predicting user viewing time in short-form video streaming. Our evaluation results show that Auto-ARIMA generally achieves the lowest and most stable forecasting errors across most experimental settings. The remaining methods, including AR, LR, SVR, and DTR, tend to produce higher errors and exhibit lower stability in many cases. The dataset is made publicly available at https://nvduc.github.io/shortvideodataset.
[617] A Video Steganography for H.265/HEVC Based on Multiple CU Size and Block Structure Distortion
Xiang Zhang, Wen Jiang, Fei Peng, Wenbin Huang, Ziqiang Li, Zhangjie Fu
Main category: cs.MM
TL;DR: Video steganography algorithm using multiple CU sizes and block structure distortion to hide secret information in I-frames with improved anti-steganalysis performance.
Details
Motivation: Existing video steganography algorithms based on block structure modification suffer from poor anti-steganalysis due to significant disruption of original CU block structure after embedding secret information.Method: Proposes three innovations: 1) CU Block Structure Stability Metric (CBSSM) to analyze anti-steganalysis limitations, 2) novel mapping rule based on multiple CU sizes to reduce block structure changes, 3) three-level distortion function based on block structure to guide embedding.
Result: Algorithm shows superior anti-steganalysis compared to state-of-the-art methods, with improved visual quality, bitrate increase ratio, and embedding capacity. CBSSM proves effective for evaluating anti-steganalysis even at low embedding rates.
Conclusion: The proposed triple-strategy approach minimizes disruption to original CU block structure while concealing information in areas where block structure changes occur after recompression, significantly enhancing anti-steganalysis performance.
Abstract: Video steganography based on block structure, which embeds secret information by modifying Coding Unit (CU) block structure of I-frames, is currently a research hotspot. However, the existing algorithms still suffer from the limitation of poor anti-steganalysis, which results from significantly disrupting the original CU block structure after embedding secret information. To overcome this limitation, this paper proposes a video steganography algorithm based on multiple CU size and block structure distortion. Our algorithm introduces three key innovations: 1) a CU Block Structure Stability Metric (CBSSM) based on CU block structure restoration phenomenon to reveal the reasons for the insufficient anti-steganalysis performance of current algorithms. 2) a novel mapping rule based on multiple CU size to reduce block structure change and enhance embedding capacity. 3) a three-level distortion function based on block structure to better guide the secret information embedding. This triple strategy ensures that the secret information embedding minimizes disruption to the original CU block structure while concealing it primarily in areas where block structure changes occur after recompression, ultimately enhancing the algorithm’s anti-steganalysis. Comprehensive experimental results highlight the crucial role of the proposed CBSSM in evaluating anti-steganalysis performance even at a low embedding rate. Meanwhile, compared to State-of-the-Art video steganography algorithms based on block structure, our proposed steganography algorithm exhibits greater anti-steganalysis, as well as further improving visual quality, bitrate increase ratio and embedding capacity.
[618] Catalogue Grounded Multimodal Attribution for Museum Video under Resource and Regulatory Constraints
Minsak Nanang, Adrian Hilton, Armin Mustafa
Main category: cs.MM
TL;DR: Automated multimodal pipeline for generating catalogue-style metadata for museum audiovisual content using video-language models and similarity matching to existing collections
Details
Motivation: Museums have growing audiovisual archives that lack searchable metadata, requiring extensive manual effort for cataloguing. Existing methods are labor-intensive, limiting discoverability of these cultural resources.Method: Multi-pass pipeline using open, locally deployable video language model: (1) summarizes artworks in video, (2) generates catalogue-style descriptions and genre labels, (3) attributes title and artist via conservative similarity matching to structured catalogue database
Result: Early deployments on painting catalogues show improved AV archive discoverability while respecting resource constraints, data sovereignty, and emerging regulations. Framework offers transferable template for application-driven ML in high-stakes domains.
Conclusion: The proposed automated pipeline addresses the metadata gap in museum AV archives, enabling better discoverability while maintaining practical deployment considerations like local deployment and data sovereignty.
Abstract: Audiovisual (AV) archives in museums and galleries are growing rapidly, but much of this material remains effectively locked away because it lacks consistent, searchable metadata. Existing method for archiving requires extensive manual effort. We address this by automating the most labour intensive part of the workflow: catalogue style metadata curation for in gallery video, grounded in an existing collection database. Concretely, we propose catalogue-grounded multimodal attribution for museum AV content using an open, locally deployable video language model. We design a multi pass pipeline that (i) summarises artworks in a video, (ii) generates catalogue style descriptions and genre labels, and (iii) attempts to attribute title and artist via conservative similarity matching to the structured catalogue. Early deployments on a painting catalogue suggest that this framework can improve AV archive discoverability while respecting resource constraints, data sovereignty, and emerging regulation, offering a transferable template for application-driven machine learning in other high-stakes domains.
eess.AS
[619] MSP-Conversation: A Corpus for Naturalistic, Time-Continuous Emotion Recognition
Luz Martinez-Lucas, Pravin Mote, Abinay Reddy Naini, Mohammed Abdelwahab, Carlos Busso
Main category: eess.AS
TL;DR: Introduces MSP-Conversation corpus: 70+ hours of conversational audio with time-continuous emotional annotations (valence, arousal, dominance) and speaker diarization for dynamic speech emotion recognition research.
Details
Motivation: Deep learning approaches for speech emotion recognition require large-scale, naturalistic emotional corpora with dynamic annotations to capture the context-dependent nature of emotional expression in conversations.Method: Created MSP-Conversation corpus by collecting publicly available podcast audio, performing speaker diarization, and applying time-continuous annotations for valence, arousal, and dominance. Overlaps with MSP-Podcast corpus for comparison between in-context vs out-of-context annotations.
Result: Developed a 70+ hour conversational audio dataset with detailed emotional annotations and speaker diarization, establishing baseline SER experiments and demonstrating the corpus’s value for dynamic emotion recognition research.
Conclusion: MSP-Conversation corpus provides a valuable resource for advancing research in dynamic speech emotion recognition in naturalistic settings, enabling study of emotional expression dynamics in conversational contexts.
Abstract: Affective computing aims to understand and model human emotions for computational systems. Within this field, speech emotion recognition (SER) focuses on predicting emotions conveyed through speech. While early SER systems relied on limited datasets and traditional machine learning models, recent deep learning approaches demand largescale, naturalistic emotional corpora. To address this need, we introduce the MSP-Conversation corpus: a dataset of more than 70 hours of conversational audio with time-continuous emotional annotations and detailed speaker diarizations. The time-continuous annotations capture the dynamic and contextdependent nature of emotional expression. The annotations in the corpus include fine-grained temporal traces of valence, arousal, and dominance. The audio data is sourced from publicly available podcasts and overlaps with a subset of the isolated speaking turns in the MSP-Podcast corpus to facilitate direct comparisons between annotation methods (i.e., in-context versus out-of-context annotations). The paper outlines the development of the corpus, annotation methodology, analyses of the annotations, and baseline SER experiments, establishing the MSP-Conversation corpus as a valuable resource for advancing research in dynamic SER in naturalistic settings.
[620] Modelling Emotions is an Elusive Pursuit in Affective Computing
Anders Rolighed Larsen, Sneha Das, Line Clemmensen
Main category: eess.AS
TL;DR: Paper argues for replacing categorical emotion labels with continuous dimensional definitions in affective computing to capture emotional nuance, reduce uncertainty, and expand applications.
Details
Motivation: Current affective computing systems have high uncertainty and limited applications due to categorical definitions of emotions that obscure emotional nuance and complexity.Method: The paper presents a conceptual argument rather than empirical methods, advocating for a paradigm shift from categorical to continuous dimensional emotion modeling in affective computing.
Result: Theoretical framework proposing that continuous dimensional emotion representations (like valence-arousal dimensions) would better capture emotional complexity and reduce uncertainty in affective computing systems.
Conclusion: Continuous dimensional definitions of emotions are necessary to advance affective computing, increase application usefulness, and lower uncertainties in emotion-aware AI systems.
Abstract: Affective computing - combining sensor technology, machine learning, and psychology - have been studied for over three decades and is employed in AI-powered technologies to enhance emotional awareness in AI systems, and detect symptoms of mental health disorders such as anxiety and depression. However, the uncertainty in such systems remains high, and the application areas are limited by categorical definitions of emotions and emotional concepts. This paper argues that categorical emotion labels obscure emotional nuance in affective computing, and therefore continuous dimensional definitions are needed to advance the field, increase application usefulness, and lower uncertainties.
[621] Prompt Amplification and Zero-Shot Late Fusion in Audio-Language Models for Speech Emotion Recognition
Saurabh Kataria, Xiao Hu
Main category: eess.AS
TL;DR: ZS-Fuse is a late-fusion method that combines zero-shot emotion estimates from audio-language models with specialist foundation models for speech emotion recognition, achieving state-of-the-art performance through prompt ensembles and amplification techniques.
Details
Motivation: While audio-language models show promise for zero-shot speech emotion recognition, specialist foundation models still outperform them on closed-ended tasks. The potential of combining ALMs with specialists to achieve better performance remains unexplored, especially given ALMs' sensitivity to prompt choices and emotion ambiguity.Method: Proposes ZS-Fuse, a late-fusion method that combines zero-shot emotion estimates from dual-encoder ALMs with specialist foundation models. Uses two techniques: 1) simple prompt ensemble to handle prompt sensitivity, and 2) novel prompt amplification that repeats audio and text queries to discover stronger zero-shot capabilities.
Result: ZS-Fuse demonstrates efficacy by evaluating with three dual-encoder ALMs and two FMs, reporting improvements over state-of-the-art baselines like WavLM-Large on three speech emotion recognition datasets.
Conclusion: The proposed fusion method successfully combines the strengths of general-purpose audio-language models with specialist foundation models for speech emotion recognition, overcoming challenges of emotion ambiguity and prompt sensitivity through innovative techniques.
Abstract: Audio-Language Models (ALMs) are making strides in understanding speech and non-speech audio. However, domain-specialist Foundation Models (FMs) remain the best for closed-ended speech processing tasks such as Speech Emotion Recognition (SER). Using ALMs for Zero-shot SER is a popular choice, but their potential to work with specialists to achieve state-of-the-art (SOTA) performance remains unexplored. We propose ZS-Fuse, a late-fusion method that combines zero-shot emotion estimates from a dual-encoder ALM with specialist FMs. To handle ambiguity in emotions and sensitivity to prompt choice, 1) we use a simple prompt ensemble and 2) suggest a novel technique called prompt amplification, which repeats audio and text queries to discover stronger zero-shot capabilities. We demonstrate the efficacy of our technique by evaluating ZS-Fuse with three dual-encoder ALMs and two FMs, and report improvements over SOTA baselines, such as WavLM-Large, on three speech emotion recognition datasets.
[622] Selective Classifier-free Guidance for Zero-shot Text-to-speech
John Zheng, Farhad Maleki
Main category: eess.AS
TL;DR: CFG strategies from image generation generally fail in speech synthesis, but selective CFG timing and text-representation awareness can improve speaker similarity while maintaining text adherence.
Details
Motivation: Addressing the challenge in zero-shot text-to-speech of balancing speaker fidelity with text adherence, and exploring whether classifier-free guidance strategies successful in image generation can be adapted to speech synthesis.Method: Evaluated adaptability of CFG strategies from image generation to speech synthesis, extended separated-condition CFG approaches, and experimented with timing strategies (standard CFG early, selective CFG later) and text-representation analysis across English and Mandarin.
Result: CFG strategies effective in image generation generally fail to improve speech synthesis. However, applying standard CFG during early timesteps and switching to selective CFG in later timesteps can improve speaker similarity while limiting text adherence degradation. Effectiveness of selective CFG is highly text-representation dependent, with different results between English and Mandarin.
Conclusion: CFG strategies don’t directly transfer from image to speech generation, but careful timing strategies and text-representation awareness can enable better trade-offs between speaker similarity and text adherence in zero-shot TTS.
Abstract: In zero-shot text-to-speech, achieving a balance between fidelity to the target speaker and adherence to text content remains a challenge. While classifier-free guidance (CFG) strategies have shown promising results in image generation, their application to speech synthesis are underexplored. Separating the conditions used for CFG enables trade-offs between different desired characteristics in speech synthesis. In this paper, we evaluate the adaptability of CFG strategies originally developed for image generation to speech synthesis and extend separated-condition CFG approaches for this domain. Our results show that CFG strategies effective in image generation generally fail to improve speech synthesis. We also find that we can improve speaker similarity while limiting degradation of text adherence by applying standard CFG during early timesteps and switching to selective CFG only in later timesteps. Surprisingly, we observe that the effectiveness of a selective CFG strategy is highly text-representation dependent, as differences between the two languages of English and Mandarin can lead to different results even with the same model.
[623] ASK: Adaptive Self-improving Knowledge Framework for Audio Text Retrieval
Siyuan Fu, Xuchen Guo, Mingjun Liu, Hongxiang Li, Boyin Tan, Gongxi Zhu, Xianwei Zhuang, Jinghan Ru, Yuxin Xie, Yuguo Yin
Main category: eess.AS
TL;DR: ASK framework addresses Audio-Text Retrieval limitations by breaking the Gradient Locality Bottleneck with multi-grained knowledge injection and mitigating Representation-Drift Mismatch through dynamic knowledge refinement.
Details
Motivation: Current dual-encoder ATR systems suffer from Gradient Locality Bottleneck (limited to in-batch samples) which prevents resolving acoustic ambiguities and learning rare concepts, and Representation-Drift Mismatch where static knowledge bases become misaligned with evolving encoders.Method: Proposes Adaptive Self-improving Knowledge (ASK) framework with: 1) Multi-grained knowledge injection to break GLB, 2) Dynamic refinement strategy to synchronize knowledge base with model evolution and mitigate RDM, 3) Adaptive reliability weighting to filter retrieval noise based on cross-modal consistency.
Result: Extensive experiments across multiple benchmarks show ASK consistently achieves new state-of-the-art performance across various backbones.
Conclusion: ASK effectively addresses fundamental limitations in ATR by breaking the Gradient Locality Bottleneck and mitigating Representation-Drift Mismatch through adaptive knowledge refinement.
Abstract: The dominant paradigm for Audio-Text Retrieval (ATR) relies on dual-encoder architectures optimized via mini-batch contrastive learning. However, restricting optimization to local in-batch samples creates a fundamental limitation we term the Gradient Locality Bottleneck (GLB), which prevents the resolution of acoustic ambiguities and hinders the learning of rare long-tail concepts. While external knowledge injection can break this bottleneck, it often triggers a problem called Representation-Drift Mismatch (RDM), where a static knowledge base becomes misaligned with evolving encoders, degrading guidance into noise. To address these intertwined challenges, we propose the Adaptive Self-improving Knowledge (ASK) framework. ASK breaks the GLB via multi-grained knowledge injection and mitigates RDM through a dynamic refinement strategy that synchronizes the knowledge base with the model. Additionally, an adaptive reliability weighting scheme is employed to filter retrieval noise based on cross-modal consistency. Extensive experiments across multiple benchmarks demonstrate that ASK consistently achieves new state-of-the-art performance across various backbones.
[624] WiRD-Gest: Gesture Recognition In The Real World Using Range-Doppler Wi-Fi Sensing on COTS Hardware
Jessica Sanson, Rahul C. Shah, Yazhou Zhu, Rafael Rosales, Valerio Frascolla
Main category: eess.AS
TL;DR: WiRD-Gest is a Wi-Fi sensing system using a single commercial laptop transceiver for gesture recognition via monostatic full duplex sensing to extract Range-Doppler information, achieving robust performance in challenging real-world environments.
Details
Motivation: Wi-Fi sensing for gesture recognition faces practical deployment challenges due to environmental sensitivity and device placement limitations. Current approaches struggle in real-world scenarios with dynamic interference and multiple moving targets.Method: Proposes WiRD-Gest system using a single unmodified Wi-Fi transceiver on COTS laptop with monostatic full duplex sensing pipeline to extract Range-Doppler (RD) information. Creates first benchmark of deep learning models for gesture recognition based on monostatic sensing.
Result: Demonstrates excellent performance in crowded, unseen public spaces with dynamic interference and additional moving targets, even when trained only on controlled environment data. Shows minor degradation in scenarios where prior approaches often fail.
Conclusion: Monostatic sensing with spatial (range) information fundamentally improves accuracy, robustness and generalization for Wi-Fi gesture recognition. The system enables practical deployment in real-world environments and will release benchmark/dataset as open source.
Abstract: Wi-Fi sensing has emerged as a promising technique for gesture recognition, yet its practical deployment is hindered by environmental sensitivity and device placement challenges. To overcome these limitations we propose Wi-Fi Range and Doppler (WiRD)-Gest, a novel system that performs gesture recognition using a single, unmodified Wi-Fi transceiver on a commercial off-the-shelf (COTS) laptop. The system leverages an monostatic full duplex sensing pipeline capable of extracting Range-Doppler (RD) information. Utilizing this, we present the first benchmark of deep learning models for gesture recognition based on monostatic sensing. The key innovation lies in how monostatic sensing and spatial (range) information fundamentally transforms accuracy, robustness and generalization compared to prior approaches. We demonstrate excellent performance in crowded, unseen public spaces with dynamic interference and additional moving targets even when trained on data from controlled environments only. These are scenarios where prior Wi-Fi sensing approaches often fail, however, our system suffers minor degradation. The WiRD-Gest benchmark and dataset will also be released as open source.
eess.IV
[625] Multimodal Fusion of Skeleton Dynamics and Clinical Gait Features for Video-Based Cerebral Palsy Severity Assessment
Kaiyuan Yang, Xupeng Chen, Jiangpeng He
Main category: eess.IV
TL;DR: Multimodal fusion framework combining skeleton dynamics and clinically meaningful gait features for video-based cerebral palsy severity assessment in children, achieving improved classification performance and interpretability.
Details
Motivation: Existing video-based gait analysis methods for cerebral palsy assessment rely on either pose sequences or handcrafted gait features alone, failing to capture both spatiotemporal motion patterns and clinically meaningful biomechanical information simultaneously.Method: Proposes a multimodal fusion framework integrating skeleton dynamics with contribution-guided clinically meaningful gait features. Uses Grad-CAM on pre-trained ST-GCN to identify discriminative body keypoints, then builds dual-stream architecture: one stream models skeleton dynamics using ST-GCN, another encodes gait features from identified keypoints. Fuses streams through feature cross-attention.
Result: Achieved 70.86% accuracy for four-level CP motor severity classification, outperforming baseline by 5.6 percentage points. Integration improved both prediction performance and biomechanical interpretability.
Conclusion: Integrating skeleton dynamics with clinically meaningful gait descriptors improves both prediction performance and biomechanical interpretability for video-based CP severity assessment, suggesting the value of multimodal fusion approaches in medical video analysis.
Abstract: Video-based gait analysis has become a promising approach for assessing motor impairment in children with cerebral palsy (CP). However, existing methods usually rely on either pose sequences or handcrafted gait features alone, making it difficult to simultaneously capture spatiotemporal motion patterns and clinically meaningful biomechanical information. To address this gap, we propose a multimodal fusion framework that integrates skeleton dynamics with contribution-guided clinically meaningful gait features. First, Grad-CAM analysis on a pre-trained ST-GCN backbone identified the most discriminative body keypoints, providing an interpretable basis for subsequent gait feature extraction. We then build a dual-stream architecture, with one stream modeling skeleton dynamics using ST-GCN and the other encoding gait geatures derived from the identified keypoints. By fusing the two streams through feature cross-attention improved four-level CP motor severity classification to 70.86%, outperforming the baseline by 5.6 percentage points. Overall, this work suggests that integrating skeleton dynamics with clinically meaningful gait descriptors can improve both prediction performance and biomechanical interpretability for video-based CP severity assessment.
[626] Abnormalities and Disease Detection in Gastro-Intestinal Tract Images
Zeshan Khan, Muhammad Atif Tahir
Main category: eess.IV
TL;DR: PhD thesis on gastrointestinal image analysis using texture features and deep learning for classification/segmentation, achieving high accuracy and real-time performance.
Details
Motivation: Address challenges in GI tract image analysis for medical diagnosis, particularly the need for accurate classification and segmentation in real-time applications despite diverse abnormalities and computational constraints.Method: Multifaceted approach: 1) Texture-based feature extraction with high-speed processing; 2) Deep learning with optimized models and data bagging; 3) Streamlined neural network combining texture and local binary patterns with learned threshold; 4) Two segmentation tools using Depth-Wise Separable Convolution and neural network ensembles.
Result: Texture methods: 4000+ FPS, F1=0.76, Accuracy=0.98 on Kvasir V2. Deep learning: Accuracy=0.92, F1=0.60 on HyperKvasir, F1=0.88 on Kvasir V2. Streamlined network: 41 FPS, Accuracy=0.99, F1=0.91 on HyperKvasir. Segmentation tools effective for low-FPS scenarios.
Conclusion: Research provides comprehensive framework progressing from traditional texture techniques to deep learning and ensemble approaches, offering novel adaptable methodologies for advancing GI image analysis with real-time capabilities.
Abstract: Gastrointestinal (GI) tract image analysis plays a crucial role in medical diagnosis. This research addresses the challenge of accurately classifying and segmenting GI images for real-time applications, where traditional methods often struggle due to the diversity and complexity of abnormalities. The high computational demands of this domain require efficient and adaptable solutions. This PhD thesis presents a multifaceted approach to GI image analysis. Initially, texture-based feature extraction and classification methods were explored, achieving high processing speed (over 4000 FPS) and strong performance (F1-score: 0.76, Accuracy: 0.98) on the Kvasir V2 dataset. The study then transitions to deep learning, where an optimized model combined with data bagging techniques improved performance, reaching an accuracy of 0.92 and an F1-score of 0.60 on the HyperKvasir dataset, and an F1-score of 0.88 on Kvasir V2. To support real-time detection, a streamlined neural network integrating texture and local binary patterns was developed. By addressing inter-class similarity and intra-class variation through a learned threshold, the system achieved 41 FPS with high accuracy (0.99) and an F1-score of 0.91 on HyperKvasir. Additionally, two segmentation tools are proposed to enhance usability, leveraging Depth-Wise Separable Convolution and neural network ensembles for improved detection, particularly in low-FPS scenarios. Overall, this research introduces novel and adaptable methodologies, progressing from traditional texture-based techniques to deep learning and ensemble approaches, providing a comprehensive framework for advancing GI image analysis.
[627] Single-Subject Multi-View MRI Super-Resolution via Implicit Neural Representations
Heejong Kim, Abhishek Thanki, Roel van Herten, Daniel Margolis, Mert R Sabuncu
Main category: eess.IV
TL;DR: SIMS-MRI is a self-supervised framework for multi-view MRI super-resolution that combines implicit neural representations with learned inter-view alignment to generate isotropic reconstructions from anisotropic scans without requiring pre-alignment or large training datasets.
Details
Motivation: Clinical MRI often acquires anisotropic volumes with different orientations to reduce scan time, but conventional integration methods degrade fine details. Existing deep learning approaches require large datasets and assume pre-aligned views, which is unrealistic in clinical practice.Method: Uses multi-resolution hash-encoded implicit representation with learned inter-view alignment to generate spatially consistent isotropic reconstructions from anisotropic multi-view scans of a single patient, operating without pre- or post-processing.
Result: Validated on simulated brain and clinical prostate MRI datasets, demonstrating effective isotropic reconstruction from anisotropic multi-view scans without requiring external training data or pre-alignment.
Conclusion: SIMS-MRI provides a practical self-supervised solution for clinical MRI super-resolution that works on single-subject data without alignment assumptions, making it suitable for real-world clinical applications.
Abstract: Clinical MRI frequently acquires anisotropic volumes with high in-plane resolution and low through-plane resolution to reduce acquisition time. Multiple orientations are therefore acquired to provide complementary anatomical information. Conventional integration of these views relies on registration followed by interpolation, which can degrade fine structural details. Recent deep learning-based super-resolution (SR) approaches have demonstrated strong performance in enhancing single-view images. However, their clinical reliability is often limited by the need for large-scale training datasets, resulting in increased dependence on cohort-level priors. Self-supervised strategies offer an alternative by learning directly from the target scans. Prior work either neglects the existence of multi-view information or assumes that in-plane information can supervise through-plane reconstruction under the assumption of pre-alignment between images. However, this assumption is rarely satisfied in clinical settings. In this work, we introduce Single-Subject Implicit Multi-View Super-Resolution for MRI (SIMS-MRI), a framework that operates solely on anisotropic multi-view scans from a single patient without requiring pre- or post-processing. Our method combines a multi-resolution hash-encoded implicit representation with learned inter-view alignment to generate a spatially consistent isotropic reconstruction. We validate the SIMS-MRI pipeline on both simulated brain and clinical prostate MRI datasets. Code will be made publicly available for reproducibility: https://github.com/abhshkt/SIMS-MRI
[628] Viewport-based Neural 360° Image Compression
Jingwei Liao, Bo Chen, Klara Nahrstedt, Zhisheng Yan
Main category: eess.IV
TL;DR: A novel viewport-based neural compression pipeline for 360° images that replaces conventional projection methods with viewport extraction and uses a transformer-based ViewPort ConText module to capture global information across viewports.
Details
Motivation: Conventional 360° image compression projects spherical images onto 2D planes, causing oversampling and distortion. The authors aim to address these issues while maintaining compression efficiency.Method: Proposes a viewport-based neural compression pipeline that extracts multiple viewports instead of projecting the entire spherical image. Uses a transformer-based ViewPort ConText (VPCT) module integrated with learning-based 2D image compression structures to capture global prior information across viewports.
Result: The pipeline saves an average of 14.01% bit consumption compared to best-performing 360° image compression methods without quality loss. The VPCT-based codec outperforms existing 2D image codecs in the viewport-based pipeline.
Conclusion: The viewport-based approach with VPCT module effectively addresses oversampling and distortion issues in 360° image compression while improving compression efficiency through better global information sharing.
Abstract: Given the popularity of 360° images on social media platforms, 360° image compression becomes a critical technology for media storage and transmission. Conventional 360° image compression pipeline projects the spherical image into a single 2D plane, leading to issues of oversampling and distortion. In this paper, we propose a novel viewport-based neural compression pipeline for 360° images. By replacing the image projection in conventional 360° image compression pipelines with viewport extraction and efficiently compressing multiple viewports, the proposed pipeline minimizes the inherent oversampling and distortion issues. However, viewport extraction impedes information sharing between multiple viewports during compression, causing the loss of global information about the spherical image. To tackle this global information loss, we design a neural viewport codec to capture global prior information across multiple viewports and maximally compress the viewport data. The viewport codec is empowered by a transformer-based ViewPort ConText (VPCT) module that can be integrated with canonical learning-based 2D image compression structures. We compare the proposed pipeline with existing 360° image compression models and conventional 360° image compression pipelines building on learning-based 2D image codecs and standard hand-crafted codecs. Results show that our pipeline saves an average of $14.01%$ bit consumption compared to the best-performing 360° image compression methods without compromising quality. The proposed VPCT-based codec also outperforms existing 2D image codecs in the viewport-based neural compression pipeline. Our code can be found at: https://github.com/Jingwei-Liao/VPCT.
[629] L-UNet: An LSTM Network for Remote Sensing Image Change Detection
Shuting Sun, Lin Mu, Lizhe Wang, Peng Liu
Main category: eess.IV
TL;DR: Proposes L-UNet and AL-UNet architectures combining Conv-LSTM with UNet for spatiotemporal change detection in high-resolution remote sensing images
Details
Motivation: Current deep learning-based change detection methods using Conv-LSTM lack spatial characteristics, but change detection requires both spatial and temporal processingMethod: Introduces Conv-LSTM extension, proposes L-UNet (substituting UNet convolution layers with Conv-LSTM) and AL-UNet (adding Atrous structure for multiscale spatial information)
Result: Experiments on two datasets show proposed methods outperform other methods in both quantitative and qualitative evaluation
Conclusion: The proposed end-to-end spatiotemporal networks effectively address the spatial-temporal nature of change detection in remote sensing images
Abstract: Change detection of high-resolution remote sensing images is an important task in earth observation and was extensively investigated. Recently, deep learning has shown to be very successful in plenty of remote sensing tasks. The current deep learning-based change detection method is mainly based on conventional long short-term memory (Conv-LSTM), which does not have spatial characteristics. Since change detection is a process with both spatiality and temporality, it is necessary to propose an end-to-end spatiotemporal network. To achieve this, Conv-LSTM, an extension of the Conv-LSTM structure, is introduced. Since it shares similar spatial characteristics with the convolutional layer, L-UNet, which substitutes partial convolution layers of UNet-to-Conv-LSTM and Atrous L-UNet (AL-UNet), which further using Atrous structure to multiscale spatial information is proposed. Experiments on two data sets are conducted and the proposed methods show the advantages both in quantity and quality when compared with some other methods.
[630] Retrieval-Guided Photovoltaic Inventory Estimation from Satellite Imagery for Distribution Grid Planning
Muhao Guo, Lihao Mai, Erik Blasch, Jafarali Parol, Turki Rakan, Yang Weng
Main category: eess.IV
TL;DR: Solar-RAG: A retrieval-augmented multimodal vision-language framework for robust rooftop PV detection from satellite imagery using similarity-based image retrieval and comparative reasoning against verified examples.
Details
Motivation: The rapid expansion of distributed rooftop PV systems creates uncertainty in grid planning and management. Current computer vision approaches for PV detection from satellite imagery are sensitive to geographic distribution shifts due to varying roof materials, urban morphology, and imaging conditions across regions.Method: Proposes Solar-RAG (Solar Retrieval-Augmented Generation), a context-grounded framework that integrates similarity-based image retrieval with multimodal vision-language reasoning. Instead of relying solely on internal model parameters, it retrieves visually similar rooftop scenes with verified annotations and performs comparative reasoning against these examples during inference.
Result: The method outperforms both conventional deep vision models and standalone vision-language models. Feeder-level case studies show improved PV inventory estimation reduces errors in voltage deviation analysis and hosting capacity assessment.
Conclusion: Solar-RAG provides a scalable and geographically robust approach for monitoring distributed PV deployment, enabling more reliable integration of remote sensing data into distribution grid planning and distributed energy resource management.
Abstract: The rapid expansion of distributed rooftop photovoltaic (PV) systems introduces increasing uncertainty in distribution grid planning, hosting capacity assessment, and voltage regulation. Reliable estimation of rooftop PV deployment from satellite imagery is therefore essential for accurate modeling of distributed generation at feeder and service-territory scales. However, conventional computer vision approaches rely on fixed learned representations and globally averaged visual correlations. This makes them sensitive to geographic distribution shifts caused by differences in roof materials, urban morphology, and imaging conditions across regions. To address these challenges, this paper proposes Solar Retrieval-Augmented Generation (Solar-RAG), a context-grounded framework for photovoltaic assessment that integrates similarity-based image retrieval with multimodal vision-language reasoning. Instead of producing predictions solely from internal model parameters, the proposed approach retrieves visually similar rooftop scenes with verified annotations and performs comparative reasoning against these examples during inference. This retrieval-guided mechanism provides geographically contextualized references that improve robustness under heterogeneous urban environments without requiring model retraining. The method outperform both conventional deep vision models and standalone vision-language models. Furthermore, feeder-level case studies show that improved PV inventory estimation reduces errors in voltage deviation analysis and hosting capacity assessment. The results demonstrate that the proposed method provides a scalable and geographically robust approach for monitoring distributed PV deployment. This enables more reliable integration of remote sensing data into distribution grid planning and distributed energy resource management.
[631] Rigid Motion Estimation using Accelerated Iterative Coordinate Descent (REACT) for MR Imaging
Kwang Eun Jang, Dwight G. Nishimura
Main category: eess.IV
TL;DR: REACT is an autofocus method for 3D rigid motion correction in MR imaging using coordinate descent optimization of image quality metrics
Details
Motivation: To develop a computationally viable autofocus method for estimating 3D rigid motion in MR imaging, addressing the need for effective motion correction in medical imaging applications like coronary MR angiographyMethod: REACT assumes piecewise-constant motion trajectory and estimates rigid motion parameters of temporal segments by optimizing image-quality metrics. Uses coordinate descent to decompose high-dimensional optimization into subproblems, each solved with derivative-free solvers to avoid exhaustive grid search
Result: Numerical simulations showed objective surfaces were approximately locally convex near desired solutions. In vivo study demonstrated REACT yielded higher coronary artery sharpness (u-IEPA) than conventional iNAV-based translational motion estimation for LAD and RCA, and higher than autofocus nonrigid motion correction for LAD
Conclusion: Demonstrates feasibility of coordinate descent for autofocus motion correction in MR imaging, providing a computationally viable approach for rigid motion estimation
Abstract: Purpose: To develop a computationally viable autofocus method for estimating 3D rigid motion in MR imaging. Theory and Methods: The proposed method, REACT, assumes a piecewise-constant motion trajectory and estimates the rigid motion parameters of individual temporal segments by optimizing an image-quality metric. Coordinate descent is adopted to decompose the high-dimensional optimization problem into a series of subproblems, each updating the motion parameters of a single temporal segment. The cost function of each subproblem is assumed to be approximately locally convex under suitable acquisition conditions. Each subproblem is then solved using a derivative-free solver, thereby avoiding an exhaustive grid search. Numerical simulations were conducted to investigate the local convexity assumption. REACT was evaluated for respiratory motion correction on in vivo free-breathing coronary MR angiography datasets acquired using a 3D cones trajectory with image-based navigators (iNAVs). An autofocus nonrigid motion correction method was also evaluated for comparison. Coronary artery sharpness was quantified using unbounded image edge profile acutance (u-IEPA). Results: In numerical simulations, the objective surfaces of the subproblems were approximately locally convex when the current motion estimate was close to the desired solution. In the in vivo study, REACT yielded higher u-IEPA than the conventional iNAV-based translational motion-estimation method for both the left anterior descending artery (LAD) and right coronary artery. REACT also yielded higher u-IEPA for the LAD than the autofocus nonrigid motion correction method. Conclusion: This study demonstrates the feasibility of coordinate descent for autofocus motion correction in MR imaging.
[632] Unsupervised Hyperspectral Image Super-Resolution via Self-Supervised Modality Decoupling
Songcheng Du, Yang Zou, Zixu Wang, Xingyuan Li, Ying Li, Changjing Shang, Qiang Shen
Main category: eess.IV
TL;DR: MossFuse: A self-supervised modality-decoupled framework for hyperspectral-multispectral image fusion that separates shared and complementary information across modalities to improve fusion performance.
Details
Motivation: Current hyperspectral-multispectral image fusion methods directly fuse modalities without effective supervision, leading to incomplete perception of complementary information and limited understanding of inter-modality correlations.Method: Proposes an end-to-end self-supervised Modality-Decoupled Spatial-Spectral Fusion (MossFuse) framework that decouples shared and complementary information across modalities, aggregates concise representations to reduce redundancy, and uses subspace clustering loss to guide feature decoupling.
Result: Systematic experiments over multiple datasets show the approach consistently outperforms existing HMIF methods while requiring fewer parameters with reduced inference time.
Conclusion: Modality decoupling is key to improving fusion performance, and the proposed simple yet effective self-supervised framework successfully addresses limitations of current methods.
Abstract: Fusion-based hyperspectral image super-resolution aims to fuse low-resolution hyperspectral images (LR-HSIs) and high-resolution multispectral images (HR-MSIs) to reconstruct high spatial and high spectral resolution images. Current methods typically apply direct fusion from the two modalities without effective supervision, leading to an incomplete perception of deep modality-complementary information and a limited understanding of inter-modality correlations. To address these issues, we propose a simple yet effective solution for unsupervised HMIF, revealing that modality decoupling is key to improving fusion performance. Specifically, we propose an end-to-end self-supervised Modality-Decoupled Spatial-Spectral Fusion (MossFuse) framework that decouples shared and complementary information across modalities and aggregates a concise representation of both LR-HSIs and HR-MSIs to reduce modality redundancy. Also, we introduce the subspace clustering loss as a clear guide to decouple modality-shared features from modality-complementary ones. Systematic experiments over multiple datasets demonstrate that our simple and effective approach consistently outperforms the existing HMIF methods while requiring considerably fewer parameters with reduced inference time. The source source code is in \href{https://github.com/dusongcheng/MossFuse}{MossFuse}.
[633] From Coarse to Continuous: Progressive Refinement Implicit Neural Representation for Motion-Robust Anisotropic MRI Reconstruction
Zhenxuan Zhang, Lipei Zhang, Yanqi Cheng, Zi Wang, Fanwen Wang, Haosen Zhang, Yue Yang, Yinzhe Wu, Jiahao Huang, Angelica I Aviles-Rivero, Zhifan Gao, Guang Yang, Peter J. Lally
Main category: eess.IV
TL;DR: A progressive refinement implicit neural representation framework for motion-robust MRI slice-to-volume reconstruction that unifies motion correction, structural refinement, and volumetric synthesis in a geometry-aware coordinate space.
Details
Motivation: Slice-to-volume reconstruction in motion-robust MRI is challenging due to hierarchical structural disruptions including local detail loss from k-space undersampling, global structural aliasing from motion, and volumetric anisotropy. Existing methods struggle to address these multi-level issues simultaneously.Method: PR-INR framework with three key modules: 1) Motion-aware diffusion module for coarse volumetric reconstructions suppressing motion artifacts, 2) Implicit detail restoration module for residual refinement by aligning spatial coordinates with visual features, and 3) Voxel continuous-aware representation module for continuous function representation over 3D coordinates enabling inter-slice completion and high-frequency detail recovery.
Result: Outperforms state-of-the-art methods on five public MRI datasets under various motion conditions (3% and 5% displacement), undersampling rates (4x and 8x), and slice resolutions. Shows superior quantitative reconstruction metrics and visual quality, with generalization and robustness across diverse unseen domains.
Conclusion: PR-INR effectively addresses hierarchical structural disruptions in MRI slice-to-volume reconstruction through progressive refinement in implicit neural representation, demonstrating strong performance in motion correction, structural refinement, and volumetric synthesis.
Abstract: In motion-robust magnetic resonance imaging (MRI), slice-to-volume reconstruction is critical for recovering anatomically consistent 3D brain volumes from 2D slices, especially under accelerated acquisitions or patient motion. However, this task remains challenging due to hierarchical structural disruptions. It includes local detail loss from k-space undersampling, global structural aliasing caused by motion, and volumetric anisotropy. Therefore, we propose a progressive refinement implicit neural representation (PR-INR) framework. Our PR-INR unifies motion correction, structural refinement, and volumetric synthesis within a geometry-aware coordinate space. Specifically, a motion-aware diffusion module is first employed to generate coarse volumetric reconstructions that suppress motion artifacts and preserve global anatomical structures. Then, we introduce an implicit detail restoration module that performs residual refinement by aligning spatial coordinates with visual features. It corrects local structures and enhances boundary precision. Further, a voxel continuous-aware representation module represents the image as a continuous function over 3D coordinates. It enables accurate inter-slice completion and high-frequency detail recovery. We evaluate PR-INR on five public MRI datasets under various motion conditions (3% and 5% displacement), undersampling rates (4x and 8x) and slice resolutions (scale = 5). Experimental results demonstrate that PR-INR outperforms state-of-the-art methods in both quantitative reconstruction metrics and visual quality. It further shows generalization and robustness across diverse unseen domains.
[634] Observer-Usable Information as a Task-specific Image Quality Metric
Changjie Lu, Sourya Sengupta, Hua Li, Mark A. Anastasio
Main category: eess.IV
TL;DR: V-information is introduced as a new objective image quality metric that quantifies task-relevant information available to sub-ideal observers, complementing conventional signal detection theory measures.
Details
Motivation: Current task-based image quality measures like task-specific information (TSI) assume ideal observers, but real-world observers are sub-ideal. There's a need for metrics that quantify how much task-relevant information can actually be exploited by practical observers.Method: Introduces predictive V-information (V-info) as a relaxation of TSI that considers specified families of sub-ideal observers. Validates V-info using a stylized magnetic resonance image restoration problem for signal detection/discrimination tasks.
Result: V-info correlates with AUC for binary tasks and works for multi-class tasks where ROC analysis is challenging. It shows greater sensitivity in scenarios where conventional metrics saturate.
Conclusion: V-info represents a new objective image quality measure that complements conventional signal detection theory-based metrics, particularly valuable for assessing performance of sub-ideal observers.
Abstract: Objective, task-based measures of image quality (IQ) have been widely advocated for assessing and optimizing medical imaging technologies. Besides signal detection theory-based measures, information-theoretic quantities have been proposed to quantify task-based IQ. For example, task-specific information (TSI), defined as the mutual information between an image and a task variable, represents an optimal measure of how informative an image is for performing a specified task. However, like the ideal observer from signal detection theory, TSI does not quantify the amount of task-relevant information in an image that can be exploited by a sub-ideal observer. A recently proposed relaxation of TSI, termed predictive V-information (V-info), removes this limitation and can quantify the utility of an image with consideration of a specified family of sub-ideal observers. In this study, for the first time, we introduce and investigate V-info as an objective, task-specific IQ metric. To corroborate its usefulness, a stylized magnetic resonance image restoration problem is considered in which V-info is employed to quantify signal detection or discrimination performance. The presented results show that V-info correlates with area under the receiver operating characteristic (ROC) curve for binary tasks, while being readily applicable to multi-class (>2) tasks where ROC analysis is challenging. Notably, V-info exhibits greater sensitivity in scenarios where conventional metrics saturate. These findings demonstrate that V-info represents a new objective IQ measure that can complement conventional signal detection theory-based ones.