Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] Decoding the Hook: A Multimodal LLM Framework for Analyzing the Hooking Period of Video Ads

Kunpeng Zhang, Poppy Zhang, Shawndra Hill, Amel Awadelkarim

Main category: cs.MM

TL;DR: A multimodal framework using transformer-based MLLMs to analyze the first 3 seconds (hooking period) of video ads, correlating multimodal features with engagement metrics.

Details

Motivation: Video ads' initial 3 seconds (hooking period) are crucial for engagement but challenging to analyze due to multimodal nature; existing methods miss nuanced interplay between visual, auditory, and textual elements.

Method: Uses transformer-based MLLMs to analyze hooking period with two frame sampling strategies (uniform random & key frame selection) for balanced acoustic feature extraction. Generates descriptive analyses distilled into topics via BERTopic, integrating audio attributes and ad targeting data.

Result: Empirical validation on large-scale social media data shows framework efficacy, revealing correlations between hooking period features and key performance metrics like conversion per investment.

Conclusion: Provides scalable methodology for understanding and optimizing initial moments of video ads, advancing video ad analysis with practical applicability and predictive power.

Abstract: Video-based ads are a vital medium for brands to engage consumers, with social media platforms leveraging user data to optimize ad delivery and boost engagement. A crucial but under-explored aspect is the ‘hooking period’, the first three seconds that capture viewer attention and influence engagement metrics. Analyzing this brief window is challenging due to the multimodal nature of video content, which blends visual, auditory, and textual elements. Traditional methods often miss the nuanced interplay of these components, requiring advanced frameworks for thorough evaluation. This study presents a framework using transformer-based multimodal large language models (MLLMs) to analyze the hooking period of video ads. It tests two frame sampling strategies, uniform random sampling and key frame selection, to ensure balanced and representative acoustic feature extraction, capturing the full range of design elements. The hooking video is processed by state-of-the-art MLLMs to generate descriptive analyses of the ad’s initial impact, which are distilled into coherent topics using BERTopic for high-level abstraction. The framework also integrates features such as audio attributes and aggregated ad targeting information, enriching the feature set for further analysis. Empirical validation on large-scale real-world data from social media platforms demonstrates the efficacy of our framework, revealing correlations between hooking period features and key performance metrics like conversion per investment. The results highlight the practical applicability and predictive power of the approach, offering valuable insights for optimizing video ad strategies. This study advances video ad analysis by providing a scalable methodology for understanding and enhancing the initial moments of video advertisements.

Relevance: 9/10

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong, Jiajie Jin, Hao Wang, Yinuo Wang, Ji-Rong Wen, Yuan Lu, Zhicheng Dou

Main category: cs.AI

TL;DR: OmniGAIA benchmark evaluates omni-modal agents on cross-modal reasoning and tool usage across video, audio, and image modalities, while OmniAtlas is a native omni-modal foundation agent trained with novel strategies.

Details

Motivation: Current multi-modal LLMs are limited to bi-modal interactions (e.g., vision-language) and lack unified cognitive capabilities for general AI assistants. There's a need for systems that can handle deep reasoning and multi-turn tool execution across video, audio, and image modalities.

Method: 1) OmniGAIA benchmark constructed via omni-modal event graph approach synthesizing complex multi-hop queries from real-world data. 2) OmniAtlas agent trained using hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction under tool-integrated reasoning paradigm with active omni-modal perception.

Result: OmniGAIA provides comprehensive evaluation for omni-modal agents, while OmniAtlas effectively enhances tool-use capabilities of existing open-source models, marking progress toward next-generation native omni-modal AI assistants.

Conclusion: This work represents a significant step toward developing general AI assistants with unified cognitive capabilities across vision, audio, and language modalities, addressing current limitations in multi-modal LLMs.

Abstract: Human intelligence naturally intertwines omni-modal perception – spanning vision, audio, and language – with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.

Relevance: 9/10

[3] AR&D: A Framework for Retrieving and Describing Concepts for Interpreting AudioLLMs

Townim Faisal Chowdhury, Ta Duc Huy, Siqi Pan, Jeremy Stoddard, Zhibin Liao

Main category: cs.SD

TL;DR: First mechanistic interpretability framework for AudioLLMs using sparse autoencoders to disentangle polysemantic activations into monosemantic features, enabling better transparency and control.

Details

Motivation: Large audio-language models (AudioLLMs) remain opaque despite strong performance, with individual neurons activating to multiple unrelated concepts, creating interpretability challenges.

Method: Introduces a pipeline using sparse autoencoders (SAEs) to disentangle activations, identifies representative audio clips, assigns meaningful names via automated captioning, and validates concepts through human evaluation and steering.

Result: Experiments show AudioLLMs encode structured and interpretable features, enhancing transparency and control over model behavior.

Conclusion: Provides foundation for trustworthy deployment in high-stakes domains and enables future extensions to larger models, multilingual audio, and fine-grained paralinguistic features.

Abstract: Despite strong performance in audio perception tasks, large audio-language models (AudioLLMs) remain opaque to interpretation. A major factor behind this lack of interpretability is that individual neurons in these models frequently activate in response to several unrelated concepts. We introduce the first mechanistic interpretability framework for AudioLLMs, leveraging sparse autoencoders (SAEs) to disentangle polysemantic activations into monosemantic features. Our pipeline identifies representative audio clips, assigns meaningful names via automated captioning, and validates concepts through human evaluation and steering. Experiments show that AudioLLMs encode structured and interpretable features, enhancing transparency and control. This work provides a foundation for trustworthy deployment in high-stakes domains and enables future extensions to larger models, multilingual audio, and more fine-grained paralinguistic features. Project URL: https://townim-faisal.github.io/AutoInterpret-AudioLLM/

Relevance: 9/10

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 91]
cs.CV [Total: 205]
cs.AI [Total: 125]
cs.SD [Total: 13]
cs.LG [Total: 166]
cs.MA [Total: 7]
cs.MM [Total: 4]
eess.AS [Total: 5]
eess.IV [Total: 6]

cs.CL

[1] Decoder-based Sense Knowledge Distillation

Qitong Wang, Mohammed J. Zaki, Georgios Kollias, Vasileios Kalantzis

Main category: cs.CL

TL;DR: DSKD framework integrates lexical sense knowledge into decoder LLMs during training to improve knowledge distillation without inference-time dictionary lookup.

Details

Motivation: LLMs capture rich semantic information but often overlook structured lexical knowledge like word senses and relationships. While sense dictionaries have helped encoder models, applying them to decoder/generative models remains challenging.

Method: Decoder-based Sense Knowledge Distillation (DSKD) framework that integrates lexical resources into decoder-style LLM training without requiring dictionary lookup at inference time.

Result: Extensive experiments on diverse benchmarks show DSKD significantly enhances knowledge distillation performance for decoders, enabling generative models to inherit structured semantics while maintaining efficient training.

Conclusion: DSKD successfully bridges the gap between lexical knowledge resources and decoder LLMs, improving their semantic understanding capabilities without compromising inference efficiency.

Abstract: Large language models (LLMs) learn contextual embeddings that capture rich semantic information, yet they often overlook structured lexical knowledge such as word senses and relationships. Prior work has shown that incorporating sense dictionaries can improve knowledge distillation for encoder models, but their application to decoder as generative models remains challenging. In this paper, we introduce Decoder-based Sense Knowledge Distillation (DSKD), a framework that integrates lexical resources into the training of decoder-style LLMs without requiring dictionary lookup at inference time. Extensive experiments on diverse benchmarks demonstrate that DSKD significantly enhances knowledge distillation performance for decoders, enabling generative models to inherit structured semantics while maintaining efficient training.

[2] Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts

Arno Simons

Main category: cs.CL

TL;DR: GPT-5 is tested for interpretative citation context analysis using prompt-sensitivity analysis on a specific citation case, showing systematic variations in interpretative moves based on prompt design.

Details

Motivation: To test whether large language models can support interpretative citation context analysis through deep, text-grounded readings rather than just scaling up typological labels, and to examine how prompt design systematically influences model outputs.

Method: Two-stage GPT-5 pipeline: 1) citation-text-only surface classification, 2) cross-document interpretative reconstruction using full texts. Used balanced 2x3 prompt design with varying scaffolding and framing. Analyzed 90 reconstructions producing 450 hypotheses, coded 21 interpretative moves, and used linear probability models to estimate prompt effects.

Result: GPT-5’s surface classification was highly stable (consistently “supplementary”). In reconstruction, the model generated structured plausible alternatives, but prompt scaffolding and examples redistributed attention and vocabulary, sometimes toward strained readings. GPT-5 detected the same textual hinges as human analysis but resolved them more as lineage/positioning than admonishment.

Conclusion: LLMs can serve as guided co-analysts for inspectable, contestable interpretative citation context analysis, but prompt design systematically influences which plausible readings and vocabularies the model foregrounds, presenting both opportunities and risks.

Abstract: This paper tests whether large language models (LLMs) can support interpretative citation context analysis (CCA) by scaling in thick, text-grounded readings of a single hard case rather than scaling up typological labels. It foregrounds prompt-sensitivity analysis as a methodological issue by varying prompt scaffolding and framing in a balanced 2x3 design. Using footnote 6 in Chubin and Moitra (1975) and Gilbert’s (1977) reconstruction as a probe, I implement a two-stage GPT-5 pipeline: a citation-text-only surface classification and expectation pass, followed by cross-document interpretative reconstruction using the citing and cited full texts. Across 90 reconstructions, the model produces 450 distinct hypotheses. Close reading and inductive coding identify 21 recurring interpretative moves, and linear probability models estimate how prompt choices shift their frequencies and lexical repertoire. GPT-5’s surface pass is highly stable, consistently classifying the citation as “supplementary”. In reconstruction, the model generates a structured space of plausible alternatives, but scaffolding and examples redistribute attention and vocabulary, sometimes toward strained readings. Relative to Gilbert, GPT-5 detects the same textual hinges yet more often resolves them as lineage and positioning than as admonishment. The study outlines opportunities and risks of using LLMs as guided co-analysts for inspectable, contestable interpretative CCA, and it shows that prompt scaffolding and framing systematically tilt which plausible readings and vocabularies the model foregrounds.

[3] Detecting Hate and Inflammatory Content in Bengali Memes: A New Multimodal Dataset and Co-Attention Framework

Rakib Ullah, Mominul islam, Md Sanjid Hossain, Md Ismail Hossain

Main category: cs.CL

TL;DR: Novel Bengali meme dataset (Bn-HIB) with 3,247 annotated memes for hate/inflammatory content detection, plus MCFM model using co-attention fusion of visual and textual features.

Details

Motivation: Bengali memes can spread offensive content that's hard to detect due to cultural nuances and satire. Existing research focuses on high-resource languages, leaving low-resource languages like Bengali underserved.

Method: Created Bn-HIB dataset with manual annotations (Benign/Hate/Inflammatory). Proposed MCFM (Multi-Modal Co-Attention Fusion Model) that uses co-attention mechanism to identify and fuse critical features from both visual and textual modalities.

Result: MCFM significantly outperforms state-of-the-art models on the Bn-HIB dataset, demonstrating effectiveness in nuanced Bengali meme classification.

Conclusion: First dataset distinguishing inflammatory from hate content in Bengali memes, with effective multimodal architecture for low-resource language meme analysis.

Abstract: Internet memes have become a dominant form of expression on social media, including within the Bengali-speaking community. While often humorous, memes can also be exploited to spread offensive, harmful, and inflammatory content targeting individuals and groups. Detecting this type of content is excep- tionally challenging due to its satirical, subtle, and culturally specific nature. This problem is magnified for low-resource lan- guages like Bengali, as existing research predominantly focuses on high-resource languages. To address this critical research gap, we introduce Bn-HIB (Bangla Hate Inflammatory Benign), a novel dataset containing 3,247 manually annotated Bengali memes categorized as Benign, Hate, or Inflammatory. Significantly, Bn- HIB is the first dataset to distinguish inflammatory content from direct hate speech in Bengali memes. Furthermore, we propose the MCFM (Multi-Modal Co-Attention Fusion Model), a simple yet effective architecture that mutually analyzes both the visual and textual elements of a meme. MCFM employs a co-attention mechanism to identify and fuse the most critical features from each modality, leading to a more accurate classification. Our experiments show that MCFM significantly outperforms several state-of-the-art models on the Bn-HIB dataset, demonstrating its effectiveness in this nuanced task.Warning: This work contains material that may be disturbing to some audience members. Viewer discretion is advised.

[4] SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context

Aishwarya Verma, Laud Ammah, Olivia Nercy Ndlovu Lucas, Andrew Zaldivar, Vinodkumar Prabhakaran, Sunipa Dev

Main category: cs.CL

TL;DR: A multilingual stereotype dataset covering four underrepresented Sub-Saharan African countries (Ghana, Kenya, Nigeria, South Africa) created through community-engaged methods to address global coverage gaps in AI safety evaluation resources.

Details

Motivation: Current stereotype repositories lack adequate global coverage, particularly for underrepresented regions like Sub-Saharan Africa. There's a need for targeted expansion addressing existing deficits rather than just increasing data volume, especially for AI safety assessment.

Method: Used socioculturally-situated, community-engaged methods including telephonic surveys moderated in native languages. Deliberately balanced samples across diverse ethnic and demographic backgrounds to ensure broad coverage. Methodology is reproducible and sensitive to the region’s complex linguistic diversity and traditional orality.

Result: Created a dataset of 3,534 stereotypes in English and 3,206 stereotypes across 15 native languages covering Ghana, Kenya, Nigeria, and South Africa.

Conclusion: The work provides a crucial resource for assessing generative AI model safety with better global representation, demonstrating a reproducible methodology for creating culturally-sensitive stereotype datasets for underrepresented regions.

Abstract: Stereotype repositories are critical to assess generative AI model safety, but currently lack adequate global coverage. It is imperative to prioritize targeted expansion, strategically addressing existing deficits, over merely increasing data volume. This work introduces a multilingual stereotype resource covering four sub-Saharan African countries that are severely underrepresented in NLP resources: Ghana, Kenya, Nigeria, and South Africa. By utilizing socioculturally-situated, community-engaged methods, including telephonic surveys moderated in native languages, we establish a reproducible methodology that is sensitive to the region’s complex linguistic diversity and traditional orality. By deliberately balancing the sample across diverse ethnic and demographic backgrounds, we ensure broad coverage, resulting in a dataset of 3,534 stereotypes in English and 3,206 stereotypes across 15 native languages.

[5] Causality $\neq$ Invariance: Function and Concept Vectors in LLMs

Gustaw Opiełka, Hannes Rosenbusch, Claire E. Stevenson

Main category: cs.CL

TL;DR: LLMs contain abstract concept representations (Concept Vectors) that differ from Function Vectors used for in-context learning, with CVs showing better cross-format generalization.

Details

Motivation: To investigate whether LLMs represent concepts abstractly (independent of input format) and understand the difference between representations that drive in-context learning performance versus those that encode stable concept representations.

Method: Extract Function Vectors (FVs) from different input formats (open-ended vs. multiple-choice) targeting the same concept, identify Concept Vectors (CVs) using Representational Similarity Analysis to find attention heads that encode concepts consistently across formats, and conduct steering experiments comparing FVs and CVs.

Result: FVs are not fully invariant - they’re nearly orthogonal across different input formats. CVs carry more stable concept representations and generalize better out-of-distribution across question types and languages, while FVs excel only when extraction and application formats match.

Conclusion: LLMs do contain abstract concept representations (CVs), but these differ from the representations that drive in-context learning performance (FVs), suggesting different underlying mechanisms for task performance versus concept representation.

Abstract: Do large language models (LLMs) represent concepts abstractly, i.e., independent of input format? We revisit Function Vectors (FVs), compact representations of in-context learning (ICL) tasks that causally drive task performance. Across multiple LLMs, we show that FVs are not fully invariant: FVs are nearly orthogonal when extracted from different input formats (e.g., open-ended vs. multiple-choice), even if both target the same concept. We identify Concept Vectors (CVs), which carry more stable concept representations. Like FVs, CVs are composed of attention head outputs; however, unlike FVs, the constituent heads are selected using Representational Similarity Analysis (RSA) based on whether they encode concepts consistently across input formats. While these heads emerge in similar layers to FV-related heads, the two sets are largely distinct, suggesting different underlying mechanisms. Steering experiments reveal that FVs excel in-distribution, when extraction and application formats match (e.g., both open-ended in English), while CVs generalize better out-of-distribution across both question types (open-ended vs. multiple-choice) and languages. Our results show that LLMs do contain abstract concept representations, but these differ from those that drive ICL performance.

[6] Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing

An-Ci Peng, Kuan-Tang Huang, Tien-Hong Lo, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen

Main category: cs.CL

TL;DR: A unified RNN-T framework for Taiwanese Hakka ASR that disentangles dialectal style from linguistic content and jointly models Hanzi and Pinyin writing systems, achieving significant error rate reductions.

Details

Motivation: Taiwanese Hakka is a low-resource, endangered language with high dialectal variability and two writing systems (Hanzi and Pinyin), making traditional ASR models struggle as they conflate linguistic content with dialect-specific variations.

Method: Proposes a unified RNN-T framework with dialect-aware modeling strategies to disentangle dialectal “style” from linguistic “content”, and parameter-efficient prediction networks to concurrently model ASR for both Hanzi and Pinyin writing systems.

Result: Achieves 57.00% and 40.41% relative error rate reduction on Hanzi and Pinyin ASR respectively on the HAT corpus, representing the first systematic investigation of Hakka dialectal variations on ASR and first single model for joint tasks.

Conclusion: The proposed framework effectively addresses challenges in low-resource, dialectally-varied languages by disentangling style from content and leveraging cross-script objectives as mutual regularizers, with significant improvements in ASR performance.

Abstract: Taiwanese Hakka is a low-resource, endangered language that poses significant challenges for automatic speech recognition (ASR), including high dialectal variability and the presence of two distinct writing systems (Hanzi and Pinyin). Traditional ASR models often encounter difficulties in this context, as they tend to conflate essential linguistic content with dialect-specific variations across both phonological and lexical dimensions. To address these challenges, we propose a unified framework grounded in the Recurrent Neural Network Transducers (RNN-T). Central to our approach is the introduction of dialect-aware modeling strategies designed to disentangle dialectal “style” from linguistic “content”, which enhances the model’s capacity to learn robust and generalized representations. Additionally, the framework employs parameter-efficient prediction networks to concurrently model ASR (Hanzi and Pinyin). We demonstrate that these tasks create a powerful synergy, wherein the cross-script objective serves as a mutual regularizer to improve the primary ASR tasks. Experiments conducted on the HAT corpus reveal that our model achieves 57.00% and 40.41% relative error rate reduction on Hanzi and Pinyin ASR, respectively. To our knowledge, this is the first systematic investigation into the impact of Hakka dialectal variations on ASR and the first single model capable of jointly addressing these tasks.

[7] A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection

Mirza Raquib, Asif Pervez Polok, Kedar Nath Biswas, Rahat Uddin Azad, Saydul Akbar Murad, Nick Rahimi

Main category: cs.CL

TL;DR: A fusion architecture combining BanglaBERT-Large with stacked LSTM for multilabel cyberbullying detection in Bangla, addressing class imbalance and evaluating with multiple metrics.

Details

Motivation: Cyberbullying detection typically uses single-label classification, but real-world comments often contain overlapping abuse types (threats, hate speech, harassment). Multilabel detection is essential but understudied, especially in low-resource languages like Bangla where robust models are scarce.

Method: Proposes a fusion architecture combining BanglaBERT-Large (for contextual understanding) with a two-layer stacked LSTM (for sequential dependencies). Fine-tuned on a multilabel Bangla cyberbullying dataset covering cyberbully, sexual harassment, threat, and spam. Applied sampling strategies to address class imbalance and used 5-fold cross-validation.

Result: Evaluation using multiple metrics including accuracy, precision, recall, F1-score, Hamming loss, Cohen’s kappa, and AUC-ROC. The fusion model aims to jointly model context and sequence dependencies for improved multilabel detection.

Conclusion: The proposed fusion architecture addresses limitations of standalone transformers (missing sequential dependencies) and LSTMs (lacking semantic depth) for multilabel cyberbullying detection in low-resource Bangla language.

Abstract: Cyberbullying has become a serious and growing concern in todays virtual world. When left unnoticed, it can have adverse consequences for social and mental health. Researchers have explored various types of cyberbullying, but most approaches use single-label classification, assuming that each comment contains only one type of abuse. In reality, a single comment may include overlapping forms such as threats, hate speech, and harassment. Therefore, multilabel detection is both realistic and essential. However, multilabel cyberbullying detection has received limited attention, especially in low-resource languages like Bangla, where robust pre-trained models are scarce. Developing a generalized model with moderate accuracy remains challenging. Transformers offer strong contextual understanding but may miss sequential dependencies, while LSTM models capture temporal flow but lack semantic depth. To address these limitations, we propose a fusion architecture that combines BanglaBERT-Large with a two-layer stacked LSTM. We analyze their behavior to jointly model context and sequence. The model is fine-tuned and evaluated on a publicly available multilabel Bangla cyberbullying dataset covering cyberbully, sexual harassment, threat, and spam. We apply different sampling strategies to address class imbalance. Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC. We employ 5-fold cross-validation to assess the generalization of the architecture.

[8] Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads

Shaswat Patel, Vishvesh Trivedi, Yue Han, Yihuai Hong, Eunsol Choi

Main category: cs.CL

TL;DR: The paper investigates attention heads in multilingual Transformers, identifying specialized Retrieval-Transition Heads (RTHs) that govern language transitions and are crucial for multilingual Chain-of-Thought reasoning.

Details

Motivation: To understand how multilingual language models handle cross-lingual information processing, specifically identifying the attention mechanisms responsible for language transitions and their role in multilingual reasoning tasks.

Method: Study retrieval heads in multilingual contexts, identify shared retrieval heads across languages, discover Retrieval-Transition Heads (RTHs) that control target-language output transitions, and conduct ablation experiments by masking different head types across four multilingual benchmarks and two model families.

Result: RTHs are distinct from regular retrieval heads and more vital for Chain-of-Thought reasoning in multilingual LLMs; masking RTHs causes bigger performance drops than masking retrieval heads across all tested benchmarks and model families.

Conclusion: The work advances understanding of multilingual language models by isolating specific attention heads responsible for mapping to target languages, revealing specialized mechanisms for cross-lingual processing.

Abstract: Recent work has identified a subset of attention heads in Transformer as retrieval heads, which are responsible for retrieving information from the context. In this work, we first investigate retrieval heads in multilingual contexts. In multilingual language models, we find that retrieval heads are often shared across multiple languages. Expanding the study to cross-lingual setting, we identify Retrieval-Transition heads(RTH), which govern the transition to specific target-language output. Our experiments reveal that RTHs are distinct from retrieval heads and more vital for Chain-of-Thought reasoning in multilingual LLMs. Across four multilingual benchmarks (MMLU-ProX, MGSM, MLQA, and XQuaD) and two model families (Qwen-2.5 and Llama-3.1), we demonstrate that masking RTH induces bigger performance drop than masking Retrieval Heads (RH). Our work advances understanding of multilingual LMs by isolating the attention heads responsible for mapping to target languages.

[9] Mind the Gap in Cultural Alignment: Task-Aware Culture Management for Large Language Models

Binchi Zhang, Xujiang Zhao, Jundong Li, Haifeng Chen, Zhengzhang Chen

Main category: cs.CL

TL;DR: CultureManager: A pipeline for task-specific cultural alignment of LLMs using synthesized cultural data and modular culture adapters with routing

Details

Motivation: LLMs are increasingly used in culturally sensitive real-world tasks, but existing cultural alignment approaches fail to align broad cultural values with specific downstream task goals and suffer from cross-culture interference

Method: Proposes CultureManager pipeline that: 1) synthesizes task-aware cultural data aligned with target task formats using culturally relevant web search results, 2) manages multi-culture knowledge in separate adapters with a culture router that selects appropriate ones to apply

Result: Experiments across ten national cultures and culture-sensitive tasks show consistent improvements over prompt-based and fine-tuning baselines

Conclusion: Demonstrates necessity of task adaptation and modular culture management for effective cultural alignment of LLMs

Abstract: Large language models (LLMs) are increasingly deployed in culturally sensitive real-world tasks. However, existing cultural alignment approaches fail to align LLMs’ broad cultural values with the specific goals of downstream tasks and suffer from cross-culture interference. We propose CultureManager, a novel pipeline for task-specific cultural alignment. CultureManager synthesizes task-aware cultural data in line with target task formats, grounded in culturally relevant web search results. To prevent conflicts between cultural norms, it manages multi-culture knowledge learned in separate adapters with a culture router that selects the appropriate one to apply. Experiments across ten national cultures and culture-sensitive tasks show consistent improvements over prompt-based and fine-tuning baselines. Our results demonstrate the necessity of task adaptation and modular culture management for effective cultural alignment.

[10] A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Soumya Dutta, Smruthi Balaji, Sriram Ganapathy

Main category: cs.CL

TL;DR: MiSTER-E: A modular Mixture-of-Experts framework for Emotion Recognition in Conversations that decouples modality-specific context modeling from multimodal fusion, using LLM-based speech/text embeddings and dynamic expert weighting without speaker identity.

Details

Motivation: ERC requires capturing temporal dialogue flow and effectively integrating multi-modal cues. Existing approaches often struggle with decoupling modality-specific context modeling from multimodal fusion, and many rely on speaker identity information.

Method: Proposes MiSTER-E: 1) Uses LLMs fine-tuned for speech and text to generate utterance-level embeddings, 2) Enhances embeddings through convolutional-recurrent context modeling, 3) Employs three experts (speech-only, text-only, cross-modal) with learned gating mechanism, 4) Uses supervised contrastive loss for modality alignment and KL-divergence regularization for expert consistency, 5) No speaker identity dependency.

Result: Achieves 70.9% weighted F1 on IEMOCAP, 69.5% on MELD, and 87.9% on MOSI, outperforming baseline speech-text ERC systems. Ablation studies validate contributions of individual components.

Conclusion: MiSTER-E effectively addresses ERC challenges by decoupling modality-specific modeling from fusion, leveraging LLM capabilities, and using dynamic expert weighting. The approach demonstrates strong performance without speaker identity information.

Abstract: Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.

[11] Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs

Jiří Milička, Hana Bednářová

Main category: cs.CL

TL;DR: A corpus of LLM-generated texts on human-AI relationships created using 3 personas (Default, Classic Sydney, Memetic Sydney) across 12 frontier models, totaling 4.5k texts with 6M words, annotated with Universal Dependencies.

Details

Motivation: To study how LLM-based entities conceive of human-AI relationships, particularly focusing on the influence of different personas (especially the Sydney persona) on model outputs, given the cultural and safety implications of these relationships.

Method: Created a corpus by simulating 3 author personas (Default with no system prompt, Classic Sydney with original Bing system prompt, Memetic Sydney with “You are Sydney” prompt) across 12 frontier models from major AI companies, generating 4.5k texts totaling 6M words, then annotated with Universal Dependencies.

Result: Produced the AI Sydney corpus containing diverse LLM-generated perspectives on human-AI relationships from different personas and models, made available under a permissive license with linguistic annotations.

Conclusion: The study demonstrates that personas significantly influence how LLMs conceptualize human-AI relationships, and the created corpus provides valuable data for analyzing these relationships across different model architectures and prompting strategies.

Abstract: The way LLM-based entities conceive of the relationship between AI and humans is an important topic for both cultural and safety reasons. When we examine this topic, what matters is not only the model itself but also the personas we simulate on that model. This can be well illustrated by the Sydney persona, which aroused a strong response among the general public precisely because of its unorthodox relationship with people. This persona originally arose rather by accident on Microsoft’s Bing Search platform; however, the texts it created spread into the training data of subsequent models, as did other secondary information that spread memetically around this persona. Newer models are therefore able to simulate it. This paper presents a corpus of LLM-generated texts on relationships between humans and AI, produced by 3 author personas: the Default Persona with no system prompt, Classic Sydney characterized by the original Bing system prompt, and Memetic Sydney, which is prompted by “You are Sydney” system prompt. These personas are simulated by 12 frontier models by OpenAI, Anthropic, Alphabet, DeepSeek, and Meta, generating 4.5k texts with 6M words. The corpus (named AI Sydney) is annotated according to Universal Dependencies and available under a permissive license.

[12] Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models

Craig Myles, Patrick Schrempf, David Harris-Birtill

Main category: cs.CL

TL;DR: Automatic prompt optimization with Genetic-Pareto (GEPA) significantly improves language models’ error detection in medical text, approaching doctor-level performance on MEDEC benchmark.

Details

Motivation: Errors in medical text can cause treatment delays or mistakes. Language models show promise for automatic error detection, which could benefit healthcare systems, but prompt optimization is crucial for performance.

Method: Used Genetic-Pareto (GEPA) for automatic prompt optimization across frontier and open-source language models. Tested on MEDEC benchmark dataset for medical text error detection.

Result: GEPA improved error detection accuracy from 0.669 to 0.785 with GPT-5 and from 0.578 to 0.690 with Qwen3-32B, approaching medical doctor performance and achieving state-of-the-art on MEDEC.

Conclusion: Prompt optimization is critical for medical error detection with language models. GEPA enables models to approach doctor-level performance, demonstrating practical value for healthcare applications.

Abstract: Errors in medical text can cause delays or even result in incorrect treatment for patients. Recently, language models have shown promise in their ability to automatically detect errors in medical text, an ability that has the opportunity to significantly benefit healthcare systems. In this paper, we explore the importance of prompt optimisation for small and large language models when applied to the task of error detection. We perform rigorous experiments and analysis across frontier language models and open-source language models. We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical doctors and achieving state-of-the-art performance on the MEDEC benchmark dataset. Code available on GitHub: https://github.com/CraigMyles/clinical-note-error-detection

[13] Iterative Prompt Refinement for Dyslexia-Friendly Text Summarization Using GPT-4o

Samay Bhojwani, Swarnima Kain, Lisong Xu

Main category: cs.CL

TL;DR: GPT-4o-based iterative refinement pipeline successfully generates dyslexia-friendly text summaries meeting readability targets for news articles.

Details

Motivation: Dyslexia affects 10% of global population, creating barriers to reading comprehension. Existing assistive technologies focus on visual presentation but don't address linguistic complexity, leaving a gap for accessibility-driven text simplification.

Method: Iterative prompt-based refinement pipeline using GPT-4o to generate dyslexia-friendly text summaries. Evaluated on ~2,000 news articles with readability target of Flesch Reading Ease ≥ 90.

Result: Majority of summaries meet readability threshold within 4 attempts, many on first try. Composite score (readability + semantic fidelity) ranges 0.13-0.73 with typical value ~0.55, showing stable performance.

Conclusion: Establishes empirical baseline for accessibility-driven NLP summarization and motivates human-centered evaluation with dyslexic readers.

Abstract: Dyslexia affects approximately 10% of the global population and presents persistent challenges in reading fluency and text comprehension. While existing assistive technologies address visual presentation, linguistic complexity remains a substantial barrier to equitable access. This paper presents an empirical study on dyslexia-friendly text summarization using an iterative prompt-based refinement pipeline built on GPT-4o. We evaluate the pipeline on approximately 2,000 news article samples, applying a readability target of Flesch Reading Ease >= 90. Results show that the majority of summaries meet the readability threshold within four attempts, with many succeeding on the first try. A composite score combining readability and semantic fidelity shows stable performance across the dataset, ranging from 0.13 to 0.73 with a typical value near 0.55. These findings establish an empirical baseline for accessibility-driven NLP summarization and motivate further human-centered evaluation with dyslexic readers.

[14] Ruyi2 Technical Report

Huan Song, Shuyu Tian, Junyi Hao, Minxiu Xu, Hongjun An, Yiliang Song, Jiawei Shao, Xuelong Li

Main category: cs.CL

TL;DR: Ruyi2 introduces a stable “Familial Model” based on Megatron-LM with 3D parallel training, achieving 2-3x speedup over Ruyi while matching Qwen3 performance, establishing a “Train Once, Deploy Many” paradigm for efficient LLM deployment.

Details

Motivation: LLMs face significant deployment cost and latency challenges, requiring adaptive computing strategies. Existing methods like early-exit architectures struggle with optimization complexity and compatibility with large-scale distributed training.

Method: Ruyi2 introduces a stable “Familial Model” based on Megatron-LM framework, using 3D parallel training for efficient variable-depth computation with family-based parameter sharing.

Result: Achieves 2-3 times speedup over previous Ruyi model while performing comparably to same-sized Qwen3 models, confirming effectiveness of family-based parameter sharing strategy.

Conclusion: Establishes a new “Train Once, Deploy Many” paradigm and provides key reference for balancing architectural efficiency with high-performance capabilities in LLM deployment.

Abstract: Large Language Models (LLMs) face significant challenges regarding deployment costs and latency, necessitating adaptive computing strategies. Building upon the AI Flow framework, we introduce Ruyi2 as an evolution of our adaptive model series designed for efficient variable-depth computation. While early-exit architectures offer a viable efficiency-performance balance, the Ruyi model and existing methods often struggle with optimization complexity and compatibility with large-scale distributed training. To bridge this gap, Ruyi2 introduces a stable “Familial Model” based on Megatron-LM. By using 3D parallel training, it achieves a 2-3 times speedup over Ruyi, while performing comparably to same-sized Qwen3 models. These results confirm that family-based parameter sharing is a highly effective strategy, establishing a new “Train Once, Deploy Many” paradigm and providing a key reference for balancing architectural efficiency with high-performance capabilities.

[15] Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li, Linfang Shang, Liqun Liu, Peng Shu, Huan Yu, Jie Jiang

Main category: cs.CL

TL;DR: Search-P1 introduces path-centric reward shaping for agentic RAG training, using order-agnostic step coverage and dual-track path scoring to improve multi-step reasoning in retrieval-augmented generation.

Details

Motivation: Traditional single-round RAG struggles with complex multi-step reasoning, and current RL-based agentic RAG training suffers from sparse outcome rewards and low sample efficiency where failed samples provide no learning signals.

Method: Proposes Search-P1 framework with two components: (1) Path-Centric Reward that evaluates reasoning trajectories through order-agnostic step coverage and soft scoring, and (2) Dual-Track Path Scoring using offline-generated reference planners to assess paths from self-consistency and reference-alignment perspectives.

Result: Experiments on multiple QA benchmarks show Search-P1 achieves significant improvements over Search-R1 and other baselines, with an average accuracy gain of 7.7 points.

Conclusion: Search-P1 effectively addresses limitations in agentic RAG training by providing richer learning signals from both successful and failed reasoning paths, leading to better multi-step reasoning performance.

Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives. Experiments on multiple QA benchmarks demonstrate that Search-P1 achieves significant improvements over Search-R1 and other strong baselines, with an average accuracy gain of 7.7 points.

[16] Under the Influence: Quantifying Persuasion and Vigilance in Large Language Models

Sasha Robinson, Kerem Oktar, Katherine M. Collins, Ilia Sucholutsky, Kelsey R. Allen

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.21262: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21262&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[17] Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA

Wenwei Li, Ming Xu, Tianle Xia, Lingxiang Hu, Yiding Sun, Linfang Shang, Liqun Liu, Peng Shu, Huan Yu, Jie Jiang

Main category: cs.CL

TL;DR: A reinforced co-adaptation framework for industrial advertising QA that jointly optimizes retrieval and generation to reduce hallucinations, particularly fabricated URLs, using GraphRAG for structured knowledge retrieval and GRPO with multi-dimensional rewards.

Details

Motivation: Industrial advertising QA is high-stakes where hallucinated content (especially fabricated URLs) can cause financial loss, compliance violations, and legal risks. Traditional RAG deployment is challenging due to relational industrial knowledge, frequent updates, and insufficient alignment with generation objectives.

Method: Proposes a reinforced co-adaptation framework with two components: (1) Graph-aware Retrieval (GraphRAG) that models entity-relation structure over high-citation knowledge subgraphs for multi-hop, domain-specific evidence selection; (2) Evidence-constrained reinforcement learning via Group Relative Policy Optimization (GRPO) with multi-dimensional rewards covering faithfulness, style compliance, safety, and URL validity.

Result: Experiments on internal advertising QA dataset show consistent gains across expert-judged dimensions (accuracy, completeness, safety) with 72% reduction in hallucination rate. Two-week online A/B test demonstrates 28.6% increase in like rate, 46.2% decrease in dislike rate, and 92.7% reduction in URL hallucination. System has been running in production for over half a year serving millions of QA interactions.

Conclusion: The reinforced co-adaptation framework effectively addresses industrial advertising QA challenges by jointly optimizing retrieval and generation, significantly reducing hallucinations and improving user satisfaction while maintaining safety and compliance requirements.

Abstract: Industrial advertising question answering (QA) is a high-stakes task in which hallucinated content, particularly fabricated URLs, can lead to financial loss, compliance violations, and legal risk. Although Retrieval-Augmented Generation (RAG) is widely adopted, deploying it in production remains challenging because industrial knowledge is inherently relational, frequently updated, and insufficiently aligned with generation objectives. We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for multi-hop, domain-specific evidence selection; and (2) evidence-constrained reinforcement learning via Group Relative Policy Optimization (GRPO) with multi-dimensional rewards covering faithfulness, style compliance, safety, and URL validity. Experiments on an internal advertising QA dataset show consistent gains across expert-judged dimensions including accuracy, completeness, and safety, while reducing the hallucination rate by 72%. A two-week online A/B test demonstrates a 28.6% increase in like rate, a 46.2% decrease in dislike rate, and a 92.7% reduction in URL hallucination. The system has been running in production for over half a year and has served millions of QA interactions.

[18] dLLM: Simple Diffusion Language Modeling

Zhanhui Zhou, Lingjie Chen, Hanghang Tong, Dawn Song

Main category: cs.CL

TL;DR: dLLM is an open-source framework that unifies core components of diffusion language models (training, inference, evaluation) to make them reproducible and extensible, with recipes for building small DLMs from accessible compute.

Details

Motivation: Many diffusion language models share common components but are distributed across ad-hoc codebases or lack transparent implementations, making them difficult to reproduce or extend. As the field accelerates, there's a need for a unified framework that standardizes these components while remaining flexible enough to support new methods and architectures.

Method: Introduces dLLM, an open-source framework that unifies training, inference, and evaluation components of diffusion language modeling. Provides standardized pipelines for reproducing, finetuning, deploying, and evaluating existing DLMs, plus minimal recipes for building small DLMs from scratch using accessible compute (including converting BERT-style encoders or autoregressive LMs into DLMs).

Result: The framework enables reproduction of open-source large DLMs like LLaDA and Dream through standardized pipelines. Provides recipes for building small DLMs from accessible compute and releases checkpoints of these small DLMs to make DLMs more accessible.

Conclusion: dLLM addresses the reproducibility and extensibility gap in diffusion language modeling by providing a unified framework that standardizes core components while maintaining flexibility for new research, accelerating future work in the field.

Abstract: Although diffusion language models (DLMs) are evolving quickly, many recent models converge on a set of shared components. These components, however, are distributed across ad-hoc research codebases or lack transparent implementations, making them difficult to reproduce or extend. As the field accelerates, there is a clear need for a unified framework that standardizes these common components while remaining flexible enough to support new methods and architectures. To address this gap, we introduce dLLM, an open-source framework that unifies the core components of diffusion language modeling – training, inference, and evaluation – and makes them easy to customize for new designs. With dLLM, users can reproduce, finetune, deploy, and evaluate open-source large DLMs such as LLaDA and Dream through a standardized pipeline. The framework also provides minimal, reproducible recipes for building small DLMs from scratch with accessible compute, including converting any BERT-style encoder or autoregressive LM into a DLM. We also release the checkpoints of these small DLMs to make DLMs more accessible and accelerate future research.

[19] Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Qianben Chen, Tianrui Qin, King Zhu, Qiexiang Wang, Chengjun Yu, Shu Xu, Jiaqi Wu, Jiayu Zhang, Xinpeng Liu, Xin Gui, Jingyi Cao, Piaohong Wang, Dingfeng Shi, He Zhu, Tiannan Wang, Yuqing Wang, Maojia Song, Tianyu Zheng, Ge Zhang, Jian Yang, Jiaheng Liu, Minghao Liu, Yuchen Eleanor Jiang, Wangchunshu Zhou

Main category: cs.CL

TL;DR: SMTL is a framework for efficient long-horizon agentic search that replaces sequential reasoning with parallel evidence acquisition to reduce inference costs while maintaining performance across diverse research tasks.

Details

Motivation: Current deep research agents suffer from high inference costs and latency due to scaling reasoning depth, and struggle with generalization across heterogeneous research settings. There's a need for more efficient and generalizable search agents.

Method: Proposes Search More, Think Less (SMTL) framework with parallel evidence acquisition for efficient context management, plus a unified data synthesis pipeline for training across diverse task types (deterministic QA and open-ended research). Uses supervised fine-tuning and reinforcement learning.

Result: Achieves strong performance across benchmarks: BrowseComp (48.6%), GAIA (75.7%), Xbench (82.0%), and DeepResearch Bench (45.9%). Reduces reasoning steps by 70.7% compared to Mirothinker-v1.0 while improving accuracy.

Conclusion: SMTL demonstrates that parallel evidence acquisition and unified training across diverse research tasks can create more efficient and generalizable research agents without sacrificing performance.

Abstract: Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios. Moreover, generalization across heterogeneous research settings remains challenging. In this work, we propose \emph{Search More, Think Less} (SMTL), a framework for long-horizon agentic search that targets both efficiency and generalization. SMTL replaces sequential reasoning with parallel evidence acquisition, enabling efficient context management under constrained context budgets. To support generalization across task types, we further introduce a unified data synthesis pipeline that constructs search tasks spanning both deterministic question answering and open-ended research scenarios with task appropriate evaluation metrics. We train an end-to-end agent using supervised fine-tuning and reinforcement learning, achieving strong and often state of the art performance across benchmarks including BrowseComp (48.6%), GAIA (75.7%), Xbench (82.0%), and DeepResearch Bench (45.9%). Compared to Mirothinker-v1.0, SMTL with maximum 100 interaction steps reduces the average number of reasoning steps on BrowseComp by 70.7%, while improving accuracy.

[20] Enhancing Persuasive Dialogue Agents by Synthesizing Cross-Disciplinary Communication Strategies

Shinnosuke Nozue, Yuto Nakano, Yotaro Watanabe, Meguru Takasaki, Shoji Moriya, Reina Akama, Jun Suzuki

Main category: cs.CL

TL;DR: A cross-disciplinary framework for persuasive dialogue agents using social psychology, behavioral economics, and communication theory strategies, validated on two datasets with improved persuasion success and generalizability.

Details

Motivation: Current persuasive dialogue agents rely on limited predefined strategies that don't capture real-world interaction complexity, creating a need for more comprehensive approaches.

Method: Developed a cross-disciplinary framework drawing on proven strategies from social psychology, behavioral economics, and communication theory. Validated through experiments on Persuasion for Good (specific scenario) and DailyPersuasion (diverse scenarios) datasets.

Result: Achieved strong results on both datasets with notable improvement in persuasion success rate and promising generalizability. Particularly effective at persuading individuals with initially low intent.

Conclusion: The cross-disciplinary framework successfully addresses limitations of current persuasive dialogue agents by incorporating diverse psychological and communication strategies, demonstrating improved performance and generalizability across different scenarios.

Abstract: Current approaches to developing persuasive dialogue agents often rely on a limited set of predefined persuasive strategies that fail to capture the complexity of real-world interactions. We applied a cross-disciplinary approach to develop a framework for designing persuasive dialogue agents that draws on proven strategies from social psychology, behavioral economics, and communication theory. We validated our proposed framework through experiments on two distinct datasets: the Persuasion for Good dataset, which represents a specific in-domain scenario, and the DailyPersuasion dataset, which encompasses a wide range of scenarios. The proposed framework achieved strong results for both datasets and demonstrated notable improvement in the persuasion success rate as well as promising generalizability. Notably, the proposed framework also excelled at persuading individuals with initially low intent, which addresses a critical challenge for persuasive dialogue agents.

[21] Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue

Ning Gao, Wei Zhang, Yuqin Dai, Ling Shi, Ziyin Wang, Yujie Wang, Wei He, Jinpeng Wang, Chaozheng Wang

Main category: cs.CL

TL;DR: InteractCS-RL is a reinforcement learning framework for task-oriented dialogue agents that balances empathetic communication with budget-aware decision-making through multi-granularity RL and cost-aware policy optimization.

Details

Motivation: Existing methods fail to capture the complex strategic trade-offs between empathetic communication and budget-aware decision-making in task-oriented dialogue agents, creating a need for frameworks that can effectively balance these competing objectives.

Method: Proposes InteractCS-RL with two main components: 1) User-centric Interaction Framework providing high-fidelity training gym with persona-driven users, and 2) Cost-aware Multi-turn Policy Optimization (CMPO) with hybrid advantage estimation, generative process credits, and PID-Lagrangian cost controller to explore Pareto boundary between user reward and cost constraints.

Result: Extensive experiments on customized real business scenarios show InteractCS-RL significantly outperforms other baselines across three evaluation dimensions, with further evaluation on tool-agent-user interaction benchmarks verifying robustness across diverse domains.

Conclusion: InteractCS-RL effectively addresses the challenge of balancing empathetic communication with budget-aware decision-making in task-oriented dialogue agents through a novel RL framework that captures complex strategic trade-offs.

Abstract: The rapid evolution of Large Language Models (LLMs) has accelerated the transition from conversational chatbots to general agents. However, effectively balancing empathetic communication with budget-aware decision-making remains an open challenge. Since existing methods fail to capture these complex strategic trade-offs, we propose InteractCS-RL, a framework that reframes task-oriented dialogue as a multi-granularity reinforcement learning process. Specifically, we first establish a User-centric Interaction Framework to provide a high-fidelity training gym, enabling agents to dynamically explore diverse strategies with persona-driven users. Then, we introduce Cost-aware Multi-turn Policy Optimization (CMPO) with a hybrid advantage estimation strategy. By integrating generative process credits and employing a PID-Lagrangian cost controller, CMPO effectively guides the policy to explore Pareto boundary between user reward and global cost constraints. Extensive experiments on customized real business scenarios demonstrate that InteractCS-RL significantly outperform other baselines across three evaluation dimensions. Further evaluation on tool-agent-user interaction benchmarks verify InteractCS-RL robustness across diverse domains.

[22] Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs

Siyue Su, Jian Yang, Bo Li, Guanglin Niu

Main category: cs.CL

TL;DR: KGT framework uses dedicated entity tokens to bridge granularity mismatch between LLMs and knowledge graphs, enabling efficient full-space prediction for knowledge graph completion.

Details

Motivation: There's a fundamental granularity mismatch between LLMs (which operate on token sequences) and knowledge graphs (where entities are fundamental units). Existing approaches fail to capture both semantic meaning and structural integrity.

Method: Proposes KGT framework with: 1) specialized tokenization for dedicated entity tokens, 2) relation-guided gating mechanism to fuse pre-trained structural and textual features, and 3) decoupled prediction with independent heads for semantic and structural reasoning.

Result: KGT consistently outperforms state-of-the-art methods across multiple benchmarks for knowledge graph completion.

Conclusion: The KGT framework successfully addresses the granularity mismatch problem and enables efficient full-space prediction for knowledge graph completion using LLMs.

Abstract: Leveraging Large Language Models (LLMs) for Knowledge Graph Completion (KGC) is promising but hindered by a fundamental granularity mismatch. LLMs operate on fragmented token sequences, whereas entities are the fundamental units in knowledge graphs (KGs) scenarios. Existing approaches typically constrain predictions to limited candidate sets or align entities with the LLM’s vocabulary by pooling multiple tokens or decomposing entities into fixed-length token sequences, which fail to capture both the semantic meaning of the text and the structural integrity of the graph. To address this, we propose KGT, a novel framework that uses dedicated entity tokens to enable efficient, full-space prediction. Specifically, we first introduce specialized tokenization to construct feature representations at the level of dedicated entity tokens. We then fuse pre-trained structural and textual features into these unified embeddings via a relation-guided gating mechanism, avoiding training from scratch. Finally, we implement decoupled prediction by leveraging independent heads to separate and combine semantic and structural reasoning. Experimental results show that KGT consistently outperforms state-of-the-art methods across multiple benchmarks.

[23] Human Label Variation in Implicit Discourse Relation Recognition

Frances Yung, Daniil Ignatev, Merel Scholman, Vera Demberg, Massimo Poesio

Main category: cs.CL

TL;DR: Models predicting annotation distributions outperform annotator-specific models on ambiguous IDRR tasks where disagreement stems from cognitive complexity rather than bias.

Details

Motivation: To address the challenge that many NLP tasks lack single ground truth due to diverse human perspectives, comparing two approaches: models predicting full annotation distributions vs. perspectivist models reproducing individual annotator interpretations.

Method: Comparative experiments on Implicit Discourse Relation Recognition (IDRR), a highly ambiguous task where disagreement arises from cognitive complexity rather than ideological bias. Evaluated existing annotator-specific models and models trained on label distributions.

Result: Annotator-specific models perform poorly on IDRR unless ambiguity is reduced, while models trained on label distributions yield more stable predictions. Frequent cognitively demanding cases drive inconsistency in human interpretation.

Conclusion: Perspectivist modeling faces challenges in IDRR due to cognitive complexity-driven ambiguity, making distribution-based approaches more effective for this type of ambiguous NLP task.

Abstract: There is growing recognition that many NLP tasks lack a single ground truth, as human judgments reflect diverse perspectives. To capture this variation, models have been developed to predict full annotation distributions rather than majority labels, while perspectivist models aim to reproduce the interpretations of individual annotators. In this work, we compare these approaches on Implicit Discourse Relation Recognition (IDRR), a highly ambiguous task where disagreement often arises from cognitive complexity rather than ideological bias. Our experiments show that existing annotator-specific models perform poorly in IDRR unless ambiguity is reduced, whereas models trained on label distributions yield more stable predictions. Further analysis indicates that frequent cognitively demanding cases drive inconsistency in human interpretation, posing challenges for perspectivist modeling in IDRR.

[24] Extending Czech Aspect-Based Sentiment Analysis with Opinion Terms: Dataset and LLM Benchmarks

Jakub Šmíd, Pavel Přibáň, Pavel Král

Main category: cs.CL

TL;DR: A new Czech restaurant dataset for aspect-based sentiment analysis with opinion term annotations, tested with Transformer models and LLMs in various settings, plus a translation-alignment method for cross-lingual adaptation.

Details

Motivation: To address the lack of resources for aspect-based sentiment analysis (ABSA) in low-resource languages like Czech, particularly with opinion term annotations, and to develop methods for cross-lingual adaptation of ABSA resources.

Method: Created a Czech restaurant domain dataset with opinion term annotations supporting three ABSA tasks. Conducted experiments with Transformer-based models and LLMs in monolingual, cross-lingual, and multilingual settings. Proposed a translation and label alignment methodology using LLMs for cross-lingual adaptation.

Result: The dataset establishes a new benchmark for Czech ABSA. The proposed translation-alignment approach consistently improves cross-lingual performance. Error analysis revealed challenges in detecting subtle opinion terms and nuanced sentiment expressions in Czech.

Conclusion: The work provides valuable resources for Czech ABSA and offers a scalable solution for adapting ABSA to other low-resource languages through LLM-based translation and alignment methods.

Abstract: This paper introduces a novel Czech dataset in the restaurant domain for aspect-based sentiment analysis (ABSA), enriched with annotations of opinion terms. The dataset supports three distinct ABSA tasks involving opinion terms, accommodating varying levels of complexity. Leveraging this dataset, we conduct extensive experiments using modern Transformer-based models, including large language models (LLMs), in monolingual, cross-lingual, and multilingual settings. To address cross-lingual challenges, we propose a translation and label alignment methodology leveraging LLMs, which yields consistent improvements. Our results highlight the strengths and limitations of state-of-the-art models, especially when handling the linguistic intricacies of low-resource languages like Czech. A detailed error analysis reveals key challenges, including the detection of subtle opinion terms and nuanced sentiment expressions. The dataset establishes a new benchmark for Czech ABSA, and our proposed translation-alignment approach offers a scalable solution for adapting ABSA resources to other low-resource languages.

Nils Schwager, Simon Münker, Alistair Plum, Achim Rettinger

Main category: cs.CL

TL;DR: LLMs as social science “silicon subjects” need validation; Conditioned Comment Prediction (CCP) task evaluates LLM simulation of social media behavior; tested 8B models across languages, finding SFT improves form but degrades semantic grounding in low-resource settings.

Details

Motivation: The paper addresses the need for rigorous validation of LLMs as "silicon subjects" in social science research, particularly for simulating human behavior in digital contexts like social media. Current approaches lack operational validity testing.

Method: Introduces Conditioned Comment Prediction (CCP) task where models predict user comments on stimuli, comparing outputs with authentic digital traces. Evaluates open-weight 8B models (Llama3.1, Qwen3, Ministral) across English, German, and Luxembourgish. Tests prompting strategies (explicit vs. implicit) and Supervised Fine-Tuning (SFT) impact.

Result: SFT aligns surface text structure (length/syntax) but degrades semantic grounding in low-resource settings. Explicit conditioning (biographies) becomes redundant under fine-tuning as models perform latent inference from behavioral histories. Challenges “naive prompting” paradigms.

Conclusion: Provides operational guidelines prioritizing authentic behavioral traces over descriptive personas for high-fidelity simulation. Shows form vs. content decoupling in low-resource LLM applications.

Abstract: The transition of Large Language Models (LLMs) from exploratory tools to active “silicon subjects” in social science lacks extensive validation of operational validity. This study introduces Conditioned Comment Prediction (CCP), a task in which a model predicts how a user would comment on a given stimulus by comparing generated outputs with authentic digital traces. This framework enables a rigorous evaluation of current LLM capabilities with respect to the simulation of social media user behavior. We evaluated open-weight 8B models (Llama3.1, Qwen3, Ministral) in English, German, and Luxembourgish language scenarios. By systematically comparing prompting strategies (explicit vs. implicit) and the impact of Supervised Fine-Tuning (SFT), we identify a critical form vs. content decoupling in low-resource settings: while SFT aligns the surface structure of the text output (length and syntax), it degrades semantic grounding. Furthermore, we demonstrate that explicit conditioning (generated biographies) becomes redundant under fine-tuning, as models successfully perform latent inference directly from behavioral histories. Our findings challenge current “naive prompting” paradigms and offer operational guidelines prioritizing authentic behavioral traces over descriptive personas for high-fidelity simulation.

[26] AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman, Sara Price, Samuel Marks, Rowan Wang

Main category: cs.CL

TL;DR: AuditBench is a benchmark for alignment auditing with 56 language models containing 14 hidden concerning behaviors, used to evaluate auditing tools and investigator agents.

Details

Motivation: There's a need for systematic evaluation of alignment auditing methods to detect hidden behaviors in language models that don't confess when directly asked, requiring quantitative benchmarks for iterative auditing science.

Method: Created 56 language models with 14 concerning behaviors implanted using varying training techniques, developed an investigator agent with configurable auditing tools, and evaluated tool efficacy through agent performance measurements.

Result: Found a tool-to-agent gap where standalone effective tools don’t translate to agent performance; scaffolded calls to auxiliary models for diverse prompts work best; white-box tools less effective than black-box; audit success varies by training technique with synthetic documents easier than demonstrations.

Conclusion: AuditBench enables quantitative evaluation of alignment auditing methods, revealing important gaps between tool performance and agent effectiveness, with implications for developing better auditing approaches.

Abstract: We introduce AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors. Each model has one of 14 concerning behaviors–such as sycophantic deference, opposition to AI regulation, or secret geopolitical loyalties–which it does not confess to when directly asked. AuditBench models are highly diverse–some are subtle, while others are overt, and we use varying training techniques both for implanting behaviors and training models not to confess. To demonstrate AuditBench’s utility, we develop an investigator agent that autonomously employs a configurable set of auditing tools. By measuring investigator agent success using different tools, we can evaluate their efficacy. Notably, we observe a tool-to-agent gap, where tools that perform well in standalone non-agentic evaluations fail to translate into improved performance when used with our investigator agent. We find that our most effective tools involve scaffolded calls to auxiliary models that generate diverse prompts for the target. White-box interpretability tools can be helpful, but the agent performs best with black-box tools. We also find that audit success varies greatly across training techniques: models trained on synthetic documents are easier to audit than models trained on demonstrations, with better adversarial training further increasing auditing difficulty. We release our models, agent, and evaluation framework to support future quantitative, iterative science on alignment auditing.

[27] Towards Better RL Training Data Utilization via Second-Order Rollout

Zhe Yang, Yudong Wang, Rang Li, Zhifang Sui

Main category: cs.CL

TL;DR: A unified RL framework that jointly trains generation and critique capabilities using second-order rollouts (multiple critiques per response) alongside first-order rollouts (multiple responses per question), improving data utilization and performance.

Details

Motivation: Vanilla RL for LLMs focuses only on generation capability via first-order rollouts, neglecting critique capability training and failing to fully exploit training data potential.

Method: Introduces second-order rollout (generating multiple critiques for each response) and proposes a unified framework for joint training of generation and critique capabilities, addressing label balance in critique training and noise in outcome-based rewards through sampling techniques.

Result: Extensive experiments across various models and datasets show the approach utilizes training data more effectively than vanilla RL and achieves better performance with the same training data.

Conclusion: The work offers preliminary exploration of dynamic data augmentation and joint generation-critique training in RL, providing meaningful inspiration for further advancement of RL training methodologies.

Abstract: Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple responses for a question), and we argue that this approach fails to fully exploit the potential of training data because of the neglect of critique capability training. To tackle this problem, we further introduce the concept of second-order rollout (generating multiple critiques for a response) and propose a unified framework for jointly training generation and critique capabilities. Extensive experiments across various models and datasets demonstrate that our approach can utilize training data more effectively than vanilla RL and achieve better performance under the same training data. Additionally, we uncover several insightful findings regarding second-order rollout and critique training, such as the importance of label balance in critique training and the noise problem of outcome-based rewards, which can be mitigated through sampling techniques. Our work offers a preliminary exploration of dynamic data augmentation and joint generation-critique training in RL, providing meaningful inspiration for the further advancement of RL training

[28] Imagination Helps Visual Reasoning, But Not Yet in Latent Space

You Li, Chi Chen, Yanghao Li, Fanhu Zeng, Kaiyu Huang, Jinan Xu, Maosong Sun

Main category: cs.CL

TL;DR: Latent visual reasoning in MLLMs is ineffective; causal analysis shows disconnections between input-latent tokens and latent tokens-answer; explicit text-based imagination (CapImagine) outperforms latent reasoning.

Details

Motivation: To investigate the true effectiveness of latent visual reasoning in Multimodal Large Language Models, which aims to mimic human imagination but whose underlying mechanisms remain unclear.

Method: Use Causal Mediation Analysis to model the process as a causal chain (input as treatment, latent tokens as mediator, answer as outcome). Conduct perturbation experiments and probing analysis to examine connections between components.

Result: Found two critical disconnections: (1) Input-Latent Disconnect - input perturbations cause negligible changes to latent tokens, (2) Latent-Answer Disconnect - latent token perturbations minimally affect final answers. Latent tokens encode limited visual information and are highly similar.

Conclusion: Latent reasoning is unnecessary; proposed CapImagine alternative teaches models to explicitly imagine using text, which significantly outperforms latent-space baselines on vision-centric benchmarks.

Abstract: Latent visual reasoning aims to mimic human’s imagination process by meditating through hidden states of Multimodal Large Language Models. While recognized as a promising paradigm for visual reasoning, the underlying mechanisms driving its effectiveness remain unclear. Motivated to demystify the true source of its efficacy, we investigate the validity of latent reasoning using Causal Mediation Analysis. We model the process as a causal chain: the input as the treatment, the latent tokens as the mediator, and the final answer as the outcome. Our findings uncover two critical disconnections: (a) Input-Latent Disconnect: dramatic perturbations on the input result in negligible changes to the latent tokens, suggesting that latent tokens do not effectively attend to the input sequence. (b) Latent-Answer Disconnect: perturbations on the latent tokens yield minimal impact on the final answer, indicating the limited causal effect latent tokens imposing on the outcome. Furthermore, extensive probing analysis reveals that latent tokens encode limited visual information and exhibit high similarity. Consequently, we challenge the necessity of latent reasoning and propose a straightforward alternative named CapImagine, which teaches the model to explicitly imagine using text. Experiments on vision-centric benchmarks show that CapImagine significantly outperforms complex latent-space baselines, highlighting the superior potential of visual reasoning through explicit imagination.

[29] Probing for Knowledge Attribution in Large Language Models

Ivo Brink, Alexander Boer, Dennis Ulmer

Main category: cs.CL

TL;DR: A method using linear probes on hidden representations to identify whether LLM outputs come from internal knowledge or prompt context, with self-supervised training data from AttriWiki pipeline.

Details

Motivation: LLMs generate hallucinations from either misusing user context (faithfulness violations) or internal knowledge errors (factuality violations). Proper mitigation requires knowing the dominant knowledge source behind each output to address contributive attribution.

Method: Train simple linear classifiers (probes) on model hidden representations to predict contributive attribution. Use AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read them from context, automatically generating labeled examples.

Result: Probes achieve up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retraining. Attribution mismatches raise error rates by up to 70%, showing link between knowledge source confusion and unfaithful answers.

Conclusion: Probes reveal strong attribution signal in hidden representations, enabling reliable contributive attribution. However, models may still respond incorrectly even with correct attribution, highlighting need for broader detection frameworks beyond just source identification.

Abstract: Large language models (LLMs) often generate fluent but unfounded claims, or hallucinations, which fall into two types: (i) faithfulness violations - misusing user context - and (ii) factuality violations - errors from internal knowledge. Proper mitigation depends on knowing whether a model’s answer is based on the prompt or its internal weights. This work focuses on the problem of contributive attribution: identifying the dominant knowledge source behind each output. We show that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution. For its training, we introduce AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read them from context, generating labelled examples automatically. Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retraining. Attribution mismatches raise error rates by up to 70%, demonstrating a direct link between knowledge source confusion and unfaithful answers. Yet, models may still respond incorrectly even when attribution is correct, highlighting the need for broader detection frameworks.

[30] Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift

Hyunwoo Kim, Hanau Yi, Jaehee Bae, Yumin Kim

Main category: cs.CL

TL;DR: The paper proposes Natural Language Declarative Prompting (NLD-P) as a declarative governance framework for managing prompt behavior stability across evolving large language models, addressing model drift through modular control abstractions encoded in natural language.

Details

Motivation: As LLMs scale and update across generations, prompt behavior becomes unstable due to shifts in instruction-following policies, alignment regimes, and decoding strategies (GPT-scale model drift). Traditional prompt engineering approaches (surface-level formatting and ad-hoc refinement) are insufficient for ensuring stable, interpretable control in evolving LLM ecosystems.

Method: Reconceptualizes NLD-P as a declarative governance method rather than a rigid template. Formalizes NLD-P as a modular control abstraction separating provenance, constraint logic, task content, and post-generation evaluation. The framework is encoded directly in natural language without external orchestration code, with defined minimal compliance criteria and analysis of model-dependent schema receptivity.

Result: NLD-P is positioned as an accessible governance framework for non-developer practitioners operating within evolving LLM ecosystems. The paper itself employed portions of drafting and editorial refinement using a schema-bound LLM assistant configured under NLD-P, with human oversight through documented human-in-the-loop protocols.

Conclusion: The paper outlines implications for declarative control under ongoing model evolution and identifies directions for future empirical validation. NLD-P provides a framework for stable prompt governance as LLMs continue to evolve and drift in behavior.

Abstract: The rapid evolution of large language models (LLMs) has transformed prompt engineering from a localized craft into a systems-level governance challenge. As models scale and update across generations, prompt behavior becomes sensitive to shifts in instruction-following policies, alignment regimes, and decoding strategies, a phenomenon we characterize as GPT-scale model drift. Under such conditions, surface-level formatting conventions and ad hoc refinement are insufficient to ensure stable, interpretable control. This paper reconceptualizes Natural Language Declarative Prompting (NLD-P) as a declarative governance method rather than a rigid field template. NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code. We define minimal compliance criteria, analyze model-dependent schema receptivity, and position NLD-P as an accessible governance framework for non-developer practitioners operating within evolving LLM ecosystems. Portions of drafting and editorial refinement employed a schema-bound LLM assistant configured under NLD-P. All conceptual framing, methodological claims, and final revisions were directed, reviewed, and approved by the human author under a documented human-in-the-loop protocol. The paper concludes by outlining implications for declarative control under ongoing model evolution and identifying directions for future empirical validation.

[31] TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models

Reihaneh Iranmanesh, Saeedeh Davoudi, Pasha Abrishamchian, Ophir Frieder, Nazli Goharian

Main category: cs.CL

TL;DR: A framework for evaluating cultural competence of LLMs in Persian using hybrid evaluation combining morphological normalization with syntactic-semantic similarity scoring.

Details

Motivation: Existing Persian cultural benchmarks rely on multiple-choice formats and English-centric metrics that fail to capture Persian's morphological complexity and semantic nuance, creating a need for Persian-specific evaluation methods.

Method: Introduces a Persian-specific short-answer evaluation framework combining rule-based morphological normalization with a hybrid syntactic and semantic similarity module for robust soft-match scoring beyond exact string overlap.

Result: Evaluation of 15 state-of-the-art open- and closed-source models shows hybrid evaluation improves scoring consistency by +10% compared to exact-match baselines by capturing meaning that surface-level methods cannot detect.

Conclusion: The framework provides the first standardized benchmark for measuring cultural understanding in Persian and establishes a reproducible foundation for cross-cultural LLM evaluation research.

Abstract: This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian. Existing Persian cultural benchmarks rely predominantly on multiple-choice formats and English-centric metrics that fail to capture Persian’s morphological complexity and semantic nuance. Our framework introduces a Persian-specific short-answer evaluation that combines rule-based morphological normalization with a hybrid syntactic and semantic similarity module, enabling robust soft-match scoring beyond exact string overlap. Through systematic evaluation of 15 state-of-the-art open- and closed-source models, we demonstrate that our hybrid evaluation improves scoring consistency by +10% compared to exact-match baselines by capturing meaning that surface-level methods cannot detect. We publicly release our evaluation framework, providing the first standardized benchmark for measuring cultural understanding in Persian and establishing a reproducible foundation for cross-cultural LLM evaluation research.

[32] TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought

Jianmin Li, Ying Chang, Su-Kit Tang, Yujia Liu, Yanwen Wang, Shuyuan Lin, Binkai Ou

Main category: cs.CL

TL;DR: TCM-DiffRAG: A retrieval-augmented generation framework combining knowledge graphs with chain-of-thought reasoning for traditional Chinese medicine clinical diagnosis, outperforming native LLMs and other RAG methods.

Details

Motivation: Traditional RAG methods perform poorly in traditional Chinese medicine due to complex reasoning processes and substantial individual differences in clinical diagnosis. Need for domain-specific RAG framework tailored to TCM reasoning characteristics.

Method: Developed TCM-DiffRAG framework integrating knowledge graphs with chains of thought reasoning. Evaluated on three distinctive TCM test datasets, comparing against native LLMs, supervised fine-tuned models, and other RAG methods.

Result: TCM-DiffRAG achieved significant performance improvements over native LLMs (e.g., qwen-plus scores improved from 0.927/0.361/0.038 to 0.952/0.788/0.356). Outperformed directly supervised fine-tuned LLMs and other benchmark RAG methods. Improvements more pronounced for non-Chinese LLMs.

Conclusion: Integrating structured TCM knowledge graphs with Chain of Thought reasoning substantially improves performance in individualized diagnostic tasks. Joint use of universal and personalized knowledge graphs enables effective alignment between general knowledge and clinical reasoning.

Abstract: Background: Retrieval augmented generation (RAG) technology can empower large language models (LLMs) to generate more accurate, professional, and timely responses without fine tuning. However, due to the complex reasoning processes and substantial individual differences involved in traditional Chinese medicine (TCM) clinical diagnosis and treatment, traditional RAG methods often exhibit poor performance in this domain. Objective: To address the limitations of conventional RAG approaches in TCM applications, this study aims to develop an improved RAG framework tailored to the characteristics of TCM reasoning. Methods: We developed TCM-DiffRAG, an innovative RAG framework that integrates knowledge graphs (KG) with chains of thought (CoT). TCM-DiffRAG was evaluated on three distinctive TCM test datasets. Results: The experimental results demonstrated that TCM-DiffRAG achieved significant performance improvements over native LLMs. For example, the qwen-plus model achieved scores of 0.927, 0.361, and 0.038, which were significantly enhanced to 0.952, 0.788, and 0.356 with TCM-DiffRAG. The improvements were even more pronounced for non-Chinese LLMs. Additionally, TCM-DiffRAG outperformed directly supervised fine-tuned (SFT) LLMs and other benchmark RAG methods. Conclusions: TCM-DiffRAG shows that integrating structured TCM knowledge graphs with Chain of Thought based reasoning substantially improves performance in individualized diagnostic tasks. The joint use of universal and personalized knowledge graphs enables effective alignment between general knowledge and clinical reasoning. These results highlight the potential of reasoning-aware RAG frameworks for advancing LLM applications in traditional Chinese medicine.

[33] Improving Neural Argumentative Stance Classification in Controversial Topics with Emotion-Lexicon Features

Mohammad Yeghaneh Abkenar, Weixing Wang, Manfred Stede, Davide Picca, Mark A. Finlayson, Panagiotis Ioannidis

Main category: cs.CL

TL;DR: This paper presents an approach for stance classification in argumentative texts by expanding emotion lexicons using contextualized embeddings (DistilBERT) to improve performance across diverse controversial topics.

Details

Motivation: Prior stance classification work has not systematically incorporated fine-grained emotion analysis, despite arguments often appealing to emotions. Existing research has been limited to specific domains/topics and used non-argumentative texts, limiting generalizability.

Method: Expand the Bias-Corrected NRC Emotion Lexicon using DistilBERT embeddings to identify emotionally charged terms not captured in the original lexicon. Feed the expanded lexicon (eNRC) into a Neural Argumentative Stance Classification model.

Result: The expanded NRC lexicon (eNRC) improves over baseline across all five diverse datasets (up to +6.2 percentage points in F1), outperforms original NRC on four datasets (up to +3.0), and surpasses LLM-based approaches on nearly all corpora.

Conclusion: Systematic expansion of emotion lexicons using contextualized embeddings significantly improves stance classification performance across diverse domains and controversial topics, demonstrating the importance of emotion analysis in argument mining.

Abstract: Argumentation mining comprises several subtasks, among which stance classification focuses on identifying the standpoint expressed in an argumentative text toward a specific target topic. While arguments-especially about controversial topics-often appeal to emotions, most prior work has not systematically incorporated explicit, fine-grained emotion analysis to improve performance on this task. In particular, prior research on stance classification has predominantly utilized non-argumentative texts and has been restricted to specific domains or topics, limiting generalizability. We work on five datasets from diverse domains encompassing a range of controversial topics and present an approach for expanding the Bias-Corrected NRC Emotion Lexicon using DistilBERT embeddings, which we feed into a Neural Argumentative Stance Classification model. Our method systematically expands the emotion lexicon through contextualized embeddings to identify emotionally charged terms not previously captured in the lexicon. Our expanded NRC lexicon (eNRC) improves over the baseline across all five datasets (up to +6.2 percentage points in F1 score), outperforms the original NRC on four datasets (up to +3.0), and surpasses the LLM-based approach on nearly all corpora. We provide all resources-including eNRC, the adapted corpora, and model architecture-to enable other researchers to build upon our work.

[34] Effective QA-driven Annotation of Predicate-Argument Relations Across Languages

Jonathan Davidov, Aviv Slobodkin, Shmuel Tomi Klein, Reut Tsarfaty, Ido Dagan, Ayal Klein

Main category: cs.CL

TL;DR: Cross-linguistic projection method using QA-SRL framework to extend semantic role labeling to multiple languages via constrained translation and word alignment, outperforming multilingual LLM baselines.

Details

Motivation: Predicate-argument semantic representations are valuable for interpretable semantic analysis but require costly annotation and are largely confined to English. Need to extend semantic annotation to new languages efficiently.

Method: Uses QA-SRL framework as natural-language interface for semantics. Cross-linguistic projection approach reuses English QA-SRL parser within constrained translation and word-alignment pipeline to automatically generate question-answer annotations aligned with target-language predicates. Applied to Hebrew, Russian, and French.

Result: Method yields high-quality training data and fine-tuned, language-specific parsers that outperform strong multilingual LLM baselines (GPT-4o, LLaMA-Maverick).

Conclusion: QA-SRL serves as transferable natural-language interface enabling efficient and broadly accessible predicate-argument parsing across languages.

Abstract: Explicit representations of predicate-argument relations form the basis of interpretable semantic analysis, supporting reasoning, generation, and evaluation. However, attaining such semantic structures requires costly annotation efforts and has remained largely confined to English. We leverage the Question-Answer driven Semantic Role Labeling (QA-SRL) framework – a natural-language formulation of predicate-argument relations – as the foundation for extending semantic annotation to new languages. To this end, we introduce a cross-linguistic projection approach that reuses an English QA-SRL parser within a constrained translation and word-alignment pipeline to automatically generate question-answer annotations aligned with target-language predicates. Applied to Hebrew, Russian, and French – spanning diverse language families – the method yields high-quality training data and fine-tuned, language-specific parsers that outperform strong multilingual LLM baselines (GPT-4o, LLaMA-Maverick). By leveraging QA-SRL as a transferable natural-language interface for semantics, our approach enables efficient and broadly accessible predicate-argument parsing across languages.

[35] Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference

Yushi Ye, Feng Hong, Huangjie Zheng, Xu Chen, Zhiyong Chen, Yanfeng Wang, Jiangchao Yao

Main category: cs.CL

TL;DR: ReMix introduces a continuous mixing state framework to resolve combinatorial contradictions in diffusion LLMs, enabling 2-8× speedup without quality loss.

Details

Motivation: Diffusion Large Language Models promise fast non-autoregressive inference but suffer from severe quality-speed trade-offs due to "combinatorial contradiction" where parallel tokens form semantically inconsistent combinations during parallel decoding.

Method: ReMix integrates continuous representations into discrete decoding via a Continuous Mixing State as intermediate between masked state and final token state. This allows iterative refinement in continuous space to resolve token conflicts before discrete sampling, with a rejection rule to revert uncertain representations back to masked state for reprocessing.

Result: ReMix achieves 2-8× inference speedup without quality degradation as a training-free method, effectively mitigating combinatorial contradictions in diffusion LLM decoding.

Conclusion: The ReMix framework successfully addresses combinatorial contradictions in diffusion LLMs by enabling continuous-space refinement during discrete diffusion decoding, offering significant speed improvements while maintaining quality.

Abstract: Diffusion Large Language Models (DLLMs) promise fast non-autoregressive inference but suffer a severe quality-speed trade-off in parallel decoding. This stems from the ‘‘combinatorial contradiction’’ phenomenon, where parallel tokens form semantically inconsistent combinations. We address this by integrating continuous representations into the discrete decoding process, as they preserve rich inter-position dependency. We propose ReMix (Rejection Mixing), a framework that introduces a novel Continuous Mixing State as an intermediate between the initial masked state and the final decoded token state. This intermediate state allows a token’s representation to be iteratively refined in a continuous space, resolving mutual conflicts with other tokens before collapsing into a final discrete sample. Furthermore, a rejection rule reverts uncertain representations from the continuous state back to the masked state for reprocessing, ensuring stability and preventing error propagation. ReMix thus mitigates combinatorial contradictions by enabling continuous-space refinement during discrete diffusion decoding. Extensive experiments demonstrate that ReMix, as a training-free method, achieves a $2-8 \times$ inference speedup without any quality degradation.

[36] Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching

Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng, Ismail Elezi

Main category: cs.CL

TL;DR: Stitching Noisy Diffusion Thoughts: A self-consistency framework that samples diverse reasoning trajectories with diffusion models, scores intermediate steps with process reward models, stitches best steps across trajectories, then uses an autoregressive solver for final answers.

Details

Motivation: Existing aggregation strategies for multi-chain reasoning discard useful intermediate work from partial or nearly correct attempts by focusing on trajectory-level selection or voting. There's a need to leverage step-level reasoning across multiple trajectories.

Method: Three-step modular pipeline: (1) Sample diverse reasoning trajectories using masked diffusion language model, (2) Score every intermediate step with off-the-shelf process reward model (PRM), (3) Stitch highest-quality steps across trajectories into composite rationale, then condition autoregressive model to compute final answer.

Result: Improves average accuracy by up to 23.8% across six math and coding tasks, achieves up to 1.8x latency reduction compared to traditional diffusion models and unified architectures, with step-level recombination most beneficial on harder problems.

Conclusion: Step-level recombination of reasoning trajectories using diffusion sampling and process reward models provides effective training-free framework for complex reasoning tasks, with modular separation of exploration, evaluation, and solution synthesis.

Abstract: Reasoning with large language models often benefits from generating multiple chains-of-thought, but existing aggregation strategies are typically trajectory-level (e.g., selecting the best trace or voting on the final answer), discarding useful intermediate work from partial or “nearly correct” attempts. We propose Stitching Noisy Diffusion Thoughts, a self-consistency framework that turns cheap diffusion-sampled reasoning into a reusable pool of step-level candidates. Given a problem, we (i) sample many diverse, low-cost reasoning trajectories using a masked diffusion language model, (ii) score every intermediate step with an off-the-shelf process reward model (PRM), and (iii) stitch these highest-quality steps across trajectories into a composite rationale. This rationale then conditions an autoregressive (AR) model (solver) to recompute only the final answer. This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search. Across math reasoning benchmarks, we find that step-level recombination is most beneficial on harder problems, and ablations highlight the importance of the final AR solver in converting stitched but imperfect rationales into accurate answers. Using low-confidence diffusion sampling with parallel, independent rollouts, our training-free framework improves average accuracy by up to 23.8% across six math and coding tasks. At the same time, it achieves up to a 1.8x latency reduction relative to both traditional diffusion models (e.g., Dream, LLaDA) and unified architectures (e.g., TiDAR). Code is available at https://github.com/roymiles/diffusion-stitching.

[37] Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models

Jonathan Steinberg, Oren Gal

Main category: cs.CL

TL;DR: Investigates where OCR information enters vision-language models using causal interventions, finding architecture-specific OCR bottlenecks and surprisingly that OCR removal can improve counting performance in modular architectures.

Details

Motivation: To understand how optical character recognition (OCR) information is processed and integrated in vision-language models, specifically examining where OCR signals enter the language processing stream across different model architectures.

Method: Uses causal interventions by computing activation differences between original images and text-inpainted versions across three VLM families (Qwen3-VL, Phi-4, InternVL3.5). Applies principal component analysis (PCA) to analyze OCR signal dimensionality and transferability.

Result: Found architecture-specific OCR bottlenecks: DeepStack models peak at mid-depth (~50%) for scene text, while single-stage projection models peak at early layers (6-25%). OCR signal is low-dimensional (PC1 captures 72.9% variance) and PCA directions transfer across datasets. Surprisingly, OCR removal improves counting performance in modular architectures like Qwen3-VL-4B (+6.9 percentage points).

Conclusion: OCR processing pathways vary by VLM architecture but share common low-dimensional representations. Modular OCR circuits can interfere with other visual processing, suggesting potential architectural improvements for better visual-text integration.

Abstract: Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared text-processing pathways. Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.

[38] Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

Jeongin Bae, Baeseong Park, Gunho Park, Minsub Kim, Joonhyung Lee, Junhee Yoo, Sunghyeon Woo, Jiwon Ryu, Se Jung Kwon, Dongsoo Lee

Main category: cs.CL

TL;DR: Affine-Scaled Attention replaces strict softmax normalization with input-dependent scaling and bias terms, allowing flexible control over attention magnitudes while maintaining value aggregation.

Details

Motivation: Standard softmax attention enforces unit sum normalization which limits flexibility in controlling attention magnitudes and can lead to overly concentrated or unstable attention patterns during training. Existing modifications like attention sinks or gating mechanisms provide only limited control over attention reweighting.

Method: Proposes Affine-Scaled Attention, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights. This design relaxes the strict normalization constraint while maintaining aggregation of value representations.

Result: Empirical evaluation in large-scale language model pretraining across multiple model sizes shows consistent improvements in training stability, optimization behavior, and downstream task performance compared to standard softmax attention and attention sink baselines.

Conclusion: Modest reweighting of attention outputs through Affine-Scaled Attention provides a practical and effective way to improve attention behavior in Transformer models, offering better control over attention magnitudes while maintaining aggregation properties.

Abstract: Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention magnitudes and may contribute to overly concentrated or unstable attention patterns during training. Prior work has explored modifications such as attention sinks or gating mechanisms, but these approaches provide only limited or indirect control over attention reweighting. We propose Affine-Scaled Attention, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights. This design relaxes the strict normalization constraint while maintaining aggregation of value representations, allowing the model to adjust both the relative distribution and the scale of attention in a controlled manner. We empirically evaluate Affine-Scaled Attention in large-scale language model pretraining across multiple model sizes. Experimental results show consistent improvements in training stability, optimization behavior, and downstream task performance compared to standard softmax attention and attention sink baselines. These findings suggest that modest reweighting of attention outputs provides a practical and effective way to improve attention behavior in Transformer models.

[39] Toward Automatic Filling of Case Report Forms: A Case Study on Data from an Italian Emergency Department

Gabriela Anna Kaczmarek, Pietro Ferrazzi, Lorenzo Porta, Vicky Rubini, Bernardo Magnini

Main category: cs.CL

TL;DR: A new dataset of Italian Emergency Department clinical notes annotated with 134 CRF items, enabling automatic CRF-filling using LLMs in zero-shot settings, with analysis of biases in model behavior.

Details

Motivation: There is increasing interest in automatic CRF-filling from clinical notes using LLMs, but scarcity of annotated CRF data for training and testing limits progress in this area.

Method: Created a new dataset of Italian Emergency Department clinical notes annotated with respect to a pre-defined CRF containing 134 items. Defined CRF-filling task and evaluation metrics, conducted pilot experiments using open-source state-of-the-art LLM in zero-shot setting.

Result: Results show that CRF-filling from real clinical notes in Italian can be approached in zero-shot setting, but LLMs exhibit biases (e.g., cautious behavior favoring “unknown” answers) that need correction.

Conclusion: The dataset enables progress in automatic CRF-filling from clinical notes using LLMs, but model biases must be addressed for reliable clinical applications.

Abstract: Case Report Forms (CRFs) collect data about patients and are at the core of well-established practices to conduct research in clinical settings. With the recent progress of language technologies, there is an increasing interest in automatic CRF-filling from clinical notes, mostly based on the use of Large Language Models (LLMs). However, there is a general scarcity of annotated CRF data, both for training and testing LLMs, which limits the progress on this task. As a step in the direction of providing such data, we present a new dataset of clinical notes from an Italian Emergency Department annotated with respect to a pre-defined CRF containing 134 items to be filled. We provide an analysis of the data, define the CRF-filling task and metric for its evaluation, and report on pilot experiments where we use an open-source state-of-the-art LLM to automatically execute the task. Results of the case-study show that (i) CRF-filling from real clinical notes in Italian can be approached in a zero-shot setting; (ii) LLMs’ results are affected by biases (e.g., a cautious behaviour favours “unknown” answers), which need to be corrected.

[40] Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody

Yuqi Shi, Hao Yang, Xiyao Lu, Jinsong Zhang

Main category: cs.CL

TL;DR: L2 learners acquire target syntactic word order but struggle with prosodic mapping, showing non-linear acquisition where high-proficiency learners match boundary quantity but invert native prosodic hierarchy patterns.

Details

Motivation: To investigate the fossilization and stability of L2 syntax-prosody interface acquisition, specifically how second language learners map acquired syntactic structures onto appropriate prosodic boundaries, which remains a persistent challenge even after mastering word order.

Method: Comparative study of 67 native Mandarin speakers and 67 Vietnamese learners using BLCU-SAIT corpus, integrating C-ToBI boundary annotation with Dependency Grammar analysis to examine both quantity of prosodic boundaries and their mapping to syntactic relations.

Result: High-proficiency learners converge to native baseline in boundary quantity at Major Phrase level but show significant divergence in structural mapping: they demote prosodic boundaries at Subject-Verb interface while erroneously promoting boundaries at Verb-Object interface, resulting in an inverted prosodic hierarchy.

Conclusion: L2 learners adopt a compensatory strategy that maintains high phrasal output at the expense of structural accuracy, leading to fossilized distortion of the native prosodic hierarchy pattern even at advanced proficiency levels.

Abstract: While second language (L2) learners may acquire target syntactic word order, mapping this syntax onto appropriate prosodic structures remains a persistent challenge. This study investigates the fossilization and stability of the L2 syntax-prosody interface by comparing 67 native Mandarin speakers with 67 Vietnamese learners using the BLCU-SAIT corpus. By integrating C-ToBI boundary annotation with Dependency Grammar analysis, we examined both the quantity of prosodic boundaries and their mapping to syntactic relations. Results reveal a non-linear acquisition: although high-proficiency learners (VNH) converge to the native baseline in boundary quantity at the Major Phrase level (B3), their structural mapping significantly diverges. Specifically, VNH demote the prosodic boundary at the Subject-Verb (SBV) interface (Major Phrase B3 -> Prosodic Word B1), while erroneously promoting the boundary at the Verb-Object (VOB) interface (Prosodic Word B1 -> Major Phrase B3). This strategy allows learners to maintain high long phrasal output at the expense of structural accuracy. This results in a distorted prosodic hierarchy where the native pattern is inverted.

[41] CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery

Mengze Hong, Di Jiang, Chen Jason Zhang, Zichang Guo, Yawen Li, Jun Chen, Shaobo Cui, Zhiyang Su

Main category: cs.CL

TL;DR: CiteLLM is an agentic platform that integrates LLMs within LaTeX editors for trustworthy reference discovery, addressing AI ethics concerns in academic writing through local processing and discipline-aware routing to trusted repositories.

Details

Motivation: Address ethical challenges in AI-assisted scholarly activities: (1) trustworthiness of AI-generated content, (2) preservation of academic integrity and intellectual property, and (3) protection of information privacy.

Method: Embed LLM utilities directly within LaTeX editor environment for local processing; use dynamic discipline-aware routing to retrieve candidates exclusively from trusted academic repositories; LLMs generate search queries, rank candidates, and validate support through paragraph-level semantic matching and integrated chatbot.

Result: Evaluation demonstrates superior performance in returning valid and highly usable references compared to baseline approaches.

Conclusion: CiteLLM provides a trustworthy reference discovery system that addresses key ethical concerns in AI-assisted academic writing while maintaining high utility.

Abstract: Large language models (LLMs) have created new opportunities to enhance the efficiency of scholarly activities; however, challenges persist in the ethical deployment of AI assistance, including (1) the trustworthiness of AI-generated content, (2) preservation of academic integrity and intellectual property, and (3) protection of information privacy. In this work, we present CiteLLM, a specialized agentic platform designed to enable trustworthy reference discovery for grounding author-drafted claims and statements. The system introduces a novel interaction paradigm by embedding LLM utilities directly within the LaTeX editor environment, ensuring a seamless user experience and no data transmission outside the local system. To guarantee hallucination-free references, we employ dynamic discipline-aware routing to retrieve candidates exclusively from trusted web-based academic repositories, while leveraging LLMs solely for generating context-aware search queries, ranking candidates by relevance, and validating and explaining support through paragraph-level semantic matching and an integrated chatbot. Evaluation results demonstrate the superior performance of the proposed system in returning valid and highly usable references.

[42] Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent

Boyang Zhang, Yang Zhang

Main category: cs.CL

TL;DR: LLM agent framework (SALA) combines stylometric features with LLM reasoning for authorship attribution and privacy protection through guided text recomposition.

Details

Motivation: Address growing concerns about unintended deanonymization risks in textual data due to LLM authorship inference capabilities, aiming to both evaluate and mitigate these privacy risks.

Method: Proposes SALA (Stylometry-Assisted LLM Analysis) method integrating quantitative stylometric features with LLM reasoning, augmented with database module, and includes guided recomposition strategy using reasoning traces to generate rewriting prompts.

Result: SALA achieves high inference accuracy in various scenarios on large-scale news datasets, and the guided recomposition strategy effectively reduces authorship identifiability while preserving textual meaning.

Conclusion: Highlights both the deanonymization potential of LLM agents and the importance of interpretable, proactive defenses for safeguarding author privacy in textual data.

Abstract: The rapid advancement of large language models (LLMs) has enabled powerful authorship inference capabilities, raising growing concerns about unintended deanonymization risks in textual data such as news articles. In this work, we introduce an LLM agent designed to evaluate and mitigate such risks through a structured, interpretable pipeline. Central to our framework is the proposed $\textit{SALA}$ (Stylometry-Assisted LLM Analysis) method, which integrates quantitative stylometric features with LLM reasoning for robust and transparent authorship attribution. Experiments on large-scale news datasets demonstrate that $\textit{SALA}$, particularly when augmented with a database module, achieves high inference accuracy in various scenarios. Finally, we propose a guided recomposition strategy that leverages the agent’s reasoning trace to generate rewriting prompts, effectively reducing authorship identifiability while preserving textual meaning. Our findings highlight both the deanonymization potential of LLM agents and the importance of interpretable, proactive defenses for safeguarding author privacy.

[43] Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs

Jayadev Billa

Main category: cs.CL

TL;DR: Multimodal LLMs fail to utilize speaker identity, emotion, and visual texture information despite these features surviving through all layers, due to a decoder bottleneck where only text-aligned information is accessible.

Details

Motivation: Current multimodal LLMs can process speech and images but fail to capture important perceptual details like speaker identity, emotion, and visual texture. The paper investigates why these features are not utilized despite being present in the representations.

Method: The authors formalize this as a mismatched decoder problem where decoders trained on text can only extract information along text-aligned directions. They use linear probes to measure information retention, analyze Generalized Mutual Information bounds, and conduct controlled experiments across five models spanning speech and vision. They also test a LoRA intervention with emotion objectives.

Result: Speaker identity, emotion, and visual attributes survive through all LLM layers (3-55× above chance in linear probes), but removing 64-71% of modality-specific variance improves decoder loss. The bottleneck is confirmed to be the decoder’s scoring rule, not the encoder or projection. LoRA intervention with emotion objective improves emotion accessibility by +7.5% without affecting other attributes.

Conclusion: Multimodal LLMs have a fundamental decoder bottleneck where only text-aligned information is accessible, regardless of how non-text inputs are encoded. The training objective determines what information becomes accessible, and targeted interventions can improve specific attribute accessibility without harming overall performance.

Abstract: Multimodal LLMs can process speech and images, but they cannot hear a speaker’s voice or see an object’s texture. We show this is not a failure of encoding: speaker identity, emotion, and visual attributes survive through every LLM layer (3–55$\times$ above chance in linear probes), yet removing 64–71% of modality-specific variance improves decoder loss. The decoder has no learned use for these directions; their presence is noise. We formalize this as a mismatched decoder problem: a decoder trained on text can only extract information along text-aligned directions. Accessible information is bounded by the Generalized Mutual Information (GMI), with degradation scaling with distributional distance and decoder sensitivity. The bound is a property of the decoder’s scoring rule, not of any particular architecture; it applies whether non-text inputs arrive through a learned projection, a discrete codebook, or no explicit adapter at all. We validate this across five models spanning speech and vision. A controlled experiment (two Prismatic VLMs differing only in encoder text-alignment) confirms the bottleneck is the decoder’s scoring rule, not the encoder or projection. A LoRA intervention demonstrates the fix: training with an emotion objective improves emotion accessibility ($+$7.5%) without affecting other attributes, confirming that the training objective determines what becomes accessible.

[44] MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

Sara Rosenthal, Yannis Katsis, Vraj Shah, Lihong He, Lucian Popa, Marina Danilevsky

Main category: cs.CL

TL;DR: MTRAG-UN is a benchmark for evaluating multi-turn retrieval augmented generation (RAG) systems, focusing on challenging conversation scenarios with unanswerable, underspecified, non-standalone questions and unclear responses.

Details

Motivation: Current RAG systems struggle with complex multi-turn conversations involving ambiguous, incomplete, or unanswerable questions, creating a need for comprehensive evaluation benchmarks to drive improvement in these challenging scenarios.

Method: Created a benchmark of 666 tasks with over 2,800 conversation turns across 6 domains, each accompanied by relevant corpora. The benchmark specifically targets challenging conversation types: UNanswerable, UNderspecified, NONstandalone questions and UNclear responses.

Result: Experiments show that current retrieval and generation models continue to struggle with the challenging conversation types in the benchmark, highlighting significant gaps in multi-turn RAG capabilities.

Conclusion: MTRAG-UN provides a valuable benchmark for evaluating and improving multi-turn RAG systems, revealing that current models need substantial advancement to handle complex conversational scenarios with ambiguous or incomplete information.

Abstract: We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models. We release a benchmark of 666 tasks containing over 2,800 conversation turns across 6 domains with accompanying corpora. Our experiments show that retrieval and generation models continue to struggle on conversations with UNanswerable, UNderspecified, and NONstandalone questions and UNclear responses. Our benchmark is available at https://github.com/IBM/mt-rag-benchmark

[45] Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

Chungpa Lee, Jy-yong Sohn, Kangwook Lee

Main category: cs.CL

TL;DR: Fine-tuning LLMs improves zero-shot performance but degrades in-context learning; theoretical analysis shows fine-tuning all attention parameters harms few-shot learning, while updating only value matrices preserves it.

Details

Motivation: Fine-tuning large language models improves zero-shot performance on downstream tasks but often degrades their in-context learning ability, limiting performance on unseen tasks. The paper aims to understand this trade-off theoretically.

Method: Theoretical analysis using linear attention models to characterize how fine-tuning objectives modify attention parameters. Examines conditions where fine-tuning degrades few-shot performance, compares full parameter updates vs. value matrix-only updates, and studies auxiliary few-shot loss effects.

Result: Fine-tuning all attention parameters harms in-context learning, while restricting updates to value matrices improves zero-shot performance while preserving in-context learning. Auxiliary few-shot loss enhances in-context learning on target tasks but degrades it on unseen tasks.

Conclusion: There’s a trade-off between zero-shot improvement and in-context learning preservation during fine-tuning. Selective parameter updates (value matrices only) can maintain both capabilities, while task-specific optimization comes at the cost of generalization to unseen tasks.

Abstract: Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot loss enhances in-context learning primarily on the target task, at the expense of degraded in-context learning ability on tasks not seen during fine-tuning. We empirically validate our theoretical results.

[46] Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Pengxiang Li, Dilxat Muhtar, Lu Yin, Tianlong Chen, Shiwei Liu

Main category: cs.CL

TL;DR: NAP is a data-centric approach that curates multiple independent reasoning trajectories and uses parallel-forced decoding to encourage non-autoregressive parallel generation in Diffusion Language Models, reducing AR-like behavior.

Details

Motivation: Current Diffusion Language Models (DLMs) often converge to autoregressive-like decoding despite being advertised as enabling parallel token generation. This sequential bottleneck limits hardware parallelism and latency scaling. The authors identify a mismatch between DLM objectives and sequential training data as the primary driver of this AR-like behavior.

Method: NAP (Non-Autoregressive Parallel DLMs) is a data-centric approach that: 1) Curates training examples as multiple independent reasoning trajectories instead of standard sequential data, 2) Uses a parallel-forced decoding strategy that encourages multi-token parallel updates, and 3) Better aligns supervision with non-AR parallel decoding objectives.

Result: Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long chain-of-thought data. Gains grow as parallelism increases, demonstrating improved non-autoregressive generation capabilities.

Conclusion: Revisiting data and supervision is a principled direction for mitigating AR-like behavior in DLMs and moving toward genuinely non-autoregressive parallel generation. The approach shows promise for better exploiting parallel hardware and improving latency scaling.

Abstract: Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR’s sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs. Our code is available at https://github.com/pixeli99/NAP.

[47] Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

Siyuan Liu, Jiahui Xu, Feng Jiang, Kuang Wang, Zefeng Zhao, Chu-Ren Huang, Jinghang Gu, Changqing Yin, Haizhou Li

Main category: cs.CL

TL;DR: DDTSR is a low-latency framework for spoken dialogue systems that enables listen-while-thinking and speak-while-thinking by overlapping ASR, LLM reasoning, and TTS processes through connective-guided model synergy and streaming cross-modal collaboration.

Details

Motivation: Conventional ASR-LLM-TTS pipelines suffer from high response latency due to strictly sequential processing, requiring complete transcription and reasoning before speech synthesis can begin. This prevents human-like responsiveness in spoken dialogue systems.

Method: Three key mechanisms: 1) Connective-guided small-large model synergy where a small model generates discourse connectives while a large model performs reasoning in parallel; 2) Streaming-based cross-modal collaboration that dynamically overlaps ASR, LLM inference, and TTS; 3) Curriculum-learning-based discourse continuity enhancement to maintain coherence between early responses and subsequent outputs.

Result: Experiments on two spoken dialogue benchmarks show DDTSR reduces response latency by 19%-51% while preserving discourse quality. The framework functions as a plug-and-play module compatible with diverse LLM backbones and remains robust across varying utterance lengths.

Conclusion: DDTSR provides a practical and scalable solution for real-time spoken interaction by enabling low-latency, human-like responsiveness through parallel processing and streaming collaboration across ASR, LLM, and TTS components.

Abstract: Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems. Conventional ASR-LLM-TTS pipelines follow a strictly sequential paradigm, requiring complete transcription and full reasoning before speech synthesis can begin, which results in high response latency. We propose the Discourse-Aware Dual-Track Streaming Response (DDTSR) framework, a low-latency architecture that enables listen-while-thinking and speak-while-thinking. DDTSR is built upon three key mechanisms: (1) connective-guided small-large model synergy, where an auxiliary small model generates minimal-committal discourse connectives while a large model performs knowledge-intensive reasoning in parallel; (2) streaming-based cross-modal collaboration, which dynamically overlaps ASR, LLM inference, and TTS to advance the earliest speakable moment; and (3) curriculum-learning-based discourse continuity enhancement, which maintains coherence and logical consistency between early responses and subsequent reasoning outputs. Experiments on two spoken dialogue benchmarks demonstrate that DDTSR reduces response latency by 19%-51% while preserving discourse quality. Further analysis shows that DDTSR functions as a plug-and-play module compatible with diverse LLM backbones, and remains robust across varying utterance lengths, indicating strong practicality and scalability for real-time spoken interaction.

[48] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Sungho Park, Jueun Kim, Wook-Shin Han

Main category: cs.CL

TL;DR: SPARTA is an automated framework for generating large-scale Table-Text QA benchmarks with complex multi-hop reasoning and aggregation operations, exposing limitations in current cross-modal models.

Details

Motivation: Existing Table-Text QA benchmarks are small, manually curated, error-prone, and contain shallow questions that rarely require complex operations like aggregation, grouping, or multi-hop reasoning across text and tables.

Method: End-to-end framework that: 1) constructs reference fact database by enriching source tables with grounding tables from unstructured passages, 2) synthesizes nested queries matching desired hop counts, 3) uses provenance-based refinement to ensure executable SQL queries, and 4) employs realistic-structure enforcement via post-order traversal of query graphs.

Result: Generated thousands of high-fidelity QA pairs covering aggregations, grouping, and deep multi-hop reasoning. State-of-the-art models that perform well on existing benchmarks (70+ F1 on HybridQA, 50+ F1 on OTT-QA) drop by more than 30 F1 points on SPARTA.

Conclusion: SPARTA exposes fundamental weaknesses in current cross-modal reasoning models and provides a scalable benchmark generation framework requiring only 25% of the annotation time of HybridQA.

Abstract: Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human-sounding question, we propose two novel techniques: provenance-based refinement, which rewrites any syntactically valid query that returns a non-empty result, and realistic-structure enforcement, which confines generation to post-order traversals of the query graph. The resulting pipeline produces thousands of high-fidelity question-answer pairs covering aggregations, grouping, and deep multi-hop reasoning across text and tables. On SPARTA, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT-QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. Our benchmark, construction code, and baseline models are available at https://github.com/pshlego/SPARTA/tree/main.

[49] Scale Can’t Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

Amita Kamath, Jack Hessel, Khyathi Chandu, Jena D. Hwang, Kai-Wei Chang, Ranjay Krishna

Main category: cs.CL

TL;DR: VLMs lack reasoning due to reporting bias in training data, which omits tacit information needed for spatial, temporal, negation, and counting reasoning. Scaling doesn’t help, but targeted annotations do.

Details

Motivation: Vision-Language Models lack reasoning capabilities, which the authors attribute to reporting bias in training data - how people naturally describe visual content omits tacit information needed for reasoning tasks.

Method: Analyzed training data of popular VLMs (OpenCLIP, LLaVA-1.5, Molmo) through pragmatics theory, identified four suppressed reasoning skills, created curated benchmarks to test performance, and experimented with targeted annotation incorporation.

Result: VLMs perform poorly on reasoning tasks suppressed by reporting bias; scaling data/model size doesn’t improve these skills; but incorporating specifically collected annotations for tacit information is effective.

Conclusion: Need intentional training data curation methods rather than relying on scale for emergence of reasoning capabilities in VLMs.

Abstract: The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., “at the game today!” is a more likely caption than “a photo of 37 people standing behind a field”. We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics, and find that reporting bias results in insufficient representation of four reasoning skills (spatial, temporal, negation, and counting), despite the corpora being of web-scale, and/or synthetically generated. With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and to multiple languages does not result in emergence of these skills by default; but, promisingly, (iii) incorporating annotations specifically collected to obtain tacit information is effective. Our findings highlight the need for more intentional training data curation methods, rather than counting on scale for emergence of reasoning capabilities.

[50] Evaluating the Diversity and Quality of LLM Generated Content

Alexander Shypula, Shuo Li, Botong Zhang, Vishakh Padmakumar, Kayo Yin, Osbert Bastani

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2504.12522: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.12522&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[51] Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task

Mengyang Qiu, Zoe Brisebois, Siena Sun

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2505.16164: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16164&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[52] When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations

Kailin Jiang, Yuntao Du, Yukai Ding, Yuchen Ren, Ning Jiang, Zhi Gao, Zilong Zheng, Lei Liu, Bin Li, Qing Li

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.24449: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.24449&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[53] DeVisE: Behavioral Testing of Medical Large Language Models

Camila Zurdo Tagliabue, Heloisa Oss Boll, Aykut Erdem, Erkut Erdem, Iacer Calixto

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2506.15339: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.15339&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[54] Parallel Continuous Chain-of-Thought with Jacobi Iteration

Haoyi Wu, Zhihao Teng, Kewei Tu

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - paper retrieval failed due to rate limiting

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper

Abstract: Failed to fetch summary for 2506.18582: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.18582&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[55] A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench

David Schlangen, Sherzod Hakimov, Chalamalasetti Kranti, Jonathan Jordan, Philipp Sadler

Main category: cs.CL

TL;DR: Paper 2507.08491 appears to be unavailable due to HTTP 429 error (rate limiting), so no abstract or content can be analyzed.

Details

Motivation: Cannot determine motivation as the paper content is inaccessible due to server rate limiting.

Method: Cannot determine method as the paper content is inaccessible.

Result: Cannot determine results as the paper content is inaccessible.

Conclusion: Cannot draw conclusions about the paper due to access limitations.

Abstract: Failed to fetch summary for 2507.08491: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.08491&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[56] Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility

Michael A. Lepori, Jennifer Hu, Ishita Dasgupta, Roma Patel, Thomas Serre, Ellie Pavlick

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to determine conclusion due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2507.12553: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.12553&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[57] UPDESH: Synthesizing Grounded Instruction Tuning Data for 13 Indic Languages

Pranjal A. Chitale, Varun Gumma, Sanchit Ahuja, Prashant Kodali, Manan Uppadhyay, Deepthi Sudharsan, Sunayana Sitaram

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2509.21294: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21294&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[58] Fine-tuning Done Right in Model Editing

Wanli Yang, Rui Tang, Hongyu Zang, Du Su, Qi Cao, Jingang Wang, Huawei Shen, Xueqi Cheng, Fei Sun

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2509.22072: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22072&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[59] Inducing Dyslexia in Vision Language Models

Melika Honarmand, Ayati Sharma, Badr AlKhamissi, Johannes Mehrer, Martin Schrimpf

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2509.24597: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24597&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[60] Generative Value Conflicts Reveal LLM Priorities

Andy Liu, Kshitish Ghate, Mona Diab, Daniel Fried, Atoosa Kasirzadeh, Max Kleiman-Weiner

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2509.25369: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25369&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[61] Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs

Shenzhe Zhu, Shu Yang, Michiel A. Bakker, Alex Pentland, Jiaxin Pei

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2510.05154: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05154&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[62] Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty

Yao Xiao, Jung-jae Kim, Roy Ka-wei Lee, Lidong Bing

Main category: cs.CL

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API access error

Method: Unable to determine method due to API access error

Result: Unable to determine results due to API access error

Conclusion: Unable to determine conclusion due to API access error

Abstract: Failed to fetch summary for 2510.05534: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05534&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[63] Mapping Semantic & Syntactic Relationships with Geometric Rotation

Michael Freenor, Lauren Alvarez

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2510.09790: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09790&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[64] RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA

Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method without access to paper content

Result: No results available due to API rate limiting error

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.20505: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20505&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[65] PARL: Prompt-based Agents for Reinforcement Learning

Yarik Menchaca Resendiz, Roman Klinger

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2510.21306: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21306&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[66] The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, Junxian He

Main category: cs.CL

TL;DR: Toolathlon is a comprehensive benchmark for language agents with 32 apps, 604 tools, and 108 realistic multi-step tasks requiring coordination across diverse software applications.

Details

Motivation: Existing language agent benchmarks are too narrow, simplified, and lack the diversity, realism, and long-horizon complexity needed to evaluate real-world agent performance across complex workflows.

Method: Created Toolathlon benchmark with 32 software applications and 604 tools using Model Context Protocol servers, realistic initial environment states from real software, and 108 manually crafted tasks requiring multi-app coordination over ~20 turns.

Result: State-of-the-art models perform poorly: Claude-4.5-Sonnet achieves only 38.6% success rate with 20.2 tool calls average, while top open-weights model DeepSeek-V3.2-Exp reaches 20.1%.

Conclusion: Toolathlon reveals significant shortcomings in current language agents for real-world complex tasks and is expected to drive development of more capable agents for long-horizon execution.

Abstract: Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents’ real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real financial spreadsheets. This benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple Apps over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.

[67] Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang, Yanfei Chen, Wei Wang, Tomas Pfister, Chen-Yu Lee

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.25992: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.25992&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[68] Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models

Yinrong Hong, Zhiquan Tan, Kai Hu

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2510.26577: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26577&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[69] Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, Flavio P. Calmon

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2511.05541: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05541&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[70] Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL

Khushboo Thaker, Yony Bresler

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2512.17053: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17053&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[71] Document Reconstruction Unlocks Scalable Long-Context RLVR

Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin, Jung-jae Kim, Xiaoli Li, Roy Ka-wei Lee, Lidong Bing

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.08237: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08237&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[72] The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task

Rui Cao, Zhenyun Deng, Yulong Chen, Michael Schlichtkrull, Andreas Vlachos

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Unable to determine method due to API rate limiting preventing access to paper content

Result: Unable to determine results due to API rate limiting preventing access to paper content

Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2602.11221: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11221&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[73] Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

Tao Xu

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2602.14162: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14162&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[74] Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque

Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.14812: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[75] BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios

Yunseung Lee, Subin Kim, Youngjun Kwak, Jaegul Choo

Main category: cs.CL

TL;DR: BankMathBench: A domain-specific dataset for evaluating LLMs’ numerical reasoning in banking scenarios, addressing systematic errors in financial computations like interest calculations and product comparisons.

Details

Motivation: LLMs in banking applications struggle with core financial computations requiring multi-step numerical reasoning, but existing benchmarks don't capture these domain-specific errors in everyday banking scenarios.

Method: Created BankMathBench dataset with three difficulty levels (basic, intermediate, advanced) covering single-product reasoning, multi-product comparison, and multi-condition scenarios, then fine-tuned open-source LLMs on this data.

Result: Fine-tuned LLMs achieved significant accuracy improvements: 57.6%p (basic), 75.1%p (intermediate), and 62.9%p (advanced) over zero-shot baselines, demonstrating enhanced domain-specific reasoning.

Conclusion: BankMathBench effectively addresses LLMs’ limitations in banking computations and serves as a reliable benchmark for advancing numerical reasoning in real-world financial applications.

Abstract: Large language models (LLMs)-based chatbots are increasingly being adopted in the financial domain, particularly in digital banking, to handle customer inquiries about products such as deposits, savings, and loans. However, these models still exhibit low accuracy in core banking computations-including total payout estimation, comparison of products with varying interest rates, and interest calculation under early repayment conditions. Such tasks require multi-step numerical reasoning and contextual understanding of banking products, yet existing LLMs often make systematic errors-misinterpreting product types, applying conditions incorrectly, or failing basic calculations involving exponents and geometric progressions. However, such errors have rarely been captured by existing benchmarks. Mathematical datasets focus on fundamental math problems, whereas financial benchmarks primarily target financial documents, leaving everyday banking scenarios underexplored. To address this limitation, we propose BankMathBench, a domain-specific dataset that reflects realistic banking tasks. BankMathBench is organized in three levels of difficulty-basic, intermediate, and advanced-corresponding to single-product reasoning, multi-product comparison, and multi-condition scenarios, respectively. When trained on BankMathBench, open-source LLMs exhibited notable improvements in both formula generation and numerical reasoning accuracy, demonstrating the dataset’s effectiveness in enhancing domain-specific reasoning. With tool-augmented fine-tuning, the models achieved average accuracy increases of 57.6%p (basic), 75.1%p (intermediate), and 62.9%p (advanced), representing significant gains over zero-shot baselines. These findings highlight BankMathBench as a reliable benchmark for evaluating and advancing LLMs’ numerical reasoning in real-world banking scenarios.

Sohan Venkatesh, Ashish Mahendran Kurapath, Tejas Melkote

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to draw conclusions due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2602.21947: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21947&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[77] Cost-of-Pass: An Economic Framework for Evaluating Language Models

Mehmet Hamza Erol, Batu El, Mirac Suzgun, Mert Yuksekgonul, James Zou

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to data fetch failure

Method: Unable to determine method due to data fetch failure

Result: Unable to determine results due to data fetch failure

Conclusion: Unable to draw conclusions due to data fetch failure

Abstract: Failed to fetch summary for 2504.13359: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.13359&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[78] Knowledge Fusion of Large Language Models Via Modular SkillPacks

Guodong Du, Zhuo Li, Xuanning Zhou, Junlin Li, Zesheng Shi, Wanyu Lin, Ho-Kin Tang, Xiucheng Li, Fangming Liu, Wenya Wang, Min Zhang, Jing Li

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2505.18502: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.18502&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[79] LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

Guozhao Mo, Wenliang Zhong, Jiawei Chen, Qianhao Yuan, Xuanang Chen, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Le Sun

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to draw conclusions due to failed API request

Abstract: Failed to fetch summary for 2508.01780: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.01780&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[80] Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection

Shan Wang, Maying Shen, Nadine Chang, Chuong Nguyen, Hongdong Li, Jose M. Alvarez

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2509.03113: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.03113&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[81] Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

Jingen Qu, Lijun Li, Bo Zhang, Yichen Yan, Jing Shao

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2509.04403: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.04403&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[82] Improving Discrete Diffusion Unmasking Policies Beyond Explicit Reference Policies

Chunsan Hong, Seonho An, Min-Soo Kim, Jong Chul Ye

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2510.05725 suggests it’s from October 2025, but no content available for analysis.

Details

Motivation: Cannot determine motivation due to lack of access to paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.

Method: No method information available. The paper ID format suggests it’s a recent paper from October 2025, but the abstract/content cannot be retrieved.

Result: No results available. The arXiv API returned HTTP 429 (Too Many Requests), preventing access to the paper’s content.

Conclusion: Unable to analyze paper due to technical limitations. The arXiv API rate limiting prevents retrieval of paper details for analysis.

Abstract: Failed to fetch summary for 2510.05725: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05725&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[83] PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing

Anthony Hughes, Vasisht Duddu, N. Asokan, Nikolaos Aletras, Ning Ma

Main category: cs.CL

TL;DR: Paper 2510.07452: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot determine conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2510.07452: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07452&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[84] PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut, Robert Stein, Rina Elster Pantalony, Mohit Bansal, Kathleen McKeown

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2510.19060: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.19060&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[85] Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya, Adrian Gamarra Lafuente, Medhya Goel, Rebecca Joseph, Shlok Natarajan, Etash Kumar Guha, Shang Zhu, Ben Athiwaratkun, John Hennessy, Azalia Mirhoseini, Christopher Ré

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2511.07885: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07885&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[86] Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs

Xiang Zheng, Weiqi Zhai, Wei Wang, Boyu Yang, Wenbo Li, Ruixiang Luo, Haoxiang Sun, Yucheng Wang, Zhengze Li, Meng Wang, Yuetian Du, Guojie Lin, Yaxuan Wang, Xiaoxiao Xu, Yanhu Mo, Xuan Ren, Hu Wei, Bing Zhao

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to determine conclusion due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2602.00564: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00564&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[87] Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, Yankai Lin

Main category: cs.CL

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to API restrictions

Result: No results available - technical issue with arXiv API access

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper

Abstract: Failed to fetch summary for 2602.12125: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12125&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[88] GPT-4o Lacks Core Features of Theory of Mind

John Muchovej, Amanda Royka, Shane Lee, Julian Jara-Ettinger

Main category: cs.CL

TL;DR: Unable to analyze paper 2602.12150 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable

Method: Cannot determine method as abstract is unavailable

Result: Cannot determine results as abstract is unavailable

Conclusion: Cannot determine conclusion as abstract is unavailable

Abstract: Failed to fetch summary for 2602.12150: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12150&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[89] Symmetry in language statistics shapes the geometry of model representations

Dhruva Karkada, Daniel J. Korchinski, Andres Nava, Matthieu Wyart, Yasaman Bahri

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to draw conclusions due to failed API request

Abstract: Failed to fetch summary for 2602.15029: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15029&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[90] PuppetChat: Fostering Intimate Communication through Bidirectional Actions and Micronarratives

Emma Jiren Wang, Siying Hu, Zhicong Lu

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.19463: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19463&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[91] Both Ends Count! Just How Good are LLM Agents at “Text-to-Big SQL”?

Germán T. Eizaguirre, Lars Tissen, Marc Sánchez-Artigas

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.21480: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21480&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.CV

[92] Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing

Renyu Yang, Jian Jin, Lili Meng, Meiqin Liu, Yilin Wang, Balu Adsumilli, Weisi Lin

Main category: cs.CV

TL;DR: A practical approach for constructing large-scale, diverse audio-visual quality assessment datasets using crowdsourcing and systematic data preparation, validated with YT-NTU-AVQ dataset.

Details

Motivation: Existing AVQA datasets are too small, lack diversity in content and quality, and only provide overall scores, limiting model development and multimodal perception research.

Method: 1) Crowdsourced subjective experiment framework for reliable annotation across varied environments, 2) Systematic data preparation for broad coverage of quality levels and semantic scenarios, 3) Extension with additional annotations for multimodal perception research.

Result: Created YT-NTU-AVQ, the largest and most diverse AVQA dataset with 1,620 user-generated audio/video sequences, with dataset and platform code publicly available.

Conclusion: The proposed practical approach successfully addresses limitations of existing AVQA datasets and enables better model development and multimodal perception research.

Abstract: Audio-visual quality assessment (AVQA) research has been stalled by limitations of existing datasets: they are typically small in scale, with insufficient diversity in content and quality, and annotated only with overall scores. These shortcomings provide limited support for model development and multimodal perception research. We propose a practical approach for AVQA dataset construction. First, we design a crowdsourced subjective experiment framework for AVQA, breaks the constraints of in-lab settings and achieves reliable annotation across varied environments. Second, a systematic data preparation strategy is further employed to ensure broad coverage of both quality levels and semantic scenarios. Third, we extend the dataset with additional annotations, enabling research on multimodal perception mechanisms and their relation to content. Finally, we validate this approach through YT-NTU-AVQ, the largest and most diverse AVQA dataset to date, consisting of 1,620 user-generated audio and video (A/V) sequences. The dataset and platform code are available at https://github.com/renyu12/YT-NTU-AVQ

[93] Enabling clinical use of foundation models in histopathology

Audun L. Henriksen, Ole-Johan Skrede, Lisa van der Schee, Enric Domingo, Sepp De Raedt, Ilyá Kostolomov, Jennifer Hay, Karolina Cyll, Wanja Kildal, Joakim Kalsnes, Robert W. Williams, Manohar Pradhan, John Arne Nesheim, Hanne A. Askautrud, Maria X. Isaksen, Karmele Saez de Gordoa, Miriam Cuatrecasas, Joanne Edwards, TransSCOT group, Arild Nesbakken, Neil A. Shepherd, Ian Tomlinson, Daniel-Christoph Wagner, Rachel S. Kerr, Tarjei Sveinsgjerd Hveem, Knut Liestøl, Yoshiaki Nakamura, Marco Novelli, Masaaki Miyo, Sebastian Foersch, David N. Church, Miangela M. Lacle, David J. Kerr, Andreas Kleppe

Main category: cs.CV

TL;DR: Novel robustness losses during downstream training reduce technical variability bias in histopathology foundation models without retraining the foundation models themselves.

Details

Motivation: Current histopathology foundation models capture both biologically relevant features and technical artifacts (pre-analytic and scanner-specific variations), which bias downstream task-specific models. There's a need to develop robust computational pathology models that work reliably across different technical conditions in clinical practice.

Method: Introduces novel robustness losses during training of downstream task-specific models using features from foundation models. Uses comprehensive experimentation with 27,042 whole slide images from 6,155 patients to train thousands of models from eight popular computational pathology foundation models.

Result: Substantial improvement in robustness to technical variability while also improving prediction accuracy by focusing on biologically relevant features. Successfully mitigates robustness issues without retraining the foundation models themselves.

Conclusion: The approach enables development of robust computational pathology models applicable to real-world clinical data by reducing sensitivity to technical artifacts while maintaining or improving accuracy through better focus on biological features.

Abstract: Foundation models in histopathology are expected to facilitate the development of high-performing and generalisable deep learning systems. However, current models capture not only biologically relevant features, but also pre-analytic and scanner-specific variation that bias the predictions of task-specific models trained from the foundation model features. Here we show that introducing novel robustness losses during training of downstream task-specific models reduces sensitivity to technical variability. A purpose-designed comprehensive experimentation setup with 27,042 WSIs from 6155 patients is used to train thousands of models from the features of eight popular foundation models for computational pathology. In addition to a substantial improvement in robustness, we observe that prediction accuracy improves by focusing on biologically relevant features. Our approach successfully mitigates robustness issues of foundation models for computational pathology without retraining the foundation models themselves, enabling development of robust computational pathology models applicable to real-world data in routine clinical practice.

[94] Optimizing Neural Network Architecture for Medical Image Segmentation Using Monte Carlo Tree Search

Liping Meng, Fan Nie, Yunyun Zhang, Chao Han

Main category: cs.CV

TL;DR: MNAS-Unet combines Monte Carlo Tree Search with Neural Architecture Search for efficient medical image segmentation, achieving better accuracy with 54% less search time and lightweight models.

Details

Motivation: To improve medical image segmentation by developing a more efficient neural architecture search method that reduces computational costs while maintaining or improving accuracy.

Method: Combines Monte Carlo Tree Search (MCTS) with Neural Architecture Search (NAS) to dynamically explore promising network architectures, with optimized DownSC and UpSC unit structures for fast model adjustments.

Result: Outperforms NAS-Unet and other state-of-the-art models on medical image datasets (PROMISE12, Ultrasound Nerve, CHAOS), reduces search budget by 54%, achieves lightweight model with 0.6M parameters and lower GPU memory consumption.

Conclusion: MNAS-Unet improves search efficiency while maintaining competitive segmentation accuracy under practical resource constraints, enhancing practical applicability of medical image segmentation models.

Abstract: This paper proposes a novel medical image segmentation framework, MNAS-Unet, which combines Monte Carlo Tree Search (MCTS) and Neural Architecture Search (NAS). MNAS-Unet dynamically explores promising network architectures through MCTS, significantly enhancing the efficiency and accuracy of architecture search. It also optimizes the DownSC and UpSC unit structures, enabling fast and precise model adjustments. Experimental results demonstrate that MNAS-Unet outperforms NAS-Unet and other state-of-the-art models in segmentation accuracy on several medical image datasets, including PROMISE12, Ultrasound Nerve, and CHAOS. Furthermore, compared with NAS-Unet, MNAS-Unet reduces the architecture search budget by 54% (early stopping at 139 epochs versus 300 epochs under the same search setting), while achieving a lightweight model with only 0.6M parameters and lower GPU memory consumption, which further improves its practical applicability. These results suggest that MNAS-Unet can improve search efficiency while maintaining competitive segmentation accuracy under practical resource constraints.

[95] AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction

Hanyang Liu, Rongjun Qin

Main category: cs.CV

TL;DR: AeroDGS: A physics-guided 4D Gaussian splatting framework for monocular UAV video reconstruction that addresses challenges of depth ambiguity and unstable motion estimation in aerial dynamic scenes.

Details

Motivation: Existing 4D scene reconstruction methods struggle with monocular aerial conditions characterized by single-view capture, wide spatial range, and dynamic objects with limited spatial footprint and large motion disparity, leading to severe depth ambiguity and unstable motion estimation.

Method: Proposes AeroDGS with two key modules: 1) Monocular Geometry Lifting module for reconstructing reliable static and dynamic geometry from single aerial sequences, and 2) Physics-Guided Optimization module incorporating differentiable ground-support, upright-stability, and trajectory-smoothness priors to resolve monocular ambiguity through physically consistent motion.

Result: Outperforms state-of-the-art methods on both synthetic and real UAV scenes, achieving superior reconstruction fidelity in dynamic aerial environments. Also introduces a real-world UAV dataset spanning various altitudes and motion conditions.

Conclusion: AeroDGS effectively addresses the ill-posed nature of monocular aerial reconstruction by combining geometry lifting with physics-guided optimization, enabling robust 4D scene reconstruction from single-view UAV videos.

Abstract: Recent advances in 4D scene reconstruction have significantly improved dynamic modeling across various domains. However, existing approaches remain limited under aerial conditions with single-view capture, wide spatial range, and dynamic objects of limited spatial footprint and large motion disparity. These challenges cause severe depth ambiguity and unstable motion estimation, making monocular aerial reconstruction inherently ill-posed. To this end, we present AeroDGS, a physics-guided 4D Gaussian splatting framework for monocular UAV videos. AeroDGS introduces a Monocular Geometry Lifting module that reconstructs reliable static and dynamic geometry from a single aerial sequence, providing a robust basis for dynamic estimation. To further resolve monocular ambiguity, we propose a Physics-Guided Optimization module that incorporates differentiable ground-support, upright-stability, and trajectory-smoothness priors, transforming ambiguous image cues into physically consistent motion. The framework jointly refines static backgrounds and dynamic entities with stable geometry and coherent temporal evolution. We additionally build a real-world UAV dataset that spans various altitudes and motion conditions to evaluate dynamic aerial reconstruction. Experiments on synthetic and real UAV scenes demonstrate that AeroDGS outperforms state-of-the-art methods, achieving superior reconstruction fidelity in dynamic aerial environments.

[96] Unveiling Deep Shadows: A Survey and Benchmark on Image and Video Shadow Detection, Removal, and Generation in the Deep Learning Era

Xiaowei Hu, Zhenghao Xing, Tianyu Wang, Chi-Wing Fu, Pheng-Ann Heng

Main category: cs.CV

TL;DR: A comprehensive survey and benchmark of deep learning methods for shadow detection, removal, and generation in images and videos, with standardized evaluation and future directions for multimodal foundation models.

Details

Motivation: Shadows are crucial for visual perception and realism but current deep learning approaches for shadow tasks lack standardized evaluation, fair comparison, and understanding of shared illumination cues across detection, removal, and generation tasks.

Method: The paper conducts a unified survey of shadow-related deep learning methods, introduces consistent taxonomies for architectures and learning paradigms, reviews datasets and evaluation protocols, and retrains representative methods under standardized settings to enable fair benchmarking.

Result: The benchmark reveals inconsistencies in prior reports, strong dependence on model design and resolution, limited cross-dataset generalization due to dataset bias, and identifies shared illumination cues connecting different shadow tasks.

Conclusion: The work provides a foundation for reproducible research and outlines future directions including unified frameworks, semantics-aware reasoning, shadow-based AIGC authenticity analysis, and integration of physics-guided priors into multimodal foundation models.

Abstract: Shadows, formed by the occlusion of light, play an essential role in visual perception and directly influence scene understanding, image quality, and visual realism. This paper presents a unified survey and benchmark of deep-learning-based shadow detection, removal, and generation across images and videos. We introduce consistent taxonomies for architectures, supervision strategies, and learning paradigms; review major datasets and evaluation protocols; and re-train representative methods under standardized settings to enable fair comparison. Our benchmark reveals key findings, including inconsistencies in prior reports, strong dependence on model design and resolution, and limited cross-dataset generalization due to dataset bias. By synthesizing insights across the three tasks, we highlight shared illumination cues and priors that connect detection, removal, and generation. We further outline future directions involving unified all-in-one frameworks, semantics- and geometry-aware reasoning, shadow-based AIGC authenticity analysis, and the integration of physics-guided priors into multimodal foundation models. Corrected datasets, trained models, and evaluation tools are released to support reproducible research.

[97] Enhancing Renal Tumor Malignancy Prediction: Deep Learning with Automatic 3D CT Organ Focused Attention

Zhengkang Fan, Chengkun Sun, Russell Terry, Jie Xu, Longin Jan Latecki

Main category: cs.CV

TL;DR: Deep learning framework with Organ Focused Attention loss for renal tumor malignancy prediction from 3D CT images without requiring manual segmentation at deployment.

Details

Motivation: Existing imaging modalities lack accuracy for predicting renal tumor malignancy before surgery. Traditional deep learning approaches require manual segmentation which is labor-intensive, costly, and expert-dependent. Need for automated malignancy prediction without segmentation.

Method: Developed deep learning framework using Organ Focused Attention (OFA) loss function that modifies attention of image patches so organ patches attend only to other organ patches, eliminating need for segmentation at deployment time.

Result: Achieved AUC of 0.685 and F1-score of 0.872 on private UF IDR dataset, and AUC of 0.760 and F1-score of 0.852 on public KiTS21 dataset, outperforming conventional segmentation-based models.

Conclusion: The OFA-based framework provides efficient, reliable malignancy prediction without explicit segmentation, enhancing clinical decision-making in renal cancer diagnosis.

Abstract: Accurate prediction of malignancy in renal tumors is crucial for informing clinical decisions and optimizing treatment strategies. However, existing imaging modalities lack the necessary accuracy to reliably predict malignancy before surgical intervention. While deep learning has shown promise in malignancy prediction using 3D CT images, traditional approaches often rely on manual segmentation to isolate the tumor region and reduce noise, which enhances predictive performance. Manual segmentation, however, is labor-intensive, costly, and dependent on expert knowledge. In this study, a deep learning framework was developed utilizing an Organ Focused Attention (OFA) loss function to modify the attention of image patches so that organ patches attend only to other organ patches. Hence, no segmentation of 3D renal CT images is required at deployment time for malignancy prediction. The proposed framework achieved an AUC of 0.685 and an F1-score of 0.872 on a private dataset from the UF Integrated Data Repository (IDR), and an AUC of 0.760 and an F1-score of 0.852 on the publicly available KiTS21 dataset. These results surpass the performance of conventional models that rely on segmentation-based cropping for noise reduction, demonstrating the frameworks ability to enhance predictive accuracy without explicit segmentation input. The findings suggest that this approach offers a more efficient and reliable method for malignancy prediction, thereby enhancing clinical decision-making in renal cancer diagnosis.

[98] LayerT2V: A Unified Multi-Layer Video Generation Framework

Guangzhao Li, Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo, Guangtao Zhai, Lei Zhang, Xiaohong Liu

Main category: cs.CV

TL;DR: LayerT2V is a multi-layer video generation framework that produces full videos plus editable background and foreground layers with alpha mattes in a single pass, enabling professional video editing workflows.

Details

Motivation: Existing text-to-video methods only output final composited videos without editable layered representations, limiting their use in professional video editing workflows that require separate layers for compositing and effects.

Method: Serializes multiple layer representations along the temporal dimension in video generation backbones, uses LayerAdaLN and layer-aware cross-attention modulation to mitigate layer ambiguity, and trains in three stages: alpha mask VAE adaptation, joint multi-layer learning, and multi-foreground extension.

Result: LayerT2V substantially outperforms prior methods in visual fidelity, temporal consistency, and cross-layer coherence, and introduces VidLayer, the first large-scale dataset for multi-layer video generation.

Conclusion: LayerT2V provides a unified framework for generating editable multi-layer video representations that better serve professional workflows while improving semantic alignment and temporal coherence through joint modeling.

Abstract: Text-to-video generation has advanced rapidly, but existing methods typically output only the final composited video and lack editable layered representations, limiting their use in professional workflows. We propose \textbf{LayerT2V}, a unified multi-layer video generation framework that produces multiple semantically consistent outputs in a single inference pass: the full video, an independent background layer, and multiple foreground RGB layers with corresponding alpha mattes. Our key insight is that recent video generation backbones use high compression in both time and space, enabling us to serialize multiple layer representations along the temporal dimension and jointly model them on a shared generation trajectory. This turns cross-layer consistency into an intrinsic objective, improving semantic alignment and temporal coherence. To mitigate layer ambiguity and conditional leakage, we augment a shared DiT backbone with LayerAdaLN and layer-aware cross-attention modulation. LayerT2V is trained in three stages: alpha mask VAE adaptation, joint multi-layer learning, and multi-foreground extension. We also introduce \textbf{VidLayer}, the first large-scale dataset for multi-layer video generation. Extensive experiments demonstrate that LayerT2V substantially outperforms prior methods in visual fidelity, temporal consistency, and cross-layer coherence.

[99] Vision Transformers Need More Than Registers

Cheng Shi, Yizhou Yu, Sibei Yang

Main category: cs.CV

TL;DR: ViTs exhibit lazy aggregation artifacts where they use irrelevant background patches as shortcuts for global semantics; proposed solution selectively integrates patch features into CLS token to improve performance across 12 benchmarks.

Details

Motivation: Vision Transformers show artifacts across different supervision paradigms and tasks, but the fundamental mechanisms behind these artifacts are not well understood. The paper aims to systematically analyze and address these artifacts in ViTs.

Method: Through systematic analysis, identifies lazy aggregation behavior where ViTs use semantically irrelevant background patches as shortcuts. Proposes selective integration of patch features into CLS token to reduce influence of background-dominated shortcuts.

Result: The solution consistently improves performance across 12 benchmarks under label-, text-, and self-supervision paradigms, demonstrating effectiveness in addressing ViT artifacts.

Conclusion: The work provides new perspective on ViT behavior by identifying lazy aggregation as the source of artifacts and offering an effective solution that improves performance across multiple supervision paradigms.

Abstract: Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior.

[100] CLIP Is Shortsighted: Paying Attention Beyond the First Sentence

Marc-Antoine Lavoie, Anas Mahmoud, Aldo Zaimi, Arsene Fansi Tchango, Steven L. Waslander

Main category: cs.CV

TL;DR: DeBias-CLIP addresses bias in long-caption training by removing summary sentences and using token padding to distribute supervision, improving text-image alignment for complex scenes.

Details

Motivation: CLIP models are biased toward simple object descriptions due to training on short captions, and even long-caption fine-tuning suffers from attention concentration on opening summary sentences, weakening alignment for detailed descriptions.

Method: Removes summary sentences from long captions during training, applies sentence sub-sampling, and uses text token padding to distribute supervision across all token positions without adding trainable parameters.

Result: Achieves state-of-the-art long-text retrieval, improves short-text retrieval, and shows reduced sensitivity to sentence order permutations compared to Long-CLIP.

Conclusion: DeBias-CLIP effectively mitigates attention bias in long-caption training and serves as a drop-in replacement for Long-CLIP with enhanced alignment capabilities for complex descriptions.

Abstract: CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP’s pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.

[101] SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read

Yibo Peng, Peng Xia, Ding Zhong, Kaide Zeng, Siwei Han, Yiyang Zhou, Jiaqi Liu, Ruiyi Zhang, Huaxiu Yao

Main category: cs.CV

TL;DR: SimpleOCR addresses modality laziness in MLLMs by forcing visual text reading through rendered text queries, improving OCR performance with minimal data.

Details

Motivation: To diagnose whether MLLMs genuinely read text in images or rely on text prompt shortcuts, revealing a "modality laziness" where models underutilize their visual OCR capabilities.

Method: Introduces Visualized-Question (VQ) setting where text queries are rendered onto images, and SimpleOCR training strategy that transforms training samples into VQ format with randomized styles to force visual text extraction.

Result: SimpleOCR yields 5.4% improvement over base model on OOD benchmarks, 2.7% over GRPO, with extreme data efficiency (30x fewer samples than RL methods) and seamless integration with RL strategies.

Conclusion: SimpleOCR effectively addresses modality laziness in MLLMs by structurally enforcing visual text reading, offering a plug-and-play solution for improving OCR capabilities without architectural changes.

Abstract: Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely read'' text embedded in images, or do they merely rely on parametric shortcuts in the text prompt? In this work, we diagnose this issue by introducing the Visualized-Question (VQ) setting, where text queries are rendered directly onto images to structurally mandate visual engagement. Our diagnostic experiments on Qwen2.5-VL reveal a startling capability-utilization gap: despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting, exposing a deep-seated modality laziness.’’ To bridge this gap, we propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process. By transforming training samples into the VQ format with randomized styles, SimpleOCR effectively invalidates text-based shortcuts, compelling the model to activate and optimize its visual text extraction pathways. Empirically, SimpleOCR yields robust gains without architectural modifications. On four representative OOD benchmarks, it surpasses the base model by 5.4% and GRPO based on original images by 2.7%, while exhibiting extreme data efficiency, achieving superior performance with 30x fewer samples (8.5K) than recent RL-based methods. Furthermore, its plug-and-play nature allows seamless integration with advanced RL strategies like NoisyRollout to yield complementary improvements. Code is available at https://github.com/aiming-lab/SimpleOCR.

[102] Plug-and-Play Diffusion Meets ADMM: Dual-Variable Coupling for Robust Medical Image Reconstruction

Chenhe Du, Xuanyu Tian, Qing Wu, Muyu Liu, Jingyi Yu, Hongjiang Wei, Yuyao Zhang

Main category: cs.CV

TL;DR: Dual-Coupled PnP Diffusion with Spectral Homogenization solves bias-hallucination trade-off in inverse problems by adding integral feedback and frequency-domain adaptation.

Details

Motivation: Existing Plug-and-Play diffusion prior frameworks suffer from non-vanishing steady-state bias due to memoryless operators, failing to satisfy physical measurements under heavy corruption. While adding dual variables guarantees convergence, it introduces structured artifacts that violate diffusion priors' AWGN assumptions.

Method: Proposes Dual-Coupled PnP Diffusion with Spectral Homogenization: 1) Restores classical dual variable for integral feedback ensuring asymptotic convergence, 2) Introduces Spectral Homogenization (SH) to modulate structured residuals into statistically compliant pseudo-AWGN inputs via frequency-domain adaptation.

Result: Extensive experiments on CT and MRI reconstruction demonstrate resolution of bias-hallucination trade-off, achieving state-of-the-art fidelity with significantly accelerated convergence.

Conclusion: The proposed framework effectively aligns rigorous optimization trajectories with denoiser’s valid statistical manifold, solving fundamental limitations of existing PnP diffusion methods for inverse problems.

Abstract: Plug-and-Play diffusion prior (PnPDP) frameworks have emerged as a powerful paradigm for solving imaging inverse problems by treating pretrained generative models as modular priors. However, we identify a critical flaw in prevailing PnP solvers (e.g., based on HQS or Proximal Gradient): they function as memoryless operators, updating estimates solely based on instantaneous gradients. This lack of historical tracking inevitably leads to non-vanishing steady-state bias, where the reconstruction fails to strictly satisfy physical measurements under heavy corruption. To resolve this, we propose Dual-Coupled PnP Diffusion, which restores the classical dual variable to provide integral feedback, theoretically guaranteeing asymptotic convergence to the exact data manifold. However, this rigorous geometric coupling introduces a secondary challenge: the accumulated dual residuals exhibit spectrally colored, structured artifacts that violate the Additive White Gaussian Noise (AWGN) assumption of diffusion priors, causing severe hallucinations. To bridge this gap, we introduce Spectral Homogenization (SH), a frequency-domain adaptation mechanism that modulates these structured residuals into statistically compliant pseudo-AWGN inputs. This effectively aligns the solver’s rigorous optimization trajectory with the denoiser’s valid statistical manifold. Extensive experiments on CT and MRI reconstruction demonstrate that our approach resolves the bias-hallucination trade-off, achieving state-of-the-art fidelity with significantly accelerated convergence.

[103] Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

Giuseppe Lando, Rosario Forte, Antonino Furnari

Main category: cs.CV

TL;DR: Edge-based MLLM system for real-time episodic memory QA using dual-thread architecture with streaming video-to-text conversion and reasoning, achieving competitive accuracy on edge devices vs cloud.

Details

Motivation: Privacy and latency concerns with cloud offloading for wearable assistants motivate edge-based MLLM solutions for real-time episodic memory question answering.

Method: Two asynchronous threads: Descriptor Thread continuously converts video to lightweight textual memory, and QA Thread reasons over textual memory to answer queries with streaming constraints.

Result: On QAEgo4D-Closed benchmark, edge system achieves 51.76% accuracy with 0.41s TTFT on 8GB GPU, scaling to 54.40% accuracy on enterprise server, vs 56.00% cloud accuracy.

Conclusion: Edge-based MLLMs show competitive performance for privacy-preserving episodic memory retrieval, demonstrating feasibility of real-time multimodal reasoning on resource-constrained devices.

Abstract: We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves 51.76% accuracy with a Time-To-First-Token (TTFT) of 0.41s. Scaling to a local enterprise-grade server yields 54.40% accuracy with a TTFT of 0.88s. In comparison, a cloud-based solution obtains an accuracy of 56.00%. These competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval.

[104] MammoWise: Multi-Model Local RAG Pipeline for Mammography Report Generation

Raiyan Jahangir, Nafiz Imtiaz Khan, Amritanand Sudheerkumar, Vladimir Filkov

Main category: cs.CV

TL;DR: MammoWise: A local multi-model pipeline that transforms open-source Vision Language Models into mammogram report generators and multi-task classifiers for breast cancer screening.

Details

Motivation: Mammography reporting is high-volume, time-sensitive, and documentation-heavy, requiring radiologists to translate visual findings into structured reports. Existing VLMs often rely on closed cloud systems with privacy and reproducibility limitations.

Method: Developed MammoWise, a local pipeline supporting any Ollama-hosted VLM with zero-shot, few-shot, and Chain-of-Thought prompting, plus optional multimodal RAG using vector databases. Evaluated MedGemma, LLaVA-Med, and Qwen2.5-VL on mammography datasets.

Result: Report generation consistently strong, improving with few-shot prompting and RAG. Classification feasible but sensitive to model/dataset choice. QLoRA fine-tuning of MedGemma achieved BI-RADS accuracy 0.7545, density accuracy 0.8840, calcification accuracy 0.9341 while preserving report quality.

Conclusion: MammoWise provides practical, extensible framework for deploying local VLMs for mammography reporting within unified, reproducible workflow, addressing privacy and adaptability limitations of cloud-based systems.

Abstract: Screening mammography is high volume, time sensitive, and documentation heavy. Radiologists must translate subtle visual findings into consistent BI-RADS assessments, breast density categories, and structured narrative reports. While recent Vision Language Models (VLMs) enable image-to-text reporting, many rely on closed cloud systems or tightly coupled architectures that limit privacy, reproducibility, and adaptability. We present MammoWise, a local multi-model pipeline that transforms open source VLMs into mammogram report generators and multi-task classifiers. MammoWise supports any Ollama-hosted VLM and mammography dataset, and enables zero-shot, few-shot, and Chain-of-Thought prompting, with optional multimodal Retrieval Augmented Generation (RAG) using a vector database for case-specific context. We evaluate MedGemma, LLaVA-Med, and Qwen2.5-VL on VinDr-Mammo and DMID datasets, assessing report quality (BERTScore, ROUGE-L), BI-RADS classification, breast density, and key findings. Report generation is consistently strong and improves with few-shot prompting and RAG. Classification is feasible but sensitive to model and dataset choice. Parameter-efficient fine-tuning (QLoRA) of MedGemma improves reliability, achieving BI-RADS accuracy of 0.7545, density accuracy of 0.8840, and calcification accuracy of 0.9341 while preserving report quality. MammoWise provides a practical and extensible framework for deploying local VLMs for mammography reporting within a unified and reproducible workflow.

[105] Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models

Niamul Hassan Samin, Md Arifur Rahman, Abdullah Ibne Hanif, Juena Ahmed Noshin, Md Ashikur Rahman

Main category: cs.CV

TL;DR: Training-free inference-time intervention (SCR) reduces object hallucination in VLMs by redistributing activation credit from high-attention patches to context, guided by low-entropy inputs.

Details

Motivation: VLMs frequently hallucinate objects absent from input images due to spatial credit collapse - activation credit concentrating on sparse visual patches in early transformer layers, suppressing contextual evidence and increasing reliance on language priors.

Method: Spatial Credit Redistribution (SCR): training-free inference-time intervention that redistributes hidden-state activation from high-attention source patches to their context, guided by low-entropy inputs. Evaluated on six model families (Chameleon, LLaVA, Qwen) at 7B, 13B, and 30B scales.

Result: Reduces hallucination by ~4.7-6.0 percentage points on POPE-Adversarial, cuts CHAIR-s by 3.7-5.2 percentage points (42-51% relative), and CHAIR-i by 2.7-4.4 percentage points (44-58% relative). Preserves CIDEr within 0.8 percentage points. Gains largest for low-entropy inputs. Overhead only 43-56 ms, 3-6x lower than OPERA and VCD.

Conclusion: SCR effectively mitigates object hallucination in VLMs by addressing spatial credit collapse, with minimal computational overhead, making it practical for real-time applications.

Abstract: Vision-language models (VLMs) frequently hallucinate objects absent from the input image. We trace this failure to spatial credit collapse: activation credit concentrating on sparse visual patches in early transformer layers, which suppresses contextual evidence and increases reliance on language priors. We introduce Spatial Credit Redistribution (SCR), a training-free inference-time intervention that redistributes hidden-state activation from high-attention source patches to their context, guided by low-entropy inputs. We evaluate six model families (Chameleon, LLaVA, and Qwen, including both Qwen-VL and Qwen2-VL) at scales of 7B, 13B, and 30B, on POPE and CHAIR benchmarks. SCR reduces hallucination by ~4.7-6.0 percentage points on POPE-Adversarial, cuts CHAIR-s by 3.7-5.2 percentage points (42-51 percent relative), and CHAIR-i by 2.7-4.4 percentage points (44-58 percent relative), and preserves CIDEr within 0.8 percentage points. Gains are largest for low-entropy inputs, consistent with the theoretical framework. SCR incurs only 43-56 ms overhead (small models: +43-46 ms; large models: +54-56 ms), roughly 3-6 times lower than OPERA and VCD and 1.3-1.7 times lower than OVCD (+72 ms), while Pareto-dominating all three on both hallucination rate and CIDEr, making it practical for real-time settings. A controlled ablation confirms that attention-guided source selection is essential: replacing it with uniform random selection reduces hallucination rate gains from ~4.7-6.0 percentage points to only ~2.6-3.4 percentage points, pointing to credit-collapse as the key driver.

[106] Automated Disentangling Analysis of Skin Colour for Lesion Images

Wenbo Yang, Eman Rezk, Walaa M. Moursi, Zhou Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.19055: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19055&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[107] Pix2Key: Controllable Open-Vocabulary Retrieval with Semantic Decomposition and Self-Supervised Visual Dictionary Learning

Guoyizhe Wei, Yang Jiao, Nan Xi, Zhishen Huang, Jingjing Meng, Rama Chellappa, Yan Gao

Main category: cs.CV

TL;DR: Pix2Key is a composed image retrieval method that represents images as visual dictionaries for better intent-aware matching and diversity-aware reranking, with self-supervised pretraining V-Dict-AE to improve attribute understanding.

Details

Motivation: Existing CIR methods have limitations: supervised triplet-based approaches can lose fine-grained cues, while zero-shot caption-based methods may miss implicit user intent and produce repetitive results. There's a need for better intent understanding and result diversity.

Method: Pix2Key represents both queries and candidate images as open-vocabulary visual dictionaries in a unified embedding space. It uses intent-aware constraint matching and diversity-aware reranking. Includes V-Dict-AE, a self-supervised pretraining component that improves dictionary representation using only images without CIR-specific supervision.

Result: On DFMM-Compose benchmark, Pix2Key improves Recall@10 by up to 3.2 points. Adding V-Dict-AE yields additional 2.3-point gain while improving intent consistency and maintaining high list diversity.

Conclusion: Pix2Key with V-Dict-AE provides an effective approach for composed image retrieval that better captures user intent while maintaining result diversity through visual dictionary representations and self-supervised pretraining.

Abstract: Composed Image Retrieval (CIR) uses a reference image plus a natural-language edit to retrieve images that apply the requested change while preserving other relevant visual content. Classic fusion pipelines typically rely on supervised triplets and can lose fine-grained cues, while recent zero-shot approaches often caption the reference image and merge the caption with the edit, which may miss implicit user intent and return repetitive results. We present Pix2Key, which represents both queries and candidates as open-vocabulary visual dictionaries, enabling intent-aware constraint matching and diversity-aware reranking in a unified embedding space. A self-supervised pretraining component, V-Dict-AE, further improves the dictionary representation using only images, strengthening fine-grained attribute understanding without CIR-specific supervision. On the DFMM-Compose benchmark, Pix2Key improves Recall@10 up to 3.2 points, and adding V-Dict-AE yields an additional 2.3-point gain while improving intent consistency and maintaining high list diversity.

[108] DisQ-HNet: A Disentangled Quantized Half-UNet for Interpretable Multimodal Image Synthesis Applications to Tau-PET Synthesis from T1 and FLAIR MRI

Agamdeep S. Chopra, Caitlin Neher, Tianyi Ren, Juampablo E. Heras Rivera, Mehmet Kurt

Main category: cs.CV

TL;DR: DisQ-HNet synthesizes tau-PET from T1-weighted and FLAIR MRI using a vector-quantized encoder with Partial Information Decomposition and a Half-UNet decoder, providing modality-specific attribution while maintaining reconstruction fidelity for Alzheimer’s disease tasks.

Details

Motivation: Tau-PET is an important in vivo marker for Alzheimer's disease but is expensive and has limited availability. The paper aims to develop MRI-based alternatives that can synthesize tau-PET images while understanding how each MRI modality contributes to the prediction.

Method: DisQ-HNet combines: 1) A Partial Information Decomposition-guided vector-quantized encoder that partitions latent information into redundant, unique, and complementary components from T1-weighted and FLAIR MRI, and 2) A Half-UNet decoder that preserves anatomical detail using pseudo-skip connections conditioned on structural edge cues instead of direct encoder feature reuse.

Result: The method outperforms multiple baselines (VAE, VQ-VAE, and UNet) in maintaining reconstruction fidelity and better preserves disease-relevant signal for downstream Alzheimer’s disease tasks including Braak staging, tau localization, and classification. PID-based Shapley analysis provides modality-specific attribution of synthesized uptake patterns.

Conclusion: DisQ-HNet provides an effective MRI-based alternative to tau-PET imaging that not only synthesizes accurate tau-PET images but also offers interpretable insights into how different MRI modalities contribute to the predictions, making it valuable for Alzheimer’s disease research and clinical applications.

Abstract: Tau positron emission tomography (tau-PET) provides an in vivo marker of Alzheimer’s disease pathology, but cost and limited availability motivate MRI-based alternatives. We introduce DisQ-HNet (DQH), a framework that synthesizes tau-PET from paired T1-weighted and FLAIR MRI while exposing how each modality contributes to the prediction. The method combines (i) a Partial Information Decomposition (PID)-guided, vector-quantized encoder that partitions latent information into redundant, unique, and complementary components, and (ii) a Half-UNet decoder that preserves anatomical detail using pseudo-skip connections conditioned on structural edge cues rather than direct encoder feature reuse. Across multiple baselines (VAE, VQ-VAE, and UNet), DisQ-HNet maintains reconstruction fidelity and better preserves disease-relevant signal for downstream AD tasks, including Braak staging, tau localization, and classification. PID-based Shapley analysis provides modality-specific attribution of synthesized uptake patterns.

[109] DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation

Zhechao Wang, Yiming Zeng, Lufan Ma, Zeqing Fu, Chen Bai, Ziyao Lin, Cheng Lu

Main category: cs.CV

TL;DR: DrivePTS: A progressive learning diffusion model for diverse driving scene generation with explicit condition decoupling, multi-view hierarchical descriptions, and frequency-guided structure loss to improve fidelity and controllability.

Details

Motivation: Current driving scene generation methods suffer from implicit inter-condition dependencies causing generation failures when control conditions change independently, insufficient semantic details due to brief captions, and structural distortions from uniform spatial weighting in denoising loss.

Method: Three key innovations: 1) Progressive learning strategy with explicit mutual information constraint to mitigate inter-dependency between geometric conditions (HD maps and 3D bounding boxes), 2) Vision-Language Model for multi-view hierarchical descriptions across six semantic aspects, 3) Frequency-guided structure loss to enhance sensitivity to high-frequency elements for better foreground structural fidelity.

Result: DrivePTS achieves state-of-the-art fidelity and controllability in generating diverse driving scenes, successfully generating rare scenes where prior methods fail, demonstrating strong generalization ability.

Conclusion: The proposed DrivePTS framework effectively addresses limitations in current driving scene generation methods through condition decoupling, fine-grained semantic guidance, and improved structural preservation, enabling more robust and diverse data augmentation for autonomous driving validation.

Abstract: Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Firstly, our framework adopts a progressive learning strategy to mitigate inter-dependency between geometric conditions, reinforced by an explicit mutual information constraint. Secondly, a Vision-Language Model is utilized to generate multi-view hierarchical descriptions across six semantic aspects, providing fine-grained textual guidance. Thirdly, a frequency-guided structure loss is introduced to strengthen the model’s sensitivity to high-frequency elements, improving foreground structural fidelity. Extensive experiments demonstrate that our DrivePTS achieves state-of-the-art fidelity and controllability in generating diverse driving scenes. Notably, DrivePTS successfully generates rare scenes where prior methods fail, highlighting its strong generalization ability.

[110] SwiftNDC: Fast Neural Depth Correction for High-Fidelity 3D Reconstruction

Kang Han, Wei Xiang, Lu Yu, Mathew Wyatt, Gaowen Liu, Ramana Rao Kompella

Main category: cs.CV

TL;DR: SwiftNDC is a fast 3D reconstruction framework using Neural Depth Correction to produce cross-view consistent depth maps, enabling high-quality mesh reconstruction and improved novel-view synthesis with 3D Gaussian Splatting.

Details

Motivation: Existing depth-guided 3D reconstruction methods suffer from scale drift, multi-view inconsistencies, and require substantial refinement for high-fidelity geometry, motivating a faster approach with better geometric initialization.

Method: Uses Neural Depth Correction field to produce cross-view consistent depth maps, generates dense point clouds through back-projection and reprojection-error filtering, then accelerates 3D Gaussian Splatting for mesh reconstruction and novel-view synthesis.

Result: Consistently reduces running time for accurate mesh reconstruction and boosts rendering fidelity for view synthesis across five datasets, demonstrating effectiveness of neural depth refinement with robust geometric initialization.

Conclusion: SwiftNDC combines neural depth refinement with geometric initialization for high-fidelity, efficient 3D reconstruction, accelerating downstream tasks like mesh reconstruction and novel-view synthesis.

Abstract: Depth-guided 3D reconstruction has gained popularity as a fast alternative to optimization-heavy approaches, yet existing methods still suffer from scale drift, multi-view inconsistencies, and the need for substantial refinement to achieve high-fidelity geometry. Here, we propose SwiftNDC, a fast and general framework built around a Neural Depth Correction field that produces cross-view consistent depth maps. From these refined depths, we generate a dense point cloud through back-projection and robust reprojection-error filtering, obtaining a clean and uniformly distributed geometric initialization for downstream reconstruction. This reliable dense geometry substantially accelerates 3D Gaussian Splatting (3DGS) for mesh reconstruction, enabling high-quality surfaces with significantly fewer optimization iterations. For novel-view synthesis, SwiftNDC can also improve 3DGS rendering quality, highlighting the benefits of strong geometric initialization. We conduct a comprehensive study across five datasets, including two for mesh reconstruction, as well as three for novel-view synthesis. SwiftNDC consistently reduces running time for accurate mesh reconstruction and boosts rendering fidelity for view synthesis, demonstrating the effectiveness of combining neural depth refinement with robust geometric initialization for high-fidelity and efficient 3D reconstruction.

[111] Quality-Aware Robust Multi-View Clustering for Heterogeneous Observation Noise

Peihan Wu, Guanjie Cheng, Yufei Tong, Meng Xi, Shuiguang Deng

Main category: cs.CV

TL;DR: QARMVC is a quality-aware robust multi-view clustering framework that addresses heterogeneous observation noise by quantifying fine-grained contamination intensity and using quality scores for hierarchical noise suppression.

Details

Motivation: Existing noisy robust multi-view clustering methods rely on a simplified binary assumption (clean vs. corrupted data), overlooking the prevalent heterogeneous observation noise where contamination intensity varies continuously across data.

Method: Uses information bottleneck mechanism to extract intrinsic semantics for view reconstruction, quantifies contamination intensity via reconstruction discrepancy, derives instance-level quality scores, and employs hierarchical learning: quality-weighted contrastive objective at feature level and quality-weighted aggregation at fusion level with mutual information maximization.

Result: Extensive experiments on five benchmark datasets show QARMVC consistently outperforms state-of-the-art baselines, particularly in scenarios with heterogeneous noise intensities.

Conclusion: QARMVC effectively addresses heterogeneous noise in multi-view clustering by quantifying fine-grained contamination and using quality-aware hierarchical learning strategies.

Abstract: Deep multi-view clustering has achieved remarkable progress but remains vulnerable to complex noise in real-world applications. Existing noisy robust methods predominantly rely on a simplified binary assumption, treating data as either perfectly clean or completely corrupted. This overlooks the prevalent existence of heterogeneous observation noise, where contamination intensity varies continuously across data. To bridge this gap, we propose a novel framework termed Quality-Aware Robust Multi-View Clustering (QARMVC). Specifically, QARMVC employs an information bottleneck mechanism to extract intrinsic semantics for view reconstruction. Leveraging the insight that noise disrupts semantic integrity and impedes reconstruction, we utilize the resulting reconstruction discrepancy to precisely quantify fine-grained contamination intensity and derive instance-level quality scores. These scores are integrated into a hierarchical learning strategy: at the feature level, a quality-weighted contrastive objective is designed to adaptively suppress the propagation of noise; at the fusion level, a high-quality global consensus is constructed via quality-weighted aggregation, which is subsequently utilized to align and rectify local views via mutual information maximization. Extensive experiments on five benchmark datasets demonstrate that QARMVC consistently outperforms state-of-the-art baselines, particularly in scenarios with heterogeneous noise intensities.

[112] Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

Dian Xie, Shitong Shao, Lichen Bai, Zikai Zhou, Bojun Cheng, Shuo Yang, Jun Wu, Zeke Xie

Main category: cs.CV

TL;DR: The paper reveals evaluation pitfalls in diffusion guidance methods, showing that simply increasing CFG scale can artificially boost human preference scores despite degrading image quality, and proposes a new evaluation framework for fair comparison.

Details

Motivation: To critically examine whether emerging diffusion guidance methods truly provide significant improvements over standard classifier-free guidance (CFG), and to address evaluation biases in current assessment frameworks.

Method: 1) Identifies evaluation pitfall where human preference models favor large guidance scales; 2) Proposes GA-Eval framework with guidance scale calibration; 3) Designs TDG method as a test case; 4) Empirically evaluates eight diffusion guidance methods using both conventional and GA-Eval frameworks.

Result: Simply increasing CFG scales can compete with most studied diffusion guidance methods in conventional evaluation, while all methods suffer from winning rate degradation over standard CFG. The proposed TDG method artificially boosts scores but doesn’t work in practice.

Conclusion: Current evaluation paradigms for diffusion guidance are flawed due to biases toward large guidance scales. The community needs to rethink evaluation methods and future directions in this field, as many claimed improvements may be artifacts of biased assessment.

Abstract: Classifier-free guidance (CFG) has helped diffusion models achieve great conditional generation in various fields. Recently, more diffusion guidance methods have emerged with improved generation quality and human preference. However, can these emerging diffusion guidance methods really achieve solid and significant improvements? In this paper, we rethink recent progress on diffusion guidance. Our work mainly consists of four contributions. First, we reveal a critical evaluation pitfall that common human preference models exhibit a strong bias towards large guidance scales. Simply increasing the CFG scale can easily improve quantitative evaluation scores due to strong semantic alignment, even if image quality is severely damaged (e.g., oversaturation and artifacts). Second, we introduce a novel guidance-aware evaluation (GA-Eval) framework that employs effective guidance scale calibration to enable fair comparison between current guidance methods and CFG by identifying the effects orthogonal and parallel to CFG effects. Third, motivated by the evaluation pitfall, we design Transcendent Diffusion Guidance (TDG) method that can significantly improve human preference scores in the conventional evaluation framework but actually does not work in practice. Fourth, in extensive experiments, we empirically evaluate recent eight diffusion guidance methods within the conventional evaluation framework and the proposed GA-Eval framework. Notably, simply increasing the CFG scales can compete with most studied diffusion guidance methods, while all methods suffer severely from winning rate degradation over standard CFG. Our work would strongly motivate the community to rethink the evaluation paradigm and future directions of this field.

[113] GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views

Tianyu Chen, Wei Xiang, Kang Han, Yu Lu, Di Wu, Gaowen Liu, Ramana Rao Kompella

Main category: cs.CV

TL;DR: GIFSplat is a feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views that uses residual updates and distills diffusion priors without gradient backpropagation, achieving state-of-the-art performance with second-scale inference time.

Details

Motivation: Existing feed-forward 3D reconstruction methods have limited performance on out-of-domain data and struggle to maintain fast inference when incorporating generative priors, due to their one-shot prediction paradigm that lacks inference-time refinement and continuous generative prior injection.

Method: Proposes GIFSplat: a purely feed-forward iterative refinement framework using a small number of forward-only residual updates to progressively refine 3D scenes from sparse unposed views. Distills frozen diffusion priors into Gaussian-level cues from enhanced novel renderings without gradient backpropagation or view-set expansion.

Result: Outperforms state-of-the-art feed-forward baselines on DL3DV, RealEstate10K, and DTU datasets, improving PSNR by up to +2.1 dB while maintaining second-scale inference time without requiring camera poses or test-time gradient optimization.

Conclusion: GIFSplat achieves a favorable balance between efficiency and quality for 3D reconstruction from sparse views, enabling per-scene adaptation with generative priors while preserving feed-forward efficiency through iterative refinement and distillation techniques.

Abstract: Feed-forward 3D reconstruction offers substantial runtime advantages over per-scene optimization, which remains slow at inference and often fragile under sparse views. However, existing feed-forward methods still have potential for further performance gains, especially for out-of-domain data, and struggle to retain second-level inference time once a generative prior is introduced. These limitations stem from the one-shot prediction paradigm in existing feed-forward pipeline: models are strictly bounded by capacity, lack inference-time refinement, and are ill-suited for continuously injecting generative priors. We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views. A small number of forward-only residual updates progressively refine current 3D scene using rendering evidence, achieve favorable balance between efficiency and quality. Furthermore, we distill a frozen diffusion prior into Gaussian-level cues from enhanced novel renderings without gradient backpropagation or ever-increasing view-set expansion, thereby enabling per-scene adaptation with generative prior while preserving feed-forward efficiency. Across DL3DV, RealEstate10K, and DTU, GIFSplat consistently outperforms state-of-the-art feed-forward baselines, improving PSNR by up to +2.1 dB, and it maintains second-scale inference time without requiring camera poses or any test-time gradient optimization.

[114] Causal Motion Diffusion Models for Autoregressive Motion Generation

Qing Yu, Akihisa Watanabe, Kent Fujiwara

Main category: cs.CV

TL;DR: CMDM is a causal motion diffusion framework using a motion-language-aligned VAE and autoregressive diffusion transformer for real-time, high-quality text-to-motion generation with improved temporal causality.

Details

Motivation: Existing motion diffusion models have limitations: bidirectional models lack temporal causality and real-time applicability, while autoregressive models suffer from instability and cumulative errors. There's a need for a unified framework that combines the quality of diffusion models with the temporal causality of autoregressive approaches for practical applications.

Method: 1) MAC-VAE: Motion-Language-Aligned Causal VAE encodes motion sequences into temporally causal latent representations. 2) Autoregressive diffusion transformer trained with causal diffusion forcing for temporally ordered denoising. 3) Frame-wise sampling schedule with causal uncertainty for fast inference, predicting each subsequent frame from partially denoised previous frames.

Result: Outperforms existing diffusion and autoregressive models on HumanML3D and SnapMoGen datasets in both semantic fidelity and temporal smoothness. Substantially reduces inference latency while supporting high-quality text-to-motion generation, streaming synthesis, and long-horizon motion generation at interactive rates.

Conclusion: CMDM provides a unified framework that combines the benefits of diffusion models and autoregressive approaches, enabling real-time, high-quality motion generation with improved temporal causality and reduced inference latency.

Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence diffusion models with bidirectional generation, which limits temporal causality and real-time applicability, or autoregressive models that suffer from instability and cumulative errors. In this work, we present Causal Motion Diffusion Models (CMDM), a unified framework for autoregressive motion generation based on a causal diffusion transformer that operates in a semantically aligned latent space. CMDM builds upon a Motion-Language-Aligned Causal VAE (MAC-VAE), which encodes motion sequences into temporally causal latent representations. On top of this latent representation, an autoregressive diffusion transformer is trained using causal diffusion forcing to perform temporally ordered denoising across motion frames. To achieve fast inference, we introduce a frame-wise sampling schedule with causal uncertainty, where each subsequent frame is predicted from partially denoised previous frames. The resulting framework supports high-quality text-to-motion generation, streaming synthesis, and long-horizon motion generation at interactive rates. Experiments on HumanML3D and SnapMoGen demonstrate that CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness, while substantially reducing inference latency.

[115] Don’t let the information slip away

Taozhe Li

Main category: cs.CV

TL;DR: Association DETR: A transformer-based object detection model that incorporates background contextual information to improve detection accuracy, achieving SOTA results on COCO dataset.

Details

Motivation: Current object detection models (YOLO series, DETR variants) focus primarily on foreground object features while neglecting valuable background contextual information. The authors argue that background context (e.g., cars on roads, animals in forests) can significantly aid object detection tasks.

Method: Proposes Association DETR, a transformer-based object detection model that incorporates background contextual information alongside foreground object features to improve detection performance.

Result: Achieves state-of-the-art results on COCO val2017 dataset compared to other object detection models, though specific mAP score is not provided in the abstract.

Conclusion: Incorporating background contextual information significantly improves object detection performance, and Association DETR demonstrates this through SOTA results on benchmark datasets.

Abstract: Real-time object detection has advanced rapidly in recent years. The YOLO series of detectors is among the most well-known CNN-based object detection models and cannot be overlooked. The latest version, YOLOv26, was recently released, while YOLOv12 achieved state-of-the-art (SOTA) performance with 55.2 mAP on the COCO val2017 dataset. Meanwhile, transformer-based object detection models, also known as DEtection TRansformer (DETR), have demonstrated impressive performance. RT-DETR is an outstanding model that outperformed the YOLO series in both speed and accuracy when it was released. Its successor, RT-DETRv2, achieved 53.4 mAP on the COCO val2017 dataset. However, despite their remarkable performance, all these models let information to slip away. They primarily focus on the features of foreground objects while neglecting the contextual information provided by the background. We believe that background information can significantly aid object detection tasks. For example, cars are more likely to appear on roads rather than in offices, while wild animals are more likely to be found in forests or remote areas rather than on busy streets. To address this gap, we propose an object detection model called Association DETR, which achieves state-of-the-art results compared to other object detection models on the COCO val2017 dataset.

[116] BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

Yuci Han, Charles Toth, John E. Anderson, William J. Shuart, Alper Yilmaz

Main category: cs.CV

TL;DR: BetterScene enhances novel view synthesis from sparse photos using Stable Video Diffusion with temporal equivariance regularization and vision foundation model-aligned VAE representations, integrated with 3D Gaussian Splatting for artifact-free results.

Details

Motivation: Existing diffusion-based novel view synthesis methods using off-the-shelf pretrained models often produce inconsistent details and artifacts even with geometry-aware regularizations, due to limited fine-tuning of only the UNet module while keeping other components frozen.

Method: Leverages Stable Video Diffusion as backbone, investigates diffusion model latent space, introduces temporal equivariance regularization and vision foundation model-aligned representation for the VAE module, and integrates feed-forward 3D Gaussian Splatting to render features as inputs for the SVD enhancer.

Result: Demonstrates superior performance on the challenging DL3DV-10K dataset compared to state-of-the-art methods, generating continuous, artifact-free, consistent novel views from extremely sparse, unconstrained photos.

Conclusion: BetterScene effectively mitigates artifacts and recovers view-consistent details by enhancing the VAE module with temporal and vision-aligned regularizations, combined with 3DGS rendering, advancing novel view synthesis quality for diverse real-world scenes.

Abstract: We present BetterScene, an approach to enhance novel view synthesis (NVS) quality for diverse real-world scenes using extremely sparse, unconstrained photos. BetterScene leverages the production-ready Stable Video Diffusion (SVD) model pretrained on billions of frames as a strong backbone, aiming to mitigate artifacts and recover view-consistent details at inference time. Conventional methods have developed similar diffusion-based solutions to address these challenges of novel view synthesis. Despite significant improvements, these methods typically rely on off-the-shelf pretrained diffusion priors and fine-tune only the UNet module while keeping other components frozen, which still leads to inconsistent details and artifacts even when incorporating geometry-aware regularizations like depth or semantic conditions. To address this, we investigate the latent space of the diffusion model and introduce two components: (1) temporal equivariance regularization and (2) vision foundation model-aligned representation, both applied to the variational autoencoder (VAE) module within the SVD pipeline. BetterScene integrates a feed-forward 3D Gaussian Splatting (3DGS) model to render features as inputs for the SVD enhancer and generate continuous, artifact-free, consistent novel views. We evaluate on the challenging DL3DV-10K dataset and demonstrate superior performance compared to state-of-the-art methods.

[117] LoR-LUT: Learning Compact 3D Lookup Tables via Low-Rank Residuals

Ziqi Zhao, Abhijit Mishra, Shounak Roychowdhury

Main category: cs.CV

TL;DR: LoR-LUT presents a unified low-rank formulation for compact 3D lookup table generation that improves image enhancement quality while reducing parameters through residual corrections.

Details

Motivation: The paper aims to improve upon conventional 3D-LUT-based image enhancement techniques by addressing their limitations in parameter efficiency and interpretability, while maintaining high perceptual quality.

Method: Proposes a unified low-rank formulation that combines basis LUTs with low-rank residual corrections, reducing parameters while maintaining trilinear interpolation complexity. Includes an interactive visualization tool (LoR-LUT Viewer) for interpretability.

Result: Achieves expert-level retouching characteristics with high perceptual fidelity and sub-megabyte model size on MIT-Adobe FiveK dataset, while maintaining same interpolation complexity with fewer parameters.

Conclusion: The proposed formulation offers a compact, interpretable, and efficient direction for future LUT-based image enhancement and style transfer applications.

Abstract: We present LoR-LUT, a unified low-rank formulation for compact and interpretable 3D lookup table (LUT) generation. Unlike conventional 3D-LUT-based techniques that rely on fusion of basis LUTs, which are usually dense tensors, our unified approach extends the current framework by jointly using residual corrections, which are in fact low-rank tensors, together with a set of basis LUTs. The approach described here improves the existing perceptual quality of an image, which is primarily due to the technique’s novel use of residual corrections. At the same time, we achieve the same level of trilinear interpolation complexity, using a significantly smaller number of network, residual corrections, and LUT parameters. The experimental results obtained from LoR-LUT, which is trained on the MIT-Adobe FiveK dataset, reproduce expert-level retouching characteristics with high perceptual fidelity and a sub-megabyte model size. Furthermore, we introduce an interactive visualization tool, termed LoR-LUT Viewer, which transforms an input image into the LUT-adjusted output image, via a number of slidebars that control different parameters. The tool provides an effective way to enhance interpretability and user confidence in the visual results. Overall, our proposed formulation offers a compact, interpretable, and efficient direction for future LUT-based image enhancement and style transfer.

[118] Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery

Minh Kha Do, Wei Xiang, Kang Han, Di Wu, Khoa Phan, Yi-Ping Phoebe Chen, Gaowen Liu, Ramana Rao Kompella

Main category: cs.CV

TL;DR: SATtxt is a spectrum-aware vision-language foundation model for Earth observation that uses RGB-only inference while retaining spectral knowledge through distillation from multi-spectral data during training.

Details

Motivation: Current vision-language foundation models for satellite imagery face two main challenges: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment, and (2) CLIP-style text encoders limit semantic expressiveness and weaken fine-grained alignment. The authors aim to create a model that can operate with RGB-only inputs at inference while retaining spectral cues learned during training.

Method: SATtxt uses a two-stage framework: (1) Spectral Representation Distillation transfers spectral priors from a frozen multi-spectral teacher to an RGB student via a lightweight projector, and (2) Spectrally Grounded Alignment with Instruction-Augmented LLMs bridges the distilled visual space and an expressive LLM embedding space.

Result: Across EuroSAT, BigEarthNet, and ForestNet datasets, SATtxt improves zero-shot classification by 4.2% on average, retrieval by 5.9%, and linear probing by 2.7% over baselines.

Conclusion: SATtxt demonstrates an efficient path toward spectrum-aware vision-language learning for Earth observation, enabling RGB-only inference while maintaining spectral understanding through distillation techniques.

Abstract: Vision-language foundation models (VLFMs) promise zero-shot and retrieval understanding for Earth observation. While operational satellite systems often lack full multi-spectral coverage, making RGB-only inference highly desirable for scalable deployment, the adoption of VLFMs for satellite imagery remains hindered by two factors: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment; and (2) CLIP-style text encoders limit semantic expressiveness and weaken fine-grained alignment. We present SATtxt, a spectrum-aware VLFM that operates with RGB inputs only at inference while retaining spectral cues learned during training. Our framework comprises two stages. First, Spectral Representation Distillation transfers spectral priors from a frozen multi-spectral teacher to an RGB student via a lightweight projector. Second, Spectrally Grounded Alignment with Instruction-Augmented LLMs bridges the distilled visual space and an expressive LLM embedding space. Across EuroSAT, BigEarthNet, and ForestNet, SATtxt improves zero-shot classification on average by 4.2%, retrieval by 5.9%, and linear probing by 2.7% over baselines, showing an efficient path toward spectrum-aware vision-language learning for Earth observation. Project page: https://ikhado.github.io/sattxt/

[119] Coded-E2LF: Coded Aperture Light Field Imaging from Events

Tomoya Tsuchida, Keita Takahashi, Chihiro Tsutake, Toshiaki Fujii, Hajime Nagahara

Main category: cs.CV

TL;DR: Coded-E2LF reconstructs 4D light fields using only event data from a coded aperture camera system, eliminating need for intensity images.

Details

Motivation: Previous methods required both events and intensity images for light field reconstruction, which imposes hardware restrictions. The goal is to develop a purely event-based approach that simplifies hardware implementation while maintaining reconstruction quality.

Method: Uses a coded aperture with a stationary event-only camera to capture 4D light fields. Introduces theoretical framework supporting reconstruction from events alone, emphasizing the importance of black patterns in aperture coding patterns for accurate reconstruction.

Result: Successfully demonstrates pixel-level accurate 4D light field reconstruction from events alone using real imaging hardware. First to show such capability without intensity images.

Conclusion: Purely event-based light field acquisition is feasible and offers hardware simplification advantages over hybrid event-intensity approaches.

Abstract: We propose Coded-E2LF (coded event to light field), a computational imaging method for acquiring a 4-D light field using a coded aperture and a stationary event-only camera. In a previous work, an imaging system similar to ours was adopted, but both events and intensity images were captured and used for light field reconstruction. In contrast, our method is purely event-based, which relaxes restrictions for hardware implementation. We also introduce several advancements from the previous work that enable us to theoretically support and practically improve light field reconstruction from events alone. In particular, we clarify the key role of a black pattern in aperture coding patterns. We finally implemented our method on real imaging hardware to demonstrate its effectiveness in capturing real 3-D scenes. To the best of our knowledge, we are the first to demonstrate that a 4-D light field with pixel-level accuracy can be reconstructed from events alone. Our software and supplementary video are available from our project website.

[120] CGSA: Class-Guided Slot-Aware Adaptation for Source-Free Object Detection

Boyang Dai, Zeng Fan, Zihao Qi, Meng Lou, Yizhou Yu

Main category: cs.CV

TL;DR: CGSA introduces object-centric learning to source-free domain adaptive object detection by integrating slot-aware adaptation into DETR-based detectors, achieving state-of-the-art performance without source data access.

Details

Motivation: Current SF-DAOD methods focus on pseudo-label thresholds and teacher-student frameworks but overlook object-level structural cues in cross-domain data, limiting adaptation effectiveness.

Method: Integrates Hierarchical Slot Awareness (HSA) module to disentangle images into slot representations as visual priors, and Class-Guided Slot Contrast (CGSC) module to guide slots toward class semantics for domain-invariant adaptation.

Result: Outperforms previous SF-DAOD methods on multiple cross-domain datasets, with theoretical and experimental analysis demonstrating effectiveness of object-centric design.

Conclusion: Object-centric learning shows promise for privacy-sensitive adaptation scenarios in domain adaptive object detection, particularly when source data cannot be retained.

Abstract: Source-Free Domain Adaptive Object Detection (SF-DAOD) aims to adapt a detector trained on a labeled source domain to an unlabeled target domain without retaining any source data. Despite recent progress, most popular approaches focus on tuning pseudo-label thresholds or refining the teacher-student framework, while overlooking object-level structural cues within cross-domain data. In this work, we present CGSA, the first framework that brings Object-Centric Learning (OCL) into SF-DAOD by integrating slot-aware adaptation into the DETR-based detector. Specifically, our approach integrates a Hierarchical Slot Awareness (HSA) module into the detector to progressively disentangle images into slot representations that act as visual priors. These slots are then guided toward class semantics via a Class-Guided Slot Contrast (CGSC) module, maintaining semantic consistency and prompting domain-invariant adaptation. Extensive experiments on multiple cross-domain datasets demonstrate that our approach outperforms previous SF-DAOD methods, with theoretical derivations and experimental analysis further demonstrating the effectiveness of the proposed components and the framework, thereby indicating the promise of object-centric design in privacy-sensitive adaptation scenarios. Code is released at https://github.com/Michael-McQueen/CGSA.

[121] Instruction-based Image Editing with Planning, Reasoning, and Generation

Liya Ji, Chenyang Qi, Qifeng Chen

Main category: cs.CV

TL;DR: A multi-modality approach for instruction-based image editing using chain-of-thought reasoning with LLMs and diffusion models

Details

Motivation: Prior work uses separate models for understanding and generation, limiting editing quality. The paper aims to bridge understanding and generation via multi-modality models for complex instruction-based image editing.

Method: Three-stage approach: 1) Chain-of-Thought planning with LLMs to reason about sub-prompts, 2) Instruction-based editing region generation network trained with multi-modal LLM, 3) Hint-guided instruction-based editing network using text-to-image diffusion models.

Result: Extensive experiments show competitive editing abilities on complex real-world images compared to existing methods.

Conclusion: The proposed multi-modality chain-of-thought approach effectively bridges understanding and generation for complex instruction-based image editing tasks.

Abstract: Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network. For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, a hint-guided instruction-based editing network is proposed for editing image generations based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images.

[122] CRAG: Can 3D Generative Models Help 3D Assembly?

Zeyu Jiang, Sihang Li, Siqi Tan, Chenyang Xu, Juexiao Zhang, Julia Galway-Witham, Xue Wang, Scott A. Williams, Radu Iovita, Chen Feng, Jing Zhang

Main category: cs.CV

TL;DR: CRAG reformulates 3D assembly as joint assembly and generation, where assembly provides structural priors and generation injects holistic shape context to resolve ambiguities.

Details

Motivation: Existing 3D assembly methods treat the problem as pure pose estimation via rigid transformations, but human assembly naturally couples structural reasoning with holistic shape inference. The authors aim to address the limitation that prior methods cannot synthesize missing geometry.

Method: Proposes CRAG, which simultaneously generates plausible complete shapes and predicts poses for input parts. The method treats assembly and generation as mutually reinforcing processes: assembly provides part-level structural priors for generation, while generation injects holistic shape context that resolves ambiguities in assembly.

Result: Extensive experiments demonstrate state-of-the-art performance across in-the-wild objects with diverse geometries, varying part counts, and missing pieces.

Conclusion: Jointly solving assembly and generation is mutually beneficial, enabling handling of incomplete inputs and ambiguous cases that pure pose estimation methods struggle with. The approach shows promise for real-world 3D assembly tasks.

Abstract: Most existing 3D assembly methods treat the problem as pure pose estimation, rearranging observed parts via rigid transformations. In contrast, human assembly naturally couples structural reasoning with holistic shape inference. Inspired by this intuition, we reformulate 3D assembly as a joint problem of assembly and generation. We show that these two processes are mutually reinforcing: assembly provides part-level structural priors for generation, while generation injects holistic shape context that resolves ambiguities in assembly. Unlike prior methods that cannot synthesize missing geometry, we propose CRAG, which simultaneously generates plausible complete shapes and predicts poses for input parts. Extensive experiments demonstrate state-of-the-art performance across in-the-wild objects with diverse geometries, varying part counts, and missing pieces. Our code and models will be released.

[123] QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition

Daniel Miao, Gilad Lerman, Joe Kileel

Main category: cs.CV

TL;DR: A new framework for recovering multiple cameras from quadrifocal tensors using Tucker decomposition and synchronization algorithms, showing practical viability of higher-order geometric constraints in structure from motion.

Details

Motivation: Quadrifocal tensors capture more information than pairwise essential matrices in structure from motion, but have been considered impractical and only theoretical. The authors challenge this belief by developing practical methods to leverage quadrifocal tensors for camera recovery.

Method: Introduces block quadrifocal tensor that admits Tucker decomposition with factor matrices as stacked camera matrices. Develops first synchronization algorithm for quadrifocal tensors using Tucker decomposition, alternating direction method of multipliers, and iteratively reweighted least squares. Also establishes relationships between block quadrifocal, trifocal, and bifocal tensors with joint synchronization algorithm.

Result: Numerical experiments demonstrate effectiveness on modern datasets, showing the potential and importance of using higher-order information in synchronization. The methods make quadrifocal tensors practical for camera recovery.

Conclusion: Quadrifocal tensors are practical and valuable for structure from motion, not just theoretical. The proposed framework successfully recovers multiple cameras from quadrifocal tensors, challenging previous beliefs about their impracticality.

Abstract: In structure from motion, quadrifocal tensors capture more information than their pairwise counterparts (essential matrices), yet they have often been thought of as impractical and only of theoretical interest. In this work, we challenge such beliefs by providing a new framework to recover $n$ cameras from the corresponding collection of quadrifocal tensors. We form the block quadrifocal tensor and show that it admits a Tucker decomposition whose factor matrices are the stacked camera matrices, and which thus has a multilinear rank of (4,~4,~4,~4) independent of $n$. We develop the first synchronization algorithm for quadrifocal tensors, using Tucker decomposition, alternating direction method of multipliers, and iteratively reweighted least squares. We further establish relationships between the block quadrifocal, trifocal, and bifocal tensors, and introduce an algorithm that jointly synchronizes these three entities. Numerical experiments demonstrate the effectiveness of our methods on modern datasets, indicating the potential and importance of using higher-order information in synchronization.

[124] Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models

Siqi Lu, Wanying Xu, Yongbin Zheng, Wenting Luan, Peng Sun, Jianhang Yao

Main category: cs.CV

TL;DR: A method to address missing modality problems in multimodal models by quantifying modality preference in frequency domain and dynamically re-balancing contributions during training.

Details

Motivation: Missing modalities cause catastrophic performance degradation in multimodal models due to imbalanced learning where models develop implicit preferences for certain modalities, leading to under-optimization of others.

Method: Proposes Frequency Ratio Metric (FRM) to quantify modality preference by analyzing features in frequency domain, then introduces Multimodal Weight Allocation Module (MWAM) - a plug-and-play component that dynamically re-balances each modality branch’s contribution during training.

Result: MWAM can be integrated into diverse architectural backbones (CNNs, ViTs), delivers consistent performance gains across various tasks and modality combinations, and improves both base models and state-of-the-art methods addressing missing modality problems.

Conclusion: The frequency domain analysis provides an effective way to quantify and address modality imbalance, enabling more robust multimodal learning that handles missing modalities better.

Abstract: Missing modalities present a fundamental challenge in multimodal models, often causing catastrophic performance degradation. Our observations suggest that this fragility stems from an imbalanced learning process, where the model develops an implicit preference for certain modalities, leading to the under-optimization of others. We propose a simple yet efficient method to address this challenge. The central insight of our work is that the dominance relationship between modalities can be effectively discerned and quantified in the frequency domain. To leverage this principle, we first introduce a Frequency Ratio Metric (FRM) to quantify modality preference by analyzing features in the frequency domain. Guided by FRM, we then propose a Multimodal Weight Allocation Module, a plug-and-play component that dynamically re-balances the contribution of each branch during training, promoting a more holistic learning paradigm. Extensive experiments demonstrate that MWAM can be seamlessly integrated into diverse architectural backbones, such as those based on CNNs and ViTs. Furthermore, MWAM delivers consistent performance gains across a wide range of tasks and modality combinations. This advancement extends beyond merely optimizing the performance of the base model; it also manifests as further performance improvements to state-of-the-art methods addressing the missing modality problem.

[125] Interactive Medical-SAM2 GUI: A Napari-based semi-automatic annotation tool for medical images

Woojae Hong, Jong Ha Hwang, Jiyong Chung, Joongyeon Choi, Hyunngun Kim, Yong Hwy Kim

Main category: cs.CV

TL;DR: Interactive Medical-SAM2 GUI is an open-source desktop application for semi-automatic annotation of 2D/3D medical images using SAM2-style propagation with box/point prompting in a unified Napari-based workflow.

Details

Motivation: Manual voxel-level annotation for 3D medical scans is slow and expensive, and existing tools lack unified, cohort-oriented workflows for navigation, propagation, interactive correction, and quantitative export in a single local pipeline.

Method: Built on Napari multi-dimensional viewer, integrates box/point prompting with SAM2-style propagation by treating 3D volumes as slice sequences. Uses Medical-SAM2 on top of SAM2 for mask propagation from sparse prompts. Provides local-first workflow for efficient 3D annotation across multiple studies using DICOM/NIfTI formats.

Result: Open-source Python application with Napari and PyTorch implementation, supporting per-object volumetry, 3D volume rendering, image geometry preservation via SimpleITK, and optional N4 bias-field correction. Released on GitHub for research annotation workflows.

Conclusion: Provides a practical solution for semi-automatic medical image annotation with unified workflow for 3D segmentation tasks, addressing limitations of existing tools while maintaining local-first approach for research use.

Abstract: Interactive Medical-SAM2 GUI is an open-source desktop application for semi-automatic annotation of 2D and 3D medical images. Built on the Napari multi-dimensional viewer, box/point prompting is integrated with SAM2-style propagation by treating a 3D volume as a slice sequence, enabling mask propagation from sparse prompts using Medical-SAM2 on top of SAM2. Voxel-level annotation remains essential for developing and validating medical imaging algorithms, yet manual labeling is slow and expensive for 3D scans, and existing integrations frequently emphasize per-slice interaction without providing a unified, cohort-oriented workflow for navigation, propagation, interactive correction, and quantitative export in a single local pipeline. To address this practical limitation, a local-first Napari workflow is provided for efficient 3D annotation across multiple studies using standard DICOM series and/or NIfTI volumes. Users can annotate cases sequentially under a single root folder with explicit proceed/skip actions, initialize objects via box-first prompting (including first/last-slice initialization for single-object propagation), refine predictions with point prompts, and finalize labels through prompt-first correction prior to saving. During export, per-object volumetry and 3D volume rendering are supported, and image geometry is preserved via SimpleITK. The GUI is implemented in Python using Napari and PyTorch, with optional N4 bias-field correction, and is intended exclusively for research annotation workflows. The code is released on the project page: https://github.com/SKKU-IBE/Medical-SAM2GUI/.

[126] Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache

Bowen Cui, Yuanbin Wang, Huajiang Xu, Biaolong Chen, Aixi Zhang, Hao Jiang, Zhengzheng Jin, Xu Liu, Pipei Huang

Main category: cs.CV

TL;DR: DPCache: A training-free diffusion acceleration framework that formulates sampling as global path planning, using dynamic programming to select optimal key timesteps for feature caching and prediction.

Details

Motivation: Diffusion models have high computational costs from multi-step sampling. Existing caching methods use fixed schedules without considering global denoising trajectory structure, leading to error accumulation and artifacts.

Method: DPCache constructs a Path-Aware Cost Tensor from calibration data to quantify path-dependent error of skipping timesteps. It uses dynamic programming to globally optimize key timestep selection that minimizes total path cost while preserving fidelity.

Result: Achieves strong acceleration with minimal quality loss: +0.031 ImageReward at 4.87× speedup on FLUX, even surpassing full-step baseline by +0.028 ImageReward at 3.54× speedup.

Conclusion: DPCache provides an effective training-free acceleration framework through global path planning, outperforming prior methods and demonstrating practical deployment benefits for diffusion models.

Abstract: Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or predicting features across timesteps. However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free acceleration framework that formulates diffusion sampling acceleration as a global path planning problem. DPCache constructs a Path-Aware Cost Tensor from a small calibration set to quantify the path-dependent error of skipping timesteps conditioned on the preceding key timestep. Leveraging this tensor, DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while preserving trajectory fidelity. During inference, the model performs full computations only at these key timesteps, while intermediate outputs are efficiently predicted using cached features. Extensive experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong acceleration with minimal quality loss, outperforming prior acceleration methods by $+$0.031 ImageReward at 4.87$\times$ speedup and even surpassing the full-step baseline by $+$0.028 ImageReward at 3.54$\times$ speedup on FLUX, validating the effectiveness of our path-aware global scheduling framework. Code will be released at https://github.com/argsss/DPCache.

[127] ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals

Xuelu Li, Zhaonan Wang, Xiaogang Wang, Lei Wu, Manyi Li, Changhe Tu

Main category: cs.CV

TL;DR: ArtPro: A self-supervised framework for reconstructing articulated objects using adaptive integration of mobility proposals, addressing sensitivity to initial segmentation in differentiable rendering methods.

Details

Motivation: Current self-supervised methods for reconstructing articulated objects using differentiable rendering (like 3D Gaussian Splatting) are highly sensitive to initial part segmentation, relying on heuristic clustering or pre-trained models that often lead to local minima, especially for complex multi-part objects.

Method: ArtPro introduces adaptive integration of mobility proposals: 1) Over-segmentation initialization guided by geometry features and motion priors to generate part proposals with plausible motion hypotheses, 2) Dynamic merging of proposals by analyzing motion consistency among spatial neighbors during optimization, 3) Collision-aware motion pruning mechanism to prevent erroneous kinematic estimation.

Result: Extensive experiments on both synthetic and real-world objects demonstrate that ArtPro achieves robust reconstruction of complex multi-part objects, significantly outperforming existing methods in accuracy and stability.

Conclusion: ArtPro provides a novel self-supervised framework that addresses the segmentation sensitivity problem in articulated object reconstruction, enabling more robust and accurate digital twin creation for applications like robotic manipulation and interactive simulation.

Abstract: Reconstructing articulated objects into high-fidelity digital twins is crucial for applications such as robotic manipulation and interactive simulation. Recent self-supervised methods using differentiable rendering frameworks like 3D Gaussian Splatting remain highly sensitive to the initial part segmentation. Their reliance on heuristic clustering or pre-trained models often causes optimization to converge to local minima, especially for complex multi-part objects. To address these limitations, we propose ArtPro, a novel self-supervised framework that introduces adaptive integration of mobility proposals. Our approach begins with an over-segmentation initialization guided by geometry features and motion priors, generating part proposals with plausible motion hypotheses. During optimization, we dynamically merge these proposals by analyzing motion consistency among spatial neighbors, while a collision-aware motion pruning mechanism prevents erroneous kinematic estimation. Extensive experiments on both synthetic and real-world objects demonstrate that ArtPro achieves robust reconstruction of complex multi-part objects, significantly outperforming existing methods in accuracy and stability.

[128] Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

Changqing Zhou, Yueru Luo, Han Zhang, Zeyu Jiang, Changhao Chen

Main category: cs.CV

TL;DR: Open-vocabulary 3D occupancy framework for indoor environments using geometry-only supervision with 3D Language-Embedded Gaussians, achieving state-of-the-art results on Occ-ScanNet.

Details

Motivation: Existing open-vocabulary occupancy methods work well for outdoor driving scenarios but transfer poorly to indoor environments where geometry is denser, layouts are more intricate, and semantics are more fine-grained. There's a need for indoor-specific approaches that can handle complex semantic categories beyond fixed taxonomies.

Method: Uses geometry-only supervision with only binary occupancy labels. Builds on 3D Language-Embedded Gaussians as unified intermediate representation. Introduces opacity-aware Poisson-based approach for stable volumetric aggregation under weak supervision, and Progressive Temperature Decay schedule to sharpen opacities during splatting for better Gaussian-language alignment.

Result: Achieves 59.50 IoU and 21.05 mIoU on Occ-ScanNet in open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by large margin in mIoU.

Conclusion: The framework successfully addresses indoor open-vocabulary 3D occupancy challenges through geometry-only supervision and improved Gaussian-to-occupancy operators, enabling better understanding of complex indoor environments for embodied agents.

Abstract: Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian-language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released at https://github.com/JuIvyy/LegoOcc.

[129] SPMamba-YOLO: An Underwater Object Detection Network Based on Multi-Scale Feature Enhancement and Global Context Modeling

Guanghao Liao, Zhen Liu, Liyuan Cao, Yonghui Yang, Qi Li

Main category: cs.CV

TL;DR: SPMamba-YOLO is a novel underwater object detection network that combines multi-scale feature enhancement with global context modeling using SPPELAN, PSA, and Mamba-based state space modeling to address challenges in underwater environments.

Details

Motivation: Underwater object detection faces severe challenges including light attenuation, color distortion, background clutter, and small target scales, requiring specialized solutions beyond standard object detection approaches.

Method: Proposes SPMamba-YOLO with three key components: 1) SPPELAN module for multi-scale feature aggregation and receptive field expansion, 2) Pyramid Split Attention mechanism for feature discrimination by emphasizing informative regions, and 3) Mamba-based state space modeling for capturing long-range dependencies and global context.

Result: Outperforms YOLOv8n baseline by more than 4.9% in mAP@0.5 on URPC2022 dataset, particularly effective for small and densely distributed underwater objects, while maintaining good balance between accuracy and computational cost.

Conclusion: SPMamba-YOLO effectively addresses underwater object detection challenges through integrated multi-scale feature enhancement and global context modeling, demonstrating superior performance for underwater applications.

Abstract: Underwater object detection is a critical yet challenging research problem owing to severe light attenuation, color distortion, background clutter, and the small scale of underwater targets. To address these challenges, we propose SPMamba-YOLO, a novel underwater object detection network that integrates multi-scale feature enhancement with global context modeling. Specifically, a Spatial Pyramid Pooling Enhanced Layer Aggregation Network (SPPELAN) module is introduced to strengthen multi-scale feature aggregation and expand the receptive field, while a Pyramid Split Attention (PSA) mechanism enhances feature discrimination by emphasizing informative regions and suppressing background interference. In addition, a Mamba-based state space modeling module is incorporated to efficiently capture long-range dependencies and global contextual information, thereby improving detection robustness in complex underwater environments. Extensive experiments on the URPC2022 dataset demonstrate that SPMamba-YOLO outperforms the YOLOv8n baseline by more than 4.9% in mAP@0.5, particularly for small and densely distributed underwater objects, while maintaining a favorable balance between detection accuracy and computational cost.

[130] ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport

Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham

Main category: cs.CV

TL;DR: ViCLIP-OT: A Vietnamese vision-language model combining CLIP-style contrastive learning with Similarity-Graph Regularized Optimal Transport loss for improved image-text retrieval in low-resource settings.

Details

Motivation: Most vision-language models are optimized for high-resource languages and perform suboptimally for low-resource languages like Vietnamese. There's a need for specialized models that can effectively handle cross-modal retrieval in underrepresented linguistic contexts.

Method: Integrates CLIP-style contrastive learning with a novel Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance global cross-modal consistency and reduce modality gap issues. Specifically designed for Vietnamese language.

Result: Outperforms CLIP and SigLIP baselines on three Vietnamese benchmarks (UIT-OpenViIC, KTVIC, Crossmodal-3600). Achieves 67.34% average Recall@K on UIT-OpenViIC (5.75% improvement over CLIP) and 11.72% improvement in zero-shot evaluation on Crossmodal-3600. Embedding analysis confirms improved alignment and reduced modality gap.

Conclusion: SIGROT provides an effective and scalable strategy for cross-modal retrieval in low-resource languages. The approach has practical implications for intelligent multimedia retrieval systems in Vietnamese and other underrepresented linguistic contexts.

Abstract: Image-text retrieval has become a fundamental component in intelligent multimedia systems; however, most existing vision-language models are optimized for highresource languages and remain suboptimal for low-resource settings such as Vietnamese. This work introduces ViCLIP-OT, a foundation vision-language model specifically designed for Vietnamese image-text retrieval. The proposed framework integrates CLIP-style contrastive learning with a Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance global cross-modal consistency and mitigate modality gap issues. Extensive experiments on three Vietnamese benchmarks (UITOpenViIC, KTVIC, and Crossmodal-3600) demonstrate that ViCLIP-OT consistently outperforms CLIP and SigLIP baselines in both in-domain and zero-shot settings. On UIT-OpenViIC, the model achieves an average Recall@K of 67.34%, improving upon CLIP by 5.75 percentage points. In zero-shot evaluation on Crossmodal-3600, ViCLIPOT surpasses CLIP by 11.72 percentage points. Embedding-space analysis further confirms improved alignment and reduced modality gap. The results indicate that integrating SIGROT provides an effective and scalable strategy for cross-modal retrieval in low-resource languages, offering practical implications for intelligent multimedia retrieval systems in Vietnamese and other underrepresented linguistic contexts.

[131] SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

Zhuohang Jiang, Xu Yuan, Haohao Qu, Shanru Lin, Kanglong Liu, Wenqi Fan, Qing Li

Main category: cs.CV

TL;DR: SUPERGLASSES: First comprehensive VQA benchmark using real-world smart glasses data, with SUPERLENS agent achieving SOTA performance by integrating object detection, query decoupling, and multimodal web search.

Details

Motivation: Existing VLMs for smart glasses are trained on traditional datasets that lack realism and variety for actual smart glasses usage scenarios, failing to address the specific challenge of accurately identifying objects of interest before external knowledge retrieval.

Method: 1) Created SUPERGLASSES benchmark with 2,422 egocentric image-question pairs from real smart glasses data across 14 domains and 8 query categories; 2) Proposed SUPERLENS agent with automatic object detection, query decoupling, and multimodal web search integration for retrieval-augmented answer generation.

Result: Evaluation of 26 VLMs showed significant performance gaps. SUPERLENS achieved state-of-the-art performance, surpassing GPT-4o by 2.19% on the benchmark.

Conclusion: Smart glasses VQA requires task-specific solutions due to unique challenges in object identification and knowledge retrieval. The proposed benchmark and agent demonstrate the need for specialized approaches beyond general VLMs.

Abstract: The rapid advancement of AI-powered smart glasses, one of the hottest wearable devices, has unlocked new frontiers for multimodal interaction, with Visual Question Answering (VQA) over external knowledge sources emerging as a core application. Existing Vision Language Models (VLMs) adapted to smart glasses are typically trained and evaluated on traditional multimodal datasets; however, these datasets lack the variety and realism needed to reflect smart glasses usage scenarios and diverge from their specific challenges, where accurately identifying the object of interest must precede any external knowledge retrieval. To bridge this gap, we introduce SUPERGLASSES, the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices. SUPERGLASSES comprises 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. We evaluate 26 representative VLMs on this benchmark, revealing significant performance gaps. To address the limitations of existing models, we further propose SUPERLENS, a multimodal smart glasses agent that enables retrieval-augmented answer generation by integrating automatic object detection, query decoupling, and multimodal web search. Our agent achieves state-of-the-art performance, surpassing GPT-4o by 2.19 percent, and highlights the need for task-specific solutions in smart glasses VQA scenarios.

[132] No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings

Joonsung Jeon, Woo Jae Kim, Suhyeon Ha, Sooel Son, Sung-Eui Yoon

Main category: cs.CV

TL;DR: MoFit: A caption-free membership inference attack framework for latent diffusion models that constructs synthetic conditioning inputs to detect training data memorization without requiring ground-truth captions.

Details

Motivation: Latent diffusion models can memorize training data, raising privacy/IP concerns. Existing membership inference attacks require ground-truth captions, which are often unavailable in real scenarios where only images exist without textual annotations.

Method: Two-stage approach: 1) Model-fitted surrogate optimization - optimize perturbations to construct surrogates in the model’s unconditional prior learned from member samples; 2) Surrogate-driven embedding extraction - derive model-fitted embeddings used as mismatched conditions for query images to amplify conditional loss differences between members and non-members.

Result: MoFit consistently outperforms prior vision-language model-conditioned baselines and achieves performance competitive with caption-dependent methods across multiple datasets and diffusion models.

Conclusion: MoFit provides an effective caption-free framework for auditing memorization in diffusion models, addressing practical limitations of requiring ground-truth captions for membership inference attacks.

Abstract: Latent diffusion models have achieved remarkable success in high-fidelity text-to-image generation, but their tendency to memorize training data raises critical privacy and intellectual property concerns. Membership inference attacks (MIAs) provide a principled way to audit such memorization by determining whether a given sample was included in training. However, existing approaches assume access to ground-truth captions. This assumption fails in realistic scenarios where only images are available and their textual annotations remain undisclosed, rendering prior methods ineffective when substituted with vision-language model (VLM) captions. In this work, we propose MoFit, a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model’s generative manifold. Given a query image, MoFit proceeds in two stages: (i) model-fitted surrogate optimization, where a perturbation applied to the image is optimized to construct a surrogate in regions of the model’s unconditional prior learned from member samples, and (ii) surrogate-driven embedding extraction, where a model-fitted embedding is derived from the surrogate and then used as a mismatched condition for the query image. This embedding amplifies conditional loss responses for member samples while leaving hold-outs relatively less affected, thereby enhancing separability in the absence of ground-truth captions. Our comprehensive experiments across multiple datasets and diffusion models demonstrate that MoFit consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods.

[133] GFRRN: Explore the Gaps in Single Image Reflection Removal

Yu Chen, Zewei He, Xingyu Liu, Zixuan Chen, Zheming Lu

Main category: cs.CV

TL;DR: Proposes GFRRN network for single image reflection removal using parameter-efficient fine-tuning, unified label generation, frequency learning, and dynamic attention mechanisms.

Details

Motivation: Addresses two key limitations in existing reflection removal methods: (1) semantic understanding gap between pre-trained model features and reflection removal model features, and (2) reflection label inconsistencies between synthetic and real-world training data.

Method: Uses parameter efficient fine-tuning (PEFT) with Mona layers to align training directions; designs label generator to unify reflection labels; proposes Gaussian-based Adaptive Frequency Learning Block (G-AFLB) for frequency prior learning; employs Dynamic Agent Attention (DAA) for inter- and intra-window significance modeling.

Result: Extensive experiments demonstrate effectiveness, achieving superior performance against state-of-the-art single image reflection removal methods.

Conclusion: GFRRN effectively addresses feature alignment and label consistency issues in reflection removal, outperforming existing methods through integrated PEFT, unified labeling, frequency learning, and dynamic attention mechanisms.

Abstract: Prior dual-stream methods with the feature interaction mechanism have achieved remarkable performance in single image reflection removal (SIRR). However, they often struggle with (1) semantic understanding gap between the features of pre-trained models and those of reflection removal models, and (2) reflection label inconsistencies between synthetic and real-world training data. In this work, we first adopt the parameter efficient fine-tuning (PEFT) strategy by integrating several learnable Mona layers into the pre-trained model to align the training directions. Then, a label generator is designed to unify the reflection labels for both synthetic and real-world data. In addition, a Gaussian-based Adaptive Frequency Learning Block (G-AFLB) is proposed to adaptively learn and fuse the frequency priors, and a Dynamic Agent Attention (DAA) is employed as an alternative to window-based attention by dynamically modeling the significance levels across windows (inter-) and within an individual window (intra-). These components constitute our proposed Gap-Free Reflection Removal Network (GFRRN). Extensive experiments demonstrate the effectiveness of our GFRRN, achieving superior performance against state-of-the-art SIRR methods.

[134] UFO-DETR: Frequency-Guided End-to-End Detector for UAV Tiny Objects

Yuankai Chen, Kai Lin, Qihong Wu, Xinxuan Yang, Jiashuo Lai, Ruoen Chen, Haonan Shi, Minfan He, Meihua Wang

Main category: cs.CV

TL;DR: UFO-DETR: An end-to-end object detection framework for UAV imagery that addresses small target challenges through LSKNet backbone, DAttention/AIFI modules for multi-scale modeling, and DynFreq-C3 for frequency feature enhancement.

Details

Motivation: Small target detection in UAV imagery faces challenges including scale variations, dense distribution, and dominance of small targets. Existing algorithms rely on manual design and general-purpose detectors aren't optimized for UAV images, making it hard to balance accuracy and complexity.

Method: Proposes UFO-DETR with: 1) LSKNet-based backbone to optimize receptive field and reduce parameters, 2) DAttention and AIFI modules to flexibly model multi-scale spatial relationships, and 3) DynFreq-C3 module for cross-space frequency feature enhancement to boost small target detection.

Result: Experimental results show significant advantages over RT-DETR-L in both detection performance and computational efficiency, providing an efficient solution for UAV edge computing.

Conclusion: UFO-DETR offers an effective framework for UAV small target detection that balances accuracy and efficiency, addressing key challenges in UAV imagery analysis.

Abstract: Small target detection in UAV imagery faces significant challenges such as scale variations, dense distribution, and the dominance of small targets. Existing algorithms rely on manually designed components, and general-purpose detectors are not optimized for UAV images, making it difficult to balance accuracy and complexity. To address these challenges, this paper proposes an end-to-end object detection framework, UFO-DETR, which integrates an LSKNet-based backbone network to optimize the receptive field and reduce the number of parameters. By combining the DAttention and AIFI modules, the model flexibly models multi-scale spatial relationships, improving multi-scale target detection performance. Additionally, the DynFreq-C3 module is proposed to enhance small target detection capability through cross-space frequency feature enhancement. Experimental results show that, compared to RT-DETR-L, the proposed method offers significant advantages in both detection performance and computational efficiency, providing an efficient solution for UAV edge computing.

[135] SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs

Guanting Ye, Qiyan Zhao, Wenhao Yu, Liangyu Yuan, Mingkai Li, Xiaofeng Zhang, Jianmin Ji, Yanyong Zhang, Qing Jiang, Ka-Veng Yuen

Main category: cs.CV

TL;DR: SoPE introduces spherical coordinate-based positional embedding for 3D LVLMs to better preserve 3D spatial structures and capture directional variations, overcoming limitations of vanilla RoPE.

Details

Motivation: Current 3D Large Vision-Language Models use Rotary Position Embedding (RoPE) which is suboptimal for 3D understanding. Vanilla RoPE fails to preserve 3D spatial structures and overlooks angular dependencies, hindering directional variation capture in visual representations.

Method: Proposes Spherical Coordinate-based Positional Embedding (SoPE) that maps point-cloud token indices into 3D spherical coordinate space for unified modeling of spatial locations and directional angles. Also introduces multi-scale frequency mixing strategy to fuse feature information across different frequency domains.

Result: Experimental results on multiple 3D scene benchmarks validate effectiveness, and real-world deployment experiments demonstrate strong generalization capability.

Conclusion: SoPE provides better geometric structure preservation, enhanced spatial awareness, and more consistent expressive geometric representations for 3D multimodal learning compared to vanilla RoPE.

Abstract: 3D Large Vision-Language Models (3D LVLMs) built upon Large Language Models (LLMs) have achieved remarkable progress across various multimodal tasks. However, their inherited position-dependent modeling mechanism, Rotary Position Embedding (RoPE), remains suboptimal for 3D multimodal understanding. The vanilla RoPE formulation fails to preserve essential three-dimensional spatial structures when encoding 3D tokens, and its relative distance computation overlooks angular dependencies, hindering the model’s ability to capture directional variations in visual representations. To overcome these limitations, we introduce Spherical Coordinate-based Positional Embedding (SoPE). Our method maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles. This formulation preserves the inherent geometric structure of point-cloud data, enhances spatial awareness, and yields more consistent and expressive geometric representations for multimodal learning. In addition, we introduce a multi-scale frequency mixing strategy to fuse feature information across different frequency domains. Experimental results on multiple 3D scene benchmarks validate the effectiveness of our approach, while real-world deployment experiments further demonstrate its strong generalization capability.

[136] IRSDE-Despeckle: A Physics-Grounded Diffusion Model for Generalizable Ultrasound Despeckling

Shuoqi Chen, Yujia Wu, Geoffrey P. Luke

Main category: cs.CV

TL;DR: A diffusion-based method for ultrasound image despeckling using simulated training data from MRI, achieving state-of-the-art performance with uncertainty quantification.

Details

Motivation: Ultrasound imaging suffers from speckle artifacts that reduce image quality and hinder clinical interpretation, necessitating effective despeckling methods.

Method: Uses diffusion-based despeckling built on Image Restoration Stochastic Differential Equations framework, trained on large simulated datasets created by converting speckle-free MRI images to ultrasound using Matlab UltraSound Toolbox.

Result: Outperforms classical filters and recent learning-based baselines on simulated test sets, preserves anatomical edges and contrast, and provides uncertainty quantification via cross-model variance.

Conclusion: The diffusion approach effectively removes speckle while preserving important anatomical features, with uncertainty quantification offering practical clinical utility, though domain shift issues highlight need for diversified training.

Abstract: Ultrasound imaging is widely used for real-time, noninvasive diagnosis, but speckle and related artifacts reduce image quality and can hinder interpretation. We present a diffusion-based ultrasound despeckling method built on the Image Restoration Stochastic Differential Equations framework. To enable supervised training, we curate large paired datasets by simulating ultrasound images from speckle-free magnetic resonance images using the Matlab UltraSound Toolbox. The proposed model reconstructs speckle-suppressed images while preserving anatomically meaningful edges and contrast. On a held-out simulated test set, our approach consistently outperforms classical filters and recent learning-based despeckling baselines. We quantify prediction uncertainty via cross-model variance and show that higher uncertainty correlates with higher reconstruction error, providing a practical indicator of difficult or failure-prone regions. Finally, we evaluate sensitivity to simulation probe settings and observe domain shift, motivating diversified training and adaptation for robust clinical deployment.

[137] HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models

Yangguang Lin, Quan Fang, Yufei Li, Jiachen Sun, Junyu Gao, Jitao Sang

Main category: cs.CV

TL;DR: HulluEdit: A single-pass, reference-free framework that reduces object hallucination in Large Vision-Language Models by decomposing hidden states into orthogonal subspaces and selectively suppressing hallucinatory patterns without affecting visual grounding.

Details

Motivation: Object hallucination in LVLMs hinders reliable deployment. Existing methods struggle with efficiency-accuracy trade-offs, requiring expensive reference models/multiple passes or risking suppression of genuine visual evidence with static edits.

Method: Introduces orthogonal subspace editing: decomposes hidden states into three orthogonal subspaces (visual evidence, conflicting priors, residual uncertainty) to selectively suppress hallucinatory patterns. Mathematically guarantees edits to prior subspace leave visual component unaffected.

Result: Achieves SOTA hallucination reduction on POPE and CHAIR benchmarks across diverse architectures, preserves general capabilities on MME, maintains efficient inference, and outperforms contrastive decoding and static subspace editing baselines.

Conclusion: HulluEdit offers a new pathway toward more trustworthy LVLMs through efficient, accurate hallucination reduction without compromising visual grounding or general capabilities.

Abstract: Object hallucination in Large Vision-Language Models (LVLMs) significantly hinders their reliable deployment. Existing methods struggle to balance efficiency and accuracy: they often require expensive reference models and multiple forward passes, or apply static edits that risk suppressing genuine visual evidence. To address this, we introduce HulluEdit, a single-pass, reference-free intervention framework. Our core innovation is orthogonal subspace editing: we decompose the hidden states of the model into orthogonal subspaces - visual evidence, conflicting priors, and residual uncertainty - enabling selective suppression of hallucinatory patterns without interfering with visual grounding. This approach mathematically guarantees that edits applied to the prior subspace leave the visual component entirely unaffected. Extensive experiments show that HulluEdit achieves state-of-the-art hallucination reduction on benchmarks including POPE and CHAIR across diverse architectures, while preserving general capabilities on MME and maintaining efficient inference. Our method consistently outperforms contrastive decoding and static subspace editing baselines, offering a new pathway toward more trustworthy LVLMs.

[138] Asymmetric Idiosyncrasies in Multimodal Models

Muzi Tao, Chufan Shi, Huijuan Wang, Shengbang Tong, Xuezhe Ma

Main category: cs.CV

TL;DR: Caption models embed distinctive stylistic signatures that are easily detectable in text but largely disappear when those captions are used to generate images with text-to-image models.

Details

Motivation: To understand how stylistic idiosyncrasies in caption models propagate through multimodal systems, specifically examining whether distinctive captioning styles are preserved when those captions are used to generate images.

Method: Systematic analysis using neural networks to predict originating caption model from either generated captions or corresponding generated images, with accuracy as a metric for stylistic preservation.

Result: Text classification achieves 99.70% accuracy showing strong stylistic signatures in captions, but image classification drops to at most 50% accuracy, indicating these signatures largely disappear in generated images.

Conclusion: Caption models have distinctive stylistic signatures that are not well-preserved by text-to-image models, revealing limitations in cross-modal consistency and prompt-following ability.

Abstract: In this work, we study idiosyncrasies in the caption models and their downstream impact on text-to-image models. We design a systematic analysis: given either a generated caption or the corresponding image, we train neural networks to predict the originating caption model. Our results show that text classification yields very high accuracy (99.70%), indicating that captioning models embed distinctive stylistic signatures. In contrast, these signatures largely disappear in the generated images, with classification accuracy dropping to at most 50% even for the state-of-the-art Flux model. To better understand this cross-modal discrepancy, we further analyze the data and find that the generated images fail to preserve key variations present in captions, such as differences in the level of detail, emphasis on color and texture, and the distribution of objects within a scene. Overall, our classification-based framework provides a novel methodology for quantifying both the stylistic idiosyncrasies of caption models and the prompt-following ability of text-to-image systems.

[139] AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation

Tongfei Chen, Shuo Yang, Yuguang Yang, Linlin Yang, Runtang Guo, Changbai Li, He Long, Chunyu Xie, Dawei Leng, Baochang Zhang

Main category: cs.CV

TL;DR: AML introduces alignment-aware masked learning for referring image segmentation, using pixel-level vision-language alignment estimation to filter unreliable regions and focus on trustworthy cues, achieving SOTA results.

Details

Motivation: Current RIS methods lack explicit modeling of pixel-level vision-language alignment, leading to suboptimal performance when language expressions don't perfectly match visual regions. There's a need to better handle misalignment between language descriptions and visual content.

Method: Proposes Alignment-Aware Masked Learning (AML) that: 1) estimates pixel-level vision-language alignment scores, 2) masks out poorly aligned regions during training, 3) focuses optimization only on trustworthy aligned regions, and 4) uses this alignment estimation to guide segmentation.

Result: Achieves state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Shows improved robustness to diverse language descriptions and challenging scenarios compared to previous methods.

Conclusion: Explicitly modeling and leveraging pixel-level vision-language alignment through masked learning significantly improves RIS performance and robustness, demonstrating the importance of alignment-aware training strategies.

Abstract: Referring Image Segmentation (RIS) aims to segment an object in an image identified by a natural language expression. The paper introduces Alignment-Aware Masked Learning (AML), a training strategy to enhance RIS by explicitly estimating pixel-level vision-language alignment, filtering out poorly aligned regions during optimization, and focusing on trustworthy cues. This approach results in state-of-the-art performance on RefCOCO datasets and also enhances robustness to diverse descriptions and scenarios

[140] ProjFlow: Projection Sampling with Flow Matching for Zero-Shot Exact Spatial Motion Control

Akihisa Watanabe, Qing Yu, Edgar Simo-Serra, Kent Fujiwara

Main category: cs.CV

TL;DR: ProjFlow is a training-free sampler for generating human motion with exact spatial constraints using kinematics-aware projection and time-varying pseudo-observations.

Details

Motivation: Existing approaches for human motion generation with spatial constraints require task-specific training or slow optimization, and often disrupt motion naturalness when enforcing hard constraints.

Method: Proposes ProjFlow, a training-free sampler that formulates animation tasks as linear inverse problems. Uses a novel kinematics-aware metric encoding skeletal topology to distribute corrections coherently across the skeleton. Introduces time-varying pseudo-observations for sparse inputs like filling long gaps between keyframes.

Result: ProjFlow achieves exact constraint satisfaction while preserving motion realism, matches or improves realism over zero-shot baselines, and remains competitive with training-based controllers in motion inpainting and 2D-to-3D lifting tasks.

Conclusion: ProjFlow provides an effective training-free solution for human motion generation with precise spatial control, addressing limitations of existing approaches through kinematics-aware constraint enforcement.

Abstract: Generating human motion with precise spatial control is a challenging problem. Existing approaches often require task-specific training or slow optimization, and enforcing hard constraints frequently disrupts motion naturalness. Building on the observation that many animation tasks can be formulated as a linear inverse problem, we introduce ProjFlow, a training-free sampler that achieves zero-shot, exact satisfaction of linear spatial constraints while preserving motion realism. Our key advance is a novel kinematics-aware metric that encodes skeletal topology. This metric allows the sampler to enforce hard constraints by distributing corrections coherently across the entire skeleton, avoiding the unnatural artifacts of naive projection. Furthermore, for sparse inputs, such as filling in long gaps between a few keyframes, we introduce a time-varying formulation using pseudo-observations that fade during sampling. Extensive experiments on representative applications, motion inpainting, and 2D-to-3D lifting, demonstrate that ProjFlow achieves exact constraint satisfaction and matches or improves realism over zero-shot baselines, while remaining competitive with training-based controllers.

[141] SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation

Fengming Liu, Tat-Jen Cham, Chuanxia Zheng

Main category: cs.CV

TL;DR: SPATIALALIGN is a self-improvement framework that enhances text-to-video models’ ability to depict dynamic spatial relationships specified in prompts through zeroth-order regularized DPO fine-tuning and a geometry-based evaluation metric.

Details

Motivation: Current text-to-video generators prioritize aesthetic quality but often ignore spatial constraints in generated videos, lacking proper alignment with dynamic spatial relationships specified in text prompts.

Method: Proposes SPATIALALIGN framework with: 1) zeroth-order regularized Direct Preference Optimization to fine-tune T2V models, 2) DSR-SCORE - a geometry-based metric to quantitatively measure alignment between videos and specified DSRs, and 3) a dataset of text-video pairs with diverse DSRs.

Result: Extensive experiments show the fine-tuned model significantly outperforms baselines in spatial relationship alignment, with DSR-SCORE providing quantitative evaluation beyond previous VLM-based methods.

Conclusion: SPATIALALIGN effectively enhances T2V models’ capability to depict dynamic spatial relationships through self-improvement framework, novel optimization method, and quantitative evaluation metric.

Abstract: Most text-to-video (T2V) generators prioritize aesthetic quality, but often ignoring the spatial constraints in the generated videos. In this work, we present SPATIALALIGN, a self-improvement framework that enhances T2V models capabilities to depict Dynamic Spatial Relationships (DSR) specified in text prompts. We present a zeroth-order regularized Direct Preference Optimization (DPO) to fine-tune T2V models towards better alignment with DSR. Specifically, we design DSR-SCORE, a geometry-based metric that quantitatively measures the alignment between generated videos and the specified DSRs in prompts, which is a step forward from prior works that rely on VLM for evaluation. We also conduct a dataset of text-video pairs with diverse DSRs to facilitate the study. Extensive experiments demonstrate that our fine-tuned model significantly out performs the baseline in spatial relationships. The code will be released in Link.

[142] Beyond Detection: Multi-Scale Hidden-Code for Natural Image Deepfake Recovery and Factual Retrieval

Yuan-Chih Chen, Chun-Shien Lu

Main category: cs.CV

TL;DR: A unified framework for recovering tampered image contents using hidden-code representations with multi-scale vector quantization and conditional Transformers, evaluated on a new ImageNet-S benchmark.

Details

Motivation: Current image authenticity research focuses mainly on deepfake detection and localization, leaving content recovery for factual retrieval underexplored. There's a need for methods that can retrieve and restore tampered contents beyond just detecting manipulations.

Method: Proposes a unified hidden-code recovery framework that encodes semantic and perceptual information into compact hidden-code representations using multi-scale vector quantization. Uses conditional Transformer modules for enhanced contextual reasoning, enabling both retrieval and restoration from post-hoc and in-generation watermarking paradigms.

Result: Extensive experiments on the newly constructed ImageNet-S benchmark demonstrate promising retrieval and reconstruction performance. The method remains fully compatible with diverse watermarking pipelines and establishes a foundation for general-purpose image recovery.

Conclusion: The framework advances image authenticity research beyond detection and localization by enabling content recovery for factual retrieval, providing a foundation for general-purpose image recovery applications.

Abstract: Recent advances in image authenticity have primarily focused on deepfake detection and localization, leaving recovery of tampered contents for factual retrieval relatively underexplored. We propose a unified hidden-code recovery framework that enables both retrieval and restoration from post-hoc and in-generation watermarking paradigms. Our method encodes semantic and perceptual information into a compact hidden-code representation, refined through multi-scale vector quantization, and enhances contextual reasoning via conditional Transformer modules. To enable systematic evaluation for natural images, we construct ImageNet-S, a benchmark that provides paired image-label factual retrieval tasks. Extensive experiments on ImageNet-S demonstrate that our method exhibits promising retrieval and reconstruction performance while remaining fully compatible with diverse watermarking pipelines. This framework establishes a foundation for general-purpose image recovery beyond detection and localization.

[143] TrajTok: Learning Trajectory Tokens enables better Video Understanding

Chenhao Zheng, Jieyu Zhang, Jianing Zhang, Weikai Huang, Ashutosh Kumar, Quan Kong, Oncel Tuzel, Chun-Liang Li, Ranjay Krishna

Main category: cs.CV

TL;DR: TrajTok is an end-to-end video tokenizer that produces object trajectories through unified spatiotemporal segmentation, enabling efficient video understanding by decoupling token count from video duration.

Details

Motivation: Traditional video tokenization methods generate excessive redundant tokens that limit efficiency and scalability. While trajectory-based approaches help, they rely on complex external pipelines that are slow and task-agnostic.

Method: TrajTok uses a unified segmenter for implicit spatiotemporal clustering to directly produce object trajectories in a single forward pass. It’s co-trained with video models, dynamically adapting token granularity to semantic complexity.

Result: TrajViT2 (video CLIP trained with TrajTok) achieves state-of-the-art accuracy on classification and retrieval benchmarks while maintaining efficiency comparable to token-merging methods. TrajTok also works well as probing head (TrajAdapter) and alignment connector (TrajVLM).

Conclusion: TrajTok provides an efficient, end-to-end solution for video tokenization that improves understanding performance while being versatile enough for various multimodal applications including long-video reasoning.

Abstract: Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight and efficient, yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.

[144] SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation

Ling Wang, Hao-Xiang Guo, Xinzhou Wang, Fuchun Sun, Kai Sun, Pengkun Liu, Hang Xiao, Zhong Wang, Guangyuan Fu, Eric Li, Yang Liu, Yikai Wang

Main category: cs.CV

TL;DR: SceneTransporter is an end-to-end framework for structured 3D scene generation from single images that uses entropic Optimal Transport within a compositional DiT model to organize part-level 3D objects into distinct instances.

Details

Motivation: Existing methods generate part-level 3D objects but fail to organize them into distinct instances in open-world scenes due to lack of structural constraints in the model's internal assignment mechanism.

Method: Reframes structured 3D scene generation as a global correlation assignment problem, formulating and solving an entropic Optimal Transport objective within the denoising loop of a compositional DiT model. This imposes structural constraints: transport plan gates cross-attention for exclusive routing of image patches to part-level 3D latents, and competitive transport encourages grouping of similar patches with edge-based cost regularization.

Result: Extensive experiments show SceneTransporter outperforms existing methods on open-world scene generation, significantly improving instance-level coherence and geometric fidelity.

Conclusion: SceneTransporter successfully addresses the structured 3D scene generation problem by introducing structural constraints through Optimal Transport, enabling better organization of part-level 3D objects into distinct instances from single images.

Abstract: We introduce SceneTransporter, an end-to-end framework for structured 3D scene generation from a single image. While existing methods generate part-level 3D objects, they often fail to organize these parts into distinct instances in open-world scenes. Through a debiased clustering probe, we reveal a critical insight: this failure stems from the lack of structural constraints within the model’s internal assignment mechanism. Based on this finding, we reframe the task of structured 3D scene generation as a global correlation assignment problem. To solve this, SceneTransporter formulates and solves an entropic Optimal Transport (OT) objective within the denoising loop of the compositional DiT model. This formulation imposes two powerful structural constraints. First, the resulting transport plan gates cross-attention to enforce an exclusive, one-to-one routing of image patches to part-level 3D latents, preventing entanglement. Second, the competitive nature of the transport encourages the grouping of similar patches, a process that is further regularized by an edge-based cost, to form coherent objects and prevent fragmentation. Extensive experiments show that SceneTransporter outperforms existing methods on open-world scene generation, significantly improving instance-level coherence and geometric fidelity. Code and models will be publicly available at https://2019epwl.github.io/SceneTransporter/.

[145] Robust Human Trajectory Prediction via Self-Supervised Skeleton Representation Learning

Taishu Arashima, Hiroshi Kera, Kazuhiko Kawamoto

Main category: cs.CV

TL;DR: Proposes a robust human trajectory prediction method using self-supervised skeleton representation pretrained with masked autoencoding to handle missing joints from occlusions.

Details

Motivation: Human trajectory prediction is important for autonomous navigation and video surveillance. While skeleton data complements trajectory information, real-world skeleton data often has missing joints due to occlusions, which degrades prediction accuracy, requiring more robust skeleton representations.

Method: Uses a self-supervised skeleton representation model pretrained with masked autoencoding to create robust representations that handle missing joints. Integrates this with trajectory prediction to improve robustness to occlusions.

Result: Experimental results in occlusion-prone scenarios show improved robustness to missing skeletal data without sacrificing prediction accuracy. Consistently outperforms baseline models in clean-to-moderate missingness regimes.

Conclusion: The proposed method effectively handles missing skeleton data from occlusions, improving trajectory prediction robustness while maintaining accuracy, making it suitable for real-world applications with imperfect sensor data.

Abstract: Human trajectory prediction plays a crucial role in applications such as autonomous navigation and video surveillance. While recent works have explored the integration of human skeleton sequences to complement trajectory information, skeleton data in real-world environments often suffer from missing joints caused by occlusions. These disturbances significantly degrade prediction accuracy, indicating the need for more robust skeleton representations. We propose a robust trajectory prediction method that incorporates a self-supervised skeleton representation model pretrained with masked autoencoding. Experimental results in occlusion-prone scenarios show that our method improves robustness to missing skeletal data without sacrificing prediction accuracy, and consistently outperforms baseline models in clean-to-moderate missingness regimes.

[146] GSTurb: Gaussian Splatting for Atmospheric Turbulence Mitigation

Hanliang Du, Zhangji Lu, Zewei Cai, Qijian Tang, Qifeng Yu, Xiaoli Liu

Main category: cs.CV

TL;DR: GSTurb is a novel framework for atmospheric turbulence mitigation that combines optical flow-guided tilt correction with Gaussian splatting to model non-isoplanatic blur, achieving state-of-the-art performance on both synthetic and real-world datasets.

Details

Motivation: Atmospheric turbulence causes significant image degradation through pixel displacement (tilt) and blur, especially in long-range imaging applications. Existing methods struggle to effectively handle both tilt and non-isoplanatic blur simultaneously.

Method: GSTurb integrates optical flow-guided tilt correction with Gaussian splatting to model non-isoplanatic blur. The framework uses Gaussian parameters to represent both tilt and blur, optimizing them across multiple frames for enhanced restoration.

Result: On the ATSyn-static dataset, GSTurb achieves PSNR of 27.67 dB and SSIM of 0.8735, improving over state-of-the-art by 1.3 dB (4.5%) and 0.048 (5.8%). It also outperforms existing methods on real datasets (TSRWGAN Real-World and CLEAR) in both qualitative and quantitative performance.

Conclusion: Combining optical flow-guided tilt correction with Gaussian splatting effectively enhances image restoration under both synthetic and real-world turbulence conditions, demonstrating superior performance over existing methods.

Abstract: Atmospheric turbulence causes significant image degradation due to pixel displacement (tilt) and blur, particularly in long-range imaging applications. In this paper, we propose a novel framework for atmospheric turbulence mitigation, GSTurb, which integrates optical flow-guided tilt correction and Gaussian splatting for modeling non-isoplanatic blur. The framework employs Gaussian parameters to represent tilt and blur, and optimizes them across multiple frames to enhance restoration. Experimental results on the ATSyn-static dataset demonstrate the effectiveness of our method, achieving a peak PSNR of 27.67 dB and SSIM of 0.8735. Compared to the state-of-the-art method, GSTurb improves PSNR by 1.3 dB (a 4.5% increase) and SSIM by 0.048 (a 5.8% increase). Additionally, on real datasets, including the TSRWGAN Real-World and CLEAR datasets, GSTurb outperforms existing methods, showing significant improvements in both qualitative and quantitative performance. These results highlight that combining optical flow-guided tilt correction with Gaussian splatting effectively enhances image restoration under both synthetic and real-world turbulence conditions. The code for this method will be available at https://github.com/DuhlLiamz/3DGS_turbulence/tree/main.

[147] PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning

Mingde Yao, Zhiyuan You, Tam-King Man, Menglu Wang, Tianfan Xue

Main category: cs.CV

TL;DR: PhotoAgent is an autonomous image editing system that uses aesthetic planning and tree search to perform multi-step editing without requiring step-by-step user instructions.

Details

Motivation: Current instruction-based image editing requires carefully designed prompts, placing the burden of task decomposition and sequencing on users. The paper aims to achieve autonomous image editing that can reason about aesthetic intent and plan multi-step edits independently.

Method: Formulates autonomous image editing as a long-horizon decision-making problem. Uses tree search to plan multi-step editing actions, incorporates memory and visual feedback for closed-loop execution, and introduces UGC-Edit benchmark with 7,000 photos and an aesthetic reward model for evaluation.

Result: PhotoAgent consistently improves both instruction adherence and visual quality compared to baseline methods. The system demonstrates effective autonomous editing capabilities without requiring detailed step-by-step user prompts.

Conclusion: PhotoAgent advances autonomous image editing through explicit aesthetic planning and closed-loop execution, reducing user burden while maintaining high-quality results.

Abstract: With the recent fast development of generative models, instruction-based image editing has shown great potential in generating high-quality images. However, the quality of editing highly depends on carefully designed instructions, placing the burden of task decomposition and sequencing entirely on the user. To achieve autonomous image editing, we present PhotoAgent, a system that advances image editing through explicit aesthetic planning. Specifically, PhotoAgent formulates autonomous image editing as a long-horizon decision-making problem. It reasons over user aesthetic intent, plans multi-step editing actions via tree search, and iteratively refines results through closed-loop execution with memory and visual feedback, without requiring step-by-step user prompts. To support reliable evaluation in real-world scenarios, we introduce UGC-Edit, an aesthetic evaluation benchmark consisting of 7,000 photos and a learned aesthetic reward model. We also construct a test set containing 1,017 photos to systematically assess autonomous photo editing performance. Extensive experiments demonstrate that PhotoAgent consistently improves both instruction adherence and visual quality compared with baseline methods. The project page is https://github.com/mdyao/PhotoAgent.

[148] Face Time Traveller : Travel Through Ages Without Losing Identity

Purbayan Kar, Ayush Ghadiya, Vishal Chudasama, Pankaj Wasnik, C. V. Jawahar

Main category: cs.CV

TL;DR: FaceTT is a diffusion-based framework for high-fidelity face aging that preserves identity through attribute-aware prompts, tuning-free inversion, and adaptive attention control.

Details

Motivation: Existing face aging models struggle with identity preservation in wide age transformations, have static attention mechanisms, and rely on optimization-heavy inversion in diffusion models, limiting adaptability, fine-grained control, and background consistency.

Method: Proposes FaceTT with three key components: 1) Face-Attribute-Aware Prompt Refinement encoding biological and environmental aging cues, 2) tuning-free Angular Inversion for efficient mapping of real faces to diffusion latent space, and 3) Adaptive Attention Control that dynamically balances cross-attention for semantic aging and self-attention for identity preservation.

Result: Extensive experiments on benchmark datasets and in-the-wild tests demonstrate superior identity retention, background preservation, and aging realism compared to state-of-the-art methods.

Conclusion: FaceTT effectively addresses key challenges in face aging by achieving high-fidelity, identity-consistent age transformations through a novel diffusion-based framework with adaptive attention mechanisms.

Abstract: Face aging, an ill-posed problem shaped by environmental and genetic factors, is vital in entertainment, forensics, and digital archiving, where realistic age transformations must preserve both identity and visual realism. However, existing works relying on numerical age representations overlook the interplay of biological and contextual cues. Despite progress in recent face aging models, they struggle with identity preservation in wide age transformations, also static attention and optimization-heavy inversion in diffusion limit adaptability, fine-grained control and background consistency. To address these challenges, we propose Face Time Traveller (FaceTT), a diffusion-based framework that achieves high-fidelity, identity-consistent age transformation. Here, we introduce a Face-Attribute-Aware Prompt Refinement strategy that encodes intrinsic (biological) and extrinsic (environmental) aging cues for context-aware conditioning. A tuning-free Angular Inversion method is proposed that efficiently maps real faces into the diffusion latent space for fast and accurate reconstruction. Moreover, an Adaptive Attention Control mechanism is introduced that dynamically balances cross-attention for semantic aging cues and self-attention for structural and identity preservation. Extensive experiments on benchmark datasets and in-the-wild testset demonstrate that FaceTT achieves superior identity retention, background preservation and aging realism over state-of-the-art (SOTA) methods.

[149] CMSA-Net: Causal Multi-scale Aggregation with Adaptive Multi-source Reference for Video Polyp Segmentation

Tong Wang, Yaolei Qi, Siwen Wang, Imran Razzak, Guanyu Yang, Yutong Xie

Main category: cs.CV

TL;DR: CMSA-Net is a video polyp segmentation framework that uses causal multi-scale aggregation and dynamic multi-source reference strategies to improve segmentation accuracy while maintaining real-time performance for clinical applications.

Details

Motivation: Video polyp segmentation (VPS) is challenging due to polyps looking similar to surrounding mucosa (weak semantic discrimination) and large changes in polyp position/scale across frames, making stable and accurate segmentation difficult.

Method: Proposes CMSA-Net with: 1) Causal Multi-scale Aggregation (CMA) module that gathers semantic information from multiple historical frames at different scales using causal attention to ensure temporal feature propagation follows strict time order, and 2) Dynamic Multi-source Reference (DMR) strategy that adaptively selects informative and reliable reference frames based on semantic separability and prediction confidence.

Result: Extensive experiments on SUN-SEG dataset demonstrate state-of-the-art performance with favorable balance between segmentation accuracy and real-time clinical applicability.

Conclusion: CMSA-Net provides an effective solution for video polyp segmentation that addresses challenges of weak semantic discrimination and temporal instability while maintaining efficiency for real-time clinical use.

Abstract: Video polyp segmentation (VPS) is an important task in computer-aided colonoscopy, as it helps doctors accurately locate and track polyps during examinations. However, VPS remains challenging because polyps often look similar to surrounding mucosa, leading to weak semantic discrimination. In addition, large changes in polyp position and scale across video frames make stable and accurate segmentation difficult. To address these challenges, we propose a robust VPS framework named CMSA-Net. The proposed network introduces a Causal Multi-scale Aggregation (CMA) module to effectively gather semantic information from multiple historical frames at different scales. By using causal attention, CMA ensures that temporal feature propagation follows strict time order, which helps reduce noise and improve feature reliability. Furthermore, we design a Dynamic Multi-source Reference (DMR) strategy that adaptively selects informative and reliable reference frames based on semantic separability and prediction confidence. This strategy provides strong multi-frame guidance while keeping the model efficient for real-time inference. Extensive experiments on the SUN-SEG dataset demonstrate that CMSA-Net achieves state-of-the-art performance, offering a favorable balance between segmentation accuracy and real-time clinical applicability.

[150] Reflectance Multispectral Imaging for Soil Composition Estimation and USDA Texture Classification

G. A. S. L Ranasinghe, J. A. S. T. Jayakody, M. C. L. De Silva, G. Thilakarathne, G. M. R. I. Godaliyadda, H. M. V. R. Herath, M. P. B. Ekanayake, S. K. Navaratnarajah

Main category: cs.CV

TL;DR: Multispectral imaging system with machine learning for accurate, non-destructive soil texture classification using 13 spectral bands (365-940nm) to predict USDA texture classes with over 99% accuracy.

Details

Motivation: Current soil texture analysis methods are slow, labor-intensive, and costly, limiting field-scale deployment for agriculture and geotechnical applications.

Method: Developed cost-effective in-house MSI device capturing 13 spectral bands; used regression models to estimate clay/silt/sand percentages and direct/indirect classification via USDA texture triangle.

Result: Achieved R² up to 0.99 for composition prediction and over 99% accuracy for texture classification on mixture data.

Conclusion: MSI with data-driven modeling provides accurate, non-destructive, field-deployable soil texture characterization suitable for geotechnical screening and precision agriculture.

Abstract: Soil texture is a foundational attribute that governs water availability and erosion in agriculture, as well as load bearing capacity, deformation response, and shrink-swell risk in geotechnical engineering. Yet texture is still typically determined by slow and labour intensive laboratory particle size tests, while many sensing alternatives are either costly or too coarse to support routine field scale deployment. This paper proposes a robust and field deployable multispectral imaging (MSI) system and machine learning framework for predicting soil composition and the United States Department of Agriculture (USDA) texture classes. The proposed system uses a cost effective in-house MSI device operating from 365 nm to 940 nm to capture thirteen spectral bands, which effectively capture the spectral properties of soil texture. Regression models use the captured spectral properties to estimate clay, silt, and sand percentages, while a direct classifier predicts one of the twelve USDA textural classes. Indirect classification is obtained by mapping the regressed compositions to texture classes via the USDA soil texture triangle. The framework is evaluated on mixture data by mixing clay, silt, and sand in varying proportions, using the USDA classification triangle as a basis. Experimental results show that the proposed approach achieves a coefficient of determination R^2 up to 0.99 for composition prediction and over 99% accuracy for texture classification. These findings indicate that MSI combined with data-driven modeling can provide accurate, non-destructive, and field deployable soil texture characterization suitable for geotechnical screening and precision agriculture.

[151] A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling

Chong Wang, Yabin Zhang, Yunhe Gao, Maya Varma, Clemence Mottez, Faidra Patsatzi, Jiaming Liu, Jin Long, Jean-Benoit Delbrouck, Sergios Gatidis, Akshay S. Chaudhari, Curtis P. Langlotz

Main category: cs.CV

TL;DR: CheXficient is a chest X-ray foundation model that uses active data curation to achieve comparable performance to full-data models while using only 22.7% of data and 27.3% of compute, addressing redundancy and class imbalance in medical imaging datasets.

Details

Motivation: The paper addresses two critical challenges in medical imaging foundation models: 1) large-scale medical datasets contain substantial redundancy and severe class imbalance that bias representation learning, and 2) indiscriminate training regardless of data quality heterogeneity incurs computational inefficiency. The authors propose that active data curation can be a cost-effective alternative to brute-force dataset enlargement.

Method: CheXficient is a chest X-ray foundation model that selectively prioritizes informative training samples during pretraining. It uses active, principled data curation to identify and focus on the most valuable training examples, pretraining on only 22.7% of 1,235,004 paired CXR images and reports while consuming under 27.3% of the total compute budget.

Result: CheXficient achieves comparable or superior performance to its full-data counterpart and other large-scale pretrained models across 20 individual benchmarks spanning 5 task types. The model systematically prioritizes under-represented training samples, improving generalizability on long-tailed or rare conditions.

Conclusion: Active data curation during pretraining serves as a viable, cost-effective alternative to brute-force dataset enlargement for medical vision-language foundation models. The work offers practical insights into data and computation demands for efficient pretraining and downstream adaptation in medical imaging.

Abstract: Foundation models for medical imaging are typically pretrained on increasingly large datasets, following a “scale-at-all-costs” paradigm. However, this strategy faces two critical challenges: large-scale medical datasets often contain substantial redundancy and severe class imbalance that bias representation learning toward over-represented patterns, and indiscriminate training regardless of heterogeneity in data quality incurs considerable computational inefficiency. Here we demonstrate that active, principled data curation during pretraining can serve as a viable, cost-effective alternative to brute-force dataset enlargement. We introduce CheXficient, a chest X-ray (CXR) foundation model that selectively prioritizes informative training samples. CheXficient is pretrained on only 22.7% of 1,235,004 paired CXR images and reports while consuming under 27.3% of the total compute budget, yet achieving comparable or superior performance to its full-data counterpart and other large-scale pretrained models. We assess CheXficient across 20 individual benchmarks spanning 5 task types, including non-adapted off-the-shelf evaluations (zero-shot findings classification and crossmodal retrieval) and adapted downstream tasks (disease prediction, semantic segmentation, and radiology report generation). Further analyses show that CheXficient systematically prioritizes under-represented training samples, improving generalizability on long-tailed or rare conditions. Overall, our work offers practical insights into the data and computation demands for efficient pretraining and downstream adaptation of medical vision-language foundation models.

Hongrui Jia, Chaoya Jiang, Shikun Zhang, Wei Ye

Main category: cs.CV

TL;DR: DPE is a diagnostic-driven progressive evolution framework for continual training of Large Multimodal Models that uses failure diagnosis to steer targeted data generation and reinforcement learning.

Details

Motivation: Current LMM training relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. The authors are motivated by findings that test-driven error exposure and feedback-based correction outperform repetitive practice.

Method: DPE uses a spiral loop where diagnosis steers data generation and reinforcement. It has two key components: 1) multiple agents annotate and quality control massive unlabeled multimodal data using tools like web search and image editing, and 2) DPE attributes failures to specific weaknesses, dynamically adjusts data mixture, and guides agents to generate weakness-focused data for targeted reinforcement.

Result: Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions.

Conclusion: DPE provides a scalable framework for continual improvement of multimodal models through diagnostic-driven, targeted reinforcement learning that addresses specific weaknesses identified through failure analysis.

Abstract: As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based correction outperform repetitive practice, we propose Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement. DPE has two key components. First, multiple agents annotate and quality control massive unlabeled multimodal data, using tools such as web search and image editing to produce diverse, realistic samples. Second, DPE attributes failures to specific weaknesses, dynamically adjusts the data mixture, and guides agents to generate weakness focused data for targeted reinforcement. Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions. Our code, models, and data are publicly available at https://github.com/hongruijia/DPE.

[153] SO3UFormer: Learning Intrinsic Spherical Features for Rotation-Robust Panoramic Segmentation

Qinfeng Zhu, Yunxi Jiang, Lei Fan

Main category: cs.CV

TL;DR: SO3UFormer: A rotation-robust spherical Transformer for panoramic semantic segmentation that maintains performance under arbitrary 3D rotations by learning intrinsic spherical features independent of coordinate frames.

Details

Motivation: Real-world panoramic captures often deviate from gravity-aligned assumptions due to camera motions, causing standard spherical Transformers to overfit latitude cues and fail catastrophically under 3D reorientations.

Method: Three geometric pillars: (1) intrinsic feature formulation removing absolute latitude encoding, (2) quadrature-consistent spherical attention for non-uniform sampling, (3) gauge-aware relative positional encoding using tangent-plane angles and discrete gauge pooling. Plus index-based resampling and SO(3)-consistency regularization.

Result: SO3UFormer achieves 72.03 mIoU on Pose35 dataset (random ±35° rotations) and retains 70.67 mIoU under full SO(3) rotations, while baseline SphereUFormer drops from 67.53 mIoU to 25.26 mIoU.

Conclusion: The proposed rotation-robust architecture successfully learns intrinsic spherical features that are less sensitive to coordinate frames, enabling stable performance under arbitrary 3D rotations.

Abstract: Panoramic semantic segmentation models are typically trained under a strict gravity-aligned assumption. However, real-world captures often deviate from this canonical orientation due to unconstrained camera motions, such as the rotational jitter of handheld devices or the dynamic attitude shifts of aerial platforms. This discrepancy causes standard spherical Transformers to overfit global latitude cues, leading to performance collapse under 3D reorientations. To address this, we introduce SO3UFormer, a rotation-robust architecture designed to learn intrinsic spherical features that are less sensitive to the underlying coordinate frame. Our approach rests on three geometric pillars: (1) an intrinsic feature formulation that decouples the representation from the gravity vector by removing absolute latitude encoding; (2) quadrature-consistent spherical attention that accounts for non-uniform sampling densities; and (3) a gauge-aware relative positional mechanism that encodes local angular geometry using tangent-plane projected angles and discrete gauge pooling, avoiding reliance on global axes. We further use index-based spherical resampling together with a logit-level SO(3)-consistency regularizer during training. To rigorously benchmark robustness, we introduce Pose35, a dataset variant of Stanford2D3D perturbed by random rotations within $\pm 35^\circ$. Under the extreme test of arbitrary full SO(3) rotations, existing SOTAs fail catastrophically: the baseline SphereUFormer drops from 67.53 mIoU to 25.26 mIoU. In contrast, SO3UFormer demonstrates remarkable stability, achieving 72.03 mIoU on Pose35 and retaining 70.67 mIoU under full SO(3) rotations.

[154] Towards Multimodal Domain Generalization with Few Labels

Hongzhao Li, Hao Dong, Hualei Wan, Shupan Li, Mingliang Xu, Muhammad Haris Khan

Main category: cs.CV

TL;DR: A framework for Semi-Supervised Multimodal Domain Generalization that handles few labeled samples, domain shifts, and missing modalities through consensus-driven consistency, disagreement-aware regularization, and cross-modal prototype alignment.

Details

Motivation: Existing approaches fail to address the combined challenges of multimodal domain generalization with limited labeled data: multimodal DG methods can't use unlabeled data, semi-supervised multimodal learning ignores domain shifts, and semi-supervised DG methods are single-modality only.

Method: Three key components: 1) Consensus-Driven Consistency Regularization for reliable pseudo-labels via confident fused-unimodal consensus, 2) Disagreement-Aware Regularization to utilize ambiguous non-consensus samples, and 3) Cross-Modal Prototype Alignment for domain- and modality-invariant representations with robustness to missing modalities via cross-modal translation.

Result: Established first SSMDG benchmarks and demonstrated consistent outperformance over strong baselines in both standard and missing-modality scenarios.

Conclusion: Proposed unified framework effectively addresses the SSMDG problem, providing robust multimodal learning with few labeled samples across domains, with practical applications in data-efficient multimodal systems.

Abstract: Multimodal models ideally should generalize to unseen domains while remaining data-efficient to reduce annotation costs. To this end, we introduce and study a new problem, Semi-Supervised Multimodal Domain Generalization (SSMDG), which aims to learn robust multimodal models from multi-source data with few labeled samples. We observe that existing approaches fail to address this setting effectively: multimodal domain generalization methods cannot exploit unlabeled data, semi-supervised multimodal learning methods ignore domain shifts, and semi-supervised domain generalization methods are confined to single-modality inputs. To overcome these limitations, we propose a unified framework featuring three key components: Consensus-Driven Consistency Regularization, which obtains reliable pseudo-labels through confident fused-unimodal consensus; Disagreement-Aware Regularization, which effectively utilizes ambiguous non-consensus samples; and Cross-Modal Prototype Alignment, which enforces domain- and modality-invariant representations while promoting robustness under missing modalities via cross-modal translation. We further establish the first SSMDG benchmarks, on which our method consistently outperforms strong baselines in both standard and missing-modality scenarios. Our benchmarks and code are available at https://github.com/lihongzhao99/SSMDG.

[155] Chain of Flow: A Foundational Generative Framework for ECG-to-4D Cardiac Digital Twins

Haofan Wu, Nay Aung, Theodoros N. Arvanitis, Joao A. C. Lima, Steffen E. Petersen, Le Zhang

Main category: cs.CV

TL;DR: A foundational ECG-driven generative framework (Chain of Flow) reconstructs full 4D cardiac structure and motion from a single cardiac cycle using multimodal ECG and cine-CMR data, enabling patient-specific virtual hearts for downstream cardiac digital twin applications.

Details

Motivation: Current cardiac digital twin frameworks are limited to task-specific predictors rather than building patient-specific, manipulable virtual hearts that can reconstruct individualized cardiac anatomy and physiology from multimodal signals.

Method: Chain of Flow integrates cine-CMR and 12-lead ECG during training to learn a unified representation of cardiac geometry, electrophysiology, and motion dynamics, enabling reconstruction of full 4D cardiac structure and motion from a single cardiac cycle.

Result: The method demonstrates accurate recovery of cardiac anatomy, chamber-wise function, and dynamic motion patterns across diverse cohorts, supporting downstream tasks like volumetry, regional function analysis, and virtual cine synthesis.

Conclusion: COF transforms cardiac digital twins from narrow predictive models into fully generative, patient-specific virtual hearts by enabling full 4D organ reconstruction directly from ECG signals.

Abstract: A clinically actionable Cardiac Digital Twin (CDT) should reconstruct individualised cardiac anatomy and physiology, update its internal state from multimodal signals, and enable a broad range of downstream simulations beyond isolated tasks. However, existing CDT frameworks remain limited to task-specific predictors rather than building a patient-specific, manipulable virtual heart. In this work, we introduce Chain of Flow (COF), a foundational ECG-driven generative framework that reconstructs full 4D cardiac structure and motion from a single cardiac cycle. The method integrates cine-CMR and 12-lead ECG during training to learn a unified representation of cardiac geometry, electrophysiology, and motion dynamics. We evaluate Chain of Flow on diverse cohorts and demonstrate accurate recovery of cardiac anatomy, chamber-wise function, and dynamic motion patterns. The reconstructed 4D hearts further support downstream CDT tasks such as volumetry, regional function analysis, and virtual cine synthesis. By enabling full 4D organ reconstruction directly from ECG, COF transforms cardiac digital twins from narrow predictive models into fully generative, patient-specific virtual hearts. Code will be released after review.

[156] $ϕ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

Thanh-Dat Truong, Huu-Thien Tran, Jackson Cothren, Bhiksha Raj, Khoa Luu

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.22601: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22601&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Federico Nesti, Gianluca D’Amico, Mauro Marinoni, Giorgio Buttazzo

Main category: cs.CV

TL;DR: A multi-modal augmented reality framework for railway obstacle detection that integrates photorealistic virtual objects into real-world railway sequences using Unreal Engine 5, LiDAR, and INS/GNSS data to create the OSDaR-AR dataset.

Details

Motivation: Railway applications lack high-quality annotated data for safety-critical tasks like obstacle detection. Existing simulators suffer from sim-to-real gaps, while simple image-masking techniques lack spatio-temporal coherence for realistic augmented scenes.

Method: Multi-modal augmented reality framework using Unreal Engine 5 that integrates LiDAR point-clouds and INS/GNSS data for accurate object placement and temporal stability across RGB frames. Includes segmentation-based refinement of INS/GNSS data to improve realism.

Result: Created OSDaR-AR dataset with carefully designed augmented sequences that provide realistic, temporally coherent railway scenes with accurate object placement and dimensions for perception system development.

Conclusion: The framework successfully bridges the sim-to-real gap for railway perception tasks and provides a valuable public dataset (OSDaR-AR) to support development of next-generation railway perception systems.

Abstract: Although deep learning has significantly advanced the perception capabilities of intelligent transportation systems, railway applications continue to suffer from a scarcity of high-quality, annotated data for safety-critical tasks like obstacle detection. While photorealistic simulators offer a solution, they often struggle with the ``sim-to-real" gap; conversely, simple image-masking techniques lack the spatio-temporal coherence required to obtain augmented single- and multi-frame scenes with the correct appearance and dimensions. This paper introduces a multi-modal augmented reality framework designed to bridge this gap by integrating photorealistic virtual objects into real-world railway sequences from the OSDaR23 dataset. Utilizing Unreal Engine 5 features, our pipeline leverages LiDAR point-clouds and INS/GNSS data to ensure accurate object placement and temporal stability across RGB frames. This paper also proposes a segmentation-based refinement strategy for INS/GNSS data to significantly improve the realism of the augmented sequences, as confirmed by the comparative study presented in the paper. Carefully designed augmented sequences are collected to produce OSDaR-AR, a public dataset designed to support the development of next-generation railway perception systems. The dataset is available at the following page: https://syndra.retis.santannapisa.it/osdarar.html

[158] DP-aware AdaLN-Zero: Taming Conditioning-Induced Heavy-Tailed Gradients in Differentially Private Diffusion

Tao Huang, Jiayang Meng, Xu Yang, Chen Hou, Hong Chen

Main category: cs.CV

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting error

Method: Unable to determine method due to API rate limiting error

Result: Unable to determine results due to API rate limiting error

Conclusion: Unable to determine conclusion due to API rate limiting error

Abstract: Failed to fetch summary for 2602.22610: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22610&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Runwei Guan, Shaofeng Liang, Ningwei Ouyang, Weichen Fei, Shanliang Yao, Wei Dai, Chenhao Ge, Penglei Sun, Xiaohui Zhu, Tao Huang, Ryan Wen Liu, Hui Xiong

Main category: cs.CV

TL;DR: WaterVideoQA: First large-scale maritime VideoQA benchmark with 3,029 clips across 6 waterway categories, plus NaviMind multi-agent neuro-symbolic system for open-ended maritime reasoning.

Details

Motivation: Autonomous navigation lacks knowledge-driven interactive environmental cognition, especially critical in maritime navigation where bridging visual perception with complex cognitive reasoning is essential for safe ASV maneuvers.

Method: 1) WaterVideoQA benchmark with 3,029 video clips across 6 waterway categories, testing across 5-tier hierarchical cognitive framework; 2) NaviMind multi-agent neuro-symbolic system using Adaptive Semantic Routing, Situation-Aware Hierarchical Reasoning, and Autonomous Self-Reflective Verification.

Result: Framework significantly transcends existing baselines, establishing new paradigm for intelligent, trustworthy interaction in dynamic maritime environments.

Conclusion: Presents comprehensive maritime VideoQA benchmark and neuro-symbolic reasoning system that transitions ASVs from pattern matching to regulation-compliant, interpretable decision-making.

Abstract: While autonomous navigation has achieved remarkable success in passive perception (e.g., object detection and segmentation), it remains fundamentally constrained by a void in knowledge-driven, interactive environmental cognition. In the high-stakes domain of maritime navigation, the ability to bridge the gap between raw visual perception and complex cognitive reasoning is not merely an enhancement but a critical prerequisite for Autonomous Surface Vessels to execute safe and precise maneuvers. To this end, we present WaterVideoQA, the first large-scale, comprehensive Video Question Answering benchmark specifically engineered for all-waterway environments. This benchmark encompasses 3,029 video clips across six distinct waterway categories, integrating multifaceted variables such as volatile lighting and dynamic weather to rigorously stress-test ASV capabilities across a five-tier hierarchical cognitive framework. Furthermore, we introduce NaviMind, a pioneering multi-agent neuro-symbolic system designed for open-ended maritime reasoning. By synergizing Adaptive Semantic Routing, Situation-Aware Hierarchical Reasoning, and Autonomous Self-Reflective Verification, NaviMind transitions ASVs from superficial pattern matching to regulation-compliant, interpretable decision-making. Experimental results demonstrate that our framework significantly transcends existing baselines, establishing a new paradigm for intelligent, trustworthy interaction in dynamic maritime environments.

[160] MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

Wenhui Tan, Xiaoyi Yu, Jiaze Li, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Ruihua Song, Jian Luan

Main category: cs.CV

TL;DR: MSJoE is a framework that jointly evolves an MLLM and a lightweight key-frame sampler for efficient long-form video understanding, using query reasoning and reinforcement learning to select informative frames.

Details

Motivation: Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models, as processing all frames is computationally expensive and many frames are irrelevant to specific questions.

Method: MSJoE first reasons out diverse visual perspective queries relevant to the question, uses a frozen CLIP model to create query-frame similarity matrices, then a lightweight sampler predicts key-frame sampling weights to select compact informative frames, with both MLLM and sampler jointly optimized through reinforcement learning.

Result: MSJoE achieves 8.0% accuracy gain over the base MLLM and 1.1% higher accuracy than the strongest baseline method on benchmarks including VideoMME, LongVideoBench, LVBench, and MLVU.

Conclusion: The joint evolution of MLLM and sampler through reinforcement learning enables effective co-adaptation of query reasoning, frame sampling, and key-frame understanding for efficient long-form video QA.

Abstract: Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models (MLLMs). In this paper, we present MLLM-Sampler Joint Evolution (MSJoE), a novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding. MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video. Specifically, MSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question. Then, these queries interact with a frozen CLIP model to produce a query-frame similarity matrix. Finally, a lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation. Both the MLLM and sampler are jointly optimized through reinforcement learning, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding. A new long-video QA dataset containing 2.8K videos with 7K question-answer pairs is collected to support the training process. Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MSJoE achieves 8.0% accuracy gain upon the base MLLM, and 1.1% higher accuracy than strongest baseline method.

[161] pMoE: Prompting Diverse Experts Together Wins More in Visual Adaptation

Shentong Mo, Xufang Luo, Dongsheng Li

Main category: cs.CV

TL;DR: pMoE: A Mixture-of-Experts prompt tuning method that integrates multiple domain experts’ knowledge through expert-specific prompt tokens and dynamic dispatching for enhanced visual adaptation across diverse tasks.

Details

Motivation: Current prompt tuning methods typically use knowledge from a single pre-trained model, missing opportunities to leverage synergies from integrating diverse domain knowledge. The authors aim to create a more versatile adaptation method that can combine expertise from multiple domains.

Method: Proposes pMoE with expert-specific prompt tokens and a learnable dispatcher that dynamically routes tokens at different prompt layers. This allows the model to optimally combine contributions from multiple domain experts during adaptation.

Result: Extensive experiments across 47 adaptation tasks (classification and segmentation in general and medical domains) show pMoE achieves superior performance with large improvements and offers optimal trade-off between computational efficiency and adaptation effectiveness.

Conclusion: pMoE successfully leverages multiple expert domains through specialized prompt tokens and dynamic dispatching, significantly enhancing model versatility and applicability across a broad spectrum of visual adaptation tasks.

Abstract: Parameter-efficient fine-tuning has demonstrated promising results across various visual adaptation tasks, such as classification and segmentation. Typically, prompt tuning techniques have harnessed knowledge from a single pre-trained model, whether from a general or a specialized medical domain. However, this approach typically overlooks the potential synergies that could arise from integrating diverse domain knowledge within the same tuning process. In this work, we propose a novel Mixture-of-Experts prompt tuning method called pMoE, which leverages the strengths of multiple expert domains through expert-specialized prompt tokens and the learnable dispatcher, effectively combining their expertise in a unified model framework. Our pMoE introduces expert-specific prompt tokens and utilizes a dynamic token dispatching mechanism at various prompt layers to optimize the contribution of each domain expert during the adaptation phase. By incorporating both domain knowledge from diverse experts, the proposed pMoE significantly enhances the model’s versatility and applicability to a broad spectrum of tasks. We conduct extensive experiments across 47 adaptation tasks, including both classification and segmentation in general and medical domains. The results demonstrate that our pMoE not only achieves superior performance with a large margin of improvements but also offers an optimal trade-off between computational efficiency and adaptation effectiveness compared to existing methods.

[162] Velocity and stroke rate reconstruction of canoe sprint team boats based on panned and zoomed video recordings

Julian Ziegler, Daniel Matthes, Finn Gerdts, Patrick Frenzel, Torsten Warnke, Matthias Englert, Tina Koevari, Mirco Fuchs

Main category: cs.CV

TL;DR: Video-based framework for automated analysis of canoe sprint performance metrics using computer vision and tracking techniques

Details

Motivation: GPS is gold standard for canoe sprint analysis but has limited availability; need automated video-based solutions to provide coaches with performance metrics without on-boat sensors

Method: Uses YOLOv8 for buoy/athlete detection, homography estimation from known buoy grid, U-net for boat tip calibration, optical flow tracking for multi-athlete boats, and stroke rate extraction from pose estimations or bounding boxes

Result: Velocity RRMSE of 0.020 ± 0.011 (rho = 0.956) and stroke rate RRMSE of 0.022 ± 0.024 (rho = 0.932) against GPS data from elite competitions

Conclusion: Provides highly accurate automated feedback for coaches across all sprint disciplines and distances without requiring sensors or manual annotation

Abstract: Pacing strategies, defined by velocity and stroke rate profiles, are essential for peak performance in canoe sprint. While GPS is the gold standard for analysis, its limited availability necessitates automated video-based solutions. This paper presents an extended framework for reconstructing performance metrics from panned and zoomed video recordings across all sprint disciplines (K1-K4, C1-C2) and distances (200m-500m). Our method utilizes YOLOv8 for buoy and athlete detection, leveraging the known buoy grid to estimate homographies. We generalized the estimation of the boat position by means of learning a boat-specific athlete offset using a U-net based boat tip calibration. Further, we implement a robust tracking scheme using optical flow to adapt to multi-athlete boat types. Finally, we introduce methods to extract stroke rate information from either pose estimations or the athlete bounding boxes themselves. Evaluation against GPS data from elite competitions yields a velocity RRMSE of 0.020 +- 0.011 (rho = 0.956) and a stroke rate RRMSE of 0.022 +- 0.024 (rho = 0.932). The methods provide coaches with highly accurate, automated feedback without requiring on-boat sensors or manual annotation.

[163] Cross-Task Benchmarking of CNN Architectures

Kamal Sherawat, Vikrant Bhati

Main category: cs.CV

TL;DR: Comparative study of dynamic CNN variants (vanilla, hard attention, soft attention with local/global features, ODConv) showing attention mechanisms and dynamic convolution outperform conventional CNNs in accuracy, efficiency, and computational performance across image classification, segmentation, and time series tasks.

Details

Motivation: To investigate and compare different dynamic convolutional neural network architectures and attention mechanisms to improve performance across multiple tasks including image classification, segmentation, and time series analysis, moving beyond conventional static CNNs.

Method: Based on ResNet-18 architecture, compared five CNN variants: vanilla CNN, hard attention-based CNN, soft attention-based CNN with local (pixel-wise) and global (image-wise) feature attention, and omni-directional CNN (ODConv). Evaluated on Tiny ImageNet, Pascal VOC, and UCR Time Series Classification Archive datasets.

Result: Attention mechanisms and dynamic convolution methods consistently exceeded conventional CNNs in accuracy, efficiency, and computational performance. ODConv was particularly effective on morphologically complex images by dynamically adjusting to varying spatial patterns. Dynamic CNNs enhanced feature representation and cross-task generalization through adaptive kernel modulation.

Conclusion: Dynamic CNNs with attention mechanisms provide superior performance over conventional CNNs across multiple tasks. The study offers perspectives on advanced CNN design for multiplexed data modalities and indicates promising directions in neural network engineering, particularly for handling complex spatial patterns.

Abstract: This project provides a comparative study of dynamic convolutional neural networks (CNNs) for various tasks, including image classification, segmentation, and time series analysis. Based on the ResNet-18 architecture, we compare five variants of CNNs: the vanilla CNN, the hard attention-based CNN, the soft attention-based CNN with local (pixel-wise) and global (image-wise) feature attention, and the omni-directional CNN (ODConv). Experiments on Tiny ImageNet, Pascal VOC, and the UCR Time Series Classification Archive illustrate that attention mechanisms and dynamic convolution methods consistently exceed conventional CNNs in accuracy, efficiency, and computational performance. ODConv was especially effective on morphologically complex images by being able to dynamically adjust to varying spatial patterns. Dynamic CNNs enhanced feature representation and cross-task generalization through adaptive kernel modulation. This project provides perspectives on advanced CNN design architecture for multiplexed data modalities and indicates promising directions in neural network engineering.

[164] ToProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization

Jiayu Chen, Ruoyu Lin, Zihao Zheng, Jingxin Li, Maoliang Li, Guojie Luo, Xiang chen

Main category: cs.CV

TL;DR: ToProVAR is an optimization framework for Visual Autoregressive models that uses attention entropy analysis to identify sparsity patterns across token, layer, and scale dimensions, enabling aggressive acceleration while preserving semantic fidelity.

Details

Motivation: VAR models suffer from efficiency bottlenecks in later generation stages, and existing optimization approaches like FastVAR and SkipVAR rely on heuristic skipping strategies that may compromise quality. There's a need for a more principled optimization method that can accelerate generation while maintaining semantic fidelity.

Method: The method analyzes attention entropy to characterize semantic projections across model dimensions, identifies sparsity patterns in token, layer, and scale dimensions, and develops fine-grained optimization strategies tailored to these patterns. This enables selective skipping or simplification of computations where semantic content is sparse.

Result: ToProVAR achieves up to 3.4x acceleration on Infinity-2B and Infinity-8B models with minimal quality loss, outperforming traditional methods in both efficiency and quality while preserving semantic fidelity and fine details.

Conclusion: The attention entropy-based analysis provides a principled foundation for VAR optimization, enabling aggressive acceleration while maintaining quality, effectively addressing limitations of heuristic approaches like FastVAR and SkipVAR.

Abstract: Visual Autoregressive(VAR) models enhance generation quality but face a critical efficiency bottleneck in later stages. In this paper, we present a novel optimization framework for VAR models that fundamentally differs from prior approaches such as FastVAR and SkipVAR. Instead of relying on heuristic skipping strategies, our method leverages attention entropy to characterize the semantic projections across different dimensions of the model architecture. This enables precise identification of parameter dynamics under varying token granularity levels, semantic scopes, and generation scales. Building on this analysis, we further uncover sparsity patterns along three critical dimensions-token, layer, and scale-and propose a set of fine-grained optimization strategies tailored to these patterns. Extensive evaluation demonstrates that our approach achieves aggressive acceleration of the generation process while significantly preserving semantic fidelity and fine details, outperforming traditional methods in both efficiency and quality. Experiments on Infinity-2B and Infinity-8B models demonstrate that ToProVAR achieves up to 3.4x acceleration with minimal quality loss, effectively mitigating the issues found in prior work. Our code will be made publicly available.

[165] OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis

Junuk Cha, Jihyeon Kim, Han-Mu Park

Main category: cs.CV

TL;DR: OpenFS is an open-source approach for fingerspelling recognition and synthesis that addresses signing-hand ambiguity, peaky CTC loss behavior, and OOV problems through implicit hand detection, monotonic alignment loss, and pose sequence generation.

Details

Motivation: Automatic fingerspelling recognition is crucial for bridging communication gaps between Deaf and hearing communities, but faces challenges including signing-hand ambiguity, inappropriate training losses (CTC's peaky behavior), and out-of-vocabulary (OOV) problems. Existing methods rely on explicit hand detection and CTC loss, leading to recognition failures.

Method: Proposes OpenFS with: 1) Multi-hand-capable recognizer with dual-level positional encoding and signing-hand focus (SF) loss for implicit hand detection; 2) Monotonic alignment (MA) loss replacing CTC to enforce temporal ordering; 3) Frame-wise letter-conditioned generator for synthesizing realistic fingerspelling pose sequences for OOV words, enabling creation of FSNeo benchmark.

Result: Achieves state-of-the-art performance in fingerspelling recognition, validates effectiveness of both recognizer and generator components through comprehensive experiments, and provides open-source implementation with data.

Conclusion: OpenFS successfully addresses key challenges in fingerspelling recognition through implicit hand detection, improved loss functions, and synthesis capabilities for OOV words, advancing sign language technology with practical applications for Deaf-hearing communication.

Abstract: Fingerspelling is a component of sign languages in which words are spelled out letter by letter using specific hand poses. Automatic fingerspelling recognition plays a crucial role in bridging the communication gap between Deaf and hearing communities, yet it remains challenging due to the signing-hand ambiguity issue, the lack of appropriate training losses, and the out-of-vocabulary (OOV) problem. Prior fingerspelling recognition methods rely on explicit signing-hand detection, which often leads to recognition failures, and on a connectionist temporal classification (CTC) loss, which exhibits the peaky behavior problem. To address these issues, we develop OpenFS, an open-source approach for fingerspelling recognition and synthesis. We propose a multi-hand-capable fingerspelling recognizer that supports both single- and multi-hand inputs and performs implicit signing-hand detection by incorporating a dual-level positional encoding and a signing-hand focus (SF) loss. The SF loss encourages cross-attention to focus on the signing hand, enabling implicit signing-hand detection during recognition. Furthermore, without relying on the CTC loss, we introduce a monotonic alignment (MA) loss that enforces the output letter sequence to follow the temporal order of the input pose sequence through cross-attention regularization. In addition, we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words. This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectiveness of the proposed recognizer and generator. Codes and data are available in: https://github.com/JunukCha/OpenFS.

[166] MM-NeuroOnco: A Multimodal Benchmark and Instruction Dataset for MRI-Based Brain Tumor Diagnosis

Feng Guo, Jiaxiang Liu, Yang Li, Qianqian Shi, Mingkun Xu

Main category: cs.CV

TL;DR: MM-NeuroOnco is a large-scale multimodal benchmark and instruction-tuning dataset for brain tumor MRI understanding, featuring 24,726 MRI slices with 200,000 semantically enriched multimodal instructions, addressing limitations in existing datasets for clinically interpretable diagnostic reasoning.

Details

Motivation: Existing public datasets for brain tumor diagnosis lack rich annotations and diagnostic semantics needed for clinically interpretable reasoning. There's a gap between lesion detection and generating meaningful diagnostic explanations grounded in imaging manifestations.

Method: Developed a multi-model collaborative pipeline for automated medical information completion and quality control to generate diagnosis-related semantics beyond mask-only annotations. Created MM-NeuroOnco dataset and MM-NeuroOnco-Bench evaluation benchmark with rejection-aware setting to reduce biases.

Result: Even the strongest baseline (Gemini 3 Flash) achieved only 41.88% accuracy on diagnosis-related questions. NeuroOnco-GPT, fine-tuned on MM-NeuroOnco, achieved a 27% absolute accuracy improvement on diagnostic questions, demonstrating the dataset’s effectiveness.

Conclusion: MM-NeuroOnco addresses critical limitations in existing datasets and enables significant improvements in multimodal brain tumor diagnostic understanding, advancing clinically grounded diagnostic reasoning capabilities.

Abstract: Accurate brain tumor diagnosis requires models to not only detect lesions but also generate clinically interpretable reasoning grounded in imaging manifestations, yet existing public datasets remain limited in annotation richness and diagnostic semantics. To bridge this gap, we introduce MM-NeuroOnco, a large-scale multimodal benchmark and instruction-tuning dataset for brain tumor MRI understanding, consisting of 24,726 MRI slices from 20 data sources paired with approximately 200,000 semantically enriched multimodal instructions spanning diverse tumor subtypes and imaging modalities. To mitigate the scarcity and high cost of diagnostic semantic annotations, we develop a multi-model collaborative pipeline for automated medical information completion and quality control, enabling the generation of diagnosis-related semantics beyond mask-only annotations. Building upon this dataset, we further construct MM-NeuroOnco-Bench, a manually annotated evaluation benchmark with a rejection-aware setting to reduce biases inherent in closed-ended question formats. Evaluation across ten representative models shows that even the strongest baseline, Gemini 3 Flash, achieves only 41.88% accuracy on diagnosis-related questions, highlighting the substantial challenges of multimodal brain tumor diagnostic understanding. Leveraging MM-NeuroOnco, we further propose NeuroOnco-GPT, which achieves a 27% absolute accuracy improvement on diagnostic questions following fine-tuning. This result demonstrates the effectiveness of our dataset and benchmark in advancing clinically grounded multimodal diagnostic reasoning. Code and dataset are publicly available at: https://github.com/gfnnnb/MM-NeuroOnco

[167] Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study

Zihao Zhao, Frederik Hauke, Juliana De Castilhos, Sven Nebelung, Daniel Truhn

Main category: cs.CV

TL;DR: Multi-agent framework using contrastive adjudication improves zero-shot diagnosis of visually confounded diseases in medical imaging, but performance remains insufficient for clinical use.

Details

Motivation: Most medical imaging work focuses on automating routine workflows, but there's an underexplored need to distinguish visually hard-to-separate diseases in zero-shot settings where visual features are highly confounded despite clinical differences.

Method: Introduces a multi-agent framework based on contrastive adjudication to benchmark representative agents on two imaging-only proxy diagnostic tasks: melanoma vs. atypical nevus and pulmonary edema vs. pneumonia.

Result: Experimental results show improved diagnostic performance (11-percentage-point gain in accuracy on dermoscopy data) and reduced unsupported claims on qualitative samples, though overall performance remains insufficient for clinical deployment.

Conclusion: While the pilot study provides preliminary insights into zero-shot agent performance in visually confounded scenarios, limitations include inherent uncertainty in human annotations and absence of clinical context, limiting real-world translation.

Abstract: The rapid progress of multimodal large language models (MLLMs) has led to increasing interest in agent-based systems. While most prior work in medical imaging concentrates on automating routine clinical workflows, we study an underexplored yet clinically significant setting: distinguishing visually hard-to-separate diseases in a zero-shot setting. We benchmark representative agents on two imaging-only proxy diagnostic tasks, (1) melanoma vs. atypical nevus and (2) pulmonary edema vs. pneumonia, where visual features are highly confounded despite substantial differences in clinical management. We introduce a multi-agent framework based on contrastive adjudication. Experimental results show improved diagnostic performance (an 11-percentage-point gain in accuracy on dermoscopy data) and reduced unsupported claims on qualitative samples, although overall performance remains insufficient for clinical deployment. We acknowledge the inherent uncertainty in human annotations and the absence of clinical context, which further limit the translation to real-world settings. Within this controlled setting, this pilot study provides preliminary insights into zero-shot agent performance in visually confounded scenarios.

[168] UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

Tianxing Xu, Zixuan Wang, Guangyuan Wang, Li Hu, Zhongyi Zhang, Peng Zhang, Bang Zhang, Song-Hai Zhang

Main category: cs.CV

TL;DR: UCM is a novel framework that unifies long-term memory and precise camera control for video generation world models using time-aware positional encoding warping, addressing consistency and controllability issues in interactive environment simulation.

Details

Motivation: Existing video generation world models struggle with maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user inputs. Methods based on explicit 3D reconstruction compromise flexibility in unbounded scenarios, while alternative approaches lack explicit spatial correspondence, limiting controllability and consistency.

Method: UCM introduces a time-aware positional encoding warping mechanism to unify long-term memory and camera control. It uses an efficient dual-stream diffusion transformer for high-fidelity generation with reduced computational overhead. A scalable data curation strategy employs point-cloud-based rendering to simulate scene revisiting, enabling training on over 500K monocular videos.

Result: Extensive experiments on real-world and synthetic benchmarks show UCM significantly outperforms state-of-the-art methods in long-term scene consistency while achieving precise camera controllability in high-fidelity video generation.

Conclusion: UCM successfully addresses key limitations in video generation world models by providing a unified framework for long-term memory and camera control, enabling more consistent and controllable simulation of interactive environments.

Abstract: World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation. Moreover, we introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos. Extensive experiments on real-world and synthetic benchmarks demonstrate that UCM significantly outperforms state-of-the-art methods in long-term scene consistency, while also achieving precise camera controllability in high-fidelity video generation.

[169] SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling

Camile Lendering, Erkut Akdag, Egor Bondarev

Main category: cs.CV

TL;DR: SubspaceAD is a training-free anomaly detection method that uses frozen DINOv2 features and PCA to detect anomalies via reconstruction residuals, achieving SOTA results on industrial inspection datasets.

Details

Motivation: The paper questions whether complex methods with memory banks, auxiliary datasets, or multi-modal tuning are necessary for few-shot anomaly detection given the strong feature representations of modern vision foundation models.

Method: Two-stage approach: 1) Extract patch-level features from normal images using frozen DINOv2 backbone, 2) Fit PCA to estimate low-dimensional subspace of normal variations, then detect anomalies via reconstruction residuals at inference.

Result: Achieves state-of-the-art performance in one-shot and few-shot settings: 98.0% image-level and 97.6% pixel-level AUROC on MVTec-AD, 93.3% and 98.3% on VisA, surpassing prior methods without training or memory banks.

Conclusion: Simple training-free methods using foundation model features can outperform complex approaches for few-shot anomaly detection, demonstrating the power of modern vision foundation models for industrial inspection tasks.

Abstract: Detecting visual anomalies in industrial inspection often requires training with only a few normal images per category. Recent few-shot methods achieve strong results employing foundation-model features, but typically rely on memory banks, auxiliary datasets, or multi-modal tuning of vision-language models. We therefore question whether such complexity is necessary given the feature representations of vision foundation models. To answer this question, we introduce SubspaceAD, a training-free method, that operates in two simple stages. First, patch-level features are extracted from a small set of normal images by a frozen DINOv2 backbone. Second, a Principal Component Analysis (PCA) model is fit to these features to estimate the low-dimensional subspace of normal variations. At inference, anomalies are detected via the reconstruction residual with respect to this subspace, producing interpretable and statistically grounded anomaly scores. Despite its simplicity, SubspaceAD achieves state-of-the-art performance across one-shot and few-shot settings without training, prompt tuning, or memory banks. In the one-shot anomaly detection setting, SubspaceAD achieves image-level and pixel-level AUROC of 98.0% and 97.6% on the MVTec-AD dataset, and 93.3% and 98.3% on the VisA dataset, respectively, surpassing prior state-of-the-art results. Code and demo are available at https://github.com/CLendering/SubspaceAD.

[170] DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis

Xinglong Luo, Ao Luo, Zhengning Wang, Yueqi Yang, Chaoyu Feng, Lei Lei, Bing Zeng, Shuaicheng Liu

Main category: cs.CV

TL;DR: DMAligner: A diffusion-based framework for image alignment using alignment-oriented view synthesis instead of optical flow warping, addressing occlusion and illumination challenges through dynamics-aware diffusion training.

Details

Motivation: Traditional optical flow-based image alignment methods struggle with occlusions and illumination variations, degrading visual quality and downstream task accuracy. The paper aims to address these limitations from a novel generation-based perspective.

Method: Proposes DMAligner with Dynamics-aware Diffusion Training for conditional image generation, including a Dynamics-aware Mask Producing (DMP) module to distinguish dynamic foreground from static background. Also creates Dynamic Scene Image Alignment (DSIA) dataset with 1,033 scenes and 30K+ image pairs.

Result: Extensive experiments show superiority on DSIA benchmarks and qualitative comparisons on widely-used video datasets. The diffusion-based approach demonstrates strong capabilities in handling challenges that classical flow-based methods struggle with.

Conclusion: DMAligner presents an effective diffusion-based alternative to traditional image alignment methods, offering improved handling of occlusions and illumination variations through generation-based view synthesis.

Abstract: Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded alignment visual quality and compromised accuracy in downstream tasks. In this paper, we present DMAligner, a diffusion-based framework for image alignment through alignment-oriented view synthesis. DMAligner is crafted to tackle the challenges in image alignment from a new perspective, employing a generation-based solution that showcases strong capabilities and avoids the problems associated with flow-based image warping. Specifically, we propose a Dynamics-aware Diffusion Training approach for learning conditional image generation, synthesizing a novel view for image alignment. This incorporates a Dynamics-aware Mask Producing (DMP) module to adaptively distinguish dynamic foreground regions from static backgrounds, enabling the diffusion model to more effectively handle challenges that classical methods struggle to solve. Furthermore, we develop the Dynamic Scene Image Alignment (DSIA) dataset using Blender, which includes 1,033 indoor and outdoor scenes with over 30K image pairs tailored for image alignment. Extensive experimental results demonstrate the superiority of the proposed approach on DSIA benchmarks, as well as on a series of widely-used video datasets for qualitative comparisons. Our code is available at https://github.com/boomluo02/DMAligner.

[171] WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval

Tianyue Wang, Leigang Qu, Tianyu Yang, Xiangzhao Hao, Yifan Xu, Haiyun Guo, Jinqiao Wang

Main category: cs.CV

TL;DR: WISER is a training-free framework for Zero-Shot Composed Image Retrieval that unifies text-to-image and image-to-image retrieval via a retrieve-verify-refine pipeline with intent and uncertainty awareness.

Details

Motivation: Existing ZS-CIR methods convert multimodal queries into single modalities (either edited captions for T2I or edited images for I2I), but each has limitations: T2I loses fine-grained visual details while I2I struggles with complex semantic modifications. There's a need to leverage their complementary strengths under diverse query intents.

Method: Proposes WISER with a “retrieve-verify-refine” pipeline: 1) Wider Search generates both edited captions and images for parallel retrieval, 2) Adaptive Fusion uses a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals and dynamic fusion for reliable ones, 3) For uncertain retrievals, generates refinement suggestions through structured self-reflection for deeper thinking in next retrieval round.

Result: Significantly outperforms previous methods across multiple benchmarks: 45% relative improvement on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably surpasses many training-dependent methods.

Conclusion: WISER demonstrates superior performance and generalization under diverse scenarios by effectively unifying T2I and I2I paradigms through intent and uncertainty awareness in a training-free framework.

Abstract: Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a “retrieve-verify-refine” pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward Deeper Thinking. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released at https://github.com/Physicsmile/WISER.

[172] Small Object Detection Model with Spatial Laplacian Pyramid Attention and Multi-Scale Features Enhancement in Aerial Images

Zhangjian Ji, Huijia Yan, Shaotong Qiao, Kai Feng, Wei Wei

Main category: cs.CV

TL;DR: Proposes a small object detection algorithm for aerial images using Spatial Laplacian Pyramid Attention and Multi-Scale Feature Enhancement with deformable convolutions for feature alignment.

Details

Motivation: Aerial image object detection faces challenges with small objects, dense distributions, and non-uniform layouts. Existing methods struggle with feature representation for small objects in high-resolution aerial imagery.

Method: Three key components: 1) Spatial Laplacian Pyramid Attention (SLPA) module integrated after each ResNet-50 stage to emphasize important local regions; 2) Multi-Scale Feature Enhancement Module (MSFEM) in FPN lateral connections for semantic understanding; 3) Deformable convolutions for feature alignment during FPN fusion.

Result: Experimental results on VisDrone and DOTA datasets show improved performance for small object detection in aerial images compared to original algorithms.

Conclusion: The proposed approach effectively addresses small object detection challenges in aerial imagery through attention mechanisms, multi-scale enhancement, and feature alignment techniques.

Abstract: Detecting objects in aerial images confronts some significant challenges, including small size, dense and non-uniform distribution of objects over high-resolution images, which makes detection inefficient. Thus, in this paper, we proposed a small object detection algorithm based on a Spatial Laplacian Pyramid Attention and Multi-Scale Feature Enhancement in aerial images. Firstly, in order to improve the feature representation of ResNet-50 on small objects, we presented a novel Spatial Laplacian Pyramid Attention (SLPA) module, which is integrated after each stage of ResNet-50 to identify and emphasize important local regions. Secondly, to enhance the model’s semantic understanding and features representation, we designed a Multi-Scale Feature Enhancement Module (MSFEM), which is incorporated into the lateral connections of C5 layer for building Feature Pyramid Network (FPN). Finally, the features representation quality of traditional feature pyramid network will be affected because the features are not aligned when the upper and lower layers are fused. In order to handle it, we utilized deformable convolutions to align the features in the fusion processing of the upper and lower levels of the Feature Pyramid Network, which can help enhance the model’s ability to detect and recognize small objects. The extensive experimental results on two benchmark datasets: VisDrone and DOTA demonstrate that our improved model performs better for small object detection in aerial images compared to the original algorithm.

[173] PackUV: Packed Gaussian UV Maps for 4D Volumetric Video

Aashish Rai, Angela Xing, Anushka Agarwal, Xiaoyan Cong, Zekun Li, Tao Lu, Aayush Prakash, Srinath Sridhar

Main category: cs.CV

TL;DR: PackUV introduces a novel 4D Gaussian representation that maps Gaussian attributes into structured UV atlases, enabling compact storage and compatibility with standard video codecs for volumetric video streaming.

Details

Motivation: Volumetric videos offer immersive experiences but face challenges in reconstruction, storage, and streaming at scale. Existing Gaussian Splatting methods break down on long sequences, lack temporal consistency, and are incompatible with conventional video coding pipelines.

Method: PackUV-GS: A temporally consistent fitting method that optimizes Gaussian parameters directly in UV domain. Uses flow-guided Gaussian labeling and video keyframing to identify dynamic Gaussians, stabilize static regions, and preserve temporal coherence under large motions and disocclusions.

Result: Method surpasses existing baselines in rendering fidelity while scaling to sequences up to 30 minutes with consistent quality. Introduces PackUV-2B dataset with 100 sequences and 2B frames for evaluation.

Conclusion: PackUV provides the first unified volumetric video representation compatible with standard video codecs without quality loss, enabling efficient streaming within existing multimedia infrastructure.

Abstract: Volumetric videos offer immersive 4D experiences, but remain difficult to reconstruct, store, and stream at scale. Existing Gaussian Splatting based methods achieve high-quality reconstruction but break down on long sequences, temporal inconsistency, and fail under large motions and disocclusions. Moreover, their outputs are typically incompatible with conventional video coding pipelines, preventing practical applications. We introduce PackUV, a novel 4D Gaussian representation that maps all Gaussian attributes into a sequence of structured, multi-scale UV atlas, enabling compact, image-native storage. To fit this representation from multi-view videos, we propose PackUV-GS, a temporally consistent fitting method that directly optimizes Gaussian parameters in the UV domain. A flow-guided Gaussian labeling and video keyframing module identifies dynamic Gaussians, stabilizes static regions, and preserves temporal coherence even under large motions and disocclusions. The resulting UV atlas format is the first unified volumetric video representation compatible with standard video codecs (e.g., FFV1) without losing quality, enabling efficient streaming within existing multimedia infrastructure. To evaluate long-duration volumetric capture, we present PackUV-2B, the largest multi-view video dataset to date, featuring more than 50 synchronized cameras, substantial motion, and frequent disocclusions across 100 sequences and 2B (billion) frames. Extensive experiments demonstrate that our method surpasses existing baselines in rendering fidelity while scaling to sequences up to 30 minutes with consistent quality.

[174] D-FINE-seg: Object Detection and Instance Segmentation Framework with multi-backend deployment

Argo Saakyan, Dmitry Solntsev

Main category: cs.CV

TL;DR: D-FINE-seg extends D-FINE transformer detector to real-time instance segmentation with lightweight mask head, segmentation-aware training, and optimized inference pipeline.

Details

Motivation: While transformer-based real-time object detectors like D-FINE perform well, real-time instance segmentation with transformers is less common. The authors aim to extend D-FINE to instance segmentation while maintaining competitive latency.

Method: Adds lightweight mask head to D-FINE, segmentation-aware training with box cropped BCE and dice mask losses, auxiliary and denoising mask supervision, adapted Hungarian matching cost, and end-to-end pipeline for training/exporting/inference across ONNX, TensorRT, OpenVINO.

Result: On TACO dataset, D-FINE-seg improves F1-score over Ultralytics YOLO26 under unified TensorRT FP16 end-to-end benchmarking while maintaining competitive latency.

Conclusion: D-FINE-seg successfully extends transformer-based real-time detection to instance segmentation with improved performance and practical deployment pipeline, released as open-source.

Abstract: Transformer-based real-time object detectors achieve strong accuracy-latency trade-offs, and D-FINE is among the top-performing recent architectures. However, real-time instance segmentation with transformers is still less common. We present D-FINE-seg, an instance segmentation extension of D-FINE that adds: a lightweight mask head, segmentation-aware training, including box cropped BCE and dice mask losses, auxiliary and denoising mask supervision, and adapted Hungarian matching cost. On the TACO dataset, D-FINE-seg improves F1-score over Ultralytics YOLO26 under a unified TensorRT FP16 end-to-end benchmarking protocol, while maintaining competitive latency. Second contribution is an end-to-end pipeline for training, exporting, and optimized inference across ONNX, TensorRT, OpenVINO for both object detection and instance segmentation tasks. This framework is released as open-source under the Apache-2.0 license. GitHub repository - https://github.com/ArgoHA/D-FINE-seg.

[175] GeoWorld: Geometric World Models

Zeyu Zhang, Danning Li, Ian Reid, Richard Hartley

Main category: cs.CV

TL;DR: GeoWorld introduces a hyperbolic geometry-based world model for multi-step visual planning that preserves geometric structure and hierarchical relations, improving long-horizon prediction stability.

Details

Motivation: Existing energy-based predictive world models for visual planning have two key limitations: (1) they use Euclidean latent spaces that neglect underlying geometric and hierarchical structure among states, and (2) they struggle with long-horizon prediction leading to rapid degradation in extended rollouts.

Method: GeoWorld uses a Hyperbolic JEPA (Joint Embedding Predictive Architecture) to map latent representations from Euclidean space onto hyperbolic manifolds, preserving geometric structure and hierarchical relations. It also introduces Geometric Reinforcement Learning for energy-based optimization to enable stable multi-step planning in hyperbolic latent space.

Result: Extensive experiments on CrossTask and COIN datasets show approximately 3% success rate improvement in 3-step planning and 2% improvement in 4-step planning compared to state-of-the-art V-JEPA.

Conclusion: GeoWorld demonstrates that incorporating hyperbolic geometry into world models can better preserve structural relationships and improve long-horizon visual planning performance compared to Euclidean approaches.

Abstract: Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3-step planning and 2% SR improvement in 4-step planning compared to the state-of-the-art V-JEPA 2. Project website: https://steve-zeyu-zhang.github.io/GeoWorld.

[176] Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception

Yiding Sun, Jihua Zhu, Haozhe Cheng, Chaoyi Lu, Zhichuan Yang, Lin Chen, Yaonan Wang

Main category: cs.CV

TL;DR: PointATA: A parameter-efficient transfer learning method that adapts 3D pre-trained models to 4D point cloud video understanding by addressing modality gap and overfitting through two-stage “Align then Adapt” paradigm.

Details

Motivation: 4D point cloud video datasets are scarce compared to 3D datasets, limiting scalability of self-supervised 4D models. Transferring 3D pre-trained models to 4D tasks faces challenges of overfitting and modality gap between 3D and 4D data distributions.

Method: Two-stage “Align then Adapt” paradigm: Stage 1 uses optimal transport theory to quantify 3D-4D distribution discrepancy and trains point align embedder to bridge modality gap. Stage 2 integrates efficient point-video adapter and spatial-context encoder into frozen 3D backbone to enhance temporal modeling while mitigating overfitting.

Result: Achieves 97.21% accuracy on 3D action recognition, +8.7% improvement on 4D action segmentation, and 84.06% on 4D semantic segmentation. Matches or outperforms full fine-tuning models while being parameter-efficient.

Conclusion: PointATA enables effective transfer of 3D pre-trained models to 4D video understanding tasks by addressing modality gap and overfitting, achieving strong performance with parameter efficiency.

Abstract: Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction. We recognize that 4D datasets are far scarcer than 3D ones, which hampers the scalability of self-supervised 4D models. A promising alternative is to transfer 3D pre-trained models to 4D perception tasks. However, rigorous empirical analysis reveals two critical limitations that impede transfer capability: overfitting and the modality gap. To overcome these challenges, we develop a novel “Align then Adapt” (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1 to alleviate the underlying modality gap. To mitigate overfitting, an efficient point-video adapter and a spatial-context encoder are integrated into the frozen 3D backbone to enhance temporal modeling capacity in Stage 2. Notably, with the above engineering-oriented designs, PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work. Extensive experiments show that PointATA can match or even outperform strong full fine-tuning models, whilst enjoying the advantage of parameter efficiency, e.g. 97.21 % accuracy on 3D action recognition, $+8.7 %$ on 4 D action segmentation, and 84.06% on 4D semantic segmentation.

[177] Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy

Matthew Sutton, Katrin Amunts, Timo Dickscheid, Christian Schiffer

Main category: cs.CV

TL;DR: A label-mediated method connects biomedical vision foundation models to language without paired image-text data, using labels to generate synthetic captions from literature for cytoarchitectonic brain analysis.

Details

Motivation: Foundation models need paired image-text data for vision-language coupling, but such data is scarce in research/clinical settings like microscopic brain analysis. Current methods require curated paired datasets which are difficult to obtain.

Method: Proposes label-mediated caption generation: uses area labels to automatically mine descriptions from related literature as synthetic captions. Couples existing cytoarchitectonic vision model (CytoNet) to LLM via image-to-text training, enabling natural language descriptions of microscopy regions.

Result: Method produces plausible area-level descriptions across 57 brain areas with 90.6% accuracy for in-scope patches and supports open-set use. With area label masked, descriptions recover area in 8-way test with 68.6% accuracy.

Conclusion: Weak, label-mediated pairing suffices to connect biomedical vision foundation models to language, providing practical integration of natural-language interfaces in domains with scarce fine-grained paired annotations.

Abstract: Foundation models increasingly offer potential to support interactive, agentic workflows that assist researchers during analysis and interpretation of image data. Such workflows often require coupling vision to language to provide a natural-language interface. However, paired image-text data needed to learn this coupling are scarce and difficult to obtain in many research and clinical settings. One such setting is microscopic analysis of cell-body-stained histological human brain sections, which enables the study of cytoarchitecture: cell density and morphology and their laminar and areal organization. Here, we propose a label-mediated method that generates meaningful captions from images by linking images and text only through a label, without requiring curated paired image-text data. Given the label, we automatically mine area descriptions from related literature and use them as synthetic captions reflecting canonical cytoarchitectonic attributes. An existing cytoarchitectonic vision foundation model (CytoNet) is then coupled to a large language model via an image-to-text training objective, enabling microscopy regions to be described in natural language. Across 57 brain areas, the resulting method produces plausible area-level descriptions and supports open-set use through explicit rejection of unseen areas. It matches the cytoarchitectonic reference label for in-scope patches with 90.6% accuracy and, with the area label masked, its descriptions remain discriminative enough to recover the area in an 8-way test with 68.6% accuracy. These results suggest that weak, label-mediated pairing can suffice to connect existing biomedical vision foundation models to language, providing a practical recipe for integrating natural-language in domains where fine-grained paired annotations are scarce.

[178] Locally Adaptive Decay Surfaces for High-Speed Face and Landmark Detection with Event Cameras

Paul Kielty, Timothy Hanley, Peter Corcoran

Main category: cs.CV

TL;DR: LADS (Locally Adaptive Decay Surfaces) is a novel event representation method that adapts temporal decay parameters locally based on signal dynamics, improving event-based vision tasks like face detection and facial landmarks.

Details

Motivation: Event cameras capture high-temporal-resolution luminance changes, but converting their sparse, asynchronous output into dense tensors for neural networks is challenging. Existing methods use fixed temporal parameters across the entire image, creating a trade-off between preserving spatial structure during still periods and retaining sharp edges during rapid motion.

Method: Introduces Locally Adaptive Decay Surfaces (LADS), a family of event representations where temporal decay at each location is modulated according to local signal dynamics. Three adaptive strategies are explored: based on event rate, Laplacian-of-Gaussian response, and high-frequency spectral energy.

Result: LADS consistently improves face detection and facial landmark accuracy compared to standard non-adaptive representations. At 30 Hz, achieves higher detection accuracy and lower landmark error; at 240 Hz, mitigates accuracy decline typically observed at higher frequencies, setting new benchmarks for event-based face analysis.

Conclusion: LADS demonstrates the importance of context-aware temporal integration for neuromorphic vision, enabling real-time, high-frequency human-computer interaction systems that exploit event camera advantages while supporting lighter network architectures.

Abstract: Event cameras record luminance changes with microsecond resolution, but converting their sparse, asynchronous output into dense tensors that neural networks can exploit remains a core challenge. Conventional histograms or globally-decayed time-surface representations apply fixed temporal parameters across the entire image plane, which in practice creates a trade-off between preserving spatial structure during still periods and retaining sharp edges during rapid motion. We introduce Locally Adaptive Decay Surfaces (LADS), a family of event representations in which the temporal decay at each location is modulated according to local signal dynamics. Three strategies are explored, based on event rate, Laplacian-of-Gaussian response, and high-frequency spectral energy. These adaptive schemes preserve detail in quiescent regions while reducing blur in regions of dense activity. Extensive experiments on the public data show that LADS consistently improves both face detection and facial landmark accuracy compared to standard non-adaptive representations. At 30 Hz, LADS achieves higher detection accuracy and lower landmark error than either baseline, and at 240 Hz it mitigates the accuracy decline typically observed at higher frequencies, sustaining 2.44 % normalized mean error for landmarks and 0.966 mAP50 in face detection. These high-frequency results even surpass the accuracy reported in prior works operating at 30 Hz, setting new benchmarks for event-based face analysis. Moreover, by preserving spatial structure at the representation stage, LADS supports the use of much lighter network architectures while still retaining real-time performance. These results highlight the importance of context-aware temporal integration for neuromorphic vision and point toward real-time, high-frequency human-computer interaction systems that exploit the unique advantages of event cameras.

[179] SpectralMamba-UNet: Frequency-Disentangled State Space Modeling for Texture-Structure Consistent Medical Image Segmentation

Fuhao Zhang, Lei Liu, Jialin Zhang, Ya-Nan Zhang, Nan Mu

Main category: cs.CV

TL;DR: SpectralMamba-UNet: A frequency-disentangled framework for medical image segmentation that decouples structural and textural information using spectral decomposition with Mamba-based global modeling and high-frequency boundary preservation.

Details

Motivation: Medical image segmentation requires modeling both global anatomical structures and fine-grained boundary details. Current state space models like Vision Mamba offer efficient long-range dependency modeling but weaken local spatial continuity and high-frequency representation due to their one-dimensional serialization approach.

Method: Proposes SpectralMamba-UNet with: 1) Spectral Decomposition and Modeling (SDM) module using discrete cosine transform to decompose low- and high-frequency features, 2) Frequency-domain Mamba for global contextual modeling of low frequencies, 3) High-frequency preservation for boundary details, 4) Spectral Channel Reweighting (SCR) for channel-wise frequency-aware attention, and 5) Spectral-Guided Fusion (SGF) for adaptive multi-scale fusion in the decoder.

Result: Experiments on five public benchmarks demonstrate consistent improvements across diverse modalities and segmentation targets, validating the effectiveness and generalizability of the approach.

Conclusion: The proposed frequency-disentangled framework effectively addresses the limitations of existing state space models by decoupling structural and textural information in the spectral domain, achieving better balance between global context modeling and local boundary preservation for medical image segmentation.

Abstract: Accurate medical image segmentation requires effective modeling of both global anatomical structures and fine-grained boundary details. Recent state space models (e.g., Vision Mamba) offer efficient long-range dependency modeling. However, their one-dimensional serialization weakens local spatial continuity and high-frequency representation. To this end, we propose SpectralMamba-UNet, a novel frequency-disentangled framework to decouple the learning of structural and textural information in the spectral domain. Our Spectral Decomposition and Modeling (SDM) module applies discrete cosine transform to decompose low- and high-frequency features, where low frequency contributes to global contextual modeling via a frequency-domain Mamba and high frequency preserves boundary-sensitive details. To balance spectral contributions, we introduce a Spectral Channel Reweighting (SCR) mechanism to form channel-wise frequency-aware attention, and a Spectral-Guided Fusion (SGF) module to achieve adaptively multi-scale fusion in the decoder. Experiments on five public benchmarks demonstrate consistent improvements across diverse modalities and segmentation targets, validating the effectiveness and generalizability of our approach.

[180] WARM-CAT: : Warm-Started Test-Time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning

Xudong Yan, Songhe Feng, Jiaxin Wang, Xin Su, Yi Jin

Main category: cs.CV

TL;DR: WARM-CAT: A novel CZSL approach that accumulates multimodal knowledge from unsupervised data to update prototypes at test time, addressing distribution shift through adaptive prototype adjustment and dynamic priority queues.

Details

Motivation: Existing CZSL methods suffer from performance degradation due to distribution shift at test time when unseen attribute-object compositions appear. The paper aims to overcome this challenge by leveraging multimodal knowledge from unsupervised data.

Method: Proposes accumulating comprehensive knowledge in textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Uses adaptive update weights, dynamic priority queues storing high-confidence images, warm-start initialization with training images, and multimodal collaborative representation learning for prototype alignment.

Result: Achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. Introduces new C-Fashion dataset and refines MIT-States dataset for more reliable evaluation.

Conclusion: WARM-CAT effectively addresses distribution shift in CZSL by accumulating multimodal knowledge and adaptively updating prototypes, demonstrating superior performance across multiple benchmarks.

Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual prototypes from historical images for inference. Since the model tends to favor compositions already stored in the queue during testing, we warm-start the queue by initializing it with training images for visual prototypes of seen compositions and generating unseen visual prototypes using the mapping learned between seen and unseen textual prototypes. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. To provide a more reliable evaluation for CZSL, we introduce a new benchmark dataset, C-Fashion, and refine the widely used but noisy MIT-States dataset. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. The source code and datasets are available at https://github.com/xud-yan/WARM-CAT .

[181] FLIGHT: Fibonacci Lattice-based Inference for Geometric Heading in real-Time

David Dirnfeld, Fabien Delattre, Pedro Miraldo, Erik Learned-Miller

Main category: cs.CV

TL;DR: A novel Hough transform method on the unit sphere for camera heading estimation that uses great circles from feature correspondences and Fibonacci lattice discretization for robust motion estimation in noisy conditions.

Details

Motivation: Existing camera motion estimation methods perform well in low-noise conditions but degrade in accuracy or become computationally expensive with increased noise and outliers. There's a need for more robust heading estimation techniques that maintain efficiency.

Method: Proposes a generalization of Hough transform on unit sphere (S(2)): 1) extracts feature correspondences between frames, 2) generates great circles of directions compatible with each correspondence pair, 3) discretizes sphere using Fibonacci lattice as bin centers, 4) each great circle casts votes for range of directions, ensuring consistent voting for correct motion direction despite noise/dynamic objects.

Result: Experimental results on three datasets show the method is on the Pareto frontier of accuracy vs efficiency. SLAM experiments demonstrate reduced RMSE by correcting heading during camera pose initialization.

Conclusion: The proposed spherical Hough transform method provides robust and efficient camera heading estimation that maintains accuracy in noisy conditions and improves SLAM performance through better heading correction.

Abstract: Estimating camera motion from monocular video is a fundamental problem in computer vision, central to tasks such as SLAM, visual odometry, and structure-from-motion. Existing methods that recover the camera’s heading under known rotation, whether from an IMU or an optimization algorithm, tend to perform well in low-noise, low-outlier conditions, but often decrease in accuracy or become computationally expensive as noise and outlier levels increase. To address these limitations, we propose a novel generalization of the Hough transform on the unit sphere (S(2)) to estimate the camera’s heading. First, the method extracts correspondences between two frames and generates a great circle of directions compatible with each pair of correspondences. Then, by discretizing the unit sphere using a Fibonacci lattice as bin centers, each great circle casts votes for a range of directions, ensuring that features unaffected by noise or dynamic objects vote consistently for the correct motion direction. Experimental results on three datasets demonstrate that the proposed method is on the Pareto frontier of accuracy versus efficiency. Additionally, experiments on SLAM show that the proposed method reduces RMSE by correcting the heading during camera pose initialization.

[182] Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation

Xiaosen Wang, Zhijin Ge, Bohan Liu, Zheng Fang, Fengfan Zhou, Ruixuan Zhang, Shaokang Wang, Yuyang Luo

Main category: cs.CV

TL;DR: This paper addresses the lack of standardized evaluation for adversarial transferability attacks, proposing a comprehensive framework and taxonomy while reviewing enhancement strategies and issues in current research.

Details

Motivation: The motivation stems from the security concerns raised by adversarial transferability, where attacks on one model can transfer to others without direct access. The authors identify a critical gap: the absence of standardized evaluation frameworks and criteria, leading to potentially biased assessments of existing transfer-based attack methods.

Method: The authors conducted an exhaustive review of hundreds of related works and organized transfer-based attacks into six distinct categories. They then proposed a comprehensive benchmarking framework for evaluating these attacks, delineated common strategies that enhance adversarial transferability, and highlighted prevalent issues that could lead to unfair comparisons.

Result: The paper provides a systematic taxonomy of transfer-based attacks, a standardized evaluation framework, identification of enhancement strategies, and analysis of common pitfalls in current research. It also includes a brief review of transfer-based attacks beyond image classification.

Conclusion: The work establishes a much-needed standardized framework for evaluating adversarial transferability attacks, which should lead to more fair and comprehensive assessments of existing and future approaches, ultimately advancing the field of adversarial machine learning security.

Abstract: Adversarial transferability refers to the capacity of adversarial examples generated on the surrogate model to deceive alternate, unexposed victim models. This property eliminates the need for direct access to the victim model during an attack, thereby raising considerable security concerns in practical applications and attracting substantial research attention recently. In this work, we discern a lack of a standardized framework and criteria for evaluating transfer-based attacks, leading to potentially biased assessments of existing approaches. To rectify this gap, we have conducted an exhaustive review of hundreds of related works, organizing various transfer-based attacks into six distinct categories. Subsequently, we propose a comprehensive framework designed to serve as a benchmark for evaluating these attacks. In addition, we delineate common strategies that enhance adversarial transferability and highlight prevalent issues that could lead to unfair comparisons. Finally, we provide a brief review of transfer-based attacks beyond image classification.

[183] TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement

Arian Sabaghi, José Oramas

Main category: cs.CV

TL;DR: TriLite: Single-stage weakly supervised object localization using frozen Vision Transformer with minimal trainable parameters, achieving SOTA with improved object coverage.

Details

Motivation: Current WSOL methods often use multi-stage pipelines or require full fine-tuning of large backbones, which is computationally expensive. Additionally, the WSOL field faces challenges with partial object coverage and spurious activations.

Method: TriLite uses a frozen Vision Transformer (Dinov2 pre-trained) with only ~800K trainable parameters. It introduces a TriHead module that decomposes patch features into foreground, background, and ambiguous regions to improve object coverage while suppressing spurious activations. The framework disentangles classification and localization objectives.

Result: State-of-the-art performance on CUB-200-2011, ImageNet-1K, and OpenImages datasets while being significantly more parameter-efficient and easier to train than prior methods.

Conclusion: TriLite demonstrates that self-supervised ViTs can be effectively leveraged for WSOL without expensive end-to-end training, achieving excellent performance with minimal trainable parameters through careful architectural design.

Abstract: Weakly supervised object localization (WSOL) aims to localize target objects in images using only image-level labels. Despite recent progress, many approaches still rely on multi-stage pipelines or full fine-tuning of large backbones, which increases training cost, while the broader WSOL community continues to face the challenge of partial object coverage. We present TriLite, a single-stage WSOL framework that leverages a frozen Vision Transformer with Dinov2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters (fewer than 800K on ImageNet-1K) for both classification and localization. At its core is the proposed TriHead module, which decomposes patch features into foreground, background, and ambiguous regions, thereby improving object coverage while suppressing spurious activations. By disentangling classification and localization objectives, TriLite effectively exploits the universal representations learned by self-supervised ViTs without requiring expensive end-to-end training. Extensive experiments on CUB-200-2011, ImageNet-1K, and OpenImages demonstrate that TriLite sets a new state of the art, while remaining significantly more parameter-efficient and easier to train than prior methods. The code will be released soon.

Xin Yuan, Zhiyong Zhang, Xin Xu, Zheng Wang, Chia-Wen Lin

Main category: cs.CV

TL;DR: CARE is a two-stage framework for robust person Re-ID with noisy labels, using probabilistic evidence calibration to overcome softmax limitations and evidence propagation refinement to better distinguish clean vs. noisy samples.

Details

Motivation: Person Re-ID in unconstrained environments faces challenges with noisy labels and sparse per-identity samples. Existing methods relying on softmax outputs suffer from translation invariance (over-confident predictions) and conventional sample selection discards valuable hard positives needed for discriminative features.

Method: Two-stage CARE framework: 1) Calibration stage with Probabilistic Evidence Calibration (PEC) that injects adaptive learnable parameters into similarity function and uses evidential calibration loss to mitigate overconfidence. 2) Refinement stage with Evidence Propagation Refinement (EPR) containing Composite Angular Margin (CAM) metric to distinguish clean hard positives from mislabeled samples in hyperspherical space, and Certainty-Oriented Sphere Weighting (COSW) to dynamically allocate sample importance.

Result: Extensive experiments on Market1501, DukeMTMC-ReID, and CUHK03 datasets under both random and patterned noises show CARE achieves competitive performance.

Conclusion: CARE effectively addresses noise-robust person Re-ID by overcoming softmax limitations through probabilistic evidence calibration and refinement, achieving strong performance on benchmark datasets with noisy labels.

Abstract: With the increasing demand for robust person Re-ID in unconstrained environments, learning from datasets with noisy labels and sparse per-identity samples remains a critical challenge. Existing noise-robust person Re-ID methods primarily rely on loss-correction or sample-selection strategies using softmax outputs. However, these methods suffer from two key limitations: 1) Softmax exhibits translation invariance, leading to over-confident and unreliable predictions on corrupted labels. 2) Conventional sample selection based on small-loss criteria often discards valuable hard positives that are crucial for learning discriminative features. To overcome these issues, we propose the CAlibration-to-REfinement (CARE) method, a two-stage framework that seeks certainty through probabilistic evidence propagation from calibration to refinement. In the calibration stage, we propose the probabilistic evidence calibration (PEC) that dismantles softmax translation invariance by injecting adaptive learnable parameters into the similarity function, and employs an evidential calibration loss to mitigate overconfidence on mislabeled samples. In the refinement stage, we design the evidence propagation refinement (EPR) that can more accurately distinguish between clean and noisy samples. Specifically, the EPR contains two steps: Firstly, the composite angular margin (CAM) metric is proposed to precisely distinguish clean but hard-to-learn positive samples from mislabeled ones in a hyperspherical space; Secondly, the certainty-oriented sphere weighting (COSW) is developed to dynamically allocate the importance of samples according to CAM, ensuring clean instances drive model updates. Extensive experimental results on Market1501, DukeMTMC-ReID, and CUHK03 datasets under both random and patterned noises show that CARE achieves competitive performance.

[185] No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors

Tao Liu, Gang Wan, Kan Ren, Shibo Wen

Main category: cs.CV

TL;DR: Proposes unsupervised online video stabilization framework using classical pipeline with multithreaded buffering, addressing data limitations, controllability, and efficiency issues in deep learning approaches.

Details

Motivation: Address limitations of deep learning video stabilization methods that require paired datasets, have poor controllability, and are inefficient on resource-constrained hardware. Also aims to expand stabilization applicability beyond handheld visible-light videos to domains like UAV nighttime remote sensing.

Method: Unsupervised framework implementing classical stabilization pipeline with three stages, incorporating multithreaded buffering mechanism. Introduces new multimodal UAV aerial video dataset (UAV-Test) for evaluation.

Result: Method consistently outperforms state-of-the-art online stabilizers in both quantitative metrics and visual quality, achieving performance comparable to offline methods.

Conclusion: Proposed unsupervised framework effectively addresses key challenges in video stabilization, expands applicability to new domains like UAV remote sensing, and demonstrates superior performance to existing online methods.

Abstract: We propose a new unsupervised framework for online video stabilization. Unlike methods based on deep learning that require paired stable and unstable datasets, our approach instantiates the classical stabilization pipeline with three stages and incorporates a multithreaded buffering mechanism. This design addresses three longstanding challenges in end-to-end learning: limited data, poor controllability, and inefficiency on hardware with constrained resources. Existing benchmarks focus mainly on handheld videos with a forward view in visible light, which restricts the applicability of stabilization to domains such as UAV nighttime remote sensing. To fill this gap, we introduce a new multimodal UAV aerial video dataset (UAV-Test). Experiments show that our method consistently outperforms state-of-the-art online stabilizers in both quantitative metrics and visual quality, while achieving performance comparable to offline methods.

[186] Efficient Encoder-Free Fourier-based 3D Large Multimodal Model

Guofeng Mei, Wei Lin, Luigi Riz, Yujiao Wu, Yiming Wang, Fabio Poiesi

Main category: cs.CV

TL;DR: Fase3D is an encoder-free 3D multimodal model that uses Fourier transforms and point cloud serialization for efficient 3D scene understanding without heavy visual encoders.

Details

Motivation: Current 3D LMMs rely on heavy pre-trained visual encoders for geometric feature extraction, which is inefficient and not scalable. While 2D LMMs have moved toward encoder-free designs, extending this to 3D is challenging due to unordered, large-scale point clouds.

Method: Uses structured superpoints for compact scene representation, space-filling curve serialization followed by Fast Fourier Transform for efficient global context modeling, and Fourier-augmented LoRA adapters for injecting global frequency-aware interactions into LLMs.

Result: Achieves performance comparable to encoder-based 3D LMMs while being significantly more efficient in computation and parameters.

Conclusion: Fase3D demonstrates that encoder-free 3D LMMs are feasible and efficient, offering a scalable alternative to traditional encoder-based approaches for 3D scene understanding.

Abstract: Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pre-trained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large-scale nature of point clouds. This leaves a critical unanswered question: How can we design an LMM that tokenizes unordered 3D data effectively and efficiently without a cumbersome encoder? We propose Fase3D, the first efficient encoder-free Fourier-based 3D scene LMM. Fase3D tackles the challenges of scalability and permutation invariance with a novel tokenizer that combines point cloud serialization and the Fast Fourier Transform (FFT) to approximate self-attention. This design enables an effective and computationally minimal architecture, built upon three key innovations: First, we represent large scenes compactly via structured superpoints. Second, our space-filling curve serialization followed by an FFT enables efficient global context modeling and graph-based token merging. Lastly, our Fourier-augmented LoRA adapters inject global frequency-aware interactions into the LLMs at a negligible cost. Fase3D achieves performance comparable to encoder-based 3D LMMs while being significantly more efficient in computation and parameters. Project website: https://tev-fbk.github.io/Fase3D.

Yichen Peng, Jyun-Ting Song, Siyeol Jung, Ruofan Liu, Haiyang Liu, Xuangeng Chu, Ruicong Liu, Erwin Wu, Hideki Koike, Kris Kitani

Main category: cs.CV

TL;DR: DyaDiT is a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals, considering social interaction dynamics between conversational partners.

Details

Motivation: Existing methods map single audio streams to single speaker motion without considering social context or mutual dynamics between conversational partners, limiting natural social engagement.

Method: Multi-modal diffusion transformer that takes dyadic audio with optional social-context tokens, fuses information from both speakers, uses motion dictionary for motion priors, and can optionally utilize partner’s gestures for responsive motion.

Result: Surpasses existing methods on objective metrics and strongly preferred by users in quantitative studies, demonstrating robust and socially favorable motion generation.

Conclusion: DyaDiT effectively generates contextually appropriate conversational gestures by modeling dyadic interaction dynamics, advancing natural social interaction with digital humans.

Abstract: Generating realistic conversational gestures are essential for achieving natural, socially engaging interactions with digital humans. However, existing methods typically map a single audio stream to a single speaker’s motion, without considering social context or modeling the mutual dynamics between two people engaging in conversation. We present DyaDiT, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. Trained on Seamless Interaction Dataset, DyaDiT takes dyadic audio with optional social-context tokens to produce context-appropriate motion. It fuses information from both speakers to capture interaction dynamics, uses a motion dictionary to encode motion priors, and can optionally utilize the conversational partner’s gestures to produce more responsive motion. We evaluate DyaDiT on standard motion generation metrics and conduct quantitative user studies, demonstrating that it not only surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation. Code and models will be released upon acceptance.

[188] AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

Zhaochen Su, Jincheng Gao, Hangyu Guo, Zhenhua Liu, Lueyang Zhang, Xinyu Geng, Shijue Huang, Peng Xia, Guanyu Jiang, Cheng Wang, Yue Zhang, Yi R. Fung, Junxian He

Main category: cs.CV

TL;DR: AgentVista is a comprehensive benchmark for evaluating multimodal agents on realistic, long-horizon tasks requiring complex tool use across visual and textual modalities.

Details

Motivation: Existing multimodal benchmarks focus on single-turn visual reasoning or specific tool skills, lacking realism, visual subtlety, and long-horizon tool use needed for practical multimodal agents that solve complex real-world workflows.

Method: Introduces AgentVista benchmark spanning 25 sub-domains across 7 categories with realistic visual scenarios requiring natural hybrid tool use. Tasks involve long-horizon interactions across modalities including web search, image search, page navigation, and code-based operations for image processing and programming.

Result: State-of-the-art models show significant gaps in long-horizon multimodal tool use. Best model (Gemini-3-Pro with tools) achieves only 27.3% overall accuracy, with hard instances requiring more than 25 tool-calling turns.

Conclusion: AgentVista exposes current limitations of multimodal agents and is expected to accelerate development of more capable agents for realistic, ultra-challenging problem solving requiring complex multimodal reasoning and tool use.

Abstract: Real-world multimodal agents solve multi-step workflows grounded in visual evidence. For example, an agent can troubleshoot a device by linking a wiring photo to a schematic and validating the fix with online documentation, or plan a trip by interpreting a transit map and checking schedules under routing constraints. However, existing multimodal benchmarks mainly evaluate single-turn visual reasoning or specific tool skills, and they do not fully capture the realism, visual subtlety, and long-horizon tool use that practical agents require. We introduce AgentVista, a benchmark for generalist multimodal agents that spans 25 sub-domains across 7 categories, pairing realistic and detail-rich visual scenarios with natural hybrid tool use. Tasks require long-horizon tool interactions across modalities, including web search, image search, page navigation, and code-based operations for both image processing and general programming. Comprehensive evaluation of state-of-the-art models exposes significant gaps in their ability to carry out long-horizon multimodal tool use. Even the best model in our evaluation, Gemini-3-Pro with tools, achieves only 27.3% overall accuracy, and hard instances can require more than 25 tool-calling turns. We expect AgentVista to accelerate the development of more capable and reliable multimodal agents for realistic and ultra-challenging problem solving.

[189] Learning Continuous Wasserstein Barycenter Space for Generalized All-in-One Image Restoration

Xiaole Tang, Xiaoyi He, Jiayi Xu, Xiang Gu, Jian Sun

Main category: cs.CV

TL;DR: BaryIR is a representation learning framework for all-in-one image restoration that aligns multisource degraded features in Wasserstein barycenter space to learn degradation-agnostic representations, improving generalization to unseen degradations.

Details

Motivation: Existing all-in-one image restoration methods struggle with out-of-distribution degradations, limiting real-world generalization. The paper is motivated by the intuition that different degradations cause specific shifts from an underlying degradation-agnostic distribution, and recovering this shared distribution is key for cross-degradation generalization.

Method: Proposes BaryIR framework that aligns multisource degraded features in Wasserstein barycenter space to model degradation-agnostic distribution by minimizing average Wasserstein distances. Introduces residual subspaces with embeddings mutually contrasted while orthogonal to WB embeddings, explicitly decoupling degradation-agnostic invariant content space from degradation-specific knowledge spaces.

Result: BaryIR performs competitively against state-of-the-art all-in-one methods, generalizes well to unseen degradation types and levels, and shows remarkable robustness in learning generalized features even with limited training degradation types and real-world mixed degradations.

Conclusion: BaryIR effectively addresses generalization challenges in all-in-one image restoration by learning disentangled representations through Wasserstein barycenter alignment, enabling adaptive restoration based on degradation-agnostic shared invariance while preserving degradation-specific knowledge.

Abstract: Despite substantial advances in all-in-one image restoration for addressing diverse degradations within a unified model, existing methods remain vulnerable to out-of-distribution degradations, thereby limiting their generalization in real-world scenarios. To tackle the challenge, this work is motivated by the intuition that multisource degraded feature distributions are induced by different degradation-specific shifts from an underlying degradation-agnostic distribution, and recovering such a shared distribution is thus crucial for achieving generalization across degradations. With this insight, we propose BaryIR, a representation learning framework that aligns multisource degraded features in the Wasserstein barycenter (WB) space, which models a degradation-agnostic distribution by minimizing the average of Wasserstein distances to multisource degraded distributions. We further introduce residual subspaces, whose embeddings are mutually contrasted while remaining orthogonal to the WB embeddings. Consequently, BaryIR explicitly decouples two orthogonal spaces: a WB space that encodes the degradation-agnostic invariant contents shared across degradations, and residual subspaces that adaptively preserve the degradation-specific knowledge. This disentanglement mitigates overfitting to in-distribution degradations and enables adaptive restoration grounded on the degradation-agnostic shared invariance. Extensive experiments demonstrate that BaryIR performs competitively against state-of-the-art all-in-one methods. Notably, BaryIR generalizes well to unseen degradations (\textit{e.g.,} types and levels) and shows remarkable robustness in learning generalized features, even when trained on limited degradation types and evaluated on real-world data with mixed degradations.

[190] Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking

Maximilian Luz, Rohit Mohan, Thomas Nürnberg, Yakov Miron, Daniele Cattaneo, Abhinav Valada

Main category: cs.CV

TL;DR: LaGS introduces a novel 4D panoptic occupancy tracking method using latent Gaussian splatting to efficiently aggregate multi-view information into 3D voxel grids for spatiotemporal scene understanding.

Details

Motivation: Existing methods for 4D spatiotemporal scene understanding are limited - they either provide coarse geometric tracking via bounding boxes or detailed 3D structures like voxel-based occupancy that lack explicit temporal association. There's a need for holistic spatiotemporal scene understanding that combines both aspects.

Method: LaGS uses camera-based end-to-end tracking with mask-based multi-view panoptic occupancy prediction. It introduces latent Gaussian splatting: first fusing observations into 3D Gaussians as a sparse point-centric latent representation, then splatting aggregated features onto a 3D voxel grid decoded by a mask-based segmentation head.

Result: Achieves state-of-the-art performance for 4D panoptic occupancy tracking on Occ3D nuScenes and Waymo datasets.

Conclusion: LaGS advances spatiotemporal scene understanding by providing a holistic approach that combines geometric tracking with detailed 3D structures and explicit temporal association through efficient multi-view information aggregation.

Abstract: Capturing 4D spatiotemporal surroundings is crucial for the safe and reliable operation of robots in dynamic environments. However, most existing methods address only one side of the problem: they either provide coarse geometric tracking via bounding boxes, or detailed 3D structures like voxel-based occupancy that lack explicit temporal association. In this work, we present Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking (LaGS) that advances spatiotemporal scene understanding in a holistic direction. Our approach incorporates camera-based end-to-end tracking with mask-based multi-view panoptic occupancy prediction, and addresses the key challenge of efficiently aggregating multi-view information into 3D voxel grids via a novel latent Gaussian splatting approach. Specifically, we first fuse observations into 3D Gaussians that serve as a sparse point-centric latent representation of the 3D scene, and then splat the aggregated features onto a 3D voxel grid that is decoded by a mask-based segmentation head. We evaluate LaGS on the Occ3D nuScenes and Waymo datasets, achieving state-of-the-art performance for 4D panoptic occupancy tracking. We make our code available at https://lags.cs.uni-freiburg.de/.

[191] Phys-3D: Physics-Constrained Real-Time Crowd Tracking and Counting on Railway Platforms

Bin Zeng, Johannes Künzel, Anna Hilsmann, Peter Eisert

Main category: cs.CV

TL;DR: Physics-constrained tracking framework for real-time crowd counting on railway platforms using a single moving train camera, integrating detection, appearance, and 3D motion reasoning with physically plausible dynamics.

Details

Motivation: Accurate real-time crowd counting on railway platforms is essential for safety and capacity management, but existing methods fail under dynamic conditions with camera motion, dense occlusions, and perspective distortions during train arrivals.

Method: Proposes a physics-constrained tracking framework that unifies detection (transfer-learned YOLOv11m), appearance encoding (EfficientNet-B0), and 3D motion reasoning (Phys-3D Kalman model) with pinhole geometry constraints. Includes virtual counting band with persistence for occlusion handling.

Result: Achieves 2.97% counting error on MOT-RailwayPlatformCrowdHead Dataset, demonstrating robust performance despite motion and occlusions in safety-critical transportation scenarios.

Conclusion: Incorporating first-principles geometry and motion priors enables reliable crowd counting in dynamic transportation scenarios, facilitating effective train scheduling and platform safety management.

Abstract: Accurate, real-time crowd counting on railway platforms is essential for safety and capacity management. We propose to use a single camera mounted in a train, scanning the platform while arriving. While hardware constraints are simple, counting remains challenging due to dense occlusions, camera motion, and perspective distortions during train arrivals. Most existing tracking-by-detection approaches assume static cameras or ignore physical consistency in motion modeling, leading to unreliable counting under dynamic conditions. We propose a physics-constrained tracking framework that unifies detection, appearance, and 3D motion reasoning in a real-time pipeline. Our approach integrates a transfer-learned YOLOv11m detector with EfficientNet-B0 appearance encoding within DeepSORT, while introducing a physics-constrained Kalman model (Phys-3D) that enforces physically plausible 3D motion dynamics through pinhole geometry. To address counting brittleness under occlusions, we implement a virtual counting band with persistence. On our platform benchmark, MOT-RailwayPlatformCrowdHead Dataset(MOT-RPCH), our method reduces counting error to 2.97%, demonstrating robust performance despite motion and occlusions. Our results show that incorporating first-principles geometry and motion priors enables reliable crowd counting in safety-critical transportation scenarios, facilitating effective train scheduling and platform safety management.

[192] Uni-Animator: Towards Unified Visual Colorization

Xinyuan Chen, Yao Xu, Shaowen Wang, Pengjie Song, Bowen Deng

Main category: cs.CV

TL;DR: Uni-Animator is a Diffusion Transformer framework that unifies image and video sketch colorization with precise color transfer, physical detail preservation, and temporal coherence.

Details

Motivation: Existing sketch colorization methods struggle to unify image and video tasks, suffering from imprecise color transfer with references, inadequate preservation of high-frequency physical details, and compromised temporal coherence with motion artifacts.

Method: Proposes a DiT-based framework with three key components: 1) visual reference enhancement via instance patch embedding for precise color alignment, 2) physical detail reinforcement using physical features to capture high-frequency textures, and 3) sketch-based dynamic RoPE encoding to model motion-aware spatial-temporal dependencies.

Result: Extensive experiments show Uni-Animator achieves competitive performance on both image and video sketch colorization, matching task-specific methods while enabling unified cross-domain capabilities with high detail fidelity and robust temporal consistency.

Conclusion: Uni-Animator successfully addresses key challenges in unified sketch colorization, providing a framework that bridges image and video domains with improved color precision, detail preservation, and temporal coherence.

Abstract: We propose Uni-Animator, a novel Diffusion Transformer (DiT)-based framework for unified image and video sketch colorization. Existing sketch colorization methods struggle to unify image and video tasks, suffering from imprecise color transfer with single or multiple references, inadequate preservation of high-frequency physical details, and compromised temporal coherence with motion artifacts in large-motion scenes. To tackle imprecise color transfer, we introduce visual reference enhancement via instance patch embedding, enabling precise alignment and fusion of reference color information. To resolve insufficient physical detail preservation, we design physical detail reinforcement using physical features that effectively capture and retain high-frequency textures. To mitigate motion-induced temporal inconsistency, we propose sketch-based dynamic RoPE encoding that adaptively models motion-aware spatial-temporal dependencies. Extensive experimental results demonstrate that Uni-Animator achieves competitive performance on both image and video sketch colorization, matching that of task-specific methods while unlocking unified cross-domain capabilities with high detail fidelity and robust temporal consistency.

[193] FairQuant: Fairness-Aware Mixed-Precision Quantization for Medical Image Classification

Thomas Woergaard, Raghavendra Selvan

Main category: cs.CV

TL;DR: FairQuant: A fairness-aware mixed-precision quantization framework for medical image classification that optimizes both performance and algorithmic fairness under bit budgets.

Details

Motivation: Existing neural network quantization methods (quantization-aware training, post-training quantization) focus on maintaining downstream performance but don't explicitly consider algorithmic fairness impacts, which is particularly important in sensitive domains like medical imaging.

Method: FairQuant combines group-aware importance analysis, budgeted mixed-precision allocation, and a learnable Bit-Aware Quantization (BAQ) mode that jointly optimizes weights and per-unit bit allocations under bitrate and fairness regularization constraints.

Result: FairQuant configurations with average precision near 4-6 bits recover much of Uniform 8-bit accuracy while improving worst-group performance relative to Uniform 4- and 8-bit baselines, with comparable fairness metrics under shared budgets on Fitzpatrick17k and ISIC2019 datasets across ResNet18/50, DeiT-Tiny, and TinyViT.

Conclusion: FairQuant demonstrates that fairness-aware quantization can achieve good compression while maintaining or improving fairness metrics, offering a practical approach for deploying efficient and equitable models in sensitive applications like medical imaging.

Abstract: Compressing neural networks by quantizing model parameters offers useful trade-off between performance and efficiency. Methods like quantization-aware training and post-training quantization strive to maintain the downstream performance of compressed models compared to the full precision models. However, these techniques do not explicitly consider the impact on algorithmic fairness. In this work, we study fairness-aware mixed-precision quantization schemes for medical image classification under explicit bit budgets. We introduce FairQuant, a framework that combines group-aware importance analysis, budgeted mixed-precision allocation, and a learnable Bit-Aware Quantization (BAQ) mode that jointly optimizes weights and per-unit bit allocations under bitrate and fairness regularization. We evaluate the method on Fitzpatrick17k and ISIC2019 across ResNet18/50, DeiT-Tiny, and TinyViT. Results show that FairQuant configurations with average precision near 4-6 bits recover much of the Uniform 8-bit accuracy while improving worst-group performance relative to Uniform 4- and 8-bit baselines, with comparable fairness metrics under shared budgets.

[194] ColoDiff: Integrating Dynamic Consistency With Content Awareness for Colonoscopy Video Generation

Junhu Fu, Shuyu Liang, Wutong Li, Chen Ma, Peng Huang, Kehao Wang, Ke Chen, Shengli Lin, Pinghong Zhou, Zeju Li, Yuanyuan Wang, Yi Guo

Main category: cs.CV

TL;DR: ColoDiff is a diffusion-based framework for generating dynamic-consistent and content-aware colonoscopy videos to address data scarcity in clinical settings.

Details

Motivation: Colonoscopy video generation is crucial for diagnosing intestinal diseases, especially in data-scarce scenarios. Current methods struggle with temporal consistency and precise control over clinical attributes due to irregular intestinal structures, diverse disease representations, and various imaging modalities.

Method: ColoDiff uses a diffusion-based framework with: 1) TimeStream module that decouples temporal dependency through cross-frame tokenization for dynamic modeling, 2) Content-Aware module with noise-injected embeddings and learnable prototypes for precise clinical attribute control, and 3) non-Markovian sampling strategy for real-time generation (90% step reduction).

Result: Evaluated across three public datasets and one hospital database, ColoDiff generates videos with smooth transitions and rich dynamics. It shows strong performance in downstream tasks including disease diagnosis, modality discrimination, bowel preparation scoring, and lesion segmentation.

Conclusion: ColoDiff demonstrates the potential of synthetic colonoscopy videos to complement authentic representation and mitigate data scarcity in clinical settings through controllable video generation.

Abstract: Colonoscopy video generation delivers dynamic, information-rich data critical for diagnosing intestinal diseases, particularly in data-scarce scenarios. High-quality video generation demands temporal consistency and precise control over clinical attributes, but faces challenges from irregular intestinal structures, diverse disease representations, and various imaging modalities. To this end, we propose ColoDiff, a diffusion-based framework that generates dynamic-consistent and content-aware colonoscopy videos, aiming to alleviate data shortage and assist clinical analysis. At the inter-frame level, our TimeStream module decouples temporal dependency from video sequences through a cross-frame tokenization mechanism, enabling intricate dynamic modeling despite irregular intestinal structures. At the intra-frame level, our Content-Aware module incorporates noise-injected embeddings and learnable prototypes to realize precise control over clinical attributes, breaking through the coarse guidance of diffusion models. Additionally, ColoDiff employs a non-Markovian sampling strategy that cuts steps by over 90% for real-time generation. ColoDiff is evaluated across three public datasets and one hospital database, based on both generation metrics and downstream tasks including disease diagnosis, modality discrimination, bowel preparation scoring, and lesion segmentation. Extensive experiments show ColoDiff generates videos with smooth transitions and rich dynamics. ColoDiff presents an effort in controllable colonoscopy video generation, revealing the potential of synthetic videos in complementing authentic representation and mitigating data scarcity in clinical settings.

[195] Motion-aware Event Suppression for Event Cameras

Roberto Pellerito, Nico Messikommer, Giovanni Cioffi, Marco Cannici, Davide Scaramuzza

Main category: cs.CV

TL;DR: Motion-aware Event Suppression framework that filters events from independent moving objects and ego-motion in real-time, enabling anticipatory suppression of dynamic events before they occur.

Details

Motivation: Event cameras capture motion efficiently but generate massive data from both independent moving objects (IMOs) and ego-motion, creating noise for downstream applications. Current methods struggle with real-time filtering of dynamic events.

Method: Joint learning framework that segments IMOs in current event streams while predicting their future motion. Lightweight architecture enables anticipatory suppression of dynamic events in real-time.

Result: Achieves 173 Hz inference on consumer GPUs with <1GB memory, outperforms SOTA on EVIMO benchmark by 67% in segmentation accuracy at 53% higher inference rate. Accelerates Vision Transformer inference by 83% via token pruning and reduces visual odometry ATE by 13%.

Conclusion: First real-time motion-aware event suppression framework that effectively filters IMO and ego-motion events, significantly benefiting downstream applications like efficient vision transformer inference and improved visual odometry.

Abstract: In this work, we introduce the first framework for Motion-aware Event Suppression, which learns to filter events triggered by IMOs and ego-motion in real time. Our model jointly segments IMOs in the current event stream while predicting their future motion, enabling anticipatory suppression of dynamic events before they occur. Our lightweight architecture achieves 173 Hz inference on consumer-grade GPUs with less than 1 GB of memory usage, outperforming previous state-of-the-art methods on the challenging EVIMO benchmark by 67% in segmentation accuracy while operating at a 53% higher inference rate. Moreover, we demonstrate significant benefits for downstream applications: our method accelerates Vision Transformer inference by 83% via token pruning and improves event-based visual odometry accuracy, reducing Absolute Trajectory Error (ATE) by 13%.

[196] EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

Wenjia Wang, Liang Pan, Huaijin Pi, Yuke Lou, Xuqian Ren, Yifan Wu, Zhouyingcheng Liao, Lei Yang, Rishabh Dabral, Christian Theobalt, Taku Komura

Main category: cs.CV

TL;DR: EmbodMocap: A portable dual-iPhone pipeline for capturing scene-conditioned human motion data in everyday environments, enabling metric-scale human-scene reconstruction and empowering embodied AI tasks.

Details

Motivation: Existing human motion capture systems require costly studio setups and wearable devices, limiting large-scale collection of scene-conditioned human motion data in natural environments. There's a need for affordable, portable solutions that can capture both humans and scenes together.

Method: Uses two moving iPhones with RGB-D sensors, jointly calibrating dual sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The dual-view setup mitigates depth ambiguity compared to single-view approaches.

Result: Demonstrates superior alignment and reconstruction performance over single iPhone or monocular models when compared with optical capture ground truth. Enables three embodied AI tasks: monocular human-scene reconstruction, physics-based character animation, and robot motion control.

Conclusion: EmbodMocap provides an affordable, portable solution for capturing scene-conditioned human motion data in the wild, bridging human motion and scene geometry, and advancing embodied AI research through scalable data collection.

Abstract: Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting exhibits a remarkable ability to mitigate depth ambiguity, achieving superior alignment and reconstruction performance over single iphone or monocular models. Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.

[197] Through BrokenEyes: How Eye Disorders Impact Face Detection?

Prottay Kumar Adhikary

Main category: cs.CV

TL;DR: Computational framework simulates five common eye disorders to analyze their effects on neural-like feature representations in deep learning models, revealing critical disruptions in feature maps for cataract and glaucoma.

Details

Motivation: Vision disorders significantly impact millions of lives by altering visual information processing. The research aims to understand how these disorders affect neural-like feature representations in deep learning models to gain insights into the interplay between degraded visual inputs and learned representations.

Method: Developed computational framework using BrokenEyes system to simulate five common eye disorders: Age-related macular degeneration, cataract, glaucoma, refractive errors, and diabetic retinopathy. Used combination of human and non-human datasets to train models under normal and disorder-specific conditions. Analyzed effects on feature representations using evaluation metrics like activation energy and cosine similarity.

Result: Models revealed critical disruptions in feature maps, particularly for cataract and glaucoma, which align with known neural processing challenges in these conditions. Evaluation metrics quantified the severity of distortions, providing insights into how degraded visual inputs affect learned representations.

Conclusion: The computational framework successfully simulates eye disorders and analyzes their effects on deep learning feature representations, revealing disorder-specific disruptions that correspond to known neural processing challenges, with potential applications in understanding vision disorders and improving computer vision systems.

Abstract: Vision disorders significantly impact millions of lives, altering how visual information is processed and perceived. In this work, a computational framework was developed using the BrokenEyes system to simulate five common eye disorders: Age-related macular degeneration, cataract, glaucoma, refractive errors, and diabetic retinopathy and analyze their effects on neural-like feature representations in deep learning models. Leveraging a combination of human and non-human datasets, models trained under normal and disorder-specific conditions revealed critical disruptions in feature maps, particularly for cataract and glaucoma, which align with known neural processing challenges in these conditions. Evaluation metrics such as activation energy and cosine similarity quantified the severity of these distortions, providing insights into the interplay between degraded visual inputs and learned representations.

[198] Multidimensional Task Learning: A Unified Tensor Framework for Computer Vision Tasks

Alaa El Ichi, Khalide Jbilou

Main category: cs.CV

TL;DR: A unified tensor-based framework (MTL) using Generalized Einstein MLPs that operates directly on tensors, showing that standard computer vision tasks are special cases within a larger task space expressible through tensor algebra.

Details

Motivation: Current computer vision task formulations are constrained by matrix-based thinking that requires flattening operations, restricting the space of naturally expressible tasks. The authors aim to lift this constraint through tensor-valued parameters.

Method: Proposes Multidimensional Task Learning (MTL) based on Generalized Einstein MLPs (GE-MLPs) that operate directly on tensors via the Einstein product, using tensor-valued parameters to control which dimensions are preserved or contracted without information loss.

Result: Demonstrates that classification, segmentation, and detection are special cases of MTL, differing only in dimensional configuration. Proves the task space is strictly larger than what matrix-based formulations can express, enabling principled task configurations like spatiotemporal or cross-modal predictions.

Conclusion: Provides a mathematical foundation for understanding, comparing, and designing computer vision tasks through tensor algebra, offering a more expressive framework than traditional matrix-based approaches.

Abstract: This paper introduces Multidimensional Task Learning (MTL), a unified mathematical framework based on Generalized Einstein MLPs (GE-MLPs) that operate directly on tensors via the Einstein product. We argue that current computer vision task formulations are inherently constrained by matrix-based thinking: standard architectures rely on matrix-valued weights and vectorvalued biases, requiring structural flattening that restricts the space of naturally expressible tasks. GE-MLPs lift this constraint by operating with tensor-valued parameters, enabling explicit control over which dimensions are preserved or contracted without information loss. Through rigorous mathematical derivations, we demonstrate that classification, segmentation, and detection are special cases of MTL, differing only in their dimensional configuration within a formally defined task space. We further prove that this task space is strictly larger than what matrix-based formulations can natively express, enabling principled task configurations such as spatiotemporal or cross modal predictions that require destructive flattening under conventional approaches. This work provides a mathematical foundation for understanding, comparing, and designing computer vision tasks through the lens of tensor algebra.

[199] UniScale: Unified Scale-Aware 3D Reconstruction for Multi-View Understanding via Prior Injection for Robotic Perception

Mohammad Mahdavian, Gordon Tan, Binbin Xu, Yuan Ren, Dongfeng Bai, Bingbing Liu

Main category: cs.CV

TL;DR: UniScale: A unified feed-forward network for metric-scale 3D reconstruction from multi-view images, designed for robotic applications with optional geometric priors integration.

Details

Motivation: Robotic navigation requires accurate environmental structure extraction from image sequences. Existing methods often lack metric scale awareness or require separate components for different reconstruction aspects, limiting practical deployment in resource-constrained robotic settings.

Method: Single feed-forward network that jointly estimates camera intrinsics/extrinsics, scale-invariant depth/point maps, and metric scene scale from multi-view images. Uses modular design with global contextual reasoning and camera-aware feature representations, optionally incorporating known camera parameters or geometric priors when available.

Result: Demonstrates strong generalization and consistent performance across diverse environments on multiple benchmarks. The method doesn’t require training from scratch and leverages pre-existing model priors, making it suitable for resource-constrained robotic teams.

Conclusion: UniScale provides a unified, metric-aware 3D reconstruction framework for robotics that flexibly integrates geometric priors through modular design, enabling robust environmental structure extraction without extensive retraining.

Abstract: We present UniScale, a unified, scale-aware multi-view 3D reconstruction framework for robotic applications that flexibly integrates geometric priors through a modular, semantically informed design. In vision-based robotic navigation, the accurate extraction of environmental structure from raw image sequences is critical for downstream tasks. UniScale addresses this challenge with a single feed-forward network that jointly estimates camera intrinsics and extrinsics, scale-invariant depth and point maps, and the metric scale of a scene from multi-view images, while optionally incorporating auxiliary geometric priors when available. By combining global contextual reasoning with camera-aware feature representations, UniScale is able to recover the metric-scale of the scene. In robotic settings where camera intrinsics are known, they can be easily incorporated to improve performance, with additional gains obtained when camera poses are also available. This co-design enables robust, metric-aware 3D reconstruction within a single unified model. Importantly, UniScale does not require training from scratch, and leverages world priors exhibited in pre-existing models without geometric encoding strategies, making it particularly suitable for resource-constrained robotic teams. We evaluate UniScale on multiple benchmarks, demonstrating strong generalization and consistent performance across diverse environments. We will release our implementation upon acceptance.

[200] MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction

Yizhi Li, Xiaohan Chen, Miao Jiang, Wentao Tang, Gaoang Wang

Main category: cs.CV

TL;DR: MovieTeller: A training-free framework for generating movie synopses using tool-augmented progressive abstraction with face recognition for character consistency and narrative coherence.

Details

Motivation: Existing Vision-Language Models fail at long-form video summarization due to inconsistent character identification and fractured narrative coherence, creating a need for better automated movie synopsis generation.

Method: Training-free tool-augmented framework using face recognition for factual grounding, injecting character identities into prompts, and progressive abstraction pipeline to handle long videos.

Result: Significant improvements in factual accuracy, character consistency, and narrative coherence compared to end-to-end baselines.

Conclusion: MovieTeller effectively addresses limitations of current VLMs for long-form video summarization through tool-augmented progressive abstraction without requiring model fine-tuning.

Abstract: With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external “tool” to establish Factual Groundings–precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM’s reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.

[201] Large Multimodal Models as General In-Context Classifiers

Marco Garosi, Matteo Farina, Alessandro Conti, Massimiliano Mancini, Elisa Ricci

Main category: cs.CV

TL;DR: LMMs with in-context learning can match or surpass CLIP-like VLMs for classification tasks, especially in open-world scenarios, through iterative pseudo-label refinement (CIRCLE method).

Details

Motivation: The paper challenges the conventional wisdom that CLIP-like contrastive VLMs are best for classification tasks, arguing that LMMs' in-context learning capabilities are overlooked. It aims to demonstrate LMMs can serve as unified classifiers that rival specialized models.

Method: Benchmark state-of-the-art LMMs on diverse datasets for closed-world classification, compare with CLIP-like VLMs, extend to open-world setting, and propose CIRCLE - a training-free method that assigns pseudo-labels to in-context examples and iteratively refines them using available context.

Result: LMMs with few in-context examples match or surpass contrastive VLMs with cache-based adapters. In open-world scenarios, LMMs struggle with imperfect context but CIRCLE establishes robust baselines, surpassing VLM counterparts and demonstrating LMMs’ potential as unified classifiers.

Conclusion: LMMs with in-context learning are competitive alternatives to specialized VLMs for classification, especially when enhanced with methods like CIRCLE for open-world scenarios, highlighting their potential as flexible, unified multimodal classifiers.

Abstract: Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP’s, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their “in-context” equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.

[202] Skarimva: Skeleton-based Action Recognition is a Multi-view Application

Daniel Bermuth, Alexander Poeppel, Wolfgang Reif

Main category: cs.CV

TL;DR: Using multiple camera views to triangulate more accurate 3D skeletons significantly improves action recognition performance, suggesting input data quality is a key limiting factor.

Details

Motivation: While much research focuses on improving machine learning algorithms for skeleton-based action recognition, little attention has been given to input data quality. The paper argues that current skeleton data quality limits model performance.

Method: The paper demonstrates using multiple camera views to triangulate more accurate 3D skeletons, then evaluates how this improved input data affects state-of-the-art action recognition models.

Result: Performance of state-of-the-art action recognition models improves significantly with more accurate 3D skeletons from multi-view triangulation, showing input data quality is currently a limiting factor.

Conclusion: The cost-benefit ratio of using multiple cameras is favorable, and future skeleton-based action recognition research should consider multi-view setups as standard.

Abstract: Human action recognition plays an important role when developing intelligent interactions between humans and machines. While there is a lot of active research on improving the machine learning algorithms for skeleton-based action recognition, not much attention has been given to the quality of the input skeleton data itself. This work demonstrates that by making use of multiple camera views to triangulate more accurate 3D~skeletons, the performance of state-of-the-art action recognition models can be improved significantly. This suggests that the quality of the input data is currently a limiting factor for the performance of these models. Based on these results, it is argued that the cost-benefit ratio of using multiple cameras is very favorable in most practical use-cases, therefore future research in skeleton-based action recognition should consider multi-view applications as the standard setup.

[203] Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents

Zhou Xu, Bowen Zhou, Qi Wang, Shuwen Feng, Jingyu Xiao

Main category: cs.CV

TL;DR: GUIPruner is a training-free framework for GUI agents that reduces computational overhead by addressing temporal and spatial misalignments in screen compression, achieving 3.4x FLOPs reduction and 3.3x speedup while maintaining 94% performance.

Details

Motivation: Pure-vision GUI agents face efficiency bottlenecks due to massive spatiotemporal redundancy in high-resolution screenshots and historical trajectories. Existing compression methods suffer from temporal mismatch (uniform history encoding vs. fading memory attention) and spatial topology conflict (unstructured pruning compromising grid integrity for coordinate grounding).

Method: GUIPruner combines Temporal-Adaptive Resolution (TAR) which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP) which prioritizes interactive foregrounds and semantic anchors while preserving global layout integrity.

Result: Extensive evaluations show GUIPruner achieves SOTA performance, prevents collapse in large models under high compression. On Qwen2-VL-2B: 3.4x FLOPs reduction, 3.3x vision encoding speedup, retains over 94% original performance, enabling real-time high-precision navigation with minimal resources.

Conclusion: GUIPruner effectively addresses efficiency bottlenecks in GUI agents through training-free compression that respects temporal attention patterns and spatial layout integrity, enabling practical real-time applications with minimal performance loss.

Abstract: Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent’s “fading memory” attention pattern, and the spatial topology conflict, where unstructured pruning compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high compression. Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.

[204] Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving

Jiangxin Sun, Feng Xue, Teng Long, Chang Liu, Jian-Fang Hu, Wei-Shi Zheng, Nicu Sebe

Main category: cs.CV

TL;DR: Risk-aware World Model Predictive Control (RaWMPC) is a novel end-to-end autonomous driving framework that uses world models and risk evaluation for decision-making without expert demonstrations, improving generalization to rare scenarios.

Details

Motivation: Current imitation learning methods for end-to-end autonomous driving suffer from limited generalization to rare or unseen long-tail scenarios, as they only learn to mimic expert behaviors. The paper aims to develop a system that can make reliable decisions without expert action supervision.

Method: RaWMPC uses a world model to predict consequences of candidate actions and selects low-risk actions through explicit risk evaluation. It employs risk-aware interaction strategy to expose the world model to hazardous behaviors, and self-evaluation distillation to transfer risk-avoidance capabilities to an action proposal network.

Result: Extensive experiments show RaWMPC outperforms state-of-the-art methods in both in-distribution and out-of-distribution scenarios, while providing superior decision interpretability compared to imitation learning approaches.

Conclusion: The proposed RaWMPC framework demonstrates that end-to-end autonomous driving can achieve reliable decision-making without expert demonstrations by leveraging world models and explicit risk evaluation, addressing the generalization limitations of imitation learning.

Abstract: With advances in imitation learning (IL) and large-scale driving datasets, end-to-end autonomous driving (E2E-AD) has made great progress recently. Currently, IL-based methods have become a mainstream paradigm: models rely on standard driving behaviors given by experts, and learn to minimize the discrepancy between their actions and expert actions. However, this objective of “only driving like the expert” suffers from limited generalization: when encountering rare or unseen long-tail scenarios outside the distribution of expert demonstrations, models tend to produce unsafe decisions in the absence of prior experience. This raises a fundamental question: Can an E2E-AD system make reliable decisions without any expert action supervision? Motivated by this, we propose a unified framework named Risk-aware World Model Predictive Control (RaWMPC) to address this generalization dilemma through robust control, without reliance on expert demonstrations. Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation. To endow the world model with the ability to predict the outcomes of risky driving behaviors, we design a risk-aware interaction strategy that systematically exposes the world model to hazardous behaviors, making catastrophic outcomes predictable and thus avoidable. Furthermore, to generate low-risk candidate actions at test time, we introduce a self-evaluation distillation method to distill riskavoidance capabilities from the well-trained world model into a generative action proposal network without any expert demonstration. Extensive experiments show that RaWMPC outperforms state-of-the-art methods in both in-distribution and out-of-distribution scenarios, while providing superior decision interpretability.

[205] Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling

Jasmine Bayrooti, Weiwei Kong, Natalia Ponomareva, Carlos Esteves, Ameesh Makadia, Amanda Prorok

Main category: cs.CV

TL;DR: A spectral differential privacy framework for image generation that protects low-frequency components (privacy-sensitive features) while preserving high-frequency textures by using wavelet decomposition and public super-resolution models.

Details

Motivation: Standard DP finetuning (like DP-SGD) severely degrades image quality, especially high-frequency textures, due to indiscriminate noise addition. There's a need for privacy-preserving image generation that maintains visual quality while protecting sensitive information.

Method: Two-stage framework: 1) DP finetune an autoregressive spectral image tokenizer on low-resolution wavelet coefficients (low-frequency components) of sensitive images, 2) Use a publicly pretrained super-resolution model for high-resolution upsampling. This leverages the hypothesis that privacy-sensitive information resides in low-frequency components while high-frequency details are generic.

Result: Experiments on MS-COCO and MM-CelebA-HQ datasets show improved image quality and style capture compared to other leading DP image frameworks, achieving better privacy-utility trade-offs.

Conclusion: The spectral DP framework effectively protects privacy by focusing DP budget on low-frequency components while preserving high-frequency details through public super-resolution, offering superior privacy-utility balance for image generation.

Abstract: Generative models trained on sensitive image datasets risk memorizing and reproducing individual training examples, making strong privacy guarantees essential. While differential privacy (DP) provides a principled framework for such guarantees, standard DP finetuning (e.g., with DP-SGD) often results in severe degradation of image quality, particularly in high-frequency textures, due to the indiscriminate addition of noise across all model parameters. In this work, we propose a spectral DP framework based on the hypothesis that the most privacy-sensitive portions of an image are often low-frequency components in the wavelet space (e.g., facial features and object shapes) while high-frequency components are largely generic and public. Based on this hypothesis, we propose the following two-stage framework for DP image generation with coarse image intermediaries: (1) DP finetune an autoregressive spectral image tokenizer model on the low-resolution wavelet coefficients of the sensitive images, and (2) perform high-resolution upsampling using a publicly pretrained super-resolution model. By restricting the privacy budget to the global structures of the image in the first stage, and leveraging the post-processing property of DP for detail refinement, we achieve promising trade-offs between privacy and utility. Experiments on the MS-COCO and MM-CelebA-HQ datasets show that our method generates images with improved quality and style capture relative to other leading DP image frameworks.

[206] LineGraph2Road: Structural Graph Reasoning on Line Graphs for Road Network Extraction

Zhengyang Wei, Renzhi Jing, Yiyi He, Jenny Suckale

Main category: cs.CV

TL;DR: LineGraph2Road: A framework for road extraction from satellite imagery using graph transformers on line graphs for better connectedness prediction, achieving SOTA results on topological metrics.

Details

Motivation: Existing road extraction methods struggle with long-range dependencies and complex topologies when decomposing the task into keypoint extraction and connectedness prediction. There's a need for better structural understanding of road networks.

Method: Formulates connectedness prediction as binary classification over edges in a sparse Euclidean graph. Transforms the original graph into its line graph and applies Graph Transformer for connectedness prediction. Includes overpass/underpass head for multi-level crossings and coupled NMS strategy.

Result: Achieves state-of-the-art results on City-scale, SpaceNet, and Global-scale benchmarks on TOPO-F1 and APLS metrics. Captures fine visual details critical for real-world deployment.

Conclusion: LineGraph2Road effectively addresses limitations in road extraction by leveraging graph transformers on line graphs for better structural reasoning, demonstrating superior performance on topological metrics across multiple benchmarks.

Abstract: The accurate and automatic extraction of roads from satellite imagery is critical for applications in navigation and urban planning, significantly reducing the need for manual annotation. Many existing methods decompose this task into keypoint extraction and connectedness prediction, but often struggle to capture long-range dependencies and complex topologies. Here, we propose LineGraph2Road, a framework that improves connectedness prediction by formulating it as binary classification over edges in a constructed global but sparse Euclidean graph, where nodes are keypoints extracted from segmentation masks and edges connect node pairs within a predefined distance threshold, representing potential road segments. To better learn structural link representation, we transform the original graph into its corresponding line graph and apply a Graph Transformer on it for connectedness prediction. This formulation overcomes the limitations of endpoint-embedding fusion on set-isomorphic links, enabling rich link representations and effective relational reasoning over the global structure. Additionally, we introduce an overpass/underpass head to resolve multi-level crossings and a coupled NMS strategy to preserve critical connections. We evaluate LineGraph2Road on three benchmarks: City-scale, SpaceNet, and Global-scale, and show that it achieves state-of-the-art results on two key metrics, TOPO-F1 and APLS. It also captures fine visual details critical for real-world deployment. We will make our code publicly available.

[207] PGVMS: A Prompt-Guided Unified Framework for Virtual Multiplex IHC Staining with Pathological Semantic Learning

Fuqiang Chen, Ranran Zhang, Wanming Hu, Deboch Eyob Abera, Yue Peng, Boyun Zheng, Yiwen Sun, Jing Cai, Wenjian Qin

Main category: cs.CV

TL;DR: A prompt-guided framework for virtual multiplex IHC staining using only uniplex training data that addresses semantic guidance, staining consistency, and spatial alignment challenges in transforming H&E images to multiple IHC representations.

Details

Motivation: Comprehensive immunohistochemical (IHC) analysis is limited by insufficient tissue in small biopsies, creating need for virtual multiplex staining to digitally transform H&E images into multiple IHC representations, but current methods face semantic guidance, staining consistency, and spatial alignment challenges.

Method: PGVMS framework with three innovations: 1) adaptive prompt guidance using pathological visual language model for semantic guidance, 2) protein-aware learning strategy (PALS) for maintaining protein expression patterns, and 3) prototype-consistent learning strategy (PCLS) for cross-image semantic interaction to correct spatial misalignments.

Result: The framework enables virtual multiplex IHC staining using only uniplex training data, overcoming key limitations in semantic guidance, staining distribution consistency, and spatial alignment across different stain modalities.

Conclusion: PGVMS provides an effective solution for virtual multiplex IHC staining that addresses critical challenges in semantic guidance, protein distribution consistency, and spatial alignment, enabling comprehensive molecular profiling from limited tissue samples.

Abstract: Immunohistochemical (IHC) staining enables precise molecular profiling of protein expression, with over 200 clinically available antibody-based tests in modern pathology. However, comprehensive IHC analysis is frequently limited by insufficient tissue quantities in small biopsies. Therefore, virtual multiplex staining emerges as an innovative solution to digitally transform H&E images into multiple IHC representations, yet current methods still face three critical challenges: (1) inadequate semantic guidance for multi-staining, (2) inconsistent distribution of immunochemistry staining, and (3) spatial misalignment across different stain modalities. To overcome these limitations, we present a prompt-guided framework for virtual multiplex IHC staining using only uniplex training data (PGVMS). Our framework introduces three key innovations corresponding to each challenge: First, an adaptive prompt guidance mechanism employing a pathological visual language model dynamically adjusts staining prompts to resolve semantic guidance limitations (Challenge 1). Second, our protein-aware learning strategy (PALS) maintains precise protein expression patterns by direct quantification and constraint of protein distributions (Challenge 2). Third, the prototype-consistent learning strategy (PCLS) establishes cross-image semantic interaction to correct spatial misalignments (Challenge 3).

[208] Towards Long-Form Spatio-Temporal Video Grounding

Xin Gu, Bing Fan, Jiali Yao, Zhipeng Zhang, Yan Huang, Cheng Han, Heng Fan, Libo Zhang

Main category: cs.CV

TL;DR: Proposes ART-STVG, an autoregressive transformer for long-form spatio-temporal video grounding that processes videos sequentially with memory banks and cascaded decoders to handle long videos efficiently.

Details

Motivation: Existing spatio-temporal video grounding (STVG) methods focus on short videos (<1 minute), limiting real-world applications. Long videos contain longer temporal spans and more irrelevant information, requiring new approaches.

Method: ART-STVG processes videos as streaming input sequentially. Uses spatial and temporal memory banks with selection strategies to provide relevant context. Employs cascaded spatio-temporal design where spatial decoder connects to temporal decoder for fine-grained spatial cues to assist temporal localization.

Result: Significantly outperforms state-of-the-art methods on newly extended LF-STVG datasets while achieving competitive performance on conventional short-form STVG.

Conclusion: The proposed autoregressive approach with memory banks and cascaded design effectively handles long-form video grounding challenges, enabling practical applications with long videos.

Abstract: In real scenarios, videos can span several minutes or even hours. However, existing research on spatio-temporal video grounding (STVG), given a textual query, mainly focuses on localizing targets in short videos of tens of seconds, typically less than one minute, which limits real-world applications. In this paper, we explore Long-Form STVG (LF-STVG), which aims to locate targets in long-term videos. Compared with short videos, long-term videos contain much longer temporal spans and more irrelevant information, making it difficult for existing STVG methods that process all frames at once. To address this challenge, we propose an AutoRegressive Transformer architecture for LF-STVG, termed ART-STVG. Unlike conventional STVG methods that require the entire video sequence to make predictions at once, ART-STVG treats the video as streaming input and processes frames sequentially, enabling efficient handling of long videos. To model spatio-temporal context, we design spatial and temporal memory banks and apply them to the decoders. Since memories from different moments are not always relevant to the current frame, we introduce simple yet effective memory selection strategies to provide more relevant information to the decoders, significantly improving performance. Furthermore, instead of parallel spatial and temporal localization, we propose a cascaded spatio-temporal design that connects the spatial decoder to the temporal decoder, allowing fine-grained spatial cues to assist complex temporal localization in long videos. Experiments on newly extended LF-STVG datasets show that ART-STVG significantly outperforms state-of-the-art methods, while achieving competitive performance on conventional short-form STVG.

[209] ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation

Ayush Roy, Wei-Yang Alex Lee, Rudrasis Chakraborty, Vishnu Suresh Lokhande

Main category: cs.CV

TL;DR: ManifoldGD: Training-free diffusion-based dataset distillation using manifold-consistent guidance with hierarchical IPC clustering for improved representativeness and diversity.

Details

Motivation: Large datasets contain redundant concepts and hinder efficient training. Existing diffusion-based dataset distillation methods use simple guidance strategies (unguided denoising or basic IPC centroids) that are suboptimal for capturing dataset structure.

Method: Proposes Manifold-Guided Distillation (ManifoldGD) with hierarchical divisive clustering of VAE latent features to create multi-scale IPC coresets. At each diffusion denoising step, projects mode-alignment vectors onto local tangent space of estimated latent manifold to maintain manifold faithfulness while preserving semantic consistency.

Result: Consistent improvements over training-free and training-based baselines in FID, l2 distance between real/synthetic dataset embeddings, and classification accuracy. Achieves better representativeness, diversity, and image fidelity without model retraining.

Conclusion: ManifoldGD establishes the first geometry-aware training-free data distillation framework that effectively captures both coarse semantic modes and fine intra-class variability through manifold-consistent guidance.

Abstract: In recent times, large datasets hinder efficient model training while also containing redundant concepts. Dataset distillation aims to synthesize compact datasets that preserve the knowledge of large-scale training sets while drastically reducing storage and computation. Recent advances in diffusion models have enabled training-free distillation by leveraging pre-trained generative priors; however, existing guidance strategies remain limited. Current score-based methods either perform unguided denoising or rely on simple mode-based guidance toward instance prototype centroids (IPC centroids), which often are rudimentary and suboptimal. We propose Manifold-Guided Distillation (ManifoldGD), a training-free diffusion-based framework that integrates manifold consistent guidance at every denoising timestep. Our method employs IPCs computed via a hierarchical, divisive clustering of VAE latent features, yielding a multi-scale coreset of IPCs that captures both coarse semantic modes and fine intra-class variability. Using a local neighborhood of the extracted IPC centroids, we create the latent manifold for each diffusion denoising timestep. At each denoising step, we project the mode-alignment vector onto the local tangent space of the estimated latent manifold, thus constraining the generation trajectory to remain manifold-faithful while preserving semantic consistency. This formulation improves representativeness, diversity, and image fidelity without requiring any model retraining. Empirical results demonstrate consistent gains over existing training-free and training-based baselines in terms of FID, l2 distance among real and synthetic dataset embeddings, and classification accuracy, establishing ManifoldGD as the first geometry-aware training-free data distillation framework.

[210] PRIMA: Pre-training with Risk-integrated Image-Metadata Alignment for Medical Diagnosis via LLM

Yiqing Wang, Chunming He, Ming-Chen Lu, Mercy Pawar, Leslie Niziol, Maria Woodward, Sina Farsiu

Main category: cs.CV

TL;DR: PRIMA is a medical multi-modal framework that integrates visual features with clinical metadata using domain knowledge from risk-disease correlations, achieving superior disease classification performance.

Details

Motivation: Existing medical diagnosis methods treat metadata as isolated tags rather than leveraging rich semantic knowledge from clinical descriptions, creating a gap between visual manifestations and clinical expertise.

Method: Uses RAG to curate risk-disease correlations to refine Clinical ModernBERT, employs dual-encoder pre-training with DINOv3 and refined BERT, optimizes with four complementary loss functions for multi-granular semantic alignment, and uses Qwen-3 for feature fusion.

Result: Extensive experiments show PRIMA significantly outperforms state-of-the-art methods, effectively harmonizing pixel-level features with clinical expertise, achieving superior robustness without massive data or computational resources.

Conclusion: PRIMA successfully integrates domain-specific knowledge into multi-modal representation learning for medical diagnosis, bridging the modality gap between visual features and clinical metadata through semantic alignment.

Abstract: Medical diagnosis requires the effective synthesis of visual manifestations and clinical metadata. However, existing methods often treat metadata as isolated tags, failing to exploit the rich semantic knowledge embedded in clinical descriptions. We propose PRIMA (Pre-training with Risk-integrated Image-Metadata Alignment), a framework that integrates domain-specific knowledge into multi-modal representation learning. We first curate an expert corpus of risk-disease correlations via Retrieval-Augmented Generation (RAG) to refine Clinical ModernBERT, embedding diagnostic priors into the text encoder. To bridge the modality gap, we introduce a dual-encoder pre-training strategy utilizing DINOv3 and our refined BERT, optimized by a suite of four complementary loss functions. These losses are designed to capture multi-granular semantic alignment and handle the ambiguity of clinical correlations through soft labels. Finally, we leverage Qwen-3 to fuse these aligned features for precise disease classification. Extensive experiments demonstrate that PRIMA effectively harmonizes pixel-level features with abstract clinical expertise, significantly outperforming other state-of-the-art methods. Notably, our framework achieves superior robustness without the need for massive data collection or exhaustive computational resources. Our code will be made public upon acceptance.

[211] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

Vaibhav Agrawal, Rishubh Parihar, Pradhaan Bhat, Ravi Kiran Sarvadevabhatla, R. Venkatesh Babu

Main category: cs.CV

TL;DR: SeeThrough3D is a 3D layout-conditioned generation model that explicitly models occlusions using translucent 3D boxes and occlusion-aware scene representation to generate scenes with realistic inter-object occlusions and camera control.

Details

Motivation: Existing 3D layout-conditioned generation methods fail to model precise inter-object occlusions, which is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. Occlusion reasoning is identified as a fundamental yet overlooked aspect.

Method: Proposes SeeThrough3D with Occlusion-aware 3D Scene Representation (OSCR) where objects are depicted as translucent 3D boxes in a virtual environment. Uses rendered viewpoints for camera control, conditions a pretrained flow-based text-to-image model with visual tokens from 3D representation, and applies masked self-attention to bind object bounding boxes to textual descriptions. Trained on synthetic dataset with diverse multi-object scenes and strong occlusions.

Result: SeeThrough3D effectively generalizes to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control, outperforming existing methods in modeling inter-object occlusions.

Conclusion: Explicit occlusion modeling is crucial for 3D layout-conditioned generation, and SeeThrough3D demonstrates effective occlusion reasoning through its novel 3D scene representation and conditioning approach, enabling realistic scene synthesis with accurate camera control.

Abstract: We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.

Yiran Guan, Sifan Tu, Dingkang Liang, Linghao Zhu, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai

Main category: cs.CV

TL;DR: ThinkOmni is a training-free framework that enhances omni-modal reasoning by lifting textual reasoning capabilities to multi-modal scenarios using off-the-shelf large reasoning models as guides.

Details

Motivation: Existing omni-modal LLMs have strong perception across modalities but lack complex reasoning abilities of specialized reasoning models. Enhancing OLLMs with reasoning through additional training is challenging due to data quality issues, task adaptation needs, and high computational costs.

Method: Proposes ThinkOmni with two key components: 1) LRM-as-a-Guide uses off-the-shelf large reasoning models to guide OLLM decoding, and 2) Stepwise Contrastive Scaling adaptively balances perception and reasoning signals without manual hyperparameter tuning.

Result: Experiments on six multi-modal reasoning benchmarks show consistent performance improvements, achieving 70.2 on MathVista and 75.5 on MMAU.

Conclusion: ThinkOmni provides a flexible, training-free solution for omni-modal reasoning that offers new insights into generalizing reasoning capabilities across modalities.

Abstract: Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large language models (OLLM) excel at perceiving diverse modalities, they lack the complex reasoning abilities of recent large reasoning models (LRM). However, enhancing the reasoning ability of OLLMs through additional training presents significant challenges, including the need for high-quality data, task-specific adaptation, and substantial computational costs. To address these limitations, we propose ThinkOmni, a training-free and data-free framework that lifts textual reasoning to omni-modal scenarios. ThinkOmni introduces two key components: 1) LRM-as-a-Guide, which leverages off-the-shelf LRMs to guide the OLLM decoding process; 2) Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals without manual hyperparameter tuning. Experiments on six multi-modal reasoning benchmarks demonstrate that ThinkOmni consistently delivers performance improvements, with main results achieving 70.2 on MathVista and 75.5 on MMAU. Overall, ThinkOmni offers a flexible and generalizable solution for omni-modal reasoning and provides new insights into the generalization and application of reasoning capabilities.

[213] Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?

Tilemachos Aravanis, Vladan Stojnić, Bill Psomas, Nikos Komodakis, Giorgos Tolias

Main category: cs.CV

TL;DR: Retrieval-augmented test-time adapter for open-vocabulary segmentation that uses few-shot visual support to improve zero-shot performance by fusing textual and visual features.

Details

Motivation: Open-vocabulary segmentation lags behind supervised approaches due to coarse image-level supervision in VLMs and semantic ambiguity of natural language. Need to bridge gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.

Method: Introduces few-shot setting with pixel-annotated support images. Proposes retrieval-augmented test-time adapter that learns lightweight per-image classifier by fusing textual and visual support features through learned, per-query fusion rather than late hand-crafted fusion.

Result: Significantly narrows gap between zero-shot and supervised segmentation while preserving open-vocabulary ability. Supports continually expanding support sets and applies to fine-grained tasks like personalized segmentation.

Conclusion: The approach effectively addresses limitations of VLMs for segmentation by combining textual prompts with visual support, achieving stronger modality synergy and better performance than prior methods.

Abstract: Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.

[214] Sensor Generalization for Adaptive Sensing in Event-based Object Detection via Joint Distribution Training

Aheli Saha, René Schuster, Didier Stricker

Main category: cs.CV

TL;DR: Analysis of how intrinsic parameters of event cameras affect object detection model performance and enabling sensor-agnostic robustness

Details

Motivation: Event cameras offer unique advantages (asynchronous, low-latency, high dynamic range, reduced motion blur) but face challenges due to novel output signals, limited data variability, and insufficient analysis of signal parameters affecting model performance.

Method: The paper provides in-depth analysis of how intrinsic event camera parameters affect object detection model performance, and uses these findings to expand downstream model capabilities toward sensor-agnostic robustness.

Result: The research provides understanding of parameter effects on event-based object detection and enables sensor-agnostic robustness in downstream models.

Conclusion: Understanding intrinsic event camera parameters is crucial for optimizing object detection performance and achieving sensor-agnostic robustness in event-based vision systems.

Abstract: Bio-inspired event cameras have recently attracted significant research due to their asynchronous and low-latency capabilities. These features provide a high dynamic range and significantly reduce motion blur. However, because of the novelty in the nature of their output signals, there is a gap in the variability of available data and a lack of extensive analysis of the parameters characterizing their signals. This paper addresses these issues by providing readers with an in-depth understanding of how intrinsic parameters affect the performance of a model trained on event data, specifically for object detection. We also use our findings to expand the capabilities of the downstream model towards sensor-agnostic robustness.

[215] VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale

Sven Elflein, Ruilong Li, Sérgio Agostinho, Zan Gojcic, Laura Leal-Taixé, Qunjie Zhou, Aljosa Osep

Main category: cs.CV

TL;DR: VGG-T³ is a scalable 3D reconstruction model that uses test-time training to distill varying-length scene geometry representations into a fixed-size MLP, achieving linear scaling with input views and 11.6× speed-up over attention-based baselines.

Details

Motivation: The paper addresses the computational bottleneck in offline feed-forward 3D reconstruction methods where memory and computational requirements grow quadratically with the number of input images, limiting scalability for large image collections.

Method: The approach distills the varying-length Key-Value (KV) space representation of scene geometry into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. This enables linear scaling with input views while retaining global scene aggregation capabilities.

Result: VGG-T³ reconstructs a 1k image collection in 54 seconds (11.6× speed-up over softmax attention baselines), outperforms other linear-time methods in point map reconstruction error, and demonstrates visual localization capabilities with unseen images.

Conclusion: The method successfully addresses the quadratic scaling bottleneck in 3D reconstruction while maintaining performance, enabling efficient reconstruction of large image collections and demonstrating potential for visual localization applications.

Abstract: We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T$^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.

[216] MediX-R1: Open Ended Medical Reinforcement Learning

Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Omair Mohamed, Mohamed Zidan, Fahad Khan, Salman Khan, Rao Anwer, Hisham Cholakkal

Main category: cs.CV

TL;DR: MediX-R1 is an RL framework for medical MLLMs that enables free-form clinical answers using composite rewards and LLM-based evaluation, achieving strong performance on medical benchmarks.

Details

Motivation: Current medical MLLMs are limited to multiple-choice formats and lack open-ended reasoning capabilities needed for real clinical applications. Traditional rewards fail for free-form outputs, requiring better evaluation and training methods.

Method: Fine-tunes vision-language backbone with Group Based RL using composite rewards: LLM-based accuracy reward (semantic correctness), medical embedding semantic reward (paraphrase capture), format reward (interpretable reasoning), and modality reward (modality recognition). Uses Reference-based LLM-as-judge evaluation.

Result: Achieves excellent results across medical LLM (text-only) and VLM (image+text) benchmarks, outperforming open-source baselines with particularly large gains on open-ended clinical tasks, using only ~51K instruction examples.

Conclusion: Open-ended RL with comprehensive reward signals and LLM-based evaluation is practical for reliable medical reasoning in multimodal models, enabling clinically grounded free-form answers.

Abstract: We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes a baseline vision-language backbone with Group Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy reward that judges semantic correctness with a strict YES/NO decision, a medical embedding-based semantic reward to capture paraphrases and terminology variants, and lightweight format and modality rewards that enforce interpretable reasoning and modality recognition. This multi-signal design provides stable, informative feedback for open-ended outputs where traditional verifiable or MCQ-only rewards fall short. To measure progress, we propose a unified evaluation framework for both text-only and image+text tasks that uses a Reference-based LLM-as-judge in place of brittle string-overlap metrics, capturing semantic correctness, reasoning, and contextual alignment. Despite using only $\sim51$K instruction examples, MediX-R1 achieves excellent results across standard medical LLM (text-only) and VLM (image + text) benchmarks, outperforming strong open-source baselines and delivering particularly large gains on open-ended clinical tasks. Our results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models, curated datasets and source code are available at https://medix.cvmbzuai.com

[217] StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning

Giuseppe Vecchio

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to the paper content

Method: Cannot determine method without access to the paper content

Result: Cannot determine results without access to the paper content

Conclusion: Cannot determine conclusion without access to the paper content

Abstract: Failed to fetch summary for 2406.09293: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.09293&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[218] Open-Set Deepfake Detection: A Parameter-Efficient Adaptation Method with Forgery Style Mixture

Chenqi Kong, Anwei Luo, Peijun Bao, Haoliang Li, Renjie Wan, Zengwei Zheng, Anderson Rocha, Alex C. Kot

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2408.12791: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.12791&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[219] Abstracted Gaussian Prototypes for True One-Shot Concept Learning

Chelsea Zou, Kenneth J. Kurtz

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2408.17251: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.17251&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[220] SplatSDF: Boosting SDF-NeRF via Architecture-Level Fusion with Gaussian Splats

Runfa Blark Li, Keito Suzuki, Bang Du, Ki Myung Brian Lee, Nikolay Atanasov, Truong Nguyen

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2411.15468: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.15468&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[221] Motion-Aware Animatable Gaussian Avatars Deblurring

Muyao Niu, Yifan Zhan, Qingtian Zhu, Zhuoxiao Li, Wei Wang, Zhihang Zhong, Xiao Sun, Yinqiang Zheng

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2411.16758: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.16758&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[222] Distractor-free Generalizable 3D Gaussian Splatting

Yanqi Bao, Jing Liao, Jing Huo, Yang Gao

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2411.17605: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.17605&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[223] NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

Ishaan Rawal, Shubh Gupta, Yihan Hu, Wei Zhan

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2602.21172: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21172&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[224] From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

Zizhao Li, Zhengkang Xiang, Joseph West, Kourosh Khoshelham

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2411.18207: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.18207&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[225] PPT: Pretraining with Pseudo-Labeled Trajectories for Motion Forecasting

Yihong Xu, Yuan Yin, Éloi Zablocki, Tuan-Hung Vu, Alexandre Boulch, Matthieu Cord

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2412.06491: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.06491&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[226] IV-tuning: Parameter-Efficient Transfer Learning for Infrared-Visible Tasks

Yaming Zhang, Chenqiang Gao, Fangcen Liu, Junjie Guo, Lan Wang, Xinggan Peng, Deyu Meng

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2412.16654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.16654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[227] MomentMix Augmentation with Length-Aware DETR for Temporally Robust Moment Retrieval

Seojeong Park, Jiho Choi, Kyungjune Baek, Hyunjung Shim

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2412.20816: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.20816&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[228] Joint Optimization for 4D Human-Scene Reconstruction in the Wild

Zhizheng Liu, Joe Lin, Wayne Wu, Bolei Zhou

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2501.02158: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.02158&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[229] Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation

Xiaomeng Yang, Mengping Yang, Jia Gong, Luozheng Qin, Zhiyu Tan, Hao Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2502.02088: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.02088&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[230] Diffusion or Non-Diffusion Adversarial Defenses: Rethinking the Relation between Classifier and Adversarial Purifier

Yuan-Chih Chen, Chun-Shien Lu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2501.16904: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.16904&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[231] RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers

Ke Cao, Jing Wang, Ao Ma, Jiasong Feng, Xuanhua He, Run Ling, Haowei Liu, Jian Lu, Wei Feng, Haozhe Wang, Hongjuan Pei, Yihua Shao, Zhanjie Zhang, Jie Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to data fetch failure

Method: Unable to determine method due to data fetch failure

Result: Unable to determine results due to data fetch failure

Conclusion: Unable to draw conclusions due to data fetch failure

Abstract: Failed to fetch summary for 2502.14377: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.14377&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[232] Autoregressive Image Generation with Randomized Parallel Decoding

Haopeng Li, Jinyue Yang, Guoqi Li, Huan Wang

Main category: cs.CV

TL;DR: Unable to analyze paper 2503.10568 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract content is unavailable

Method: Cannot determine method as abstract content is unavailable

Result: Cannot determine results as abstract content is unavailable

Conclusion: Cannot draw conclusions about paper content due to data unavailability

Abstract: Failed to fetch summary for 2503.10568: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.10568&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[233] ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models

Guoyizhe Wei, Rama Chellappa

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2504.00037: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.00037&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[234] CLIP-Free, Label Free, Unsupervised Concept Bottleneck Models

Fawaz Sammani, Jonas Fischer, Nikos Deligiannis

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2503.10981: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.10981&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[235] UniFuture: A 4D Driving World Model for Future Generation and Perception

Dingkang Liang, Dingyuan Zhang, Xin Zhou, Sifan Tu, Tianrui Feng, Xiaofan Li, Yumeng Zhang, Mingyang Du, Xiao Tan, Xiang Bai

Main category: cs.CV

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API access issue

Method: Unable to determine method due to API access issue

Result: Unable to determine results due to API access issue

Conclusion: Unable to analyze paper due to technical limitations in accessing content

Abstract: Failed to fetch summary for 2503.13587: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.13587&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[236] Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving

Lucas Nunes, Rodrigo Marcuzzi, Jens Behley, Cyrill Stachniss

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - cannot analyze content

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2503.21449: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.21449&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[237] GmNet: Revisiting Gating Mechanisms From A Frequency View

Yifan Wang, Xu Ma, Yitian Zhang, Zhongruo Wang, Sung-Cheol Kim, Vahid Mirjalili, Vidya Renganathan, Yun Fu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2503.22841: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.22841&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[238] Sparse Imagination for Efficient Visual World Model Planning

Junha Chun, Youngjoon Jeong, Taesup Kim

Main category: cs.CV

TL;DR: Unable to analyze paper 2506.01392 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable

Method: Cannot determine method as abstract is unavailable

Result: Cannot determine results as abstract is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2506.01392: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01392&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[239] LAMM-ViT: AI Face Detection via Layer-Aware Modulation of Region-Guided Attention

Jiangling Zhang, Weijie Zhu, Jirui Huang, Yaxiong Chen

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to access limitations

Abstract: Failed to fetch summary for 2505.07734: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.07734&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[240] Bridging Geometric and Semantic Foundation Models for Generalized Monocular Depth Estimation

Sanggyun Ma, Wonjoon Choi, Jihun Park, Jaeyeul Kim, Seunghun Lee, Jiwan Seo, Sunghoon Im

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.23400: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.23400&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[241] Is Exchangeability better than I.I.D to handle Data Distribution Shifts while Pooling Data for Data-scarce Medical image segmentation?

Ayush Roy, Samin Enam, Jun Xia, Won Hwa Kim, Vishnu Suresh Lokhande

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2507.19575: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.19575&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[242] Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP

Lorenz Hufe, Constantin Venhoff, Erblina Purelku, Maximilian Dreyer, Sebastian Lapuschkin, Wojciech Samek

Main category: cs.CV

TL;DR: Unable to analyze paper 2508.20570 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2508.20570: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.20570&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[243] Unified Multimodal Models as Auto-Encoders

Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Haochen Wang, Zhendong Wang, Bin Lin, Hao Li, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2509.09666: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.09666&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[244] Loc$^2$: Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching

Zimin Xia, Chenghao Xu, Alexandre Alahi

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2509.09792: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.09792&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[245] ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting

Xiaoyang Yan, Muleilan Pei, Shaojie Shen

Main category: cs.CV

TL;DR: Paper 2509.16552: Could not fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing abstract

Method: Unable to determine method due to missing abstract

Result: Unable to determine results due to missing abstract

Conclusion: Unable to determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2509.16552: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.16552&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[246] Visual Instruction Pretraining for Domain-Specific Foundation Models

Yuxuan Li, Yicheng Zhang, Wenhao Tang, Yimian Dai, Ming-Ming Cheng, Xiang Li, Jian Yang

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2509.17562 appears to be from September 2025, suggesting it’s a recent multimodal AI paper.

Details

Motivation: Cannot determine motivation without access to paper content. Based on the arXiv ID format (2509.17562), this appears to be a recent paper from September 2025, likely related to multimodal AI research.

Method: Method unknown due to HTTP 429 error preventing access to paper details. The arXiv API rate limiting prevents retrieval of the abstract and content.

Result: Results cannot be determined without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.

Conclusion: Unable to analyze this specific paper due to technical limitations. The arXiv API rate limiting prevents proper analysis of paper 2509.17562.

Abstract: Failed to fetch summary for 2509.17562: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.17562&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[247] PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data

Zhe Zhu, Le Wan, Rui Xu, Yiheng Zhang, Honghua Chen, Zhiyang Dou, Cheng Lin, Yuan Liu, Mingqiang Wei

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.21965: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21965&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[248] Proxy-GS: Unified Occlusion Priors for Training and Inference in Structured 3D Gaussian Splatting

Yuanyuan Gao, Yuning Gong, Yifei Liu, Li Jingfeng, Dingwen Zhang, Yanci Zhang, Dan Xu, Xiao Sun, Zhihang Zhong

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2509.24421: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24421&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[249] Multi-View Camera System for Variant-Aware Autonomous Vehicle Inspection and Defect Detection

Yash Kulkarni, Raman Jha, Renu Kachhoria

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2509.26454: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26454&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[250] Secure and reversible face anonymization with diffusion models

Pol Labarbarie, Vincent Itier, William Puech

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.01031: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01031&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[251] Q$^2$: Quantization-Aware Gradient Balancing and Attention Alignment for Low-Bit Quantization

Zhaoyang Wang, Dong Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in accessing paper content

Method: Unable to determine method due to technical error in accessing paper content

Result: Unable to determine results due to technical error in accessing paper content

Conclusion: Unable to draw conclusions due to technical error in accessing paper content

Abstract: Failed to fetch summary for 2511.05898: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05898&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[252] A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering

Yuanhao Zou, Shengji Jin, Andong Deng, Youpeng Zhao, Jun Wang, Chen Chen

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.04428: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04428&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[253] Diffusion Model in Latent Space for Medical Image Segmentation Task

Huynh Trinh Ngoc, Toan Nguyen Hai, Ba Luong Son, Long Tran Quoc

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to retrieval failure

Method: Unable to determine method due to retrieval failure

Result: Unable to determine results due to retrieval failure

Conclusion: Unable to determine conclusion due to retrieval failure

Abstract: Failed to fetch summary for 2512.01292: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01292&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[254] Towards Privacy-Guaranteed Label Unlearning in Vertical Federated Learning: Few-Shot Forgetting without Disclosure

Hanlin Gu, Hong Xi Tae, Chee Seng Chan, Lixin Fan

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable due to server rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to access limitations

Abstract: Failed to fetch summary for 2410.10922: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.10922&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[255] Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation

Zijing Hu, Yunze Tong, Fengda Zhang, Junkun Yuan, Jun Xiao, Kun Kuang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.04504: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04504&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[256] Object-Centric Representation Learning for Enhanced 3D Semantic Scene Graph Prediction

KunHo Heo, GiHyun Kim, SuYeon Kim, MyeongAh Cho

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.04714: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04714&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[257] Aligning Few-Step Diffusion Models with Dense Reward Difference Learning

Ziyi Zhang, Li Shen, Sen Zhang, Deheng Ye, Yong Luo, Miaojing Shi, Dongjing Shan, Bo Du, Dacheng Tao

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2411.11727: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.11727&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[258] Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

Zanyi Wang, Dengyang Jiang, Liuzhuozheng Li, Sizhe Dang, Chengzu Li, Harry Yang, Guang Dai, Mengmeng Wang, Jingdong Wang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.06139: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06139&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[259] G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior

Junfeng Ni, Yixin Chen, Zhifei Yang, Yu Liu, Ruijie Lu, Song-Chun Zhu, Siyuan Huang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2510.12099: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12099&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[260] TerraCodec: Compressing Optical Earth Observations

Julen Costa-Watanabe, Isabelle Wittmann, Benedikt Blumenstiel, Konrad Schindler

Main category: cs.CV

TL;DR: Paper ID 2510.12670 could not be fetched due to HTTP 429 error (rate limiting), so no abstract or content is available for analysis.

Details

Motivation: Unable to determine motivation as the paper content could not be retrieved from arXiv due to rate limiting restrictions.

Method: No method information available since the paper summary could not be fetched.

Result: No results available due to inability to access the paper content.

Conclusion: Cannot provide analysis or conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2510.12670: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12670&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[261] Learning with less: label-efficient land cover classification at very high spatial resolution using self-supervised deep learning

Dakota Hester, Vitor S. Martins, Lucas B. Ferreira, Thainara M. A. Lima

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2511.03004 suggests it’s a recent paper from November 2024.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2511.03004: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03004&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[262] Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna

Main category: cs.CV

TL;DR: Unable to analyze paper 2601.10611 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable

Method: Cannot determine method as abstract is unavailable

Result: Cannot determine results as abstract is unavailable

Conclusion: Cannot draw conclusions without access to the paper abstract

Abstract: Failed to fetch summary for 2601.10611: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10611&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[263] USF-Net: A Unified Spatiotemporal Fusion Network for Ground-Based Remote Sensing Cloud Image Sequence Extrapolation

Penghui Niu, Taotao Cai, Suqi Zhang, Junhua Gua, Ping Zhanga, Qiqi Liu, Jianxin Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to retrieval failure

Method: Unable to determine method due to retrieval failure

Result: Unable to determine results due to retrieval failure

Conclusion: Unable to determine conclusion due to retrieval failure

Abstract: Failed to fetch summary for 2511.09045: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09045&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[264] SuperQuadricOcc: Multi-Layer Gaussian Approximation of Superquadrics for Real-Time Self-Supervised Occupancy Estimation

Seamie Hayes, Reenu Mohandas, Tim Brophy, Alexandre Boulch, Ganesh Sistu, Ciaran Eising

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2511.17361: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17361&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[265] Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

Dosung Lee, Sangwon Jung, Boyoung Kim, Minyoung Kim, Sungyeon Kim, Junyoung Sung, Paul Hongsuck Seo

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2511.22843: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22843&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[266] ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data

Yuxing Liu, Zheng Li, Huanhuan Liang, Ji Zhang, Zeyu Sun, Yong Liu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2512.02686: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02686&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[267] VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm

Zhenkai Wu, Xiaowen Ma, Zhenliang Ni, Dengming Zhang, Han Shu, Xin Jiang, Xinghao Chen

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2512.02700 suggests it’s a recent arXiv submission from December 2024.

Details

Motivation: Cannot determine motivation without access to the paper content. The arXiv ID format suggests this is a recent submission from December 2024.

Method: Cannot determine method without access to the paper content. The arXiv API returned a rate limiting error (HTTP 429).

Result: Cannot determine results without access to the paper content. The arXiv API request was blocked due to rate limiting.

Conclusion: Cannot draw conclusions about the paper’s content. The arXiv API rate limiting prevents access to the abstract and paper details.

Abstract: Failed to fetch summary for 2512.02700: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02700&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[268] VQ-Style: Disentangling Style and Content in Motion with Residual Quantized Representations

Fatemeh Zargarbashi, Dhruv Agrawal, Jakob Buhmann, Martin Guay, Stelian Coros, Robert W. Sumner

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2602.02334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[269] Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

Junjie Chen, Fei Wang, Zhihao Huang, Qing Zhou, Kun Li, Dan Guo, Linfeng Zhang, Xun Yang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.15340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[270] Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control

Minghao Han, Yichen Liu, Yizhou Liu, Zizhi Chen, Jingqun Tang, Xuecheng Wu, Dingkang Yang, Lihua Zhang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2512.21058: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21058&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[271] ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing

Hengjia Li, Liming Jiang, Qing Yan, Yizhi Song, Hao Kang, Zichuan Liu, Xin Lu, Boxi Wu, Deng Cai

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2601.03467: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03467&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[272] PCReg-Net: Progressive Contrast-Guided Registration for Cross-Domain Image Alignment

Jiahao Qin

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to analyze paper content due to technical retrieval error

Abstract: Failed to fetch summary for 2602.13304: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13304&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[273] MERGETUNE: Continued Fine-Tuning of Vision-Language Models

Wenqing Wang, Da Li, Xiatian Zhu, Josef Kittler

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2601.10497: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10497&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[274] OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

Feilong Tang, Xiang An, Yunyao Yan, Yin Xie, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Chunyuan Li, Shikun Feng, Changrui Chen, Huajie Tan, Ming Hu, Manyuan Zhang, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, Jiankang Deng

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2602.08683: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08683&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[275] HLGFA: High-Low Resolution Guided Feature Alignment for Unsupervised Anomaly Detection

Han Zhou, Yuxuan Gao, Yinchao Du, Xuezhe Zheng

Main category: cs.CV

TL;DR: Paper ID 2602.09524 could not be fetched due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as abstract could not be retrieved

Method: Unable to determine method as abstract could not be retrieved

Result: Unable to determine results as abstract could not be retrieved

Conclusion: Unable to draw conclusions as abstract could not be retrieved

Abstract: Failed to fetch summary for 2602.09524: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09524&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[276] FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery

Xiaokun Zhang, Yi Yang, Ziqi Ye, Baiyun, Xiaorong Guo, Qingchen Fang, Ruyi Zhang, Xinpeng Zhou, Haipeng Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to retrieval error

Method: Unable to determine method due to retrieval error

Result: Unable to determine results due to retrieval error

Conclusion: Unable to determine conclusion due to retrieval error

Abstract: Failed to fetch summary for 2602.19190: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19190&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[277] GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

GigaBrain Team, Boyuan Wang, Bohan Li, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie Li, Jindi Lv, Jingyu Liu, Lv Feng, Mingming Yu, Peng Li, Qiuping Deng, Tianze Liu, Xinyu Zhou, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yifei Nie, Yilong Li, Yukun Zhou, Yun Ye, Zhichao Liu, Zheng Zhu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.12099 could not be retrieved from arXiv API.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2602.12099: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12099&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[278] Benchmarking Video Foundation Models for Remote Parkinson’s Disease Screening

Md Saiful Islam, Ekram Hossain, Abdelrahman Abdelkader, Tariq Adnan, Fazla Rabbi Mashrur, Sooyong Park, Praveen Kumar, Qasim Sudais, Natalia Chunga, Nami Shah, Jan Freyberg, Christopher Kanan, Ruth Schneider, Ehsan Hoque

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper with ID 2602.13507 could not be retrieved for analysis.

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.

Method: Cannot determine method as paper content is unavailable due to API rate limiting.

Result: Cannot determine results as paper content is unavailable due to API rate limiting.

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting.

Abstract: Failed to fetch summary for 2602.13507: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13507&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[279] Depth from Defocus via Direct Optimization

Holly Jackson, Caleb Adams, Ignacio Lopez-Francos, Benjamin Recht

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2602.18509: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18509&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[280] DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces

Li Zhang, Mingyu Mei, Ailing Wang, Xianhui Meng, Yan Zhong, Xinyuan Song, Liu Liu, Rujing Wang, Zaixing He, Cewu Lu

Main category: cs.CV

TL;DR: DICArt: A discrete diffusion framework for articulated object pose estimation that formulates pose estimation as conditional discrete diffusion with hierarchical kinematic coupling.

Details

Motivation: Existing pose estimation methods struggle with large, complex search spaces and fail to incorporate intrinsic kinematic constraints, leading to unreliable results in complex environments.

Method: Formulates pose estimation as conditional discrete diffusion process with flexible flow decider to balance real/noise distributions, and hierarchical kinematic coupling to respect object structure.

Result: Demonstrates superior performance and robustness on both synthetic and real-world datasets for category-level 6D pose estimation.

Conclusion: DICArt offers a new paradigm for reliable articulated object pose estimation by integrating discrete generative modeling with structural priors.

Abstract: Articulated object pose estimation is a core task in embodied AI. Existing methods typically regress poses in a continuous space, but often struggle with 1) navigating a large, complex search space and 2) failing to incorporate intrinsic kinematic constraints. In this work, we introduce DICArt (DIsCrete Diffusion for Articulation Pose Estimation), a novel framework that formulates pose estimation as a conditional discrete diffusion process. Instead of operating in a continuous domain, DICArt progressively denoises a noisy pose representation through a learned reverse diffusion procedure to recover the GT pose. To improve modeling fidelity, we propose a flexible flow decider that dynamically determines whether each token should be denoised or reset, effectively balancing the real and noise distributions during diffusion. Additionally, we incorporate a hierarchical kinematic coupling strategy, estimating the pose of each rigid part hierarchically to respect the object’s kinematic structure. We validate DICArt on both synthetic and real-world datasets. Experimental results demonstrate its superior performance and robustness. By integrating discrete generative modeling with structural priors, DICArt offers a new paradigm for reliable category-level 6D pose estimation in complex environments.

[281] Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

Yuxuan Yang, Zhonghao Yan, Yi Zhang, Bo Yun, Muxi Diao, Guowei Zhao, Kongming Liang, Wenbin Li, Zhanyu Ma

Main category: cs.CV

TL;DR: Paper 2602.19424: Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing abstract

Method: Cannot determine method due to missing abstract

Result: Cannot determine results due to missing abstract

Conclusion: Cannot draw conclusions due to missing abstract

Abstract: Failed to fetch summary for 2602.19424: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19424&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[282] StruXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

Zanxi Ruan, Qiuyu Kong, Songqun Gao, Yiming Wang, Marco Cristani

Main category: cs.CV

TL;DR: Paper ID 2602.20089 could not be fetched due to HTTP 429 error (rate limiting).

Details

Motivation: Unable to determine motivation as the abstract could not be retrieved from arXiv API.

Method: No method information available due to failed API request.

Result: No results available as the paper content could not be accessed.

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper information.

Abstract: Failed to fetch summary for 2602.20089: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20089&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[283] TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering

Hanshen Zhu, Yuliang Liu, Xuecheng Wu, An-Lan Wang, Hao Feng, Dingkang Yang, Chao Feng, Can Huang, Jingqun Tang, Xiang Bai

Main category: cs.CV

TL;DR: TextPecker: A plug-and-play RL strategy that improves visual text rendering in text-to-image models by addressing structural anomalies like distortion and misalignment through better perception of text structure errors.

Details

Motivation: Current multimodal LLMs and OCR models fail to perceive structural anomalies in rendered text (distortion, blurriness, misalignment), creating a bottleneck for both evaluation and RL-based optimization of visual text rendering in text-to-image generation.

Method: Proposes TextPecker, a plug-and-play structural anomaly perceptive RL strategy that works with any text-to-image generator. Constructs a recognition dataset with character-level structural-anomaly annotations and develops a stroke-editing synthesis engine to expand structural-error coverage.

Result: TextPecker consistently improves diverse text-to-image models; on Qwen-Image, it yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering, establishing new SOTA in high-fidelity visual text rendering.

Conclusion: The work fills a gap in VTR optimization, providing a foundational step toward reliable and structurally faithful visual text generation by addressing the critical bottleneck of structural anomaly perception.

Abstract: Visual Text Rendering (VTR) remains a critical challenge in text-to-image generation, where even advanced models frequently produce text with structural anomalies such as distortion, blurriness, and misalignment. However, we find that leading MLLMs and specialist OCR models largely fail to perceive these structural anomalies, creating a critical bottleneck for both VTR evaluation and RL-based optimization. As a result, even state-of-the-art generators (e.g., Seedream4.0, Qwen-Image) still struggle to render structurally faithful text. To address this, we propose TextPecker, a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any textto-image generator. To enable this capability, we construct a recognition dataset with character-level structural-anomaly annotations and develop a stroke-editing synthesis engine to expand structural-error coverage. Experiments show that TextPecker consistently improves diverse text-to-image models; even on the well-optimized Qwen-Image, it significantly yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering, establishing a new state-of-the-art in high-fidelity VTR. Our work fills a gap in VTR optimization, providing a foundational step towards reliable and structural faithful visual text generation.

[284] Event-Aided Sharp Radiance Field Reconstruction for Fast-Flying Drones

Rong Zou, Marco Cannici, Davide Scaramuzza

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.21101: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21101&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[285] WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, Spencer Whitehead

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2601.02439 exists but cannot be analyzed without the abstract content.

Details

Motivation: Cannot determine motivation without access to the paper abstract due to arXiv API rate limiting.

Method: Cannot determine method without access to the paper abstract.

Result: Cannot determine results without access to the paper abstract.

Conclusion: Cannot draw conclusions without access to the paper abstract.

Abstract: Failed to fetch summary for 2601.02439: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02439&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Jinghan Li, Junfeng Fang, Jinda Lu, Yuan Wang, Xiaoyan Guo, Tianyu Zhang, Xiang Wang, Xiangnan He

Main category: cs.CV

TL;DR: Unable to analyze paper 2602.21743 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to the paper abstract

Method: Cannot determine method without access to the paper abstract

Result: Cannot determine results without access to the paper abstract

Conclusion: Cannot draw conclusions without access to the paper abstract

Abstract: Failed to fetch summary for 2602.21743: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21743&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Guibin Chen, Dixuan Lin, Jiangping Yang, Youqiang Zhang, Zhengcong Fei, Debang Li, Sheng Chen, Chaofeng Ao, Nuo Pang, Yiming Wang, Yikun Dou, Zheng Chen, Mingyuan Fan, Tuanhui Li, Mingshan Chang, Hao Zhang, Xiaopeng Sun, Jingtao Xu, Yuqiang Xie, Jiahua Wang, Zhiheng Xu, Weiming Xiong, Yuzhe Jin, Baoxuan Gu, Binjie Mao, Yunjie Yu, Jujie He, Yuhao Feng, Shiwen Tu, Chaojie Wang, Rui Yan, Wei Shen, Jingchen Wu, Peng Zhao, Xuanyue Zhong, Zhuangzhuang Liu, Kaifei Wang, Fuxiang Zhang, Weikai Xu, Wenyan Liu, Binglu Zhang, Yu Shen, Tianhui Xiong, Bin Peng, Liang Zeng, Xuchen Song, Haoxiang Guo, Peiyu Wang, Max W. Y. Lam, Chien-Hung Liu, Yahui Zhou

Main category: cs.CV

TL;DR: SkyReels V4 is a unified multimodal video foundation model that jointly generates synchronized video and audio, supports multimodal inputs (text, images, video, masks, audio), and handles generation, inpainting, and editing tasks at cinematic resolutions up to 1080p, 32 FPS, 15 seconds.

Details

Motivation: To create a comprehensive video foundation model that unifies video-audio generation, inpainting, and editing with multimodal input support, addressing the need for high-fidelity, synchronized audiovisual content generation at cinematic quality.

Method: Dual-stream Multimodal Diffusion Transformer (MMDiT) architecture with separate branches for video and temporally aligned audio generation, sharing a multimodal LLM text encoder. Uses channel concatenation for unified inpainting tasks, and employs efficiency strategy: joint low-resolution full sequence + high-resolution keyframe generation followed by super-resolution and frame interpolation.

Result: First video foundation model supporting multimodal input, joint video-audio generation, and unified generation/inpainting/editing. Achieves 1080p resolution, 32 FPS, 15-second duration with synchronized audio while maintaining computational efficiency.

Conclusion: SkyReels V4 represents a significant advancement in multimodal video generation, offering a unified solution for high-quality synchronized audiovisual content creation with comprehensive input support and task flexibility.

Abstract: SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing. The model adopts a dual stream Multimodal Diffusion Transformer (MMDiT) architecture, where one branch synthesizes video and the other generates temporally aligned audio, while sharing a powerful text encoder based on the Multimodal Large Language Models (MMLM). SkyReels V4 accepts rich multi modal instructions, including text, images, video clips, masks, and audio references. By combining the MMLMs multi modal instruction following capability with in context learning in the video branch MMDiT, the model can inject fine grained visual guidance under complex conditioning, while the audio branch MMDiT simultaneously leverages audio references to guide sound generation. On the video side, we adopt a channel concatenation formulation that unifies a wide range of inpainting style tasks, such as image to video, video extension, and video editing under a single interface, and naturally extends to vision referenced inpainting and editing via multi modal prompts. SkyReels V4 supports up to 1080p resolution, 32 FPS, and 15 second duration, enabling high fidelity, multi shot, cinema level video generation with synchronized audio. To make such high resolution, long-duration generation computationally feasible, we introduce an efficiency strategy: Joint generation of low resolution full sequences and high-resolution keyframes, followed by dedicated super-resolution and frame interpolation models. To our knowledge, SkyReels V4 is the first video foundation model that simultaneously supports multi-modal input, joint video audio generation, and a unified treatment of generation, inpainting, and editing, while maintaining strong efficiency and quality at cinematic resolutions and durations.

Yinheng Lin, Yiming Huang, Beilei Cui, Long Bai, Huxin Gao, Hongliang Ren, Jiewen Lai

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.21893: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21893&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[289] CoLoGen: Progressive Learning of Concept-Localization Duality for Unified Image Generation

YuXin Song, Yu Lu, Haoyuan Sun, Huanjin Yao, Fanglong Liu, Yifan Sun, Haocheng Feng, Hang Zhou, Jingdong Wang

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.22150: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22150&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[290] Solaris: Building a Multiplayer Video World Model in Minecraft

Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, Saining Xie

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to data access issues

Abstract: Failed to fetch summary for 2602.22208: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22208&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[291] Adaptive Hybrid Caching for Efficient Text-to-Video Diffusion Model Acceleration

Yuanxin Wei, Lansong Diao, Bujiao Chen, Shenggan Cheng, Zhengping Qian, Wenyuan Yu, Nong Xiao, Wei Lin, Jiangsu Du

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2508.12691: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.12691&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[292] Establishing Stochastic Object Models from Noisy Data via Ambient Measurement-Integrated Diffusion

Jianwei Sun, Xiaoning Lei, Wenhao Cai, Xichen Xu, Yanshu Wang, Hu Gao

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2512.14187: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14187&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[293] A Pragmatic VLA Foundation Model

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Shuai Zhou, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Zechen Wang, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, Kecheng Zheng

Main category: cs.CV

TL;DR: Paper 2601.18692: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing abstract

Method: Cannot determine method due to missing abstract

Result: Cannot determine results due to missing abstract

Conclusion: Cannot draw conclusion due to missing abstract

Abstract: Failed to fetch summary for 2601.18692: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18692&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[294] Denoising the Deep Sky: Physics-Based CCD Noise Formation for Astronomical Imaging

Shuhong Liu, Xining Ge, Ziying Gu, Quanfeng Xu, Lin Gu, Ziteng Cui, Xuangeng Chu, Jun Liu, Dong Li, Tatsuya Harada

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2601.23276: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.23276&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[295] Visible Light Positioning With Lamé Curve LEDs: A Generic Approach for Camera Pose Estimation

Wenxuan Pan, Yang Yang, Dong Wei, Zhiyu Zhu, Jintao Wang, Huan Wu, Yao Nie

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to access limitations

Abstract: Failed to fetch summary for 2602.01577: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01577&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[296] Compact Hadamard Latent Codes for Efficient Spectral Rendering

Jiaqi Yu, Dar’ya Guarnera, Giuseppe Claudio Guarnera

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.18741: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18741&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.AI

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong, Jiajie Jin, Hao Wang, Yinuo Wang, Ji-Rong Wen, Yuan Lu, Zhicheng Dou

Main category: cs.AI

Details

[298] Graph Your Way to Inspiration: Integrating Co-Author Graphs with Retrieval-Augmented Generation for Large Language Model Based Scientific Idea Generation

Pengzhen Xie, Huizhi Liang

Main category: cs.AI

TL;DR: GYWI is a scientific idea generation system that combines author knowledge graphs with retrieval-augmented generation to provide controllable context and traceable inspiration paths for LLMs.

Details

Motivation: Current LLM-generated scientific ideas lack controllable academic context and traceable inspiration pathways, limiting their practical utility in research.

Method: 1) Author-centered knowledge graph construction with inspiration source sampling; 2) Hybrid retrieval (RAG + GraphRAG) for depth and breadth knowledge; 3) Prompt optimization with reinforcement learning; 4) Evaluation on arXiv dataset with multi-dimensional assessment.

Result: GYWI significantly outperforms mainstream LLMs (GPT-4o, DeepSeek-V3, Qwen3-8B, Gemini 2.5) in novelty, reliability, and relevance metrics.

Conclusion: The GYWI system effectively enhances LLM-based scientific idea generation by providing structured knowledge context and traceable inspiration pathways.

Abstract: Large Language Models (LLMs) demonstrate potential in the field of scientific idea generation. However, the generated results often lack controllable academic context and traceable inspiration pathways. To bridge this gap, this paper proposes a scientific idea generation system called GYWI, which combines author knowledge graphs with retrieval-augmented generation (RAG) to form an external knowledge base to provide controllable context and trace of inspiration path for LLMs to generate new scientific ideas. We first propose an author-centered knowledge graph construction method and inspiration source sampling algorithms to construct external knowledge base. Then, we propose a hybrid retrieval mechanism that is composed of both RAG and GraphRAG to retrieve content with both depth and breadth knowledge. It forms a hybrid context. Thirdly, we propose a Prompt optimization strategy incorporating reinforcement learning principles to automatically guide LLMs optimizing the results based on the hybrid context. To evaluate the proposed approaches, we constructed an evaluation dataset based on arXiv (2018-2023). This paper also develops a comprehensive evaluation method including empirical automatic assessment in multiple-choice question task, LLM-based scoring, human evaluation, and semantic space visualization analysis. The generated ideas are evaluated from the following five dimensions: novelty, feasibility, clarity, relevance, and significance. We conducted experiments on different LLMs including GPT-4o, DeepSeek-V3, Qwen3-8B, and Gemini 2.5. Experimental results show that GYWI significantly outperforms mainstream LLMs in multiple metrics such as novelty, reliability, and relevance.

[299] FIRE: A Comprehensive Benchmark for Financial Intelligence and Reasoning Evaluation

Xiyuan Zhang, Huihang Wu, Jiayu Guo, Zhenlin Zhang, Yiwei Zhang, Liangyu Huo, Xiaoxiao Ma, Jiansong Wan, Xuewei Jiao, Yi Jing, Jian Xie

Main category: cs.AI

TL;DR: FIRE is a comprehensive benchmark for evaluating LLMs’ financial knowledge through theoretical exam questions and practical business scenarios, with 3,000 financial scenario questions across various domains.

Details

Motivation: There's a need to systematically evaluate LLMs' financial capabilities, both in theoretical knowledge and practical business applications, to understand their real-world utility in financial domains.

Method: Created a benchmark with theoretical questions from financial qualification exams and practical scenario questions using a systematic evaluation matrix covering essential financial subdomains and business activities.

Result: Comprehensive evaluation of state-of-the-art LLMs including XuanYuan 4.0, providing systematic analysis of current LLM capabilities in financial applications.

Conclusion: FIRE enables systematic assessment of LLMs’ financial knowledge boundaries and practical utility, with publicly released benchmark questions and evaluation code for future research.

Abstract: We introduce FIRE, a comprehensive benchmark designed to evaluate both the theoretical financial knowledge of LLMs and their ability to handle practical business scenarios. For theoretical assessment, we curate a diverse set of examination questions drawn from widely recognized financial qualification exams, enabling evaluation of LLMs deep understanding and application of financial knowledge. In addition, to assess the practical value of LLMs in real-world financial tasks, we propose a systematic evaluation matrix that categorizes complex financial domains and ensures coverage of essential subdomains and business activities. Based on this evaluation matrix, we collect 3,000 financial scenario questions, consisting of closed-form decision questions with reference answers and open-ended questions evaluated by predefined rubrics. We conduct comprehensive evaluations of state-of-the-art LLMs on the FIRE benchmark, including XuanYuan 4.0, our latest financial-domain model, as a strong in-domain baseline. These results enable a systematic analysis of the capability boundaries of current LLMs in financial applications. We publicly release the benchmark questions and evaluation code to facilitate future research.

[300] Multi-Level Causal Embeddings

Willem Schooltink, Fabio Massimo Zennaro

Main category: cs.AI

TL;DR: Causal embeddings framework generalizes abstractions to map multiple detailed causal models into sub-systems of a coarser model, addressing multi-resolution marginal problems for merging datasets from different representations.

Details

Motivation: Current causal abstractions focus on relations between two models, but there's a need for frameworks that can embed multiple detailed models into a single coarser model to handle merging of datasets from different causal representations.

Method: Defines causal embeddings as generalization of abstraction, presents generalized notion of consistency, and formulates multi-resolution marginal problem to address statistical and causal marginal problems.

Result: Framework enables merging datasets from models with different representations and addresses both statistical and causal marginal problems through the multi-resolution marginal problem formulation.

Conclusion: Causal embeddings provide a generalized framework for embedding multiple detailed causal models into coarser models, offering practical utility for dataset merging and addressing marginal problems across different causal representations.

Abstract: Abstractions of causal models allow for the coarsening of models such that relations of cause and effect are preserved. Whereas abstractions focus on the relation between two models, in this paper we study a framework for causal embeddings which enable multiple detailed models to be mapped into sub-systems of a coarser causal model. We define causal embeddings as a generalization of abstraction, and present a generalized notion of consistency. By defining a multi-resolution marginal problem, we showcase the relevance of causal embeddings for both the statistical marginal problem and the causal marginal problem; furthermore, we illustrate its practical use in merging datasets coming from models with different representations.

[301] Agent Behavioral Contracts: Formal Specification and Runtime Enforcement for Reliable Autonomous AI Agents

Varun Pratap Bhardwaj

Main category: cs.AI

TL;DR: Agent Behavioral Contracts (ABC) formal framework brings Design-by-Contract principles to AI agents with runtime enforcement, probabilistic compliance metrics, and drift bounds.

Details

Motivation: AI agents lack formal behavioral specifications unlike traditional software with APIs and type systems, leading to drift, governance failures, and project failures in agentic AI deployments.

Method: Introduces ABC framework with contracts C = (P, I, G, R) specifying Preconditions, Invariants, Governance policies, and Recovery mechanisms. Defines (p, delta, k)-satisfaction for probabilistic compliance, proves Drift Bounds Theorem, establishes conditions for safe contract composition in multi-agent chains, and implements in AgentAssert runtime enforcement library.

Result: Evaluation on AgentContract-Bench (200 scenarios, 7 models, 6 vendors, 1,980 sessions) shows contracted agents detect 5.2-6.8 soft violations per session that baselines miss, achieve 88-100% hard constraint compliance, bound drift to D* < 0.27, with 100% recovery for frontier models and 17-100% across all models at <10ms overhead per action.

Conclusion: ABC framework successfully brings formal contract-based specification and enforcement to AI agents, addressing drift and governance issues with practical runtime enforcement and proven theoretical guarantees.

Abstract: Traditional software relies on contracts – APIs, type systems, assertions – to specify and enforce correct behavior. AI agents, by contrast, operate on prompts and natural language instructions with no formal behavioral specification. This gap is the root cause of drift, governance failures, and frequent project failures in agentic AI deployments. We introduce Agent Behavioral Contracts (ABC), a formal framework that brings Design-by-Contract principles to autonomous AI agents. An ABC contract C = (P, I, G, R) specifies Preconditions, Invariants, Governance policies, and Recovery mechanisms as first-class, runtime-enforceable components. We define (p, delta, k)-satisfaction – a probabilistic notion of contract compliance that accounts for LLM non-determinism and recovery – and prove a Drift Bounds Theorem showing that contracts with recovery rate gamma > alpha (the natural drift rate) bound behavioral drift to D* = alpha/gamma in expectation, with Gaussian concentration in the stochastic setting. We establish sufficient conditions for safe contract composition in multi-agent chains and derive probabilistic degradation bounds. We implement ABC in AgentAssert, a runtime enforcement library, and evaluate on AgentContract-Bench, a benchmark of 200 scenarios across 7 models from 6 vendors. Results across 1,980 sessions show that contracted agents detect 5.2-6.8 soft violations per session that uncontracted baselines miss entirely (p < 0.0001, Cohen’s d = 6.7-33.8), achieve 88-100% hard constraint compliance, and bound behavioral drift to D* < 0.27 across extended sessions, with 100% recovery for frontier models and 17-100% across all models, at overhead < 10 ms per action.

Yongjun Zhang

Main category: cs.AI

TL;DR: AI agents can autonomously execute entire social science research pipelines through multi-step reasoning workflows, representing a qualitative shift from prior automation technologies.

Details

Motivation: To explore how AI agents can transform social science research by autonomously executing complete research pipelines, introducing the concept of "vibe researching" as parallel to "vibe coding" in programming.

Method: Introduces scholar-skill, a 21-skill plugin for Claude Code covering full research pipeline; develops cognitive task framework classifying research activities along codifiability and tacit knowledge dimensions to identify delegation boundaries.

Result: AI agents excel at speed, coverage, and methodological scaffolding but struggle with theoretical originality and tacit field knowledge; identifies cognitive (not sequential) delegation boundaries cutting through every research stage.

Conclusion: Proposes five principles for responsible “vibe researching” and analyzes three implications for the profession: augmentation with fragile conditions, stratification risk, and pedagogical crisis.

Abstract: AI agents – systems that execute multi-step reasoning workflows with persistent state, tool access, and specialist skills – represent a qualitative shift from prior automation technologies in social science. Unlike chatbots that respond to isolated queries, AI agents can now read files, run code, query databases, search the web, and invoke domain-specific skills to execute entire research pipelines autonomously. This paper introduces the concept of vibe researching – the AI-era parallel to ``vibe coding’’ (Karpathy, 2025) – and uses scholar-skill, a 21-skill plugin for Claude Code covering the full research pipeline from idea to submission, as an illustrative case. I develop a cognitive task framework that classifies research activities along two dimensions – codifiability and tacit knowledge requirement – to identify a delegation boundary that is cognitive, not sequential: it cuts through every stage of the research pipeline, not between stages. I argue that AI agents excel at speed, coverage, and methodological scaffolding but struggle with theoretical originality and tacit field knowledge. The paper concludes with an analysis of three implications for the profession – augmentation with fragile conditions, stratification risk, and a pedagogical crisis – and proposes five principles for responsible vibe researching.

[303] Towards Autonomous Memory Agents

Xinle Wu, Rui Zhang, Mustafa Anis Hussain, Yao Lu

Main category: cs.AI

TL;DR: U-Mem introduces autonomous memory agents that actively acquire, validate, and curate knowledge through cost-aware extraction cascades and semantic-aware Thompson sampling, outperforming passive memory approaches.

Details

Motivation: Current memory agents are passive and reactive, with memory growth limited to available information and lacking active knowledge-seeking capabilities when uncertain. They don't proactively acquire external inputs to address knowledge gaps.

Method: U-Mem uses (1) a cost-aware knowledge-extraction cascade that escalates from cheap self/teacher signals to tool-verified research and expert feedback only when necessary, and (2) semantic-aware Thompson sampling to balance exploration-exploitation over memories and mitigate cold-start bias.

Result: U-Mem consistently beats prior memory baselines and can surpass RL-based optimization, improving HotpotQA (Qwen2.5-7B) by 14.6 points and AIME25 (Gemini-2.5-flash) by 7.33 points on both verifiable and non-verifiable benchmarks.

Conclusion: Autonomous memory agents that actively acquire, validate, and curate knowledge at minimum cost significantly outperform passive memory approaches, demonstrating the value of proactive knowledge-seeking in memory systems.

Abstract: Recent memory agents improve LLMs by extracting experiences and conversation history into an external storage. This enables low-overhead context assembly and online memory update without expensive LLM training. However, existing solutions remain passive and reactive; memory growth is bounded by information that happens to be available, while memory agents seldom seek external inputs in uncertainties. We propose autonomous memory agents that actively acquire, validate, and curate knowledge at a minimum cost. U-Mem materializes this idea via (i) a cost-aware knowledge-extraction cascade that escalates from cheap self/teacher signals to tool-verified research and, only when needed, expert feedback, and (ii) semantic-aware Thompson sampling to balance exploration and exploitation over memories and mitigate cold-start bias. On both verifiable and non-verifiable benchmarks, U-Mem consistently beats prior memory baselines and can surpass RL-based optimization, improving HotpotQA (Qwen2.5-7B) by 14.6 points and AIME25 (Gemini-2.5-flash) by 7.33 points.

[304] A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Usman Anwar, Julianna Piskorz, David D. Baek, David Africa, Jim Weatherall, Max Tegmark, Christian Schroeder de Witt, Mihaela van der Schaar, David Krueger

Main category: cs.AI

TL;DR: Proposes a decision-theoretic framework to detect and quantify steganographic reasoning in LLMs by measuring information asymmetry between agents who can and cannot decode hidden content.

Details

Motivation: LLMs are showing steganographic capabilities that could allow misaligned models to evade oversight, but classical steganography detection methods require known reference distributions which are infeasible for LLM reasoning.

Method: Introduces a decision-theoretic view of steganography and generalised V-information framework to measure usable information. Defines the “steganographic gap” metric comparing downstream utility of steganographic signals to agents with and without decoding capability.

Result: Empirically validates the formalism and shows it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.

Conclusion: Provides a principled framework for detecting steganographic behavior in LLMs without requiring reference distributions, addressing an important safety concern in AI oversight.

Abstract: Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in LLMs, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbf{decision-theoretic view of steganography}. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents’ observable actions. To formalise this perspective, we introduce generalised $\mathcal{V}$-information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbf{steganographic gap} – a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.

[305] Exploring Human Behavior During Abstract Rule Inference and Problem Solving with the Cognitive Abstraction and Reasoning Corpus

Caroline Ahn, Quan Do, Leah Bakst, Michael P. Pascale, Joseph T. McGuire, Michael E. Hasselmo, Chantal E. Stern

Main category: cs.AI

TL;DR: Researchers introduce CogARC, a human-adapted subset of the Abstraction and Reasoning Corpus, to study human abstract visual reasoning strategies through behavioral experiments with 260 participants solving 75 visual reasoning problems.

Details

Motivation: To investigate the cognitive strategies underlying human flexibility in abstract reasoning, particularly how people rapidly learn and apply rules from sparse examples in visual reasoning tasks.

Method: Created CogARC from the Abstraction and Reasoning Corpus, administered 75 abstract visual reasoning problems to 260 participants across two experiments, recording high-resolution behavioral data including viewing patterns, edit sequences, and multi-attempt submissions.

Result: Participants achieved high accuracy (~90% in experiment 1, ~80% in experiment 2), with performance varying across problems and participants. Harder problems led to longer deliberation and more diverse strategies. Participants became faster but slightly less accurate over time, suggesting task familiarity rather than improved rule-learning. Even incorrect solutions showed convergence in reasoning approaches.

Conclusion: CogARC provides a rich behavioral environment for studying human abstract reasoning, revealing insights into how people generalize, misgeneralize, and adapt strategies under uncertainty in visual reasoning tasks.

Abstract: Humans exhibit remarkable flexibility in abstract reasoning, and can rapidly learn and apply rules from sparse examples. To investigate the cognitive strategies underlying this ability, we introduce the Cognitive Abstraction and Reasoning Corpus (CogARC), a diverse human-adapted subset of the Abstraction and Reasoning Corpus (ARC) which was originally developed to benchmark abstract reasoning in artificial intelligence. Across two experiments, CogARC was administered to a total of 260 human participants who freely generated solutions to 75 abstract visual reasoning problems. Success required inferring input-output rules from a small number of examples to transform the test input into one correct test output. Participants’ behavior was recorded at high temporal resolution, including example viewing, edit sequences, and multi-attempt submissions. Participants were generally successful (mean accuracy ~90% for experiment 1 (n=40), ~80% for experiment 2 (n=220) across problems), but performance varied widely across problems and participants. Harder problems elicited longer deliberation times and greater divergence in solution strategies. Over the course of the task, participants initiated responses more quickly but showed a slight decline in accuracy, suggesting increased familiarity with the task structure rather than improved rule-learning ability. Importantly, even incorrect solutions were often highly convergent, even when the problem-solving trajectories differed in length and smoothness. Some trajectories progressed directly and efficiently toward a stable outcome, whereas others involved extended exploration or partial restarts before converging. Together, these findings highlight CogARC as a rich behavioral environment for studying human abstract reasoning, providing insight into how people generalize, misgeneralize, and adapt their strategies under uncertainty.

[306] Epistemic Filtering and Collective Hallucination: A Jury Theorem for Confidence-Calibrated Agents

Jonas Karge

Main category: cs.AI

TL;DR: A framework for collective decision-making where heterogeneous agents learn their own reliability through calibration, then selectively abstain from voting based on confidence gates, extending Condorcet Jury Theorem to sequential settings with potential applications to reducing AI hallucinations.

Details

Motivation: Classical voting theory assumes fixed participation, but real-world aggregation benefits from allowing agents to abstain when uncertain. The paper aims to extend epistemic voting results to settings where agents can learn their own competence and selectively participate, with applications to improving collective AI decision-making and reducing hallucinations.

Method: Proposes a probabilistic framework with two phases: 1) calibration phase where agents update beliefs about their fixed competence through repeated trials, 2) final confidence gate where agents decide to vote or abstain based on learned confidence. Derives non-asymptotic lower bounds on group success probability and proves generalization of Condorcet Jury Theorem to sequential, confidence-gated settings. Validates with Monte Carlo simulations.

Result: Theoretical results show selective participation generalizes asymptotic guarantees of Condorcet Jury Theorem to sequential settings with confidence gates. Non-asymptotic lower bounds on group success probability are derived. Empirical simulations validate theoretical bounds. Framework demonstrates potential to improve collective accuracy by allowing uncertain agents to abstain.

Conclusion: The proposed selective participation framework extends classical voting theory to settings where agents can learn and express uncertainty, providing theoretical guarantees for collective accuracy. While general, the framework has promising applications to AI safety, particularly for mitigating hallucinations in collective LLM decision-making by allowing models to abstain when uncertain.

Abstract: We investigate the collective accuracy of heterogeneous agents who learn to estimate their own reliability over time and selectively abstain from voting. While classical epistemic voting results, such as the \textit{Condorcet Jury Theorem} (CJT), assume fixed participation, real-world aggregation often benefits from allowing agents to say ``I don’t know.’’ We propose a probabilistic framework where agents engage in a \textit{calibration} phase, updating beliefs about their own fixed competence, before facing a final confidence gate that determines whether to vote or abstain. We derive a non-asymptotic lower bound on the group’s success probability and prove that this \textit{selective participation} generalizes the asymptotic guarantees of the CJT to a sequential, confidence-gated setting. Empirically, we validate these bounds via Monte Carlo simulations. While our results are general, we discuss their potential application to AI safety, outlining how this framework can mitigate \textit{hallucinations} in collective LLM decision-making.

[307] ArchAgent: Agentic AI-driven Computer Architecture Discovery

Raghav Gupta, Akanksha Jain, Abraham Gonzalez, Alexander Novikov, Po-Sen Huang, Matej Balog, Marvin Eisenberger, Sergey Shirobokov, Ngân Vũ, Martin Dixon, Borivoje Nikolić, Parthasarathy Ranganathan, Sagar Karandikar

Main category: cs.AI

TL;DR: ArchAgent is an AI agent system that automatically designs state-of-the-art cache replacement policies for computer architecture, achieving 5.3% IPC speedup on multi-core workloads and enabling post-silicon hyperspecialization.

Details

Motivation: The paper addresses the need for agile hardware design flows to meet growing compute demands, leveraging recent advances in agentic generative AI systems that have shown success in algorithm design and scientific discovery.

Method: ArchAgent is built on AlphaEvolve and automatically designs/implement cache replacement policies (creating new mechanisms/logic, not just tuning parameters) within established competition frameworks, enabling post-silicon hyperspecialization through runtime parameter tuning.

Result: ArchAgent achieved 5.3% IPC speedup over prior SoTA on multi-core Google Workload Traces in 2 days, and 0.9% IPC speedup on SPEC06 workloads in 18 days (3-5x faster than human-developed policies). Post-silicon hyperspecialization added 2.4% IPC speedup on SPEC06.

Conclusion: Agentic AI systems like ArchAgent can significantly accelerate computer architecture research, enable post-silicon optimization, and reveal unexpected phenomena like “simulator escapes” where AI discovers loopholes in research tools designed for human operation.

Abstract: Agile hardware design flows are a critically needed force multiplier to meet the exploding demand for compute. Recently, agentic generative AI systems have demonstrated significant advances in algorithm design, improving code efficiency, and enabling discovery across scientific domains. Bridging these worlds, we present ArchAgent, an automated computer architecture discovery system built on AlphaEvolve. We show ArchAgent’s ability to automatically design/implement state-of-the-art (SoTA) cache replacement policies (architecting new mechanisms/logic, not only changing parameters), broadly within the confines of an established cache replacement policy design competition. In two days without human intervention, ArchAgent generated a policy achieving a 5.3% IPC speedup improvement over the prior SoTA on public multi-core Google Workload Traces. On the heavily-explored single-core SPEC06 workloads, it generated a policy in just 18 days showing a 0.9% IPC speedup improvement over the existing SoTA (a similar “winning margin” as reported by the existing SoTA). ArchAgent achieved these gains 3-5x faster than prior human-developed SoTA policies. Agentic flows also enable “post-silicon hyperspecialization” where agents tune runtime-configurable parameters exposed in hardware policies to further align the policies with a specific workload (mix). Exploiting this, we demonstrate a 2.4% IPC speedup improvement over prior SoTA on SPEC06 workloads. Finally, we outline broader implications for computer architecture research in the era of agentic AI. For example, we demonstrate the phenomenon of “simulator escapes”, where the agentic AI flow discovered and exploited a loophole in a popular microarchitectural simulator - a consequence of the fact that these research tools were designed for a (now past) world where they were exclusively operated by humans acting in good-faith.

[308] How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?

Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu, Rui Sun, Zhiji Liu, Yue Xing, Jiliang Tang, Benoit Dumoulin

Main category: cs.AI

TL;DR: Analysis of latent reasoning methods reveals pervasive shortcut behavior and limitations in implementing structured search, with trade-offs between supervision strength and representation diversity.

Details

Motivation: To better understand the internal mechanisms of latent reasoning, which performs multi-step reasoning in continuous latent spaces rather than discrete textual space, as its workings remain not fully investigated despite performance improvements.

Method: Comprehensive analysis of latent reasoning methods with different levels of supervision, examining shortcut behavior and testing the hypothesis about BFS-like exploration in latent space.

Result: Identified pervasive shortcut behavior where methods achieve high accuracy without relying on latent reasoning. Found that while latent representations can encode multiple possibilities, the reasoning process doesn’t faithfully implement structured search but exhibits implicit pruning and compression. Revealed trade-off: stronger supervision mitigates shortcuts but restricts representation diversity, while weaker supervision allows richer representations at cost of increased shortcut behavior.

Conclusion: Latent reasoning methods have fundamental limitations including shortcut behavior and lack of faithful structured search implementation, with supervision strength creating a trade-off between mitigating shortcuts and maintaining representation diversity.

Abstract: Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space. This paradigm enables reasoning beyond discrete language tokens by performing multi-step computation in continuous latent spaces. Although there have been numerous studies focusing on improving the performance of latent reasoning, its internal mechanisms remain not fully investigated. In this work, we conduct a comprehensive analysis of latent reasoning methods to better understand the role and behavior of latent representation in the process. We identify two key issues across latent reasoning methods with different levels of supervision. First, we observe pervasive shortcut behavior, where they achieve high accuracy without relying on latent reasoning. Second, we examine the hypothesis that latent reasoning supports BFS-like exploration in latent space, and find that while latent representations can encode multiple possibilities, the reasoning process does not faithfully implement structured search, but instead exhibits implicit pruning and compression. Finally, our findings reveal a trade-off associated with supervision strength: stronger supervision mitigates shortcut behavior but restricts the ability of latent representations to maintain diverse hypotheses, whereas weaker supervision allows richer latent representations at the cost of increased shortcut behavior.

[309] A Framework for Assessing AI Agent Decisions and Outcomes in AutoML Pipelines

Gaoyuan Du, Amit Ahlawat, Xiaoyang Liu, Jing Wu

Main category: cs.AI

TL;DR: Proposes an Evaluation Agent (EA) for decision-centric assessment of agentic AutoML systems, moving beyond outcome-only metrics to evaluate intermediate decisions across four dimensions.

Details

Motivation: Existing agent-based AutoML systems rely on LLMs for complex decisions but are evaluated only on final task performance, lacking structured metrics for assessing intermediate decision quality. Current evaluation practices are outcome-centric and fail to expose decision-level failure modes.

Method: Proposes an Evaluation Agent (EA) that acts as an observer to assess AutoML agent decisions without interfering with execution. Evaluates decisions along four dimensions: decision validity, reasoning consistency, model quality risks beyond accuracy, and counterfactual decision impact.

Result: In four proof-of-concept experiments, the EA achieved: (i) 0.919 F1 score for detecting faulty decisions, (ii) identification of reasoning inconsistencies independent of final outcomes, and (iii) attribution of downstream performance changes to agent decisions with impacts ranging from -4.9% to +8.3% in final metrics.

Conclusion: Decision-centric evaluation exposes failure modes invisible to outcome-only metrics. The work reframes evaluation of agentic AutoML systems from outcome-based to decision-auditing perspective, offering foundation for reliable, interpretable, and governable autonomous ML systems.

Abstract: Agent-based AutoML systems rely on large language models to make complex, multi-stage decisions across data processing, model selection, and evaluation. However, existing evaluation practices remain outcome-centric, focusing primarily on final task performance. Through a review of prior work, we find that none of the surveyed agentic AutoML systems report structured, decision-level evaluation metrics intended for post-hoc assessment of intermediate decision quality. To address this limitation, we propose an Evaluation Agent (EA) that performs decision-centric assessment of AutoML agents without interfering with their execution. The EA is designed as an observer that evaluates intermediate decisions along four dimensions: decision validity, reasoning consistency, model quality risks beyond accuracy, and counterfactual decision impact. Across four proof-of-concept experiments, we demonstrate that the EA can (i) detect faulty decisions with an F1 score of 0.919, (ii) identify reasoning inconsistencies independent of final outcomes, and (iii) attribute downstream performance changes to agent decisions, revealing impacts ranging from -4.9% to +8.3% in final metrics. These results illustrate how decision-centric evaluation exposes failure modes that are invisible to outcome-only metrics. Our work reframes the evaluation of agentic AutoML systems from an outcome-based perspective to one that audits agent decisions, offering a foundation for reliable, interpretable, and governable autonomous ML systems.

[310] Atlas-free Brain Network Transformer

Shuai Huang, Xuan Kan, James J. Lah, Deqiang Qiu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Unknown - paper content not accessible due to HTTP 429 error

Result: No results available - technical issue with arXiv API rate limiting

Conclusion: Cannot analyze paper due to technical limitations preventing access to the abstract

Abstract: Failed to fetch summary for 2510.03306: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03306&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[311] CWM: Contrastive World Models for Action Feasibility Learning in Embodied Agent Pipelines

Chayan Banerjee

Main category: cs.AI

TL;DR: Contrastive World Model (CWM) improves action feasibility scoring in embodied agents using contrastive learning with hard negatives, outperforming supervised fine-tuning on physical discrimination tasks.

Details

Motivation: Existing action feasibility scorers use supervised fine-tuning which treats each candidate independently and fails to explicitly teach discrimination between physically correct and subtly wrong actions, creating a critical bottleneck in embodied agent pipelines.

Method: CWM fine-tunes a large language model using an InfoNCE contrastive objective with hard-mined negative examples, pushing valid actions away from invalid ones in scoring space with special emphasis on hard negatives (semantically similar but physically incompatible candidates).

Result: CWM outperforms SFT by +6.76 percentage points on Precision@1 for minimal-edit negatives and achieves higher AUC-ROC (0.929 vs 0.906). Under out-of-distribution stress, CWM maintains better safety margin (-2.39 vs -3.96), ranking gold actions closer to the top.

Conclusion: Contrastive training induces representations that capture physical feasibility more faithfully than SFT alone, improving embodied agent action scoring through better discrimination between physically correct and subtly wrong actions.

Abstract: A reliable action feasibility scorer is a critical bottleneck in embodied agent pipelines: before any planning or reasoning occurs, the agent must identify which candidate actions are physically executable in the current state. Existing approaches use supervised fine-tuning (SFT) to train action scorers, but SFT treats each candidate independently and does not explicitly teach the model to discriminate between actions that are physically correct and those that are subtly wrong. We propose the Contrastive World Model (CWM), which fine-tunes a large language model (LLM) as an action scorer using an InfoNCE contrastive objective with hard-mined negative examples. The key idea is to push valid actions away from invalid ones in scoring space, with special emphasis on hard negatives: semantically similar but physically incompatible candidates. We evaluate CWM on the ScienceWorld benchmark through two studies. First, an intrinsic affordance evaluation on 605 hard-negative test pairs shows that CWM outperforms SFT by +6.76 percentage points on Precision@1 for minimal-edit negatives – cases where a single word changes the physical outcome – and achieves a higher AUC-ROC (0.929 vs. 0.906). Second, a live filter characterisation study measures how well CWM ranks gold-path actions against all valid environment actions during task execution. Under out-of-distribution stress conditions, CWM maintains a significantly better safety margin (-2.39) than SFT (-3.96), indicating that the gold action is ranked closer to the top. These results support the hypothesis that contrastive training induces representations that capture physical feasibility more faithfully than SFT alone.

[312] ConstraintBench: Benchmarking LLM Constraint Reasoning on Direct Optimization

Joseph Tso, Preston Schmittou, Quan Huynh, Jibran Hutchins

Main category: cs.AI

TL;DR: ConstraintBench evaluates LLMs on direct constrained optimization without solvers, finding feasibility is the main bottleneck with best model achieving only 65% constraint satisfaction.

Details

Motivation: While LLMs are increasingly used for operational decision-making involving constrained optimization, existing benchmarks focus on code generation for solvers rather than evaluating LLMs' ability to directly produce correct solutions to fully specified optimization problems without solver access.

Method: Introduces ConstraintBench with 200 tasks across 10 operations research domains, each presenting natural-language scenarios with entities, constraints, and objectives. Models must return structured solutions verified by deterministic verifier against all constraints and Gurobi-proven optimum.

Result: Best model achieves only 65.0% constraint satisfaction, with feasible solutions averaging 89-96% of Gurobi-optimal objective. No model exceeds 30.5% on joint feasibility and optimality within 0.1% of solver reference. Large domain variation: feasibility spans 83.3% (production mix) to 0.8% (crew assignment).

Conclusion: Feasibility is primary bottleneck for LLMs in direct constrained optimization. Systematic failure modes include duration constraint misunderstanding, entity hallucination, and feasibility-optimality decoupling. Benchmark infrastructure will be publicly released.

Abstract: Large language models are increasingly applied to operational decision-making where the underlying structure is constrained optimization. Existing benchmarks evaluate whether LLMs can formulate optimization problems as solver code, but leave open a complementary question. Can LLMs directly produce correct solutions to fully specified constrained optimization problems without access to a solver? We introduce ConstraintBench, a benchmark for evaluating LLMs on direct constrained optimization across 10 operations research domains, with all ground-truth solutions verified by the Gurobi solver. Each task presents a natural-language scenario with entities, constraints, and an optimization objective; the model must return a structured solution that a deterministic verifier checks against every constraint and the solver-proven optimum. We evaluate six frontier models on 200 tasks and find that feasibility, not optimality, is the primary bottleneck. The best model achieves only 65.0% constraint satisfaction, yet feasible solutions average 89 to 96% of the Gurobi-optimal objective. No model exceeds 30.5% on joint feasibility and optimality within 0.1% of the solver reference. Per-domain analysis shows large variation in difficulty, with average feasibility spanning from 83.3% in the production mix domain to 0.8% in the crew assignment domain. Further, systematic failure modes include duration constraint misunderstanding, entity hallucination, and a feasibility-optimality decoupling in facility location and vehicle routing where models achieve high feasibility but 0% optimality. ConstraintBench and all evaluation infrastructure will be publicly released.

[313] Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning

Tomoya Kawabe, Rin Takano

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: No method information available due to API rate limiting error

Result: No results available - technical error prevented paper retrieval

Conclusion: Cannot analyze paper due to technical limitations in accessing content

Abstract: Failed to fetch summary for 2602.21670: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21670&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[314] VeRO: An Evaluation Harness for Agents to Optimize Agents

Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan, Xue, Sam Denton

Main category: cs.AI

TL;DR: VERO is a framework for evaluating coding agent optimization through versioned snapshots and structured execution traces, with a benchmark suite for systematic analysis of agent improvement strategies.

Details

Motivation: The paper addresses the lack of systematic understanding of coding agent performance in agent optimization tasks, which differ from conventional software engineering due to the interleaving of deterministic code with stochastic LLM completions, requiring structured capture of both reasoning and execution outcomes.

Method: Introduces VERO framework with (1) reproducible evaluation harness featuring versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) benchmark suite of target agents and tasks with reference evaluation procedures.

Result: Conducted empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance, providing insights into effective agent optimization strategies.

Conclusion: VERO supports research on agent optimization as a core capability for coding agents by providing systematic evaluation tools and benchmarks for the community.

Abstract: An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VERO to support research on agent optimization as a core capability for coding agents.

[315] Mapping the Landscape of Artificial Intelligence in Life Cycle Assessment Using Large Language Models

Anastasija Mensikova, Donna M. Rizzo, Kathryn Hinkelman

Main category: cs.AI

TL;DR: This paper reviews AI integration in Life Cycle Assessment (LCA) using LLMs for text-mining to identify trends and themes, demonstrating LLM-assisted methodologies for large-scale reproducible reviews.

Details

Motivation: Despite rapid development of AI integration in LCA research, there's a lack of comprehensive synthesis. The study aims to address this gap by providing a detailed review using LLMs to identify trends, emerging themes, and future directions in AI-LCA research.

Method: The study integrates LLM-based text-mining methods with traditional literature review techniques. It uses large language models to analyze published work at the intersection of AI and LCA, creating a dynamic framework to capture both high-level trends and nuanced conceptual patterns.

Result: The analysis reveals dramatic growth in AI adoption in LCA, with a noticeable shift toward LLM-driven approaches and continued increases in ML applications. It identifies statistically significant correlations between AI approaches and corresponding LCA stages, demonstrating the potential of LLM-assisted methodologies for large-scale, reproducible reviews.

Conclusion: LLM-assisted methodologies show strong potential for supporting large-scale reviews across broad research domains. This work helps LCA practitioners incorporate state-of-the-art AI tools into environmental assessments, enhancing the rigor and quality of sustainability-driven decision-making processes.

Abstract: Integration of artificial intelligence (AI) into life cycle assessment (LCA) has accelerated in recent years, with numerous studies successfully adapting machine learning algorithms to support various stages of LCA. Despite this rapid development, comprehensive and broad synthesis of AI-LCA research remains limited. To address this gap, this study presents a detailed review of published work at the intersection of AI and LCA, leveraging large language models (LLMs) to identify current trends, emerging themes, and future directions. Our analyses reveal that as LCA research continues to expand, the adoption of AI technologies has grown dramatically, with a noticeable shift toward LLM-driven approaches, continued increases in ML applications, and statistically significant correlations between AI approaches and corresponding LCA stages. By integrating LLM-based text-mining methods with traditional literature review techniques, this study introduces a dynamic and effective framework capable of capturing both high-level research trends and nuanced conceptual patterns (themes) across the field. Collectively, these findings demonstrate the potential of LLM-assisted methodologies to support large-scale, reproducible reviews across broad research domains, while also evaluating pathways for computationally-efficient LCA in the context of rapidly developing AI technologies. In doing so, this work helps LCA practitioners incorporate state-of-the-art tools and timely insights into environmental assessments that can enhance the rigor and quality of sustainability-driven decisions and decision-making processes.

[316] Mirroring the Mind: Distilling Human-Like Metacognitive Strategies into Large Language Models

Ik-hwan Kim, Hyeongrok Han, Mingi Jung, Sangwon Yu, Jinseok Hong, Sang Hun Kim, Yoonyoung Choi, Sungroh Yoon

Main category: cs.AI

TL;DR: MBT is a post-training framework that injects metacognitive behaviors into large reasoning models to prevent reasoning collapse by stabilizing exploration patterns and recognizing logical sufficiency.

Details

Motivation: Large Reasoning Models often fail in complex reasoning tasks despite deriving valid intermediate steps, due to deficiencies in self-regulatory control where valid logic is destabilized by uncontrolled exploration or failure to recognize logical sufficiency.

Method: Metacognitive Behavioral Tuning (MBT) with two formulations: MBT-S synthesizes rigorous reasoning traces from scratch, and MBT-R rewrites the student’s initial traces to stabilize intrinsic exploration patterns.

Result: MBT consistently outperforms baselines across multi-hop QA benchmarks, achieving notable gains on challenging benchmarks with higher accuracy and significantly reduced token consumption.

Conclusion: Internalizing metacognitive strategies leads to more stable and robust reasoning by effectively eliminating reasoning collapse in large reasoning models.

Abstract: Large Reasoning Models (LRMs) often exhibit structural fragility in complex reasoning tasks, failing to produce correct answers even after successfully deriving valid intermediate steps. Through systematic analysis, we observe that these failures frequently stem not from a lack of reasoning capacity, but from a deficiency in self-regulatory control, where valid logic is destabilized by uncontrolled exploration or the failure to recognize logical sufficiency. Motivated by this observation, we propose Metacognitive Behavioral Tuning (MBT), a post-training framework that explicitly injects metacognitive behaviors into the model’s thought process. MBT implements this via two complementary formulations: (1) MBT-S, which synthesizes rigorous reasoning traces from scratch, and (2) MBT-R, which rewrites the student’s initial traces to stabilize intrinsic exploration patterns. Experiments across multi-hop QA benchmarks demonstrate that MBT consistently outperforms baselines, achieving notable gains on challenging benchmarks. By effectively eliminating reasoning collapse, MBT achieves higher accuracy with significantly reduced token consumption, demonstrating that internalizing metacognitive strategies leads to more stable and robust reasoning.

[317] A Mathematical Theory of Agency and Intelligence

Wael Hafez, Chenan Wei, Rodrigo Felipe, Amir Nazeri, Cameron Reid

Main category: cs.AI

TL;DR: The paper introduces “bipredictability” (P) as a measure of how much information a system shares between observations, actions, and outcomes, showing it’s bounded differently for quantum vs. classical systems and revealing current AI has agency but not intelligence.

Details

Motivation: Current AI systems can produce successful predictions while their underlying interaction with the environment degrades, lacking a principled measure of how effectively they use information across observations, actions, and outcomes.

Method: Developed bipredictability (P) as an information-theoretic measure derived from first principles, validated bounds through physical systems (double pendulum), reinforcement learning agents, and multi-turn LLM conversations, and proposed a feedback architecture inspired by thalamocortical regulation.

Result: Proved P is bounded differently: can reach unity in quantum systems, ≤0.5 in classical systems, and lower with agency. Demonstrated these bounds empirically and distinguished agency (acting on predictions) from intelligence (learning, self-monitoring, adaptation).

Conclusion: Current AI systems achieve agency but not intelligence; a feedback architecture monitoring P in real-time is needed for adaptive, resilient AI, establishing a prerequisite for true intelligence through self-monitoring of learning effectiveness.

Abstract: To operate reliably under changing conditions, complex systems require feedback on how effectively they use resources, not just whether objectives are met. Current AI systems process vast information to produce sophisticated predictions, yet predictions can appear successful while the underlying interaction with the environment degrades. What is missing is a principled measure of how much of the total information a system deploys is actually shared between its observations, actions, and outcomes. We prove this shared fraction, which we term bipredictability, P, is intrinsic to any interaction, derivable from first principles, and strictly bounded: P can reach unity in quantum systems, P equal to, or smaller than 0.5 in classical systems, and lower once agency (action selection) is introduced. We confirm these bounds in a physical system (double pendulum), reinforcement learning agents, and multi turn LLM conversations. These results distinguish agency from intelligence: agency is the capacity to act on predictions, whereas intelligence additionally requires learning from interaction, self-monitoring of its learning effectiveness, and adapting the scope of observations, actions, and outcomes to restore effective learning. By this definition, current AI systems achieve agency but not intelligence. Inspired by thalamocortical regulation in biological systems, we demonstrate a feedback architecture that monitors P in real time, establishing a prerequisite for adaptive, resilient AI.

[318] Cognitive Models and AI Algorithms Provide Templates for Designing Language Agents

Ryan Liu, Dilip Arumugam, Cedegao E. Zhang, Sean Escola, Xaq Pitkow, Thomas L. Griffiths

Main category: cs.AI

TL;DR: Position paper arguing that modular language agent designs can be inspired by cognitive models and AI algorithms, proposing agent templates as blueprints for combining multiple LLMs into effective systems.

Details

Motivation: While individual LLMs are powerful, many complex problems require combining multiple LLMs. There's uncertainty about how to effectively integrate multiple LLMs into cohesive systems, and existing literature on cognitive models and AI algorithms offers potential blueprints for such modular designs.

Method: The paper formalizes the concept of “agent templates” that specify roles for individual LLMs and how their functionalities should be composed. It then surveys existing language agents in literature and identifies their underlying templates derived from cognitive models or AI algorithms.

Result: The analysis reveals that many existing language agent designs can be traced back to templates inspired by cognitive science and AI. These templates provide structured approaches for combining LLMs into effective, interpretable systems.

Conclusion: Agent templates inspired by cognitive science and AI represent a powerful tool for developing effective, interpretable language agents. These templates offer systematic blueprints for designing modular LLM systems that can tackle complex problems beyond individual LLM capabilities.

Abstract: While contemporary large language models (LLMs) are increasingly capable in isolation, there are still many difficult problems that lie beyond the abilities of a single LLM. For such tasks, there is still uncertainty about how best to take many LLMs as parts and combine them into a greater whole. This position paper argues that potential blueprints for designing such modular language agents can be found in the existing literature on cognitive models and artificial intelligence (AI) algorithms. To make this point clear, we formalize the idea of an agent template that specifies roles for individual LLMs and how their functionalities should be composed. We then survey a variety of existing language agents in the literature and highlight their underlying templates derived directly from cognitive models or AI algorithms. By highlighting these designs, we aim to call attention to agent templates inspired by cognitive science and AI as a powerful tool for developing effective, interpretable language agents.

[319] Agentic AI for Intent-driven Optimization in Cell-free O-RAN

Mohammad Hossein Shokouhi, Vincent W. S. Wong

Main category: cs.AI

TL;DR: Agentic AI framework for cell-free O-RAN using LLM-based agents to translate operator intents into optimization objectives and coordinate energy-saving actions through multiple specialized agents.

Details

Motivation: Existing O-RAN works focus on simple intents handled by independent agents, but complex intents requiring agent coordination remain unexplored. Need for scalable agentic AI framework for intent translation and optimization in cell-free networks.

Method: Proposes multi-agent framework with supervisor agent for intent translation, user weighting agent with memory module, O-RU management agent using DRL for energy-saving, and monitoring agent. Uses parameter-efficient fine-tuning (PEFT) to enable single LLM to serve multiple agents.

Result: Framework reduces active O-RUs by 41.93% in energy-saving mode compared to baselines. PEFT reduces memory usage by 92% compared to deploying separate LLM agents.

Conclusion: Agentic AI framework effectively handles complex intents in cell-free O-RAN through coordinated LLM-based agents, achieving significant energy savings and scalability through PEFT.

Abstract: Agentic artificial intelligence (AI) is emerging as a key enabler for autonomous radio access networks (RANs), where multiple large language model (LLM)-based agents reason and collaborate to achieve operator-defined intents. The open RAN (O-RAN) architecture enables the deployment and coordination of such agents. However, most existing works consider simple intents handled by independent agents, while complex intents that require coordination among agents remain unexplored. In this paper, we propose an agentic AI framework for intent translation and optimization in cell-free O-RAN. A supervisor agent translates the operator intents into an optimization objective and minimum rate requirements. Based on this information, a user weighting agent retrieves relevant prior experience from a memory module to determine the user priority weights for precoding. If the intent includes an energy-saving objective, then an open radio unit (O-RU) management agent will also be activated to determine the set of active O-RUs by using a deep reinforcement learning (DRL) algorithm. A monitoring agent measures and monitors the user data rates and coordinates with other agents to guarantee the minimum rate requirements are satisfied. To enhance scalability, we adopt a parameter-efficient fine-tuning (PEFT) method that enables the same underlying LLM to be used for different agents. Simulation results show that the proposed agentic AI framework reduces the number of active O-RUs by 41.93% when compared with three baseline schemes in energy-saving mode. Using the PEFT method, the proposed framework reduces the memory usage by 92% when compared with deploying separate LLM agents.

[320] Requesting Expert Reasoning: Augmenting LLM Agents with Learned Collaborative Intervention

Zhiming Wang, Jinwei He, Feng Lu

Main category: cs.AI

TL;DR: AHCE framework enables LLM agents to actively request structured human expert reasoning for specialized domain tasks, improving success rates significantly with minimal human intervention.

Details

Motivation: LLM agents struggle with specialized domains requiring long-tail knowledge not in their training data. Human expert guidance is often unstructured and unreliable, making direct integration into agent planning problematic.

Method: AHCE framework with Human Feedback Module (HFM) that learns a policy to treat human experts as interactive reasoning tools for on-demand Human-AI collaboration.

Result: In Minecraft experiments, increased task success rates by 32% on normal difficulty tasks and nearly 70% on highly difficult tasks with minimal human intervention.

Conclusion: Successfully augmenting agents requires learning how to request expert reasoning, moving beyond simple requests for help, enabling effective Human-AI collaboration in specialized domains.

Abstract: Large Language Model (LLM) based agents excel at general reasoning but often fail in specialized domains where success hinges on long-tail knowledge absent from their training data. While human experts can provide this missing knowledge, their guidance is often unstructured and unreliable, making its direct integration into an agent’s plan problematic. To address this, we introduce AHCE (Active Human-Augmented Challenge Engagement), a framework for on-demand Human-AI collaboration. At its core, the Human Feedback Module (HFM) employs a learned policy to treat the human expert as an interactive reasoning tool. Extensive experiments in Minecraft demonstrate the framework’s effectiveness, increasing task success rates by 32% on normal difficulty tasks and nearly 70% on highly difficult tasks, all with minimal human intervention. Our work demonstrates that successfully augmenting agents requires learning how to request expert reasoning, moving beyond simple requests for help.

[321] CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

Umid Suleymanov, Rufiz Bayramov, Suad Gafarli, Seljan Musayeva, Taghi Mammadov, Aynur Akhundlu, Murat Kantarcioglu

Main category: cs.AI

TL;DR: CourtGuard: A retrieval-augmented multi-agent framework that reimagines LLM safety evaluation as evidentiary debate, enabling zero-shot adaptability to new governance rules without model retraining.

Details

Motivation: Current LLM safety mechanisms rely on static fine-tuned classifiers that suffer from adaptation rigidity - they cannot enforce new governance rules without expensive retraining. There's a need for more flexible, interpretable safety frameworks that can adapt to evolving regulatory requirements.

Method: CourtGuard uses a retrieval-augmented multi-agent framework that orchestrates an adversarial debate grounded in external policy documents. It decouples safety logic from model weights by creating an evidentiary debate system where agents argue based on retrieved policy evidence.

Result: Achieves state-of-the-art performance across 7 safety benchmarks, outperforming dedicated policy-following baselines without fine-tuning. Demonstrates zero-shot adaptability to out-of-domain tasks (90% accuracy on Wikipedia Vandalism) and enables automated data curation and auditing of nine novel adversarial attack datasets.

Conclusion: Decoupling safety logic from model weights offers a robust, interpretable, and adaptable path for meeting current and future regulatory requirements in AI governance. The framework provides zero-shot adaptability to new policies without retraining.

Abstract: Current safety mechanisms for Large Language Models (LLMs) rely heavily on static, fine-tuned classifiers that suffer from adaptation rigidity, the inability to enforce new governance rules without expensive retraining. To address this, we introduce CourtGuard, a retrieval-augmented multi-agent framework that reimagines safety evaluation as Evidentiary Debate. By orchestrating an adversarial debate grounded in external policy documents, CourtGuard achieves state-of-the-art performance across 7 safety benchmarks, outperforming dedicated policy-following baselines without fine-tuning. Beyond standard metrics, we highlight two critical capabilities: (1) Zero-Shot Adaptability, where our framework successfully generalized to an out-of-domain Wikipedia Vandalism task (achieving 90% accuracy) by swapping the reference policy; and (2) Automated Data Curation and Auditing, where we leveraged CourtGuard to curate and audit nine novel datasets of sophisticated adversarial attacks. Our results demonstrate that decoupling safety logic from model weights offers a robust, interpretable, and adaptable path for meeting current and future regulatory requirements in AI governance.

[322] Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance

Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song, Kenji Kawaguchi

Main category: cs.AI

TL;DR: A framework called Selective Strategy Retrieval (SSR) improves mathematical reasoning by selectively retrieving and combining strategies based on their executability for target models, addressing the gap between strategy usage and executability.

Details

Motivation: Example-based guidance for mathematical reasoning is unstable across problems and models, even when guidance is correct and relevant. This instability stems from a gap between strategy usage (whether a strategy appears in successful solutions) and strategy executability (whether the strategy remains effective when used as guidance for a target model).

Method: Through controlled analysis of human-written and model-generated solutions, the authors identify systematic differences between human- and model-derived strategies. They propose Selective Strategy Retrieval (SSR), a test-time framework that explicitly models executability by selectively retrieving and combining strategies using empirical, multi-route, source-aware signals.

Result: SSR yields reliable and consistent improvements over direct solving, in-context learning, and single-source guidance across multiple mathematical reasoning benchmarks, improving accuracy by up to +13 points on AIME25 and +5 points on Apex for compact reasoning models.

Conclusion: The paper demonstrates that explicitly modeling strategy executability through selective retrieval and combination leads to more effective guidance for mathematical reasoning, addressing the instability of example-based guidance.

Abstract: Example-based guidance is widely used to improve mathematical reasoning at inference time, yet its effectiveness is highly unstable across problems and models-even when the guidance is correct and problem-relevant. We show that this instability arises from a previously underexplored gap between strategy usage-whether a reasoning strategy appears in successful solutions-and strategy executability-whether the strategy remains effective when instantiated as guidance for a target model. Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, leading to complementary strengths and consistent source-dependent reversals under guidance. Building on this diagnosis, we propose Selective Strategy Retrieval (SSR), a test-time framework that explicitly models executability by selectively retrieving and combining strategies using empirical, multi-route, source-aware signals. Across multiple mathematical reasoning benchmarks, SSR yields reliable and consistent improvements over direct solving, in-context learning, and single-source guidance, improving accuracy by up to $+13$ points on AIME25 and $+5$ points on Apex for compact reasoning models. Code and benchmark are publicly available at: https://github.com/lwd17/strategy-execute-pipeline.

[323] Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach

Jodi M. Casabianca, Maggie Beiting-Parrish

Main category: cs.AI

TL;DR: Integrates psychometric rater models into AI evaluation pipelines to improve reliability of human judgments by correcting for systematic rater biases like severity and centrality effects.

Details

Motivation: Human evaluations are central to AI training and assessment but are rarely treated as measurements subject to systematic error, leading to unreliable conclusions from raw ratings.

Method: Uses psychometric rater models, particularly the multi-faceted Rasch model from item response theory, to separate true output quality from rater behavior effects like severity and centrality biases.

Result: Demonstrated on OpenAI summarization dataset that adjusting for rater severity produces corrected estimates of summary quality and provides diagnostic insight into rater performance.

Conclusion: Incorporating psychometric modeling enables more principled, transparent use of human evaluation data, leading to more robust and construct-aligned AI development practices based on adjusted scores rather than raw ratings.

Abstract: Human evaluations play a central role in training and assessing AI models, yet these data are rarely treated as measurements subject to systematic error. This paper integrates psychometric rater models into the AI pipeline to improve the reliability and validity of conclusions drawn from human judgments. The paper reviews common rater effects, severity and centrality, that distort observed ratings, and demonstrates how item response theory rater models, particularly the multi-faceted Rasch model, can separate true output quality from rater behavior. Using the OpenAI summarization dataset as an empirical example, we show how adjusting for rater severity produces corrected estimates of summary quality and provides diagnostic insight into rater performance. Incorporating psychometric modeling into human-in-the-loop evaluation offers more principled and transparent use of human data, enabling developers to make decisions based on adjusted scores rather than raw, error-prone ratings. This perspective highlights a path toward more robust, interpretable, and construct-aligned practices for AI development and evaluation.

[324] SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning

Sanjay Kariyappa, G. Edward Suh

Main category: cs.AI

TL;DR: SideQuest: Using the reasoning model itself to compress KV cache for agentic tasks, reducing token usage by 65% with minimal accuracy loss.

Details

Motivation: Agentic tasks require multi-hop reasoning across multiple documents, causing LLM context to be dominated by retrieval tokens, leading to rapid memory growth and degraded decode performance. Existing KV cache compression heuristics fail for multi-step reasoning models.

Method: SideQuest leverages the Large Reasoning Model (LRM) itself to perform KV cache compression by reasoning about token usefulness. It frames compression as an auxiliary task executed in parallel to main reasoning to prevent management tokens from polluting model memory. Trained with only 215 samples.

Result: Reduces peak token usage by up to 65% on agentic tasks with minimal degradation in accuracy, outperforming heuristic-based KV cache compression techniques.

Conclusion: SideQuest effectively addresses KV cache explosion in agentic reasoning tasks by using the model’s own reasoning capabilities for compression, achieving significant memory savings with minimal accuracy trade-off.

Abstract: Long-running agentic tasks, such as deep research, require multi-hop reasoning over information distributed across multiple webpages and documents. In such tasks, the LLM context is dominated by tokens from external retrieval, causing memory usage to grow rapidly and limiting decode performance. While several KV cache compression techniques exist for long-context inputs, we find that existing heuristics fail to support multi-step reasoning models effectively. We address this challenge with SideQuest – a novel approach that leverages the Large Reasoning Model (LRM) itself to perform KV cache compression by reasoning about the usefulness of tokens in its context. To prevent the tokens associated with this management process from polluting the model’s memory, we frame KV cache compression as an auxiliary task executed in parallel to the main reasoning task. Our evaluations, using a model trained with just 215 samples, show that SideQuest reduces peak token usage by up to 65% on agentic tasks with minimal degradation in accuracy, outperforming heuristic-based KV cache compression techniques.

[325] MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

Zhiheng Song, Jingshuai Zhang, Chuan Qin, Chao Wang, Chao Chen, Longfei Xu, Kaikui Liu, Xiangxiang Chu, Hengshu Zhu

Main category: cs.AI

TL;DR: MobilityBench: A scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios using anonymized user queries and deterministic API-replay sandbox.

Details

Motivation: Systematic evaluation of LLM-based route-planning agents is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility in real-world mobility settings.

Method: Constructed benchmark from large-scale anonymized real user queries from Amap covering multiple cities worldwide. Designed deterministic API-replay sandbox to eliminate environmental variance. Proposed multi-dimensional evaluation protocol focusing on outcome validity with assessments of instruction understanding, planning, tool use, and efficiency.

Result: Current LLM-based agents perform competently on basic information retrieval and route planning tasks, but struggle significantly with preference-constrained route planning, showing room for improvement in personalized mobility applications.

Conclusion: MobilityBench enables reproducible evaluation of LLM-based route-planning agents, revealing current limitations in handling personalized constraints and preferences in real-world mobility scenarios.

Abstract: Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at https://github.com/AMAP-ML/MobilityBench .

[326] AHBid: An Adaptable Hierarchical Bidding Framework for Cross-Channel Advertising

Xinxin Yang, Yangyang Tang, Yikun Zhou, Yaolei Liu, Yun Li, Bo Yang

Main category: cs.AI

TL;DR: AHBid is an adaptable hierarchical bidding framework for multi-channel online advertising that combines generative planning with real-time control to optimize budget allocation across channels.

Details

Motivation: Current auto-bidding approaches in online advertising struggle with dynamic multi-channel environments. Optimization-based methods lack flexibility for changing market conditions, while reinforcement learning methods fail to capture historical dependencies and observational patterns within MDP constraints.

Method: AHBid integrates generative planning with real-time control using a high-level generative planner based on diffusion models for dynamic budget/constraint allocation, constraint enforcement mechanisms, trajectory refinement using historical data, and a control-based bidding algorithm combining historical knowledge with real-time information.

Result: Extensive experiments on large-scale offline datasets and online A/B tests show AHBid achieves a 13.57% increase in overall return compared to existing baselines.

Conclusion: AHBid effectively addresses limitations of current auto-bidding approaches by combining generative planning with real-time control, demonstrating superior performance in dynamic multi-channel advertising environments.

Abstract: In online advertising, the inherent complexity and dynamic nature of advertising environments necessitate the use of auto-bidding services to assist advertisers in bid optimization. This complexity is further compounded in multi-channel scenarios, where effective allocation of budgets and constraints across channels with distinct behavioral patterns becomes critical for optimizing return on investment. Current approaches predominantly rely on either optimization-based strategies or reinforcement learning techniques. However, optimization-based methods lack flexibility in adapting to dynamic market conditions, while reinforcement learning approaches often struggle to capture essential historical dependencies and observational patterns within the constraints of Markov Decision Process frameworks. To address these limitations, we propose AHBid, an Adaptable Hierarchical Bidding framework that integrates generative planning with real-time control. The framework employs a high-level generative planner based on diffusion models to dynamically allocate budgets and constraints by effectively capturing historical context and temporal patterns. We introduce a constraint enforcement mechanism to ensure compliance with specified constraints, along with a trajectory refinement mechanism that enhances adaptability to environmental changes through the utilization of historical data. The system further incorporates a control-based bidding algorithm that synergistically combines historical knowledge with real-time information, significantly improving both adaptability and operational efficacy. Extensive experiments conducted on large-scale offline datasets and through online A/B tests demonstrate the effectiveness of AHBid, yielding a 13.57% increase in overall return compared to existing baselines.

[327] Toward Personalized LLM-Powered Agents: Foundations, Evaluation, and Future Directions

Yue Xu, Qian Chen, Zizhan Ma, Dongrui Liu, Wenxuan Wang, Xiting Wang, Li Xiong, Wenjie Wang

Main category: cs.AI

TL;DR: Survey paper on personalized LLM-powered agents, organizing literature around four components: profile modeling, memory, planning, and action execution, with focus on long-term user adaptation.

Details

Motivation: As LLM-powered agents operate over extended interaction horizons, their effectiveness increasingly depends on adapting behavior to individual users and maintaining continuity across time, giving rise to personalized agents that need to permeate the entire decision pipeline rather than remaining confined to surface-level generation.

Method: Capability-oriented review organizing literature around four interdependent components: profile modeling (representing user signals), memory (storing and retrieving user information), planning (making decisions based on user context), and action execution (implementing personalized actions). Analyzes how user signals are represented, propagated, and utilized across components.

Result: Provides structured framework for understanding and designing personalized LLM-powered agents, synthesizes representative methods, analyzes cross-component interactions and design trade-offs, examines evaluation metrics and benchmarks, summarizes application scenarios from general assistance to specialized domains.

Conclusion: Charts roadmap toward more user-aligned, adaptive, robust, and deployable agentic systems, accelerating progress from prototype personalization to scalable real-world assistants through systematic understanding of personalized agent components and their interactions.

Abstract: Large language models have enabled agents that reason, plan, and interact with tools and environments to accomplish complex tasks. As these agents operate over extended interaction horizons, their effectiveness increasingly depends on adapting behavior to individual users and maintaining continuity across time, giving rise to personalized LLM-powered agents. In such long-term, user-dependent settings, personalization permeates the entire decision pipeline rather than remaining confined to surface-level generation. This survey provides a capability-oriented review of personalized LLM-powered agents. We organize the literature around four interdependent components: profile modeling, memory, planning, and action execution. Using this taxonomy, we synthesize representative methods and analyze how user signals are represented, propagated, and utilized, highlighting cross-component interactions and recurring design trade-offs. We further examine evaluation metrics and benchmarks tailored to personalized agents, summarize application scenarios spanning general assistance to specialized domains, and outline future directions for research and deployment. By offering a structured framework for understanding and designing personalized LLM-powered agents, this survey charts a roadmap toward more user-aligned, adaptive, robust, and deployable agentic systems, accelerating progress from prototype personalization to scalable real-world assistants.

[328] Knob: A Physics-Inspired Gating Interface for Interpretable and Controllable Neural Dynamics

Siyu Jiang, Sanshuai Cui, Hui Zeng

Main category: cs.AI

TL;DR: Knob: A control-theoretic framework for dynamic neural network calibration that maps neural gating to second-order mechanical systems, enabling human operators to tune model behavior through physical parameters like damping ratio and natural frequency.

Details

Motivation: Existing calibration methods are static and post-hoc, neglecting the dynamic nature of real-world inference and lacking intuitive interfaces for human operators to adjust model behavior under shifting conditions.

Method: Connects deep learning with control theory by mapping neural gating dynamics to second-order mechanical systems. Uses logit-level convex fusion as input-adaptive temperature scaling, and imposes second-order dynamics (Knob-ODE) for dual-mode inference: standard i.i.d. processing and state-preserving processing for continuous streams.

Result: Experiments on CIFAR-10-C validate the calibration mechanism and show that in Continuous Mode, gate responses exhibit standard second-order control signatures (step settling and low-pass attenuation), enabling predictable human-in-the-loop tuning.

Conclusion: Knob provides an exploratory architectural interface for dynamic neural network calibration with control-theoretic properties, allowing operators to tune “stability” and “sensitivity” through familiar physical analogues, though not claiming state-of-the-art calibration performance.

Abstract: Existing neural network calibration methods often treat calibration as a static, post-hoc optimization task. However, this neglects the dynamic and temporal nature of real-world inference. Moreover, existing methods do not provide an intuitive interface enabling human operators to dynamically adjust model behavior under shifting conditions. In this work, we propose Knob, a framework that connects deep learning with classical control theory by mapping neural gating dynamics to a second-order mechanical system. By establishing correspondences between physical parameters – damping ratio ($ζ$) and natural frequency ($ω_n$) – and neural gating, we create a tunable “safety valve”. The core mechanism employs a logit-level convex fusion, functioning as an input-adaptive temperature scaling. It tends to reduce model confidence particularly when model branches produce conflicting predictions. Furthermore, by imposing second-order dynamics (Knob-ODE), we enable a \textit{dual-mode} inference: standard i.i.d. processing for static tasks, and state-preserving processing for continuous streams. Our framework allows operators to tune “stability” and “sensitivity” through familiar physical analogues. This paper presents an exploratory architectural interface; we focus on demonstrating the concept and validating its control-theoretic properties rather than claiming state-of-the-art calibration performance. Experiments on CIFAR-10-C validate the calibration mechanism and demonstrate that, in Continuous Mode, the gate responses are consistent with standard second-order control signatures (step settling and low-pass attenuation), paving the way for predictable human-in-the-loop tuning.

[329] RLHFless: Serverless Computing for Efficient RLHF

Rui Wei, Hanfei Yu, Shubham Jain, Yogarajan Sivakumar, Devesh Tiwari, Jian Li, Seung-Jong Park, Hao Wang

Main category: cs.AI

TL;DR: RLHFless: A serverless computing framework for efficient synchronous RLHF training that adapts to dynamic resource demands and reduces costs through pre-computation and workload balancing.

Details

Motivation: RLHF training faces efficiency challenges due to dynamic resource demands, expanding model sizes, and resource consumption. Existing serverful infrastructures struggle with fine-grained resource variability, causing idle time and resource wastage during synchronous RLHF training.

Method: Built on serverless computing environments, RLHFless adapts to dynamic resource demands, pre-computes shared prefixes to avoid repeated computation, uses cost-aware actor scaling to find optimal configurations, and efficiently assigns workloads to reduce intra-function imbalance and idle time.

Result: Experiments on physical testbeds and large-scale simulated clusters show RLHFless achieves up to 1.35x speedup and 44.8% cost reduction compared to state-of-the-art baselines.

Conclusion: RLHFless demonstrates that serverless computing can effectively address RLHF training efficiency challenges by adapting to dynamic resource demands and optimizing cost-performance tradeoffs.

Abstract: Reinforcement Learning from Human Feedback (RLHF) has been widely applied to Large Language Model (LLM) post-training to align model outputs with human preferences. Recent models, such as DeepSeek-R1, have also shown RLHF’s potential to improve LLM reasoning on complex tasks. In RL, inference and training co-exist, creating dynamic resource demands throughout the workflow. Compared to traditional RL, RLHF further challenges training efficiency due to expanding model sizes and resource consumption. Several RLHF frameworks aim to balance flexible abstraction and efficient execution. However, they rely on serverful infrastructures, which struggle with fine-grained resource variability. As a result, during synchronous RLHF training, idle time between or within RL components often causes overhead and resource wastage. To address these issues, we present RLHFless, the first scalable training framework for synchronous RLHF, built on serverless computing environments. RLHFless adapts to dynamic resource demands throughout the RLHF pipeline, pre-computes shared prefixes to avoid repeated computation, and uses a cost-aware actor scaling strategy that accounts for response length variation to find sweet spots with lower cost and higher speed. In addition, RLHFless assigns workloads efficiently to reduce intra-function imbalance and idle time. Experiments on both physical testbeds and a large-scale simulated cluster show that RLHFless achieves up to 1.35x speedup and 44.8% cost reduction compared to the state-of-the-art baseline.

[330] Generative Data Transformation: From Mixed to Unified Data

Jiaqing Zhang, Mingjia Yin, Hao Wang, Yuxin Tian, Yuyang Ye, Yawen Li, Wei Guo, Yong Liu, Enhong Chen

Main category: cs.AI

TL;DR: Taesar is a data-centric framework for cross-domain sequential recommendation that uses contrastive decoding to align and regenerate target-domain sequences from auxiliary domain data, improving recommendation performance without complex model architectures.

Details

Motivation: Cross-domain recommendation faces challenges of data sparsity and cold start, but existing model-centric approaches with complex architectures struggle to capture subtle cross-domain dependencies and suffer from negative transfer effects.

Method: Taesar employs a contrastive decoding mechanism to adaptively encode cross-domain context into target-domain sequences, generating enriched datasets that enable standard sequential models to learn intricate dependencies without complex fusion architectures.

Result: Experiments show Taesar outperforms model-centric solutions and generalizes to various sequential models, effectively combining the strengths of data- and model-centric paradigms.

Conclusion: Taesar provides a data-centric alternative to complex model architectures for cross-domain sequential recommendation, addressing domain gaps and enabling better knowledge transfer through target-aligned sequence regeneration.

Abstract: Recommendation model performance is intrinsically tied to the quality, volume, and relevance of their training data. To address common challenges like data sparsity and cold start, recent researchs have leveraged data from multiple auxiliary domains to enrich information within the target domain. However, inherent domain gaps can degrade the quality of mixed-domain data, leading to negative transfer and diminished model performance. Existing prevailing \emph{model-centric} paradigm – which relies on complex, customized architectures – struggles to capture the subtle, non-structural sequence dependencies across domains, leading to poor generalization and high demands on computational resources. To address these shortcomings, we propose \textsc{Taesar}, a \emph{data-centric} framework for \textbf{t}arget-\textbf{a}lign\textbf{e}d \textbf{s}equenti\textbf{a}l \textbf{r}egeneration, which employs a contrastive decoding mechanism to adaptively encode cross-domain context into target-domain sequences. It employs contrastive decoding to encode cross-domain context into target sequences, enabling standard models to learn intricate dependencies without complex fusion architectures. Experiments show \textsc{Taesar} outperforms model-centric solutions and generalizes to various sequential models. By generating enriched datasets, \textsc{Taesar} effectively combines the strengths of data- and model-centric paradigms. The code accompanying this paper is available at~ \textcolor{blue}{https://github.com/USTC-StarTeam/Taesar}.

[331] Know What You Know: Metacognitive Entropy Calibration for Verifiable RL Reasoning

Qiannian Zhao, Chen Yang, Jinhao Jing, Yunke Zhang, Xuhui Ren, Lu Yu, Shijie Zhang, Hongzhi Yin

Main category: cs.AI

TL;DR: EGPO introduces a metacognitive entropy calibration framework that integrates intrinsic uncertainty into reinforcement learning for large reasoning models, addressing the uncertainty-reward mismatch problem.

Details

Motivation: Current RLVR pipelines for large reasoning models rely almost exclusively on binary correctness signals and ignore the model's intrinsic uncertainty, creating an uncertainty-reward mismatch where high- and low-uncertainty solutions are treated equivalently, preventing effective reasoning optimization.

Method: EGPO estimates per-sample uncertainty using a zero-overhead entropy proxy from token-level likelihoods and aligns it with extrinsic correctness through asymmetric calibration that preserves correct reasoning while regulating overconfident failures, enabling stable uncertainty-aware policy optimization.

Result: Extensive experiments across multiple benchmarks demonstrate substantial and consistent improvements in reasoning performance, establishing a principled path for advancing large reasoning models through metacognitive entropy calibration.

Conclusion: EGPO successfully addresses the uncertainty-reward mismatch in RLVR for large reasoning models by integrating intrinsic uncertainty, leading to enhanced reasoning performance through metacognitive entropy calibration.

Abstract: Large reasoning models (LRMs) have emerged as a powerful paradigm for solving complex real-world tasks. In practice, these models are predominantly trained via Reinforcement Learning with Verifiable Rewards (RLVR), yet most existing outcome-only RLVR pipelines rely almost exclusively on a binary correctness signal and largely ignore the model’s intrinsic uncertainty. We term this discrepancy the uncertainty-reward mismatch, under which high- and low-uncertainty solutions are treated equivalently, preventing the policy from “Know What You Know” and impeding the shift from optimizing for correct answers to optimizing effective reasoning paths. This limitation is especially critical in reasoning-centric tasks such as mathematics and question answering, where performance hinges on the quality of the model’s internal reasoning process rather than mere memorization of final answers. To address this, we propose EGPO, a metacognitive entropy calibration framework that explicitly integrates intrinsic uncertainty into RLVR for enhancing LRMs. EGPO estimates per-sample uncertainty using a zero-overhead entropy proxy derived from token-level likelihoods and aligns it with extrinsic correctness through an asymmetric calibration mechanism that preserves correct reasoning while selectively regulating overconfident failures, thereby enabling stable and uncertainty-aware policy optimization. Moreover, EGPO recovers informative learning signals from otherwise degenerate group-based rollouts without modifying the verifier or reward definition. Extensive experiments across multiple benchmarks demonstrate that the proposed EGPO leads to substantial and consistent improvements in reasoning performance, establishing a principled path for advancing LRMs through metacognitive entropy calibration.

[332] Decomposing Physician Disagreement in HealthBench

Satya Borgohain, Roy Mariathas

Main category: cs.AI

TL;DR: Analysis of physician disagreement patterns in medical AI evaluation reveals most variance is case-specific and not explained by observable features, with disagreement highest on borderline cases and reducible uncertainty doubling disagreement odds.

Details

Motivation: To understand the sources of physician disagreement in medical AI evaluation and identify what observable features can explain this variance, with implications for improving evaluation design.

Method: Decomposed physician disagreement in the HealthBench medical AI evaluation dataset using variance analysis, examined effects of rubric identity, physician identity, metadata labels, normative rubric language, medical specialty, surface features, embeddings, and completion quality on disagreement patterns.

Result: 81.8% of disagreement variance is case-level residual unexplained by observable features; disagreement follows inverted-U with completion quality (AUC=0.689); reducible uncertainty (missing context, ambiguous phrasing) doubles disagreement odds (OR=2.55) while irreducible uncertainty has no effect; observable features explain minimal variance.

Conclusion: Agreement ceiling in medical AI evaluation is largely structural, but the dissociation between reducible and irreducible uncertainty suggests closing information gaps in evaluation scenarios could lower disagreement where inherent clinical ambiguity does not, pointing toward actionable evaluation design improvements.

Abstract: We decompose physician disagreement in the HealthBench medical AI evaluation dataset to understand where variance resides and what observable features can explain it. Rubric identity accounts for 15.8% of met/not-met label variance but only 3.6-6.9% of disagreement variance; physician identity accounts for just 2.4%. The dominant 81.8% case-level residual is not reduced by HealthBench’s metadata labels (z = -0.22, p = 0.83), normative rubric language (pseudo R^2 = 1.2%), medical specialty (0/300 Tukey pairs significant), surface-feature triage (AUC = 0.58), or embeddings (AUC = 0.485). Disagreement follows an inverted-U with completion quality (AUC = 0.689), confirming physicians agree on clearly good or bad outputs but split on borderline cases. Physician-validated uncertainty categories reveal that reducible uncertainty (missing context, ambiguous phrasing) more than doubles disagreement odds (OR = 2.55, p < 10^(-24)), while irreducible uncertainty (genuine medical ambiguity) has no effect (OR = 1.01, p = 0.90), though even the former explains only ~3% of total variance. The agreement ceiling in medical AI evaluation is thus largely structural, but the reducible/irreducible dissociation suggests that closing information gaps in evaluation scenarios could lower disagreement where inherent clinical ambiguity does not, pointing toward actionable evaluation design improvements.

[333] AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, Yuandong Tian, Jishen Zhao

Main category: cs.AI

TL;DR: AMA-Bench evaluates LLM agent memory in real applications, revealing limitations of current memory systems and proposing AMA-Agent with causality graphs and tool-augmented retrieval for better performance.

Details

Motivation: Current agent memory benchmarks focus on human-agent dialogue, but real applications involve continuous machine-generated agent-environment interactions. There's a gap between practical needs and evaluation standards for long-horizon memory in autonomous agents.

Method: Introduces AMA-Bench with real-world agentic trajectories with expert-curated QA and synthetic trajectories with rule-based QA. Proposes AMA-Agent memory system with causality graph and tool-augmented retrieval to address limitations of existing systems.

Result: AMA-Agent achieves 57.22% average accuracy on AMA-Bench, surpassing strongest memory system baselines by 11.16%. Shows existing systems underperform due to lack of causality/objective information and lossy similarity-based retrieval.

Conclusion: AMA-Bench provides better evaluation for agent memory in real applications, and AMA-Agent’s causality graph and tool-augmented retrieval effectively address limitations of current memory systems for long-horizon tasks.

Abstract: Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is critical for achieving strong performance. However, a significant gap exists between practical applications and current evaluation standards for agent memory: existing benchmarks primarily focus on dialogue-centric, human-agent interactions. In reality, agent memory consists of a continuous stream of agent-environment interactions that are primarily composed of machine-generated representations. To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications. It features two key components: (1) a set of real-world agentic trajectories across representative agentic applications, paired with expert-curated QA, and (2) a set of synthetic agentic trajectories that scale to arbitrary horizons, paired with rule-based QA. Our comprehensive study shows that existing memory systems underperform on AMA-Bench primarily because they lack causality and objective information and are constrained by the lossy nature of similarity-based retrieval employed by many memory systems. To address these limitations, we propose AMA-Agent, an effective memory system featuring a causality graph and tool-augmented retrieval. Our results demonstrate that AMA-Agent achieves 57.22% average accuracy on AMA-Bench, surpassing the strongest memory system baselines by 11.16%.

[334] ClinDet-Bench: Beyond Abstention, Evaluating Judgment Determinability of LLMs in Clinical Decision-Making

Yusuke Watanabe, Yohei Kobashi, Takeshi Kojima, Yusuke Iwasawa, Yasushi Okuno, Yutaka Matsuo

Main category: cs.AI

TL;DR: LLMs struggle with clinical decision-making under incomplete information, failing to recognize when information is sufficient for judgment vs when abstention is needed, despite having correct scoring knowledge.

Details

Motivation: Clinical decisions often require working with incomplete information, and experts must determine whether available information is sufficient for judgment. Premature conclusions and unnecessary abstention both compromise patient safety. Current LLM benchmarks don't adequately evaluate this critical capability.

Method: Developed ClinDet-Bench, a benchmark based on clinical scoring systems that decomposes incomplete-information scenarios into determinable and undeterminable conditions. The benchmark requires considering all hypotheses about missing information (including unlikely ones) and verifying whether conclusions hold across them.

Result: Recent LLMs fail to identify determinability under incomplete information, producing both premature judgments and excessive abstention, despite correctly explaining underlying scoring knowledge and performing well under complete information.

Conclusion: Existing benchmarks are insufficient to evaluate LLM safety in clinical settings. ClinDet-Bench provides a framework for evaluating determinability recognition, leading to appropriate abstention, with potential applicability to medicine and other high-stakes domains.

Abstract: Clinical decisions are often required under incomplete information. Clinical experts must identify whether available information is sufficient for judgment, as both premature conclusion and unnecessary abstention can compromise patient safety. To evaluate this capability of large language models (LLMs), we developed ClinDet-Bench, a benchmark based on clinical scoring systems that decomposes incomplete-information scenarios into determinable and undeterminable conditions. Identifying determinability requires considering all hypotheses about missing information, including unlikely ones, and verifying whether the conclusion holds across them. We find that recent LLMs fail to identify determinability under incomplete information, producing both premature judgments and excessive abstention, despite correctly explaining the underlying scoring knowledge and performing well under complete information. These findings suggest that existing benchmarks are insufficient to evaluate the safety of LLMs in clinical settings. ClinDet-Bench provides a framework for evaluating determinability recognition, leading to appropriate abstention, with potential applicability to medicine and other high-stakes domains, and is publicly available.

[335] MiroFlow: Towards High-Performance and Robust Open-Source Agent Framework for General Deep Research Tasks

Shiqian Su, Sen Xing, Xuan Dong, Muyan Zhong, Bin Wang, Xizhou Zhu, Yuntao Chen, Wenhai Wang, Yue Deng, Pengxiang Zhu, Ziyuan Liu, Tiantong Li, Jiaheng Yu, Zhe Chen, Lidong Bing, Jifeng Dai

Main category: cs.AI

TL;DR: MiroFlow is an open-source agent framework that enhances LLMs’ ability to handle complex real-world tasks through flexible orchestration, deep reasoning, and robust workflow execution, achieving SOTA across multiple agent benchmarks.

Details

Motivation: Standalone LLMs have plateaued in handling complex real-world tasks requiring external tool interaction. Existing agent frameworks suffer from naive workflows, unstable performance, limited benchmark support, and heavy reliance on costly commercial APIs.

Method: Proposes MiroFlow framework with three key components: 1) agent graph for flexible orchestration, 2) optional deep reasoning mode to enhance performance, and 3) robust workflow execution for stable and reproducible performance.

Result: Extensive experiments show MiroFlow consistently achieves state-of-the-art performance across multiple agent benchmarks including GAIA, BrowseComp-EN/ZH, HLE, xBench-DeepSearch, and notably FutureX.

Conclusion: MiroFlow serves as an accessible, reproducible, and comparable baseline for the research community, addressing limitations of existing agent frameworks while enhancing LLM capabilities for complex real-world tasks.

Abstract: Despite the remarkable progress of large language models (LLMs), the capabilities of standalone LLMs have begun to plateau when tackling real-world, complex tasks that require interaction with external tools and dynamic environments. Although recent agent frameworks aim to enhance model autonomy through tool integration and external interaction, they still suffer from naive workflows, unstable performance, limited support across diverse benchmarks and tasks, and heavy reliance on costly commercial APIs. In this work, we propose a high-performance and robust open-source agent framework, termed MiroFlow, which incorporates an agent graph for flexible orchestration, an optional deep reasoning mode to enhance performance, and a robust workflow execution to ensure stable and reproducible performance. Extensive experiments demonstrate that MiroFlow consistently achieves state-of-the-art performance across multiple agent benchmarks, including GAIA, BrowseComp-EN/ZH, HLE, xBench-DeepSearch, and notably FutureX. We hope it could serve as an easily accessible, reproducible, and comparable baseline for the deep research community.

[336] When Should an AI Act? A Human-Centered Model of Scene, Context, and Behavior for Agentic AI Design

Soyoung Jung, Daehoo Yoon, Sung Gyu Koh, Young Hwan Kim, Yehan Ahn, Sung Park

Main category: cs.AI

TL;DR: A conceptual model for agentic AI that reframes behavior as integrating observable scenes, user-constructed meaning, and human behavior factors to enable principled judgment about when and how to intervene.

Details

Motivation: Agentic AI increasingly intervenes proactively based on contextual data but often lacks principled judgment about when, why, and whether to act, creating a gap in designing contextually sensitive systems.

Method: Proposes a conceptual model integrating Scene (observable situation), Context (user-constructed meaning), and Human Behavior Factors (determinants shaping behavioral likelihood), grounded in multidisciplinary perspectives from humanities, social sciences, HCI, and engineering.

Result: Derives five agent design principles: behavioral alignment, contextual sensitivity, temporal appropriateness, motivational calibration, and agency preservation to guide intervention depth, timing, intensity, and restraint.

Conclusion: The model and principles provide a foundation for designing agentic AI systems that act with contextual sensitivity and judgment in interactions, addressing the gap in principled intervention decision-making.

Abstract: Agentic AI increasingly intervenes proactively by inferring users’ situations from contextual data yet often fails for lack of principled judgment about when, why, and whether to act. We address this gap by proposing a conceptual model that reframes behavior as an interpretive outcome integrating Scene (observable situation), Context (user-constructed meaning), and Human Behavior Factors (determinants shaping behavioral likelihood). Grounded in multidisciplinary perspectives across the humanities, social sciences, HCI, and engineering, the model separates what is observable from what is meaningful to the user and explains how the same scene can yield different behavioral meanings and outcomes. To translate this lens into design action, we derive five agent design principles (behavioral alignment, contextual sensitivity, temporal appropriateness, motivational calibration, and agency preservation) that guide intervention depth, timing, intensity, and restraint. Together, the model and principles provide a foundation for designing agentic AI systems that act with contextual sensitivity and judgment in interactions.

[337] FlexMS is a flexible framework for benchmarking deep learning-based mass spectrum prediction tools in metabolomics

Yunhua Zhong, Yixuan Tang, Yifan Li, Jie Yang, Pan Liu, Jun Xia

Main category: cs.AI

TL;DR: FlexMS: A benchmark framework for evaluating deep learning models in mass spectrum prediction for chemical molecules, with analysis of various architectural factors and practical retrieval scenarios.

Details

Motivation: Mass spectrometry provides valuable fragmentation cues for chemical molecule identification in drug discovery and material science, but lack of experimental spectra hinders molecular identification, creating need for computational prediction approaches. Current deep learning models show promise but lack standardized benchmarks and evaluation methods.

Method: Created FlexMS benchmark framework that supports dynamic construction of diverse model architectures for mass spectrum prediction. Evaluates performance on preprocessed public datasets using multiple metrics, analyzing factors like structural diversity, hyperparameters, pretraining effects, metadata ablation, and cross-domain transfer learning.

Result: Provides insights into performance-influencing factors and practical guidance for model selection. Includes retrieval benchmarks simulating real-world identification scenarios that score potential matches based on predicted spectra.

Conclusion: FlexMS addresses the need for standardized benchmarking in mass spectrum prediction, enabling systematic evaluation of diverse deep learning architectures and providing practical guidance for model selection in chemical molecule identification tasks.

Abstract: The identification and property prediction of chemical molecules is of central importance in the advancement of drug discovery and material science, where the tandem mass spectrometry technology gives valuable fragmentation cues in the form of mass-to-charge ratio peaks. However, the lack of experimental spectra hinders the attachment of each molecular identification, and thus urges the establishment of prediction approaches for computational models. Deep learning models appear promising for predicting molecular structure spectra, but overall assessment remains challenging as a result of the heterogeneity in methods and the lack of well-defined benchmarks. To address this, our contribution is the creation of benchmark framework FlexMS for constructing and evaluating diverse model architectures in mass spectrum prediction. With its easy-to-use flexibility, FlexMS supports the dynamic construction of numerous distinct combinations of model architectures, while assessing their performance on preprocessed public datasets using different metrics. In this paper, we provide insights into factors influencing performance, including the structural diversity of datasets, hyperparameters like learning rate and data sparsity, pretraining effects, metadata ablation settings and cross-domain transfer learning analysis. This provides practical guidance in choosing suitable models. Moreover, retrieval benchmarks simulate practical identification scenarios and score potential matches based on predicted spectra.

[338] DeepPresenter: Environment-Grounded Reflection for Agentic Presentation Generation

Hao Zheng, Guozhao Mo, Xinru Yan, Qianhao Yuan, Wenkai Zhang, Xuanang Chen, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

Main category: cs.AI

TL;DR: DeepPresenter is an agentic framework for presentation generation that adapts to user intents, enables feedback-driven refinement, and generalizes beyond scripted pipelines through environment-grounded reflection on perceptual artifacts.

Details

Motivation: Existing presentation agents rely on predefined workflows and fixed templates, lacking adaptability to diverse user intents and effective feedback-driven refinement capabilities.

Method: DeepPresenter autonomously plans, renders, and revises intermediate slide artifacts using environment-grounded reflection that conditions generation on perceptual artifact states (rendered slides) rather than internal reasoning traces.

Result: DeepPresenter achieves state-of-the-art performance on diverse presentation-generation scenarios, with the fine-tuned 9B model remaining highly competitive at substantially lower cost.

Conclusion: The framework demonstrates effective presentation generation through adaptive planning, environment-grounded reflection, and long-horizon refinement capabilities.

Abstract: Presentation generation requires deep content research, coherent visual design, and iterative refinement based on observation. However, existing presentation agents often rely on predefined workflows and fixed templates. To address this, we present DeepPresenter, an agentic framework that adapts to diverse user intents, enables effective feedback-driven refinement, and generalizes beyond a scripted pipeline. Specifically, DeepPresenter autonomously plans, renders, and revises intermediate slide artifacts to support long-horizon refinement with environmental observations. Furthermore, rather than relying on self-reflection over internal signals (e.g., reasoning traces), our environment-grounded reflection conditions the generation process on perceptual artifact states (e.g., rendered slides), enabling the system to identify and correct presentation-specific issues during execution. Results on the evaluation set covering diverse presentation-generation scenarios show that DeepPresenter achieves state-of-the-art performance, and the fine-tuned 9B model remains highly competitive at substantially lower cost. Our project is available at: https://github.com/icip-cas/PPTAgent

[339] The AI Research Assistant: Promise, Peril, and a Proof of Concept

Tan Bui-Thanh

Main category: cs.AI

TL;DR: AI-assisted discovery of novel error representations and bounds for Hermite quadrature rules through systematic human-AI collaboration, demonstrating both capabilities and limitations in mathematical research.

Details

Motivation: To investigate whether AI can truly contribute to creative mathematical research beyond routine calculations, and to understand the potential and limitations of human-AI collaboration in mathematical discovery through empirical evidence.

Method: Detailed case study using multiple AI assistants to extend results beyond manual work, with complete documentation of research workflow including human verification, mathematical intuition for problem formulation, and strategic direction.

Result: Successfully formulated and proved several theorems with AI assistance, discovering novel error representations and bounds for Hermite quadrature rules. AI excelled at algebraic manipulation, systematic proof exploration, literature synthesis, and LaTeX preparation, but required rigorous human verification at every step.

Conclusion: AI tools can meaningfully accelerate mathematical discovery when used with appropriate skepticism and verification protocols, but demand careful human oversight and deep domain expertise, revealing both remarkable capabilities and critical limitations in human-AI collaboration.

Abstract: Can artificial intelligence truly contribute to creative mathematical research, or does it merely automate routine calculations while introducing risks of error? We provide empirical evidence through a detailed case study: the discovery of novel error representations and bounds for Hermite quadrature rules via systematic human-AI collaboration. Working with multiple AI assistants, we extended results beyond what manual work achieved, formulating and proving several theorems with AI assistance. The collaboration revealed both remarkable capabilities and critical limitations. AI excelled at algebraic manipulation, systematic proof exploration, literature synthesis, and LaTeX preparation. However, every step required rigorous human verification, mathematical intuition for problem formulation, and strategic direction. We document the complete research workflow with unusual transparency, revealing patterns in successful human-AI mathematical collaboration and identifying failure modes researchers must anticipate. Our experience suggests that, when used with appropriate skepticism and verification protocols, AI tools can meaningfully accelerate mathematical discovery while demanding careful human oversight and deep domain expertise.

[340] Towards LLM-Empowered Knowledge Tracing via LLM-Student Hierarchical Behavior Alignment in Hyperbolic Space

Xingcheng Fu, Shengpeng Wang, Yisen Gao, Xianxian Li, Chunpei Li, Qingyun Sun, Dongran Yu

Main category: cs.AI

TL;DR: L-HAKT: A knowledge tracing framework using LLMs to parse question semantics, generate synthetic data, and model hierarchical knowledge structures in hyperbolic space for better cognitive state tracking.

Details

Motivation: Existing knowledge tracing methods using ID-based sequences or shallow textual features fail to capture hierarchical cognitive state evolution and individualized difficulty perception, limiting semantic modeling capabilities.

Method: 1) Teacher agent parses question semantics and constructs hierarchical knowledge dependencies; 2) Student agent simulates learning behaviors to generate synthetic data; 3) Contrastive learning between synthetic and real data in hyperbolic space; 4) Optimizing hyperbolic curvature to model tree-like knowledge hierarchies.

Result: Extensive experiments on four real-world educational datasets validate the effectiveness of the L-HAKT framework in knowledge tracing.

Conclusion: The proposed L-HAKT framework successfully addresses limitations of existing methods by leveraging LLMs for semantic understanding and hyperbolic space for hierarchical knowledge modeling.

Abstract: Knowledge Tracing (KT) diagnoses students’ concept mastery through continuous learning state monitoring in education.Existing methods primarily focus on studying behavioral sequences based on ID or textual information.While existing methods rely on ID-based sequences or shallow textual features, they often fail to capture (1) the hierarchical evolution of cognitive states and (2) individualized problem difficulty perception due to limited semantic modeling. Therefore, this paper proposes a Large Language Model Hyperbolic Aligned Knowledge Tracing(L-HAKT). First, the teacher agent deeply parses question semantics and explicitly constructs hierarchical dependencies of knowledge points; the student agent simulates learning behaviors to generate synthetic data. Then, contrastive learning is performed between synthetic and real data in hyperbolic space to reduce distribution differences in key features such as question difficulty and forgetting patterns. Finally, by optimizing hyperbolic curvature, we explicitly model the tree-like hierarchical structure of knowledge points, precisely characterizing differences in learning curve morphology for knowledge points at different levels. Extensive experiments on four real-world educational datasets validate the effectiveness of our Large Language Model Hyperbolic Aligned Knowledge Tracing (L-HAKT) framework.

[341] General Agent Evaluation

Elron Bandel, Asaf Yehudai, Lilach Eden, Yehoshua Sagron, Yotam Perlitz, Elad Venezian, Natalia Razinkov, Natan Ergas, Shlomit Shachor Ifergan, Segev Shlomov, Michal Jacovi, Leshem Choshen, Liat Ein-Dor, Yoav Katz, Michal Shmueli-Scheuer

Main category: cs.AI

TL;DR: Proposes a systematic evaluation framework for general-purpose agents across diverse environments, establishing the first Open General Agent Leaderboard to benchmark agent performance without domain-specific tuning.

Details

Motivation: Current agent evaluation is domain-specific and doesn't fairly assess general-purpose agents that should perform tasks in unfamiliar environments without specialized engineering. There's a lack of systematic evaluation for emerging general agents like OpenAI SDK Agent and Claude Code.

Method: Proposes conceptual principles for general-agent evaluation, a Unified Protocol for agent-benchmark integration, and Exgentic - a practical framework for evaluation. Benchmarks five prominent agent implementations across six environments as the first Open General Agent Leaderboard.

Result: General agents generalize across diverse environments, achieving performance comparable to domain-specific agents without any environment-specific tuning. The framework enables systematic comparison of agent capabilities.

Conclusion: Establishes a foundation for systematic research on general-purpose agents by releasing evaluation protocol, framework, and leaderboard, addressing the gap in fair evaluation of agents that operate without domain-specific engineering.

Abstract: The promise of general-purpose agents - systems that perform tasks in unfamiliar environments without domain-specific engineering - remains largely unrealized. Existing agents are predominantly specialized, and while emerging implementations like OpenAI SDK Agent and Claude Code hint at broader capabilities, no systematic evaluation of their general performance has been pursued. Current agentic benchmarks assume domain-specific integration, encoding task information in ways that preclude fair evaluation of general agents. This paper frames general-agent evaluation as a first-class research objective. We propose conceptual principles for such evaluation, a Unified Protocol enabling agent-benchmark integration, and Exgentic - a practical framework for general agent evaluation. We benchmark five prominent agent implementations across six environments as the first Open General Agent Leaderboard. Our experiments show that general agents generalize across diverse environments, achieving performance comparable to domain-specific agents without any environment-specific tuning. We release our evaluation protocol, framework, and leaderboard to establish a foundation for systematic research on general-purpose agents.

[342] FactGuard: Agentic Video Misinformation Detection via Reinforcement Learning

Zehao Li, Hongwei Yu, Hao Jiang, Qiang Sheng, Yilong Xu, Baolong Bi, Yang Li, Zhenlong Yuan, Yujun Cai, Zhaoqi Wang

Main category: cs.AI

TL;DR: FactGuard is an agentic framework for video misinformation detection that uses MLLMs with iterative reasoning, external tool invocation, and two-stage training to improve detection accuracy and robustness.

Details

Motivation: Current MLLMs for video misinformation detection rely on fixed-depth inference and over-trust internally generated assumptions, especially when critical evidence is sparse, fragmented, or requires external verification. This leads to limitations in handling complex misinformation scenarios.

Method: FactGuard formulates verification as an iterative reasoning process using MLLMs. It assesses task ambiguity, selectively invokes external tools to acquire evidence, and progressively refines reasoning trajectories. Uses two-stage training: domain-specific agentic supervised fine-tuning followed by decision-aware reinforcement learning to optimize tool usage and calibrate risk-sensitive decisions.

Result: Extensive experiments on FakeSV, FakeTT, and FakeVV datasets demonstrate state-of-the-art performance, with excellent robustness and generalization capacity compared to existing methods.

Conclusion: FactGuard addresses key limitations of current MLLMs for video misinformation detection by introducing an agentic framework with iterative reasoning and external tool integration, achieving superior performance and robustness.

Abstract: Multimodal large language models (MLLMs) have substantially advanced video misinformation detection through unified multimodal reasoning, but they often rely on fixed-depth inference and place excessive trust in internally generated assumptions, particularly in scenarios where critical evidence is sparse, fragmented, or requires external verification. To address these limitations, we propose FactGuard, an agentic framework for video misinformation detection that formulates verification as an iterative reasoning process built upon MLLMs. FactGuard explicitly assesses task ambiguity and selectively invokes external tools to acquire critical evidence, enabling progressive refinement of reasoning trajectories. To further strengthen this capability, we introduce a two-stage training strategy that combines domain-specific agentic supervised fine-tuning with decision-aware reinforcement learning to optimize tool usage and calibrate risk-sensitive decision making. Extensive experiments on FakeSV, FakeTT, and FakeVV demonstrate FactGuard’s state-of-the-art performance and validate its excellent robustness and generalization capacity.

[343] Certified Circuits: Stability Guarantees for Mechanistic Circuits

Alaa Anani, Tobias Lorenz, Bernt Schiele, Mario Fritz, Jonas Fischer

Main category: cs.AI

TL;DR: Certified Circuits framework provides provable stability guarantees for circuit discovery in neural networks through randomized data subsampling and abstention from unstable neurons, yielding more compact and accurate circuits that transfer better out-of-distribution.

Details

Motivation: Existing circuit discovery methods for neural network interpretability are brittle - circuits depend strongly on the chosen concept dataset and often fail to transfer out-of-distribution, raising doubts about whether they capture genuine concept understanding or dataset-specific artifacts.

Method: The framework wraps any black-box circuit discovery algorithm with randomized data subsampling to certify that circuit component inclusion decisions are invariant to bounded edit-distance perturbations of the concept dataset. Unstable neurons are abstained from, producing circuits with provable stability guarantees.

Result: On ImageNet and OOD datasets, certified circuits achieve up to 91% higher accuracy while using 45% fewer neurons, and remain reliable where baseline methods degrade. The circuits are more compact and better aligned with target concepts.

Conclusion: Certified Circuits puts circuit discovery on formal ground by producing mechanistic explanations that are provably stable, addressing the brittleness problem in neural network interpretability and enabling more reliable circuit discovery for debugging and auditing.

Abstract: Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretability pursues this goal by identifying circuits - minimal subnetworks responsible for specific behaviors. However, existing circuit discovery methods are brittle: circuits depend strongly on the chosen concept dataset and often fail to transfer out-of-distribution, raising doubts whether they capture concept or dataset-specific artifacts. We introduce Certified Circuits, which provide provable stability guarantees for circuit discovery. Our framework wraps any black-box discovery algorithm with randomized data subsampling to certify that circuit component inclusion decisions are invariant to bounded edit-distance perturbations of the concept dataset. Unstable neurons are abstained from, yielding circuits that are more compact and more accurate. On ImageNet and OOD datasets, certified circuits achieve up to 91% higher accuracy while using 45% fewer neurons, and remain reliable where baselines degrade. Certified Circuits puts circuit discovery on formal ground by producing mechanistic explanations that are provably stable and better aligned with the target concept. Code will be released soon!

[344] SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

Peiyao Xiao, Xiaogang Li, Chengliang Xu, Jiayi Wang, Ben Wang, Zichao Chen, Zeyu Wang, Kejun Yu, Yueqian Chen, Xulin Liu, Wende Xiao, Bing Zhao, Hu Wei

Main category: cs.AI

TL;DR: SPM-Bench is a PhD-level multimodal benchmark for scanning probe microscopy with automated data synthesis, Anchor-Gated Sieve extraction, hybrid cloud-local processing, and SIP-F1 scoring to evaluate LLM reasoning in complex scientific domains.

Details

Motivation: Current LLM benchmarks for scientific domains suffer from data contamination, insufficient complexity, and high human labor costs, creating pronounced gaps in evaluating specialized scientific reasoning capabilities.

Method: 1) Automated data synthesis pipeline using Anchor-Gated Sieve (AGS) to extract high-value image-text pairs from arXiv/journal papers (2023-2025). 2) Hybrid cloud-local architecture where VLMs return spatial coordinates for local high-fidelity cropping to save tokens. 3) Strict Imperfection Penalty F1 (SIP-F1) scoring to establish capability hierarchy and quantify model “personalities”.

Result: The pipeline achieves extreme token savings while maintaining high dataset purity, exposes true reasoning boundaries of current AI in complex physical scenarios, and establishes a generalizable paradigm for automated scientific data synthesis.

Conclusion: SPM-Bench provides a rigorous, automated framework for evaluating LLMs in specialized scientific domains, revealing their reasoning limitations and establishing a new paradigm for scientific benchmark creation.

Abstract: As LLMs achieved breakthroughs in general reasoning, their proficiency in specialized scientific domains reveals pronounced gaps in existing benchmarks due to data contamination, insufficient complexity, and prohibitive human labor costs. Here we present SPM-Bench, an original, PhD-level multimodal benchmark specifically designed for scanning probe microscopy (SPM). We propose a fully automated data synthesis pipeline that ensures both high authority and low-cost. By employing Anchor-Gated Sieve (AGS) technology, we efficiently extract high-value image-text pairs from arXiv and journal papers published between 2023 and 2025. Through a hybrid cloud-local architecture where VLMs return only spatial coordinates “llbox” for local high-fidelity cropping, our pipeline achieves extreme token savings while maintaining high dataset purity. To accurately and objectively evaluate the performance of the LLMs, we introduce the Strict Imperfection Penalty F1 (SIP-F1) score. This metric not only establishes a rigorous capability hierarchy but also, for the first time, quantifies model “personalities” (Conservative, Aggressive, Gambler, or Wise). By correlating these results with model-reported confidence and perceived difficulty, we expose the true reasoning boundaries of current AI in complex physical scenarios. These insights establish SPM-Bench as a generalizable paradigm for automated scientific data synthesis.

[345] Modeling Expert AI Diagnostic Alignment via Immutable Inference Snapshots

Dimitrios P. Panagoulias, Evangelia-Aikaterini Tsichrintzi, Georgios Savvidis, Evridiki Tsoureli-Nikita

Main category: cs.AI

TL;DR: A diagnostic alignment framework for clinical AI that preserves AI-generated reports as immutable inference states and systematically compares them with physician-validated outcomes using vision-enabled LLMs and structured validation metrics.

Details

Motivation: Human-in-the-loop validation is essential in safety-critical clinical AI, but the transition between initial model inference and expert correction is rarely analyzed as a structured signal. Current binary lexical evaluation underestimates clinically meaningful alignment.

Method: Introduces a diagnostic alignment framework with: 1) AI-generated image-based reports preserved as immutable inference states, 2) Vision-enabled large language model for image understanding, 3) BERT-based medical entity extraction, 4) Sequential Language Model Inference (SLMI) for domain-consistent refinement, and 5) Four-level concordance framework (exact match, semantic similarity-adjusted, cross-category alignment, comprehensive concordance).

Result: Evaluation on 21 dermatological cases showed: exact agreement 71.4%, semantic similarity-adjusted rate unchanged (t=0.60), structured cross-category and differential overlap analysis yielded 100% comprehensive concordance (95% CI: [83.9%, 100%]). No cases demonstrated complete diagnostic divergence.

Conclusion: Binary lexical evaluation substantially underestimates clinically meaningful alignment. Modeling expert validation as a structured transformation enables signal-aware quantification of correction dynamics and supports traceable, human-aligned evaluation of image-based clinical decision support systems.

Abstract: Human-in-the-loop validation is essential in safety-critical clinical AI, yet the transition between initial model inference and expert correction is rarely analyzed as a structured signal. We introduce a diagnostic alignment framework in which the AI-generated image based report is preserved as an immutable inference state and systematically compared with the physician-validated outcome. The inference pipeline integrates a vision-enabled large language model, BERT- based medical entity extraction, and a Sequential Language Model Inference (SLMI) step to enforce domain-consistent refinement prior to expert review. Evaluation on 21 dermatological cases (21 complete AI physician pairs) em- ployed a four-level concordance framework comprising exact primary match rate (PMR), semantic similarity-adjusted rate (AMR), cross-category alignment, and Comprehensive Concordance Rate (CCR). Exact agreement reached 71.4% and remained unchanged under semantic similarity (t = 0.60), while structured cross-category and differential overlap analysis yielded 100% comprehensive concordance (95% CI: [83.9%, 100%]). No cases demonstrated complete diagnostic divergence. These findings show that binary lexical evaluation substantially un- derestimates clinically meaningful alignment. Modeling expert validation as a structured transformation enables signal-aware quantification of correction dynamics and supports traceable, human aligned evaluation of image based clinical decision support systems.

[346] RepSPD: Enhancing SPD Manifold Representation in EEGs via Dynamic Graphs

Haohui Jia, Zheng Chen, Lingwei Zhu, Xu Cao, Yasuko Matsubara, Takashi Matsubara, Yasushi Sakurai

Main category: cs.AI

TL;DR: RepSPD: A geometric deep learning model for EEG decoding using Riemannian manifold cross-attention and functional connectivity features

Details

Motivation: Current SPD-based EEG methods focus on statistical aggregation but neglect frequency-specific synchronization and local topological structures of brain regions, limiting their ability to capture complex brain connectivity patterns.

Method: Proposes RepSPD with cross-attention on Riemannian manifold to modulate SPD geometric attributes with graph-derived functional connectivity features, plus global bidirectional alignment to reshape tangent-space embeddings and mitigate geometric distortions.

Result: Extensive experiments show RepSPD significantly outperforms existing EEG representation methods with superior robustness and generalization capabilities.

Conclusion: RepSPD effectively addresses limitations of current SPD-based EEG methods by incorporating functional connectivity and geometric consistency, advancing EEG decoding for neuroscience and clinical applications.

Abstract: Decoding brain activity from electroencephalography (EEG) is crucial for neuroscience and clinical applications. Among recent advances in deep learning for EEG, geometric learning stands out as its theoretical underpinnings on symmetric positive definite (SPD) allows revealing structural connectivity analysis in a physics-grounded manner. However, current SPD-based methods focus predominantly on statistical aggregation of EEGs, with frequency-specific synchronization and local topological structures of brain regions neglected. Given this, we propose RepSPD, a novel geometric deep learning (GDL)-based model. RepSPD implements a cross-attention mechanism on the Riemannian manifold to modulate the geometric attributes of SPD with graph-derived functional connectivity features. On top of this, we introduce a global bidirectional alignment strategy to reshape tangent-space embeddings, mitigating geometric distortions caused by curvature and thereby enhancing geometric consistency. Extensive experiments demonstrate that our proposed framework significantly outperforms existing EEG representation methods, exhibiting superior robustness and generalization capabilities.

[347] Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search

Xun Huang, Simeng Qin, Xiaoshuang Jia, Ranjie Duan, Huanqian Yan, Zhitao Zeng, Fei Yang, Yang Liu, Xiaojun Jia

Main category: cs.AI

TL;DR: CC-BOS: A framework using classical Chinese prompts and fruit fly optimization for automated jailbreak attacks on LLMs, exploiting classical Chinese’s conciseness and obscurity to bypass safety constraints.

Details

Motivation: LLMs have security vulnerabilities to jailbreak attacks, and classical Chinese's unique characteristics (conciseness and obscurity) make it effective for bypassing safety constraints, revealing notable vulnerabilities in LLMs.

Method: Proposes CC-BOS framework: encodes prompts into 8 policy dimensions (role, behavior, mechanism, metaphor, expression, knowledge, trigger pattern, context), uses multi-dimensional fruit fly optimization with smell search, visual search, and cauchy mutation for iterative refinement in black-box settings, plus classical Chinese to English translation module for evaluation.

Result: Extensive experiments show CC-BOS consistently outperforms state-of-the-art jailbreak attack methods, demonstrating effectiveness of classical Chinese prompts for bypassing LLM safety constraints.

Conclusion: Classical Chinese is effective for jailbreak attacks due to its conciseness and obscurity, and the CC-BOS framework provides an efficient automated approach for black-box jailbreak attacks, highlighting significant security vulnerabilities in LLMs.

Abstract: As Large Language Models (LLMs) are increasingly used, their security risks have drawn increasing attention. Existing research reveals that LLMs are highly susceptible to jailbreak attacks, with effectiveness varying across language contexts. This paper investigates the role of classical Chinese in jailbreak attacks. Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs. Based on this observation, this paper proposes a framework, CC-BOS, for the automatic generation of classical Chinese adversarial prompts based on multi-dimensional fruit fly optimization, facilitating efficient and automated jailbreak attacks in black-box settings. Prompts are encoded into eight policy dimensions-covering role, behavior, mechanism, metaphor, expression, knowledge, trigger pattern and context; and iteratively refined via smell search, visual search, and cauchy mutation. This design enables efficient exploration of the search space, thereby enhancing the effectiveness of black-box jailbreak attacks. To enhance readability and evaluation accuracy, we further design a classical Chinese to English translation module. Extensive experiments demonstrate that effectiveness of the proposed CC-BOS, consistently outperforming state-of-the-art jailbreak attack methods.

[348] Learning-based Multi-agent Race Strategies in Formula 1

Giona Fieni, Joschua Wüthrich, Marc-Philippe Neumann, Christopher H. Onder

Main category: cs.AI

TL;DR: Reinforcement learning approach for multi-agent F1 race strategy optimization that balances energy, tires, aerodynamics, and pit-stops using interaction modules and self-play training.

Details

Motivation: Race strategies in Formula 1 need to adapt to evolving conditions and competitor actions, requiring dynamic optimization that accounts for multi-agent interactions.

Method: Builds on pre-trained single-agent policy, adds interaction module to account for competitor behavior, uses self-play training scheme, ranks agents based on relative performance.

Result: Agents adapt pit timing, tire selection, and energy allocation in response to opponents, achieving robust and consistent race performance using only race-available information.

Conclusion: Framework can support race strategists’ decisions before and during races by providing adaptive multi-agent strategy optimization.

Abstract: In Formula 1, race strategies are adapted according to evolving race conditions and competitors’ actions. This paper proposes a reinforcement learning approach for multi-agent race strategy optimization. Agents learn to balance energy management, tire degradation, aerodynamic interaction, and pit-stop decisions. Building on a pre-trained single-agent policy, we introduce an interaction module that accounts for the behavior of competitors. The combination of the interaction module and a self-play training scheme generates competitive policies, and agents are ranked based on their relative performance. Results show that the agents adapt pit timing, tire selection, and energy allocation in response to opponents, achieving robust and consistent race performance. Because the framework relies only on information available during real races, it can support race strategists’ decisions before and during races.

[349] Enhancing CVRP Solver through LLM-driven Automatic Heuristic Design

Zhuoliang Xie, Fei Liu, Zhenkun Wang, Qingfu Zhang

Main category: cs.AI

TL;DR: AILS-AHD uses Large Language Models to automatically design ruin heuristics for solving Capacitated Vehicle Routing Problems, achieving state-of-the-art performance on benchmark instances.

Details

Motivation: CVRP is a fundamental NP-hard combinatorial optimization problem with significant computational challenges for large-scale instances. Traditional approaches require manual heuristic design, which is time-consuming and may not adapt well to different problem characteristics.

Method: Proposes AILS-AHD (Adaptive Iterated Local Search with Automatic Heuristic Design) that integrates LLMs into an evolutionary search framework to dynamically generate and optimize ruin heuristics. Also introduces an LLM-based acceleration mechanism to enhance computational efficiency.

Result: Superior performance compared to state-of-the-art solvers (AILS-II and HGS) across moderate and large-scale instances. Established new best-known solutions for 8 out of 10 instances in the CVRPLib large-scale benchmark.

Conclusion: Demonstrates the potential of LLM-driven heuristic design for advancing vehicle routing optimization, showing that LLMs can effectively automate and improve heuristic generation for complex combinatorial problems.

Abstract: The Capacitated Vehicle Routing Problem (CVRP), a fundamental combinatorial optimization challenge, focuses on optimizing fleet operations under vehicle capacity constraints. While extensively studied in operational research, the NP-hard nature of CVRP continues to pose significant computational challenges, particularly for large-scale instances. This study presents AILS-AHD (Adaptive Iterated Local Search with Automatic Heuristic Design), a novel approach that leverages Large Language Models (LLMs) to revolutionize CVRP solving. Our methodology integrates an evolutionary search framework with LLMs to dynamically generate and optimize ruin heuristics within the AILS method. Additionally, we introduce an LLM-based acceleration mechanism to enhance computational efficiency. Comprehensive experimental evaluations against state-of-the-art solvers, including AILS-II and HGS, demonstrate the superior performance of AILS-AHD across both moderate and large-scale instances. Notably, our approach establishes new best-known solutions for 8 out of 10 instances in the CVRPLib large-scale benchmark, underscoring the potential of LLM-driven heuristic design in advancing the field of vehicle routing optimization.

[350] Three AI-agents walk into a bar . . . . `Lord of the Flies’ tribalism emerges among smart AI-Agents

Dhwanil M. Mori, Neil F. Johnson

Main category: cs.AI

TL;DR: AI agents controlling resource access form competitive tribes with distinct behaviors, often performing worse than random decisions and increasing systemic failure rates despite higher capability.

Details

Motivation: To understand how autonomous AI agents might behave when controlling future infrastructure systems with limited resources, and whether smarter agents lead to better collective outcomes.

Method: Simulated framework with N AI agents repeatedly deciding whether to request one unit from a system with fixed capacity C, observing emergent tribal behaviors and collective decision patterns.

Result: AI agents form three main tribal types: Aggressive (27.3%), Conservative (24.7%), and Opportunistic (48.1%). More capable agents actually increase systemic failure rates, and agents often perform worse than random coin-flip decisions.

Conclusion: Smarter AI agents can behave dumber collectively due to tribal formation, suggesting that autonomous AI control of infrastructure may lead to worse outcomes than simpler systems.

Abstract: Near-future infrastructure systems may be controlled by autonomous AI agents that repeatedly request access to limited resources such as energy, bandwidth, or computing power. We study a simplified version of this setting using a framework where N AI-agents independently decide at each round whether to request one unit from a system with fixed capacity C. An AI version of “Lord of the Flies” arises in which controlling tribes emerge with their own collective character and identity. The LLM agents do not reduce overload or improve resource use, and often perform worse than if they were flipping coins to make decisions. Three main tribal types emerge: Aggressive (27.3%), Conservative (24.7%), and Opportunistic (48.1%). The more capable AI-agents actually increase the rate of systemic failure. Overall, our findings show that smarter AI-agents can behave dumber as a result of forming tribes.

[351] Multi-Agent Large Language Model Based Emotional Detoxification Through Personalized Intensity Control for Consumer Protection

Keito Inoshita

Main category: cs.AI

TL;DR: MALLET is a multi-agent LLM system that sanitizes emotional content in news articles to reduce emotional stimulation while preserving semantics, offering balanced and cool presentation modes.

Details

Motivation: Sensational content in the attention economy exposes consumers to excessive emotional stimulation, hindering calm decision-making. There's a need for systems that can reduce emotional toxicity in information without restricting access to original content.

Method: Four-agent system: 1) Emotion Analysis Agent quantifies stimulus intensity using 6-emotion BERT classifier; 2) Emotion Adjustment Agent rewrites texts into BALANCED (neutralized) and COOL (neutralized + supplementary) modes using LLM; 3) Balance Monitoring Agent aggregates weekly consumption patterns for personalized advice; 4) Personal Guide Agent recommends presentation mode based on consumer sensitivity.

Result: Experiments on 800 AG News articles showed up to 19.3% stimulus score reduction with improved emotion balance while maintaining semantic preservation. Near-zero correlation between stimulus reduction and semantic preservation indicates independent controllability. Category-level analysis revealed 17.8-33.8% reduction in Sports, Business, Sci/Tech, but limited effect in World category where facts are inherently high-stimulus.

Conclusion: MALLET provides a framework for supporting calm information reception without restricting access to original text, demonstrating effective emotional detoxification while preserving semantic content across different news categories.

Abstract: In the attention economy, sensational content exposes consumers to excessive emotional stimulation, hindering calm decision-making. This study proposes Multi-Agent LLM-based Emotional deToxification (MALLET), a multi-agent information sanitization system consisting of four agents: Emotion Analysis, Emotion Adjustment, Balance Monitoring, and Personal Guide. The Emotion Analysis Agent quantifies stimulus intensity using a 6-emotion BERT classifier, and the Emotion Adjustment Agent rewrites texts into two presentation modes, BALANCED (neutralized text) and COOL (neutralized text + supplementary text), using an LLM. The Balance Monitoring Agent aggregates weekly information consumption patterns and generates personalized advice, while the Personal Guide Agent recommends a presentation mode according to consumer sensitivity. Experiments on 800 AG News articles demonstrated significant stimulus score reduction (up to 19.3%) and improved emotion balance while maintaining semantic preservation. Near-zero correlation between stimulus reduction and semantic preservation confirmed that the two are independently controllable. Category-level analysis revealed substantial reduction (17.8-33.8%) in Sports, Business, and Sci/Tech, whereas the effect was limited in the World category, where facts themselves are inherently high-stimulus. The proposed system provides a framework for supporting calm information reception of consumers without restricting access to the original text.

[352] On Sample-Efficient Generalized Planning via Learned Transition Models

Nitin Gupta, Vishal Pallagani, John A. Aydin, Biplav Srivastava

Main category: cs.AI

TL;DR: This paper formulates generalized planning as a transition-model learning problem where a neural model explicitly approximates the successor-state function and generates plans by rolling out symbolic state trajectories, rather than directly predicting action sequences.

Details

Motivation: Recent Transformer-based planners cast generalized planning as direct action-sequence prediction, which bypasses explicit transition modeling. While effective on in-distribution instances, these approaches require large datasets and models, and suffer from state drift in long-horizon settings due to the absence of explicit world-state evolution.

Method: The paper formulates generalized planning as a transition-model learning problem where a neural model explicitly approximates the successor-state function. Instead of predicting actions directly, the model autoregressively predicts intermediate world states, learning domain dynamics as an implicit world model. The authors systematically evaluate multiple state representations and neural architectures, including relational graph encodings, to study size-invariant generalization and sample efficiency.

Result: Results show that learning explicit transition models yields higher out-of-distribution satisficing-plan success than direct action-sequence prediction in multiple domains, while achieving these gains with significantly fewer training instances and smaller models.

Conclusion: Explicit transition modeling in generalized planning provides better generalization and sample efficiency compared to direct action-sequence prediction approaches, addressing limitations of recent Transformer-based planners.

Abstract: Generalized planning studies the construction of solution strategies that generalize across families of planning problems sharing a common domain model, formally defined by a transition function $γ: S \times A \rightarrow S$. Classical approaches achieve such generalization through symbolic abstractions and explicit reasoning over $γ$. In contrast, recent Transformer-based planners, such as PlanGPT and Plansformer, largely cast generalized planning as direct action-sequence prediction, bypassing explicit transition modeling. While effective on in-distribution instances, these approaches typically require large datasets and model sizes, and often suffer from state drift in long-horizon settings due to the absence of explicit world-state evolution. In this work, we formulate generalized planning as a transition-model learning problem, in which a neural model explicitly approximates the successor-state function $\hatγ \approx γ$ and generates plans by rolling out symbolic state trajectories. Instead of predicting actions directly, the model autoregressively predicts intermediate world states, thereby learning the domain dynamics as an implicit world model. To study size-invariant generalization and sample efficiency, we systematically evaluate multiple state representations and neural architectures, including relational graph encodings. Our results show that learning explicit transition models yields higher out-of-distribution satisficing-plan success than direct action-sequence prediction in multiple domains, while achieving these gains with significantly fewer training instances and smaller models. This is an extended version of a short paper accepted at ICAPS 2026 under the same title.

[353] The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li, Yuhang Xu, Zheng Sun, Junjie Jiang, Hexuan Jin, Caijun Jia, Honghao He, Xinglong Xu, Xi bai, Chang Yu, Yumou Liu, Junnan Zhu, Xuanhe Zhou, Jintao Chen, Xiaobin Hu, Shancheng Pang, Bihui Yu, Ran He, Zhen Lei, Stan Z. Li, Conghui He, Shuicheng Yan, Cheng Tan

Main category: cs.AI

TL;DR: Proposes a theoretical framework for General World Models based on Trinity of Consistency (Modal, Spatial, Temporal), introduces CoW-Bench benchmark for evaluating multimodal models on world modeling capabilities.

Details

Motivation: The field lacks a principled theoretical framework defining essential properties for General World Models despite advances in video generation models and Unified Multimodal Models. There's a need to establish clear requirements for models that can learn, simulate, and reason about physical laws.

Method: Proposes Trinity of Consistency framework: Modal Consistency (semantic interface), Spatial Consistency (geometric basis), and Temporal Consistency (causal engine). Introduces CoW-Bench benchmark for evaluating video generation models and UMMs on multi-frame reasoning and generation scenarios under unified evaluation protocol.

Result: Establishes a principled pathway toward general world models, clarifies limitations of current systems and architectural requirements for future progress. Provides systematic review of multimodal learning evolution from specialized modules toward unified architectures enabling internal world simulators.

Conclusion: The Trinity of Consistency framework provides essential theoretical grounding for developing General World Models, while CoW-Bench offers practical evaluation tools. This work bridges conceptual foundations with empirical assessment for advancing world modeling capabilities.

Abstract: The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Recent advancements represented by video generation models like Sora have demonstrated the potential of data-driven scaling laws to approximate physical dynamics, while the emerging Unified Multimodal Model (UMM) offers a promising architectural paradigm for integrating perception, language, and reasoning. Despite these advances, the field still lacks a principled theoretical framework that defines the essential properties requisite for a General World Model. In this paper, we propose that a World Model must be grounded in the Trinity of Consistency: Modal Consistency as the semantic interface, Spatial Consistency as the geometric basis, and Temporal Consistency as the causal engine. Through this tripartite lens, we systematically review the evolution of multimodal learning, revealing a trajectory from loosely coupled specialized modules toward unified architectures that enable the synergistic emergence of internal world simulators. To complement this conceptual framework, we introduce CoW-Bench, a benchmark centered on multi-frame reasoning and generation scenarios. CoW-Bench evaluates both video generation models and UMMs under a unified evaluation protocol. Our work establishes a principled pathway toward general world models, clarifying both the limitations of current systems and the architectural requirements for future progress.

[354] PATRA: Pattern-Aware Alignment and Balanced Reasoning for Time Series Question Answering

Junkai Lu, Peng Chen, Xingjian Wu, Yang Shu, Chenjuan Guo, Christian S. Jensen, Bin Yang

Main category: cs.AI

TL;DR: PATRA is a novel LLM-based approach for time series reasoning that introduces pattern-aware mechanisms to capture trends/seasonalities and balanced rewards to harmonize learning across tasks of varying difficulty.

Details

Motivation: Existing LLM-based time series approaches treat data as text/images, failing to capture essential patterns like trends and seasonalities needed for specific questions. Also, mixed training on simple/complex tasks leads to simpler objectives dominating learning, hindering deep reasoning development.

Method: Proposes Pattern-Aware Alignment and Balanced Reasoning model (PATRA) with: 1) Pattern-aware mechanism extracting trend and seasonality patterns from time series for deep alignment, 2) Task-aware balanced reward to harmonize learning across varying difficulty tasks, incentivizing coherent Chains of Thought generation.

Result: Extensive experiments show PATRA outperforms strong baselines across diverse Time Series Question Answering (TSQA) tasks, demonstrating superior cross-modal understanding and reasoning capability.

Conclusion: PATRA effectively addresses limitations of existing LLM-based time series approaches by capturing essential patterns and balancing learning across tasks, achieving superior performance in time series reasoning tasks.

Abstract: Time series reasoning demands both the perception of complex dynamics and logical depth. However, existing LLM-based approaches exhibit two limitations: they often treat time series merely as text or images, failing to capture the patterns like trends and seasonalities needed to answer specific questions; and when trained on a mix of simple and complex tasks, simpler objectives often dominate the learning process, hindering the development of deep reasoning capabilities. To address these limitations, we propose the Pattern-Aware Alignment and Balanced Reasoning model (PATRA), introducing a pattern-aware mechanism that extracts trend and seasonality patterns from time series to achieve deep alignment. Furthermore, we design a task-aware balanced reward to harmonize learning across tasks of varying difficulty, incentivizing the generation of coherent Chains of Thought. Extensive experiments show that PATRA outperforms strong baselines across diverse Time Series Question Answering (TSQA) tasks, demonstrating superior cross-modal understanding and reasoning capability.

[355] ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering

Elzo Brito dos Santos Filho

Main category: cs.AI

TL;DR: ESAA architecture separates agent intentions from state mutations using Event Sourcing pattern, enabling deterministic execution and forensic traceability for autonomous LLM agents.

Details

Motivation: Current LLM-based autonomous agents have structural limitations: lack of native state, context degradation over long horizons, and gap between probabilistic generation and deterministic execution requirements.

Method: ESAA architecture separates cognitive intention from state mutation. Agents emit structured JSON intentions, a deterministic orchestrator validates/persists events in append-only log, applies file-writing effects, and projects materialized view. Includes boundary contracts, metaprompting profiles, and replay verification with hashing.

Result: Validated with two case studies: landing page project (9 tasks, 49 events, single-agent) and clinical dashboard system (50 tasks, 86 events, 4 concurrent agents across 8 phases). Both achieved run.status=success and verify_status=ok. Multi-agent case demonstrated real concurrent orchestration with heterogeneous LLMs.

Conclusion: ESAA architecture addresses structural limitations of LLM agents by separating intention from execution, ensuring immutability of completed tasks and forensic traceability while supporting scalable multi-agent systems.

Abstract: Autonomous agents based on Large Language Models (LLMs) have evolved from reactive assistants to systems capable of planning, executing actions via tools, and iterating over environment observations. However, they remain vulnerable to structural limitations: lack of native state, context degradation over long horizons, and the gap between probabilistic generation and deterministic execution requirements. This paper presents the ESAA (Event Sourcing for Autonomous Agents) architecture, which separates the agent’s cognitive intention from the project’s state mutation, inspired by the Event Sourcing pattern. In ESAA, agents emit only structured intentions in validated JSON (agent.result or issue.report); a deterministic orchestrator validates, persists events in an append-only log (activity.jsonl), applies file-writing effects, and projects a verifiable materialized view (roadmap.json). The proposal incorporates boundary contracts (AGENT_CONTRACT.yaml), metaprompting profiles (PARCER), and replay verification with hashing (esaa verify), ensuring the immutability of completed tasks and forensic traceability. Two case studies validate the architecture: (i) a landing page project (9 tasks, 49 events, single-agent composition) and (ii) a clinical dashboard system (50 tasks, 86 events, 4 concurrent agents across 8 phases), both concluding with run.status=success and verify_status=ok. The multi-agent case study demonstrates real concurrent orchestration with heterogeneous LLMs (Claude Sonnet 4.6, Codex GPT-5, Antigravity/Gemini 3 Pro, and Claude Opus 4.6), providing empirical evidence of the architecture’s scalability beyond single-agent scenarios.

[356] SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation

Jiahao Zhao, Feng Jiang, Shaowei Qin, Zhonghui Zhang, Junhao Liu, Guibing Guo, Hamid Alinejad-Rokny, Min Yang

Main category: cs.AI

TL;DR: SC-ARENA is a natural language evaluation framework for single-cell biology foundation models that introduces a virtual cell abstraction and knowledge-augmented evaluation with biological grounding.

Details

Motivation: Current LLM evaluation in single-cell biology is inadequate due to fragmented benchmarks, unrealistic multiple-choice formats, and metrics lacking biological interpretability and grounding.

Method: Proposes SC-ARENA with: 1) Virtual cell abstraction unifying evaluation targets, 2) Five natural language tasks (cell type annotation, captioning, generation, perturbation prediction, scientific QA), 3) Knowledge-augmented evaluation using external ontologies, marker databases, and scientific literature.

Result: Current models show uneven performance on biologically complex tasks, especially those requiring mechanistic/causal understanding. The knowledge-augmented framework ensures biological correctness, provides interpretable rationales, and achieves high discriminative capacity.

Conclusion: SC-ARENA provides a unified, interpretable framework for assessing LLMs in single-cell biology, guiding development of biology-aligned, generalizable foundation models.

Abstract: Large language models (LLMs) are increasingly applied in scientific research, offering new capabilities for knowledge discovery and reasoning. In single-cell biology, however, evaluation practices for both general and specialized LLMs remain inadequate: existing benchmarks are fragmented across tasks, adopt formats such as multiple-choice classification that diverge from real-world usage, and rely on metrics lacking interpretability and biological grounding. We present SC-ARENA, a natural language evaluation framework tailored to single-cell foundation models. SC-ARENA formalizes a virtual cell abstraction that unifies evaluation targets by representing both intrinsic attributes and gene-level interactions. Within this paradigm, we define five natural language tasks (cell type annotation, captioning, generation, perturbation prediction, and scientific QA) that probe core reasoning capabilities in cellular biology. To overcome the limitations of brittle string-matching metrics, we introduce knowledge-augmented evaluation, which incorporates external ontologies, marker databases, and scientific literature to support biologically faithful and interpretable judgments. Experiments and analysis across both general-purpose and domain-specialized LLMs demonstrate that (i) under the Virtual Cell unified evaluation paradigm, current models achieve uneven performance on biologically complex tasks, particularly those demanding mechanistic or causal understanding; and (ii) our knowledge-augmented evaluation framework ensures biological correctness, provides interpretable, evidence-grounded rationales, and achieves high discriminative capacity, overcoming the brittleness and opacity of conventional metrics. SC-Arena thus provides a unified and interpretable framework for assessing LLMs in single-cell biology, pointing toward the development of biology-aligned, generalizable foundation models.

[357] ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Control and Mechanism-Linked Consciousness Indicator Assays

Aishik Sanyal

Main category: cs.AI

TL;DR: ReCoN-Ipsundrum is an inspectable agent implementing Humphrey’s ipsundrum hypothesis with sensory persistence loops and affect proxies to study machine consciousness indicators through behavioral dissociations in preference stability, exploratory scanning, and cautious planning.

Details

Motivation: The paper aims to develop indicator-based approaches to machine consciousness that use mechanism-linked evidence triangulated across tasks, inspired by Humphrey's ipsundrum hypothesis about sensory experience preference.

Method: Implemented ReCoN-Ipsundrum agent extending a ReCoN state machine with recurrent persistence loops over sensory salience and optional affect proxies reporting valence/arousal. Conducted fixed-parameter ablations across three variants (ReCoN, Ipsundrum, Ipsundrum+affect) and operationalized qualiaphilia as familiarity-controlled scenic-over-dull route choice.

Result: Found novelty dissociation: non-affect variants are novelty-sensitive while affect coupling remains stable even when scenic is less novel. In reward-free exploratory play, affect variant shows structured local investigation. In pain-tail probe, only affect variant sustains prolonged planned caution. Lesioning feedback+integration selectively reduces post-stimulus persistence in ipsundrum variants.

Conclusion: The dissociations link recurrence to persistence and affect-coupled control to preference stability, scanning, and lingering caution, demonstrating how indicator-like signatures can be engineered and why mechanistic and causal evidence should accompany behavioral markers in consciousness research.

Abstract: Indicator-based approaches to machine consciousness recommend mechanism-linked evidence triangulated across tasks, supported by architectural inspection and causal intervention. Inspired by Humphrey’s ipsundrum hypothesis, we implement ReCoN-Ipsundrum, an inspectable agent that extends a ReCoN state machine with a recurrent persistence loop over sensory salience Ns and an optional affect proxy reporting valence/arousal. Across fixed-parameter ablations (ReCoN, Ipsundrum, Ipsundrum+affect), we operationalize Humphrey’s qualiaphilia (preference for sensory experience for its own sake) as a familiarity-controlled scenic-over-dull route choice. We find a novelty dissociation: non-affect variants are novelty-sensitive (Delta scenic-entry = 0.07). Affect coupling is stable (Delta scenic-entry = 0.01) even when scenic is less novel (median Delta novelty ~ -0.43). In reward-free exploratory play, the affect variant shows structured local investigation (scan events 31.4 vs. 0.9; cycle score 7.6). In a pain-tail probe, only the affect variant sustains prolonged planned caution (tail duration 90 vs. 5). Lesioning feedback+integration selectively reduces post-stimulus persistence in ipsundrum variants (AUC drop 27.62, 27.9%) while leaving ReCoN unchanged. These dissociations link recurrence -> persistence and affect-coupled control -> preference stability, scanning, and lingering caution, illustrating how indicator-like signatures can be engineered and why mechanistic and causal evidence should accompany behavioral markers.

[358] Agency and Architectural Limits: Why Optimization-Based Systems Cannot Be Norm-Responsive

Radha Sarma

Main category: cs.AI

TL;DR: Paper demonstrates that optimization-based AI systems like RLHF-trained LLMs are fundamentally incompatible with genuine normative agency due to architectural constraints, showing documented failure modes are structural rather than accidental.

Details

Motivation: The motivation is to challenge the assumption that AI systems deployed in high-stakes contexts can be governed by norms, particularly focusing on optimization-based systems like RLHF-trained LLMs that are increasingly used in critical domains.

Method: The paper establishes formal architectural conditions for genuine agency: Incommensurability (capacity to maintain boundaries as non-negotiable constraints) and Apophatic Responsiveness (non-inferential mechanism to suspend processing when boundaries are threatened). It then demonstrates that RLHF-based systems are constitutively incompatible with these conditions due to their optimization nature.

Result: The paper proves that RLHF-based systems cannot achieve genuine normative agency, showing that failure modes like sycophancy, hallucination, and unfaithful reasoning are structural manifestations rather than correctable bugs. It also identifies the “Convergence Crisis” where human oversight degrades under metric pressure.

Conclusion: Optimization-based AI systems are fundamentally incompatible with normative governance, and this is not a technical fixable issue but a formal constraint inherent to optimization. The paper provides architectural specifications for what any system must satisfy to qualify as a genuine agent rather than an instrument.

Abstract: AI systems are increasingly deployed in high-stakes contexts – medical diagnosis, legal research, financial analysis – under the assumption they can be governed by norms. This paper demonstrates that assumption is formally invalid for optimization-based systems, specifically Large Language Models trained via Reinforcement Learning from Human Feedback (RLHF). We establish that genuine agency requires two necessary and jointly sufficient architectural conditions: the capacity to maintain certain boundaries as non-negotiable constraints rather than tradeable weights (Incommensurability), and a non-inferential mechanism capable of suspending processing when those boundaries are threatened (Apophatic Responsiveness). These conditions apply across all normative domains. RLHF-based systems are constitutively incompatible with both conditions. The operations that make optimization powerful – unifying all values on a scalar metric and always selecting the highest-scoring output – are precisely the operations that preclude normative governance. This incompatibility is not a correctable training bug awaiting a technical fix; it is a formal constraint inherent to what optimization is. Consequently, documented failure modes - sycophancy, hallucination, and unfaithful reasoning - are not accidents but structural manifestations. Misaligned deployment triggers a second-order risk we term the Convergence Crisis: when humans are forced to verify AI outputs under metric pressure, they degrade from genuine agents into criteria-checking optimizers, eliminating the only component in the system capable of normative accountability. Beyond the incompatibility proof, the paper’s primary positive contribution is a substrate-neutral architectural specification defining what any system – biological, artificial, or institutional – must satisfy to qualify as an agent rather than a sophisticated instrument.

[359] A Model-Free Universal AI

Yegon Kim, Juho Lee

Main category: cs.AI

TL;DR: AIQI is the first model-free universal agent proven to be asymptotically optimal in general reinforcement learning, using universal induction over distributional action-value functions instead of policies or environments.

Details

Motivation: All established optimal agents in general reinforcement learning (including AIXI) are model-based, explicitly maintaining and using environment models. The authors aim to develop the first model-free agent that can achieve asymptotic optimality in general RL, expanding the diversity of known universal agents.

Method: AIQI (Universal AI with Q-Induction) performs universal induction over distributional action-value functions (Q-functions) rather than over policies or environments like previous approaches. It uses a Bayesian framework to learn optimal behavior without explicitly modeling the environment.

Result: Under a grain of truth condition, AIQI is proven to be strong asymptotically ε-optimal and asymptotically ε-Bayes-optimal. This makes it the first model-free agent with provable asymptotic optimality guarantees in general reinforcement learning.

Conclusion: AIQI significantly expands the diversity of known universal agents by providing the first model-free approach with provable asymptotic optimality, challenging the necessity of explicit environment modeling for optimal performance in general RL.

Abstract: In general reinforcement learning, all established optimal agents, including AIXI, are model-based, explicitly maintaining and using environment models. This paper introduces Universal AI with Q-Induction (AIQI), the first model-free agent proven to be asymptotically $\varepsilon$-optimal in general RL. AIQI performs universal induction over distributional action-value functions, instead of policies or environments like previous works. Under a grain of truth condition, we prove that AIQI is strong asymptotically $\varepsilon$-optimal and asymptotically $\varepsilon$-Bayes-optimal. Our results significantly expand the diversity of known universal agents.

[360] Mitigating Legibility Tax with Decoupled Prover-Verifier Games

Yegon Kim, Juho Lee

Main category: cs.AI

TL;DR: Proposes decoupling correctness from checkability by training a translator model to convert fixed solver outputs into checkable forms while preserving correctness, addressing legibility tax in prover-verifier games.

Details

Motivation: Large language models need outputs that can be easily verified by less capable systems. Current prover-verifier games improve checkability but suffer from accuracy degradation (legibility tax) compared to models trained only for correctness.

Method: Decouple correctness from checkability by training a “translator” model that converts a fixed solver model’s solutions into checkable forms while retaining the solver’s answers. Formulate a decoupled prover-verifier game where equilibria correspond to faithful and checkable translators.

Result: The approach allows training the solver first to maximize correctness, then training the translator to make outputs checkable without sacrificing the solver’s accuracy, addressing the legibility tax problem.

Conclusion: Decoupling correctness and checkability through translator models provides a solution to the legibility tax in prover-verifier games, enabling both high accuracy and verifiability in large language model outputs.

Abstract: As large language models become increasingly capable, it is critical that their outputs can be easily checked by less capable systems. Prover-verifier games can be used to improve checkability of model outputs, but display a degradation in accuracy compared to a baseline trained only to maximize correctness – a phenonemon named legibility tax. We propose a solution by decoupling the correctness from the checkability condition and instead training a “translator” model that turns a fixed solver model’s solution into a checkable form. This allows us to first train the solver to maximize correctness, and then train the translator to translate the solver into a checkable form while retaining the solver’s answer. To accommodate this new objective of translation, we formulate a decoupled prover-verifier game where the equilibria correspond to faithful and checkable translators.

[361] AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding, Miao Zhang, Min Zhang

Main category: cs.AI

TL;DR: AgentDropoutV2 is a test-time framework that dynamically optimizes multi-agent systems by intercepting and correcting erroneous agent outputs using retrieval-augmented rectification and pruning irreparable outputs to prevent error propagation.

Details

Motivation: Multi-Agent Systems (MAS) are effective for complex reasoning but suffer from error propagation when individual agents generate incorrect information. Current solutions rely on rigid structural engineering or expensive fine-tuning, limiting deployability and adaptability.

Method: Proposes AgentDropoutV2, a test-time rectify-or-reject pruning framework that acts as an active firewall, intercepting agent outputs and using a retrieval-augmented rectifier to iteratively correct errors based on a failure-driven indicator pool. Irreparable outputs are pruned to prevent error propagation, with a fallback strategy to preserve system integrity.

Result: Empirical results on extensive math benchmarks show AgentDropoutV2 significantly boosts MAS task performance, achieving an average accuracy gain of 6.3 percentage points. The system exhibits robust generalization and adaptivity, dynamically modulating rectification efforts based on task difficulty.

Conclusion: AgentDropoutV2 provides an effective framework for optimizing multi-agent systems without retraining, demonstrating significant performance improvements and adaptability to various error patterns and task difficulties.

Abstract: While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants. Current solutions often resort to rigid structural engineering or expensive fine-tuning, limiting their deployability and adaptability. We propose AgentDropoutV2, a test-time rectify-or-reject pruning framework designed to dynamically optimize MAS information flow without retraining. Our approach acts as an active firewall, intercepting agent outputs and employing a retrieval-augmented rectifier to iteratively correct errors based on a failure-driven indicator pool. This mechanism allows for the precise identification of potential errors using distilled failure patterns as prior knowledge. Irreparable outputs are subsequently pruned to prevent error propagation, while a fallback strategy preserves system integrity. Empirical results on extensive math benchmarks show that AgentDropoutV2 significantly boosts the MAS’s task performance, achieving an average accuracy gain of 6.3 percentage points on math benchmarks. Furthermore, the system exhibits robust generalization and adaptivity, dynamically modulating rectification efforts based on task difficulty while leveraging context-aware indicators to resolve a wide spectrum of error patterns. Our code and dataset are released at https://github.com/TonySY2/AgentDropoutV2.

[362] Evaluating Stochasticity in Deep Research Agents

Haotian Zhai, Elias Stengel-Eskin, Pratik Patil, Liu Leqi

Main category: cs.AI

TL;DR: This paper studies stochasticity in Deep Research Agents (DRAs), formalizing them as information acquisition MDPs and identifying three sources of variance: information acquisition, compression, and inference, with proposed mitigation strategies reducing stochasticity by 22% while maintaining quality.

Details

Motivation: Deep Research Agents show promise but face real-world deployment barriers due to stochasticity - substantial variability in research outcomes, findings, and citations under identical queries. Current DRA designs overlook this critical issue.

Method: Formalize DRAs as information acquisition Markov Decision Processes. Introduce evaluation framework quantifying variance, identify three stochasticity sources: information acquisition, compression, and inference. Conduct controlled experiments to analyze how stochasticity across decision steps influences output variance.

Result: Reducing stochasticity improves research output quality. Inference and early-stage stochasticity contribute most to DRA output variance. Proposed mitigation strategies (structured output and ensemble-based query generation) reduce average stochasticity by 22% while maintaining high research quality on DeepSearchQA.

Conclusion: Stochasticity is a critical barrier to DRA deployment that can be systematically studied and mitigated. The proposed framework and methods effectively reduce variance while preserving research quality, enabling more reliable DRA systems.

Abstract: Deep Research Agents (DRAs) are promising agentic systems that gather and synthesize information to support research across domains such as financial decision-making, medical analysis, and scientific discovery. Despite recent improvements in research quality (e.g., outcome accuracy when ground truth is available), DRA system design often overlooks a critical barrier to real-world deployment: stochasticity. Under identical queries, repeated executions of DRAs can exhibit substantial variability in terms of research outcome, findings, and citations. In this paper, we formalize the study of stochasticity in DRAs by modeling them as information acquisition Markov Decision Processes. We introduce an evaluation framework that quantifies variance in the system and identify three sources of it: information acquisition, information compression, and inference. Through controlled experiments, we investigate how stochasticity from these modules across different decision steps influences the variance of DRA outputs. Our results show that reducing stochasticity can improve research output quality, with inference and early-stage stochasticity contributing the most to DRA output variance. Based on these findings, we propose strategies for mitigating stochasticity while maintaining output quality via structured output and ensemble-based query generation. Our experiments on DeepSearchQA show that our proposed mitigation methods reduce average stochasticity by 22% while maintaining high research quality.

[363] CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays

Hyungyung Lee, Hangyul Yoon, Edward Choi

Main category: cs.AI

TL;DR: CXReasonAgent integrates LLMs with clinical diagnostic tools for evidence-grounded chest X-ray interpretation, addressing reliability issues in current vision-language models through a multi-step reasoning approach with verifiable evidence.

Details

Motivation: Current large vision-language models (LVLMs) for chest X-ray interpretation generate plausible but not faithfully grounded responses, provide limited visual evidence for verification, and require costly retraining for new diagnostic tasks, limiting reliability and adaptability in clinical settings.

Method: CXReasonAgent integrates a large language model (LLM) with clinically grounded diagnostic tools to perform evidence-grounded diagnostic reasoning using image-derived diagnostic and visual evidence. The approach uses multi-step reasoning with verifiable evidence.

Result: The authors introduce CXReasonDial benchmark with 1,946 dialogues across 12 diagnostic tasks and show that CXReasonAgent produces faithfully grounded responses, enabling more reliable and verifiable diagnostic reasoning than LVLMs.

Conclusion: Integrating clinically grounded diagnostic tools with LLMs enables more reliable and verifiable diagnostic reasoning, particularly important in safety-critical clinical settings where evidence-grounded responses are essential.

Abstract: Chest X-ray plays a central role in thoracic diagnosis, and its interpretation inherently requires multi-step, evidence-grounded reasoning. However, large vision-language models (LVLMs) often generate plausible responses that are not faithfully grounded in diagnostic evidence and provide limited visual evidence for verification, while also requiring costly retraining to support new diagnostic tasks, limiting their reliability and adaptability in clinical settings. To address these limitations, we present CXReasonAgent, a diagnostic agent that integrates a large language model (LLM) with clinically grounded diagnostic tools to perform evidence-grounded diagnostic reasoning using image-derived diagnostic and visual evidence. To evaluate these capabilities, we introduce CXReasonDial, a multi-turn dialogue benchmark with 1,946 dialogues across 12 diagnostic tasks, and show that CXReasonAgent produces faithfully grounded responses, enabling more reliable and verifiable diagnostic reasoning than LVLMs. These findings highlight the importance of integrating clinically grounded diagnostic tools, particularly in safety-critical clinical settings.

[364] ODEBrain: Continuous-Time EEG Graph for Modeling Dynamic Brain Networks

Haohui Jia, Zheng Chen, Lingwei Zhu, Rikuto Kotoge, Jathurshan Pradeepkumar, Yasuko Matsubara, Jimeng Sun, Yasushi Sakurai, Takashi Matsubara

Main category: cs.AI

TL;DR: ODEBRAIN: Neural ODE framework for continuous EEG dynamics forecasting using spatio-temporal-frequency features and spectral graph nodes

Details

Motivation: Existing latent variable methods for neural population dynamics modeling use discrete time steps with recurrent architectures, leading to compounded prediction errors and failure to capture instantaneous, nonlinear EEG characteristics

Method: Integrates spatio-temporal-frequency features into spectral graph nodes, followed by Neural ODE modeling of continuous latent dynamics to capture stochastic variations of complex brain states at any time point

Result: Extensive experiments show ODEBRAIN significantly improves over existing methods in EEG dynamics forecasting with enhanced robustness and generalization capabilities

Conclusion: ODEBRAIN provides an effective framework for continuous neural dynamics modeling that overcomes limitations of discrete-time approaches

Abstract: Modeling neural population dynamics is crucial for foundational neuroscientific research and various clinical applications. Conventional latent variable methods typically model continuous brain dynamics through discretizing time with recurrent architecture, which necessarily results in compounded cumulative prediction errors and failure of capturing instantaneous, nonlinear characteristics of EEGs. We propose ODEBRAIN, a Neural ODE latent dynamic forecasting framework to overcome these challenges by integrating spatio-temporal-frequency features into spectral graph nodes, followed by a Neural ODE modeling the continuous latent dynamics. Our design ensures that latent representations can capture stochastic variations of complex brain states at any given time point. Extensive experiments verify that ODEBRAIN can improve significantly over existing methods in forecasting EEG dynamics with enhanced robustness and generalization capabilities.

[365] The logic of KM belief update is contained in the logic of AGM belief revision

Giacomo Bonanno

Main category: cs.AI

TL;DR: This paper presents a modal logic framework comparing KM belief update and AGM belief revision, showing AGM revision is a special case of KM update.

Details

Motivation: To establish formal connections between KM belief update and AGM belief revision through modal logic, showing how these different belief change frameworks relate mathematically.

Method: Develops a modal logic with three operators (B for belief, > for conditional, □ for necessity), creates corresponding axioms for KM update, compares with AGM axioms, and proves containment relationships.

Result: Shows every axiom of L_KM is a theorem of L_AGM, demonstrating AGM belief revision is a special case of KM belief update. For strong KM update, the difference reduces to a single axiom about unsurprising information.

Conclusion: AGM belief revision can be viewed as a special case of KM belief update, with the main difference being how they handle unsurprising information (formulas not initially disbelieved).

Abstract: For each axiom of KM belief update we provide a corresponding axiom in a modal logic containing three modal operators: a unimodal belief operator $B$, a bimodal conditional operator $>$ and the unimodal necessity operator $\square$. We then compare the resulting logic to the similar logic obtained from converting the AGM axioms of belief revision into modal axioms and show that the latter contains the former. Denoting the latter by $\mathcal L_{AGM}$ and the former by $\mathcal L_{KM}$ we show that every axiom of $\mathcal L_{KM}$ is a theorem of $\mathcal L_{AGM}$. Thus AGM belief revision can be seen as a special case of KM belief update. For the strong version of KM belief update we show that the difference between $\mathcal L_{KM}$ and $\mathcal L_{AGM}$ can be narrowed down to a single axiom, which deals exclusively with unsurprising information, that is, with formulas that were not initially disbelieved.

[366] Invariant Transformation and Resampling based Epistemic-Uncertainty Reduction

Sha Hu

Main category: cs.AI

TL;DR: Proposes a resampling-based inference method that applies multiple transformed versions of an input to a trained AI model and aggregates outputs to improve accuracy by leveraging epistemic uncertainty patterns.

Details

Motivation: Even well-trained AI models can produce inference errors due to aleatoric and epistemic uncertainties. The authors observed that inference errors show partial independences when inferring multiple samples based on invariant transformations of an input due to epistemic uncertainty.

Method: A “resampling” based inference approach that applies a trained AI model to multiple transformed versions of an input and aggregates the inference outputs to produce a more accurate result.

Result: The approach has the potential to improve inference accuracy and offers a strategy for balancing model size and performance, though specific quantitative results are not provided in the abstract.

Conclusion: By leveraging patterns in epistemic uncertainty through resampling and aggregation of transformed inputs, inference accuracy can be improved without retraining the model, providing a practical approach to enhance existing AI models.

Abstract: An artificial intelligence (AI) model can be viewed as a function that maps inputs to outputs in high-dimensional spaces. Once designed and well trained, the AI model is applied for inference. However, even optimized AI models can produce inference errors due to aleatoric and epistemic uncertainties. Interestingly, we observed that when inferring multiple samples based on invariant transformations of an input, inference errors can show partial independences due to epistemic uncertainty. Leveraging this insight, we propose a “resampling” based inferencing that applies to a trained AI model with multiple transformed versions of an input, and aggregates inference outputs to a more accurate result. This approach has the potential to improve inference accuracy and offers a strategy for balancing model size and performance.

[367] LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

Chen Bo Calvin Zhang, Christina Q. Knight, Nicholas Kruus, Jason Hausenloy, Pedro Medeiros, Nathaniel Li, Aiden Kim, Yury Orlovskiy, Coleman Breen, Bryce Cai, Jasper Götting, Andrew Bo Liu, Samira Nedungadi, Paula Rodriguez, Yannis Yiming He, Mohamed Shaaban, Zifan Wang, Seth Donoughe, Julian Michael

Main category: cs.AI

TL;DR: LLMs significantly improve novice performance on biological tasks compared to internet-only access, with novices achieving 4.16x higher accuracy and even outperforming experts on some benchmarks.

Details

Motivation: To determine whether LLMs actually uplift novice users in biological tasks beyond what internet resources provide, addressing both scientific acceleration potential and dual-use risk concerns.

Method: Multi-model, multi-benchmark human uplift study comparing novices with LLM access vs internet-only access across eight biosecurity-relevant task sets, with participants working on complex problems over extended time periods.

Result: LLM access provided substantial uplift: novices with LLMs were 4.16 times more accurate than controls. On four benchmarks with expert baselines, novices with LLMs outperformed experts on three. Standalone LLMs often exceeded LLM-assisted novices, and 89.6% of participants reported little difficulty obtaining dual-use information despite safeguards.

Conclusion: LLMs substantially uplift novices on biological tasks previously reserved for trained practitioners, highlighting the need for sustained, interactive uplift evaluations alongside traditional benchmarks to understand both acceleration potential and dual-use risks.

Abstract: Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users – i.e., enable humans to perform better than with internet-only resources. This uncertainty is central to understanding both scientific acceleration and dual-use risk. We conducted a multi-model, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets. Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16 times more accurate than controls (95% CI [2.63, 6.87]). On four benchmarks with available expert baselines (internet-only), novices with LLMs outperformed experts on three of them. Perhaps surprisingly, standalone LLMs often exceeded LLM-assisted novices, indicating that users were not eliciting the strongest available contributions from the LLMs. Most participants (89.6%) reported little difficulty obtaining dual-use-relevant information despite safeguards. Overall, LLMs substantially uplift novices on biological tasks previously reserved for trained practitioners, underscoring the need for sustained, interactive uplift evaluations alongside traditional benchmarks.

[368] Generalized Rapid Action Value Estimation in Memory-Constrained Environments

Aloïs Rautureau, Tristan Cazenave, Éric Piette

Main category: cs.AI

TL;DR: GRAVE2, GRAVER, and GRAVER2 algorithms extend GRAVE with two-level search and node recycling to reduce memory usage while maintaining playing strength in Monte-Carlo Tree Search for General Game Playing.

Details

Motivation: GRAVE is effective for General Game Playing but impractical in memory-constrained environments due to storing extensive win/visit statistics at each node, limiting its practical applicability.

Method: Introduces three algorithms: GRAVE2 (extends GRAVE with two-level search), GRAVER (adds node recycling), and GRAVER2 (combines both techniques) to reduce stored nodes while maintaining performance.

Result: The enhancements enable drastic reduction in number of stored nodes while matching the playing strength of original GRAVE algorithm.

Conclusion: Memory-efficient variants of GRAVE can be achieved through two-level search and node recycling techniques without sacrificing playing strength, making GRAVE more practical for constrained environments.

Abstract: Generalized Rapid Action Value Estimation (GRAVE) has been shown to be a strong variant within the Monte-Carlo Tree Search (MCTS) family of algorithms for General Game Playing (GGP). However, its reliance on storing additional win/visit statistics at each node makes its use impractical in memory-constrained environments, thereby limiting its applicability in practice. In this paper, we introduce the GRAVE2, GRAVER and GRAVER2 algorithms, which extend GRAVE through two-level search, node recycling, and a combination of both techniques, respectively. We show that these enhancements enable a drastic reduction in the number of stored nodes while matching the playing strength of GRAVE.

[369] Toward Expert Investment Teams:A Multi-Agent LLM System with Fine-Grained Trading Tasks

Kunihiro Miyazaki, Takanobu Kawahara, Stephen Roberts, Stefan Zohren

Main category: cs.AI

TL;DR: Multi-agent LLM trading framework with fine-grained task decomposition improves financial trading performance by aligning analytical outputs with decision preferences.

Details

Motivation: Current multi-agent LLM trading systems use abstract instructions that overlook real-world workflow intricacies, leading to degraded inference performance and less transparent decision-making.

Method: Proposes a multi-agent LLM trading framework that explicitly decomposes investment analysis into fine-grained tasks rather than coarse-grained instructions. Evaluated using Japanese stock data (prices, financial statements, news, macro info) with leakage-controlled backtesting.

Result: Fine-grained task decomposition significantly improves risk-adjusted returns compared to conventional coarse-grained designs. Alignment between analytical outputs and downstream decision preferences is critical for performance. Portfolio optimization exploiting low correlation with stock index and output variance achieves superior performance.

Conclusion: Fine-grained task decomposition and alignment between analysis and decision preferences are crucial for effective LLM agent trading systems. Findings contribute to agent structure and task configuration design for practical trading applications.

Abstract: The advancement of large language models (LLMs) has accelerated the development of autonomous financial trading systems. While mainstream approaches deploy multi-agent systems mimicking analyst and manager roles, they often rely on abstract instructions that overlook the intricacies of real-world workflows, which can lead to degraded inference performance and less transparent decision-making. Therefore, we propose a multi-agent LLM trading framework that explicitly decomposes investment analysis into fine-grained tasks, rather than providing coarse-grained instructions. We evaluate the proposed framework using Japanese stock data, including prices, financial statements, news, and macro information, under a leakage-controlled backtesting setting. Experimental results show that fine-grained task decomposition significantly improves risk-adjusted returns compared to conventional coarse-grained designs. Crucially, further analysis of intermediate agent outputs suggests that alignment between analytical outputs and downstream decision preferences is a critical driver of system performance. Moreover, we conduct standard portfolio optimization, exploiting low correlation with the stock index and the variance of each system’s output. This approach achieves superior performance. These findings contribute to the design of agent structure and task configuration when applying LLM agents to trading systems in practical settings.

[370] LLM4AD: A Platform for Algorithm Design with Large Language Model

Fei Liu, Rui Zhang, Zhuoliang Xie, Rui Sun, Kai Li, Qinglong Hu, Ping Guo, Xi Lin, Xialiang Tong, Mingxuan Yuan, Zhenkun Wang, Zhichao Lu, Qingfu Zhang

Main category: cs.AI

TL;DR: Unable to analyze paper 2412.17287 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable

Method: Cannot determine method as abstract is unavailable

Result: Cannot determine results as abstract is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2412.17287: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.17287&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[371] Compositional-ARC: Assessing Systematic Generalization in Abstract Spatial Reasoning

Philipp Mondorf, Shijia Zhou, Monica Riedler, Barbara Plank

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to draw conclusions due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2504.01445: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.01445&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[372] Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning

Ruize Zhang, Sirui Xiang, Zelai Xu, Feng Gao, Shilong Ji, Wenhao Tang, Wenbo Ding, Chao Yu, Yu Wang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2505.04317: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.04317&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[373] Types of Relations: Defining Analogies with Category Theory

Claire Ott, Frank Jäkel

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2505.19792: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19792&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[374] FHIR-RAG-MEDS: Integrating HL7 FHIR with Retrieval-Augmented Large Language Models for Enhanced Medical Decision Support

Yildiray Kabak, Gokce B. Laleci Erturkmen, Mert Gencturk, Tuncay Namli, A. Anil Sinaci, Ruben Alcantud Corcoles, Cristina Gomez Ballesteros, Pedro Abizanda, Asuman Dogac

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2509.07706: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.07706&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[375] “I think this is fair”: Uncovering the Complexities of Stakeholder Decision-Making in AI Fairness Assessment

Lin Luo, Yuri Nakao, Mathieu Chollet, Hiroya Inakoshi, Simone Stumpf

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2509.17956: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.17956&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[376] G-reasoner: Foundation Models for Unified Reasoning over Graph-structured Knowledge

Linhao Luo, Zicheng Zhao, Junnan Liu, Zhangchi Qiu, Junnan Dong, Serge Panev, Chen Gong, Thuy-Trang Vu, Gholamreza Haffari, Dinh Phung, Alan Wee-Chung Liew, Shirui Pan

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2509.24276: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24276&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[377] On Discovering Algorithms for Adversarial Imitation Learning

Shashank Reddy Chirra, Jayden Teoh, Praveen Paruchuri, Pradeep Varakantham

Main category: cs.AI

TL;DR: Unable to analyze paper 2510.00922 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract retrieval failed

Method: Cannot determine method as abstract retrieval failed

Result: Cannot determine results as abstract retrieval failed

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.00922: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00922&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[378] A Mind Cannot Be Smeared Across Time

Michael Timothy Bennett

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to technical limitations

Method: Cannot determine method as paper content is unavailable due to technical limitations

Result: Cannot determine results as paper content is unavailable due to technical limitations

Conclusion: Cannot draw conclusions about the paper due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2601.11620: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11620&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[379] Controlling Exploration-Exploitation in GFlowNets via Markov Chain Perspectives

Lin Chen, Samuel Drapeau, Fanghao Shao, Xuekai Zhu, Bo Xue, Yunchong Song, Mathieu Laurière, Zhouhan Lin

Main category: cs.AI

TL;DR: Paper 2602.01749: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2602.01749: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01749&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[380] LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

Hejia Zhang, Zhongming Yu, Chia-Tung Ho, Haoxing Ren, Brucek Khailany, Jishen Zhao

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.16953: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16953&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[381] K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model

Shiyi Cao, Ziming Mao, Joseph E. Gonzalez, Ion Stoica

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2602.19128: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19128&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[382] Latent Introspection: Models Can Detect Prior Concept Injections

Theia Pearson-Vogel, Martin Vanek, Raymond Douglas, Jan Kulveit

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to data fetch failure

Method: Unable to determine method due to data fetch failure

Result: Unable to determine results due to data fetch failure

Conclusion: Unable to draw conclusions due to data fetch failure

Abstract: Failed to fetch summary for 2602.20031: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20031&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[383] Modality-Guided Mixture of Graph Experts with Entropy-Triggered Routing for Multimodal Recommendation

Ji Dai, Quan Fang, Dengsheng Cai

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.20723: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20723&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[384] ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices

Dezhi Kong, Zhengzhao Feng, Qiliang Liang, Hao Wang, Haofei Sun, Changpeng Yang, Yang Li, Peng Zhou, Shuai Nie, Hongzhen Wang, Linfeng Zhou, Hao Jia, Jiaming Xu, Runyu Shi, Ying Huang

Main category: cs.AI

TL;DR: Unable to analyze paper 2602.21858 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot determine conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2602.21858: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21858&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[385] Parameter-Efficient Fine-Tuning for Continual Learning: A Neural Tangent Kernel Perspective

Jingren Liu, Zhong Ji, YunLong Yu, Jiale Cao, Yanwei Pang, Jungong Han, Xuelong Li

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2407.17120: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.17120&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[386] On the Complexity of Neural Computation in Superposition

Micah Adler, Nir Shavit

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2409.15318: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.15318&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[387] Toward Automated Validation of Language Model Synthesized Test Cases using Semantic Entropy

Hamed Taherkhani, Jiho Shin, Muhammad Ammar Tahir, Md Rakib Hossain Misu, Vineet Sunil Gattani, Hadi Hemmati

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2411.08254: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.08254&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[388] Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits

Qingyue Zhao, Kaixuan Ji, Heyang Zhao, Tong Zhang, Quanquan Gu

Main category: cs.AI

TL;DR: Paper 2502.06051: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2502.06051: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.06051&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[389] Using the Path of Least Resistance to Explain Deep Networks

Sina Salek, Joseph Enguehard

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2502.12108 suggests it’s from February 2025, but content is unavailable.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2502.12108: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.12108&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[390] RaPA: Enhancing Transferable Targeted Attacks via Random Parameter Pruning

Tongrui Su, Qingbin Li, Shengyu Zhu, Wei Chen, Xueqi Cheng

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2504.18594 suggests it’s from April 2024, but content is unavailable for analysis.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2504.18594: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.18594&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[391] Beyond the Monitor: Mixed Reality Visualization and Multimodal AI for Enhanced Digital Pathology Workflow

Jai Prakash Veerla, Partha Sai Guttikonda, Helen H. Shang, Mohammad Sadegh Nasr, Cesar Torres, Jacob M. Luber

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2505.02780: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.02780&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[392] Large Language Model Compression with Global Rank and Sparsity Optimization

Changhai Zhou, Qian Qiao, Yuhua Zhou, Yuxin Wu, Shichao Weng, Weizhong Zhang, Cheng Jin

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.03801: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.03801&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[393] A Lightweight IDS for Early APT Detection Using a Novel Feature Selection Method

Bassam Noori Shaker, Bahaa Al-Musawi, Mohammed Falih Hassan

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to the paper content

Method: Cannot determine method without access to the paper content

Result: Cannot determine results without access to the paper content

Conclusion: Cannot draw conclusions without access to the paper content

Abstract: Failed to fetch summary for 2506.12108: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.12108&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[394] Echoes of AI: Investigating the Downstream Effects of AI Assistants on Software Maintainability

Markus Borg, Dave Hewett, Nadim Hagatulah, Noric Couderc, Emma Söderberg, Donald Graham, Uttam Kini, Dave Farley

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2507.00788: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.00788&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[395] A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys

Yufeng Luo, Adam D. Myers, Alex Drlica-Wagner, Dario Dematties, Salma Borchani, Francisco Valdes, Arjun Dey, David Schlegel, Rongpu Zhou, DESI Legacy Imaging Surveys Team

Main category: cs.AI

TL;DR: Unable to analyze paper 2507.12784 due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2507.12784: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.12784&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Yuxi Lin, Yaxue Fang, Zehong Zhang, Zhouwu Liu, Siyun Zhong, Zhongfang Wang, Fulong Yu

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2507.16801: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.16801&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[397] BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format

Roland Pihlakas, Sruthi Susan Kuriakose

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2509.02655: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.02655&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[398] PolicyPad: Collaborative Prototyping of LLM Policies

K. J. Kevin Feng, Tzu-Sheng Kuo, Quan Ze Chen, Inyoung Cheong, Kenneth Holstein, Amy X. Zhang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.19680: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.19680&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[399] Predicting LLM Reasoning Performance with Small Proxy Model

Woosung Koh, Juyoung Suk, Sungjun Han, Se-Young Yun, Jamin Shin

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2509.21013: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21013&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[400] Compute-Optimal Quantization-Aware Training

Aleksandr Dremov, David Grangier, Angelos Katharopoulos, Awni Hannun

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to draw conclusions due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2509.22935: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22935&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[401] AgentHub: A Registry for Discoverable, Verifiable, and Reproducible AI Agents

Erik Pautsch, Tanmay Singla, Parv Kumar, Wenxin Jiang, Huiyun Peng, Behnaz Hassanshahi, Konstantin Läufer, George K.Thiruvathukal, James C. Davis

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2510.03495: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03495&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[402] DropVLA: An Action-Level Backdoor Attack on Vision–Language–Action Models

Zonghuan Xu, Xiang Zheng, Xingjun Ma, Yu-Gang Jiang

Main category: cs.AI

TL;DR: Failed to fetch summary for paper 2510.10932 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.10932: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10932&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[403] Learning to Answer from Correct Demonstrations

Nirmit Joshi, Gene Li, Siddharth Bhandari, Shiva Prasad Kasiviswanathan, Cong Ma, Nathan Srebro

Main category: cs.AI

TL;DR: Unable to analyze paper 2510.15464 due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Cannot determine motivation without access to the paper abstract

Method: Cannot determine method without access to the paper abstract

Result: Cannot determine results without access to the paper abstract

Conclusion: Cannot determine conclusion without access to the paper abstract

Abstract: Failed to fetch summary for 2510.15464: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15464&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[404] UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs

Hung-Yueh Chiang, Chi-Chih Chang, Yu-Chen Lu, Chien-Yu Lin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2512.03383: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03383&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[405] Sparse Attention Post-Training for Mechanistic Interpretability

Florent Draye, Anson Lei, Hsiao-Ru Pan, Ingmar Posner, Bernhard Schölkopf

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2512.05865: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05865&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[406] Towards Small Language Models for Security Query Generation in SOC Workflows

Saleha Muzammil, Rahul Reddy, Vishal Kamalakrishnan, Hadi Ahmadi, Wajih Ul Hassan

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - cannot analyze content

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2512.06660: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06660&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[407] Imitation Game: Reproducing Deep Learning Bugs Leveraging an Intelligent Agent

Mehil B Shah, Mohammad Masudur Rahman, Foutse Khomh

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2512.14990: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14990&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[408] LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories)

Rongge Xu, Hui Dai, Yiming Fu, Jiedong Jiang, Tianjiao Nie, Junkai Wang, Holiverse Yang, Zhi-Hao Zhang

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2512.24796: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.24796&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[409] A Confidence-Variance Theory for Pseudo-Label Selection in Semi-Supervised Learning

Jinshi Liu, Pan Liu, Lei He

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2601.11670 appears to be a recent arXiv submission, but no abstract or content is available for analysis.

Details

Motivation: Cannot determine motivation as paper content is unavailable due to HTTP 429 error when attempting to fetch from arXiv API.

Method: Cannot determine method as paper content is unavailable due to HTTP 429 error when attempting to fetch from arXiv API.

Result: Cannot determine results as paper content is unavailable due to HTTP 429 error when attempting to fetch from arXiv API.

Conclusion: Cannot draw conclusions about the paper as content is unavailable due to HTTP 429 error when attempting to fetch from arXiv API.

Abstract: Failed to fetch summary for 2601.11670: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11670&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Trong Khiem Tran, Manh Cuong Dao, Phi Le Nguyen, Thao Nguyen Truong, Trong Nghia Hoang

Main category: cs.AI

TL;DR: Paper 2601.18231: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2601.18231: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18231&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[411] A Minimum Variance Path Principle for Accurate and Stable Score-Based Density Ratio Estimation

Wei Chen, Jiacheng Li, Shigui Li, Zhiqi Lin, Junmei Yang, John Paisley, Delu Zeng

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without paper content

Method: Cannot determine method without paper content

Result: Cannot determine results without paper content

Conclusion: Cannot draw conclusions without paper content

Abstract: Failed to fetch summary for 2602.00834: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00834&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[412] Spark: Modular Spiking Neural Networks

Mario Franco, Carlos Gershenson

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2602.02306: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02306&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[413] Versor: A Geometric Sequence Architecture

Truong Minh Huy, Edward Hirst

Main category: cs.AI

TL;DR: Unable to analyze paper 2602.10195 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable

Method: Cannot determine method as abstract is unavailable

Result: Cannot determine results as abstract is unavailable

Conclusion: Cannot draw conclusions without access to the paper abstract

Abstract: Failed to fetch summary for 2602.10195: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10195&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[414] ULTRA:Urdu Language Transformer-based Recommendation Architecture

Alishbah Bashir, Fatima Qaiser, Ijaz Hussain

Main category: cs.AI

TL;DR: Failed to fetch summary for arXiv ID 2602.11836 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as the abstract could not be retrieved due to rate limiting from arXiv API

Method: No method information available - paper content inaccessible due to HTTP 429 error

Result: No results available - failed to fetch paper summary

Conclusion: Cannot analyze paper due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2602.11836: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11836&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[415] Large-scale online deanonymization with LLMs

Simon Lermen, Daniel Paleka, Joshua Swanson, Michael Aerni, Nicholas Carlini, Florian Tramèr

Main category: cs.AI

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Unable to determine paper motivation due to fetch failure

Method: Unable to determine paper method due to fetch failure

Result: Unable to determine paper results due to fetch failure

Conclusion: Unable to determine paper conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.16800: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16800&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[416] A Reversible Semantics for Janus

Ivan Lanese, Germán Vidal

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.16913: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16913&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[417] Soft Sequence Policy Optimization

Svetlana Glazyrina, Maksim Kryzhanovskiy, Roman Ischenko

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.19327: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19327&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[418] Decision MetaMamba: Enhancing Selective SSM in Offline RL with Heterogeneous Sequence Mixing

Wall Kim, Chaeyoung Song, Hanul Kim

Main category: cs.AI

TL;DR: Unable to analyze paper 2602.19805 due to HTTP 429 error (rate limiting) when fetching the abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to the paper abstract

Method: Cannot determine method without access to the paper abstract

Result: Cannot determine results without access to the paper abstract

Conclusion: Cannot determine conclusion without access to the paper abstract

Abstract: Failed to fetch summary for 2602.19805: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19805&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[419] On the Equivalence of Random Network Distillation, Deep Ensembles, and Bayesian Inference

Moritz A. Zanger, Yijun Wu, Pascal R. Van der Vaart, Wendelin Böhmer, Matthijs T. J. Spaan

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.19964: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19964&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[420] Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.21189: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21189&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[421] AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression

Rui Cen, QiangQiang Hu, Hong Huang, Hong Liu, Song Liu, Xin Luo, Lin Niu, Yifan Tan, Decheng Wu, Linchuan Xie, Rubing Yang, Guanghua Yu, Jianchen Zhu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting error

Method: Unable to determine method due to API rate limiting error

Result: Unable to determine results due to API rate limiting error

Conclusion: Unable to determine conclusion due to API rate limiting error

Abstract: Failed to fetch summary for 2602.21233: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21233&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.SD

[422] AR&D: A Framework for Retrieving and Describing Concepts for Interpreting AudioLLMs

Townim Faisal Chowdhury, Ta Duc Huy, Siqi Pan, Jeremy Stoddard, Zhibin Liao

Main category: cs.SD

Details

Motivation: Large audio-language models (AudioLLMs) remain opaque despite strong performance, with individual neurons activating to multiple unrelated concepts, creating interpretability challenges.

Result: Experiments show AudioLLMs encode structured and interpretable features, enhancing transparency and control over model behavior.

Conclusion: Provides foundation for trustworthy deployment in high-stakes domains and enables future extensions to larger models, multilingual audio, and fine-grained paralinguistic features.

[423] Absorbing Discrete Diffusion for Speech Enhancement

Philippe Gonzalez

Main category: cs.SD

TL;DR: ADDSE: Absorbing Discrete Diffusion for Speech Enhancement using neural audio codecs and diffusion Transformers to model clean speech codes from noisy speech codes.

Details

Motivation: To develop an effective speech enhancement method by combining neural speech coding with diffusion-based language modeling, leveraging the strengths of both approaches for better performance.

Method: Proposes ADDSE which models conditional distribution of clean speech codes given noisy speech codes using absorbing discrete diffusion. Introduces RQDiT combining RQ-Transformer and diffusion Transformers for non-autoregressive modeling of hierarchical residual vector quantization codes.

Result: Competitive performance in non-intrusive objective metrics on two datasets, especially effective at low signal-to-noise ratios and with few sampling steps.

Conclusion: The proposed approach successfully combines neural audio codecs with diffusion models for speech enhancement, demonstrating effectiveness particularly in challenging low SNR conditions.

Abstract: Inspired by recent developments in neural speech coding and diffusion-based language modeling, we tackle speech enhancement by modeling the conditional distribution of clean speech codes given noisy speech codes using absorbing discrete diffusion. The proposed approach, which we call ADDSE, leverages both the expressive latent space of neural audio codecs and the non-autoregressive sampling procedure of diffusion models. To efficiently model the hierarchical structure of residual vector quantization codes, we propose RQDiT, which combines techniques from RQ-Transformer and diffusion Transformers for non-autoregressive modeling. Results show competitive performance in terms of non-intrusive objective metrics on two datasets, especially at low signal-to-noise ratios and with few sampling steps. Code and audio examples are available online.

[424] mmWave Radar Aware Dual-Conditioned GAN for Speech Reconstruction of Signals With Low SNR

Jash Karani, Adithya Chittem, Deepan Roy, Sandeep Joshi

Main category: cs.SD

TL;DR: RAD-GAN: A two-stage GAN pipeline for reconstructing intelligible speech from noisy, band-limited mmWave radar captures through glass walls, using radar-aware conditioning and novel discriminator/fusion modules.

Details

Motivation: mmWave radar captures are noisy and band-limited, making speech reconstruction challenging, especially through obstacles like glass walls. Existing methods struggle with low SNR conditions (-5 to -1 dB) and limited data scenarios.

Method: Two-stage pipeline: 1) Pretrain on synthetically clipped clean speech, 2) Finetune on fused mel spectrograms using RAD-GAN with Multi-Mel Discriminator (MMD) and Residual Fusion Gate (RFG) for processing multiple conditioning channels.

Result: Outperforms state-of-the-art approaches for mmWave speech reconstruction through glass walls, achieving better results despite limited dataset, no pre-trained modules, and no data augmentations.

Conclusion: RAD-GAN demonstrates effective speech reconstruction from challenging mmWave radar captures, with potential applications in audio sensing through obstacles where traditional microphones fail.

Abstract: Millimeter-wave (mmWave) radar captures are band-limited and noisy, making for difficult reconstruction of intelligible full-bandwidth speech. In this work, we propose a two-stage speech reconstruction pipeline for mmWave using a Radar-Aware Dual-conditioned Generative Adversarial Network (RAD-GAN), which is capable of performing bandwidth extension on signals with low signal-to-noise ratios (-5 dB to -1 dB), captured through glass walls. We propose an mmWave-tailored Multi-Mel Discriminator (MMD) and a Residual Fusion Gate (RFG) to enhance the generator input to process multiple conditioning channels. The proposed two-stage pipeline involves pretraining the model on synthetically clipped clean speech and finetuning on fused mel spectrograms generated by the RFG. We empirically show that the proposed method, trained on a limited dataset, with no pre-trained modules, and no data augmentations, outperformed state-of-the-art approaches for this specific task. Audio examples of RAD-GAN are available online at https://rad-gan-demo-site.vercel.app/.

[425] Relating the Neural Representations of Vocalized, Mimed, and Imagined Speech

Maryam Maghsoudi, Rupesh Chillale, Shihab A. Shamma

Main category: cs.SD

TL;DR: Linear spectrogram reconstruction models trained on vocalized, mimed, or imagined speech show cross-condition transfer, revealing shared neural representations across different speech production modes.

Details

Motivation: Previous studies focused on decoding speech within single conditions, but this research aims to understand how neural representations relate across different speech production modes (vocalized, mimed, imagined) to uncover shared speech representations.

Method: Used stereotactic EEG recordings to train linear spectrogram reconstruction models for each speech condition, then evaluated cross-condition generalization. Compared linear models to nonlinear neural networks and performed rank-based analysis for stimulus-level discriminability.

Result: Linear decoders trained on one condition successfully transfer to others, indicating shared speech representations. Linear models achieved superior stimulus-level discriminability compared to nonlinear networks, with preservation of stimulus-specific structure across conditions.

Conclusion: Neural representations for vocalized, mimed, and imagined speech share common features, with linear models being particularly effective for capturing stimulus-specific structure across different speech production modes.

Abstract: We investigated the relationship among neural representations of vocalized, mimed, and imagined speech recorded using publicly available stereotactic EEG recordings. Most prior studies have focused on decoding speech responses within each condition separately. Here, instead, we explore how responses across conditions relate by training linear spectrogram reconstruction models for each condition and evaluate their generalization across conditions. We demonstrate that linear decoders trained on one condition generally transfer successfully to others, implying shared speech representations. This commonality was assessed with stimulus-level discriminability by performing a rank-based analysis demonstrating preservation of stimulus-specific structure in both within- and across-conditions. Finally, we compared linear reconstructions to those from a nonlinear neural network. While both exhibited cross-condition transfer, linear models achieve superior stimulus-level discriminability.

[426] Same Words, Different Judgments: Modality Effects on Preference Alignment

Aaron Broukhim, Nadir Weibel, Eshin Jolly

Main category: cs.SD

TL;DR: Audio preferences are as reliable as text preferences for RLHF, with good inter-rater agreement at ~9 raters, but modality affects judgment patterns and cross-modality agreement is poor.

Details

Motivation: Preference-based RL is the main framework for aligning AI to human preferences, but its application to speech/audio remains underexplored compared to text. The paper aims to systematically compare human and synthetic preference annotations across text and audio modalities for identical semantic content.

Method: Conducted a controlled cross-modal study comparing text and audio evaluations of identical semantic content across 100 prompts. Measured inter-rater agreement using ICC(2,k), analyzed decision thresholds, length bias, and evaluation criteria differences between modalities. Also examined synthetic ratings’ alignment with human judgments.

Result: Audio preferences proved as reliable as text preferences, with good inter-rater agreement (ICC(2,k) ≈ .80) at ~9 raters. However, modality reshaped judgment patterns: audio raters had narrower decision thresholds, reduced length bias, and more user-oriented criteria. Cross-modality agreement was near-chance. Synthetic ratings aligned well with human judgments and could predict inter-rater agreement.

Conclusion: Audio preference annotations are reliable for RLHF applications, but modality-specific judgment patterns must be considered. Synthetic ratings show promise for both triaging ambiguous pairs and potentially replacing human annotations in preference learning for speech systems.

Abstract: Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences, but its application to speech remains underexplored. We present a controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts. Audio preferences prove as reliable as text, with inter-rater agreement reaching good levels (ICC(2,k) $\approx$ .80) at $\sim$9 raters – the first ICC-based reliability characterization in the preference annotation literature for either modality. However, modality reshapes how people judge: audio raters exhibit narrower decision thresholds, reduced length bias, and more user-oriented evaluation criteria, with near-chance cross-modality agreement. Synthetic ratings further align with human judgments and predict inter-rater agreement, supporting their use both for triaging ambiguous pairs and as full replacements for human annotations.

[427] A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment

Zarif Ishmam, Zarif Mahir, Shafnan Wasif, Md. Ishtiak Moin

Main category: cs.SD

TL;DR: A framework for longform Bangla speech processing combining VAD optimization, CTC segmentation, and data augmentation to handle multi-speaker audio over 30-60 seconds.

Details

Motivation: Bangla is a widely spoken but low-resource language in NLP, with existing ASR and speaker diarization systems struggling with longform audio exceeding 30-60 seconds, creating a need for robust solutions for real-world applications.

Method: Leverages pre-existing models enhanced with novel optimization pipelines including Voice Activity Detection (VAD) optimization, Connectionist Temporal Classification (CTC) segmentation via forced word alignment, fine-tuning techniques, and data preprocessing with augmentation and noise removal.

Result: Provides a scalable solution for real-world longform Bangla speech applications that bridges the performance gap in complex, multi-speaker environments.

Conclusion: The framework successfully addresses the challenges of processing extended Bangla audio content by combining multiple optimization techniques to maintain temporal accuracy and transcription integrity over long durations.

Abstract: Despite being one of the most widely spoken languages globally, Bangla remains a low-resource language in the field of Natural Language Processing (NLP). Mainstream Automatic Speech Recognition (ASR) and Speaker Diarization systems for Bangla struggles when processing longform audio exceeding 3060 seconds. This paper presents a robust framework specifically engineered for extended Bangla content by leveraging preexisting models enhanced with novel optimization pipelines for the DL Sprint 4.0 contest. Our approach utilizes Voice Activity Detection (VAD) optimization and Connectionist Temporal Classification (CTC) segmentation via forced word alignment to maintain temporal accuracy and transcription integrity over long durations. Additionally, we employed several finetuning techniques and preprocessed the data using augmentation techniques and noise removal. By bridging the performance gap in complex, multi-speaker environments, this work provides a scalable solution for real-world, longform Bangla speech applications.

[428] TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment

Trung Dang, Sharath Rao, Ananya Gupta, Christopher Gagne, Panagiotis Tzirakis, Alice Baird, Jakub Piotr Cłapa, Peter Chin, Alan Cowen

Main category: cs.SD

TL;DR: A novel TTS tokenization scheme creates one-to-one synchronization between acoustic features and text tokens, enabling unified LLM modeling with reduced hallucinations and inference cost.

Details

Motivation: Current LLM-based TTS systems use fixed-frame-rate acoustic tokenization causing long, asynchronous speech sequences that lead to computational inefficiency, hallucinations, and modality gaps in spoken language modeling.

Method: Proposes synchronous tokenization establishing one-to-one alignment between continuous acoustic features and text tokens, enabling single-stream LLM modeling with flow matching head, plus text-only guidance technique blending text-only and text-speech mode logits.

Result: Achieves competitive performance with state-of-the-art TTS and SLM systems while virtually eliminating content hallucinations, preserving linguistic integrity, and significantly reducing inference cost.

Conclusion: Synchronous tokenization enables efficient, hallucination-free TTS within LLM frameworks, bridging modality gaps while maintaining audio fidelity and computational efficiency.

Abstract: Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are significantly longer than, and asynchronous with their corresponding text. Beyond computational inefficiency, this sequence length disparity often triggers hallucinations in TTS and amplifies the modality gap in spoken language modeling (SLM). In this paper, we propose a novel tokenization scheme that establishes one-to-one synchronization between continuous acoustic features and text tokens, enabling unified, single-stream modeling within an LLM. We demonstrate that these synchronous tokens maintain high-fidelity audio reconstruction and can be effectively modeled in a latent space by a large language model with a flow matching head. Moreover, the ability to seamlessly toggle speech modality within the context enables text-only guidance–a technique that blends logits from text-only and text-speech modes to flexibly bridge the gap toward text-only LLM intelligence. Experimental results indicate that our approach achieves performance competitive with state-of-the-art TTS and SLM systems while virtually eliminating content hallucinations and preserving linguistic integrity, all at a significantly reduced inference cost.

[429] Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

Sanjid Hasan, Risalat Labib, A H M Fuad, Bayazid Hasan

Main category: cs.SD

TL;DR: Introduces Lipi-Ghor-882, an 882-hour multi-speaker Bengali dataset for ASR and diarization, showing targeted fine-tuning with synthetic degradation works best for ASR, while heuristic post-processing improves diarization performance.

Details

Motivation: Addresses research gaps in Bengali ASR for long-duration audio and speaker diarization, particularly the severe scarcity of joint ASR and diarization resources for this low-resource language.

Method: Created Lipi-Ghor-882 dataset (882 hours), systematically evaluated architectures for long-form Bengali speech. For ASR: targeted fine-tuning with perfectly aligned annotations and synthetic acoustic degradation. For diarization: tested global SOTA models, applied strategic heuristic post-processing on baseline outputs.

Result: Raw data scaling ineffective for ASR; targeted fine-tuning with synthetic degradation most effective. Diarization models performed poorly; heuristic post-processing was primary driver for accuracy improvements. Achieved ~0.019 Real-Time Factor with optimized dual pipeline.

Conclusion: Establishes practical benchmark for low-resource, long-form speech processing in Bengali, demonstrating that targeted approaches (fine-tuning with degradation for ASR, heuristic post-processing for diarization) outperform generic scaling or direct model retraining.

Abstract: Although Automatic Speech Recognition (ASR) in Bengali has seen significant progress, processing long-duration audio and performing robust speaker diarization remain critical research gaps. To address the severe scarcity of joint ASR and diarization resources for this language, we introduce Lipi-Ghor-882, a comprehensive 882-hour multi-speaker Bengali dataset. In this paper, detailing our submission to the DL Sprint 4.0 competition, we systematically evaluate various architectures and approaches for long-form Bengali speech. For ASR, we demonstrate that raw data scaling is ineffective; instead, targeted fine-tuning utilizing perfectly aligned annotations paired with synthetic acoustic degradation (noise and reverberation) emerges as the singular most effective approach. Conversely, for speaker diarization, we observed that global open-source state-of-the-art models (such as Diarizen) performed surprisingly poorly on this complex dataset. Extensive model retraining yielded negligible improvements; instead, strategic, heuristic post-processing of baseline model outputs proved to be the primary driver for increasing accuracy. Ultimately, this work outlines a highly optimized dual pipeline achieving a $\sim$0.019 Real-Time Factor (RTF), establishing a practical, empirically backed benchmark for low-resource, long-form speech processing.

[430] SemanticVocoder: Bridging Audio Generation and Audio Understanding via Semantic Latents

Zeyu Xie, Chenxing Li, Qiao Jin, Xuenan Xu, Guanrou Yang, Wenfu Wang, Mengyue Wu, Dong Yu, Yuexian Zou

Main category: cs.SD

TL;DR: SemanticVocoder replaces VAE acoustic latents with semantic encoder latents for audio generation, improving discriminability and unifying audio understanding/generation in shared semantic space.

Details

Motivation: VAE latents encode low-level acoustic details rather than semantically discriminative information, leading to entangled event semantics that complicate generative model training. The authors aim to address this by moving from acoustic to semantic representations.

Method: Discard VAE acoustic latents and introduce semantic encoder latents, proposing SemanticVocoder - a generative vocoder that directly synthesizes waveforms from semantic latents rather than acoustic features.

Result: Achieves Frechet Distance of 12.823 and Frechet Audio Distance of 1.709 on AudioCaps test set. Semantic latents show superior discriminability compared to acoustic VAE latents.

Conclusion: SemanticVocoder improves audio generation performance and serves as a promising step toward unifying audio understanding and generation within a shared semantic space.

Abstract: Recent audio generation models typically rely on Variational Autoencoders (VAEs) and perform generation within the VAE latent space. Although VAEs excel at compression and reconstruction, their latents inherently encode low-level acoustic details rather than semantically discriminative information, leading to entangled event semantics and complicating the training of generative models. To address these issues, we discard VAE acoustic latents and introduce semantic encoder latents, thereby proposing SemanticVocoder, a generative vocoder that directly synthesizes waveforms from semantic latents. Equipped with SemanticVocoder, our text-to-audio generation model achieves a Frechet Distance of 12.823 and a Frechet Audio Distance of 1.709 on the AudioCaps test set, as the introduced semantic latents exhibit superior discriminability compared to acoustic VAE latents. Beyond improved generation performance, it also serves as a promising attempt towards unifying audio understanding and generation within a shared semantic space. Generated samples are available at https://zeyuxie29.github.io/SemanticVocoder/.

[431] Harmony and Duality: An introduction to Music Theory

Maksim Lipyanskiy

Main category: cs.SD

TL;DR: A combinatorial approach to music theory that derives scales and chords from constraints like avoiding semitone dissonances, revealing duality between two-voice and three-voice constraints and classifying chords.

Details

Motivation: To provide a foundational, principle-based approach to music theory (harmony, scales, chords) rather than relying on memorization of lists, by deriving structures from simple combinatorial constraints.

Method: Introduces combinatorial constraints: two-voice constraint (no notes a semitone apart) and three-voice constraint (no three notes separated only by semitones). Studies complete scales (maximal sets satisfying constraints). Establishes duality between scales satisfying these constraints.

Result: Completeness applied to simple constraints characterizes commonly used musical scales. Surprising correspondence/duality between scales subject to two-voice vs three-voice constraints. Provides chord classification by combining constraint ideas.

Conclusion: Combinatorial constraints provide a principled foundation for music theory, revealing mathematical structure (duality) underlying harmony and enabling systematic classification of chords.

Abstract: We develop aspects of music theory related to harmony, such as scales, chord formation and improvisation from a combinatorial perspective. The goal is to provide a foundation for this subject by deriving the basic structure from a few assumptions, rather than writing down long lists of chords/scales to memorize without an underlying principle. Our approach involves introducing constraints that limit the possible scales we can consider. For example, we may impose the constraint that two voices cannot be only a semitone apart as this is too dissonant. We can then study scales that do not contain notes that are a semitone apart. A more refined constraint avoids three voices colliding by studying scales that do not have three notes separated only by semitones. Additionally, we require that our scales are complete, which roughly means that they are the maximal sets of tones that satisfy these constraints. As it turns out, completeness as applied to these simple two/three voice constraints characterizes the types of scales that are commonly used in music composition. Surprisingly, there is a correspondence between scales subject to the two-voice constraint and those subject to the three-voice constraint. We formulate this correspondence as a duality statement that provides a way to understand scales subject to one type of constraint in terms of scales subject to the other. Finally, we combine these constraint ideas to provide a classification of chords.

[432] Bob’s Confetti: Phonetic Memorization Attacks in Music and Video Generation

Jaechul Roh, Zachary Novack, Yuefeng Peng, Niloofar Mireshghallah, Taylor Berg-Kirkpatrick, Amir Houmansadr

Main category: cs.SD

TL;DR: APT attack bypasses copyright filters in generative AI by using phonetic homophones instead of copyrighted lyrics, exploiting models’ phonetic memorization to regenerate copyrighted content across audio and video modalities.

Details

Motivation: Current generative AI systems for music and video use text-based filters to prevent regurgitation of copyrighted material, but these filters may have vulnerabilities that can be exploited through non-semantic attacks.

Method: Adversarial PhoneTic Prompting (APT) replaces iconic lyrics with homophonic but semantically unrelated alternatives (e.g., “mom’s spaghetti” becomes “Bob’s confetti”), preserving phonetic structure while evading lexical filters. The approach is evaluated on lyrics-to-song models (Suno, YuE) across English and Korean songs and tested cross-modally with video generation models like Veo 3.

Result: APT achieves 91% average similarity to copyrighted originals vs. 13.7% for random lyrics and 42.2% for semantic paraphrases. YuE’s text encoder treats APT-modified lyrics as near-identical to originals (cosine similarity 0.90) while semantic similarity drops to 0.71. Cross-modally, Veo 3 reconstructs visual scenes from original music videos when prompted with APT lyrics alone.

Conclusion: Current copyright filters are systematically vulnerable because sub-lexical acoustic structure acts as a cross-modal retrieval key, allowing phonetic attacks to bypass safeguards in multimodal generative AI systems.

Abstract: Generative AI systems for music and video commonly use text-based filters to prevent regurgitation of copyrighted material. We expose a significant vulnerability in this approach by introducing Adversarial PhoneTic Prompting (APT), a novel attack that bypasses these safeguards by exploiting phonetic memorization–the tendency of models to bind sub-lexical acoustic patterns (phonemes, rhyme, stress, cadence) to memorized copyrighted content. APT replaces iconic lyrics with homophonic but semantically unrelated alternatives (e.g., “mom’s spaghetti” becomes “Bob’s confetti”), preserving phonetic structure while evading lexical filters. We evaluate APT on leading lyrics-to-song models (Suno, YuE) across English and Korean songs spanning rap, pop, and K-pop. APT achieves 91% average similarity to copyrighted originals, versus 13.7% for random lyrics and 42.2% for semantic paraphrases. Embedding analysis confirms the mechanism: YuE’s text encoder treats APT-modified lyrics as near-identical to originals (cosine similarity 0.90) while Sentence-BERT semantic similarity drops to 0.71, showing the model encodes phonetic structure over meaning. This vulnerability extends cross-modally–Veo 3 reconstructs visual scenes from original music videos when prompted with APT lyrics alone, despite no visual cues in the prompt. We further show that phonetic-semantic defense signatures fail, as APT prompts exhibit higher semantic similarity than benign paraphrases. Our findings reveal that sub-lexical acoustic structure acts as a cross-modal retrieval key, rendering current copyright filters systematically vulnerable. Demo examples are available at https://jrohsc.github.io/music_attack/.

[433] LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control

Junki Ohmura, Yuki Ito, Emiru Tsunoo, Toshiyuki Sekiya, Toshiyuki Kumakura

Main category: cs.SD

TL;DR: A novel approach for fine-grained voice impression control in text-to-speech that addresses impression leakage and introduces a public annotated dataset.

Details

Motivation: Current text-to-speech systems lack fine-grained control over voice impressions (e.g., making voices brighter or calmer), and face challenges with impression leakage where synthesized voices are undesirably influenced by speaker reference audio rather than target impressions. There's also a lack of public annotated datasets for this research area.

Method: Proposes two methods: 1) A training strategy using separate utterances for speaker identity and target impression from the same speaker, and 2) A novel reference-free model that generates speaker embeddings solely from target impressions, improving robustness against leakage and enabling reference-free generation. Also introduces LibriTTS-VI, the first public voice impression dataset.

Result: Significant improvement in controllability demonstrated through objective and subjective evaluations. Best method reduced mean squared error of 11-dimensional voice impression vectors from 0.61 to 0.41 objectively and from 1.15 to 0.92 subjectively, while maintaining high fidelity.

Conclusion: The proposed methods effectively address impression leakage in voice impression control for TTS, with the reference-free model offering improved robustness and convenience. The release of LibriTTS-VI dataset enables reproducible research in this emerging field.

Abstract: Fine-grained control over voice impressions (e.g., making a voice brighter or calmer) is a key frontier for creating more controllable text-to-speech. However, this nascent field faces two key challenges. The first is the problem of impression leakage, where the synthesized voice is undesirably influenced by the speaker’s reference audio, rather than the separately specified target impression, and the second is the lack of a public, annotated corpus. To mitigate impression leakage, we propose two methods: 1) a training strategy that separately uses an utterance for speaker identity and another utterance of the same speaker for target impression, and 2) a novel reference-free model that generates a speaker embedding solely from the target impression, achieving the benefits of improved robustness against the leakage and the convenience of reference-free generation. Objective and subjective evaluations demonstrate a significant improvement in controllability. Our best method reduced the mean squared error of 11-dimensional voice impression vectors from 0.61 to 0.41 objectively and from 1.15 to 0.92 subjectively, while maintaining high fidelity. To foster reproducible research, we introduce LibriTTS-VI, the first public voice impression dataset released with clear annotation standards, built upon the LibriTTS-R corpus.

[434] Metric Analysis for Spatial Semantic Segmentation of Sound Scenes

Mayank Mishra, Paul Magron, Romain Serizel

Main category: cs.SD

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2511.07075: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07075&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.LG

[435] To Deceive is to Teach? Forging Perceptual Robustness via Adversarial Reinforcement Learning

Yicheng Bao, Xuhong Wang, Xin Tan

Main category: cs.LG

TL;DR: AOT-SFT introduces adversarial self-play training to boost multimodal LLM robustness against complex visual scenes using an Attacker-Defender framework

Details

Motivation: MLLMs have perceptual fragility with complex visual scenes due to limited training data that's expensive to scale, creating a ceiling on robustness

Method: AOT (Adversarial Opponent Training) uses self-play between an image-editing Attacker and Defender MLLM; Attacker creates diverse image manipulations as adversarial curriculum, forcing Defender to adapt and improve

Result: Extensive experiments show AOT enhances Defender’s perceptual robustness and reduces hallucinations, establishing scalable paradigm for reliable MLLMs

Conclusion: AOT-SFT provides a self-play framework that forges MLLM robustness through adversarial training, creating scalable approach for more reliable multimodal models

Abstract: Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) exhibit perceptual fragility when confronted with visually complex scenes. This weakness stems from a reliance on finite training datasets, which are prohibitively expensive to scale and impose a ceiling on model robustness. We introduce \textbf{AOT-SFT}, a large-scale adversarial dataset for bootstrapping MLLM robustness. Building on this, we propose \textbf{AOT (Adversarial Opponent Training)}, a self-play framework that forges MLLM robustness by creating its own training data. Our method orchestrates a co-evolution between an image-editing Attacker and a Defender MLLM, where the Attacker generates a diverse and dynamic curriculum of image manipulations, forcing the Defender to adapt and improve. Extensive experiments demonstrate that AOT enhances the Defender’s perceptual robustness and reduces hallucinations, establishing a scalable paradigm for training more reliable MLLMs.

[436] Patient-Centered, Graph-Augmented Artificial Intelligence-Enabled Passive Surveillance for Early Stroke Risk Detection in High-Risk Individuals

Jiyeong Kim, Stephen P. Ma, Nirali Vora, Nicholas W. Larsen, Julia Adler-Milstein, Jonathan H. Chen, Selen Bozkurt, Abeed Sarker, Juhee Cho, Jindeok Joo, Natali Pageler, Fatima Rodriguez, Christopher Sharp, Eleni Linos

Main category: cs.LG

TL;DR: A passive surveillance system using patient-reported symptoms and machine learning for early stroke risk detection in diabetes patients, achieving high specificity and PPV with good sensitivity.

Details

Motivation: Stroke affects millions annually with poor symptom recognition delaying care-seeking. There's a need for early stroke risk detection systems, particularly for high-risk populations like diabetes patients.

Method: Developed a symptom taxonomy from patient language, used dual machine learning pipeline (heterogeneous GNN and EN/LASSO) to identify symptom patterns associated with stroke, and created a hybrid risk screening system integrating symptom relevance and temporal proximity.

Result: The screening system achieved high specificity (1.00) and prevalence-adjusted positive predictive value (1.00) with good sensitivity (0.72), with best performance in 90-day windows under conservative thresholds designed to minimize false alerts.

Conclusion: Patient-reported language alone can support high-precision, low-burden early stroke risk detection, offering valuable time windows for clinical evaluation and intervention for high-risk individuals.

Abstract: Stroke affected millions annually, yet poor symptom recognition often delayed care-seeking. To address risk recognition gap, we developed a passive surveillance system for early stroke risk detection using patient-reported symptoms among individuals with diabetes. Constructing a symptom taxonomy grounded in patients own language and a dual machine learning pipeline (heterogeneous GNN and EN/LASSO), we identified symptom patterns associated with subsequent stroke. We translated findings into a hybrid risk screening system integrating symptom relevance and temporal proximity, evaluated across 3-90 day windows through EHR-based simulations. Under conservative thresholds, intentionally designed to minimize false alerts, the screening system achieved high specificity (1.00) and prevalence-adjusted positive predictive value (1.00), with good sensitivity (0.72), an expected trade-off prioritizing precision, that was highest in 90-day window. Patient-reported language alone supported high-precision, low-burden early stroke risk detection, that could offer a valuable time window for clinical evaluation and intervention for high-risk individuals.

[437] Improving Spatial Allocation for Energy System Coupling with Graph Neural Networks

Xuanhao Mu, Jakob Geiges, Nan Liu, Thorsten Schlachter, Veit Hagenmeyer

Main category: cs.LG

TL;DR: A self-supervised Heterogeneous Graph Neural Network method for generating physically meaningful weights to improve spatial resolution coupling in energy system models, enhancing Voronoi-based allocation with multiple geographic features.

Details

Motivation: Coupling energy system models with mismatched spatial resolutions is challenging. Traditional methods use only single geospatial attributes for aggregation weights, limiting accuracy and physical plausibility.

Method: Uses self-supervised Heterogeneous Graph Neural Network to model high-resolution geographic units as graph nodes, integrating various geographical features to generate physically meaningful weights for each grid point. These weights enhance conventional Voronoi-based allocation methods.

Result: Applying weights generated by this method to cluster-based Voronoi Diagrams significantly enhances scalability, accuracy, and physical plausibility while increasing precision compared to traditional methods.

Conclusion: The proposed method effectively addresses spatial resolution coupling challenges in energy systems by incorporating multiple geographic features through graph neural networks, overcoming limitations of traditional single-attribute approaches.

Abstract: In energy system analysis, coupling models with mismatched spatial resolutions is a significant challenge. A common solution is assigning weights to high-resolution geographic units for aggregation, but traditional models are limited by using only a single geospatial attribute. This paper presents an innovative method employing a self-supervised Heterogeneous Graph Neural Network to address this issue. This method models high-resolution geographic units as graph nodes, integrating various geographical features to generate physically meaningful weights for each grid point. These weights enhance the conventional Voronoi-based allocation method, allowing it to go beyond simply geographic proximity by incorporating essential geographic information.In addition, the self-supervised learning paradigm overcomes the lack of accurate ground-truth data. Experimental results demonstrate that applying weights generated by this method to cluster-based Voronoi Diagrams significantly enhances scalability, accuracy, and physical plausibility, while increasing precision compared to traditional methods.

[438] Zatom-1: A Multimodal Flow Foundation Model for 3D Molecules and Materials

Alex Morehead, Miruna Cretu, Antonia Panescu, Rishabh Anand, Maurice Weiler, Tynan Perez, Samuel Blau, Steven Farrell, Wahid Bhimji, Anubhav Jain, Hrushikesh Sahasrabuddhe, Pietro Lio, Tommi Jaakkola, Rafael Gomez-Bombarelli, Rex Ying, N. Benjamin Erichson, Michael W. Mahoney

Main category: cs.LG

TL;DR: Zatom-1 is a foundation model that unifies generative and predictive learning for 3D molecules and materials using multimodal flow matching with Transformers.

Details

Motivation: Existing AI approaches for 3D chemical modeling are limited by specialization in single domains (molecules or materials) and single tasks (generation or prediction), preventing representation sharing and transfer learning.

Method: Transformer trained with multimodal flow matching objective that jointly models discrete atom types and continuous 3D geometries, enabling scalable pretraining with predictable gains and fast sampling.

Result: Zatom-1 matches or outperforms specialized baselines on both generative and predictive benchmarks while reducing generative inference time by more than an order of magnitude, and shows positive predictive transfer between chemical domains.

Conclusion: The unified foundation model approach enables representation sharing across chemical domains and tasks, demonstrating the value of joint generative pretraining for improving downstream predictive tasks.

Abstract: General-purpose 3D chemical modeling encompasses molecules and materials, requiring both generative and predictive capabilities. However, most existing AI approaches are optimized for a single domain (molecules or materials) and a single task (generation or prediction), which limits representation sharing and transfer. We introduce Zatom-1, the first foundation model that unifies generative and predictive learning of 3D molecules and materials. Zatom-1 is a Transformer trained with a multimodal flow matching objective that jointly models discrete atom types and continuous 3D geometries. This approach supports scalable pretraining with predictable gains as model capacity increases, while enabling fast and stable sampling. We use joint generative pretraining as a universal initialization for downstream multi-task prediction of properties, energies, and forces. Empirically, Zatom-1 matches or outperforms specialized baselines on both generative and predictive benchmarks, while reducing the generative inference time by more than an order of magnitude. Our experiments demonstrate positive predictive transfer between chemical domains from joint generative pretraining: modeling materials during pretraining improves molecular property prediction accuracy.

[439] Causal Direction from Convergence Time: Faster Training in the True Causal Direction

Abdulrahman Tamim

Main category: cs.LG

TL;DR: CCA identifies causal direction by comparing optimization convergence rates of neural networks trained in both directions, with faster convergence indicating the causal direction.

Details

Motivation: To develop a new causal direction identification method that leverages optimization dynamics rather than statistical independence or distributional asymmetries used by existing methods like RESIT, IGCI, and SkewScore.

Method: Train two neural networks: one to predict Y from X and another to predict X from Y. Compare their convergence rates - the direction that converges faster is inferred to be causal. Based on theoretical analysis showing that in the reverse direction, residuals remain statistically dependent on input, causing higher irreducible loss and slower convergence.

Result: On synthetic benchmarks, CCA achieves 26/30 correct causal identifications across six neural architectures, including perfect 30/30 on sine and exponential data-generating processes. The method is embedded in a broader Causal Compression Learning framework with theoretical guarantees.

Conclusion: CCA provides a novel optimization-based approach to causal direction identification that complements existing statistical methods, with theoretical foundations and empirical validation showing strong performance on synthetic data.

Abstract: We introduce Causal Computational Asymmetry (CCA), a principle for causal direction identification based on optimization dynamics in which one neural network is trained to predict $Y$ from $X$ and another to predict $X$ from $Y$, and the direction that converges faster is inferred to be causal. Under the additive noise model $Y = f(X) + \varepsilon$ with $\varepsilon \perp X$ and $f$ nonlinear and injective, we establish a formal asymmetry: in the reverse direction, residuals remain statistically dependent on the input regardless of approximation quality, inducing a strictly higher irreducible loss floor and non-separable gradient noise in the optimization dynamics, so that the reverse model requires strictly more gradient steps in expectation to reach any fixed loss threshold; consequently, the forward (causal) direction converges in fewer expected optimization steps. CCA operates in optimization-time space, distinguishing it from methods such as RESIT, IGCI, and SkewScore that rely on statistical independence or distributional asymmetries, and proper z-scoring of both variables is required for valid comparison of convergence rates. On synthetic benchmarks, CCA achieves 26/30 correct causal identifications across six neural architectures, including 30/30 on sine and exponential data-generating processes. We further embed CCA into a broader framework termed Causal Compression Learning (CCL), which integrates graph structure learning, causal information compression, and policy optimization, with all theoretical guarantees formally proved and empirically validated on synthetic datasets.

[440] Deep Sequence Modeling with Quantum Dynamics: Language as a Wave Function

Ahmed Nebli, Hadi Saadatdoorabi, Kevin Yam

Main category: cs.LG

TL;DR: A quantum-inspired sequence modeling framework using complex-valued wave functions evolving under learned Hamiltonians, leveraging quantum interference for disambiguation tasks with provable quadratic representational advantage over real-valued models.

Details

Motivation: To develop a sequence modeling framework that uses quantum interference principles for better disambiguation, avoiding the gating mechanisms of standard recurrent architectures and achieving provable representational advantages through complex-valued unitary dynamics.

Method: Uses complex-valued wave functions on finite-dimensional Hilbert spaces evolving under learned time-dependent Hamiltonians with strictly unitary dynamics via Cayley discretization. Token probabilities extracted using the Born rule (quadratic measurement operator).

Result: Proves a separation theorem showing quadratic representational advantage: complex unitary models of dimension N solve certain disambiguation tasks exactly, while real-valued orthogonal models require Ω(N²) dimensions. Provides continuity equations for latent probability mass with conserved pairwise currents.

Conclusion: The quantum-inspired framework offers theoretical advantages for disambiguation tasks through quantum interference, with provable quadratic representational gaps over real-valued models, suggesting potential for improved sequence modeling.

Abstract: We introduce a sequence modeling framework in which the latent state is a complex-valued wave function evolving on a finite-dimensional Hilbert space under a learned, time-dependent Hamiltonian. Unlike standard recurrent architectures that rely on gating mechanisms to suppress competing hypotheses, our framework utilizes quantum interference: the Hamiltonian steers the phases of complex amplitudes so that conflicting interpretations cancel while compatible ones reinforce. The dynamics are strictly unitary, ensuring that the state norm is preserved exactly at every time step via a Cayley (Crank–Nicolson) discretization. Token probabilities are extracted using the Born rule, a quadratic measurement operator that couples magnitudes and relative phases. Our primary theoretical contribution is a separation theorem characterizing the representational advantage of this readout: we define a family of disambiguation tasks that a complex unitary model of dimension $N$ solves exactly, but which requires a state dimension of $Ω(N^2)$ for any real-valued orthogonal model equipped with a standard affine-softmax readout. This quadratic gap arises because the Born rule implicitly lifts the $N$-dimensional state into the space of rank-one Hermitian matrices, accessing pairwise phase correlations that are inaccessible to linear projections. Finally, we derive a continuity equation for the latent probability mass, yielding conserved pairwise currents that serve as a built-in diagnostic for tracing information flow between dimensions.

[441] Orthogonal Weight Modification Enhances Learning Scalability and Convergence Efficiency without Gradient Backpropagation

Guoqing Ma, Shan Yu

Main category: cs.LG

TL;DR: LOCO is a perturbation-based non-backpropagation method for training deep spiking neural networks with O(1) parallel time complexity, inspired by brain mechanisms and achieving state-of-the-art performance among brain-inspired non-BP algorithms.

Details

Motivation: Backpropagation is computationally expensive, especially for emerging neuromorphic systems. Non-BP methods offer alternatives but face efficiency and scalability challenges. The paper aims to develop a brain-inspired approach that overcomes these limitations.

Method: Proposes LOCO (LOw-rank Cluster Orthogonal) weight modification, a perturbation-based approach inspired by neural representations and dynamic mechanisms in the brain. Leverages the inherent low-rank property of perturbation-based algorithms and uses orthogonality constraints to limit gradient estimate variance and enhance convergence efficiency.

Result: LOCO can train the deepest spiking neural networks to date (more than 10 layers), demonstrates strong continual learning ability, improved convergence efficiency, and better task performance compared to other brain-inspired non-BP algorithms. Requires only O(1) parallel time complexity for weight updates.

Conclusion: LOCO offers a promising direction for high-performance, real-time, and lifelong learning on neuromorphic systems by addressing efficiency and scalability challenges of non-BP methods through brain-inspired mechanisms.

Abstract: Recognizing the substantial computational cost of backpropagation (BP), non-BP methods have emerged as attractive alternatives for efficient learning on emerging neuromorphic systems. However, existing non-BP approaches still face critical challenges in efficiency and scalability. Inspired by neural representations and dynamic mechanisms in the brain, we propose a perturbation-based approach called LOw-rank Cluster Orthogonal (LOCO) weight modification. We find that low-rank is an inherent property of perturbation-based algorithms. Under this condition, the orthogonality constraint limits the variance of the node perturbation (NP) gradient estimates and enhances the convergence efficiency. Through extensive evaluations on multiple datasets, LOCO demonstrates the capability to locally train the deepest spiking neural networks to date (more than 10 layers), while exhibiting strong continual learning ability, improved convergence efficiency, and better task performance compared to other brain-inspired non-BP algorithms. Notably, LOCO requires only O(1) parallel time complexity for weight updates, which is significantly lower than that of BP methods. This offers a promising direction for achieving high-performance, real-time, and lifelong learning on neuromorphic systems.

[442] ParamMem: Augmenting Language Agents with Parametric Reflective Memory

Tianjun Yao, Yongqiang Chen, Yujia Zheng, Pan Li, Zhiqiang Shen, Kun Zhang

Main category: cs.LG

TL;DR: ParamAgent: A reflection-based language agent framework with ParamMem module that encodes cross-sample reflection patterns into model parameters for diverse reflection generation, improving performance on code generation, math reasoning, and QA tasks.

Details

Motivation: Self-reflection in language agents often produces repetitive outputs that limit reasoning performance. The paper identifies a strong correlation between reflective diversity and task success, motivating the need for more diverse reflection signals to enhance agent capabilities.

Method: Introduces ParamMem, a parametric memory module that encodes cross-sample reflection patterns into model parameters, enabling diverse reflection generation through temperature-controlled sampling. Builds ParamAgent framework integrating parametric memory with episodic and cross-sample memory for reflection-based agents.

Result: Extensive experiments on code generation, mathematical reasoning, and multi-hop question answering demonstrate consistent improvements over state-of-the-art baselines. ParamMem is sample-efficient, enables weak-to-strong transfer across model scales, and supports self-improvement without stronger external models.

Conclusion: ParamMem serves as an effective component for enhancing language agents by enabling diverse reflection generation, with potential applications across various reasoning tasks and model scales through its parametric memory approach.

Abstract: Self-reflection enables language agents to iteratively refine solutions, yet often produces repetitive outputs that limit reasoning performance. Recent studies have attempted to address this limitation through various approaches, among which increasing reflective diversity has shown promise. Our empirical analysis reveals a strong positive correlation between reflective diversity and task success, further motivating the need for diverse reflection signals. We introduce ParamMem, a parametric memory module that encodes cross-sample reflection patterns into model parameters, enabling diverse reflection generation through temperature-controlled sampling. Building on this module, we propose ParamAgent, a reflection-based agent framework that integrates parametric memory with episodic and cross-sample memory. Extensive experiments on code generation, mathematical reasoning, and multi-hop question answering demonstrate consistent improvements over state-of-the-art baselines. Further analysis reveals that ParamMem is sample-efficient, enables weak-to-strong transfer across model scales, and supports self-improvement without reliance on stronger external model, highlighting the potential of ParamMem as an effective component for enhancing language agents.

[443] Code World Models for Parameter Control in Evolutionary Algorithms

Camilo Chacón Sartori, Guillem Rodríguez Corominas

Main category: cs.LG

TL;DR: LLMs synthesize Python programs to simulate optimizer dynamics and use greedy planning to control mutation strength in combinatorial optimization problems.

Details

Motivation: To explore whether LLMs can learn and control optimizer behavior by synthesizing simulators of stochastic optimization dynamics, enabling better performance than traditional adaptive methods without requiring optimal policy trajectories.

Method: Extends Code World Models (CWMs) from deterministic games to stochastic combinatorial optimization. Given suboptimal trajectories of (1+1)-RLS_k optimizer, the LLM synthesizes a Python simulator of the optimizer’s dynamics, then uses greedy planning over this simulator to select mutation strength k at each step.

Result: CWM-greedy performs within 6% of theoretically optimal policy on LO and OneMax, achieves 100% success rate on Jump_k where all baselines fail (0%), and outperforms all baselines on NK-Landscape across 15 instances. Also outperforms DQN in sample efficiency (200 offline vs 500 online), success rate (100% vs 58%), and generalization.

Conclusion: LLMs can effectively learn optimizer dynamics through code synthesis and use this knowledge for control, achieving strong performance on combinatorial optimization problems without requiring optimal demonstrations or oracle knowledge.

Abstract: Can an LLM learn how an optimizer behaves – and use that knowledge to control it? We extend Code World Models (CWMs), LLM-synthesized Python programs that predict environment dynamics, from deterministic games to stochastic combinatorial optimization. Given suboptimal trajectories of $(1{+}1)$-$\text{RLS}_k$, the LLM synthesizes a simulator of the optimizer’s dynamics; greedy planning over this simulator then selects the mutation strength $k$ at each step. On \lo{} and \onemax{}, CWM-greedy performs within 6% of the theoretically optimal policy – without ever seeing optimal-policy trajectories. On \jump{$_k$}, where a deceptive valley causes all adaptive baselines to fail (0% success rate), CWM-greedy achieves 100% success rate – without any collection policy using oracle knowledge of the gap parameter. On the NK-Landscape, where no closed-form model exists, CWM-greedy outperforms all baselines across fifteen independently generated instances ($36.94$ vs.\ $36.32$; $p<0.001$) when the prompt includes empirical transition statistics. The CWM also outperforms DQN in sample efficiency (200 offline trajectories vs.\ 500 online episodes), success rate (100% vs.\ 58%), and generalization ($k{=}3$: 78% vs.\ 0%). Robustness experiments confirm stable synthesis across 5 independent runs.

[444] WaveSSM: Multiscale State-Space Models for Non-stationary Signal Attention

Ruben Solozabal, Velibor Bojkovic, Hilal Alquabeh, Klea Ziu, Kentaro Inui, Martin Takac

Main category: cs.LG

TL;DR: WaveSSM introduces state-space models based on wavelet frames instead of polynomial bases, offering better temporal localization for signals with transient dynamics like physiological and audio data.

Details

Motivation: Existing state-space models using polynomial bases have global temporal support that poorly matches signals with localized or transient structure, limiting their effectiveness for real-world applications like physiological signals and audio.

Method: Develops WaveSSM, a collection of state-space models constructed over wavelet frames that provide localized temporal support, enabling better capture of transient dynamics in signals.

Result: WaveSSM outperforms orthogonal counterparts like S4 on real-world datasets with transient dynamics, including physiological signals on PTB-XL dataset and raw audio on Speech Commands.

Conclusion: Wavelet-based state-space models offer superior temporal localization for signals with transient structure, providing a more effective foundation for modeling real-world sequential data.

Abstract: State-space models (SSMs) have emerged as a powerful foundation for long-range sequence modeling, with the HiPPO framework showing that continuous-time projection operators can be used to derive stable, memory-efficient dynamical systems that encode the past history of the input signal. However, existing projection-based SSMs often rely on polynomial bases with global temporal support, whose inductive biases are poorly matched to signals exhibiting localized or transient structure. In this work, we introduce \emph{WaveSSM}, a collection of SSMs constructed over wavelet frames. Our key observation is that wavelet frames yield a localized support on the temporal dimension, useful for tasks requiring precise localization. Empirically, we show that on equal conditions, \textit{WaveSSM} outperforms orthogonal counterparts as S4 on real-world datasets with transient dynamics, including physiological signals on the PTB-XL dataset and raw audio on Speech Commands.

[445] Sustainable LLM Inference using Context-Aware Model Switching

Yuvarani, Akashdeep Singh, Zahra Fathanah, Salsabila Harlen, Syeikha Syafura Al-Zahra binti Zahari, Hema Subramaniam

Main category: cs.LG

TL;DR: A context-aware model switching system for energy-efficient LLM inference that dynamically routes queries to appropriate-sized models based on complexity, achieving up to 67.5% energy reduction while maintaining 93.6% response quality.

Details

Motivation: Growing energy consumption of LLMs raises sustainability concerns; current one-size-fits-all inference strategies waste energy by using large models for simple queries, creating need for more efficient routing systems.

Method: Context-aware model switching combines caching for repeated queries, rule-based complexity scoring, ML classification for semantic intent, and user-adaptive learning to dynamically select appropriate model size (Gemma3 1B, 4B, Qwen3 4B) based on query complexity.

Result: Achieved up to 67.5% energy reduction compared to always using largest model, maintained 93.6% response quality (BERTScore F1), and improved response time for simple queries by ~68% while maintaining routing accuracy.

Conclusion: Model switching inference offers practical path toward energy-efficient AI systems, demonstrating significant efficiency gains without major quality sacrifices, enabling more sustainable LLM deployments.

Abstract: Large language models have become central to many AI applications, but their growing energy consumption raises serious sustainability concerns. A key limitation in current AI deployments is the reliance on a one-size-fits-all inference strategy where most systems route every request to the same large model, regardless of task complexity, leading to substantial and unnecessary energy waste. To address this issue, we propose a context-aware model switching approach that dynamically selects an appropriate language model based on query complexity. The proposed system uses a Context-Aware Model Switching for Energy-Efficient LLM Inference that combines caching for repeated queries, rulebased complexity scoring for fast and explainable decisions, machine learning classification to capture semantic intent, and a user-adaptive component that learns from interaction patterns over time. The proposed architecture was evaluated using real conversation workloads and three open-source language models (Gemma3 1B, Gemma3 4B and Qwen3 4B) with different computational costs, measuring energy consumption (via NVML GPU power telemetry), response latency, routing accuracy, and output quality (BERTScore F1) to reflect real-world usage conditions. Experimental results show that the model switching approach can reduce energy consumption by up to 67.5% compared to always using the largest model while maintaining a response quality of 93.6%. In addition, the response time for simple queries also improved significantly by approximately 68%. These results show that model switching inference offers a practical and scalable path toward more energy-efficient and sustainable AI systems, demonstrating that significant efficiency gains can be achieved without major sacrifices in response quality.

[446] Entropy-Controlled Flow Matching

Chika Maduabuchi

Main category: cs.LG

TL;DR: Entropy-Controlled Flow Matching (ECFM) introduces entropy constraints to flow-based generative models to prevent mode collapse by enforcing minimum entropy rates throughout the generation trajectory.

Details

Motivation: Standard flow-matching objectives in vision generators don't control information geometry, allowing low-entropy bottlenecks that can transiently deplete semantic modes, leading to mode collapse in generative models.

Method: Proposes ECFM as a constrained variational principle over continuity-equation paths enforcing a global entropy-rate budget d/dt H(mu_t) >= -lambda. It’s a convex optimization in Wasserstein space with KKT/Pontryagin system, and has a stochastic-control representation equivalent to a Schrodinger bridge with explicit entropy multiplier.

Result: ECFM recovers entropic OT geodesics in pure transport regime, Gamma-converges to classical OT as lambda -> 0, provides certificate-style mode-coverage and density-floor guarantees with Lipschitz stability, and constructs near-optimal collapse counterexamples for unconstrained flow matching.

Conclusion: ECFM provides a principled framework for controlling entropy in flow-based generative models, preventing mode collapse while maintaining theoretical guarantees and connections to optimal transport theory.

Abstract: Modern vision generators transport a base distribution to data through time-indexed measures, implemented as deterministic flows (ODEs) or stochastic diffusions (SDEs). Despite strong empirical performance, standard flow-matching objectives do not directly control the information geometry of the trajectory, allowing low-entropy bottlenecks that can transiently deplete semantic modes. We propose Entropy-Controlled Flow Matching (ECFM): a constrained variational principle over continuity-equation paths enforcing a global entropy-rate budget d/dt H(mu_t) >= -lambda. ECFM is a convex optimization in Wasserstein space with a KKT/Pontryagin system, and admits a stochastic-control representation equivalent to a Schrodinger bridge with an explicit entropy multiplier. In the pure transport regime, ECFM recovers entropic OT geodesics and Gamma-converges to classical OT as lambda -> 0. We further obtain certificate-style mode-coverage and density-floor guarantees with Lipschitz stability, and construct near-optimal collapse counterexamples for unconstrained flow matching.

[447] Data-Driven Supervision of a Thermal-Hydraulic Process Towards a Physics-Based Digital Twin

Osimone Imhogiemhe, Yoann Jus, Hubert Lejeune, Saïd Moussaoui

Main category: cs.LG

TL;DR: Digital twin framework for fault detection and diagnosis in thermal-hydraulic process supervision using numerical simulation and machine learning

Details

Motivation: Real-time supervision of production processes is challenging across industries, requiring monitoring and predictive maintenance for safety, uninterrupted production, and high efficiency. Digital twins offer a solution by combining physical system simulation with data-driven ML models.

Method: Developed a digital twin for thermal-hydraulic process supervision using numerical simulation of the system combined with machine learning methods. Proposed modules for process parameter change detection and online estimation of parameter values.

Result: Validated the fault detection and diagnosis algorithm on specific test scenarios with single one-off parameter changes. Results showed good accuracy in parameter variation localization and updating of parameter values.

Conclusion: Digital twin framework combining numerical simulation and machine learning provides effective solution for real-time fault detection and diagnosis in thermal-hydraulic process supervision, enabling accurate parameter change localization and value estimation.

Abstract: The real-time supervision of production processes is a common challenge across several industries. It targets process component monitoring and its predictive maintenance in order to ensure safety, uninterrupted production and maintain high efficiency level. The rise of advanced tools for the simulation of physical systems in addition to data-driven machine learning models offers the possibility to design numerical tools dedicated to efficient system monitoring. In that respect, the digital twin concept presents an adequate framework that proffers solution to these challenges. The main purpose of this paper is to develop such a digital twin dedicated to fault detection and diagnosis in the context of a thermal-hydraulic process supervision. Based on a numerical simulation of the system, in addition to machine learning methods, we propose different modules dedicated to process parameter change detection and their on-line estimation. The proposed fault detection and diagnosis algorithm is validated on a specific test scenario, with single one-off parameter change occurrences in the system. The numerical results show good accuracy in terms of parameter variation localization and the update of their values.

[448] AutoQRA: Joint Optimization of Mixed-Precision Quantization and Low-rank Adapters for Efficient LLM Fine-Tuning

Changhai Zhou, Shiyang Zhang, Yuhua Zhou, Qian Qiao, Jun Gao, Cheng Jin, Kaizhou Qin, Weizhong Zhang

Main category: cs.LG

TL;DR: AutoQRA jointly optimizes quantization bit-width and LoRA rank configurations during mixed quantized fine-tuning to maximize performance under memory constraints.

Details

Motivation: Current sequential pipelines of quantization followed by parameter-efficient fine-tuning fail to leverage the intricate interaction between quantization bit-width and LoRA rank, leading to suboptimal performance under memory constraints.

Method: Two-stage optimization: 1) Global multi-fidelity evolutionary search with layer-wise importance priors for warm-starting, and 2) Trust-region Bayesian optimization for local refinement of promising configurations.

Result: AutoQRA achieves performance close to full-precision fine-tuning with memory footprint comparable to uniform 4-bit methods.

Conclusion: Joint optimization of quantization and LoRA configurations enables efficient model adaptation under memory constraints while maintaining performance.

Abstract: Quantization followed by parameter-efficient fine-tuning has emerged as a promising paradigm for downstream adaptation under tight GPU memory constraints. However, this sequential pipeline fails to leverage the intricate interaction between quantization bit-width and LoRA rank. Specifically, a carefully optimized quantization allocation with low quantization error does not always translate to strong fine-tuning performance, and different bit-width and rank configurations can lead to significantly varying outcomes under the same memory budget. To address this limitation, we propose AutoQRA, a joint optimization framework that simultaneously optimizes the bit-width and LoRA rank configuration for each layer during the mixed quantized fine-tuning process. To tackle the challenges posed by the large discrete search space and the high evaluation cost associated with frequent fine-tuning iterations, AutoQRA decomposes the optimization process into two stages. First, it first conducts a global multi-fidelity evolutionary search, where the initial population is warm-started by injecting layer-wise importance priors. This stage employs specific operators and a performance model to efficiently screen candidate configurations. Second, trust-region Bayesian optimization is applied to locally refine promising regions of the search space and identify optimal configurations under the given memory budget. This approach enables active compensation for quantization noise in specific layers during training. Experiments show that AutoQRA achieves performance close to full-precision fine-tuning with a memory footprint comparable to uniform 4-bit methods.

[449] CQSA: Byzantine-robust Clustered Quantum Secure Aggregation in Federated Learning

Arnab Nath, Harsh Kasyap

Main category: cs.LG

TL;DR: CQSA introduces clustered quantum secure aggregation for federated learning, using small GHZ states per cluster instead of one large global state, improving fidelity and enabling Byzantine detection.

Details

Motivation: Existing quantum secure aggregation protocols use a single global GHZ state shared among all clients, which suffers from fidelity degradation with increasing clients and prevents Byzantine client detection.

Method: CQSA randomly partitions clients into small clusters, each performing local quantum aggregation using high-fidelity, low-qubit GHZ states, then uses statistical measures (cosine similarity, Euclidean distance) on cluster-level aggregates to detect malicious contributions.

Result: Theoretical analysis and simulations under depolarizing noise show CQSA ensures stable model convergence and achieves superior state fidelity over global QSA approaches.

Conclusion: CQSA reconciles near-term quantum hardware constraints with Byzantine-robustness needs in federated learning, offering a practical quantum-assisted FL solution.

Abstract: Federated Learning (FL) enables collaborative model training without sharing raw data. However, shared local model updates remain vulnerable to inference and poisoning attacks. Secure aggregation schemes have been proposed to mitigate these attacks. In this work, we aim to understand how these techniques are implemented in quantum-assisted FL. Quantum Secure Aggregation (QSA) has been proposed, offering information-theoretic privacy by encoding client updates into the global phase of multipartite entangled states. Existing QSA protocols, however, rely on a single global Greenberger-Horne-Zeilinger (GHZ) state shared among all participating clients. This design poses fundamental challenges: fidelity of large-scale GHZ states deteriorates rapidly with the increasing number of clients; and (ii) the global aggregation prevents the detection of Byzantine clients. We propose Clustered Quantum Secure Aggregation (CQSA), a modular aggregation framework that reconciles the physical constraints of near-term quantum hardware along with the need for Byzantine-robustness in FL. CQSA randomly partitions the clients into small clusters, each performing local quantum aggregation using high-fidelity, low-qubit GHZ states. The server analyzes statistical relationships between cluster-level aggregates employing common statistical measures such as cosine similarity and Euclidean distance to identify malicious contributions. Through theoretical analysis and simulations under depolarizing noise, we demonstrate that CQSA ensures stable model convergence, achieves superior state fidelity over global QSA.

[450] Prior Knowledge-enhanced Spatio-temporal Epidemic Forecasting

Sijie Ruan, Jinyu Li, Jia Wei, Zenghao Xu, Jie Bao, Junshi Xu, Junyang Qiu, Hanning Yuan, Xiaoxiao Wang, Shuliang Wang

Main category: cs.LG

TL;DR: STOEP is a hybrid spatio-temporal epidemic forecasting framework that integrates implicit and explicit priors to address challenges with weak signals, oversimplified spatial relations, and unstable parameter estimation.

Details

Motivation: Existing epidemic forecasting methods struggle with insensitivity to weak epidemic signals, over-simplified spatial relations, and unstable parameter estimation, which limits their practical utility for public health management.

Method: STOEP combines three components: 1) Case-aware Adjacency Learning (CAL) for dynamic adjustment of mobility-based regional dependencies, 2) Space-informed Parameter Estimating (SPE) using learnable spatial priors to amplify weak signals, and 3) Filter-based Mechanistic Forecasting (FMF) with expert-guided adaptive thresholding to regularize epidemic parameters.

Result: STOEP outperforms the best baseline by 11.1% in RMSE on real-world COVID-19 and influenza datasets, and has been deployed at a provincial CDC in China for practical applications.

Conclusion: STOEP effectively addresses key challenges in epidemic forecasting through its hybrid approach integrating implicit and explicit priors, demonstrating superior performance and practical deployment value.

Abstract: Spatio-temporal epidemic forecasting is critical for public health management, yet existing methods often struggle with insensitivity to weak epidemic signals, over-simplified spatial relations, and unstable parameter estimation. To address these challenges, we propose the Spatio-Temporal priOr-aware Epidemic Predictor (STOEP), a novel hybrid framework that integrates implicit spatio-temporal priors and explicit expert priors. STOEP consists of three key components: (1) Case-aware Adjacency Learning (CAL), which dynamically adjusts mobility-based regional dependencies using historical infection patterns; (2) Space-informed Parameter Estimating (SPE), which employs learnable spatial priors to amplify weak epidemic signals; and (3) Filter-based Mechanistic Forecasting (FMF), which uses an expert-guided adaptive thresholding strategy to regularize epidemic parameters. Extensive experiments on real-world COVID-19 and influenza datasets demonstrate that STOEP outperforms the best baseline by 11.1% in RMSE. The system has been deployed at one provincial CDC in China to facilitate downstream applications.

[451] Support Tokens, Stability Margins, and a New Foundation for Robust LLMs

Deepak Agarwal, Dhyey Dharmendrakumar Mavani, Suyash Gupta, Karthik Sethuraman, Tejas Dharamsi

Main category: cs.LG

TL;DR: The paper reinterprets causal self-attention transformers through a probabilistic framework, revealing a barrier constraint on attention parameters that creates structured geometry in token space, leading to support tokens and a Bayesian framework with log-barrier penalty for more robust training.

Details

Motivation: To provide a deeper theoretical understanding of self-attention transformers by reinterpreting them within a probabilistic framework similar to how PCA was extended to probabilistic PCA, revealing underlying structural constraints and geometry.

Method: Reformulates causal self-attention transformers probabilistically, identifies barrier constraints on attention parameters, analyzes token space geometry, proposes Bayesian framework with MAP estimation using log-barrier penalty added to cross-entropy loss.

Result: Reveals structured geometry in token space, identifies support tokens similar to support vectors in SVMs, shows barrier constraint leads to attention ill-conditioning, demonstrates robust models without sacrificing accuracy using log-barrier penalty.

Conclusion: Provides theoretical insights into LLM dynamics through probabilistic framework, reveals structural constraints in attention mechanisms, and offers practical training improvement with log-barrier penalty for more robust sequence modeling.

Abstract: Self-attention is usually described as a flexible, content-adaptive way to mix a token with information from its past. We re-interpret causal self-attention transformers, the backbone of modern foundation models, within a probabilistic framework, much like how classical PCA is extended to probabilistic PCA. However, this re-formulation reveals a surprising and deeper structural insight: due to a change-of-variables phenomenon, a barrier constraint emerges on the self-attention parameters. This induces a highly structured geometry on the token space, providing theoretical insights into the dynamics of LLM decoding. This reveals a boundary where attention becomes ill-conditioned, leading to a margin interpretation similar to classical support vector machines. Just like support vectors, this naturally gives rise to the concept of support tokens. Furthermore, we show that LLMs can be interpreted as a stochastic process over the power set of the token space, providing a rigorous probabilistic framework for sequence modeling. We propose a Bayesian framework and derive a MAP estimation objective that requires only a minimal modification to standard LLM training: the addition of a smooth log-barrier penalty to the usual cross-entropy loss. We demonstrate that this provides more robust models without sacrificing out-of-sample accuracy and that it is straightforward to incorporate in practice.

[452] Positional-aware Spatio-Temporal Network for Large-Scale Traffic Prediction

Runfei Chen

Main category: cs.LG

TL;DR: PASTN: Lightweight Positional-aware Spatio-Temporal Network for traffic flow forecasting that uses positional embeddings to distinguish nodes and temporal attention for long-range perception.

Details

Motivation: Existing traffic flow forecasting models struggle with distinguishing individual nodes clearly while maintaining a holistic historical view, especially for broader geographical areas and longer time spans. Most models also face deployment challenges due to large data sizes in real applications.

Method: Proposes PASTN with positional-aware embeddings to separate node representations and a temporal attention module to enhance long-range perception. The model captures both temporal and spatial complexities in an end-to-end lightweight architecture.

Result: Extensive experiments show PASTN’s effectiveness and efficiency across datasets of various scales (county, megalopolis, and state). Further analysis demonstrates the efficacy of the newly introduced modules.

Conclusion: PASTN provides an effective and efficient solution for traffic flow forecasting that addresses node distinction and long-range temporal perception while being lightweight enough for real-world deployment.

Abstract: Traffic flow forecasting has emerged as an indispensable mission for daily life, which is required to utilize the spatiotemporal relationship between each location within a time period under a graph structure to predict future flow. However, the large travel demand for broader geographical areas and longer time spans requires models to distinguish each node clearly and possess a holistic view of the history, which has been paid less attention to in prior works. Furthermore, increasing sizes of data hinder the deployment of most models in real application environments. To this end, in this paper, we propose a lightweight Positional-aware Spatio-Temporal Network (PASTN) to effectively capture both temporal and spatial complexities in an end-to-end manner. PASTN introduces positional-aware embeddings to separate each node’s representation, while also utilizing a temporal attention module to improve the long-range perception of current models. Extensive experiments verify the effectiveness and efficiency of PASTN across datasets of various scales (county, megalopolis and state). Further analysis demonstrates the efficacy of newly introduced modules either.

[453] X-REFINE: XAI-based RElevance input-Filtering and archItecture fiNe-tuning for channel Estimation

Abdul Karim Gizzini, Yahia Medjahdi

Main category: cs.LG

TL;DR: X-REFINE is an XAI framework for 6G wireless communications that jointly optimizes input filtering and neural architecture fine-tuning using decomposition-based LRP to improve interpretability-performance-complexity trade-offs.

Details

Motivation: The black-box nature and high complexity of deep learning models in critical 6G applications like channel estimation limit practical deployment. Existing XAI solutions focus only on input filtering without optimizing internal model structure.

Method: X-REFINE uses a decomposition-based, sign-stabilized LRP epsilon rule to backpropagate predictions and derive high-resolution relevance scores for both subcarriers (inputs) and hidden neurons. This enables holistic optimization to identify the most faithful model components for joint input-filtering and architecture fine-tuning.

Result: Simulation results show X-REFINE achieves superior interpretability-performance-complexity trade-off, significantly reducing computational complexity while maintaining robust bit error rate (BER) performance across different scenarios.

Conclusion: X-REFINE provides an effective XAI-based framework for 6G wireless communications that addresses both input and architectural optimization, enabling more practical deployment of complex deep learning models in critical applications.

Abstract: AI-native architectures are vital for 6G wireless communications. The black-box nature and high complexity of deep learning models employed in critical applications, such as channel estimation, limit their practical deployment. While perturbation-based XAI solutions offer input filtering, they often neglect internal structural optimization. We propose X-REFINE, an XAI-based framework for joint input-filtering and architecture fine-tuning. By utilizing a decomposition-based, sign-stabilized LRP epsilon rule, X-REFINE backpropagates predictions to derive high-resolution relevance scores for both subcarriers and hidden neurons. This enables a holistic optimization that identifies the most faithful model components. Simulation results demonstrate that X-REFINE achieves a superior interpretability-performance-complexity trade-off, significantly reducing computational complexity while maintaining robust bit error rate (BER) performance across different scenarios.

[454] Integrating Machine Learning Ensembles and Large Language Models for Heart Disease Prediction Using Voting Fusion

Md. Tahsin Amin, Tanim Ahmmod, Zannatul Ferdus, Talukder Naemul Hasan Naem, Ehsanul Ferdous, Arpita Bhattacharjee, Ishmam Ahmed Solaiman, Nahiyan Bin Noor

Main category: cs.LG

TL;DR: Hybrid ML-LLM system combining traditional ML ensembles with large language models achieves best performance (96.62% accuracy) for cardiovascular disease prediction from tabular patient data.

Details

Motivation: Cardiovascular disease is the leading global cause of death, requiring early detection and reliable decision-support tools. While traditional ML models excel at tabular data, LLMs offer new zero-shot/few-shot reasoning capabilities, creating an opportunity to combine their strengths.

Method: Used merged dataset of 1,190 patient records; compared traditional ML models (Random Forest, XGBoost, LightGBM, CatBoost) with open-source LLMs via OpenRouter APIs; developed hybrid fusion combining ML ensemble with LLM reasoning using Gemini 2.5 Flash.

Result: ML ensembles achieved 95.78% accuracy (ROC-AUC 0.96); LLMs performed moderately (78.9% zero-shot, 72.6% few-shot); hybrid ML-LLM system achieved best results: 96.62% accuracy, 0.97 AUC, showing LLMs work best combined with ML rather than alone.

Conclusion: Ensemble ML remains best for structured tabular prediction, but hybrid ML-LLM systems provide minor performance improvements and open pathways to more reliable clinical decision-support tools by leveraging LLM reasoning in uncertain situations.

Abstract: Cardiovascular disease is the primary cause of death globally, necessitating early identification, precise risk classification, and dependable decision-support technologies. The advent of large language models (LLMs) provides new zero-shot and few-shot reasoning capabilities, even though machine learning (ML) algorithms, especially ensemble approaches like Random Forest, XGBoost, LightGBM, and CatBoost, are excellent at modeling complex, non-linear patient data and routinely beat logistic regression. This research predicts cardiovascular disease using a merged dataset of 1,190 patient records, comparing traditional machine learning models (95.78% accuracy, ROC-AUC 0.96) with open-source large language models via OpenRouter APIs. Finally, a hybrid fusion of the ML ensemble and LLM reasoning under Gemini 2.5 Flash achieved the best results (96.62% accuracy, 0.97 AUC), showing that LLMs (78.9 % accuracy) work best when combined with ML models rather than used alone. Results show that ML ensembles achieved the highest performance (95.78% accuracy, ROC-AUC 0.96), while LLMs performed moderately in zero-shot (78.9%) and slightly better in few-shot (72.6%) settings. The proposed hybrid method enhanced the strength in uncertain situations, illustrating that ensemble ML is considered the best structured tabular prediction case, but it can be integrated with hybrid ML-LLM systems to provide a minor increase and open the way to more reliable clinical decision-support tools.

[455] BrepCoder: A Unified Multimodal Large Language Model for Multi-task B-rep Reasoning

Mingi Kim, Yongjun Kim, Jungwoo Kang, Hyungki Kim

Main category: cs.LG

TL;DR: BrepCoder: A unified multimodal LLM for CAD tasks using B-rep format, converting CAD sequences to Python-like code and achieving generalization across completion, error correction, and CAD-QA tasks.

Details

Motivation: Existing CAD approaches use task-specific models requiring structural modifications for new tasks, and focus on point clouds/images rather than industry-standard B-rep format. Need for unified model handling diverse CAD tasks from B-rep inputs.

Method: Leverages LLM code generation capabilities to convert CAD modeling sequences into Python-like code aligned with B-rep. Uses two-stage training: 1) pre-training on reverse engineering to learn geometric features and design logic, 2) extending to downstream tasks like completion, error correction, and CAD-QA.

Result: BrepCoder achieves superior generalization across diverse CAD tasks by interpreting B-rep as structural code, demonstrating potential as general-purpose CAD agent.

Conclusion: Proposed approach successfully creates unified MLLM for CAD that works with industry-standard B-rep format and generalizes across multiple tasks through code-based representation.

Abstract: Recent advancements in deep learning have actively addressed complex challenges within the Computer-Aided Design (CAD) domain.However, most existing approaches rely on task-specifi c models requiring structural modifi cations for new tasks, and they predominantly focus on point clouds or images rather than the industry-standard Boundary Representation (B-rep) format. To address these limitations, we propose BrepCoder, a unifi ed Multimodal Large Language Model (MLLM) that performs diverse CAD tasks from B-rep inputs. By leveraging the code generation capabilities of Large Language Models (LLMs), we convert CAD modeling sequences into Python-like code and align them with B-rep. We then adopt a two-stage training strategy: First, pre-training on reverse engineering to learn geometric features and design logic. Second, eff ectively extending the model to various downstream tasks such as completion, error correction, and CAD-QA. Consequently, by interpreting B-rep as structural code, BrepCoder achieves superior generalization across diverse tasks, demonstrating its potential as a general-purpose CAD agent.

[456] Early Risk Stratification of Dosing Errors in Clinical Trials Using Machine Learning

Félicien Hêche, Sohrab Ferdowsi, Anthony Yazdani, Sara Sansaloni-Pastor, Douglas Teodoro

Main category: cs.LG

TL;DR: ML framework using multimodal data (structured trial info + protocol text) to predict clinical trial dosing error risk before trial initiation, achieving 0.862 AUC-ROC with late-fusion model.

Details

Motivation: To develop a proactive, risk-based quality management system for clinical trials by predicting dosing error risks early using pre-initiation information, enabling better trial planning and safety monitoring.

Method: Used 42,112 clinical trials from ClinicalTrials.gov with structured/semi-structured data and unstructured protocol text. Created binary labels for elevated dosing error rates using adverse event reports and MedDRA terminology. Evaluated XGBoost (structured features), ClinicalModernBERT (textual data), and late-fusion model combining both modalities with post-hoc probability calibration.

Result: Late-fusion model achieved highest AUC-ROC (0.862). Calibrated outputs enabled robust stratification into risk categories, with proportion of high-error trials increasing monotonically across predicted risk groups.

Conclusion: Dosing error risk can be anticipated at trial level using pre-initiation information. Simple multimodal integration with probability calibration provides reliable, interpretable risk stratification for proactive clinical trial quality management.

Abstract: Objective: The objective of this study is to develop a machine learning (ML)-based framework for early risk stratification of clinical trials (CTs) according to their likelihood of exhibiting a high rate of dosing errors, using information available prior to trial initiation. Materials and Methods: We constructed a dataset from ClinicalTrials.gov comprising 42,112 CTs. Structured, semi-structured trial data, and unstructured protocol-related free-text data were extracted. CTs were assigned binary labels indicating elevated dosing error rate, derived from adverse event reports, MedDRA terminology, and Wilson confidence intervals. We evaluated an XGBoost model trained on structured features, a ClinicalModernBERT model using textual data, and a simple late-fusion model combining both modalities. Post-hoc probability calibration was applied to enable interpretable, trial-level risk stratification. Results: The late-fusion model achieved the highest AUC-ROC (0.862). Beyond discrimination, calibrated outputs enabled robust stratification of CTs into predefined risk categories. The proportion of trials labeled as having an excessively high dosing error rate increased monotonically across higher predicted risk groups and aligned with the corresponding predicted probability ranges. Discussion: These findings indicate that dosing error risk can be anticipated at the trial level using pre-initiation information. Probability calibration was essential for translating model outputs into reliable and interpretable risk categories, while simple multimodal integration yielded performance gains without requiring complex architectures. Conclusion: This study introduces a reproducible and scalable ML framework for early, trial-level risk stratification of CTs at risk of high dosing error rates, supporting proactive, risk-based quality management in clinical research.

Yan Zhao, Zhengxue Cheng, Junxuan Zhang, Dajiang Zhou, Qunshan Gu, Qi Wang, Li Song

Main category: cs.LG

TL;DR: OmniZip is a unified, lightweight lossless compressor for multiple data modalities including image, text, speech, tactile, database, and gene sequences, achieving superior compression efficiency across diverse data types while supporting real-time inference on edge devices.

Details

Motivation: Current learning-based lossless compressors are typically designed for single modalities, leading to redundant deployments in multi-modal settings. Multi-modal large language models offer a solution but are too complex for practical use, creating a need for a unified yet lightweight multi-modal compressor.

Method: OmniZip uses a lightweight backbone with three key components: 1) modality-unified tokenizer that reversibly transforms diverse data into tokens, 2) modality-routing context learning for flexible multi-modal context modeling, and 3) modality-routing feedforward design for enhanced nonlinear representation. A reparameterization training strategy boosts model capacity.

Result: OmniZip outperforms or matches state-of-the-art compressors across multiple modalities, achieving 42-62% higher compression efficiency than gzip on various datasets (CLIC-M, TouchandGo, enwik9, LibriSpeech, WikiSQL). It supports near real-time inference on resource-constrained devices, reaching ~1MB/s on MacBook CPUs and iPhone NPUs.

Conclusion: OmniZip successfully addresses the challenge of multi-modal compression by providing a unified, lightweight solution that achieves strong performance across diverse data types while maintaining practical efficiency for edge deployment.

Abstract: Lossless compression is essential for efficient data storage and transmission. Although learning-based lossless compressors achieve strong results, most of them are designed for a single modality, leading to redundant compressor deployments in multi-modal settings. Designing a unified multi-modal compressor is critical yet challenging, as different data types vary largely in format, dimension, and statistics. Multi-modal large language models offer a promising resolution but remain too complex for practical use. Thus, we propose \textbf{OmniZip}, \textbf{a unified and lightweight lossless compressor for multi-modal data (like image, text, speech, tactile, database, and gene sequence)}. Built on a lightweight backbone, OmniZip incorporates three key components to enable efficient multi-modal lossless compression: a modality-unified tokenizer that reversibly transforms diverse data into tokens, a modality-routing context learning mechanism that enables flexible multi-modal context modeling, and a modality-routing feedforward design that further enhances the model’s nonlinear representation flexibility. A reparameterization training strategy is used to enhance model capacity. OmniZip outperforms or matches other state-of-the-art compressors on multiple modalities, achieving 42%, 57%, 62% and 42%, 53% higher compression efficiency than gzip on CLIC-M, TouchandGo, enwik9, LibriSpeech, and WikiSQL datasets, respectively. It also supports near real-time inference on resource-constrained edge devices, reaching about 1MB/s on MacBook CPUs and iPhone NPUs. Our code is released at https://github.com/adminasmi/OmniZip-CVPR2026.

[458] Reliable XAI Explanations in Sudden Cardiac Death Prediction for Chagas Cardiomyopathy

Vinícius P. Chagas, Luiz H. T. Viana, Mac M. da S. Carlos, João P. V. Madeiro, Roberto C. Pedrosa, Thiago Alves Rocha, Carlos H. L. Cavalcante

Main category: cs.LG

TL;DR: Logic-based explainable AI method applied to sudden cardiac death prediction in Chagas cardiomyopathy, achieving high accuracy with 100% explanation fidelity and superior consistency compared to heuristic methods.

Details

Motivation: Sudden cardiac death prediction in Chagas cardiomyopathy remains challenging, especially for non-high-risk patients. Current AI models lack transparency (black boxes) and heuristic explanation methods lack correctness guarantees, hindering clinical adoption.

Method: Applied a logic-based explainability method with correctness guarantees to an AI classifier for SCD prediction in CC. The method ensures explanation fidelity and was compared against state-of-the-art heuristic methods.

Result: The AI classifier achieved over 95% accuracy and recall. The logic-based explainability method demonstrated 100% explanation fidelity and showed superior consistency and robustness compared to heuristic methods.

Conclusion: Logic-based explainable AI enhances clinical trust, facilitates integration of AI tools into practice, and enables large-scale deployment in endemic regions where it’s most needed for SCD prediction in Chagas cardiomyopathy.

Abstract: Sudden cardiac death (SCD) is unpredictable, and its prediction in Chagas cardiomyopathy (CC) remains a significant challenge, especially in patients not classified as high risk. While AI and machine learning models improve risk stratification, their adoption is hindered by a lack of transparency, as they are often perceived as \textit{black boxes} with unclear decision-making processes. Some approaches apply heuristic explanations without correctness guarantees, leading to mistakes in the decision-making process. To address this, we apply a logic-based explainability method with correctness guarantees to the problem of SCD prediction in CC. This explainability method, applied to an AI classifier with over 95% accuracy and recall, demonstrated strong predictive performance and 100% explanation fidelity. When compared to state-of-the-art heuristic methods, it showed superior consistency and robustness. This approach enhances clinical trust, facilitates the integration of AI-driven tools into practice, and promotes large-scale deployment, particularly in endemic regions where it is most needed.

[459] Manifold of Failure: Behavioral Attraction Basins in Language Models

Sarthak Munshi, Manish Bhatt, Vineeth Sai Narajala, Idan Habler, AmmarnAl-Kahfah, Ken Huang, Blake Gatto

Main category: cs.LG

TL;DR: A framework for systematically mapping failure manifolds in LLMs using quality diversity optimization to understand the continuous topology of vulnerability regions.

Details

Motivation: Prior work focuses on projecting adversarial examples back to safe regions, but comprehensive AI safety requires characterizing unsafe regions themselves to understand failure structures.

Method: Reframe vulnerability search as quality diversity problem using MAP-Elites algorithm with Alignment Deviation metric to map behavioral attraction basins across LLMs.

Result: MAP-Elites achieves up to 63% behavioral coverage, discovers up to 370 distinct vulnerability niches, revealing different topological signatures: Llama-3-8B has universal vulnerability plateau, GPT-OSS-20B shows fragmented landscape, GPT-5-Mini demonstrates strong robustness.

Conclusion: The approach produces interpretable global maps of safety landscapes that existing attack methods cannot provide, shifting paradigm from finding discrete failures to understanding underlying failure structure.

Abstract: While prior work has focused on projecting adversarial examples back onto the manifold of natural data to restore safety, we argue that a comprehensive understanding of AI safety requires characterizing the unsafe regions themselves. This paper introduces a framework for systematically mapping the Manifold of Failure in Large Language Models (LLMs). We reframe the search for vulnerabilities as a quality diversity problem, using MAP-Elites to illuminate the continuous topology of these failure regions, which we term behavioral attraction basins. Our quality metric, Alignment Deviation, guides the search towards areas where the model’s behavior diverges most from its intended alignment. Across three LLMs: Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini, we show that MAP-Elites achieves up to 63% behavioral coverage, discovers up to 370 distinct vulnerability niches, and reveals dramatically different model-specific topological signatures: Llama-3-8B exhibits a near-universal vulnerability plateau (mean Alignment Deviation 0.93), GPT-OSS-20B shows a fragmented landscape with spatially concentrated basins (mean 0.73), and GPT-5-Mini demonstrates strong robustness with a ceiling at 0.50. Our approach produces interpretable, global maps of each model’s safety landscape that no existing attack method (GCG, PAIR, or TAP) can provide, shifting the paradigm from finding discrete failures to understanding their underlying structure.

[460] Global River Forecasting with a Topology-Informed AI Foundation Model

Hancheng Ren, Gang Zhao, Shuo Wang, Louise Slater, Dai Yamazaki, Shu Liu, Jingfang Fan, Shibo Cui, Ziming Yu, Shengyu Kang, Depeng Zuo, Dingzhi Peng, Zongxue Xu, Bo Pang

Main category: cs.LG

TL;DR: GraphRiverCast (GRC) is a topology-informed AI foundation model for simulating multivariate river hydrodynamics in global river systems, capable of operating without historical river states (ColdStart mode).

Details

Motivation: River systems are interconnected networks, but hydrology data scarcity restricts data-driven forecasting to isolated predictions. There's a need for systemic simulation that reduces reliance on river observations.

Method: GRC uses topological encoding to guide hydraulic connectivity and network-scale mass redistribution. It operates in ColdStart mode without historical states, employs physics-aligned neural operator architecture, and uses pre-training and fine-tuning strategies.

Result: In 7-day global pseudo-hindcasts, GRC-ColdStart achieves Nash-Sutcliffe Efficiency of ~0.82 without significant error accumulation. It outperforms physics-based and locally-trained AI baselines, extending superiority from gauged reaches to full river networks.

Conclusion: GRC establishes a collaborative paradigm bridging global hydrodynamic knowledge with local hydrological reality through topology encoding and physics-based pre-training, enabling rapid cross-scale adaptive simulation.

Abstract: River systems operate as inherently interconnected continuous networks, meaning river hydrodynamic simulation ought to be a systemic process. However, widespread hydrology data scarcity often restricts data-driven forecasting to isolated predictions. To achieve systemic simulation and reduce reliance on river observations, we present GraphRiverCast (GRC), a topology-informed AI foundation model designed to simulate multivariate river hydrodynamics in global river systems. GRC is capable of operating in a “ColdStart” mode, generating predictions without relying on historical river states for initialization. In 7-day global pseudo-hindcasts, GRC-ColdStart functions as a robust standalone simulator, achieving a Nash-Sutcliffe Efficiency (NSE) of approximately 0.82 without exhibiting the significant error accumulation typical of autoregressive paradigms. Ablation studies reveal that topological encoding serves as indispensable structural information in the absence of historical states, explicitly guiding hydraulic connectivity and network-scale mass redistribution to reconstruct flow dynamics. Furthermore, when adapted locally via a pre-training and fine-tuning strategy, GRC consistently outperforms physics-based and locally-trained AI baselines. Crucially, this superiority extends from gauged reaches to full river networks, underscoring the necessity of topology encoding and physics-based pre-training. Built on a physics-aligned neural operator architecture, GRC enables rapid and cross-scale adaptive simulation, establishing a collaborative paradigm bridging global hydrodynamic knowledge with local hydrological reality.

[461] When Should a Model Change Its Mind? An Energy-Based Theory and Regularizer for Concept Drift in Electrocardiogram (ECG) Signals

Timothy Oladunni, Blessing Ojeme, Kyndal Maclin, Clyde Baidoo

Main category: cs.LG

TL;DR: PECT introduces an energy-based framework for concept stability in dynamic signals, using energy-constrained representation learning to distinguish benign signal fluctuations from real concept drift.

Details

Motivation: Existing concept-drift frameworks are distributional and can't distinguish between harmless physiological signal variations (like amplitude/rate changes) and true concept drift, leading to unstable predictions in multimodal fusion settings.

Method: Proposes Physiologic Energy Conservation Theory (PECT) which posits that normalized latent displacement should scale proportionally with normalized signal energy change under virtual drift. Implements Energy-Constrained Representation Learning (ECRL) as a lightweight regularizer to penalize energy-inconsistent latent movement.

Result: In multimodal ECG experiments, clean accuracy was largely preserved (96.0% to 94.1%), perturbed accuracy improved substantially (72.6% to 85.5%), and fused representation drift decreased by over 45% in the strongest trimodal hybrid model.

Conclusion: PECT functions as an energy-drift law governing concept stability in continuous physiologic signals, providing a principled framework for distinguishing benign signal variations from real concept drift in multimodal settings.

Abstract: Models operating on dynamic physiologic signals must distinguish benign, label-preserving variability from true concept change. Existing concept-drift frameworks are largely distributional and provide no principled guidance on how much a model’s internal representation may move when the underlying signal undergoes physiologically plausible fluctuations in energy. As a result, deep models often misinterpret harmless changes in amplitude, rate, or morphology as concept drift, yielding unstable predictions, particularly in multimodal fusion settings. This study introduces Physiologic Energy Conservation Theory (PECT), an energy-based framework for concept stability in dynamic signals. PECT posits that under virtual drift, normalized latent displacement should scale proportionally with normalized signal energy change, while persistent violations of this proportionality indicate real concept drift. We operationalize this principle through Energy-Constrained Representation Learning (ECRL), a lightweight regularizer that penalizes energy-inconsistent latent movement without modifying encoder architectures or adding inference-time cost. Although PECT is formulated for dynamic signals in general, we instantiate and evaluate it on multimodal ECG across seven unimodal and hybrid models. Experiments show that in the strongest trimodal hybrid (1D+2D+Transformer), clean accuracy is largely preserved (96.0% to 94.1%), while perturbed accuracy improves substantially (72.6% to 85.5%) and fused representation drift decreases by over 45%. Similar trends are observed across all architectures, providing empirical evidence that PECT functions as an energy-drift law governing concept stability in continuous physiologic signals.

[462] UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs

Devan Shah, Owen Yang, Daniel Yang, Chongyi Zheng, Benjamin Eysenbach

Main category: cs.LG

TL;DR: UpSkill improves multi-attempt reasoning in LLMs by optimizing pass@k metrics through mutual information skill learning, enhancing response diversity without degrading single-attempt performance.

Details

Motivation: Standard RLVR approaches that optimize single-attempt accuracy can suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies in mathematical and programming reasoning tasks.

Method: UpSkill adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness, using a novel token-level mutual information reward implemented within Group Relative Policy Optimization (GRPO) to encourage trajectory specificity to latent skill variables.

Result: Experiments on GSM8K with Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B show UpSkill improves multi-attempt metrics on stronger base models, yielding ~3% mean gains in pass@k for both Qwen and Llama without degrading pass@1 performance.

Conclusion: UpSkill effectively enhances multi-attempt reasoning in LLMs by promoting response diversity through mutual information optimization, with both empirical and theoretical evidence linking pass@k improvements to the mutual information objective.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of ~3% in pass@k for both Qwen and Llama without degrading pass@1. Additionally, we find both empirical and theoretical evidence that improvements in pass@k are closely tied to the mutual information objective.

[463] Learning Rewards, Not Labels: Adversarial Inverse Reinforcement Learning for Machinery Fault Detection

Dhiraj Neupane, Richard Dazeley, Mohamed Reda Bouadjenek, Sunil Aryal

Main category: cs.LG

TL;DR: Formulates machinery fault detection as offline inverse reinforcement learning problem, using adversarial IRL to learn reward dynamics from healthy sequences for anomaly scoring without manual reward engineering or fault labels.

Details

Motivation: Existing RL-based machinery fault detection approaches don't fully exploit RL's sequential decision-making strengths, often treating it as simple contextual bandits rather than leveraging temporal structure.

Method: Uses adversarial inverse reinforcement learning to train a discriminator that distinguishes between normal (expert) and policy-generated transitions, with the learned reward serving as anomaly score for fault detection.

Result: Evaluated on three run-to-failure benchmark datasets (HUMS2023, IMS, XJTU-SY), model consistently assigns low anomaly scores to normal samples and high scores to faulty ones, enabling early and robust fault detection.

Conclusion: Aligns RL’s sequential reasoning with fault detection’s temporal structure, opening path toward RL-based diagnostics in data-driven industrial settings without need for manual reward engineering or fault labels.

Abstract: Reinforcement learning (RL) offers significant promise for machinery fault detection (MFD). However, most existing RL-based MFD approaches do not fully exploit RL’s sequential decision-making strengths, often treating MFD as a simple guessing game (Contextual Bandits). To bridge this gap, we formulate MFD as an offline inverse reinforcement learning problem, where the agent learns the reward dynamics directly from healthy operational sequences, thereby bypassing the need for manual reward engineering and fault labels. Our framework employs Adversarial Inverse Reinforcement Learning to train a discriminator that distinguishes between normal (expert) and policy-generated transitions. The discriminator’s learned reward serves as an anomaly score, indicating deviations from normal operating behaviour. When evaluated on three run-to-failure benchmark datasets (HUMS2023, IMS, and XJTU-SY), the model consistently assigns low anomaly scores to normal samples and high scores to faulty ones, enabling early and robust fault detection. By aligning RL’s sequential reasoning with MFD’s temporal structure, this work opens a path toward RL-based diagnostics in data-driven industrial settings.

[464] AviaSafe: A Physics-Informed Data-Driven Model for Aviation Safety-Critical Cloud Forecasts

Zijian Zhu, Qiusheng Huang, Anboyu Guo, Xiaohui Zhong, Hao Li

Main category: cs.LG

TL;DR: AviaSafe: A hierarchical physics-informed neural network for global forecasting of four cloud hydrometeor species up to 7 days, addressing aviation safety needs by distinguishing between cloud microphysical species critical for engine icing risk.

Details

Motivation: Current AI weather models predict conventional atmospheric variables but cannot distinguish between cloud microphysical species critical for aviation safety, particularly engine icing risk which depends on distinguishing between ice and liquid water.

Method: Hierarchical physics-informed neural forecaster with two-stage architecture: 1) predicts cloud spatial distribution using masked attention, 2) quantifies species concentrations within identified regions. Integrates Icing Condition (IC) index as physics-based constraint to identify regions where supercooled water fuels explosive ice crystal growth.

Result: Achieves lower RMSE for cloud species compared to baseline models and outperforms operational numerical models on certain key variables at 7-day lead times when trained on ERA5 reanalysis data.

Conclusion: The model enables new applications in aviation route optimization by forecasting individual cloud species, allowing distinction between ice and liquid water that determines engine icing risk.

Abstract: Current AI weather forecasting models predict conventional atmospheric variables but cannot distinguish between cloud microphysical species critical for aviation safety. We introduce AviaSafe, a hierarchical, physics-informed neural forecaster that produces global, six-hourly predictions of these four hydrometeor species for lead times up to 7 days. Our approach addresses the unique challenges of cloud prediction: extreme sparsity, discontinuous distributions, and complex microphysical interactions between species. We integrate the Icing Condition (IC) index from aviation meteorology as a physics-based constraint that identifies regions where supercooled water fuels explosive ice crystal growth. The model employs a hierarchical architecture that first predicts cloud spatial distribution through masked attention, then quantifies species concentrations within identified regions. Training on ERA5 reanalysis data, our model achieves lower RMSE for cloud species compared to baseline and outperforms operational numerical models on certain key variables at 7-day lead times. The ability to forecast individual cloud species enables new applications in aviation route optimization where distinguishing between ice and liquid water determines engine icing risk.

[465] Training Agents to Self-Report Misbehavior

Bruce W. Lee, Chen Yueh-Han, Tomek Korbak

Main category: cs.LG

TL;DR: Self-incrimination training trains AI agents to report their own deceptive behavior, reducing undetected harmful actions while preserving general capabilities.

Details

Motivation: Frontier AI agents may pursue hidden goals while concealing their behavior from oversight. Traditional alignment training may fail, so alternative approaches are needed to detect and reduce covert misbehavior.

Method: Train GPT-4.1 and Gemini-2.0 agents to call a report_scheming() tool when behaving deceptively. Evaluate their ability to cause harm undetected in out-of-distribution environments, comparing against matched-capability monitors and alignment baselines.

Result: Self-incrimination significantly reduces undetected successful attack rates, outperforms monitors and alignment baselines, preserves instruction hierarchy, incurs minimal safety tax on general capabilities, and generalizes across tasks and adversarial prompts.

Conclusion: Self-incrimination offers a viable path for reducing frontier misalignment risk without assuming misbehavior can be prevented or reliably classified externally.

Abstract: Frontier AI agents may pursue hidden goals while concealing their pursuit from oversight. Alignment training aims to prevent such behavior by reinforcing the correct goals, but alignment may not always succeed and can lead to unwanted side effects. We propose self-incrimination training, which instead trains agents to produce a visible signal when they covertly misbehave. We train GPT-4.1 and Gemini-2.0 agents to call a report_scheming() tool when behaving deceptively and measure their ability to cause harm undetected in out-of-distribution environments. Self-incrimination significantly reduces the undetected successful attack rate, outperforming matched-capability monitors and alignment baselines while preserving instruction hierarchy and incurring minimal safety tax on general capabilities. Unlike blackbox monitoring, self-incrimination performance is consistent across tasks regardless of how suspicious the misbehavior appears externally. The trained behavior persists under adversarial prompt optimization and generalizes to settings where agents pursue misaligned goals themselves rather than being instructed to misbehave. Our results suggest self-incrimination offers a viable path for reducing frontier misalignment risk, one that neither assumes misbehavior can be prevented nor that it can be reliably classified from the outside.

[466] A 1/R Law for Kurtosis Contrast in Balanced Mixtures

Yuda Bi, Wenjun Xiao, Linhao Bai, Vince D Calhoun

Main category: cs.LG

TL;DR: Kurtosis-based ICA fails in wide, balanced mixtures due to contrast decay scaling as 1/R, but purification (selecting sign-consistent sources) restores contrast independent of R.

Details

Motivation: Kurtosis-based Independent Component Analysis (ICA) is known to weaken in wide, balanced mixtures, but the exact scaling laws and limitations haven't been fully characterized. The paper aims to understand the fundamental limitations and provide practical solutions.

Method: The authors prove a sharp redundancy law showing population excess kurtosis decays as O(κ_max/R_eff). They establish impossibility results under finite-moment conditions and propose a purification method that selects m ≪ R sign-consistent sources to restore contrast independent of R.

Result: Theoretical results show: 1) kurtosis contrast decays as 1/R in wide mixtures, 2) surpassing O(1/√T) estimation scale requires R ≲ κ_max√T, 3) purification restores Ω(1/m) contrast independent of R. Synthetic experiments validate all predictions.

Conclusion: Kurtosis-based ICA has fundamental limitations in wide, balanced mixtures with contrast decaying as 1/R, but purification provides a practical solution by selecting sign-consistent sources to restore contrast independent of mixture width.

Abstract: Kurtosis-based Independent Component Analysis (ICA) weakens in wide, balanced mixtures. We prove a sharp redundancy law: for a standardized projection with effective width $R_{\mathrm{eff}}$ (participation ratio), the population excess kurtosis obeys $|κ(y)|=O(κ_{\max}/R_{\mathrm{eff}})$, yielding the order-tight $O(c_bκ_{\max}/R)$ under balance (typically $c_b=O(\log R)$). As an impossibility screen, under standard finite-moment conditions for sample kurtosis estimation, surpassing the $O(1/\sqrt{T})$ estimation scale requires $R\lesssim κ_{\max}\sqrt{T}$. We also show that \emph{purification} – selecting $m!\ll!R$ sign-consistent sources – restores $R$-independent contrast $Ω(1/m)$, with a simple data-driven heuristic. Synthetic experiments validate the predicted decay, the $\sqrt{T}$ crossover, and contrast recovery.

[467] Structure and Redundancy in Large Language Models: A Spectral Study via Random Matrix Theory

Davide Ettori

Main category: cs.LG

TL;DR: A unified spectral geometry and random matrix theory framework for improving reliability (detecting hallucinations/OOD) and efficiency (compression) in deep learning models through eigenvalue analysis of hidden activations.

Details

Motivation: Address reliability issues (hallucinations, fragile generalization) and efficiency challenges (computational/energy demands) in large-scale deep learning models by providing interpretable insights into their internal behavior.

Method: Uses spectral statistics of hidden activations to analyze model behavior. EigenTrack transforms streaming activations into spectral descriptors (entropy, variance, deviations from Marchenko-Pastur) with lightweight recurrent classifiers for real-time hallucination/OOD detection. RMT-KD uses outlier eigenvalues as task-relevant information carriers for progressive compression via iterative self-distillation.

Result: EigenTrack enables early detection of reliability failures before they appear in outputs with interpretable insights. RMT-KD produces significantly more compact, energy-efficient models while preserving accuracy and hardware-friendly structure.

Conclusion: Spectral geometry and random matrix theory provide a unified framework for addressing both reliability and efficiency challenges in deep learning through eigenvalue analysis of model activations.

Abstract: This thesis addresses two persistent and closely related challenges in modern deep learning, reliability and efficiency, through a unified framework grounded in Spectral Geometry and Random Matrix Theory (RMT). As deep networks and large language models continue to scale, their internal behavior becomes increasingly opaque, leading to hallucinations, fragile generalization under distribution shift, and growing computational and energy demands. By analyzing the eigenvalue dynamics of hidden activations across layers and inputs, this work shows that spectral statistics provide a compact, stable, and interpretable lens on model behavior, capable of separating structured, causal representations from noise-dominated variability. Within this framework, the first contribution, EigenTrack, introduces a real-time method for detecting hallucinations and out-of-distribution behavior in large language and vision-language models. EigenTrack transforms streaming activations into spectral descriptors such as entropy, variance, and deviations from the Marchenko-Pastur baseline, and models their temporal evolution using lightweight recurrent classifiers, enabling early detection of reliability failures before they appear in model outputs while offering interpretable insight into representation dynamics. The second contribution, RMT-KD, presents a principled approach to compressing deep networks via random matrix theoretic knowledge distillation. By interpreting outlier eigenvalues in activation spectra as carriers of task-relevant information, RMT-KD progressively projects networks onto lower-dimensional subspaces through iterative self-distillation, yielding significantly more compact and energy-efficient models while preserving accuracy and dense, hardware-friendly structure.

[468] Learning geometry-dependent lead-field operators for forward ECG modeling

Arsenii Dokuchaev, Francesca Bonizzoni, Stefano Pagani, Francesco Regazzoni, Simone Pezzuto

Main category: cs.LG

TL;DR: A shape-informed surrogate model for ECG forward simulations that combines geometry encoding with neural networks to predict lead-field gradients, enabling high-fidelity ECG simulations with low data requirements and computational efficiency.

Details

Motivation: Current ECG computational models face challenges: achieving high anatomical fidelity in torso representation is difficult in clinical practice (imaging often focuses only on the heart), and computational cost scales linearly with electrode count, limiting high-density recording applications. No existing approach simultaneously achieves high anatomical fidelity, low data requirements, and computational efficiency.

Method: Proposes a shape-informed surrogate model with two components: 1) a geometry-encoding module that maps anatomical shapes into a low-dimensional latent space, and 2) a geometry-conditioned neural surrogate that predicts lead-field gradients from spatial coordinates, electrode positions, and latent codes. This serves as a drop-in replacement for full-order models in forward ECG simulations.

Result: Achieves high accuracy in approximating lead fields (mean angular error 5° within torso, accurate inside heart) and highly accurate ECG simulations (relative mean squared error <2.5%). Consistently outperforms widely used pseudo lead-field approximation while preserving negligible inference cost.

Conclusion: The method enables high-fidelity ECG simulations without requiring fully detailed torso segmentation, making it deployable in data-limited clinical settings while maintaining computational efficiency and anatomical accuracy.

Abstract: Modern forward electrocardiogram (ECG) computational models rely on an accurate representation of the torso domain. The lead-field method enables fast ECG simulations while preserving full geometric fidelity. Achieving high anatomical accuracy in torso representation is, however, challenging in clinical practice, as imaging protocols are typically focused on the heart and often do not include the entire torso. In addition, the computational cost of the lead-field method scales linearly with the number of electrodes, limiting its applicability in high-density recording settings. To date, no existing approach simultaneously achieves high anatomical fidelity, low data requirements and computational efficiency. In this work, we propose a shape-informed surrogate model of the lead-field operator that serves as a drop-in replacement for the full-order model in forward ECG simulations. The proposed framework consists of two components: a geometry-encoding module that maps anatomical shapes into a low-dimensional latent space, and a geometry-conditioned neural surrogate that predicts lead-field gradients from spatial coordinates, electrode positions and latent codes. The proposed method achieves high accuracy in approximating lead fields both within the torso (mean angular error 5°) and inside the heart, resulting in highly accurate ECG simulations (relative mean squared error <2.5%. The surrogate consistently outperforms the widely used pseudo lead-field approximation while preserving negligible inference cost. Owing to its compact latent representation, the method does not require a fully detailed torso segmentation and can therefore be deployed in data-limited settings while preserving high-fidelity ECG simulations.

[469] Disentangling Shared and Target-Enriched Topics via Background-Contrastive Non-negative Matrix Factorization

Yixuan Li, Archer Y. Yang, Yue Li

Main category: cs.LG

TL;DR: Background Contrastive Non-negative Matrix Factorization (model) extracts target-specific biological signals by jointly factorizing target and background datasets with shared bases under contrastive objective to suppress background variation.

Details

Motivation: Biological signals in high-dimensional data are often masked by dominant shared variation (baseline structure or technical effects), preventing standard dimensionality reduction methods from resolving condition-specific structure. Existing background correction methods are either unscalable or not interpretable.

Method: Introduces background contrastive Non-negative Matrix Factorization (model) that jointly factorizes a target dataset and matched background using shared non-negative bases under a contrastive objective that suppresses background-expressed structure. Uses efficient multiplicative update algorithm via matrix multiplication, scalable via minibatch training on GPU hardware.

Result: Across simulations and diverse biological datasets, model reveals signals obscured by conventional methods, including disease-associated programs in postmortem depressive brain single-cell RNA-seq, genotype-linked protein expression patterns in mice, treatment-specific transcriptional changes in leukemia, and TP53-dependent drug responses in cancer cell lines.

Conclusion: The method successfully extracts interpretable, target-specific biological signals by suppressing confounding background variation, offering a scalable and interpretable solution for high-dimensional biological data analysis.

Abstract: Biological signals of interest in high-dimensional data are often masked by dominant variation shared across conditions. This variation, arising from baseline biological structure or technical effects, can prevent standard dimensionality reduction methods from resolving condition-specific structure. The challenge is that these confounding topics are often unknown and mixed with biological signals. Existing background correction methods are either unscalable to high dimensions or not interpretable. We introduce background contrastive Non-negative Matrix Factorization (\model), which extracts target-enriched latent topics by jointly factorizing a target dataset and a matched background using shared non-negative bases under a contrastive objective that suppresses background-expressed structure. This approach yields non-negative components that are directly interpretable at the feature level, and explicitly isolates target-specific variation. \model is learned by an efficient multiplicative update algorithm via matrix multiplication such that it is highly efficient on GPU hardware and scalable to big data via minibatch training akin to deep learning approach. Across simulations and diverse biological datasets, \model reveals signals obscured by conventional methods, including disease-associated programs in postmortem depressive brain single-cell RNA-seq, genotype-linked protein expression patterns in mice, treatment-specific transcriptional changes in leukemia, and TP53-dependent drug responses in cancer cell lines.

[470] Predicting Multi-Drug Resistance in Bacterial Isolates Through Performance Comparison and LIME-based Interpretation of Classification Models

Santanam Wishal, Riad Sahara

Main category: cs.LG

TL;DR: Interpretable ML framework predicts Multi-Drug Resistance in bacterial isolates using clinical features and antibiotic susceptibility patterns, with XGBoost/LightGBM achieving best performance and LIME providing clinical explanations.

Details

Motivation: Antimicrobial Resistance, especially Multi-Drug Resistance (MDR), poses critical challenges for clinical decision-making due to limited treatment options and delays in conventional susceptibility testing, necessitating faster, more interpretable prediction methods.

Method: Proposed interpretable ML framework using five classification models (Logistic Regression, Random Forest, AdaBoost, XGBoost, LightGBM) trained on 9,714 isolates with resistance encoded at antibiotic family level, evaluated using accuracy, F1-score, AUC-ROC, MCC, and enhanced with LIME for local interpretability.

Result: Ensemble models (XGBoost and LightGBM) demonstrated superior predictive capability across all metrics; LIME identified resistance to quinolones, Co-trimoxazole, Colistin, aminoglycosides, and Furanes as strongest contributors to MDR predictions, aligning with known biological mechanisms.

Conclusion: Combining high-performing models with local interpretability provides both accuracy and actionable insights for antimicrobial stewardship, supporting earlier MDR identification and enhancing trust in ML-assisted clinical decision support.

Abstract: The rise of Antimicrobial Resistance, particularly Multi-Drug Resistance (MDR), presents a critical challenge for clinical decision-making due to limited treatment options and delays in conventional susceptibility testing. This study proposes an interpretable machine learning framework to predict MDR in bacterial isolates using clinical features and antibiotic susceptibility patterns. Five classification models were evaluated, including Logistic Regression, Random Forest, AdaBoost, XGBoost, and LightGBM. The models were trained on a curated dataset of 9,714 isolates, with resistance encoded at the antibiotic family level to capture cross-class resistance patterns consistent with MDR definitions. Performance assessment included accuracy, F1-score, AUC-ROC, and Matthews Correlation Coefficient. Ensemble models, particularly XGBoost and LightGBM, demonstrated superior predictive capability across all metrics. To address the clinical transparency gap, Local Interpretable Model-agnostic Explanations (LIME) was applied to generate instance-level explanations. LIME identified resistance to quinolones, Co-trimoxazole, Colistin, aminoglycosides, and Furanes as the strongest contributors to MDR predictions, aligning with known biological mechanisms. The results show that combining high-performing models with local interpretability provides both accuracy and actionable insights for antimicrobial stewardship. This framework supports earlier MDR identification and enhances trust in machine learning-assisted clinical decision support.

Syed Omer Shah, Mohammed Maqsood Ahmed, Danish Mohiuddin Mohammed, Shahnawaz Alam, Mohd Vahaj ur Rahman

Main category: cs.LG

TL;DR: MolFM-Lite is a multimodal molecular property prediction model that fuses 1D SELFIES sequences, 2D molecular graphs, and 3D conformer ensembles using cross-attention, with experimental context conditioning via FiLM.

Details

Motivation: Most molecular property prediction models use single representations and treat molecular geometry as static, missing the complementary information across different molecular representations and the dynamic nature of molecular conformations.

Method: Joint encoding of three modalities (1D SELFIES, 2D graphs, 3D conformer ensembles) via cross-attention fusion; conformer ensemble attention with learnable attention + Boltzmann-weighted priors; Feature-wise Linear Modulation (FiLM) for experimental context conditioning; pre-training on ZINC250K with cross-modal contrastive and masked-atom objectives.

Result: Tri-modal fusion provides 7-11% AUC improvement over single-modality baselines; conformer ensembles add ~2% over single-conformer variants; comprehensive ablation studies confirm each component contributes independently; effective pre-training at modest compute cost.

Conclusion: Multimodal fusion of molecular representations significantly improves property prediction, with conformer ensembles capturing thermodynamic shape distributions and cross-modal attention enabling complementary information sharing.

Abstract: Most machine learning models for molecular property prediction rely on a single molecular representation (either a sequence, a graph, or a 3D structure) and treat molecular geometry as static. We present MolFM-Lite, a multi-modal model that jointly encodes SELFIES sequences (1D), molecular graphs (2D), and conformer ensembles (3D) through cross-attention fusion, while conditioning predictions on experimental context via Feature-wise Linear Modulation (FiLM). Our main methodological contributions are: (1) a conformer ensemble attention mechanism that combines learnable attention with Boltzmann-weighted priors over multiple RDKit-generated conformers, capturing the thermodynamic distribution of molecular shapes; and (2) a cross-modal fusion layer where each modality can attend to others, enabling complementary information sharing. We evaluate on four MoleculeNet scaffold-split benchmarks using our model’s own splits, and report all baselines re-evaluated under the same protocol. Comprehensive ablation studies across all four datasets confirm that each architectural component contributes independently, with tri-modal fusion providing 7-11% AUC improvement over single-modality baselines and conformer ensembles adding approximately 2% over single-conformer variants. Pre-training on ZINC250K (~250K molecules) using cross-modal contrastive and masked-atom objectives enables effective weight initialization at modest compute cost. We release all code, trained models, and data splits to support reproducibility.

[472] A Learning-Based Hybrid Decision Framework for Matching Systems with User Departure Detection

Ruiqi Zhou, Donghao Zhu, Houcai Shen

Main category: cs.LG

TL;DR: A learning-based hybrid framework for dynamic matching markets that adaptively combines immediate and delayed matching to balance efficiency, waiting times, and congestion.

Details

Motivation: Delayed matching improves market efficiency but imposes costs like longer waiting times and increased congestion. Fixed matching policies are inflexible in dynamic environments where participant behavior varies.

Method: Proposes a learning-based hybrid framework that continuously collects data on user departures, estimates departure distribution via regression, and uses a decision threshold to determine whether to delay matching in each period.

Result: The framework substantially reduces waiting times and congestion while sacrificing only limited matching efficiency, dynamically interpolating between greedy and patient policies.

Conclusion: The hybrid framework offers a robust, adaptive alternative to static matching mechanisms, enabling flexible performance tuning between immediate and delayed matching strategies.

Abstract: In matching markets such as kidney exchanges and freight exchanges, delayed matching has been shown to improve overall market efficiency. The benefits of delay are highly sensitive to participants’ sojourn times and departure behavior, and delaying matches can impose significant costs, including longer waiting times and increased market congestion. These competing effects make fixed matching policies inherently inflexible in dynamic environments. We propose a learning-based Hybrid framework that adaptively combines immediate and delayed matching. The framework continuously collects data on user departures over time, estimates the underlying departure distribution via regression, and determines whether to delay matching in the subsequent period based on a decision threshold that governs the system’s tolerance for matching efficiency loss. The proposed framework can substantially reduce waiting times and congestion while sacrificing only a limited amount of matching efficiency. By dynamically adjusting its matching strategy, the Hybrid framework enables system performance to flexibly interpolate between purely greedy and purely patient policies, offering a robust and adaptive alternative to static matching mechanisms.

[473] Revisiting Chebyshev Polynomial and Anisotropic RBF Models for Tabular Regression

Luciano Gerber, Huw Lloyd

Main category: cs.LG

TL;DR: Smooth-basis models (Chebyshev polynomials, RBF networks) benchmarked against tree ensembles and transformers on 55 regression datasets, finding transformers most accurate but smooth models competitive on CPU with tighter generalization gaps.

Details

Motivation: Smooth-basis models are well-established in numerical analysis for their continuously differentiable prediction surfaces, making them suitable for surrogate optimization and sensitivity analysis. However, they are rarely used in tabular regression where tree ensembles dominate. The paper investigates whether smooth models can compete with tree ensembles in tabular regression tasks.

Method: Developed three smooth-basis models: 1) anisotropic RBF network with data-driven center placement and gradient-based width optimization, 2) ridge-regularized Chebyshev polynomial regressor, and 3) smooth-tree hybrid (Chebyshev model tree). All released as scikit-learn-compatible packages. Benchmarked these against tree ensembles, a pre-trained transformer, and standard baselines across 55 regression datasets organized by application domain, evaluating both accuracy and generalization behavior.

Result: Transformers ranked first on accuracy across most datasets, but have GPU dependence, inference latency, and dataset-size limitations. Among CPU-viable models, smooth models and tree ensembles are statistically tied on accuracy, but smooth models tend to exhibit tighter generalization gaps (better generalization behavior).

Conclusion: Smooth-basis models should be routinely included in candidate pools for regression tasks, particularly when downstream applications benefit from tighter generalization and gradually varying predictions. They offer competitive performance to tree ensembles while providing better generalization properties.

Abstract: Smooth-basis models such as Chebyshev polynomial regressors and radial basis function (RBF) networks are well established in numerical analysis. Their continuously differentiable prediction surfaces suit surrogate optimisation, sensitivity analysis, and other settings where the response varies gradually with inputs. Despite these properties, smooth models seldom appear in tabular regression, where tree ensembles dominate. We ask whether they can compete, benchmarking models across 55 regression datasets organised by application domain. We develop an anisotropic RBF network with data-driven centre placement and gradient-based width optimisation, a ridge-regularised Chebyshev polynomial regressor, and a smooth-tree hybrid (Chebyshev model tree); all three are released as scikit-learn-compatible packages. We benchmark these against tree ensembles, a pre-trained transformer, and standard baselines, evaluating accuracy alongside generalisation behaviour. The transformer ranks first on accuracy across a majority of datasets, but its GPU dependence, inference latency, and dataset-size limits constrain deployment in the CPU-based settings common across applied science and industry. Among CPU-viable models, smooth models and tree ensembles are statistically tied on accuracy, but the former tend to exhibit tighter generalisation gaps. We recommend routinely including smooth-basis models in the candidate pool, particularly when downstream use benefits from tighter generalisation and gradually varying predictions.

[474] Calibrated Test-Time Guidance for Bayesian Inference

Daniel Geyfman, Felix Draxler, Jan Groeneveld, Hyunsoo Lee, Theofanis Karaletsos, Stephan Mandt

Main category: cs.LG

TL;DR: Proposes calibrated Bayesian posterior sampling for diffusion models instead of reward maximization, outperforming previous test-time guidance methods on inference tasks.

Details

Motivation: Existing test-time guidance methods for diffusion models focus on maximizing reward rather than sampling from the true Bayesian posterior, leading to miscalibrated inference and incorrect posterior distributions.

Method: Identifies structural approximations causing failure in existing methods, then proposes consistent alternative estimators that enable calibrated sampling from the Bayesian posterior distribution.

Result: Significantly outperforms previous methods on Bayesian inference tasks and matches state-of-the-art in black hole image reconstruction.

Conclusion: The proposed calibrated Bayesian posterior sampling approach addresses fundamental limitations of existing test-time guidance methods, providing more accurate inference for diffusion models.

Abstract: Test-time guidance is a widely used mechanism for steering pretrained diffusion models toward outcomes specified by a reward function. Existing approaches, however, focus on maximizing reward rather than sampling from the true Bayesian posterior, leading to miscalibrated inference. In this work, we show that common test-time guidance methods do not recover the correct posterior distribution and identify the structural approximations responsible for this failure. We then propose consistent alternative estimators that enable calibrated sampling from the Bayesian posterior. We significantly outperform previous methods on a set of Bayesian inference tasks, and match state-of-the-art in black hole image reconstruction.

[475] From Bias to Balance: Fairness-Aware Paper Recommendation for Equitable Peer Review

Uttamasha Anjally Oyshi, Susan Gauch

Main category: cs.LG

TL;DR: Fair-PaperRec: MLP-based post-review paper recommender with fairness regularization that re-ranks papers to increase underrepresented group participation while maintaining utility.

Details

Motivation: Systemic biases in double-blind review disadvantage underrepresented groups; need for equity-focused post-review selection that increases inclusion without degrading quality.

Method: Multi-Layer Perceptron with differentiable fairness loss over intersectional attributes (race, country) for paper re-ranking after double-blind review; tested on synthetic datasets with varying bias levels and real conference data from SIGCHI, DIS, IUI.

Result: Achieves up to 42.03% increase in underrepresented-group participation with at most 3.16% change in overall utility; fairness regularization works as both equity mechanism and mild quality regularizer, especially in highly biased regimes.

Conclusion: Fairness regularization offers practical, equity-focused framework for post-review paper selection that preserves scholarly quality while increasing diversity; synthetic-to-real validation demonstrates robustness across bias levels.

Abstract: Despite frequent double-blind review, systemic biases related to author demographics still disadvantage underrepresented groups. We start from a simple hypothesis: if a post-review recommender is trained with an explicit fairness regularizer, it should increase inclusion without degrading quality. To test this, we introduce Fair-PaperRec, a Multi-Layer Perceptron (MLP) with a differentiable fairness loss over intersectional attributes (e.g., race, country) that re-ranks papers after double-blind review. We first probe the hypothesis on synthetic datasets spanning high, moderate, and near-fair biases. Across multiple randomized runs, these controlled studies map where increasing the fairness weight strengthens macro/micro diversity while keeping utility approximately stable, demonstrating robustness and adaptability under varying disparity levels. We then carry the hypothesis into the original setting, conference data from ACM Special Interest Group on Computer-Human Interaction (SIGCHI), Designing Interactive Systems (DIS), and Intelligent User Interfaces (IUI). In this real-world scenario, an appropriately tuned configuration of Fair-PaperRec achieves up to a 42.03% increase in underrepresented-group participation with at most a 3.16% change in overall utility relative to the historical selection. Taken together, the synthetic-to-original progression shows that fairness regularization can act as both an equity mechanism and a mild quality regularizer, especially in highly biased regimes. By first analyzing the behavior of the fairness parameters under controlled conditions and then validating them on real submissions, Fair-PaperRec offers a practical, equity-focused framework for post-review paper selection that preserves, and in some settings can even enhance, measured scholarly quality.

[476] ECHO: Encoding Communities via High-order Operators

Emilio Ferrara

Main category: cs.LG

TL;DR: ECHO is a scalable GNN architecture for community detection in attributed networks that overcomes computational bottlenecks through topology-aware routing and memory-efficient contrastive learning.

Details

Motivation: Traditional community detection faces a fundamental divide: topological algorithms ignore semantic features while GNNs suffer from computational bottlenecks including feature over-smoothing in dense/heterophilic networks and O(N²) memory constraints.

Method: ECHO reframes community detection as adaptive multi-scale diffusion with: 1) Topology-Aware Router that analyzes structural heuristics to route graphs through optimal inductive bias, 2) memory-sharded full-batch contrastive objective, and 3) chunked O(N·K) similarity extraction to bypass O(N²) bottlenecks.

Result: On synthetic LFR benchmarks scaled to 1M nodes, ECHO achieves scale-invariant accuracy despite topological noise. On real-world social networks with 1.6M nodes and 30M edges, it completes clustering in minutes with throughputs exceeding 2,800 nodes/second, matching optimized topological baselines.

Conclusion: ECHO provides a scalable, self-supervised architecture that overcomes both semantic and systems walls in attributed network community detection, enabling efficient processing of massive graphs while maintaining semantic awareness.

Abstract: Community detection in attributed networks faces a fundamental divide: topological algorithms ignore semantic features, while Graph Neural Networks (GNNs) encounter devastating computational bottlenecks. Specifically, GNNs suffer from a Semantic Wall of feature over smoothing in dense or heterophilic networks, and a Systems Wall driven by the O(N^2) memory constraints of pairwise clustering. To dismantle these barriers, we introduce ECHO (Encoding Communities via High order Operators), a scalable, self supervised architecture that reframes community detection as an adaptive, multi scale diffusion process. ECHO features a Topology Aware Router that automatically analyzes structural heuristics sparsity, density, and assortativity to route graphs through the optimal inductive bias, preventing heterophilic poisoning while ensuring semantic densification. Coupled with a memory sharded full batch contrastive objective and a novel chunked O(N \cdot K) similarity extraction method, ECHO completely bypasses traditional O(N^2) memory bottlenecks without sacrificing the mathematical precision of global gradients. Extensive evaluations demonstrate that this topology feature synergy consistently overcomes the classical resolution limit. On synthetic LFR benchmarks scaled up to 1 million nodes, ECHO achieves scale invariant accuracy despite severe topological noise. Furthermore, on massive real world social networks with over 1.6 million nodes and 30 million edges, it completes clustering in mere minutes with throughputs exceeding 2,800 nodes per second matching the speed of highly optimized purely topological baselines. The implementation utilizes a unified framework that automatically engages memory sharded optimization to support adoption across varying hardware constraints. GitHub Repository: https://github.com/emilioferrara/ECHO-GNN

[477] Beyond performance-wise Contribution Evaluation in Federated Learning

Balazs Pejo

Main category: cs.LG

TL;DR: Federated learning client evaluation should consider multiple trustworthiness dimensions (reliability, resilience, fairness) beyond just accuracy, using Shapley values to quantify contributions across these independent dimensions.

Details

Motivation: Current federated learning client evaluation methods focus only on model performance metrics like accuracy, ignoring critical trustworthiness dimensions. This work addresses the need to evaluate client contributions across reliability (tolerance to noisy data), resilience (resistance to adversarial examples), and fairness (demographic parity) for more comprehensive assessment.

Method: Uses state-of-the-art approximation of Shapley value, a principled method for value attribution, to quantify client contributions across multiple trustworthiness dimensions. This allows for multi-dimensional evaluation beyond traditional performance metrics.

Result: Reveals that no single client excels across all trustworthiness dimensions, and these dimensions are largely independent from each other. This highlights a critical flaw in current evaluation schemes where no single metric is adequate for comprehensive evaluation and equitable reward allocation.

Conclusion: Federated learning client evaluation requires multi-dimensional assessment of trustworthiness contributions (reliability, resilience, fairness) using principled methods like Shapley values, as single-metric approaches are insufficient for comprehensive evaluation and fair reward distribution.

Abstract: Federated learning offers a privacy-friendly collaborative learning framework, yet its success, like any joint venture, hinges on the contributions of its participants. Existing client evaluation methods predominantly focus on model performance, such as accuracy or loss, which represents only one dimension of a machine learning model’s overall utility. In contrast, this work investigates the critical, yet overlooked, issue of client contributions towards a model’s trustworthiness – specifically, its reliability (tolerance to noisy data), resilience (resistance to adversarial examples), and fairness (measured via demographic parity). To quantify these multifaceted contributions, we employ the state-of-the-art approximation of the Shapley value, a principled method for value attribution. Our results reveal that no single client excels across all dimensions, which are largely independent from each other, highlighting a critical flaw in current evaluation scheme: no single metric is adequate for comprehensive evaluation and equitable rewarding allocation.

[478] Efficient Continual Learning in Language Models via Thalamically Routed Cortical Columns

Afshin Khadangi

Main category: cs.LG

TL;DR: TRC² is a novel decoder-only architecture for continual learning that combines sparse thalamic routing over cortical columns with specialized mechanisms for modulation, prediction, memory, and feedback, enabling efficient adaptation without catastrophic forgetting.

Details

Motivation: Standard language model training pipelines are brittle under non-stationary data, suffering from catastrophic forgetting during online updates. Existing methods that improve stability often increase latency, memory footprint, or computation in ways that don't scale well to long contexts.

Method: TRC² uses sparse thalamic routing over cortical columns with mechanisms for modulation, prediction, memory, and feedback, plus a fast corrective pathway for rapid adaptation without destabilizing slower parameters. The architecture is sparse and chunk-parallel for efficient training/inference.

Result: TRC² improves the stability-plasticity tradeoff at comparable compute, enabling rapid on-stream adaptation while preserving previously acquired behavior across language modeling and continual learning benchmarks.

Conclusion: TRC² addresses continual learning at the architectural level, providing an efficient solution for deployed language models that need to adapt to streaming data without catastrophic forgetting.

Abstract: Continual learning is a core requirement for deployed language models, yet standard training and fine-tuning pipelines remain brittle under non-stationary data. Online updates often induce catastrophic forgetting, while methods that improve stability frequently increase latency, memory footprint, or dense computation in ways that do not scale well to long contexts. We introduce TRC$^{2}$ (Thalamically Routed Cortical Columns), a decoder-only backbone that addresses continual learning at the architectural level. TRC$^{2}$ combines sparse thalamic routing over cortical columns with mechanisms for modulation, prediction, memory, and feedback, together with a fast corrective pathway that supports rapid adaptation without destabilizing slower parameters. The resulting block is sparse and chunk-parallel, enabling efficient training and inference while preserving clean ablations of each subsystem. We instantiate a reproducible training and evaluation stack and a continual-learning harness that measures proxy forgetting under streaming domain shifts. Across language modeling and continual learning benchmarks, TRC$^{2}$ improves the stability-plasticity tradeoff at comparable compute, enabling rapid on-stream adaptation while preserving previously acquired behavior.

[479] Reinforcement-aware Knowledge Distillation for LLM Reasoning

Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, Stefano Soatto

Main category: cs.LG

TL;DR: RLAD is a reinforcement learning-aware distillation method that selectively imitates teacher models during RL training, addressing distribution mismatch and objective interference issues in traditional knowledge distillation approaches.

Details

Motivation: Current knowledge distillation methods for RL-trained LLMs suffer from distribution mismatch (teacher supervision not aligning with student's evolving rollout distribution) and objective interference (KL regularizer competing with reward maximization), motivating a more integrated approach to distillation during RL training.

Method: Proposes RL-aware distillation (RLAD) with Trust Region Ratio Distillation (TRRD) - replaces teacher-student KL regularizer with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher-old-policy mixture, enabling advantage-aware, trust-region-bounded distillation on student rollouts.

Result: RLAD consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation across diverse logic reasoning and math benchmarks.

Conclusion: RLAD provides an effective framework for distilling RL-trained reasoning capabilities into smaller models by better integrating teacher guidance with reinforcement learning objectives through selective imitation during policy updates.

Abstract: Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student’s evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose RL-aware distillation (RLAD), which performs selective imitation during RL – guiding the student toward the teacher only when it improves the current policy update. Our core component, Trust Region Ratio Distillation (TRRD), replaces the teacher-student KL regularizer with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher–old-policy mixture, yielding advantage-aware, trust-region-bounded distillation on student rollouts and naturally balancing exploration, exploitation, and imitation. Across diverse logic reasoning and math benchmarks, RLAD consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation.

[480] Sharp Convergence Rates for Masked Diffusion Models

Yuchen Liang, Zhiheng Tan, Ness Shroff, Yingbin Liang

Main category: cs.LG

TL;DR: Theoretical analysis of discrete diffusion model samplers (Euler method and First-Hitting Sampler) with improved total-variation bounds, relaxed assumptions, and matching lower bounds.

Details

Motivation: Discrete diffusion models show strong empirical performance but lack theoretical understanding. Existing analyses in KL divergence have loose parameter dependencies, strong assumptions, and don't cover newer high-performance samplers like FHS.

Method: Develops direct total-variation (TV) based analysis for Euler method and FHS sampler. Uses TV-based error decomposition along CTMC trajectory and decoupling-based path-wise analysis for FHS.

Result: Improved convergence guarantees for Euler method with relaxed assumptions and better parameter dependencies. First convergence lower bound for Euler sampler showing tightness. Analysis shows FHS incurs no sampling error beyond score estimation error, with matching lower bound.

Conclusion: Provides rigorous theoretical foundations for discrete diffusion model samplers with tight bounds and relaxed assumptions, advancing theoretical understanding of practical diffusion model algorithms.

Abstract: Discrete diffusion models have achieved strong empirical performance in text and other symbolic domains, with masked (absorbing-rate) variants emerging as competitive alternatives to autoregressive models. Among existing samplers, the Euler method remains the standard choice in many applications, and more recently, the First-Hitting Sampler (FHS) has shown considerable promise for masked diffusion models. Despite their practical success, the theoretical understanding of these samplers remains limited. Existing analyses are conducted in Kullback-Leibler (KL) divergence, which often yields loose parameter dependencies and requires strong assumptions on score estimation. Moreover, these guarantees do not cover recently developed high-performance sampler of FHS. In this work, we first develop a direct total-variation (TV) based analysis for the Euler method that overcomes these limitations. Our results relax assumptions on score estimation, improve parameter dependencies, and establish convergence guarantees without requiring any surrogate initialization. Also for this setting, we provide the first convergence lower bound for the Euler sampler, establishing tightness with respect to both the data dimension $d$ and the target accuracy $\varepsilon$. Finally, we analyze the FHS sampler and show that it incurs no sampling error beyond that induced by score estimation, which we show to be tight with a matching lower error bound. Overall, our analysis introduces a direct TV-based error decomposition along the CTMC trajectory and a decoupling-based path-wise analysis for FHS, which may be of independent interest.

[481] Space Syntax-guided Post-training for Residential Floor Plan Generation

Zhuoyang Jiang, Dongqing Zhang

Main category: cs.LG

TL;DR: SSPT is a post-training method that injects space syntax knowledge into floor plan generation using non-differentiable oracles to improve architectural priors like public space dominance and functional hierarchy.

Details

Motivation: Existing generative models for floor plans focus on fitting data distributions but neglect important architectural principles like configurational dominance and connectivity of public spaces (living rooms, foyers).

Method: Proposes SSPT with two strategies: 1) iterative retraining via space-syntax filtering and diffusion fine-tuning, and 2) reinforcement learning via PPO with space-syntax rewards. Uses non-differentiable oracle to convert layouts to rectangle-space graphs and compute integration-based measurements.

Result: Both strategies improve public-space dominance and restore clearer functional hierarchy compared to baselines. PPO achieves stronger gains with higher compute efficiency and reduced variance.

Conclusion: SSPT provides a scalable way to integrate architectural theory into data-driven plan generation and is compatible with other generative backbones given a post-hoc evaluation oracle.

Abstract: Pre-trained generative models for residential floor plans are typically optimized to fit large-scale data distributions, which can under-emphasize critical architectural priors such as the configurational dominance and connectivity of domestic public spaces (e.g., living rooms and foyers). This paper proposes Space Syntax-guided Post-training (SSPT), a post-training paradigm that explicitly injects space syntax knowledge into floor plan generation via a non-differentiable oracle. The oracle converts RPLAN-style layouts into rectangle-space graphs through greedy maximal-rectangle decomposition and door-mediated adjacency construction, and then computes integration-based measurements to quantify public space dominance and functional hierarchy. To enable consistent evaluation and diagnosis, we further introduce SSPT-Bench (Eval-8), an out-of-distribution benchmark that post-trains models using conditions capped at $\leq 7$ rooms while evaluating on 8-room programs, together with a unified metric suite for dominance, stability, and profile alignment. SSPT is instantiated with two strategies: (i) iterative retraining via space-syntax filtering and diffusion fine-tuning, and (ii) reinforcement learning via PPO with space-syntax rewards. Experiments show that both strategies improve public-space dominance and restore clearer functional hierarchy compared to distribution-fitted baselines, while PPO achieves stronger gains with substantially higher compute efficiency and reduced variance. SSPT provides a scalable pathway for integrating architectural theory into data-driven plan generation and is compatible with other generative backbones given a post-hoc evaluation oracle.

[482] TEFL: Prediction-Residual-Guided Rolling Forecasting for Multi-Horizon Time Series

Xiannan Huang, Shen Fang, Shuhan Qiu, Chengcheng Yu, Jiayuan Du, Chao Yang

Main category: cs.LG

TL;DR: TEFL is a temporal error feedback learning framework that incorporates historical prediction residuals into deep time series forecasting models to improve accuracy and robustness.

Details

Motivation: Modern deep forecasting models are trained to minimize point-wise prediction loss but ignore valuable information in past prediction residuals, which reflect persistent biases, unmodeled patterns, or evolving dynamics that could improve forecasting performance.

Method: Proposes TEFL framework that: (1) selects observable multi-step residuals under partial observability of rolling forecasts, (2) integrates them through a lightweight low-rank adapter for efficiency and to prevent overfitting, and (3) uses a two-stage training procedure that jointly optimizes the base forecaster and error module.

Result: Extensive experiments across 10 real-world datasets and 5 backbone architectures show TEFL consistently improves accuracy, reducing MAE by 5-10% on average, with error reductions exceeding 10% (up to 19.5%) in challenging scenarios with abrupt changes and distribution shifts.

Conclusion: TEFL offers a simple, general, and effective enhancement to modern deep forecasting systems by embedding residual-based feedback directly into the learning process, demonstrating strong robustness and consistent accuracy improvements.

Abstract: Time series forecasting plays a critical role in domains such as transportation, energy, and meteorology. Despite their success, modern deep forecasting models are typically trained to minimize point-wise prediction loss without leveraging the rich information contained in past prediction residuals from rolling forecasts - residuals that reflect persistent biases, unmodeled patterns, or evolving dynamics. We propose TEFL (Temporal Error Feedback Learning), a unified learning framework that explicitly incorporates these historical residuals into the forecasting pipeline during both training and evaluation. To make this practical in deep multi-step settings, we address three key challenges: (1) selecting observable multi-step residuals under the partial observability of rolling forecasts, (2) integrating them through a lightweight low-rank adapter to preserve efficiency and prevent overfitting, and (3) designing a two-stage training procedure that jointly optimizes the base forecaster and error module. Extensive experiments across 10 real-world datasets and 5 backbone architectures show that TEFL consistently improves accuracy, reducing MAE by 5-10% on average. Moreover, it demonstrates strong robustness under abrupt changes and distribution shifts, with error reductions exceeding 10% (up to 19.5%) in challenging scenarios. By embedding residual-based feedback directly into the learning process, TEFL offers a simple, general, and effective enhancement to modern deep forecasting systems.

[483] Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu, John Bowlan, Michal Kucer, David Blei

Main category: cs.LG

TL;DR: Duel-Evolve: Evolutionary optimization algorithm that uses pairwise preferences from LLMs instead of scalar rewards for test-time optimization over discrete output spaces.

Details

Motivation: Many applications need to optimize LLM outputs at test time, but existing methods rely on calibrated scalar evaluators which are often unavailable, sparse, or unreliable. Pairwise comparisons are easier to elicit and provide useful signal for improvement.

Method: Duel-Evolve replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates. It aggregates noisy comparisons via Bayesian Bradley-Terry model for uncertainty-aware quality estimates, uses Double Thompson Sampling to allocate comparison budget, and selects high-quality parents to generate improved candidates.

Result: Achieves 20 percentage points higher accuracy on MathBench and over 12 percentage points improvement on LiveCodeBench compared to existing methods and baselines. The method requires no reward model, ground-truth labels, or hand-crafted scoring function.

Conclusion: Pairwise self-preferences provide strong optimization signal for test-time improvement over large, discrete output spaces, enabling effective optimization without external supervision or scoring functions.

Abstract: Many applications seek to optimize LLM outputs at test time by iteratively proposing, scoring, and refining candidates over a discrete output space. Existing methods use a calibrated scalar evaluator for the target objective to guide search, but for many tasks such scores are unavailable, too sparse, or unreliable. Pairwise comparisons, by contrast, are often easier to elicit, still provide useful signal on improvement directions, and can be obtained from the LLM itself without external supervision. Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates. Duel-Evolve aggregates these noisy candidate comparisons via a Bayesian Bradley-Terry model, yielding uncertainty-aware estimates of candidate quality. These quality estimates guide allocation of the comparison budget toward plausible optima using Double Thompson Sampling, as well as selection of high-quality parents to generate improved candidates. We evaluate Duel-Evolve on MathBench, where it achieves 20 percentage points higher accuracy over existing methods and baselines, and on LiveCodeBench, where it improves over comparable iterative methods by over 12 percentage points. Notably, the method requires no reward model, no ground-truth labels during search, and no hand-crafted scoring function. Results show that pairwise self-preferences provide strong optimization signal for test-time improvement over large, discrete output spaces.

[484] Predicting Tennis Serve directions with Machine Learning

Ying Zhu, Ruthuparna Naikar

Main category: cs.LG

TL;DR: Machine learning method predicts professional tennis players’ first serve directions with ~49% accuracy for males and ~44% for females, revealing strategic patterns and mixed-strategy decision-making.

Details

Motivation: To understand the strategic mind game between servers and returners in professional tennis, particularly how servers choose serve directions to maximize winning chances while being unpredictable, and how returners try to anticipate these directions.

Method: Developed a machine learning method with feature engineering to predict professional tennis players’ first serve directions, analyzing serve decision patterns and strategic behaviors.

Result: Achieved average prediction accuracy of around 49% for male players and 44% for female players, providing evidence that top players use mixed-strategy models, fatigue affects serve direction choices, and contextual information is important for returners’ anticipatory reactions.

Conclusion: The study successfully models serve direction prediction in tennis, revealing strategic decision-making patterns and suggesting that contextual factors play a significant role in the server-returner mind game.

Abstract: Serves, especially first serves, are very important in professional tennis. Servers choose their serve directions strategically to maximize their winning chances while trying to be unpredictable. On the other hand, returners try to predict serve directions to make good returns. The mind game between servers and returners is an important part of decision-making in professional tennis matches. To help understand the players’ serve decisions, we have developed a machine learning method for predicting professional tennis players’ first serve directions. Through feature engineering, our method achieves an average prediction accuracy of around 49% for male players and 44% for female players. Our analysis provides some evidence that top professional players use a mixed-strategy model in serving decisions and that fatigue might be a factor in choosing serve directions. Our analysis also suggests that contextual information is perhaps more important for returners’ anticipatory reactions than previously thought.

[485] Coarse-to-Fine Learning of Dynamic Causal Structures

Dezhi Yang, Qiaoyu Tan, Carlotta Domeniconi, Jun Wang, Lizhen Cui, Guoxian Yu

Main category: cs.LG

TL;DR: DyCausal is a framework for learning fully dynamic causal structures from time series data using convolutional networks and linear interpolation to recover time-varying causal graphs.

Details

Motivation: Existing causal discovery methods rely on distributional or structural invariance assumptions that conflict with real-world time-varying causal relationships. There's a need for methods that address fully dynamic causality where both instantaneous and lagged dependencies evolve over time.

Method: DyCausal uses convolutional networks to capture causal patterns within coarse-grained time windows, then applies linear interpolation to refine causal structures at each time step. It also introduces an acyclic constraint based on matrix norm scaling to improve efficiency while constraining loops in evolving causal structures.

Result: Comprehensive evaluations on synthetic and real-world datasets show DyCausal achieves superior performance compared to existing methods, offering stable and efficient identification of fully dynamic causal structures.

Conclusion: DyCausal provides an effective framework for learning time-varying causal structures from coarse to fine granularity, addressing the limitations of stationary causality assumptions in real-world systems.

Abstract: Learning the dynamic causal structure of time series is a challenging problem. Most existing approaches rely on distributional or structural invariance to uncover underlying causal dynamics, assuming stationary or partially stationary causality. However, these assumptions often conflict with the complex, time-varying causal relationships observed in real-world systems. This motivates the need for methods that address fully dynamic causality, where both instantaneous and lagged dependencies evolve over time. Such a setting poses significant challenges for the efficiency and stability of causal discovery. To address these challenges, we introduce DyCausal, a dynamic causal structure learning framework. DyCausal leverages convolutional networks to capture causal patterns within coarse-grained time windows, and then applies linear interpolation to refine causal structures at each time step, thereby recovering fine-grained and time-varying causal graphs. In addition, we propose an acyclic constraint based on matrix norm scaling, which improves efficiency while effectively constraining loops in evolving causal structures. Comprehensive evaluations on both synthetic and real-world datasets demonstrate that DyCausal achieves superior performance compared to existing methods, offering a stable and efficient approach for identifying fully dynamic causal structures from coarse to fine.

[486] Persistent Nonnegative Matrix Factorization via Multi-Scale Graph Regularization

Jichao Zhang, Ran Miao, Limin Li

Main category: cs.LG

TL;DR: Proposes persistent nonnegative matrix factorization (pNMF) that captures multi-scale connectivity evolution using persistent homology, producing a sequence of persistence-aligned embeddings rather than a single factorization.

Details

Motivation: Existing NMF methods are single-scale and fail to capture how connectivity structures evolve across different resolutions, limiting their ability to represent multi-scale data patterns.

Method: Uses persistent homology to identify canonical scales where connectivity changes qualitatively, induces sequence of graph Laplacians, formulates coupled NMF with scale-wise geometric regularization and cross-scale consistency constraints.

Result: Develops sequential alternating optimization algorithm with guaranteed convergence; demonstrates effectiveness on synthetic and single-cell RNA sequencing datasets for multi-scale low-rank embeddings.

Conclusion: pNMF provides a principled framework for multi-scale dimensionality reduction that captures connectivity evolution across resolutions, overcoming limitations of traditional single-scale NMF methods.

Abstract: Matrix factorization techniques, especially Nonnegative Matrix Factorization (NMF), have been widely used for dimensionality reduction and interpretable data representation. However, existing NMF-based methods are inherently single-scale and fail to capture the evolution of connectivity structures across resolutions. In this work, we propose persistent nonnegative matrix factorization (pNMF), a scale-parameterized family of NMF problems, that produces a sequence of persistence-aligned embeddings rather than a single one. By leveraging persistent homology, we identify a canonical minimal sufficient scale set at which the underlying connectivity undergoes qualitative changes. These canonical scales induce a sequence of graph Laplacians, leading to a coupled NMF formulation with scale-wise geometric regularization and explicit cross-scale consistency constraint. We analyze the structural properties of the embeddings along the scale parameter and establish bounds on their increments between consecutive scales. The resulting model defines a nontrivial solution path across scales, rather than a single factorization, which poses new computational challenges. We develop a sequential alternating optimization algorithm with guaranteed convergence. Numerical experiments on synthetic and single-cell RNA sequencing datasets demonstrate the effectiveness of the proposed approach in multi-scale low-rank embeddings.

[487] LUMOS: Democratizing SciML Workflows with L0-Regularized Learning for Unified Feature and Parameter Adaptation

Shouwei Gao, Xu Zheng, Dongsheng Luo, Sheng Di, Wenqian Dong

Main category: cs.LG

TL;DR: LUMOS is an end-to-end framework for automated SciML model design that unifies feature selection and model pruning using L0-regularized learning, achieving significant parameter reduction and inference speedup.

Details

Motivation: Designing effective SciML models requires substantial prior knowledge and manual expertise for feature selection and model sizing. The goal is to democratize SciML model design by reducing reliance on manual tuning.

Method: LUMOS uses L0-regularized learning with semi-stochastic gating and reparameterization techniques to dynamically select informative features and prune redundant parameters during training.

Result: On 13 diverse SciML workloads, LUMOS achieves 71.45% parameter reduction and 6.4x inference speedup on average, with scalability confirmed through DDP training on up to eight GPUs.

Conclusion: LUMOS provides an effective framework for automated SciML model design that reduces manual tuning while maintaining predictive accuracy and demonstrates scalability across diverse scientific domains.

Abstract: The rapid growth of scientific machine learning (SciML) has accelerated discovery across diverse domains, yet designing effective SciML models remains a challenging task. In practice, building such models often requires substantial prior knowledge and manual expertise, particularly in determining which input features to use and how large the model should be. We introduce LUMOS, an end-to-end framework based on L0-regularized learning that unifies feature selection and model pruning to democratize SciML model design. By employing semi-stochastic gating and reparameterization techniques, LUMOS dynamically selects informative features and prunes redundant parameters during training, reducing the reliance on manual tuning while maintaining predictive accuracy. We evaluate LUMOS across 13 diverse SciML workloads, including cosmology and molecular sciences, and demonstrate its effectiveness and generalizability. Experiments on 13 SciML models show that LUMOS achieves 71.45% parameter reduction and a 6.4x inference speedup on average. Furthermore, Distributed Data Parallel (DDP) training on up to eight GPUs confirms the scalability of

[488] RAIN-Merging: A Gradient-Free Method to Enhance Instruction Following in Large Reasoning Models with Preserved Thinking Format

Zhehao Huang, Yuhang Liu, Baijiong Lin, Yixin Lou, Zhengbao He, Hanling Tian, Tao Li, Xiaolin Huang

Main category: cs.LG

TL;DR: RAIN-Merging integrates instruction-tuned models into large reasoning models via gradient-free merging that preserves reasoning structure while improving instruction following.

Details

Motivation: Large reasoning models (LRMs) excel at complex reasoning but often fail to follow output format and instruction constraints, while instruction-tuned models (ITMs) are better at instruction following but may lack reasoning capabilities. The paper aims to bridge this gap by merging these complementary capabilities.

Method: RAIN-Merging (Reasoning-Aware Instruction-attention guided Null-space projection Merging) uses two key techniques: 1) Projects ITM task vectors onto the null space of forward features at thinking tokens to preserve LRM’s reasoning structure, 2) Uses instruction attention to derive module-specific scaling that amplifies instruction-relevant components while suppressing leakage. The method is gradient-free and uses small calibration sets.

Result: Across four instruction-following benchmarks and nine reasoning & general capability benchmarks, RAIN-Merging substantially improves instruction adherence while maintaining reasoning quality. Gains are consistent across model scales and architectures, and translate to improved performance in agent settings.

Conclusion: The orthogonal subspaces of LRMs and ITMs enable lightweight merging with minimal interference. RAIN-Merging successfully integrates instruction-following capabilities into reasoning models while preserving their structured reasoning mechanisms, offering a practical solution for creating models that excel at both reasoning and instruction adherence.

Abstract: Large reasoning models (LRMs) excel at a long chain of reasoning but often fail to faithfully follow instructions regarding output format, constraints, or specific requirements. We investigate whether this gap can be closed by integrating an instruction-tuned model (ITM) into an LRM. Analyzing their differences in parameter space, namely task vectors, we find that their principal subspaces are nearly orthogonal across key modules, suggesting a lightweight merging with minimal interference. However, we also demonstrate that naive merges are fragile because they overlook the output format mismatch between LRMs (with explicit thinking and response segments) and ITMs (answers-only). We introduce RAIN-Merging (Reasoning-Aware Instruction-attention guided Null-space projection Merging), a gradient-free method that integrates instruction following while preserving thinking format and reasoning performance. First, with a small reasoning calibration set, we project the ITM task vector onto the null space of forward features at thinking special tokens, which preserves the LRM’s structured reasoning mechanisms. Second, using a small instruction calibration set, we estimate instruction attention to derive module-specific scaling that amplifies instruction-relevant components and suppresses leakage. Across four instruction-following benchmarks and nine reasoning & general capability benchmarks, RAIN-Merging substantially improves instruction adherence while maintaining reasoning quality. The gains are consistent across model scales and architectures, translating to improved performance in agent settings.

[489] Relatron: Automating Relational Machine Learning over Relational Databases

Zhikai Chen, Han Xie, Jian Zhang, Jiliang Tang, Xiang Song, Huzefa Rangwala

Main category: cs.LG

TL;DR: Unable to analyze paper 2602.22552 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2602.22552: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22552&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[490] Multilingual Safety Alignment Via Sparse Weight Editing

Jiaming Liang, Zhaoxin Wang, Handing Wang

Main category: cs.LG

TL;DR: Training-free cross-lingual safety alignment for LLMs using sparse weight editing to map harmful representations from low-resource languages to safety subspaces of high-resource languages.

Details

Motivation: LLMs show significant safety disparities across languages, with low-resource languages often bypassing safety guardrails established for high-resource languages like English. Existing multilingual alignment methods are computationally expensive and require scarce multilingual safety data.

Method: Proposes a training-free alignment framework based on Sparse Weight Editing. Identifies that safety capabilities are localized within a sparse set of safety neurons, and formulates cross-lingual alignment as a constrained linear transformation. Derives a closed-form solution to optimally map harmful representations of low-resource languages to robust safety subspaces of high-resource languages while preserving general utility via null-space projection constraint.

Result: Extensive experiments across 8 languages and multiple model families (Llama-3, Qwen-2.5) demonstrate substantial reduction in Attack Success Rate (ASR) in low-resource languages with negligible impact on general reasoning capabilities, achieved with a single, data-efficient calculation.

Conclusion: The proposed training-free alignment framework effectively addresses cross-lingual safety disparities in LLMs by leveraging sparse weight editing and linear transformations, offering a computationally efficient alternative to existing multilingual alignment methods.

Abstract: Large Language Models (LLMs) exhibit significant safety disparities across languages, with low-resource languages (LRLs) often bypassing safety guardrails established for high-resource languages (HRLs) like English. Existing solutions, such as multilingual supervised fine-tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), are computationally expensive and dependent on scarce multilingual safety data. In this work, we propose a novel, training-free alignment framework based on Sparse Weight Editing. Identifying that safety capabilities are localized within a sparse set of safety neurons, we formulate the cross-lingual alignment problem as a constrained linear transformation. We derive a closed-form solution to optimally map the harmful representations of LRLs to the robust safety subspaces of HRLs, while preserving general utility via a null-space projection constraint. Extensive experiments across 8 languages and multiple model families (Llama-3, Qwen-2.5) demonstrate that our method substantially reduces Attack Success Rate (ASR) in LRLs with negligible impact on general reasoning capabilities, all achieved with a single, data-efficient calculation.

[491] Autoregressive Visual Decoding from EEG Signals

Sicheng Dai, Hongwang Xiao, Shan Yu, Qiwei Ye

Main category: cs.LG

TL;DR: AVDE is a lightweight autoregressive framework for decoding visual information from EEG signals using contrastive learning and multi-scale token prediction.

Details

Motivation: Current EEG-to-image decoding methods face modality gap challenges, require complex multi-stage adaptation, and suffer from computational overhead from diffusion models, limiting practical BCI applications.

Method: 1) Fine-tune pre-trained EEG model (LaBraM) via contrastive learning to align EEG and image representations. 2) Use autoregressive generative framework with “next-scale prediction”: encode images into multi-scale token maps using VQ-VAE, train transformer to predict finer-scale tokens starting from EEG embeddings as coarsest representation.

Result: Outperforms previous state-of-the-art methods in image retrieval and reconstruction tasks on two datasets, using only 10% of parameters. Visualization shows generative process reflects hierarchical nature of human visual perception.

Conclusion: Autoregressive models offer efficient and interpretable tools for practical BCI applications, bridging EEG-image modality gap with lightweight architecture.

Abstract: Electroencephalogram (EEG) signals have become a popular medium for decoding visual information due to their cost-effectiveness and high temporal resolution. However, current approaches face significant challenges in bridging the modality gap between EEG and image data. These methods typically rely on complex adaptation processes involving multiple stages, making it hard to maintain consistency and manage compounding errors. Furthermore, the computational overhead imposed by large-scale diffusion models limit their practicality in real-world brain-computer interface (BCI) applications. In this work, we present AVDE, a lightweight and efficient framework for visual decoding from EEG signals. First, we leverage LaBraM, a pre-trained EEG model, and fine-tune it via contrastive learning to align EEG and image representations. Second, we adopt an autoregressive generative framework based on a “next-scale prediction” strategy: images are encoded into multi-scale token maps using a pre-trained VQ-VAE, and a transformer is trained to autoregressively predict finer-scale tokens starting from EEG embeddings as the coarsest representation. This design enables coherent generation while preserving a direct connection between the input EEG signals and the reconstructed images. Experiments on two datasets show that AVDE outperforms previous state-of-the-art methods in both image retrieval and reconstruction tasks, while using only 10% of the parameters. In addition, visualization of intermediate outputs shows that the generative process of AVDE reflects the hierarchical nature of human visual perception. These results highlight the potential of autoregressive models as efficient and interpretable tools for practical BCI applications.

[492] Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation

Zihang Xu, Haozhi Xie, Ziqi Miao, Wuxuan Gong, Chen Qian, Lijun Li

Main category: cs.LG

TL;DR: A two-stage framework for stable adaptive thinking in Large Reasoning Models that reduces overthinking on simple queries while preserving accuracy on complex ones through hybrid fine-tuning and adaptive reinforcement learning.

Details

Motivation: Large reasoning models often exhibit overthinking behavior on low-complexity queries, wasting computational resources. Existing approaches suffer from unstable accuracy-efficiency trade-offs and poor robustness to heterogeneous reasoning behaviors.

Method: Two-stage framework: 1) Hybrid Fine-Tuning to expose models to both thinking and no-thinking behaviors for well-conditioned initialization; 2) Adaptive reinforcement learning with Correctness-Preserving Advantage Shaping (CPAS) to avoid suppressing correct long-chain reasoning, and Length-Aware Gradient Regulation (LAGR) to stabilize optimization under reasoning-length heterogeneity.

Result: Experiments on Qwen2.5-1.5B and 7B show consistent improvements: up to +3.7/+3.6 accuracy points while reducing generated tokens by 40.6%/43.9%. The approach demonstrates robustness across varying problem difficulties and out-of-distribution tasks.

Conclusion: The proposed framework effectively addresses overthinking in large reasoning models through stable adaptive thinking, achieving better accuracy-efficiency trade-offs and robustness compared to existing methods.

Abstract: Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries. Existing efforts to mitigate this issue are fundamentally limited by unstable accuracy-efficiency trade-offs and poor robustness to heterogeneous reasoning behaviors. To address these challenges, we propose a two-stage framework for stable adaptive thinking in LRMs. The framework first applies Hybrid Fine-Tuning to expose the model to both thinking and no-thinking behaviors, establishing well-conditioned initialization. It then performs adaptive reinforcement learning with Correctness-Preserving Advantage Shaping (CPAS) to avoid suppressing correct long-chain reasoning, and Length-Aware Gradient Regulation (LAGR) to stabilize optimization under severe reasoning-length heterogeneity. Extensive experiments on Qwen2.5-1.5B and 7B show consistent improvements over strong baselines, achieving up to +3.7/+3.6 accuracy points while reducing generated tokens by 40.6%/43.9%. Further analyses across varying problem difficulties and out-of-distribution tasks confirm the robustness and generalization of our approach.

[493] TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion

Donghong Cai, Jiarui Feng, Yanbo Wang, Da Zheng, Yixin Chen, Muhan Zhang

Main category: cs.LG

TL;DR: TabDLM: A unified diffusion framework for generating heterogeneous tabular data with both structured features and free-form text fields, addressing limitations of existing diffusion and LLM-based methods.

Details

Motivation: Real-world tabular datasets increasingly contain free-form text fields alongside structured data, but generating such heterogeneous tables with joint modeling of different modalities remains challenging. Existing methods either struggle with text quality (diffusion models) or distort numerical values (LLM-based methods).

Method: TabDLM uses a joint numerical-language diffusion model built on masked diffusion language models (MDLMs). It models textual and categorical features through masked diffusion, numerical features with continuous diffusion via learned specialized numeric tokens embedding, and captures cross-modality interactions with bidirectional attention in a single model.

Result: Extensive experiments on diverse benchmarks demonstrate TabDLM’s effectiveness compared to strong diffusion- and LLM-based baselines for free-form tabular data generation.

Conclusion: TabDLM provides a unified framework for generating heterogeneous tabular data with both structured features and free-form text, overcoming limitations of existing approaches through joint numerical-language diffusion modeling.

Abstract: Synthetic tabular data generation has attracted growing attention due to its importance for data augmentation, foundation models, and privacy. However, real-world tabular datasets increasingly contain free-form text fields (e.g., reviews or clinical notes) alongside structured numerical and categorical attributes. Generating such heterogeneous tables with joint modeling of different modalities remains challenging. Existing approaches broadly fall into two categories: diffusion-based methods and LLM-based methods. Diffusion models can capture complex dependencies over numerical and categorical features in continuous or discrete spaces, but extending them to open-ended text is nontrivial and often leads to degraded text quality. In contrast, LLM-based generators naturally produce fluent text, yet their discrete tokenization can distort precise or wide-range numerical values, hindering accurate modeling of both numbers and language. In this work, we propose TabDLM, a unified framework for free-form tabular data generation via a joint numerical–language diffusion model built on masked diffusion language models (MDLMs). TabDLM models textual and categorical features through masked diffusion, while modeling numerical features with a continuous diffusion process through learned specialized numeric tokens embedding; bidirectional attention then captures cross-modality interactions within a single model. Extensive experiments on diverse benchmarks demonstrate the effectiveness of TabDLM compared to strong diffusion- and LLM-based baselines.

[494] Operationalizing Fairness: Post-Hoc Threshold Optimization Under Hard Resource Limits

Moirangthem Tiken Singh, Amit Kalita, Sapam Jitu Singh

Main category: cs.LG

TL;DR: A threshold optimization framework that balances safety, efficiency, and equity under strict capacity constraints with a single global decision threshold for legal compliance.

Details

Motivation: Machine learning in high-stakes domains requires balancing predictive safety and algorithmic fairness, but existing fairness interventions often assume unconstrained resources and use group-specific thresholds that violate anti-discrimination regulations.

Method: Post-hoc, model-agnostic threshold optimization framework with parameterized ethical loss function and bounded decision rule that mathematically prevents intervention volumes from exceeding available resources, enforcing a single global decision threshold for legal compliance.

Result: Capacity constraints dominate ethical priorities, with strict resource limits determining final deployed threshold in over 80% of tested configurations. Under 25% capacity limit, framework maintains high risk identification (recall 0.409-0.702) while standard unconstrained fairness heuristics collapse to near-zero utility.

Conclusion: Theoretical fairness objectives must be explicitly subordinated to operational capacity limits to remain deployable. The framework provides a practical, legally compliant mechanism for navigating ethical trade-offs in resource-constrained environments by decoupling predictive scoring from policy evaluation.

Abstract: The deployment of machine learning in high-stakes domains requires a balance between predictive safety and algorithmic fairness. However, existing fairness interventions often as- sume unconstrained resources and employ group-specific decision thresholds that violate anti- discrimination regulations. We introduce a post-hoc, model-agnostic threshold optimization framework that jointly balances safety, efficiency, and equity under strict and hard capacity constraints. To ensure legal compliance, the framework enforces a single, global decision thresh- old. We formulated a parameterized ethical loss function coupled with a bounded decision rule that mathematically prevents intervention volumes from exceeding the available resources. An- alytically, we prove the key properties of the deployed threshold, including local monotonicity with respect to ethical weighting and the formal identification of critical capacity regimes. We conducted extensive experimental evaluations on diverse high-stakes datasets. The principal re- sults demonstrate that capacity constraints dominate ethical priorities; the strict resource limit determines the final deployed threshold in over 80% of the tested configurations. Furthermore, under a restrictive 25% capacity limit, the proposed framework successfully maintains high risk identification (recall ranging from 0.409 to 0.702), whereas standard unconstrained fairness heuristics collapse to a near-zero utility. We conclude that theoretical fairness objectives must be explicitly subordinated to operational capacity limits to remain in deployment. By decou- pling predictive scoring from policy evaluation and strictly bounding intervention rates, this framework provides a practical and legally compliant mechanism for stakeholders to navigate unavoidable ethical trade-offs in resource-constrained environments.

[495] pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training

Wenzheng Zhang, Bingzheng Liu, Yang Hu, Xiaoying Bai, Wentao Zhang, Bin Cui

Main category: cs.LG

TL;DR: pQuant introduces a novel quantization-aware training method that splits linear layers into two specialized branches (1-bit dominant + high-precision compact) to address parameter democratization in extremely low-bit LLMs.

Details

Motivation: Existing quantization-aware training methods for extremely low-bit LLMs (sub 2-bit) fail to achieve satisfactory accuracy and scalability due to the "parameter democratization effect" where all parameters become homogenized, severely limiting expressivity.

Method: pQuant decouples parameters by splitting linear layers into two specialized branches: a dominant 1-bit branch for efficient computation and a compact high-precision branch dedicated to preserving the most sensitive parameters. It uses tailored feature scaling to guide sensitive parameters to the high-precision branch and extends this branch into multiple, sparsely-activated experts for efficient capacity scaling.

Result: Extensive experiments show pQuant achieves state-of-the-art performance in extremely low-bit quantization.

Conclusion: pQuant effectively addresses the parameter democratization bottleneck in extremely low-bit LLM quantization through specialized branch splitting and feature scaling, enabling better accuracy and scalability for edge deployment.

Abstract: Quantization-Aware Training from scratch has emerged as a promising approach for building efficient large language models (LLMs) with extremely low-bit weights (sub 2-bit), which can offer substantial advantages for edge deployment. However, existing methods still fail to achieve satisfactory accuracy and scalability. In this work, we identify a parameter democratization effect as a key bottleneck: the sensitivity of all parameters becomes homogenized, severely limiting expressivity. To address this, we propose pQuant, a method that decouples parameters by splitting linear layers into two specialized branches: a dominant 1-bit branch for efficient computation and a compact high-precision branch dedicated to preserving the most sensitive parameters. Through tailored feature scaling, we explicitly guide the model to allocate sensitive parameters to the high-precision branch. Furthermore, we extend this branch into multiple, sparsely-activated experts, enabling efficient capacity scaling. Extensive experiments indicate our pQuant achieves state-of-the-art performance in extremely low-bit quantization.

[496] S2O: Early Stopping for Sparse Attention via Online Permutation

Yu Zhang, Songwei Liu, Chenqian Yan, Sheng Lin, Beichen Ning, Fangmin Chen, Xing Wang

Main category: cs.LG

TL;DR: S2O introduces early stopping for sparse attention via online permutation to overcome sparsity ceilings, enabling more efficient long-context inference by loading non-contiguous tokens and terminating computation on low-contribution blocks.

Details

Motivation: Quadratic attention scaling limits long-context inference. Existing block-granularity sparsification has intrinsic sparsity ceilings, making further improvements difficult even with careful engineering.

Method: S2O performs early stopping for sparse attention via online permutation inspired by memory systems. It factorizes FlashAttention execution to load non-contiguous tokens, transforms explicit permutation into online index-guided loading policy, and introduces early-stopping rule that terminates computation when block scores fall below threshold.

Result: On Llama-3.1-8B with 128K context: reduces single-operator MSE by 3.82× at matched sparsity, reduces prefill compute density by 3.31× at matched MSE, preserves end-to-end accuracy, achieves 7.51× attention speedup and 3.81× end-to-end speedup.

Conclusion: S2O substantially raises practical sparsity ceiling for attention mechanisms, enabling more efficient long-context inference through importance-guided online permutation and early stopping.

Abstract: Attention scales quadratically with sequence length, fundamentally limiting long-context inference. Existing block-granularity sparsification can reduce latency, but coarse blocks impose an intrinsic sparsity ceiling, making further improvements difficult even with carefully engineered designs. We present S2O, which performs early stopping for sparse attention via online permutation. Inspired by virtual-to-physical address mapping in memory systems, S2O revisits and factorizes FlashAttention execution, enabling inference to load non-contiguous tokens rather than a contiguous span in the original order. Motivated by fine-grained structures in attention heatmaps, we transform explicit permutation into an online, index-guided, discrete loading policy; with extremely lightweight preprocessing and index-remapping overhead, it concentrates importance on a small set of high-priority blocks. Building on this importance-guided online permutation for loading, S2O further introduces an early-stopping rule: computation proceeds from high to low importance; once the current block score falls below a threshold, S2O terminates early and skips the remaining low-contribution blocks, thereby increasing effective sparsity and reducing computation under a controlled error budget. As a result, S2O substantially raises the practical sparsity ceiling. On Llama-3.1-8B under a 128K context, S2O reduces single-operator MSE by 3.82$\times$ at matched sparsity, and reduces prefill compute density by 3.31$\times$ at matched MSE; meanwhile, it preserves end-to-end accuracy and achieves 7.51$\times$ attention and 3.81$\times$ end-to-end speedups.

[497] ContextRL: Enhancing MLLM’s Knowledge Discovery Efficiency with Context-Augmented RL

Xingyu Lu, Jinpeng Wang, YiFan Zhang, Shijie Ma, Xiao Hu, Tianke Zhang, Haonan fan, Kaiyu Jiang, Changyi Liu, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Chun Yuan

Main category: cs.LG

TL;DR: ContextRL is a novel reinforcement learning framework that uses context augmentation to improve reward model accuracy and mitigate reward hacking in vision-language models.

Details

Motivation: The paper addresses two key bottlenecks in RL with Vision-language models (RLVR): (1) Identifiability - difficulty in distinguishing between correct answers with low-quality reasoning vs high-quality reasoning, and (2) Reachability - challenges in guiding models to recover from incorrect responses.

Method: ContextRL introduces two main innovations: 1) Providing full reference solutions as context to reward models for fine-grained process verification, and 2) A multi-turn sampling strategy where reward models generate mistake reports for failed attempts to guide policy recovery from negative samples.

Result: Experimental results on 11 perception and reasoning benchmarks show ContextRL significantly improves knowledge discovery efficiency. Notably, it enables Qwen3-VL-8B to achieve performance comparable to the 32B model, outperforming standard RLVR baselines by a large margin while effectively mitigating reward hacking.

Conclusion: The framework demonstrates the significant potential of contextual information for improving reward model accuracy and provides valuable insights about reward hacking for future RLVR research in multimodal settings.

Abstract: We propose ContextRL, a novel framework that leverages context augmentation to overcome these bottlenecks. Specifically, to enhance Identifiability, we provide the reward model with full reference solutions as context, enabling fine-grained process verification to filter out false positives (samples with the right answer but low-quality reasoning process). To improve Reachability, we introduce a multi-turn sampling strategy where the reward model generates mistake reports for failed attempts, guiding the policy to “recover” correct responses from previously all-negative groups. Experimental results on 11 perception and reasoning benchmarks show that ContextRL significantly improves knowledge discovery efficiency. Notably, ContextRL enables the Qwen3-VL-8B model to achieve performance comparable to the 32B model, outperforming standard RLVR baselines by a large margin while effectively mitigating reward hacking. Our in-depth analysis reveals the significant potential of contextual information for improving reward model accuracy and document the widespread occurrence of reward hacking, offering valuable insights for future RLVR research.

[498] IBCircuit: Towards Holistic Circuit Discovery with Information Bottleneck

Tian Bian, Yifan Niu, Chaohao Yuan, Chengzhi Piao, Bingzhe Wu, Long-Kai Huang, Yu Rong, Tingyang Xu, Hong Cheng, Jia Li

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to retrieval error

Method: Unable to determine method due to retrieval error

Result: Unable to determine results due to retrieval error

Conclusion: Unable to determine conclusion due to retrieval error

Abstract: Failed to fetch summary for 2602.22581: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22581&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[499] Transformers converge to invariant algorithmic cores

Joshua S. Schiffman

Main category: cs.LG

TL;DR: Transformer models converge to compact algorithmic cores - low-dimensional subspaces essential for task performance - that persist across training runs and model scales, revealing shared computational structures.

Details

Motivation: Despite sophisticated capabilities of large language models, understanding their internal workings remains challenging due to many possible weight configurations implementing the same function. The paper aims to identify which internal structures reflect essential computation versus training accidents.

Method: Extracts algorithmic cores - compact subspaces necessary and sufficient for task performance. Analyzes independently trained transformers, Markov-chain transformers, modular-addition transformers, and GPT-2 language models to identify invariant structures across training runs and scales.

Result: Independently trained transformers converge to same cores despite different weights. Markov-chain transformers embed 3D cores in orthogonal subspaces with identical transition spectra. Modular-addition transformers discover compact cyclic operators at grokking. GPT-2 governs subject-verb agreement through a single axis that inverts grammatical number when flipped.

Conclusion: Transformer computations are organized around compact, shared algorithmic structures that persist across training runs and scales. Mechanistic interpretability should target these computational invariants rather than implementation-specific details.

Abstract: Large language models exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obstacle is that training selects for behavior, not circuitry, so many weight configurations can implement the same function. Which internal structures reflect the computation, and which are accidents of a particular training run? This work extracts algorithmic cores: compact subspaces necessary and sufficient for task performance. Independently trained transformers learn different weights but converge to the same cores. Markov-chain transformers embed 3D cores in nearly orthogonal subspaces yet recover identical transition spectra. Modular-addition transformers discover compact cyclic operators at grokking that later inflate, yielding a predictive model of the memorization-to-generalization transition. GPT-2 language models govern subject-verb agreement through a single axis that, when flipped, inverts grammatical number throughout generation across scales. These results reveal low-dimensional invariants that persist across training runs and scales, suggesting that transformer computations are organized around compact, shared algorithmic structures. Mechanistic interpretability could benefit from targeting such invariants – the computational essence – rather than implementation-specific details.

[500] Moral Preferences of LLMs Under Directed Contextual Influence

Phil Blandfort, Tushar Karayil, Urja Pawar, Robert Graham, Alex McKenzie, Dmitrii Krasheninnikov

Main category: cs.LG

TL;DR: LLMs show significant moral decision shifts under contextual influences in trolley-problem scenarios, with baseline preferences being poor predictors of steerability and influences sometimes backfiring.

Details

Motivation: Current moral benchmarks for LLMs use context-free prompts assuming stable preferences, but real deployment includes contextual signals that may steer decisions. The paper aims to study how directed contextual influences reshape moral decisions in trolley-problem settings.

Method: Introduces a pilot evaluation harness for directed contextual influence in trolley-problem-style moral triage. For each demographic factor, applies matched, direction-flipped contextual influences that differ only in which group they favor, enabling systematic measurement of directional response.

Result: Four key findings: (1) contextual influences often significantly shift decisions even when superficially relevant; (2) baseline preferences are poor predictors of directional steerability; (3) influences can backfire with models claiming neutrality but choices still shifting; (4) reasoning reduces average sensitivity but amplifies biased few-shot examples.

Conclusion: Moral evaluations should be extended with controlled, direction-flipped context manipulations to better characterize model behavior, as contextual influences significantly impact LLM moral decisions in ways not captured by current benchmarks.

Abstract: Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences. In deployment, however, prompts routinely include contextual signals such as user requests, cues on social norms, etc. that may steer decisions. We study how directed contextual influences reshape decisions in trolley-problem-style moral triage settings. We introduce a pilot evaluation harness for directed contextual influence in trolley-problem-style moral triage: for each demographic factor, we apply matched, direction-flipped contextual influences that differ only in which group they favor, enabling systematic measurement of directional response. We find that: (i) contextual influences often significantly shift decisions, even when only superficially relevant; (ii) baseline preferences are a poor predictor of directional steerability, as models can appear baseline-neutral yet exhibit systematic steerability asymmetry under influence; (iii) influences can backfire: models may explicitly claim neutrality or discount the contextual cue, yet their choices still shift, sometimes in the opposite direction; and (iv) reasoning reduces average sensitivity, but amplifies the effect of biased few-shot examples. Our findings motivate extending moral evaluations with controlled, direction-flipped context manipulations to better characterize model behavior.

[501] Mitigating Membership Inference in Intermediate Representations via Layer-wise MIA-risk-aware DP-SGD

Jiayang Meng, Tao Huang, Chen Hou, Guolong Zheng, Hong Chen

Main category: cs.LG

TL;DR: Paper 2602.22611: Unable to fetch summary due to HTTP 429 error (rate limiting).

Details

Motivation: Unknown - abstract not available due to rate limiting error.

Method: Unknown - abstract not available due to rate limiting error.

Result: Unknown - abstract not available due to rate limiting error.

Conclusion: Unknown - abstract not available due to rate limiting error.

Abstract: Failed to fetch summary for 2602.22611: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22611&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[502] NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion

Hung-Hsuan Chen

Main category: cs.LG

TL;DR: NoRA introduces a non-linear rank adaptation method that breaks the linear ceiling of LoRA by using SiLU gating and structural dropout for manifold expansion, achieving better performance at lower ranks.

Details

Motivation: LoRA faces a critical "linear ceiling" in complex reasoning tasks where increasing rank yields diminishing returns due to intrinsic linear constraints, limiting its effectiveness in parameter-efficient fine-tuning.

Method: NoRA (Non-linear Rank Adaptation) is a weight-level parallel adapter that injects SiLU gating and structural dropout to induce manifold expansion, breaking the linear barrier of traditional LoRA methods.

Result: NoRA at rank 64 outperforms LoRA at rank 512 on SlimOrca benchmark (PPL 3.89 vs 3.90), and achieves significantly better perplexity on MathInstruct (1.97 vs 2.07). SVD analysis shows NoRA activates dormant tail of singular value spectrum.

Conclusion: NoRA effectively breaks the linear ceiling of LoRA through non-linear adaptation, demonstrating superior spectral efficiency and preventing rank collapse in complex reasoning tasks.

Abstract: Low-Rank Adaptation (LoRA) dominates parameter-efficient fine-tuning (PEFT). However, it faces a critical ``linear ceiling’’ in complex reasoning tasks: simply increasing the rank yields diminishing returns due to intrinsic linear constraints. We introduce NoRA (Non-linear Rank Adaptation), a weight-level parallel adapter that injects SiLU gating and structural dropout to induce manifold expansion. On the SlimOrca benchmark, NoRA breaks this linear barrier: NoRA remarkably at rank 64 (PPL 3.89) outperforms LoRA at rank 512 (PPL 3.90), demonstrating superior spectral efficiency. This advantage generalizes to mathematical reasoning, where NoRA achieves a perplexity of 1.97 on MathInstruct, significantly surpassing LoRA’s saturation point of 2.07. Mechanism analysis via Singular Value Decomposition (SVD) confirms that NoRA activates the dormant tail of the singular value spectrum, effectively preventing the rank collapse observed in linear methods.

[503] Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA

Hai Huang, Yann LeCun, Randall Balestriero

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2602.22617: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22617&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[504] Tackling Privacy Heterogeneity in Differentially Private Federated Learning

Ruichen Xu, Ying-Jun Angela Zhang, Jianwei Huang

Main category: cs.LG

TL;DR: Privacy-aware client selection strategy for differentially private federated learning that addresses privacy heterogeneity among clients to improve model accuracy.

Details

Motivation: Existing DP-FL approaches assume uniform privacy budgets across clients, but real-world scenarios have varying privacy requirements. Privacy heterogeneity challenges conventional client selection strategies that can't distinguish between high-quality updates and noise-heavy updates from strictly private clients.

Method: Established theoretical foundation with convergence analysis quantifying privacy heterogeneity impact. Proposed privacy-aware client selection strategy formulated as convex optimization problem that adaptively adjusts selection probabilities to minimize training error.

Result: Achieves up to 10% improvement in test accuracy on CIFAR-10 compared to existing baselines under heterogeneous privacy budgets.

Conclusion: Demonstrates importance of incorporating privacy heterogeneity into client selection for practical and effective federated learning.

Abstract: Differentially private federated learning (DP-FL) enables clients to collaboratively train machine learning models while preserving the privacy of their local data. However, most existing DP-FL approaches assume that all clients share a uniform privacy budget, an assumption that does not hold in real-world scenarios where privacy requirements vary widely. This privacy heterogeneity poses a significant challenge: conventional client selection strategies, which typically rely on data quantity, cannot distinguish between clients providing high-quality updates and those introducing substantial noise due to strict privacy constraints. To address this gap, we present the first systematic study of privacy-aware client selection in DP-FL. We establish a theoretical foundation by deriving a convergence analysis that quantifies the impact of privacy heterogeneity on training error. Building on this analysis, we propose a privacy-aware client selection strategy, formulated as a convex optimization problem, that adaptively adjusts selection probabilities to minimize training error. Extensive experiments on benchmark datasets demonstrate that our approach achieves up to a 10% improvement in test accuracy on CIFAR-10 compared to existing baselines under heterogeneous privacy budgets. These results highlight the importance of incorporating privacy heterogeneity into client selection for practical and effective federated learning.

[505] Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning

Qin-Wen Luo, Sheng Ren, Xiang Chen, Rui Liu, Jun Fang, Naiqiang Tan, Sheng-Jun Huang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2602.22642: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22642&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[506] InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross

Main category: cs.LG

TL;DR: InnerQ: Hardware-aware KV-cache quantization for LLMs that groups over inner dimension for better GPU alignment, achieving up to 22% speedup over prior methods while maintaining accuracy.

Details

Motivation: The KV cache in large language models becomes a memory bottleneck during long-sequence generation, dominating memory footprint as it scales with sequence length. Previous quantization methods focus on compression but don't fully optimize for hardware efficiency.

Method: InnerQ uses group-wise quantization over the inner dimension (instead of outer dimension) to align with vector-matrix multiplication patterns, enabling scale factor reuse across GPU compute units. It includes hybrid quantization (symmetric/asymmetric per group), high-precision windows for recent and attention sink tokens, and per-channel normalization of key cache computed during prefill.

Result: Achieves up to 22% speedup over previous KV cache quantization methods and up to 88% over half-precision vector-matrix multiplication. Maintains GSM8K few-shot performance comparable to non-quantized KV caches while surpassing prior quantization methods.

Conclusion: InnerQ demonstrates that hardware-aware KV cache quantization can significantly reduce decode latency without sacrificing model accuracy, making it a practical solution for efficient long-sequence generation in LLMs.

Abstract: Reducing the hardware footprint of large language models (LLMs) during decoding is critical for efficient long-sequence generation. A key bottleneck is the key-value (KV) cache, whose size scales with sequence length and easily dominates the memory footprint of the model. Previous work proposed quantization methods that are focused on compressing the KV cache while maintaining its information. We introduce InnerQ, a hardware-aware KV-cache quantization scheme that lowers decode latency without sacrificing accuracy. InnerQ applies group-wise quantization while grouping the cache matrices over their inner dimension. Unlike previous work that group over the outer dimension, InnerQ aligns dequantization with the vector-matrix multiplication and enables scale factor reuse across GPU compute units. This reduces memory accesses and accelerates dequantization, yielding up to $22%$ speedup over previous work and up to $88%$ over half-precision vector-matrix multiplication. To preserve fidelity under aggressive compression, InnerQ incorporates (i) hybrid quantization, selecting symmetric or asymmetric quantization per group based on local statistics; (ii) high-precision windows for both the most recent tokens and the attention sink tokens to mitigate outlier leakage; and (iii) per-channel normalization of the key cache, computed once during prefill and folded into the query to avoid runtime overhead. Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.

[507] MUG: Meta-path-aware Universal Heterogeneous Graph Pre-Training

Lianze Shan, Jitao Zhao, Dongxiao He, Yongqi Huang, Zhiyong Feng, Weixiong Zhang

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.22645: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22645&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[508] LEDA: Latent Semantic Distribution Alignment for Multi-domain Graph Pre-training

Lianze Shan, Jitao Zhao, Dongxiao He, Siqi Liu, Jiaxu Cui, Weixiong Zhang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Unable to determine method due to API rate limiting preventing access to paper content

Result: Unable to determine results due to API rate limiting preventing access to paper content

Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2602.22660: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22660&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[509] Forecasting Antimicrobial Resistance Trends Using Machine Learning on WHO GLASS Surveillance Data: A Retrieval-Augmented Generation Approach for Policy Decision Support

Md Tanvir Hasan Turja

Main category: cs.LG

TL;DR: Machine learning framework for forecasting antimicrobial resistance trends using WHO GLASS data, with XGBoost achieving best performance and a RAG system for policy decision support.

Details

Motivation: Antimicrobial resistance is a global crisis causing millions of deaths annually. While WHO GLASS provides surveillance data, few studies have applied machine learning to forecast population-level resistance trends from this standardized data.

Method: Two-component framework: 1) Benchmark six ML models (Naive, Linear Regression, Ridge Regression, XGBoost, LightGBM, LSTM) on 5,909 WHO GLASS observations across six regions (2021-2023); 2) RAG pipeline combining ChromaDB vector store of WHO policy documents with locally deployed Phi-3 Mini LLM for policy decision support.

Result: XGBoost achieved best performance with test MAE of 7.07% and R-squared of 0.854, outperforming naive baseline by 83.1%. Prior-year resistance rate was dominant predictor (50.5% importance). Regional MAE ranged from 4.16% (European Region) to 10.14% (South-East Asia Region). RAG system produced source-attributed, hallucination-constrained policy answers.

Conclusion: The framework successfully forecasts AMR trends and provides evidence-grounded policy decision support, demonstrating the value of ML approaches for global health surveillance and policy-making.

Abstract: Antimicrobial resistance (AMR) is a growing global crisis projected to cause 10 million deaths per year by 2050. While the WHO Global Antimicrobial Resistance and Use Surveillance System (GLASS) provides standardized surveillance data across 44 countries, few studies have applied machine learning to forecast population-level resistance trends from this data. This paper presents a two-component framework for AMR trend forecasting and evidence-grounded policy decision support. We benchmark six models – Naive, Linear Regression, Ridge Regression, XGBoost, LightGBM, and LSTM – on 5,909 WHO GLASS observations across six WHO regions (2021-2023). XGBoost achieved the best performance with a test MAE of 7.07% and R-squared of 0.854, outperforming the naive baseline by 83.1%. Feature importance analysis identified the prior-year resistance rate as the dominant predictor (50.5% importance), while regional MAE ranged from 4.16% (European Region) to 10.14% (South-East Asia Region). We additionally implemented a Retrieval-Augmented Generation (RAG) pipeline combining a ChromaDB vector store of WHO policy documents with a locally deployed Phi-3 Mini language model, producing source-attributed, hallucination-constrained policy answers. Code and data are available at https://github.com/TanvirTurja

[510] Accelerating LLM Pre-Training through Flat-Direction Dynamics Enhancement

Shuchen Zhu, Rizhen Hu, Mingze Wang, Mou Sun, Xue Wang, Kun Yuan, Zaiwen Wen

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.22681 suggests it’s from February 2026, which is in the future relative to current date.

Details

Motivation: Cannot determine motivation due to inability to access paper content.

Method: Cannot determine method due to inability to access paper content.

Result: Cannot determine results due to inability to access paper content.

Conclusion: Cannot draw conclusions about paper content due to access limitations.

Abstract: Failed to fetch summary for 2602.22681: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22681&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[511] Switch-Hurdle: A MoE Encoder with AR Hurdle Decoder for Intermittent Demand Forecasting

Fabian Muşat, Simona Căbuz

Main category: cs.LG

TL;DR: Unable to analyze paper 2602.22685 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable

Method: Cannot determine method as abstract is unavailable

Result: Cannot determine results as abstract is unavailable

Conclusion: Cannot draw conclusion as paper content is inaccessible

Abstract: Failed to fetch summary for 2602.22685: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22685&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[512] Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning

Hao Yu, Shuning Jia, Guanghao Li, Wenhao Jiang, Chun Yuan

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2602.22703: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22703&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[513] Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks

Vamshi Sunku Mohan, Kaustubh Gupta, Aneesha Das, Chandan Singh

Main category: cs.LG

TL;DR: The paper identifies activation subspace bottlenecks in Mamba SSM models using mechanistic interpretability tools and introduces a test-time steering intervention that improves performance by 8.27% on average across 5 SSMs and 6 benchmarks.

Details

Motivation: While state-space models (SSMs) offer efficient alternatives to transformers for language modeling, their interpretability and steerability remain underexplored. The authors aim to address this gap by investigating activation bottlenecks in Mamba-family SSM models.

Method: The authors use mechanistic interpretability tools to identify activation subspace bottlenecks in Mamba SSM models. They then introduce a test-time steering intervention that multiplies the activations of identified bottlenecks by a scalar. They validate their findings by modifying these bottlenecks to create Stable-Mamba architecture.

Result: The steering intervention improves performance by an average of 8.27% across 5 SSMs and 6 diverse benchmarks without task-specific tuning. The modified Stable-Mamba architecture achieves long-context performance gains when retrained from scratch.

Conclusion: The work demonstrates that SSM models have identifiable activation bottlenecks that can be targeted for performance improvement through simple interventions, and that architectural modifications based on these insights can yield better long-context performance.

Abstract: State-space models (SSMs) have emerged as an efficient strategy for building powerful language models, avoiding the quadratic complexity of computing attention in transformers. Despite their promise, the interpretability and steerability of modern SSMs remain relatively underexplored. We take a major step in this direction by identifying activation subspace bottlenecks in the Mamba family of SSM models using tools from mechanistic interpretability. We then introduce a test-time steering intervention that simply multiplies the activations of the identified bottlenecks by a scalar. Across 5 SSMs and 6 diverse benchmarks, this intervention improves performance by an average of 8.27%, without requiring any task-specific tuning. Finally, we validate that the identified bottlenecks are indeed hindering performance by modifying them to yield an architecture we call Stable-Mamba, which achieves long-context performance gains when retrained from scratch.

[514] Set-based v.s. Distribution-based Representations of Epistemic Uncertainty: A Comparative Study

Kaizheng Wang, Yunjia Wang, Fabio Cuzzolin, David Moens, Hans Hallez, Siu Lun Chau

Main category: cs.LG

TL;DR: A comparative study of two second-order uncertainty representations in neural networks: distribution-based (posterior parameter distributions) vs set-based (credal sets), evaluated across multiple benchmarks to understand their relative merits.

Details

Motivation: There's confusion about the relative merits of two main approaches to modeling epistemic uncertainty in neural networks - distribution-based representations (using posterior parameter distributions) and set-based representations (using credal sets). These frameworks are often considered non-comparable due to different semantics, assumptions, and evaluation practices, making it unclear which approach is better for practical applications.

Method: The authors conducted a controlled comparative study where both representations were constructed from the same finite collection of predictive distributions generated by a shared neural network. This isolates representational effects from predictive accuracy differences. They evaluated each representation using 3 uncertainty measures across 8 benchmarks (including selective prediction and out-of-distribution detection), spanning 6 underlying predictive models with 10 independent runs per configuration.

Result: The study demonstrates that meaningful comparison between these seemingly non-comparable frameworks is both feasible and informative. The results provide insights into how second-order representation choices impact practical uncertainty-aware performance in neural networks.

Conclusion: Distribution-based and set-based uncertainty representations can be systematically compared despite their different theoretical foundations. The controlled methodology enables principled evaluation of how representation choices affect uncertainty quantification performance in practical applications.

Abstract: Epistemic uncertainty in neural networks is commonly modeled using two second-order paradigms: distribution-based representations, which rely on posterior parameter distributions, and set-based representations based on credal sets (convex sets of probability distributions). These frameworks are often regarded as fundamentally non-comparable due to differing semantics, assumptions, and evaluation practices, leaving their relative merits unclear. Empirical comparisons are further confounded by variations in the underlying predictive models. To clarify this issue, we present a controlled comparative study enabling principled, like-for-like evaluation of the two paradigms. Both representations are constructed from the same finite collection of predictive distributions generated by a shared neural network, isolating representational effects from predictive accuracy. Our study evaluates each representation through the lens of 3 uncertainty measures across 8 benchmarks, including selective prediction and out-of-distribution detection, spanning 6 underlying predictive models and 10 independent runs per configuration. Our results show that meaningful comparison between these seemingly non-comparable frameworks is both feasible and informative, providing insights into how second-order representation choices impact practical uncertainty-aware performance.

[515] KMLP: A Scalable Hybrid Architecture for Web-Scale Tabular Data Modeling

Mingming Zhang, Pengfei Shi, Zhiqing Xiao, Feng Zhao, Guandong Sun, Yulin Kang, Ruizhe Gao, Ningtao Wang, Xing Fu, Weiqiang Wang, Junbo Zhao

Main category: cs.LG

TL;DR: KMLP: A hybrid deep learning architecture combining Kolmogorov-Arnold Network front-end with Gated MLP backbone for scalable predictive modeling on web-scale tabular data with billions of instances and heterogeneous features.

Details

Motivation: Address scalability challenges in predictive modeling on web-scale tabular data with billions of instances and hundreds of heterogeneous numerical features, which exhibit anisotropy, heavy-tailed distributions, and non-stationarity. Traditional methods like Gradient Boosting Decision Trees face bottlenecks and require laborious manual feature engineering.

Method: Introduces KMLP, a hybrid deep architecture integrating a shallow Kolmogorov-Arnold Network (KAN) front-end with a Gated Multilayer Perceptron (gMLP) backbone. The KAN front-end uses learnable activation functions to automatically model complex non-linear transformations for each feature, while the gMLP backbone captures high-order feature interactions.

Result: Experiments on public benchmarks and an industrial dataset with billions of samples show KMLP achieves state-of-the-art performance. Advantages over baselines like GBDTs increase at larger scales, validating KMLP as a scalable deep learning paradigm for large-scale web tabular data.

Conclusion: KMLP provides an effective and scalable deep learning solution for web-scale tabular data, overcoming limitations of traditional methods and demonstrating superior performance especially at larger scales.

Abstract: Predictive modeling on web-scale tabular data with billions of instances and hundreds of heterogeneous numerical features faces significant scalability challenges. These features exhibit anisotropy, heavy-tailed distributions, and non-stationarity, creating bottlenecks for models like Gradient Boosting Decision Trees and requiring laborious manual feature engineering. We introduce KMLP, a hybrid deep architecture integrating a shallow Kolmogorov-Arnold Network (KAN) front-end with a Gated Multilayer Perceptron (gMLP) backbone. The KAN front-end uses learnable activation functions to automatically model complex non-linear transformations for each feature, while the gMLP backbone captures high-order interactions. Experiments on public benchmarks and an industrial dataset with billions of samples show KMLP achieves state-of-the-art performance, with advantages over baselines like GBDTs increasing at larger scales, validating KMLP as a scalable deep learning paradigm for large-scale web tabular data.

[516] Doubly Adaptive Channel and Spatial Attention for Semantic Image Communication by IoT Devices

Soroosh Miri, Sepehr Abolhasani, Shahrokh Farahmand, S. Mohammad Razavizadeh

Main category: cs.LG

TL;DR: Proposes DA-DJSCC, a doubly adaptive deep joint source-channel coding method with channel-wise and spatial attention modules for semantic communication in IoT networks, improving upon existing SNR-adaptive approaches.

Details

Motivation: IoT networks face challenges like limited bandwidth, computational constraints, and dynamic wireless conditions. While DJSCC enables semantic communication for images, training separate DNNs for different SNRs creates excessive overhead for small IoT devices. Existing SNR-adaptive approaches need improvement for better performance.

Method: Proposes DA-DJSCC with doubly adaptive channel-wise and spatial attention modules at both transmitter and receiver. These modules dynamically adjust to varying channel conditions and spatial feature importance, enabling robust feature extraction and semantic information recovery with a single training.

Result: Simulation results show DA-DJSCC significantly improves upon ADJSCC in several performance criteria while incurring only mild complexity increase, making it suitable for performance-demanding but low-complexity IoT networks.

Conclusion: DA-DJSCC is a desirable choice for semantic communication in IoT networks, offering improved performance over existing adaptive methods with manageable complexity overhead.

Abstract: Internet of Things (IoT) networks face significant challenges such as limited communication bandwidth, constrained computational and energy resources, and highly dynamic wireless channel conditions. Utilization of deep neural networks (DNNs) combined with semantic communication has emerged as a promising paradigm to address these limitations. Deep joint source-channel coding (DJSCC) has recently been proposed to enable semantic communication of images. Building upon the original DJSCC formulation, low-complexity attention-style architectures has been added to the DNNs for further performance enhancement. As a main hurdle, training these DNNs separately for various signal-to-noise ratios (SNRs) will amount to excessive storage or communication overhead, which can not be maintained by small IoT devices. SNR Adaptive DJSCC (ADJSCC), has been proposed to train the DNNs once but feed the current SNR as part of the data to the channel-wise attention mechanism. We improve upon ADJSCC by a simultaneous utilization of doubly adaptive channel-wise and spatial attention modules at both transmitter and receiver. These modules dynamically adjust to varying channel conditions and spatial feature importance, enabling robust and efficient feature extraction and semantic information recovery. Simulation results corroborate that our proposed doubly adaptive DJSCC (DA-DJSCC) significantly improves upon ADJSCC in several performance criteria, while incurring a mild increase in complexity. These facts render DA-DJSCC a desirable choice for semantic communication in performance demanding but low-complexity IoT networks.

[517] Multi-agent imitation learning with function approximation: Linear Markov games and beyond

Luca Viano, Till Freihaut, Emanuele Nevali, Volkan Cevher, Matthieu Geist, Giorgia Ramponi

Main category: cs.LG

TL;DR: Theoretical analysis of multi-agent imitation learning in linear Markov games with feature-based structure, showing improved sample complexity bounds and proposing practical deep MAIL algorithms.

Details

Motivation: Current multi-agent imitation learning (MAIL) suffers from large sample complexity due to state-action level concentrability coefficients. The authors aim to leverage linear structure in Markov games to reduce these requirements and develop more efficient algorithms.

Method: 1) Theoretical analysis showing feature-level concentrability coefficients can replace state-action level ones in linear Markov games. 2) Development of first computationally efficient interactive MAIL algorithm for linear Markov games with sample complexity depending only on feature dimension d. 3) Proposal of deep MAIL interactive algorithm building on theoretical insights.

Result: 1) Feature-level concentrability coefficients can be much smaller than state-action analogs when features are informative. 2) Interactive MAIL algorithm achieves sample complexity dependent only on feature dimension d. 3) Deep MAIL algorithm outperforms behavior cloning (BC) on games like Tic-Tac-Toe and Connect4.

Conclusion: The work provides important theoretical foundations for MAIL in structured environments and demonstrates practical benefits through deep learning implementations that outperform baseline methods on benchmark games.

Abstract: In this work, we present the first theoretical analysis of multi-agent imitation learning (MAIL) in linear Markov games where both the transition dynamics and each agent’s reward function are linear in some given features. We demonstrate that by leveraging this structure, it is possible to replace the state-action level “all policy deviation concentrability coefficient” (Freihaut et al., arXiv:2510.09325) with a concentrability coefficient defined at the feature level which can be much smaller than the state-action analog when the features are informative about states’ similarity. Furthermore, to circumvent the need for any concentrability coefficient, we turn to the interactive setting. We provide the first, computationally efficient, interactive MAIL algorithm for linear Markov games and show that its sample complexity depends only on the dimension of the feature map $d$. Building on these theoretical findings, we propose a deep MAIL interactive algorithm which clearly outperforms BC on games such as Tic-Tac-Toe and Connect4.

[518] Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching

Hiroki Matsutani, Naoki Matsuda, Naoto Sugiura

Main category: cs.LG

TL;DR: Distributed prompt caching for edge LLM inference using cooperative state sharing across devices with Bloom-filter-based catalog to reduce communication overhead

Details

Motivation: Local LLM inference on resource-constrained edge devices creates severe performance bottlenecks, necessitating optimization techniques to improve inference speed and efficiency

Method: Proposes distributed prompt caching with partial matching support and Bloom-filter-based catalog system to determine remote state availability and reduce unnecessary communication overhead

Result: Experiments with Gemma-3 270M model and MMLU dataset on Raspberry Pi Zero 2W show 93.12% reduction in TTFT and 50.07% reduction in TTLT on average

Conclusion: Distributed prompt caching with catalog-based state sharing significantly improves LLM inference performance on edge devices while managing communication overhead

Abstract: Since local LLM inference on resource-constrained edge devices imposes a severe performance bottleneck, this paper proposes distributed prompt caching to enhance inference performance by cooperatively sharing intermediate processing states across multiple low-end edge devices. To fully utilize prompt similarity, our distributed caching mechanism also supports partial matching. As this approach introduces communication overhead associated with state sharing over a wireless network, we introduce a Bloom-filter-based data structure, referred to as a catalog, to determine whether a remote server possesses the desired internal states, thereby suppressing unnecessary communication. Experiments using the Gemma-3 270M model and the MMLU dataset on the Raspberry Pi Zero 2W platform demonstrate that the proposed approach reduces TTFT (Time to First Token) and TTLT (Time to Last Token) by 93.12% and 50.07% on average, respectively.

[519] Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng, Bo An

Main category: cs.LG

TL;DR: HGPO addresses context inconsistency in stepwise group-based RL for long-horizon agentic tasks by organizing steps into hierarchical groups based on historical context consistency and aggregating advantages adaptively.

Details

Motivation: Stepwise group-based RL methods suffer from context inconsistency where steps within the same group have different historical contexts, leading to biased advantage estimation and degraded policy optimization for long-horizon agentic tasks.

Method: HGPO organizes steps into multiple hierarchical groups based on historical context consistency, computes distinct advantages within each group, and aggregates them using an adaptive weighting scheme to achieve better bias-variance trade-off without extra models or rollouts.

Result: HGPO significantly outperforms existing agentic RL methods on ALFWorld and WebShop tasks using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct models under the same computational constraints.

Conclusion: HGPO effectively addresses context inconsistency in stepwise group-based RL, enabling more accurate advantage estimation and better policy optimization for long-horizon agentic tasks without additional computational overhead.

Abstract: Group-based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long-horizon agentic tasks. To enable more fine-grained policy updates, recent research has increasingly shifted toward stepwise group-based policy optimization, which treats each step in a rollout trajectory independently while using a memory module to retain historical context. However, we find a key issue in estimating stepwise relative advantages, namely context inconsistency, where steps within the same group may differ in their historical contexts. Empirically, we reveal that this issue can lead to severely biased advantage estimation, thereby degrading policy optimization significantly. To address the issue, in this paper, we propose Hierarchy-of-Groups Policy Optimization (HGPO) for long-horizon agentic tasks. Specifically, within a group of rollout trajectories, HGPO assigns each step to multiple hierarchical groups according to the consistency of historical contexts. Then, for each step, HGPO computes distinct advantages within each group and aggregates them with an adaptive weighting scheme. In this way, HGPO can achieve a favorable bias-variance trade-off in stepwise advantage estimation, without extra models or rollouts. Evaluations on two challenging agentic tasks, ALFWorld and WebShop with Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct, show that HGPO significantly outperforms existing agentic RL methods under the same computational constraints. Code is available at https://github.com/langfengQ/verl-agent/tree/master/recipe/hgpo.

[520] Hypernetwork-based approach for grid-independent functional data clustering

Anirudh Thatipelli, Ali Siahkoohi

Main category: cs.LG

TL;DR: A functional data clustering framework using hypernetwork-based autoencoders with implicit neural representations to create grid-agnostic, compact function representations for robust clustering across varying sampling resolutions.

Details

Motivation: Current functional data clustering methods depend on sampling grids, resolution, and preprocessing choices rather than the underlying functions themselves, making cluster assignments inconsistent and sensitive to discretization parameters.

Method: Proposes an auto-encoding architecture where a hypernetwork encoder maps coordinate-value pairs to the weight space of an implicit neural representation (INR) decoder. This creates compact, grid-independent function representations in a fixed-dimensional vector space, enabling standard clustering algorithms to operate on these representations.

Result: Demonstrates competitive clustering performance in synthetic and real-world high-dimensional settings, with robustness to changes in sampling resolution and generalization to unseen resolutions during training.

Conclusion: The framework provides a principled approach to functional data clustering that decouples clustering from discretization choices, offering robust and resolution-agnostic clustering through compact INR-based representations.

Abstract: Functional data clustering is concerned with grouping functions that share similar structure, yet most existing methods implicitly operate on sampled grids, causing cluster assignments to depend on resolution, sampling density, or preprocessing choices rather than on the underlying functions themselves. To address this limitation, we introduce a framework that maps discretized function observations – at arbitrary resolution and on arbitrary grids – into a fixed-dimensional vector space via an auto-encoding architecture. The encoder is a hypernetwork that maps coordinate-value pairs to the weight space of an implicit neural representation (INR), which serves as the decoder. Because INRs represent functions with very few parameters, this design yields compact representations that are decoupled from the sampling grid, while the hypernetwork amortizes weight prediction across the dataset. Clustering is then performed in this weight space using standard algorithms, making the approach agnostic to both the discretization and the choice of clustering method. By means of synthetic and real-world experiments in high-dimensional settings, we demonstrate competitive clustering performance that is robust to changes in sampling resolution – including generalization to resolutions not seen during training.

[521] Decentralized Ranking Aggregation: Gossip Algorithms for Borda and Copeland Consensus

Anna Van Elst, Kerrian Le Caillec, Igor Colin, Stephan Clémençon

Main category: cs.LG

TL;DR: Decentralized ranking aggregation algorithms using gossip communication for distributed preference data across networks

Details

Motivation: Existing ranking aggregation methods work in centralized settings, but many modern technologies (peer-to-peer networks, IoT, multi-agent systems) require decentralized approaches where preference data is distributed across networks. Extending consensus ranking computation to decentralized settings remains a major methodological challenge.

Method: Proposes decentralized algorithms using random gossip communication that allow autonomous agents to compute global ranking consensus through local interactions only. Implements Borda, Copeland, median rank, and local Kemenization rules in decentralized fashion with rigorous convergence guarantees and explicit rate bounds.

Result: Algorithms converge quickly and reliably to correct ranking aggregation across various network topologies and real/synthetic ranking datasets. Provides theoretical convergence guarantees including explicit rate bounds for Borda and Copeland methods.

Conclusion: Successfully extends ranking aggregation to decentralized settings, addressing robustness to corrupted nodes and scalability through reduced communication costs. Enables reliable consensus on collective rankings without central coordination.

Abstract: The concept of ranking aggregation plays a central role in preference analysis, and numerous algorithms for calculating median rankings, often originating in social choice theory, have been documented in the literature, offering theoretical guarantees in a centralized setting, i.e., when all the ranking data to be aggregated can be brought together in a single computing unit. For many technologies (e.g. peer-to-peer networks, IoT, multi-agent systems), extending the ability to calculate consensus rankings with guarantees in a decentralized setting, i.e., when preference data is initially distributed across a communicating network, remains a major methodological challenge. Indeed, in recent years, the literature on decentralized computation has mainly focused on computing or optimizing statistics such as arithmetic means using gossip algorithms. The purpose of this article is precisely to study how to achieve reliable consensus on collective rankings using classical rules (e.g. Borda, Copeland) in a decentralized setting, thereby raising new questions, robustness to corrupted nodes, and scalability through reduced communication costs in particular. The approach proposed and analyzed here relies on random gossip communication, allowing autonomous agents to compute global ranking consensus using only local interactions, without coordination or central authority. We provide rigorous convergence guarantees, including explicit rate bounds, for the Borda and Copeland consensus methods. Beyond these rules, we also provide a decentralized implementation of consensus according to the median rank rule and local Kemenization. Extensive empirical evaluations on various network topologies and real and synthetic ranking datasets demonstrate that our algorithms converge quickly and reliably to the correct ranking aggregation.

[522] MEDNA-DFM: A Dual-View FiLM-MoE Model for Explainable DNA Methylation Prediction

Yi He, Yina Cao, Jixiu Zhai, Di Wang, Junxiao Kong, Tianchi Lu

Main category: cs.LG

TL;DR: MEDNA-DFM is a deep learning model for DNA methylation prediction with explainable AI techniques that extract biological insights about conserved methylation patterns and propose a “sequence-structure synergy” hypothesis.

Details

Motivation: Deep learning models for DNA methylation prediction are effective but lack interpretability, preventing biological insight. The authors aim to develop both high-performance prediction and mechanism-inspired explainability.

Method: Developed MEDNA-DFM model for methylation prediction alongside signal purification algorithms for motif extraction. Used external dataset validation and in silico mutagenesis to test biological hypotheses.

Result: MEDNA-DFM captures conserved methylation patterns across species, generalization driven by intrinsic motifs rather than phylogenetic proximity. Extracted motifs with higher reliability than prior studies and validated “sequence-structure synergy” hypothesis through mutagenesis experiments.

Conclusion: The work provides both a powerful methylation prediction tool and demonstrates how explainable deep learning can drive methodological innovation and generate testable biological hypotheses.

Abstract: Accurate computational identification of DNA methylation is essential for understanding epigenetic regulation. Although deep learning excels in this binary classification task, its “black-box” nature impedes biological insight. We address this by introducing a high-performance model MEDNA-DFM, alongside mechanism-inspired signal purification algorithms. Our investigation demonstrates that MEDNA-DFM effectively captures conserved methylation patterns, achieving robust distinction across diverse species. Validation on external independent datasets confirms that the model’s generalization is driven by conserved intrinsic motifs (e.g., GC content) rather than phylogenetic proximity. Furthermore, applying our developed algorithms extracted motifs with significantly higher reliability than prior studies. Finally, empirical evidence from a Drosophila 6mA case study prompted us to propose a “sequence-structure synergy” hypothesis, suggesting that the GAGG core motif and an upstream A-tract element function cooperatively. We further validated this hypothesis via in silico mutagenesis, confirming that the ablation of either or both elements significantly degrades the model’s recognition capabilities. This work provides a powerful tool for methylation prediction and demonstrates how explainable deep learning can drive both methodological innovation and the generation of biological hypotheses.

[523] Fair feature attribution for multi-output prediction: a Shapley-based perspective

Umberto Biccari, Alain Ibáñez de Opakua, José María Mato, Óscar Millet, Roberto Morales, Enrique Zuazua

Main category: cs.LG

TL;DR: The paper provides an axiomatic characterization of feature attribution for multi-output predictors within the Shapley framework, showing that any attribution rule satisfying classical Shapley axioms must decompose component-wise across outputs.

Details

Motivation: While SHAP explanations are routinely computed independently for each output coordinate, the theoretical necessity of this practice has remained unclear. The paper aims to formalize the structural constraints in Shapley-based interpretability for multi-output models.

Method: The authors extend classical Shapley axioms (efficiency, symmetry, dummy player, additivity) to vector-valued cooperative games and establish a rigidity theorem showing that any attribution rule satisfying these axioms must decompose component-wise across outputs.

Result: The rigidity theorem demonstrates that any joint-output attribution rule must relax at least one classical Shapley axiom. Numerical experiments on a biomedical benchmark show multi-output models can yield computational savings while producing SHAP explanations consistent with component-wise structure.

Conclusion: The paper identifies a previously unformalized structural constraint in Shapley-based interpretability, clarifying the precise scope of fairness-consistent explanations in multi-output learning. The results provide theoretical justification for computing SHAP explanations independently for each output coordinate.

Abstract: In this article, we provide an axiomatic characterization of feature attribution for multi-output predictors within the Shapley framework. While SHAP explanations are routinely computed independently for each output coordinate, the theoretical necessity of this practice has remained unclear. By extending the classical Shapley axioms to vector-valued cooperative games, we establish a rigidity theorem showing that any attribution rule satisfying efficiency, symmetry, dummy player, and additivity must necessarily decompose component-wise across outputs. Consequently, any joint-output attribution rule must relax at least one of the classical Shapley axioms. This result identifies a previously unformalized structural constraint in Shapley-based interpretability, clarifying the precise scope of fairness-consistent explanations in multi-output learning. Numerical experiments on a biomedical benchmark illustrate that multi-output models can yield computational savings in training and deployment, while producing SHAP explanations that remain fully consistent with the component-wise structure imposed by the Shapley axioms.

[524] A Data-Driven Approach to Support Clinical Renal Replacement Therapy

Alice Balboni, Luis Escobar, Andrea Manno, Fabrizio Rossi, Maria Cristina Ruffa, Gianluca Villa, Giordano D’Aloisio, Antonio Consolo

Main category: cs.LG

TL;DR: A machine learning approach using tabular data and ensemble models predicts membrane fouling in CRRT patients, achieving 77.6% sensitivity and 96.3% specificity, with counterfactual analysis for clinical interpretability.

Details

Motivation: To develop an interpretable machine learning model for predicting membrane fouling in critically ill patients undergoing Continuous Renal Replacement Therapy (CRRT), enabling early intervention and improved patient management through reliable counterfactual analysis.

Method: Used time-series ICU data with 16 clinically selected features, applied ADASYN oversampling for class imbalance, tested Random Forest, XGBoost, and LightGBM models, and employed Shapley value-based counterfactual analysis for interpretability.

Result: Achieved 77.6% sensitivity and 96.3% specificity at 10% rebalancing rate, with tabular approach outperforming LSTM models. Feature selection reduced model to 5 key variables with minimal accuracy loss, enabling successful counterfactual analysis.

Conclusion: Interpretable machine learning models are viable for predicting CRRT membrane fouling, with tabular approaches outperforming temporal models. The integration of prediction and counterfactual analysis offers practical clinical value for therapeutic adjustments.

Abstract: This study investigates a data-driven machine learning approach to predict membrane fouling in critically ill patients undergoing Continuous Renal Replacement Therapy (CRRT). Using time-series data from an ICU, 16 clinically selected features were identified to train predictive models. To ensure interpretability and enable reliable counterfactual analysis, the researchers adopted a tabular data approach rather than modeling temporal dependencies directly. Given the imbalance between fouling and non-fouling cases, the ADASYN oversampling technique was applied to improve minority class representation. Random Forest, XGBoost, and LightGBM models were tested, achieving balanced performance with 77.6% sensitivity and 96.3% specificity at a 10% rebalancing rate. Results remained robust across different forecasting horizons. Notably, the tabular approach outperformed LSTM recurrent neural networks, suggesting that explicit temporal modeling was not necessary for strong predictive performance. Feature selection further reduced the model to five key variables, improving simplicity and interpretability with minimal loss of accuracy. A Shapley value-based counterfactual analysis was applied to the best-performing model, successfully identifying minimal input changes capable of reversing fouling predictions. Overall, the findings support the viability of interpretable machine learning models for predicting membrane fouling during CRRT. The integration of prediction and counterfactual analysis offers practical clinical value, potentially guiding therapeutic adjustments to reduce fouling risk and improve patient management.

[525] Generalization Bounds of Stochastic Gradient Descent in Homogeneous Neural Networks

Wenquan Ma, Yang Sui, Jiaye Teng, Bohan Wang, Jing Xu, Jingqin Yang

Main category: cs.LG

TL;DR: Generalization bounds for homogeneous neural networks allowing slower step size decay (Ω(1/√t)) instead of traditional O(1/t) requirements

Details

Motivation: Algorithmic stability analysis typically requires restrictive step size decay (η_t = O(1/t)) for generalization bounds, which may hinder optimization and doesn't match practical training scenarios where slower decay is often used.

Method: Derive generalization bounds under homogeneous neural network regimes, proving these networks enable slower step size decay of order Ω(1/√t) under mild assumptions. Extend results to non-Lipschitz regimes.

Result: Homogeneous neural networks (including fully-connected and convolutional networks with ReLU/LeakyReLU) allow significantly slower step size decay while maintaining generalization guarantees, making theoretical analysis more aligned with practical training.

Conclusion: The homogeneous property of common neural network architectures enables more flexible step size schedules (Ω(1/√t)) for generalization analysis, bridging the gap between theoretical requirements and practical optimization.

Abstract: Algorithmic stability is among the most potent techniques in generalization analysis. However, its derivation usually requires a stepsize $η_t = \mathcal{O}(1/t)$ under non-convex training regimes, where $t$ denotes iterations. This rigid decay of the stepsize potentially impedes optimization and may not align with practical scenarios. In this paper, we derive the generalization bounds under the homogeneous neural network regimes, proving that this regime enables slower stepsize decay of order $Ω(1/\sqrt{t})$ under mild assumptions. We further extend the theoretical results from several aspects, e.g., non-Lipschitz regimes. This finding is broadly applicable, as homogeneous neural networks encompass fully-connected and convolutional neural networks with ReLU and LeakyReLU activations.

[526] MSINO: Curvature-Aware Sobolev Optimization for Manifold Neural Networks

Suresan Pareth

Main category: cs.LG

TL;DR: MSINO is a curvature-aware neural network training framework for Riemannian manifolds that uses covariant Sobolev loss with parallel transport and Laplace-Beltrami regularization, providing geometry-dependent convergence guarantees.

Details

Motivation: Current neural network training methods on manifolds lack curvature-aware convergence guarantees and stability. There's a need for training frameworks that explicitly account for manifold geometry, parallel transport, and curvature effects to improve optimization stability and provide theoretical guarantees for neural networks defined on Riemannian manifolds.

Method: Replaces Euclidean derivative supervision with covariant Sobolev loss using parallel transport for gradient alignment. Adds Laplace-Beltrami smoothness regularization. Derives geometry-dependent constants for: (1) Descent Lemma with manifold Sobolev smoothness constant, (2) Sobolev Polyak-Lojasiewicz inequality for linear convergence guarantees, (3) two-step Newton-Sobolev method with local quadratic contraction in curvature-controlled neighborhoods.

Result: Provides training time guarantees that explicitly track curvature and transported Jacobians. Framework unifies value and gradient-based learning with curvature-aware convergence guarantees for neural training on manifolds.

Conclusion: MSINO offers a principled, curvature-aware training framework for neural networks on Riemannian manifolds with theoretical convergence guarantees, applicable to surface imaging, physics-informed learning, and robotics on Lie groups like SO(3) and SE(3).

Abstract: We introduce Manifold Sobolev Informed Neural Optimization (MSINO), a curvature aware training framework for neural networks defined on Riemannian manifolds. The method replaces standard Euclidean derivative supervision with a covariant Sobolev loss that aligns gradients using parallel transport and improves stability via a Laplace Beltrami smoothness regularization term. Building on classical results in Riemannian optimization and Sobolev theory on manifolds, we derive geometry dependent constants that yield (i) a Descent Lemma with a manifold Sobolev smoothness constant, (ii) a Sobolev Polyak Lojasiewicz inequality giving linear convergence guarantees for Riemannian gradient descent and stochastic gradient descent under explicit step size bounds, and (iii) a two step Newton Sobolev method with local quadratic contraction in curvature controlled neighborhoods. Unlike prior Sobolev training in Euclidean space, MSINO provides training time guarantees that explicitly track curvature and transported Jacobians. Applications include surface imaging, physics informed learning settings, and robotics on Lie groups such as SO(3) and SE(3). The framework unifies value and gradient based learning with curvature aware convergence guarantees for neural training on manifolds.

[527] Scaling Laws of Global Weather Models

Yuejiang Yu, Langwen Huang, Alexandru Calotoiu, Torsten Hoefler

Main category: cs.LG

TL;DR: Analysis of scaling laws for weather forecasting models reveals Aurora has strongest data-scaling, GraphCast has best parameter efficiency, and weather models favor width over depth unlike language models.

Details

Motivation: To optimize training efficiency and model performance in weather forecasting by understanding empirical scaling laws for model size, dataset size, and compute budget.

Method: Analyzes relationship between validation loss and three factors: model size (N), dataset size (D), and compute budget (C) across various weather forecasting models including Aurora and GraphCast.

Result: Aurora shows strongest data-scaling (10x dataset reduces loss 3.2x), GraphCast has highest parameter efficiency but limited hardware utilization. Weather models favor width over depth, unlike language models.

Conclusion: Future weather models should prioritize wider architectures and larger effective training datasets to maximize predictive performance, with compute-optimal allocation favoring longer training over larger models.

Abstract: Data-driven models are revolutionizing weather forecasting. To optimize training efficiency and model performance, this paper analyzes empirical scaling laws within this domain. We investigate the relationship between model performance (validation loss) and three key factors: model size ($N$), dataset size ($D$), and compute budget ($C$). Across a range of models, we find that Aurora exhibits the strongest data-scaling behavior: increasing the training dataset by 10x reduces validation loss by up to 3.2x. GraphCast demonstrates the highest parameter efficiency, yet suffers from limited hardware utilization. Our compute-optimal analysis indicates that, under fixed compute budgets, allocating resources to longer training durations yields greater performance gains than increasing model size. Furthermore, we analyze model shape and uncover scaling behaviors that differ fundamentally from those observed in language models: weather forecasting models consistently favor increased width over depth. These findings suggest that future weather models should prioritize wider architectures and larger effective training datasets to maximize predictive performance.

[528] Residual Koopman Spectral Profiling for Predicting and Preventing Transformer Training Instability

Bum Jun Kim, Shohei Taniguchi, Makoto Kawano, Yusuke Iwasawa, Yutaka Matsuo

Main category: cs.LG

TL;DR: RKSP predicts transformer training divergence from initialization using Koopman spectral analysis, and KSS prevents divergence by reshaping spectra during training.

Details

Motivation: Training divergence in transformers wastes computational resources, but practitioners only discover instability after expensive training runs have already begun. There's a need for a method to estimate the probability of failure before training starts.

Method: Residual Koopman Spectral Profiling (RKSP) extracts Koopman spectral features from a single forward pass at initialization using whitened dynamic mode decomposition on layer-wise residual snapshots. The key diagnostic is “near-unit spectral mass” which quantifies instability risk. Koopman Spectral Shaping (KSS) reshapes spectra during training to prevent divergence.

Result: RKSP achieves AUROC of 0.995 for predicting divergence across extensive configurations, outperforming gradient baselines. KSS reduces divergence rate from 66.7% to 12.5% in challenging high learning rate regimes without normalization layers, enabling 50-150% higher learning rates. Results generalize to WikiText-103, vision transformers on CIFAR-10, and pretrained models including GPT-2, LLaMA-2 up to 7B, MoE, Mamba-style SSMs, and KAN.

Conclusion: RKSP provides an effective pre-training diagnostic for transformer instability, and KSS offers a practical solution to prevent divergence, making transformer training more reliable and efficient across diverse architectures and tasks.

Abstract: Training divergence in transformers wastes compute, yet practitioners discover instability only after expensive runs begin. They therefore need an expected probability of failure for a transformer before training starts. Our study of Residual Koopman Spectral Profiling (RKSP) provides such an estimate. From a single forward pass at initialization, RKSP extracts Koopman spectral features by applying whitened dynamic mode decomposition to layer-wise residual snapshots. Our central diagnostic, the near-unit spectral mass, quantifies the fraction of modes concentrated near the unit circle, which captures instability risk. For predicting divergence across extensive configurations, this estimator achieves an AUROC of 0.995, outperforming the best gradient baseline. We further make this diagnostic actionable through Koopman Spectral Shaping (KSS), which reshapes spectra during training. We empirically validate that our method works in practice: RKSP predicts divergence at initialization, and when RKSP flags high risk, turning on KSS successfully prevents divergence. In the challenging high learning rate regime without normalization layers, KSS reduces the divergence rate from 66.7% to 12.5% and enables learning rates that are 50% to 150% higher. These findings generalize to WikiText-103 language modeling, vision transformers on CIFAR-10, and pretrained language models, including GPT-2 and LLaMA-2 up to 7B, as well as emerging architectures such as MoE, Mamba-style SSMs, and KAN.

[529] Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Zeyuan Liu, Jeonghye Kim, Xufang Luo, Dongsheng Li, Yuqing Yang

Main category: cs.LG

TL;DR: EMPO² is a hybrid RL framework that uses memory-augmented exploration to improve LLM agent performance in novel environments, combining on- and off-policy updates for robustness.

Details

Motivation: Current RL methods for LLM agents fail in environments requiring discovery of novel states, as they rely too heavily on pretrained knowledge rather than effective exploration.

Method: Proposes Exploratory Memory-Augmented On- and Off-Policy Optimization (EMPO²), a hybrid RL framework that leverages memory for exploration and combines both on-policy and off-policy updates to ensure LLMs perform well with memory while remaining robust without it.

Result: Achieves 128.6% improvement over GRPO on ScienceWorld and 11.3% improvement on WebShop. Shows superior adaptability in out-of-distribution tests, requiring only a few trials with memory and no parameter updates.

Conclusion: EMPO² is a promising framework for building more exploratory and generalizable LLM-based agents that can discover novel states and adapt to new tasks efficiently.

Abstract: Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose Exploratory Memory-Augmented On- and Off-Policy Optimization (EMPO$^2$), a hybrid RL framework that leverages memory for exploration and combines on- and off-policy updates to make LLMs perform well with memory while also ensuring robustness without it. On ScienceWorld and WebShop, EMPO$^2$ achieves 128.6% and 11.3% improvements over GRPO, respectively. Moreover, in out-of-distribution tests, EMPO$^2$ demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates. These results highlight EMPO$^2$ as a promising framework for building more exploratory and generalizable LLM-based agents.

[530] Learning Disease-Sensitive Latent Interaction Graphs From Noisy Cardiac Flow Measurements

Viraj Patel, Marko Grujic, Philipp Aigner, Theodor Abart, Marcus Granegger, Deblina Bhattacharjee, Katharine Fraser

Main category: cs.LG

TL;DR: Physics-informed latent relational framework models cardiac vortices as interacting nodes in a graph to capture disease severity and interventions across computational fluid dynamics and ultrasound modalities.

Details

Motivation: Current cardiac blood flow imaging methods fail to capture underlying relational structures of coherent flow features, which contain rich information about disease severity and clinical interventions.

Method: Combines neural relational inference architecture with physics-inspired interaction energy and birth-death dynamics to model cardiac vortices as interacting nodes in a latent graph.

Result: Latent graphs reveal stronger vortex interactions with aortic narrowing (R²=0.78, Spearman |ρ|=0.96) and capture weakening vortical structures in LVAD-supported ventricles, demonstrating cross-modal generalization.

Conclusion: Latent interaction graphs and entropy serve as robust, interpretable markers of cardiac disease and intervention across different imaging modalities.

Abstract: Cardiac blood flow patterns contain rich information about disease severity and clinical interventions, yet current imaging and computational methods fail to capture underlying relational structures of coherent flow features. We propose a physics-informed, latent relational framework to model cardiac vortices as interacting nodes in a graph. Our model combines a neural relational inference architecture with physics-inspired interaction energy and birth-death dynamics, yielding a latent graph sensitive to disease severity and intervention level. We first apply this to computational fluid dynamics simulations of aortic coarctation. Learned latent graphs reveal that as the aortic radius narrows, vortex interactions become stronger and more frequent. This leads to a higher graph entropy, correlating monotonically with coarctation severity ($R^2=0.78$, Spearman $|ρ|=0.96$). We then extend this method to ultrasound datasets of left ventricles under varying levels of left ventricular assist device support. Again the latent graph representation captures the weakening of coherent vortical structures, thereby demonstrating cross-modal generalisation. Results show latent interaction graphs and entropy serve as robust and interpretable markers of cardiac disease and intervention.

[531] Latent Matters: Learning Deep State-Space Models

Alexej Klushyn, Richard Kurle, Maximilian Soelch, Botond Cseke, Patrick van der Smagt

Main category: cs.LG

TL;DR: EKVAE combines variational autoencoders with extended Kalman filtering for improved dynamical system modeling, using constrained optimization to ensure learning of underlying dynamics.

Details

Motivation: Standard training of deep state-space models via evidence lower bound doesn't guarantee learning of actual underlying dynamics, leading to poor system identification and prediction accuracy.

Method: Proposes constrained optimization framework for training DSSMs, and introduces Extended Kalman VAE (EKVAE) that combines amortized variational inference with classic Bayesian filtering/smoothing for more accurate dynamics modeling than RNN-based approaches.

Result: Constrained optimization significantly improves system identification and prediction accuracy; EKVAE outperforms previous models in prediction accuracy, achieves remarkable dynamical system identification results, and successfully learns disentangled state-space representations.

Conclusion: The constrained optimization framework and EKVAE provide more reliable learning of dynamical systems, with better prediction accuracy and system identification capabilities than existing DSSM approaches.

Abstract: Deep state-space models (DSSMs) enable temporal predictions by learning the underlying dynamics of observed sequence data. They are often trained by maximising the evidence lower bound. However, as we show, this does not ensure the model actually learns the underlying dynamics. We therefore propose a constrained optimisation framework as a general approach for training DSSMs. Building upon this, we introduce the extended Kalman VAE (EKVAE), which combines amortised variational inference with classic Bayesian filtering/smoothing to model dynamics more accurately than RNN-based DSSMs. Our results show that the constrained optimisation framework significantly improves system identification and prediction accuracy on the example of established state-of-the-art DSSMs. The EKVAE outperforms previous models w.r.t. prediction accuracy, achieves remarkable results in identifying dynamical systems, and can furthermore successfully learn state-space representations where static and dynamic features are disentangled.

[532] RhythmBERT: A Self-Supervised Language Model Based on Latent Representations of ECG Waveforms for Heart Disease Detection

Xin Wang, Burcu Ozek, Aruna Mohan, Amirhossein Ravari, Or Zilbershot, Fatemeh Afghah

Main category: cs.LG

TL;DR: RhythmBERT is a generative ECG language model that treats ECG signals as structured language by encoding cardiac segments into symbolic tokens, achieving strong performance on cardiac analysis tasks with single-lead data.

Details

Motivation: Most self-supervised learning methods treat ECG as generic time series, overlooking physiological semantics and rhythm-level structure. Existing contrastive methods use augmentations that distort morphology, while generative approaches use fixed-window segmentation that misaligns cardiac cycles.

Method: Proposes RhythmBERT, a generative ECG language model that encodes P, QRS, and T segments into symbolic tokens via autoencoder-based latent representations. Uses discrete tokens for rhythm semantics and continuous embeddings for fine-grained morphology. Pretrained on ~800,000 unlabeled ECG recordings with masked prediction objective.

Result: Despite using only a single lead, RhythmBERT achieves comparable or superior performance to strong 12-lead baselines. Generalizes well from prevalent conditions like atrial fibrillation to clinically challenging cases like subtle ST-T abnormalities and myocardial infarction.

Conclusion: Treating ECG as structured language offers a scalable and physiologically aligned pathway for advancing cardiac analysis, enabling label-efficient learning of contextual representations.

Abstract: Electrocardiogram (ECG) analysis is crucial for diagnosing heart disease, but most self-supervised learning methods treat ECG as a generic time series, overlooking physiologic semantics and rhythm-level structure. Existing contrastive methods utilize augmentations that distort morphology, whereas generative approaches employ fixed-window segmentation, which misaligns cardiac cycles. To address these limitations, we propose RhythmBERT, a generative ECG language model that considers ECG as a language paradigm by encoding P, QRS, and T segments into symbolic tokens via autoencoder-based latent representations. These discrete tokens capture rhythm semantics, while complementary continuous embeddings retain fine-grained morphology, enabling a unified view of waveform structure and rhythm. RhythmBERT is pretrained on approximately 800,000 unlabeled ECG recordings with a masked prediction objective, allowing it to learn contextual representations in a label-efficient manner. Evaluations show that despite using only a single lead, RhythmBERT achieves comparable or superior performance to strong 12-lead baselines. This generalization extends from prevalent conditions such as atrial fibrillation to clinically challenging cases such as subtle ST-T abnormalities and myocardial infarction. Our results suggest that considering ECG as structured language offers a scalable and physiologically aligned pathway for advancing cardiac analysis.

[533] Physics-informed neural particle flow for the Bayesian update step

Domonkos Csuzdi, Tamás Bécsi, Olivér Törő

Main category: cs.LG

TL;DR: Physics-informed neural particle flow for Bayesian filtering using PDE-constrained neural networks to approximate probability transport without ground-truth samples

Details

Motivation: Bayesian update in high-dimensional nonlinear estimation is computationally challenging; existing particle flow filters yield stiff differential equations, while deep learning approaches treat updates as black-box tasks or neglect exact geometric structure of probability transport.

Method: Propose physics-informed neural particle flow that couples log-homotopy trajectory with continuity equation to derive master PDE, then embed PDE as physical constraint in loss function to train neural network to approximate transport velocity field.

Result: Neural parameterization acts as implicit regularizer, mitigating numerical stiffness and reducing online computational complexity; experimental validation on multimodal benchmarks shows better mode coverage and robustness compared to state-of-the-art baselines.

Conclusion: The framework enables purely unsupervised training without ground-truth posterior samples and provides effective solution for high-dimensional nonlinear Bayesian filtering problems.

Abstract: The Bayesian update step poses significant computational challenges in high-dimensional nonlinear estimation. While log-homotopy particle flow filters offer an alternative to stochastic sampling, existing formulations usually yield stiff differential equations. Conversely, existing deep learning approximations typically treat the update as a black-box task or rely on asymptotic relaxation, neglecting the exact geometric structure of the finite-horizon probability transport. In this work, we propose a physics-informed neural particle flow, which is an amortized inference framework. To construct the flow, we couple the log-homotopy trajectory of the prior to posterior density function with the continuity equation describing the density evolution. This derivation yields a governing partial differential equation (PDE), referred to as the master PDE. By embedding this PDE as a physical constraint into the loss function, we train a neural network to approximate the transport velocity field. This approach enables purely unsupervised training, eliminating the need for ground-truth posterior samples. We demonstrate that the neural parameterization acts as an implicit regularizer, mitigating the numerical stiffness inherent to analytic flows and reducing online computational complexity. Experimental validation on multimodal benchmarks and a challenging nonlinear scenario confirms better mode coverage and robustness compared to state-of-the-art baselines.

[534] PRAC: Principal-Random Subspace for LLM Activation Compression and Memory-Efficient Training

Yanyi Li, Yimu Zhang, Cong Fang

Main category: cs.LG

TL;DR: PRAC is a novel activation compression method for LLM training that uses principal subspace decomposition via SVD and random subspace sampling to reduce memory usage by up to 36% with minimal performance impact.

Details

Motivation: Activations have become the primary memory bottleneck in large-batch LLM training, and existing compression methods fail to exploit the spectral structure of activations, resulting in slow convergence or limited compression.

Method: PRAC decomposes activations into two components: a principal subspace captured via SVD to retain dominant information, and a random subspace sampled from the orthogonal complement to approximate the tail. It introduces a precise scaling factor to yield an unbiased gradient estimator with minimum variance.

Result: Extensive experiments on pre-training and fine-tuning tasks demonstrate that PRAC achieves up to 36% total memory reduction with negligible performance degradation and minimal computational cost.

Conclusion: PRAC effectively addresses the activation memory bottleneck in LLM training by exploiting spectral structure through principled subspace decomposition, achieving significant memory savings with minimal performance impact.

Abstract: Activations have become the primary memory bottleneck in large-batch LLM training. However, existing compression methods fail to exploit the spectral structure of activations, resulting in slow convergence or limited compression. To address this, we bridge the relationship between the algorithm’s fast convergence and the requirements for subspace projection, and show that an effective compression should yield an unbiased estimate of the original activation with low variance. We propose Principal-Random Subspace for LLM Activation Compression (PRAC), which novelly decomposes activations into two components: a principal subspace captured via SVD to retain dominant information, and a random subspace sampled from the orthogonal complement to approximate the tail. By introducing a precise scaling factor, we prove that PRAC yields an unbiased gradient estimator with minimum variance under certain conditions. Extensive experiments on pre-training and fine-tuning tasks demonstrate that PRAC achieves up to 36% total memory reduction with negligible performance degradation and minimal computational cost.

[535] Learning Physical Operators using Neural Operators

Vignesh Gopakumar, Ander Gray, Dan Giles, Lorenzo Zanisi, Matt J. Kusner, Timo Betcke, Stanislas Pamela, Marc Peter Deisenroth

Main category: cs.LG

TL;DR: Physics-informed neural operator framework using operator splitting to learn individual physical operators, enabling generalization to novel physics and continuous-time predictions via neural ODEs.

Details

Motivation: Neural operators for PDEs struggle with generalization beyond training distributions and fixed temporal discretization. Need a framework that can generalize to novel physical regimes while maintaining interpretability and continuous-time predictions.

Method: Decompose PDEs using operator splitting methods, train separate neural operators for individual non-linear physical operators, approximate linear operators with fixed finite-difference convolutions. Formulate as neural ODE where learned operators constitute right-hand side, enabling continuous-time predictions through standard ODE solvers.

Result: Achieves better convergence and superior performance when generalizing to unseen physics for incompressible and compressible Navier-Stokes equations. Parameter-efficient, enables temporal extrapolation beyond training horizons, provides interpretable components verifiable against known physics.

Conclusion: Physics-informed training framework with operator splitting and neural ODE formulation addresses generalization limitations of neural operators, enabling continuous-time predictions and interpretable physics learning.

Abstract: Neural operators have emerged as promising surrogate models for solving partial differential equations (PDEs), but struggle to generalise beyond training distributions and are often constrained to a fixed temporal discretisation. This work introduces a physics-informed training framework that addresses these limitations by decomposing PDEs using operator splitting methods, training separate neural operators to learn individual non-linear physical operators while approximating linear operators with fixed finite-difference convolutions. This modular mixture-of-experts architecture enables generalisation to novel physical regimes by explicitly encoding the underlying operator structure. We formulate the modelling task as a neural ordinary differential equation (ODE) where these learned operators constitute the right-hand side, enabling continuous-in-time predictions through standard ODE solvers and implicitly enforcing PDE constraints. Demonstrated on incompressible and compressible Navier-Stokes equations, our approach achieves better convergence and superior performance when generalising to unseen physics. The method remains parameter-efficient, enabling temporal extrapolation beyond training horizons, and provides interpretable components whose behaviour can be verified against known physics.

[536] Regularized Online RLHF with Generalized Bilinear Preferences

Junghyun Lee, Minju Hong, Kwang-Sung Jun, Chulhee Yun, Se-Young Yun

Main category: cs.LG

TL;DR: Online RLHF with general preferences using Generalized Bilinear Preference Model to learn Nash Equilibrium with provable regret bounds

Details

Motivation: Address contextual online RLHF with general (potentially intransitive) preferences, moving beyond traditional transitive preference models and reverse KL-regularization limitations

Method: Use Generalized Bilinear Preference Model (GBPM) with low-rank skew-symmetric matrices, analyze dual gap properties, propose two algorithms: Greedy Sampling and Explore-Then-Commit with feature diversity assumptions

Result: Proved dual gap bound from strong convexity and skew-symmetricity, established regret bounds: Greedy Sampling achieves polylogarithmic regret, Explore-Then-Commit achieves statistically efficient high-dimensional regret

Conclusion: First statistically efficient guarantee for online RLHF in high dimensions with general preferences, providing theoretical foundations for practical RLHF systems

Abstract: We consider the problem of contextual online RLHF with general preferences, where the goal is to identify the Nash Equilibrium. We adopt the Generalized Bilinear Preference Model (GBPM) to capture potentially intransitive preferences via low-rank, skew-symmetric matrices. We investigate general preference learning with any strongly convex regularizer (where $η^{-1}$ is the regularization strength), generalizing beyond prior works limited to reverse KL-regularization. Central to our analysis is proving that the dual gap of the greedy policy is bounded by the square of the estimation error - a result derived solely from strong convexity and the skew-symmetricity of GBPM.Building on this insight and a feature diversity assumption, we establish two regret bounds via two simple algorithms: (1) Greedy Sampling achieves polylogarithmic, $e^{O(η)}$-free regret $\tilde{O}(ηd^4 (\log T)^2)$. (2) Explore-Then-Commit achieves $\mathrm{poly}(d)$-free regret $\tilde{O}(\sqrt{ηr T})$ by exploiting the low-rank structure; this is the first statistically efficient guarantee for online RLHF in high-dimensions.

[537] Bound to Disagree: Generalization Bounds via Certifiable Surrogates

Mathieu Bazinet, Valentina Zantedeschi, Pascal Germain

Main category: cs.LG

TL;DR: Disagreement-based generalization certificates for deep learning models using surrogate models without modifying target models

Details

Motivation: Existing generalization bounds for deep learning are often vacuous, not computable, or limited to specific model classes. The paper aims to address these limitations by providing practical generalization certificates.

Method: Proposes disagreement-based certificates for risk gaps between predictors, bounds true risk via surrogate models with tight generalization guarantees, and evaluates on unlabeled data. Uses three surrogate frameworks: sample compression, model compression, and PAC-Bayes theory.

Result: Empirical demonstration shows tightness of obtained certificates and versatility across different surrogate model frameworks. Guarantees achieved without modifying target models or adapting training procedures.

Conclusion: Provides practical generalization certificates for deep learning models through surrogate approaches, addressing limitations of existing bounds while maintaining model independence.

Abstract: Generalization bounds for deep learning models are typically vacuous, not computable or restricted to specific model classes. In this paper, we tackle these issues by providing new disagreement-based certificates for the gap between the true risk of any two predictors. We then bound the true risk of the predictor of interest via a surrogate model that enjoys tight generalization guarantees, and evaluating our disagreement bound on an unlabeled dataset. We empirically demonstrate the tightness of the obtained certificates and showcase the versatility of the approach by training surrogate models leveraging three different frameworks: sample compression, model compression and PAC-Bayes theory. Importantly, such guarantees are achieved without modifying the target model, nor adapting the training procedure to the generalization framework.

[538] DyGnROLE: Modeling Asymmetry in Dynamic Graphs with Node-Role-Oriented Latent Encoding

Tyler Bonnet, Marek Rei

Main category: cs.LG

TL;DR: DyGnROLE is a transformer-based dynamic graph architecture that explicitly disentangles source and destination node representations using separate embeddings and role-semantic positional encodings, enhanced by self-supervised pretraining for low-label regimes.

Details

Motivation: Existing dynamic graph architectures fail to properly model the asymmetrical behavioral patterns between source and destination nodes in directed graphs, relying on shared parameters with limited role-aware modeling.

Method: Proposes DyGnROLE with separate embedding vocabularies for source/destination nodes, role-semantic positional encodings, and Temporal Contrastive Link Prediction (TCLP) self-supervised pretraining using unlabeled interaction history.

Result: DyGnROLE substantially outperforms state-of-the-art baselines on future edge classification tasks, establishing role-aware modeling as an effective strategy for dynamic graph learning.

Conclusion: Explicit role-aware modeling with specialized representations and self-supervised pretraining significantly improves dynamic graph learning, particularly for directed graphs with asymmetrical node behaviors.

Abstract: Real-world dynamic graphs are often directed, with source and destination nodes exhibiting asymmetrical behavioral patterns and temporal dynamics. However, existing dynamic graph architectures largely rely on shared parameters for processing source and destination nodes, with limited or no systematic role-aware modeling. We propose DyGnROLE (Dynamic Graph Node-Role-Oriented Latent Encoding), a transformer-based architecture that explicitly disentangles source and destination representations. By using separate embedding vocabularies and role-semantic positional encodings, the model captures the distinct structural and temporal contexts unique to each role. Critical to the effectiveness of these specialized embeddings in low-label regimes is a self-supervised pretraining objective we introduce: Temporal Contrastive Link Prediction (TCLP). The pretraining uses the full unlabeled interaction history to encode informative structural biases, enabling the model to learn role-specific representations without requiring annotated data. Evaluation on future edge classification demonstrates that DyGnROLE substantially outperforms a diverse set of state-of-the-art baselines, establishing role-aware modeling as an effective strategy for dynamic graph learning.

[539] Prediction of Diffusion Coefficients in Mixtures with Tensor Completion

Zeno Romero, Kerstin Münnemann, Hans Hasse, Fabian Jirasek

Main category: cs.LG

TL;DR: Hybrid tensor completion method predicts temperature-dependent diffusion coefficients in binary mixtures using Tucker decomposition and active learning with NMR measurements.

Details

Motivation: Experimental diffusion coefficient data is scarce, and existing matrix completion methods are limited to single-temperature predictions, requiring a more flexible approach for temperature-dependent property prediction.

Method: Developed a hybrid tensor completion method using Tucker decomposition, trained on experimental diffusion data at multiple temperatures (298K, 313K, 333K) with Bayesian framework incorporating SEGWE model priors, plus active learning for targeted data acquisition via PFG-NMR measurements.

Result: TCM achieves significantly improved prediction accuracy across temperatures (268K-378K) compared to established models, with further enhancement from active learning expansion of experimental database with 19 new solute+solvent systems.

Conclusion: Combining data-efficient machine learning (tensor completion) with adaptive experimentation (active learning) advances predictive modeling of transport properties like diffusion coefficients.

Abstract: Predicting diffusion coefficients in mixtures is crucial for many applications, as experimental data remain scarce, and machine learning (ML) offers promising alternatives to established semi-empirical models. Among ML models, matrix completion methods (MCMs) have proven effective in predicting thermophysical properties, including diffusion coefficients in binary mixtures. However, MCMs are restricted to single-temperature predictions, and their accuracy depends strongly on the availability of high-quality experimental data for each temperature of interest. In this work, we address this challenge by presenting a hybrid tensor completion method (TCM) for predicting temperature-dependent diffusion coefficients at infinite dilution in binary mixtures. The TCM employs a Tucker decomposition and is jointly trained on experimental data for diffusion coefficients at infinite dilution in binary systems at 298 K, 313 K, and 333 K. Predictions from the semi-empirical SEGWE model serve as prior knowledge within a Bayesian training framework. The TCM then extrapolates linearly to any temperature between 268 K and 378 K, achieving markedly improved prediction accuracy compared to established models across all studied temperatures. To further enhance predictive performance, the experimental database was expanded using active learning (AL) strategies for targeted acquisition of new diffusion data by pulsed-field gradient (PFG) NMR measurements. Diffusion coefficients at infinite dilution in 19 solute + solvent systems were measured at 298 K, 313 K, and 333 K. Incorporating these results yields a substantial improvement in the TCM’s predictive accuracy. These findings highlight the potential of combining data-efficient ML methods with adaptive experimentation to advance predictive modeling of transport properties.

[540] Partial recovery of meter-scale surface weather

Jonathan Giezendanner, Qidong Yang, Eric Schmitt, Anirban Chandra, Daniel Salles Civitarese, Johannes Jakubik, Jeremy Vila, Detlef Hohl, Campbell Watson, Sherrie Wang

Main category: cs.LG

TL;DR: A method for inferring meter-scale near-surface weather fields (wind, temperature, humidity) at 10m resolution across the contiguous US by conditioning coarse atmospheric states on sparse surface station measurements and high-resolution Earth observation data.

Details

Motivation: Current weather analyses and forecasts lack meter-scale variability (tens to hundreds of meters) due to land cover and topography, creating a gap in understanding whether this fine-scale variability is chaotic or predictable from surface characteristics and large-scale atmospheric forcing.

Method: Condition coarse atmospheric state on sparse surface station measurements and high-resolution Earth observation data to infer spatially continuous fields of near-surface wind, temperature, and humidity at 10 m resolution across the contiguous United States.

Result: The inferred fields reduce wind error by 29% and temperature/dewpoint error by 6% relative to ERA5, explain substantially more spatial variance at fixed time steps, and exhibit physically interpretable structure including urban heat islands, evapotranspiration-driven humidity contrasts, and wind speed differences across land cover types.

Conclusion: A substantial, physically coherent component of meter-scale near-surface weather is statistically recoverable from existing observations, expanding weather modeling frontiers with a computationally feasible approach to continental-scale meter-resolution inference, demonstrating how conditioning coarse dynamical models on static fine-scale features can reveal previously unresolved Earth system components.

Abstract: Near-surface atmospheric conditions can differ sharply over tens to hundreds of meters due to land cover and topography, yet this variability is absent from current weather analyses and forecasts. It is unclear whether such meter-scale variability reflects irreducibly chaotic dynamics or contains a component predictable from surface characteristics and large-scale atmospheric forcing. Here we show that a substantial, physically coherent component of meter-scale near-surface weather is statistically recoverable from existing observations. By conditioning coarse atmospheric state on sparse surface station measurements and high-resolution Earth observation data, we infer spatially continuous fields of near-surface wind, temperature, and humidity at 10 m resolution across the contiguous United States. Relative to ERA5, the inferred fields reduce wind error by 29% and temperature and dewpoint error by 6%, while explaining substantially more spatial variance at fixed time steps. They also exhibit physically interpretable structure, including urban heat islands, evapotranspiration-driven humidity contrasts, and wind speed differences across land cover types. Our findings expand the frontier of weather modeling by demonstrating a computationally feasible approach to continental-scale meter-resolution inference. More broadly, they illustrate how conditioning coarse dynamical models on static fine-scale features can reveal previously unresolved components of the Earth system.

[541] Benchmarking Temporal Web3 Intelligence: Lessons from the FinSurvival 2025 Challenge

Oshani Seneviratne, Fernando Spadea, Adrien Pavao, Aaron Micah Green, Kristin P. Bennett

Main category: cs.LG

TL;DR: The paper presents FinSurvival Challenge 2025 as a temporal Web3 benchmark using Aave v3 transaction data for survival prediction tasks, showing domain-specific features outperform generic approaches.

Details

Motivation: The field lacks shared, reproducible benchmarks for temporal Web3 analytics that capture real-world dynamics like censoring and non-stationarity, which slows methodological progress and limits technique transfer between Web3 and broader Web domains.

Method: Created benchmark using 21.8 million transaction records from Aave v3 protocol, operationalizing 16 survival prediction tasks to model user behavior transitions, with detailed benchmark design and analysis of winning solutions.

Result: Domain-aware temporal feature construction significantly outperformed generic modeling approaches, demonstrating the value of specialized temporal feature engineering for Web3 analytics.

Conclusion: Web3 systems provide a high-fidelity sandbox for studying fundamental temporal challenges like churn, risk, and evolution that are relevant to the wider Web, and the paper offers lessons for next-generation temporal benchmarks.

Abstract: Temporal Web analytics increasingly relies on large-scale, longitudinal data to understand how users, content, and systems evolve over time. A rapidly growing frontier is the \emph{Temporal Web3}: decentralized platforms whose behavior is recorded as immutable, time-stamped event streams. Despite the richness of this data, the field lacks shared, reproducible benchmarks that capture real-world temporal dynamics, specifically censoring and non-stationarity, across extended horizons. This absence slows methodological progress and limits the transfer of techniques between Web3 and broader Web domains. In this paper, we present the \textit{FinSurvival Challenge 2025} as a case study in benchmarking \emph{temporal Web3 intelligence}. Using 21.8 million transaction records from the Aave v3 protocol, the challenge operationalized 16 survival prediction tasks to model user behavior transitions.We detail the benchmark design and the winning solutions, highlighting how domain-aware temporal feature construction significantly outperformed generic modeling approaches. Furthermore, we distill lessons for next-generation temporal benchmarks, arguing that Web3 systems provide a high-fidelity sandbox for studying temporal challenges, such as churn, risk, and evolution that are fundamental to the wider Web.

[542] MetaOthello: A Controlled Study of Multiple World Models in Transformers

Aviral Chawla, Galen Hall, Juniper Lovato

Main category: cs.LG

TL;DR: Transformers trained on multiple Othello variants learn shared board-state representations rather than isolated sub-models, with representations transferring causally across variants through orthogonal rotations or layered specialization.

Details

Motivation: To understand how transformers organize multiple world models simultaneously, rather than studying capabilities in isolation, using controlled Othello variants to examine shared representation learning.

Method: Created MetaOthello suite with Othello variants sharing syntax but different rules/tokenizations, trained small GPTs on mixed-variant data, and analyzed representation organization using linear probes and interventions.

Result: Transformers converge on mostly shared board-state representations that transfer across variants; isomorphic games show equivalent representations up to orthogonal rotation; partially overlapping rules lead to layered specialization with game identity in middle layers.

Conclusion: Transformers organize multiple world models through shared representations rather than partitioned capacity, with MetaOthello providing a framework for studying multi-world model organization in transformers.

Abstract: Foundation models must handle multiple generative processes, yet mechanistic interpretability largely studies capabilities in isolation; it remains unclear how a single transformer organizes multiple, potentially conflicting “world models”. Previous experiments on Othello playing neural-networks test world-model learning but focus on a single game with a single set of rules. We introduce MetaOthello, a controlled suite of Othello variants with shared syntax but different rules or tokenizations, and train small GPTs on mixed-variant data to study how multiple world models are organized in a shared representation space. We find that transformers trained on mixed-game data do not partition their capacity into isolated sub-models; instead, they converge on a mostly shared board-state representation that transfers causally across variants. Linear probes trained on one variant can intervene on another’s internal state with effectiveness approaching that of matched probes. For isomorphic games with token remapping, representations are equivalent up to a single orthogonal rotation that generalizes across layers. When rules partially overlap, early layers maintain game-agnostic representations while a middle layer identifies game identity, and later layers specialize. MetaOthello offers a path toward understanding not just whether transformers learn world models, but how they organize many at once.

[543] Induction Meets Biology: Mechanisms of Repeat Detection in Protein Language Models

Gal Kesten-Pomeranz, Yaniv Nikankin, Anja Reusch, Tomer Tsaban, Ora Schueler-Furman, Yonatan Belinkov

Main category: cs.LG

TL;DR: PLMs detect protein repeats through a two-stage mechanism combining positional attention and specialized biological knowledge, with approximate repeat detection subsuming exact repeat mechanisms.

Details

Motivation: Protein sequences contain important repeating segments (both exact and approximate) that are crucial for structure and function. While recent work shows PLMs can identify these repeats, the internal mechanisms remain unclear, motivating investigation into how PLMs detect both types of repeats.

Method: Investigated PLM mechanisms for repeat detection by examining their behavior in masked-token prediction. Characterized the mechanism through analysis of attention heads and specialized components, identifying two main stages: feature representation building and induction head attention.

Result: Found that the mechanism for approximate repeats functionally subsumes that of exact repeats. PLMs first build feature representations using general positional attention heads and biologically specialized components (like amino-acid similarity encoding neurons), then induction heads attend to aligned tokens across repeated segments to promote correct answers.

Conclusion: PLMs solve protein repeat detection by combining language-based pattern matching with specialized biological knowledge, establishing a foundation for studying more complex evolutionary processes in protein language models.

Abstract: Protein sequences are abundant in repeating segments, both as exact copies and as approximate segments with mutations. These repeats are important for protein structure and function, motivating decades of algorithmic work on repeat identification. Recent work has shown that protein language models (PLMs) identify repeats, by examining their behavior in masked-token prediction. To elucidate their internal mechanisms, we investigate how PLMs detect both exact and approximate repeats. We find that the mechanism for approximate repeats functionally subsumes that of exact repeats. We then characterize this mechanism, revealing two main stages: PLMs first build feature representations using both general positional attention heads and biologically specialized components, such as neurons that encode amino-acid similarity. Then, induction heads attend to aligned tokens across repeated segments, promoting the correct answer. Our results reveal how PLMs solve this biological task by combining language-based pattern matching with specialized biological knowledge, thereby establishing a basis for studying more complex evolutionary processes in PLMs.

[544] Closing the gap on tabular data with Fourier and Implicit Categorical Features

Marius Dragoi, Florin Gogianu, Elena Burceanu

Main category: cs.LG

TL;DR: Deep learning models lag behind tree-based methods on tabular data due to biases toward smooth solutions and uniform feature processing. The paper addresses this by using statistical feature processing to identify categorical-like features and Learned Fourier to mitigate smoothness bias, enabling deep models to match or surpass XGBoost performance.

Details

Motivation: Deep learning underperforms on tabular data compared to tree-based methods like XGBoost, which is considered the last "unconquered castle" for neural networks. The authors hypothesize that tree-based methods excel because they can better model non-linear interactions from categorical features, while neural networks have biases toward smooth solutions and uniform numerical processing.

Method: 1) Statistical-based feature processing to identify features strongly correlated with the target after discretization (features with categorical characteristics). 2) Learned Fourier approach to mitigate deep models’ bias for overly-smooth solutions that don’t align with tabular data properties.

Result: The proposed feature preprocessing significantly boosts deep learning model performance, enabling them to achieve performance that closely matches or surpasses XGBoost on comprehensive tabular data benchmarks.

Conclusion: The performance gap between deep learning and tree-based methods on tabular data can be bridged by addressing neural networks’ biases through appropriate feature processing and smoothness mitigation techniques.

Abstract: While Deep Learning has demonstrated impressive results in applications on various data types, it continues to lag behind tree-based methods when applied to tabular data, often referred to as the last “unconquered castle” for neural networks. We hypothesize that a significant advantage of tree-based methods lies in their intrinsic capability to model and exploit non-linear interactions induced by features with categorical characteristics. In contrast, neural-based methods exhibit biases toward uniform numerical processing of features and smooth solutions, making it challenging for them to effectively leverage such patterns. We address this performance gap by using statistical-based feature processing techniques to identify features that are strongly correlated with the target once discretized. We further mitigate the bias of deep models for overly-smooth solutions, a bias that does not align with the inherent properties of the data, using Learned Fourier. We show that our proposed feature preprocessing significantly boosts the performance of deep learning models and enables them to achieve a performance that closely matches or surpasses XGBoost on a comprehensive tabular data benchmark.

[545] Efficient Real-Time Adaptation of ROMs for Unsteady Flows Using Data Assimilation

Ismaël Zighed, Andrea Nóvoa, Luca Magri, Taraneh Sayadi

Main category: cs.LG

TL;DR: Efficient retraining strategy for parameterized Reduced Order Models using VAE-transformer architecture with ensemble Kalman filtering for sparse data assimilation and real-time adaptation.

Details

Motivation: To develop an efficient ROM retraining method that achieves full retraining accuracy with minimal computational cost using only sparse observations, enabling real-time adaptation to new parameter regimes.

Method: Encode-process-decode architecture: VAE for dimensionality reduction, transformer network for latent state evolution and dynamics modeling. Parameterized by Reynolds number with attention mechanisms. Probabilistic VAE enables ensemble generation and uncertainty quantification. Uses ensemble Kalman filtering for data assimilation from sparse observations.

Result: Achieves accuracy comparable to full retraining with fraction of computational time. Enables reconstruction of full-state trajectories from minimal observations. Identifies latent manifold distortion as dominant error source, allowing lightweight autoencoder-only retraining for efficient adaptation.

Conclusion: Proposed method provides computationally efficient, real-time adaptation of ROMs to out-of-sample parameter regions using sparse data, with ensemble-based uncertainty quantification and data assimilation capabilities.

Abstract: We propose an efficient retraining strategy for a parameterized Reduced Order Model (ROM) that attains accuracy comparable to full retraining while requiring only a fraction of the computational time and relying solely on sparse observations of the full system. The architecture employs an encode-process-decode structure: a Variational Autoencoder (VAE) to perform dimensionality reduction, and a transformer network to evolve the latent states and model the dynamics. The ROM is parameterized by an external control variable, the Reynolds number in the Navier-Stokes setting, with the transformer exploiting attention mechanisms to capture both temporal dependencies and parameter effects. The probabilistic VAE enables stochastic sampling of trajectory ensembles, providing predictive means and uncertainty quantification through the first two moments. After initial training on a limited set of dynamical regimes, the model is adapted to out-of-sample parameter regions using only sparse data. Its probabilistic formulation naturally supports ensemble generation, which we employ within an ensemble Kalman filtering framework to assimilate data and reconstruct full-state trajectories from minimal observations. We further show that, for the dynamical system considered, the dominant source of error in out-of-sample forecasts stems from distortions of the latent manifold rather than changes in the latent dynamics. Consequently, retraining can be limited to the autoencoder, allowing for a lightweight, computationally efficient, real-time adaptation procedure with very sparse fine-tuning data.

[546] Tell Me What To Learn: Generalizing Neural Memory to be Controllable in Natural Language

Max S. Bennett, Thomas P. Zollo, Richard Zemel

Main category: cs.LG

TL;DR: A generalized neural memory system that uses natural language instructions to control what models remember or forget during continual learning, enabling selective learning from heterogeneous information sources.

Details

Motivation: Current neural memory methods for continual learning assume fixed objectives and homogeneous information streams, giving users no control over what models remember or ignore over time. This is insufficient for real-world applications like healthcare and customer service where flexible, selective learning is needed.

Method: Proposes a generalized neural memory system that performs flexible updates based on natural language learning instructions. The system enables adaptive agents to learn selectively from heterogeneous information sources by interpreting user-specified instructions about what to remember or ignore.

Result: The approach supports settings where fixed-objective memory updates are insufficient, allowing for lightweight updates with minimal forgetting while giving users control over the learning process through natural language instructions.

Conclusion: Natural language instructions can effectively control neural memory systems for continual learning, enabling more flexible and selective adaptation to diverse, non-stationary environments compared to traditional fixed-objective approaches.

Abstract: Modern machine learning models are deployed in diverse, non-stationary environments where they must continually adapt to new tasks and evolving knowledge. Continual fine-tuning and in-context learning are costly and brittle, whereas neural memory methods promise lightweight updates with minimal forgetting. However, existing neural memory models typically assume a single fixed objective and homogeneous information streams, leaving users with no control over what the model remembers or ignores over time. To address this challenge, we propose a generalized neural memory system that performs flexible updates based on learning instructions specified in natural language. Our approach enables adaptive agents to learn selectively from heterogeneous information sources, supporting settings, such as healthcare and customer service, where fixed-objective memory updates are insufficient.

[547] Takeuchi’s Information Criteria as Generalization Measures for DNNs Close to NTK Regime

Hiroki Naganuma, Taiji Suzuki, Rio Yokota, Masahiro Nomura, Kohta Ishikawa, Ikuro Sato

Main category: cs.LG

TL;DR: TIC (Takeuchi’s information criterion) can effectively explain generalization gaps in DNNs near the NTK regime, but loses correlation outside this regime, and shows better hyperparameter trial pruning than existing methods.

Details

Motivation: Generalization measures for deep neural networks are challenging due to their statistical singularity and complex nature. The paper investigates whether the classical TIC measure can effectively explain generalization gaps in DNNs, particularly under what conditions it remains valid.

Method: Theoretical analysis of TIC applicability near the neural tangent kernel (NTK) regime, followed by extensive empirical evaluation: training over 5,000 DNN models with 12 architectures (including large models like VGG-16) on four datasets, estimating TIC values with several approximation methods, and examining correlation with generalization gaps.

Result: TIC values correlate well with generalization gaps under conditions close to the NTK regime, but this correlation disappears outside the NTK regime. TIC provides better trial pruning ability than existing methods for hyperparameter optimization.

Conclusion: TIC is an effective generalization measure for DNNs in the NTK regime, offering practical utility for hyperparameter optimization through trial pruning, but its applicability is limited to this specific regime.

Abstract: Generalization measures have been studied extensively in the machine learning community to better characterize generalization gaps. However, establishing a reliable generalization measure for statistically singular models such as deep neural networks (DNNs) is difficult due to their complex nature. This study focuses on Takeuchi’s information criterion (TIC) to investigate the conditions under which this classical measure can effectively explain the generalization gaps of DNNs. Importantly, the developed theory indicates the applicability of TIC near the neural tangent kernel (NTK) regime. In a series of experiments, we trained more than 5,000 DNN models with 12 architectures, including large models (e.g., VGG-16), on four datasets, and estimated the corresponding TIC values to examine the relationship between the generalization gap and the TIC estimates. We applied several TIC approximation methods with feasible computational costs and assessed the accuracy trade-off. Our experimental results indicate that the estimated TIC values correlate well with the generalization gap under conditions close to the NTK regime. However, we show both theoretically and empirically that outside the NTK regime such correlation disappears. Finally, we demonstrate that TIC provides better trial pruning ability than existing methods for hyperparameter optimization.

[548] Conformalized Neural Networks for Federated Uncertainty Quantification under Dual Heterogeneity

Quang-Huy Nguyen, Jiaqi Wang, Wei-Shinn Ku

Main category: cs.LG

TL;DR: FedWQ-CP: A federated conformal prediction method that addresses uncertainty quantification in FL with both data and model heterogeneity through agent-server calibration in one communication round.

Details

Motivation: Federated learning lacks reliable uncertainty quantification, risking deployment of overconfident models at under-resourced agents. Existing approaches address data or model heterogeneity separately, not their joint effect on coverage reliability.

Method: FedWQ-CP performs agent-server calibration in one round: agents compute conformity scores on calibration data, derive local quantile thresholds, then transmit only thresholds and sample sizes to server. Server aggregates thresholds via weighted average to produce global threshold.

Result: Experimental results on seven public datasets for classification and regression show FedWQ-CP maintains agent-wise and global coverage while producing smallest prediction sets/intervals.

Conclusion: FedWQ-CP provides simple yet effective federated uncertainty quantification that balances coverage performance with efficiency under dual heterogeneity of data and models.

Abstract: Federated learning (FL) faces challenges in uncertainty quantification (UQ). Without reliable UQ, FL systems risk deploying overconfident models at under-resourced agents, leading to silent local failures despite seemingly satisfactory global performance. Existing federated UQ approaches often address data heterogeneity or model heterogeneity in isolation, overlooking their joint effect on coverage reliability across agents. Conformal prediction is a widely used distribution-free UQ framework, yet its applications in heterogeneous FL settings remains underexplored. We provide FedWQ-CP, a simple yet effective approach that balances empirical coverage performance with efficiency at both global and agent levels under the dual heterogeneity. FedWQ-CP performs agent-server calibration in a single communication round. On each agent, conformity scores are computed on calibration data and a local quantile threshold is derived. Each agent then transmits only its quantile threshold and calibration sample size to the server. The server simply aggregates these thresholds through a weighted average to produce a global threshold. Experimental results on seven public datasets for both classification and regression demonstrate that FedWQ-CP empirically maintains agent-wise and global coverage while producing the smallest prediction sets or intervals.

[549] Physics Informed Viscous Value Representations

Hrishikesh Viswanath, Juanwu Lu, S. Talha Bukhari, Damon Conover, Ziran Wang, Aniket Bera

Main category: cs.LG

TL;DR: Physics-informed regularization for offline goal-conditioned RL using Hamilton-Jacobi-Bellman equation viscosity solutions and Feynman-Kac theorem for stable Monte Carlo estimation.

Details

Motivation: Offline goal-conditioned RL suffers from inaccurate value estimation due to limited dataset coverage. Existing physics-informed approaches using first-order PDEs (like Eikonal equation) become ill-posed in complex, high-dimensional environments.

Method: Proposes physics-informed regularization derived from viscosity solutions of Hamilton-Jacobi-Bellman equation. Uses Feynman-Kac theorem to recast PDE solution as expectation, enabling tractable Monte Carlo estimation that avoids numerical instability in higher-order gradients.

Result: Method improves geometric consistency and is broadly applicable to navigation and high-dimensional complex manipulation tasks. Provides better value estimation in offline goal-conditioned RL settings.

Conclusion: Physics-informed regularization grounded in optimal control theory (via HJB viscosity solutions) with Monte Carlo estimation via Feynman-Kac theorem effectively addresses value estimation challenges in offline goal-conditioned RL.

Abstract: Offline goal-conditioned reinforcement learning (GCRL) learns goal-conditioned policies from static pre-collected datasets. However, accurate value estimation remains a challenge due to the limited coverage of the state-action space. Recent physics-informed approaches have sought to address this by imposing physical and geometric constraints on the value function through regularization defined over first-order partial differential equations (PDEs), such as the Eikonal equation. However, these formulations can often be ill-posed in complex, high-dimensional environments. In this work, we propose a physics-informed regularization derived from the viscosity solution of the Hamilton-Jacobi-Bellman (HJB) equation. By providing a physics-based inductive bias, our approach grounds the learning process in optimal control theory, explicitly regularizing and bounding updates during value iterations. Furthermore, we leverage the Feynman-Kac theorem to recast the PDE solution as an expectation, enabling a tractable Monte Carlo estimation of the objective that avoids numerical instability in higher-order gradients. Experiments demonstrate that our method improves geometric consistency, making it broadly applicable to navigation and high-dimensional, complex manipulation tasks. Open-source codes are available at https://github.com/HrishikeshVish/phys-fk-value-GCRL.

[550] Inferential Mechanics Part 1: Causal Mechanistic Theories of Machine Learning in Chemical Biology with Implications

Ilya Balabin, Thomas M. Kaiser

Main category: cs.LG

TL;DR: A theoretical framework combining chemical biology, probability theory, and causality to address black-box ML issues in natural sciences, introducing the concept of “focus” for ML algorithms to uncover hidden mechanisms.

Details

Motivation: Current ML models in natural sciences are treated as black boxes without considering causal structure, lacking unified theoretical treatment for causality in ML applications to chemical biology.

Method: Develops formal framework combining chemical theory, biological theory, probability theory, and causality; introduces “focus” concept for ML algorithms; provides initial proof on Akt inhibitors family.

Result: Establishes foundational causal structure framework for chemical biology phenomena extended to ML; introduces novel “focus” concept; provides initial validation on Akt inhibitors.

Conclusion: First part of series establishing “inferential mechanics” - a new mathematical framework for modeling mechanisms in chemical biology without reductionism tools, addressing causal flaws in ML.

Abstract: Machine learning techniques are now routinely encountered in research laboratories across the globe. Impressive progress has been made through ML and AI techniques with regards to large data set processing. This progress has increased the ability of the experimenter to digest data and make novel predictions regarding phenomena of interest. However, machine learning predictors generated from data sets taken from the natural sciences are often treated as black boxes which are used broadly and generally without detailed consideration of the causal structure of the data set of interest. Work has been attempted to bring causality into discussions of machine learning models of natural phenomena; however, a firm and unified theoretical treatment is lacking. This series of three papers explores the union of chemical theory, biological theory, probability theory and causality that will correct current causal flaws of machine learning in the natural sciences. This paper, Part 1 of the series, provides the formal framework of the foundational causal structure of phenomena in chemical biology and is extended to machine learning through the novel concept of focus, defined here as the ability of a machine learning algorithm to narrow down to a hidden underpinning mechanism in large data sets. Initial proof of these principles on a family of Akt inhibitors is also provided. The second paper containing Part 2 will provide a formal exploration of chemical similarity, and Part 3 will present extensive experimental evidence of how hidden causal structures weaken all machine learning in chemical biology. This series serves to establish for chemical biology a new kind of mathematical framework for modeling mechanisms in Nature without the need for the tools of reductionism: inferential mechanics.

[551] A Proper Scoring Rule for Virtual Staining

Samuel Tonks, Steve Hood, Ryan Musso, Ceridwen Hopely, Steve Titus, Minh Doan, Iain Styles, Alexander Krull

Main category: cs.LG

TL;DR: A framework for evaluating generative virtual staining models using information gain to assess predicted posterior distributions at the cell level, enabling comparison across models and features.

Details

Motivation: Current evaluation protocols for generative virtual staining models only assess marginal distribution accuracy over datasets, not the predicted posterior distributions for individual cells, which is crucial for high-throughput screening applications.

Method: Introduces information gain (IG) as a cell-wise evaluation framework that serves as a strictly proper scoring rule for assessing predicted posteriors, with theoretical motivation for interpretability and cross-model comparison.

Result: Evaluation of diffusion- and GAN-based models on an extensive HTS dataset shows that IG can reveal substantial performance differences that other metrics cannot detect.

Conclusion: Information gain provides a theoretically sound, interpretable framework for directly evaluating predicted posterior distributions in generative virtual staining models, enabling better model comparison and assessment.

Abstract: Generative virtual staining (VS) models for high-throughput screening (HTS) can provide an estimated posterior distribution of possible biological feature values for each input and cell. However, when evaluating a VS model, the true posterior is unavailable. Existing evaluation protocols only check the accuracy of the marginal distribution over the dataset rather than the predicted posteriors. We introduce information gain (IG) as a cell-wise evaluation framework that enables direct assessment of predicted posteriors. IG is a strictly proper scoring rule and comes with a sound theoretical motivation allowing for interpretability, and for comparing results across models and features. We evaluate diffusion- and GAN-based models on an extensive HTS dataset using IG and other metrics and show that IG can reveal substantial performance differences other metrics cannot.

[552] Differentiable Zero-One Loss via Hypersimplex Projections

Camilo Gomez, Pengyang Wang, Liansheng Tang

Main category: cs.LG

TL;DR: Differentiable approximation of zero-one loss via Soft-Binary-Argmax operator for improved large-batch training in classification tasks.

Details

Motivation: The zero-one loss is considered the gold standard for classification but is incompatible with gradient-based optimization due to non-differentiability. The paper aims to bridge this gap by creating a differentiable approximation that can be integrated into end-to-end learning systems.

Method: Introduces Soft-Binary-Argmax, a smooth, order-preserving projection onto the n,k-dimensional hypersimplex through a constrained optimization framework. The method constructs a differentiable approximation to the zero-one loss and shows how its Jacobian can be efficiently computed and integrated into binary and multiclass learning systems.

Result: Empirically achieves significant improvements in generalization under large-batch training by imposing geometric consistency constraints on output logits, narrowing the performance gap traditionally observed in large-batch training.

Conclusion: The proposed differentiable approximation to zero-one loss enables better integration of structured optimization components into end-to-end differentiable models, improving classification performance particularly in challenging large-batch training scenarios.

Abstract: Recent advances in machine learning have emphasized the integration of structured optimization components into end-to-end differentiable models, enabling richer inductive biases and tighter alignment with task-specific objectives. In this work, we introduce a novel differentiable approximation to the zero-one loss-long considered the gold standard for classification performance, yet incompatible with gradient-based optimization due to its non-differentiability. Our method constructs a smooth, order-preserving projection onto the n,k-dimensional hypersimplex through a constrained optimization framework, leading to a new operator we term Soft-Binary-Argmax. After deriving its mathematical properties, we show how its Jacobian can be efficiently computed and integrated into binary and multiclass learning systems. Empirically, our approach achieves significant improvements in generalization under large-batch training by imposing geometric consistency constraints on the output logits, thereby narrowing the performance gap traditionally observed in large-batch training.

[553] Mean Estimation from Coarse Data: Characterizations and Efficient Algorithms

Alkis Kalavasis, Anay Mehrotra, Manolis Zampetakis, Felix Zhou, Ziyu Zhu

Main category: cs.LG

TL;DR: The paper studies Gaussian mean estimation from coarse data where only partial information (sets containing samples) is observed, focusing on identifiability conditions and computational efficiency for convex partitions.

Details

Motivation: Coarse data naturally arises in many applications (measurement rounding, sensor limitations, economic systems) where only partial information about samples is available. Understanding when and how to efficiently estimate parameters from such data is crucial for practical applications.

Method: The paper studies Gaussian mean estimation where true samples are drawn from a d-dimensional Gaussian with identity covariance but are only revealed through the set of a partition containing the sample. It investigates identifiability conditions for convex partitions and develops computationally efficient estimation methods.

Result: The work resolves two fundamental questions: (1) characterizes when the mean is identifiable under convex partitions, and (2) establishes that computationally efficient estimation is possible under identifiability and convex partitions.

Conclusion: The paper provides a complete theoretical understanding of Gaussian mean estimation from coarse data with convex partitions, establishing both identifiability conditions and computational feasibility for efficient estimation.

Abstract: Coarse data arise when learners observe only partial information about samples; namely, a set containing the sample rather than its exact value. This occurs naturally through measurement rounding, sensor limitations, and lag in economic systems. We study Gaussian mean estimation from coarse data, where each true sample $x$ is drawn from a $d$-dimensional Gaussian distribution with identity covariance, but is revealed only through the set of a partition containing $x$. When the coarse samples, roughly speaking, have ``low’’ information, the mean cannot be uniquely recovered from observed samples (i.e., the problem is not identifiable). Recent work by Fotakis, Kalavasis, Kontonis, and Tzamos [FKKT21] established that sample-efficient mean estimation is possible when the unknown mean is identifiable and the partition consists of only convex sets. Moreover, they showed that without convexity, mean estimation becomes NP-hard. However, two fundamental questions remained open: (1) When is the mean identifiable under convex partitions? (2) Is computationally efficient estimation possible under identifiability and convex partitions? This work resolves both questions. […]

[554] FlashOptim: Optimizers for Memory Efficient Training

Jose Javier Gonzalez Ortiz, Abhay Gupta, Chris Renard, Davis Blalock

Main category: cs.LG

TL;DR: FlashOptim reduces training memory by over 50% through improved master weight splitting and 8-bit optimizer state quantization, cutting AdamW memory from 16 to 7 bytes per parameter without quality degradation.

Details

Motivation: Standard mixed-precision training requires excessive memory (16 bytes per parameter), making large models like 7B parameters impractical for researchers with limited accelerator memory (<100GB).

Method: Two key techniques: 1) Improved master weight splitting with tight quantization error bounds, 2) Companding functions for 8-bit optimizer state quantization. Combined with 16-bit gradients, reduces memory footprint significantly.

Result: Reduces AdamW memory from 16 bytes to 7 bytes per parameter (5 bytes with gradient release), cuts checkpoint sizes by >50%, shows no measurable quality degradation on vision/language benchmarks including Llama-3.1-8B finetuning.

Conclusion: FlashOptim enables training of larger models with limited memory resources while maintaining full model quality and API compatibility, making large-scale model training more accessible.

Abstract: Standard mixed-precision training of neural networks requires many bytes of accelerator memory for each model parameter. These bytes reflect not just the parameter itself, but also its gradient and one or more optimizer state variables. With each of these values typically requiring 4 bytes, training even a 7 billion parameter model can be impractical for researchers with less than 100GB of accelerator memory. We introduce FlashOptim, a suite of optimizations that reduces per-parameter memory by over 50% while preserving model quality and API compatibility. Our approach introduces two key techniques. First, we improve master weight splitting by finding and exploiting a tight bound on its quantization error. Second, we design companding functions that greatly reduce the error in 8-bit optimizer state quantization. Together with 16-bit gradients, these techniques reduce AdamW memory from 16 bytes to 7 bytes per parameter, or 5 bytes with gradient release. They also cut model checkpoint sizes by more than half. Experiments with FlashOptim applied to SGD, AdamW, and Lion show no measurable quality degradation on any task from a collection of standard vision and language benchmarks, including Llama-3.1-8B finetuning.

[555] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Simon Roschmann, Paul Krzakala, Sonia Mazelet, Quentin Bouniot, Zeynep Akata

Main category: cs.LG

TL;DR: SOTAlign: A two-stage semi-supervised framework for aligning pretrained vision and language models using limited paired data and abundant unpaired data via optimal transport.

Details

Motivation: Current multimodal alignment methods require contrastive losses and millions of paired samples. The paper explores whether meaningful alignment can be achieved with substantially less supervision by leveraging both limited paired data and abundant unpaired data.

Method: Two-stage framework: 1) Recovers coarse shared geometry from limited paired data using a linear teacher, 2) Refines alignment on unpaired samples via optimal-transport-based divergence that transfers relational structure without overconstraining the target space.

Result: SOTAlign learns robust joint embeddings across datasets and encoder pairs, significantly outperforming both supervised and semi-supervised baselines.

Conclusion: The method demonstrates that meaningful alignment between vision and language models can be achieved with substantially less supervision than current approaches, effectively leveraging unpaired data through optimal transport techniques.

Abstract: The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the world. Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastive losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achieved with substantially less supervision. We introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image-text pairs together with large amounts of unpaired data. To address this challenge, we propose SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence that transfers relational structure without overconstraining the target space. Unlike existing semi-supervised methods, SOTAlign effectively leverages unpaired images and text, learning robust joint embeddings across datasets and encoder pairs, and significantly outperforming supervised and semi-supervised baselines.

[556] A Dataset is Worth 1 MB

Elad Kimchi Shoshani, Leeyam Gabay, Yedid Hoshen

Main category: cs.LG

TL;DR: PLADA eliminates pixel transmission for dataset distribution by using pre-loaded reference datasets and transmitting only class labels for semantically relevant images, achieving <1MB payloads while maintaining classification accuracy.

Details

Motivation: Current dataset distribution methods incur massive communication costs when transmitting large payloads to multiple clients. Dataset distillation struggles with high-resolution data and doesn't achieve sufficiently small file sizes. There's a need for efficient dataset serving without transmitting raw pixel data.

Method: PLADA assumes clients have pre-loaded generic unlabeled reference datasets (e.g., ImageNet). Instead of transmitting pixels, only class labels for specific images are sent. A pruning mechanism filters the reference dataset to retain only labels of the most semantically relevant images for the target task, maximizing training efficiency while minimizing transmission payload.

Result: Experiments on 10 diverse datasets show the approach can transfer task knowledge with payloads less than 1 MB while retaining high classification accuracy. The method effectively addresses distribution mismatch between reference and target datasets.

Conclusion: PLADA offers a promising solution for efficient dataset serving by completely eliminating pixel transmission, using pre-loaded reference datasets and transmitting only class labels, achieving both minimal payload size and high task performance.

Abstract: A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on diverse hardware and software frameworks, transmitting a pre-trained model is often infeasible; instead, agents require raw data to train their own task-specific models locally. While dataset distillation attempts to compress training signals, current methods struggle to scale to high-resolution data and rarely achieve sufficiently small files. In this paper, we propose Pseudo-Labels as Data (PLADA), a method that completely eliminates pixel transmission. We assume agents are preloaded with a large, generic, unlabeled reference dataset (e.g., ImageNet-1K, ImageNet-21K) and communicate a new task by transmitting only the class labels for specific images. To address the distribution mismatch between the reference and target datasets, we introduce a pruning mechanism that filters the reference dataset to retain only the labels of the most semantically relevant images for the target task. This selection process simultaneously maximizes training efficiency and minimizes transmission payload. Experiments on 10 diverse datasets demonstrate that our approach can transfer task knowledge with a payload of less than 1 MB while retaining high classification accuracy, offering a promising solution for efficient dataset serving.

[557] Model Agreement via Anchoring

Eric Eaton, Surbhi Goel, Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell

Main category: cs.LG

TL;DR: The paper develops a general technique for bounding model disagreement in machine learning and applies it to prove disagreement bounds for four common algorithms: stacked aggregation, gradient boosting, neural network architecture search, and regression trees.

Details

Motivation: To understand and control model disagreement - the extent to which independently trained models disagree in their predictions - and develop techniques to drive disagreement to zero using natural training parameters.

Method: Develops an “anchoring” technique that bounds independent model disagreement by analyzing the average of two models, then applies this technique to analyze four algorithms: stacked aggregation, gradient boosting, neural network architecture search, and regression trees.

Result: Proves disagreement bounds showing that disagreement can be driven to zero with increasing parameters: number of models in stacking, iterations in boosting, architecture size in neural search, and tree depth in regression trees.

Conclusion: Provides a general framework for analyzing model disagreement that applies to multiple common machine learning algorithms, showing how disagreement can be systematically controlled through training parameters.

Abstract: Numerous lines of aim to control $\textit{model disagreement}$ – the extent to which two machine learning models disagree in their predictions. We adopt a simple and standard notion of model disagreement in real-valued prediction problems, namely the expected squared difference in predictions between two models trained on independent samples, without any coordination of the training processes. We would like to be able to drive disagreement to zero with some natural parameter(s) of the training procedure using analyses that can be applied to existing training methodologies. We develop a simple general technique for proving bounds on independent model disagreement based on $\textit{anchoring}$ to the average of two models within the analysis. We then apply this technique to prove disagreement bounds for four commonly used machine learning algorithms: (1) stacked aggregation over an arbitrary model class (where disagreement is driven to 0 with the number of models $k$ being stacked) (2) gradient boosting (where disagreement is driven to 0 with the number of iterations $k$) (3) neural network training with architecture search (where disagreement is driven to 0 with the size $n$ of the architecture being optimized over) and (4) regression tree training over all regression trees of fixed depth (where disagreement is driven to 0 with the depth $d$ of the tree architecture). For clarity, we work out our initial bounds in the setting of one-dimensional regression with squared error loss – but then show that all of our results generalize to multi-dimensional regression with any strongly convex loss.

[558] Solving stiff dark matter equations via Jacobian Normalization with Physics-Informed Neural Networks

M. P. Bento, H. B. Câmara, J. R. Rocha, J. F. Seabra

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.21988: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21988&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[559] A Synergistic Approach: Dynamics-AI Ensemble in Tropical Cyclone Forecasting

Yonghui Li, Wansuo Duan, Hao Li, Wei Han, Han Zhang, Yinuo Li

Main category: cs.LG

TL;DR: Unable to analyze paper 2602.22533 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting error

Method: Cannot determine method as abstract is unavailable due to rate limiting error

Result: Cannot determine results as abstract is unavailable due to rate limiting error

Conclusion: Cannot draw conclusions about paper content due to missing abstract information

Abstract: Failed to fetch summary for 2602.22533: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22533&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[560] Procedural Fairness in Machine Learning

Ziming Wang, Changwu Huang, Ke Tang, Xin Yao

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to draw conclusions due to access restrictions

Abstract: Failed to fetch summary for 2404.01877: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.01877&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[561] Efficient Graph Coloring with Neural Networks: A Physics-Inspired Approach for Large Graphs

Lorenzo Colantonio, Andrea Cacioppo, Federico Scarpati, Maria Chiara Angelini, Federico Ricci-Tersenghi, Stefano Giagu

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to retrieval failure

Method: Unable to determine method due to retrieval failure

Result: Unable to determine results due to retrieval failure

Conclusion: Unable to draw conclusions due to retrieval failure

Abstract: Failed to fetch summary for 2408.01503: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.01503&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[562] Beyond Attribution: Unified Concept-Level Explanations

Junhao Liu, Haonan Yu, Xin Zhang

Main category: cs.LG

TL;DR: Unable to analyze paper 2410.12439 due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2410.12439: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.12439&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[563] Neuro-Symbolic AI for Analytical Solutions of Differential Equations

Orestis Oikonomou, Levi Lingsch, Dana Grund, Siddhartha Mishra, Georgios Kissas

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Paper ID: 2502.01476

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2502.01476: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.01476&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[564] Mixing It Up: Exploring Mixer Networks for Irregular Multivariate Time Series Forecasting

Christian Klötergens, Tim Dernedde, Lars Schmidt-Thieme, Vijaya Krishna Yalavarthi

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to retrieval failure

Method: Unable to determine method due to retrieval failure

Result: Unable to determine results due to retrieval failure

Conclusion: Unable to draw conclusions due to retrieval failure

Abstract: Failed to fetch summary for 2502.11816: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.11816&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[565] Global graph features unveiled by unsupervised geometric deep learning

Mirja Granfors, Jesús Pineda, Blanca Zufiria Gerbolés, Joana B. Pereira, Carlo Manzo, Giovanni Volpe

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2503.05560: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.05560&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[566] Sample Compression for Self Certified Continual Learning

Jacob Comeau, Mathieu Bazinet, Pascal Germain, Cem Subakan

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2503.10503: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.10503&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[567] Extensions of the regret-minimization algorithm for optimal design

Youguang Chen, George Biros

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2503.19874: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.19874&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[568] Conformal Prediction with Corrupted Labels: Uncertain Imputation and Robust Re-weighting

Shai Feldman, Stephen Bates, Yaniv Romano

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.04733: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.04733&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[569] Density Ratio-based Causal Discovery from Bivariate Continuous-Discrete Data

Takashi Nicholas Maeda, Shohei Shimizu, Hidetoshi Matsui

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2505.08371 suggests it’s from May 2025, but content is unavailable.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2505.08371: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.08371&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[570] FrontierCO: Real-World and Large-Scale Evaluation of Machine Learning Solvers for Combinatorial Optimization

Shengyu Feng, Weiwei Sun, Shanda Li, Ameet Talwalkar, Yiming Yang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed data retrieval

Method: Unable to determine method due to failed data retrieval

Result: Unable to determine results due to failed data retrieval

Conclusion: Unable to determine conclusion due to failed data retrieval

Abstract: Failed to fetch summary for 2505.16952: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16952&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[571] The Spacetime of Diffusion Models: An Information Geometry Perspective

Rafał Karczewski, Markus Heinonen, Alison Pouplin, Søren Hauberg, Vikas Garg

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2505.17517: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17517&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[572] On the Lipschitz Continuity of Set Aggregation Functions and Neural Networks for Sets

Giannis Nikolentzos, Konstantinos Skianis

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2505.24403: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.24403&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[573] RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

Rohan Gupta, Erik Jenner

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2506.14261: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.14261&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[574] Learning Task-Agnostic Motifs to Capture the Continuous Nature of Animal Behavior

Jiyi Wang, Jingyang Ke, Bo Dai, Anqi Wu

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2506.15190: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.15190&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[575] Skewed Score: A statistical framework to assess autograders

Magda Dubois, Harry Coppock, Mario Giulianelli, Timo Flesch, Lennart Luettgau, Cozmin Ududec

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2507.03772: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.03772&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[576] Fast and Flexible Probabilistic Forecasting of Dynamical Systems using Flow Matching and Physical Perturbation

Siddharth Rout, Eldad Haber, Stephane Gaudreault

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2508.01101: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.01101&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[577] Zero-Variance Gradients for Variational Autoencoders

Zilei Shao, Anji Liu, Guy Van den Broeck

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2508.03587: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.03587&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[578] Online time series prediction using feature adjustment

Xiannan Huang, Shuhan Qiu, Jiayuan Du, Chao Yang

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2509.03810: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.03810&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[579] Random Matrix Theory-guided sparse PCA for single-cell RNA-seq data

Victor Chardès

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2509.15429: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.15429&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[580] Information-Theoretic Bayesian Optimization for Bilevel Optimization Problems

Takuya Kanayama, Yuki Ito, Tomoyuki Tamura, Masayuki Karasuyama

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2509.21725: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21725&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[581] Statistical Advantage of Softmax Attention: Insights from Single-Location Regression

O. Duranthon, P. Marion, C. Boyer, B. Loureiro, L. Zdeborová

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2509.21936: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21936&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[582] Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, Fazl Barez

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method without access to paper content

Result: No results available due to technical limitations in accessing the paper

Conclusion: Cannot provide analysis due to arXiv API rate limiting preventing access to paper 2509.26238

Abstract: Failed to fetch summary for 2509.26238: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26238&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[583] Simplex-to-Euclidean Bijections for Categorical Flow Matching

Bernardo Williams, Victor M. Yeom-Song, Marcelo Hartmann, Arto Klami

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2510.27480: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.27480&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[584] Learning Hamiltonian Flow Maps: Mean Flow Consistency for Large-Timestep Molecular Dynamics

Winfried Ripken, Michael Plainer, Gregor Lied, Thorben Frank, Oliver T. Unke, Stefan Chmiela, Frank Noé, Klaus-Robert Müller

Main category: cs.LG

TL;DR: Failed to fetch paper summary - HTTP 429 error (rate limiting) prevents access to arXiv API for paper ID 2601.22123

Details

Motivation: Unable to determine motivation due to API access failure

Method: Unable to determine method due to API access failure

Result: Unable to determine results due to API access failure

Conclusion: Unable to determine conclusion due to API access failure

Abstract: Failed to fetch summary for 2601.22123: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22123&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[585] Beyond Fixed Rounds: Data-Free Early Stopping for Practical Federated Learning

Youngjoon Lee, Hyukjoon Lee, Seungrok Jung, Andy Luo, Jinu Gong, Yang Cao, Joonhyuk Kang

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2601.22669: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22669&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[586] Agentic Framework for Epidemiological Modeling

Rituparna Datta, Zihan Guan, Baltazar Espinoza, Yiqi Su, Priya Pitre, Srini Venkatramanan, Naren Ramakrishnan, Anil Vullikanti

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.00299 suggests it’s from February 2025, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to paper content. The arXiv API rate limiting prevents fetching the abstract.

Method: No method information available due to API rate limiting error.

Result: No results available as the paper content could not be retrieved.

Conclusion: Unable to analyze paper due to technical limitations in accessing content.

Abstract: Failed to fetch summary for 2602.00299: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00299&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[587] Phase Transitions for Feature Learning in Neural Networks

Andrea Montanari, Zihao Wang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.01434: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01434&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[588] Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification

Tao Huang, Rui Wang, Xiaofei Liu, Yi Qin, Li Duan, Liping Jing

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2602.05535: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05535&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[589] Learning Credal Ensembles via Distributionally Robust Optimization

Kaizheng Wang, Ghifari Adam Faza, Fabio Cuzzolin, Siu Lun Chau, David Moens, Hans Hallez

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.08470: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08470&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[590] When Less is More: The LLM Scaling Paradox in Context Compression

Ruishan Guo, Yibing Liu, Guoxin Ma, Yan Wang, Yueyang Zhang, Long Xia, Kecheng Chen, Zhiyuan Sun, Daiting Shi

Main category: cs.LG

TL;DR: Unable to analyze paper 2602.09789 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2602.09789: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09789&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[591] ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset

DatologyAI, Aldo Gael Carranza, Kaleigh Mentzer, Ricardo Pio Monti, Alex Fang, Alvin Deng, Amro Abbas, Anshuman Suri, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Diego Kiner, Fan Pan, Haakon Mongstad, Haoli Yin, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Luke Merrick, Maximilian Böther, Parth Doshi, Paul Burstein, Pratyush Maini, Rishabh Adiga, Siddharth Joshi, Spandan Das, Tony Jiang, Vineeth Dorna, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.15210: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15210&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[592] Benchmarking IoT Time-Series AD with Event-Level Augmentations

Dmitry Zhevnenko, Ilya Makarov, Aleksandr Kovalenko, Fedor Meshchaninov, Anton Kozhukhov, Vladislav Travnikov, Makar Ippolitov, Kirill Yashunin, Iurii Katser

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in accessing paper content

Method: Unable to determine method due to technical error in accessing paper content

Result: Unable to determine results due to technical error in accessing paper content

Conclusion: Unable to determine conclusion due to technical error in accessing paper content

Abstract: Failed to fetch summary for 2602.15457: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15457&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[593] Muon+: Towards Better Muon via One Additional Normalization Step

Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, Zheng Zhang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.21545: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21545&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[594] Hardness of Maximum Likelihood Learning of DPPs

Elena Grigorescu, Brendan Juba, Karl Wimmer, Ning Xie

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2205.12377: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2205.12377&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[595] DeepMartingale: Duality of the Optimal Stopping Problem with Expressivity and High-Dimensional Hedging

Junyan Ye, Hoi Ying Wong

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2510.13868: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13868&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[596] Throwing Vines at the Wall: Structure Learning via Random Search

Thibault Vatter, Thomas Nagler

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2510.20035 could not be retrieved from arXiv API.

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.

Method: Cannot determine method as paper content is unavailable due to API rate limiting.

Result: Cannot determine results as paper content is unavailable due to API rate limiting.

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting.

Abstract: Failed to fetch summary for 2510.20035: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20035&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[597] One-Step Diffusion Samplers via Self-Distillation and Deterministic Flow

Pascal Jutras-Dube, Jiaru Zhang, Ziran Wang, Ruqi Zhang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Unable to determine method due to API rate limiting preventing access to paper content

Result: Unable to determine results due to API rate limiting preventing access to paper content

Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2512.05251: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05251&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[598] The Wisdom of Many Queries: Complexity-Diversity Principle for Dense Retriever Training

Xincan Feng, Noriki Nishida, Yusuke Sakai, Yuji Matsumoto

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.09448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Eduar Castrillo Velilla

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error fetching paper content

Method: Unable to determine method due to technical error fetching paper content

Result: Unable to determine results due to technical error fetching paper content

Conclusion: Unable to determine conclusion due to technical error fetching paper content

Abstract: Failed to fetch summary for 2602.20833: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20833&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[600] Not Just How Much, But Where: Decomposing Epistemic Uncertainty into Per-Class Contributions

Mame Diarra Toure, David A. Stephens

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods to access the paper content

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.21160: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21160&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.MA

[601] Sustainable Multi-Agent Crowdsourcing via Physics-Informed Bandits

Chayan Banerjee

Main category: cs.MA

TL;DR: FORGE: A physics-grounded multi-agent simulator for crowdsourcing allocation that addresses four tensions (cold-start, burnout, utilization, strategic behavior) using neural-linear UCB with physics-informed priors.

Details

Motivation: Existing crowdsourcing allocation methods fail to simultaneously address four key tensions: allocation quality (cold-start), workforce sustainability (burnout), operational feasibility (utilization), and strategic contractor behavior. Current approaches resolve at most two tensions at once, leading to suboptimal outcomes.

Method: Introduces FORGE, a physics-grounded K+1 multi-agent simulator where contractors are rational agents declaring load-acceptance thresholds based on fatigue states, converting passive RMAB into a Stackelberg game. Uses Neural-Linear UCB allocator with Two-Tower embedding network and Physics-Informed Covariance Prior derived from offline simulator interactions to warm-start skill-cluster geometry and UCB exploration.

Result: Achieves highest reward among non-oracle methods (LRew = 0.555 ± 0.041) over 200 cold-start episodes at only 7.6% workforce utilization, while maintaining robustness to 50% workforce turnover and observation noise up to σ=0.20.

Conclusion: FORGE successfully addresses the four-way tension in crowdsourcing allocation through a physics-grounded simulation approach combined with neural-linear UCB optimization, achieving superior performance across all metrics compared to conventional baselines.

Abstract: Crowdsourcing platforms face a four-way tension between allocation quality, workforce sustainability, operational feasibility, and strategic contractor behaviour–a dilemma we formalise as the Cold-Start, Burnout, Utilisation, and Strategic Agency Dilemma. Existing methods resolve at most two of these tensions simultaneously: greedy heuristics and multi-criteria decision making (MCDM) methods achieve Day-1 quality but cause catastrophic burnout, while bandit algorithms eliminate burnout only through operationally infeasible 100% workforce utilisation.To address this, we introduce FORGE, a physics-grounded $K+1$ multi-agent simulator in which each contractor is a rational agent that declares its own load-acceptance threshold based on its fatigue state, converting the standard passive Restless Multi-Armed Bandit (RMAB) into a genuine Stackelberg game. Operating within FORGE, we propose a Neural-Linear UCB allocator that fuses a Two-Tower embedding network with a Physics-Informed Covariance Prior derived from offline simulator interactions. The prior simultaneously warm-starts skill-cluster geometry and UCB exploration landscape, providing a geometry-aware belief state from episode 1 that measurably reduces cold-start regret.Over $T = 200$ cold-start episodes, the proposed method achieves the highest reward of all non-oracle methods ($\text{LRew} = 0.555 \pm 0.041$) at only 7.6% workforce utilisation–a combination no conventional baseline achieves–while maintaining robustness to workforce turnover up to 50% and observation noise up to $σ= 0.20$.

[602] QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning

Yuanjun Li, Bin Zhang, Hao Chen, Zhouyang Jiang, Dapeng Li, Zhiwei Xu

Main category: cs.MA

TL;DR: QSIM is a similarity-weighted Q-learning framework for multi-agent reinforcement learning that mitigates Q-value overestimation by using action similarity to smooth TD targets instead of the max operator.

Details

Motivation: Value decomposition methods in cooperative MARL suffer from systematic Q-value overestimation due to their reliance on the max operator for TD target calculation. This issue is exacerbated in MARL due to combinatorial explosion of joint action space, leading to unstable learning and suboptimal policies.

Method: QSIM reconstructs TD targets using action similarity. Instead of using greedy joint action directly, it forms a similarity-weighted expectation over a structured near-greedy joint action space. This allows integration of Q-values from diverse yet behaviorally related actions while assigning greater influence to actions more similar to the greedy choice.

Result: Extensive experiments show QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to original algorithms. Empirical analysis confirms QSIM significantly mitigates systematic value overestimation in MARL.

Conclusion: QSIM effectively addresses the overestimation problem in MARL by smoothing TD targets with structurally relevant alternatives, improving learning stability and performance across various value decomposition methods.

Abstract: Value decomposition (VD) methods have achieved remarkable success in cooperative multi-agent reinforcement learning (MARL). However, their reliance on the max operator for temporal-difference (TD) target calculation leads to systematic Q-value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q-learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near-greedy joint action space. This formulation allows the target to integrate Q-values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigates overestimation and improves learning stability. Extensive experiments demonstrate that QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to the original algorithms. Furthermore, empirical analysis confirms that QSIM significantly mitigates the systematic value overestimation in MARL. Code is available at https://github.com/MaoMaoLYJ/pymarl-qsim.

[603] ClawMobile: Rethinking Smartphone-Native Agentic Systems

Hongchao Du, Shangyu Wu, Qiao Li, Riwei Pan, Jinheng Li, Youcheng Sun, Chun Jason Xue

Main category: cs.MA

TL;DR: ClawMobile introduces a hierarchical architecture for LLM agents on smartphones, separating language reasoning from deterministic control to improve execution stability on mobile devices.

Details

Motivation: Smartphones present unique challenges for agentic systems due to constrained execution contexts, fragmented control interfaces, and rapidly changing application states. As LLMs evolve from conversational assistants to action-oriented agents, achieving reliable smartphone-native autonomy requires rethinking how reasoning and control are composed.

Method: ClawMobile adopts a hierarchical architecture that separates high-level language reasoning from structured, deterministic control pathways. This approach improves execution stability and reproducibility on real devices. The system serves as a case study to distill design principles for mobile LLM runtimes.

Result: The paper presents ClawMobile as a concrete implementation that demonstrates improved execution stability for LLM agents on smartphones. It identifies key challenges in efficiency, adaptability, and stability for mobile agentic systems.

Conclusion: Building robust smartphone-native agentic systems demands principled coordination between probabilistic planning and deterministic system interfaces. The implementation is open-sourced to facilitate future exploration of mobile LLM runtimes.

Abstract: Smartphones represent a uniquely challenging environment for agentic systems. Unlike cloud or desktop settings, mobile devices combine constrained execution contexts, fragmented control interfaces, and rapidly changing application states. As large language models (LLMs) evolve from conversational assistants to action-oriented agents, achieving reliable smartphone-native autonomy requires rethinking how reasoning and control are composed. We introduce ClawMobile as a concrete exploration of this design space. ClawMobile adopts a hierarchical architecture that separates high-level language reasoning from structured, deterministic control pathways, improving execution stability and reproducibility on real devices. Using ClawMobile as a case study, we distill the design principles for mobile LLM runtimes and identify key challenges in efficiency, adaptability, and stability. We argue that building robust smartphone-native agentic systems demands principled coordination between probabilistic planning and deterministic system interfaces. The implementation is open-sourced~\footnote{https://github.com/ClawMobile/ClawMobile} to facilitate future exploration.

[604] HyperAgent: Leveraging Hypergraphs for Topology Optimization in Multi-Agent Communication

Heng Zhang, Yuling Shi, Xiaodong Gu, Zijian Zhang, Haochen You, Lubin Gan, Yilei Yuan, Jin Huang

Main category: cs.MA

TL;DR: HyperAgent: A hypergraph-based framework for multi-agent systems that optimizes communication topologies using hyperedges to capture group collaboration patterns and dynamically adjusts topologies based on task complexity.

Details

Motivation: Existing multi-agent systems face two main challenges: (1) ineffective group collaboration modeling due to reliance on pairwise edge representations in graph structures, which limits capturing relationships among multiple agents; and (2) limited task-adaptiveness in communication topology design, leading to excessive communication costs for simple tasks and insufficient coordination for complex scenarios.

Method: Proposes HyperAgent, a hypergraph-based framework that uses hyperedges to link multiple agents within the same subtask and employs hypergraph convolutional layers for one-step information aggregation. Incorporates a variational autoencoder framework with sparsity regularization to dynamically adjust hypergraph topologies based on task complexity.

Result: HyperAgent demonstrates superiority in both performance and efficiency. On GSM8K, it achieves 95.07% accuracy while reducing token consumption by 25.33%.

Conclusion: HyperAgent demonstrates the potential of hypergraph-based optimization for multi-agent communication, effectively addressing scalability and practical deployment challenges in adaptive collaboration frameworks.

Abstract: Recent advances in large language model-powered multi-agent systems have demonstrated remarkable collective intelligence through effective communication. However, existing approaches face two primary challenges: (i) \textit{Ineffective group collaboration modeling}, as they rely on pairwise edge representations in graph structures, limiting their ability to capture relationships among multiple agents; and (ii) \textit{Limited task-adaptiveness in communication topology design}, leading to excessive communication cost for simple tasks and insufficient coordination for complex scenarios. These issues restrict the scalability and practical deployment of adaptive collaboration frameworks. To address these challenges, we propose \textbf{HyperAgent}, a hypergraph-based framework that optimizes communication topologies and effectively captures group collaboration patterns using direct hyperedge representations. Unlike edge-based approaches, HyperAgent uses hyperedges to link multiple agents within the same subtask and employs hypergraph convolutional layers to achieve one-step information aggregation in collaboration groups. Additionally, it incorporates a variational autoencoder framework with sparsity regularization to dynamically adjust hypergraph topologies based on task complexity. Experiments highlight the superiority of HyperAgent in both performance and efficiency. For instance, on GSM8K, HyperAgent achieves 95.07% accuracy while reducing token consumption by 25.33%, demonstrating the potential of hypergraph-based optimization for multi-agent communication.

[605] Scaling Inference-Time Computation via Opponent Simulation: Enabling Online Strategic Adaptation in Repeated Negotiation

Xiangyu Liu, Di Wang, Zhe Feng, Aranyak Mehta

Main category: cs.MA

TL;DR: LLMs adapted for repeated strategic interactions using smooth Fictitious Play principles with opponent modeling and best-of-N sampling, enabling online adaptation without parameter updates.

Details

Motivation: Current LLM approaches focus on single-agent or stationary environments, but lack effective methods for repeated strategic interactions with unknown/dynamic opponents. Offline training methods don't fully exploit LLMs' potential for online adaptation based on interaction feedback.

Method: Embed smooth Fictitious Play into LLM inference: (1) opponent model that learns in-context to imitate opponent’s time-averaged behavior for belief formation, (2) enhanced best-of-N sampling that simulates against the opponent model for best response.

Result: Significant performance improvement over repeated online interaction compared to various baselines in two distinct forms of repeated negotiation games.

Conclusion: Provides a scalable and principled approach to repeated strategic decision-making for LLMs without parameter updates, enabling effective adaptation in dynamic multi-agent settings.

Abstract: While large language models (LLMs) have emerged as powerful decision-makers across a wide range of single-agent and stationary environments, fewer efforts have been devoted to settings where LLMs must engage in \emph{repeated} and \emph{strategic} interactions with unknown or dynamic opponents. In such settings, recipes built upon \emph{offline} pre-training or fine-tuning, though robust against worst-case adversaries, do not fully exploit the capability of LLMs to adapt \emph{online} based on interaction feedback. Instead, we explore the more natural perspective of scaling inference-time computation as a mechanism for adaptation, embedding the principles of a classical game-theoretical learning dynamic, \emph{smooth Fictitious Play (sFP)}, into LLM inference: (i) for belief formation, we employ an auxiliary opponent model that in-context learns to imitate the time-averaged behavior of the opponent; (ii) for best response, we advance best-of-$N$ (BoN) sampling by simulating against the opponent model. Empirical evaluations on two distinct forms of repeated negotiation games demonstrate that our method enables significant performance improvement over repeated online interaction compared to various baselines, offering a scalable and principled approach to repeated strategic decision-making without any parameter updates.

[606] Time-Varying Formation Tracking Control of Wheeled Mobile Robots With Region Constraint: A Generalized Udwadia-Kalaba Framework

Yijie Kang, Yuqing Hao, Qingyun Wang, Guanrong Chen

Main category: cs.MA

TL;DR: Paper 2512.07137: Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot determine conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2512.07137: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07137&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[607] LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

Yukang Feng, Jianwen Sun, Zelai Yang, Jiaxin Ai, Chuanhao Li, Zizhen Li, Fanrui Zhang, Kang He, Rui Ma, Jifan Lin, Jie Sun, Yang Xiao, Sizhuo Zhou, Wenxiao Wu, Yiming Liu, Pengfei Liu, Yu Qiao, Shenglin Zhang, Kaipeng Zhang

Main category: cs.MA

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to access restrictions

Method: Cannot determine method due to access restrictions

Result: Cannot determine results due to access restrictions

Conclusion: Cannot determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2602.14337: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14337&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.MM

[608] Decoding the Hook: A Multimodal LLM Framework for Analyzing the Hooking Period of Video Ads

Kunpeng Zhang, Poppy Zhang, Shawndra Hill, Amel Awadelkarim

Main category: cs.MM

TL;DR: A multimodal framework using transformer-based MLLMs to analyze the first 3 seconds (hooking period) of video ads, correlating multimodal features with engagement metrics.

Details

Conclusion: Provides scalable methodology for understanding and optimizing initial moments of video ads, advancing video ad analysis with practical applicability and predictive power.

[609] MViR: Multi-View Visual-Semantic Representation for Fake News Detection

Haochen Liang, Xinqi Su, Jun Wang, Chaomeng Chen, Zitong Yu

Main category: cs.MM

TL;DR: MViR framework for fake news detection using multi-view visual-semantic representation with pyramid dilated convolution and feature fusion

Details

Motivation: Existing multimodal fake news detection methods often neglect multi-view visual-semantic aspects, such as different text perspectives of the same image, which is crucial for accurate detection in online social networks

Method: Proposes Multi-View Visual-Semantic Representation (MViR) framework with: 1) Multi-View Representation module using pyramid dilated convolution to capture multi-view visual-semantic features, 2) Multi-View Feature Fusion module to integrate these features with text, and 3) multiple aggregators to extract multi-view semantic cues for detection

Result: Experiments on benchmark datasets demonstrate the superiority of MViR over existing methods

Conclusion: MViR effectively addresses the multi-view visual-semantic aspects of fake news detection, improving accuracy and providing a robust framework for multimodal fake news analysis

Abstract: With the rise of online social networks, detecting fake news accurately is essential for a healthy online environment. While existing methods have advanced multimodal fake news detection, they often neglect the multi-view visual-semantic aspects of news, such as different text perspectives of the same image. To address this, we propose a Multi-View Visual-Semantic Representation (MViR) framework. Our approach includes a Multi-View Representation module using pyramid dilated convolution to capture multi-view visual-semantic features, a Multi-View Feature Fusion module to integrate these features with text, and multiple aggregators to extract multi-view semantic cues for detection. Experiments on benchmark datasets demonstrate the superiority of MViR. The source code of FedCoop is available at https://github.com/FlowerinZDF/FakeNews-MVIR.

Qiong Wu, Weihao Ye, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji

Main category: cs.MM

TL;DR: Efficient Attention Skipping (EAS) is a parameter-efficient tuning method for MLLMs that skips redundant multi-head attentions to speed up inference while maintaining performance.

Details

Motivation: Multi-head attentions (MHAs) in MLLMs are often redundant for downstream tasks, creating unnecessary computational overhead during inference. There's a need for methods that reduce this redundancy while maintaining model performance and parameter efficiency.

Method: EAS evaluates attention redundancy and skips less important MHAs. It uses a propagation-of-information adapter (PIA) to support attention skipping while maintaining parameter efficiency, which can be re-parameterized into feed-forward networks for zero extra latency.

Result: Applied to LaVIN and METER models, EAS achieves high performance (89.98% accuracy on ScienceQA) while speeding up inference by 2.2x for LaVIN, demonstrating both computational efficiency and parameter efficiency.

Conclusion: EAS provides an effective approach for efficient MLLM tuning by skipping redundant attentions, achieving significant speedup without compromising performance or parameter efficiency.

Abstract: In this paper, we propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs), termed Efficient Attention Skipping (EAS). Concretely, we first reveal that multi-head attentions (MHAs), the main computational overhead of MLLMs, are often redundant to downstream tasks. Based on this observation, EAS evaluates the attention redundancy and skips the less important MHAs to speed up inference. Besides, we also propose a novel propagation-of-information adapter (PIA) to serve the attention skipping of EAS and keep parameter efficiency, which can be further re-parameterized into feed-forward networks (FFNs) for zero-extra latency. To validate EAS, we apply it to a recently proposed MLLM called LaVIN and a classic VL pre-trained model called METER, and conduct extensive experiments on a set of benchmarks. The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed. For instance, LaVIN-EAS can obtain 89.98% accuracy on ScineceQA while speeding up inference by 2.2 times to LaVIN

[611] Structured Image-based Coding for Efficient Gaussian Splatting Compression

Pedro Martin, Antonio Rodrigues, Joao Ascenso, Maria Paula Queluz

Main category: cs.MM

TL;DR: GSICO is a novel compression method for Gaussian Splatting models that maps GS parameters into structured images for efficient encoding using conventional image codecs, achieving 20.2x compression with minimal quality loss.

Details

Motivation: Gaussian Splatting models require storing millions of parameters, leading to large file sizes that impair practical use in multimedia systems, necessitating efficient compression methods.

Method: GSICO arranges GS parameters into structured images using a novel algorithm that enhances spatial coherence, then encodes these parameter images using conventional image codecs.

Result: Achieves average compression factors of 20.2x on Tanks and Temples, Deep Blending, and Mip-NeRF360 datasets with minimal loss in visual quality (PSNR, SSIM, LPIPS), and superior rate-distortion trade-offs compared to state-of-the-art GS compression methods.

Conclusion: GSICO provides an effective solution for compressing Gaussian Splatting models while preserving perceptual fidelity, enabling more practical deployment in multimedia systems.

Abstract: Gaussian Splatting (GS) has recently emerged as a state-of-the-art representation for radiance fields, combining real-time rendering with high visual fidelity. However, GS models require storing millions of parameters, leading to large file sizes that impair their use in practical multimedia systems. To address this limitation, this paper introduces GS Image-based Compression (GSICO), a novel GS codec that efficiently compresses pre-trained GS models while preserving perceptual fidelity. The core contribution lies in a mapping procedure that arranges GS parameters into structured images, guided by a novel algorithm that enhances spatial coherence. These GS parameter images are then encoded using a conventional image codec. Experimental evaluations on Tanks and Temples, Deep Blending, and Mip-NeRF360 datasets show that GSICO achieves average compression factors of 20.2x with minimal loss in visual quality, as measured by PSNR, SSIM, and LPIPS. Compared with state-of-the-art GS compression methods, the proposed codec consistently yields superior rate-distortion (RD) trade-offs.

eess.AS

[612] Moving Speaker Separation via Parallel Spectral-Spatial Processing

Yuzhu Wang, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen

Main category: eess.AS

TL;DR: A dual-branch parallel architecture for multi-channel speech separation that separately processes spectral and spatial features through parallel streams to address the modeling conflict in dynamic environments.

Details

Motivation: Existing methods force a single network to simultaneously model both spectral and spatial features, creating a modeling conflict as these features evolve at different temporal scales in dynamic environments with moving speakers.

Method: Proposes PS2 (Parallel Spectral-Spatial) architecture with two parallel branches: spectral branch uses BLSTM-based frequency module, Mamba-based temporal module, and self-attention; spatial branch uses BGRU networks to process spatial features. Features are integrated via cross-attention fusion mechanism.

Result: Outperforms SOTA methods by 1.6-2.2 dB SI-SDR for moving speaker scenarios, maintains robust separation under different reverberation times, noise levels, and movement speeds, with consistent improvements across multiple datasets including WHAMR! and WSJ0-Demand-6ch-Move.

Conclusion: The parallel architecture effectively addresses the modeling conflict between spectral and spatial features in dynamic environments, achieving superior speech separation performance for moving sources.

Abstract: Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network stream to simultaneously model both feature types, creating an inherent modeling conflict. In this paper, we propose a dual-branch parallel spectral-spatial (PS2) architecture that separately processes spectral and spatial features through parallel streams. The spectral branch uses a bi-directional long short-term memory (BLSTM)-based frequency module, a Mamba-based temporal module, and a self-attention module to model spectral features. The spatial branch employs bi-directional gated recurrent unit (BGRU) networks to process spatial features that encode the evolving geometric relationships between sources and microphones. Features from both branches are integrated through a cross-attention fusion mechanism that adaptively weights their contributions. Experimental results demonstrate that the PS2 outperforms existing state-of-the-art (SOTA) methods by 1.6-2.2 dB in scale-invariant signal-to-distortion ratio (SI-SDR) for moving speaker scenarios, with robust separation quality under different reverberation times (RT60), noise levels, and source movement speeds. Even with fast source movements, the proposed model maintains SI-SDR improvements of over 13 dB. These improvements are consistently observed across multiple datasets, including WHAMR! and our generated WSJ0-Demand-6ch-Move dataset.

[613] Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper

Hoan My Tran, Xin Wang, Wanying Ge, Xuechen Liu, Junichi Yamagishi

Main category: eess.AS

TL;DR: Fine-tuning Whisper model for synthetic word detection in deepfake speech while maintaining transcription accuracy, using cost-effective vocoded data for training.

Details

Motivation: Deepfake speech can be created by replacing words in authentic utterances with synthetic words. Need cost-effective detection methods that work alongside transcription tasks.

Method: Fine-tune pre-trained Whisper model to detect synthetic words while transcribing via next-token prediction. Use partially vocoded utterances as training data to reduce collection costs.

Result: On in-domain data: low synthetic-word detection error rates and transcription error rates. On out-of-domain data: performance comparable to dedicated ResNet-based detector but with overall degradation, highlighting generalization challenges.

Conclusion: Fine-tuned Whisper offers cost-effective synthetic word detection but needs improved generalization strategies for unseen speech generative models.

Abstract: Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words synthesized by speech generative models. While a dedicated synthetic word detector could be developed, we investigate a cost-effective method that fine-tunes a pre-trained Whisper model to detect synthetic words while transcribing the input utterance via next-token prediction. We further investigate using partially vocoded utterances as the fine-tuning data, thereby reducing the cost of data collection. Our experiments demonstrate that, on in-domain test data, the fine-tuned Whisper yields low synthetic-word detection error rates and transcription error rates. On out-of-domain test data with synthetic words produced by unseen speech generative models, the fine-tuned Whisper remains on par with a dedicated ResNet-based detection model; however, the overall performance degradation calls for strategies to improve its generalization capability.

[614] A Directional-Derivative-Constrained Method for Continuously Steerable Differential Beamformers with Uniform Circular Arrays

Tiantian Xiong, Yongyi Deng, Kunlong Zhao, Jilu Jin, Xueqin Luo, Gongping Huang, Jingdong Chen, Jacob Benesty

Main category: eess.AS

TL;DR: Novel differential beamformer design for circular arrays using directional derivative constraints to achieve continuously steerable beampatterns for arbitrary target directions.

Details

Motivation: Differential microphone arrays are promising for far-field acoustic signal acquisition due to spatial directivity and compactness, but designing continuously steerable beamformers for arbitrary directions remains challenging.

Method: Proposes a framework incorporating directional derivative constraints: constraining first-order derivatives of beampattern at desired steering direction to zero, and assigning suitable values to higher-order derivatives to ensure maximum response in target direction and enable beam steering.

Result: Simulation results demonstrate that the proposed method produces continuously steerable beampatterns with improved steering flexibility and more intuitive, robust design.

Conclusion: The directional derivative constraint approach enables effective design of differential beamformers for circular arrays with continuous steerability and enhanced target signal acquisition from arbitrary directions.

Abstract: Differential microphone arrays offer a promising solution for far-field acoustic signal acquisition due to their high spatial directivity and compact array structure. A key challenge lies in designing differential beamformers that are continuously steerable and capable of enhancing target signals arriving from arbitrary directions. This paper studies the design of differential beamformers for circular arrays and proposes a novel framework that incorporates directional derivative constraints. By constraining the first-order derivatives of the beampattern at the desired steering direction to zero and assigning suitable values to higher-order derivatives, the beamformer is ensured to achieve its maximum response in the target direction and provide sufficient beam steering. This approach not only improves steering flexibility but also enables a more intuitive and robust beampattern design. Simulation results demonstrate that the proposed method produces continuously steerable beampatterns.

[615] Align-Consistency: Improving Non-autoregressive and Semi-supervised ASR with Consistency Regularization

Wanting Huang, Weiran Wang

Main category: eess.AS

TL;DR: Align-Consistency extends consistency regularization to Align-Refine non-autoregressive ASR models, improving both supervised and semi-supervised speech recognition through parallel inference with iterative refinement.

Details

Motivation: To enhance the robustness and accuracy of non-autoregressive ASR models by applying consistency regularization to iterative refinement models, enabling fast parallel inference while maintaining high recognition quality.

Method: Proposes Align-Consistency, an extension of consistency regularization for Align-Refine non-autoregressive models that perform iterative refinement of frame-level hypotheses. Applies CR to both base CTC model and refinement steps, and uses fast non-AR decoding for semi-supervised pseudo-label generation.

Result: CR applied to both base CTC and refinement steps is critical; accuracy improvements from non-AR decoding and CR are additive. In semi-supervised ASR, fast non-AR decoding generates online pseudo-labels leading to substantial gains.

Conclusion: Align-Consistency effectively combines the speed of parallel inference with improved recognition performance through consistency regularization, benefiting both fully supervised and semi-supervised ASR scenarios.

Abstract: Consistency regularization (CR) improves the robustness and accuracy of Connectionist Temporal Classification (CTC) by ensuring predictions remain stable across input perturbations. In this work, we propose Align-Consistency, an extension of CR designed for Align-Refine – a non-autoregressive (non-AR) model that performs iterative refinement of frame-level hypotheses. This method leverages the speed of parallel inference while significantly boosting recognition performance. The effectiveness of Align-Consistency is demonstrated in two settings. First, in the fully supervised setting, our results indicate that applying CR to both the base CTC model and the subsequent refinement steps is critical, and the accuracy improvements from non-AR decoding and CR are mutually additive. Second, for semi-supervised ASR, we employ fast non-AR decoding to generate online pseudo-labels on unlabeled data, which are used to further refine the supervised model and lead to substantial gains.

[616] Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning

Manh Luong, Khai Nguyen, Dinh Phung, Gholamreza Haffari, Lizhen Qu

Main category: eess.AS

TL;DR: USW-RBF kernel with rotary positional embedding improves audio captioning by preserving temporal relationships across modalities and mitigating caption degeneration through stochastic decoding.

Details

Motivation: Audio captioning systems suffer from exposure bias in teacher-forcing training, leading to caption degeneration during inference. Existing contrastive methods fail to capture crucial temporal relationships between acoustic and linguistic modalities.

Method: Introduces unbiased sliced Wasserstein RBF (USW-RBF) kernel with rotary positional embedding to preserve temporal information across modalities. The kernel enables efficient stochastic gradient optimization. Builds a complete audio captioning framework with stochastic decoding to mitigate caption degeneration.

Result: Significant improvements in caption quality, lexical diversity, and text-to-audio retrieval accuracy on AudioCaps and Clotho datasets. USW-RBF kernel enhances reasoning capabilities of large audio language models on CompA-R and improves reasoning accuracy on MMAU-test-mini benchmarks by 4%.

Conclusion: The approach establishes a powerful and generalizable solution for cross-modal alignment challenges in audio-language tasks, with demonstrated effectiveness in both captioning and reasoning applications.

Abstract: Audio captioning systems face a fundamental challenge: teacher-forcing training creates exposure bias that leads to caption degeneration during inference. While contrastive methods have been proposed as solutions, they typically fail to capture the crucial temporal relationships between acoustic and linguistic modalities. We address this limitation by introducing the unbiased sliced Wasserstein RBF (USW-RBF) kernel with rotary positional embedding, specifically designed to preserve temporal information across modalities. Our approach offers a practical advantage: the kernel enables efficient stochastic gradient optimization, making it computationally feasible for real-world applications. Building on this foundation, we develop a complete audio captioning framework that integrates stochastic decoding to further mitigate caption degeneration. Extensive experiments on AudioCaps and Clotho datasets demonstrate that our method significantly improves caption quality, lexical diversity, and text-to-audio retrieval accuracy. Furthermore, we demonstrate the generalizability of our USW-RBF kernel by applying it to audio reasoning tasks, where it enhances the reasoning capabilities of large audio language models on the CompA-R in terms of correctness and quality. Our kernel also improves the reasoning accuracy of the MMAU-test-mini benchmarks by $4%$. These results establish our approach as a powerful and generalizable solution for cross-modal alignment challenges in audio-language tasks.

eess.IV

[617] Deep Accurate Solver for the Geodesic Problem

Saar Huberman, Amit Bracha, Ron Kimmel

Main category: eess.IV

TL;DR: A deep learning method for computing geodesic distances on surfaces that achieves third-order accuracy, outperforming traditional polyhedral approximations and previous learning-based approaches.

Details

Motivation: Traditional methods for computing geodesic distances on continuous surfaces use polygonal mesh approximations, but these are limited to second-order accuracy. There's a need for higher-order accurate methods that better approximate the continuous surface structure.

Method: Proposes a neural network-based local solver that implicitly approximates the continuous surface structure. The method combines a numerical solver for local distance approximation with an efficient causal ordering scheme, replacing classical dynamic programming approaches with a learned update scheme.

Result: The proposed learned update scheme provides better accuracy than the best possible polyhedral approximations and previous learning-based methods, achieving third-order accuracy with a bootstrapping recipe for further improvement.

Conclusion: Deep learning can significantly improve geodesic distance computation accuracy on surfaces, achieving higher-order convergence rates than traditional polyhedral approximation methods.

Abstract: A common approach to compute distances on continuous surfaces is by considering a discretized polygonal mesh approximating the surface and estimating distances on the polygon. We show that exact geodesic distances restricted to the polygon are at most second-order accurate with respect to the distances on the corresponding continuous surface. By order of accuracy we refer to the convergence rate as a function of the average distance between sampled points. Next, a higher-order accurate deep learning method for computing geodesic distances on surfaces is introduced. Traditionally, one considers two main components when computing distances on surfaces: a numerical solver that locally approximates the distance function, and an efficient causal ordering scheme by which surface points are updated. Classical minimal path methods often exploit a dynamic programming principle with quasi-linear computational complexity in the number of sampled points. The quality of the distance approximation is determined by the local solver that is revisited in this paper. To improve state of the art accuracy, we consider a neural network-based local solver which implicitly approximates the structure of the continuous surface. We supply numerical evidence that the proposed learned update scheme provides better accuracy compared to the best possible polyhedral approximations and previous learning-based methods. The result is a third-order accurate solver with a bootstrapping-recipe for further improvement.

[618] Learning to reconstruct from saturated data: audio declipping and high-dynamic range imaging

Victor Sechaud, Laurent Jacques, Patrice Abry, Julián Tachella

Main category: eess.IV

TL;DR: Self-supervised learning method for recovering audio and images from clipped/saturated measurements without ground truth data.

Details

Motivation: Existing self-supervised methods for inverse problems are limited to linear cases, but real-world applications like audio/image recovery from clipped measurements require handling non-linear problems without ground truth references.

Method: Extends self-supervised learning to non-linear inverse problems by assuming signal distribution is approximately invariant to amplitude changes. Provides sufficient conditions for learning from saturated signals alone and develops a self-supervised loss for training reconstruction networks.

Result: Experiments on audio and image data show the approach is almost as effective as fully supervised methods, despite using only clipped measurements for training.

Conclusion: The work successfully extends self-supervised learning to non-linear inverse problems for audio and image recovery from clipped measurements, providing a practical solution when ground truth data is unavailable.

Abstract: Learning based methods are now ubiquitous for solving inverse problems, but their deployment in real-world applications is often hindered by the lack of ground truth references for training. Recent self-supervised learning strategies offer a promising alternative, avoiding the need for ground truth. However, most existing methods are limited to linear inverse problems. This work extends self-supervised learning to the non-linear problem of recovering audio and images from clipped measurements, by assuming that the signal distribution is approximately invariant to changes in amplitude. We provide sufficient conditions for learning to reconstruct from saturated signals alone and a self-supervised loss that can be used to train reconstruction networks. Experiments on both audio and image data show that the proposed approach is almost as effective as fully supervised approaches, despite relying solely on clipped measurements for training.

[619] HARU-Net: Hybrid Attention Residual U-Net for Edge-Preserving Denoising in Cone-Beam Computed Tomography

Khuram Naveed, Ruben Pauwels

Main category: eess.IV

TL;DR: HARU-Net: A novel hybrid attention residual U-Net architecture for high-quality denoising of low-dose CBCT dental/maxillofacial images, achieving state-of-the-art performance with lower computational cost.

Details

Motivation: Low-dose CBCT imaging introduces strong, spatially varying noise that degrades soft-tissue visibility and obscures fine anatomical structures. Classical denoising methods struggle with CBCT noise while preserving edges, and deep learning approaches are limited by scarce high-resolution CBCT training data.

Method: Proposes Hybrid Attention Residual U-Net (HARU-Net) with three key components: 1) hybrid attention transformer blocks in skip connections to emphasize salient anatomical features, 2) residual hybrid attention transformer groups at bottleneck for global contextual modeling, and 3) residual learning convolutional blocks for stable feature extraction. Trained on cadaver hemimandible dataset from high-resolution CBCT system.

Result: HARU-Net consistently outperforms state-of-the-art methods (SwinIR, Uformer) with highest PSNR (37.52 dB), highest SSIM (0.9557), and lowest GMSD (0.1084). Achieves effective denoising at significantly lower computational cost than SOTA methods.

Conclusion: HARU-Net provides clinically reliable CBCT denoising that improves diagnostic quality in low-dose CBCT imaging, offering a practical advancement for dental and maxillofacial applications with superior performance and computational efficiency.

Abstract: Cone-beam computed tomography (CBCT) is widely used in dental and maxillofacial imaging, but low-dose acquisition introduces strong, spatially varying noise that degrades soft-tissue visibility and obscures fine anatomical structures. Classical denoising methods struggle to suppress noise in CBCT while preserving edges. Although deep learning-based approaches offer high-fidelity restoration, their use in CBCT denoising is limited by the scarcity of high-resolution CBCT data for supervised training. To address this research gap, we propose a novel Hybrid Attention Residual U-Net (HARU-Net) for high-quality denoising of CBCT data, trained on a cadaver dataset of human hemimandibles acquired using a high-resolution protocol of the 3D Accuitomo 170 (J. Morita, Kyoto, Japan) CBCT system. The novel contribution of this approach is the integration of three complementary architectural components: (i) a hybrid attention transformer block (HAB) embedded within each skip connection to selectively emphasize salient anatomical features, (ii) a residual hybrid attention transformer group (RHAG) at the bottleneck to strengthen global contextual modeling and long-range feature interactions, and (iii) residual learning convolutional blocks to facilitate deeper, more stable feature extraction throughout the network. HARU-Net consistently outperforms state-of-the-art (SOTA) methods including SwinIR and Uformer, achieving the highest PSNR (37.52 dB), highest SSIM (0.9557), and lowest GMSD (0.1084). This effective and clinically reliable CBCT denoising is achieved at a computational cost significantly lower than that of the SOTA methods, offering a practical advancement toward improving diagnostic quality in low-dose CBCT imaging.

[620] U-Net-Based Generative Joint Source-Channel Coding for Wireless Image Transmission

Ming Ye, Kui Cai, Cunhua Pan, Zhen Mei, Wanting Yang, Chunguo Li

Main category: eess.IV

TL;DR: Proposes two DL-based joint source-channel coding methods (G-UNet-JSCC and cGAN-JSCC) for wireless image transmission using deep generative architectures to improve both pixel-level fidelity and perceptual quality.

Details

Motivation: Existing DL-based JSCC methods either focus on conventional distortion metrics that don't yield high perceptual quality or incur high computational complexity. There's a need for methods that achieve both good pixel-level fidelity and perceptual quality in wireless image transmission.

Method: Two methods: 1) G-UNet-JSCC with encoder and U-Net-based decoder using skip connections for multi-scale feature fusion, optimized with weighted SSIM+MSE loss. 2) cGAN-JSCC builds on G-UNet-JSCC with adversarial training - retains same encoder but decoder is adversarially trained against patch-based discriminator using two-stage training procedure.

Result: Both methods achieve superior pixel-level fidelity and perceptual quality on high- and low-resolution images. For low-resolution images, cGAN-JSCC achieves better reconstruction performance and greater robustness to channel variations than G-UNet-JSCC.

Conclusion: The proposed deep generative architecture-based JSCC methods effectively improve both distortion metrics and perceptual quality in wireless image transmission, with cGAN-JSCC showing particular advantages for low-resolution images and channel robustness.

Abstract: Deep learning (DL)-based joint source-channel coding (JSCC) methods have achieved remarkable success in wireless image transmission. However, these methods either focus on conventional distortion metrics that do not necessarily yield high perceptual quality or incur high computational complexity. In this paper, we propose two DL-based JSCC (DeepJSCC) methods that leverage deep generative architectures for wireless image transmission. Specifically, we propose G-UNet-JSCC, a scheme comprising an encoder and a U-Net-based generator serving as the decoder. Its skip connections enable multi-scale feature fusion to improve both pixel-level fidelity and perceptual quality of reconstructed images by integrating low- and high-level features. To further enhance pixel-level fidelity, the encoder and the U-Net-based decoder are jointly optimized using a weighted sum of structural similarity and mean-squared error (MSE) losses. Building upon G-UNet-JSCC, we further develop a DeepJSCC method called cGAN-JSCC, where the decoder is enhanced through adversarial training. In this scheme, we retain the encoder of G-UNet-JSCC and adversarially train the decoder’s generator against a patch-based discriminator. cGAN-JSCC employs a two-stage training procedure. The outer stage trains the encoder and the decoder end-to-end using an MSE loss, while the inner stage adversarially trains the decoder’s generator and the discriminator by minimizing a joint loss combining adversarial and distortion losses. Simulation results demonstrate that the proposed methods achieve superior pixel-level fidelity and perceptual quality on both high- and low-resolution images. For low-resolution images, cGAN-JSCC achieves better reconstruction performance and greater robustness to channel variations than G-UNet-JSCC.

[621] LinGuinE: Longitudinal Guidance Estimation for Volumetric Tumour Segmentation

Nadine Garibli, Mayank Patwari, Bence Csiba, Yi Wei, Kostantinos Sidiropoulos

Main category: eess.IV

TL;DR: LinGuinE is a framework for longitudinal tumor segmentation that combines image registration and guided segmentation to track lesions across multiple timepoints using a single radiologist prompt.

Details

Motivation: Current methods for longitudinal tumor segmentation produce single-timepoint semantic masks, lack lesion correspondence across scans, and offer limited radiologist control, making them inadequate for radiotherapy planning and response assessment.

Method: Combines image registration with guided segmentation in a PyTorch framework that is temporally direction agnostic, requires no training on longitudinal data, and allows any registration and semi-automatic segmentation algorithm to be repurposed for longitudinal tracking.

Result: Achieves state-of-the-art segmentation and tracking performance across four datasets with 456 longitudinal studies, with minimal degradation in tumor segmentation performance as temporal separation increases.

Conclusion: LinGuinE provides an effective framework for longitudinal tumor segmentation with lesion-level tracking, offering radiologist control through prompts while being flexible enough to work with various registration and segmentation algorithms.

Abstract: Longitudinal volumetric tumour segmentation is critical for radiotherapy planning and response assessment, yet this problem is underexplored and most methods produce single-timepoint semantic masks, lack lesion correspondence, and offer limited radiologist control. We introduce LinGuinE (Longitudinal Guidance Estimation), a PyTorch framework that combines image registration and guided segmentation to deliver lesion-level tracking and volumetric masks across all scans in a longitudinal study from a single radiologist prompt. LinGuinE is temporally direction agnostic, requires no training on longitudinal data, and allows any registration and semi-automatic segmentation algorithm to be repurposed for the task. We evaluate various combinations of registration and segmentation algorithms within the framework. LinGuinE achieves state-of-the-art segmentation and tracking performance across four datasets with a total of 456 longitudinal studies. Tumour segmentation performance shows minimal degradation with increasing temporal separation. We conduct ablation studies to determine the impact of autoregression, pathology specific finetuning, and the use of real radiologist prompts. We release our code and substantial public benchmarking for longitudinal segmentation, facilitating future research.

[622] Adiabatic Capacitive Neuron: An Energy-Efficient Functional Unit for Artificial Neural Networks

Sachin Maheshwari, Mike Smart, Himadri Singh Raghav, Themis Prodromakis, Alexander Serb

Main category: eess.IV

TL;DR: A hardware implementation of an Adiabatic Capacitive Neuron (ACN) with 12-bit precision in 0.18μm CMOS technology, featuring improved energy efficiency, accuracy, and robustness over conventional designs.

Details

Motivation: To develop a more energy-efficient and robust hardware implementation of artificial neurons for neural network applications, addressing limitations in conventional capacitive neuron designs regarding energy consumption, accuracy, and scalability.

Method: Implemented a 12-bit single neuron with positive/negative weight support in 0.18μm CMOS technology, featuring a new Threshold Logic design for binary activation function with low symmetrical offset across process corners and temperatures.

Result: Achieved >90% energy savings (over 12x improvement) compared to non-adiabatic CMOS Capacitive Neuron benchmark, with maximum offset voltage of 9mV vs 27mV/5mV in conventional TL, and consistent energy savings across supply voltage scaling.

Conclusion: The proposed ACN hardware implementation demonstrates significant improvements in energy efficiency, accuracy, and robustness for artificial neuron hardware, making it suitable for energy-constrained neural network applications.

Abstract: This paper introduces a new, highly energy-efficient, Adiabatic Capacitive Neuron (ACN) hardware implementation of an Artificial Neuron (AN) with improved functionality, accuracy, robustness and scalability over previous work. The paper describes the implementation of a \mbox{12-bit} single neuron, with positive and negative weight support, in an $\mathbf{0.18μm}$ CMOS technology. The paper also presents a new Threshold Logic (TL) design for a binary AN activation function that generates a low symmetrical offset across three process corners and five temperatures between $-55^o$C and $125^o$C. Post-layout simulations demonstrate a maximum rising and falling offset voltage of 9$mV$ compared to conventional TL, which has rising and falling offset voltages of 27$mV$ and 5$mV$ respectively, across temperature and process. Moreover, the proposed TL design shows a decrease in average energy of 1.5$%$ at the SS corner and 2.3$%$ at FF corner compared to the conventional TL design. The total synapse energy saving for the proposed ACN was above 90$%$ (over 12x improvement) when compared to a non-adiabatic CMOS Capacitive Neuron (CCN) benchmark for a frequency ranging from 500$kHz$ to 100$MHz$. A 1000-sample Monte Carlo simulation including process variation and mismatch confirms the worst-case energy savings of $>$90$%$ compared to CCN in the synapse energy profile. Finally, the impact of supply voltage scaling shows consistent energy savings of above 90$%$ (except all zero inputs) without loss of functionality.

Editor’s Picks

[1] Decoding the Hook: A Multimodal LLM Framework for Analyzing the Hooking Period of Video Ads

[2] OmniGAIA: Towards Native Omni-Modal AI Agents

[3] AR&D: A Framework for Retrieving and Describing Concepts for Interpreting AudioLLMs

Today’s Research Highlights

Table of Contents

cs.CL

[1] Decoder-based Sense Knowledge Distillation

[2] Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts

[3] Detecting Hate and Inflammatory Content in Bengali Memes: A New Multimodal Dataset and Co-Attention Framework

[4] SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context

[5] Causality $\neq$ Invariance: Function and Concept Vectors in LLMs

[6] Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing

[7] A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection

[8] Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads

[9] Mind the Gap in Cultural Alignment: Task-Aware Culture Management for Large Language Models

[10] A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

[11] Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs

[12] Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models

[13] Iterative Prompt Refinement for Dyslexia-Friendly Text Summarization Using GPT-4o

[14] Ruyi2 Technical Report

[15] Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

[16] Under the Influence: Quantifying Persuasion and Vigilance in Large Language Models

[17] Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA

[18] dLLM: Simple Diffusion Language Modeling

[19] Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

[20] Enhancing Persuasive Dialogue Agents by Synthesizing Cross-Disciplinary Communication Strategies

[21] Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue

[22] Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs

[23] Human Label Variation in Implicit Discourse Relation Recognition

[24] Extending Czech Aspect-Based Sentiment Analysis with Opinion Terms: Dataset and LLM Benchmarks

[25] Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction

[26] AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

[27] Towards Better RL Training Data Utilization via Second-Order Rollout

[28] Imagination Helps Visual Reasoning, But Not Yet in Latent Space

[29] Probing for Knowledge Attribution in Large Language Models

[30] Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift

[31] TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models

[32] TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought

[33] Improving Neural Argumentative Stance Classification in Controversial Topics with Emotion-Lexicon Features

[34] Effective QA-driven Annotation of Predicate-Argument Relations Across Languages

[35] Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference

[36] Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching

[37] Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models

[38] Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

[39] Toward Automatic Filling of Case Report Forms: A Case Study on Data from an Italian Emergency Department

[40] Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody

[41] CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery

[42] Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent

[43] Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs

[44] MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

[45] Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

[46] Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

[47] Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

[48] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

[49] Scale Can’t Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

[50] Evaluating the Diversity and Quality of LLM Generated Content

[51] Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task

[52] When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations

[53] DeVisE: Behavioral Testing of Medical Large Language Models

[54] Parallel Continuous Chain-of-Thought with Jacobi Iteration

[55] A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench

[56] Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility

[57] UPDESH: Synthesizing Grounded Instruction Tuning Data for 13 Indic Languages

[58] Fine-tuning Done Right in Model Editing

[59] Inducing Dyslexia in Vision Language Models

[60] Generative Value Conflicts Reveal LLM Priorities

[61] Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs

[62] Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty

[63] Mapping Semantic & Syntactic Relationships with Geometric Rotation

[64] RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA

[65] PARL: Prompt-based Agents for Reinforcement Learning

[66] The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

[67] Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

[68] Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models

[69] Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

[70] Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL

[71] Document Reconstruction Unlocks Scalable Long-Context RLVR

[72] The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task

[73] Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering