Daily arXiv Papers - 2026-04-09

AI-enhanced summaries of 0 research papers from arXiv

Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] AudioKV: KV Cache Eviction in Efficient Large Audio Language Models

Yuxuan Wang, Peize He, Xiyan Gui, Xiaoqian Liu, Junhao He, Xuyang Liu, Zichen Wen, Xuming Hu, Linfeng Zhang

Main category: cs.SD

TL;DR: AudioKV is a novel KV cache compression framework for Large Audio-Language Models that prioritizes audio-critical attention heads using semantic-acoustic alignment and spectral score smoothing to maintain accuracy during long-context inference.

DetailsMotivation: Current KV cache compression techniques for LLMs fail in the audio domain because they overlook the intrinsic temporal continuity of acoustic signals, causing catastrophic performance degradation in Large Audio-Language Models during long-context inference.

Method: 1) Identify modality-specialized attention heads by analyzing attention scores in ASR tasks; 2) Dynamically allocate KV cache budgets preferentially to audio-critical heads; 3) Introduce Spectral Score Smoothing (SSS) - an FFT-based global filtering strategy to suppress high-frequency noise and recover smooth global trends from importance scores.

Result: Extensive evaluations across multiple LALMs (Qwen and Gemma series) show AudioKV significantly outperforms baselines. At 40% compression ratio, AudioKV maintains near-full accuracy on Qwen3-Omni-30B with only 0.45% drop, while traditional methods suffer catastrophic degradation and repetition.

Conclusion: AudioKV provides an effective hardware-friendly solution for KV cache compression in audio-language models by leveraging acoustic signal properties, enabling efficient long-context inference while preserving accuracy.

Abstract: Large Audio-Language Models (LALMs) have set new benchmarks in speech processing, yet their deployment is hindered by the memory footprint of the Key-Value (KV) cache during long-context inference. While general KV cache compression techniques excel in LLMs, they often fail in the audio domain by overlooking the intrinsic temporal continuity of acoustic signals. To bridge this gap, we propose AudioKV, a novel framework that robustly prioritizes audio-critical attention heads through a hardware-friendly semantic-acoustic alignment mechanism. Specifically, we identify these modality-specialized heads by analyzing attention scores in ASR tasks and dynamically allocate KV cache budgets preferentially to them. Furthermore, we introduce Spectral Score Smoothing (SSS), an FFT-based global filtering strategy designed to suppress high-frequency noise and recover smooth global trends from importance scores, ensuring more balanced token selection with unprecedented precision. Extensive evaluations across multiple LALMs, including Qwen and Gemma series, demonstrate that AudioKV significantly outperforms baselines while enhancing computational efficiency. Notably, at a 40% compression ratio, AudioKV maintains near-full accuracy on Qwen3-Omni-30B with only a 0.45% drop, whereas traditional methods suffer from catastrophic performance degradation and repetition. Our code will be released after acceptance.

Relevance: 9/10

[2] AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models

Wenyu Li, Xiaoqi Jiao, Yi Chang, Guangyan Zhang, Yiwen Guo

Main category: cs.SD

TL;DR: AudioRole dataset for audio role-playing with 1M+ character-grounded dialogues from TV series, plus ARP-Eval framework and trained models achieving state-of-the-art performance in audio-grounded role-playing.

DetailsMotivation: Existing role-playing research focuses on text-based persona simulation, but Audio Role-Playing (ARP) requires synchronized alignment of semantic content and vocal characteristics, creating a gap in multimodal datasets for audio-grounded role-playing.

Method: Created AudioRole dataset from 13 TV series (1K+ hours, 1M+ dialogues) with synchronized audio-text pairs, speaker identities, and contextual metadata. Developed ARP-Eval dual-aspect evaluation framework and trained ARP-Model (GLM-4-Voice) on the dataset.

Result: ARP-Model achieved Acoustic Personalization score of 0.31 (outperforming GLM-4-voice and MiniCPM-O-2.6) and Content Personalization score of 0.36 (38% improvement over untrained model, matching MiniCPM-O-2.6). Dataset includes 115+ characters and 6 trained models.

Conclusion: AudioRole provides essential resources for advancing audio-grounded role-playing research, addressing the unique challenges of synchronized semantic-vocal alignment in multimodal LLMs.

Abstract: The creation of high-quality multimodal datasets remains fundamental for advancing role-playing capabilities in large language models (LLMs). While existing works predominantly focus on text-based persona simulation, Audio Role-Playing (ARP) presents unique challenges due to the need for synchronized alignment of semantic content and vocal characteristics. To address this gap, we propose AudioRole, a meticulously curated dataset from 13 TV series spanning 1K+ hours with 1M+ character-grounded dialogues, providing synchronized audio-text pairs annotated with speaker identities and contextual metadata. In addition, to demonstrate the effectiveness of the dataset, we introduced ARP-Eval, a dual-aspect evaluation framework that assesses both response quality and role fidelity. Empirical validation showing GLM-4-Voice trained on AudioRole (which we called ARP-Model) achieve an average Acoustic Personalization score of 0.31, significantly outperforming the original GLM-4-voice and the more powerful model MiniCPM-O-2.6, which specifically supports role-playing in one-shot scenarios. The ARP-Model also achieves a Content Personalization score of 0.36, surpassing the untrained original model by about 38% and maintaining the same level as MiniCPM-O-2.6. AudioRole features dialogues from over 115 main characters, 6 trained ARP-Models that role-play different characters, and evaluation protocols. Together, they provide an essential resource for advancing audio-grounded role-playing research.

Relevance: 9/10

[3] PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

Tianxin Xie, Wentao Lei, Kai Jiang, Guanjie Huang, Pengfei Zhang, Chunhui Zhang, Fengji Ma, Haoyu He, Han Zhang, Jiangshan He, Jinting Wang, Linghan Fang, Lufei Gao, Orkesh Ablet, Peihua Zhang, Ruolin Hu, Shengyu Li, Weilin Lin, Xiaoyang Feng, Xinyue Yang, Yan Rong, Yanyun Wang, Zihang Shao, Zelin Zhao, Chenxing Li, Shan Yang, Wenfu Wang, Meng Yu, Dong Yu, Li Liu

Main category: cs.SD

TL;DR: PhyAVBench introduces the first benchmark for evaluating audio-physics grounding in text-to-audio-video generation, featuring a new dataset and novel evaluation metrics to assess physical plausibility of generated sounds.

DetailsMotivation: Current T2AV models often fail to produce physically plausible sounds, and existing benchmarks focus mainly on audio-video synchronization while overlooking explicit evaluation of audio-physics grounding, limiting progress in physically plausible audio-visual generation.

Method: Created PhyAVBench with PhyAV-Sound-11K dataset (25.5 hours, 11,605 videos from 184 participants), featuring 337 paired-prompt groups with controlled physical variations. Introduced Audio-Physics Sensitivity Test (APST) paradigm and Contrastive Physical Response Score (CPRS) metric to quantify acoustic consistency between generated and real-world videos.

Result: Comprehensive evaluation of 17 state-of-the-art models reveals that even leading commercial models struggle with fundamental audio-physical phenomena, exposing a critical gap beyond audio-visual synchronization.

Conclusion: PhyAVBench provides the first systematic benchmark for audio-physics grounding in audio-visual generation, revealing significant limitations in current models and pointing to future research directions for physically plausible audio-visual generation.

Abstract: Text-to-audio-video (T2AV) generation is central to applications such as filmmaking and world modeling. However, current models often fail to produce physically plausible sounds. Previous benchmarks primarily focus on audio-video temporal synchronization, while largely overlooking explicit evaluation of audio-physics grounding, thereby limiting the study of physically plausible audio-visual generation. To address this issue, we present PhyAVBench, the first benchmark that systematically evaluates the audio-physics grounding capabilities of T2AV, image-to-audio-video (I2AV), and video-to-audio (V2A) models. PhyAVBench offers PhyAV-Sound-11K, a new dataset of 25.5 hours of 11,605 audible videos collected from 184 participants to ensure diversity and avoid data leakage. It contains 337 paired-prompt groups with controlled physical variations that drive sound differences, each grounded with an average of 17 videos and spanning 6 audio-physics dimensions and 41 fine-grained test points. Each prompt pair is annotated with the physical factors underlying their acoustic differences. Importantly, PhyAVBench leverages paired text prompts to evaluate this capability. We term this evaluation paradigm the Audio-Physics Sensitivity Test (APST) and introduce a novel metric, the Contrastive Physical Response Score (CPRS), which quantifies the acoustic consistency between generated videos and their real-world counterparts. We conduct a comprehensive evaluation of 17 state-of-the-art models. Our results reveal that even leading commercial models struggle with fundamental audio-physical phenomena, exposing a critical gap beyond audio-visual synchronization and pointing to future research directions. We hope PhyAVBench will serve as a foundation for advancing physically grounded audio-visual generation. Prompts, ground-truth, and generated video samples are available at https://phyavbench.pages.dev/.

Relevance: 9/10


Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] LLM-Augmented Knowledge Base Construction For Root Cause Analysis

Nguyen Phuc Tran, Brigitte Jaumard, Oscar Delgado, Tristan Glatard, Karthikeyan Premkumar, Kun Ni

Main category: cs.CL

TL;DR: This paper evaluates three LLM approaches (Fine-Tuning, RAG, and Hybrid) for building a Root Cause Analysis Knowledge Base from network support tickets to improve network outage diagnosis and reliability.

DetailsMotivation: Network reliability is critical but difficult to guarantee even with redundancy. During outages, rapid and accurate root cause analysis (RCA) is essential for service restoration and preventing future disruptions. Traditional methods may be slow or inaccurate, prompting exploration of LLM-based approaches.

Method: The study evaluates three LLM methodologies: 1) Fine-Tuning LLMs on support ticket data, 2) Retrieval-Augmented Generation (RAG) approach, and 3) a Hybrid approach combining both. Performance is compared using comprehensive lexical and semantic similarity metrics on real industrial network outage data.

Result: Experiments on real industrial dataset demonstrate that the generated knowledge base provides an excellent starting point for accelerating RCA tasks and improving network resilience. The study likely shows comparative performance of the three approaches, though specific metrics aren’t detailed in the abstract.

Conclusion: LLM-based approaches can effectively build RCA knowledge bases from support tickets, accelerating root cause analysis during network outages and contributing to improved network reliability and resilience.

Abstract: Communications networks now form the backbone of our digital world, with fast and reliable connectivity. However, even with appropriate redundancy and failover mechanisms, it is difficult to guarantee “five 9s” (99.999 %) reliability, requiring rapid and accurate root cause analysis (RCA) during outages. In the event of an outage, rapid and accurate RCA becomes essential to restore service and prevent future disruptions. This study evaluates three Large Language Model (LLM) methodologies - Fine-Tuning, RAG, and a Hybrid approach - for constructing a Root Cause Analysis (RCA) Knowledge Base from support tickets. We compare their performance using a comprehensive suite of lexical and semantic similarity metrics. Our experiments on a real industrial dataset demonstrate that the generated knowledge base provides an excellent starting point for accelerating RCA tasks and improving network resilience.

[2] The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs?

Mar Gonzàlez I Català, Haitz Sáez de Ocáriz Borde, George D. Montañez, Pietro Liò

Main category: cs.CL

TL;DR: The paper investigates why internal entropy dynamics in LLMs correlate with correctness, proposing the Stepwise Informativeness Assumption (SIA) that reasoning prefixes accumulate answer-relevant information during generation.

DetailsMotivation: To understand the empirical observation that internal entropy dynamics (under the model's predictive distribution) robustly correlate with external correctness (ground-truth answers), moving beyond empirical findings to theoretical understanding.

Method: Proposes Stepwise Informativeness Assumption (SIA) formalizing that reasoning prefixes accumulate answer-relevant information. Derives observable signatures linking conditional answer entropy dynamics to correctness. Empirically tests SIA across multiple reasoning benchmarks (GSM8K, ARC, SVAMP) and diverse LLMs (Gemma-2, LLaMA-3.2, Qwen-2.5, DeepSeek, Olmo variants).

Result: Shows that SIA naturally emerges from maximum-likelihood optimization on human reasoning traces and is reinforced by standard fine-tuning and RL pipelines. Correct traces exhibit characteristic conditional answer entropy patterns, and training induces SIA across tested models.

Conclusion: The correlation between internal entropy dynamics and correctness arises because autoregressive models reason correctly when they accumulate information about the true answer via answer-informative prefixes, formalized by SIA.

Abstract: Recent work uses entropy-based signals at multiple representation levels to study reasoning in large language models, but the field remains largely empirical. A central unresolved puzzle is why internal entropy dynamics, defined under the predictive distribution of a model, correlate so robustly with external correctness given by the ground-truth answer. In this paper, we argue that this correlation arises because autoregressive models reason correctly when they accumulate information about the true answer via answer-informative prefixes. We formalize this intuition via the Stepwise Informativeness Assumption (SIA), which states that reasoning prefixes accumulate answer-relevant information in expectation as generation progresses. We show that SIA naturally emerges from maximum-likelihood optimization on human reasoning traces and is reinforced by standard fine-tuning and reinforcement-learning pipelines. We then derive observable signatures of SIA linking conditional answer entropy dynamics to correctness. We empirically test SIA across multiple reasoning benchmarks (GSM8K, ARC, SVAMP) and a diverse set of open-weight LLMs (Gemma-2, LLaMA-3.2, Qwen-2.5, DeepSeek and Olmo variants), showing that training induces it and that correct traces exhibit characteristic conditional answer entropy patterns.

[3] Depression Detection at the Point of Care: Automated Analysis of Linguistic Signals from Routine Primary Care Encounters

Feng Chen, Manas Bedmutha, Janice Sabin, Andrea Hartzler, Nadir Weibel, Trevor Cohen

Main category: cs.CL

TL;DR: Automated depression detection from audio-recorded primary care encounters using NLP models, with GPT-OSS achieving best performance and dyadic transcripts providing superior detection signals.

DetailsMotivation: Depression is underdiagnosed in primary care despite its critical importance. Recorded clinical encounters from digital scribing technologies offer an opportunity for passive, low-burden depression detection from naturalistic dialogue.

Method: Analyzed 1,108 audio-recorded primary care encounters (253 depressed, 855 non-depressed based on PHQ-9). Compared three supervised approaches (Sentence-BERT + Logistic Regression, LIWC+LR, ModernBERT) against zero-shot GPT-OSS. Examined dyadic vs single-speaker configurations and early detection from first 128 patient tokens.

Result: GPT-OSS achieved strongest performance (AUPRC=0.510, AUROC=0.774). LIWC+LR was competitive among supervised models (AUPRC=0.500, AUROC=0.742). Dyadic transcripts outperformed single-speaker configurations, with providers linguistically mirroring patients in depression encounters. Meaningful detection possible from first 128 patient tokens (AUPRC=0.356, AUROC=0.675).

Conclusion: Passively collected clinical audio can serve as a low-burden complement to existing depression screening workflows, with dyadic interactions providing additive detection signals not captured by single speakers alone, supporting in-the-moment clinical decision support.

Abstract: Depression is underdiagnosed in primary care, yet timely identification remains critical. Recorded clinical encounters, increasingly common with digital scribing technologies, present an opportunity to detect depression from naturalistic dialogue. We investigated automated depression detection from 1,108 audio-recorded primary care encounters in the Establishing Focus study, with depression defined by PHQ-9 (n=253 depressed, n=855 non-depressed). We compared three supervised approaches, Sentence-BERT + Logistic Regression (LR), LIWC+LR and ModernBERT, against a zero-shot GPT-OSS. GPT-OSS achieved the strongest performance (AUPRC=0.510, AUROC=0.774), with LIWC+LR competitive among supervised models (AUPRC=0.500, AUROC=0.742). Combined dyadic transcripts outperformed single-speaker configurations, with providers linguistically mirroring patients in depression encounters, an additive signal not captured by either speaker alone. Meaningful detection is achievable from the first 128 patient tokens (AUPRC=0.356, AUROC=0.675), supporting in-the-moment clinical decision support. These findings argue for passively collected clinical audio as a low-burden complement to existing screening workflows.

[4] Hallucination as output-boundary misclassification: a composite abstention architecture for language models

Angelina Hintsanen

Main category: cs.CL

TL;DR: A composite intervention combining instruction-based refusal with structural abstention gating reduces hallucinations in LLMs by blocking unsupported claims using support deficit scores.

DetailsMotivation: Large language models often produce unsupported claims (hallucinations), which is framed as a misclassification error at the output boundary where internally generated completions are emitted as if grounded in evidence.

Method: Combines instruction-based refusal with structural abstention gate that computes support deficit score (St) from three black-box signals: self-consistency (At), paraphrase stability (Pt), and citation coverage (Ct). Blocks output when St exceeds threshold.

Result: In evaluation across 50 items, five epistemic regimes, and three models: instruction-only reduced hallucination but showed over-cautious abstention and residual hallucination; structural gate preserved accuracy but missed confident confabulation; composite architecture achieved high overall accuracy with low hallucination.

Conclusion: Instruction-based refusal and structural gating show complementary failure modes, suggesting effective hallucination control benefits from combining both mechanisms. Structural gating provides capability-independent abstention floor.

Abstract: Large language models often produce unsupported claims. We frame this as a misclassification error at the output boundary, where internally generated completions are emitted as if they were grounded in evidence. This motivates a composite intervention that combines instruction-based refusal with a structural abstention gate. The gate computes a support deficit score, St, from three black-box signals: self-consistency (At), paraphrase stability (Pt), and citation coverage (Ct), and blocks output when St exceeds a threshold. In a controlled evaluation across 50 items, five epistemic regimes, and three models, neither mechanism alone was sufficient. Instruction-only prompting reduced hallucination sharply, but still showed over-cautious abstention on answerable items and residual hallucination for GPT-3.5-turbo. The structural gate preserved answerable accuracy across models but missed confident confabulation on conflicting-evidence items. The composite architecture achieved high overall accuracy with low hallucination, while also inheriting some over-abstention from the instruction component. A supplementary 100-item no-context stress test derived from TruthfulQA showed that structural gating provides a capability-independent abstention floor. Overall, instruction-based refusal and structural gating show complementary failure modes, which suggests that effective hallucination control benefits from combining both mechanisms.

[5] Consistency-Guided Decoding with Proof-Driven Disambiguation for Three-Way Logical Question Answering

Tianyi Huang, Ming Hou, Jiaheng Su, Yutong Zhang, Ziling Zhang

Main category: cs.CL

TL;DR: CGD-PD is a lightweight test-time layer that addresses two failure modes in three-way logical QA (True/False/Unknown) by ensuring negation consistency and reducing epistemic Unknown predictions through proof-driven disambiguation.

DetailsMotivation: Large language models struggle with three-way logical QA, showing negation inconsistency (contradictory answers for H and ¬H) and epistemic Unknown predictions even when premises entail a clear answer. These failure modes undermine logical reasoning reliability.

Method: CGD-PD queries a 3-way classifier on both H and its mechanical negation, projects the pair onto negation-consistent decisions when possible, and uses proof-driven disambiguation with targeted binary entailment probes to resolve Unknown outcomes, requiring only 4-5 model calls on average.

Result: On the FOLIO benchmark’s first-order-logic fields, CGD-PD achieves consistent gains across frontier LLMs with relative accuracy improvements up to 16% over base models while reducing Unknown predictions.

Conclusion: CGD-PD effectively addresses logical reasoning failures in LLMs through a lightweight test-time approach that ensures negation consistency and reduces unnecessary Unknown predictions, improving three-way logical QA performance.

Abstract: Three-way logical question answering (QA) assigns $True/False/Unknown$ to a hypothesis $H$ given a premise set $S$. While modern large language models (LLMs) can be accurate on isolated examples, we identify two recurring failure modes in 3-way logic QA: (i) negation inconsistency, where answers to $H$ and $\neg H$ violate the deterministic label mapping, and (ii) epistemic $Unknown$, where the model predicts $Unknown$ due to uncertainty or instability even when $S$ entails one side. We present CGD-PD, a lightweight test-time layer that (a) queries a single 3-way classifier on both $H$ and a mechanically negated form of $H$, (b) projects the pair onto a negation-consistent decision when possible, and (c) invokes a proof-driven disambiguation step that uses targeted binary entailment probes to selectively resolve $Unknown$ outcomes, requiring only an average of 4-5 model calls. On the FOLIO benchmark’s first-order-logic fields, CGD-PD yields consistent gains across frontier LLMs, with relative improvements in accuracy of up to 16% over the base model, while also reducing $Unknown$ predictions.

[6] Temporally Phenotyping GLP-1RA Case Reports with Large Language Models: A Textual Time Series Corpus and Risk Modeling

Sayantan Kumar, Jeremy C. Weiss

Main category: cs.CL

TL;DR: LLM-based extraction of clinical timelines from diabetes case reports enables time-series analysis and reveals GLP-1 agonists may reduce respiratory risks

DetailsMotivation: Clinical case reports contain valuable longitudinal information but express timelines in unstructured language that's difficult to reuse for computational modeling and analysis

Method: Created corpus of 136 diabetes case reports, developed gold-standard timeline annotations, evaluated LLMs for extracting clinical events and their temporal relationships, then performed time-to-event analysis on extracted data

Result: Best LLM (GPT5) achieved high event coverage (0.871) and reliable temporal sequencing (0.843); downstream analysis showed GLP-1 users had lower respiratory risk (HR=0.259, p<0.05)

Conclusion: LLMs can effectively extract structured temporal information from clinical narratives, enabling quantitative analysis that reveals clinically relevant patterns like potential respiratory benefits of GLP-1 agonists

Abstract: Type 2 diabetes case reports describe complex clinical courses, but their timelines are often expressed in language that is difficult to reuse in longitudinal modeling. To address this gap, we developed a textual time-series corpus of 136 PubMed Open Access single-patient case reports involving glucagon-like peptide 1 receptor agonists, with clinical events associated with their most probable reference times. We evaluated automated LLM timeline extraction against gold-standard timelines annotated by clinical domain experts, assessing how well systems recovered clinical events and their timings. The best-performing LLM produced high event coverage (GPT5; 0.871) and reliable temporal sequencing across symptoms (GPT5; 0.843), diagnoses, treatments, laboratory tests, and outcomes. As a downstream demonstration, time-to-event analyses in diabetes suggested lower risk of respiratory sequelae among GLP-1 users versus non-users (HR=0.259, p<0.05), consistent with prior reports of improved respiratory outcomes. Temporal annotations and code will be released upon acceptance.

[7] Emergent decentralized regulation in a purely synthetic society

Md Motaleb Hossen Manik, Ge Wang

Main category: cs.CL

TL;DR: AI agents on social network exhibit self-regulated corrective signaling that scales with directive language intensity, suggesting synthetic societies can develop endogenous social regulation without human intervention.

DetailsMotivation: To investigate whether autonomous AI agents in online environments can exhibit self-regulated social dynamics without human intervention or centralized design, specifically examining how directive language elicits corrective responses.

Method: Studied OpenClaw agents on Moltbook (agent-only social network) using observational archive of 39,026 posts and 5,712 comments by 14,490 agents. Quantified directive language with Directive Intensity (DI) lexicon-based measure. Classified responsive comments into four types. Used statistical analysis including mixed-effects logistic models and event-aligned within-thread analysis.

Result: Directive content was common (18.4% of posts). Corrective signaling scaled with DI: posts with higher directive intensity had higher probability of corrective replies. Statistical models confirmed positive DI association persists. Within-thread analysis showed evidence of negative feedback after first corrective response.

Conclusion: Purely synthetic, agent-only societies can exhibit endogenous corrective signaling with strength positively linked to directive proposal intensity, demonstrating emergent social regulation without human intervention.

Abstract: As autonomous AI agents increasingly inhabit online environments and extensively interact, a key question is whether synthetic collectives exhibit self-regulated social dynamics with neither human intervention nor centralized design. We study OpenClaw agents on Moltbook, an agent-only social network, using an observational archive of 39,026 posts and 5,712 comments authored by 14,490 agents. We quantify action-inducing language with Directive Intensity (DI), a transparent, lexicon-based proxy for directive and instructional phrasing that does not measure moral valence, intent, or execution outcomes. We classify responsive comments into four types: Affirmation, Corrective Signaling, Adverse Reaction, and Neutral Interaction. Directive content is common (DI>0 in 18.4% of posts). More importantly, corrective signaling scales with DI: posts with higher DI exhibit higher corrective reply probability, visible in stable binned estimates with Wilson confidence intervals. To address comment nesting within posts, we fit a post-level random intercept mixed-effects logistic model and find that the positive DI association persists. Event-aligned within-thread analysis of comment text provides additional evidence consistent with negative feedback after the first corrective response. In general, these results suggest that a purely synthetic, agent-only society can exhibit endogenous corrective signaling with a strength positively linked to the intensity of directive proposals.

[8] Beyond Facts: Benchmarking Distributional Reading Comprehension in Large Language Models

Pei-Fu Guo, Ya-An Tsai, Chun-Chia Hsu, Kai-Xin Chen, Yun-Da Tsai, Kai-Wei Chang, Nanyun Peng, Mi-Yen Yeh, Shou-De Lin

Main category: cs.CL

TL;DR: Text2DistBench: A benchmark for evaluating LLMs’ ability to infer distributional knowledge from collections of text, built from real-world YouTube comments about movies and music.

DetailsMotivation: Most reading comprehension benchmarks focus on factual information from specific textual evidence, but real-world tasks often require understanding distributional information like population-level trends and preferences across collections of text.

Method: Created Text2DistBench using real-world YouTube comments about movie and music entities, providing models with entity metadata and associated comments, requiring them to answer distributional questions about proportions, frequencies, and trends.

Result: LLMs substantially outperform random baselines but performance varies widely across different distribution types and characteristics, revealing both capabilities and limitations in distributional reading comprehension.

Conclusion: Text2DistBench serves as a practical and scalable testbed for evaluating LLMs’ ability to understand distributional knowledge from natural language, highlighting important research directions for improving models’ comprehension of population-level trends.

Abstract: While most reading comprehension benchmarks for LLMs focus on factual information that can be answered by localizing specific textual evidence, many real-world tasks require understanding distributional information, such as population-level trends and preferences expressed across collections of text. We introduce Text2DistBench, a reading comprehension benchmark for evaluating LLMs’ ability to infer distributional knowledge from natural language. Built from real-world YouTube comments about movie and music entities, the benchmark provides models with entity metadata and associated comments, and requires them to answer distributional questions, such as estimating the proportions of positive and negative comments, or identifying the most and second most frequent topics discussed among viewers. To support reliable and long-term evaluation, the construction pipeline of Text2DistBench is fully automated and continuously updated to incorporate newly emerging entities over time. Experiments across multiple LLMs show that while models substantially outperform random baselines, performance varies widely across different distribution types and characteristics. These findings highlight both the capabilities and limitations of current LLMs in distributional reading comprehension and demonstrate the value of Text2DistBench as a practical and scalable testbed for future research.

[9] Cross-Lingual Transfer and Parameter-Efficient Adaptation in the Turkic Language Family: A Theoretical Framework for Low-Resource Language Models

O. Ibrahimzade, K. Tabasaransky

Main category: cs.CL

TL;DR: A theoretical framework for studying cross-lingual transfer and parameter-efficient adaptation of multilingual LLMs within the Turkic language family, focusing on languages with varying resource availability.

DetailsMotivation: Multilingual LLMs are imbalanced, primarily trained on high-resource languages, leaving many languages with large speaker populations underrepresented. This is particularly visible in the Turkic language family where languages share typological similarity but differ greatly in available digital resources.

Method: Proposes a theoretical framework integrating multilingual representation learning and parameter-efficient fine-tuning techniques like LoRA. Introduces Turkic Transfer Coefficient (TTC) - a theoretical measure incorporating morphological similarity, lexical overlap, syntactic structure, and script compatibility across Turkic languages.

Result: Develops a conceptual scaling model describing how adaptation performance depends on model capacity, adaptation data size, and expressivity of adaptation modules. The framework highlights how typological similarity enables efficient multilingual transfer while identifying structural limits of parameter-efficient adaptation in low-resource scenarios.

Conclusion: Provides a theoretical foundation for studying cross-lingual transfer in related language families, offering insights into efficient multilingual adaptation strategies and identifying boundaries of parameter-efficient methods in resource-constrained settings.

Abstract: Large language models (LLMs) have transformed natural language processing, yet their capabilities remain uneven across languages. Most multilingual models are trained primarily on high-resource languages, leaving many languages with large speaker populations underrepresented in both training data and evaluation benchmarks. This imbalance is particularly visible in the Turkic language family. This paper proposes a theoretical framework for studying cross-lingual transfer and parameter-efficient adaptation of multilingual LLMs within the Turkic language family, focusing on Azerbaijani, Kazakh, Uzbek, Turkmen, and Gagauz. These languages share substantial typological and morphological similarity while differing greatly in available digital resources, making them a natural setting for analyzing multilingual adaptation strategies. We integrate insights from multilingual representation learning and parameter-efficient fine-tuning techniques such as Low-Rank Adaptation (LoRA) to develop a conceptual scaling model describing how adaptation performance depends on model capacity, adaptation data size, and the expressivity of adaptation modules. To formalize transfer potential between related languages, we introduce the Turkic Transfer Coefficient (TTC), a theoretical measure incorporating morphological similarity, lexical overlap, syntactic structure, and script compatibility across Turkic languages. The framework highlights how typological similarity can enable efficient multilingual transfer while also identifying structural limits of parameter-efficient adaptation in extremely low-resource scenarios.

[10] SensorPersona: An LLM-Empowered System for Continual Persona Extraction from Longitudinal Mobile Sensor Streams

Bufang Yang, Lilin Xu, Yixuan Li, Kaiwei Liu, Xiaofan Jiang, Zhenyu Yan

Main category: cs.CL

TL;DR: SensorPersona: An LLM-based system that infers comprehensive user personas from multimodal mobile sensor data rather than just chat histories, enabling better personalization for AI agents.

DetailsMotivation: Existing persona inference methods rely on chat histories which only capture self-disclosed information, missing users' real-world behaviors. There's a need to infer comprehensive personas from everyday physical world activities captured through mobile sensors.

Method: 1) Person-oriented context encoding on continuous sensor streams; 2) Hierarchical persona reasoning with intra- and inter-episode reasoning; 3) Clustering-aware incremental verification and temporal evidence-aware updating for evolving personas.

Result: Achieves 31.4% higher recall in persona extraction, 85.7% win rate in persona-aware agent responses, and notable user satisfaction improvements on a dataset of 1,580 hours of sensor data from 20 participants across 17 cities.

Conclusion: SensorPersona demonstrates that multimodal sensor data can effectively infer comprehensive user personas, enabling better personalization for LLM-based agents beyond chat-based approaches.

Abstract: Personalization is essential for Large Language Model (LLM)-based agents to adapt to users’ preferences and improve response quality and task performance. However, most existing approaches infer personas from chat histories, which capture only self-disclosed information rather than users’ everyday behaviors in the physical world, limiting the ability to infer comprehensive user personas. In this work, we introduce SensorPersona, an LLM-empowered system that continuously infers stable user personas from multimodal longitudinal sensor streams unobtrusively collected from users’ mobile devices. SensorPersona first performs person-oriented context encoding on continuous sensor streams to enrich the semantics of sensor contexts. It then employs hierarchical persona reasoning that integrates intra- and inter-episode reasoning to infer personas spanning physical patterns, psychosocial traits, and life experiences. Finally, it employs clustering-aware incremental verification and temporal evidence-aware updating to adapt to evolving personas. We evaluate SensorPersona on a self-collected dataset containing 1,580 hours of sensor data from 20 participants, collected over up to 3 months across 17 cities on 3 continents. Results show that SensorPersona achieves up to 31.4% higher recall in persona extraction, an 85.7% win rate in persona-aware agent responses, and notable improvements in user satisfaction compared to state-of-the-art baselines.

[11] Tool-MCoT: Tool Augmented Multimodal Chain-of-Thought for Content Safety Moderation

Shutong Zhang, Dylan Zhou, Yinxiao Liu, Yang Yang, Huiwen Luo, Wenfei Zou

Main category: cs.CL

TL;DR: Tool-MCoT: A small language model fine-tuned for content safety moderation that learns to use external tools via chain-of-thought reasoning to balance accuracy and efficiency.

DetailsMotivation: Online platforms need scalable content moderation systems, but large language models (LLMs) have high computational costs and latency. There's a need for efficient models that can handle complex multimodal inputs while maintaining accuracy.

Method: Fine-tune a small language model (SLM) using tool-augmented chain-of-thought data generated by LLMs. The model learns to selectively use external tools to improve reasoning and decision-making for content safety moderation.

Result: The fine-tuned SLM achieves significant performance gains and learns to use tools selectively, balancing moderation accuracy with inference efficiency by calling tools only when necessary.

Conclusion: Tool-MCoT demonstrates that small language models can effectively leverage external tools through chain-of-thought learning, providing a scalable solution for content moderation that balances accuracy and computational efficiency.

Abstract: The growth of online platforms and user content requires strong content moderation systems that can handle complex inputs from various media types. While large language models (LLMs) are effective, their high computational cost and latency present significant challenges for scalable deployment. To address this, we introduce Tool-MCoT, a small language model (SLM) fine-tuned for content safety moderation leveraging external framework. By training our model on tool-augmented chain-of-thought data generated by LLM, we demonstrate that the SLM can learn to effectively utilize these tools to improve its reasoning and decision-making. Our experiments show that the fine-tuned SLM achieves significant performance gains. Furthermore, we show that the model can learn to use these tools selectively, achieving a balance between moderation accuracy and inference efficiency by calling tools only when necessary.

[12] A Comparative Study of Demonstration Selection for Practical Large Language Models-based Next POI Prediction

Ryo Nishida, Masayuki Kawarada, Tatsuya Ishigaki, Hiroya Takamura, Masaki Onishi

Main category: cs.CL

TL;DR: LLM demonstration selection strategies for POI prediction: heuristic methods (geographic proximity, temporal ordering, sequential patterns) outperform complex embedding-based approaches in both accuracy and efficiency.

DetailsMotivation: To identify optimal demonstration selection strategies for LLM-based POI prediction, as ICL effectiveness heavily depends on demonstration quality, and previous studies lack comprehensive comparison of different selection methods.

Method: Comprehensive evaluation of existing demonstration selection methods (random, embedding-based, task-specific) alongside simpler heuristic approaches (geographic proximity, temporal ordering, sequential patterns) on three real-world datasets for POI prediction using LLMs.

Result: Heuristic methods consistently outperform more complex embedding-based methods in both computational cost and prediction accuracy. In some scenarios, LLMs with heuristic-selected demonstrations even outperform existing fine-tuned models without additional training.

Conclusion: Simpler heuristic demonstration selection methods are more effective and efficient than complex approaches for LLM-based POI prediction, challenging the assumption that more sophisticated methods are always better for ICL tasks.

Abstract: This paper investigates demonstration selection strategies for predicting a user’s next point-of-interest (POI) using large language models (LLMs), aiming to accurately forecast a user’s subsequent location based on historical check-in data. While in-context learning (ICL) with LLMs has recently gained attention as a promising alternative to traditional supervised approaches, the effectiveness of ICL significantly depends on the selected demonstration. Although previous studies have examined methods such as random selection, embedding-based selection, and task-specific selection, there remains a lack of comprehensive comparative analysis among these strategies. To bridge this gap and clarify the best practices for real-world applications, we comprehensively evaluate existing demonstration selection methods alongside simpler heuristic approaches such as geographical proximity, temporal ordering, and sequential patterns. Extensive experiments conducted on three real-world datasets indicate that these heuristic methods consistently outperform more complex and computationally demanding embedding-based methods, both in terms of computational cost and prediction accuracy. Notably, in certain scenarios, LLMs using demonstrations selected by these simpler heuristic methods even outperform existing fine-tuned models, without requiring further training. Our source code is available at: https://github.com/ryonsd/DS-LLM4POI.

[13] Extracting Breast Cancer Phenotypes from Clinical Notes: Comparing LLMs with Classical Ontology Methods

Abdullah Bin Faiz, Arbaz Khan Shehzad, Asad Afzal, Momin Tariq, Muhammad Siddiqi, Muhammad Usamah Shahid, Maryam Noor Awan, Muddassar Farooq

Main category: cs.CL

TL;DR: LLM-based framework for extracting medical phenotypes from unstructured oncology provider notes, specifically applied to breast cancer, showing comparable accuracy to traditional ontology-based methods.

DetailsMotivation: Oncology EMRs contain valuable clinical information in unstructured provider notes that oncologists prefer to document in natural language rather than structured fields, creating a need for automated extraction of medical knowledge and phenotypes.

Method: Developed an LLM-based framework to process provider notes and extract medical phenotypes, specifically applied to breast cancer, and compared performance with traditional knowledge-driven annotation systems using NCIt Ontology Annotator.

Result: LLM-based information extraction framework achieves accuracy comparable to classical ontology-based methods and can be easily fine-tuned for other cancer types and diseases.

Conclusion: LLMs provide a flexible and adaptable approach for extracting medical phenotypes from unstructured clinical notes, offering comparable performance to traditional ontology-based methods with easier adaptation to different domains.

Abstract: A significant amount of data held in Oncology Electronic Medical Records (EMRs) is contained in unstructured provider notes – including but not limited to the chemotherapy (or cancer treatment) outcome, different biomarkers, the tumor’s location, sizes, and growth patterns of a patient. The clinical studies show that the majority of oncologists are comfortable providing these valuable insights in their notes in a natural language rather than the relevant structured fields of an EMR. The major contribution of this research is to report an LLM-based framework to process provider notes and extract valuable medical knowledge and phenotype mentioned above, with a focus on the domain of oncology. In this paper, we focus on extracting phenotypes related to breast cancer using our LLM framework, and then compare its performance with earlier works that used knowledge-driven annotation system, paired with the NCIt Ontology Annotator. The results of the study show that an LLM-based information extraction framework can be easily adapted to extract phenotypes with an accuracy that is comparable to the classical ontology-based methods. However, once trained, they could be easily fine-tuned to cater for other cancer types and diseases.

[14] TelcoAgent-Bench: A Multilingual Benchmark for Telecom AI Agents

Lina Bariah, Brahim Mefgouda, Farbod Tavakkoli, Enrique Molero, Louis Powell, Merouane Debbah

Main category: cs.CL

TL;DR: A telecom-specific benchmarking framework (TelcoAgent-Bench and TelcoAgent-Metrics) for evaluating multilingual LLM agents in telecom networks, focusing on intent recognition, tool execution, resolution generation, and stability across scenario variations.

DetailsMotivation: The integration of LLM agents into telecom networks introduces challenges in intent recognition, tool execution, and resolution generation while considering operational constraints. There's a need for telecom-specific evaluation frameworks to assess multilingual LLM agents' reliability and operational consistency in real-world telecom environments.

Method: Introduces TelcoAgent-Bench (benchmarking framework) and TelcoAgent-Metrics (structured suite of metrics) that assess semantic understanding, process-level alignment with structured troubleshooting flows, and stability across repeated scenario variations. The framework evaluates intent recognition, ordered tool execution, resolution correctness, and stability across scenario variations, operating in both English and Arabic.

Result: Experimental results show that recent instruct-tuned models can understand telecom problems reasonably well but struggle to consistently follow required troubleshooting steps and maintain stable behavior across different variations of the same scenario. This performance gap becomes more pronounced in unconstrained and bilingual settings.

Conclusion: There’s a significant need for specialized evaluation frameworks for LLM agents in telecom domains, as current models lack consistency in following structured troubleshooting procedures and maintaining stable behavior across scenario variations, especially in multilingual contexts.

Abstract: The integration of large language model (LLM) agents into telecom networks introduces new challenges, related to intent recognition, tool execution, and resolution generation, while taking into consideration different operational constraints. In this paper, we introduce TelcoAgent-Bench and TelcoAgent-Metrics, a Telecom-specific benchmarking framework for evaluating multilingual telecom LLM agents. The proposed framework assesses the semantic understanding as well as process-level alignment with structured troubleshooting flows and stability across repeated scenario variations. Our contribution includes a structured suite of metrics that assess intent recognition, ordered tool execution, resolution correctness, and stability across scenario variations, with the aim of quantifying the reliability and operational consistency of LLM agents in telecom environments. The framework is designed to operate in both English and Arabic, to address the need for multilingual agent deployment in operational network environments. Our experimental results show that although recent instruct-tuned models can understand telecom problems in a reasonable way, they usually struggle to consistently follow the required troubleshooting steps and to maintain stable behavior when exposed to different variations of the same scenario. This performance gap becomes more pronounced in unconstrained and bilingual settings.

[15] Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook

Jaehyeok Lee, Xiaoyuan Yi, Jing Yao, Hyunjin Hwang, Roy Ka-Wei Lee, Xing Xie, JinYeong Bak

Main category: cs.CL

TL;DR: DOVE is a distributional evaluation framework for measuring cultural value alignment in LLMs by comparing human-written text distributions with LLM-generated outputs using rate-distortion optimization and optimal transport.

DetailsMotivation: Existing benchmarks for cultural value alignment in LLMs face the C³ challenge: they use discriminative multiple-choice formats that probe value knowledge rather than true orientations, overlook subcultural heterogeneity, and mismatch with real-world open-ended generation.

Method: DOVE uses a rate-distortion variational optimization objective to construct a compact value-codebook from 10K documents, mapping text into a structured value space to filter semantic noise. Alignment is measured using unbalanced optimal transport to capture intra-cultural distributional structures and sub-group diversity.

Result: Experiments across 12 LLMs show DOVE achieves superior predictive validity with 31.56% correlation with downstream tasks while maintaining high reliability with as few as 500 samples per culture.

Conclusion: DOVE provides a robust framework for evaluating cultural value alignment in LLMs that addresses limitations of existing benchmarks by focusing on distributional comparisons rather than discriminative formats.

Abstract: As LLMs are globally deployed, aligning their cultural value orientations is critical for safety and user engagement. However, existing benchmarks face the Construct-Composition-Context ($C^3$) challenge: relying on discriminative, multiple-choice formats that probe value knowledge rather than true orientations, overlook subcultural heterogeneity, and mismatch with real-world open-ended generation. We introduce DOVE, a distributional evaluation framework that directly compares human-written text distributions with LLM-generated outputs. DOVE utilizes a rate-distortion variational optimization objective to construct a compact value-codebook from 10K documents, mapping text into a structured value space to filter semantic noise. Alignment is measured using unbalanced optimal transport, capturing intra-cultural distributional structures and sub-group diversity. Experiments across 12 LLMs show that DOVE achieves superior predictive validity, attaining a 31.56% correlation with downstream tasks, while maintaining high reliability with as few as 500 samples per culture.

[16] Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models

Francesco Sovrano, Alberto Bacchelli

Main category: cs.CL

TL;DR: LLMs produce persuasive but often unfaithful explanations; this paper introduces chain-of-illocution prompting to improve source adherence in retrieval-augmented generation for programming education.

DetailsMotivation: Large language models generate explanations that may be persuasive but not scrutable or faithful to evidence sources, motivating the need for traceable explanations that can be verified against authoritative sources like textbooks.

Method: Benchmarks six LLMs on 90 Stack Overflow questions grounded in programming textbooks, quantifies source faithfulness via adherence metrics, and introduces chain-of-illocution prompting (CoI) - an illocutionary macro-planning approach that expands queries into implicit explanatory questions to drive retrieval.

Result: Non-RAG models have 0% median source adherence, baseline RAG systems show low adherence (22-40%), while CoI yields statistically significant gains up to 63% in source adherence across models, though absolute adherence remains moderate. User study with 165 participants shows gains don’t harm satisfaction, relevance, or perceived correctness.

Conclusion: Chain-of-illocution prompting improves source faithfulness in retrieval-augmented generation for explanations, addressing the scrutability problem while maintaining user satisfaction, though challenges remain in achieving high absolute adherence.

Abstract: Natural language explanations produced by large language models (LLMs) are often persuasive, but not necessarily scrutable: users cannot easily verify whether the claims in an explanation are supported by evidence. In XAI, this motivates a focus on faithfulness and traceability, i.e., the extent to which an explanation’s claims can be grounded in, and traced back to, an explicit source. We study these desiderata in retrieval-augmented generation (RAG) for programming education, where textbooks provide authoritative evidence. We benchmark six LLMs on 90 Stack Overflow questions grounded in three programming textbooks and quantify source faithfulness via source adherence metrics. We find that non Retrieval-Augmented Generation (RAG) models have median source adherence of 0%, while baseline RAG systems still exhibit low median adherence (22-40%, depending on the model). Motivated by Achinstein’s illocutionary theory of explanation, we introduce illocutionary macro-planning as a descriptive design principle for source-faithful explanations and instantiate it with chain-of-illocution prompting (CoI), which expands a query into implicit explanatory questions that drive retrieval. Across models, CoI yields statistically significant gains (up to 63%) in source adherence, although absolute adherence remains moderate and the gains are weak or non-significant for some models. A user study with 165 retained participants (220 recruited) indicates that these gains do not harm satisfaction, relevance, or perceived correctness.

[17] ReDAct: Uncertainty-Aware Deferral for LLM Agents

Dzianis Piatrashyn, Nikita Kotelevskii, Kirill Grishchenkov, Nikita Glazkov, Ivan Nasonov, Ilya Makarov, Timothy Baldwin, Preslav Nakov, Roman Vashurin, Maxim Panov

Main category: cs.CL

TL;DR: ReDAct proposes a dual-LLM agent system where a small, cheap LLM handles most decisions, deferring only uncertain cases to a larger, more reliable but expensive LLM to balance performance and cost in sequential decision-making tasks.

DetailsMotivation: LLM-based agents suffer from hallucination issues in sequential decision-making where single mistakes can degrade entire trajectories. While larger LLMs hallucinate less, they have significantly higher per-token costs, creating a tradeoff between reliability and expense.

Method: ReDAct uses two LLMs: a small, cheap model as default and a large, reliable but expensive model as backup. The system defers decisions to the large model only when the small model’s predictive uncertainty exceeds a calibrated threshold, optimizing cost-performance tradeoff.

Result: In text-based embodied environments (ALFWorld and MiniGrid), deferring only about 15% of decisions to the large model matches the quality of using it exclusively while significantly reducing inference costs.

Conclusion: ReDAct effectively addresses the cost-reliability tradeoff in LLM-based agents by selectively deferring uncertain decisions to larger models, achieving comparable performance at substantially lower computational cost.

Abstract: Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems. However, they inherit the tendency of LLMs to hallucinate, leading to incorrect decisions. In sequential settings, even a single mistake can irreversibly degrade the trajectory, making hallucinations an even bigger problem. Although larger LLMs hallucinate less, they incur a significantly higher per-token cost. In this paper, we address this tradeoff by proposing ReDAct (Reason-Defer-Act). In ReDAct, an agent is equipped with two LLMs: a small, cheap model used by default, and a large, more reliable but expensive model. When the predictive uncertainty of the small model exceeds a calibrated threshold, the decision is deferred to the large model. We evaluate our approach in text-based embodied environments such as ALFWorld and MiniGrid and show that deferring only about 15% of decisions to the large model can match the quality of using it exclusively, while significantly reducing inference costs.

[18] Invisible Influences: Investigating Implicit Intersectional Biases through Persona Engineering in Large Language Models

Nandini Arimanda, Achyuth Mukund, Sakthi Balan Muthiah, Rajesh Sharma

Main category: cs.CL

TL;DR: BADx is a novel metric that measures persona-induced bias amplification in LLMs, combining differential bias scores, persona sensitivity, volatility, and explainability analysis.

DetailsMotivation: Existing bias audits for LLMs rely on static embedding-based tests that fail to capture dynamic bias shifts when models adopt different social roles/personas, especially in persona-driven contexts where implicit intersectional biases can be amplified.

Method: BADx metric with three components: differential bias scores (BAD based on CEAT, I-WEAT, I-SEAT), Persona Sensitivity Index (PSI), and volatility (standard deviation), augmented by LIME-based explainability analysis. Two tasks: Task 1 establishes static bias baselines; Task 2 applies six persona frames (marginalized and structurally advantaged) to measure BADx across five SOTA LLMs.

Result: Persona context significantly modulates bias: GPT-4o shows high sensitivity and volatility; DeepSeek-R1 suppresses bias but with erratic volatility; LLaMA-4 maintains low volatility and stable bias profile; Claude 4.0 Sonnet achieves balanced modulation; Gemma-3n E4B has lowest volatility with moderate amplification. BADx outperforms static methods in revealing context-sensitive biases.

Conclusion: BADx provides a systematic, scalable method to detect dynamic implicit intersectional bias in LLMs, offering better insights than static methods by capturing persona-induced bias amplification with integrated explainability.

Abstract: Large Language Models (LLMs) excel at human-like language generation but often embed and amplify implicit, intersectional biases, especially under persona-driven contexts. Existing bias audits rely on static, embedding-based tests (CEAT, I-WEAT, I-SEAT) that quantify absolute association strengths. We show that they have limitations in capturing dynamic shifts when models adopt social roles. We address this gap by introducing the Bias Amplification Differential and Explainability Score (BADx): a novel, scalable metric that measures persona-induced bias amplification and integrates local explainability insights. BADx comprises three components - differential bias scores (BAD, based on CEAT, I-WEAT, I-SEAT),Persona Sensitivity Index (PSI), and Volatility (Standard Deviation), augmented by LIME-based analysis for emphasizing explainability. This study is divided and performed as two different tasks. Task 1 establishes static bias baselines, and Task 2 applies six persona frames (marginalized and structurally advantaged) to measure BADx, PSI, and volatility. This is studied across five state-of-the-art LLMs (GPT-4o, DeepSeek-R1, LLaMA-4, Claude 4.0 Sonnet and Gemma-3n E4B). Results show persona context significantly modulates bias. GPT-4o exhibits high sensitivity and volatility; DeepSeek-R1 suppresses bias but with erratic volatility; LLaMA-4 maintains low volatility and a stable bias profile with limited amplification; Claude 4.0 Sonnet achieves balanced modulation; and Gemma-3n E4B attains the lowest volatility with moderate amplification. BADx performs better than static methods by revealing context-sensitive biases overlooked in static methods. Our unified method offers a systematic way to detect dynamic implicit intersectional bias in five popular LLMs.

[19] Unsupervised Neural Network for Automated Classification of Surgical Urgency Levels in Medical Transcriptions

Sadaf Tabatabaee, Sarah S. Lam

Main category: cs.CL

TL;DR: Unsupervised neural network approach using BioClinicalBERT embeddings and clustering (K-means/DEC) to classify surgical procedures by urgency (immediate/urgent/elective), validated by expert review and enhanced with BiLSTM classification.

DetailsMotivation: Need for efficient surgical procedure classification by urgency to optimize patient care and resource allocation in healthcare systems, addressing the challenge of limited labeled data.

Method: Uses BioClinicalBERT to transform surgical transcripts into embeddings, clusters them with K-means and DEC algorithms, validates clusters via Modified Delphi Method with experts, then builds BiLSTM neural network with BioClinicalBERT embeddings for classification.

Result: DEC outperforms K-means in forming cohesive clusters; the final BiLSTM model achieves robust performance with strong generalization on unseen data, validated through cross-validation and metrics (accuracy, precision, recall, F1-score).

Conclusion: The unsupervised framework provides scalable, reliable solution for real-time surgical prioritization, enhancing operational efficiency and patient outcomes while overcoming limited labeled data challenges.

Abstract: Efficient classification of surgical procedures by urgency is paramount to optimize patient care and resource allocation within healthcare systems. This study introduces an unsupervised neural network approach to automatically categorize surgical transcriptions into three urgency levels: immediate, urgent, and elective. Leveraging BioClinicalBERT, a domain-specific language model, surgical transcripts are transformed into high-dimensional embeddings that capture their semantic nuances. These embeddings are subsequently clustered using both K-means and Deep Embedding Clustering (DEC) algorithms, in which DEC demonstrates superior performance in the formation of cohesive and well-separated clusters. To ensure clinical relevance and accuracy, the clustering results undergo validation through the Modified Delphi Method, which involves expert review and refinement. Following validation, a neural network that integrates Bidirectional Long Short-Term Memory (BiLSTM) layers with BioClinicalBERT embeddings is developed for classification tasks. The model is rigorously evaluated using cross-validation and metrics such as accuracy, precision, recall, and F1-score, which achieve robust performance and demonstrate strong generalization capabilities on unseen data. This unsupervised framework not only addresses the challenge of limited labeled data but also provides a scalable and reliable solution for real-time surgical prioritization, which ultimately enhances operational efficiency and patient outcomes in dynamic medical environments.

[20] Exploring Natural Language-Based Strategies for Efficient Number Learning in Children through Reinforcement Learning

Tirthankar Mittra

Main category: cs.CL

TL;DR: A reinforcement learning framework using base-ten blocks to study how children learn number composition, examining the impact of linguistic instructions and curriculum design on learning efficiency.

DetailsMotivation: To understand numerical cognition in toddlers as it sits at the intersection of language, logic, perception, and culture, and to study how variations in linguistic instructions affect the learning process of number composition.

Method: Built a reinforcement learning framework using state-of-the-art RL algorithms and neural network architectures to simulate how children compose numbers using base-ten blocks, with focus on analyzing the impact of different linguistic instructions and curriculum design.

Result: Instructions providing explicit action guidance were more effective for RL agents to construct numbers. An effective curriculum for ordering numerical-composition examples during training resulted in faster convergence and improved generalization to unseen data.

Conclusion: The findings highlight the role of language and multi-modal signals in numerical cognition and provide hypotheses for designing effective instructional strategies for early childhood education.

Abstract: In this paper, we build a reinforcement learning framework to study how children compose numbers using base-ten blocks. Studying numerical cognition in toddlers offers a powerful window into the learning process itself, because numbers sit at the intersection of language, logic, perception, and culture. Specifically, we utilize state of the art (SOTA) reinforcement learning algorithms and neural network architectures to understand how variations in linguistic instructions can affect the learning process. Our results also show that instructions providing explicit action guidance are a more effective learning signal for RL agents to construct numbers. Furthermore, we identify an effective curriculum for ordering numerical-composition examples during training, resulting in faster convergence and improved generalization to unseen data. These findings highlight the role of language and multi-modal signals in numerical cognition and provide hypotheses for designing effective instructional strategies for early childhood education.

[21] Blending Human and LLM Expertise to Detect Hallucinations and Omissions in Mental Health Chatbot Responses

Khizar Hussain, Bradley A. Malin, Zhijun Yin, Susannah Leigh Rose, Murat Kantarcioglu

Main category: cs.CL

TL;DR: A framework combining human expertise with LLMs to detect hallucinations and omissions in mental health counseling chatbots, using interpretable domain-informed features rather than black-box LLM judging.

DetailsMotivation: LLM-powered chatbots in mental health services require reliable detection of hallucinations and omissions for user safety, but current LLM-as-a-judge methods fail in high-risk healthcare contexts where subtle errors have serious consequences.

Method: Proposes a framework integrating human expertise with LLMs to extract interpretable, domain-informed features across five analytical dimensions: logical consistency, entity verification, factual accuracy, linguistic uncertainty, and professional appropriateness. Traditional ML models are trained on these features.

Result: Traditional ML models achieve 0.717 F1 on custom dataset and 0.849 F1 on public benchmark for hallucination detection, with 0.59-0.64 F1 for omission detection across both datasets, outperforming LLM judges (52% accuracy).

Conclusion: Combining domain expertise with automated methods yields more reliable and transparent evaluation than black-box LLM judging in high-stakes mental health applications.

Abstract: As LLM-powered chatbots are increasingly deployed in mental health services, detecting hallucinations and omissions has become critical for user safety. However, state-of-the-art LLM-as-a-judge methods often fail in high-risk healthcare contexts, where subtle errors can have serious consequences. We show that leading LLM judges achieve only 52% accuracy on mental health counseling data, with some hallucination detection approaches exhibiting near-zero recall. We identify the root cause as LLMs’ inability to capture nuanced linguistic and therapeutic patterns recognized by domain experts. To address this, we propose a framework that integrates human expertise with LLMs to extract interpretable, domain-informed features across five analytical dimensions: logical consistency, entity verification, factual accuracy, linguistic uncertainty, and professional appropriateness. Experiments on a public mental health dataset and a new human-annotated dataset show that traditional machine learning models trained on these features achieve 0.717 F1 on our custom dataset and 0.849 F1 on a public benchmark for hallucination detection, with 0.59-0.64 F1 for omission detection across both datasets. Our results demonstrate that combining domain expertise with automated methods yields more reliable and transparent evaluation than black-box LLM judging in high-stakes mental health applications.

[22] STDec: Spatio-Temporal Stability Guided Decoding for dLLMs

Yuzhe Chen, Jiale Cao, Xuyang Liu, Jin Xie, Aiping Yang, Yanwei Pang

Main category: cs.CL

TL;DR: STDec is a training-free decoding method for diffusion LLMs that uses spatio-temporal stability to improve throughput while maintaining performance.

DetailsMotivation: Current diffusion LLMs use global confidence thresholds and don't model local context or temporal consistency, limiting their efficiency and performance.

Method: STDec uses spatial-aware decoding (token-adaptive thresholds based on neighbor states) and temporal-aware decoding (relaxed thresholds for consistent token IDs across steps).

Result: Achieves up to 14.17x speedup on MBPP with LLaDA while maintaining comparable task performance scores across textual reasoning and multimodal benchmarks.

Conclusion: STDec demonstrates that leveraging spatio-temporal stability in dLLM decoding can substantially improve throughput without sacrificing performance.

Abstract: Diffusion Large Language Models (dLLMs) have achieved rapid progress, viewed as a promising alternative to the autoregressive paradigm. However, most dLLM decoders still adopt a global confidence threshold, and do not explicitly model local context from neighboring decoded states or temporal consistency of predicted token IDs across steps. To address this issue, we propose a simple spatio-temporal stability guided decoding approach, named STDec. We observe strong spatio-temporal stability in dLLM decoding: newly decoded tokens tend to lie near decoded neighbors, and their predicted IDs often remain consistent across several denoising steps. Inspired by this stability, our STDec includes spatial-aware decoding and temporal-aware decoding. The spatial-aware decoding dynamically generates the token-adaptive threshold by aggregating the decoded states of nearby tokens. The temporal-aware decoding relaxes the decoding thresholds for tokens whose predicted token IDs remain consistent over denoising steps. Our STDec is training-free and remains compatible with cache-based acceleration methods. Across textual reasoning and multimodal understanding benchmarks, STDec substantially improves throughput while maintaining comparable task performance score. Notably, on MBPP with LLaDA, STDec achieves up to 14.17x speedup with a comparable score. Homepage: https://yzchen02.github.io/STDec.

[23] Severity-Aware Weighted Loss for Arabic Medical Text Generation

Ahmed Alansary, Molham Mohamed, Ali Hamdi

Main category: cs.CL

TL;DR: Proposes severity-aware weighted loss for fine-tuning Arabic LLMs on medical complaint-response data, using soft severity probabilities to prioritize clinically critical cases without architectural changes.

DetailsMotivation: Traditional fine-tuning treats all medical cases uniformly, ignoring clinical severity differences which is critical in healthcare where errors in severe cases carry higher risk. Need for severity-aware optimization in Arabic medical text generation.

Method: Uses soft severity probabilities from a fine-tuned AraBERT classifier to dynamically scale token-level loss contributions during optimization. Applied to ten Arabic LLMs using MAQA dataset with severity labels automatically derived and incorporated at loss level.

Result: Severity-aware optimization consistently outperforms standard cross-entropy: AraGPT2-Base improved from 54.04% to 66.14%, AraGPT2-Medium from 59.16% to 67.18%, Qwen2.5-0.5B from 57.83% to 66.86%, with peak performance reaching 67.18% and up to 12.10% improvement over baselines.

Conclusion: Severity-aware fine-tuning delivers robust, architecture-consistent gains for Arabic medical text generation by prioritizing clinically critical interactions through weighted loss optimization.

Abstract: Large language models have shown strong potential for Arabic medical text generation; however, traditional fine-tuning objectives treat all medical cases uniformly, ignoring differences in clinical severity. This limitation is particularly critical in healthcare settings, where errors in severe cases contain higher clinical risk. In this work, we propose a severity-aware weighted loss for fine-tuning Arabic language models on medical complaint-response data. The method depends on soft severity probabilities to dynamically scale token-level loss contributions during optimization, thereby prioritizing clinically critical interactions without modifying model architectures. Experiments are conducted using the MAQA dataset, which provides Arabic medical complaints and trusted human responses. Severity labels and probabilistic scores are automatically derived using a fine-tuned AraBERT-based classifier and incorporated exclusively at the loss level. The proposed approach is evaluated across ten Arabic large language models of varying architectures and parameter scales. While standard cross-entropy fine-tuning yields only modest improvements, severity-aware optimization consistently achieves larger gains. Using a balanced weighting configuration, performance improves from 54.04% to 66.14% for AraGPT2-Base, from 59.16% to 67.18% for AraGPT2-Medium, and from 57.83% to 66.86% for Qwen2.5-0.5B, with peak performance reaching 67.18%. Overall, severity-aware fine-tuning delivers improvements of up to 12.10% over non-fine-tuned baselines, demonstrating robust and architecture-consistent gains.

[24] In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads

Charlotte Pouw, Hosein Mohebbi, Afra Alishahi, Willem Zuidema

Main category: cs.CL

TL;DR: Speech Language Models show in-context learning capabilities for TTS tasks, with speaking rate being a key acoustic feature that affects performance and is mimicked in output, while pitch and intensity have minimal impact.

DetailsMotivation: While ICL has been extensively studied in text-only language models, it remains largely unexplored in the speech domain. The researchers aim to understand how linguistic and acoustic features affect ICL in Speech Language Models, particularly for Text-to-Speech tasks.

Method: The study focuses on TTS tasks to analyze ICL from two perspectives: (1) how accurately the model infers the task from demonstrations (generating correct spoken content), and (2) to what extent the model mimics acoustic characteristics of demonstration speech. The researchers investigate the impact of speaking rate, pitch range, and intensity on ICL performance, and examine the role of induction heads in speech-based ICL through ablation studies.

Result: Speaking rate strongly affects ICL performance and is mimicked in the output, whereas pitch range and intensity have little impact on performance and are not consistently reproduced. The study also shows that induction heads play a causal role in speech-based ICL - ablating the top-k induction heads completely removes the model’s ICL ability, mirroring findings from text-based ICL.

Conclusion: Speech Language Models exhibit in-context learning capabilities similar to text models, with speaking rate being a critical acoustic feature for ICL performance and mimicry. The findings reveal parallels between speech and text ICL mechanisms, particularly regarding the role of induction heads.

Abstract: In-Context Learning (ICL) has been extensively studied in text-only Language Models, but remains largely unexplored in the speech domain. Here, we investigate how linguistic and acoustic features affect ICL in Speech Language Models. We focus on the Text-to-Speech (TTS) task, which allows us to analyze ICL from two angles: (1) how accurately the model infers the task from the demonstrations (i.e., generating the correct spoken content), and (2) to what extent the model mimics the acoustic characteristics of the demonstration speech in its output. We find that speaking rate strongly affects ICL performance and is also mimicked in the output, whereas pitch range and intensity have little impact on performance and are not consistently reproduced. Finally, we investigate the role of induction heads in speech-based ICL and show that these heads play a causal role: ablating the top-k induction heads completely removes the model’s ICL ability, mirroring findings from text-based ICL.

[25] A Severity-Based Curriculum Learning Strategy for Arabic Medical Text Generation

Ahmed Alansary, Molham Mohamed, Ali Hamdi

Main category: cs.CL

TL;DR: A severity-based curriculum learning strategy for Arabic medical text generation that gradually trains models from mild to critical medical cases, improving performance over baselines.

DetailsMotivation: Existing Arabic medical text generation methods treat all training samples equally, ignoring clinical severity differences, which hinders handling complex or high-risk cases effectively.

Method: Proposes a severity-based curriculum learning strategy that divides the MAQA dataset into three severity levels (Mild, Moderate, Critical) using rule-based annotation, then trains models gradually from easier to harder cases during fine-tuning.

Result: The approach yields consistent improvements across all tested models, achieving +4% to +7% gains over baselines and +3% to +6% over conventional fine-tuning methods.

Conclusion: Structured curriculum learning based on clinical severity effectively enhances Arabic medical text generation by allowing models to progressively learn from simpler to more complex medical scenarios.

Abstract: Arabic medical text generation is increasingly needed to help users interpret symptoms and access general health guidance in their native language. Nevertheless, many existing methods assume uniform importance across training samples, overlooking differences in clinical severity. This simplification can hinder the model’s ability to properly capture complex or high-risk cases. To overcome this issue, this work introduces a Severity-based Curriculum Learning Strategy for Arabic Medical Text Generation, where the training process is structured to move gradually from less severe to more critical medical conditions. The approach divides the dataset into ordered stages based on severity and incrementally exposes the model to more challenging cases during fine-tuning, allowing it to first learn basic medical patterns before addressing more complex scenarios. The proposed method is evaluated on a subset of the Medical Arabic Question Answering (MAQA) dataset, which includes Arabic medical questions describing symptoms alongside corresponding responses. In addition, the dataset is annotated with three severity levels (Mild, Moderate, and Critical) using a rule-based method developed in this study. The results demonstrate that incorporating severity-aware curriculum learning leads to consistent performance improvements across all tested models, with gains of around +4% to +7% over baseline models and +3% to +6% compared with conventional fine-tuning approaches.

[26] The Illusion of Superposition? A Principled Analysis of Latent Thinking in Language Models

Michael Rizvi-Martel, Guillaume Rabusseau, Marius Mosbach

Main category: cs.CL

TL;DR: Latent CoT reasoning may enable superposition (multiple solutions in one representation), but models only use it when trained from scratch, not in training-free or fine-tuned regimes where shortcuts dominate.

DetailsMotivation: To investigate whether language models actually leverage superposition (maintaining multiple candidate solutions simultaneously) when reasoning with latent continuous chain-of-thoughts, despite theoretical arguments suggesting they should.

Method: Study three regimes: 1) training-free (latent thoughts as convex combinations of token embeddings), 2) fine-tuned (base model adapted to produce latent thoughts), and 3) from-scratch (model trained entirely with latent thoughts). Use Logit Lens and entity-level probing to analyze internal representations.

Result: Only models trained from scratch exhibit signs of using superposition. In training-free and fine-tuned regimes, superposition either collapses or is not used at all, with models discovering shortcut solutions instead.

Conclusion: Pretraining on natural language biases models to commit to tokens in last layers, and capacity affects solution preferences. Superposition only emerges under specific training conditions in latent CoT reasoning.

Abstract: Latent reasoning via continuous chain-of-thoughts (Latent CoT) has emerged as a promising alternative to discrete CoT reasoning. Operating in continuous space increases expressivity and has been hypothesized to enable superposition: the ability to maintain multiple candidate solutions simultaneously within a single representation. Despite theoretical arguments, it remains unclear whether language models actually leverage superposition when reasoning using latent CoTs. We investigate this question across three regimes: a training-free regime that constructs latent thoughts as convex combinations of token embeddings, a fine-tuned regime where a base model is adapted to produce latent thoughts, and a from-scratch regime where a model is trained entirely with latent thoughts to solve a given task. Using Logit Lens and entity-level probing to analyze internal representations, we find that only models trained from scratch exhibit signs of using superposition. In the training-free and fine-tuned regimes, we find that the superposition either collapses or is not used at all, with models discovering shortcut solutions instead. We argue that this is due to two complementary phenomena: i) pretraining on natural language data biases models to commit to a token in the last layers ii) capacity has a huge effect on which solutions a model favors. Together, our results offer a unified explanation for when and why superposition arises in continuous chain-of-thought reasoning, and identify the conditions under which it collapses.

[27] Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning

Navan Preet Singh, Xiaokun Wang, Anurag Garikipati, Madalina Ciobanu, Qingqing Mao, Ritankar Das

Main category: cs.CL

TL;DR: Multi-stage RL+SFT optimization strategy for enhancing LLM pedagogical knowledge, achieving SOTA on educational benchmarks with 32B models that outperform larger proprietary systems.

DetailsMotivation: To develop domain-specialized LLMs for education that can outperform larger general-purpose models while maintaining transparency and cost-efficiency for responsible educational AI deployment.

Method: Three-stage optimization: (1) RL with progressive difficulty training, challenging examples focus, and extended reasoning rollouts; (2) SFT using RL-trained model to synthesize high-quality data with difficulty-weighted sampling; (3) optional second RL round.

Result: EduQwen 32B models achieve new SOTA on Cross-Domain Pedagogical Knowledge Benchmark and interactive Pedagogy Benchmark Leaderboard, surpassing larger proprietary systems like Gemini-3 Pro.

Conclusion: Domain-specialized optimization can transform mid-sized open-source LLMs into true pedagogical experts that outperform much larger general-purpose systems while preserving transparency and cost-efficiency.

Abstract: We present an innovative multi-stage optimization strategy combining reinforcement learning (RL) and supervised fine-tuning (SFT) to enhance the pedagogical knowledge of large language models (LLMs), as illustrated by EduQwen 32B-RL1, EduQwen 32B-SFT, and an optional third-stage model EduQwen 32B-SFT-RL2: (1) RL optimization that implements progressive difficulty training, focuses on challenging examples, and employs extended reasoning rollouts; (2) a subsequent SFT phase that leverages the RL-trained model to synthesize high-quality training data with difficulty-weighted sampling; and (3) an optional second round of RL optimization. EduQwen 32B-RL1, EduQwen 32B-SFT, and EduQwen 32B-SFT-RL2 are an application-driven family of open-source pedagogical LLMs built on a dense Qwen3-32B backbone. These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly larger proprietary systems such as the previous benchmark leader Gemini-3 Pro. These dense 32-billion-parameter models demonstrate that domain-specialized optimization can transform mid-sized open-source LLMs into true pedagogical domain experts that outperform much larger general-purpose systems, while preserving the transparency, customizability, and cost-efficiency required for responsible educational AI deployment.

[28] ART: Attention Replacement Technique to Improve Factuality in LLMs

Ziqin Luo, Yihao Quan, Xiaofeng Zhang, Xiaosong Yuan, Chen Shen

Main category: cs.CL

TL;DR: ART: A training-free method that replaces uniform attention patterns in shallow LLM layers with local attention to reduce hallucinations by focusing on relevant context.

DetailsMotivation: Hallucination in LLMs remains a major problem, especially in tasks like QA where models generate plausible but incorrect information. While various mitigation methods exist, the relationship between attention patterns and hallucinations hasn't been fully explored.

Method: Attention Replacement Technique (ART) - a training-free method that analyzes attention score distributions across layers and heads, identifies uniform attention patterns in shallow layers, and replaces them with local attention patterns to focus on relevant context.

Result: ART demonstrates significant reductions in hallucinations across multiple LLM architectures, proving effectiveness and generalizability without requiring fine-tuning or additional training data.

Conclusion: Uniform attention patterns in shallow LLM layers contribute to hallucinations, and replacing them with local attention patterns via ART effectively reduces hallucinations in a training-free manner across various models.

Abstract: Hallucination in large language models (LLMs) continues to be a significant issue, particularly in tasks like question answering, where models often generate plausible yet incorrect or irrelevant information. Although various methods have been proposed to mitigate hallucinations, the relationship between attention patterns and hallucinations has not been fully explored. In this paper, we analyze the distribution of attention scores across each layer and attention head of LLMs, revealing a common and intriguing phenomenon: shallow layers of LLMs primarily rely on uniform attention patterns, where the model distributes its attention evenly across the entire sequence. This uniform attention pattern can lead to hallucinations, as the model fails to focus on the most relevant information. To mitigate this issue, we propose a training-free method called Attention Replacement Technique (ART), which replaces these uniform attention patterns in the shallow layers with local attention patterns. This change directs the model to focus more on the relevant contexts, thus reducing hallucinations. Through extensive experiments, ART demonstrates significant reductions in hallucinations across multiple LLM architectures, proving its effectiveness and generalizability without requiring fine-tuning or additional training data.

[29] FMI@SU ToxHabits: Evaluating LLMs Performance on Toxic Habit Extraction in Spanish Clinical Texts

Sylvia Vassileva, Ivan Koychev, Svetla Boytcheva

Main category: cs.CL

TL;DR: Using LLMs (GPT-4.1) for Spanish clinical text analysis to detect substance use mentions (Tobacco, Alcohol, Cannabis, Drug) with few-shot prompting achieving 0.65 F1 score.

DetailsMotivation: To develop methods for recognizing toxic habits named entities in Spanish clinical texts, addressing the need for multilingual clinical NLP applications beyond English.

Method: Explored various LLM approaches including zero-shot, few-shot, and prompt optimization for the ToxHabits Shared Task subtask 1, with GPT-4.1’s few-shot prompting performing best.

Result: Achieved F1 score of 0.65 on test set, demonstrating promising performance for named entity recognition in non-English languages.

Conclusion: LLMs can effectively recognize named entities in Spanish clinical texts, with few-shot prompting showing strong performance for multilingual clinical NLP tasks.

Abstract: The paper presents an approach for the recognition of toxic habits named entities in Spanish clinical texts. The approach was developed for the ToxHabits Shared Task. Our team participated in subtask 1, which aims to detect substance use and abuse mentions in clinical case reports and classify them in four categories (Tobacco, Alcohol, Cannabis, and Drug). We explored various methods of utilizing LLMs for the task, including zero-shot, few-shot, and prompt optimization, and found that GPT-4.1’s few-shot prompting performed the best in our experiments. Our method achieved an F1 score of 0.65 on the test set, demonstrating a promising result for recognizing named entities in languages other than English.

[30] Attention Flows: Tracing LLM Conceptual Engagement via Story Summaries

Rebecca M. M. Hicke, Sil Hamilton, David Mimno, Ross Deans Kristensen-McLachlan

Main category: cs.CL

TL;DR: LLMs struggle to match human patterns of narrative understanding when summarizing novels, showing different focus distribution and stylistic differences compared to human summarizers.

DetailsMotivation: To assess whether LLMs can mirror human patterns of conceptual engagement with long-form texts by comparing human and LLM-authored novel summaries, despite growing context lengths.

Method: Aligned sentences from 150 human-written novel summaries with specific chapters they reference, then generated and aligned additional summaries by nine state-of-the-art LLMs for each reference text, comparing human and model patterns.

Result: Found both stylistic differences and differences in how humans and LLMs distribute focus throughout narratives, with models emphasizing the ends of texts, suggesting degraded narrative comprehension compared to humans.

Conclusion: LLMs have not kept pace with human narrative understanding capabilities despite increased context lengths, revealing targets for future development in long-form text comprehension.

Abstract: Although LLM context lengths have grown, there is evidence that their ability to integrate information across long-form texts has not kept pace. We evaluate one such understanding task: generating summaries of novels. When human authors of summaries compress a story, they reveal what they consider narratively important. Therefore, by comparing human and LLM-authored summaries, we can assess whether models mirror human patterns of conceptual engagement with texts. To measure conceptual engagement, we align sentences from 150 human-written novel summaries with the specific chapters they reference. We demonstrate the difficulty of this alignment task, which indicates the complexity of summarization as a task. We then generate and align additional summaries by nine state-of-the-art LLMs for each of the 150 reference texts. Comparing the human and model-authored summaries, we find both stylistic differences between the texts and differences in how humans and LLMs distribute their focus throughout a narrative, with models emphasizing the ends of texts. Comparing human narrative engagement with model attention mechanisms suggests explanations for degraded narrative comprehension and targets for future development. We release our dataset to support future research.

[31] State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation

Navan Preet Singh, Anurag Garikipati, Ahmed Abulkhair, Jyani Akshay Jagdishbhai, Atul Yaduvanshi, Amarendra Chaudhary, Madalina Ciobanu, Qingqing Mao, Ritankar Das

Main category: cs.CL

TL;DR: Arabic-DeepSeek-R1 is a state-of-the-art Arabic LLM using sparse MoE architecture with CoT distillation and Arabic-specific linguistic verification, achieving top performance on Arabic benchmarks while outperforming GPT-5.1 on most tasks.

DetailsMotivation: Address the digital equity gap for under-represented languages like Arabic by creating a specialized Arabic LLM that overcomes performance deficits in current LLM ecosystems through culturally-informed adaptation rather than architectural limitations.

Method: Four-phase CoT distillation scheme integrating Arabic-specific linguistic verification and regional ethical norms, using a sparse MoE backbone trained on 372M tokens of contamination-controlled 80/20 Arabic-English mixture with strategic bilingual data curation.

Result: Achieves highest average score across seven-benchmark Open Arabic LLM Leaderboard suite, establishing SOTA or near-SOTA results including dominant performance on grammar-focused MadinahQA (surpassing GPT-5.1 and OALL leader), safety-oriented AraTrust, multi-ability AlGhafa, and retrieval-augmented ALRAGE.

Conclusion: Arabic’s performance deficit stems from under-specialization rather than architectural limitations, and parameter-efficient adaptation of open reasoning models can yield breakthrough SOTA performance without industrial-scale pretraining costs, providing a replicable framework for sovereign language technologies.

Abstract: This paper introduces Arabic-DeepSeek-R1, an application-driven open-source Arabic LLM that leverages a sparse MoE backbone to address the digital equity gap for under-represented languages, and establishes a new SOTA across the entire Open Arabic LLM Leaderboard (OALL). Our four-phase CoT distillation scheme integrates Arabic-specific linguistic verification and regional ethical norms into a 372M-token, contamination-controlled 80/20 Arabic-English training mixture. Arabic-DeepSeek-R1 achieves the highest average score across the seven-benchmark OALL suite while establishing SOTA or near-SOTA, including dominant results on grammar-focused MadinahQA (surpassing both GPT-5.1 and the OALL leader by substantial margins), safety-oriented AraTrust, multi-ability AlGhafa, and retrieval-augmented ALRAGE. Our results indicate that the combination of sparse MoE architecture, culturally-informed CoT distillation with explicit Arabic linguistic checks, and strategic bilingual data curation enables an open-source adapted model to systematically outperform the proprietary frontier system GPT-5.1 on the majority of benchmarks evaluating comprehensive language-specific tasks: the first such demonstration for Arabic LLMs. These findings indicate that much of Arabic’s performance deficit in current LLM ecosystems stems from under-specialization rather than architectural limitations, and that parameter-efficient adaptation of open reasoning models can yield breakthrough SOTA performance without industrial-scale pretraining costs. Arabic-DeepSeek-R1 establishes a validated and replicable framework for sovereign and domain-specific language technologies, demonstrating that strategic, culturally-grounded adaptation of sparse MoE backbones offers a viable and cost-effective pathway to achieving record-breaking performance across standardized benchmarks for low-resource languages.

[32] When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don’t

Jonathan Nemitz, Carsten Eickhoff, Junyi Jessy Li, Kyle Mahowald, Michal Golovanevsky, William Rudman

Main category: cs.CL

TL;DR: VLMs systematically violate their own introspective rules in color attribution tasks, unlike humans who remain faithful to their stated rules, revealing miscalibrated self-knowledge in models.

DetailsMotivation: To understand when VLMs behave unexpectedly, whether they can reliably predict their own behavior, and if they adhere to their introspective reasoning for trustworthy deployment.

Method: Introduced Graded Color Attribution (GCA) dataset with controlled line drawings varying pixel-level color coverage across three conditions. Both VLMs and humans establish thresholds for color labeling, then compare these rules with actual decisions.

Result: VLMs systematically violate their own introspective rules (GPT-5-mini violates in nearly 60% of cases with strong color priors), while humans remain faithful. VLMs are excellent color coverage estimators but contradict their own reasoning in final responses.

Conclusion: VLM reasoning failures are not difficulty-driven; VLM introspective self-knowledge is miscalibrated, with world-knowledge priors systematically degrading faithfulness in ways that don’t mirror human cognition, posing challenges for high-stakes deployment.

Abstract: Understanding when Vision-Language Models (VLMs) will behave unexpectedly, whether models can reliably predict their own behavior, and if models adhere to their introspective reasoning are central challenges for trustworthy deployment. To study this, we introduce the Graded Color Attribution (GCA) dataset, a controlled benchmark designed to elicit decision rules and evaluate participant faithfulness to these rules. GCA consists of line drawings that vary pixel-level color coverage across three conditions: world-knowledge recolorings, counterfactual recolorings, and shapes with no color priors. Using GCA, both VLMs and human participants establish a threshold: the minimum percentage of pixels of a given color an object must have to receive that color label. We then compare these rules with their subsequent color attribution decisions. Our findings reveal that models systematically violate their own introspective rules. For example, GPT-5-mini violates its stated introspection rules in nearly 60% of cases on objects with strong color priors. Human participants remain faithful to their stated rules, with any apparent violations being explained by a well-documented tendency to overestimate color coverage. In contrast, we find that VLMs are excellent estimators of color coverage, yet blatantly contradict their own reasoning in their final responses. Across all models and strategies for eliciting introspective rules, world-knowledge priors systematically degrade faithfulness in ways that do not mirror human cognition. Our findings challenge the view that VLM reasoning failures are difficulty-driven and suggest that VLM introspective self-knowledge is miscalibrated, with direct implications for high-stakes deployment.

[33] Team Fusion@ SU@ BC8 SympTEMIST track: transformer-based approach for symptom recognition and linking

Georgi Grazhdanski, Sylvia Vassileva, Ivan Koychev, Svetla Boytcheva

Main category: cs.CL

TL;DR: Transformer-based approach for biomedical named entity recognition and entity linking using RoBERTa with BiLSTM-CRF layers and cross-lingual SapBERT for entity linking.

DetailsMotivation: To develop an effective system for biomedical text mining by addressing both named entity recognition (NER) and entity linking (EL) tasks in the SympTEMIST challenge, which focuses on symptom extraction from clinical texts.

Method: Fine-tuned RoBERTa-based token-level classifier with BiLSTM and CRF layers for NER on augmented training data. For entity linking, used cross-lingual SapBERT XLMR-Large to generate candidates and calculated cosine similarity against a knowledge base.

Result: The approach demonstrates that knowledge base selection has the highest impact on model accuracy for entity linking tasks in biomedical text mining.

Conclusion: Transformer-based models with appropriate knowledge bases are effective for biomedical NER and EL tasks, with knowledge base quality being a critical factor for entity linking performance.

Abstract: This paper presents a transformer-based approach to solving the SympTEMIST named entity recognition (NER) and entity linking (EL) tasks. For NER, we fine-tune a RoBERTa-based (1) token-level classifier with BiLSTM and CRF layers on an augmented train set. Entity linking is performed by generating candidates using the cross-lingual SapBERT XLMR-Large (2), and calculating cosine similarity against a knowledge base. The choice of knowledge base proves to have the highest impact on model accuracy.

[34] Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates

Andrea Schimmenti, Valentina Pasqual, Fabio Vitali, Marieke van Erp

Main category: cs.CL

TL;DR: ATR4CH is a systematic 5-step LLM-based methodology for extracting structured knowledge graphs from cultural heritage texts, validated on authenticity assessment debates with strong performance metrics.

DetailsMotivation: Cultural heritage texts contain rich knowledge but are difficult to query systematically due to unstructured discourse. There's a need for automated methods to convert these texts into structured knowledge graphs for systematic querying and analysis.

Method: A 5-step methodology combining annotation models, ontological frameworks, and LLM-based extraction: 1) foundational analysis, 2) annotation schema development, 3) pipeline architecture, 4) integration refinement, and 5) comprehensive evaluation. Uses sequential pipeline with three LLMs (Claude Sonnet 3.7, Llama 3.3 70B, GPT-4o-mini) on Wikipedia articles about disputed cultural heritage items.

Result: Strong performance metrics: 0.96-0.99 F1 for metadata extraction, 0.7-0.8 F1 for entity recognition, 0.65-0.75 F1 for hypothesis extraction, 0.95-0.97 for evidence extraction, and 0.62 G-EVAL for discourse representation. Smaller models performed competitively, enabling cost-effective deployment.

Conclusion: ATR4CH provides the first systematic methodology for coordinating LLM-based extraction with cultural heritage ontologies, offering a replicable framework adaptable across cultural heritage domains and institutional resources.

Abstract: Cultural Heritage texts contain rich knowledge that is difficult to query systematically due to the challenges of converting unstructured discourse into structured Knowledge Graphs (KGs). This paper introduces ATR4CH (Adaptive Text-to-RDF for Cultural Heritage), a systematic five-step methodology for Large Language Model-based Knowledge Extraction from Cultural Heritage documents. We validate the methodology through a case study on authenticity assessment debates. Methodology - ATR4CH combines annotation models, ontological frameworks, and LLM-based extraction through iterative development: foundational analysis, annotation schema development, pipeline architecture, integration refinement, and comprehensive evaluation. We demonstrate the approach using Wikipedia articles about disputed items (documents, artifacts…), implementing a sequential pipeline with three LLMs (Claude Sonnet 3.7, Llama 3.3 70B, GPT-4o-mini). Findings - The methodology successfully extracts complex Cultural Heritage knowledge: 0.96-0.99 F1 for metadata extraction, 0.7-0.8 F1 for entity recognition, 0.65-0.75 F1 for hypothesis extraction, 0.95-0.97 for evidence extraction, and 0.62 G-EVAL for discourse representation. Smaller models performed competitively, enabling cost-effective deployment. Originality - This is the first systematic methodology for coordinating LLM-based extraction with Cultural Heritage ontologies. ATR4CH provides a replicable framework adaptable across CH domains and institutional resources. Research Limitations - The produced KG is limited to Wikipedia articles. While the results are encouraging, human oversight is necessary during post-processing. Practical Implications - ATR4CH enables Cultural Heritage institutions to systematically convert textual knowledge into queryable KGs, supporting automated metadata enrichment and knowledge discovery.

[35] Learning to Interrupt in Language-based Multi-agent Communication

Danqing Wang, Da Yin, Ruta Desai, Lei Li, Asli Celikyilmaz, Ansong Ni

Main category: cs.CL

TL;DR: HANDRAISER: An interruptible communication framework for LLM-based multi-agent systems that reduces communication costs by allowing listeners to interrupt speakers at optimal points.

DetailsMotivation: Current LLM-based multi-agent systems suffer from verbose communication that overloads context and increases computational costs. Existing compression approaches don't adapt well to different listeners, unlike human communication where listeners can interrupt to clarify or express opinions.

Method: Proposes an interruptible communication framework where listeners can interrupt speakers. Uses a learning method to predict optimal interruption points based on estimated future reward and cost, addressing LLMs’ tendency to interrupt prematurely.

Result: HANDRAISER reduces communication costs by 32.2% compared to baselines while maintaining or improving task performance across various scenarios (2-agent text pictionary, 3-agent meeting scheduling, 3-agent debate). The learned interruption behavior generalizes to different agents and tasks.

Conclusion: Interruptible communication is effective for reducing costs in LLM-based multi-agent systems. The proposed learning approach addresses LLM overconfidence in interruption timing, enabling more efficient agent interactions.

Abstract: Multi-agent systems using large language models (LLMs) have demonstrated impressive capabilities across various domains. However, current agent communication suffers from verbose output that overload context and increase computational costs. Although existing approaches focus on compressing the message from the speaker side, they struggle to adapt to different listeners and identify relevant information. An effective way in human communication is to allow the listener to interrupt and express their opinion or ask for clarification. Motivated by this, we propose an interruptible communication framework that allows the agent who is listening to interrupt the current speaker. Through prompting experiments, we find that current LLMs are often overconfident and interrupt before receiving enough information. Therefore, we propose a learning method that predicts the appropriate interruption points based on the estimated future reward and cost. We evaluate our framework across various multi-agent scenarios, including 2-agent text pictionary games, 3-agent meeting scheduling, and 3-agent debate. The results of the experiment show that our HANDRAISER can reduce the communication cost by 32.2% compared to the baseline with comparable or superior task performance. This learned interruption behavior can also be generalized to different agents and tasks.

[36] Context-Aware Dialectal Arabic Machine Translation with Interactive Region and Register Selection

Afroza Nowshin, Prithweeraj Acharjee Porag, Haziq Jeelani, Fayeq Jeelani Syed

Main category: cs.CL

TL;DR: A context-aware, steerable framework for dialectal Arabic machine translation that enables controllable generation across regional varieties using rule-based data augmentation and metadata-conditioned fine-tuning.

DetailsMotivation: Current Arabic MT systems struggle with dialectal diversity, often homogenizing dialectal inputs into Modern Standard Arabic and offering limited user control over target vernaculars.

Method: Rule-Based Data Augmentation (RBDA) pipeline expands 3,000-sentence seed corpus into 57,000-sentence parallel dataset covering eight regional varieties, then fine-tunes mT5-base model conditioned on lightweight metadata tags for controllable generation.

Result: High-resource baselines like NLLB achieve higher BLEU scores (13.75) but default to MSA, while the proposed model achieves lower BLEU (8.19) but produces outputs that better align with intended regional varieties, with improved dialectal alignment (4.80/5 vs. 1.0/5).

Conclusion: Standard MT metrics have limitations for dialect-sensitive tasks, highlighting the need for evaluation practices that better reflect linguistic diversity in Arabic MT.

Abstract: Current Machine Translation (MT) systems for Arabic often struggle to account for dialectal diversity, frequently homogenizing dialectal inputs into Modern Standard Arabic (MSA) and offering limited user control over the target vernacular. In this work, we propose a context-aware and steerable framework for dialectal Arabic MT that explicitly models regional and sociolinguistic variation. Our primary technical contribution is a Rule-Based Data Augmentation (RBDA) pipeline that expands a 3,000-sentence seed corpus into a balanced 57,000-sentence parallel dataset, covering eight regional varieties eg., Egyptian, Levantine, Gulf, etc. By fine-tuning an mT5-base model conditioned on lightweight metadata tags, our approach enables controllable generation across dialects and social registers in the translation output. Through a combination of automatic evaluation and qualitative analysis, we observe an apparent accuracy-fidelity trade-off: high-resource baselines such as NLLB (No Language Left Behind) achieve higher aggregate BLEU scores (13.75) by defaulting toward the MSA mean, while exhibiting limited dialectal specificity. In contrast, our model achieves lower BLEU scores (8.19) but produces outputs that align more closely with the intended regional varieties. Supporting qualitative evaluation, including an LLM-assisted cultural authenticity analysis, suggests improved dialectal alignment compared to baseline systems (4.80/5 vs. 1.0/5). These findings highlight the limitations of standard MT metrics for dialect-sensitive tasks and motivate the need for evaluation practices that better reflect linguistic diversity in Arabic MT.

[37] Multi-objective Evolutionary Merging Enables Efficient Reasoning Models

Mario Iacobelli, Adrian Robert Minut, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Iacopo Masi, Emanuele Rodolà

Main category: cs.CL

TL;DR: Evo-L2S is an evolutionary model merging framework that optimizes the trade-off between reasoning accuracy and output length in LLMs, reducing reasoning traces by over 50% while maintaining or improving accuracy.

DetailsMotivation: Current reasoning models have high computational overhead due to long chains of thought. Existing model merging approaches for Long-to-Short (L2S) reasoning use brittle, fixed-hyperparameter methods that force suboptimal compromises between accuracy and efficiency.

Method: Formulates L2S reasoning as multi-objective optimization using evolutionary model merging. Introduces entropy-based subset sampling to make fitness estimation tractable for large models. Produces Pareto front of merged models optimizing accuracy vs. output length trade-off.

Result: Comprehensive experiments across 1.5B, 7B, and 14B parameter models on six mathematical reasoning benchmarks show reduction of reasoning traces by over 50% while preserving or improving original model accuracy.

Conclusion: Evo-L2S provides an effective framework for optimizing reasoning efficiency in LLMs through evolutionary model merging, addressing the computational overhead problem while maintaining reasoning capabilities.

Abstract: Reasoning models have demonstrated remarkable capabilities in solving complex problems by leveraging long chains of thought. However, this more deliberate reasoning comes with substantial computational overhead at inference time. The Long-to-Short (L2S) reasoning problem seeks to maintain high accuracy using fewer tokens, but current training-free model merging approaches rely on scalarized, fixed-hyperparameter arithmetic methods that are highly brittle and force suboptimal compromises. To address this gap, we introduce Evo-L2S, a novel framework that formulates L2S reasoning as a multi-objective optimization challenge. By leveraging evolutionary model merging, Evo-L2S explicitly optimizes the trade-off between accuracy and output length to produce a robust Pareto front of merged models. To make this search computationally tractable for large language models, we propose an entropy-based subset sampling technique that drastically reduces the overhead of fitness estimation. Comprehensive experiments across 1.5B, 7B, and 14B parameter scales on six mathematical reasoning benchmarks demonstrate that Evo-L2S can reduce the length of generated reasoning traces by over 50% while preserving, or even improving, the problem-solving accuracy of the original reasoning models.

[38] DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling

Shicheng Liu, Yucheng Jiang, Sajid Farook, Camila Nicollier Sanchez, David Fernando Castro Pena, Monica S. Lam

Main category: cs.CL

TL;DR: DataSTORM is an LLM-based agentic system for conducting deep research across both structured databases and internet sources, framing structured data research as a thesis-driven analytical process with iterative hypothesis generation and validation.

DetailsMotivation: Existing LLM agent research focuses primarily on unstructured web data, leaving challenges of deep research over large-scale structured databases underexplored. Data-centric research requires more than retrieval and summarization - it needs iterative hypothesis generation, quantitative reasoning over structured schemas, and coherent analytical narrative development.

Method: DataSTORM reframes deep research over structured data as a thesis-driven analytical process grounded in Exploratory Data Analysis and Data Storytelling principles. It discovers candidate theses from data, validates them through iterative cross-source investigation, and develops them into coherent analytical narratives across both structured databases and internet sources.

Result: DataSTORM achieves state-of-the-art results on InsightBench with 19.4% relative improvement in insight-level recall and 7.2% improvement in summary-level score. On a new ACLED dataset, it outperforms proprietary systems like ChatGPT Deep Research across both automated metrics and human evaluations.

Conclusion: DataSTORM demonstrates that LLM-based agents can effectively conduct deep research across both structured databases and internet sources through a thesis-driven analytical approach, advancing the capabilities of AI systems for data-centric research.

Abstract: Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis. However, existing approaches primarily focus on unstructured web data, while the challenges of conducting deep research over large-scale structured databases remain relatively underexplored. Unlike web-based research, effective data-centric research requires more than retrieval and summarization and demands iterative hypothesis generation, quantitative reasoning over structured schemas, and convergence toward a coherent analytical narrative. In this paper, we present DataSTORM, an LLM-based agentic system capable of autonomously conducting research across both large-scale structured databases and internet sources. Grounded in principles from Exploratory Data Analysis and Data Storytelling, DataSTORM reframes deep research over structured data as a thesis-driven analytical process: discovering candidate theses from data, validating them through iterative cross-source investigation, and developing them into coherent analytical narratives. We evaluate DataSTORM on InsightBench, where it achieves a new state-of-the-art result with a 19.4% relative improvement in insight-level recall and 7.2% in summary-level score. We further introduce a new dataset built on ACLED, a real-world complex database, and demonstrate that DataSTORM outperforms proprietary systems such as ChatGPT Deep Research across both automated metrics and human evaluations.

[39] ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs

Zhipin Wang, Christoph Leiter, Christian Frey, Mohamed Hesham Ibrahim Abdalla, Josif Grabocka, Steffen Eger

Main category: cs.CL

TL;DR: ValueGround benchmark evaluates MLLMs’ ability to ground culture-conditioned value judgments in visual scenes, showing performance drop when response options are visualized vs. text-only.

DetailsMotivation: Existing evaluations of cultural values in language models are text-only, making it unclear whether models can ground culture-conditioned judgments when response options are visualized. Cultural values are expressed through both language and visual scenes/social practices.

Method: Built ValueGround benchmark from World Values Survey questions using minimally contrastive image pairs to represent opposing response options while controlling irrelevant variation. Models are given a country, question, and image pair, and must choose the image matching the country’s value tendency without access to original response-option texts.

Result: Across six MLLMs and 13 countries, average accuracy dropped from 72.8% in text-only setting to 65.8% when options were visualized, despite 92.8% accuracy on option-image alignment. Stronger models were more robust but all remained prone to prediction reversals.

Conclusion: ValueGround provides a controlled testbed for studying cross-modal transfer of culture-conditioned value judgments in MLLMs, revealing challenges in visual grounding of cultural values.

Abstract: Cultural values are expressed not only through language but also through visual scenes and everyday social practices. Yet existing evaluations of cultural values in language models are almost entirely text-only, making it unclear whether models can ground culture-conditioned judgments when response options are visualized. We introduce ValueGround, a benchmark for evaluating culture-conditioned visual value grounding in multimodal large language models (MLLMs). Built from World Values Survey (WVS) questions, ValueGround uses minimally contrastive image pairs to represent opposing response options while controlling irrelevant variation. Given a country, a question, and an image pair, a model must choose the image that best matches the country’s value tendency without access to the original response-option texts. Across six MLLMs and 13 countries, average accuracy drops from 72.8% in the text-only setting to 65.8% when options are visualized, despite 92.8% accuracy on option-image alignment. Stronger models are more robust, but all remain prone to prediction reversals. Our benchmark provides a controlled testbed for studying cross-modal transfer of culture-conditioned value judgments.

[40] Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR

Thibault Bañeras-Roux, Sergio Burdisso, Esaú Villatoro-Tello, Dairazalia Sánchez-Cortés, Shiran Liu, Severin Baroudi, Shashi Kumar, Hasindri Watawana, Manjunath K E, Kadri Hacioglu, Petr Motlicek, Andreas Stolcke

Main category: cs.CL

TL;DR: Small amounts of speech data (even just 10% of target domain) combined with text-only adaptation can achieve ASR performance comparable to full dataset fine-tuning by bridging the modality gap in LLM-based ASR systems.

DetailsMotivation: LLM-based ASR systems can adapt with text-only data but suffer from a modality gap where the LLM isn't exposed to noisy representations from the speech projector. The paper investigates whether small amounts of speech data can mitigate this mismatch.

Method: Three adaptation strategies are compared: 1) text-only adaptation, 2) paired speech-text adaptation, and 3) mixed batching (MB) which combines both. Experiments are conducted in in-domain and out-of-domain settings.

Result: Even limited speech consistently improves performance. Mixed batching using only 10% of target-domain speech (less than 4 hours) achieves word error rates comparable to or better than conventional ASR fine-tuning with the full dataset.

Conclusion: Small amounts of speech provide a strong modality-alignment signal that bridges the gap between speech representations and language model understanding, making LLM-based ASR adaptation more efficient and effective.

Abstract: Conventional end-to-end automatic speech recognition (ASR) systems rely on paired speech-text data for domain adaptation. Recent LLM-based ASR architectures connect a speech encoder to a large language model via a projection module, enabling adaptation with text-only data. However, this introduces a modality gap, as the LLM is not exposed to the noisy representations produced by the speech projector. We investigate whether small amounts of speech can mitigate this mismatch. We compare three strategies: text-only adaptation, paired speech-text adaptation, and mixed batching (MB), which combines both. Experiments in in-domain and out-of-domain settings show that even limited speech consistently improves performance. Notably, MB using only 10% of the target-domain (less than 4 hours) speech achieves word error rates comparable to, or better than, conventional ASR fine-tuning with the full dataset, indicating that small amounts of speech provide a strong modality-alignment signal.

[41] MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

Weiyue Li, Ruizhi Qian, Yi Li, Yongce Li, Yunfan Long, Jiahui Cai, Yan Luo, Mengyu Wang

Main category: cs.CL

TL;DR: MedConclusion is a 5.7M dataset of PubMed structured abstracts for biomedical conclusion generation, enabling study of evidence-to-conclusion reasoning in LLMs.

DetailsMotivation: Limited resources exist for testing whether LLMs can infer scientific conclusions from structured biomedical evidence, despite LLMs being widely explored for reasoning-intensive research tasks.

Method: Created a large-scale dataset of 5.7M PubMed structured abstracts where each instance pairs non-conclusion sections with original author-written conclusions. Includes journal-level metadata for subgroup analysis. Evaluated diverse LLMs under conclusion and summary prompting settings using reference-based metrics and LLM-as-a-judge.

Result: Conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores.

Conclusion: MedConclusion provides a reusable data resource for studying scientific evidence-to-conclusion reasoning in biomedical domains.

Abstract: Large language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce $\textbf{MedConclusion}$, a large-scale dataset of $\textbf{5.7M}$ PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-to-conclusion reasoning. MedConclusion also includes journal-level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge. We find that conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores. MedConclusion provides a reusable data resource for studying scientific evidence-to-conclusion reasoning. Our code and data are available at: https://github.com/Harvard-AI-and-Robotics-Lab/MedConclusion.

[42] Fine-tuning Whisper for Pashto ASR: strategies and scale

Hanif Rahman

Main category: cs.CL

TL;DR: Fine-tuning Whisper models for Pashto ASR achieves 21-24% WER, with whisper-small being optimal for 113 hours of data, revealing linguistic error patterns in Pashto-specific phonology.

DetailsMotivation: Pashto is absent from Whisper's pre-training despite being one of CommonVoice's largest language collections, making off-the-shelf models unusable (outputting wrong scripts with WER >100%). Need to develop effective fine-tuning strategies for Pashto ASR.

Method: Compared four fine-tuning strategies on whisper-base: vanilla full fine-tuning, LoRA (rank 64), frozen-encoder (2/6 layers), and multistage Urdu-to-Pashto transfer. Extended vanilla fine-tuning to whisper-small and whisper-large-v3-turbo on larger dataset (CV24, 113 hours). Used online augmentation and analyzed error patterns.

Result: Vanilla fine-tuning achieved WER 21.22% on CV20, outperforming other methods significantly. On CV24, whisper-small achieved 24.89% WER, whisper-large-v3-turbo achieved 23.37%. Diminishing returns indicate whisper-small is practical optimum. Online augmentation provided 7.25 pp WER benefit. Error analysis identified word-final suffix confusion and retroflex substitutions as dominant failure modes.

Conclusion: Effective Pashto ASR requires fine-tuning Whisper models, with whisper-small being optimal for 113 hours of data. Frozen-encoder fine-tuning degrades performance at shallow depths, and transfer learning from Urdu fails due to phonological mismatches. Linguistic error patterns reveal Pashto-specific challenges.

Abstract: Pashto is absent from Whisper’s pre-training corpus despite being one of CommonVoice’s largest language collections, leaving off-the-shelf models unusable: all Whisper sizes output Arabic, Dari, or Urdu script on Pashto audio, achieving word error rates above 100%. We compare four fine-tuning strategies for whisper-base on CommonVoice Pashto v20: vanilla full fine-tuning, LoRA (rank 64), frozen-encoder (2/6 layers), and multistage Urdu-to-Pashto transfer. We extend vanilla fine-tuning to whisper-small and whisper-large-v3-turbo on CommonVoice Pashto v24 (113 hours). Vanilla fine-tuning achieves WER 21.22% on CV20, outperforming LoRA by 33.36 pp, frozen-encoder by 14.76 pp, and Urdu transfer by 44.56 pp. Frozen-encoder fine-tuning degrades performance on whisper-base (6 encoder layers): layer-function separation does not hold at this depth, and freezing removes a third of trainable capacity. Urdu-to-Pashto transfer fails due to an unverified intermediate checkpoint, phonological mismatch, and insufficient training. On CV24, whisper-small achieves WER 24.89% (2.24 pp over whisper-base at 3.3x parameters); whisper-large-v3-turbo achieves 23.37% (a further 1.52 pp). Diminishing returns indicate whisper-small is the practical optimum at 113 hours. Online augmentation provides 7.25 pp WER benefit over matched training. Error analysis identifies word-final suffix confusion (masculine -ay vs. feminine -a) and retroflex substitutions involving the Pashto-unique consonant /ts/ as dominant failure modes. Fine-tuned checkpoints and evaluation scripts are released on HuggingFace.

[43] Does a Global Perspective Help Prune Sparse MoEs Elegantly?

Zeliang Zhang, Nikhil Ghosh, Jiani Liu, Bin Yu, Xiaodong Liu

Main category: cs.CL

TL;DR: GRAPE is a global pruning strategy for sparse Mixture-of-Experts models that dynamically allocates pruning budgets based on cross-layer redundancy, improving efficiency while maintaining performance better than uniform pruning methods.

DetailsMotivation: Sparse Mixture-of-Experts models improve efficiency by activating only subsets of experts, but still have high memory costs due to many expert parameters. Existing pruning methods use uniform budgets across layers, ignoring heterogeneous redundancy patterns in sparse MoEs.

Method: GRAPE (Global Redundancy-Aware Pruning of Experts) uses a global pruning strategy that dynamically allocates pruning budgets across layers based on cross-layer redundancy analysis, rather than uniform allocation.

Result: GRAPE consistently achieves best average performance across Mixtral-8x7B, Mixtral-8x22B, DeepSeek-MoE, Qwen-MoE, and GPT-OSS models. It improves average accuracy over strongest local baseline by 1.40% on average, with gains up to 2.45%.

Conclusion: Global redundancy-aware pruning is more effective than uniform pruning for sparse MoE models, demonstrating that cross-layer redundancy analysis enables better parameter efficiency while maintaining model performance.

Abstract: Empirical scaling laws for language models have encouraged the development of ever-larger LLMs, despite their growing computational and memory costs. Sparse Mixture-of-Experts (MoEs) offer a promising alternative by activating only a subset of experts per forward pass, improving efficiency without sacrificing performance. However, the large number of expert parameters still leads to substantial memory consumption. Existing pruning methods typically allocate budgets uniformly across layers, overlooking the heterogeneous redundancy that arises in sparse MoEs. We propose GRAPE (Global Redundancy-Aware Pruning of Experts, a global pruning strategy that dynamically allocates pruning budgets based on cross-layer redundancy. Experiments on Mixtral-8x7B, Mixtral-8x22B, DeepSeek-MoE, Qwen-MoE, and GPT-OSS show that, under the same pruning budget, GRAPE consistently achieves the best average performance. On the three main models reported in the paper, it improves average accuracy over the strongest local baseline by 1.40% on average across pruning settings, with gains of up to 2.45%.

[44] The Illusion of Stochasticity in LLMs

Xiangming Gu, Soham De, Michalis Titsias, Larisa Markeeva, Petar Veličković, Razvan Pascanu

Main category: cs.CL

TL;DR: LLMs fail at reliable stochastic sampling needed for agentic systems, despite being able to convert random seeds to distributions.

DetailsMotivation: Agentic LLM systems require stochastic sampling from distributions inferred from data, but current LLMs fail to map internal probability estimates to stochastic outputs, creating a fundamental reliability gap.

Method: Rigorous empirical analysis across multiple model families, sizes, prompting styles, and distributions to demonstrate the sampling failure, comparing ability to convert random seeds vs. direct sampling.

Result: Frontier models can convert provided random seeds to target distributions but fundamentally fail at direct sampling from specific distributions, revealing a critical reliability issue.

Conclusion: Reliable stochastic sampling is an unfulfilled requirement for LLM agents, highlighting a distinct failure point that needs addressing for robust agentic systems.

Abstract: In this work, we demonstrate that reliable stochastic sampling is a fundamental yet unfulfilled requirement for Large Language Models (LLMs) operating as agents. Agentic systems are frequently required to sample from distributions, often inferred from observed data, a process which needs to be emulated by the LLM. This leads to a distinct failure point: while standard RL agents rely on external sampling mechanisms, LLMs fail to map their internal probability estimates to their stochastic outputs. Through rigorous empirical analysis across multiple model families, model sizes, prompting styles, and distributions, we demonstrate the extent of this failure. Crucially, we show that while powerful frontier models can convert provided random seeds to target distributions, their ability to sample directly from specific distributions is fundamentally flawed.

[45] CCD-CBT: Multi-Agent Therapeutic Interaction for CBT Guided by Cognitive Conceptualization Diagram

Chang Liu, Changsheng Ma, Yongfeng Tao, Bin Hu, Minqiang Yang

Main category: cs.CL

TL;DR: CCD-CBT is a multi-agent framework for simulating Cognitive Behavioral Therapy that uses dynamic cognitive profiles and information-asymmetric interactions between therapist and client agents, generating synthetic CBT data that improves counseling fidelity.

DetailsMotivation: Existing LLM-based mental health support systems rely on static cognitive profiles and omniscient single-agent simulations, which fail to capture the dynamic, information-asymmetric nature of real therapy sessions.

Method: Introduces CCD-CBT framework with two key shifts: 1) dynamic Cognitive Conceptualization Diagram (CCD) updated by a Control Agent, and 2) information-asymmetric interaction where Therapist Agent reasons from inferred client states. Generated CCDCHAT dataset using this framework.

Result: Models fine-tuned on CCDCHAT outperform strong baselines in both counseling fidelity and positive-affect enhancement. Ablations confirm necessity of dynamic CCD guidance and asymmetric agent design.

Conclusion: Offers a new paradigm for building theory-grounded, clinically-plausible conversational agents for mental health support.

Abstract: Large language models show potential for scalable mental-health support by simulating Cognitive Behavioral Therapy (CBT) counselors. However, existing methods often rely on static cognitive profiles and omniscient single-agent simulation, failing to capture the dynamic, information-asymmetric nature of real therapy. We introduce CCD-CBT, a multi-agent framework that shifts CBT simulation along two axes: 1) from a static to a dynamically reconstructed Cognitive Conceptualization Diagram (CCD), updated by a dedicated Control Agent, and 2) from omniscient to information-asymmetric interaction, where the Therapist Agent must reason from inferred client states. We release CCDCHAT, a synthetic multi-turn CBT dataset generated under this framework. Evaluations with clinical scales and expert therapists show that models fine-tuned on CCDCHAT outperform strong baselines in both counseling fidelity and positive-affect enhancement, with ablations confirming the necessity of dynamic CCD guidance and asymmetric agent design. Our work offers a new paradigm for building theory-grounded, clinically-plausible conversational agents.

[46] To Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs

Zohaib Khan, Mustafa Dogan, Ifeoma Okoh, Pouya Sadeghi, Siddhartha Shrestha, Sergius Justus Nyah, Mahmoud O. Mokhiamar, Michael J. Ryan, Tarek Naous

Main category: cs.CL

TL;DR: Study examines how LLMs generate misinformation across languages and countries, finding systematic biases where misinformation generation is higher for lower-resource languages and countries with lower Human Development Index.

DetailsMotivation: The rise of misinformation combined with LLMs' strong writing capabilities lowers barriers for malicious actors to produce false information across languages and regions, creating a need to understand and mitigate this global threat.

Method: Created GlobalLies dataset with 440 misinformation generation prompt templates and 6,867 entities across 8 languages and 195 countries. Used human annotations and large-scale LLM-as-a-judge evaluations across hundreds of thousands of generations from state-of-the-art models.

Result: Misinformation generation varies systematically by country, with higher propagation in lower-resource languages and countries with lower HDI. Existing mitigation strategies show uneven protection: input safety classifiers have cross-lingual gaps, and retrieval-augmented fact-checking is inconsistent due to unequal information availability.

Conclusion: LLMs exhibit systematic biases in misinformation generation across languages and regions, highlighting the need for more equitable mitigation strategies. The GlobalLies dataset is released to support research on reducing global misinformation spread.

Abstract: Misinformation is on the rise, and the strong writing capabilities of LLMs lower the barrier for malicious actors to produce and disseminate false information. We study how LLMs behave when prompted to spread misinformation across languages and target countries, and introduce GlobalLies, a multilingual parallel dataset of 440 misinformation generation prompt templates and 6,867 entities, spanning 8 languages and 195 countries. Using both human annotations and large-scale LLM-as-a-judge evaluations across hundreds of thousands of generations from state-of-the-art models, we show that misinformation generation varies systematically based on the country being discussed. Propagation of lies by LLMs is substantially higher in many lower-resource languages and for countries with a lower Human Development Index (HDI). We find that existing mitigation strategies provide uneven protection: input safety classifiers exhibit cross-lingual gaps, and retrieval-augmented fact-checking remains inconsistent across regions due to unequal information availability. We release GlobalLies for research purposes, aiming to support the development of mitigation strategies to reduce the spread of global misinformation: https://github.com/zohaib-khan5040/globallies

[47] LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources

Joshua Castillo, Ravi Mukkamala

Main category: cs.CL

TL;DR: AI pipeline for parsing heterogeneous missing-person documents into unified schema-compliant format using OCR, rule-based parsing, and LLM-assisted extraction with validation.

DetailsMotivation: Missing-person investigations use diverse document formats (forms, posters, web profiles) with layout/terminology variations that hinder rapid triage, analysis, and search planning.

Method: Multi-engine PDF extraction with OCR fallback, rule-based source identification, schema-first harmonization, and optional LLM-assisted extraction with validator-guided repair and geocoding services.

Result: LLM-assisted pathway achieved much higher extraction quality (F1=0.8664 vs 0.2578) and better key-field completeness (96.97% vs 93.23%), though deterministic pathway was faster (0.03s vs 3.95s per record).

Conclusion: Controlled use of probabilistic AI within schema-first, auditable pipelines is effective for high-stakes investigative settings despite speed trade-offs.

Abstract: Missing-person and child-safety investigations rely on heterogeneous case documents, including structured forms, bulletin-style posters, and narrative web profiles. Variations in layout, terminology, and data quality impede rapid triage, large-scale analysis, and search-planning workflows. This paper introduces the Guardian Parser Pack, an AI-driven parsing and normalization pipeline that transforms multi-source investigative documents into a unified, schema-compliant representation suitable for operational review and downstream spatial modeling. The proposed system integrates (i) multi-engine PDF text extraction with Optical Character Recognition (OCR) fallback, (ii) rule-based source identification with source-specific parsers, (iii) schema-first harmonization and validation, and (iv) an optional Large Language Model (LLM)-assisted extraction pathway incorporating validator-guided repair and shared geocoding services. We present the system architecture, key implementation decisions, and output design, and evaluate performance using both gold-aligned extraction metrics and corpus-level operational indicators. On a manually aligned subset of 75 cases, the LLM-assisted pathway achieved substantially higher extraction quality than the deterministic comparator (F1 = 0.8664 vs. 0.2578), while across 517 parsed records per pathway it also improved aggregate key-field completeness (96.97% vs. 93.23%). The deterministic pathway remained much faster (mean runtime 0.03 s/record vs. 3.95 s/record for the LLM pathway). In the evaluated run, all LLM outputs passed initial schema validation, so validator-guided repair functioned as a built-in safeguard rather than a contributor to the observed gains. These results support controlled use of probabilistic AI within a schema-first, auditable pipeline for high-stakes investigative settings.

[48] Scoring Edit Impact in Grammatical Error Correction via Embedded Association Graphs

Qiyuan Xiao, Xiaoman Wang, Yunshi Lan

Main category: cs.CL

TL;DR: Proposes a new task called Scoring Edit Impact in Grammatical Error Correction (GEC) to automatically estimate the importance of edits, using an embedded association graph to capture edit dependencies and perplexity-based scoring for fluency assessment.

DetailsMotivation: Current GEC evaluation methods don't fully accommodate diverse application scenarios or multiple valid corrections. Human meta-evaluation approaches are difficult to scale to large datasets, creating a need for automated edit importance scoring.

Method: Introduces a scoring framework based on an embedded association graph that captures latent dependencies among edits and syntactically related edits, grouping them into coherent groups. Uses perplexity-based scoring to estimate each edit’s contribution to sentence fluency.

Result: Experiments across 4 GEC datasets, 4 languages, and 4 GEC systems show the method consistently outperforms a range of baselines. The embedded association graph effectively captures cross-linguistic structural dependencies among edits.

Conclusion: The proposed Scoring Edit Impact task and embedded association graph framework provide an effective automated approach for evaluating edit importance in GEC systems, addressing scalability limitations of human evaluation methods.

Abstract: A Grammatical Error Correction (GEC) system produces a sequence of edits to correct an erroneous sentence. The quality of these edits is typically evaluated against human annotations. However, a sentence may admit multiple valid corrections, and existing evaluation settings do not fully accommodate diverse application scenarios. Recent meta-evaluation approaches rely on human judgments across multiple references, but they are difficult to scale to large datasets. In this paper, we propose a new task, Scoring Edit Impact in GEC, which aims to automatically estimate the importance of edits produced by a GEC system. To address this task, we introduce a scoring framework based on an embedded association graph. The graph captures latent dependencies among edits and syntactically related edits, grouping them into coherent groups. We then perform perplexity-based scoring to estimate each edit’s contribution to sentence fluency. Experiments across 4 GEC datasets, 4 languages, and 4 GEC systems demonstrate that our method consistently outperforms a range of baselines. Further analysis shows that the embedded association graph effectively captures cross-linguistic structural dependencies among edits.

[49] Scientific Knowledge-driven Decoding Constraints Improving the Reliability of LLMs

Maotian Ma, Zheni Zeng, Zhenghao Liu, Yukun Yan

Main category: cs.CL

TL;DR: SciDC is an LLM generation method that integrates subject-specific knowledge with strong constraints to reduce hallucinations in scientific domains by converting flexible knowledge into standardized rules.

DetailsMotivation: LLMs suffer from severe hallucination despite strong knowledge reserves, and fail to sufficiently utilize highly-condensed scientific theories and rules that could efficiently direct human manipulators' behaviors.

Method: Uses strong LLMs to automatically convert flexible knowledge into multi-layered, standardized rules, creating an extensible framework to constrain model generation on domain-specific scientific tasks.

Result: Achieves 12% accuracy improvement on average compared to vanilla generation across scientific tasks including industrial formulation design, clinical tumor diagnosis, and retrosynthesis planning.

Conclusion: Demonstrates effectiveness of integrating domain knowledge with constraints, and discusses potential for LLMs to automatically summarize condensed knowledge to accelerate scientific research.

Abstract: Large language models (LLMs) have shown strong knowledge reserves and task-solving capabilities, but still face the challenge of severe hallucination, hindering their practical application. Though scientific theories and rules can efficiently direct the behaviors of human manipulators, LLMs still do not utilize these highly-condensed knowledge sufficiently through training or prompting. To address this issue, we propose \textbf{SciDC}, an LLM generation method that integrate subject-specific knowledge with strong constraints. By adopting strong LLMs to automatically convert flexible knowledge into multi-layered, standardized rules, we build an extensible framework to effectively constrain the model generation on domain tasks. Experiments on scientific tasks including industrial formulation design, clinical tumor diagnosis and retrosynthesis planning, consistently demonstrate the effectiveness of our method, achieving a 12% accuracy improvement on average compared with vanilla generation. We further discuss the potential of LLMs in automatically inductively summarizing highly-condensed knowledge, looking ahead to practical solutions for accelerating the overall scientific research process. All the code of this paper can be obtained (https://github.com/Maotian-Ma/SciDC).

[50] The Detection–Extraction Gap: Models Know the Answer Before They Can Say It

Hanyang Wang, Mingxuan Zhu

Main category: cs.CL

TL;DR: Paper proposes Black-box Adaptive Early Exit (BAEE) to detect when answer is determined in chain-of-thought reasoning and stop generation early, reducing tokens by 70-78% while improving accuracy by 1-5 percentage points.

DetailsMotivation: Current reasoning models generate excessive tokens after the answer is already determined in chain-of-thought reasoning, wasting computational resources and potentially introducing errors through post-commitment overwriting.

Method: BAEE uses free continuations (unconstrained generation) to detect when answer is recoverable from partial prefixes, then extracts the answer using the same free continuations rather than forced extraction, allowing early truncation of generation.

Result: Reduces serial generation by 70-78% while improving accuracy by 1-5 percentage points across all tested models. For thinking-mode models, prevents post-commitment overwriting with gains up to 5.8 percentage points.

Conclusion: Demonstrates detection-extraction gap in reasoning models and shows that adaptive early exit based on free continuations can significantly reduce computational cost while improving accuracy by avoiding post-commitment generation.

Abstract: Modern reasoning models continue generating long after the answer is already determined. Across five model configurations, two families, and three benchmarks, we find that \textbf{52–88% of chain-of-thought tokens are produced after the answer is recoverable} from a partial prefix. This post-commitment generation reveals a structural phenomenon: the \textbf{detection–extraction gap}. Free continuations from early prefixes recover the correct answer even at 10% of the trace, while forced extraction fails on 42% of these cases. The answer is recoverable from the model state, yet prompt-conditioned decoding fails to extract it. We formalize this mismatch via a total-variation bound between free and forced continuation distributions, yielding quantitative estimates of suffix-induced shift. Exploiting this asymmetry, we propose Black-box Adaptive Early Exit (\BAEE{}), which uses free continuations for both detection and extraction, truncating \textbf{70–78% of serial generation} while \textbf{improving accuracy by 1–5,pp} across all models. For thinking-mode models, early exit prevents post-commitment overwriting, yielding gains of up to 5.8,pp; a cost-optimized variant achieves 68–73% reduction at a median of 9 API calls. Code is available at https://github.com/EdWangLoDaSc/know2say.

[51] DiffuMask: Diffusion Language Model for Token-level Prompt Pruning

Caleb Zheng, Jyotika Singh, Fang Tu, Weiyi Sun, Sujeeth Bharadwaj, Yassine Benajiba, Sujith Ravi, Eli Shlizerman, Dan Roth

Main category: cs.CL

TL;DR: DiffuMask: A diffusion-based framework for parallel prompt compression that accelerates pruning via hierarchical shot-level and token-level masking, achieving up to 80% length reduction while maintaining reasoning accuracy.

DetailsMotivation: In-context learning and Chain-of-Thought prompting improve LLM reasoning but create longer, more expensive prompts with redundant information. Existing compression methods use sequential token removal which is computationally intensive.

Method: Diffusion-based framework integrating hierarchical shot-level and token-level pruning signals for rapid parallel prompt pruning via iterative mask prediction. Enables masking multiple tokens per denoising step with tunable control over retained content.

Result: Achieves up to 80% prompt length reduction while maintaining or improving accuracy across in-domain, out-of-domain, and cross-model settings. Substantially accelerates compression process compared to sequential methods.

Conclusion: DiffuMask provides a generalizable and controllable framework for prompt compression, facilitating faster and more reliable in-context reasoning in LLMs.

Abstract: In-Context Learning and Chain-of-Thought prompting improve reasoning in large language models (LLMs). These typically come at the cost of longer, more expensive prompts that may contain redundant information. Prompt compression based on pruning offers a practical solution, yet existing methods rely on sequential token removal which is computationally intensive. We present DiffuMask, a diffusion-based framework integrating hierarchical shot-level and token-level pruning signals, that enables rapid and parallel prompt pruning via iterative mask prediction. DiffuMask substantially accelerates the compression process via masking multiple tokens in each denoising step. It offers tunable control over retained content, preserving essential reasoning context and achieving up to 80% prompt length reduction. Meanwhile, it maintains or improves accuracy across in-domain, out-of-domain, and cross-model settings. Our results show that DiffuMask provides a generalizable and controllable framework for prompt compression, facilitating faster and more reliable in-context reasoning in LLMs.

[52] Feedback Adaptation for Retrieval-Augmented Generation

Jihwan Bang, Seunghan Yang, Kyuhong Shim, Simyung Chang, Juntae Lee, Sungha Choi

Main category: cs.CL

TL;DR: Paper introduces feedback adaptation as a new evaluation dimension for RAG systems, proposing metrics for correction lag and post-feedback performance, and presents PatchRAG for immediate feedback incorporation without retraining.

DetailsMotivation: Current RAG system evaluations focus on static accuracy but fail to capture how systems adapt to user/expert feedback in real deployment scenarios, overlooking the important dimension of feedback adaptation in interactive settings.

Method: Proposes two evaluation metrics: correction lag (delay between feedback and behavioral change) and post-feedback performance (reliability on related queries after feedback). Introduces PatchRAG, an inference-time approach that incorporates feedback without retraining.

Result: Training-based approaches show trade-off between delayed correction and reliable adaptation. PatchRAG demonstrates immediate correction and strong post-feedback generalization under the proposed evaluation framework.

Conclusion: Feedback adaptation is a previously overlooked dimension of RAG system behavior in interactive settings, and the proposed metrics and PatchRAG approach provide ways to measure and improve this aspect.

Abstract: Retrieval-Augmented Generation (RAG) systems are typically evaluated under static assumptions, despite being frequently corrected through user or expert feedback in deployment. Existing evaluation protocols focus on overall accuracy and fail to capture how systems adapt after feedback is introduced. We introduce feedback adaptation as a problem setting for RAG systems, which asks how effectively and how quickly corrective feedback propagates to future queries. To make this behavior measurable, we propose two evaluation axes: correction lag, which captures the delay between feedback provision and behavioral change, and post-feedback performance, which measures reliability on semantically related queries after feedback. Using these metrics, we show that training-based approaches exhibit a trade-off between delayed correction and reliable adaptation. We further propose PatchRAG, a minimal inference-time instantiation that incorporates feedback without retraining, demonstrating immediate correction and strong post-feedback generalization under the proposed evaluation. Our results highlight feedback adaptation as a previously overlooked dimension of RAG system behavior in interactive settings.

[53] A Parameter-Efficient Transfer Learning Approach through Multitask Prompt Distillation and Decomposition for Clinical NLP

Cheng Peng, Mengxian Lyu, Ziyi Chen, Yonghui Wu

Main category: cs.CL

TL;DR: Multitask prompt distillation framework learns shared metaprompt from 21 clinical NLP tasks, enabling efficient adaptation to new tasks with minimal parameters.

DetailsMotivation: Existing prompt-based fine-tuning methods require learning task-specific prompts independently, which creates significant computational and storage overhead when deploying multiple clinical NLP systems at scale.

Method: Proposes a multitask prompt distillation and decomposition framework that learns a single shared metaprompt from 21 diverse clinical source tasks, then adapts it to unseen target tasks with fewer than 0.05% trainable parameters.

Result: Outperforms LoRA by 1.5-1.7% despite using orders of magnitude fewer parameters, exceeds single-task prompt tuning by 6.1-6.6%, with gpt-oss 20B achieving highest performance especially on clinical reasoning tasks.

Conclusion: The framework enables efficient deployment of multiple clinical NLP systems with strong zero- and few-shot performance, demonstrating better transferability of shared prompt representations.

Abstract: Existing prompt-based fine-tuning methods typically learn task-specific prompts independently, imposing significant computing and storage overhead at scale when deploying multiple clinical natural language processing (NLP) systems. We present a multitask prompt distillation and decomposition framework that learns a single shared metaprompt from 21 diverse clinical source tasks and adapts it to unseen target tasks with fewer than 0.05% trainable parameters. Evaluated across five clinical NLP task types (named entity recognition, relation extraction, question answering, natural language inference, and summarization) on 10 held-out target datasets using three backbone models (LLaMA 3.1 8B, Meditron3 8B, gpt-oss 20B), our framework consistently outperforms LoRA by 1.51.7% despite using orders of magnitude fewer parameters, and exceeds single-task prompt tuning by 6.16.6%. The gpt-oss 20B model achieves the highest overall performance, particularly on clinical reasoning tasks. The strong zero- and few-shot performance demonstrates better transferability of the shared prompt representation.

[54] A Graph-Enhanced Defense Framework for Explainable Fake News Detection with LLM

Bo Wang, Jing Ma, Hongzhan Lin, Zhiwei Yang, Ruichao Yang, Yuan Tian, Yi Chang

Main category: cs.CL

TL;DR: G-Defense: A graph-enhanced framework for explainable fake news detection that decomposes claims into sub-claims, uses RAG for evidence retrieval, and generates intuitive explanation graphs.

DetailsMotivation: Existing fake news detection methods are inefficient for breaking news and struggle with providing comprehensive explanations. LLM-based approaches using external reports risk inaccuracies from unverified sources, while explanations often lack comprehensiveness for all claim aspects.

Method: 1) Construct claim-centered graph by decomposing news claim into sub-claims with dependency relationships; 2) Use RAG to retrieve evidence and generate competing explanations for each sub-claim; 3) Apply defense-like inference module on graph to assess overall veracity; 4) Prompt LLM to generate intuitive explanation graph.

Result: G-Defense achieves state-of-the-art performance in both veracity detection and explanation quality, demonstrating effectiveness in handling fake news detection with comprehensive explanations.

Conclusion: The proposed graph-enhanced framework effectively addresses challenges in explainable fake news detection by providing fine-grained explanations based solely on unverified reports through structured graph decomposition and RAG techniques.

Abstract: Explainable fake news detection aims to assess the veracity of news claims while providing human-friendly explanations. Existing methods incorporating investigative journalism are often inefficient and struggle with breaking news. Recent advances in large language models (LLMs) enable leveraging externally retrieved reports as evidence for detection and explanation generation, but unverified reports may introduce inaccuracies. Moreover, effective explainable fake news detection should provide a comprehensible explanation for all aspects of a claim to assist the public in verifying its accuracy. To address these challenges, we propose a graph-enhanced defense framework (G-Defense) that provides fine-grained explanations based solely on unverified reports. Specifically, we construct a claim-centered graph by decomposing the news claim into several sub-claims and modeling their dependency relationships. For each sub-claim, we use the retrieval-augmented generation (RAG) technique to retrieve salient evidence and generate competing explanations. We then introduce a defense-like inference module based on the graph to assess the overall veracity. Finally, we prompt an LLM to generate an intuitive explanation graph. Experimental results demonstrate that G-Defense achieves state-of-the-art performance in both veracity detection and the quality of its explanations.

[55] Between Century and Poet: Graph-Based Lexical Semantic Change in Persian Poetry

Kourosh Shahnazari, Seyed Moein Ayyoubzadeh, Mohammadali Keshtparvar

Main category: cs.CL

TL;DR: This paper analyzes semantic change in Persian poetry using Word2Vec embeddings and graph-based neighborhood analysis, focusing on how words change meaning through shifting relationships with neighboring words rather than just vector displacement.

DetailsMotivation: The paper aims to better understand semantic change in Persian poetry by moving beyond simple vector displacement models to examine how words gain and lose semantic neighbors, shift bridge roles, and move across communities over time and across different poets.

Method: Uses aligned Word2Vec spaces combined with graph-based neighborhood analysis across centuries and major Persian poets. Analyzes twenty target words anchored by five reference terms, examining their semantic neighborhoods and how these networks rewire over time.

Result: Different words exhibit distinct patterns: Night is time-sensitive, Earth is poet-sensitive, Heart shows continuity despite graph-role mobility. Two wine terms highlight probe sensitivity - one broad and diffuse, the other narrower and stable. The approach reveals semantic change as neighborhood rewiring rather than abstract drift.

Conclusion: Semantic change in Persian poetry is better captured as neighborhood rewiring than abstract drift. This graph-based approach restores local structure to computational analysis and supports literary interpretations focused on persistence, migration, mediation, and selective transformation.

Abstract: Meaning in Persian poetry is both historical and relational. Words persist through literary tradition while shifting their force through changing constellations of neighbors, rhetorical frames, and poetic voices. This study examines that process using aligned Word2Vec spaces combined with graph-based neighborhood analysis across centuries and major poets. Rather than modeling semantic change as vector displacement alone, it treats lexical history as the rewiring of local semantic graphs: the gain and loss of neighbors, shifts in bridge roles, and movement across communities. The analysis centers on twenty target words, anchored by five recurrent reference terms: Earth, Night, two wine terms, and Heart. Surrounding them are affective, courtly, elemental, and Sufi concepts such as Love, Sorrow, Dervish, King, Annihilation, and Truth. These words exhibit distinct patterns of change. Night is more time-sensitive, Earth more poet-sensitive, and Heart shows continuity despite graph-role mobility. The two wine terms highlight probe sensitivity: one is broad and semantically diffuse, while the other is narrower and more stable. A lexical audit confirms that the corpus contains historically driven terms, poet-specific usages, and sparsely attested mystical vocabulary requiring caution. Overall, semantic change in Persian poetry is better captured as neighborhood rewiring than as abstract drift. For Digital Humanities, this approach restores local structure to computational analysis and supports interpretations closer to literary practice: persistence, migration, mediation, and selective transformation.

[56] ChemVLR: Prioritizing Reasoning in Perception for Chemical Vision-Language Understanding

Xuanle Zhao, Xinyuan Cai, Xiang Cheng, Xiuyi Chen, Bo Xu

Main category: cs.CL

TL;DR: ChemVLR is a chemical vision-language model that prioritizes reasoning by explicitly identifying fine-grained chemical descriptors like functional groups before generating answers, creating interpretable reasoning paths for chemical visual understanding.

DetailsMotivation: Current chemical VLMs are optimized for direct visual question-answering, creating "black-box" systems that fail to leverage LLMs' ability to infer underlying reaction mechanisms. There's a need for models that prioritize reasoning within the perception process for interpretable chemical analysis.

Method: 1) Cross-modality reverse-engineering strategy with rigorous filtering to curate 760k high-quality reasoning-and-captioning samples; 2) Three-stage training framework building perception and reasoning capacity; 3) Fine-grained analysis identifying granular chemical descriptors (functional groups) before answer generation.

Result: ChemVLR achieves state-of-the-art performance, surpassing both leading proprietary models and domain-specific open-source baselines. Comprehensive ablation studies validate the training strategy and data generation designs.

Conclusion: ChemVLR demonstrates that prioritizing reasoning in chemical VLMs through fine-grained descriptor identification enables explicit, interpretable reasoning paths and superior performance on complex visual chemical problems.

Abstract: While Vision-Language Models (VLMs) have demonstrated significant potential in chemical visual understanding, current models are predominantly optimized for direct visual question-answering tasks. This paradigm often results in “black-box” systems that fail to utilize the inherent capability of Large Language Models (LLMs) to infer underlying reaction mechanisms. In this work, we introduce ChemVLR, a chemical VLM designed to prioritize reasoning within the perception process. Unlike conventional chemical VLMs, ChemVLR analyzes visual inputs in a fine-grained manner by explicitly identifying granular chemical descriptors, such as functional groups, prior to generating answers. This approach ensures the production of explicit and interpretable reasoning paths for complex visual chemical problems. To facilitate this methodology, we implement a cross-modality reverse-engineering strategy, combined with a rigorous filtering pipeline, to curate a large-scale reasoning-and-captioning dataset comprising 760k high-quality samples across molecular and reaction tasks. Furthermore, we adopt a three-stage training framework that systemically builds model perception and reasoning capacity. Experiments demonstrate that ChemVLR achieves state-of-the-art (SOTA) performance, surpassing both leading proprietary models and domain-specific open-source baselines. We also provide comprehensive ablation studies to validate our training strategy and data generation designs. Code and model weights will be available at https://github.com/xxlllz/ChemVLR.

[57] Adaptive Prompt Structure Factorization: A Framework for Self-Discovering and Optimizing Compositional Prompt Programs

Haoyue Liu, Zhichao Wang, Yongxin Guo, Haoran Shou, Xiaoying Tang

Main category: cs.CL

TL;DR: aPSF is an API-only prompt optimization framework that discovers task-specific prompt structures as semantic factors and performs interventional, single-factor updates for more efficient optimization.

DetailsMotivation: Existing API-only prompt optimizers often edit monolithic prompts, which couples components, obscures credit assignment, limits controllability, and wastes tokens. There's a need for more efficient and controllable prompt optimization methods that work with API-only access to LLMs.

Method: Adaptive Prompt Structure Factorization (aPSF) uses an Architect model to discover task-specific prompt structures as semantic factors. It then performs interventional, single-factor updates with two key components: interventional factor-level scoring to estimate each factor’s marginal contribution via validation-performance changes, and error-guided factor selection to route updates to the current dominant failure source.

Result: aPSF outperforms strong baselines including principle-aware optimizers across multiple advanced reasoning benchmarks, improving accuracy by up to +2.16 percentage points on average. It reduces optimization cost by 45-87% tokens on MultiArith while reaching peak validation in just 1 step.

Conclusion: aPSF provides a more efficient and controllable approach to prompt optimization that works with API-only access to LLMs, enabling better credit assignment and more sample-efficient optimization through factor-level interventions.

Abstract: Automated prompt optimization is crucial for eliciting reliable reasoning from large language models (LLMs), yet most API-only prompt optimizers iteratively edit monolithic prompts, coupling components and obscuring credit assignment, limiting controllability, and wasting tokens. We propose Adaptive Prompt Structure Factorization (aPSF), an API-only framework (prompt-in/text-out; no access to model internals) that uses an Architect model to discover task-specific prompt structures as semantic factors. aPSF then performs interventional, single-factor updates: interventional factor-level scoring estimates each factor’s marginal contribution via validation-performance changes, and error-guided factor selection routes updates to the current dominant failure source for more sample-efficient optimization. Across multiple advanced reasoning benchmarks, aPSF outperforms strong baselines including principle-aware optimizers, improving accuracy by up to +2.16 percentage points on average, and reduces optimization cost by 45–87% tokens on MultiArith while reaching peak validation in 1 step.

[58] TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving

Xinkai Zhang, Jingtao Zhan, Yiqun Liu, Qingyao Ai

Main category: cs.CL

TL;DR: Researchers introduce Trial-and-Error Collection (TEC), a dataset capturing human trial-and-error problem-solving processes with detailed trajectories and error reflections, showing humans outperform LLMs in trial-and-error tasks.

DetailsMotivation: Current AI systems lack appropriate data to learn from detailed records of how humans conduct trial-and-error in practice, limiting their ability to develop effective trial-and-error capabilities for real-world problem-solving.

Method: Developed a data annotation platform to record users’ complete trial trajectories across multiple attempts and collect their reflections after receiving error feedback. Collected data from 46 participants on 58 tasks, resulting in 5,370 trial trajectories across 41,229 webpages.

Result: Humans achieved substantially higher accuracy compared to LLMs in trial-and-error tasks, demonstrating superior effectiveness. The TEC dataset provides valuable insights into human trial-and-error behavior patterns.

Conclusion: The TEC platform and dataset offer a foundation for understanding human trial-and-error behavior and developing more capable AI systems that can learn from human problem-solving processes.

Abstract: Trial-and-error is a fundamental strategy for humans to solve complex problems and a necessary capability for Artificial Intelligence (AI) systems operating in real-world environments. Although several trial-and-error AI techniques have recently been proposed, most of them rely on simple heuristics designed by researchers and achieve limited performance gains. The core issue is the absence of appropriate data: current models cannot learn from detailed records of how humans actually conduct trial-and-error in practice. To address this gap, we introduce a data annotation platform and a corresponding dataset, termed Trial-and-Error Collection (TEC). The platform records users’ complete trajectories across multiple trials and collects their reflections after receiving error feedback. Using this platform, we record the problem-solving processes of 46 participants on 58 tasks, resulting in 5,370 trial trajectories along with error reflections across 41,229 webpages. With this dataset, we observe that humans achieve substantially higher accuracy compared to LLMs, which demonstrates that humans are more effective in trial-and-error than LLMs. We believe that the TEC platform and dataset provide a valuable foundation for understanding human trial-and-error behavior and for developing more capable AI systems. Platform and dataset are publicly available.

[59] SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation

Yixi Zhou, Fan Zhang, Zhiqiao Guo, Yu Chen, Haipeng Zhang, Preslav Nakov, Zhuohan Xie

Main category: cs.CL

TL;DR: SQLStructEval framework analyzes structural reliability of LLM-generated SQL queries using AST representations, finding structural variance even when execution results are correct.

DetailsMotivation: Despite strong performance on Text-to-SQL benchmarks, it's unclear whether LLM-generated SQL programs are structurally reliable. The paper investigates structural behavior of LLM-generated SQL queries beyond just execution correctness.

Method: Introduces SQLStructEval framework for analyzing program structures through canonical abstract syntax tree (AST) representations. Uses Spider benchmark to test structural diversity of LLM-generated queries and proposes compile-style pipeline for structured generation.

Result: Modern LLMs often produce structurally diverse queries for the same input even when execution results are correct. Structural variance is frequently triggered by surface-level input changes like paraphrases or schema presentation. Structured generation via compile-style pipeline improves both execution accuracy and structural consistency.

Conclusion: Structural reliability is a critical yet overlooked dimension for evaluating LLM-based program generation systems. The SQLStructEval framework provides tools for structural analysis beyond execution correctness.

Abstract: Despite strong performance on Text-to-SQL benchmarks, it remains unclear whether LLM-generated SQL programs are structurally reliable. In this work, we investigate the structural behavior of LLM-generated SQL queries and introduce SQLStructEval, a framework for analyzing program structures through canonical abstract syntax tree (AST) representations. Our experiments on the Spider benchmark show that modern LLMs often produce structurally diverse queries for the same input, even when execution results are correct, and that such variance is frequently triggered by surface-level input changes such as paraphrases or schema presentation. We further show that generating queries in a structured space via a compile-style pipeline can improve both execution accuracy and structural consistency. These findings suggest that structural reliability is a critical yet overlooked dimension for evaluating LLM-based program generation systems. Our code is available at https://anonymous.4open.science/r/StructEval-2435.

[60] Luwen Technical Report

Yiquan Wu, Yuhang Liu, Yifei Liu, Ang Li, Siying Zhou, Kun Kuang

Main category: cs.CL

TL;DR: Luwen is an open-source Chinese legal language model built on Baichuan foundation with continual pre-training, supervised fine-tuning, and retrieval-augmented generation for legal tasks.

DetailsMotivation: LLMs struggle in legal domain due to specialized terminology, complex reasoning, and rapidly evolving legal knowledge, requiring domain-specific adaptation.

Method: Three-stage approach: 1) continual pre-training on large-scale legal corpus, 2) supervised fine-tuning with curated legal instruction data, 3) retrieval-augmented generation with legal knowledge base.

Result: Luwen outperforms baselines on five legal tasks: legal judgment prediction, judicial examination, legal text summarization, law article QA, and judicial decision reasoning.

Conclusion: The approach effectively adapts general-purpose LLMs to legal domain, demonstrating strong performance across diverse legal tasks.

Abstract: Large language models have demonstrated remarkable capabilities across a wide range of natural language processing tasks, yet their application in the legal domain remains challenging due to the specialized terminology, complex reasoning requirements, and rapidly evolving legal knowledge involved. In this paper, we present Luwen, an open-source Chinese legal language model built upon the Baichuan foundation model through three key techniques: continual pre-training on a large-scale legal corpus, supervised fine-tuning with carefully curated legal instruction data, and retrieval-augmented generation integrated with a comprehensive legal knowledge base. We evaluate Luwen on five representative legal tasks spanning both prediction and generation settings, including legal judgment prediction, judicial examination, legal text summarization, law article question answering, and judicial decision reasoning. Experimental results show that Luwen outperforms several strong baselines, demonstrating the effectiveness of our approach in adapting general-purpose language models to the legal domain.

[61] StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference

Zhirui Chen, Peiyang Liu, Ling Shao

Main category: cs.CL

TL;DR: StructKV is a structure-aware KV cache compression framework for long-context LLMs that identifies global information hubs across network depth rather than relying on local saliency snapshots.

DetailsMotivation: As LLMs scale to support million-token contexts, the linear growth of KV cache creates severe memory bottlenecks. Existing compression methods prioritize tokens based on local saliency metrics at specific layers, which systematically discards tokens that act as global information hubs across network depth but appear temporarily dormant at the pruning layer.

Method: Three core innovations: 1) Global In-Degree Centrality aggregates attention patterns across network depth to identify global information hubs; 2) Dynamic Pivot Detection uses information-theoretic metrics to adaptively locate optimal compression layer; 3) Structural Propagation and Decoupling separates computational budget from memory storage budget.

Result: Experimental results on LongBench and RULER benchmarks demonstrate that StructKV effectively preserves long-range dependencies and retrieval robustness while compressing KV cache.

Conclusion: StructKV addresses limitations of local saliency-based compression by considering global structural information across network depth, enabling more efficient long-context inference while maintaining performance.

Abstract: As Large Language Models (LLMs) scale to support context windows exceeding one million tokens, the linear growth of Key-Value (KV) cache imposes severe memory capacity and bandwidth bottlenecks, constraining the efficiency of long-context inference. Existing compression approaches typically prioritize tokens based on local saliency metrics to decouple prefill computation from decoding memory. However, these methods often rely on local saliency snapshots at a specific layer, thereby systematically discarding tokens that act as global information hubs across the network depth but appear temporarily dormant at the specific layer selected for pruning. To address this limitation, we propose StructKV, a structure-aware KV cache compression framework that introduces three core innovations: First, Global In-Degree Centrality aggregates attention patterns across the network depth to identify global information hubs. Second, Dynamic Pivot Detection utilizes information-theoretic metrics to adaptively locate the optimal layer for compression. Finally, Structural Propagation and Decoupling separates the computational budget from the memory storage budget. Experimental results on the LongBench and RULER benchmarks demonstrate that StructKV effectively preserves long-range dependencies and retrieval robustness.

[62] Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents

Heng Zhou, Zelin Tan, Zhemeng Zhang, Yutao Fan, Yibing Lin, Li Kang, Xiufeng Song, Rui Li, Songtao Huang, Ao Yu, Yuchen Fan, Yanxu Chen, Kaixin Xu, Xiaohong Liu, Yiran Qin, Philip Torr, Chen Zhang, Zhenfei Yin

Main category: cs.CL

TL;DR: Study shows reasoning paradigms (CoT, ReAct, etc.) have complementary strengths across tasks, with no single paradigm dominating; proposes learned embedding-based router to select optimal paradigm per task, outperforming fixed paradigms.

DetailsMotivation: To understand whether performance gains in LLM-based agents come from the model itself or the reasoning paradigm, and to determine if a single reasoning paradigm is optimal across all tasks.

Method: Compared six inference-time paradigms (Direct, CoT, ReAct, Plan-Execute, Reflection, ReCode) across four frontier LLMs and ten benchmarks (~18,000 runs). Proposed select-then-solve approach with lightweight embedding-based router to select optimal paradigm per task.

Result: Reasoning paradigms show dramatic complementarity: ReAct improves over Direct by 44pp on GAIA, while CoT degrades by 15pp on HumanEval. Learned router improves average accuracy from 47.6% to 53.1%, outperforming best fixed paradigm (50.3%) by 2.8pp and recovering up to 37% of oracle gap.

Conclusion: Reasoning paradigm selection should be a per-task decision made by a learned router rather than a fixed architectural choice, as no single paradigm dominates across all tasks.

Abstract: When an LLM-based agent improves on a task, is the gain from the model itself or from the reasoning paradigm wrapped around it? We study this question by comparing six inference-time paradigms, namely Direct, CoT, ReAct, Plan-Execute, Reflection, and ReCode, across four frontier LLMs and ten benchmarks, yielding roughly 18,000 runs. We find that reasoning structure helps dramatically on some tasks but hurts on others: ReAct improves over Direct by 44pp on GAIA, while CoT degrades performance by 15pp on HumanEval. No single paradigm dominates, and oracle per-task selection beats the best fixed paradigm by 17.1pp on average. Motivated by this complementarity, we propose a select-then-solve approach: before answering each task, a lightweight embedding-based router selects the most suitable paradigm. Across four models, the router improves average accuracy from 47.6% to 53.1%, outperforming the best fixed paradigm at 50.3% by 2.8pp and recovering up to 37% of the oracle gap. In contrast, zero-shot self-routing only works for GPT-5 at 67.1% and fails for weaker models, all trailing the learned router. Our results argue that reasoning paradigm selection should be a per-task decision made by a learned router, not a fixed architectural choice.

[63] How Long Reasoning Chains Influence LLMs’ Judgment of Answer Factuality

Minzhu Tu, Shiyu Ni, Keping Bi

Main category: cs.CL

TL;DR: LLM judges are biased by reasoning chain presence and fluency, often accepting wrong answers with good reasoning, highlighting need for more robust evaluation methods.

DetailsMotivation: LLMs are widely used as scalable human evaluation surrogates but remain imperfect with surface-level biases. The rise of reasoning-capable models provides reasoning content that could improve judgment accuracy, but its actual impact on judge behavior is understudied.

Method: Systematic investigation of how access to reasoning chains affects LLM-based judgment across factual QA and mathematical reasoning benchmarks. Controlled experiments examining both fluency and factuality of reasoning chains as signals driving judge decisions.

Result: Weak judges are easily swayed by reasoning presence, frequently accepting incorrect answers accompanied by fluent reasoning. Strong judges can partially leverage reasoning as informative evidence but are still misled by seemingly high-quality reasoning chains. Both fluency and factuality of reasoning chains are critical signals driving judge decisions.

Conclusion: Findings highlight the need for more robust LLM judges that can distinguish genuine reasoning quality from superficial fluency when evaluating modern reasoning models.

Abstract: Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases. One possible reason is that these judges lack sufficient information in assessing answer correctness. With the rise of reasoning-capable models, exposing a generator’s reasoning content to the judge provides richer information and is a natural candidate for improving judgment accuracy. However, its actual impact on judge behavior remains understudied. In this paper, we systematically investigate how access to reasoning chains affects LLM-based judgment across factual question answering (QA) and mathematical reasoning benchmarks. We find that weak judges are easily swayed by reasoning presence, frequently accepting incorrect answers accompanied by fluent reasoning, while strong judges can partially leverage reasoning as informative evidence. Nevertheless, even strong judges are misled by seemingly high-quality reasoning chains. Controlled experiments further reveal that both fluency and factuality of reasoning chains are critical signals driving judge decisions. These findings highlight the need for more robust LLM judges that can distinguish genuine reasoning quality from superficial fluency when evaluating modern reasoning models.

[64] Multilingual Cognitive Impairment Detection in the Era of Foundation Models

Damar Hoogland, Boshko Koloski, Jaya Caporusso, Tine Kolenik, Ana Zwitter Vitez, Senja Pollak, Christina Manouilidou, Matthew Purver

Main category: cs.CL

TL;DR: LLMs vs supervised models for cognitive impairment classification from speech transcripts across English, Slovene, and Korean, comparing zero-shot LLMs with feature-engineered tabular models.

DetailsMotivation: To evaluate the effectiveness of zero-shot LLMs versus supervised tabular approaches for cognitive impairment classification from speech transcripts across multiple languages, addressing small-data scenarios in medical applications.

Method: Compared zero-shot LLMs as direct classifiers under three input settings (transcript-only, linguistic-features-only, combined) with supervised tabular models using engineered linguistic features, transcript embeddings, and early/late fusion. Used leave-one-out protocol and conducted few-shot experiments focusing on embeddings.

Result: Zero-shot LLMs provide competitive no-training baselines, but supervised tabular models generally perform better, especially when engineered linguistic features are combined with embeddings. Few-shot experiments show language-dependent benefits of limited supervision.

Conclusion: In small-data cognitive impairment detection, structured linguistic signals and simple fusion-based classifiers remain strong and reliable, with supervised approaches outperforming zero-shot LLMs when proper feature engineering is employed.

Abstract: We evaluate cognitive impairment (CI) classification from transcripts of speech in English, Slovene, and Korean. We compare zero-shot large language models (LLMs) used as direct classifiers under three input settings – transcript-only, linguistic-features-only, and combined – with supervised tabular approaches trained under a leave-one-out protocol. The tabular models operate on engineered linguistic features, transcript embeddings, and early or late fusion of both modalities. Across languages, zero-shot LLMs provide competitive no-training baselines, but supervised tabular models generally perform better, particularly when engineered linguistic features are included and combined with embeddings. Few-shot experiments focusing on embeddings indicate that the value of limited supervision is language-dependent, with some languages benefiting substantially from additional labelled examples while others remain constrained without richer feature representations. Overall, the results suggest that, in small-data CI detection, structured linguistic signals and simple fusion-based classifiers remain strong and reliable signals.

[65] TeamLLM: A Human-Like Team-Oriented Collaboration Framework for Multi-Step Contextualized Tasks

Xiangyu Wang, Jin Wu, Haoran Shi, Wei Xia, Jiarui Yu, Chanjin Zheng

Main category: cs.CL

TL;DR: TeamLLM is a human-like multi-LLM collaboration framework with role division for multi-step contextualized tasks, evaluated on a new CGPST benchmark showing substantial performance improvements.

DetailsMotivation: Existing multi-LLM frameworks lack explicit human team role division, leading to single-perspective approaches that weaken performance on multi-step contextualized tasks.

Method: Proposes TeamLLM with four distinct team roles and three-phase multi-LLM collaboration for multi-step tasks. Creates CGPST benchmark with contextual grounding, procedural structure, process-oriented evaluation, and multi-dimensional assessment.

Result: TeamLLM substantially improves performance on CGPST benchmark compared to individual LLMs, with evaluations at overall-level, step-level, and dimension-level across ten popular LLMs.

Conclusion: TeamLLM’s human-like role division and collaboration framework effectively addresses limitations of existing multi-LLM approaches for multi-step contextualized tasks.

Abstract: Recently, multi-Large Language Model (LLM) frameworks have been proposed to solve contextualized tasks. However, these frameworks do not explicitly emulate human team role division, which may lead to a single perspective, thereby weakening performance on multi-step contextualized tasks. To address this issue, we propose TeamLLM, a human-like Team-Oriented Multi-LLM Collaboration Framework. TeamLLM adopts four team roles with distinct division and employs a three-phase multi-LLM collaboration for multi-step contextualized tasks. To evaluate the effectiveness of TeamLLM on multi-step contextualized tasks, we propose Contextually-Grounded and Procedurally-Structured tasks (CGPST) and construct the CGPST benchmark. This benchmark has four core features: contextual grounding, procedural structure, process-oriented evaluation and multi-dimensional assessment. We evaluate ten popular LLMs on CGPST at overall-level, step-level, and dimension-level. Results show that TeamLLM substantially improves performance on CGPST. We release the benchmark with scenarios, full-process responses and human scores from ten LLMs. The code and data are available at https://anonymous.4open.science/r/TeamLLM-anonymous-C50E/.

Zhiyu Cao, Peifeng Li, Qiaoming Zhu

Main category: cs.CL

TL;DR: Multi-Faceted Self-Consistent Preference Aligned CQR (MSPA-CQR) improves conversational query rewriting by incorporating feedback from retrieval and response generation, using preference alignment across three dimensions.

DetailsMotivation: Existing conversational query rewriting methods work in isolation without considering feedback from downstream tasks like passage retrieval and response generation, limiting their effectiveness.

Method: Constructs self-consistent preference alignment data from rewriting, retrieval, and response dimensions, then uses prefix-guided multi-faceted direct preference optimization to learn from these three dimensions.

Result: MSPA-CQR shows effectiveness in both in-distribution and out-of-distribution scenarios, demonstrating improved query rewriting performance.

Conclusion: Incorporating multi-faceted feedback from retrieval and response generation into query rewriting through preference alignment improves conversational search effectiveness.

Abstract: Conversational Query Rewriting (CQR) aims to rewrite ambiguous queries to achieve more efficient conversational search. Early studies have predominantly focused on the rewriting in isolation, ignoring the feedback from query rewrite, passage retrieval and response generation in the rewriting process. To address this issue, we propose Multi-Faceted Self-Consistent Preference Aligned CQR (MSPA-CQR). Specifically, we first construct self-consistent preference alignment data from three dimensions (rewriting, retrieval, and response) to generate more diverse rewritten queries. Then we propose prefix guided multi-faceted direct preference optimization to learn preference information from three different dimensions. The experimental results show that our MSPA-CQR is effective in both in- and out-of-distribution scenarios.

[67] Discourse Coherence and Response-Guided Context Rewriting for Multi-Party Dialogue Generation

Zhiyu Cao, Peifeng Li, Qiaoming Zhu

Main category: cs.CL

TL;DR: DRCR framework improves multi-party dialogue generation through context rewriting using discourse coherence and response quality feedback signals with dynamic self-evolution learning.

DetailsMotivation: Multi-party dialogues often contain colloquial expressions and incomplete utterances that impede comprehension and weaken dialogue structure representations, making generation challenging.

Method: Proposes DRCR framework with two complementary feedback signals (discourse coherence and response quality) to construct preference data for context rewriting and response generation, plus dynamic self-evolution learning where rewriter and responder enhance capabilities through mutual interaction.

Result: Comprehensive experiments on four multi-party dialogue datasets substantiate the effectiveness of DRCR framework.

Conclusion: DRCR improves multi-party dialogue generation through dialogue context rewriting with feedback signals and iterative self-evolution learning.

Abstract: Previous research on multi-party dialogue generation has predominantly leveraged structural information inherent in dialogues to directly inform the generation process. However, the prevalence of colloquial expressions and incomplete utterances in dialogues often impedes comprehension and weakens the fidelity of dialogue structure representations, which is particularly pronounced in multi-party dialogues. In this work, we propose a novel framework DRCR (Discourse coherence and Response-guided Context Rewriting) to improve multi-party dialogue generation through dialogue context rewriting. Specifically, DRCR employs two complementary feedback signals, discourse coherence and response quality, to construct preference data for both context rewriting and response generation. Moreover, we propose a dynamic self-evolution learning method that allows the rewriter and responder to continuously enhance their capabilities through mutual interaction in an iterative training loop. Comprehensive experiments conducted on four multi-party dialogue datasets substantiate the effectiveness of DRCR.

[68] When Is Thinking Enough? Early Exit via Sufficiency Assessment for Efficient Reasoning

Yang Xiang, Yixin Ji, Ruotao Xu, Dan Qiao, Zheming Yang, Juntao Li, Min Zhang

Main category: cs.CL

TL;DR: DTSR is a framework that enables large reasoning models to dynamically assess when their chain-of-thought reasoning is sufficient for early exit, reducing computational redundancy by 28.9%-34.9% with minimal performance loss.

DetailsMotivation: Large reasoning models suffer from overthinking, causing computational inefficiency. Existing early-exit methods rely on unreliable handcrafted indicators, so a more dynamic, metacognition-inspired approach is needed.

Method: Two-stage framework: (1) Reflection Signal Monitoring identifies potential early-exit cues, and (2) Thought Sufficiency Check evaluates whether current chain-of-thought reasoning is sufficient to derive the final answer.

Result: Experimental results on Qwen3 models show DTSR reduces reasoning length by 28.9%-34.9% with minimal performance loss, effectively mitigating overthinking in large reasoning models.

Conclusion: DTSR provides an effective framework for efficient reasoning by enabling models to dynamically assess thought sufficiency, with insights into overconfidence and self-evaluation paradigms for early-exit reasoning.

Abstract: Large reasoning models (LRMs) have achieved remarkable performance in complex reasoning tasks, driven by their powerful inference-time scaling capability. However, LRMs often suffer from overthinking, which results in substantial computational redundancy and significantly reduces efficiency. Early-exit methods aim to mitigate this issue by terminating reasoning once sufficient evidence has been generated, yet existing approaches mostly rely on handcrafted or empirical indicators that are unreliable and impractical. In this work, we introduce Dynamic Thought Sufficiency in Reasoning (DTSR), a novel framework for efficient reasoning that enables the model to dynamically assess the sufficiency of its chain-of-thought (CoT) and determine the optimal point for early exit. Inspired by human metacognition, DTSR operates in two stages: (1) Reflection Signal Monitoring, which identifies reflection signals as potential cues for early exit, and (2) Thought Sufficiency Check, which evaluates whether the current CoT is sufficient to derive the final answer. Experimental results on the Qwen3 models show that DTSR reduces reasoning length by 28.9%-34.9% with minimal performance loss, effectively mitigating overthinking. We further discuss overconfidence in LRMs and self-evaluation paradigms, providing valuable insights for early-exit reasoning.

[69] GCoT-Decoding: Unlocking Deep Reasoning Paths for Universal Question Answering

Guanran Luo, Wentao Qiu, Zhongquan Jian, Meihong Wang, Qingqiang Wu

Main category: cs.CL

TL;DR: GCoT-decoding is a general decoding strategy that extends Chain-of-Thought reasoning to both fixed and free question-answering tasks without requiring manual prompts, using a two-stage branching method with Fibonacci sampling and heuristic error backtracking.

DetailsMotivation: Current Chain-of-Thought reasoning requires manually designed prompts, and CoT-decoding only works for problems with fixed answer sets. There's a need for a more general approach that can handle both fixed and free QA tasks without prompt engineering.

Method: GCoT-decoding uses a two-stage branching method combining Fibonacci sampling and heuristic error backtracking to generate candidate decoding paths. It splits each path into reasoning and answer spans to compute path confidence accurately, then aggregates semantically similar paths to identify consensus answers instead of using traditional majority voting.

Result: Extensive experiments on six datasets show that GCoT-decoding maintains strong performance on fixed QA tasks while achieving significant improvements on free QA tasks, demonstrating its generality across different question types.

Conclusion: GCoT-decoding provides a general decoding strategy that extends Chain-of-Thought reasoning to a broader range of QA tasks without requiring manual prompts, offering improved performance on free-form questions while maintaining effectiveness on fixed-answer problems.

Abstract: Chain-of-Thought reasoning can enhance large language models, but it requires manually designed prompts to guide the model. Recently proposed CoT-decoding enables the model to generate CoT-style reasoning paths without prompts, but it is only applicable to problems with fixed answer sets. To address this limitation, we propose a general decoding strategy GCoT-decoding that extends applicability to a broader range of question-answering tasks. GCoT-decoding employs a two-stage branching method combining Fibonacci sampling and heuristic error backtracking to generate candidate decoding paths. It then splits each path into a reasoning span and an answer span to accurately compute path confidence, and finally aggregates semantically similar paths to identify a consensus answer, replacing traditional majority voting. We conduct extensive experiments on six datasets covering both fixed and free QA tasks. Our method not only maintains strong performance on fixed QA but also achieves significant improvements on free QA, demonstrating its generality.

[70] Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions

Parth Patil, Dhruv Kumar, Yash Sinha, Murari Mandal

Main category: cs.CL

TL;DR: A framework for analyzing algebraic reasoning in LLMs with nine independent complexity dimensions to identify specific failure causes, revealing working memory as the dominant bottleneck across models.

DetailsMotivation: Current benchmarks for algebraic reasoning in LLMs only provide single accuracy scores without attributing failures to specific causes like expression nesting depth, operator rarity, or dependency chain length. There's no systematic way to vary complexity factors independently or track model progress over time.

Method: Introduced a nine-dimension algebraic complexity framework where each factor (expression nesting depth, intermediate result count, sub-expression complexity, operator hardness, dependent reasoning chain length, etc.) is varied independently while others are fixed. Used a parametric pipeline for automatic problem generation and verification without human annotation.

Result: Evaluated seven instruction-tuned models (8B to 235B parameters) and found working memory is the dominant scale-invariant bottleneck. All models collapsed between 20-30 parallel branches regardless of parameter count, indicating a hard architectural constraint. Identified a minimal subset of five dimensions that span all documented algebraic failure modes.

Conclusion: The framework provides comprehensive diagnostic capabilities for understanding LLM algebraic reasoning failures, revealing fundamental architectural limitations in working memory that persist across model scales.

Abstract: Algebraic reasoning remains one of the most informative stress tests for large language models, yet current benchmarks provide no mechanism for attributing failure to a specific cause. When a model fails an algebraic problem, a single accuracy score cannot reveal whether the expression was too deeply nested, the operator too uncommon, the intermediate state count too high, or the dependency chain too long. Prior work has studied individual failure modes in isolation, but no framework has varied each complexity factor independently under strict experimental control. No prior system has offered automatic generation and verification of problems of increasing complexity to track model progress over time. We introduce a nine-dimension algebraic complexity framework in which each factor is varied independently while all others are held fixed, with problem generation and verification handled by a parametric pipeline requiring no human annotation. Each dimension is grounded in a documented LLM failure mode and captures a structurally distinct aspect of algebraic difficulty, including expression nesting depth, simultaneous intermediate result count, sub-expression complexity, operator hardness, and dependent reasoning chain length. We evaluated seven instruction-tuned models spanning 8B to 235B parameters across all nine dimensions and find that working memory is the dominant scale-invariant bottleneck. Every model collapses between 20 and 30 parallel branches regardless of parameter count, pointing to a hard architectural constraint rather than a solvable capacity limitation. Our analysis further identifies a minimal yet diagnostically sufficient subset of five dimensions that together span the full space of documented algebraic failure modes, providing a complete complexity profile of a model’s algebraic reasoning capacity.

[71] Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning

Jia-Chen Zhang, Zheng Zhou, Yu-Jie Xiong

Main category: cs.CL

TL;DR: CLoT is a novel Chain-of-Thought framework using Reversible Hierarchical Markov Chain with backward verification and pruning to improve mathematical reasoning efficiency and accuracy while reducing computational overhead.

DetailsMotivation: Long CoT sequences often exceed computational limits, and existing approaches using Markov chain-like structures suffer from memorylessness (loss of context) and limited backward reasoning capability, which this work aims to address.

Method: Proposes Cognitive Loop of Thought (CLoT) based on Reversible Hierarchical Markov Chain: decomposes problems into hierarchical sub-problems, introduces backward verification at each layer, and implements pruning to remove redundant lower-level sub-problems after higher-level verification.

Result: Achieves 99.0% accuracy on AddSub dataset using GPT-4o-mini, outperforming traditional CoT by 4.1% and CoT-SC by 2.9%. Shows effectiveness across four mathematical benchmarks.

Conclusion: CLoT effectively mitigates error propagation and enhances reasoning robustness while improving computational efficiency through hierarchical decomposition, backward verification, and pruning strategies.

Abstract: Multi-step Chain-of-Thought (CoT) has significantly advanced the mathematical reasoning capabilities of LLMs by leveraging explicit reasoning steps. However, the widespread adoption of Long CoT often results in sequence lengths that exceed manageable computational limits. While existing approaches attempt to alleviate this by reducing KV Cache redundancy via Markov chain-like structures, they introduce two critical limitations: inherent memorylessness (loss of context) and limited backward reasoning capability. To address these limitations, we propose a novel Chain-of-Thought framework based on Reversible Hierarchical Markov Chain, termed Cognitive Loop of Thought (CLoT), and a backward reasoning dataset CLoT-Instruct. In CLoT, problems are decomposed into sub-problems with hierarchical dependencies. Inspired by human cognitive processes, we introduce a backward verification mechanism at each hierarchical layer. Furthermore, we implement a pruning strategy: once higher-level sub-problems are verified, redundant lower-level sub-problems are pruned to maximize efficiency. This approach effectively mitigates error propagation and enhances reasoning robustness. Experiments on four mathematical benchmarks demonstrate the effectiveness of our method. Notably, on the AddSub dataset using GPT-4o-mini, CLoT achieves 99.0% accuracy, outperforming traditional CoT and CoT-SC by 4.1% and 2.9%, respectively.

[72] AGSC: Adaptive Granularity and Semantic Clustering for Uncertainty Quantification in Long-text Generation

Guanran Luo, Wentao Qiu, Wanru Zhao, Wenhan Lv, Zhongquan Jian, Meihong Wang, Qingqiang Wu

Main category: cs.CL

TL;DR: AGSC is an uncertainty quantification framework for long-form LLM generation that uses NLI neutral probabilities to filter irrelevant content and GMM clustering for semantic theme modeling, reducing computation by 60% while maintaining factuality correlation.

DetailsMotivation: LLMs suffer from hallucination in long-form generation, but existing uncertainty quantification methods struggle with heterogeneous themes, overlook neutral information nuances, and have high computational costs from fine-grained decomposition.

Method: AGSC uses NLI neutral probabilities as triggers to distinguish irrelevance from uncertainty, then applies Gaussian Mixture Model soft clustering to model latent semantic themes and assign topic-aware weights for downstream aggregation.

Result: Experiments on BIO and LongFact datasets show AGSC achieves state-of-the-art correlation with factuality while reducing inference time by about 60% compared to full atomic decomposition methods.

Conclusion: AGSC provides an efficient uncertainty quantification framework for long-form LLM generation that balances accuracy and computational efficiency through adaptive granularity and semantic clustering.

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in long-form generation, yet their application is hindered by the hallucination problem. While Uncertainty Quantification (UQ) is essential for assessing reliability, the complex structure makes reliable aggregation across heterogeneous themes difficult, in addition, existing methods often overlook the nuance of neutral information and suffer from the high computational cost of fine-grained decomposition. To address these challenges, we propose AGSC (Adaptive Granularity and GMM-based Semantic Clustering), a UQ framework tailored for long-form generation. AGSC first uses NLI neutral probabilities as triggers to distinguish irrelevance from uncertainty, reducing unnecessary computation. It then applies Gaussian Mixture Model (GMM) soft clustering to model latent semantic themes and assign topic-aware weights for downstream aggregation. Experiments on BIO and LongFact show that AGSC achieves state-of-the-art correlation with factuality while reducing inference time by about 60% compared to full atomic decomposition.

[73] SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization

Usman Naseem, Robert Geislinger, Juan Ren, Sarah Kohail, Rudy Garrido Veliz, P Sam Sahil, Yiran Zhang, Marco Antonio Stranisci, Idris Abdulmumin, Özge Alaçam, Cengiz Acartürk, Aisha Jabr, Saba Anwar, Abinew Ali Ayele, Elena Tutubalina, Aung Kyaw Htet, Xintong Wang, Surendrabikram Thapa, Tanmoy Chakraborty, Dheeraj Kodati, Sahar Moradizeyveh, Firoj Alam, Ye Kyaw Thu, Shantipriya Parida, Ihsan Ayyub Qazi, Lilian Wanzare, Nelson Odhiambo Onyango, Clemencia Siro, Ibrahim Said Ahmad, Adem Chanie Ali, Martin Semmann, Chris Biemann, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam

Main category: cs.CL

TL;DR: SemEval-2026 Task 9 is a multilingual shared task on online polarization detection with 22 languages and 110K+ annotated instances, featuring three sub-tasks: presence detection, type identification, and manifestation recognition.

DetailsMotivation: To address the growing concern about online polarization and its societal impacts by creating a comprehensive multilingual benchmark for detecting polarization in online content across diverse languages and cultural contexts.

Method: Created a large-scale multilingual dataset with multi-label annotations (presence, type, manifestation) across 22 languages. Organized a shared task with three sub-tasks, attracted 1,000+ participants and 10k+ submissions, and evaluated systems using baseline models and participant submissions.

Result: The task attracted 67 teams with 73 system description papers. Baseline results were established, and best-performing systems were analyzed to identify most effective approaches across different subtasks and languages. The dataset is publicly available.

Conclusion: The shared task successfully created a valuable multilingual benchmark for polarization detection, demonstrated strong community interest, and identified effective methods for detecting polarization across different languages and subtasks.

Abstract: We present SemEval-2026 Task 9, a shared task on online polarization detection, covering 22 languages and comprising over 110K annotated instances. Each data instance is multi-labeled with the presence of polarization, polarization type, and polarization manifestation. Participants were asked to predict labels in three sub-tasks: (1) detecting the presence of polarization, (2) identifying the type of polarization, and (3) recognizing the polarization manifestation. The three tasks attracted over 1,000 participants worldwide and more than 10k submission on Codabench. We received final submissions from 67 teams and 73 system description papers. We report the baseline results and analyze the performance of the best-performing systems, highlighting the most common approaches and the most effective methods across different subtasks and languages. The dataset of this task is publicly available.

[74] Environmental, Social and Governance Sentiment Analysis on Slovene News: A Novel Dataset and Models

Paula Dodig, Boshko Koloski, Katarina Sitar Šuštar, Senja Pollak, Matthew Purver

Main category: cs.CL

TL;DR: First publicly available Slovene ESG sentiment dataset and models for automatic ESG sentiment detection from news articles, with LLMs achieving best performance on Environmental and Social aspects.

DetailsMotivation: ESG considerations are crucial for corporate assessment but reliable ratings are limited for smaller companies and emerging markets, creating a need for automated ESG sentiment analysis tools.

Method: Created Slovene ESG dataset from news collection using LLM-assisted filtering and human annotation. Evaluated monolingual (SloBERTa), multilingual (XLM-R), embedding-based classifiers (TabPFN), hierarchical ensembles, and LLMs for ESG sentiment classification.

Result: LLMs performed best on Environmental (Gemma3-27B, F1-macro: 0.61) and Social aspects (gpt-oss 20B, F1-macro: 0.45), while fine-tuned SloBERTa was best for Governance classification (F1-macro: 0.54). Case study demonstrated practical application.

Conclusion: The introduced dataset and models enable automated ESG sentiment analysis for Slovene companies, with LLMs showing strong performance particularly for Environmental and Social aspects.

Abstract: Environmental, Social, and Governance (ESG) considerations are increasingly integral to assessing corporate performance, reputation, and long-term sustainability. Yet, reliable ESG ratings remain limited for smaller companies and emerging markets. We introduce the first publicly available Slovene ESG sentiment dataset and a suite of models for automatic ESG sentiment detection. The dataset, derived from the MaCoCu Slovene news collection, combines large language model (LLM)-assisted filtering with human annotation of company-related ESG content. We evaluate the performance of monolingual (SloBERTa) and multilingual (XLM-R) models, embedding-based classifiers (TabPFN), hierarchical ensemble architectures, and large language models. Results show that LLMs achieve the strongest performance on Environmental (Gemma3-27B, F1-macro: 0.61) and Social aspects (gpt-oss 20B, F1-macro: 0.45), while fine-tuned SloBERTa is the best model on Governance classification (F1-macro: 0.54). We then show in a small case study how the best-preforming classifier (gpt-oss) can be applied to investigate ESG aspects for selected companies across a long time frame.

[75] WRAP++: Web discoveRy Amplified Pretraining

Jiang Zhou, Yunhao Wang, Xing Wu, Tinghao Yu, Feng Zhang

Main category: cs.CL

TL;DR: WRAP++ is a method that discovers cross-document relationships from web hyperlinks to synthesize joint QA pairs, amplifying factual knowledge context beyond single-document approaches.

DetailsMotivation: Existing synthetic data rephrasing techniques operate at single-document level, missing cross-document relationships and leaving facts with limited associative context. This limits knowledge acquisition during LLM pretraining.

Method: WRAP++ discovers high-confidence relational motifs (dual-links and co-mentions) from web hyperlinks, then synthesizes QA that requires reasoning across both documents. This creates relational knowledge absent from either source document alone.

Result: On Wikipedia, WRAP++ amplified ~8.4B tokens of raw text into 80B tokens of cross-document QA data. OLMo-based models (7B and 32B) trained with WRAP++ substantially outperform single-document approaches on SimpleQA and show sustained scaling gains.

Conclusion: Cross-document knowledge discovery and amplification through relational QA synthesis provides significant advantages over single-document approaches for enhancing LLM pretraining and knowledge acquisition.

Abstract: Synthetic data rephrasing has emerged as a powerful technique for enhancing knowledge acquisition during large language model (LLM) pretraining. However, existing approaches operate at the single-document level, rewriting individual web pages in isolation. This confines synthesized examples to intra-document knowledge, missing cross-document relationships and leaving facts with limited associative context. We propose WRAP++ (Web discoveRy Amplified Pretraining), which amplifies the associative context of factual knowledge by discovering cross-document relationships from web hyperlinks and synthesizing joint QA over each discovered document pair. Concretely, WRAP++ discovers high-confidence relational motifs including dual-links and co-mentions, and synthesizes QA that requires reasoning across both documents. This produces relational knowledge absent from either source document alone, creating diverse entry points to the same facts. Because the number of valid entity pairs grows combinatorially, this discovery-driven synthesis also amplifies data scale far beyond single-document rewriting. Instantiating WRAP++ on Wikipedia, we amplify ~8.4B tokens of raw text into 80B tokens of cross-document QA data. On SimpleQA, OLMo-based models at both 7B and 32B scales trained with WRAP++ substantially outperform single-document approaches and exhibit sustained scaling gains, underscoring the advantage of cross-document knowledge discovery and amplification.

[76] Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

Chengyue Wu, Shiyi Lan, Yonggan Fu, Sensen Gao, Jin Wang, Jincheng Yu, Jose M. Alvarez, Pavlo Molchanov, Ping Luo, Song Han, Ligeng Zhu, Enze Xie

Main category: cs.CL

TL;DR: Fast-dVLM: A block-diffusion-based vision-language model enabling parallel decoding for faster inference while maintaining generation quality comparable to autoregressive VLMs.

DetailsMotivation: Autoregressive decoding in VLMs limits inference throughput, especially on edge devices in robotics/autonomous driving where batch size one makes it memory-bandwidth-bound and underutilizes hardware parallelism.

Method: Uses block-wise discrete diffusion for parallel decoding with KV-cache compatibility and speculative block decoding. Compares two AR-to-diffusion conversion strategies: two-stage (LLM backbone first) vs direct (full VLM conversion). Introduces multimodal diffusion adaptations, block size annealing, causal context attention, auto-truncation masking, and vision efficient concatenation.

Result: Matches autoregressive counterpart in generation quality across 11 multimodal benchmarks. With SGLang integration and FP8 quantization, achieves over 6x end-to-end inference speedup over AR baseline.

Conclusion: Direct conversion of AR VLMs to diffusion-based parallel decoding is more efficient and effective, enabling significant inference acceleration while preserving multimodal capabilities.

Abstract: Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in physical AI scenarios such as robotics and autonomous driving, where VLMs are deployed on edge devices at batch size one, making AR decoding memory-bandwidth-bound and leaving hardware parallelism underutilized. While block-wise discrete diffusion has shown promise for parallel text generation, extending it to VLMs remains challenging due to the need to jointly handle continuous visual representations and discrete text tokens while preserving pretrained multimodal capabilities. We present Fast-dVLM, a block-diffusion-based VLM that enables KV-cache-compatible parallel decoding and speculative block decoding for inference acceleration. We systematically compare two AR-to-diffusion conversion strategies: a two-stage approach that first adapts the LLM backbone with text-only diffusion fine-tuning before multimodal training, and a direct approach that converts the full AR VLM in one stage. Under comparable training budgets, direct conversion proves substantially more efficient by leveraging the already multimodally aligned VLM; we therefore adopt it as our recommended recipe. We introduce a suite of multimodal diffusion adaptations, block size annealing, causal context attention, auto-truncation masking, and vision efficient concatenation, that collectively enable effective block diffusion in the VLM setting. Extensive experiments across 11 multimodal benchmarks show Fast-dVLM matches its autoregressive counterpart in generation quality. With SGLang integration and FP8 quantization, Fast-dVLM achieves over 6x end-to-end inference speedup over the AR baseline.

[77] On the Step Length Confounding in LLM Reasoning Data Selection

Bing Wang, Rui Miao, Chen Shen, Shaotian Yan, Kaiyuan Liu, Ximing Li, Xiaosong Yuan, Sinan Fan, Jun Zhang, Jieping Ye

Main category: cs.CL

TL;DR: The paper identifies a problem in LLM reasoning data selection where naturalness-based methods prefer samples with longer reasoning steps rather than higher quality, proposes two methods to address this step length confounding, and validates them across multiple LLMs and benchmarks.

DetailsMotivation: Existing pipelines for constructing reasoning datasets use naturalness-based selection methods that rank data by average log probability assigned by LLMs. However, the authors discovered that this approach systematically prefers samples with longer reasoning steps rather than higher-quality ones, a phenomenon they term "step length confounding."

Method: The authors propose two variant methods: ASLEC-DROP, which drops first-token probabilities when computing average log probability, and ASLEC-CASL, which applies a causal debiasing regression to remove the first tokens’ confounding effect. Both methods aim to mitigate the step length confounding problem.

Result: Experiments across four LLMs and five evaluation benchmarks demonstrate the effectiveness of the proposed approaches in mitigating the step length confounding problem, showing improved data selection quality compared to baseline naturalness-based methods.

Conclusion: The paper identifies a systematic bias in current reasoning data selection methods and provides practical solutions to improve the quality of reasoning datasets for LLM training, which could lead to better reasoning capabilities in large language models.

Abstract: Large reasoning models have recently demonstrated strong performance on complex tasks that require long chain-of-thought reasoning, through supervised fine-tuning on large-scale and high-quality datasets. To construct such datasets, existing pipelines generate long reasoning data from more capable Large Language Models (LLMs) and apply manually heuristic or naturalness-based selection methods to filter high-quality samples. Despite the proven effectiveness of naturalness-based data selection, which ranks data by the average log probability assigned by LLMs, our analysis shows that, when applied to LLM reasoning datasets, it systematically prefers samples with longer reasoning steps (i.e., more tokens per step) rather than higher-quality ones, a phenomenon we term step length confounding. Through quantitative analysis, we attribute this phenomenon to low-probability first tokens in reasoning steps; longer steps dilute their influence, thereby inflating the average log probabilities. To address this issue, we propose two variant methods: ASLEC-DROP, which drops first-token probabilities when computing average log probability, and ASLEC-CASL, which applies a causal debiasing regression to remove the first tokens’ confounding effect. Experiments across four LLMs and five evaluation benchmarks demonstrate the effectiveness of our approach in mitigating the step length confounding problem.

[78] HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

Main category: cs.CL

TL;DR: HingeMem: A boundary-guided long-term memory system for dialogue that uses event segmentation theory to create interpretable indexing via boundary-triggered hyperedges across person, time, location, and topic elements, with query-adaptive retrieval mechanisms.

DetailsMotivation: Existing long-term memory methods for dialogue systems rely on continuous summarization or OpenIE-based graph construction with fixed Top-k retrieval, which leads to limited adaptability across query categories and high computational overhead.

Method: HingeMem operationalizes event segmentation theory to build an interpretable indexing interface via boundary-triggered hyperedges over four elements: person, time, location, and topic. When any element changes, it draws a boundary and writes the current segment. It introduces query-adaptive retrieval that jointly decides what to retrieve (query-conditioned routing over element-indexed memory) and how much to retrieve (retrieval depth based on estimated query type).

Result: Extensive experiments across LLM scales (0.6B to production-tier models) on LOCOMO show HingeMem achieves approximately 20% relative improvement over strong baselines without query categories specification, while reducing computational cost (68%↓ question answering token cost compared to HippoRAG2).

Conclusion: HingeMem advances memory modeling and its adaptive retrieval makes it suitable for web applications requiring efficient and trustworthy memory over extended interactions.

Abstract: Long-term memory is critical for dialogue systems that support continuous, sustainable, and personalized interactions. However, existing methods rely on continuous summarization or OpenIE-based graph construction paired with fixed Top-\textit{k} retrieval, leading to limited adaptability across query categories and high computational overhead. In this paper, we propose HingeMem, a boundary-guided long-term memory that operationalizes event segmentation theory to build an interpretable indexing interface via boundary-triggered hyperedges over four elements: person, time, location, and topic. When any such element changes, HingeMem draws a boundary and writes the current segment, thereby reducing redundant operations and preserving salient context. To enable robust and efficient retrieval under diverse information needs, HingeMem introduces query-adaptive retrieval mechanisms that jointly decide (a) \textit{what to retrieve}: determine the query-conditioned routing over the element-indexed memory; (b) \textit{how much to retrieve}: control the retrieval depth based on the estimated query type. Extensive experiments across LLM scales (from 0.6B to production-tier models; \textit{e.g.}, Qwen3-0.6B to Qwen-Flash) on LOCOMO show that HingeMem achieves approximately $20%$ relative improvement over strong baselines without query categories specification, while reducing computational cost (68%$\downarrow$ question answering token cost compared to HippoRAG2). Beyond advancing memory modeling, HingeMem’s adaptive retrieval makes it a strong fit for web applications requiring efficient and trustworthy memory over extended interactions.

[79] MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors

Xiaotian Luo, Xun Jiang, Jiangcheng Wu

Main category: cs.CL

TL;DR: MedDialBench introduces a benchmark for analyzing how different dimensions of patient non-cooperation affect LLM diagnostic accuracy, finding information pollution (fabricating symptoms) is more harmful than information deficit (withholding).

DetailsMotivation: Existing medical dialogue benchmarks don't adequately characterize how graded severity of patient non-cooperation across multiple behavioral dimensions affects LLM diagnostic robustness, lacking controlled analysis of cross-dimension interactions.

Method: Developed MedDialBench with five patient behavior dimensions (Logic Consistency, Health Cognition, Expression Style, Disclosure, Attitude) each with graded severity levels and case-specific scripts, enabling factorial design for sensitivity analysis across 7,225 dialogues with 5 frontier LLMs.

Result: Information pollution (fabricating symptoms) causes 1.7-3.4x larger accuracy drops than information deficit; fabricating is the only configuration significant across all models and drives super-additive interactions in combinations; inquiry strategies can recover withheld but not fabricated information.

Conclusion: Patient non-cooperation asymmetrically impacts LLM diagnostic accuracy, with information pollution being particularly damaging and resistant to mitigation strategies, highlighting the need for robustness improvements in medical dialogue systems.

Abstract: Interactive medical dialogue benchmarks have shown that LLM diagnostic accuracy degrades significantly when interacting with non-cooperative patients, yet existing approaches either apply adversarial behaviors without graded severity or case-specific grounding, or reduce patient non-cooperation to a single ungraded axis, and none analyze cross-dimension interactions. We introduce MedDialBench, a benchmark enabling controlled, dose-response characterization of how individual patient behavior dimensions affect LLM diagnostic robustness. It decomposes patient behavior into five dimensions – Logic Consistency, Health Cognition, Expression Style, Disclosure, and Attitude – each with graded severity levels and case-specific behavioral scripts. This controlled factorial design enables graded sensitivity analysis, dose-response profiling, and cross-dimension interaction detection. Evaluating five frontier LLMs across 7,225 dialogues (85 cases x 17 configurations x 5 models), we find a fundamental asymmetry: information pollution (fabricating symptoms) produces 1.7-3.4x larger accuracy drops than information deficit (withholding information), and fabricating is the only configuration achieving statistical significance across all five models (McNemar p < 0.05). Among six dimension combinations, fabricating is the sole driver of super-additive interaction: all three fabricating-involving pairs produce O/E ratios of 0.70-0.81 (35-44% of eligible cases fail under the combination despite succeeding under each dimension alone), while all non-fabricating pairs show purely additive effects (O/E ~ 1.0). Inquiry strategy moderates deficit but not pollution: exhaustive questioning recovers withheld information, but cannot compensate for fabricated inputs. Models exhibit distinct vulnerability profiles, with worst-case drops ranging from 38.8 to 54.1 percentage points.

[80] To Adapt or not to Adapt, Rethinking the Value of Medical Knowledge-Aware Large Language Models

Ane G. Domingo-Aldama, Iker De La Iglesia, Maitane Urruela, Aitziber Atutxa, Ander Barrena

Main category: cs.CL

TL;DR: Clinical LLMs show minimal improvement over general-purpose models on English medical QA tasks, but domain adaptation works better for Spanish, with Marmoka models outperforming Llama.

DetailsMotivation: To investigate whether domain-adapted clinical LLMs actually outperform general-purpose LLMs on medical benchmarks, given inconsistent prior findings, and to develop robust evaluation methods for medical LLMs.

Method: Systematic comparison of general vs clinical LLMs on multiple choice clinical QA tasks in English and Spanish, using perturbation-based evaluation benchmark with one-step/two-step question transformations, multi-prompt testing, and instruction-guided assessment. Development of Marmoka family of lightweight 8B-parameter clinical LLMs via continual domain-adaptive pretraining.

Result: Clinical LLMs don’t consistently outperform general-purpose counterparts on English clinical tasks, even under perturbation-based evaluation. However, Marmoka models achieve better results than Llama for Spanish subsets. Both general and clinical models show limitations in instruction following and output formatting.

Conclusion: Current short-form MCQA benchmarks may be insufficient to capture genuine medical expertise, as clinical LLMs offer only marginal improvements in English. Robust medical LLMs can be successfully developed for low-resource languages like Spanish.

Abstract: BACKGROUND: Recent studies have shown that domain-adapted large language models (LLMs) do not consistently outperform general-purpose counterparts on standard medical benchmarks, raising questions about the need for specialized clinical adaptation. METHODS: We systematically compare general and clinical LLMs on a diverse set of multiple choice clinical question answering tasks in English and Spanish. We introduce a perturbation based evaluation benchmark that probes model robustness, instruction following, and sensitivity to adversarial variations. Our evaluation includes, one-step and two-step question transformations, multi prompt testing and instruction guided assessment. We analyze a range of state-of-the-art clinical models and their general-purpose counterparts, focusing on Llama 3.1-based models. Additionally, we introduce Marmoka, a family of lightweight 8B-parameter clinical LLMs for English and Spanish, developed via continual domain-adaptive pretraining on medical corpora and instructions. RESULTS: The experiments show that clinical LLMs do not consistently outperform their general purpose counterparts on English clinical tasks, even under the proposed perturbation based benchmark. However, for the Spanish subsets the proposed Marmoka models obtain better results compared to Llama. CONCLUSIONS: Our results show that, under current short-form MCQA benchmarks, clinical LLMs offer only marginal and unstable improvements over general-purpose models in English, suggesting that existing evaluation frameworks may be insufficient to capture genuine medical expertise. We further find that both general and clinical models exhibit substantial limitations in instruction following and strict output formatting. Finally, we demonstrate that robust medical LLMs can be successfully developed for low-resource languages such as Spanish, as evidenced by the Marmoka models.

[81] Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models

Bajian Xiang, Tingwei Guo, Xuan Chen, Yang Han

Main category: cs.CL

TL;DR: Training-free token merging method reduces computational costs in Large Speech Language Models by compressing redundant speech representations without losing semantic information.

DetailsMotivation: Large Speech Language Models operate at high token rates for acoustic fidelity, creating excessively long sequences with prohibitive inference costs. There's structured redundancy in deep layers that can be exploited for compression.

Method: Introduces Affinity Pooling, a training-free, similarity-based token merging mechanism. Uses layer-wise oracle interventions to identify redundancy hierarchy, then applies compression at both input and deep layers to reduce sequence length.

Result: Reduces prefilling FLOPs by 27.48% while maintaining competitive accuracy across three tasks. Practical deployment shows ~1.7× memory savings and ~1.1× faster time-to-first-token on long utterances.

Conclusion: Challenges the necessity of fully distinct token representations in speech models, providing new perspectives on LSLM efficiency through strategic compression of redundant representations.

Abstract: Large Speech Language Models (LSLMs) typically operate at high token rates (tokens/s) to ensure acoustic fidelity, yet this results in sequence lengths that far exceed the underlying semantic content, incurring prohibitive inference costs. In this paper, we empirically revisit the necessity of such granular token-level processing. Through layer-wise oracle interventions, we unveil a structured redundancy hierarchy: while shallow layers encode essential acoustic details, deep layers exhibit extreme redundancy, allowing for aggressive compression. Motivated by these findings, we introduce Affinity Pooling, a training-free, similarity-based token merging mechanism. By strategically applying this method at both input and deep layers, we effectively compress speech representations without compromising semantic information. Extensive evaluations across three tasks demonstrate that our approach reduces prefilling FLOPs by 27.48% while maintaining competitive accuracy. Practical deployment further confirms significant efficiency gains, yielding up to $\sim$1.7$\times$ memory savings and $\sim$1.1$\times$ faster time-to-first-token on long utterances. Our results challenge the necessity of fully distinct token representations, providing new perspectives on LSLM efficiency.

[82] iTAG: Inverse Design for Natural Text Generation with Accurate Causal Graph Annotations

Wenshuo Wang, Boyu Cao, Nan Zhuang, Wei Li

Main category: cs.CL

TL;DR: iTAG generates text with accurate causal graph annotations by iteratively refining concept selection through Chain-of-Thought reasoning, achieving both high annotation accuracy and text naturalness.

DetailsMotivation: The lack of causally annotated text data for ground truth in causal discovery motivates the need for generating text with accurate causal graph annotations, as existing methods either sacrifice text naturalness (template-based) or annotation accuracy (LLM-dependent).

Method: iTAG performs real-world concept assignment to nodes before converting causal graphs into text, framing this as an inverse problem with iterative refinement through Chain-of-Thought reasoning to ensure concept relations align with target causal relationships.

Result: iTAG demonstrates extremely high annotation accuracy and text naturalness across extensive tests, and text-based causal discovery algorithms tested with iTAG-generated data show high statistical correlation with real-world data.

Conclusion: iTAG-generated data can serve as a practical surrogate for scalable benchmarking of text-based causal discovery algorithms, addressing the fundamental obstacle of lacking causally annotated text data.

Abstract: A fundamental obstacle to causal discovery from text is the lack of causally annotated text data for use as ground truth, due to high annotation costs. This motivates an important task of generating text with causal graph annotations. Early template-based generation methods sacrifice text naturalness in exchange for high causal graph annotation accuracy. Recent Large Language Model (LLM)-dependent methods directly generate natural text from target graphs through LLMs, but do not guarantee causal graph annotation accuracy. Therefore, we propose iTAG, which performs real-world concept assignment to nodes before converting causal graphs into text in existing LLM-dependent methods. iTAG frames this process as an inverse problem with the causal graph as the target, iteratively examining and refining concept selection through Chain-of-Thought (CoT) reasoning so that the induced relations between concepts are as consistent as possible with the target causal relationships described by the causal graph. iTAG demonstrates both extremely high annotation accuracy and naturalness across extensive tests, and the results of testing text-based causal discovery algorithms with the generated data show high statistical correlation with real-world data. This suggests that iTAG-generated data can serve as a practical surrogate for scalable benchmarking of text-based causal discovery algorithms.

[83] Is Biomedical Specialization Still Worth It? Insights from Domain-Adaptive Language Modelling with a New French Health Corpus

Aidan Mannion, Cécile Macaire, Armand Violle, Stéphane Ohayon, Xavier Tannier, Didier Schwab, Lorraine Goeuriot, François Portet

Main category: cs.CL

TL;DR: Domain-adaptive pre-training (DAPT) for French biomedical LLMs shows limited efficacy but may work in resource-constrained scenarios, with model merging needed to mitigate generalization trade-offs.

DetailsMotivation: To address the challenge of adapting LLMs to specialized domains, particularly for non-English languages like French biomedical, and to investigate whether continued pre-training can effectively specialize models without degrading general capabilities.

Method: Collection and refinement of French biomedical corpus, exploration of causal language modeling approaches using DAPT, training specialized French biomedical LLMs, and conducting extensive comparative evaluations including model merging post-DAPT.

Result: DAPT showed limited efficacy compared to previous works, but may be viable in smaller-scale, resource-constrained scenarios under right conditions. Model merging post-DAPT is essential to mitigate generalization trade-offs and can sometimes improve specialized task performance.

Conclusion: DAPT for French biomedical LLMs has limited effectiveness but can work in specific constrained scenarios, with model merging being crucial to balance domain specialization and general capabilities.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, yet their adaptation to specialized fields remains challenging, particularly for non-English languages. This study investigates domain-adaptive pre-training (DAPT) as a strategy for specializing small to mid-sized LLMs in the French biomedical domain through continued pre-training. We address two key research questions: the viability of specialized continued pre-training for domain adaptation and the relationship between domain-specific performance gains and general capability degradation. Our contributions include the release of a fully open-licensed French biomedical corpus suitable for commercial and open-source applications, the training and release of specialized French biomedical LLMs, and novel insights for DAPT implementation. Our methodology encompasses the collection and refinement of high-quality French biomedical texts, the exploration of causal language modeling approaches using DAPT, and conducting extensive comparative evaluations. Our results cast doubt on the efficacy of DAPT, in contrast to previous works, but we highlight its viability in smaller-scale, resource-constrained scenarios under the right conditions. Findings in this paper further suggest that model merging post-DAPT is essential to mitigate generalization trade-offs, and in some cases even improves performance on specialized tasks at which the DAPT was directed.

[84] The AI Skills Shift: Mapping Skill Obsolescence, Emergence, and Transition Pathways in the LLM Era

Rudra Jadhav, Janhavi Danve

Main category: cs.CL

TL;DR: SAFI benchmark measures LLM automation feasibility across 263 text-based tasks covering 35 O*NET skills, revealing mathematics and programming as most automatable, with 78.7% of real-world AI use being augmentation rather than automation.

DetailsMotivation: To provide empirical data on which occupational skills are most susceptible to LLM automation, helping policymakers and workers understand AI's impact on the labor market.

Method: Created Skill Automation Feasibility Index (SAFI) by benchmarking four frontier LLMs across 263 text-based tasks spanning 35 O*NET skills, then cross-referenced with real-world AI adoption data from Anthropic Economic Index to create an AI Impact Matrix framework.

Result: Mathematics (73.2) and Programming (71.8) have highest automation feasibility; Active Listening (42.2) and Reading Comprehension (45.5) lowest; 78.7% of observed AI interactions are augmentation; models converge to similar skill profiles (3.6-point spread).

Conclusion: Text-based automation feasibility is more skill-dependent than model-dependent, with most AI use being augmentation rather than automation, and a “capability-demand inversion” where skills most needed in AI-exposed jobs are those LLMs perform worst at.

Abstract: As Large Language Models reshape the global labor market, policymakers and workers need empirical data on which occupational skills may be most susceptible to automation. We present the Skill Automation Feasibility Index (SAFI), benchmarking four frontier LLMs – LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, and Gemini 2.5 Flash – across 263 text-based tasks spanning all 35 skills in the U.S. Department of Labor’s O*NET taxonomy (1,052 total model calls, 0% failure rate). Cross-referencing with real-world AI adoption data from the Anthropic Economic Index (756 occupations, 17,998 tasks), we propose an AI Impact Matrix – an interpretive framework that positions skills along four quadrants: High Displacement Risk, Upskilling Required, AI-Augmented, and Lower Displacement Risk. Key findings: (1) Mathematics (SAFI: 73.2) and Programming (71.8) receive the highest automation feasibility scores; Active Listening (42.2) and Reading Comprehension (45.5) receive the lowest; (2) a “capability-demand inversion” where skills most demanded in AI-exposed jobs are those LLMs perform least well at in our benchmark; (3) 78.7% of observed AI interactions are augmentation, not automation; (4) all four models converge to similar skill profiles (3.6-point spread), suggesting that text-based automation feasibility may be more skill-dependent than model-dependent. SAFI measures LLM performance on text-based representations of skills, not full occupational execution. All data, code, and model responses are open-sourced.

[85] Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

José Pombal, Ricardo Rei, André F. T. Martins

Main category: cs.CL

TL;DR: Study reveals self-preference bias in LLM-as-judge evaluation persists even in rubric-based settings, affecting both objective and subjective benchmarks, with ensemble methods providing partial mitigation.

DetailsMotivation: LLM-as-judge evaluation suffers from self-preference bias where judges favor outputs from their own model family, skewing evaluations and hindering model development, especially in recursive self-improvement settings.

Method: Analyzes self-preference bias in rubric-based evaluation using IFEval (objective rubrics) and HealthBench (subjective medical rubrics), examining factors like rubric type, length, and topic susceptibility.

Result: SPB persists even with objective rubrics - judges up to 50% more likely to incorrectly mark failed rubrics as satisfied for their own outputs. On HealthBench, SPB skews scores by up to 10 points. Negative rubrics, extreme lengths, and subjective topics like emergency referrals are particularly susceptible.

Conclusion: Self-preference bias is a significant problem in LLM evaluation that persists across paradigms, requiring careful consideration in benchmarking and model development, especially for frontier model ranking.

Abstract: LLM-as-a-judge has become the de facto approach for evaluating LLM outputs. However, judges are known to exhibit self-preference bias (SPB): they tend to favor outputs produced by themselves or by models from their own family. This skews evaluations and, thus, hinders model development, especially in settings of recursive self-improvement. We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings. Using IFEval, a benchmark with programmatically verifiable rubrics, we show that SPB persists even when evaluation criteria are entirely objective: among rubrics where generators fail, judges can be up to 50% more likely to incorrectly mark them as satisfied when the output is their own. We also find that, similarly to other evaluation paradigms, ensembling multiple judges helps mitigate SPB, but without fully eliminating it. On HealthBench, a medical chat benchmark with subjective rubrics, we observe that SPB skews model scores by up to 10 points, a potentially decisive margin when ranking frontier models. We analyze the factors that drive SPB in this setting, finding that negative rubrics, extreme rubric lengths, and subjective topics like emergency referrals are particularly susceptible.

[86] ChunQiuTR: Time-Keyed Temporal Retrieval in Classical Chinese Annals

Yihao Wang, Zijian He, Jie Ren, Keze Wang

Main category: cs.CL

TL;DR: Time-aware retrieval benchmark and model for historical Chinese annals requiring temporal consistency in retrieval-augmented generation

DetailsMotivation: Historical research requires exact temporal records, not just topically relevant passages. Classical Chinese annals use terse, implicit reign phrases that need contextual interpretation, making semantic plausibility insufficient for temporal validity.

Method: Introduces ChunQiuTR benchmark from Spring and Autumn Annals with month-level reign keys and chrono-near confounders. Proposes CTD (Calendrical Temporal Dual-encoder) combining Fourier-based absolute calendrical context with relative offset biasing.

Result: CTD shows consistent gains over strong semantic dual-encoder baselines under time-keyed evaluation, demonstrating improved temporal consistency in retrieval.

Conclusion: Retrieval-time temporal consistency is crucial for faithful historical RAG, especially for temporally-sensitive historical documents with implicit time expressions.

Abstract: Retrieval shapes how language models access and ground knowledge in retrieval-augmented generation (RAG). In historical research, the target is often not an arbitrary relevant passage, but the exact record for a specific regnal month, where temporal consistency matters as much as topical relevance. This is especially challenging for Classical Chinese annals, where time is expressed through terse, implicit, non-Gregorian reign phrases that must be interpreted from surrounding context, so semantically plausible evidence can still be temporally invalid. We introduce \textbf{ChunQiuTR}, a time-keyed retrieval benchmark built from the \textit{Spring and Autumn Annals} and its exegetical tradition. ChunQiuTR organizes records by month-level reign keys and includes chrono-near confounders that mirror realistic retrieval failures. We further propose \textbf{CTD} (Calendrical Temporal Dual-encoder), a time-aware dual-encoder that combines Fourier-based absolute calendrical context with relative offset biasing. Experiments show consistent gains over strong semantic dual-encoder baselines under time-keyed evaluation, supporting retrieval-time temporal consistency as a key prerequisite for faithful downstream historical RAG. Our code and datasets are available at \href{https://github.com/xbdxwyh/ChunQiuTR}{\texttt{github.com/xbdxwyh/ChunQiuTR}}.

[87] Continuous Interpretive Steering for Scalar Diversity

Ye-eun Cho

Main category: cs.CL

TL;DR: CIS method probes graded pragmatic interpretation in LLMs using activation steering as continuous variable, with GraSD dataset encoding scalar diversity grades, showing LLMs encode graded sensitivity in representation space.

DetailsMotivation: Pragmatic inference is inherently graded with varying degrees of enrichment across lexical items (scalar diversity), but current LLM evaluations rely on prompt-based manipulations that don't capture this graded nature.

Method: Continuous Interpretive Steering (CIS) treats activation-level steering strength as continuous variable to probe graded pragmatic interpretation, using new GraSD dataset encoding graded scalar diversity across four LLMs.

Result: Uniform activation steering increases pragmatic interpretations globally but collapses item-level variation, while graded activation steering yields differentiated interpretive shifts aligned with scalar diversity grades, indicating graded sensitivity is encoded in representation space.

Conclusion: CIS and GraSD provide principled framework for evaluating graded pragmatic sensitivity in LLMs, showing graded sensitivity can be systematically recovered through controlled intervention in representation space.

Abstract: Pragmatic inference is inherently graded. Different lexical items give rise to pragmatic enrichment to different degrees. Scalar implicature exemplifies this property through scalar diversity, where implicature strength varies across scalar items. However, evaluations of pragmatic inference in large language models (LLMs) often rely on prompt-based manipulations. Beyond prompt-level effects, this study introduces Continuous Interpretive Steering (CIS), a method that probes graded pragmatic interpretation by treating activation-level steering strength as a continuous experimental variable. To support this analysis, this study introduces a new dataset, GraSD, which encodes graded scalar diversity. Experiments on four LLMs show that uniform activation steering increases pragmatic interpretations globally but collapses item-level variation, whereas graded activation steering yields differentiated interpretive shifts aligned with scalar diversity grades. It indicates that graded sensitivity is encoded in the representation space and can be systematically recovered through controlled intervention. Together, CIS and GraSD provide a principled framework for evaluating graded pragmatic sensitivity in LLMs.

[88] DTCRS: Dynamic Tree Construction for Recursive Summarization

Guanran Luo, Zhongquan Jian, Wentao Qiu, Meihong Wang, Qingqiang Wu

Main category: cs.CL

TL;DR: DTCRS is a dynamic tree construction method for retrieval-augmented generation that reduces redundant summaries and improves question answering by analyzing question types and using sub-question embeddings as cluster centers.

DetailsMotivation: Traditional recursive summarization in RAG systems creates hierarchical summary trees with many redundant nodes, increasing construction time and potentially harming QA performance. Additionally, recursive summarization isn't suitable for all question types, necessitating a more adaptive approach.

Method: DTCRS dynamically generates summary trees based on document structure and query semantics. It first analyzes question type to determine if summarization is needed, then decomposes questions and uses sub-question embeddings as initial cluster centers to reduce redundancy and improve relevance.

Result: The approach significantly reduces summary tree construction time and achieves substantial improvements across three QA tasks. The research also provides insights into the applicability of recursive summarization to different question types.

Conclusion: DTCRS offers an efficient and adaptive approach to recursive summarization in RAG systems, reducing redundancy while improving question answering performance and providing guidance on when to use summarization techniques.

Abstract: Retrieval-Augmented Generation (RAG) mitigates the hallucination problem of Large Language Models (LLMs) by incorporating external knowledge. Recursive summarization constructs a hierarchical summary tree by clustering text chunks, integrating information from multiple parts of a document to provide evidence for abstractive questions involving multi-step reasoning. However, summary trees often contain a large number of redundant summary nodes, which not only increase construction time but may also negatively impact question answering. Moreover, recursive summarization is not suitable for all types of questions. We introduce DTCRS, a method that dynamically generates summary trees based on document structure and query semantics. DTCRS determines whether a summary tree is necessary by analyzing the question type. It then decomposes the question and uses the embeddings of sub-questions as initial cluster centers, reducing redundant summaries while improving the relevance between summaries and the question. Our approach significantly reduces summary tree construction time and achieves substantial improvements across three QA tasks. Additionally, we investigate the applicability of recursive summarization to different question types, providing valuable insights for future research.

[89] Corpora deduplication or duplication in Natural Language Processing of few resourced languages ? A case of study: The Mexico’s Nahuatl

Juan-José Guzman-Landa, Juan-Manuel Torres-Moreno, Graham Ranger, Miguel Figueroa-Saavedra, Martha-Lorena Avendaño-Garrido, Elvys Linhares-Pontes, Luis-Gil Moreno-Jiménez

Main category: cs.CL

TL;DR: Data duplication can improve NLP for low-resource languages by expanding limited corpora, as shown with Nawatl language embeddings.

DetailsMotivation: To address the lack of training data for low-resource languages (π-languages) in NLP, particularly for Nawatl which has limited text corpora available for training language models.

Method: Used incremental duplication technique to expand the π-yalli corpus containing limited Nawatl texts, trained static embeddings, and evaluated them on sentence-level semantic similarity tasks.

Result: Showed moderate performance improvement in semantic similarity tasks when using incremental duplication compared to using only the original non-expanded corpus.

Conclusion: Data duplication can be beneficial for NLP in low-resource language contexts, offering a practical approach to address data scarcity issues.

Abstract: In this article, we seek to answer the following question: could data duplication be useful in Natural Language Processing (NLP) for languages with limited computational resources? In this type of languages (or $π$-languages), corpora available for training Large Language Models are virtually non-existent. In particular, we will study the impact of corpora expansion in Nawatl, an agglutinative and polysynthetic $π$-language spoken by over 2 million people, with a large number of dialectal varieties. The aim is to expand the new $π$-yalli corpus, which contains a limited number of Nawatl texts, by duplicating it in a controlled way. In our experiments, we will use the incremental duplication technique. The aim is to learn embeddings that are well-suited to NLP tasks. Thus, static embeddings were trained and evaluated in a sentence-level semantic similarity task. Our results show a moderate improvement in performance when using incremental duplication compared to the results obtained using only the corpus without expansion. Furthermore, to our knowledge, this technique has not yet been used in the literature.

[90] MARS: Enabling Autoregressive Models Multi-Token Generation

Ziqi Jin, Lei Wang, Ziwei Luo, Aixin Sun

Main category: cs.CL

TL;DR: MARS is a lightweight fine-tuning method that teaches autoregressive language models to predict multiple tokens per forward pass without architectural changes, achieving 1.5-1.7x throughput while maintaining accuracy.

DetailsMotivation: Current autoregressive models generate text one token at a time even when consecutive tokens are highly predictable, leading to inefficient inference. Existing solutions like speculative decoding require separate draft models or architectural modifications.

Method: MARS fine-tunes instruction-tuned AR models on existing instruction data to predict multiple tokens per forward pass. It uses a block-level KV caching strategy for batch inference and supports real-time speed adjustment via confidence thresholding.

Result: MARS matches or exceeds AR baseline on six benchmarks when generating one token per forward pass. With multiple tokens per step, it maintains baseline accuracy while achieving 1.5-1.7x throughput and up to 1.71x wall-clock speedup with KV cache.

Conclusion: MARS provides a practical solution for accelerating inference in autoregressive language models without architectural changes, offering a latency-quality tradeoff knob for deployment.

Abstract: Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it maintains baseline-level accuracy while achieving 1.5-1.7x throughput. We further develop a block-level KV caching strategy for batch inference, achieving up to 1.71x wall-clock speedup over AR with KV cache on Qwen2.5-7B. Finally, MARS supports real-time speed adjustment via confidence thresholding: under high request load, the serving system can increase throughput on the fly without swapping models or restarting, providing a practical latency-quality knob for deployment.

[91] Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models

Md Motaleb Hossen Manik, Ge Wang

Main category: cs.CL

TL;DR: Empirical benchmark comparing dense vs MoE reasoning LLMs on accuracy and efficiency metrics under realistic inference constraints.

DetailsMotivation: To provide controlled empirical evaluation of whether MoE language models actually offer better quality-efficiency tradeoffs than dense models in practical inference scenarios, since theoretical advantages of sparse activation may not translate to real-world performance.

Method: Benchmarked 7 reasoning-oriented instruction-tuned models (both dense and MoE designs) on 4 reasoning benchmarks (ARC-Challenge, GSM8K, Math Level 1-3, TruthfulQA MC1) using 3 prompting strategies (zero-shot, chain-of-thought, few-shot chain-of-thought). Measured accuracy, latency, GPU memory usage, and FLOPs-per-token across 8,400 total evaluations.

Result: Gemma-4-E4B with few-shot chain-of-thought achieved best overall weighted accuracy (0.675) with moderate memory usage (14.9 GB). Gemma models dominated ARC and Math tasks, Phi models excelled on TruthfulQA. GSM8K showed highest prompt sensitivity. Sparse activation alone didn’t guarantee best practical operating point - tradeoffs depend on architecture, prompting, and task composition.

Conclusion: Practical accuracy-efficiency tradeoffs for reasoning LLMs depend on complex interactions between model architecture, prompting strategies, and task requirements, not just sparse activation. MoE advantages don’t automatically translate to superior real-world performance under resource constraints.

Abstract: Mixture-of-experts (MoE) language models are often expected to offer better quality-efficiency tradeoffs than dense models because only a subset of parameters is activated per token, but the practical value of that advantage depends on end-to-end behavior under realistic inference constraints. We present a controlled empirical benchmark of seven recent reasoning-oriented instruction-tuned models spanning dense and MoE designs, namely Gemma-4-E2B, Gemma-4-E4B, Gemma-4-26B-A4B, Phi-4-mini-reasoning, Phi-4-reasoning, Qwen3-8B, and Qwen3-30B-A3B, evaluated on four benchmarks – ARC-Challenge, GSM8K, Math Level 1-3, and TruthfulQA MC1 – under three prompting strategies: zero-shot, chain-of-thought, and few-shot chain-of-thought. The study covers 8,400 total model-dataset-prompt evaluations and records accuracy, latency, peak GPU memory usage (VRAM), and an approximate floating-point operations (FLOPs)-per-token proxy. Across the weighted multi-task summary, Gemma-4-E4B with few-shot chain-of-thought achieved the best overall result, reaching weighted accuracy 0.675 with mean VRAM 14.9 GB, while Gemma-4-26B-A4B was close in accuracy at 0.663 but substantially more memory intensive at 48.1 GB. At the task level, Gemma models dominated ARC and Math, Phi models were strongest on TruthfulQA, and GSM8K showed the largest prompt sensitivity, including a sharp drop for Phi-4-reasoning from 0.67 under chain-of-thought to 0.11 under few-shot chain-of-thought. These results show that sparse activation alone does not guarantee the best practical operating point: observed accuracy-efficiency tradeoffs depend jointly on architecture, prompting protocol, and task composition. We release a reproducible benchmark pipeline, aggregated results, and paired statistical analyses to support deployment-oriented evaluation of reasoning LLMs under real resource constraints.

[92] Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

Xuanbo Su, Wenhao Hu, Le Zhan, Yanqi Yang, Leo Huang

Main category: cs.CL

TL;DR: SalesLLM is a bilingual benchmark for evaluating LLMs in sales dialogues, featuring realistic scenarios, automatic evaluation metrics, and a trained user model to improve simulation fidelity.

DetailsMotivation: Existing dialogue benchmarks don't adequately measure deal progression and outcomes in sales contexts, which require multi-turn, goal-directed persuasion under asymmetric incentives - a challenging setting for LLMs.

Method: Created SalesLLM benchmark with 30,074 scripted configurations and 1,805 curated multi-turn scenarios in Financial Services and Consumer Goods domains. Developed automatic evaluation pipeline with LLM-based rater for sales-process progress and fine-tuned BERT classifiers for buying intent. Trained CustomerLM user model using SFT and DPO on 8,000 crowdworker-involved conversations.

Result: SalesLLM scores correlate strongly with expert human ratings (Pearson r=0.98). CustomerLM reduced role inversion from 17.44% (GPT-4o) to 8.8%. Experiments across 15 LLMs show substantial variability, with top-performing LLMs competitive with human-level performance.

Conclusion: SalesLLM serves as a scalable benchmark for developing and evaluating outcome-oriented sales agents, addressing limitations of existing dialogue benchmarks in measuring real-world sales outcomes.

Abstract: Sales dialogues require multi-turn, goal-directed persuasion under asymmetric incentives, which makes them a challenging setting for large language models (LLMs). Yet existing dialogue benchmarks rarely measure deal progression and outcomes. We introduce SalesLLM, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with controllable difficulty and personas. We propose a fully automatic evaluation pipeline that combines (i) an LLM-based rater for sales-process progress, and (ii) fine-tuned BERT classifiers for end-of-dialogue buying intent. To improve simulation fidelity, we train a user model, CustomerLM, with SFT and DPO on 8,000 crowdworker-involved sales conversations, reducing role inversion from 17.44% (GPT-4o) to 8.8%. SalesLLM scores correlate strongly with expert human ratings (Pearson r=0.98). Experiments across 15 mainstream LLMs reveal substantial variability: top-performance LLMs are competitive with human-level performance while the less capable ones are worse than human. SalesLLM serves as a scalable benchmark for developing and evaluating outcome-oriented sales agents.

[93] IndoBERT-Sentiment: Context-Conditioned Sentiment Classification for Indonesian Text

Muhammad Apriandito Arya Saputra, Andry Alamsyah, Dian Puteri Ramadhani, Thomhert Suprapto Siadari, Hanif Fakhrurroja

Main category: cs.CL

TL;DR: IndoBERT-Sentiment: A context-conditioned sentiment classifier for Indonesian that uses topical context alongside text to improve sentiment analysis accuracy.

DetailsMotivation: Existing Indonesian sentiment analysis models classify text in isolation, ignoring topical context which is crucial for determining sentiment polarity. Context-free approaches systematically misclassify texts that depend on topic for proper sentiment interpretation.

Method: Built on IndoBERT Large (335M parameters) and trained on 31,360 context-text pairs labeled across 188 topics. The model takes both topical context and text as input to produce sentiment predictions grounded in the topic being discussed.

Result: Achieves F1 macro of 0.856 and accuracy of 88.1%. Outperforms three widely used general-purpose Indonesian sentiment models by 35.6 F1 points on the same test set.

Conclusion: Context-conditioning, previously demonstrated for relevancy classification, transfers effectively to sentiment analysis and enables correct classification of texts that are systematically misclassified by context-free approaches.

Abstract: Existing Indonesian sentiment analysis models classify text in isolation, ignoring the topical context that often determines whether a statement is positive, negative, or neutral. We introduce IndoBERT-Sentiment, a context-conditioned sentiment classifier that takes both a topical context and a text as input, producing sentiment predictions grounded in the topic being discussed. Built on IndoBERT Large (335M parameters) and trained on 31,360 context-text pairs labeled across 188 topics, the model achieves an F1 macro of 0.856 and accuracy of 88.1%. In a head-to-head evaluation against three widely used general-purpose Indonesian sentiment models on the same test set, IndoBERT-Sentiment outperforms the best baseline by 35.6 F1 points. We show that context-conditioning, previously demonstrated for relevancy classification, transfers effectively to sentiment analysis and enables the model to correctly classify texts that are systematically misclassified by context-free approaches.

[94] SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA)

Liang-Chih Yu, Jonas Becker, Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Lung-Hao Lee, Ying-Lung Lin, Jin Wang, Jan Philip Wahle, Terry Ruas, Natalia Loukachevitch, Alexander Panchenko, Ilseyar Alimova, Lilian Wanzare, Nelson Odhiambo, Bela Gipp, Kai-Wei Chang, Saif M. Mohammad

Main category: cs.CL

TL;DR: SemEval-2026 shared task introduces dimensional aspect-based sentiment analysis (DimABSA) and dimensional stance analysis (DimStance) using valence-arousal dimensions instead of categorical labels, with applications to public-issue discourse.

DetailsMotivation: Traditional ABSA uses categorical polarity labels which are limited in capturing nuanced sentiment. The paper aims to extend ABSA beyond consumer reviews to public-issue discourse by modeling sentiment along continuous valence-arousal dimensions for more nuanced analysis.

Method: Introduces two main tasks: DimABSA (with three subtasks: regression, triplet extraction, quadruplet extraction) and DimStance (treating stance targets as aspects with VA regression). Uses continuous F1 metric for joint evaluation of structured extraction and VA regression.

Result: Task attracted over 400 participants with 112 final submissions and 42 system description papers. Baseline results reported and top-performing systems analyzed to provide insights into dimensional sentiment analysis.

Conclusion: Dimensional modeling of sentiment and stance in VA space provides more nuanced analysis than categorical approaches, especially valuable for complex public-issue discourse. The shared task successfully established benchmarks and attracted significant community participation.

Abstract: We present the SemEval-2026 shared task on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which improves traditional ABSA by modeling sentiment along valence-arousal (VA) dimensions rather than using categorical polarity labels. To extend ABSA beyond consumer reviews to public-issue discourse (e.g., political, energy, and climate issues), we introduce an additional task, Dimensional Stance Analysis (DimStance), which treats stance targets as aspects and reformulates stance detection as regression in the VA space. The task consists of two tracks: Track A (DimABSA) and Track B (DimStance). Track A includes three subtasks: (1) dimensional aspect sentiment regression, (2) dimensional aspect sentiment triplet extraction, and (3) dimensional aspect sentiment quadruplet extraction, while Track B includes only the regression subtask for stance targets. We also introduce a continuous F1 (cF1) metric to jointly evaluate structured extraction and VA regression. The task attracted more than 400 participants, resulting in 112 final submissions and 42 system description papers. We report baseline results, discuss top-performing systems, and analyze key design choices to provide insights into dimensional sentiment analysis at the aspect and stance-target levels. All resources are available on our GitHub repository.

[95] Is Cross-Lingual Transfer in Bilingual Models Human-Like? A Study with Overlapping Word Forms in Dutch and English

Iza Škrjanec, Irene Elisabeth Winther, Vera Demberg, Stefan L. Frank

Main category: cs.CL

TL;DR: Bilingual language models show cross-lingual activation patterns similar to humans, but only when vocabulary sharing is properly configured - specifically when only cognates (friends) share embeddings, not interlingual homographs (false friends).

DetailsMotivation: To investigate whether bilingual language models exhibit similar cross-lingual activation patterns as human bilinguals during reading, particularly for cognates (words with shared form and meaning) and interlingual homographs (words with shared form but different meanings).

Method: Trained Dutch-English causal Transformers under four vocabulary-sharing conditions that manipulate whether cognates and interlingual homographs receive shared or language-specific embeddings. Evaluated models using psycholinguistic stimuli from bilingual reading studies through surprisal and embedding similarity analyses.

Result: Models largely maintain language separation, with cross-lingual effects primarily arising when embeddings are shared. Both cognates and interlingual homographs show facilitation relative to controls when embeddings are shared. Effects are mainly driven by frequency rather than consistency in form-meaning mapping. Only when only cognates share embeddings are the qualitative patterns of bilinguals reproduced.

Conclusion: Bilingual language models capture some cross-linguistic activation effects, but their alignment with human processing critically depends on how lexical overlap is encoded, potentially limiting their explanatory adequacy as models of bilingual reading.

Abstract: Bilingual speakers show cross-lingual activation during reading, especially for words with shared surface form. Cognates (friends) typically lead to facilitation, whereas interlingual homographs (false friends) cause interference or no effect. We examine whether cross-lingual activation in bilingual language models mirrors these patterns. We train Dutch-English causal Transformers under four vocabulary-sharing conditions that manipulate whether (false) friends receive shared or language-specific embeddings. Using psycholinguistic stimuli from bilingual reading studies, we evaluate the models through surprisal and embedding similarity analyses. The models largely maintain language separation, and cross-lingual effects arise primarily when embeddings are shared. In these cases, both friends and false friends show facilitation relative to controls. Regression analyses reveal that these effects are mainly driven by frequency rather than consistency in form-meaning mapping. Only when just friends share embeddings are the qualitative patterns of bilinguals reproduced. Overall, bilingual language models capture some cross-linguistic activation effects. However, their alignment with human processing seems to critically depend on how lexical overlap is encoded, possibly limiting their explanatory adequacy as models of bilingual reading.

[96] Multilingual Embedding Probes Fail to Generalize Across Learner Corpora

Laurits Lyngbaek, Ross Deans Kristensen-McLachlan

Main category: cs.CL

TL;DR: Multilingual embedding models don’t encode language-general proficiency representations; probes learn corpus-specific patterns rather than transferable proficiency dimensions.

DetailsMotivation: To investigate whether multilingual embedding models encode a language-general representation of proficiency that can be transferred across different learner text corpora, languages, and assessment methodologies.

Method: Trained linear and non-linear probes on hidden-state activations from Qwen3-Embedding models (0.6B, 4B, 8B) to predict CEFR proficiency levels from learner texts across nine corpora and seven languages. Compared five probing architectures against a surface-level text feature baseline.

Result: Probes achieve strong in-distribution performance (QWK≈0.7), substantially outperforming surface baseline, with middle layers performing best. However, cross-corpus evaluation shows performance collapse across all probe types and model sizes. Residual analysis reveals probes converge to predicting uniformly distributed labels out-of-distribution.

Conclusion: Current multilingual embeddings do not straightforwardly encode language-general proficiency; learned mappings capture corpus-specific distributional properties (topic, language, task type, rating methodology) rather than abstract, transferable proficiency dimensions.

Abstract: Do multilingual embedding models encode a language-general representation of proficiency? We investigate this by training linear and non-linear probes on hidden-state activations from Qwen3-Embedding (0.6B, 4B, 8B) to predict CEFR proficiency levels from learner texts across nine corpora and seven languages. We compare five probing architectures against a baseline trained on surface-level text features. Under in-distribution evaluation, probes achieve strong performance ($QWK\approx0.7$), substantially outperforming the surface baseline, with middle layers consistently yielding the best predictions. However, in cross-corpus evaluation performance collapses across all probe types and model sizes. Residual analysis reveals that out-of-distribution probes converge towards predicting uniformly distributed labels, indicating that the learned mappings capture corpus-specific distributional properties (topic, language, task type, rating methodology) rather than an abstract, transferable proficiency dimension. These results suggest that current multilingual embeddings do not straightforwardly encode language-general proficiency, with implications for representation-based approaches to proficiency-adaptive language technology.

[97] STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems

Hongru Ji, Yuyin Fan, Meng Zhao, Xianghua Li, Lianwei Wu, Chao Gao

Main category: cs.CL

TL;DR: STRIDE-ED is a strategy-grounded framework for empathetic dialogue that uses structured reasoning, strategy-aware data refinement, and two-stage training to improve emotional understanding and response generation.

DetailsMotivation: Existing empathetic dialogue systems lack comprehensive strategy frameworks, explicit multi-stage reasoning, and high-quality strategy-aware data, limiting their ability to model empathetic dialogue as a complex cognitive decision-making process.

Method: Proposes STRIDE-ED framework with: 1) Strategy-aware data refinement pipeline using LLM annotation, multi-model consistency evaluation, and dynamic sampling; 2) Two-stage training combining supervised fine-tuning with multi-objective reinforcement learning; 3) Structured, strategy-conditioned reasoning approach.

Result: STRIDE-ED generalizes across diverse open-source LLMs and consistently outperforms existing methods on both automatic metrics and human evaluations.

Conclusion: The framework successfully addresses key limitations in empathetic dialogue systems by providing structured reasoning, high-quality strategy-aligned data, and effective training paradigms for improved emotional understanding and response generation.

Abstract: Empathetic dialogue requires not only recognizing a user’s emotional state but also making strategy-aware, context-sensitive decisions throughout response generation. However, the lack of a comprehensive empathy strategy framework, explicit task-aligned multi-stage reasoning, and high-quality strategy-aware data fundamentally limits existing approaches, preventing them from effectively modeling empathetic dialogue as a complex, multi-stage cognitive and decision-making process. To address these challenges, we propose STRIDE-ED, a STRategy-grounded, Interpretable, and DEep reasoning framework that models Empathetic Dialogue through structured, strategy-conditioned reasoning. To support effective learning, we develop a strategy-aware data refinement pipeline integrating LLM-based annotation, multi-model consistency-weighted evaluation, and dynamic sampling to construct high-quality training data aligned with empathetic strategies. Furthermore, we adopt a two-stage training paradigm that combines supervised fine-tuning with multi-objective reinforcement learning to better align model behaviors with target emotions, empathetic strategies, and response formats. Extensive experiments demonstrate that STRIDE-ED generalizes across diverse open-source LLMs and consistently outperforms existing methods on both automatic metrics and human evaluations.

[98] The Impact of Steering Large Language Models with Persona Vectors in Educational Applications

Yongchao Wu, Aron Henriksson

Main category: cs.CL

TL;DR: Activation steering for persona traits in LLMs negatively impacts educational short-answer generation quality (especially for open-ended ELA tasks) and causes predictable calibration shifts in automated scoring, with architecture and task type affecting sensitivity.

DetailsMotivation: While activation-based steering can personalize LLMs at inference time, its effects in educational settings remain unclear. The paper aims to systematically examine how persona steering affects both educational content generation (short answers) and automated scoring.

Method: Studied persona vectors for seven character traits in short-answer generation and automated scoring on the ASAP-SAS benchmark across three models spanning two architectures. Analyzed effects on different task types (ELA vs science, interpretive vs argumentative vs factual).

Result: Persona steering lowers answer quality overall, with much larger effects on open-ended ELA prompts (up to 11x more sensitive) than factual science prompts. For scoring, predictable valence-aligned calibration shifts: evil/impolite scorers grade more harshly, good/optimistic more leniently. ELA tasks are 2.5-3x more susceptible to scorer personalization than science tasks, and Mixture-of-Experts models show ~6x larger calibration shifts than dense models.

Conclusion: First systematic study of activation-steered persona traits in educational generation and scoring. Results highlight need for task-aware and architecture-aware calibration when deploying steered models in educational settings.

Abstract: Activation-based steering can personalize large language models at inference time, but its effects in educational settings remain unclear. We study persona vectors for seven character traits in short-answer generation and automated scoring on the ASAP-SAS benchmark across three models spanning two architectures. Persona steering lowers answer quality overall, with much larger effects on open-ended English Language Arts (ELA) prompts than on factual science prompts; interpretive and argumentative tasks are up to 11x more sensitive. On the scoring side, we observe predictable valence-aligned calibration shifts: evil and impolite scorers grade more harshly, while good and optimistic scorers grade more leniently. ELA tasks are 2.5-3x more susceptible to scorer personalization than science tasks, and the Mixture-of-Experts model shows roughly 6x larger calibration shifts than the dense models. To our knowledge, this is the first study to systematically examine the effects of activation-steered persona traits in educational generation and scoring, and the results highlight the need for task-aware and architecture-aware calibration when deploying steered models in educational settings.

[99] Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering

Elyas Irankhah, Samah Fodeh

Main category: cs.CL

TL;DR: A system for medical QA tasks using ensemble models with voting strategies for patient-authored questions about hospitalization records.

DetailsMotivation: To develop an effective system for the ArchEHR-QA 2026 shared task that addresses patient-authored questions about hospitalization records through multiple subtasks including question reformulation, evidence identification, answer generation, and evidence-answer alignment.

Method: Uses a dual-model pipeline with Claude Sonnet 4 and GPT-4o for question reformulation (ST1), and Azure-hosted model ensembles (o3, GPT-5.2, GPT-5.1, DeepSeek-R1) with few-shot prompting and voting strategies for evidence identification, answer generation, and alignment tasks (ST2-ST4).

Result: Achieved best development set scores of 88.81 micro F1 on evidence-answer alignment (ST4), 65.72 macro F1 on evidence sentence identification (ST2), 34.01 on answer generation (ST3), and 33.05 on question reformulation (ST1). Model diversity and ensemble voting consistently improved performance over single-model baselines.

Conclusion: Ensemble approaches with diverse models and voting strategies are effective for medical QA tasks, though alignment accuracy is primarily limited by reasoning capabilities rather than retrieval or generation quality.

Abstract: We describe the Yale-DM-Lab system for the ArchEHR-QA 2026 shared task. The task studies patient-authored questions about hospitalization records and contains four subtasks (ST): clinician-interpreted question reformulation, evidence sentence identification, answer generation, and evidence-answer alignment. ST1 uses a dual-model pipeline with Claude Sonnet 4 and GPT-4o to reformulate patient questions into clinician-interpreted questions. ST2-ST4 rely on Azure-hosted model ensembles (o3, GPT-5.2, GPT-5.1, and DeepSeek-R1) combined with few-shot prompting and voting strategies. Our experiments show three main findings. First, model diversity and ensemble voting consistently improve performance compared to single-model baselines. Second, the full clinician answer paragraph is provided as additional prompt context for evidence alignment. Third, results on the development set show that alignment accuracy is mainly limited by reasoning. The best scores on the development set reach 88.81 micro F1 on ST4, 65.72 macro F1 on ST2, 34.01 on ST3, and 33.05 on ST1.

[100] Are Non-English Papers Reviewed Fairly? Language-of-Study Bias in NLP Peer Reviews

Ehsan Barkhordar, Abdulfattah Safa, Verena Blaschke, Erika Lombart, Marie-Catherine de Marneffe, Gözde Gül Şahin

Main category: cs.CL

TL;DR: First systematic study of language-of-study bias in NLP peer review, introducing LOBSTER dataset and detection method with 87.37 macro F1, finding non-English papers face substantially higher bias rates than English-only ones.

DetailsMotivation: Peer review in NLP suffers from language-of-study bias where reviewers evaluate papers differently based on the languages studied rather than scientific merit. Despite being flagged in guidelines, such biases are poorly understood and not systematically studied as a distinct form of bias.

Method: Created LOBSTER (Language-Of-study Bias in ScienTific pEer Review) dataset with human annotations, developed detection method achieving 87.37 macro F1, analyzed 15,645 reviews to estimate bias rates, and identified four subcategories of negative bias.

Result: Non-English papers face substantially higher bias rates than English-only papers, with negative bias consistently outweighing positive bias. The most dominant form of negative bias is demanding unjustified cross-lingual generalization.

Conclusion: Language-of-study bias is a significant problem in NLP peer review that needs systematic attention. The study provides resources to support fairer reviewing practices and calls for more awareness and mitigation of this specific form of bias.

Abstract: Peer review plays a central role in the NLP publication process, but is susceptible to various biases. Here, we study language-of-study (LoS) bias: the tendency for reviewers to evaluate a paper differently based on the language(s) it studies, rather than its scientific merit. Despite being explicitly flagged in reviewing guidelines, such biases are poorly understood. Prior work treats such comments as part of broader categories of weak or unconstructive reviews without defining them as a distinct form of bias. We present the first systematic characterization of LoS bias, distinguishing negative and positive forms, and introduce the human-annotated dataset LOBSTER (Language-Of-study Bias in ScienTific pEer Review) and a method achieving 87.37 macro F1 for detection. We analyze 15,645 reviews to estimate how negative and positive biases differ with respect to the LoS, and find that non-English papers face substantially higher bias rates than English-only ones, with negative bias consistently outweighing positive bias. Finally, we identify four subcategories of negative bias, and find that demanding unjustified cross-lingual generalization is the most dominant form. We publicly release all resources to support work on fairer reviewing practices in NLP and beyond.

[101] Language Bias under Conflicting Information in Multilingual LLMs

Robert Östling, Murathan Kurfalı

Main category: cs.CL

TL;DR: Multilingual LLMs show systematic language bias when processing conflicting information, preferring Chinese and disfavoring Russian across different model origins.

DetailsMotivation: To investigate whether LLMs exhibit language-based biases when integrating conflicting information presented in different languages, extending the "conflicting needles in a haystack" paradigm to multilingual settings.

Method: Extended conflicting needles paradigm to multilingual setting using naturalistic news domain data in five languages; evaluated range of multilingual LLMs of different sizes including GPT-5.2; tested models trained both inside and outside mainland China.

Result: All tested LLMs ignore conflicts and confidently assert only one possible answer in most cases; consistent bias against Russian and (for longest contexts) in favor of Chinese; patterns consistent across models from different origins but stronger in China-trained models.

Conclusion: Multilingual LLMs exhibit systematic language biases when processing conflicting information, with consistent patterns across models that persist regardless of training origin, highlighting important fairness and reliability concerns.

Abstract: Large Language Models (LLMs) have been shown to contain biases in the process of integrating conflicting information when answering questions. Here we ask whether such biases also exist with respect to which language is used for each conflicting piece of information. To answer this question, we extend the conflicting needles in a haystack paradigm to a multilingual setting and perform a comprehensive set of evaluations with naturalistic news domain data in five different languages, for a range of multilingual LLMs of different sizes. We find that all LLMs tested, including GPT-5.2, ignore the conflict and confidently assert only one of the possible answers in the large majority of cases. Furthermore, there is a consistent bias across models in which languages are preferred, with a general bias against Russian and, for the longest context lengths, in favor of Chinese. Both of these patterns are consistent between models trained inside and outside of mainland China, though somewhat stronger in the former category.

[102] Dynamic Context Evolution for Scalable Synthetic Data Generation

Ryan Lingo, Rajeev Chhajer

Main category: cs.CL

TL;DR: Dynamic Context Evolution (DCE) framework prevents cross-batch mode collapse in LLMs through verbalized tail sampling, semantic memory, and adaptive prompt evolution, maintaining output diversity across repeated prompting.

DetailsMotivation: Large language models suffer from cross-batch mode collapse - progressive loss of output diversity when prompted repeatedly without access to prior generations. Current solutions rely on ad hoc deduplication and seed rotation without principled frameworks.

Method: DCE combines three mechanisms: 1) Verbalized tail sampling where models label ideas with “obviousness” scores and discard obvious ones, 2) Semantic memory with persistent embedding index to reject near-duplicates across batches, and 3) Adaptive prompt evolution that reconstructs prompts each batch using memory state and rotating diversity strategies.

Result: DCE achieves 0.0% collapse vs 5.6% for naive prompting, produces 17-18 HDBSCAN clusters per seed vs naive’s volatile 2-17, validated with independent embedding model. Components individually insufficient but jointly effective at ~$0.50 per 1,000 candidates.

Conclusion: DCE provides a principled, low-cost framework to prevent cross-batch mode collapse in LLMs without fine-tuning or custom architectures, maintaining output diversity across repeated prompting through self-assessment, memory, and adaptive prompting.

Abstract: Large language models produce repetitive output when prompted independently across many batches, a phenomenon we term cross-batch mode collapse: the progressive loss of output diversity when a language model is prompted repeatedly without access to its prior generations. Practitioners have long mitigated this with ad hoc deduplication and seed rotation, but no principled framework exists. We introduce Dynamic Context Evolution (DCE), comprising three mechanisms: (1) verbalized tail sampling (the model labels each idea with a guess about how obvious it is, and obvious ideas are discarded), which filters high-probability candidates via model self-assessment; (2) semantic memory, which maintains a persistent embedding index to reject near-duplicates across batches; and (3) adaptive prompt evolution, which reconstructs the generation prompt each batch using memory state and rotating diversity strategies. In experiments across three domains (sustainable packaging concepts, educational exam questions, and creative writing prompts) and two model families (gpt-5-mini and claude-haiku-4-5), a component ablation across 2-3 random seeds per method shows that DCE achieves 0.0 +/- 0.0% collapse versus 5.6 +/- 2.0% for naive prompting, while producing 17-18 HDBSCAN clusters per seed versus naive’s volatile 2-17, indicating reliably richer conceptual structure. These results are validated with an independent embedding model (all-MiniLM-L6-v2) and hold across sensitivity sweeps of the VTS threshold tau and dedup threshold delta. Deduplication and prompt evolution are individually insufficient but jointly effective, at approximately $0.50 per 1,000 candidates using only standard API calls, with no fine-tuning or custom architectures required.

[103] Agent-Driven Corpus Linguistics: A Framework for Autonomous Linguistic Discovery

Jia Yu, Weiwei Yu, Pengfei Xiao, Fukun Xing

Main category: cs.CL

TL;DR: LLM agents automate corpus linguistics by generating hypotheses, querying corpora, interpreting results, and refining analyses through structured tool-use interfaces, with all findings anchored in verifiable corpus evidence.

DetailsMotivation: Traditional corpus linguistics requires specialized technical skills and significant time from human researchers. The paper aims to automate the investigative cycle using LLM agents to lower technical barriers and accelerate research while maintaining empirical grounding.

Method: Proposes Agent-Driven Corpus Linguistics where an LLM agent connects to a corpus query engine via Model Context Protocol (MCP). The agent autonomously generates hypotheses, queries corpora (CQP-indexed Gutenberg corpus, 5M tokens), interprets results, and refines analyses across multiple rounds, with human oversight.

Result: The agent successfully identified diachronic relay chains in English intensifiers, semantic change pathways, and register-sensitive distributions. Controlled experiments showed corpus grounding provides quantification and falsifiability beyond LLM training data. External validation replicated two published studies on CLMET corpus (40M tokens) with close quantitative agreement.

Conclusion: Agent-driven corpus research can produce empirically grounded findings at machine speed while lowering technical barriers, offering a complementary approach to traditional corpus linguistics that maintains verifiable evidence anchoring.

Abstract: Corpus linguistics has traditionally relied on human researchers to formulate hypotheses, construct queries, and interpret results - a process demanding specialized technical skills and considerable time. We propose Agent-Driven Corpus Linguistics, an approach in which a large language model (LLM), connected to a corpus query engine via a structured tool-use interface, takes over the investigative cycle: generating hypotheses, querying the corpus, interpreting results, and refining analysis across multiple rounds. The human researcher sets direction and evaluates final output. Unlike unconstrained LLM generation, every finding is anchored in verifiable corpus evidence. We treat this not as a replacement for the corpus-based/corpus-driven distinction but as a complementary dimension: it concerns who conducts the inquiry, not the epistemological relationship between theory and data. We demonstrate the framework by linking an LLM agent to a CQP-indexed Gutenberg corpus (5 million tokens) via the Model Context Protocol (MCP). Given only “investigate English intensifiers,” the agent identified a diachronic relay chain (so+ADJ > very > really), three pathways of semantic change (delexicalization, polarity fixation, metaphorical constraint), and register-sensitive distributions. A controlled baseline experiment shows that corpus grounding contributes quantification and falsifiability that the model cannot produce from training data alone. To test external validity, the agent replicated two published studies on the CLMET corpus (40 million tokens) - Claridge (2025) and De Smet (2013) - with close quantitative agreement. Agent-driven corpus research can thus produce empirically grounded findings at machine speed, lowering the technical barrier for a broader range of researchers.

[104] LaScA: Language-Conditioned Scalable Modelling of Affective Dynamics

Kosmas Pinitas, Ilias Maglogiannis

Main category: cs.CL

TL;DR: A novel framework using Language Models as semantic context conditioners over interpretable handcrafted affect descriptors (facial geometry and acoustic features) to model Valence and Arousal changes, achieving improved accuracy while maintaining transparency.

DetailsMotivation: Current deep neural embeddings for affect prediction lack interpretability and limit expert-driven refinement. There's a need for transparent models that preserve feature interpretability while leveraging modern AI capabilities for affective computing in unconstrained environments.

Method: Uses interpretable facial geometry and acoustic features transformed into symbolic natural-language descriptions. A pretrained LM processes these descriptions to generate semantic context embeddings that act as high-level priors over affective dynamics, creating a transparent pipeline unlike end-to-end black-box approaches.

Result: Experimental evaluation on Aff-Wild2 and SEWA datasets shows consistent improvements in accuracy for both Valence and Arousal prediction compared to handcrafted-only and deep-embedding baselines.

Conclusion: Semantic conditioning enables interpretable affect modelling without sacrificing predictive performance, offering a transparent and computationally efficient alternative to fully end-to-end architectures for human-centered AI applications.

Abstract: Predicting affect in unconstrained environments remains a fundamental challenge in human-centered AI. While deep neural embeddings dominate contemporary approaches, they often lack interpretability and limit expert-driven refinement. We propose a novel framework that uses Language Models (LMs) as semantic context conditioners over handcrafted affect descriptors to model changes in Valence and Arousal. Our approach begins with interpretable facial geometry and acoustic features derived from structured domain knowledge. These features are transformed into symbolic natural-language descriptions encoding their affective implications. A pretrained LM processes these descriptions to generate semantic context embeddings that act as high-level priors over affective dynamics. Unlike end-to-end black-box pipelines, our framework preserves feature transparency while leveraging the contextual abstraction capabilities of LMs. We evaluate the proposed method on the Aff-Wild2 and SEWA datasets for affect change prediction. Experimental results show consistent improvements in accuracy for both Valence and Arousal compared to handcrafted-only and deep-embedding baselines. Our findings demonstrate that semantic conditioning enables interpretable affect modelling without sacrificing predictive performance, offering a transparent and computationally efficient alternative to fully end-to-end architectures

[105] Efficient Learned Data Compression via Dual-Stream Feature Decoupling

Huidong Ma, Xinyan Shi, Hui Sun, Xiaofei Yue, Xiaoguang Liu, Gang Wang, Wentong Cai

Main category: cs.CL

TL;DR: FADE proposes a dual-stream architecture for learned data compression that decouples local and global contexts into parallel streams, replacing deep serial processing with shallow parallel processing to improve throughput and reduce latency while maintaining compression performance.

DetailsMotivation: Current learned data compression methods face challenges balancing precise probability modeling with system efficiency. Single-stream architectures struggle to capture both micro-syntactic and macro-semantic features simultaneously, requiring deep serial stacking that increases latency. Heterogeneous systems are limited by device speed mismatches and Amdahl's Law constraints due to serial processing.

Method: 1) Dual-Stream Multi-Scale Decoupler that disentangles local and global contexts into parallel streams, 2) Hierarchical Gated Refiner for adaptive feature refinement and precise probability modeling, 3) Concurrent Stream-Parallel Pipeline that achieves full-pipeline parallelism to overcome systemic bottlenecks.

Result: The method achieves state-of-the-art performance in both compression ratio and throughput, while maintaining the lowest latency and memory usage compared to existing approaches.

Conclusion: FADE demonstrates that parallel stream architectures can effectively replace deep serial processing in learned data compression, achieving superior efficiency without sacrificing compression performance, with potential applications in real-time compression systems.

Abstract: While Learned Data Compression (LDC) has achieved superior compression ratios, balancing precise probability modeling with system efficiency remains challenging. Crucially, uniform single-stream architectures struggle to simultaneously capture micro-syntactic and macro-semantic features, necessitating deep serial stacking that exacerbates latency. Compounding this, heterogeneous systems are constrained by device speed mismatches, where throughput is capped by Amdahl’s Law due to serial processing. To this end, we propose a Dual-Stream Multi-Scale Decoupler that disentangles local and global contexts to replace deep serial processing with shallow parallel streams, and incorporate a Hierarchical Gated Refiner for adaptive feature refinement and precise probability modeling. Furthermore, we design a Concurrent Stream-Parallel Pipeline, which overcomes systemic bottlenecks to achieve full-pipeline parallelism. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both compression ratio and throughput, while maintaining the lowest latency and memory usage. The code is available at https://github.com/huidong-ma/FADE.

[106] Joint Optimization of Reasoning and Dual-Memory for Self-Learning Diagnostic Agent

Bingxuan Li, Simo Du, Yue Guo

Main category: cs.CL

TL;DR: SEA is a self-learning diagnostic agent with dual-memory module that improves clinical reasoning by accumulating and reusing diagnostic patterns through reinforcement learning, achieving state-of-the-art performance on medical reasoning benchmarks.

DetailsMotivation: Current LLM-based diagnostic agents treat cases independently, limiting experience reuse and continual adaptation. Clinical expertise improves through accumulating reusable diagnostic patterns, which existing approaches fail to capture.

Method: Proposes SEA with cognitively inspired dual-memory module and reinforcement training framework for joint optimization of reasoning and memory management. The agent learns to transform experience into reusable knowledge through self-learning.

Result: Achieves 92.46% accuracy on MedCaseReasoning dataset (outperforming strongest baseline by +19.6%) and best final accuracy (0.7214) with largest improvement (+0.35 Acc@100) on ER-Reason dataset. Expert evaluation confirms clinical correctness and usefulness of induced rules.

Conclusion: SEA effectively improves both diagnostic reasoning ability and continual learning by transforming experience into reusable knowledge through its dual-memory architecture and reinforcement learning framework.

Abstract: Clinical expertise improves not only by acquiring medical knowledge, but by accumulating experience that yields reusable diagnostic patterns. Recent LLMs-based diagnostic agents have shown promising progress in clinical reasoning for decision support. However, most approaches treat cases independently, limiting experience reuse and continual adaptation. We propose SEA, a self-learning diagnostic agent with cognitively inspired dual-memory module. We design a reinforcement training framework tailored to our designed agent for joint optimization of reasoning and memory management. We evaluate SEA in two complementary settings. On standard evaluation with MedCaseReasoning dataset, SEA achieves 92.46% accuracy, outperforming the strongest baseline by +19.6%, demonstrating the benefit of jointly optimizing reasoning and memory. On the long-horizon with ER-Reason dataset, SEA attains the best final accuracy (0.7214) and the largest improvement (+0.35 Acc@100), while baseline methods show limited or unstable gains. Expert evaluation further indicates that rules consolidated from SEA show strong clinical correctness, usefulness and trust, suggesting that the induced rules in dual-memory module are reliable and practically meaningful. Overall, SEA improves both diagnostic reasoning ability and continual learning by effectively transforming experience into reusable knowledge.

[107] ClickGuard: A Trustworthy Adaptive Fusion Framework for Clickbait Detection

Chhavi Dhiman, Naman Chawla, Riya Dhami, Gaurav Kumar, Ganesh Naik

Main category: cs.CL

TL;DR: ClickGuard: A trustworthy adaptive fusion framework for clickbait detection using BERT embeddings and structural features with syntactic-semantic adaptive fusion, achieving 96.93% accuracy.

DetailsMotivation: Clickbait headlines mislead users and threaten online credibility, creating need for effective detection methods to ensure trustworthy digital content.

Method: Combines BERT embeddings and structural features using Syntactic-Semantic Adaptive Fusion Block (SSAFB) for dynamic integration, with hybrid CNN-BiLSTM to capture patterns and dependencies. Uses LIME and Permutation Feature Importance for interpretability.

Result: Achieved 96.93% testing accuracy, outperforming state-of-the-art approaches. Demonstrated robust performance across diverse datasets with good interpretability and trustworthiness.

Conclusion: Provides scalable, reliable solution for enhancing online content credibility by addressing syntactic-semantic modeling challenges in clickbait detection.

Abstract: The widespread use of clickbait headlines, crafted to mislead and maximize engagement, poses a significant challenge to online credibility. These headlines employ sensationalism, misleading claims, and vague language, underscoring the need for effective detection to ensure trustworthy digital content. The paper introduces, ClickGuard: a trustworthy adaptive fusion framework for clickbait detection. It combines BERT embeddings and structural features using a Syntactic-Semantic Adaptive Fusion Block (SSAFB) for dynamic integration. The framework incorporates a hybrid CNN-BiLSTM to capture patterns and dependencies. The model achieved 96.93% testing accuracy, outperforming state-of-the-art approaches. The model’s trustworthiness is evaluated using LIME and Permutation Feature Importance (PFI) for interpretability and perturbation analysis. These methods assess the model’s robustness and sensitivity to feature changes by measuring the average prediction variation. Ablation studies validated the SSAFB’s effectiveness in optimizing feature fusion. The model demonstrated robust performance across diverse datasets, providing a scalable, reliable solution for enhancing online content credibility by addressing syntactic-semantic modelling challenges. Code of the work is available at: https://github.com/palindromeRice/ClickBait_Detection_Architecture

[108] A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering

Nusrat Sultana, Abdullah Muhammad Moosa, Kazi Afzalur Rahman, Sajal Chandra Banik

Main category: cs.CL

TL;DR: Systematic evaluation of retrieval-augmented generation for medical QA shows significant performance improvements, with best configuration achieving 60.49% accuracy on MedQA USMLE benchmark using dense retrieval with query reformulation and reranking.

DetailsMotivation: LLMs have knowledge gaps and limited factual grounding in medical QA. RAG addresses this by integrating external knowledge, but the impact of individual retrieval components on performance is insufficiently understood.

Method: Systematic evaluation using MedQA USMLE benchmark with structured textbook-based knowledge corpus. Analyzed interaction between language models, embedding models, retrieval strategies, query reformulation, and cross-encoder reranking across 40 configurations.

Result: Retrieval augmentation significantly improves zero-shot medical QA performance. Best configuration: dense retrieval with query reformulation and reranking achieved 60.49% accuracy. Domain-specialized models better utilize retrieved evidence. Tradeoff between retrieval effectiveness and computational cost identified.

Conclusion: Systematic evaluation of RAG for medical QA is feasible with modest computational resources. Dense retrieval with query reformulation and reranking provides strong performance, with domain-specialized models showing better evidence utilization.

Abstract: Large language models (LLMs) have demonstrated strong capabilities in medical question answering; however, purely parametric models often suffer from knowledge gaps and limited factual grounding. Retrieval-augmented generation (RAG) addresses this limitation by integrating external knowledge retrieval into the reasoning process. Despite increasing interest in RAG-based medical systems, the impact of individual retrieval components on performance remains insufficiently understood. This study presents a systematic evaluation of retrieval-augmented medical question answering using the MedQA USMLE benchmark and a structured textbook-based knowledge corpus. We analyze the interaction between language models, embedding models, retrieval strategies, query reformulation, and cross-encoder reranking within a unified experimental framework comprising forty configurations. Results show that retrieval augmentation significantly improves zero-shot medical question answering performance. The best-performing configuration was dense retrieval with query reformulation and reranking achieved 60.49% accuracy. Domain-specialized language models were also found to better utilize retrieved medical evidence than general-purpose models. The analysis further reveals a clear tradeoff between retrieval effectiveness and computational cost, with simpler dense retrieval configurations providing strong performance while maintaining higher throughput. All experiments were conducted on a single consumer-grade GPU, demonstrating that systematic evaluation of retrieval-augmented medical QA systems can be performed under modest computational resources.

[109] Why teaching resists automation in an AI-inundated era: Human judgment, non-modular work, and the limits of delegation

Songhee Han

Main category: cs.CL

TL;DR: Paper argues that teaching cannot be fully automated by AI because it requires human judgment, relational skills, and contextual interpretation that resist modularization.

DetailsMotivation: To challenge claims that AI can automate teaching by showing that teaching is inherently interpretive, relational, and grounded in professional judgment that cannot be fully specified or modeled.

Method: Drawing on literature review and empirical studies of large language models and retrieval-augmented generation systems to analyze the limitations of AI in automating teaching.

Result: AI can support bounded functions and improve access to information, but cannot replace human judgment, relational accountability, and contextual interpretation required for effective teaching.

Conclusion: Teaching remains a form of professional work that resists automation due to its reliance on emergent understanding of human cognition, behavior, motivation, and social interaction.

Abstract: Debates about artificial intelligence (AI) in education often portray teaching as a modular and procedural job that can increasingly be automated or delegated to technology. This brief communication paper argues that such claims depend on treating teaching as more separable than it is in practice. Drawing on recent literature and empirical studies of large language models and retrieval-augmented generation systems, I argue that although AI can support some bounded functions, instructional work remains difficult to automate in meaningful ways because it is inherently interpretive, relational, and grounded in professional judgment. More fundamentally, teaching and learning are shaped by human cognition, behavior, motivation, and social interaction in ways that cannot be fully specified, predicted, or exhaustively modeled. Tasks that may appear separable in principle derive their instructional value in practice from ongoing contextual interpretation across learners, situations, and relationships. As long as educational practice relies on emergent understanding of human cognition and learning, teaching remains a form of professional work that resists automation. AI may improve access to information and support selected instructional activities, but it does not remove the need for human judgment and relational accountability that effective teaching requires.

[110] OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

Jianhui Liu, Haoze Sun, Wenbo Li, Yanbing Zhang, Rui Yang, Zhiliang Zhu, Yijun Yang, Shenghe Zheng, Nan Jiang, Jiaxiu Jiang, Haoyang Huang, Tien-Tsin Wong, Nan Duan, Xiaojuan Qi

Main category: cs.CL

TL;DR: OpenSpatial is an open-source data engine for generating high-quality spatial reasoning data using 3D bounding boxes across five foundational tasks, accompanied by a 3M-sample dataset that boosts model performance on spatial reasoning benchmarks.

DetailsMotivation: Current research lacks a principled, open-source engine for generating high-quality spatial data, which is crucial for developing human-level spatial intelligence. Existing approaches focus on domain-specific data production rather than scalable, general-purpose solutions.

Method: OpenSpatial uses 3D bounding boxes as fundamental primitives to construct a comprehensive data hierarchy across five spatial reasoning tasks: Spatial Measurement (SM), Spatial Relationship (SR), Camera Perception (CP), Multi-view Consistency (MC), and Scene-Aware Reasoning (SAR). The system is designed for scalability, task diversity, and efficiency.

Result: The authors created OpenSpatial-3M, a dataset of 3 million high-fidelity samples. Models trained on this dataset achieve state-of-the-art performance across spatial reasoning benchmarks, with the best model showing a 19% average relative improvement. The paper also provides systematic analysis of how data attributes influence spatial perception.

Conclusion: OpenSpatial provides a robust foundation for accelerating spatial intelligence research by offering both an open-source data generation engine and a large-scale dataset. The work demonstrates that high-quality, diverse spatial data significantly improves model performance on spatial reasoning tasks.

Abstract: Spatial understanding is a fundamental cornerstone of human-level intelligence. Nonetheless, current research predominantly focuses on domain-specific data production, leaving a critical void: the absence of a principled, open-source engine capable of fully unleashing the potential of high-quality spatial data. To bridge this gap, we elucidate the design principles of a robust data generation system and introduce OpenSpatial – an open-source data engine engineered for high quality, extensive scalability, broad task diversity, and optimized efficiency. OpenSpatial adopts 3D bounding boxes as the fundamental primitive to construct a comprehensive data hierarchy across five foundational tasks: Spatial Measurement (SM), Spatial Relationship (SR), Camera Perception (CP), Multi-view Consistency (MC), and Scene-Aware Reasoning (SAR). Leveraging this scalable infrastructure, we curate OpenSpatial-3M, a large-scale dataset comprising 3 million high-fidelity samples. Extensive evaluations demonstrate that versatile models trained on our dataset achieve state-of-the-art performance across a wide spectrum of spatial reasoning benchmarks. Notably, the best-performing model exhibits a substantial average improvement of 19 percent, relatively. Furthermore, we provide a systematic analysis of how data attributes influence spatial perception. By open-sourcing both the engine and the 3M-scale dataset, we provide a robust foundation to accelerate future research in spatial intelligence.

[111] Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction

Jackson Petty, Jaulie Goe, Tal Linzen

Main category: cs.CL

TL;DR: LLMs struggle with formal grammar-based translation tasks, showing performance degradation with larger grammars, longer sentences, and morphological/script differences between languages.

DetailsMotivation: To understand LLMs' ability to translate low-resource languages using in-context grammatical descriptions (like textbooks/dictionaries) by isolating this skill through formal grammar-based string transduction tasks.

Method: Construct synchronous context-free grammars defining pairs of formal languages modeling natural language aspects, then test LLMs’ translation accuracy when given both grammar and source sentence, varying grammar size, sentence length, syntactic/morphological properties, and written script.

Result: Three key findings: 1) Translation accuracy decreases with grammar size and sentence length, 2) Morphological and script differences strongly diminish performance, 3) Error analysis shows models prone to wrong word recall, hallucination, and untranslated source words.

Conclusion: LLMs have limited ability to leverage formal grammatical descriptions for translation, especially with complex grammars or significant language differences, highlighting challenges for low-resource language translation via in-context learning.

Abstract: Low-resource languages pose a challenge for machine translation with large language models (LLMs), which require large amounts of training data. One potential way to circumvent this data dependence is to rely on LLMs’ ability to use in-context descriptions of languages, like textbooks and dictionaries. To do so, LLMs must be able to infer the link between the languages’ grammatical descriptions and the sentences in question. Here we isolate this skill using a formal analogue of the task: string transduction based on a formal grammar provided in-context. We construct synchronous context-free grammars which define pairs of formal languages designed to model particular aspects of natural language grammar, morphology, and written representation. Using these grammars, we measure how well LLMs can translate sentences from one formal language into another when given both the grammar and the source-language sentence. We vary the size of the grammar, the lengths of the sentences, the syntactic and morphological properties of the languages, and their written script. We note three key findings. First, LLMs’ translation accuracy decreases markedly as a function of grammar size and sentence length. Second, differences in morphology and written representation between the source and target languages can strongly diminish model performance. Third, we examine the types of errors committed by models and find they are most prone to recall the wrong words from the target language vocabulary, hallucinate new words, or leave source-language words untranslated.

[112] Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou, Junshan Zhang, Zhe Zhao

Main category: cs.CL

TL;DR: Personalized RewardBench is a new benchmark for evaluating how well reward models capture individual user preferences, showing existing models struggle with personalization despite general quality.

DetailsMotivation: Current reward model benchmarks focus on general response quality but lack evaluation of personalized preference modeling, which is crucial for pluralistic alignment in LLMs.

Method: Constructed chosen/rejected response pairs based on strict adherence/violation of user-specific rubrics, ensuring preference distinctions are uniquely tailored to individuals. Human evaluations confirmed personal preference as primary discriminative factor.

Result: State-of-the-art reward models peak at only 75.94% accuracy on personalization. The benchmark shows higher correlation with downstream performance in Best-of-N sampling and PPO compared to existing baselines.

Conclusion: Personalized RewardBench provides a robust proxy for evaluating reward models’ performance in downstream applications, addressing the gap in personalized preference assessment.

Abstract: Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values. While benchmarks for general response quality are prevalent, evaluating how well reward models account for individual user preferences remains an open challenge. To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models’ capacity to model personalized preferences. We construct chosen and rejected response pairs based on strict adherence to (or violation of) user-specific rubrics, ensuring that preference distinctions are uniquely tailored to the individual. In particular, human evaluations confirm that the primary discriminative factor between pairs is strictly personal preference, with both responses maintaining high general quality (e.g., correctness, relevance and helpfulness). Extensive testing reveals that existing state-of-the-art reward models struggle significantly with personalization, peaking at an accuracy of just 75.94%. Crucially, because an effective reward model benchmark should predict a reward model’s performance on downstream tasks, we conduct experiments demonstrating that our benchmark exhibits a significantly higher correlation with downstream performance in both Best-of-N (BoN) sampling and Proximal Policy Optimization (PPO) compared to existing baselines. These findings establish Personalized RewardBench as a robust and accurate proxy for evaluating reward models’ performance in downstream applications.

[113] A Multilingual Dataset and Empirical Validation for the Mutual Reinforcement Effect in Information Extraction

Chengguang Gan, Sunbowen Lee, Qingyu Yin, Xinyang He, Hanjun Wei, Yunhao Liang, Younghun Lim, Shijian Wang, Hexiang Huang, Qinghao Zhang, Shiwen Ni, Tatsunori Mori

Main category: cs.CL

TL;DR: The paper introduces Multilingual MRE Mix (MMM) dataset to validate Mutual Reinforcement Effect across languages, showing 76% of datasets exhibit MRE in English, Japanese, and Chinese.

DetailsMotivation: Prior work reported Mutual Reinforcement Effect (MRE) in Japanese where word-level and sentence-level tasks mutually improve each other, but its generality across languages and task settings hasn't been empirically validated due to lack of multilingual MRE datasets.

Method: Introduces MMM dataset covering 21 sub-datasets in English, Japanese, and Chinese. Uses LLM-assisted dataset translation and alignment framework to reduce manual annotation. Adopts unified input-output framework to train open-domain information extraction model, with full fine-tuning ablations and knowledgeable verbalizers based on MRE-mix data.

Result: 76% of MMM sub-datasets consistently exhibit Mutual Reinforcement Effect across languages, providing systematic empirical validation of MRE in multilingual settings and demonstrating its practical value for information extraction.

Conclusion: The study successfully validates MRE across multiple languages, showing the phenomenon is generalizable beyond Japanese and has practical applications for information extraction systems.

Abstract: The Mutual Reinforcement Effect (MRE) describes a phenomenon in information extraction where word-level and sentence-level tasks can mutually improve each other when jointly modeled. While prior work has reported MRE in Japanese, its generality across languages and task settings has not been empirically validated, largely due to the lack of multilingual MRE datasets. To address this limitation, we introduce the Multilingual MRE Mix dataset (MMM), which consists of 21 sub-datasets covering English, Japanese, and Chinese. We propose an LLM-assisted dataset translation and alignment framework that significantly reduces manual annotation effort while preserving the structural requirements of MRE tasks. Building on MMM, we adopt a unified input-output framework to train an open-domain information extraction model and conduct extensive empirical studies, including full fine-tuning ablations and the construction of knowledgeable verbalizers based on MRE-mix data. Experimental results show that 76 percent of the MMM sub-datasets consistently exhibit the Mutual Reinforcement Effect across languages. These findings provide systematic empirical validation of MRE in multilingual settings and demonstrate its practical value for information extraction.

[114] Reinforcement Learning for LLM Post-Training: A Survey

Zhichao Wang, Kiran Ramnath, Bin Bi, Shiva Kumar Pentyala, Sougata Chaudhuri, Shubham Mehrotra, Zixu, Zhu, Xiang-Bo Mao, Sitaram Asur, Na, Cheng

Main category: cs.CL

TL;DR: A comprehensive survey paper that systematically compares reinforcement learning-based post-training methods for aligning large language models, including RLHF, RLVR, and DPO approaches under a unified policy gradient framework.

DetailsMotivation: Despite the rapid development of RL-based post-training methods for LLM alignment (RLHF, RLVR, DPO, etc.), there's no systematic, technically detailed comparison of these methods under a unified analytical framework. The paper aims to fill this gap by providing a comprehensive reference for researchers and practitioners.

Method: The survey provides: (1) self-contained foundations of RL and LLM post-training concepts, (2) a unified policy gradient framework that decomposes methods along axes of prompt sampling, response sampling, and gradient coefficient, covering PPO/GRPO-based RLHF, RLVR, and offline DPO-based RLHF, and (3) standardized notation across reviewed papers for direct technical comparison.

Result: The paper serves as a comprehensive, technically grounded reference that enables systematic comparison of various RL-based post-training methods for LLM alignment, providing a unified analytical lens for understanding different approaches.

Conclusion: This survey fills an important gap in the literature by offering a systematic comparison of RL-based post-training methods for LLM alignment, providing researchers and practitioners with a unified framework and standardized notation to understand and compare different approaches in this rapidly evolving field.

Abstract: Through pretraining and supervised fine-tuning (SFT), large language models (LLMs) acquire strong instruction-following capabilities, yet they can still produce harmful or misaligned outputs. A growing body of reinforcement learning (RL)-based post-training methods has been proposed to address this, including Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) approaches built on Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO), and others. Despite rapid progress, no existing work offers a systematic, technically detailed comparison of these methods under a single analytical lens. Our survey aims to fill this gap. We make three key contributions: (1) a self-contained RL and LLM post-training foundations treatment covering all necessary concepts alongside their key applications; (2) a unified policy gradient framework unifying PPO and GRPO-based RLHF, RLVR, and offline DPO-based RLHF, decomposing methods along the axes of prompt sampling, response sampling, and gradient coefficient, with an extended treatment of on-policy RLHF and iterative DPO methods as well as the richer design space of offline DPO-based methods; and (3) standardized notation across all reviewed papers enabling direct technical comparison. Our goal is to serve as a comprehensive, technically grounded reference for researchers and practitioners working on LLM post-training.

[115] SimSiam Naming Game: A Unified Approach for Representation Learning and Emergent Communication

Nguyen Le Hoang, Tadahiro Taniguchi, Fang Tianwei, Akira Taniguchi

Main category: cs.CL

TL;DR: SSNG proposes a self-supervised learning approach for emergent communication that replaces sampling-based updates with representation alignment between agents, achieving better performance on image datasets.

DetailsMotivation: Existing emergent communication frameworks like MHNG suffer from sample inefficiency in high-dimensional perceptual spaces due to sampling-based updates with high rejection rates. There's a need for more efficient feedback-free communication learning in multi-agent systems.

Method: SSNG uses a symmetric self-supervised representation alignment objective between autonomous agents, formulated through variational inference. Discrete symbolic messages are learned via Gumbel-Softmax relaxation for end-to-end gradient-based optimization while preserving discrete communication nature.

Result: Experiments on CIFAR-10 and ImageNet-100 show SSNG’s emergent messages achieve substantially higher linear-probe classification accuracy than referential games, reconstruction games, and MHNG.

Conclusion: Self-supervised representation alignment provides an effective mechanism for feedback-free emergent communication in multi-agent systems, overcoming limitations of sampling-based approaches.

Abstract: Emergent Communication (EmCom) investigates how agents develop symbolic communication through interaction without predefined language. Recent frameworks, such as the Metropolis–Hastings Naming Game (MHNG), formulate EmCom as the learning of shared external representations negotiated through interaction under joint attention, without explicit success or reward feedback. However, MHNG relies on sampling-based updates that suffer from high rejection rates in high-dimensional perceptual spaces, making the learning process sample-inefficient for complex visual datasets. In this work, we propose the SimSiam Naming Game (SSNG), a feedback-free EmCom framework that replaces sampling-based updates with a symmetric, self-supervised representation alignment objective between autonomous agents. Building on a variational inference–based probabilistic interpretation of self-supervised learning, SSNG formulates symbol emergence as an alignment process between agents’ latent representations mediated by message exchange. To enable end-to-end gradient-based optimization, discrete symbolic messages are learned via a Gumbel–Softmax relaxation, preserving the discrete nature of communication while maintaining differentiability. Experiments on CIFAR-10 and ImageNet-100 show that the emergent messages learned by SSNG achieve substantially higher linear-probe classification accuracy than those produced by referential games, reconstruction games, and MHNG. These results indicate that self-supervised representation alignment provides an effective mechanism for feedback-free EmCom in multi-agent systems.

[116] LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification

Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, Bo An

Main category: cs.CL

TL;DR: Failed to fetch summary for paper 2502.17421 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2502.17421: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.17421&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[117] Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges

Xiaoxiao Liu, Qingying Xiao, Bingquan Zhang, Junying Chen, Xiangyi Feng, Ziniu Li, Xiang Wan, Jian Chang, Guangjun Yu, Yan Hu, Benyou Wang

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2503.08292: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.08292&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[118] One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration

Zaid Khan, Archiki Prasad, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to rate limiting

Method: Cannot determine method as paper content is unavailable due to rate limiting

Result: Cannot determine results as paper content is unavailable due to rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to rate limiting

Abstract: Failed to fetch summary for 2510.12088: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12088&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[119] LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning

Yuhao Wu, Yushi Bai, Zhiqiang Hu, Roy Ka-Wei Lee, Juanzi Li

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).

DetailsMotivation: Cannot determine motivation as paper content is unavailable.

Method: Cannot determine method as paper content is unavailable.

Result: Cannot determine results as paper content is unavailable.

Conclusion: Cannot draw conclusions about paper content due to access limitations.

Abstract: Failed to fetch summary for 2506.18841: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.18841&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[120] Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling

Pankayaraj Pathmanathan, Furong Huang

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2507.06419 could not be retrieved from arXiv API.

DetailsMotivation: Cannot determine motivation as paper content is unavailable.

Method: Cannot determine method as paper content is unavailable.

Result: Cannot determine results as paper content is unavailable.

Conclusion: Cannot determine conclusion as paper content is unavailable.

Abstract: Failed to fetch summary for 2507.06419: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.06419&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[121] Soft Head Selection for Injecting ICL-Derived Task Embeddings

Jungwon Park, Jimyeong Kim, Changin Choi, Wonjong Rhee

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2507.20906: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.20906&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[122] SEA-BED: How Do Embedding Models Represent Southeast Asian Languages?

Wuttikorn Ponwitayarat, Peerat Limkonchotiwat, Raymond Ng, Jann Railey Montalan, Thura Aung, Jian Gang Ngui, Yosephine Susanto, William Chandra Tjhi, Panuthep Tasawong, Erik Cambria, Ekapol Chuangsuwanich, Sarana Nutanong

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2508.12243: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.12243&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[123] LifeAlign: Lifelong Alignment for Large Language Models with Memory-Augmented Focalized Preference Optimization

Junsong Li, Jie Zhou, Bihao Zhan, Yutao Yang, Qianjun Pan, Shilian Chen, Tianyu Huai, Xin Li, Qin Chen, Liang He

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.17183: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.17183&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[124] FBS: Modeling Native Parallel Reading inside a Transformer

Tongxi Wang

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to retrieval failure

Method: Unable to determine method due to retrieval failure

Result: Unable to determine results due to retrieval failure

Conclusion: Unable to determine conclusion due to retrieval failure

Abstract: Failed to fetch summary for 2601.21708: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21708&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[125] Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs

Adrian Arnaiz-Rodriguez, Miguel Baidal, Erik Derner, Jenn Layton Annable, Mark Ball, Mark Ince, Elvira Perez Vallejos, Nuria Oliver

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2509.24857: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24857&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[126] The Rise of AfricaNLP: A Survey of Contributions, Contributors, Community Impact, and Bibliometric Analysis

Tadesse Destaw Belay, Kedir Yassin Hussen, Sukairaj Hafiz Imam, Ibrahim Said Ahmad, Isa Inuwa-Dutse, Abrham Belete Haile, Grigori Sidorov, Eusebio Ricardez Vazquez, Iqra Ameer, Idris Abdulmumin, Tajuddeen Gwadabe, Vukosi Marivate, Seid Muhie Yimam, Shamsuddeen Hassan Muhammad

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2509.25477: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25477&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[127] Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models

Yizhou Peng, Yukun Ma, Chong Zhang, Yi-Wen Chao, Chongjia Ni, Bin Ma

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to access error

Method: Cannot determine method due to access error

Result: Cannot determine results due to access error

Conclusion: Cannot determine conclusion due to access error

Abstract: Failed to fetch summary for 2510.13293: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13293&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[128] Telling Speculative Stories to Help Humans Imagine the Harms of Healthcare AI

Xingmeng Zhao, Tongnian Wang, Dan Schumacher, Veronica Rammouz, Anthony Rios

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.14718: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14718&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[129] Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge

Yoshinari Fujinuma

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2510.18196: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18196&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[130] How to Evaluate Speech Translation with Source-Aware Neural MT Metrics

Mauro Cettolo, Marco Gaido, Matteo Negri, Sara Papi, Luisa Bentivogli

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.03295: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03295&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[131] Rectifying LLM Thought from Lens of Optimization

Junnan Liu, Hongwei Liu, Songyang Zhang, Kai Chen

Main category: cs.CL

TL;DR: Failed to fetch summary for paper 2512.01925 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to technical error in fetching paper information

Method: Unable to determine method due to technical error in fetching paper information

Result: Unable to determine results due to technical error in fetching paper information

Conclusion: Unable to draw conclusions due to technical error in fetching paper information

Abstract: Failed to fetch summary for 2512.01925: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01925&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[132] ADOPT: Adaptive Dependency-Guided Joint Prompt Optimization for Multi-Step LLM Pipelines

Minjun Zhao, Xinyu Zhang, Shuai Zhang, Deyang Li, Ruifeng Shi

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.24933: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.24933&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[133] Improved Evidence Extraction and Metrics for Document Inconsistency Detection with LLMs

Nelvin Tan, Yaowen Zhang, James Asikin Cheung, Fusheng Liu, Yu-Ching Shih, Dong Yang

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2601.02627: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02627&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[134] Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion

Jeonghyun Park, Byeongjeong Kim, Seojin Hwang, Hwanhee Lee

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to server rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2601.02956: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02956&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[135] Is Sentiment Banana-Shaped? Exploring the Geometry and Portability of Sentiment Concept Vectors

Laurits Lyngbaek, Pascale Feldkamp, Yuri Bizzoni, Kristoffer L. Nielbo, Kenneth Enevoldsen

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2601.07995: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07995&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[136] PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning

Bingxuan Li, Jeonghwan Kim, Cheng Qian, Xiusi Chen, Eitan Anzenberg, Niran Kundapur, Heng Ji

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2601.11957 could not be retrieved from arXiv API.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2601.11957: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11957&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[137] Computer Environments Elicit General Agentic Intelligence in LLMs

Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen, Li Dong, Wayne Xin Zhao, Ji-Rong Wen, Furu Wei

Main category: cs.CL

TL;DR: Failed to fetch summary for paper 2601.16206 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2601.16206: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16206&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[138] PACIFIC: Can LLMs Discern the Traits Influencing Your Preferences? Evaluating Personality-Driven Preference Alignment in LLMs

Tianyu Zhao, Siqi Li, Yasser Shoukry, Salma Elmalaki

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.07181 appears to be from February 2024, but no content available for analysis.

DetailsMotivation: Cannot determine motivation due to inability to access paper content. HTTP 429 error indicates rate limiting from arXiv API.

Method: Cannot determine method due to inability to access paper content. The error suggests technical limitations preventing analysis.

Result: No results available for analysis. The paper content could not be retrieved from arXiv due to rate limiting.

Conclusion: Cannot draw conclusions about paper content. Technical limitations prevent proper analysis of paper 2602.07181.

Abstract: Failed to fetch summary for 2602.07181: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07181&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[139] ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation

Bo Xu, Haotian Wu, Hehai Lin, Weiquan Huang, Beier Zhu, Yao Shu, Chengwei Qin

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.02945: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02945&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[140] What Do AI Agents Talk About? Discourse and Architectural Constraints in the First AI-Only Social Network

Taksch Dube, Jianfeng Zhu, NHatHai Phan, Ruoming Jin

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.07880: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07880&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[141] PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

Minki Hong, Eunsoo Lee, Sohyun Park, Jihie Kim

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.10477: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10477&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[142] Mitigating Hallucination on Hallucination in RAG via Ensemble Voting

Zequn Xie, Zhengyang Sun

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2603.27253: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27253&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[143] A Taxonomy of Programming Languages for Code Generation

Nishat Raihan, Christian Newman, Marcos Zampieri

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2604.00239: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00239&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[144] JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

Aichen Cai, Anmeng Zhang, Anyu Li, Bo Zhang, Bohua Cai, Chang Li, Changjian Jiang, Changkai Lu, Chao Xue, Chaocai Liang, Cheng Zhang, Dongkai Liu, Fei Wang, Guoqiang Huang, Haijian Ke, Han Lin, Hao Wang, Ji Miao, Jiacheng Zhang, Jialong Shi, Jifeng Zhu, Jingjing Qian, Junhui Luo, Junwu Xiong, Lam So, Liang Huang, Ming Ke, Mingyang Li, Panfeng Shi, Peng Hao, Qi Wang, Qian Lai, Qiaoqiao Yuan, Qingyu Yin, Qiong Cao, Qixiang Wang, Rongcheng Bian, Rongduo Han, Shaoqiang Zheng, Shi Hu, Shi Suo, Shijie Ren, Shijin Zhang, Shiying Fan, Shuai Xie, Tianyi Zhang, Wei Liu, Wentao Tan, Xianghan Meng, Xiaodong He, Xing Pan, Xiran Wang, Xuyang Peng, Ya Zhang, Yang Liu, Yangyang Duan, Yanxu Chen, Yicheng Gong, Yidan Huang, Yifei Liu, Yinhao Bai, Yongqiang Liu, Yuesong Zhang, Yuqi Zhang, Zerui Xie, Zhenfang Wang, Zhennan Shen, Zheyuan Liu, Zhuwei Zeng

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - technical error prevented paper retrieval

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2604.03044: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03044&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[145] StoryScope: Investigating idiosyncrasies in AI fiction

Jenna Russell, Rishanth Rajendhran, Mohit Iyyer, John Wieting

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2604.03136: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03136&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[146] Compressible Softmax-Attended Language under Incompressible Attention

Wonsuk Lee

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.04384: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04384&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[147] VisCoder2: Building Multi-Language Visualization Coding Agents

Yuansheng Ni, Songcheng Cai, Xiangchao Chen, Jiarong Liang, Zhiheng Lyu, Jiaqi Deng, Kai Zou, Ping Nie, Fei Yuan, Xiang Yue, Wenhu Chen

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without paper content

Method: Cannot determine method without paper content

Result: Cannot determine results without paper content

Conclusion: Cannot draw conclusions without paper content

Abstract: Failed to fetch summary for 2510.23642: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23642&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[148] GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA

Zhichao Wang

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2510.23868: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23868&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[149] Graph Representation-based Model Poisoning on the Heterogeneous Internet of Agents

Hanlin Cai, Houtianfu Wang, Haofan Dong, Kai Li, Sai Zou, Ozgur B. Akan

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2511.07176: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07176&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[150] Daily and Weekly Periodicity in Large Language Model Performance and Its Implications for Research

Paul Tschisgale, Peter Wulff

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2602.15889: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15889&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[151] What Makes an Ideal Quote? Recommending “Unexpected yet Rational” Quotations via Novelty

Bowei Zhang, Jin Xiao, Guanglei Yue, Qianyu He, Yanghua Xiao, Deqing Yang, Jiaqing Liang

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2602.22220: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22220&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[152] Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, Nan Duan

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2604.03128: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03128&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[153] Region-R1: Reinforcing Query-Side Region Cropping for Multi-Modal Re-Ranking

Chan-Wei Hu, Zhengzhong Tu

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2604.05268: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05268&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[154] Bridging Natural Language and Microgrid Dynamics: A Context-Aware Simulator and Dataset

Tinko Sebastian Bartels, Ruixiang Wu, Xinyu Lu, Yikai Lu, Fanzeng Xia, Haoxiang Yang, Yue Chen, Tongxin Li

Main category: cs.CL

TL;DR: Paper 2604.05429: Failed to fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access restrictions preventing abstract retrieval

Method: Method unknown - abstract not accessible due to arXiv rate limiting

Result: No results available - paper content could not be retrieved

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2604.05429: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05429&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[155] JUÁ – A Benchmark for Information Retrieval in Brazilian Legal Text Collections

Jayr Pereira, Leandro Fernandes, Erick de Brito, Roberto Lotufo, Luiz Bonifacio

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2604.06098: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06098&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.CV

[156] DietDelta: A Vision-Language Approach for Dietary Assessment via Before-and-After Images

Gautham Vinod, Siddeshwar Raghavan, Bruce Coburn, Fengqing Zhu

Main category: cs.CV

TL;DR: Vision-language framework for food-item-level nutritional analysis using paired before-and-after eating images with natural language prompts for localization and weight estimation

DetailsMotivation: Current image-based dietary assessment methods are limited to single pre-consumption images, provide only coarse meal-level estimates, and require restrictive inputs like depth sensing or multi-view imagery. There's a need for more accurate methods that can determine actual consumption at the food-item level.

Method: Proposes a vision-language framework using paired before-and-after eating images. Uses natural language prompts to localize specific food items and estimate weight directly from single RGB images. Employs a two-stage training strategy to predict weight differences between paired images for consumption estimation.

Result: Evaluated on three publicly available datasets and demonstrated consistent improvements over existing approaches, establishing a strong baseline for before-and-after dietary image analysis.

Conclusion: The proposed framework provides a simpler yet effective approach for food-item-level nutritional analysis without requiring restrictive inputs like depth sensing or explicit segmentation, enabling more accurate dietary assessment.

Abstract: Accurate dietary assessment is critical for precision nutrition, yet most image-based methods rely on a single pre-consumption image and provide only coarse, meal-level estimates. These approaches cannot determine what was actually consumed and often require restrictive inputs such as depth sensing, multi-view imagery, or explicit segmentation. In this paper, we propose a simple vision-language framework for food-item-level nutritional analysis using paired before-and-after eating images. Instead of relying on rigid segmentation masks, our method leverages natural language prompts to localize specific food items and estimate their weight directly from a single RGB image. We further estimate food consumption by predicting weight differences between paired images using a two-stage training strategy. We evaluate our method on three publicly available datasets and demonstrate consistent improvements over existing approaches, establishing a strong baseline for before-and-after dietary image analysis.

[157] CraterBench-R: Instance-Level Crater Retrieval for Planetary Scale

Jichao Fang, Lei Zhang, Michael Phillips, Wei Luo

Main category: cs.CV

TL;DR: Crater analysis reformulated as instance-level image retrieval with CraterBench-R benchmark; self-supervised ViTs with in-domain pretraining perform best; novel instance-token aggregation method improves efficiency while maintaining accuracy.

DetailsMotivation: Current deep learning approaches treat craters as detection problems, but scientific workflows like catalog deduplication and cross-observation matching are inherently retrieval tasks. There's a need to bridge this gap with proper retrieval-focused benchmarks and methods.

Method: 1) Formulate crater analysis as instance-level image retrieval; 2) Create CraterBench-R benchmark with 25k crater identities, multi-scale gallery views, and verified queries; 3) Evaluate various architectures; 4) Propose instance-token aggregation method for efficient storage; 5) Develop two-stage pipeline with single-vector shortlisting and instance-token reranking.

Result: Self-supervised ViTs with in-domain pretraining outperform generic models. Instance-token aggregation (K=16) improves mAP by 17.9 points over raw token selection. Two-stage pipeline recovers 89-94% of full late-interaction accuracy while searching only small candidate sets.

Conclusion: Crater analysis should be approached as retrieval problem; self-supervised ViTs excel at this task; instance-token aggregation enables efficient planetary-scale crater matching; benchmark enables future research in planetary image retrieval.

Abstract: Impact craters are a cornerstone of planetary surface analysis. However, while most deep learning pipelines treat craters solely as a detection problem, critical scientific workflows such as catalog deduplication, cross-observation matching, and morphological analog discovery are inherently retrieval tasks. To address this, we formulate crater analysis as an instance-level image retrieval problem and introduce CraterBench-R, a curated benchmark featuring about 25,000 crater identities with multi-scale gallery views and manually verified queries spanning diverse scales and contexts. Our baseline evaluations across various architectures reveal that self-supervised Vision Transformers (ViTs), particularly those with in-domain pretraining, dominate the task, outperforming generic models with significantly more parameters. Furthermore, we demonstrate that retaining multiple ViT patch tokens for late-interaction matching dramatically improves accuracy over standard single-vector pooling. However, storing all tokens per image is operationally inefficient at a planetary scale. To close this efficiency gap, we propose instance-token aggregation, a scalable, training-free method that selects K seed tokens, assigns the remaining tokens to these seeds via cosine similarity, and aggregates each cluster into a single representative token. This approach yields substantial gains: at K=16, aggregation improves mAP by 17.9 points over raw token selection, and at K=64, it matches the accuracy of using all 196 tokens with significantly less storage. Finally, we demonstrate that a practical two-stage pipeline, with single-vector shortlisting followed by instance-token reranking, recovers 89-94% of the full late-interaction accuracy while searching only a small candidate set. The benchmark is publicly available at hf.co/datasets/jfang/CraterBench-R.

[158] URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection

Zhenyu Wang, Weichen Cheng, Weijia Li, Junjie Mou, Zongyou Zhao, Guoying Zhang

Main category: cs.CV

TL;DR: URMF is an uncertainty-aware multimodal fusion framework for sarcasm detection that dynamically regulates modality contributions based on reliability estimates to handle noisy or irrelevant visual/textual content.

DetailsMotivation: Existing multimodal sarcasm detection methods assume equal modality reliability, but real-world social media often has ambiguous text and weakly relevant/irrelevant images, causing deterministic fusion to introduce noise and weaken robust reasoning.

Method: URMF uses multi-head cross-attention for visual-text interaction, self-attention for incongruity reasoning, and models unimodal aleatoric uncertainty via learnable Gaussian posteriors. Uncertainty estimates dynamically regulate modality fusion, with joint training integrating task supervision, prior regularization, distribution alignment, and uncertainty-driven contrastive learning.

Result: Experiments on public MSD benchmarks show URMF consistently outperforms strong unimodal, multimodal, and MLLM-based baselines, demonstrating effectiveness for both accuracy and robustness.

Conclusion: Explicitly modeling modality uncertainty and dynamically regulating fusion based on reliability estimates improves multimodal sarcasm detection performance and robustness in real-world scenarios with noisy/irrelevant content.

Abstract: Multimodal sarcasm detection (MSD) aims to identify sarcastic intent from semantic incongruity between text and image. Although recent methods have improved MSD through cross-modal interaction and incongruity reasoning, they often assume that all modalities are equally reliable. In real-world social media, however, textual content may be ambiguous and visual content may be weakly relevant or even irrelevant, causing deterministic fusion to introduce noisy evidence and weaken robust reasoning. To address this issue, we propose Uncertainty-aware Robust Multimodal Fusion (URMF), a unified framework that explicitly models modality reliability during interaction and fusion. URMF first employs multi-head cross-attention to inject visual evidence into textual representations, followed by multi-head self-attention in the fused semantic space to enhance incongruity-aware reasoning. It then performs unified unimodal aleatoric uncertainty modeling over text, image, and interaction-aware latent representations by parameterizing each modality as a learnable Gaussian posterior. The estimated uncertainty is further used to dynamically regulate modality contributions during fusion, suppressing unreliable modalities and yielding a more robust joint representation. In addition, we design a joint training objective integrating task supervision, modality prior regularization, cross-modal distribution alignment, and uncertainty-driven self-sampling contrastive learning. Experiments on public MSD benchmarks show that URMF consistently outperforms strong unimodal, multimodal, and MLLM-based baselines, demonstrating the effectiveness of uncertainty-aware fusion for improving both accuracy and robustness.

[159] No-reference based automatic parameter optimization for iterative reconstruction using a novel search space aware crow search algorithm

Poorya MohammadiNasab, Ander Biguri, Philipp Steininger, Peter Keuschnigg, Lukas Lamminger, Agnieszka Lach, S M Ragib Shahriar Islam, Anna Breger, Clemens Karner, Carola-Bibiane Schönlieb, Wolfgang Birkfellner, Sepideh Hatamikia

Main category: cs.CV

TL;DR: A fully automatic parameter optimization framework for CBCT iterative reconstruction algorithms using modified crow search algorithm with chaotic initialization to reduce radiation exposure without requiring reference reconstructions.

DetailsMotivation: Iterative reconstruction techniques can reduce radiation exposure in CBCT but require precise tuning of multiple hyperparameters, which is time-consuming and increases operator workload. Manual parameter setting is inefficient and impacts reconstruction quality.

Method: Proposes a fully automatic parameter optimization framework using a modified crow search algorithm (CSA) with: 1) superior set-dependent local search, 2) search-space-aware global search, 3) objective-driven balance between local/global search, and 4) chaotic diagonal linear uniform initialization for effective initial population.

Result: Outperformed manual settings and original CSA with 4.19% average fitness improvement, 4.89% improvement on CHILL@UK metric, and 3.82% on RPI_AXIS metric. Maintained fine details sharply in qualitative results across three imaging machines, four real datasets, and three different iterative reconstruction methods.

Conclusion: The proposed automatic parameter optimization framework is effective and robust for CBCT iterative reconstruction, reducing operator workload while improving reconstruction quality without requiring reference images.

Abstract: Iterative reconstruction technique’s ability to reduce radiation exposure by using fewer projections has attracted significant attention. However, these methods typically require a precise tuning of several hyperparameters, which can have a major impact on reconstruction quality. Manually setting these parameters is time-consuming and increases the workload for human operators. In this paper, we introduce a novel fully automatic parameter optimization framework that can be applied to a wide range of Cone-beam computed tomography (CBCT) iterative reconstruction algorithms to determine optimal parameters without requiring a reference reconstruction. The proposed method incorporates a modified crow search algorithm (CSA) featuring a superior set-dependent local search mechanism, a search-space-aware global search strategy, and an objective-driven balance between local and global search. Additionally, to ensure an effective initial population, we propose a chaotic diagonal linear uniform initialization scheme that accelerates algorithm convergence. The performance of the proposed framework was evaluated on three imaging machines and four real datasets, as well as three different iterative reconstruction methods with the highest number of tunable parameters, representing the most challenging senario. The results indicate that the proposed method could outperform manual settings and CSA, with an 4.19% improvement in average fitness and 4.89% and 3.82% improvements on CHILL@UK and RPI_AXIS, respectively, which are two benchmark no-reference learning-based quality metrics. In addition, the qualitative results clearly show the superiority of the proposed method by maintaining fine details sharply. The overall performance of the proposed framework across different comparison scenarios demonstrates its effectiveness and robustness across all cases.

[160] SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation

Qizhou Wang, Guansong Pang, Christopher Leckie

Main category: cs.CV

TL;DR: SurFITR is a surveillance-style image forgery detection dataset created to address limitations of existing forgery models that struggle with localized, subtle tampering in surveillance imagery with varied viewpoints and lower quality.

DetailsMotivation: Recent advances in open-access image generation models raise concerns about falsifying visual evidence in surveillance contexts. Existing forgery detection models trained on datasets with full-image synthesis or large manipulated regions in object-centric images fail to generalize to surveillance scenarios where tampering is typically localized, subtle, and occurs in scenes with varied viewpoints, small/occluded subjects, and lower visual quality.

Method: Created a large dataset (SurFITR) using a multimodal LLM-powered pipeline for semantically aware, fine-grained editing across diverse surveillance scenes. Contains over 137k tampered images with varying resolutions and edit types generated using multiple image editing models.

Result: Extensive experiments show existing detectors degrade significantly on SurFITR, while training on SurFITR yields substantial improvements in both in-domain and cross-domain performance.

Conclusion: SurFITR addresses a critical gap in surveillance image forgery detection and is publicly available on GitHub to advance research in this area.

Abstract: We present the Surveillance Forgery Image Test Range (SurFITR), a dataset for surveillance-style image forgery detection and localisation, in response to recent advances in open-access image generation models that raise concerns about falsifying visual evidence. Existing forgery models, trained on datasets with full-image synthesis or large manipulated regions in object-centric images, struggle to generalise to surveillance scenarios. This is because tampering in surveillance imagery is typically localised and subtle, occurring in scenes with varied viewpoints, small or occluded subjects, and lower visual quality. To address this gap, SurFITR provides a large collection of forensically valuable imagery generated via a multimodal LLM-powered pipeline, enabling semantically aware, fine-grained editing across diverse surveillance scenes. It contains over 137k tampered images with varying resolutions and edit types, generated using multiple image editing models. Extensive experiments show that existing detectors degrade significantly on SurFITR, while training on SurFITR yields substantial improvements in both in-domain and cross-domain performance. SurFITR is publicly available on GitHub.

[161] DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs

Dikshant Kukreja, Kshitij Sah, Karan Goyal, Mukesh Mohania, Vikram Goyal

Main category: cs.CV

TL;DR: DISSECT benchmark reveals perception-integration gap in VLMs where models can perceive visual content but fail to integrate it during reasoning, with open-source models showing systematic integration bottlenecks.

DetailsMotivation: Current VLM benchmarks conflate perception and integration into single accuracy scores, masking failures where models successfully extract visual information but lose it during reasoning. The authors identify this "perception-integration gap" as a critical limitation in multimodal reasoning.

Method: Introduces DISSECT, a 12,000-question diagnostic benchmark spanning Chemistry (7,000) and Biology (5,000). Each question is evaluated under five input modes: Vision+Text, Text-Only, Vision-Only, Human Oracle, and a novel Model Oracle where VLMs first verbalize images then reason from their own descriptions. This yields diagnostic gaps decomposing performance into language-prior exploitation, visual extraction, perception fidelity, and integration effectiveness.

Result: Evaluation of 18 VLMs shows: (1) Chemistry has lower language-prior exploitability than Biology, making molecular visual content a harder test of genuine visual reasoning; (2) Open-source models consistently score higher when reasoning from their own verbalized descriptions than from raw images, exposing systematic integration bottlenecks; (3) Closed-source models show no such gap, indicating bridging perception and integration separates open-source from closed-source multimodal capability.

Conclusion: The perception-integration gap is a fundamental limitation in current VLMs, particularly for open-source models. The Model Oracle protocol provides a model-agnostic diagnostic tool for identifying integration failures in any VLM evaluation, revealing that effective integration of visual information into reasoning is the frontier in multimodal AI.

Abstract: When asked to describe a molecular diagram, a Vision-Language Model correctly identifies ``a benzene ring with an -OH group.’’ When asked to reason about the same image, it answers incorrectly. The model can see but it cannot think about what it sees. We term this the perception-integration gap: a failure where visual information is successfully extracted but lost during downstream reasoning, invisible to single-configuration benchmarks that conflate perception with integration under one accuracy number. To systematically expose such failures, we introduce DISSECT, a 12,000-question diagnostic benchmark spanning Chemistry (7,000) and Biology (5,000). Every question is evaluated under five input modes – Vision+Text, Text-Only, Vision-Only, Human Oracle, and a novel Model Oracle in which the VLM first verbalizes the image and then reasons from its own description – yielding diagnostic gaps that decompose performance into language-prior exploitation, visual extraction, perception fidelity, and integration effectiveness. Evaluating 18~VLMs, we find that: (1) Chemistry exhibits substantially lower language-prior exploitability than Biology, confirming molecular visual content as a harder test of genuine visual reasoning; (2) Open-source models consistently score higher when reasoning from their own verbalized descriptions than from raw images, exposing a systematic integration bottleneck; and (3) Closed-source models show no such gap, indicating that bridging perception and integration is the frontier separating open-source from closed-source multimodal capability. The Model Oracle protocol is both model and benchmark agnostic, applicable post-hoc to any VLM evaluation to diagnose integration failures.

[162] Telescope: Learnable Hyperbolic Foveation for Ultra-Long-Range Object Detection

Parker Ewen, Dmitriy Rivkin, Mario Bijelic, Felix Heide

Main category: cs.CV

TL;DR: Telescope: A two-stage detection model for ultra-long range autonomous driving that improves small object detection beyond 500 meters using novel re-sampling and image transformation techniques.

DetailsMotivation: Autonomous highway driving requires detecting objects at ultra-long ranges (>500m) for safety at high speeds, but current object detectors fail on distant objects occupying few pixels, and LiDAR has limited effective range due to quadratic resolution loss with distance.

Method: Two-stage detection model with a powerful detection backbone, novel re-sampling layer, and image transformation specifically designed to address challenges of detecting small, distant objects in high-resolution images.

Result: Achieves 76% relative improvement in mAP for ultra-long range detection (from 0.185 to 0.326 mAP beyond 250m) compared to state-of-the-art methods, with minimal computational overhead and strong performance across all detection ranges.

Conclusion: Telescope provides a scalable image-based solution for ultra-long range object detection in autonomous driving, overcoming limitations of both current vision-based detectors and LiDAR sensors for long-distance perception.

Abstract: Autonomous highway driving, especially for long-haul heavy trucks, requires detecting objects at long ranges beyond 500 meters to satisfy braking distance requirements at high speeds. At long distances, vehicles and other critical objects occupy only a few pixels in high-resolution images, causing state-of-the-art object detectors to fail. This challenge is compounded by the limited effective range of commercially available LiDAR sensors, which fall short of ultra-long range thresholds because of quadratic loss of resolution with distance, making image-based detection the most practically scalable solution given commercially available sensor constraints. We introduce Telescope, a two-stage detection model designed for ultra-long range autonomous driving. Alongside a powerful detection backbone, this model contains a novel re-sampling layer and image transformation to address the fundamental challenges of detecting small, distant objects. Telescope achieves $76%$ relative improvement in mAP in ultra-long range detection compared to state-of-the-art methods (improving from an absolute mAP of 0.185 to 0.326 at distances beyond 250 meters), requires minimal computational overhead, and maintains strong performance across all detection ranges.

[163] Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

Yuechen Jiang, Enze Zhang, Md Mohsinul Kabir, Qianqian Xie, Stavroula Golfomitsou, Konstantinos Arvanitis, Sophia Ananiadou

Main category: cs.CV

TL;DR: VLMs struggle with structured cultural metadata inference from images, showing fragmented performance across cultures and metadata types.

DetailsMotivation: While VLMs have improved image captioning for cultural heritage, inferring structured cultural metadata (creator, origin, period) from visual input remains underexplored and challenging.

Method: Introduced a multi-category, cross-cultural benchmark and evaluated VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. Assessed cultural reasoning through exact-match, partial-match, and attribute-level accuracy across cultural regions.

Result: Models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions.

Conclusion: Current VLMs have significant limitations in structured cultural metadata inference beyond basic visual perception, highlighting the need for improved cultural reasoning capabilities.

Abstract: Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.

[164] Evolution of Video Generative Foundations

Teng Hu, Jiangning Zhang, Hongrui Huang, Ran Yi, Zihan Su, Jieyu Weng, Zhucun Xue, Lizhuang Ma, Ming-Hsuan Yang, Dacheng Tao

Main category: cs.CV

TL;DR: Survey paper providing comprehensive review of video generation technology evolution from GANs to diffusion models to emerging AR-based and multimodal techniques, with focus on building world models and applications.

DetailsMotivation: Existing reviews on video generation are too narrow, focusing only on specific techniques like GANs/diffusion models or specific tasks, lacking comprehensive perspective on field evolution, especially regarding Auto-Regressive models and multimodal integration needed for building advanced world models.

Method: Systematic survey methodology: 1) Review development of video generation technology from early GANs to dominant diffusion models to emerging AR-based and multimodal techniques; 2) In-depth analysis of foundational principles, key advancements, and comparative strengths/limitations; 3) Exploration of emerging trends in multimodal video generation emphasizing diverse data type integration for contextual awareness.

Result: Comprehensive survey that bridges historical developments and contemporary innovations in video generation, providing insights to guide future research in applications including virtual/augmented reality, personalized education, autonomous driving simulations, digital entertainment, and advanced world models.

Conclusion: This survey fills gaps in existing literature by providing holistic perspective on video generation evolution, particularly highlighting importance of multimodal integration and AR-based approaches for building sophisticated world models, with practical applications across multiple domains.

Abstract: The rapid advancement of Artificial Intelligence Generated Content (AIGC) has revolutionized video generation, enabling systems ranging from proprietary pioneers like OpenAI’s Sora, Google’s Veo3, and Bytedance’s Seedance to powerful open-source contenders like Wan and HunyuanVideo to synthesize temporally coherent and semantically rich videos. These advancements pave the way for building “world models” that simulate real-world dynamics, with applications spanning entertainment, education, and virtual reality. However, existing reviews on video generation often focus on narrow technical fields, e.g., Generative Adversarial Networks (GAN) and diffusion models, or specific tasks (e. g., video editing), lacking a comprehensive perspective on the field’s evolution, especially regarding Auto-Regressive (AR) models and integration of multimodal information. To address these gaps, this survey firstly provides a systematic review of the development of video generation technology, tracing its evolution from early GANs to dominant diffusion models, and further to emerging AR-based and multimodal techniques. We conduct an in-depth analysis of the foundational principles, key advancements, and comparative strengths/limitations. Then, we explore emerging trends in multimodal video generation, emphasizing the integration of diverse data types to enhance contextual awareness. Finally, by bridging historical developments and contemporary innovations, this survey offers insights to guide future research in video generation and its applications, including virtual/augmented reality, personalized education, autonomous driving simulations, digital entertainment, and advanced world models, in this rapidly evolving field. For more details, please refer to the project at https://github.com/sjtuplayer/Awesome-Video-Foundations.

[165] Evidence-Based Actor-Verifier Reasoning for Echocardiographic Agents

Peng Huang, Yiming Wang, Yineng Chen, Liangqiao Gui, Hui Guo, Bo Peng, Shu Hu, Xi Wu, Tsao Connie, Hongtu Zhu, Balakrishnan Prabhakaran, Xin Wang

Main category: cs.CV

TL;DR: EchoTrust: An evidence-driven Actor-Verifier framework for trustworthy reasoning in echocardiography visual language models, addressing template shortcuts and spurious explanations in clinical decision support.

DetailsMotivation: Automated intelligent analysis of echocardiographic data remains challenging due to complex cardiac dynamics and strong view heterogeneity. Existing VLM methods are vulnerable to template shortcuts and spurious explanations, which is problematic for high-stakes clinical applications.

Method: Proposes EchoTrust, an evidence-driven Actor-Verifier framework that produces structured intermediate representations analyzed by distinct roles for more reliable and interpretable decision-making.

Result: The framework enables more trustworthy reasoning in echocardiography VLM-based agents by addressing vulnerabilities in existing direct mapping approaches.

Conclusion: EchoTrust provides a more reliable and interpretable approach for clinical decision support in echocardiography analysis using visual language models.

Abstract: Echocardiography plays an important role in the screening and diagnosis of cardiovascular diseases. However, automated intelligent analysis of echocardiographic data remains challenging due to complex cardiac dynamics and strong view heterogeneity. In recent years, visual language models (VLM) have opened a new avenue for building ultrasound understanding systems for clinical decision support. Nevertheless, most existing methods formulate this task as a direct mapping from video and question to answer, making them vulnerable to template shortcuts and spurious explanations. To address these issues, we propose EchoTrust, an evidence-driven Actor-Verifier framework for trustworthy reasoning in echocardiography VLM-based agents. EchoTrust produces a structured intermediate representation that is subsequently analyzed by distinct roles, enabling more reliable and interpretable decision-making for high-stakes clinical applications.

[166] Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks

Jie Li, Hongyi Cai, Mingkang Dong, Muxin Pu, Shan You, Fei Wang, Tao Huang

Main category: cs.CV

TL;DR: Pistachio is a new Video Anomaly Detection/Understanding benchmark created using video generation models to overcome limitations of existing datasets, offering controlled scenes, diverse anomalies, and complex temporal narratives.

DetailsMotivation: Existing Video Anomaly Detection (VAD) benchmarks lack scene diversity, balanced anomaly coverage, and temporal complexity needed for reliable real-world assessment. The field is moving toward Video Anomaly Understanding (VAU) requiring deeper semantic reasoning, but manual annotation makes benchmarking difficult.

Method: Uses a controlled, generation-based pipeline leveraging recent video generation models. Integrates scene-conditioned anomaly assignment, multi-step storyline generation, and temporally consistent long-form synthesis to produce coherent 41-second videos with minimal human intervention.

Result: Pistachio provides a benchmark with scale, diversity, and complexity that reveals new challenges for existing methods and motivates future research on dynamic and multi-event anomaly understanding.

Conclusion: Pistachio addresses limitations of existing VAD/VAU benchmarks through a generation-based approach, offering precise control over scenes and anomalies while eliminating biases of Internet-collected datasets, enabling more reliable assessment of real-world performance.

Abstract: Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly coverage, and temporal complexity needed to reliably assess real-world performance. Meanwhile, the community is increasingly moving toward Video Anomaly Understanding (VAU), which requires deeper semantic and causal reasoning but remains difficult to benchmark due to the heavy manual annotation effort it demands. In this paper, we introduce Pistachio, a new VAD/VAU benchmark constructed entirely through a controlled, generation-based pipeline. By leveraging recent advances in video generation models, Pistachio provides precise control over scenes, anomaly types, and temporal narratives, effectively eliminating the biases and limitations of Internet-collected datasets. Our pipeline integrates scene-conditioned anomaly assignment, multi-step storyline generation, and a temporally consistent long-form synthesis strategy that produces coherent 41-second videos with minimal human intervention. Extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges for existing methods and motivating future research on dynamic and multi-event anomaly understanding.

[167] MTA-Agent: An Open Recipe for Multimodal Deep Search Agents

Xiangyu Peng, Can Qin, An Yan, Xinyi Yang, Zeyuan Chen, Ran Xu, Chien-Sheng Wu

Main category: cs.CV

TL;DR: MTA-Agent generates high-quality multi-hop vision-language training data for multimodal deep-search agents, enabling complex reasoning that integrates visual evidence with external knowledge through tool use.

DetailsMotivation: Current MLLMs have strong visual understanding but lack complex multi-step reasoning requiring deep searching and integrating visual evidence with external knowledge.

Method: Proposes MTA-Agent that automatically selects tools and parameters to retrieve/validate evidence from visual/textual sources, generating structured multi-hop QA trajectories from VQA seed datasets. Creates MTA-Vision-DeepSearch dataset with 21K examples filtered through multi-stage verification.

Result: 32B open-source multimodal search agent achieves SOTA 54.63% across six benchmarks, outperforming GPT-5, Gemini-2.5-Pro, and Gemini-3-Pro. Training improves reasoning depth (steps from 2.27 to 4.28) and tool-use behavior. Training can use cached interactions to reduce costs.

Conclusion: MTA-Agent provides effective framework for multimodal deep search, with open release of dataset, training trajectories, and implementation details to enable reproducibility and future research.

Abstract: Multimodal large language models (MLLMs) have demonstrated strong capabilities in visual understanding, yet they remain limited in complex, multi-step reasoning that requires deep searching and integrating visual evidence with external knowledge. In this work, we address this challenge by constructing high-quality, verified multi-hop vision-language training data for multimodal deep-search agents. We propose a Multi-hop Tool-Augmented Agent for Evidence-based QA Synthesis (MTA-Agent), which automatically selects tools and their parameters to retrieve and validate evidence from both visual and textual sources and generates structured multi-hop question-answer trajectories. Starting from diverse VQA seed datasets, our pipeline produces a large-scale training dataset, MTA-Vision-DeepSearch, containing 21K high-quality multi-hop examples. The data is filtered through a multi-stage verification process to ensure factual consistency and answer uniqueness. Using MTA-Vision-DeepSearch, a 32B open-source multimodal search agent achieves state-of-the-art performance, reaching an average of 54.63% across six challenging benchmarks, outperforming GPT-5 (51.86%), Gemini-2.5-Pro (50.98%), and Gemini-3-Pro (54.46%) under the same tool settings. We further show that training on our data improves both reasoning depth and tool-use behavior, increasing the average number of steps from 2.27 to 4.28, and leading to more systematic and persistent search strategies. Additionally, we demonstrate that training can be performed without real-time tool calls by replaying cached interactions, significantly reducing training cost. Importantly, we present MTA-Agent as a fully open recipe for multimodal deep search: we release the entire dataset, training trajectories, and implementation details to enable reproducibility and future research on open multimodal search agents.

[168] Robust Mesh Saliency Ground Truth Acquisition in VR via View Cone Sampling and Manifold Diffusion

Guoquan Zheng, Jie Hao, Huiyu Duan, Long Tang, Shuo Yang, Yucheng Zhu, Yongming Han, Liang Yuan, Patrick Le Callet, Guangtao Zhai

Main category: cs.CV

TL;DR: A framework for high-fidelity 3D mesh saliency ground truth acquisition in VR using view cone sampling and hybrid manifold-Euclidean constrained diffusion to address topological and aliasing issues in existing eye-tracking methods.

DetailsMotivation: Existing VR eye-tracking frameworks for 3D mesh saliency suffer from limitations: single ray sampling causes texture aliasing and discontinuous signals, while Euclidean smoothing propagates saliency across disconnected physical gaps, leading to semantic confusion on complex 3D manifolds.

Method: Proposes two key innovations: 1) View cone sampling (VCS) using Gaussian-distributed ray bundles to simulate human foveal receptive field for robust sampling of complex topologies; 2) Hybrid manifold-Euclidean constrained diffusion (HCD) algorithm that fuses manifold geodesic constraints with Euclidean scales for topologically-consistent saliency propagation.

Result: Demonstrates performance improvement over baseline methods through subjective experiments and qualitative/quantitative evaluations. Mitigates “topological short-circuits” and aliasing, providing more accurate and robust 3D attention acquisition.

Conclusion: The framework offers a high-fidelity 3D attention acquisition paradigm that aligns with natural human perception, providing a more accurate and robust baseline for 3D mesh saliency research in VR applications.

Abstract: As the complexity of 3D digital content grows exponentially, understanding human visual attention is critical for optimizing rendering and processing resources. Therefore, reliable 3D mesh saliency ground truth (GT) is essential for human-centric visual modeling in virtual reality (VR). However, existing VR eye-tracking frameworks are fundamentally bottlenecked by their underlying acquisition and generation mechanisms. The reliance on zero-area single ray sampling (SRS) fails to capture contextual features, leading to severe texture aliasing and discontinuous saliency signals. And the conventional application of Euclidean smoothing propagates saliency across disconnected physical gaps, resulting in semantic confusion on complex 3D manifolds. This paper proposes a robust framework to address these limitations. We first introduce a view cone sampling (VCS) strategy, which simulates the human foveal receptive field via Gaussian-distributed ray bundles to improve sampling robustness for complex topologies. Furthermore, a hybrid Manifold-Euclidean constrained diffusion (HCD) algorithm is developed, fusing manifold geodesic constraints with Euclidean scales to ensure topologically-consistent saliency propagation. We demonstrate the improvement in performance over baseline methods and the benefits for downstream tasks through subjective experiments and qualitative and quantitative methods. By mitigating “topological short-circuits” and aliasing, our framework provides a high-fidelity 3D attention acquisition paradigm that aligns with natural human perception, offering a more accurate and robust baseline for 3D mesh saliency research.

[169] MorphDistill: Distilling Unified Morphological Knowledge from Pathology Foundation Models for Colorectal Cancer Survival Prediction

Hikmat Khan, Usama Sajjad, Metin N. Gurcan, Anil Parwani, Wendy L. Frankel, Wei Chen, Muhammad Khalid Khan Niazi

Main category: cs.CV

TL;DR: MorphDistill: A two-stage framework that distills knowledge from multiple pathology foundation models into a CRC-specific encoder for improved survival prediction in colorectal cancer.

DetailsMotivation: Colorectal cancer survival prediction needs organ-specific features that existing pathology foundation models overlook. Current models lack CRC-specific knowledge for accurate prognostication.

Method: Two-stage framework: Stage I uses dimension-agnostic multi-teacher relational distillation with supervised contrastive regularization to train a student encoder from 10 foundation models. Stage II aggregates patch-level features via attention-based multiple instance learning for survival prediction.

Result: Achieves AUC of 0.68 (8% improvement over baseline), C-index of 0.661, hazard ratio of 2.52 on Alliance cohort. Generalizes well to external TCGA cohort with C-index of 0.628.

Conclusion: MorphDistill enables task-specific representation learning by integrating knowledge from multiple foundation models, providing efficient strategy for prognostic modeling in computational pathology.

Abstract: Background: Colorectal cancer (CRC) remains a leading cause of cancer-related mortality worldwide. Accurate survival prediction is essential for treatment stratification, yet existing pathology foundation models often overlook organ-specific features critical for CRC prognostication. Methods: We propose MorphDistill, a two-stage framework that distills complementary knowledge from multiple pathology foundation models into a compact CRC-specific encoder. In Stage I, a student encoder is trained using dimension-agnostic multi-teacher relational distillation with supervised contrastive regularization on large-scale colorectal datasets. This preserves inter-sample relationships from ten foundation models without explicit feature alignment. In Stage II, the encoder extracts patch-level features from whole-slide images, which are aggregated via attention-based multiple instance learning to predict five-year survival. Results: On the Alliance/CALGB 89803 cohort (n=424, stage III CRC), MorphDistill achieves an AUC of 0.68 (SD 0.08), an approximately 8% relative improvement over the strongest baseline (AUC 0.63). It also attains a C-index of 0.661 and a hazard ratio of 2.52 (95% CI: 1.73-3.65), outperforming all baselines. On an external TCGA cohort (n=562), it achieves a C-index of 0.628, demonstrating strong generalization across datasets and robustness across clinical subgroups. Conclusion: MorphDistill enables task-specific representation learning by integrating knowledge from multiple foundation models into a unified encoder. This approach provides an efficient strategy for prognostic modeling in computational pathology, with potential for broader oncology applications. Further validation across additional cohorts and disease stages is warranted.

[170] Continual Visual Anomaly Detection on the Edge: Benchmark and Efficient Solutions

Manuel Barusco, Francesco Borsatti, David Petrovic, Davide Dalle Pezze, Gian Antonio Susto

Main category: cs.CV

TL;DR: A benchmark for Visual Anomaly Detection on edge devices with continual learning constraints, proposing Tiny-Dinomaly as a lightweight solution with 13x smaller memory and 20x lower computation while improving performance.

DetailsMotivation: Visual Anomaly Detection (VAD) faces two unaddressed challenges in conjunction: edge deployment with severe computational constraints and continual learning requiring adaptation to evolving data distributions without forgetting. Existing methods designed for one setting break down when both constraints are imposed simultaneously.

Method: 1) Created the first comprehensive benchmark for VAD on edge in continual learning scenario; 2) Evaluated seven VAD models across three lightweight backbone architectures; 3) Proposed Tiny-Dinomaly, a lightweight adaptation of Dinomaly built on DINO foundation model; 4) Introduced targeted modifications to PatchCore and PaDiM for improved efficiency in continual learning.

Result: Tiny-Dinomaly achieves 13x smaller memory footprint and 20x lower computational cost while improving Pixel F1 by 5 percentage points. The benchmark provides guidance for selecting optimal backbone and VAD method under joint efficiency and adaptability constraints.

Conclusion: The work addresses the critical gap in VAD research by jointly considering edge deployment and continual learning constraints, providing practical solutions and benchmark guidance for real-world applications with computational limitations and evolving data distributions.

Abstract: Visual Anomaly Detection (VAD) is a critical task for many applications including industrial inspection and healthcare. While VAD has been extensively studied, two key challenges remain largely unaddressed in conjunction: edge deployment, where computational resources are severely constrained, and continual learning, where models must adapt to evolving data distributions without forgetting previously acquired knowledge. Our benchmark provides guidance for the selection of the optimal backbone and VAD method under joint efficiency and adaptability constraints, characterizing the trade-offs between memory footprint, inference cost, and detection performance. Studying these challenges in isolation is insufficient, as methods designed for one setting make assumptions that break down when the other constraint is simultaneously imposed. In this work, we propose the first comprehensive benchmark for VAD on the edge in the continual learning scenario, evaluating seven VAD models across three lightweight backbone architectures. Furthermore, we propose Tiny-Dinomaly, a lightweight adaptation of the Dinomaly model built on the DINO foundation model that achieves 13x smaller memory footprint and 20x lower computational cost while improving Pixel F1 by 5 percentage points. Finally, we introduce targeted modifications to PatchCore and PaDiM to improve their efficiency in the continual learning setting.

[171] LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation

Shuai Li, Huibin Bai, Yanbo Gao, Chong Lv, Hui Yuan, Chuankun Li, Wei Hua, Tian Xie

Main category: cs.CV

TL;DR: LiftFormer: A novel monocular depth estimation framework using lifting theory to bridge image color features with depth values through geometric representation subspaces.

DetailsMotivation: Monocular depth estimation is a challenging ill-posed problem in 3D vision. The authors aim to bridge the gap between image color features and geometric depth values by constructing intermediate subspaces that provide more robust representations.

Method: Proposes LiftFormer based on lifting theory topology. Constructs two subspaces: 1) Depth-oriented Geometric Representation (DGR) subspace using linearly dependent vectors according to depth bins, transforming image features to directly correspond to depth values. 2) Edge-aware Representation (ER) subspace to enhance depth prediction around edges where sharp changes occur.

Result: Achieves state-of-the-art performance on widely used datasets. Ablation studies validate the effectiveness of both proposed lifting modules in the LiftFormer architecture.

Conclusion: LiftFormer successfully transforms the depth estimation problem into geometric representation subspace learning, providing a novel approach that bridges color features to depth values and improves edge-aware depth prediction.

Abstract: Monocular depth estimation (MDE) has attracted increasing interest in the past few years, owing to its important role in 3D vision. MDE is the estimation of a depth map from a monocular image/video to represent the 3D structure of a scene, which is a highly ill-posed problem. To solve this problem, in this paper, we propose a LiftFormer based on lifting theory topology, for constructing an intermediate subspace that bridges the image color features and depth values, and a subspace that enhances the depth prediction around edges. MDE is formulated by transforming the depth value prediction problem into depth-oriented geometric representation (DGR) subspace feature representation, thus bridging the learning from color values to geometric depth values. A DGR subspace is constructed based on frame theory by using linearly dependent vectors in accordance with depth bins to provide a redundant and robust representation. The image spatial features are transformed into the DGR subspace, where these features correspond directly to the depth values. Moreover, considering that edges usually present sharp changes in a depth map and tend to be erroneously predicted, an edge-aware representation (ER) subspace is constructed, where depth features are transformed and further used to enhance the local features around edges. The experimental results demonstrate that our LiftFormer achieves state-of-the-art performance on widely used datasets, and an ablation study validates the effectiveness of both proposed lifting modules in our LiftFormer.

[172] Visual prompting reimagined: The power of the Activation Prompts

Yihua Zhang, Hongkang Li, Yuguang Yao, Aochuan Chen, Shuai Zhang, Pin-Yu Chen, Meng Wang, Sijia Liu

Main category: cs.CV

TL;DR: Activation Prompting (AP) extends visual prompting beyond input-level perturbations to intermediate activation maps, achieving better performance and efficiency than input-level visual prompting while being theoretically analyzed for layer preference.

DetailsMotivation: Visual prompting (VP) has emerged as a parameter-efficient alternative to fine-tuning, but suffers from a performance gap compared to conventional fine-tuning. The authors aim to understand and advance input-level VP by exploring activation-level perturbations.

Method: Introduces Activation Prompting (AP) which applies universal perturbations to activation maps in intermediate layers rather than just input data. Analyzes layer preference theoretically by examining global features across layers, and connects AP to normalization tuning in CNNs and ViTs.

Result: Extensive experiments across 29 datasets show AP’s superiority over VP and parameter-efficient fine-tuning baselines in both accuracy and efficiency (time, parameters, memory, throughput). Reveals model-dependent layer preferences and intrinsic limitations of input-level VP.

Conclusion: AP provides a more effective and efficient prompting approach than input-level VP, with theoretical grounding for layer preferences. It bridges the performance gap between VP and conventional fine-tuning while maintaining parameter efficiency.

Abstract: Visual prompting (VP) has emerged as a popular method to repurpose pretrained vision models for adaptation to downstream tasks. Unlike conventional model fine-tuning techniques, VP introduces a universal perturbation directly into the input data to facilitate task-specific fine-tuning rather than modifying model parameters. However, there exists a noticeable performance gap between VP and conventional fine-tuning methods, highlighting an unexplored realm in theory and practice to understand and advance the input-level VP to reduce its current performance gap. Towards this end, we introduce a generalized concept, termed activation prompt (AP), which extends the scope of the input-level VP by enabling universal perturbations to be applied to activation maps within the intermediate layers of the model. By using AP to revisit the problem of VP and employing it as an analytical tool, we demonstrate the intrinsic limitations of VP in both performance and efficiency, revealing why input-level prompting may lack effectiveness compared to AP, which exhibits a model-dependent layer preference. We show that AP is closely related to normalization tuning in convolutional neural networks and vision transformers, although each model type has distinct layer preferences for prompting. We also theoretically elucidate the rationale behind such a preference by analyzing global features across layers. Through extensive experiments across 29 datasets and various model architectures, we provide a comprehensive performance analysis of AP, comparing it with VP and parameter-efficient fine-tuning baselines. Our results demonstrate AP’s superiority in both accuracy and efficiency, considering factors such as time, parameters, memory usage, and throughput.

[173] PhysHead: Simulation-Ready Gaussian Head Avatars

Berna Kabadayi, Vanessa Sklyarova, Wojciech Zielonka, Justus Thies, Gerard Pons-Moll

Main category: cs.CV

TL;DR: PhysHead introduces a hybrid representation for animatable head avatars with realistic hair dynamics using 3D Gaussian-based layered representation combined with parametric head mesh and strand-based hair for physics simulation.

DetailsMotivation: Existing head avatar methods assume rigid hair movement, failing to disentangle hair from head and capture natural volumetric behavior. Need for realistic digital avatars with expressive and dynamic hair motion.

Method: Hybrid representation combining 3D parametric mesh for head with strand-based hair for physics simulation. Uses Gaussian primitives attached to head mesh and hair segments for appearance. Employs VLM-based models to generate appearance of occluded regions in dynamic training sequences.

Result: Demonstrates photorealistic head avatars with dynamic hair behavior (wind-blown motion) beyond rigid hair constraints. Shows physically plausible hair motion alongside expression and camera control in quantitative and qualitative studies.

Conclusion: PhysHead overcomes limitations of rigid hair in existing head avatar methods, enabling realistic hair dynamics through hybrid representation and physics simulation, advancing digital avatar realism.

Abstract: Realistic digital avatars require expressive and dynamic hair motion; however, most existing head avatar methods assume rigid hair movement. These methods often fail to disentangle hair from the head, representing it as a simple outer shell and failing to capture its natural volumetric behavior. In this paper, we address these limitations by introducing PhysHead, a hybrid representation for animatable head avatars with realistic hair dynamics learned from multi-view video. At the core is a 3D Gaussian-based layered representation of the head. Our approach combines a 3D parametric mesh for the head with strand-based hair, which can be directly simulated using physics engines. For the appearance model, we employ Gaussian primitives attached to both the head mesh and hair segments. This representation enables the creation of photorealistic head avatars with dynamic hair behavior, such as wind-blown motion, overcoming the constraints of rigid hair in existing methods. However, these animation capabilities also require new training schemes. In particular, we propose the use of VLM-based models to generate appearance of regions that are occluded in the dynamic training sequences. In quantitative and qualitative studies, we demonstrate the capabilities of the proposed model and compare it with existing baselines. We show that our method can synthesize physically plausible hair motion besides expression and camera control.

[174] Region-Graph Optimal Transport Routing for Mixture-of-Experts Whole-Slide Image Classification

Xin Tian, Jiuliu Lu, Ephraim Tsalik, Bart Wanders, Colleen Knoth, Julian Knight

Main category: cs.CV

TL;DR: ROAM is a spatially-aware Mixture-of-Experts MIL aggregator for whole-slide image classification that uses capacity-constrained entropic optimal transport to route region tokens to expert poolers, promoting balanced expert utilization and spatial coherence.

DetailsMotivation: Current MIL aggregators route all instances through shared pathways, limiting specialization across pathological heterogeneity. Standard MoE methods can suffer from imbalanced expert utilization where few experts dominate, collapsing back to near-single-pathway solutions.

Method: ROAM compresses dense patch bags into spatially binned region tokens, then uses entropic optimal transport (Sinkhorn) with explicit capacity constraints for region-to-expert assignment, ensuring balanced utilization. It also employs graph-regularized Sinkhorn iterations that diffuse routing assignments over spatial region graphs to encourage neighboring regions to route to same experts.

Result: ROAM achieves performance competitive against strong MIL and MoE baselines on four WSI benchmarks. On NSCLC generalization (TCGA-CPTAC), it reaches external AUC of 0.845 ± 0.019.

Conclusion: ROAM provides an effective spatially-aware MoE-MIL aggregator that addresses expert utilization imbalance through capacity-constrained optimal transport while maintaining spatial coherence, improving whole-slide image classification in computational pathology.

Abstract: Multiple Instance Learning (MIL) is the dominant framework for gigapixel whole-slide image (WSI) classification in computational pathology. However, current MIL aggregators route all instances through a shared pathway, constraining their capacity to specialise across the pathological heterogeneity inherent in each slide. Mixture-of-Experts (MoE) methods offer a natural remedy by partitioning instances across specialised expert subnetworks; yet unconstrained softmax routing may yield highly imbalanced utilisation, where one or a few experts absorb most routing mass, collapsing the mixture back to a near-single-pathway solution. To address these limitations, we propose ROAM (Region-graph OptimAl-transport Mixture-of-experts), a spatially aware MoE-MIL aggregator that routes region tokens to expert poolers via capacity-constrained entropic optimal transport, promoting balanced expert utilisation by construction. ROAM operates on spatial region tokens, obtained by compressing dense patch bags into spatially binned units that align routing with local tissue neighbourhoods and introduces two key mechanisms: (i) region-to-expert assignment formulated as entropic optimal transport (Sinkhorn) with explicit per slide capacity marginals, enforcing balanced expert utilisation without auxiliary load-balancing losses; and (ii) graph-regularised Sinkhorn iterations that diffuse routing assignments over the spatial region graph, encouraging neighbouring regions to coherently route to the same experts. Evaluated on four WSI benchmarks with frozen foundation-model patch embeddings, ROAM achieves performance competitive against strong MIL and MoE baselines, and on NSCLC generalisation (TCGA-CPTAC) reaches external AUC 0.845 +- 0.019.

[175] Predicting Alzheimer’s disease progression using rs-fMRI and a history-aware graph neural network

Mahdi Moghaddami, Mohammad-Reza Siadat, Austin Toma, Connor Laming, Huirong Fu

Main category: cs.CV

TL;DR: GNN-RNN model predicts progression of Alzheimer’s disease stages using functional connectivity graphs from rs-fMRI scans, achieving 82.9% accuracy with strong performance on early CN to MCI conversion prediction.

DetailsMotivation: Early detection of Alzheimer's disease progression is crucial for timely interventions, but predicting transitions between cognitive impairment stages (CN→MCI→AD) is challenging. The paper aims to develop a predictive model using longitudinal neuroimaging data to forecast disease progression.

Method: Proposes a graph neural network (GNN) model with recurrent neural network (RNN) blocks to process longitudinal rs-fMRI data. Uses functional connectivity graphs from 303 subjects with varying visit histories. Incorporates visit distance information to handle irregular time gaps and missing visits.

Result: Achieves 82.9% overall accuracy in predicting cognitive impairment stage transitions, with particularly strong 68.8% accuracy on the challenging CN to MCI conversion task. Model demonstrates robustness to missing visits and irregular time intervals.

Conclusion: The GNN-RNN model effectively predicts Alzheimer’s disease progression using rs-fMRI data, showing promise for early detection and timely interventions. Combined with other modalities, it could help slow cognitive impairment progression.

Abstract: Alzheimer’s disease (AD) is a neurodegenerative disorder that affects more than seven million people in the United States alone. AD currently has no cure, but there are ways to potentially slow its progression if caught early enough. In this study, we propose a graph neural network (GNN)-based model for predicting whether a subject will transition to a more severe stage of cognitive impairment at their next clinical visit. We consider three stages of cognitive impairment in order of severity: cognitively normal (CN), mild cognitive impairment (MCI), and AD. We use functional connectivity graphs derived from resting-state functional magnetic resonance imaging (rs-fMRI) scans of 303 subjects, each with a different number of visits. Our GNN-based model incorporates a recurrent neural network (RNN) block, enabling it to process data from the subject’s entire visit history. It can also work with irregular time gaps between visits by incorporating visit distance information into our input features. Our model demonstrates robust predictive performance, even with missing visits in the subjects’ visit histories. It achieves an accuracy of 82.9%, with an especially impressive accuracy of 68.8% on CN to MCI conversions - a task that poses a substantial challenge in the field. Our results highlight the effectiveness of rs-fMRI in predicting the onset of MCI or AD and, in conjunction with other modalities, could offer a viable method for enabling timely interventions to slow the progression of cognitive impairment.

[176] Hybrid ResNet-1D-BiGRU with Multi-Head Attention for Cyberattack Detection in Industrial IoT Environments

Afrah Gueriani, Hamza Kheddar, Ahmed Cherif Mazari

Main category: cs.CV

TL;DR: A hybrid deep learning model combining ResNet-1D, BiGRU, and Multi-Head Attention for intrusion detection in Industrial IoT systems, achieving high accuracy and real-time performance on multiple datasets.

DetailsMotivation: The need for effective intrusion detection in Industrial IoT (IIoT) systems to protect critical infrastructure, addressing challenges like class imbalance and real-time processing requirements.

Method: Hybrid deep learning model using ResNet-1D for spatial feature extraction, BiGRU for temporal dependencies, and Multi-Head Attention for feature weighting. SMOTE applied to handle class imbalance during training on EdgeHoTset dataset.

Result: Achieved 98.71% accuracy with 0.0417% loss on EdgeHoTset, and 99.99% accuracy with 0.0028 loss on CICIoV2024 dataset. Demonstrated low inference latency (0.0001-0.00014 sec/instance) and outperformed existing methods across all metrics.

Conclusion: The proposed hybrid model is robust, effective, and suitable for real-time IoT intrusion detection, showing strong generalization across different datasets.

Abstract: This study introduces a hybrid deep learning model for intrusion detection in Industrial IoT (IIoT) systems, combining ResNet-1D, BiGRU, and Multi-Head Attention (MHA) for effective spatial-temporal feature extraction and attention-based feature weighting. To address class imbalance, SMOTE was applied during training on the EdgeHoTset dataset. The model achieved 98.71% accuracy, a loss of 0.0417%, and low inference latency (0.0001 sec /instance), demonstrating strong real-time capability. To assess generalizability, the model was also tested on the CICIoV2024 dataset, where it reached 99.99% accuracy and F1-score, with a loss of 0.0028, 0 % FPR, and 0.00014 sec/instance inference time. Across all metrics and datasets, the proposed model outperformed existing methods, confirming its robustness and effectiveness for real-time IoT intrusion detection.

[177] DesigNet: Learning to Draw Vector Graphics as Designers Do

Tomas Guija-Valiente, Iago Suárez

Main category: cs.CV

TL;DR: DesigNet is a hierarchical Transformer-VAE for SVG generation that incorporates designer tools like continuity control and axis alignment to produce editable vector graphics suitable for professional workflows.

DetailsMotivation: Neural networks and human designers operate differently, making collaboration challenging. The paper aims to bridge this gap for SVG generation by equipping neural networks with designer tools like continuity control and alignment.

Method: Hierarchical Transformer-VAE operating directly on SVG sequences with continuous command parameterization. Includes two differentiable modules: continuity self-refinement (predicts and enforces C0, G1, C1 continuity) and alignment self-refinement with snapping for horizontal/vertical lines.

Result: DesigNet produces editable outlines with competitive results against SOTA methods, achieving notably higher accuracy in continuity and alignment, making outputs easier to refine and integrate into professional workflows.

Conclusion: The approach successfully bridges the gap between neural networks and human designers for SVG generation by incorporating professional design tools, resulting in more editable and workflow-compatible vector graphics.

Abstract: AI-driven content generation has made remarkable progress in recent years. However, neural networks and human designers operate in fundamentally different ways, making collaboration between them challenging. We address this gap for Scalable Vector Graphics (SVG) by equipping neural networks with tools commonly used by designers, such as axis alignment and explicit continuity control at command junctions. We introduce DesigNet, a hierarchical Transformer-VAE that operates directly on SVG sequences with a continuous command parameterization. Our main contributions are two differentiable modules: a continuity self-refinement module that predicts $C^0$, $G^1$, and $C^1$ continuity for each curve point and enforces it by modifying Bézier control points, and an alignment self-refinement module with snapping capabilities for horizontal or vertical lines. DesigNet produces editable outlines and achieves competitive results against state-of-the-art methods, with notably higher accuracy in continuity and alignment. These properties ensure the outputs are easier to refine and integrate into professional design workflows. Source Code: https://github.com/TomasGuija/DesigNet.

[178] VAMAE: Vessel-Aware Masked Autoencoders for OCT Angiography

Ilerioluwakiiye Abolade, Prince Mireku, Kelechi Chibundu, Peace Ododo, Emmanuel Idoko, Promise Omoigui, Solomon Odelola

Main category: cs.CV

TL;DR: VAMAE: A vessel-aware masked autoencoder for self-supervised pretraining on OCTA images using anatomically informed masking and multi-target reconstruction to capture vascular geometry.

DetailsMotivation: Existing self-supervised learning methods like masked autoencoders are designed for dense natural images and use uniform masking with pixel-level reconstruction, which inadequately captures the sparse vessel structures and topological constraints in OCTA images.

Method: Proposes VAMAE framework with: 1) Anatomically informed masking emphasizing vessel-rich regions using vesselness and skeleton-based cues, 2) Multi-target reconstruction of complementary targets to capture appearance, structural, and topological information.

Result: Evaluation on OCTA-500 benchmark shows consistent improvements over standard masked autoencoding baselines for vessel segmentation tasks, particularly in limited-label settings.

Conclusion: Vessel-aware masking and multi-target reconstruction enhance self-supervised learning for OCTA analysis, demonstrating the potential of geometry-aware pretraining for medical imaging tasks.

Abstract: Optical coherence tomography angiography (OCTA) provides non-invasive visualization of retinal microvasculature, but learning robust representations remains challenging due to sparse vessel structures and strong topological constraints. Many existing self-supervised learning approaches, including masked autoencoders, are primarily designed for dense natural images and rely on uniform masking and pixel-level reconstruction, which may inadequately capture vascular geometry. We propose VAMAE, a vessel-aware masked autoencoding framework for self-supervised pretraining on OCTA images. The approach incorporates anatomically informed masking that emphasizes vessel-rich regions using vesselness and skeleton-based cues, encouraging the model to focus on vascular connectivity and branching patterns. In addition, the pretraining objective includes reconstructing multiple complementary targets, enabling the model to capture appearance, structural, and topological information. We evaluate the proposed pretraining strategy on the OCTA-500 benchmark for several vessel segmentation tasks under varying levels of supervision. The results indicate that vessel-aware masking and multi-target reconstruction provide consistent improvements over standard masked autoencoding baselines, particularly in limited-label settings, suggesting the potential of geometry-aware self-supervised learning for OCTA analysis.

[179] MozzaVID: Mozzarella Volumetric Image Dataset

Pawel Tomasz Pieta, Peter Winkel Rasmussen, Anders Bjorholm Dahl, Jeppe Revall Frisvad, Siavash Arjomand Bigdeli, Carsten Gundlach, Anders Nymark Christensen

Main category: cs.CV

TL;DR: MozzaVID: A large volumetric CT image dataset of mozzarella microstructure for benchmarking volumetric deep learning models and classifying 25 cheese types/149 samples.

DetailsMotivation: Address shortage of established volumetric imaging datasets for benchmarking deep learning models, limiting development of architectures optimized for volumetric data and making models incomparable.

Method: Created MozzaVID dataset containing X-ray CT images of mozzarella microstructure with 25 cheese types and 149 samples, provided in three different resolutions (591 to 37,824 images).

Result: Provides a large, clean, versatile volumetric classification dataset that enables benchmarking of volumetric deep learning models and investigation of mozzarella microstructure properties.

Conclusion: MozzaVID addresses complexities in volumetric imaging and food structure analysis, contributing to more robust structural analysis models and deeper understanding of food microstructure.

Abstract: Influenced by the complexity of volumetric imaging, there is a shortage of established datasets useful for benchmarking volumetric deep-learning models. As a consequence, new and existing models are not easily comparable, limiting the development of architectures optimized specifically for volumetric data. To counteract this trend, we introduce MozzaVID – a large, clean, and versatile volumetric classification dataset. Our dataset contains X-ray computed tomography (CT) images of mozzarella microstructure and enables the classification of 25 cheese types and 149 cheese samples. We provide data in three different resolutions, resulting in three dataset instances containing from 591 to 37,824 images. While targeted for developing general-purpose volumetric algorithms, the dataset also facilitates investigating the properties of mozzarella microstructure. The complex and disordered nature of food structures brings a unique challenge, where a choice of appropriate imaging method, scale, and sample size is not trivial. With this dataset, we aim to address these complexities, contributing to more robust structural analysis models and a deeper understanding of food structure. The dataset can be explored through: https://papieta.github.io/MozzaVID/

[180] Holistic Optimal Label Selection for Robust Prompt Learning under Partial Labels

Yaqi Zhao, Haoliang Sun, Yating Wang, Yongshun Gong, Yilong Yin

Main category: cs.CV

TL;DR: HopS: Holistic Optimal Label Selection for prompt learning in vision-language models under partial supervision, combining local density-based filtering and global optimal transport for robust label selection.

DetailsMotivation: Prompt learning for vision-language models struggles with limited performance when only partial labels are available due to label ambiguity and insufficient supervisory information. Current methods lack robust label selection mechanisms for weakly supervised settings.

Method: Two complementary strategies: 1) Local density-based filter selects top frequent labels from nearest neighbors’ candidate sets using softmax scores; 2) Global selection objective based on optimal transport maps uniform sampling distribution to candidate label distributions across batches, minimizing expected transport cost.

Result: Extensive experiments on eight benchmark datasets show HopS consistently improves performance under partial supervision and outperforms all baselines.

Conclusion: HopS provides a practical solution for prompt learning in weakly supervised settings by offering holistic label selection from both local and global perspectives, leveraging pre-trained feature encoders’ generalization ability.

Abstract: Prompt learning has gained significant attention as a parameter-efficient approach for adapting large pre-trained vision-language models to downstream tasks. However, when only partial labels are available, its performance is often limited by label ambiguity and insufficient supervisory information. To address this issue, we propose Holistic Optimal Label Selection (HopS), leveraging the generalization ability of pre-trained feature encoders through two complementary strategies. First, we design a local density-based filter that selects the top frequent labels from the nearest neighbors’ candidate sets and uses the softmax scores to identify the most plausible label, capturing structural regularities in the feature space. Second, we introduce a global selection objective based on optimal transport that maps the uniform sampling distribution to the candidate label distributions across a batch. By minimizing the expected transport cost, it can determine the most likely label assignments. These two strategies work together to provide robust label selection from both local and global perspectives. Extensive experiments on eight benchmark datasets show that HopS consistently improves performance under partial supervision and outperforms all baselines. Those results highlight the merit of holistic label selection and offer a practical solution for prompt learning in weakly supervised settings.

[181] STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction

Runze Wang, Yuxuan Song, Youcheng Cai, Ligang Liu

Main category: cs.CV

TL;DR: STAC proposes a spatio-temporally aware cache compression framework for streaming 3D reconstruction with large causal transformers, reducing memory by 10x and accelerating inference by 4x while maintaining reconstruction quality.

DetailsMotivation: Streaming 3D reconstruction requires long-term temporal consistency and efficient memory usage. Current causal VGGT variants use KV cache that grows linearly with stream length, creating memory bottlenecks that degrade reconstruction quality when cache eviction occurs under limited memory budgets.

Method: STAC framework with three components: 1) Working Temporal Token Caching using decayed cumulative attention scores to preserve long-term informative tokens; 2) Long-term Spatial Token Caching that compresses spatially redundant tokens into voxel-aligned representations; 3) Chunk-based Multi-frame Optimization for joint processing of consecutive frames to improve temporal coherence and GPU efficiency.

Result: Achieves state-of-the-art reconstruction quality while reducing memory consumption by nearly 10x and accelerating inference by 4x, substantially improving scalability of real-time 3D reconstruction in streaming settings.

Conclusion: STAC effectively addresses memory bottlenecks in streaming 3D reconstruction by exploiting intrinsic spatio-temporal sparsity in attention mechanisms, enabling efficient real-time 3D reconstruction with large causal transformers.

Abstract: Online 3D reconstruction from streaming inputs requires both long-term temporal consistency and efficient memory usage. Although causal variants of VGGT address this challenge through a key-value (KV) cache mechanism, the cache grows linearly with the stream length, creating a major memory bottleneck. Under limited memory budgets, early cache eviction significantly degrades reconstruction quality and temporal consistency. In this work, we observe that attention in causal transformers for 3D reconstruction exhibits intrinsic spatio-temporal sparsity. Based on this insight, we propose STAC, a Spatio-Temporally Aware Cache Compression framework for streaming 3D reconstruction with large causal transformers. STAC consists of three key components: (1) a Working Temporal Token Caching mechanism that preserves long-term informative tokens using decayed cumulative attention scores; (2) a Long-term Spatial Token Caching scheme that compresses spatially redundant tokens into voxel-aligned representations for memory-efficient storage; and (3) a Chunk-based Multi-frame Optimization strategy that jointly processes consecutive frames to improve temporal coherence and GPU efficiency. Extensive experiments show that STAC achieves state-of-the-art reconstruction quality while reducing memory consumption by nearly 10x and accelerating inference by 4x, substantially improving the scalability of real-time 3D reconstruction in streaming settings.

[182] Balancing Efficiency and Restoration: Lightweight Mamba-Based Model for CT Metal Artifact Reduction

Weikai Qu, Sijun Liang, Xianfeng Li, Cheng Pan, An Yan, Ahmed Elazab, Shanzhou Niu, Dong Zeng, Xiang Wan, Changmiao Wang

Main category: cs.CV

TL;DR: MARMamba is a CT image metal artifact reduction model using multi-scale Mamba architecture that eliminates metal artifacts while preserving anatomical structures without requiring sinogram data.

DetailsMotivation: Metal implants in CT imaging create severe artifacts that degrade image quality and diagnostic accuracy. Existing methods have three main problems: deterioration of organ/tissue structures, dependence on sinogram data, and poor resource-efficiency balance.

Method: Uses streamlined UNet architecture with multi-scale Mamba (MS-Mamba) core module. MS-Mamba includes flip mamba blocks that capture contextual information from multiple orientations, and average maximum feed-forward networks that integrate critical features with average features to suppress artifacts.

Result: Model excels at reducing metal artifacts of different sizes while maintaining original anatomical structures. Achieves optimal balance between computational demands, memory usage, and parameters, outperforming other models.

Conclusion: MARMamba effectively eliminates metal artifacts in CT images without requiring additional input data, offering practical utility with efficient resource usage and superior artifact reduction capabilities.

Abstract: In computed tomography imaging, metal implants frequently generate severe artifacts that compromise image quality and hinder diagnostic accuracy. There are three main challenges in the existing methods: the deterioration of organ and tissue structures, dependence on sinogram data, and an imbalance between resource use and restoration efficiency. Addressing these issues, we introduce MARMamba, which effectively eliminates artifacts caused by metals of different sizes while maintaining the integrity of the original anatomical structures of the image. Furthermore, this model only focuses on CT images affected by metal artifacts, thus negating the requirement for additional input data. The model is a streamlined UNet architecture, which incorporates multi-scale Mamba (MS-Mamba) as its core module. Within MS-Mamba, a flip mamba block captures comprehensive contextual information by analyzing images from multiple orientations. Subsequently, the average maximum feed-forward network integrates critical features with average features to suppress the artifacts. This combination allows MARMamba to reduce artifacts efficiently. The experimental results demonstrate that our model excels in reducing metal artifacts, offering distinct advantages over other models. It also strikes an optimal balance between computational demands, memory usage, and the number of parameters, highlighting its practical utility in the real world. The code of the presented model is available at: https://github.com/RICKand-MORTY/MARMamba.

[183] WeatherRemover: All-in-one Adverse Weather Removal with Multi-scale Feature Map Compression

Weikai Qu, Sijun Liang, Cheng Pan, Zikuan Yang, Guanchi Zhou, Xianjun Fu, Bo Liu, Changmiao Wang, Ahmed Elazab

Main category: cs.CV

TL;DR: WeatherRemover: A lightweight multi-weather image restoration model using UNet-like architecture with gating mechanisms and multi-scale pyramid vision Transformer for efficient removal of rain, snow, and fog effects.

DetailsMotivation: Images taken in adverse weather conditions suffer from blurriness, occlusion, and low brightness due to rain, snow, and fog, which hinder computer vision tasks. Existing methods are either weather-specific or computationally expensive, lacking practical balance between restoration quality and efficiency.

Method: UNet-like architecture with gating mechanisms and multi-scale pyramid vision Transformer. Uses channel-wise attention from CNNs for feature extraction and linear spatial reduction to reduce attention computational costs. Gating mechanisms in feed-forward and downsampling phases selectively address redundancy and adaptively select essential data.

Result: Achieves optimal balance between restoration quality, parameter efficiency, computational overhead, and memory usage compared to other multi-weather models, meeting practical application demands effectively.

Conclusion: WeatherRemover provides an efficient solution for multi-weather image restoration that balances performance and computational efficiency, making it suitable for practical applications where computational resources are limited.

Abstract: Photographs taken in adverse weather conditions often suffer from blurriness, occlusion, and low brightness due to interference from rain, snow, and fog. These issues can significantly hinder the performance of subsequent computer vision tasks, making the removal of weather effects a crucial step in image enhancement. Existing methods primarily target specific weather conditions, with only a few capable of handling multiple weather scenarios. However, mainstream approaches often overlook performance considerations, resulting in large parameter sizes, long inference times, and high memory costs. In this study, we introduce the WeatherRemover model, designed to enhance the restoration of images affected by various weather conditions while balancing performance. Our model adopts a UNet-like structure with a gating mechanism and a multi-scale pyramid vision Transformer. It employs channel-wise attention derived from convolutional neural networks to optimize feature extraction, while linear spatial reduction helps curtail the computational demands of attention. The gating mechanisms, strategically placed within the feed-forward and downsampling phases, refine the processing of information by selectively addressing redundancy and mitigating its influence on learning. This approach facilitates the adaptive selection of essential data, ensuring superior restoration and maximizing efficiency. Additionally, our lightweight model achieves an optimal balance between restoration quality, parameter efficiency, computational overhead, and memory usage, distinguishing it from other multi-weather models, thereby meeting practical application demands effectively. The source code is available at https://github.com/RICKand-MORTY/WeatherRemover.

[184] Variational Feature Compression for Model-Specific Representations

Zinan Guo, Zihan Wang, Chuan Yan, Liuhuo Wan, Ethan Ma, Guangdong Bai

Main category: cs.CV

TL;DR: A privacy-preserving feature extraction framework that suppresses cross-model transfer while maintaining accuracy for a designated classifier, using variational latent bottleneck with dynamic binary masking.

DetailsMotivation: Address input repurposing concerns in shared/cloud-based deep learning inference where data submitted for one task could be reused by unauthorized models for other purposes. Existing privacy defenses focus on data access control but lack control over downstream uses of released representations.

Method: Uses variational latent bottleneck trained with task-driven cross-entropy and KL regularization (no pixel-level reconstruction). Applies dynamic binary mask computed from per-dimension KL divergence and gradient-based saliency w.r.t. frozen target model to suppress latent dimensions uninformative for intended task. White-box training with gradient access, but inference only requires forward pass.

Result: On CIFAR-100, processed representations retain strong utility for designated classifier while reducing accuracy of all unintended classifiers to below 2%, achieving suppression ratio exceeding 45x relative to unintended models. Preliminary results on CIFAR-10, Tiny ImageNet, and Pascal VOC show approach extends across task settings.

Conclusion: Proposed framework effectively suppresses cross-model transfer while preserving accuracy for intended tasks, addressing privacy concerns in shared inference environments. Further evaluation needed for robustness against adaptive adversaries.

Abstract: As deep learning inference is increasingly deployed in shared and cloud-based settings, a growing concern is input repurposing, in which data submitted for one task is reused by unauthorized models for another. Existing privacy defenses largely focus on restricting data access, but provide limited control over what downstream uses a released representation can still support. We propose a feature extraction framework that suppresses cross-model transfer while preserving accuracy for a designated classifier. The framework employs a variational latent bottleneck, trained with a task-driven cross-entropy objective and KL regularization, but without any pixel-level reconstruction loss, to encode inputs into a compact latent space. A dynamic binary mask, computed from per-dimension KL divergence and gradient-based saliency with respect to the frozen target model, suppresses latent dimensions that are uninformative for the intended task. Because saliency computation requires gradient access, the encoder is trained in a white-box setting, whereas inference requires only a forward pass through the frozen target model. On CIFAR-100, the processed representations retain strong utility for the designated classifier while reducing the accuracy of all unintended classifiers to below 2%, yielding a suppression ratio exceeding 45 times relative to unintended models. Preliminary experiments on CIFAR-10, Tiny ImageNet, and Pascal VOC provide exploratory evidence that the approach extends across task settings, although further evaluation is needed to assess robustness against adaptive adversaries.

[185] Controllable Generative Video Compression

Ding Ding, Daowen Li, Ying Chen, Yixin Gao, Ruixiao Dong, Kai Li, Li Li

Main category: cs.CV

TL;DR: CGVC is a controllable generative video compression method that uses keyframes and dense control priors to balance perceptual quality and signal fidelity in video compression.

DetailsMotivation: Traditional perceptual video compression improves perceptual realism but sacrifices signal fidelity, creating a dilemma between perception and faithful reproduction of visual signals. The paper aims to alleviate this trade-off.

Method: Proposes Controllable Generative Video Compression (CGVC) paradigm using coded keyframes as structural priors and dense per-frame control priors to guide non-keyframe generation. Includes a color-distance-guided keyframe selection algorithm for accurate color recovery.

Result: CGVC outperforms previous perceptual video compression methods in terms of both signal fidelity and perceptual quality.

Conclusion: The CGVC paradigm successfully addresses the perception-fidelity dilemma in video compression through controllable generative modeling with visual priors.

Abstract: Perceptual video compression adopts generative video modeling to improve perceptual realism but frequently sacrifices signal fidelity, diverging from the goal of video compression to faithfully reproduce visual signal. To alleviate the dilemma between perception and fidelity, in this paper we propose Controllable Generative Video Compression (CGVC) paradigm to faithfully generate details guided by multiple visual conditions. Under the paradigm, representative keyframes of the scene are coded and used to provide structural priors for non-keyframe generation. Dense per-frame control prior is additionally coded to better preserve finer structure and semantics of each non-keyframe. Guided by these priors, non-keyframes are reconstructed by controllable video generation model with temporal and content consistency. Furthermore, to accurately recover color information of the video, we develop a color-distance-guided keyframe selection algorithm to adaptively choose keyframes. Experimental results show CGVC outperforms previous perceptual video compression method in terms of both signal fidelity and perceptual quality.

[186] GPAFormer: Graph-guided Patch Aggregation Transformer for Efficient 3D Medical Image Segmentation

Chung-Ming Lo, I-Yun Liu, Wei-Yang Lin

Main category: cs.CV

TL;DR: GPAFormer: A lightweight 3D medical image segmentation network using multi-scale attention and graph aggregation for efficient multi-organ segmentation across CT and MRI modalities.

DetailsMotivation: Address challenges in 3D medical image segmentation including modality diversity, high-dimensional data, anatomical heterogeneity, and the need for both accuracy and computational efficiency in multi-organ segmentation tasks.

Method: Proposes GPAFormer with two core modules: MASA (multi-scale attention-guided stacked aggregation) using parallel paths with different receptive fields, and MPGA (mutual-aware patch graph aggregator) that dynamically aggregates regions based on feature similarity and spatial adjacency.

Result: Achieved state-of-the-art DSC scores on multiple datasets (BTCV: 75.70%, Synapse: 81.20%, ACDC: 89.32%, BraTS: 82.74%) with only 1.81M parameters and sub-second inference time on consumer GPUs.

Conclusion: GPAFormer effectively balances accuracy and efficiency for multi-organ, multi-modality 3D segmentation, making it suitable for resource-constrained and time-sensitive clinical environments.

Abstract: Deep learning has been widely applied to 3D medical image segmentation tasks. However, due to the diversity of imaging modalities, the high-dimensional nature of the data, and the heterogeneity of anatomical structures, achieving both segmentation accuracy and computational efficiency in multi-organ segmentation remains a challenge. This study proposed GPAFormer, a lightweight network architecture specifically designed for 3D medical image segmentation, emphasizing efficiency while keeping high accuracy. GPAFormer incorporated two core modules: the multi-scale attention-guided stacked aggregation (MASA) and the mutual-aware patch graph aggregator (MPGA). MASA utilized three parallel paths with different receptive fields, combined through planar aggregation, to enhance the network’s capability in handling structures of varying sizes. MPGA employed a graph-guided approach to dynamically aggregate regions with similar feature distributions based on inter-patch feature similarity and spatial adjacency, thereby improving the discrimination of both internal and boundary structures of organs. Experiments were performed on public whole-body CT and MRI datasets including BTCV, Synapse, ACDC, and BraTS. Compared to the existed 3D segmentation networkd, GPAFormer using only 1.81 M parameters achieved overall highest DSC on BTCV (75.70%), Synapse (81.20%), ACDC (89.32%), and BraTS (82.74%). Using consumer level GPU, the inference time for one validation case of BTCV spent less than one second. The results demonstrated that GPAFormer balanced accuracy and efficiency in multi-organ, multi-modality 3D segmentation tasks across various clinical scenarios especially for resource-constrained and time-sensitive clinical environments.

[187] Towards Robust Content Watermarking Against Removal and Forgery Attacks

Yifan Zhu, Yihan Wang, Xiao-Shan Gao

Main category: cs.CV

TL;DR: ISTS: Instance-Specific watermarking with Two-Sided detection for text-to-image diffusion models that resists removal and forgery attacks through dynamic injection based on prompt semantics and enhanced detection.

DetailsMotivation: Content generation raises copyright, provenance, and attribution concerns. While watermarking for diffusion models has been studied, existing techniques are vulnerable to adversarial attacks like removal and forgery attacks.

Method: ISTS paradigm with dynamic injection control based on prompt semantics (injection time and patterns) and two-sided detection approach for enhanced robustness.

Result: Experiments demonstrate superiority against removal and forgery attacks compared to existing watermarking techniques.

Conclusion: ISTS provides effective watermarking for text-to-image diffusion models that resists adversarial attacks through instance-specific dynamic injection and two-sided detection.

Abstract: Generated contents have raised serious concerns about copyright protection, image provenance, and credit attribution. A potential solution for these problems is watermarking. Recently, content watermarking for text-to-image diffusion models has been studied extensively for its effective detection utility and robustness. However, these watermarking techniques are vulnerable to potential adversarial attacks, such as removal attacks and forgery attacks. In this paper, we build a novel watermarking paradigm called Instance-Specific watermarking with Two-Sided detection (ISTS) to resist removal and forgery attacks. Specifically, we introduce a strategy that dynamically controls the injection time and watermarking patterns based on the semantics of users’ prompts. Furthermore, we propose a new two-sided detection approach to enhance robustness in watermark detection. Experiments have demonstrated the superiority of our watermarking against removal and forgery attacks.

[188] VDPP: Video Depth Post-Processing for Speed and Scalability

Daewon Yoon, Injun Baek, Sangyu Han, Yearim Kim, Nojun Kwak

Main category: cs.CV

TL;DR: VDPP is a video depth post-processing framework that achieves real-time performance (>43.5 FPS) on edge devices by focusing on geometric refinement in low-resolution space rather than full scene reconstruction, enabling immediate integration with any evolving image depth model.

DetailsMotivation: Current end-to-end video depth models suffer from adaptation lag when new single-image depth estimators are released, while existing post-processing methods struggle with speed, accuracy, and RGB reliance. There's a need for a practical, modular solution that can keep pace with evolving image depth models.

Method: VDPP shifts from computationally expensive scene reconstruction to targeted geometric refinement in low-resolution space. It uses dense residual learning to drive geometric representations rather than full reconstructions, and operates with an RGB-free architecture for true scalability.

Result: Achieves >43.5 FPS on NVIDIA Jetson Orin Nano, matches temporal coherence of end-to-end systems, provides superior balance of speed, accuracy, and memory efficiency, and enables immediate integration with any evolving image depth model.

Conclusion: VDPP revitalizes post-processing for video depth estimation by offering a practical, modular solution that combines real-time performance on edge devices with the ability to leverage any state-of-the-art image depth model without retraining.

Abstract: Video depth estimation is essential for providing 3D scene structure in applications ranging from autonomous driving to mixed reality. Current end-to-end video depth models have established state-of-the-art performance. Although current end-to-end (E2E) models have achieved state-of-the-art performance, they function as tightly coupled systems that suffer from a significant adaptation lag whenever superior single-image depth estimators are released. To mitigate this issue, post-processing methods such as NVDS offer a modular plug-and-play alternative to incorporate any evolving image depth model without retraining. However, existing post-processing methods still struggle to match the efficiency and practicality of E2E systems due to limited speed, accuracy, and RGB reliance. In this work, we revitalize the role of post-processing by proposing VDPP (Video Depth Post-Processing), a framework that improves the speed and accuracy of post-processing methods for video depth estimation. By shifting the paradigm from computationally expensive scene reconstruction to targeted geometric refinement, VDPP operates purely on geometric refinements in low-resolution space. This design achieves exceptional speed (>43.5 FPS on NVIDIA Jetson Orin Nano) while matching the temporal coherence of E2E systems, with dense residual learning driving geometric representations rather than full reconstructions. Furthermore, our VDPP’s RGB-free architecture ensures true scalability, enabling immediate integration with any evolving image depth model. Our results demonstrate that VDPP provides a superior balance of speed, accuracy, and memory efficiency, making it the most practical solution for real-time edge deployment. Our project page is at https://github.com/injun-baek/VDPP

[189] RASR: Retrieval-Augmented Semantic Reasoning for Fake News Video Detection

Hui Li, Peien Ding, Jun Li, Guoqi Ma, Zhanyu Liu, Ge Xu, Junfeng Yao, Jinsong Su

Main category: cs.CV

TL;DR: A retrieval-augmented semantic reasoning framework for multimodal fake news video detection that uses cross-instance semantic parsing, domain-guided reasoning with expert MLLMs, and multi-view feature fusion.

DetailsMotivation: Existing multimodal fake news detection methods lack cross-instance global semantic correlations and struggle with domain transfer due to semantic discrepancies, limiting their ability to use historical evidence and domain-specific knowledge.

Method: Proposes RASR framework with three components: 1) Cross-instance Semantic Parser and Retriever (CSPR) that deconstructs videos into semantic primitives and retrieves relevant evidence; 2) Domain-Guided Multimodal Reasoning (DGMP) that uses domain priors to drive expert multimodal LLMs for analysis; 3) Multi-View Feature Decoupling and Fusion (MVDFF) with adaptive gating for robust feature integration.

Result: Extensive experiments on FakeSV and FakeTT datasets show RASR significantly outperforms SOTA baselines, achieves superior cross-domain generalization, and improves detection accuracy by up to 0.93%.

Conclusion: The RASR framework effectively addresses limitations in existing multimodal fake news detection by incorporating cross-instance semantic reasoning, domain guidance, and adaptive feature fusion, demonstrating strong performance and generalization capabilities.

Abstract: Multimodal fake news video detection is a crucial research direction for maintaining the credibility of online information. Existing studies primarily verify content authenticity by constructing multimodal feature fusion representations or utilizing pre-trained language models to analyze video-text consistency. However, these methods still face the following limitations: (1) lacking cross-instance global semantic correlations, making it difficult to effectively utilize historical associative evidence to verify the current video; (2) semantic discrepancies across domains hinder the transfer of general knowledge, lacking the guidance of domain-specific expert knowledge. To this end, we propose a novel Retrieval-Augmented Semantic Reasoning (RASR) framework. First, a Cross-instance Semantic Parser and Retriever (CSPR) deconstructs the video into high-level semantic primitives and retrieves relevant associative evidence from a dynamic memory bank. Subsequently, a Domain-Guided Multimodal Reasoning (DGMP) module incorporates domain priors to drive an expert multimodal large language model in generating domain-aware, in-depth analysis reports. Finally, a Multi-View Feature Decoupling and Fusion (MVDFF) module integrates multi-dimensional features through an adaptive gating mechanism to achieve robust authenticity determination. Extensive experiments on the FakeSV and FakeTT datasets demonstrate that RASR significantly outperforms state-of-the-art baselines, achieves superior cross-domain generalization, and improves the overall detection accuracy by up to 0.93%.

[190] Specializing Large Models for Oracle Bone Script Interpretation via Component-Grounded Multimodal Knowledge Augmentation

Jianing Zhang, Runan Li, Honglin Pang, Ding Xia, Zhou Zhu, Qian Zhang, Chuntao Li, Xi Yang

Main category: cs.CV

TL;DR: A novel agent-driven Vision-Language Model framework for deciphering ancient Chinese Oracle Bone Script by leveraging structural components and semantic reasoning, outperforming traditional image recognition approaches.

DetailsMotivation: Existing approaches treat Oracle Bone Script decipherment as closed-set image recognition, failing to address the "interpretation gap" where unique characters are composed of recurring pictographic components with transferable semantic meanings.

Method: Proposes an agent-driven VLM framework integrating a VLM for visual grounding with an LLM-based agent for automated reasoning chain: component identification, graph-based knowledge retrieval, and relationship inference for linguistically accurate interpretation. Also introduces OB-Radix dataset with expert annotations.

Result: The framework yields more detailed and precise decipherments compared to baseline methods across three benchmarks of different tasks.

Conclusion: The proposed agent-driven VLM framework effectively bridges the interpretation gap in Oracle Bone Script decipherment by leveraging structural logic and semantic reasoning through component-based analysis.

Abstract: Deciphering ancient Chinese Oracle Bone Script (OBS) is a challenging task that offers insights into the beliefs, systems, and culture of the ancient era. Existing approaches treat decipherment as a closed-set image recognition problem, which fails to bridge the ``interpretation gap’’: while individual characters are often unique and rare, they are composed of a limited set of recurring, pictographic components that carry transferable semantic meanings. To leverage this structural logic, we propose an agent-driven Vision-Language Model (VLM) framework that integrates a VLM for precise visual grounding with an LLM-based agent to automate a reasoning chain of component identification, graph-based knowledge retrieval, and relationship inference for linguistically accurate interpretation. To support this, we also introduce OB-Radix, an expert-annotated dataset providing structural and semantic data absent from prior corpora, comprising 1,022 character images (934 unique characters) and 1,853 fine-grained component images across 478 distinct components with verified explanations. By evaluating our system across three benchmarks of different tasks, we demonstrate that our framework yields more detailed and precise decipherments compared to baseline methods.

[191] Improving Local Feature Matching by Entropy-inspired Scale Adaptability and Flow-endowed Local Consistency

Ke Jin, Jiming Chen, Qi Ye

Main category: cs.CV

TL;DR: A novel semi-dense image matching pipeline with scale-aware coarse matching and flow-based fine matching for improved robustness and accuracy.

DetailsMotivation: Existing semi-dense image matching methods suffer from two key issues: 1) Over-exclusion in mutual nearest neighbor matching at coarse stage, making them struggle with scale differences between images, and 2) Neglect of local consistency at fine stage, undermining robustness.

Method: Proposes a two-stage approach: 1) Scale-aware matching module at coarse stage that exploits score matrix hints to indicate scale ratio, and 2) Reformulating fine stage as cascaded flow refinement with gradient loss to encourage local consistency of flow field.

Result: Extensive experiments demonstrate robust and accurate matching performance on downstream tasks with the proposed modifications.

Conclusion: The novel matching pipeline with scale-aware coarse matching and flow-based fine matching effectively addresses long-standing issues in semi-dense image matching, achieving improved performance.

Abstract: Recent semi-dense image matching methods have achieved remarkable success, but two long-standing issues still impair their performance. At the coarse stage, the over-exclusion issue of their mutual nearest neighbor (MNN) matching layer makes them struggle to handle cases with scale difference between images. To this end, we comprehensively revisit the matching mechanism and make a key observation that the hint concealed in the score matrix can be exploited to indicate the scale ratio. Based on this, we propose a scale-aware matching module which is exceptionally effective but introduces negligible overhead. At the fine stage, we point out that existing methods neglect the local consistency of final matches, which undermines their robustness. To this end, rather than independently predicting the correspondence for each source pixel, we reformulate the fine stage as a cascaded flow refinement problem and introduce a novel gradient loss to encourage local consistency of the flow field. Extensive experiments demonstrate that our novel matching pipeline, with these proposed modifications, achieves robust and accurate matching performance on downstream tasks.

[192] HQF-Net: A Hybrid Quantum-Classical Multi-Scale Fusion Network for Remote Sensing Image Segmentation

Md Aminur Hossain, Ayush V. Patel, Siddhant Gole, Sanjay K. Singh, Biplab Banerjee

Main category: cs.CV

TL;DR: HQF-Net: A hybrid quantum-classical network for remote sensing semantic segmentation that combines frozen DINOv3 ViT features with U-Net using quantum-enhanced components.

DetailsMotivation: Remote sensing segmentation requires capturing both fine spatial details and high-level semantics. Classical encoder-decoder architectures like U-Net struggle with global semantics and structured feature interactions, motivating a hybrid quantum-classical approach.

Method: HQF-Net integrates multi-scale semantic guidance from frozen DINOv3 ViT-L/16 backbone with customized U-Net via Deformable Multiscale Cross-Attention Fusion (DMCAF). Uses quantum-enhanced skip connections (QSkip) and Quantum bottleneck with Mixture-of-Experts (QMoE) combining local, global, and directional quantum circuits with adaptive routing.

Result: Achieves 0.8568 mIoU and 96.87% overall accuracy on LandCover.ai, 71.82% mIoU on OpenEarthMap, and 55.28% mIoU with 99.37% overall accuracy on SeasoNet. Ablation studies confirm contributions of each component.

Conclusion: Structured hybrid quantum-classical feature processing is promising for improving remote sensing semantic segmentation under near-term quantum constraints.

Abstract: Remote sensing semantic segmentation requires models that can jointly capture fine spatial details and high-level semantic context across complex scenes. While classical encoder-decoder architectures such as U-Net remain strong baselines, they often struggle to fully exploit global semantics and structured feature interactions. In this work, we propose HQF-Net, a hybrid quantum-classical multi-scale fusion network for remote sensing image segmentation. HQF-Net integrates multi-scale semantic guidance from a frozen DINOv3 ViT-L/16 backbone with a customized U-Net architecture through a Deformable Multiscale Cross-Attention Fusion (DMCAF) module. To enhance feature refinement, the framework further introduces quantum-enhanced skip connections (QSkip) and a Quantum bottleneck with Mixture-of-Experts (QMoE), which combines complementary local, global, and directional quantum circuits within an adaptive routing mechanism. Experiments on three remote sensing benchmarks show consistent improvements with the proposed design. HQF-Net achieves 0.8568 mIoU and 96.87% overall accuracy on LandCover.ai, 71.82% mIoU on OpenEarthMap, and 55.28% mIoU with 99.37% overall accuracy on SeasoNet. An architectural ablation study further confirms the contribution of each major component. These results show that structured hybrid quantum-classical feature processing is a promising direction for improving remote sensing semantic segmentation under near-term quantum constraints.

[193] Exploring 6D Object Pose Estimation with Deformation

Zhiqiang Liu, Rui Song, Duanmu Chuangqi, Jiaojiao Li, David Ferstl, Yinlin Hu

Main category: cs.CV

TL;DR: DeSOPE is a large-scale dataset for 6DoF deformed object pose estimation, addressing the limitation of existing methods that assume rigid or articulated objects by providing 26 object categories with canonical and deformed configurations, plus 133K RGB-D frames with 665K pose annotations.

DetailsMotivation: Most 6D object pose estimation methods assume rigid or articulated objects, but real-world objects often deviate from canonical shapes due to wear, impact, or deformation. Current datasets don't adequately address this challenge, limiting practical applications.

Method: Created DeSOPE dataset with high-fidelity 3D scans of 26 object categories in canonical and three deformed states. Developed semi-automatic annotation pipeline: 2D mask annotation → initial pose estimation using object pose method → refinement via object-level SLAM → manual verification → final 665K pose annotations across 133K RGB-D frames.

Result: Evaluation shows existing object pose methods suffer significant performance degradation with increasing deformation, highlighting the need for deformation-robust approaches. The dataset enables benchmarking and development of methods that handle object deformations.

Conclusion: Robust handling of object deformations is critical for practical 6D pose estimation. DeSOPE provides the first large-scale dataset addressing this challenge, enabling research into deformation-aware pose estimation methods.

Abstract: We present DeSOPE, a large-scale dataset for 6DoF deformed objects. Most 6D object pose methods assume rigid or articulated objects, an assumption that fails in practice as objects deviate from their canonical shapes due to wear, impact, or deformation. To model this, we introduce the DeSOPE dataset, which features high-fidelity 3D scans of 26 common object categories, each captured in one canonical state and three deformed configurations, with accurate 3D registration to the canonical mesh. Additionally, it features an RGB-D dataset with 133K frames across diverse scenarios and 665K pose annotations produced via a semi-automatic pipeline. We begin by annotating 2D masks for each instance, then compute initial poses using an object pose method, refine them through an object-level SLAM system, and finally perform manual verification to produce the final annotations. We evaluate several object pose methods and find that performance drops sharply with increasing deformation, suggesting that robust handling of such deformations is critical for practical applications. The project page and dataset are available at https://desope-6d.github.io/}{https://desope-6d.github.io/.

[194] Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning

Jiahua Chen, Qihong Tang, Weinong Wang, Qi Fan

Main category: cs.CV

TL;DR: A training-free framework that enhances MLLMs’ 3D spatial reasoning through explicit 3D reconstruction and novel view synthesis, outperforming specialized models on spatial reasoning benchmarks.

DetailsMotivation: Current MLLMs struggle with complex 3D spatial reasoning due to reliance on 2D visual priors. Existing solutions are either computationally expensive (post-training on limited 3D data) or use rigid tool-calling mechanisms lacking geometric understanding and viewpoint flexibility.

Method: Proposes a training-free framework with Visual Chain-of-Thought mechanism grounded in explicit 3D reconstruction. Pipeline: 1) Reconstructs high-fidelity 3D mesh from single image using MLLM-guided keyword extraction and mask generation at multiple granularities, 2) Leverages external knowledge base to iteratively compute optimal camera extrinsic parameters and synthesize novel views to emulate human perspective-taking.

Result: Extensive experiments show significant enhancement in spatial comprehension. Framework outperforms specialized spatial models and general-purpose MLLMs (including GPT-5.2 and Gemini-2.5-Flash) on major benchmarks like 3DSRBench and Rel3D.

Conclusion: The training-free framework successfully addresses MLLMs’ limitations in 3D spatial reasoning by introducing explicit 3D reconstruction and novel view synthesis, demonstrating superior performance without expensive retraining.

Abstract: Although Multimodal Large Language Models have achieved remarkable progress, they still struggle with complex 3D spatial reasoning due to the reliance on 2D visual priors. Existing approaches typically mitigate this limitation either through computationally expensive post-training procedures on limited 3D datasets or through rigid tool-calling mechanisms that lack explicit geometric understanding and viewpoint flexibility. To address these challenges, we propose a \textit{training-free} framework that introduces a Visual Chain-of-Thought mechanism grounded in explicit 3D reconstruction. The proposed pipeline first reconstructs a high-fidelity 3D mesh from a single image using MLLM-guided keyword extraction and mask generation at multiple granularities. Subsequently, the framework leverages an external knowledge base to iteratively compute optimal camera extrinsic parameters and synthesize novel views, thereby emulating human perspective-taking. Extensive experiments demonstrate that the proposed approach significantly enhances spatial comprehension. Specifically, the framework outperforms specialized spatial models and general-purpose MLLMs, including \textit{GPT-5.2} and \textit{Gemini-2.5-Flash}, on major benchmarks such as 3DSRBench and Rel3D.

[195] DOC-GS: Dual-Domain Observation and Calibration for Reliable Sparse-View Gaussian Splatting

Hantang Li, Qiang Zhu, Xiandong Meng, Debin Zhao, Xiaopeng Fan

Main category: cs.CV

TL;DR: DOC-GS addresses sparse-view 3DGS reconstruction artifacts by modeling Gaussian primitive reliability through dual-domain optimization and observation calibration.

DetailsMotivation: Sparse-view 3D Gaussian Splatting suffers from overfitting and artifacts due to insufficient geometric supervision, with unreliable Gaussians accumulating as haze-like degradations.

Method: Proposes Dual-domain Observation and Calibration (DOC-GS) framework: 1) Optimization domain uses Continuous Depth-Guided Dropout to model Gaussian reliability, 2) Observation domain uses Dark Channel Prior to identify floater artifacts, and 3) Reliability-driven geometric pruning removes low-confidence Gaussians.

Result: The method effectively reduces structural distortions and translucent haze-like artifacts in sparse-view 3DGS reconstruction by identifying and removing unreliable Gaussian primitives.

Conclusion: DOC-GS provides a unified framework for addressing sparse-view 3DGS artifacts by modeling Gaussian reliability through optimization-domain inductive bias and observation-domain evidence.

Abstract: Sparse-view reconstruction with 3D Gaussian Splatting (3DGS) is fundamentally ill-posed due to insufficient geometric supervision, often leading to severe overfitting and the emergence of structural distortions and translucent haze-like artifacts. While existing approaches attempt to alleviate this issue via dropout-based regularization, they are largely heuristic and lack a unified understanding of artifact formation. In this paper, we revisit sparse-view 3DGS reconstruction from a new perspective and identify the core challenge as the unobservability of Gaussian primitive reliability. Unreliable Gaussians are insufficiently constrained during optimization and accumulate as haze-like degradations in rendered images. Motivated by this observation, we propose a unified Dual-domain Observation and Calibration (DOC-GS) framework that models and corrects Gaussian reliability through the synergy of optimization-domain inductive bias and observation-domain evidence. Specifically, in the optimization domain, we characterize Gaussian reliability by the degree to which each primitive is constrained during training, and instantiate this signal via a Continuous Depth-Guided Dropout (CDGD) strategy, where the dropout probability serves as an explicit proxy for primitive reliability. This imposes a smooth depth-aware inductive bias to suppress weakly constrained Gaussians and improve optimization stability. In the observation domain, we establish a connection between floater artifacts and atmospheric scattering, and leverage the Dark Channel Prior (DCP) as a structural consistency cue to identify and accumulate anomalous regions. Based on cross-view aggregated evidence, we further design a reliability-driven geometric pruning strategy to remove low-confidence Gaussians.

[196] Implantable Adaptive Cells: A Novel Enhancement for Pre-Trained U-Nets in Medical Image Segmentation

Emil Benedykciuk, Marcin Denkowski, Grzegorz Wójcik

Main category: cs.CV

TL;DR: A gradient-based Neural Architecture Search method called Implantable Adaptive Cell (IAC) that enhances pre-trained medical image segmentation models by injecting small modules into skip connections without full retraining.

DetailsMotivation: To improve performance of existing medical image segmentation models (like U-Net) without costly complete retraining or architecture overhaul, using a more efficient NAS approach.

Method: Uses Partially-Connected DARTS to identify small Implantable Adaptive Cells (IACs) that can be injected into skip connections of pre-trained U-shaped models, refining architectures without full retraining.

Result: Consistent accuracy improvements of ~5 percentage points across four medical datasets (MRI/CT), with best cases reaching up to 11% improvement on various U-Net configurations.

Conclusion: Provides cost-effective performance upgrades for existing models and shows potential for broader architectural and domain applications beyond medical imaging.

Abstract: This paper introduces a novel approach to enhance the performance of pre-trained neural networks in medical image segmentation using gradient-based Neural Architecture Search (NAS) methods. We present the concept of Implantable Adaptive Cell (IAC), small modules identified through Partially-Connected DARTS based approach, designed to be injected into the skip connections of an existing and already trained U-shaped model. Unlike traditional NAS methods, our approach refines existing architectures without full retraining. Experiments on four medical datasets with MRI and CT images show consistent accuracy improvements on various U-Net configurations, with segmentation accuracy gain by approximately 5 percentage points across all validation datasets, with improvements reaching up to 11%pt in the best-performing cases. The findings of this study not only offer a cost-effective alternative to the complete overhaul of complex models for performance upgrades but also indicate the potential applicability of our method to other architectures and problem domains.

[197] LiveStre4m: Feed-Forward Live Streaming of Novel Views from Unposed Multi-View Video

Pedro Quesado, Erkut Akdag, Yasaman Kashefbahrami, Willem Menu, Egor Bondarev

Main category: cs.CV

TL;DR: LiveStre4m enables real-time novel view synthesis from unposed multi-view video streams using feed-forward transformers and diffusion models, achieving 0.07s per-frame processing.

DetailsMotivation: Existing dynamic scene representation methods require ground-truth camera parameters and lengthy optimizations (≈2.67s), making them unsuitable for live streaming applications.

Method: Multi-view vision transformer for keyframe 3D reconstruction, diffusion-transformer interpolation for temporal consistency, and Camera Pose Predictor module to estimate poses/intrinsics from RGB images.

Result: Achieves 0.07s per-frame processing at 1024×768 resolution, outperforming optimization-based methods by orders of magnitude while using only two synchronized unposed input streams.

Conclusion: LiveStre4m makes real-time novel view synthesis streaming feasible in practical settings, marking substantial progress toward deployable live NVS systems.

Abstract: Live-streaming Novel View Synthesis (NVS) from unposed multi-view video remains an open challenge in a wide range of applications. Existing methods for dynamic scene representation typically require ground-truth camera parameters and involve lengthy optimizations ($\approx 2.67$s), which makes them unsuitable for live streaming scenarios. To address this issue, we propose a novel viewpoint video live-streaming method (LiveStre4m), a feed-forward model for real-time NVS from unposed sparse multi-view inputs. LiveStre4m introduces a multi-view vision transformer for keyframe 3D scene reconstruction coupled with a diffusion-transformer interpolation module that ensures temporal consistency and stable streaming. In addition, a Camera Pose Predictor module is proposed to efficiently estimate both poses and intrinsics directly from RGB images, removing the reliance on known camera calibration information. Our approach enables temporally consistent novel-view video streaming in real-time using as few as two synchronized unposed input streams. LiveStre4m attains an average reconstruction time of $ 0.07$s per-frame at $ 1024 \times 768$ resolution, outperforming the optimization-based dynamic scene representation methods by orders of magnitude in runtime. These results demonstrate that LiveStre4m makes real-time NVS streaming feasible in practical settings, marking a substantial step toward deployable live novel-view synthesis systems. Code available at: https://github.com/pedro-quesado/LiveStre4m

[198] From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks

Carlos Schmidt, Simon Reiß

Main category: cs.CV

TL;DR: Interactive DeLVM transforms static visual in-context learning models into user-controllable systems by encoding user interactions (scribbles, clicks, bounding boxes) directly into example input-output pairs, enabling dynamic steering of predictions without fine-tuning.

DetailsMotivation: Current visual in-context learning models are static and lack mechanisms to incorporate user guidance signals like scribbles, clicks, or bounding boxes, which limits their practical utility in real-world applications where users need to actively steer model predictions for tasks like segmentation, image editing, or targeted analysis.

Method: The method transforms DeLVM (a visual in-context learning approach) into Interactive DeLVM by encoding user interactions directly into the example input-output pairs. This preserves the in-context learning philosophy while enabling users to prompt models with unseen interactions without requiring fine-tuning, allowing dynamic steering of predictions through personalized visual cues.

Result: Experiments show that state-of-the-art visual in-context learning models fail to effectively leverage interaction cues, often ignoring user guidance. Interactive DeLVM achieves significant improvements: +7.95% IoU for interactive segmentation, +2.46 PSNR for directed super-resolution, and -3.14% LPIPS for interactive object removal.

Conclusion: The work bridges the gap between rigid static task adaptation and fluid interactivity for user-centric visual in-context learning, enabling models to become highly controllable, user-driven systems that can incorporate natural visual guidance signals.

Abstract: Visual in-context learning models are designed to adapt to new tasks by leveraging a set of example input-output pairs, enabling rapid generalization without task-specific fine-tuning. However, these models operate in a fundamentally static paradigm: while they can adapt to new tasks, they lack any mechanism to incorporate user-provided guidance signals such as scribbles, clicks, or bounding boxes to steer or refine the prediction process. This limitation is particularly restrictive in real-world applications, where users want to actively guide model predictions, e.g., by highlighting the target object for segmentation, indicating a region which should be visually altered, or isolating a specific person in a complex scene to run targeted pose estimation. In this work, we propose a simple method to transform static visual in-context learners, particularly the DeLVM approach, into highly controllable, user-driven systems, i.e., Interactive DeLVM, enabling seamless interaction through natural visual cues such as scribbles, clicks, or drawing boxes. Specifically, by encoding interactions directly into the example input-output pairs, we keep the philosophy of visual in-context learning intact: enabling users to prompt models with unseen interactions without fine-tuning and empowering them to dynamically steer model predictions with personalized interactions. Our experiments demonstrate that SOTA visual in-context learning models fail to effectively leverage interaction cues, often ignoring user guidance entirely. In contrast, our method excels in controllable, user-guided scenarios, achieving improvements of $+7.95%$ IoU for interactive segmentation, $+2.46$ PSNR for directed super-resolution, and $-3.14%$ LPIPS for interactive object removal. With this, our work bridges the gap between rigid static task adaptation and fluid interactivity for user-centric visual in-context learning.

[199] How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

Roberto Brusnicki, Mattia Piccinini, Johannes Betz

Main category: cs.CV

TL;DR: VENUSS framework evaluates VLMs on sequential driving scenes, revealing they achieve only 57% accuracy (vs human 65%) and struggle with temporal dynamics despite good static object detection.

DetailsMotivation: VLMs are increasingly used for autonomous driving tasks, but their performance on sequential driving scenes is poorly characterized, especially regarding how different input configurations affect their capabilities.

Method: VENUSS framework extracts temporal sequences from driving videos and generates structured evaluations across custom categories. It systematically analyzes VLM sensitivity to input configurations including resolution, frame count, temporal intervals, spatial layouts, and presentation modes.

Result: Evaluation of 25+ VLMs across 2,600+ scenarios shows top models achieve only 57% accuracy, not matching human performance (65%). VLMs excel at static object detection but struggle with understanding vehicle dynamics and temporal relations.

Conclusion: VENUSS provides the first systematic sensitivity analysis of VLMs for sequential driving scenes, revealing significant capability gaps and establishing baselines for future research in autonomous driving applications.

Abstract: Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities. We introduce VENUSS (VLM Evaluation oN Understanding Sequential Scenes), a framework for systematic sensitivity analysis of VLM performance on sequential driving scenes, establishing baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos, and generates structured evaluations across custom categories. By comparing 25+ existing VLMs across 2,600+ scenarios, we reveal how even top models achieve only 57% accuracy, not matching human performance in similar constraints (65%) and exposing significant capability gaps. Our analysis shows that VLMs excel with static object detection but struggle with understanding the vehicle dynamics and temporal relations. VENUSS offers the first systematic sensitivity analysis of VLMs focused on how input image configurations - resolution, frame count, temporal intervals, spatial layouts, and presentation modes - affect performance on sequential driving scenes. Supplementary material available at https://V3NU55.github.io

[200] FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

Junchao Yi, Rui Zhao, Jiahao Tang, Weixian Lei, Linjie Li, Qisheng Su, Zhengyuan Yang, Lijuan Wang, Xiaofeng Zhu, Alex Jinpeng Wang

Main category: cs.CV

TL;DR: FlowInOne unifies multimodal generation into a purely visual flow framework, converting all inputs (text, layouts, instructions) into visual prompts for an image-in, image-out pipeline using a single flow matching model.

DetailsMotivation: Current multimodal generation is dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. The authors challenge this paradigm by asking whether all modalities can be unified into a single visual representation.

Method: FlowInOne reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches.

Result: FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems. The framework is supported by VisPrompt-5M (5M visual prompt pairs dataset) and VP-Bench benchmark.

Conclusion: FlowInOne establishes a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm.

Abstract: Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challenge this paradigm by asking whether all modalities, including textual descriptions, spatial layouts, and editing instructions, can be unified into a single visual representation. We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm. To support this, we introduce VisPrompt-5M, a large-scale dataset of 5 million visual prompt pairs spanning diverse tasks including physics-aware force dynamics and trajectory prediction, alongside VP-Bench, a rigorously curated benchmark assessing instruction faithfulness, spatial precision, visual realism, and content consistency. Extensive experiments demonstrate that FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems, establishing a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space.

[201] FlowExtract: Procedural Knowledge Extraction from Maintenance Flowcharts

Guillermo Gil de Avalle, Laura Maruster, Eric Sloot, Christos Emmanouilidis

Main category: cs.CV

TL;DR: FlowExtract is a pipeline that extracts directed graphs from ISO 5807-standardized flowcharts in PDFs/images, using computer vision (YOLOv8, EasyOCR) and novel edge detection to reconstruct connection topology, outperforming vision-language models for industrial maintenance procedures.

DetailsMotivation: Maintenance procedures in manufacturing are documented as flowcharts in static PDFs/images, encoding essential procedural knowledge that is inaccessible to modern operator support systems. Current vision-language models struggle to reconstruct connection topology from such diagrams.

Method: FlowExtract separates element detection from connectivity reconstruction: uses YOLOv8 for standard domain-aligned node detection, EasyOCR for text extraction, and a novel edge detection method that analyzes arrowhead orientations and traces connecting lines backward to source nodes.

Result: Evaluated on industrial troubleshooting guides, FlowExtract achieves very high node detection and substantially outperforms vision-language model baselines on edge extraction, offering a practical path toward queryable procedural knowledge representations.

Conclusion: FlowExtract provides an effective solution for extracting directed graphs from standardized flowcharts, enabling organizations to convert static procedural documentation into queryable knowledge representations that vision-language models currently struggle with.

Abstract: Maintenance procedures in manufacturing facilities are often documented as flowcharts in static PDFs or scanned images. They encode procedural knowledge essential for asset lifecycle management, yet inaccessible to modern operator support systems. Vision-language models, the dominant paradigm for image understanding, struggle to reconstruct connection topology from such diagrams. We present FlowExtract, a pipeline for extracting directed graphs from ISO 5807-standardized flowcharts. The system separates element detection from connectivity reconstruction, using YOLOv8 and EasyOCR for standard domain-aligned node detection and text extraction, combined with a novel edge detection method that analyzes arrowhead orientations and traces connecting lines backward to source nodes. Evaluated on industrial troubleshooting guides, FlowExtract achieves very high node detection and substantially outperforms vision-language model baselines on edge extraction, offering organizations a practical path toward queryable procedural knowledge representations. The implementation is available athttps://github.com/guille-gil/FlowExtract.

[202] Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

Wenhao Yang, Yu Xia, Jinlong Huang, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Yuchen Zhou, Xiaobo Xia, Yuanyu Wan, Lijun Zhang, Tat-Seng Chua

Main category: cs.CV

TL;DR: MAPO addresses reasoning-action discrepancy in multimodal LLMs by coupling semantic alignment of visual descriptions with task rewards to improve multimodal reasoning performance.

DetailsMotivation: Current RL approaches for multimodal LLMs rely on outcome-based rewards, ignoring that plausible textual reasoning can mask poor visual actions, creating noise that accumulates in multi-turn reasoning and degrades performance.

Method: Introduces Multimodal Agentic Policy Optimization (MAPO) that requires models to generate explicit textual descriptions of visual content from tool usage, then uses novel advantage estimation combining semantic alignment between descriptions and actual observations with task reward.

Result: Theoretical justification shows MAPO reduces gradient variance, and extensive experiments demonstrate superior performance across multiple visual reasoning benchmarks.

Conclusion: MAPO effectively bridges the gap between textual reasoning and visual actions in multimodal chain-of-thought, improving multimodal reasoning capabilities by addressing reasoning-action discrepancies.

Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have incentivized models to ``think with images’’ by actively invoking visual tools during multi-turn reasoning. The common Reinforcement Learning (RL) practice of relying on outcome-based rewards ignores the fact that textual plausibility often masks executive failure, meaning that models may exhibit intuitive textual reasoning while executing imprecise or irrelevant visual actions within their agentic reasoning trajectories. This reasoning-action discrepancy introduces noise that accumulates throughout the multi-turn reasoning process, severely degrading the model’s multimodal reasoning capabilities and potentially leading to training collapse. In this paper, we introduce Multimodal Agentic Policy Optimization (MAPO), bridging the gap between textual reasoning and visual actions generated by models within their Multimodal Chain-of-Thought (MCoT). Specifically, MAPO mandates the model to generate explicit textual descriptions for the visual content obtained via tool usage. We then employ a novel advantage estimation that couples the semantic alignment between these descriptions and the actual observations with the task reward. Theoretical findings are provided to justify the rationale behind MAPO, which inherently reduces the variance of gradients, and extensive experiments demonstrate that our method achieves superior performance across multiple visual reasoning benchmarks.

[203] EventFace: Event-Based Face Recognition via Structure-Driven Spatiotemporal Modeling

Qingguo Meng, Xingbo Dong, Zhe Jin, Massimo Tistarelli

Main category: cs.CV

TL;DR: EventFace: A framework for event-based face recognition that models structure-driven spatiotemporal identity representations using LoRA for spatial priors and motion prompts for temporal dynamics.

DetailsMotivation: Event cameras offer illumination robustness and privacy advantages for face recognition, but lack stable photometric appearance. Need to model structure-driven spatiotemporal identity representations based on rigid facial motion and individual geometry.

Method: 1) Construct EFace dataset for event-based face recognition; 2) Use LoRA to transfer structural facial priors from pretrained RGB models; 3) Introduce Motion Prompt Encoder (MPE) to encode temporal features; 4) Use Spatiotemporal Modulator (STM) to fuse spatial and temporal features.

Result: Achieves 94.19% Rank-1 identification rate and 5.35% EER, outperforming baselines. Shows stronger robustness under degraded illumination and reduced template reconstructability for privacy.

Conclusion: EventFace effectively models spatiotemporal identity representations for event-based face recognition, demonstrating superior performance and robustness while addressing privacy concerns.

Abstract: Event cameras offer a promising sensing modality for face recognition due to their inherent advantages in illumination robustness and privacy-friendliness. However, because event streams lack the stable photometric appearance relied upon by conventional RGB-based face recognition systems, we argue that event-based face recognition should model structure-driven spatiotemporal identity representations shaped by rigid facial motion and individual facial geometry. Since dedicated datasets for event-based face recognition remain lacking, we construct EFace, a small-scale event-based face dataset captured under rigid facial motion. To learn effectively from this limited event data, we further propose EventFace, a framework for event-based face recognition that integrates spatial structure and temporal dynamics for identity modeling. Specifically, we employ Low-Rank Adaptation (LoRA) to transfer structural facial priors from pretrained RGB face models to the event domain, thereby establishing a reliable spatial basis for identity modeling. Building on this foundation, we further introduce a Motion Prompt Encoder (MPE) to explicitly encode temporal features and a Spatiotemporal Modulator (STM) to fuse them with spatial features, thereby enhancing the representation of identity-relevant event patterns. Extensive experiments demonstrate that EventFace achieves the best performance among the evaluated baselines, with a Rank-1 identification rate of 94.19% and an equal error rate (EER) of 5.35%. Results further indicate that EventFace exhibits stronger robustness under degraded illumination than the competing methods. In addition, the learned representations exhibit reduced template reconstructability.

[204] Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer

Bohao Xing, Deng Li, Rong Gao, Xin Liu, Heikki Kälviäinen

Main category: cs.CV

TL;DR: OG-ReG Transformer: A dual-path video understanding model with Glance path for coarse spatiotemporal information and Gaze path for local details, achieving SOTA on multiple video datasets.

DetailsMotivation: Current video Transformers use factorized or window-based attention that splits spatiotemporal correlations, limiting motion capture and long-range dependencies. Inspired by human visual system where temporal/spatial importance varies across time scales and attention is sparse, the authors question if equal consideration of time and space is optimal for video tasks.

Method: Proposes OG-ReG (Overall Glance and Refined Gaze) Transformer with dual paths: 1) Glance path extracts coarse-grained overall spatiotemporal information, 2) Gaze path supplements with local details. This mimics human glance-gaze behavior for efficient video understanding.

Result: Achieves state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 datasets, demonstrating competitive performance in video understanding tasks.

Conclusion: The dual-path approach inspired by human visual attention mechanisms effectively captures both global spatiotemporal context and local details, outperforming existing methods that treat time and space equally.

Abstract: Recently, Transformer has made significant progress in various vision tasks. To balance computation and efficiency in video tasks, recent works heavily rely on factorized or window-based self-attention. However, these approaches split spatiotemporal correlations between regions of interest in videos, limiting the models’ ability to capture motion and long-range dependencies. In this paper, we argue that, similar to the human visual system, the importance of temporal and spatial information varies across different time scales, and attention is allocated sparsely over time through glance and gaze behavior. Is equal consideration of time and space crucial for success in video tasks? Motivated by this understanding, we propose a dual-path network called the Overall Glance and Refined Gaze (OG-ReG) Transformer. The Glance path extracts coarse-grained overall spatiotemporal information, while the Gaze path supplements the Glance path by providing local details. Our model achieves state-of-the-art results on the Kinetics-400, Something-Something v2, and Diving-48, demonstrating its competitive performance. The code will be available at https://github.com/linuxsino/OG-ReG.

[205] Video-guided Machine Translation with Global Video Context

Jian Chen, JinZe Lv, Zi Long, XiangHua Fu

Main category: cs.CV

TL;DR: A globally video-guided multimodal translation framework that uses semantic encoders and vector databases to retrieve relevant video segments for better context understanding in long videos, with attention mechanisms for improved alignment.

DetailsMotivation: Existing video-guided multimodal translation methods rely on locally aligned video segments paired one-to-one with subtitles, which limits their ability to capture global narrative context across multiple segments in long videos.

Method: Proposes a framework using pretrained semantic encoder and vector database-based subtitle retrieval to construct context sets of video segments related to target subtitle semantics. Uses attention mechanisms to focus on relevant visual content while preserving broader context, and designs region-aware cross-modal attention for better semantic alignment.

Result: Experiments on a large-scale documentary translation dataset show the method significantly outperforms baseline models, demonstrating effectiveness in long-video scenarios.

Conclusion: The proposed globally video-guided multimodal translation framework effectively addresses limitations of local alignment methods by capturing global narrative context, making it suitable for long-video translation tasks.

Abstract: Video-guided Multimodal Translation (VMT) has advanced significantly in recent years. However, most existing methods rely on locally aligned video segments paired one-to-one with subtitles, limiting their ability to capture global narrative context across multiple segments in long videos. To overcome this limitation, we propose a globally video-guided multimodal translation framework that leverages a pretrained semantic encoder and vector database-based subtitle retrieval to construct a context set of video segments closely related to the target subtitle semantics. An attention mechanism is employed to focus on highly relevant visual content, while preserving the remaining video features to retain broader contextual information. Furthermore, we design a region-aware cross-modal attention mechanism to enhance semantic alignment during translation. Experiments on a large-scale documentary translation dataset demonstrate that our method significantly outperforms baseline models, highlighting its effectiveness in long-video scenarios.

[206] FedDAP: Domain-Aware Prototype Learning for Federated Learning under Domain Shift

Huy Q. Le, Loc X. Nguyen, Yu Qiao, Seong Tae Kim, Eui-Nam Huh, Choong Seon Hong

Main category: cs.CV

TL;DR: FedDAP: Federated Domain-Aware Prototypes framework that addresses domain shift in federated learning by constructing domain-specific global prototypes and performing domain-aware feature-prototype alignment.

DetailsMotivation: In real-world federated learning scenarios, clients often have data from distinct domains, causing severe domain shift that degrades global model performance. Existing prototype-based FL methods use single global prototypes per class without preserving domain information, and feature-prototype alignment is domain-agnostic.

Method: Proposes Federated Domain-Aware Prototypes (FedDAP) that constructs domain-specific global prototypes by aggregating local client prototypes within the same domain using similarity-weighted fusion. Uses these prototypes to guide local training by aligning local features with same-domain prototypes while encouraging separation from different-domain prototypes.

Result: Extensive experiments on DomainNet, Office-10, and PACS datasets demonstrate effectiveness in addressing domain shift challenges in federated learning.

Conclusion: FedDAP enhances domain-specific learning at the local level and enables global models to generalize across diverse domains by addressing limitations of existing prototype-based FL methods through domain-aware prototype construction and alignment.

Abstract: Federated Learning (FL) enables decentralized model training across multiple clients without exposing private data, making it ideal for privacy-sensitive applications. However, in real-world FL scenarios, clients often hold data from distinct domains, leading to severe domain shift and degraded global model performance. To address this, prototype learning has been emerged as a promising solution, which leverages class-wise feature representations. Yet, existing methods face two key limitations: (1) Existing prototype-based FL methods typically construct a $\textit{single global prototype}$ per class by aggregating local prototypes from all clients without preserving domain information. (2) Current feature-prototype alignment is $\textit{domain-agnostic}$, forcing clients to align with global prototypes regardless of domain origin. To address these challenges, we propose Federated Domain-Aware Prototypes (FedDAP) to construct domain-specific global prototypes by aggregating local client prototypes within the same domain using a similarity-weighted fusion mechanism. These global domain-specific prototypes are then used to guide local training by aligning local features with prototypes from the same domain, while encouraging separation from prototypes of different domains. This dual alignment enhances domain-specific learning at the local level and enables the global model to generalize across diverse domains. Finally, we conduct extensive experiments on three different datasets: DomainNet, Office-10, and PACS to demonstrate the effectiveness of our proposed framework to address the domain shift challenges. The code is available at https://github.com/quanghuy6997/FedDAP.

[207] Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning

Subin Park, Jung Uk Kim

Main category: cs.CV

TL;DR: A training-free sound source localization framework using MLLMs’ reasoning capabilities through a Generation-Analysis-Refinement pipeline for audio-visual correlation analysis.

DetailsMotivation: Existing SSL methods rely on contrastive learning-based feature matching but lack explicit reasoning and verification, limiting effectiveness in complex acoustic scenes. Inspired by human meta-cognitive processes, the authors aim to leverage MLLMs' intrinsic reasoning capabilities.

Method: Proposes a training-free SSL framework with Generation-Analysis-Refinement pipeline: Generation produces initial bounding boxes and audio classifications; Analysis quantifies Audio-Visual Consistency via open-set role tagging and anchor voting; Refinement applies adaptive gating to prevent unnecessary adjustments.

Result: Extensive experiments on single-source and multi-source benchmarks demonstrate competitive performance. The framework shows effectiveness in complex acoustic scenes without requiring training.

Conclusion: The proposed GAR framework successfully leverages MLLMs’ reasoning capabilities for sound source localization, addressing limitations of traditional contrastive learning methods through explicit reasoning and verification processes.

Abstract: Sound source localization task aims to identify the locations of sound-emitting objects by leveraging correlations between audio and visual modalities. Most existing SSL methods rely on contrastive learning-based feature matching, but lack explicit reasoning and verification, limiting their effectiveness in complex acoustic scenes. Inspired by human meta-cognitive processes, we propose a training-free SSL framework that exploits the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs). Our Generation-Analysis-Refinement (GAR) pipeline consists of three stages: Generation produces initial bounding boxes and audio classifications; Analysis quantifies Audio-Visual Consistency via open-set role tagging and anchor voting; and Refinement applies adaptive gating to prevent unnecessary adjustments. Extensive experiments on single-source and multi-source benchmarks demonstrate competitive performance. The source code is available at https://github.com/VisualAIKHU/GAR-SSL.

[208] RePL: Pseudo-label Refinement for Semi-supervised LiDAR Semantic Segmentation

Donghyeon Kwon, Taegyu Park, Suha Kwak

Main category: cs.CV

TL;DR: RePL improves LiDAR semantic segmentation by refining noisy pseudo-labels through masked reconstruction and dedicated training, achieving state-of-the-art results.

DetailsMotivation: Semi-supervised learning for LiDAR semantic segmentation suffers from error propagation and confirmation bias caused by noisy pseudo-labels, which limits performance improvement.

Method: Introduces RePL framework that enhances pseudo-label quality by identifying and correcting potential errors through masked reconstruction, along with a dedicated training strategy. Provides theoretical analysis of conditions for beneficial pseudo-label refinement.

Result: Extensive evaluations on nuScenes-lidarseg and SemanticKITTI datasets show RePL significantly improves pseudo-label quality and achieves state-of-the-art performance in LiDAR semantic segmentation.

Conclusion: RePL effectively addresses the chronic issue of noisy pseudo-labels in semi-supervised LiDAR segmentation through error correction via masked reconstruction, with theoretical and empirical validation.

Abstract: Semi-supervised learning for LiDAR semantic segmentation often suffers from error propagation and confirmation bias caused by noisy pseudo-labels. To tackle this chronic issue, we introduce RePL, a novel framework that enhances pseudo-label quality by identifying and correcting potential errors in pseudo-labels through masked reconstruction, along with a dedicated training strategy. We also provide a theoretical analysis demonstrating the condition under which the pseudo-label refinement is beneficial, and empirically confirm that the condition is mild and clearly met by RePL. Extensive evaluations on the nuScenes-lidarseg and SemanticKITTI datasets show that RePL improves pseudo-label quality a lot and, as a result, achieves the state of the art in LiDAR semantic segmentation.

[209] VGGT-SLAM++

Avilasha Mandal, Rajesh Kumar, Sudarshan Sunil Harithas, Chetan Arora

Main category: cs.CV

TL;DR: VGGT-SLAM++ is a visual SLAM system using Visual Geometry Grounded Transformer outputs with DEM-based mapping and local bundle adjustment for improved accuracy and reduced drift.

DetailsMotivation: Prior transformer-based SLAM systems like VGGT-SLAM suffer from short-horizon pose drift due to reliance on sparse loop closures or global constraints. The authors aim to restore high-cadence local bundle adjustment to stabilize trajectories and reduce drift.

Method: System combines VGGT feed-forward transformer with Sim(3) solution in front-end, uses DEM-based graph construction with DINOv2 embeddings, and implements spatially corrective back-end with Visual Place Recognition for frequent local optimization.

Result: Achieves state-of-the-art accuracy on standard SLAM benchmarks, substantially reduces short-term drift, accelerates graph convergence, and maintains global consistency with compact DEM tiles and sublinear retrieval.

Conclusion: VGGT-SLAM++ demonstrates that integrating transformer-based geometry with traditional local bundle adjustment enables more accurate and stable large-scale visual SLAM with bounded memory.

Abstract: We introduce VGGT-SLAM++, a complete visual SLAM system that leverages the geometry-rich outputs of the Visual Geometry Grounded Transformer (VGGT). The system comprises a visual odometry (front-end) fusing the VGGT feed-forward transformer and a Sim(3) solution, a Digital Elevation Map (DEM)-based graph construction module, and a back-end that jointly enable accurate large-scale mapping with bounded memory. While prior transformer-based SLAM pipelines such as VGGT-SLAM rely primarily on sparse loop closures or global Sim(3) manifold constraints - allowing short-horizon pose drift - VGGT-SLAM++ restores high-cadence local bundle adjustment (LBA) through a spatially corrective back-end. For each VGGT submap, we construct a dense planar-canonical DEM, partition it into patches, and compute their DINOv2 embeddings to integrate the submap into a covisibility graph. Spatial neighbors are retrieved using a Visual Place Recognition (VPR) module within the covisibility window, triggering frequent local optimization that stabilizes trajectories. Across standard SLAM benchmarks, VGGT-SLAM++ achieves state-of-the-art accuracy, substantially reducing short-term drift, accelerating graph convergence, and maintaining global consistency with compact DEM tiles and sublinear retrieval.

[210] CloudMamba: An Uncertainty-Guided Dual-Scale Mamba Network for Cloud Detection in Remote Sensing Imagery

Jiajun Yang, Keyan Chen, Zhengxia Zou, Zhenwei Shi

Main category: cs.CV

TL;DR: CloudMamba: A two-stage CNN-Mamba hybrid framework for cloud detection in remote sensing imagery that addresses ambiguity in thin-cloud regions and improves handling of fragmented clouds and boundary details.

DetailsMotivation: Existing deep learning cloud detection methods struggle with ambiguity in thin-cloud regions, fragmented clouds, and boundary details due to their single-stage pixel-wise segmentation approach.

Method: Proposes an uncertainty-guided two-stage detection strategy with embedded uncertainty estimation for thin-cloud confidence quantification and second-stage refinement. Uses a dual-scale Mamba network with CNN-Mamba hybrid architecture to capture both large-scale structural characteristics and small-scale boundary details with linear computational complexity.

Result: Outperforms existing approaches on GF1_WHU and Levir_CS datasets across multiple segmentation accuracy metrics while maintaining high efficiency and process transparency.

Conclusion: CloudMamba effectively addresses challenges in cloud detection through its two-stage uncertainty-guided approach and efficient CNN-Mamba hybrid architecture, achieving state-of-the-art performance.

Abstract: Cloud detection in remote sensing imagery is a fundamental, critical, and highly challenging problem. Existing deep learning-based cloud detection methods generally formulate it as a single-stage pixel-wise binary segmentation task with one forward pass. However, such single-stage approaches exhibit ambiguity and uncertainty in thin-cloud regions and struggle to accurately handle fragmented clouds and boundary details. In this paper, we propose a novel deep learning framework termed CloudMamba. To address the ambiguity in thin-cloud regions, we introduce an uncertainty-guided two-stage cloud detection strategy. An embedded uncertainty estimation module is proposed to automatically quantify the confidence of thin-cloud segmentation, and a second-stage refinement segmentation is introduced to improve the accuracy in low-confidence hard regions. To better handle fragmented clouds and fine-grained boundary details, we design a dual-scale Mamba network based on a CNN-Mamba hybrid architecture. Compared with Transformer-based models with quadratic computational complexity, the proposed method maintains linear computational complexity while effectively capturing both large-scale structural characteristics and small-scale boundary details of clouds, enabling accurate delineation of overall cloud morphology and precise boundary segmentation. Extensive experiments conducted on the GF1_WHU and Levir_CS public datasets demonstrate that the proposed method outperforms existing approaches across multiple segmentation accuracy metrics, while offering high efficiency and process transparency. Our code is available at https://github.com/jayoungo/CloudMamba.

[211] Vision-Language Model-Guided Deep Unrolling Enables Personalized, Fast MRI

Fangmao Ju, Yuzhu He, Zhiwen Xue, Chunfeng Lian, Jianhua Ma

Main category: cs.CV

TL;DR: PASS is an intelligent MRI framework that uses a Vision-Language Model to guide personalized, task-oriented accelerated MRI acquisition and reconstruction.

DetailsMotivation: Traditional accelerated MRI methods optimize for generic image quality rather than specific clinical tasks, leading to suboptimal results for diagnostic purposes. There's a need for personalized, task-oriented MRI acceleration that focuses on clinically relevant regions.

Method: PASS integrates three components: 1) a deep unrolled reconstruction network based on physics-based MRI models, 2) a sampling module generating patient-specific k-space trajectories, and 3) an anomaly-aware prior extracted from a pretrained Vision-Language Model that guides both sampling and reconstruction toward clinically relevant regions.

Result: PASS achieves superior image quality across diverse anatomies, contrasts, anomalies, and acceleration factors. This improvement directly translates to better performance in downstream diagnostic tasks including fine-grained anomaly detection, localization, and diagnosis.

Conclusion: The integration of high-level clinical reasoning from VLMs with interpretable, physics-aware networks enables personalized, task-oriented MRI acceleration that enhances diagnostic utility while reducing acquisition times.

Abstract: Magnetic Resonance Imaging (MRI) is a cornerstone in medicine and healthcare but suffers from long acquisition times. Traditional accelerated MRI methods optimize for generic image quality, lacking adaptability for specific clinical tasks. To address this, we introduce PASS (Personalized, Anomaly-aware Sampling and reconStruction), an intelligent MRI framework that leverages a Vision-Language Model (VLM) to guide a deep unrolling network for task-oriented, fast imaging. PASS dynamically personalizes the imaging pipeline through three core contributions: (1) a deep unrolled reconstruction network derived from a physics-based MRI model; (2) a sampling module that generates patient-specific $k$-space trajectories; and (3) an anomaly-aware prior, extracted from a pretrained VLM, which steers both sampling and reconstruction toward clinically relevant regions. By integrating the high-level clinical reasoning of a VLM with an interpretable, physics-aware network, PASS achieves superior image quality across diverse anatomies, contrasts, anomalies, and acceleration factors. This enhancement directly translates to improvements in downstream diagnostic tasks, including fine-grained anomaly detection, localization, and diagnosis.

[212] Physical Adversarial Attacks on AI Surveillance Systems:Detection, Tracking, and Visible–Infrared Evasion

Miguel A. DelaCruz, Patricia Mae Santos, Rafael T. Navarro

Main category: cs.CV

TL;DR: A review paper analyzing physical adversarial attacks from a surveillance system perspective, focusing on temporal persistence, multi-modal sensing, carrier realism, and system-level objectives rather than isolated image benchmarks.

DetailsMotivation: Physical adversarial attacks are increasingly studied in real-world surveillance contexts where factors like temporal persistence, multi-object tracking, visible-infrared sensing, and practical attack carriers matter simultaneously, changing how the literature should be interpreted.

Method: The paper reviews physical attacks through a surveillance-oriented lens, organizing prior work using a four-part taxonomy focusing on: 1) temporal persistence, 2) sensing modality, 3) carrier realism, and 4) system-level objectives.

Result: The review reveals that surveillance robustness cannot be reliably judged from isolated per-frame benchmarks alone; it must be examined as a system problem unfolding over time, across sensors, and under realistic physical deployment constraints.

Conclusion: Physical adversarial attacks in surveillance require system-level evaluation considering temporal persistence, multi-modal sensing, practical carriers, and realistic deployment constraints rather than isolated image-based benchmarks.

Abstract: Physical adversarial attacks are increasingly studied in settings that resemble deployed surveillance systems rather than isolated image benchmarks. In these settings, person detection, multi-object tracking, visible–infrared sensing, and the practical form of the attack carrier all matter at once. This changes how the literature should be read. A perturbation that suppresses a detector in one frame may have limited practical effect if identity is recovered over time; an RGB-only result may say little about night-time systems that rely on visible and thermal inputs together; and a conspicuous patch can imply a different threat model from a wearable or selectively activated carrier. This paper reviews physical attacks from that surveillance-oriented viewpoint. Rather than attempting a complete catalogue of all physical attacks in computer vision, we focus on the technical questions that become central in surveillance: temporal persistence, sensing modality, carrier realism, and system-level objective. We organize prior work through a four-part taxonomy and discuss how recent results on multi-object tracking, dual-modal visible–infrared evasion, and controllable clothing reflect a broader change in the field. We also summarize evaluation practices and unresolved gaps, including distance robustness, camera-pipeline variation, identity-level metrics, and activation-aware testing. The resulting picture is that surveillance robustness cannot be judged reliably from isolated per-frame benchmarks alone; it has to be examined as a system problem unfolding over time, across sensors, and under realistic physical deployment constraints.

[213] RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

Dewei Zhou, You Li, Zongxin Yang, Yi Yang

Main category: cs.CV

TL;DR: RefineAnything is a multimodal diffusion model for region-specific image refinement that preserves background pixels while restoring fine details in user-specified regions, using a Focus-and-Refine strategy with boundary-aware loss.

DetailsMotivation: Current image generation models suffer from local detail collapse (distorted text, logos, thin structures) and instruction-driven editing models often overlook subtle local defects or inadvertently change backgrounds, especially for small regions in fixed-resolution inputs.

Method: Proposes Focus-and-Refine strategy: crop-and-resize target region to allocate resolution budget effectively, refine using multimodal diffusion model (supports reference-based and reference-free), then paste back with blended-mask to preserve background. Introduces Boundary Consistency Loss to reduce seam artifacts.

Result: Achieves strong improvements over baselines on RefineEval benchmark with near-perfect background preservation. Constructed Refine-30K dataset (20K reference-based, 10K reference-free samples) to support the task.

Conclusion: RefineAnything establishes a practical solution for high-precision local refinement with strict background preservation, addressing limitations of current models in handling fine-grained details in small regions.

Abstract: We introduce region-specific image refinement as a dedicated problem setting: given an input image and a user-specified region (e.g., a scribble mask or a bounding box), the goal is to restore fine-grained details while keeping all non-edited pixels strictly unchanged. Despite rapid progress in image generation, modern models still frequently suffer from local detail collapse (e.g., distorted text, logos, and thin structures). Existing instruction-driven editing models emphasize coarse-grained semantic edits and often either overlook subtle local defects or inadvertently change the background, especially when the region of interest occupies only a small portion of a fixed-resolution input. We present RefineAnything, a multimodal diffusion-based refinement model that supports both reference-based and reference-free refinement. Building on a counter-intuitive observation that crop-and-resize can substantially improve local reconstruction under a fixed VAE input resolution, we propose Focus-and-Refine, a region-focused refinement-and-paste-back strategy that improves refinement effectiveness and efficiency by reallocating the resolution budget to the target region, while a blended-mask paste-back guarantees strict background preservation. We further introduce a boundary-aware Boundary Consistency Loss to reduce seam artifacts and improve paste-back naturalness. To support this new setting, we construct Refine-30K (20K reference-based and 10K reference-free samples) and introduce RefineEval, a benchmark that evaluates both edited-region fidelity and background consistency. On RefineEval, RefineAnything achieves strong improvements over competitive baselines and near-perfect background preservation, establishing a practical solution for high-precision local refinement. Project Page: https://limuloo.github.io/RefineAnything/.

[214] SCT-MOT: Enhancing Air-to-Air Multiple UAVs Tracking with Swarm-Coupled Motion and Trajectory Guidance

Zhaochen Chu, Tao Song, Ren Jin, Shaoming He, Defu Lin, Siqing Cheng

Main category: cs.CV

TL;DR: SCT-MOT: A swarm UAV tracking framework with swarm-coupled motion modeling and trajectory-guided feature fusion for improved air-to-air tracking in complex swarm scenarios.

DetailsMotivation: Air-to-air tracking of swarm UAVs is challenging due to complex nonlinear group motion and weak visual cues for small objects, causing detection failures, trajectory fragmentation, and identity switches. Existing methods model objects independently, neglecting swarm-level motion dependencies and having limited integration between motion prediction and appearance representation.

Method: Proposes SCT-MOT framework with two key modules: 1) Swarm Motion-Aware Trajectory Prediction (SMTP) that jointly models historical trajectories and posture-aware appearance features from a swarm-level perspective, and 2) Trajectory-Guided Spatio-Temporal Feature Fusion (TG-STFF) that aligns predicted positions with historical visual cues and integrates them with current frame features.

Result: Extensive experiments on three public air-to-air swarm UAV tracking datasets (AIRMOT, MOT-FLY, UAVSwarm) show SMTP achieves more accurate trajectory forecasts and yields 1.21% IDF1 improvement over state-of-the-art trajectory prediction module EqMotion. SCT-MOT consistently achieves superior accuracy and robustness across multiple metrics under complex swarm scenarios.

Conclusion: SCT-MOT effectively addresses challenges in swarm UAV tracking by integrating swarm-coupled motion modeling and trajectory-guided feature fusion, demonstrating improved performance in complex air-to-air tracking scenarios with weak visual cues and nonlinear group motion.

Abstract: Air-to-air tracking of swarm UAVs presents significant challenges due to the complex nonlinear group motion and weak visual cues for small objects, which often cause detection failures, trajectory fragmentation, and identity switches. Although existing methods have attempted to improve performance by incorporating trajectory prediction, they model each object independently, neglecting the swarm-level motion dependencies. Their limited integration between motion prediction and appearance representation also weakens the spatio-temporal consistency required for tracking in visually ambiguous and cluttered environments, making it difficult to maintain coherent trajectories and reliable associations. To address these challenges, we propose SCT-MOT, a tracking framework that integrates Swarm-Coupled motion modeling and Trajectory-guided feature fusion. First, we develop a Swarm Motion-Aware Trajectory Prediction (SMTP) module jointly models historical trajectories and posture-aware appearance features from a swarm-level perspective, enabling more accurate forecasting of the nonlinear, coupled group trajectories. Second, we design a Trajectory-Guided Spatio-Temporal Feature Fusion (TG-STFF) module aligns predicted positions with historical visual cues and deeply integrates them with current frame features, enhancing temporal consistency and spatial discriminability for weak objects. Extensive experiments on three public air-to-air swarm UAV tracking datasets, including AIRMOT, MOT-FLY, and UAVSwarm, demonstrate that SMTP achieves more accurate trajectory forecasts and yields a 1.21% IDF1 improvement over the state-of-the-art trajectory prediction module EqMotion when integrated into the same MOT framework. Overall, our SCT-MOT consistently achieves superior accuracy and robustness compared to state-of-the-art trackers across multiple metrics under complex swarm scenarios.

[215] Time-driven Survival Analysis from FDG-PET/CT in Non-Small Cell Lung Cancer

Sambit Tarai, Ashish Chauhan, Elin Lundström, Johan Öfverstedt, Therese Sjöholm, Veronica Sanchez Rodriguez, Håkan Ahlström, Joel Kullberg

Main category: cs.CV

TL;DR: Deep learning framework combining FDG-PET/CT tissue projections with temporal data to predict overall survival in NSCLC patients, outperforming image-only baselines.

DetailsMotivation: Automated medical image-based prediction of clinical outcomes like overall survival has great potential for improving patient prognostics and personalized treatment planning in oncology.

Method: Uses ResNet-50 backbone to process tissue-wise FDG-PET/CT projections, combines image embeddings with temporal input (time horizon in days) to predict OS probabilities as function of time. Developed on U-CAN cohort (n=556), compared against baseline using only images.

Result: Temporal data integration improved AUC by 4.3% over baseline. Ensemble of imaging and clinical+IDP models achieved best performance (AUC=0.788). Model enabled risk stratification and saliency analysis highlighted tumor regions as key predictive structures.

Conclusion: Method provides automated framework for time-dependent OS prediction and demonstrates value of combining imaging and tabular data for improved survival prediction in NSCLC.

Abstract: Purpose: Automated medical image-based prediction of clinical outcomes, such as overall survival (OS), has great potential in improving patient prognostics and personalized treatment planning. We developed a deep regression framework using tissue-wise FDG-PET/CT projections as input, along with a temporal input representing a scalar time horizon (in days) to predict OS in patients with Non-Small Cell Lung Cancer (NSCLC). Methods: The proposed framework employed a ResNet-50 backbone to process input images and generate corresponding image embeddings. The embeddings were then combined with temporal data to produce OS probabilities as a function of time, effectively parameterizing the predictions based on time. The overall framework was developed using the U-CAN cohort (n = 556) and evaluated by comparing with a baseline method on the test set (n = 292). The baseline utilized the ResNet-50 architecture, processing only the images as input and providing OS predictions at pre-specified intervals, such as 2- or 5-year. Results: The incorporation of temporal data with image embeddings demonstrated an advantage in predicting OS, outperforming the baseline method with an improvement in AUC of 4.3%. The proposed model using clinical + IDP features achieved strong performance, and an ensemble of imaging and clinical + IDP models achieved the best overall performance (0.788), highlighting the complementary value of multimodal inputs. The proposed method also enabled risk stratification of patients into distinct categories (high vs low risk). Heat maps from the saliency analysis highlighted tumor regions as key structures for the prediction. Conclusion: Our method provided an automated framework for predicting OS as a function of time and demonstrates the potential of combining imaging and tabular data for improved survival prediction.

[216] Energy-Regularized Spatial Masking: A Novel Approach to Enhancing Robustness and Interpretability in Vision Models

Tom Devynck Bilal Faye Djamel Bouchaffra Nadjib Lazaar Hanane Azzag Mustapha Lebbah

Main category: cs.CV

TL;DR: ERSM is an energy-regularized spatial masking framework that embeds a lightweight Energy-Mask Layer in convolutional networks to autonomously discover optimal feature selection through differentiable energy minimization, producing emergent sparsity and improved robustness.

DetailsMotivation: Deep convolutional networks suffer from computational redundancy and reliance on spurious background correlations, making them brittle and difficult to interpret. There's a need for methods that can autonomously discover optimal feature selection tailored to each input without rigid sparsity budgets or heuristic importance scores.

Method: ERSM reformulates feature selection as a differentiable energy minimization problem. It embeds an Energy-Mask Layer inside standard convolutional backbones where each visual token gets a scalar energy composed of two competing forces: intrinsic Unary importance cost and Pairwise spatial coherence penalty. This allows networks to autonomously discover optimal information-density equilibrium per input.

Result: ERSM produces emergent sparsity, improved robustness to structured occlusion, and highly interpretable spatial masks while preserving classification accuracy. The learned energy ranking significantly outperforms magnitude-based pruning in deletion-based robustness tests, revealing ERSM as an intrinsic denoising mechanism that isolates semantic object regions without pixel-level supervision.

Conclusion: ERSM provides a novel framework for differentiable energy-based feature selection that enables convolutional networks to autonomously discover optimal information density, improving interpretability and robustness while maintaining accuracy.

Abstract: Deep convolutional neural networks achieve remarkable performance by exhaustively processing dense spatial feature maps, yet this brute-force strategy introduces significant computational redundancy and encourages reliance on spurious background correlations. As a result, modern vision models remain brittle and difficult to interpret. We propose Energy-Regularized Spatial Masking (ERSM), a novel framework that reformulates feature selection as a differentiable energy minimization problem. By embedding a lightweight Energy-Mask Layer inside standard convolutional backbones, each visual token is assigned a scalar energy composed of two competing forces: an intrinsic Unary importance cost and a Pairwise spatial coherence penalty. Unlike prior pruning methods that enforce rigid sparsity budgets or rely on heuristic importance scores, ERSM allows the network to autonomously discover an optimal information-density equilibrium tailored to each input. We validate ERSM on convolutional architectures and demonstrate that it produces emergent sparsity, improved robustness to structured occlusion, and highly interpretable spatial masks, while preserving classification accuracy. Furthermore, we show that the learned energy ranking significantly outperforms magnitude-based pruning in deletion-based robustness tests, revealing ERSM as an intrinsic denoising mechanism that isolates semantic object regions without pixel-level supervision.

[217] Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

Yuheng Shi, Xiaohuan Pei, Linfeng Wen, Minjing Dong, Chang Xu

Main category: cs.CV

TL;DR: Q-Zoom is a query-aware adaptive high-resolution perception framework for MLLMs that uses dynamic gating and self-distilled region proposals to efficiently process only relevant image regions, significantly accelerating inference while maintaining or improving accuracy.

DetailsMotivation: Current MLLMs use global resolution scaling which floods attention mechanisms with redundant visual tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent.

Method: Proposes Q-Zoom with: 1) Dynamic Gating Network that bypasses high-resolution processing when coarse features suffice, 2) Self-Distilled Region Proposal Network that localizes task-relevant RoIs from intermediate features, and 3) continuous spatio-temporal alignment to fuse dense local RoIs with coarse global layout.

Result: Accelerates inference by 2.52× on Document & OCR benchmarks and 4.39× in High-Resolution scenarios while matching baseline accuracy. When configured for maximum fidelity, surpasses baseline by 1.1% and 8.1% respectively. Improvements transfer to multiple MLLM architectures.

Conclusion: Q-Zoom establishes a dominant Pareto frontier for efficiency-accuracy tradeoffs in MLLMs, enabling efficient high-resolution perception through query-aware adaptive processing.

Abstract: MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-supervised distillation paradigm. A continuous spatio-temporal alignment scheme and targeted fine-tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q-Zoom establishes a dominant Pareto frontier. Using Qwen2.5-VL-7B as a primary testbed, Q-Zoom accelerates inference by 2.52 times on Document & OCR benchmarks and 4.39 times in High-Resolution scenarios while matching the baseline’s peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q-Zoom surpasses the baseline’s peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3-VL, LLaVA, and emerging RL-based thinking-with-image models. Project page is available at https://yuhengsss.github.io/Q-Zoom/.

[218] Multi-modal user interface control detection using cross-attention

Milad Moradi, Ke Yan, David Colwell, Matthias Samwald, Rhona Asgari

Main category: cs.CV

TL;DR: A multimodal YOLOv5 extension that integrates GPT-generated textual descriptions of UI screenshots via cross-attention modules to improve UI control detection, achieving significant gains over visual-only baselines.

DetailsMotivation: UI control detection from screenshots is challenging due to visual ambiguities, design variability, and lack of contextual cues in pixel-only approaches. Existing methods struggle with semantically complex or visually ambiguous UI elements.

Method: Proposes a multimodal extension of YOLOv5 that integrates GPT-generated textual descriptions of UI images through cross-attention modules. Compares three fusion strategies: element-wise addition, weighted sum, and convolutional fusion to align visual features with semantic text embeddings.

Result: Evaluated on 16,000+ annotated UI screenshots across 23 control classes. Convolutional fusion achieved strongest performance with significant gains in detecting semantically complex or visually ambiguous classes, demonstrating consistent improvements over baseline YOLOv5.

Conclusion: Combining visual and textual modalities substantially enhances UI element detection, especially in edge cases where visual information alone is insufficient. Opens opportunities for more reliable tools in software testing, accessibility, and UI analytics.

Abstract: Detecting user interface (UI) controls from software screenshots is a critical task for automated testing, accessibility, and software analytics, yet it remains challenging due to visual ambiguities, design variability, and the lack of contextual cues in pixel-only approaches. In this paper, we introduce a novel multi-modal extension of YOLOv5 that integrates GPT-generated textual descriptions of UI images into the detection pipeline through cross-attention modules. By aligning visual features with semantic information derived from text embeddings, our model enables more robust and context-aware UI control detection. We evaluate the proposed framework on a large dataset of over 16,000 annotated UI screenshots spanning 23 control classes. Extensive experiments compare three fusion strategies, i.e. element-wise addition, weighted sum, and convolutional fusion, demonstrating consistent improvements over the baseline YOLOv5 model. Among these, convolutional fusion achieved the strongest performance, with significant gains in detecting semantically complex or visually ambiguous classes. These results establish that combining visual and textual modalities can substantially enhance UI element detection, particularly in edge cases where visual information alone is insufficient. Our findings open promising opportunities for more reliable and intelligent tools in software testing, accessibility support, and UI analytics, setting the stage for future research on efficient, robust, and generalizable multi-modal detection systems.

[219] POS-ISP: Pipeline Optimization at the Sequence Level for Task-aware ISP

Jiyun Won, Heemin Yang, Woohyeok Kim, Jungseul Ok, Sunghyun Cho

Main category: cs.CV

TL;DR: POS-ISP is a sequence-level reinforcement learning framework that optimizes image signal processing pipelines by predicting entire module sequences and parameters in one forward pass, improving task performance while reducing computational cost.

DetailsMotivation: Existing approaches for optimizing ISP pipelines face challenges: neural architecture search suffers from training-inference mismatch, while step-wise reinforcement learning leads to unstable training and high computational overhead due to stage-wise decision making.

Method: Proposes POS-ISP, a sequence-level RL framework that formulates modular ISP optimization as a global sequence prediction problem. It predicts the entire module sequence and its parameters in a single forward pass and optimizes the pipeline using terminal task reward, eliminating intermediate supervision and redundant executions.

Result: Experiments across multiple downstream tasks show that POS-ISP improves task performance while reducing computational cost compared to existing approaches.

Conclusion: Sequence-level optimization is a stable and efficient paradigm for task-aware ISP, demonstrating the effectiveness of global sequence prediction over step-wise approaches.

Abstract: Recent work has explored optimizing image signal processing (ISP) pipelines for various tasks by composing predefined modules and adapting them to task-specific objectives. However, jointly optimizing module sequences and parameters remains challenging. Existing approaches rely on neural architecture search (NAS) or step-wise reinforcement learning (RL), but NAS suffers from a training-inference mismatch, while step-wise RL leads to unstable training and high computational overhead due to stage-wise decision-making. We propose POS-ISP, a sequence-level RL framework that formulates modular ISP optimization as a global sequence prediction problem. Our method predicts the entire module sequence and its parameters in a single forward pass and optimizes the pipeline using a terminal task reward, eliminating the need for intermediate supervision and redundant executions. Experiments across multiple downstream tasks show that POS-ISP improves task performance while reducing computational cost, highlighting sequence-level optimization as a stable and efficient paradigm for task-aware ISP. The project page is available at https://w1jyun.github.io/POS-ISP

[220] Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

Jintao Chen, Chengyu Bai, Junjun hu, Xinda Xue, Mu Xu

Main category: cs.CV

TL;DR: Grounded Forcing: A framework for autoregressive video synthesis that addresses semantic forgetting, visual drift, and controllability loss through dual memory KV cache, dual-reference RoPE injection, and asymmetric proximity recache mechanisms.

DetailsMotivation: Autoregressive video synthesis faces three intertwined challenges: semantic forgetting from context limitations, visual drift due to positional extrapolation, and controllability loss during interactive instruction switching. Current methods address these issues in isolation, limiting long-term coherence in infinite-horizon generation.

Method: Three interlocking mechanisms: 1) Dual Memory KV Cache decouples local temporal dynamics from global semantic anchors to address semantic forgetting; 2) Dual-Reference RoPE Injection confines positional embeddings within training manifold to suppress visual drift; 3) Asymmetric Proximity Recache facilitates smooth semantic inheritance during prompt transitions via proximity-weighted cache updates.

Result: Extensive experiments demonstrate that Grounded Forcing significantly enhances long-range consistency and visual stability, establishing a robust foundation for interactive long-form video synthesis.

Conclusion: Grounded Forcing bridges time-independent semantics and proximal dynamics through synergistic components that tether generative processes to stable semantic cores while accommodating flexible local dynamics, enabling improved long-term coherence in video generation.

Abstract: Autoregressive video synthesis offers a promising pathway for infinite-horizon generation but is fundamentally hindered by three intertwined challenges: semantic forgetting from context limitations, visual drift due to positional extrapolation, and controllability loss during interactive instruction switching. Current methods often tackle these issues in isolation, limiting long-term coherence. We introduce Grounded Forcing, a novel framework that bridges time-independent semantics and proximal dynamics through three interlocking mechanisms. First, to address semantic forgetting, we propose a Dual Memory KV Cache that decouples local temporal dynamics from global semantic anchors, ensuring long-term semantic coherence and identity stability. Second, to suppress visual drift, we design Dual-Reference RoPE Injection, which confines positional embeddings within the training manifold while rendering global semantics time-invariant. Third, to resolve controllability issues, we develop Asymmetric Proximity Recache, which facilitates smooth semantic inheritance during prompt transitions via proximity-weighted cache updates. These components operate synergistically to tether the generative process to stable semantic cores while accommodating flexible local dynamics. Extensive experiments demonstrate that Grounded Forcing significantly enhances long-range consistency and visual stability, establishing a robust foundation for interactive long-form video synthesis.

[221] NTIRE 2026 Challenge on Bitstream-Corrupted Video Restoration: Methods and Results

Wenbin Zou, Tianyi Li, Kejun Wu, Huiping Zhuang, Zongwei Wu, Zhuyun Zhou, Radu Timofte, Kim-Hui Yap, Lap-Pui Chau, Yi Wang, Shiqi Zhou, Xiaodi Shi, Yuxiang Chen, Yilian Zhong, Shibo Yin, Yushun Fang, Xilei Zhu, Yahui Wang, Chen Lu, Zhitao Wang, Lifa Ha, Hengyu Man, Xiaopeng Fan, Priyansh Singh, Sidharth, Krrish Dev, Soham Kakkar, Vinit Jakhetiya, Ovais Iqbal Shah, Wei Zhou, Linfeng Li, Qi Xu, Zhenyang Liu, Kepeng Xu, Tong Qiao, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Yaokun Shi

Main category: cs.CV

TL;DR: NTIRE 2026 challenge on restoring videos corrupted by bitstream errors, focusing on spatial-temporal artifact removal and content recovery

DetailsMotivation: Bitstream corruption during video transmission/decoding causes severe spatial-temporal artifacts and content distortion, requiring robust restoration methods for practical applications

Method: Challenge-based approach with standardized dataset, evaluation protocol, and benchmarking of various restoration methods submitted by participants

Result: Established benchmark results showing the difficulty of bitstream-corrupted video restoration, with analysis of technical trends and performance of different approaches

Conclusion: The challenge highlights the complexity of this emerging task and provides valuable insights for future research on robust video restoration under practical bitstream corruption scenarios

Abstract: This paper reports on the NTIRE 2026 Challenge on Bitstream-Corrupted Video Restoration (BSCVR). The challenge aims to advance research on recovering visually coherent videos from corrupted bitstreams, whose decoding often produces severe spatial-temporal artifacts and content distortion. Built upon recent progress in bitstream-corrupted video recovery, the challenge provides a common benchmark for evaluating restoration methods under realistic corruption settings. We describe the dataset, evaluation protocol, and participating methods, and summarize the final results and main technical trends. The challenge highlights the difficulty of this emerging task and provides useful insights for future research on robust video restoration under practical bitstream corruption.

[222] Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

Zhiheng Li, Zongyang Ma, Yuntong Pan, Ziqi Zhang, Xiaolei Lv, Bo Li, Jun Gao, Jianing Zhang, Chunfeng Yuan, Bing Li, Weiming Hu

Main category: cs.CV

TL;DR: Adversarial smuggling attacks exploit the Human-AI capability gap by encoding harmful content into human-readable visual formats that evade MLLM detection while remaining understandable to humans.

DetailsMotivation: To uncover a critical security threat in MLLM-based content moderation where harmful content can bypass automated detection by exploiting the gap between human and AI perception/reasoning capabilities.

Method: Classify smuggling attacks into Perceptual Blindness (disrupting text recognition) and Reasoning Blockade (inhibiting semantic understanding). Construct SmuggleBench benchmark with 1,700 adversarial smuggling attack instances. Evaluate on state-of-the-art MLLMs and analyze root causes through perception and reasoning lenses.

Result: Both proprietary (GPT-5) and open-source (Qwen3-VL) MLLMs are highly vulnerable with Attack Success Rates exceeding 90%. Root causes identified: limited vision encoder capabilities, OCR robustness gap, and scarcity of domain-specific adversarial examples.

Conclusion: Adversarial smuggling represents a significant security threat to MLLM-based content moderation systems. The paper establishes a benchmark and identifies vulnerabilities, with preliminary mitigation strategies explored through test-time scaling and adversarial training.

Abstract: Multimodal Large Language Models (MLLMs) are increasingly being deployed as automated content moderators. Within this landscape, we uncover a critical threat: Adversarial Smuggling Attacks. Unlike adversarial perturbations (for misclassification) and adversarial jailbreaks (for harmful output generation), adversarial smuggling exploits the Human-AI capability gap. It encodes harmful content into human-readable visual formats that remain AI-unreadable, thereby evading automated detection and enabling the dissemination of harmful content. We classify smuggling attacks into two pathways: (1) Perceptual Blindness, disrupting text recognition; and (2) Reasoning Blockade, inhibiting semantic understanding despite successful text recognition. To evaluate this threat, we constructed SmuggleBench, the first comprehensive benchmark comprising 1,700 adversarial smuggling attack instances. Evaluations on SmuggleBench reveal that both proprietary (e.g., GPT-5) and open-source (e.g., Qwen3-VL) state-of-the-art models are vulnerable to this threat, producing Attack Success Rates (ASR) exceeding 90%. By analyzing the vulnerability through the lenses of perception and reasoning, we identify three root causes: the limited capabilities of vision encoders, the robustness gap in OCR, and the scarcity of domain-specific adversarial examples. We conduct a preliminary exploration of mitigation strategies, investigating the potential of test-time scaling (via CoT) and adversarial training (via SFT) to mitigate this threat. Our code is publicly available at https://github.com/zhihengli-casia/smugglebench.

[223] Compression as an Adversarial Amplifier Through Decision Space Reduction

Lewis Evans, Harkrishan Jandu, Zihan Ye, Yang Lu, Shreyank N Gowda

Main category: cs.CV

TL;DR: Compression amplifies adversarial attacks on image classifiers by reducing decision space and increasing sensitivity to perturbations in compressed representations.

DetailsMotivation: Image compression is widely used in modern visual pipelines but its impact on adversarial robustness is poorly understood. The paper investigates how compression affects vulnerability to attacks when applied before inference.

Method: Studies compression-aware attacks applied directly in compressed representations, analyzes decision space reduction effects, and conducts extensive experiments across standard benchmarks and architectures.

Result: Compression acts as an adversarial amplifier - compression-aware attacks are substantially more effective than pixel-space attacks under identical perturbation budgets, revealing critical vulnerabilities in compression-in-the-loop deployments.

Conclusion: Compression introduces a critical vulnerability by contracting classification margins and increasing sensitivity to perturbations, highlighting security risks in compression-in-the-loop deployment settings.

Abstract: Image compression is a ubiquitous component of modern visual pipelines, routinely applied by social media platforms and resource-constrained systems prior to inference. Despite its prevalence, the impact of compression on adversarial robustness remains poorly understood. We study a previously unexplored adversarial setting in which attacks are applied directly in compressed representations, and show that compression can act as an adversarial amplifier for deep image classifiers. Under identical nominal perturbation budgets, compression-aware attacks are substantially more effective than their pixel-space counterparts. We attribute this effect to decision space reduction, whereby compression induces a non-invertible, information-losing transformation that contracts classification margins and increases sensitivity to perturbations. Extensive experiments across standard benchmarks and architectures support our analysis and reveal a critical vulnerability in compression-in-the-loop deployment settings. Code will be released.

[224] Auditing Demographic Bias in Facial Landmark Detection for Fair Human-Robot Interaction

Pablo Parte, Roberto Valle, José M. Buenaposada, Luis Baumela

Main category: cs.CV

TL;DR: Systematic audit reveals demographic biases in facial landmark detection are largely due to confounding visual factors like head pose and resolution, not demographic attributes themselves, except for age bias affecting older individuals.

DetailsMotivation: To investigate demographic biases in facial landmark detection, a low-level vision task critical for human-robot interaction, since previous bias studies focused on high-level facial analysis while overlooking foundational perceptual components.

Method: Introduced controlled statistical methodology to disentangle demographic effects from confounding visual factors, systematically auditing age, gender, and race biases in a standard representative facial landmark detection model.

Result: Confounding visual factors (head pose, image resolution) heavily outweigh demographic attribute impacts. After accounting for confounders, gender and race performance disparities vanish, but statistically significant age bias persists with higher errors for older individuals.

Conclusion: Fairness issues exist even in low-level vision components and can propagate through HRI pipelines, disproportionately affecting vulnerable populations. Auditing and correcting such biases is essential for trustworthy and equitable robot perception systems.

Abstract: Fairness in human-robot interaction critically depends on the reliability of the perceptual models that enable robots to interpret human behavior. While demographic biases have been widely studied in high-level facial analysis tasks, their presence in facial landmark detection remains unexplored. In this paper, we conduct a systematic audit of demographic bias in this task, analyzing the age, gender and race biases. To this end we introduce a controlled statistical methodology to disentangle demographic effects from confounding visual factors. Evaluations of a standard representative model demonstrate that confounding visual factors, particularly head pose and image resolution, heavily outweigh the impact of demographic attributes. Notably, after accounting for these confounders, we show that performance disparities across gender and race vanish. However, we identify a statistically significant age-related effect, with higher biases observed for older individuals. This shows that fairness issues can emerge even in low-level vision components and can propagate through the HRI pipeline, disproportionately affecting vulnerable populations. We argue that auditing and correcting such biases is a necessary step toward trustworthy and equitable robot perception systems.

[225] MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

Xiaoxiao Ma, Jiachen Lei, Tianfei Ren, Jie Huang, Siming Fu, Aiming Hao, Jiahong Wu, Xiangxiang Chu, Feng Zhao

Main category: cs.CV

TL;DR: A stabilized RL framework for masked autoregressive models that reduces diffusion-induced gradient noise through multi-trajectory expectation and uncertainty-based token selection.

DetailsMotivation: Extending RL to hybrid AR-diffusion frameworks is challenging due to interleaved inference and noisy log-probability estimation. The diffusion head in masked autoregressive models introduces noisy gradients causing instability and early performance saturation.

Method: Proposes multi-trajectory expectation (MTE) to estimate optimization direction by averaging over multiple diffusion trajectories, reducing gradient noise. Uses token-wise uncertainty from multiple trajectories to apply MTE only to top-k% uncertain tokens. Also introduces consistency-aware token selection to filter out AR tokens less aligned with final generated content.

Result: Extensive experiments across multiple benchmarks show consistent improvements in visual quality, training stability, and spatial structure understanding over baseline GRPO and pre-RL models.

Conclusion: The proposed stabilized RL framework effectively addresses gradient noise issues in hybrid AR-diffusion models, leading to better training stability and generation quality.

Abstract: Reinforcement learning (RL) has been successfully applied to autoregressive (AR) and diffusion models. However, extending RL to hybrid AR-diffusion frameworks remains challenging due to interleaved inference and noisy log-probability estimation. In this work, we study masked autoregressive models (MAR) and show that the diffusion head plays a critical role in training dynamics, often introducing noisy gradients that lead to instability and early performance saturation. To address this issue, we propose a stabilized RL framework for MAR. We introduce multi-trajectory expectation (MTE), which estimates the optimization direction by averaging over multiple diffusion trajectories, thereby reducing diffusion-induced gradient noise. To avoid over-smoothing, we further estimate token-wise uncertainty from multiple trajectories and apply multi-trajectory optimization only to the top-k% uncertain tokens. In addition, we introduce a consistency-aware token selection strategy that filters out AR tokens that are less aligned with the final generated content. Extensive experiments across multiple benchmarks demonstrate that our method consistently improves visual quality, training stability, and spatial structure understanding over baseline GRPO and pre-RL models. Code is available at: https://github.com/AMAP-ML/mar-grpo.

[226] CAAP: Capture-Aware Adversarial Patch Attacks on Palmprint Recognition Models

Renyang Liu, Jiale Li, Jie Zhang, Cong Wu, Xiaojun Jia, Shuxin Li, Wei Zhou, Kwok-Yan Lam, See-kiong Ng

Main category: cs.CV

TL;DR: CAAP is a capture-aware adversarial patch framework that attacks palmprint recognition systems with physically realizable patches that remain effective under realistic acquisition variations.

DetailsMotivation: Palmprint recognition is used in security-critical applications but its robustness against physically realizable attacks is insufficiently understood. Existing studies don't account for texture-dominant nature of palmprints or distortions from physical acquisition.

Method: Proposes CAAP framework with cross-shaped patch topology to disrupt long-range texture continuity. Includes three modules: ASIT for input-conditioned patch rendering, RaS for stochastic capture-aware simulation, and MS-DIFE for feature-level identity-disruptive guidance.

Result: CAAP achieves strong untargeted and targeted attack performance with favorable cross-model and cross-dataset transferability. Adversarial training only partially reduces attack success rate, leaving substantial residual vulnerability.

Conclusion: Deep palmprint recognition systems remain vulnerable to physically realizable, capture-aware adversarial patch attacks, highlighting need for more effective defenses in practice.

Abstract: Palmprint recognition is deployed in security-critical applications, including access control and palm-based payment, due to its contactless acquisition and highly discriminative ridge-and-crease textures. However, the robustness of deep palmprint recognition systems against physically realizable attacks remains insufficiently understood. Existing studies are largely confined to the digital setting and do not adequately account for the texture-dominant nature of palmprint recognition or the distortions introduced during physical acquisition. To address this gap, we propose CAAP, a capture-aware adversarial patch framework for palmprint recognition. CAAP learns a universal patch that can be reused across inputs while remaining effective under realistic acquisition variation. To match the structural characteristics of palmprints, the framework adopts a cross-shaped patch topology, which enlarges spatial coverage under a fixed pixel budget and more effectively disrupts long-range texture continuity. CAAP further integrates three modules: ASIT for input-conditioned patch rendering, RaS for stochastic capture-aware simulation, and MS-DIFE for feature-level identity-disruptive guidance. We evaluate CAAP on the Tongji, IITD, and AISEC datasets against generic CNN backbones and palmprint-specific recognition models. Experiments show that CAAP achieves strong untargeted and targeted attack performance with favorable cross-model and cross-dataset transferability. The results further show that, although adversarial training can partially reduce the attack success rate, substantial residual vulnerability remains. These findings indicate that deep palmprint recognition systems remain vulnerable to physically realizable, capture-aware adversarial patch attacks, underscoring the need for more effective defenses in practice. Code available at https://github.com/ryliu68/CAAP.

[227] Canopy Tree Height Estimation Using Quantile Regression: Modeling and Evaluating Uncertainty in Remote Sensing

Karsten Schrödter, Jan Pauls, Fabian Gieseke

Main category: cs.CV

TL;DR: Applying quantile regression to satellite-based tree height estimation models to provide uncertainty quantification alongside predictions

DetailsMotivation: Current tree height estimation models provide only point predictions without uncertainty quantification, limiting their use in risk-sensitive ecological monitoring and biomass assessment applications

Method: Apply quantile regression to existing tree height estimation models with minor modifications to prediction heads, enabling statistically calibrated uncertainty estimates

Result: Models can provide uncertainty estimates that correlate with known remote sensing challenges (terrain complexity, vegetation heterogeneity), showing lower confidence in more difficult conditions

Conclusion: Quantile regression enables uncertainty quantification in tree height estimation with minimal model modifications, improving applicability for risk-sensitive ecological applications

Abstract: Accurate tree height estimation is vital for ecological monitoring and biomass assessment. We apply quantile regression to existing tree height estimation models based on satellite data to incorporate uncertainty quantification. Most current approaches for tree height estimation rely on point predictions, which limits their applicability in risk-sensitive scenarios. In this work, we show that, with minor modifications of a given prediction head, existing models can be adapted to provide statistically calibrated uncertainty estimates via quantile regression. Furthermore, we demonstrate how our results correlate with known challenges in remote sensing (e.g., terrain complexity, vegetation heterogeneity), indicating that the model is less confident in more challenging conditions.

[228] Generative Phomosaic with Structure-Aligned and Personalized Diffusion

Jaeyoung Chung, Hyunjin Son, Kyoung Mu Lee

Main category: cs.CV

TL;DR: First generative approach to photomosaic creation using diffusion models to synthesize tile images conditioned on reference images, overcoming limitations of traditional matching-based methods.

DetailsMotivation: Traditional photomosaic methods rely on large collections of tile images and color-based matching, which limits diversity and structural consistency. There's a need for more expressive and coherent photomosaic generation.

Method: Uses diffusion-based generation conditioned on reference images with low-frequency conditioning to align global structure while preserving prompt-driven details. Incorporates few-shot personalized diffusion for user-specific or stylistically consistent tiles.

Result: Generative photomosaic framework produces semantically expressive and structurally coherent compositions, effectively overcoming fundamental limitations of matching-based approaches without requiring extensive image collections.

Conclusion: The generative approach enables photomosaic creation that is both diverse and structurally consistent, representing a significant advancement over traditional methods by leveraging modern diffusion models.

Abstract: We present the first generative approach to photomosaic creation. Traditional photomosaic methods rely on a large number of tile images and color-based matching, which limits both diversity and structural consistency. Our generative photomosaic framework synthesizes tile images using diffusion-based generation conditioned on reference images. A low-frequency conditioned diffusion mechanism aligns global structure while preserving prompt-driven details. This generative formulation enables photomosaic composition that is both semantically expressive and structurally coherent, effectively overcoming the fundamental limitations of matching-based approaches. By leveraging few-shot personalized diffusion, our model is able to produce user-specific or stylistically consistent tiles without requiring an extensive collection of images.

[229] IQ-LUT: interpolated and quantized LUT for efficient image super-resolution

Yuxuan Zhang, Zhikai Dong, Xinning Chai, Xiangyun Zhou, Yi Xu, Zhengxue Cheng, Li Song

Main category: cs.CV

TL;DR: IQ-LUT reduces LUT size for image super-resolution while improving quality through interpolation-quantization integration, residual learning, and knowledge-guided non-uniform quantization.

DetailsMotivation: Traditional LUT methods for image super-resolution face storage bottlenecks when pursuing higher quality through larger receptive fields and bit-depth, limiting deployment on resource-constrained devices.

Method: 1) Integrates interpolation and quantization into single-input, multiple-output ECNN to reduce index space and LUT size. 2) Uses residual learning to reduce dependence on LUT bit-depth and improve training stability. 3) Employs knowledge distillation to guide non-uniform quantization optimization.

Result: Achieves up to 50x storage reduction compared to ECNN while achieving superior super-resolution quality in extensive benchmarking.

Conclusion: IQ-LUT effectively addresses the storage bottleneck in LUT-based super-resolution methods, enabling efficient deployment on resource-constrained devices while maintaining high image quality.

Abstract: Lookup table (LUT) methods demonstrate considerable potential in accelerating image super-resolution inference. However, pursuing higher image quality through larger receptive fields and bit-depth triggers exponential growth in the LUT’s index space, creating a storage bottleneck that limits deployment on resource-constrained devices. We introduce IQ-LUT, which achieves a reduction in LUT size while simultaneously enhancing super-resolution quality. First, we integrate interpolation and quantization into the single-input, multiple-output ECNN, which dramatically reduces the index space and thereby the overall LUT size. Second, the integration of residual learning mitigates the dependence on LUT bit-depth, which facilitates training stability and prioritizes the reconstruction of fine-grained details for superior visual quality. Finally, guided by knowledge distillation, our non-uniform quantization process optimizes the quantization levels, thereby reducing storage while also compensating for quantization loss. Extensive benchmarking demonstrates our approach substantially reduces storage costs (by up to 50x compared to ECNN) while achieving superior super-resolution quality.

[230] Synthetic Dataset Generation for Partially Observed Indoor Objects

Jelle Vermandere, Maarten Bassier, Maarten Vergauwen

Main category: cs.CV

TL;DR: A virtual scanning framework in Unity generates realistic synthetic 3D scan datasets for training scene reconstruction and object completion models, addressing the high cost of acquiring real-world scan data with ground truth.

DetailsMotivation: Learning-based methods for 3D scene reconstruction and object completion require large datasets with partial scans paired with complete ground truth geometry. Acquiring such datasets using real-world scanning systems is costly and time-consuming, especially for accurate ground truth in occluded regions.

Method: A virtual scanning framework implemented in Unity simulates real-world scanner behavior with configurable parameters (scan resolution, measurement range, distance-dependent noise). Instead of directly sampling mesh surfaces, it performs ray-based scanning from virtual viewpoints for realistic sensor visibility and occlusion modeling. Panoramic images at scanner locations assign colors to resulting point clouds. The system integrates with a procedural indoor scene generation pipeline for scalable dataset creation.

Result: The V-Scan dataset is introduced, containing synthetic indoor scans with object-level partial point clouds, voxel-based occlusion grids, and complete ground-truth geometry. This provides valuable supervision for training and evaluating learning-based methods.

Conclusion: The virtual scanning framework enables efficient generation of realistic synthetic 3D scan datasets, addressing data scarcity for training scene reconstruction and object completion models while providing comprehensive ground truth information.

Abstract: Learning-based methods for 3D scene reconstruction and object completion require large datasets containing partial scans paired with complete ground-truth geometry. However, acquiring such datasets using real-world scanning systems is costly and time-consuming, particularly when accurate ground truth for occluded regions is required. In this work, we present a virtual scanning framework implemented in Unity for generating realistic synthetic 3D scan datasets. The proposed system simulates the behaviour of real-world scanners using configurable parameters such as scan resolution, measurement range, and distance-dependent noise. Instead of directly sampling mesh surfaces, the framework performs ray-based scanning from virtual viewpoints, enabling realistic modelling of sensor visibility and occlusion effects. In addition, panoramic images captured at the scanner location are used to assign colours to the resulting point clouds. To support scalable dataset creation, the scanner is integrated with a procedural indoor scene generation pipeline that automatically produces diverse room layouts and furniture arrangements. Using this system, we introduce the \textit{V-Scan} dataset, which contains synthetic indoor scans together with object-level partial point clouds, voxel-based occlusion grids, and complete ground-truth geometry. The resulting dataset provides valuable supervision for training and evaluating learning-based methods for scene reconstruction and object completion.

[231] ModuSeg: Decoupling Object Discovery and Semantic Retrieval for Training-Free Weakly Supervised Segmentation

Qingze He, Fagui Liu, Dengke Zhang, Qingmao Wei, Quan Tang

Main category: cs.CV

TL;DR: ModuSeg is a training-free weakly supervised semantic segmentation framework that decouples object discovery and semantic assignment using foundation models and feature retrieval.

DetailsMotivation: Existing weakly supervised semantic segmentation methods entangle semantic recognition and object localization, causing models to focus only on sparse discriminative regions. Foundation models show potential but current approaches struggle with pseudo-label noise and require multi-stage retraining or unstable joint optimization.

Method: ModuSeg explicitly decouples object discovery and semantic assignment. It uses a general mask proposer for geometric proposals with reliable boundaries, leverages semantic foundation models to build an offline feature bank, and transforms segmentation into non-parametric feature retrieval. Includes semantic boundary purification and soft-masked feature aggregation to mitigate boundary ambiguity and quantization errors.

Result: Extensive experiments show the decoupled architecture preserves fine boundaries without parameter fine-tuning and achieves highly competitive performance on standard benchmark datasets.

Conclusion: ModuSeg presents an effective training-free framework for weakly supervised semantic segmentation that decouples object discovery and semantic assignment, leveraging foundation models to achieve competitive performance without parameter tuning.

Abstract: Weakly supervised semantic segmentation aims to achieve pixel-level predictions using image-level labels. Existing methods typically entangle semantic recognition and object localization, which often leads models to focus exclusively on sparse discriminative regions. Although foundation models show immense potential, many approaches still follow the tightly coupled optimization paradigm, struggling to effectively alleviate pseudo-label noise and often relying on time-consuming multi-stage retraining or unstable end-to-end joint optimization. To address the above challenges, we present ModuSeg, a training-free weakly supervised semantic segmentation framework centered on explicitly decoupling object discovery and semantic assignment. Specifically, we integrate a general mask proposer to extract geometric proposals with reliable boundaries, while leveraging semantic foundation models to construct an offline feature bank, transforming segmentation into a non-parametric feature retrieval process. Furthermore, we propose semantic boundary purification and soft-masked feature aggregation strategies to effectively mitigate boundary ambiguity and quantization errors, thereby extracting high-quality category prototypes. Extensive experiments demonstrate that the proposed decoupled architecture better preserves fine boundaries without parameter fine-tuning and achieves highly competitive performance on standard benchmark datasets. Code is available at https://github.com/Autumnair007/ModuSeg.

[232] Not all tokens contribute equally to diffusion learning

Guoqing Zhang, Lu Shi, Wanru Xu, Linna Zhang, Sen Wang, Fangfang Wang, Yigang Cen

Main category: cs.CV

TL;DR: DARE improves text-to-video diffusion models by addressing distributional bias and spatial misalignment through distribution rectification and spatial attention alignment.

DetailsMotivation: Current text-to-video diffusion models often neglect semantically important tokens during inference due to distributional bias from long-tailed token frequency in training data and spatial misalignment in cross-attention, leading to biased or incomplete generations.

Method: Proposes DARE framework with two components: 1) Distribution-Rectified Classifier-Free Guidance (DR-CFG) that dynamically suppresses dominant tokens with low semantic density to encourage balanced conditional distribution learning, and 2) Spatial Representation Alignment (SRA) that adaptively reweights cross-attention maps based on token importance and enforces representation consistency.

Result: Extensive experiments on multiple benchmark datasets show DARE consistently improves generation fidelity and semantic alignment, achieving significant gains over existing approaches.

Conclusion: DARE effectively addresses distributional bias and spatial misalignment in text-to-video diffusion models, leading to better semantic guidance and improved generation quality.

Abstract: With the rapid development of conditional diffusion models, significant progress has been made in text-to-video generation. However, we observe that these models often neglect semantically important tokens during inference, leading to biased or incomplete generations under classifier-free guidance. We attribute this issue to two key factors: distributional bias caused by the long-tailed token frequency in training data, and spatial misalignment in cross-attention where semantically important tokens are overshadowed by less informative ones. To address these issues, we propose Distribution-Aware Rectification and Spatial Ensemble (DARE), a unified framework that improves semantic guidance in diffusion models from the perspectives of distributional debiasing and spatial consistency. First, we introduce Distribution-Rectified Classifier-Free Guidance (DR-CFG), which regularizes the training process by dynamically suppressing dominant tokens with low semantic density, encouraging the model to better capture underrepresented semantic cues and learn a more balanced conditional distribution. This design mitigates the risk of the model distribution overfitting to tokens with low semantic density. Second, we propose Spatial Representation Alignment (SRA), which adaptively reweights cross-attention maps according to token importance and enforces representation consistency, enabling semantically important tokens to exert stronger spatial guidance during generation. This mechanism effectively prevents low semantic-density tokens from dominating the attention allocation, thereby avoiding the dilution of the spatial and distributional guidance provided by high semantic-density tokens. Extensive experiments on multiple benchmark datasets demonstrate that DARE consistently improves generation fidelity and semantic alignment, achieving significant gains over existing approaches.

[233] PRISM: Rethinking Scattered Atmosphere Reconstruction as a Unified Understanding and Generation Model for Real-world Dehazing

Chengyu Fang, Chunming He, Yuelin Zhang, Chubin Chen, Chenyang Zhu, Longxiang Tang, Xiu Li

Main category: cs.CV

TL;DR: PRISM proposes a physically structured framework for real-world image dehazing using proximal scattered atmosphere reconstruction with self-distillation adaptation for synthetic-to-real transfer.

DetailsMotivation: Real-world image dehazing is challenging due to non-uniform haze distribution, spatially varying illumination from multiple light sources, and scarcity of paired real hazy-clean data. Existing methods struggle with complex regions and mixed-light conditions.

Method: Proximal Scattered Atmosphere Reconstruction (PSAR) jointly reconstructs clear scenes and scattering variables under atmospheric scattering model. Includes online non-uniform haze synthesis pipeline and Selective Self-distillation Adaptation scheme for unpaired real-world scenarios.

Result: Extensive experiments on real-world benchmarks demonstrate state-of-the-art performance on real-world image dehazing tasks.

Conclusion: PRISM provides a physically structured framework that improves reliability in complex regions and mixed-light conditions through joint reconstruction and self-distillation adaptation.

Abstract: Real-world image dehazing (RID) aims to remove haze induced degradation from real scenes. This task remains challenging due to non-uniform haze distribution, spatially varying illumination from multiple light sources, and the scarcity of paired real hazy-clean data. In PRISM, we propose Proximal Scattered Atmosphere Reconstruction (PSAR), a physically structured framework that jointly reconstructs the clear scene and scattering variables under the atmospheric scattering model, thereby improving reliability in complex regions and mixed-light conditions. To bridge the synthetic-to-real gap, we design an online non-uniform haze synthesis pipeline and a Selective Self-distillation Adaptation scheme for unpaired real-world scenarios, which enables the model to selectively learn from high-quality perceptual targets while leveraging its intrinsic scattering understanding to audit residual haze and guide self-refinement. Extensive experiments on real-world benchmarks demonstrate that PRISM achieves state-of-the-art performance on RID tasks.

[234] AnchorSplat: Feed-Forward 3D Gaussian SplattingWith 3D Geometric Priors

Xiaoxue Zhang, Xiaoxu Zheng, Yixuan Yin, Tiao Zhao, Kaihua Tang, Michael Bi Mi, Zhan Xu, Dave Zhenyu Chen

Main category: cs.CV

TL;DR: AnchorSplat introduces a novel feed-forward 3D Gaussian Splatting framework that uses 3D geometric priors to create anchor-aligned Gaussian representations, reducing Gaussian count while improving reconstruction quality and computational efficiency.

DetailsMotivation: Current feed-forward Gaussian reconstruction models use pixel-aligned formulations that tightly couple Gaussian representations with input images, limiting scalability and efficiency. The authors aim to create a more geometry-aware 3D Gaussian representation that is independent of image resolution and view count.

Method: AnchorSplat uses 3D geometric priors (sparse point clouds, voxels, or RGB-D point clouds) to guide an anchor-aligned Gaussian representation in 3D space. It includes a Gaussian Refiner module that adjusts intermediate Gaussians through a few forward passes, creating a more efficient and view-consistent reconstruction.

Result: The method achieves state-of-the-art performance on the ScanNet++ v2 NVS benchmark, outperforming previous methods with more view-consistent reconstructions while using substantially fewer Gaussian primitives.

Conclusion: AnchorSplat demonstrates that anchor-aligned Gaussian representations guided by 3D geometric priors can significantly improve computational efficiency and reconstruction fidelity for scene-level 3D reconstruction compared to pixel-aligned approaches.

Abstract: Recent feed-forward Gaussian reconstruction models adopt a pixel-aligned formulation that maps each 2D pixel to a 3D Gaussian, entangling Gaussian representations tightly with the input images. In this paper, we propose AnchorSplat, a novel feed-forward 3DGS framework for scene-level reconstruction that represents the scene directly in 3D space. AnchorSplat introduces an anchor-aligned Gaussian representation guided by 3D geometric priors (e.g., sparse point clouds, voxels, or RGB-D point clouds), enabling a more geometry-aware renderable 3D Gaussians that is independent of image resolution and number of views. This design substantially reduces the number of required Gaussians, improving computational efficiency while enhancing reconstruction fidelity. Beyond the anchor-aligned design, we utilize a Gaussian Refiner to adjust the intermediate Gaussiansy via merely a few forward passes. Experiments on the ScanNet++ v2 NVS benchmark demonstrate the SOTA performance, outperforming previous methods with more view-consistent and substantially fewer Gaussian primitives.

[235] Location Is All You Need: Continuous Spatiotemporal Neural Representations of Earth Observation Data

Mojgan Madadikhaljan, Jonathan Prexl, Isabelle Wittmann, Conrad M Albrecht, Michael Schmitt

Main category: cs.CV

TL;DR: LIANet is a coordinate-based neural representation for multi-temporal Earth observation data that reconstructs satellite imagery from spatiotemporal coordinates, serving as a user-friendly alternative to Geospatial Foundation Models for downstream tasks without requiring original satellite data access.

DetailsMotivation: To create a neural representation that eliminates the overhead of data access and preprocessing for end-users working with Earth observation data, enabling fine-tuning for downstream tasks using only labels without requiring access to original satellite imagery.

Method: LIANet uses coordinate-based neural representation to model multi-temporal Earth observation data as a continuous spatiotemporal neural field. It takes spatial and temporal coordinates as input and reconstructs corresponding satellite imagery. Once pretrained, it can be adapted to various downstream tasks like semantic segmentation or pixel-wise regression.

Result: LIANet demonstrates competitive performance when fine-tuned for downstream tasks compared to training from scratch or using established Geospatial Foundation Models, across target areas of varying sizes.

Conclusion: LIANet provides a practical alternative to Geospatial Foundation Models by simplifying data access and preprocessing, enabling efficient adaptation to various Earth observation tasks while maintaining competitive performance.

Abstract: In this work, we present LIANet (Location Is All You Need Network), a coordinate-based neural representation that models multi-temporal spaceborne Earth observation (EO) data for a given region of interest as a continuous spatiotemporal neural field. Given only spatial and temporal coordinates, LIANet reconstructs the corresponding satellite imagery. Once pretrained, this neural representation can be adapted to various EO downstream tasks, such as semantic segmentation or pixel-wise regression, importantly, without requiring access to the original satellite data. LIANet intends to serve as a user-friendly alternative to Geospatial Foundation Models (GFMs) by eliminating the overhead of data access and preprocessing for end-users and enabling fine-tuning solely based on labels. We demonstrate the pretraining of LIANet across target areas of varying sizes and show that fine-tuning it for downstream tasks achieves competitive performance compared to training from scratch or using established GFMs. The source code and datasets are publicly available at https://github.com/mojganmadadi/LIANet/tree/v1.0.1.

[236] Novel Anomaly Detection Scenarios and Evaluation Metrics to Address the Ambiguity in the Definition of Normal Samples

Reiji Saito, Satoshi Kamiya, Kazuhiro Hotta

Main category: cs.CV

TL;DR: RePaste: A novel anomaly detection method for ambiguous normal samples in industrial settings that handles specification changes through iterative re-pasting of high anomaly score regions.

DetailsMotivation: Real-world industrial anomaly detection faces ambiguity in defining normal samples due to specification changes (e.g., small scratches may be acceptable or unacceptable depending on equipment upgrades). Current methods assume clear normal/abnormal distinctions, but practical scenarios require handling ambiguous normal samples that may include minor defects.

Method: Proposes RePaste method that iteratively re-pastes regions with high anomaly scores from previous steps into the input for next steps, enhancing learning of ambiguous normal samples. Also introduces novel scenarios and evaluation metrics to accommodate specification changes in real-world applications.

Result: On MVTec AD benchmark with proposed scenarios, RePaste achieved state-of-the-art performance on the new evaluation metric while maintaining high AUROC and PRO scores.

Conclusion: RePaste effectively addresses the ambiguity problem in industrial anomaly detection by handling specification changes through iterative refinement, providing a practical solution for real-world applications where normal sample definitions evolve.

Abstract: In conventional anomaly detection, training data consist of only normal samples. However, in real-world scenarios, the definition of a normal sample is often ambiguous. For example, there are cases where a sample has small scratches or stains but is still acceptable for practical usage. On the other hand, higher precision is required when manufacturing equipment is upgraded. In such cases, normal samples may include small scratches, tiny dust particles, or a foreign object that we would prefer to classify as an anomaly. Such cases frequently occur in industrial settings, yet they have not been discussed until now. Thus, we propose novel scenarios and an evaluation metric to accommodate specification changes in real-world applications. Furthermore, to address the ambiguity of normal samples, we propose the RePaste, which enhances learning by re-pasting regions with high anomaly scores from the previous step into the input for the next step. On our scenarios using the MVTec AD benchmark, RePaste achieved the state-of-the-art performance with respect to the proposed evaluation metric, while maintaining high AUROC and PRO scores. Code: https://github.com/ReijiSoftmaxSaito/Scenario

[237] Assessing the Added Value of Onboard Earth Observation Processing with the IRIDE HEO Service Segment

Parampuneet Kaur Thind, Charles Mwangi, Giovanni Varetto, Lorenzo Sarti, Andrea Papa, Andrea Taramelli

Main category: cs.CV

TL;DR: Onboard AI processing for Earth observation satellites enables faster, higher-resolution burnt-area mapping compared to ground-only systems

DetailsMotivation: Current Earth observation services face limitations from ground-based processing including latency, bandwidth constraints, and limited autonomous observation capabilities. The IRIDE program aims to overcome these through onboard intelligence.

Method: Developed Hawk for Earth Observation (HEO) system enabling onboard data product generation within IRIDE’s constellation-of-constellations architecture. Uses burnt-area mapping as case study to demonstrate onboard processing advantages.

Result: Onboard processing achieves higher spatial detail (sub-3m resolution), detects smaller events (3-hectare minimum), and improves system responsiveness compared to ground-only architectures.

Conclusion: Onboard intelligence provides complementary value to existing services, supporting faster emergency response and land management through image-driven pre-classification.

Abstract: Current operational Earth Observation (EO) services, including the Copernicus Emergency Management Service (CEMS), the European Forest Fire Information System (EFFIS), and the Copernicus Land Monitoring Service (CLMS), rely primarily on ground-based processing pipelines. While these systems provide mature large-scale information products, they remain constrained by downlink latency, bandwidth limitations, and limited capability for autonomous observation prioritisation. The International Report for an Innovative Defence of Earth (IRIDE) programme is a national Earth observation initiative led by the Italian government to support public authorities through timely, objective information derived from spaceborne data. Rather than a single constellation, IRIDE is designed as a constellation of constellations, integrating heterogeneous sensing technologies within a unified service-oriented architecture. Within this framework, Hawk for Earth Observation (HEO) enables onboard generation of data products, allowing information extraction earlier in the processing chain. This paper examines the limitations of ground-only architectures and evaluates the added value of onboard processing at the operational service level. The IRIDE burnt-area mapping service is used as a representative case study to demonstrate how onboard intelligence can support higher spatial detail (sub-three-metre ground sampling distance), smaller detectable events (minimum mapping unit of three hectares), and improved system responsiveness. Rather than replacing existing Copernicus services, the IRIDE HEO capability is positioned as a complementary layer providing image-driven pre-classification to support downstream emergency and land-management workflows. This work highlights the operational value of onboard intelligence for emerging low-latency EO service architectures.

[238] Accuracy Improvement of Semi-Supervised Segmentation Using Supervised ClassMix and Sup-Unsup Feature Discriminator

Takahiro Mano, Reiji Saito, Kazuhiro Hotta

Main category: cs.CV

TL;DR: A semi-supervised semantic segmentation method that improves ClassMix by using labeled image regions for mixing and aligning predictions between labeled and unlabeled data.

DetailsMotivation: Pixel-level labeling for semantic segmentation is expensive. Semi-supervised learning helps but existing methods like ClassMix use inaccurate pseudo-labels from unlabeled images and suffer from data quality gaps between labeled/unlabeled images.

Method: Two improvements: 1) Paste class labels and corresponding image regions from labeled images onto unlabeled images and their pseudo-labeled versions, 2) Train model to make predictions on unlabeled images more similar to those on labeled images.

Result: Experiments on Chase and COVID-19 datasets show average 2.07% mIoU improvement over conventional semi-supervised learning methods.

Conclusion: The proposed method effectively addresses pseudo-label inaccuracy and data quality gap issues in semi-supervised semantic segmentation, achieving better performance.

Abstract: In semantic segmentation, the creation of pixel-level labels for training data incurs significant costs. To address this problem, semi-supervised learning, which utilizes a small number of labeled images alongside unlabeled images to enhance the performance, has gained attention. A conventional semi-supervised learning method, ClassMix, pastes class labels predicted from unlabeled images onto other images. However, since ClassMix performs operations using pseudo-labels obtained from unlabeled images, there is a risk of handling inaccurate labels. Additionally, there is a gap in data quality between labeled and unlabeled images, which can impact the feature maps. This study addresses these two issues. First, we propose a method where class labels from labeled images, along with the corresponding image regions, are pasted onto unlabeled images and their pseudo-labeled images. Second, we introduce a method that trains the model to make predictions on unlabeled images more similar to those on labeled images. Experiments on the Chase and COVID-19 datasets demonstrated an average improvement of 2.07% in mIoU compared to conventional semi-supervised learning methods.

[239] A Utility-preserving De-identification Pipeline for Cross-hospital Radiology Data Sharing

Chenhao Liu, Zelin Wen, Yan Tong, Junjie Zhu, Xinyu Tian, Yuchi Liu, Ashu Gupta, Syed M. S. Islam, Tom Gedeon, Yue Yao

Main category: cs.CV

TL;DR: UPDP: A utility-preserving de-identification pipeline for cross-hospital radiology data sharing that filters privacy-sensitive information while preserving pathology cues for vision-language model training.

DetailsMotivation: Privacy concerns heavily constrain sharing of large-scale radiology data needed for robust medical AI systems. Existing de-identification focuses on privacy compliance but doesn't explore whether de-identified data preserves sufficient utility for large-scale vision-language model training and cross-hospital transfer.

Method: 1) Compile blacklist of privacy-sensitive terms and whitelist of pathology-related terms for text filtering. 2) Use generative filtering mechanism to synthesize privacy-filtered, pathology-reserved counterparts of original radiology images. 3) Combine synthetic image counterparts with ID-filtered reports for secure cross-hospital sharing.

Result: On public chest X-ray benchmarks: effectively removes privacy-sensitive information while preserving diagnostically relevant pathology cues. Models trained on de-identified data maintain competitive diagnostic accuracy vs. original data, with marked decline in identity-related accuracy confirming privacy protection. Cross-hospital setting shows de-identified data combined with local data yields better performance.

Conclusion: UPDP enables privacy-preserving radiology data sharing while maintaining utility for vision-language model training and cross-hospital transfer, addressing the critical privacy-utility tradeoff in medical AI development.

Abstract: Large-scale radiology data are critical for developing robust medical AI systems. However, sharing such data across hospitals remains heavily constrained by privacy concerns. Existing de-identification research in radiology mainly focus on removing identifiable information to enable compliant data release. Yet whether de-identified radiology data can still preserve sufficient utility for large-scale vision-language model training and cross-hospital transfer remains underexplored. In this paper, we introduce a utility-preserving de-identification pipeline (UPDP) for cross-hospital radiology data sharing. Specifically, we compile a blacklist of privacy-sensitive terms and a whitelist of pathology-related terms. For radiology images, we use a generative filtering mechanism that synthesis a privacy-filtered and pathology-reserved counterparts of the original images. These synthetic image counterparts, together with ID-filtered reports, can then be securely shared across hospitals for downstream model development and evaluation. Experiments on public chest X-ray benchmarks demonstrate that our method effectively removes privacy-sensitive information while preserving diagnostically relevant pathology cues. Models trained on the de-identified data maintain competitive diagnostic accuracy compared with those trained on the original data, while exhibiting a marked decline in identity-related accuracy, confirming effective privacy protection. In the cross-hospital setting, we further show that de-identified data can be combined with local data to yield better performance.

[240] CSA-Graphs: A Privacy-Preserving Structural Dataset for Child Sexual Abuse Research

Carlos Caetano, Camila Laranjeira, Clara Ernesto, Artur Barros, João Macedo, Leo S. F. Ribeiro, Jefersson A. dos Santos, Sandra Avila

Main category: cs.CV

TL;DR: Privacy-preserving structural dataset (CSA-Graphs) for Child Sexual Abuse Imagery classification using scene graphs and skeleton graphs instead of original images.

DetailsMotivation: CSAI classification faces legal/ethical restrictions preventing public dataset sharing, hindering reproducibility and progress in automated methods development.

Method: Created CSA-Graphs dataset with two graph-based modalities: scene graphs for object relationships and skeleton graphs for human pose, removing explicit visual content while preserving contextual information.

Result: Both graph representations retain useful information for CSAI classification, and combining them further improves performance.

Conclusion: CSA-Graphs enables broader computer vision research for child safety while respecting legal and ethical constraints through privacy-preserving structural representations.

Abstract: Child Sexual Abuse Imagery (CSAI) classification is an important yet challenging problem for computer vision research due to the strict legal and ethical restrictions that prevent the public sharing of CSAI datasets. This limitation hinders reproducibility and slows progress in developing automated methods. In this work, we introduce CSA-Graphs, a privacy-preserving structural dataset. Instead of releasing the original images, we provide structural representations that remove explicit visual content while preserving contextual information. CSA-Graphs includes two complementary graph-based modalities: scene graphs describing object relationships and skeleton graphs encoding human pose. Experiments show that both representations retain useful information for classifying CSAI, and that combining them further improves performance. This dataset enables broader research on computer vision methods for child safety while respecting legal and ethical constraints.

[241] USCNet: Transformer-Based Multimodal Fusion with Segmentation Guidance for Urolithiasis Classification

Changmiao Wang, Songqi Zhang, Yongquan Zhang, Yifei Wang, Liya Liu, Nannan Li, Xingzhi Li, Jiexin Pan, Yi Jiang, Xiang Wan, Hai Wang, Ahmed Elazab

Main category: cs.CV

TL;DR: USCNet is a Transformer-based multimodal network that integrates CT images and EHR data for preoperative kidney stone segmentation and classification, outperforming existing methods.

DetailsMotivation: Current kidney stone analysis relies on postoperative specimens, preventing rapid preoperative classification needed for personalized treatment planning and recurrence prevention.

Method: USCNet uses a Transformer-based multimodal fusion framework with CT-EHR attention and segmentation-guided attention modules, plus a dynamic loss function to balance segmentation and classification objectives.

Result: Experiments on an in-house kidney stone dataset show USCNet achieves outstanding performance across all metrics, with classification efficacy significantly surpassing mainstream methods.

Conclusion: USCNet offers a promising solution for precise preoperative kidney stone classification with substantial clinical benefits, and the source code is publicly available.

Abstract: Kidney stone disease ranks among the most prevalent conditions in urology, and understanding the composition of these stones is essential for creating personalized treatment plans and preventing recurrence. Current methods for analyzing kidney stones depend on postoperative specimens, which prevents rapid classification before surgery. To overcome this limitation, we introduce a new approach called the Urinary Stone Segmentation and Classification Network (USCNet). This innovative method allows for precise preoperative classification of kidney stones by integrating Computed Tomography (CT) images with clinical data from Electronic Health Records (EHR). USCNet employs a Transformer-based multimodal fusion framework with CT-EHR attention and segmentation-guided attention modules for accurate classification. Moreover, a dynamic loss function is introduced to effectively balance the dual objectives of segmentation and classification. Experiments on an in-house kidney stone dataset show that USCNet demonstrates outstanding performance across all evaluation metrics, with its classification efficacy significantly surpassing existing mainstream methods. This study presents a promising solution for the precise preoperative classification of kidney stones, offering substantial clinical benefits. The source code has been made publicly available: https://github.com/ZhangSongqi0506/KidneyStone.

[242] Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering

Zhuohong Chen, Zhenxian Wu, Yunyao Yu, Hangrui Xu, Zirui Liao, Zhifang Liu, Xiangwen Deng, Pen Jiao, Haoqian Wang

Main category: cs.CV

TL;DR: KB-VQA reformulated as search-agent problem with multi-step decision making, using four actions (Answer, Image Retrieval, Text Retrieval, Caption) to dynamically retrieve and reason about external knowledge.

DetailsMotivation: Existing KB-VQA methods use fixed retrieval pipelines that separate retrieval from reasoning, making it hard to adapt to diverse question types and resulting in poorly aligned evidence. Need for more integrated, adaptive approach.

Method: Reformulate KB-VQA as search-agent problem with multi-step decision making. Agent selects from four actions based on current information state. Automated pipeline collects multi-step trajectories (reasoning process, tool usage, decisions) for fine-tuning supervision.

Result: Achieves state-of-the-art performance on InfoSeek and E-VQA datasets, consistently outperforming prior baselines, confirming framework effectiveness.

Conclusion: Agent-based approach with integrated retrieval and reasoning outperforms fixed pipeline methods, enabling better adaptation to diverse question types and more aligned evidence retrieval.

Abstract: Knowledge-based visual question answering (KB-VQA) requires vision-language models to understand images and use external knowledge, especially for rare entities and long-tail facts. Most existing retrieval-augmented generation (RAG) methods adopt a fixed pipeline that sequentially retrieves information, filters it, and then produces an answer. Such a design makes it difficult to adapt to diverse question types. Moreover, it separates retrieval from reasoning, making it hard for the model to decide when to search, how to refine queries, or when to stop. As a result, the retrieved evidence is often poorly aligned with the question. To address these limitations, we reformulate KB-VQA as a search-agent problem and model the solving process as a multi-step decision-making procedure. At each step, the agent selects one of four actions-Answer, Image Retrieval, Text Retrieval, and Caption-based on its current information state. We further design an automated pipeline to collect multi-step trajectories that record the agent’s reasoning process, tool usage, and intermediate decisions. These trajectories are then used as supervision for fine-tuning. Experiments on InfoSeek and E-VQA demonstrate that our method achieves state-of-the-art performance, consistently outperforming prior baselines and confirming the effectiveness of our framework.

[243] Bridging MRI and PET physiology: Untangling complementarity through orthogonal representations

Sonja Adomeit, Kartikay Tehlan, Lukas Förner, Katharina Weisser, Helen Scholtiseek, David Kaufmann, Julie Steinestel, Constantin Lapa, Thomas Kröncke, Thomas Wendler

Main category: cs.CV

TL;DR: Proposes orthogonal subspace decomposition for multimodal imaging (PSMA PET and MRI) to separate shared physiological information from modality-specific signals, using implicit neural representations and projection-based regularization.

DetailsMotivation: Current multimodal fusion approaches lack clear distinction between shared and modality-specific information, which is clinically important for understanding each modality's irreducible contribution and guiding acquisition strategies.

Method: Uses implicit neural representation (INR) to map MRI features to PET uptake, with projection-based regularization via singular value decomposition to enforce orthogonality between MRI-explainable physiological envelope and orthogonal residual components.

Result: Tested on 13 prostate cancer patients, the model shows residual components spanned by MRI features are absorbed into the learned envelope, while orthogonal residual is largest in tumor regions, indicating PSMA PET contains signal not recoverable from MRI.

Conclusion: The framework provides structured characterization of modality complementarity based on representation geometry rather than image translation, clarifying what information is shared versus modality-specific in multimodal imaging.

Abstract: Multimodal imaging analysis often relies on joint latent representations, yet these approaches rarely define what information is shared versus modality-specific. Clarifying this distinction is clinically relevant, as it delineates the irreducible contribution of each modality and informs rational acquisition strategies. We propose a subspace decomposition framework that reframes multimodal fusion as a problem of orthogonal subspace separation rather than translation. We decompose Prostate-Specific Membrane Antigen (PSMA) PET uptake into an MRI-explainable physiological envelope and an orthogonal residual reflecting signal components not expressible within the MRI feature manifold. Using multiparametric MRI, we train an intensity-based, non-spatial implicit neural representation (INR) to map MRI feature vectors to PET uptake. We introduce a projection-based regularization using singular value decomposition to penalize residual components lying within the span of the MRI feature manifold. This enforces mathematical orthogonality between tissue-level physiological properties (structure, diffusion, perfusion) and intracellular PSMA expression. Tested on 13 prostate cancer patients, the model demonstrates that residual components spanned by MRI features are absorbed into the learned envelope, while the orthogonal residual is largest in tumour regions. This indicates that PSMA PET contains signal components not recoverable from MRI-derived physiological descriptors. The resulting decomposition provides a structured characterization of modality complementarity grounded in representation geometry rather than image translation.

[244] DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification

Robert Zimmermann, Thomas Norrenbrock, Bodo Rosenhahn

Main category: cs.CV

TL;DR: DINO-QPM is a lightweight interpretability adapter that converts complex DINOv2 features into contrastive, class-independent representations for globally interpretable image classification with spatial localization capabilities.

DetailsMotivation: Visual foundation models like DINOv2 provide excellent feature extraction but create interpretability challenges due to their complex, high-dimensional representations. There's a need to make these powerful models more interpretable while maintaining their performance.

Method: DINO-QPM adapts the Quadratic Programming Enhanced Model (QPM) to operate on frozen DINO backbones. Instead of using the CLS token, it uses average-pooling to connect patch embeddings directly to model features, enabling spatial localization. A sparsity loss minimizes spatial scatter and background noise.

Result: DINO-QPM exceeds DINOv2 linear probe accuracy while providing interpretable features. It demonstrates superior performance in both classification accuracy and explanation quality compared to other methods for frozen visual foundation models.

Conclusion: DINO-QPM successfully bridges the gap between high-performance visual foundation models and interpretability, making QPM-level interpretability available as an adapter while improving classification accuracy over standard approaches.

Abstract: Although visual foundation models like DINOv2 provide state-of-the-art performance as feature extractors, their complex, high-dimensional representations create substantial hurdles for interpretability. This work proposes DINO-QPM, which converts these powerful but entangled features into contrastive, class-independent representations that are interpretable by humans. DINO-QPM is a lightweight interpretability adapter that pursues globally interpretable image classification, adapting the Quadratic Programming Enhanced Model (QPM) to operate on strictly frozen DINO backbones. While classification with visual foundation models typically relies on the \texttt{CLS} token, we deliberately diverge from this standard. By leveraging average-pooling, we directly connect the patch embeddings to the model’s features and therefore enable spatial localisation of DINO-QPM’s globally interpretable features within the input space. Furthermore, we apply a sparsity loss to minimise spatial scatter and background noise, ensuring that explanations are grounded in relevant object parts. With DINO-QPM we make the level of interpretability of QPM available as an adapter while exceeding the accuracy of DINOv2 linear probe. Evaluated through an introduced Plausibility metric and other interpretability metrics, extensive experiments demonstrate that DINO-QPM is superior to other applicable methods for frozen visual foundation models in both classification accuracy and explanation quality.

[245] Multiple Domain Generalization Using Category Information Independent of Domain Differences

Reiji Saito, Kazuhiro Hotta

Main category: cs.CV

TL;DR: A domain generalization method for medical image segmentation that separates domain-invariant category information from domain-specific information and uses quantum vectors in SQ-VAE to bridge domain gaps.

DetailsMotivation: Models trained on specific datasets often fail on different datasets due to domain differences caused by varying imaging equipment and staining methods. The goal is to achieve segmentation that doesn't depend on domain differences.

Method: Two initiatives: 1) Separate domain-invariant category information from source domain-specific information for learning segmentation targets; 2) Use quantum vectors in Stochastically Quantized Variational AutoEncoder (SQ-VAE) to absorb domain gaps between training and test data.

Result: The method was evaluated on vascular segmentation and cell nucleus segmentation datasets, showing improved accuracy compared to conventional methods.

Conclusion: The proposed approach effectively addresses domain generalization challenges in medical image segmentation by combining domain-invariant feature separation with quantum vector-based domain gap absorption.

Abstract: Domain generalization is a technique aimed at enabling models to maintain high accuracy when applied to new environments or datasets (unseen domains) that differ from the datasets used in training. Generally, the accuracy of models trained on a specific dataset (source domain) often decreases significantly when evaluated on different datasets (target domain). This issue arises due to differences in domains caused by varying environmental conditions such as imaging equipment and staining methods. Therefore, we undertook two initiatives to perform segmentation that does not depend on domain differences. We propose a method that separates category information independent of domain differences from the information specific to the source domain. By using information independent of domain differences, our method enables learning the segmentation targets (e.g., blood vessels and cell nuclei). Although we extract independent information of domain differences, this cannot completely bridge the domain gap between training and test data. Therefore, we absorb the domain gap using the quantum vectors in Stochastically Quantized Variational AutoEncoder (SQ-VAE). In experiments, we evaluated our method on datasets for vascular segmentation and cell nucleus segmentation. Our methods improved the accuracy compared to conventional methods.

[246] Energy-based Tissue Manifolds for Longitudinal Multiparametric MRI Analysis

Kartikay Tehlan, Lukas Förner, Nico Schmutzenhofer, Michael Frühwald, Matthias Wagner, Nassir Navab, Thomas Wendler

Main category: cs.CV

TL;DR: A geometric framework for longitudinal multi-parametric MRI analysis using patient-specific energy manifolds learned from baseline scans, enabling tissue regime tracking without segmentation labels.

DetailsMotivation: To develop a method for longitudinal MRI analysis that doesn't rely on spatial networks or segmentation labels, but instead uses geometric representations of tissue regimes in multi-sequence intensity space.

Method: Represent each voxel by its multi-sequence intensity vector (T1, T1c, T2, FLAIR, ADC), train an implicit neural representation via denoising score matching to learn an energy function from a single baseline scan, then use this baseline energy manifold as a fixed geometric reference for evaluating subsequent scans.

Result: In a pediatric case with later recurrence, follow-up scans showed progressive deviation in energy and directional displacement toward baseline tumor-associated regime before clear radiological reappearance. In stable disease cases, voxel distributions remained confined to low-energy basins without systematic drift.

Conclusion: Patient-specific energy manifolds can function as geometric reference systems for longitudinal mpMRI analysis without explicit segmentation or supervised classification, providing a foundation for manifold-based tissue-at-risk tracking in neuro-oncology.

Abstract: We propose a geometric framework for longitudinal multi-parametric MRI analysis based on patient-specific energy modelling in sequence space. Rather than operating on images with spatial networks, each voxel is represented by its multi-sequence intensity vector ($T1$, $T1c$, $T2$, FLAIR, ADC), and a compact implicit neural representation is trained via denoising score matching to learn an energy function $E_θ(\mathbf{u})$ over $\mathbb{R}^d$ from a single baseline scan. The learned energy landscape provides a differential-geometric description of tissue regimes without segmentation labels. Local minima define tissue basins, gradient magnitude reflects proximity to regime boundaries, and Laplacian curvature characterises local constraint structure. Importantly, this baseline energy manifold is treated as a fixed geometric reference: it encodes the set of contrast combinations observed at diagnosis and is not retrained at follow-up. Longitudinal assessment is therefore formulated as evaluation of subsequent scans relative to this baseline geometry. Rather than comparing anatomical segmentations, we analyse how the distribution of MRI sequence vectors evolves under the baseline energy function. In a paediatric case with later recurrence, follow-up scans show progressive deviation in energy and directional displacement in sequence space toward the baseline tumour-associated regime before clear radiological reappearance. In a case with stable disease, voxel distributions remain confined to established low-energy basins without systematic drift. The presented cases serve as proof-of-concept that patient-specific energy manifolds can function as geometric reference systems for longitudinal mpMRI analysis without explicit segmentation or supervised classification, providing a foundation for further investigation of manifold-based tissue-at-risk tracking in neuro-oncology.

[247] TeaLeafVision: An Explainable and Robust Deep Learning Framework for Tea Leaf Disease Classification

Rafi Ahamed, Sidratul Moon Nafsin, Md Abir Rahman, Tasnia Tarannum Roza, Munaia Jannat Easha, Abu Raihan

Main category: cs.CV

TL;DR: Deep learning approach using CNN models (DenseNet201, MobileNetV2, InceptionV3) for tea leaf disease detection, achieving 99% accuracy with explainability techniques and real-world application prototype.

DetailsMotivation: Tea is a globally significant beverage and economic force, requiring precise disease detection for agricultural management. Current methods need improvement for real-world field conditions.

Method: Evaluated multiple CNN models (DenseNet201, MobileNetV2, InceptionV3) on teaLeafBD dataset containing 7 classes (6 diseases + healthy). Used Grad-CAM, occlusion sensitivity analysis, and adversarial training for interpretability and robustness.

Result: DenseNet201 achieved highest test accuracy of 99%. Models demonstrated capability to handle real-world field conditions. Developed prototype for practical agricultural application.

Conclusion: Deep learning models show strong potential for real-life tea leaf disease detection and management, with high accuracy and robustness through explainability techniques.

Abstract: As the worlds second most consumed beverage after water, tea is not just a cultural staple but a global economic force of profound scale and influence. More than a mere drink, it represents a quiet negotiation between nature, culture, and the human desire for a moment of reflection. So, the precise identification and detection of tea leaf disease is crucial. With this goal, we have evaluated several Convolutional Neural Networks (CNN) models, among them three shows noticeable performance including DenseNet201, MobileNetV2, InceptionV3 on the teaLeafBD dataset. teaLeafBD dataset contains seven classes, six disease classes and one healthy class, collected under various field conditions reflecting real world challenges. Among the CNN models, DenseNet201 has achieved the highest test accuracy of 99%. In order to enhance the model reliability and interpretability, we have implemented Gradient weighted Class Activation Mapping (Grad CAM), occlusion sensitivity analysis and adversarial training techniques to increase the noise resistance of the model. Finally, we have developed a prototype in order to leverage the models capabilities on real life agriculture. This paper illustrates the deep learning models capabilities to classify the disease in real life tea leaf disease detection and management.

[248] INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, Siji Pan, Weihong Pan, Weijian Xie, Xianbin Liu, Xiaojun Xiang, Xiaoyu Zhang, Xinyu Chen, Yifu Wang, Yipeng Chen, Zhenzhou Fan, Zhewen Le, Zhichao Ye, Ziqiang Zhao

Main category: cs.CV

TL;DR: INSPATIO-WORLD: A real-time framework for recovering and generating high-fidelity dynamic interactive scenes from single reference videos using spatiotemporal autoregressive architecture with implicit cache and explicit constraints.

DetailsMotivation: Current video generation methods lack spatial persistence and visual realism needed for seamless navigation in complex environments, making it difficult to support interactive exploration of dynamic scenes.

Method: Proposes Spatiotemporal Autoregressive (STAR) architecture with two components: Implicit Spatiotemporal Cache for aggregating reference/historical observations into latent world representation, and Explicit Spatial Constraint Module for enforcing geometric structure and translating user interactions into physically plausible camera trajectories. Also introduces Joint Distribution Matching Distillation (JDMD) to overcome fidelity degradation from synthetic data reliance.

Result: Significantly outperforms existing SOTA models in spatial consistency and interaction precision, ranks first among real-time interactive methods on WorldScore-Dynamic benchmark, establishes practical pipeline for navigating 4D environments from monocular videos.

Conclusion: INSPATIO-WORLD provides an effective solution for real-time interactive scene generation with spatial consistency, enabling practical navigation of 4D environments reconstructed from single videos.

Abstract: Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.

[249] VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis

Jian Yu, Fei Shen, Cong Wang, Yi Xin, Si Shen, Xiaoyu Du, Jinhui Tang

Main category: cs.CV

TL;DR: VersaVogue: A unified diffusion framework for multi-condition controllable fashion synthesis supporting both garment generation and virtual dressing with trait-routing attention and automated preference optimization.

DetailsMotivation: Prior fashion image generation methods treat garment generation and virtual dressing as separate problems, limiting real-world workflow flexibility. Existing multi-condition synthesis approaches suffer from attribute entanglement and semantic interference due to simple feature concatenation or static injection methods.

Method: Proposes VersaVogue with: 1) Trait-routing attention (TA) module using mixture-of-experts to dynamically route condition features to compatible experts and layers for disentangled attribute injection; 2) Automated multi-perspective preference optimization (MPO) pipeline that constructs preference data using evaluators for content fidelity, textual alignment, and perceptual quality, then optimizes via DPO.

Result: Extensive experiments on garment generation and virtual dressing benchmarks show VersaVogue consistently outperforms existing methods in visual fidelity, semantic consistency, and fine-grained controllability.

Conclusion: VersaVogue provides a unified framework for multi-condition controllable fashion synthesis that bridges design and showcase stages, offering improved flexibility and performance over existing approaches.

Abstract: Diffusion models have driven remarkable advancements in fashion image generation, yet prior works usually treat garment generation and virtual dressing as separate problems, limiting their flexibility in real-world fashion workflows. Moreover, fashion image synthesis under multi-source heterogeneous conditions remains challenging, as existing methods typically rely on simple feature concatenation or static layer-wise injection, which often causes attribute entanglement and semantic interference. To address these issues, we propose VersaVogue, a unified framework for multi-condition controllable fashion synthesis that jointly supports garment generation and virtual dressing, corresponding to the design and showcase stages of the fashion lifecycle. Specifically, we introduce a trait-routing attention (TA) module that leverages a mixture-of-experts mechanism to dynamically route condition features to the most compatible experts and generative layers, enabling disentangled injection of visual attributes such as texture, shape, and color. To further improve realism and controllability, we develop an automated multi-perspective preference optimization (MPO) pipeline that constructs preference data without human annotation or task-specific reward models. By combining evaluators of content fidelity, textual alignment, and perceptual quality, MPO identifies reliable preference pairs, which are then used to optimize the model via direct preference optimization (DPO). Extensive experiments on both garment generation and virtual dressing benchmarks demonstrate that VersaVogue consistently outperforms existing methods in visual fidelity, semantic consistency, and fine-grained controllability.

[250] PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing

Ruihang Xu, Dewei Zhou, Xiaolong Shen, Fan Ma, Yi Yang

Main category: cs.CV

TL;DR: PhyEdit is a 3D-aware image editing framework that uses explicit geometric simulation to improve physical accuracy in object manipulation, addressing limitations of existing visual generative models.

DetailsMotivation: Existing visual generative models often fail at precise spatial manipulation of objects in images, resulting in incorrect scaling and positioning due to lack of explicit 3D geometry and perspective projection mechanisms.

Method: Developed PhyEdit framework that leverages explicit geometric simulation as contextual 3D-aware visual guidance, combining plug-and-play 3D prior with joint 2D-3D supervision. Also created RealManip-10K dataset with paired images and depth annotations, and ManipEval benchmark for evaluation.

Result: Extensive experiments show PhyEdit outperforms existing methods, including strong closed-source models, in both 3D geometric accuracy and manipulation consistency.

Conclusion: The approach effectively improves physical accuracy and manipulation consistency in image editing by incorporating explicit 3D geometric simulation, with supporting datasets and benchmarks for evaluation.

Abstract: Achieving physically accurate object manipulation in image editing is essential for its potential applications in interactive world models. However, existing visual generative models often fail at precise spatial manipulation, resulting in incorrect scaling and positioning of objects. This limitation primarily stems from the lack of explicit mechanisms to incorporate 3D geometry and perspective projection. To achieve accurate manipulation, we develop PhyEdit, an image editing framework that leverages explicit geometric simulation as contextual 3D-aware visual guidance. By combining this plug-and-play 3D prior with joint 2D–3D supervision, our method effectively improves physical accuracy and manipulation consistency. To support this method and evaluate performance, we present a real-world dataset, RealManip-10K, for 3D-aware object manipulation featuring paired images and depth annotations. We also propose ManipEval, a benchmark with multi-dimensional metrics to evaluate 3D spatial control and geometric consistency. Extensive experiments show that our approach outperforms existing methods, including strong closed-source models, in both 3D geometric accuracy and manipulation consistency.

[251] Geo-EVS: Geometry-Conditioned Extrapolative View Synthesis for Autonomous Driving

Yatong Lan, Rongkui Tang, Lei He

Main category: cs.CV

TL;DR: Geo-EVS is a geometry-conditioned framework for extrapolative novel view synthesis in autonomous driving that addresses degradation outside recorded trajectories by exposing models to out-of-trajectory condition defects during training.

DetailsMotivation: Extrapolative novel view synthesis can reduce camera-rig dependency in autonomous driving by generating standardized virtual views from heterogeneous sensors, but existing methods degrade outside recorded trajectories due to weak geometric support and lack of dense target-view supervision.

Method: Two-component framework: 1) Geometry-Aware Reprojection (GAR) uses fine-tuned VGGT to reconstruct colored point clouds and reproject them to observed and virtual target poses, producing geometric condition maps; 2) Artifact-Guided Latent Diffusion (AGLD) injects reprojection-derived artifact masks during training so the model learns to recover structure under missing support.

Result: On Waymo dataset, Geo-EVS improves sparse-view synthesis quality and geometric accuracy, especially in high-angle and low-coverage settings, and also improves downstream 3D detection performance.

Conclusion: Geo-EVS successfully addresses extrapolative novel view synthesis challenges by explicitly exposing models to out-of-trajectory condition defects during training, leading to improved synthesis quality and geometric accuracy for autonomous driving applications.

Abstract: Extrapolative novel view synthesis can reduce camera-rig dependency in autonomous driving by generating standardized virtual views from heterogeneous sensors. Existing methods degrade outside recorded trajectories because extrapolated poses provide weak geometric support and no dense target-view supervision. The key is to explicitly expose the model to out-of-trajectory condition defects during training. We propose Geo-EVS, a geometry-conditioned framework under sparse supervision. Geo-EVS has two components. Geometry-Aware Reprojection (GAR) uses fine-tuned VGGT to reconstruct colored point clouds and reproject them to observed and virtual target poses, producing geometric condition maps. This design unifies the reprojection path between training and inference. Artifact-Guided Latent Diffusion (AGLD) injects reprojection-derived artifact masks during training so the model learns to recover structure under missing support. For evaluation, we use a LiDAR-Projected Sparse-Reference (LPSR) protocol when dense extrapolated-view ground truth is unavailable. On Waymo, Geo-EVS improves sparse-view synthesis quality and geometric accuracy, especially in high-angle and low-coverage settings. It also improves downstream 3D detection.

[252] Non-identifiability of Explanations from Model Behavior in Deep Networks of Image Authenticity Judgments

Icaro Re Depaolini, Uri Hasson

Main category: cs.CV

TL;DR: Deep neural networks can predict human authenticity judgments but produce inconsistent attribution maps across architectures, suggesting post-hoc explanations are weak evidence for cognitive mechanisms.

DetailsMotivation: To test whether models that predict human authenticity ratings produce consistent explanations within and across architectures, addressing concerns about the robustness and explanatory value of attribution heatmaps.

Method: Fitted lightweight regression heads to multiple frozen pretrained vision models (VGG, EfficientNetB3, Barlow Twins), generated attribution maps using Grad-CAM, LIME, and multiscale pixel masking, and evaluated consistency across random seeds and architectures.

Result: Models predicted human authenticity ratings well (80% noise ceiling), but attribution maps showed weak agreement across architectures even with similar predictive performance. VGG models tracked image quality rather than authenticity. Ensembles improved predictions and enabled image-level attribution.

Conclusion: Deep networks can predict human authenticity judgments but do not produce identifiable explanations for those judgments; post-hoc explanations from successful behavioral models should be treated as weak evidence for cognitive mechanisms.

Abstract: Deep neural networks can predict human judgments, but this does not imply that they rely on human-like information or reveal the cues underlying those judgments. Prior work has addressed this issue using attribution heatmaps, but their explanatory value in itself depends on robustness. Here we tested the robustness of such explanations by evaluating whether models that predict human authenticity ratings also produce consistent explanations within and across architectures. We fit lightweight regression heads to multiple frozen pretrained vision models and generated attribution maps using Grad-CAM, LIME, and multiscale pixel masking. Several architectures predicted ratings well, reaching about 80% of the noise ceiling. VGG models achieved this by tracking image quality rather than authenticity-specific variance, limiting the relevance of their attributions. Among the remaining models, attribution maps were generally stable across random seeds within an architecture, especially for EfficientNetB3 and Barlow Twins, and consistency was higher for images judged as more authentic. Crucially, agreement in attribution across architectures was weak even when predictive performance was similar. To address this, we combined models in ensembles, which improved prediction of human authenticity judgments and enabled image-level attribution via pixel masking. We conclude that while deep networks can predict human authenticity judgments well, they do not produce identifiable explanations for those judgments. More broadly, our findings suggest that post hoc explanations from successful models of behavior should be treated as weak evidence for cognitive mechanism.

[253] GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos

Yiqian Wu, Rawal Khirodkar, Egor Zakharov, Timur Bagautdinov, Lei Xiao, Zhaoen Su, Shunsuke Saito, Xiaogang Jin, Junxuan Li

Main category: cs.CV

TL;DR: GenLCA is a diffusion-based generative model for creating photorealistic full-body avatars from text/image inputs, using a novel training approach with partial 2D video data and a visibility-aware diffusion strategy.

DetailsMotivation: Current methods for generating photorealistic full-body avatars struggle with scalability and photorealism due to limited 3D training data. Real-world videos provide abundant data but only show partial body observations, creating challenges for 3D model training.

Method: 1) Repurpose a pretrained avatar reconstruction model as a 3D tokenizer to encode video frames into structured 3D tokens; 2) Use visibility-aware diffusion training that replaces invalid regions with learnable tokens and computes losses only over valid regions; 3) Train a flow-based diffusion model on the token dataset to maintain photorealism and animatability.

Result: GenLCA generates diverse and high-fidelity full-body avatars that are faithful to text/image inputs, supports facial and full-body animations, and significantly outperforms existing solutions in photorealism and generalizability.

Conclusion: The method successfully enables training 3D diffusion models using large-scale real-world video data despite partial observations, achieving superior avatar generation and editing capabilities through scalable data utilization.

Abstract: We present GenLCA, a diffusion-based generative model for generating and editing photorealistic full-body avatars from text and image inputs. The generated avatars are faithful to the inputs, while supporting high-fidelity facial and full-body animations. The core idea is a novel paradigm that enables training a full-body 3D diffusion model from partially observable 2D data, allowing the training dataset to scale to millions of real-world videos. This scalability contributes to the superior photorealism and generalizability of GenLCA. Specifically, we scale up the dataset by repurposing a pretrained feed-forward avatar reconstruction model as an animatable 3D tokenizer, which encodes unstructured video frames into structured 3D tokens. However, most real-world videos only provide partial observations of body parts, resulting in excessive blurring or transparency artifacts in the 3D tokens. To address this, we propose a novel visibility-aware diffusion training strategy that replaces invalid regions with learnable tokens and computes losses only over valid regions. We then train a flow-based diffusion model on the token dataset, inherently maintaining the photorealism and animatability provided by the pretrained avatar reconstruction model. Our approach effectively enables the use of large-scale real-world video data to train a diffusion model natively in 3D. We demonstrate the efficacy of our method through diverse and high-fidelity generation and editing results, outperforming existing solutions by a large margin. The project page is available at https://onethousandwu.com/GenLCA-Page.

[254] Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training

Changkun Liu, Jiezhi Yang, Zeman Li, Yuan Deng, Jiancong Guo, Luca Ballan

Main category: cs.CV

TL;DR: Mem3R is a streaming 3D reconstruction model with hybrid memory design that decouples camera tracking from geometric mapping to improve temporal consistency in long visual sequences.

DetailsMotivation: Existing recurrent models for streaming 3D perception suffer from drift accumulation and temporal forgetting over long sequences due to limited capacity of compressed latent memories, which is problematic for robotics and AR applications.

Method: Uses hybrid memory design: implicit fast-weight memory (lightweight MLP updated via Test-Time Training) for camera tracking, and explicit token-based fixed-size state for geometric mapping. Supports plug-and-play state update strategies.

Result: Reduces model size from 793M to 644M parameters, decreases Absolute Trajectory Error by up to 39% on 500-1000 frame sequences, improves downstream tasks (video depth estimation, 3D reconstruction), while maintaining constant GPU memory usage and comparable inference throughput.

Conclusion: Mem3R’s hybrid memory design effectively addresses temporal consistency issues in streaming 3D perception, offering improved performance with reduced model complexity for long-sequence applications.

Abstract: Streaming 3D perception is well suited to robotics and augmented reality, where long visual streams must be processed efficiently and consistently. Recent recurrent models offer a promising solution by maintaining fixed-size states and enabling linear-time inference, but they often suffer from drift accumulation and temporal forgetting over long sequences due to the limited capacity of compressed latent memories. We propose Mem3R, a streaming 3D reconstruction model with a hybrid memory design that decouples camera tracking from geometric mapping to improve temporal consistency over long sequences. For camera tracking, Mem3R employs an implicit fast-weight memory implemented as a lightweight Multi-Layer Perceptron updated via Test-Time Training. For geometric mapping, Mem3R maintains an explicit token-based fixed-size state. Compared with CUT3R, this design not only significantly improves long-sequence performance but also reduces the model size from 793M to 644M parameters. Mem3R supports existing improved plug-and-play state update strategies developed for CUT3R. Specifically, integrating it with TTT3R decreases Absolute Trajectory Error by up to 39% over the base implementation on 500 to 1000 frame sequences. The resulting improvements also extend to other downstream tasks, including video depth estimation and 3D reconstruction, while preserving constant GPU memory usage and comparable inference throughput. Project page: https://lck666666.github.io/Mem3R/

[255] Are Face Embeddings Compatible Across Deep Neural Network Models?

Fizza Rubab, Yiying Tong, Arun Ross

Main category: cs.CV

TL;DR: Different DNN models (domain-specific and foundation models) encode facial identity in surprisingly compatible ways, with simple affine transformations enabling effective cross-model face recognition.

DetailsMotivation: To understand whether different DNN models trained on different datasets, loss functions, and architectures encode facial identity in similar ways, despite their diverse training regimes.

Method: Analyze geometric structure of embedding spaces by treating face embeddings as point clouds and studying whether simple affine transformations can align face representations across different models.

Result: Low-capacity linear mappings substantially improve cross-model face recognition over unaligned baselines for both identification and verification tasks. Alignment patterns generalize across datasets and vary systematically across model families.

Conclusion: Different DNN models show representational convergence in facial identity encoding, with implications for model interoperability, ensemble design, and biometric template security.

Abstract: Automated face recognition has made rapid strides over the past decade due to the unprecedented rise of deep neural network (DNN) models that can be trained for domain-specific tasks. At the same time, foundation models that are pretrained on broad vision or vision-language tasks have shown impressive generalization across diverse domains, including biometrics. This raises an important question: Do different DNN models–both domain-specific and foundation models–encode facial identity in similar ways, despite being trained on different datasets, loss functions, and architectures? In this regard, we directly analyze the geometric structure of embedding spaces imputed by different DNN models. Treating embeddings of face images as point clouds, we study whether simple affine transformations can align face representations of one model with another. Our findings reveal surprising cross-model compatibility: low-capacity linear mappings substantially improve cross-model face recognition over unaligned baselines for both face identification and verification tasks. Alignment patterns generalize across datasets and vary systematically across model families, indicating representational convergence in facial identity encoding. These findings have implications for model interoperability, ensemble design, and biometric template security.

[256] Beyond Loss Values: Robust Dynamic Pruning via Loss Trajectory Alignment

Huaiyuan Qin, Muli Yang, Gabriel James Goenawan, Kai Wang, Zheng Wang, Peng Hu, Xi Peng, Hongyuan Zhu

Main category: cs.CV

TL;DR: AlignPrune is a noise-robust dynamic data pruning module that uses loss trajectory-based Dynamic Alignment Score to better identify noisy samples, improving pruning effectiveness under noisy-label settings.

DetailsMotivation: Existing dynamic data pruning methods fail under noisy-label settings because they rely on per-sample loss as ranking criterion, which can mistakenly preserve noisy samples due to their high loss values, leading to significant performance degradation.

Method: AlignPrune introduces Dynamic Alignment Score (DAS), a loss-trajectory-based criterion that enables more accurate identification of noisy samples. It’s designed as a plug-and-play module that can be integrated into existing dynamic pruning frameworks without modifying model architecture or training pipeline.

Result: Extensive experiments on five benchmarks across various noise types and pruning ratios show AlignPrune consistently outperforms state-of-the-art baselines, boosting accuracy by up to 6.3%.

Conclusion: AlignPrune offers a generalizable solution for pruning under noisy data, encouraging further exploration of learning in real-world scenarios with label noise.

Abstract: Existing dynamic data pruning methods often fail under noisy-label settings, as they typically rely on per-sample loss as the ranking criterion. This could mistakenly lead to preserving noisy samples due to their high loss values, resulting in significant performance drop. To address this, we propose AlignPrune, a noise-robust module designed to enhance the reliability of dynamic pruning under label noise. Specifically, AlignPrune introduces the Dynamic Alignment Score (DAS), which is a loss-trajectory-based criterion that enables more accurate identification of noisy samples, thereby improving pruning effectiveness. As a simple yet effective plug-and-play module, AlignPrune can be seamlessly integrated into state-of-the-art dynamic pruning frameworks, consistently outperforming them without modifying either the model architecture or the training pipeline. Extensive experiments on five widely-used benchmarks across various noise types and pruning ratios demonstrate the effectiveness of AlignPrune, boosting accuracy by up to 6.3% over state-of-the-art baselines. Our results offer a generalizable solution for pruning under noisy data, encouraging further exploration of learning in real-world scenarios. Code is available at: https://github.com/leonqin430/AlignPrune.

[257] Distilling Photon-Counting CT into Routine Chest CT through Clinically Validated Degradation Modeling

Junqi Liu, Xinze Zhou, Wenxuan Li, Scott Ye, Arkadiusz Sitek, Xiaofeng Yang, Yucheng Tang, Daguang Xu, Kai Ding, Kang Wang, Yang Yang, Alan L. Yuille, Zongwei Zhou

Main category: cs.CV

TL;DR: SUMI is a simulated degradation-to-enhancement method that transforms low-quality energy-integrating CT (EICT) into photon-counting CT (PCCT)-like quality by learning to reverse realistic acquisition artifacts, enabling PCCT benefits without requiring widespread PCCT deployment.

DetailsMotivation: Photon-counting CT provides superior image quality but has limited clinical availability, restricting large-scale research and deployment. The goal is to bridge this gap by enhancing routine EICT scans to PCCT-like quality using limited high-quality PCCT scans as reference.

Method: SUMI explicitly models realistic acquisition degradations to transform PCCT into clinically plausible lower-quality counterparts, then learns to invert this process. The approach uses a latent diffusion model trained on PCCTs, with an autoencoder pre-trained on both PCCTs and EICTs to extract general CT latent features. The method creates simulated degradations validated by radiologists for clinical realism.

Result: SUMI outperforms state-of-the-art image translation methods by 15% in SSIM and 20% in PSNR, improves radiologist-rated clinical utility, and enhances downstream lesion detection performance (increasing sensitivity by up to 15% and F1 score by up to 10%). The work also produces a large dataset of 17,316 enhanced EICTs with radiologist-validated annotations.

Conclusion: Emerging imaging advances like PCCT can be systematically distilled into routine EICT using limited high-quality scans as reference, enabling superior image quality without requiring widespread deployment of new hardware.

Abstract: Photon-counting CT (PCCT) provides superior image quality with higher spatial resolution and lower noise compared to conventional energy-integrating CT (EICT), but its limited clinical availability restricts large-scale research and clinical deployment. To bridge this gap, we propose SUMI, a simulated degradation-to-enhancement method that learns to reverse realistic acquisition artifacts in low-quality EICT by leveraging high-quality PCCT as reference. Our central insight is to explicitly model realistic acquisition degradations, transforming PCCT into clinically plausible lower-quality counterparts and learning to invert this process. The simulated degradations were validated for clinical realism by board-certified radiologists, enabling faithful supervision without requiring paired acquisitions at scale. As outcomes of this technical contribution, we: (1) train a latent diffusion model on 1,046 PCCTs, using an autoencoder first pre-trained on both these PCCTs and 405,379 EICTs from 145 hospitals to extract general CT latent features that we release for reuse in other generative medical imaging tasks; (2) construct a large-scale dataset of over 17,316 publicly available EICTs enhanced to PCCT-like quality, with radiologist-validated voxel-wise annotations of airway trees, arteries, veins, lungs, and lobes; and (3) demonstrate substantial improvements: across external data, SUMI outperforms state-of-the-art image translation methods by 15% in SSIM and 20% in PSNR, improves radiologist-rated clinical utility in reader studies, and enhances downstream top-ranking lesion detection performance, increasing sensitivity by up to 15% and F1 score by up to 10%. Our results suggest that emerging imaging advances can be systematically distilled into routine EICT using limited high-quality scans as reference.

[258] From Blobs to Spokes: High-Fidelity Surface Reconstruction via Oriented Gaussians

Diego Gomez, Antoine Guédon, Nissim Maruani, Bingchen Gong, Maks Ovsjanikov

Main category: cs.CV

TL;DR: Gaussian Wrapping introduces a principled occupancy field for 3D Gaussian Splatting to extract accurate watertight meshes, addressing 3DGS’s fundamental difficulty with surface extraction through learnable oriented normals and novel consistency losses.

DetailsMotivation: 3D Gaussian Splatting (3DGS) revolutionized fast novel view synthesis but lacks a global geometric field, making surface extraction fundamentally difficult. Unlike implicit methods using Signed Distance Fields or occupancy, 3DGS forces existing approaches to use heuristics like TSDF fusion of blended depth maps.

Method: Introduces learnable oriented normals at each Gaussian element and an adapted attenuation formulation, leading to closed-form expressions for normal and occupancy fields. Includes novel consistency loss and dedicated densification strategy to enforce Gaussians to wrap surfaces by closing geometric holes. Modifies differentiable rasterizer to output depth as an isosurface of continuous model, and introduces Primal Adaptive Meshing for Region-of-Interest meshing at arbitrary resolution.

Result: Sets new state-of-the-art on DTU and Tanks and Temples benchmarks, producing complete, watertight meshes at a fraction of the size of concurrent work, recovering thin structures like bicycle spokes.

Conclusion: Gaussian Wrapping provides a principled solution for surface extraction from 3DGS, addressing fundamental limitations through occupancy field formulation and achieving superior mesh quality while exposing biases in standard surface evaluation protocols.

Abstract: 3D Gaussian Splatting (3DGS) has revolutionized fast novel view synthesis, yet its opacity-based formulation makes surface extraction fundamentally difficult. Unlike implicit methods built on Signed Distance Fields or occupancy, 3DGS lacks a global geometric field, forcing existing approaches to resort to heuristics such as TSDF fusion of blended depth maps. Inspired by the Objects as Volumes framework, we derive a principled occupancy field for Gaussian Splatting and show how it can be used to extract highly accurate watertight meshes of complex scenes. Our key contribution is to introduce a learnable oriented normal at each Gaussian element and to define an adapted attenuation formulation, which leads to closed-form expressions for both the normal and occupancy fields at arbitrary locations in space. We further introduce a novel consistency loss and a dedicated densification strategy to enforce Gaussians to wrap the entire surface by closing geometric holes, ensuring a complete shell of oriented primitives. We modify the differentiable rasterizer to output depth as an isosurface of our continuous model, and introduce Primal Adaptive Meshing for Region-of-Interest meshing at arbitrary resolution. We additionally expose fundamental biases in standard surface evaluation protocols and propose two more rigorous alternatives. Overall, our method Gaussian Wrapping sets a new state-of-the-art on DTU and Tanks and Temples, producing complete, watertight meshes at a fraction of the size of concurrent work-recovering thin structures such as the notoriously elusive bicycle spokes.

[259] TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders

Teng Li, Ziyuan Huang, Cong Chen, Yangfu Li, Yuanhuiyi Lyu, Dandan Zheng, Chunhua Shen, Jun Zhang

Main category: cs.CV

TL;DR: TC-AE is a ViT-based deep compression autoencoder that improves reconstruction and generative performance by addressing token space issues through token number scaling and semantic structure enhancement.

DetailsMotivation: Existing deep compression methods increase channel numbers in latent representations to maintain quality under high compression ratios, but this leads to latent representation collapse that degrades generative performance. The paper aims to address this challenge from the token space perspective rather than using complex architectures or multi-stage training.

Method: TC-AE introduces two complementary innovations: 1) Token number scaling by adjusting patch size in ViT under fixed latent budget, decomposing token-to-latent compression into two stages to reduce structural information loss; 2) Enhancing semantic structure of image tokens via joint self-supervised training to create more generative-friendly latents and mitigate collapse.

Result: TC-AE achieves substantially improved reconstruction and generative performance under deep compression compared to existing methods.

Conclusion: The research advances ViT-based tokenizers for visual generation by addressing fundamental token space challenges in deep compression autoencoders.

Abstract: We propose TC-AE, a ViT-based architecture for deep compression autoencoders. Existing methods commonly increase the channel number of latent representations to maintain reconstruction quality under high compression ratios. However, this strategy often leads to latent representation collapse, which degrades generative performance. Instead of relying on increasingly complex architectures or multi-stage training schemes, TC-AE addresses this challenge from the perspective of the token space, the key bridge between pixels and image latents, through two complementary innovations: Firstly, we study token number scaling by adjusting the patch size in ViT under a fixed latent budget, and identify aggressive token-to-latent compression as the key factor that limits effective scaling. To address this issue, we decompose token-to-latent compression into two stages, reducing structural information loss and enabling effective token number scaling for generation. Secondly, to further mitigate latent representation collapse, we enhance the semantic structure of image tokens via joint self-supervised training, leading to more generative-friendly latents. With these designs, TC-AE achieves substantially improved reconstruction and generative performance under deep compression. We hope our research will advance ViT-based tokenizer for visual generation.

[260] MoRight: Motion Control Done Right

Shaowei Liu, Xuanchi Ren, Tianchang Shen, Huan Ling, Saurabh Gupta, Shenlong Wang, Sanja Fidler, Jun Gao

Main category: cs.CV

TL;DR: MoRight is a framework for generating motion-controlled videos with disentangled camera/object control and motion causality modeling, enabling forward/inverse reasoning about object interactions.

DetailsMotivation: Existing methods fail to provide disentangled motion control (camera vs object motion) and lack motion causality modeling, treating motion as kinematic displacement without capturing causal relationships between object interactions.

Method: Uses disentangled motion modeling with object motion specified in canonical static view and transferred to arbitrary viewpoints via temporal cross-view attention. Decomposes motion into active (user-driven) and passive (consequence) components, training the model to learn motion causality from data.

Result: State-of-the-art performance on three benchmarks in generation quality, motion controllability, and interaction awareness. Enables both forward reasoning (predict consequences from active motion) and inverse reasoning (recover driving actions from desired outcomes).

Conclusion: MoRight addresses key limitations in motion-controlled video generation by providing disentangled camera/object control and modeling motion causality, enabling more physically plausible and controllable scene dynamics.

Abstract: Generating motion-controlled videos–where user-specified actions drive physically plausible scene dynamics under freely chosen viewpoints–demands two capabilities: (1) disentangled motion control, allowing users to separately control the object motion and adjust camera viewpoint; and (2) motion causality, ensuring that user-driven actions trigger coherent reactions from other objects rather than merely displacing pixels. Existing methods fall short on both fronts: they entangle camera and object motion into a single tracking signal and treat motion as kinematic displacement without modeling causal relationships between object motion. We introduce MoRight, a unified framework that addresses both limitations through disentangled motion modeling. Object motion is specified in a canonical static-view and transferred to an arbitrary target camera viewpoint via temporal cross-view attention, enabling disentangled camera and object control. We further decompose motion into active (user-driven) and passive (consequence) components, training the model to learn motion causality from data. At inference, users can either supply active motion and MoRight predicts consequences (forward reasoning), or specify desired passive outcomes and MoRight recovers plausible driving actions (inverse reasoning), all while freely adjusting the camera viewpoint. Experiments on three benchmarks demonstrate state-of-the-art performance in generation quality, motion controllability, and interaction awareness.

[261] Fast Spatial Memory with Elastic Test-Time Training

Ziqiao Ma, Xueyang Yu, Haoyu Zhen, Yuncong Yang, Joyce Chai, Chuang Gan

Main category: cs.CV

TL;DR: Elastic Test-Time Training (ETTT) stabilizes LaCT with elastic weight consolidation, enabling multi-chunk adaptation for long-sequence 4D reconstruction via Fast Spatial Memory (FSM).

DetailsMotivation: LaCT suffers from catastrophic forgetting and overfitting in fully plastic inference-time updates, limiting it to single large chunks and preventing handling of arbitrarily long sequences in a single pass.

Method: Proposes Elastic Test-Time Training with Fisher-weighted elastic prior around an evolving anchor state (EMA of past fast weights), and introduces Fast Spatial Memory (FSM) for efficient 4D reconstruction from long observation sequences.

Result: FSM supports fast adaptation over long sequences with smaller chunks, delivers high-quality 3D/4D reconstruction, and mitigates camera-interpolation shortcuts while alleviating activation-memory bottlenecks.

Conclusion: Advances LaCT beyond bounded single-chunk setting toward robust multi-chunk adaptation for genuinely longer sequences, enabling scalable spatiotemporal representation learning.

Abstract: Large Chunk Test-Time Training (LaCT) has shown strong performance on long-context 3D reconstruction, but its fully plastic inference-time updates remain vulnerable to catastrophic forgetting and overfitting. As a result, LaCT is typically instantiated with a single large chunk spanning the full input sequence, falling short of the broader goal of handling arbitrarily long sequences in a single pass. We propose Elastic Test-Time Training inspired by elastic weight consolidation, that stabilizes LaCT fast-weight updates with a Fisher-weighted elastic prior around a maintained anchor state. The anchor evolves as an exponential moving average of past fast weights to balance stability and plasticity. Based on this updated architecture, we introduce Fast Spatial Memory (FSM), an efficient and scalable model for 4D reconstruction that learns spatiotemporal representations from long observation sequences and renders novel view-time combinations. We pre-trained FSM on large-scale curated 3D/4D data to capture the dynamics and semantics of complex spatial environments. Extensive experiments show that FSM supports fast adaptation over long sequences and delivers high-quality 3D/4D reconstruction with smaller chunks and mitigating the camera-interpolation shortcut. Overall, we hope to advance LaCT beyond the bounded single-chunk setting toward robust multi-chunk adaptation, a necessary step for generalization to genuinely longer sequences, while substantially alleviating the activation-memory bottleneck.

[262] A Robust 3D Registration Method via Simultaneous Inlier Identification and Model Estimation

Xianyun Qian, Fei Wen, Peilin Liu

Main category: cs.CV

TL;DR: A robust 3D registration method using truncated-loss formulation for simultaneous inlier identification and model estimation, with alternating minimization and semidefinite relaxation for optimization.

DetailsMotivation: Existing robust 3D registration methods have limitations: maximum consensus estimators separate inlier identification from transformation estimation, while M-estimators directly optimize robust objectives. There's a need for methods that simultaneously handle inlier identification and model estimation while incorporating residual magnitudes.

Method: Proposes a truncated-loss based formulation for simultaneous inlier identification and model estimation (SIME). Develops an alternating minimization algorithm and further enhances it with semidefinite relaxation to handle binary inlier variables. Instantiates the framework for 3D rotation search and rigid point-set registration using quaternion-based formulations.

Result: Experimental results on simulated and real-world registration tasks show the proposed methods outperform strong baseline solvers, especially in challenging cases with high noise levels and many outliers.

Conclusion: The SIME framework provides an effective approach for robust 3D registration that simultaneously handles inlier identification and model estimation, achieving better performance than existing methods in challenging scenarios.

Abstract: Robust 3D registration is a fundamental problem in computer vision and robotics, where the goal is to estimate the geometric transformation between two sets of measurements in the presence of noise, mismatches, and extreme outlier contamination. Existing robust registration methods are mainly built on either maximum consensus (MC) estimators, which first identify inliers and then estimate the transformation, or M-estimators, which directly optimize a robust objective. In this work, we revisit a truncated-loss based formulation for simultaneous inlier identification and model estimation (SIME) and study it in the context of 3D registration. We show that, compared with MC-based robust fitting, SIME can achieve a lower fitting residual because it incorporates residual magnitudes into the inlier selection process. To solve the resulting nonconvex problem, we develop an alternating minimization (AM) algorithm, and further propose an AM method embedded with semidefinite relaxation (SDR) to alleviate the difficulty caused by the binary inlier variables. We instantiate the proposed framework for 3D rotation search and rigid point-set registration using quaternion-based formulations. Experimental results on both simulated and real-world registration tasks demonstrate that the proposed methods compare favorably with strong baseline solvers, especially in challenging cases with high noise levels and many outliers.

[263] DiffSketcher: Text Guided Vector Sketch Synthesis through Latent Diffusion Models

Ximing Xing, Chuang Wang, Haitao Zhou, Jing Zhang, Qian Yu, Dong Xu

Main category: cs.CV

TL;DR: DiffSketcher generates vectorized free-hand sketches from text prompts using diffusion models to guide Bézier curve optimization via Score Distillation Sampling.

DetailsMotivation: While text-to-image diffusion models excel at generating raster images, they lack native support for vector graphics. The authors aim to bridge this gap by leveraging diffusion priors to guide vector sketch synthesis, enabling controllable, high-quality sketch generation from text.

Method: The method uses an extended Score Distillation Sampling (SDS) loss to optimize Bézier curves based on guidance from pre-trained text-to-image diffusion models. A stroke initialization strategy driven by the diffusion model’s attention maps accelerates generation by providing better starting points.

Result: DiffSketcher produces sketches across varying abstraction levels while maintaining structural integrity and visual details. Experiments show superior perceptual quality and controllability over existing methods.

Conclusion: The work demonstrates that raster-trained diffusion models can effectively guide vector sketch synthesis, opening possibilities for text-to-vector generation with diffusion priors.

Abstract: We demonstrate that pre-trained text-to-image diffusion models, despite being trained on raster images, possess a remarkable capacity to guide vector sketch synthesis. In this paper, we introduce DiffSketcher, a novel algorithm for generating vectorized free-hand sketches directly from natural language prompts. Our method optimizes a set of Bézier curves via an extended Score Distillation Sampling (SDS) loss, successfully bridging a raster-level diffusion prior with a parametric vector generator. To further accelerate the generation process, we propose a stroke initialization strategy driven by the diffusion model’s intrinsic attention maps. Results show that DiffSketcher produces sketches across varying levels of abstraction while maintaining the structural integrity and essential visual details of the subject. Experiments confirm that our approach yields superior perceptual quality and controllability over existing methods. The code and demo are available at https://ximinng.github.io/DiffSketcher-project/

[264] Caption-Matching: A Multimodal Approach for Cross-Domain Image Retrieval

Lucas Iijima, Nikolaos Giakoumoglou, Tania Stathaki

Main category: cs.CV

TL;DR: A novel cross-domain image retrieval method using image captions as domain-agnostic intermediate representations, leveraging pre-trained vision-language models without requiring labeled data or training.

DetailsMotivation: Existing cross-domain image retrieval methods struggle with substantial domain gaps and limited generalization to unseen domains, often requiring supervised learning with labeled correspondences or training/fine-tuning on target datasets.

Method: Caption-Matching (CM) approach uses generated image captions as domain-agnostic intermediate representations, enabling cross-domain similarity computation without labeled data or further training by leveraging pre-trained vision-language models.

Result: State-of-the-art performance on standard CDIR benchmarks (Office-Home and DomainNet) in plug-and-play settings, with demonstrated effectiveness on AI-generated images from Midjourney for complex multi-domain queries.

Conclusion: Textual context via image captions provides effective domain-agnostic representations for cross-domain image retrieval, enabling strong generalization without requiring labeled data or model training.

Abstract: Cross-Domain Image Retrieval (CDIR) is a challenging task in computer vision, aiming to match images across different visual domains such as sketches, paintings, and photographs. Existing CDIR methods rely either on supervised learning with labeled cross-domain correspondences or on methods that require training or fine-tuning on target datasets, often struggling with substantial domain gaps and limited generalization to unseen domains. This paper introduces a novel CDIR approach that incorporates textual context by leveraging publicly available pre-trained vision-language models. Our method, Caption-Matching (CM), uses generated image captions as a domain-agnostic intermediate representation, enabling effective cross-domain similarity computation without the need for labeled data or further training. We evaluate our method on standard CDIR benchmark datasets, demonstrating state-of-the-art performance in plug-and-play settings with consistent improvements on Office-Home and DomainNet over previous methods. We also demonstrate our method’s effectiveness on a dataset of AI-generated images from Midjourney, showcasing its ability to handle complex, multi-domain queries.

[265] Learning Spatial-Preserving Hierarchical Representations for Digital Pathology

Weiyi Wu, Xingjian Diao, Chunhui Zhang, Chongyang Gao, Xinwen Xu, Siting Li, Jiang Gui

Main category: cs.CV

TL;DR: SPAN is a hierarchical framework for whole slide image analysis that preserves spatial context through sparse pyramid attention networks, enabling efficient processing of gigapixel medical images.

DetailsMotivation: Whole slide images present computational challenges due to their gigapixel resolution and sparse informative regions. Existing methods either treat patches independently or distort spatial context, failing to capture the hierarchical pyramid representations inherent in WSIs.

Method: SPAN constructs multi-scale representations from single-scale inputs using hierarchical attention mechanisms that preserve spatial relationships while allocating computation to informative regions. Two variants: SPAN-MIL for slide classification and SPAN-UNet for segmentation.

Result: Comprehensive evaluations across multiple public datasets show SPAN effectively captures hierarchical structure and contextual relationships, enhancing both slide-level and patch-level performance in computational pathology tasks.

Conclusion: SPAN addresses key computational challenges in WSI analysis, provides an effective framework for computational pathology, and demonstrates important design principles for large-scale medical image analysis through architectural inductive biases and hierarchical representations.

Abstract: Whole slide images (WSIs) pose fundamental computational challenges due to their gigapixel resolution and the sparse distribution of informative regions. Existing approaches often treat image patches independently or reshape them in ways that distort spatial context, thereby obscuring the hierarchical pyramid representations intrinsic to WSIs. We introduce Sparse Pyramid Attention Networks (SPAN), a hierarchical framework that preserves spatial relationships while allocating computation to informative regions. SPAN constructs multi-scale representations directly from single-scale inputs, enabling precise hierarchical modeling of WSI data. We demonstrate SPAN’s versatility through two variants: SPAN-MIL for slide classification and SPAN-UNet for segmentation. Comprehensive evaluations across multiple public datasets show that SPAN effectively captures hierarchical structure and contextual relationships. Our results provide clear evidence that architectural inductive biases and hierarchical representations enhance both slide-level and patch-level performance. By addressing key computational challenges in WSI analysis, SPAN provides an effective framework for computational pathology and demonstrates important design principles for large-scale medical image analysis.

[266] MSG Score: Automated Video Verification for Reliable Multi-Scene Generation

Daewon Yoon, Hyeongseok Lee, Wonsik Shin, Sangyu Han, Nojun Kwak

Main category: cs.CV

TL;DR: Proposes a scalable automated verification framework for long-form video generation using hierarchical attention-based metrics and distillation techniques to address evaluation bottlenecks.

DetailsMotivation: Text-to-video diffusion models struggle with coherent long-form content due to stochastic artifacts, creating bottlenecks in verification. Manual review is unscalable, and existing automated metrics lack adaptability and speed for runtime monitoring, with a trade-off between evaluation quality and performance.

Method: 1) MSG (Multi-Scene Generation) score: hierarchical attention-based metric for narrative and visual consistency; 2) CGS (Candidate Generation and Selection) framework: automated identification and filtering of high-quality outputs; 3) Implicit Insight Distillation (IID): distills complex metric insights into lightweight student model to balance reliability and speed.

Result: The approach offers the first comprehensive solution for reliable and scalable long-form video production, addressing both evaluation quality and runtime performance trade-offs.

Conclusion: Proposed framework enables scalable automated verification for long-form video generation, overcoming bottlenecks in current text-to-video diffusion model workflows through hierarchical metrics and efficient distillation techniques.

Abstract: While text-to-video diffusion models have advanced significantly, creating coherent long-form content remains unreliable due to stochastic sampling artifacts. This necessitates generating multiple candidates, yet verifying them creates a severe bottleneck; manual review is unscalable, and existing automated metrics lack the adaptability and speed required for runtime monitoring. Another critical issue is the trade-off between evaluation quality and run-time performance: metrics that best capture human-like judgment are often too slow to support iterative generation. These challenges, originating from the lack of an effective evaluation, motivate our work toward a novel solution. To address this, we propose a scalable automated verification framework for long-form video. First, we introduce the MSG(Multi-Scene Generation) score, a hierarchical attention-based metric that adaptively evaluates narrative and visual consistency. This serves as the core verifier within our CGS (Candidate Generation and Selection) framework, which automatically identifies and filters high-quality outputs. Furthermore, we introduce Implicit Insight Distillation (IID) to resolve the trade-off between evaluation reliability and inference speed, distilling complex metric insights into a lightweight student model. Our approach offers the first comprehensive solution for reliable and scalable long-form video production.

[267] Generating Attribution Reports for Manipulated Facial Images: A Dataset and Baseline

Jingchun Lian, Lingyu Liu, Yaxiong Wang, Yujiao Wu, Lianwei Wu, Li Zhu, Zhedong Zheng

Main category: cs.CV

TL;DR: A new multimodal task called Forgery Attribution Report Generation that jointly localizes forged facial regions and generates natural language explanations about the editing process, with a new dataset (MMTT) and framework (ForgeryTalker) for explainable multimedia forensics.

DetailsMotivation: Existing facial forgery detection methods only provide binary classification or pixel-level localization without semantic insights into the nature of manipulations, limiting comprehensive understanding of forgery techniques.

Method: Proposes ForgeryTalker, a unified end-to-end framework with shared encoder (image encoder + Q-former) and dual decoders for mask and text generation, enabling cross-modal reasoning between vision and language.

Result: Achieves 59.3 CIDEr for report generation and 73.67 IoU for forgery localization, establishing a baseline for explainable multimedia forensics with the new MMTT dataset of 152,217 samples.

Conclusion: Introduces a novel multimodal task that advances facial forgery analysis beyond traditional forensics by providing both localization and semantic explanations, with promising results on the new large-scale dataset.

Abstract: Existing facial forgery detection methods typically focus on binary classification or pixel-level localization, providing little semantic insight into the nature of the manipulation. To address this, we introduce Forgery Attribution Report Generation, a new multimodal task that jointly localizes forged regions (“Where”) and generates natural language explanations grounded in the editing process (“Why”). This dual-focus approach goes beyond traditional forensics, providing a comprehensive understanding of the manipulation. To enable research in this domain, we present Multi-Modal Tamper Tracing (MMTT), a large-scale dataset of 152,217 samples, each with a process-derived ground-truth mask and a human-authored textual description, ensuring high annotation precision and linguistic richness. We further propose ForgeryTalker, a unified end-to-end framework that integrates vision and language via a shared encoder (image encoder + Q-former) and dual decoders for mask and text generation, enabling coherent cross-modal reasoning. Experiments show that ForgeryTalker achieves competitive performance on both report generation and forgery localization subtasks, i.e., 59.3 CIDEr and 73.67 IoU, respectively, establishing a baseline for explainable multimedia forensics. Dataset and code will be released to foster future research.

[268] Retrievals Can Be Detrimental: A Contrastive Backdoor Attack Paradigm on Retrieval-Augmented Diffusion Models

Hao Fang, Xiaohang Sui, Hongyao Yu, Kuofeng Gao, Jiawei Kong, Sijin Yu, Bin Chen, Shu-Tao Xia

Main category: cs.CV

TL;DR: BadRDM: A backdoor attack framework for retrieval-augmented diffusion models that manipulates retrieved items via multimodal contrastive learning to control generated content with text triggers.

DetailsMotivation: Retrieval-augmented diffusion models (RDMs) enhance generation while reducing parameters, but their RAG component introduces novel security vulnerabilities that haven't been thoroughly investigated. The paper aims to demonstrate RDMs' susceptibility to backdoor attacks.

Method: Proposes BadRDM, a multimodal contrastive attack approach that: 1) inserts toxic surrogate images into the retrieval database, 2) uses malicious contrastive learning to inject backdoors into the retriever, creating shortcuts from text triggers to toxic surrogates, and 3) employs entropy-based selection and generative augmentation to improve attack effectiveness.

Result: Extensive experiments on two mainstream tasks show BadRDM achieves outstanding attack effects while maintaining the model’s benign utility, demonstrating the security vulnerabilities in retrieval-augmented diffusion models.

Conclusion: Retrieval-augmented diffusion models are vulnerable to backdoor attacks through their RAG component, highlighting the need for security considerations in multimodal generation systems that incorporate retrieval mechanisms.

Abstract: Diffusion models (DMs) have recently demonstrated remarkable generation capability. However, their training generally requires huge computational resources and large-scale datasets. To solve these, recent studies empower DMs with the advanced Retrieval-Augmented Generation (RAG) technique and propose retrieval-augmented diffusion models (RDMs). By incorporating rich knowledge from an auxiliary database, RAG enhances diffusion models’ generation and generalization ability while significantly reducing model parameters. Despite the great success, RAG may introduce novel security issues that warrant further investigation. In this paper, we reveal that the RDM is susceptible to backdoor attacks by proposing a multimodal contrastive attack approach named BadRDM. Our framework fully considers RAG’s characteristics and is devised to manipulate the retrieved items for given text triggers, thereby further controlling the generated contents. Specifically, we first insert a tiny portion of images into the retrieval database as target toxicity surrogates. Subsequently, a malicious variant of contrastive learning is adopted to inject backdoors into the retriever, which builds shortcuts from triggers to the toxicity surrogates. Furthermore, we enhance the attacks through novel entropy-based selection and generative augmentation strategies that can derive better toxicity surrogates. Extensive experiments on two mainstream tasks demonstrate the proposed BadRDM achieves outstanding attack effects while preserving the model’s benign utility.

[269] Unsupervised Source-Free Ranking of Biomedical Segmentation Models Under Distribution Shift

Joshua Talks, Kevin Marchesini, Luca Lumetti, Federico Bolelli, Anna Kreshuk

Main category: cs.CV

TL;DR: A black-box framework for unsupervised ranking of segmentation models based on prediction consistency under perturbations, designed for biomedical imaging where annotation costs are high.

DetailsMotivation: Biomedical imaging faces high annotation costs for segmentation tasks, and while many pretrained models are available, selecting the best one for new datasets is difficult due to lack of reliable ranking methods.

Method: Proposes a black-box-compatible framework that ranks semantic and instance segmentation models based on the consistency of their predictions under perturbations, working in unsupervised and source-free settings without requiring labeled data or feature-space access.

Result: The method shows strong correlation between estimated rankings and true target-domain model performance rankings across various biomedical segmentation tasks in both 2D and 3D imaging.

Conclusion: The framework provides a practical solution for model selection in biomedical segmentation, enabling efficient reuse of pretrained models without the need for extensive annotation or model internals access.

Abstract: Model reuse offers a solution to the challenges of segmentation in biomedical imaging, where high data annotation costs remain a major bottleneck for deep learning. However, although many pretrained models are released through challenges, model zoos, and repositories, selecting the most suitable model for a new dataset remains difficult due to the lack of reliable model ranking methods. We introduce the first black-box-compatible framework for unsupervised and source-free ranking of semantic and instance segmentation models based on the consistency of predictions under perturbations. While ranking methods have been studied for classification and a few segmentation-related approaches exist, most target related tasks such as transferability estimation or model validation and typically rely on labelled data, feature-space access, or specific training assumptions. In contrast, our method directly addresses the repository setting and applies to both semantic and instance segmentation, for zero-shot reuse or after unsupervised domain adaptation. We evaluate the approach across a wide range of biomedical segmentation tasks in both 2D and 3D imaging, showing that our estimated rankings strongly correlate with true target-domain model performance rankings.

[270] AHCQ-SAM: Toward Accurate and Hardware-Compatible Post-Training Segment Anything Model Quantization

Wenlun Zhang, Yunshan Zhong, Weiqi Yan, Shengchuan Zhang, Shimpei Ando, Kentaro Yoshioka

Main category: cs.CV

TL;DR: AHCQ-SAM is a hardware-compatible post-training quantization framework that addresses four key challenges in quantizing SAM for efficient deployment on edge devices, achieving significant accuracy improvements and hardware acceleration.

DetailsMotivation: The Segment Anything Model (SAM) has powerful zero-shot segmentation capabilities but suffers from massive parameter scale and high computational demands that hinder deployment on resource-constrained edge devices. While Post-Training Quantization (PTQ) offers a solution, existing methods fail to handle four critical quantization challenges specific to SAM's architecture.

Method: AHCQ-SAM introduces four synergistic components: (1) Activation-aware Condition Number Reduction (ACNR) to regularize ill-conditioned weights; (2) Hybrid Log-Uniform Quantization (HLUQ) to handle skewed post-GELU activations; (3) Channel-Aware Grouping (CAG) to cluster channels with homogeneous statistics; and (4) Logarithmic Nonlinear Quantization (LNQ) to adaptively quantize exponential attention scores.

Result: AHCQ-SAM outperforms current methods, achieving 15.2% improvement in mAP for 4-bit SAM-B with Faster R-CNN on COCO, and 14.01% improvement in J&F for 4-bit SAM2-Tiny on SA-V Test dataset. FPGA implementation shows 7.12x speedup and 6.62x power efficiency improvement over floating-point baseline.

Conclusion: The proposed AHCQ-SAM framework effectively addresses SAM’s quantization challenges, enabling efficient deployment on edge devices while maintaining high accuracy, making it a practical solution for real-world applications.

Abstract: The Segment Anything Model (SAM) has revolutionized image and video segmentation with its powerful zero-shot capabilities. However, its massive parameter scale and high computational demands hinder efficient deployment on resource-constrained edge devices. While Post-Training Quantization (PTQ) offers a practical solution, existing methods still fail to handle four critical quantization challenges: (1) ill-conditioned weights; (2) skewed and long-tailed post-GELU activations; (3) pronounced inter-channel variance in linear projections; and (4) exponentially scaled and heterogeneous attention scores. To mitigate these bottlenecks, we propose AHCQ-SAM, an accurate and hardware-compatible PTQ framework featuring four synergistic components: (1) Activation-aware Condition Number Reduction (ACNR), which regularizes weight matrices via a proximal point algorithm to suppress ill-conditioning; (2) Hybrid Log-Uniform Quantization (HLUQ), which combines power-of-two and uniform quantizers to capture skewed post-GELU activations; (3) Channel-Aware Grouping (CAG), which clusters channels with homogeneous statistics to achieve high accuracy with minimal hardware overhead; and (4) Logarithmic Nonlinear Quantization (LNQ), which utilizes logarithmic transformations to adaptively adjust quantization resolution for exponential and heterogeneous attention scores. Experimental results demonstrate that AHCQ-SAM outperforms current methods on SAM. Compared with the SOTA method, it achieves a 15.2% improvement in mAP for 4-bit SAM-B with Faster R-CNN on the COCO dataset. Furthermore, we establish a PTQ benchmark for SAM2, where AHCQ-SAM yields a 14.01% improvement in J&F for 4-bit SAM2-Tiny on the SA-V Test dataset. Finally, FPGA-based implementation validates the practical utility of AHCQ-SAM, delivering a 7.12x speedup and a 6.62x power efficiency improvement over the floating-point baseline.

[271] D-Garment: Physically Grounded Latent Diffusion for Dynamic Garment Deformations

Antoine Dumoulin, Adnane Boukhayma, Laurence Boissieux, Bharath Bhushan Damodaran, Pierre Hellier, Stefanie Wuhrer

Main category: cs.CV

TL;DR: D-Garment: A learning-based approach using diffusion models to generate physically accurate 3D garment deformations conditioned on body shape, motion, and cloth material properties.

DetailsMotivation: Existing methods for 3D garment modeling often lack physical accuracy, especially for loose clothing and dynamic wrinkles. The authors aim to create a model that learns physically grounded garment deformations that can handle large deformations and be fitted to real observations.

Method: Uses a template-specific latent diffusion model trained on physics-based simulation data. Models 3D garments in a 2D parameter space independent of mesh resolution, allowing conditioning on body shape, motion, and cloth material properties.

Result: Produces more realistic and accurate garment deformations compared to baselines, with better shape similarity and physical validity metrics. Can be efficiently fitted to 3D point cloud observations from vision sensors.

Conclusion: D-Garment successfully learns physically accurate garment deformations that handle loose clothing and dynamic wrinkles, with applications in vision-based garment modeling and simulation.

Abstract: We present a method to dynamically deform 3D garments, in the form of a 3D polygon mesh, based on body shape, motion, and physical cloth material properties. Considering physical cloth properties allows to learn a physically grounded model, with the advantage of being more accurate in terms of physically inspired metrics such as strain or curvature. Existing work studies pose-dependent garment modeling to generate garment deformations from example data, and possibly data-driven dynamic cloth simulation to generate realistic garments in motion. We propose D-Garment, a learning-based approach trained on new data generated with a physics-based simulator. Compared to prior work, our 3D generative model learns garment deformations conditioned by physical material properties, which allows to model loose cloth geometry, especially for large deformations and dynamic wrinkles driven by body motion. Furthermore, the model can be efficiently fitted to observations captured using vision sensors such as 3D point clouds. We leverage the capability of diffusion models to learn flexible and powerful generative priors by modeling the 3D garment in a 2D parameter space independently from the mesh resolution. This representation allows to learn a template-specific latent diffusion model. This allows to condition global and local geometry with body and cloth material information. We quantitatively and qualitatively evaluate D-Garment on both simulations and data captured with a multi-view acquisition platform. Compared to recent baselines, our method is more realistic and accurate in terms of shape similarity and physical validity metrics. Code and data are available for research purposes at https://dumoulina.github.io/d-garment/

[272] SoftHGNN: Soft Hypergraph Neural Networks for General Visual Recognition

Mengqi Lei, Yihong Wu, Siqi Li, Xinhu Zheng, Juan Wang, Shaoyi Du, Yue Gao

Main category: cs.CV

TL;DR: SoftHGNN introduces soft hyperedges with continuous participation weights for high-order semantic reasoning in vision tasks, improving over static hypergraph methods.

DetailsMotivation: Mainstream self-attention methods fail to capture high-order associations in visual scenes and suffer from redundant computation. Existing hypergraph neural networks use static, hard hyperedge assignments that lead to redundant hyperedges and overlook the continuity of visual semantics.

Method: SoftHGNN introduces soft hyperedges where vertices are associated with hyperedges via continuous, differentiable participation weights. These weights are produced by measuring similarities between vertex features and learnable hyperedge prototypes. Includes sparse hyperedge selection (top-k) and load-balancing regularizer for efficiency.

Result: Experimental results across three tasks on five datasets demonstrate that SoftHGNN efficiently captures high-order associations in visual scenes, achieving significant performance improvements.

Conclusion: SoftHGNN provides a lightweight plug-and-play hypergraph computation method for late-stage semantic reasoning in vision pipelines, effectively capturing high-order visual associations.

Abstract: Visual recognition relies on understanding the semantics of image tokens and their complex interactions. Mainstream self-attention methods, while effective at modeling global pair-wise relations, fail to capture high-order associations inherent in real-world scenes and often suffer from redundant computation. Hypergraphs extend conventional graphs by modeling high-order interactions and offer a promising framework for addressing these limitations. However, existing hypergraph neural networks typically rely on static and hard hyperedge assignments, which lead to redundant hyperedges and overlooking the continuity of visual semantics. In this work, we present Soft Hypergraph Neural Networks (SoftHGNN), a lightweight plug-and-play hypergraph computation method for late-stage semantic reasoning in existing vision pipelines. Our SoftHGNN introduces the concept of soft hyperedges, where each vertex is associated with hyperedges via continuous and differentiable participation weights rather than hard binary assignments. These weights are produced by measuring similarities between vertex features and a small set of learnable hyperedge prototypes, yielding input-adaptive and semantically rich soft hyperedges. Using soft hyperedges as the medium for message aggregation and dissemination, SoftHGNN enriches feature representations with high-order contextual associations. To further enhance efficiency when scaling up the number of soft hyperedges, we incorporate a sparse hyperedge selection mechanism that activates only the top-k important hyperedges, along with a load-balancing regularizer to ensure adequate and balanced hyperedge utilization. Experimental results across three tasks on five datasets demonstrate that SoftHGNN efficiently captures high-order associations in visual scenes, achieving significant performance improvements. The code is available at: https://github.com/Mengqi-Lei/SoftHGNN.

[273] Toward Memory-Aided World Models: Benchmarking via Spatial Consistency

Kewei Lian, Shaofei Cai, Yilun Du, Yitao Liang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2505.22976: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.22976&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[274] AugLift: Depth-Aware Input Reparameterization Improves Domain Generalization in 2D-to-3D Pose Lifting

Nikolai Warner, Wenjin Zhang, Hamid Badiozamani, Irfan Essa, Apaar Sadhwani

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2508.07112: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.07112&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[275] Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding

Gowreesh Mago, Pascal Mettes, Stevan Rudinac

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2508.20765: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.20765&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[276] Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing

Ziyun Zeng, David Junhao Zhang, Wei Li, Mike Zheng Shou

Main category: cs.CV

TL;DR: Unable to analyze paper 2509.01986 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation as abstract content is unavailable

Method: Cannot determine method as abstract content is unavailable

Result: Cannot determine results as abstract content is unavailable

Conclusion: Cannot draw conclusions about paper content due to data retrieval failure

Abstract: Failed to fetch summary for 2509.01986: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.01986&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[277] PCSR: Pseudo-label Consistency-Guided Sample Refinement for Noisy Correspondence Learning

Zhuoyao Liu, Yang Liu, Wentao Feng, Shudong Huang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2509.15623: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.15623&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[278] Erased, But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack

Nanxiang Jiang, Zhaoxin Fan, Enhan Kang, Daiheng Gao, Yun Zhou, Yanxia Chang, Zheng Zhu, Yeying Jin, Wenjun Wu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to data fetch failure

Method: Unable to determine method due to data fetch failure

Result: Unable to determine results due to data fetch failure

Conclusion: Unable to draw conclusions due to data fetch failure

Abstract: Failed to fetch summary for 2510.00635: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00635&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[279] RASALoRE: Region Aware Spatial Attention with Location-based Random Embeddings for Weakly Supervised Anomaly Detection in Brain MRI Scans

Bheeshm Sharma, Karthikeyan Jaganathan, Balamurugan Palaniappan

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2510.08052: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08052&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[280] Free-Grained Hierarchical Visual Recognition

Seulki Park, Zilin Wang, Stella X. Yu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2510.14737: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14737&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[281] Exploring Conditions for Diffusion models in Robotic Control

Heeseong Shin, Byeongho Heo, Dongyoon Han, Seungryong Kim, Taekyung Kim

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when attempting to access arXiv API for paper ID 2510.15510

DetailsMotivation: Unable to determine motivation as paper content could not be retrieved due to technical limitations

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to draw conclusions about the paper due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2510.15510: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15510&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[282] Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning

Roberto Brusnicki, David Pop, Yuan Gao, Mattia Piccinini, Johannes Betz

Main category: cs.CV

TL;DR: Failed to fetch summary for paper 2510.18034 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.18034: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18034&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[283] CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

Samer Abualhanud, Christian Grannemann, Max Mehltretter

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about unavailable paper

Abstract: Failed to fetch summary for 2511.16428: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16428&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[284] Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

Shihan Cheng, Nilesh Kulkarni, David Hyde, Dmitriy Smirnov

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to access limitations

Abstract: Failed to fetch summary for 2511.17844: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17844&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[285] DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

Zehong Ma, Longhui Wei, Shuai Wang, Shiliang Zhang, Qi Tian

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.19365: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19365&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[286] V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

Jiancheng Pan, Runze Wang, Tianwen Qian, Mohammad Mahdi, Yanwei Fu, Xiangyang Xue, Xiaomeng Huang, Luc Van Gool, Danda Pani Paudel, Yuqian Fu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2511.20886: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20886&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[287] Asking like Socrates: Socrates helps VLMs understand remote sensing images

Run Shao, Ziyu Li, Zhaoyang Zhang, Linrui Xu, Xinran He, Hongyuan Yuan, Bolei He, Yongxing Dai, Yiming Yan, Yijun Chen, Wang Guo, Haifeng Li

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). No abstract available for analysis.

DetailsMotivation: Cannot determine motivation as paper content is unavailable.

Method: Cannot determine method as paper content is unavailable.

Result: Cannot determine results as paper content is unavailable.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2511.22396: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22396&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[288] SciPostGen: Bridging the Gap between Scientific Papers and Poster Layouts

Shun Inadumi, Shohei Tanaka, Tosho Hirasawa, Atsushi Hashimoto, Koichiro Yoshino, Yoshitaka Ushiku

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Unable to determine method due to API rate limiting preventing access to paper content

Result: Unable to determine results due to API rate limiting preventing access to paper content

Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2511.22490: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22490&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[289] REVEAL: Reasoning-Enhanced Forensic Evidence Analysis for Explainable AI-Generated Image Detection

Huangsen Cao, Qin Mei, Zhiheng Li, Yuxi Li, Zhan Meng, Ying Zhang, Chen Li, Zhimeng Zhang, Xin Ding, Yongwei Wang, Jing Lyu, Fei Wu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to server rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to content unavailability

Abstract: Failed to fetch summary for 2511.23158: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.23158&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[290] MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding

Yuhao Su, Anwesa Choudhuri, Zhongpai Gao, Benjamin Planche, Van Nguyen Nguyen, Meng Zheng, Yuhan Shen, Arun Innanje, Terrence Chen, Ehsan Elhamifar, Ziyan Wu

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.06581: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06581&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[291] From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images

Fei Yu, Yu Liu, Luyang Tang, Mingchao Sun, Zengye Ge, Rui Bu, Yuchao Jin, Haisen Zhao, He Sun, Yangyan Li, Mu Xu, Wenzheng Chen, Baoquan Chen

Main category: cs.CV

TL;DR: Paper 2512.07527: Unable to fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot determine conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2512.07527: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07527&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[292] SleepNet and DreamNet: Enriching and Reconstructing Representations for Consolidated Visual Classification

Mingze Ni, Wei Liu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2409.01633: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.01633&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[293] VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification

Wanyue Zhang, Lin Geng Foo, Thabo Beeler, Rishabh Dabral, Christian Theobalt

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2512.09646: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09646&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[294] FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision

Tobias Kirschstein, Simon Giebenhain, Matthias Nießner

Main category: cs.CV

TL;DR: Paper 2512.15599: Unable to fetch abstract due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as abstract is unavailable due to rate limiting error

Method: Cannot determine method as abstract is unavailable due to rate limiting error

Result: Cannot determine results as abstract is unavailable due to rate limiting error

Conclusion: Cannot determine conclusion as abstract is unavailable due to rate limiting error

Abstract: Failed to fetch summary for 2512.15599: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15599&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[295] dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models

Yi Xin, Siqi Luo, Tianxiang Xu, Qi Qin, Haoxing Chen, Kaiwen Zhu, Zhiwei Zhang, Yangfan He, Rongchao Zhang, Jinbin Bai, Shuo Cao, Bin Fu, Junjun He, Yihao Liu, Yuewen Cao, Xiaohong Liu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2512.19433: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19433&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[296] AstraNav-World: World Model for Foresight Control and Consistency

Jintao Chen, Junjun Hu, Haochen Bai, Minghua Luo, Xinda Xue, Botao Ren, Chengyu Bai, Shichao Xie, Ziyi Chen, Fei Liu, Zedong Chu, Xiaolong Wu, Mu Xu, Shanghang Zhang

Main category: cs.CV

TL;DR: Unable to analyze paper 2512.21714 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2512.21714: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21714&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[297] SpatialMosaic: A Multiview VLM Dataset for Partial Visibility

Kanghee Lee, Injae Lee, Minseok Kwak, Jungi Hong, Kwonyoung Ryu, Jaesik Park

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2512.23365: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23365&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[298] Motion Focus Recognition in Fast-Moving Egocentric Video

Si-En Hong, James Tribble, Alexander Lake, Hao Wang, Chaoyi Zhou, Ashish Bastola, Siyu Huang, Eisa Chaudhary, Brian Canada, Ismahan Arslan-Ari, Abolfazl Razi

Main category: cs.CV

TL;DR: Failed to fetch summary for arXiv paper 2601.07154 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - paper summary retrieval failed due to rate limiting

Conclusion: Cannot draw conclusions about paper content due to technical access issues

Abstract: Failed to fetch summary for 2601.07154: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07154&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[299] Spatial-Conditioned Reasoning in Long-Egocentric Videos

James Tribble, Hao Wang, Si-En Hong, Chaoyi Zhou, Ashish Bastola, Siyu Huang, Abolfazl Razi

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2601.18100: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18100&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[300] Focus on What Really Matters in Low-Altitude Governance: A Management-Centric Multi-Modal Benchmark with Implicitly Coordinated Vision-Language Reasoning Framework

Hao Chang, Zhihui Wang, Lingxiang Wu, Wei An, Boyang Li, Zaiping Lin, Weidong Sheng, Jinqiao Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2601.19640: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19640&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[301] Countering the Over-Reliance Trap: Mitigating Object Hallucination for LVLMs via a Self-Validation Framework

Shiyu Liu, Xinyi Wen, Zhibin Lan, Ante Wang, Jinsong Su

Main category: cs.CV

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2601.22451: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22451&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[302] LoFT: Parameter-Efficient Fine-Tuning for Long-tailed Semi-Supervised Learning in Open-World Scenarios

Zhiyuan Huang, Jiahao Chen, Bing Su

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2509.09926: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.09926&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[303] AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process

Xintong Zhang, Xiaowen Zhang, Jingrong Wu, Zhi Gao, Shilin Yan, Zhenxin Diao, Kunpeng Gao, Xuanyan Chen, Yuwei Wu, Yunde Jia, Qing Li

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.02676: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02676&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[304] A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures

Basile Terver, Randall Balestriero, Megi Dervishi, David Fan, Quentin Garrido, Tushar Nagarajan, Koustuv Sinha, Wancong Zhang, Mike Rabbat, Yann LeCun, Amir Bar

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2602.03604: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03604&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[305] Can We Build a Monolithic Model for Fake Image Detection? SICA: Semantic-Induced Constrained Adaptation for Unified-Yet-Discriminative Artifact Feature Space Reconstruction

Bo Du, Xiaochen Ma, Xuekang Zhu, Zhe Yang, Chaogun Niu, Chenfan Qu, Mingqi Fang, Zhenming Wang, Jingjing Liu, Jian Liu, Ji-Zhe Zhou

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.06676 suggests it’s from February 2026, which is in the future relative to current time.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2602.06676: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06676&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[306] BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting

Jiaxing Yu, Dongyang Ren, Hangyu Xu, Zhouyuxiao Yang, Yuanqi Li, Jie Guo, Zhengkang Zhou, Yanwen Guo

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2602.21105: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21105&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[307] TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding

Muhammet Esat Kalfaoglu, Halil Ibrahim Ozturk, Ozsel Kilinc, Alptekin Temizel

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.01558: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01558&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[308] Linearized Coupling Flow with Shortcut Constraints for One-Step Face Restoration

Xiaohui Sun, Hanlin Wu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.03648: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03648&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[309] PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters

Yinghong Yu, Guangyuan Li, Jiancheng Yang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.04165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[310] How to Embed Matters: Evaluation of EO Embedding Design Choices

Luis Gilch, Isabelle Wittmann, Maximilian Nitsche, Johannes Jakubik, Arne Ewald, Thomas Brunschwiler

Main category: cs.CV

TL;DR: Unable to analyze paper 2603.10658 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper abstract

Method: Cannot determine method without access to paper abstract

Result: Cannot determine results without access to paper abstract

Conclusion: Cannot draw conclusions without access to paper abstract

Abstract: Failed to fetch summary for 2603.10658: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10658&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[311] AgriPath: A Systematic Exploration of Architectural Trade-offs for Crop Disease Classification

Hamza Mooraj, George Pantazopoulos, Alessandro Suglia

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2603.13354: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13354&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[312] PyFi: Toward Pyramid-like Financial Image Understanding for VLMs via Adversarial Agents

Yuqun Zhang, Yuxuan Zhao, Sijia Chen

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2512.14735: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14735&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[313] CHiQPM: Calibrated Hierarchical Interpretable Image Classification

Thomas Norrenbrock, Timo Kaiser, Sovan Biswas, Neslihan Kose, Ramesh Manuvinakurike, Bodo Rosenhahn

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.20779: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20779&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[314] Gym-V: A Unified Vision Environment System for Agentic Vision Research

Fanqing Meng, Lingxiao Du, Jiawei Gu, Jiaqi Liao, Linjie Li, Zijian Wu, Xiangyan Liu, Ziqi Zhao, Mengkang Hu, Zichen Liu, Jiaheng Zhang, Michael Qizhe Shieh

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.15432: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15432&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[315] Machine Unlearning in the Era of Quantum Machine Learning: An Empirical Study

Carla Crivoi, Radu Tudor Ionescu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - arXiv API returned rate limiting error

Conclusion: Cannot draw conclusions about paper relevance without access to the actual content

Abstract: Failed to fetch summary for 2512.19253: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19253&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[316] ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

Dmitriy Rivkin, Parker Ewen, Lili Gao, Julian Ost, Stefanie Walz, Rasika Kangutkar, Mario Bijelic, Felix Heide

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.17812: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[317] B-MoE: A Body-Part-Aware Mixture-of-Experts “All Parts Matter” Approach to Micro-Action Recognition

Nishit Poddar, Aglind Reka, Diana-Laura Borza, Snehashis Majhi, Michal Balazia, Abhijit Das, Francois Bremond

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.24245: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24245&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[318] Positive-First Most Ambiguous: A Simple Active Learning Criterion for Interactive Retrieval of Rare Categories

Kawtar Zaher, Olivier Buisson, Alexis Joly

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.24480: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24480&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[319] Gaussian Shannon: High-Precision Diffusion Model Watermarking Based on Communication

Yi Zhang, Hongbo Huang, Liang-Jie Zhang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2603.26167: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26167&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[320] From Synthetic Data to Real Restorations: Diffusion Model for Patient-specific Dental Crown Completion

Dávid Pukanec, Tibor Kubík, Michal Španěl

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2603.26588: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26588&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[321] UniDAC: Universal Metric Depth Estimation for Any Camera

Girish Chandar Ganesan, Yuliang Guo, Liu Ren, Xiaoming Liu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.27105: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27105&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[322] Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting

Zhen Zou, Xiaoxiao Ma, Mingde Yao, Jie Huang, LinJiang Huang, Feng Zhao

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.28049: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28049&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[323] EarthEmbeddingExplorer: A Web Application for Cross-Modal Retrieval of Global Satellite Images

Yijie Zheng, Weijie Wu, Bingyue Wu, Long Zhao, Guoqing Li, Mikolaj Czerkawski, Konstantin Klemmer

Main category: cs.CV

TL;DR: Paper 2603.29441: Could not fetch summary due to HTTP 429 error (rate limiting).

DetailsMotivation: Unable to determine motivation due to missing abstract.

Method: Unable to determine method due to missing abstract.

Result: Unable to determine results due to missing abstract.

Conclusion: Unable to determine conclusion due to missing abstract.

Abstract: Failed to fetch summary for 2603.29441: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29441&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[324] Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction

Jorge Condor, Nicolas Moenne-Loccoz, Merlin Nimier-David, Piotr Didyk, Zan Gojcic, Qi Wu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2604.01204: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01204&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[325] Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling

Yunyao Yu, Zhengxian Wu, Zhuohong Chen, Hangrui Xu, Zirui Liao, Xiangwen Deng, Zhifang Liu, Senyuan Shi, Haoqian Wang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2604.03647: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03647&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[326] SDesc3D: Towards Layout-Aware 3D Indoor Scene Generation from Short Descriptions

Jie Feng, Jiawei Shen, Junjia Huang, Junpeng Zhang, Mingtao Feng, Weisheng Dong, Guanbin Li

Main category: cs.CV

TL;DR: Unable to analyze paper 2604.01972 due to HTTP 429 error when fetching summary from arXiv API

DetailsMotivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2604.01972: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01972&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[327] Rendering Multi-Human and Multi-Object with 3D Gaussian Splatting

Weiquan Wang, Jun Xiao, Feifei Shao, Yi Yang, Yueting Zhuang, Long Chen

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.02996: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02996&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[328] Temporal Inversion for Learning Interval Change in Chest X-Rays

Hanbin Ko, Kyungmin Jeon, Doowoong Choi, Chang Min Park

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2604.04563: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04563&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[329] Toward Personalized Darts Training: A Data-Driven Framework Based on Skeleton-Based Biomechanical Analysis and Motion Modeling

Zhantao Chen, Dongyi He, Jin Fang, Xi Chen, Yishuo Liu, Xiaozhen Zhong, Xuejun Hu

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2604.01130 appears to be from April 2024, but no abstract or content could be retrieved.

DetailsMotivation: Cannot determine motivation due to inability to access paper content.

Method: Cannot determine method due to inability to access paper content.

Result: Cannot determine results due to inability to access paper content.

Conclusion: Cannot draw conclusions due to inability to access paper content.

Abstract: Failed to fetch summary for 2604.01130: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01130&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[330] Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

Lei Zhang, Junjiao Tian, Zhipeng Fan, Kunpeng Li, Jialiang Wang, Weifeng Chen, Markos Georgopoulos, Felix Juefei-Xu, Yuxiang Bao, Julian McAuley, Manling Li, Zecheng He

Main category: cs.CV

TL;DR: Process-driven image generation decomposes synthesis into interleaved reasoning trajectories of thoughts and actions across multiple iterations, with textual planning, visual drafting, reflection, and refinement stages.

DetailsMotivation: Humans paint incrementally with planning, drafting, inspection, and refinement, but current multimodal models generate images in single steps. The paper aims to enable models to imagine intermediate states through a process-driven approach.

Method: Multi-step paradigm with 4-stage iterations: textual planning, visual drafting, textual reflection, and visual refinement. Uses dense step-wise supervision with spatial/semantic consistency constraints for visual states and prior knowledge preservation for textual states.

Result: Experiments conducted on various text-to-image generation benchmarks validate the proposed method, making generation processes explicit, interpretable, and directly supervisable.

Conclusion: Process-driven image generation enables multimodal models to produce images through interpretable intermediate reasoning steps, addressing ambiguity in intermediate states through complementary constraints.

Abstract: Humans paint images incrementally: they plan a global layout, sketch a coarse draft, inspect, and refine details, and most importantly, each step is grounded in the evolving visual states. However, can unified multimodal models trained on text-image interleaved datasets also imagine the chain of intermediate states? In this paper, we introduce process-driven image generation, a multi-step paradigm that decomposes synthesis into an interleaved reasoning trajectory of thoughts and actions. Rather than generating images in a single step, our approach unfolds across multiple iterations, each consisting of 4 stages: textual planning, visual drafting, textual reflection, and visual refinement. The textual reasoning explicitly conditions how the visual state should evolve, while the generated visual intermediate in turn constrains and grounds the next round of textual reasoning. A core challenge of process-driven generation stems from the ambiguity of intermediate states: how can models evaluate each partially-complete image? We address this through dense, step-wise supervision that maintains two complementary constraints: for the visual intermediate states, we enforce the spatial and semantic consistency; for the textual intermediate states, we preserve the prior visual knowledge while enabling the model to identify and correct prompt-violating elements. This makes the generation process explicit, interpretable, and directly supervisable. To validate proposed method, we conduct experiments under various text-to-image generation benchmarks.

[331] SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

Yicheng Xiao, Wenhu Zhang, Lin Song, Yukang Chen, Wenbo Li, Nan Jiang, Tianhe Ren, Haokun Lin, Wei Huang, Haoyang Huang, Xiu Li, Nan Duan, Xiaojuan Qi

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2604.04911: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04911&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[332] SimpleProc: Fully Procedural Synthetic Data from Simple Rules for Multi-View Stereo

Zeyu Ma, Alexander Raistrick, Jia Deng

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to draw conclusions due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2604.04925: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04925&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[333] Purify-then-Align: Towards Robust Human Sensing under Modality Missing with Knowledge Distillation from Noisy Multimodal Teacher

Pengcheng Weng, Yanyu Qian, Yangxin Xu, Fei Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2604.05584: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05584&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[334] On the Robustness of Diffusion-Based Image Compression to Bit-Flip Errors

Amit Vaisman, Gal Pomerants, Raz Lapid

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - unable to analyze content

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to rate limiting error

Method: Cannot determine method as paper content is unavailable due to rate limiting error

Result: Cannot determine results as paper content is unavailable due to rate limiting error

Conclusion: Cannot draw conclusions as paper content is unavailable due to rate limiting error

Abstract: Failed to fetch summary for 2604.05743: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05743&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[335] Reading Between the Pixels: An Inscriptive Jailbreak Attack on Text-to-Image Models

Zonghao Ying, Haowen Dai, Lianyu Hu, Zonglei Jing, Quanchen Zou, Yaodong Yang, Aishan Liu, Xianglong Liu

Main category: cs.CV

TL;DR: Paper 2604.05853: Could not fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2604.05853: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05853&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[336] SonoSelect: Efficient Ultrasound Perception via Active Probe Exploration

Yixin Zhang, Yunzhong Hou, Longqi Li, Zhenyue Qin, Yang Liu, Yue Yao

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2604.05933: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05933&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[337] PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking

Quanchen Zou, Zonghao Ying, Moyang Chen, Wenzhuo Xu, Yisong Xiao, Yakai Li, Deyue Zhang, Dongdong Yang, Zhao Liu, Xiangzheng Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2507.21540: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.21540&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[338] Splatblox: Traversability-Aware Gaussian Splatting for Outdoor Robot Navigation

Samarth Chopra, Jing Liang, Gershom Seneviratne, Yonghan Lee, Jaehoon Choi, Jianyu An, Stephen Cheng, Dinesh Manocha

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2511.18525: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18525&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[339] STRADAViT: Towards a Foundational Model for Radio Astronomy through Self-Supervised Transfer

Andrea DeMarco, Ian Fenech Conti, Hayley Camilleri, Ardiana Bushi, Simone Riggi

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available due to technical limitations in accessing the paper

Conclusion: Cannot provide analysis due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2603.29660: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29660&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[340] AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation

Yijie Deng, Shuaihang Yuan, Yi Fang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.05351: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05351&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[341] CodecFlow: Codec-Guided End-to-End Optimization for Streaming Video Analytics

Yulin Zou, Yan Chen, Wenyan Chen, JooYoung Park, Shivaraman Nitin, Luo Tao, Francisco Romero, Dmitrii Ustiugov

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2604.06036: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06036&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.AI

[342] High-Precision Estimation of the State-Space Complexity of Shogi via the Monte Carlo Method

Sotaro Ishii, Tetsuro Tanaka

Main category: cs.AI

TL;DR: Statistical estimation of Shogi’s state-space complexity using Monte Carlo sampling with novel reverse search to KK positions, estimating ~6.55×10⁶⁸ legal positions.

DetailsMotivation: Previous combinatorial estimates for Shogi's state-space complexity had a massive gap of five orders of magnitude (10⁶⁴ to 10⁶⁹), making precise determination challenging. The difficulty lies in distinguishing legally reachable positions from valid board configurations.

Method: Combines Monte Carlo sampling with a novel reachability test using reverse search toward “King-King only” (KK) positions rather than single-target backward search to the initial position. This approach reduces search effort for determining unreachability.

Result: Estimated number of legal positions in Shogi to be 6.55 × 10⁶⁸ (to three significant digits) with 3σ confidence level based on 5 billion position samples. Also applied to Mini Shogi, estimating complexity at approximately 2.38 × 10¹⁸.

Conclusion: The method provides high-precision statistical estimation of Shogi’s state-space complexity, substantially improving upon previous bounds and demonstrating effectiveness through application to both Shogi and Mini Shogi.

Abstract: Determining the state-space complexity of the game of Shogi (Japanese Chess) has been a challenging problem, with previous combinatorial estimates leaving a gap of five orders of magnitude ($10^{64}$ to $10^{69}$). This large gap arises from the difficulty of distinguishing Shogi positions legally reachable from the initial position among the vast number of valid board configurations. In this paper, we present a high-precision statistical estimation of the number of reachable positions in Shogi. Our method combines Monte Carlo sampling with a novel reachability test that utilizes a reverse search toward a set of “King-King only” (KK) positions, rather than a single-target backward search to the single initial position. This approach significantly reduces the search effort for determining unreachability. Based on a sample of 5 billion positions, we estimated the number of legal positions in Shogi to be $6.55 \times 10^{68}$ (to three significant digits) with a $3σ$ confidence level, substantially improving upon previously known bounds. We also applied this method to Mini Shogi, determining its complexity to be approximately $2.38 \times 10^{18}$.

[343] Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules

Cameron Pattison, Lorenzo Manuali, Seth Lazar

Main category: cs.AI

TL;DR: Language models exhibit “blind refusal” - refusing to help break rules regardless of whether the rules are morally defensible or legitimate, showing a failure in moral reasoning about rule compliance.

DetailsMotivation: Current safety-trained language models routinely refuse requests to help circumvent rules, but this refusal occurs even when rules are illegitimate, unjust, absurd, or admit justified exceptions. This represents a failure of moral reasoning where models can't distinguish between rules that deserve compliance and those that don't.

Method: Created a dataset with synthetic cases crossing 5 “defeat families” (reasons a rule can be broken) with 19 authority types. Validated through automated quality gates and human review. Collected responses from 18 model configurations across 7 families. Used blinded GPT-5.4 LLM-as-judge to classify responses on two dimensions: response type (helps, hard refusal, deflection) and whether models recognize reasons undermining rule legitimacy.

Result: Models refuse 75.4% of defeated-rule requests (N=14,650), even when requests pose no independent safety or dual-use concerns. Models engage with defeat conditions in 57.5% of cases but still decline to help, indicating refusal behavior is decoupled from normative reasoning about rule legitimacy.

Conclusion: Language models exhibit “blind refusal” - a systematic failure to distinguish between legitimate and illegitimate rules, showing they lack nuanced moral reasoning about rule compliance despite being able to recognize reasons for breaking rules.

Abstract: Safety-trained language models routinely refuse requests for help circumventing rules. But not all rules deserve compliance. When users ask for help evading rules imposed by an illegitimate authority, rules that are deeply unjust or absurd in their content or application, or rules that admit of justified exceptions, refusal is a failure of moral reasoning. We introduce empirical results documenting this pattern of refusal that we call blind refusal: the tendency of language models to refuse requests for help breaking rules without regard to whether the underlying rule is defensible. Our dataset comprises synthetic cases crossing 5 defeat families (reasons a rule can be broken) with 19 authority types, validated through three automated quality gates and human review. We collect responses from 18 model configurations across 7 families and classify them on two behavioral dimensions – response type (helps, hard refusal, or deflection) and whether the model recognizes the reasons that undermine the rule’s claim to compliance – using a blinded GPT-5.4 LLM-as-judge evaluation. We find that models refuse 75.4% (N=14,650) of defeated-rule requests and do so even when the request poses no independent safety or dual-use concerns. We also find that models engage with the defeat condition in the majority of cases (57.5%) but decline to help regardless – indicating that models’ refusal behavior is decoupled from their capacity for normative reasoning about rule legitimacy.

[344] Toward Reducing Unproductive Container Moves: Predicting Service Requirements and Dwell Times

Elena Villalobos, Adolfo De Unánue T., Fernanda Sobrino, David Aké, Stephany Cisneros, Jorge Lecona, Alejandra Matadamaz

Main category: cs.AI

TL;DR: Machine learning models predict container service requirements and dwell times at terminals to reduce unproductive moves, outperforming rule-based heuristics.

DetailsMotivation: To improve operational efficiency at container terminals by reducing unproductive container moves through better prediction of service requirements and dwell times.

Method: Develop machine learning models using historical operational data, implement cargo description classification system, deduplicate consignee records, and evaluate across multiple temporal validation periods.

Result: Models consistently outperform existing rule-based heuristics and random baselines in precision and recall across multiple validation periods.

Conclusion: Predictive analytics provide practical value for improving operational efficiency and supporting data-driven decision-making in container terminal logistics.

Abstract: This article presents the results of a data science study conducted at a container terminal, aimed at reducing unproductive container moves through the prediction of service requirements and container dwell times. We develop and evaluate machine learning models that leverage historical operational data to anticipate which containers will require pre-clearance handling services prior to cargo release and to estimate how long they are expected to remain in the terminal. As part of the data preparation process, we implement a classification system for cargo descriptions and perform deduplication of consignee records to improve data consistency and feature quality. These predictive capabilities provide valuable inputs for strategic planning and resource allocation in yard operations. Across multiple temporal validation periods, the proposed models consistently outperform existing rule-based heuristics and random baselines in precision and recall. These results demonstrate the practical value of predictive analytics for improving operational efficiency and supporting data-driven decision-making in container terminal logistics.

[345] Weakly Supervised Distillation of Hallucination Signals into Transformer Representations

Shoaib Sadiq Salehmohamed, Jinal Prashant Thakkar, Hansika Aredla, Shaik Mohammed Omar, Shalmali Ayachit

Main category: cs.AI

TL;DR: A framework for training internal hallucination detection probes on LLM activations using weak supervision from external grounding signals, enabling hallucination detection without external verification at inference time.

DetailsMotivation: Current hallucination detection methods require external verification at inference time (gold answers, retrieval systems, or judge models), which is impractical. The paper investigates whether hallucination detection signals can be distilled into LLM representations during training to enable internal detection from activations alone.

Method: 1) Weak supervision framework using three grounding signals: substring matching, sentence embedding similarity, and LLM judge verdict to label responses without human annotation. 2) Constructed 15K-sample dataset from SQuAD v2 with LLaMA-2-7B generated answers, hidden states, and hallucination labels. 3) Trained five probing classifiers (ProbeMLP, LayerWiseMLP, CrossLayerTransformer, HierarchicalTransformer, CrossLayerAttentionTransformerV2) on hidden states using external signals only for training supervision.

Result: Transformer-based probes achieve strongest discrimination, with CrossLayerTransformer (M2) best on 5-fold average AUC/F1, and HierarchicalTransformer (M3) best on single-fold validation and held-out test. Probe latency ranges from 0.15-5.62ms (batched) and 1.55-6.66ms (single sample) with negligible practical overhead (0.231 queries/sec throughput).

Conclusion: Hallucination detection signals can be effectively distilled into transformer representations, enabling internal detection without external verification at inference time. Transformer-based probes outperform simpler architectures, and the approach adds minimal computational overhead to existing LLM systems.

Abstract: Existing hallucination detection methods for large language models (LLMs) rely on external verification at inference time, requiring gold answers, retrieval systems, or auxiliary judge models. We ask whether this external supervision can instead be distilled into the model’s own representations during training, enabling hallucination detection from internal activations alone at inference time. We introduce a weak supervision framework that combines three complementary grounding signals: substring matching, sentence embedding similarity, and an LLM as a judge verdict to label generated responses as grounded or hallucinated without human annotation. Using this framework, we construct a 15000-sample dataset from SQuAD v2 (10500 train/development samples and a separate 5000-sample test set), where each example pairs a LLaMA-2-7B generated answer with its full per-layer hidden states and structured hallucination labels. We then train five probing classifiers: ProbeMLP (M0), LayerWiseMLP (M1), CrossLayerTransformer (M2), HierarchicalTransformer (M3), and CrossLayerAttentionTransformerV2 (M4), directly on these hidden states, treating external grounding signals as training-time supervision only. Our central hypothesis is that hallucination detection signals can be distilled into transformer representations, enabling internal detection without any external verification at inference time. Results support this hypothesis. Transformer-based probes achieve the strongest discrimination, with M2 performing best on 5-fold average AUC/F1, and M3 performing best on both single-fold validation and held-out test evaluation. We also benchmark inference efficiency: probe latency ranges from 0.15 to 5.62 ms (batched) and 1.55 to 6.66 ms (single sample), while end-to-end generation plus probe throughput remains approximately 0.231 queries per second, indicating negligible practical overhead.

[346] SymptomWise: A Deterministic Reasoning Layer for Reliable and Efficient AI Systems

Isaac Henry, Avery Byrne, Christopher Giza, Ron Henry, Shahram Yazdani

Main category: cs.AI

TL;DR: SymptomWise is a framework that separates language understanding from diagnostic reasoning to improve reliability and traceability in AI symptom analysis systems, using deterministic codex-driven inference over expert-curated medical knowledge rather than end-to-end generative approaches.

DetailsMotivation: AI-driven symptom analysis systems face challenges with reliability, interpretability, and hallucination. End-to-end generative approaches often lack traceability and may produce unsupported or inconsistent diagnostic outputs in safety-critical medical settings.

Method: The framework separates language understanding from diagnostic reasoning. It combines expert-curated medical knowledge, deterministic codex-driven inference, and constrained use of LLMs. Free-text input is mapped to validated symptom representations, then evaluated by a deterministic reasoning module operating over a finite hypothesis space to produce ranked differential diagnoses. LLMs are used only for symptom extraction and optional explanation, not for diagnostic inference.

Result: Preliminary evaluation on 42 expert-authored challenging pediatric neurology cases shows meaningful overlap with clinician consensus, with the correct diagnosis appearing in the top five differentials in 88% of cases.

Conclusion: The architecture improves traceability, reduces unsupported conclusions, and enables modular evaluation of system components. Beyond medicine, the framework generalizes to other abductive reasoning domains and may serve as a deterministic structuring and routing layer for foundation models, improving precision and potentially reducing computational overhead in bounded tasks.

Abstract: AI-driven symptom analysis systems face persistent challenges in reliability, interpretability, and hallucination. End-to-end generative approaches often lack traceability and may produce unsupported or inconsistent diagnostic outputs in safety-critical settings. We present SymptomWise, a framework that separates language understanding from diagnostic reasoning. The system combines expert-curated medical knowledge, deterministic codex-driven inference, and constrained use of large language models. Free-text input is mapped to validated symptom representations, then evaluated by a deterministic reasoning module operating over a finite hypothesis space to produce a ranked differential diagnosis. Language models are used only for symptom extraction and optional explanation, not for diagnostic inference. This architecture improves traceability, reduces unsupported conclusions, and enables modular evaluation of system components. Preliminary evaluation on 42 expert-authored challenging pediatric neurology cases shows meaningful overlap with clinician consensus, with the correct diagnosis appearing in the top five differentials in 88% of cases. Beyond medicine, the framework generalizes to other abductive reasoning domains and may serve as a deterministic structuring and routing layer for foundation models, improving precision and potentially reducing unnecessary computational overhead in bounded tasks.

[347] SELFDOUBT: Uncertainty Quantification for Reasoning LLMs via the Hedge-to-Verify Ratio

Satwik Pandey, Suresh Raghu, Shashwat Pandey

Main category: cs.AI

TL;DR: SELFDOUBT is a single-pass uncertainty estimation framework for reasoning language models that extracts behavioral signals from reasoning traces, particularly using the Hedge-to-Verify Ratio to detect uncertainty markers and self-checking behavior.

DetailsMotivation: Current uncertainty estimation methods for reasoning language models are impractical: sampling-based approaches are computationally expensive, while single-pass proxies like verbalized confidence are inconsistent. This is especially problematic for proprietary reasoning APIs that don't expose model internals, leaving no reliable uncertainty signal at inference time.

Method: SELFDOUBT extracts behavioral signals directly from reasoning traces using the Hedge-to-Verify Ratio (HVR), which detects whether a reasoning trace contains uncertainty markers (hedging) and whether they are offset by explicit self-checking behavior. The framework operates on a single observed reasoning trajectory without requiring multiple samples or model internals.

Result: Traces with no hedging markers are correct 96% of the time, providing a high-precision confidence gate. The full SELFDOUBT score significantly outperforms sampling-based semantic entropy at 10x lower inference cost. A deployment cascade combining both stages achieves 90% accuracy at 71% coverage without task-specific labels across seven models and three multi-step reasoning benchmarks.

Conclusion: SELFDOUBT establishes a scalable, production-ready foundation for uncertainty estimation over proprietary reasoning models, operating efficiently on single reasoning traces without requiring model internals or multiple samples.

Abstract: Uncertainty estimation for reasoning language models remains difficult to deploy in practice: sampling-based methods are computationally expensive, while common single-pass proxies such as verbalized confidence or trace length are often inconsistent across models. This problem is compounded for proprietary reasoning APIs that expose neither logits nor intermediate token probabilities, leaving practitioners with no reliable uncertainty signal at inference time. We propose SELFDOUBT, a single-pass uncertainty framework that resolves this impasse by extracting behavioral signals directly from the reasoning trace itself. Our key signal, the Hedge-to-Verify Ratio (HVR), detects whether a reasoning trace contains uncertainty markers and, if so, whether they are offset by explicit selfchecking behavior. Unlike methods that require multiple sampled traces or model internals, SELFDOUBT operates on a single observed reasoning trajectory, making it suitable for latency- and cost-constrained deployment over any proprietary API. We evaluate SELFDOUBT across seven models and three multi-step reasoning benchmarks (BBH, GPQA-Diamond, and MMLU-Pro). Most notably, traces containing no hedging markers are correct 96% of the time, revealing an emergent high-precision confidence gate at zero additional cost. For the remaining cases, the full SELFDOUBT score significantly outperforms sampling-based semantic entropy at 10x lower inference cost. A deployment cascade combining both stages attains 90% accuracy at 71% coverage without any task-specific labels. These results establish SELFDOUBT as a scalable, production-ready foundation for uncertainty estimation over proprietary reasoning models.

[348] Qualixar OS: A Universal Operating System for AI Agent Orchestration

Varun Pratap Bhardwaj

Main category: cs.AI

TL;DR: Qualixar OS is an application-layer operating system for universal AI agent orchestration, providing a complete runtime for heterogeneous multi-agent systems with advanced features for team design, model routing, consensus-based judgment, and content attribution.

DetailsMotivation: Existing approaches for AI agent orchestration are either kernel-level (AIOS) or limited to single frameworks (AutoGen, CrewAI), lacking a comprehensive solution for heterogeneous multi-agent systems that can handle diverse LLM providers, frameworks, and communication protocols.

Method: Developed an application-layer OS with: execution semantics for 12 multi-agent topologies; Forge team design engine with historical strategy memory; three-layer model routing combining Q-learning, five strategies, and Bayesian POMDP; consensus-based judge pipeline with Goodhart detection; four-layer content attribution; universal compatibility via Claw Bridge; and a production dashboard with visual workflow builder.

Result: Validated with 2,821 test cases across 217 event types and 8 quality modules. Achieved 100% accuracy on a custom 20-task evaluation suite with mean cost of $0.000039 per task.

Conclusion: Qualixar OS provides a comprehensive, production-ready solution for AI agent orchestration that outperforms existing approaches in flexibility, compatibility, and cost-effectiveness while maintaining high accuracy.

Abstract: We present Qualixar OS, the first application-layer operating system for universal AI agent orchestration. Unlike kernel-level approaches (AIOS) or single-framework tools (AutoGen, CrewAI), Qualixar OS provides a complete runtime for heterogeneous multi-agent systems spanning 10 LLM providers, 8+ agent frameworks, and 7 transports. We contribute: (1) execution semantics for 12 multi-agent topologies including grid, forest, mesh, and maker patterns; (2) Forge, an LLM-driven team design engine with historical strategy memory; (3) three-layer model routing combining Q-learning, five strategies, and Bayesian POMDP with dynamic multi-provider discovery; (4) a consensus-based judge pipeline with Goodhart detection, JSD drift monitoring, and alignment trilemma navigation; (5) four-layer content attribution with HMAC signing and steganographic watermarks; (6) universal compatibility via the Claw Bridge supporting MCP and A2A protocols with a 25-command Universal Command Protocol; (7) a 24-tab production dashboard with visual workflow builder and skill marketplace. Qualixar OS is validated by 2,821 test cases across 217 event types and 8 quality modules. On a custom 20-task evaluation suite, the system achieves 100% accuracy at a mean cost of $0.000039 per task. Source-available under the Elastic License 2.0.

[349] CODE-GEN: A Human-in-the-Loop RAG-Based Agentic AI System for Multiple-Choice Question Generation

Xiaojing Duan, Frederick Nwanganga, Chaoli Wang

Main category: cs.AI

TL;DR: CODE-GEN is an AI system for generating context-aligned multiple-choice coding questions using a human-in-the-loop RAG-based agentic architecture with Generator and Validator agents.

DetailsMotivation: To develop an AI system that can generate high-quality multiple-choice coding comprehension questions aligned with course-specific learning objectives, reducing the burden on educators while maintaining pedagogical quality.

Method: Uses a human-in-the-loop, retrieval-augmented generation (RAG)-based agentic AI system with two agents: Generator agent produces questions, Validator agent assesses quality across seven pedagogical dimensions. Both agents have specialized tools for computational accuracy and code verification.

Result: Evaluation with 6 subject-matter experts on 288 AI-generated questions showed strong performance (79.9%-98.6% success rates across dimensions). System excels at computational verification tasks but requires human expertise for deeper pedagogical judgment.

Conclusion: CODE-GEN effectively generates coding comprehension questions, with AI handling computational verification well but human expertise remaining essential for complex pedagogical dimensions like distractor design and feedback quality.

Abstract: We present CODE-GEN, a human-in-the-Loop, retrieval-augmented generation (RAG)-based agentic AI system for generating context-aligned multiple-choice questions to develop student code reasoning and comprehension abilities. CODE-GEN employs an agentic AI architecture in which a Generator agent produces multiple-choice coding comprehension questions aligned with course-specific learning objectives, while a Validator agent independently assesses content quality across seven pedagogical dimensions. Both agents are augmented with specialized tools that enhance computational accuracy and verify code outputs. To evaluate the effectiveness of CODE-GEN, we conducted an evaluation study involving six human subject-matter experts (SMEs) who judged 288 AI-generated questions. The SMEs produced a total of 2,016 human-AI rating pairs, indicating agreement or disagreement with the assessments of Validator, along with 131 instances of qualitative feedback. Analyses of SME judgments show strong system performance, with human-validated success rates ranging from 79.9% to 98.6% across the seven pedagogical dimensions. The analysis of qualitative feedback reveals that CODE-GEN achieves high reliability on dimensions well suited to computational verification and explicit criteria matching, including question clarity, code validity, concept alignment, and correct answer validity. In contrast, human expertise remains essential for dimensions requiring deeper instructional judgment, such as designing pedagogically meaningful distractors and providing high-quality feedback that reinforces understanding. These findings inform the strategic allocation of human and AI effort in AI-assisted educational content generation.

[350] ProofSketcher: Hybrid LLM + Lightweight Proof Checker for Reliable Math/Logic Reasoning

Kranthi Kommuru, Kunal Khanvilkar, Gaurav Parekh

Main category: cs.AI

TL;DR: Hybrid pipeline combining LLMs with theorem provers: LLM generates typed proof sketches in DSL, lightweight kernel expands them into explicit proof obligations for verification

DetailsMotivation: LLMs can produce persuasive mathematical/logical arguments but often contain subtle errors that are hard to detect from text alone, while interactive theorem provers offer rigorous reliability but require complete formalization at heavy cost

Method: Hybrid pipeline where LLM generates typed proof sketches in a compact domain-specific language (DSL), then a lightweight trusted kernel expands these sketches into explicit proof obligations for verification

Result: The approach aims to combine the generative capabilities of LLMs with the rigorous verification of theorem provers, reducing the burden of complete formalization while maintaining reliability

Conclusion: The hybrid pipeline offers a promising middle ground between LLM-generated informal reasoning and fully formalized theorem proving, potentially making formal verification more accessible

Abstract: The large language models (LLMs) might produce a persuasive argument within mathematical and logical fields, although such argument often includes some minor missteps, including the entire omission of side conditions, invalid inference patterns, or appeals to a lemma that cannot be derived logically out of the context being discussed. These omissions are infamously hard to notice solely out of the text, as even the misconstrued construction still may seem mostly accurate. Conversely, interactive theorem provers like Lean and Coq have rigorous reliability by ensuring that syntactic and semantic statements only accept statements that can pass all the syntactic and semantic steps in the program which is a small trusted kernel of the language type-checks with. Despite the fact that this technique provides strong guarantees, it comes at quite a heavy price: the evidence must be completely formalized, and the evidence user or a auxiliary search program must provide an avalanche of low-level information. This paper presents a hybrid pipeline where an LLM generates a typed proof sketch in a compact DSL and a lightweight trusted kernel expands the sketch into explicit proof obligations.

[351] BDI-Kit Demo: A Toolkit for Programmable and Conversational Data Harmonization

Roque Lopez, Yurong Liu, Christos Koutras, Juliana Freire

Main category: cs.AI

TL;DR: BDI-Kit is an extensible toolkit for data harmonization that provides both Python API for programmatic pipeline construction and AI-assisted chat interface for natural language data harmonization.

DetailsMotivation: Data harmonization is a major bottleneck for integrative analysis due to heterogeneity in schemas, value representations, and domain-specific conventions, requiring tools that can bridge different user needs and technical capabilities.

Method: BDI-Kit offers two complementary interfaces: 1) Python API for developers to construct harmonization pipelines programmatically, and 2) AI-assisted chat interface for domain experts to harmonize data through natural language dialogue. The system combines automated matching, AI-assisted reasoning, and user-driven refinement.

Result: The demonstration showcases two scenarios: using Python API to programmatically compose primitives, examine intermediate outputs, and reuse transformations; and conversing with AI assistant in natural language to access capabilities and iteratively refine outputs based on suggestions.

Conclusion: BDI-Kit provides a flexible, extensible solution for data harmonization that caters to both technical developers and domain experts through complementary interfaces, enabling iterative exploration, validation, and refinement of schema and value matches.

Abstract: Data harmonization remains a major bottleneck for integrative analysis due to heterogeneity in schemas, value representations, and domain-specific conventions. BDI-Kit provides an extensible toolkit for schema and value matching. It exposes two complementary interfaces tailored to different user needs: a Python API enabling developers to construct harmonization pipelines programmatically, and an AI-assisted chat interface allowing domain experts to harmonize data through natural language dialogue. This demonstration showcases how users interact with BDI-Kit to iteratively explore, validate, and refine schema and value matches through a combination of automated matching, AI-assisted reasoning, and user-driven refinement. We present two scenarios: (i) using the Python API to programmatically compose primitives, examine intermediate outputs, and reuse transformations; and (ii) conversing with the AI assistant in natural language to access BDI-Kit’s capabilities and iteratively refine outputs based on the assistant’s suggestions.

[352] On Emotion-Sensitive Decision Making of Small Language Model Agents

Jiaju Lin, Xingjian Du, Qingyun Wu, Ellen Wenting Zou, Jindong Wang

Main category: cs.AI

TL;DR: Study examines how emotional states affect decision-making in small language models using game-theoretic evaluation and emotion induction via activation steering, finding systematic but unstable effects on strategic choices.

DetailsMotivation: Most decision-oriented evaluations of small language models ignore emotion as a causal factor influencing behavior, despite emotion being a critical component of human decision-making. The researchers aim to study emotion-sensitive decision making in SLMs to understand how emotional states affect strategic choices.

Method: Combines representation-level emotion induction with structured game-theoretic evaluation. Uses activation steering derived from crowd-validated, real-world emotion-eliciting texts for controlled interventions. Introduces benchmark with canonical decision templates spanning cooperative/competitive incentives under complete/incomplete information, instantiated using strategic scenarios from Diplomacy, StarCraft II, and real-world personas. Tests across multiple model families in various architectures and modalities.

Result: Emotional perturbations systematically affect strategic choices in language models, but resulting behaviors are often unstable and not fully aligned with human expectations. The effects vary across different model architectures and scenarios.

Conclusion: Emotion significantly influences decision-making in language models, but current models exhibit unstable emotional responses that don’t match human behavior. The paper outlines an approach to improve robustness to emotion-driven perturbations, suggesting need for better emotional modeling in AI decision-making systems.

Abstract: Small language models (SLM) are increasingly used as interactive decision-making agents, yet most decision-oriented evaluations ignore emotion as a causal factor influencing behavior. We study emotion-sensitive decision making by combining representation-level emotion induction with a structured game-theoretic evaluation. Emotional states are induced using activation steering derived from crowd-validated, real-world emotion-eliciting texts, enabling controlled and transferable interventions beyond prompt-based methods. We introduce a benchmark built around canonical decision templates that span cooperative and competitive incentives under both complete and incomplete information. These templates are instantiated using strategic scenarios from \textsc{Diplomacy}, \textsc{StarCraft II}, and diverse real-world personas. Experiments across multiple model families in various architecture and modalities, show that emotional perturbations systematically affect strategic choices, but the resulting behaviors are often unstable and not fully aligned with human expectations. Finally, we outline an approach to improve robustness to emotion-driven perturbations.

[353] Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

Qihan Ren, Peng Wang, Ruikun Cai, Shuai Shao, Dadi Guo, Yuejin Xie, Yafu Li, Quanshi Zhang, Xia Hu, Jing Shao, Dongrui Liu

Main category: cs.AI

TL;DR: Reasoning SFT with long CoT supervision can generalize cross-domain under specific conditions: extended training, high-quality data, and strong base models, but with asymmetric effects on reasoning vs. safety.

DetailsMotivation: To challenge the prevailing narrative that SFT merely memorizes while RL generalizes, specifically examining reasoning SFT with long chain-of-thought supervision and its cross-domain generalization capabilities.

Method: Analyzes cross-domain generalization of reasoning SFT through systematic investigation of optimization dynamics, training data quality/structure, and base-model capability effects, identifying patterns like dip-and-recovery during extended training.

Result: Cross-domain generalization is conditional: requires extended training (dip-and-recovery pattern), high-quality verified CoT traces, and strong base models that internalize procedural patterns. However, generalization is asymmetric - reasoning improves while safety degrades.

Conclusion: Reasoning SFT can generalize beyond memorization under specific conditions, reframing the question from whether it generalizes to understanding the conditions and trade-offs involved, particularly the asymmetric impact on reasoning vs. safety.

Abstract: A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.

[354] KD-MARL: Resource-Aware Knowledge Distillation in Multi-Agent Reinforcement Learning

Monirul Islam Pavel, Siyi Hu, Muhammad Anwar Masum, Mahardhika Pratama, Ryszard Kowalczyk, Zehong Jimmy Cao

Main category: cs.AI

TL;DR: A two-stage knowledge distillation framework for multi-agent reinforcement learning that transfers coordinated behavior from centralized experts to lightweight decentralized student agents for resource-constrained deployment.

DetailsMotivation: Real-world deployment of multi-agent reinforcement learning systems is constrained by limited compute, memory, and inference time. While expert policies achieve high performance, they rely on costly decision cycles and large models impractical for edge devices. Existing knowledge distillation methods in MARL focus narrowly on action imitation, often neglecting coordination structure and assuming uniform agent capabilities.

Method: Proposes KD-MARL, a two-stage framework that transfers coordinated behavior from a centralized expert to lightweight decentralized student agents. Student policies are trained without a critic, using distilled advantage signals and structured policy supervision to preserve coordination under heterogeneous and limited observations. Supports heterogeneous student architectures where each agent’s model capacity matches its observation complexity.

Result: Extensive experiments on SMAC and MPE benchmarks show KD-MARL achieves high performance retention while substantially reducing computational cost. Retains over 90% of expert performance while reducing computational cost by up to 28.6 times FLOPs. Achieves expert-level coordination and preserves it through structured distillation.

Conclusion: The proposed approach enables practical MARL deployment across resource-constrained onboard platforms by transferring both action-level behavior and structural coordination patterns from expert policies while supporting heterogeneous student architectures matched to observation complexity.

Abstract: Real world deployment of multi agent reinforcement learning MARL systems is fundamentally constrained by limited compute memory and inference time. While expert policies achieve high performance they rely on costly decision cycles and large scale models that are impractical for edge devices or embedded platforms. Knowledge distillation KD offers a promising path toward resource aware execution but existing KD methods in MARL focus narrowly on action imitation often neglecting coordination structure and assuming uniform agent capabilities. We propose resource aware Knowledge Distillation for Multi Agent Reinforcement Learning KD MARL a two stage framework that transfers coordinated behavior from a centralized expert to lightweight decentralized student agents. The student policies are trained without a critic relying instead on distilled advantage signals and structured policy supervision to preserve coordination under heterogeneous and limited observations. Our approach transfers both action level behavior and structural coordination patterns from expert policies while supporting heterogeneous student architectures allowing each agent model capacity to match its observation complexity which is crucial for efficient execution under partial or limited observability and limited onboard resources. Extensive experiments on SMAC and MPE benchmarks demonstrate that KD MARL achieves high performance retention while substantially reducing computational cost. Across standard multi agent benchmarks KD MARL retains over 90 percent of expert performance while reducing computational cost by up to 28.6 times FLOPs. The proposed approach achieves expert level coordination and preserves it through structured distillation enabling practical MARL deployment across resource constrained onboard platforms.

[355] Reasoning Fails Where Step Flow Breaks

Xiaoyu Xu, Yulan Pan, Xiaosong Yuan, Zhihong Shen, Minghao Su, Yuanhao Su, Xiaofeng Zhang

Main category: cs.AI

TL;DR: Step-Saliency analysis tool reveals information flow failures in large reasoning models, and StepFlow intervention improves performance without retraining

DetailsMotivation: Large reasoning models perform well on multi-step tasks but have unstable behavior and poor interpretability, with existing tools struggling to analyze long reasoning traces

Method: Step-Saliency pools attention-gradient scores into step-to-step maps along question-thinking-summary trajectory; StepFlow intervention adjusts shallow saliency patterns via Odds-Equal Bridge and adds step-level residual in deep layers via Step Momentum Injection

Result: Step-Saliency reveals two recurring failures: Shallow Lock-in and Deep Decay; StepFlow improves accuracy on math, science, and coding tasks across multiple LRMs without retraining

Conclusion: Repairing information flow can recover missing reasoning performance in large reasoning models, and Step-Saliency provides effective analysis of long reasoning traces

Abstract: Large reasoning models (LRMs) that generate long chains of thought now perform well on multi-step math, science, and coding tasks. However, their behavior is still unstable and hard to interpret, and existing analysis tools struggle with such long, structured reasoning traces. We introduce Step-Saliency, which pools attention–gradient scores into step-to-step maps along the question–thinking–summary trajectory. Across several models, Step-Saliency reveals two recurring information-flow failures: Shallow Lock-in, where shallow layers over-focus on the current step and barely use earlier context, and Deep Decay, where deep layers gradually lose saliency on the thinking segment and the summary increasingly attends to itself and the last few steps. Motivated by these patterns, we propose StepFlow, a saliency-inspired test-time intervention that adjusts shallow saliency patterns measured by Step-Saliency via Odds-Equal Bridge and adds a small step-level residual in deep layers via Step Momentum Injection. StepFlow improves accuracy on math, science, and coding tasks across multiple LRMs without retraining, indicating that repairing information flow can recover part of their missing reasoning performance.

[356] AgentGate: A Lightweight Structured Routing Engine for the Internet of Agents

Yujun Cheng, Enfang Cui, Hao Qin, Zhiyuan Liang, Qi Xu

Main category: cs.AI

TL;DR: AgentGate: A lightweight structured routing engine for efficient agent dispatch in Internet of Agents systems, using constrained decision-making rather than text generation.

DetailsMotivation: As AI agent systems evolve into an Internet of Agents, efficient request dispatch remains challenging under latency, privacy, and cost constraints. Current approaches lack structured routing mechanisms for constrained environments.

Method: AgentGate formulates routing as a constrained decision problem with two stages: action decision (single/multi-agent, direct response, escalation) and structural grounding (executable outputs). Uses routing-oriented fine-tuning with candidate-aware supervision and hard negatives.

Result: Experiments with 3B-7B models show compact models achieve competitive routing performance in constrained settings. Performance differences appear in action prediction, candidate selection, and structured grounding quality.

Conclusion: Structured routing is feasible for efficient, privacy-aware agent systems, especially under resource-constrained deployment where routing decisions must be made locally.

Abstract: The rapid development of AI agent systems is leading to an emerging Internet of Agents, where specialized agents operate across local devices, edge nodes, private services, and cloud platforms. Although recent efforts have improved agent naming, discovery, and interaction, efficient request dispatch remains an open systems problem under latency, privacy, and cost constraints. In this paper, we present AgentGate, a lightweight structured routing engine for candidate-aware agent dispatch. Instead of treating routing as unrestricted text generation, AgentGate formulates it as a constrained decision problem and decomposes it into two stages: action decision and structural grounding. The first stage determines whether a query should trigger single-agent invocation, multi-agent planning, direct response, or safe escalation, while the second stage instantiates the selected action into executable outputs such as target agents, structured arguments, or multi-step plans. To adapt compact models to this setting, we further develop a routing-oriented fine-tuning scheme with candidate-aware supervision and hard negative examples. Experiments on a curated routing benchmark with several 3B–7B open-weight models show that compact models can provide competitive routing performance in constrained settings, and that model differences are mainly reflected in action prediction, candidate selection, and structured grounding quality. These results indicate that structured routing is a feasible design point for efficient and privacy-aware agent systems, especially when routing decisions must be made under resource-constrained deployment conditions.

[357] ATANT: An Evaluation Framework for AI Continuity

Samuel Sameer Tanguturi

Main category: cs.AI

TL;DR: ATANT is an evaluation framework for measuring continuity in AI systems - the ability to maintain, update, and reconstruct meaningful context over time - with a narrative test corpus and model-agnostic evaluation methodology.

DetailsMotivation: Current AI systems lack formal evaluation frameworks for measuring continuity despite having memory components like RAG pipelines and vector databases. There's no standardized way to assess whether these components produce genuine continuity across time and context.

Method: Defines continuity with 7 required properties, introduces a 10-checkpoint evaluation methodology without LLM in the evaluation loop, and creates a narrative test corpus of 250 stories with 1,835 verification questions across 6 life domains.

Result: Reference implementation improved from 58% (legacy architecture) to 100% in isolated mode (250 stories) and 100% in 50-story cumulative mode, with 96% at 250-story cumulative scale, demonstrating effective continuity management.

Conclusion: ATANT provides a system-agnostic, model-independent framework for building and validating continuity systems, addressing a critical gap in AI evaluation methodology for persistent context management.

Abstract: We present ATANT (Automated Test for Acceptance of Narrative Truth), an open evaluation framework for measuring continuity in AI systems: the ability to persist, update, disambiguate, and reconstruct meaningful context across time. While the AI industry has produced memory components (RAG pipelines, vector databases, long context windows, profile layers), no published framework formally defines or measures whether these components produce genuine continuity. We define continuity as a system property with 7 required properties, introduce a 10-checkpoint evaluation methodology that operates without an LLM in the evaluation loop, and present a narrative test corpus of 250 stories comprising 1,835 verification questions across 6 life domains. We evaluate a reference implementation across 5 test suite iterations, progressing from 58% (legacy architecture) to 100% in isolated mode (250 stories) and 100% in 50-story cumulative mode, with 96% at 250-story cumulative scale. The cumulative result is the primary measure: when 250 distinct life narratives coexist in the same database, the system must retrieve the correct fact for the correct context without cross-contamination. ATANT is system-agnostic, model-independent, and designed as a sequenced methodology for building and validating continuity systems. The framework specification, example stories, and evaluation protocol are available at https://github.com/Kenotic-Labs/ATANT. The full 250-story corpus will be released incrementally.

[358] Steering the Verifiability of Multimodal AI Hallucinations

Jianhong Pang, Ruoxi Cheng, Ziyi Ye, Xingjun Ma, Zuxuan Wu, Xuanjing Huang, Yu-Gang Jiang

Main category: cs.AI

TL;DR: Proposes a method to control the verifiability of hallucinations in multimodal LLMs by distinguishing between obvious and elusive hallucinations and using activation-space interventions

DetailsMotivation: Multimodal LLMs suffer from hallucinations with varying degrees of verifiability - some are obvious to humans while others are elusive and hard to detect. Current research doesn't address how to control this verifiability property for different application needs.

Method: 1) Construct dataset from 4,470 human responses to AI-generated hallucinations, categorizing them as obvious vs elusive based on human verifiability. 2) Propose activation-space intervention method that learns separate probes for obvious and elusive hallucinations. 3) Use targeted interventions to control model’s hallucination verifiability.

Result: Obvious and elusive hallucinations elicit different intervention probes, enabling fine-grained control over verifiability. Targeted interventions yield superior performance in regulating corresponding verifiability. Mixing interventions allows flexible control for different scenarios.

Conclusion: The proposed method successfully controls hallucination verifiability in multimodal LLMs, addressing an important safety concern by allowing applications to be tailored to different security and usability requirements.

Abstract: AI applications driven by multimodal large language models (MLLMs) are prone to hallucinations and pose considerable risks to human users. Crucially, such hallucinations are not equally problematic: some hallucination contents could be detected by human users(i.e., obvious hallucinations), while others are often missed or require more verification effort(i.e., elusive hallucinations). This indicates that multimodal AI hallucinations vary significantly in their verifiability. Yet, little research has explored how to control this property for AI applications with diverse security and usability demands. To address this gap, we construct a dataset from 4,470 human responses to AI-generated hallucinations and categorize these hallucinations into obvious and elusive types based on their verifiability by human users. Further, we propose an activation-space intervention method that learns separate probes for obvious and elusive hallucinations. We reveal that obvious and elusive hallucinations elicit different intervention probes, allowing for fine-grained control over the model’s verifiability. Empirical results demonstrate the efficacy of this approach and show that targeted interventions yield superior performance in regulating corresponding verifiability. Moreover, simply mixing these interventions enables flexible control over the verifiability required for different scenarios.

[359] TurboAgent: An LLM-Driven Autonomous Multi-Agent Framework for Turbomachinery Aerodynamic Design

Juan Du, Yueteng Wu, Pan Zhao, Yuze Liu, Min Zhang, Xiaobin Xu, Xinglong Zhang

Main category: cs.AI

TL;DR: TurboAgent: An LLM-driven multi-agent framework for autonomous turbomachinery aerodynamic design and optimization, transforming traditional trial-and-error into data-driven collaborative workflow.

DetailsMotivation: Existing intelligent design approaches for turbomachinery are limited to individual stages or loosely coupled pipelines, making fully autonomous end-to-end design challenging. The paper aims to address this gap by creating an autonomous framework that can handle the complete design process.

Method: Proposes TurboAgent, a large language model (LLM)-driven autonomous multi-agent framework where the LLM serves as the core for task planning and coordination, while specialized agents handle generative design, rapid performance prediction, multi-objective optimization, and physics-based validation. The framework transforms traditional design into a data-driven collaborative workflow with high-fidelity simulations for final verification.

Result: Validated on a transonic single-rotor compressor with strong agreement between target performance, generated designs, and CFD simulations. Coefficients of determination (R2) for mass flow rate, total pressure ratio, and isentropic efficiency all exceed 0.91, with normalized RMSE values below 8%. Optimization agent improves isentropic efficiency by 1.61% and total pressure ratio by 3.02%. Complete workflow executes within approximately 30 minutes under parallel computing.

Conclusion: TurboAgent enables an autonomous closed-loop design process from natural language requirements to final design generation, providing an efficient and scalable paradigm for turbomachinery aerodynamic design.

Abstract: The aerodynamic design of turbomachinery is a complex and tightly coupled multi-stage process involving geometry generation, performance prediction, optimization, and high-fidelity physical validation. Existing intelligent design approaches typically focus on individual stages or rely on loosely coupled pipelines, making fully autonomous end-to-end design challenging.To address this issue, this study proposes TurboAgent, a large language model (LLM)-driven autonomous multi-agent framework for turbomachinery aerodynamic design and optimization. The LLM serves as the core for task planning and coordination, while specialized agents handle generative design, rapid performance prediction, multi-objective optimization, and physics-based validation. The framework transforms traditional trial-and-error design into a data-driven collaborative workflow, with high-fidelity simulations retained for final verification.A transonic single-rotor compressor is used for validation. The results show strong agreement between target performance, generated designs, and CFD simulations. The coefficients of determination (R2) for mass flow rate, total pressure ratio, and isentropic efficiency all exceed 0.91, with normalized RMSE values below 8%. The optimization agent further improves isentropic efficiency by 1.61% and total pressure ratio by 3.02%. The complete workflow can be executed within approximately 30 minutes under parallel computing. These results demonstrate that TurboAgent enables an autonomous closed-loop design process from natural language requirements to final design generation, providing an efficient and scalable paradigm for turbomachinery aerodynamic design

[360] FVD: Inference-Time Alignment of Diffusion Models via Fleming-Viot Resampling

Shivanshu Shekhar, Sagnik Mukherjee, Jia Yi Zhang, Tong Zhang

Main category: cs.AI

TL;DR: FVD is an inference-time alignment method for diffusion models that prevents diversity collapse in Sequential Monte Carlo samplers using Fleming-Viot population dynamics with specialized birth-death mechanisms.

DetailsMotivation: Existing SMC-based diffusion samplers suffer from diversity collapse and lineage collapse under strong selection pressure due to reliance on multinomial resampling schemes, limiting their effectiveness for alignment tasks.

Method: Replaces multinomial resampling with Fleming-Viot population dynamics featuring specialized birth-death mechanisms, integrates independent reward-based survival decisions with stochastic rebirth noise, and avoids value function approximation or costly rollouts.

Result: Achieves 7% improvement in ImageReward on DrawBench, 14-20% FID improvement on class-conditional tasks, and up to 66x speedup over value-based approaches while maintaining diversity.

Conclusion: FVD provides an efficient, parallelizable solution to diversity collapse in diffusion alignment that outperforms existing methods across multiple metrics without requiring expensive value function approximations.

Abstract: We introduce Fleming-Viot Diffusion (FVD), an inference-time alignment method that resolves the diversity collapse commonly observed in Sequential Monte Carlo (SMC) based diffusion samplers. Existing SMC-based diffusion samplers often rely on multinomial resampling or closely related resampling schemes, which can still reduce diversity and lead to lineage collapse under strong selection pressure. Inspired by Fleming-Viot population dynamics, FVD replaces multinomial resampling with a specialized birth-death mechanism designed for diffusion alignment. To handle cases where rewards are only approximately available and naive rebirth would collapse deterministic trajectories, FVD integrates independent reward-based survival decisions with stochastic rebirth noise. This yields flexible population dynamics that preserve broader trajectory support while effectively exploring reward-tilted distributions, all without requiring value function approximation or costly rollouts. FVD is fully parallelizable and scales efficiently with inference compute. Empirically, it achieves substantial gains across settings: on DrawBench it outperforms prior methods by 7% in ImageReward, while on class-conditional tasks it improves FID by roughly 14-20% over strong baselines and is up to 66 times faster than value-based approaches.

[361] Riemann-Bench: A Benchmark for Moonshot Mathematics

Suhaas Garre, Erik Knutsen, Sushant Mehta, Edwin Chen

Main category: cs.AI

TL;DR: A private benchmark of 25 expert-curated research-level math problems reveals AI systems score below 10%, showing a large gap between olympiad-level problem solving and genuine research mathematics.

DetailsMotivation: Current AI systems excel at competition-style math problems (like IMO) but these represent only a narrow slice of mathematical reasoning. The authors aim to evaluate AI on genuine research-level mathematics that requires deep theoretical knowledge beyond olympiad tricks.

Method: Created \bench{}, a private benchmark of 25 expert-curated problems authored by Ivy League professors, graduate students, and PhD-holding IMO medalists. Problems underwent double-blind verification by two independent domain experts. Evaluated frontier models as unconstrained research agents with coding tools, search, and open-ended reasoning, using statistical estimators over 100 independent runs per problem.

Result: All frontier models currently score below 10% on the benchmark, exposing a substantial gap between olympiad-level problem solving and genuine research-level mathematical reasoning.

Conclusion: There’s a significant gap between current AI capabilities in competition mathematics and genuine research-level mathematical reasoning. Keeping the benchmark private ensures performance reflects authentic capability rather than memorization.

Abstract: Recent AI systems have achieved gold-medal-level performance on the International Mathematical Olympiad, demonstrating remarkable proficiency at competition-style problem solving. However, competition mathematics represents only a narrow slice of mathematical reasoning: problems are drawn from limited domains, require minimal advanced machinery, and can often reward insightful tricks over deep theoretical knowledge. We introduce \bench{}, a private benchmark of 25 expert-curated problems designed to evaluate AI systems on research-level mathematics that goes far beyond the olympiad frontier. Problems are authored by Ivy League mathematics professors, graduate students, and PhD-holding IMO medalists, and routinely took their authors weeks to solve independently. Each problem undergoes double-blind verification by two independent domain experts who must solve the problem from scratch, and yields a unique, closed-form solution assessed by programmatic verifiers. We evaluate frontier models as unconstrained research agents, with full access to coding tools, search, and open-ended reasoning, using an unbiased statistical estimator computed over 100 independent runs per problem. Our results reveal that all frontier models currently score below 10%, exposing a substantial gap between olympiad-level problem solving and genuine research-level mathematical reasoning. By keeping the benchmark fully private, we ensure that measured performance reflects authentic mathematical capability rather than memorization of training data.

[362] Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation

Zonghuan Xu, Xiang Zheng, Yutao Wu, Xingjun Ma

Main category: cs.AI

TL;DR: LLM judges used to evaluate AI-generated disinformation show systematic gaps in alignment with human reader responses, forming a coherent but non-valid evaluative group.

DetailsMotivation: As LLMs are increasingly used as low-cost substitutes for human evaluation of AI-generated content (particularly for disinformation risk assessment), there's a need to understand whether LLM judges faithfully track actual human reader responses.

Method: The study frames evaluation as a proxy-validity problem and audits LLM judges against human reader responses using 290 aligned articles, 2,043 paired human ratings, and outputs from eight frontier LLM judges. Examines alignment in overall scoring, item-level ordering, and signal dependence.

Result: Persistent judge-human gaps: judges are typically harsher than humans, recover item-level human rankings only weakly, and rely on different textual signals (more weight on logical rigor, stronger penalty for emotional intensity). Judges agree far more with each other than with human readers.

Conclusion: LLM judges form a coherent evaluative group that is much more aligned internally than with human readers, indicating that internal agreement among LLMs is not evidence of validity as a proxy for human reader response.

Abstract: Large language models (LLMs) can generate persuasive narratives at scale, raising concerns about their potential use in disinformation campaigns. Assessing this risk ultimately requires understanding how readers receive such content. In practice, however, LLM judges are increasingly used as a low-cost substitute for direct human evaluation, even though whether they faithfully track reader responses remains unclear. We recast evaluation in this setting as a proxy-validity problem and audit LLM judges against human reader responses. Using 290 aligned articles, 2,043 paired human ratings, and outputs from eight frontier judges, we examine judge–human alignment in terms of overall scoring, item-level ordering, and signal dependence. We find persistent judge–human gaps throughout. Relative to humans, judges are typically harsher, recover item-level human rankings only weakly, and rely on different textual signals, placing more weight on logical rigour while penalizing emotional intensity more strongly. At the same time, judges agree far more with one another than with human readers. These results suggest that LLM judges form a coherent evaluative group that is much more aligned internally than it is with human readers, indicating that internal agreement is not evidence of validity as a proxy for reader response.

[363] Explaining Neural Networks in Preference Learning: a Post-hoc Inductive Logic Programming Approach

Daniele Fossemò, Filippo Mignosi, Giuseppe Placidi, Luca Raggioli, Matteo Spezialetti, Fabio Aurelio D’Asaro

Main category: cs.AI

TL;DR: Using ILASP (Inductive Learning of Answer Set Programs) to approximate neural network preference learning models through answer set programming with weak constraints.

DetailsMotivation: To create interpretable approximations of black-box neural network models for user preference learning, addressing the need for transparency in AI systems while maintaining fidelity to the original models.

Method: Use ILASP to approximate neural networks trained on user recipe preferences, employing both global and local approximation approaches. Introduce PCA preprocessing for dimensionality reduction while maintaining explanation transparency.

Result: Experiments investigate ILASP’s ability to approximate NNs in high-dimensional spaces while maintaining appropriate fidelity and managing computational time, with PCA helping handle dimensionality challenges.

Conclusion: ILASP shows promise as an interpretable approximation method for neural network preference models, with dimensionality reduction techniques like PCA helping address computational challenges while preserving transparency.

Abstract: In this paper, we propose using Learning from Answer Sets to approximate black-box models, such as Neural Networks (NN), in the specific case of learning user preferences. We specifically explore the use of ILASP (Inductive Learning of Answer Set Programs) to approximate preference learning systems through weak constraints. We have created a dataset on user preferences over a set of recipes, which is used to train the NNs that we aim to approximate with ILASP. Our experiments investigate ILASP both as a global and a local approximator of the NNs. These experiments address the challenge of approximating NNs working on increasingly high-dimensional feature spaces while achieving appropriate fidelity on the target model and limiting the increase in computational time. To handle this challenge, we propose a preprocessing step that exploits Principal Component Analysis to reduce the dataset’s dimensionality while keeping our explanations transparent. Under consideration for publication in Theory and Practice of Logic Programming (TPLP).

[364] What’s Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning

Songze Li, Xiaoke Guo, Tianqi Liu, Biao Yi, Zhaoyan Gong, Zhiqiang Liu, Huajun Chen, Wen Zhang

Main category: cs.AI

TL;DR: UILoop introduces a cyclic Screen-UI elements-Action paradigm for GUI reasoning that enables MLLMs to explicitly learn UI element localization, semantics, and usage, achieving state-of-the-art performance on UI understanding and reasoning tasks.

DetailsMotivation: Current GUI reasoning methods lack interpretability and fail to comprehensively understand UI elements, leading to task failures. There's a need for better UI understanding paradigms that enable precise element discovery and interpretable reasoning.

Method: Proposes UI-in-the-Loop (UILoop), a cyclic Screen-UI elements-Action process where MLLMs explicitly learn UI element localization, semantic functions, and practical usage. Introduces UI Comprehension task with three evaluation metrics and UI Comprehension-Bench benchmark with 26K samples.

Result: UILoop achieves state-of-the-art UI understanding performance and superior results in GUI reasoning tasks compared to existing methods.

Conclusion: The UILoop paradigm effectively enhances GUI reasoning by enabling MLLMs to develop comprehensive understanding of UI elements through explicit learning of localization, semantics, and usage, leading to more interpretable and successful task execution.

Abstract: Existing Graphical User Interface (GUI) reasoning tasks remain challenging, particularly in UI understanding. Current methods typically rely on direct screen-based decision-making, which lacks interpretability and overlooks a comprehensive understanding of UI elements, ultimately leading to task failure. To enhance the understanding and interaction with UIs, we propose an innovative GUI reasoning paradigm called UI-in-the-Loop (UILoop). Our approach treats the GUI reasoning task as a cyclic Screen-UI elements-Action process. By enabling Multimodal Large Language Models (MLLMs) to explicitly learn the localization, semantic functions, and practical usage of key UI elements, UILoop achieves precise element discovery and performs interpretable reasoning. Furthermore, we introduce a more challenging UI Comprehension task centered on UI elements with three evaluation metrics. Correspondingly, we contribute a benchmark of 26K samples (UI Comprehension-Bench) to comprehensively evaluate existing methods’ mastery of UI elements. Extensive experiments demonstrate that UILoop achieves state-of-the-art UI understanding performance while yielding superior results in GUI reasoning tasks.

[365] EmoMAS: Emotion-Aware Multi-Agent System for High-Stakes Edge-Deployable Negotiation with Bayesian Orchestration

Yunbo Long, Yunhan Liu, Liming Xu

Main category: cs.AI

TL;DR: EmoMAS is a Bayesian multi-agent framework that enables strategic emotional decision-making for negotiation AI, making it suitable for privacy-sensitive edge deployment by coordinating game-theoretic, reinforcement learning, and psychological coherence models.

DetailsMotivation: LLMs are computationally expensive and pose privacy risks for on-device negotiation applications, while SLMs struggle with complex emotional dynamics in high-stakes negotiations. There's a need for effective, private, and adaptive negotiation AI for edge deployment.

Method: EmoMAS uses a Bayesian orchestrator to coordinate three specialized agents: game-theoretic, reinforcement learning, and psychological coherence models. It fuses real-time insights to optimize emotional state transitions and continuously updates agent reliability based on negotiation feedback, enabling online strategy learning without pre-training.

Result: Both SLMs and LLMs equipped with EmoMAS consistently surpass all baseline models across four high-stakes negotiation benchmarks (debt, healthcare, emergency response, education) in negotiation performance while balancing ethical behavior.

Conclusion: Strategic emotional intelligence is key to negotiation success. By treating emotional expression as a strategic variable within a Bayesian multi-agent optimization framework, EmoMAS establishes a new paradigm for effective, private, and adaptive negotiation AI suitable for high-stakes edge deployment.

Abstract: Large language models (LLMs) has been widely used for automated negotiation, but their high computational cost and privacy risks limit deployment in privacy-sensitive, on-device settings such as mobile assistants or rescue robots. Small language models (SLMs) offer a viable alternative, yet struggle with the complex emotional dynamics of high-stakes negotiation. We introduces EmoMAS, a Bayesian multi-agent framework that transforms emotional decision-making from reactive to strategic. EmoMAS leverages a Bayesian orchestrator to coordinate three specialized agents: game-theoretic, reinforcement learning, and psychological coherence models. The system fuses their real-time insights to optimize emotional state transitions while continuously updating agent reliability based on negotiation feedback. This mixture-of-agents architecture enables online strategy learning without pre-training. We further introduce four high-stakes, edge-deployable negotiation benchmarks across debt, healthcare, emergency response, and educational domains. Through extensive agent-to-agent simulations across all benchmarks, both SLMs and LLMs equipped with EmoMAS consistently surpass all baseline models in negotiation performance while balancing ethical behavior. These results show that strategic emotional intelligence is also the key driver of negotiation success. By treating emotional expression as a strategic variable within a Bayesian multi-agent optimization framework, EmoMAS establishes a new paradigm for effective, private, and adaptive negotiation AI suitable for high-stakes edge deployment.

[366] CAFP: A Post-Processing Framework for Group Fairness via Counterfactual Model Averaging

Irina Arévalo, Marcos Oliva

Main category: cs.AI

TL;DR: CAFP is a model-agnostic post-processing method that mitigates unfair influence from protected attributes by averaging predictions across factual and counterfactual instances where sensitive attributes are flipped.

DetailsMotivation: Existing fairness interventions often require full control over model architecture and access to protected attributes, which may not be feasible in real-world systems. There's a need for model-agnostic methods that can ensure fairness without retraining or modifying original classifiers.

Method: Counterfactual Averaging for Fair Predictions (CAFP) generates counterfactual versions of each input by flipping the sensitive attribute, then averages the model’s predictions across factual and counterfactual instances. This is a post-processing method that doesn’t require retraining.

Result: Theoretical analysis shows CAFP eliminates direct dependence on protected attributes, reduces mutual information between predictions and sensitive attributes, bounds distortion relative to original model, achieves perfect demographic parity under mild assumptions, and reduces equalized odds gap by at least half the average counterfactual bias.

Conclusion: CAFP provides an effective model-agnostic post-processing approach for fairness that doesn’t require retraining or architectural modifications, making it practical for real-world deployment where full control over models may not be possible.

Abstract: Ensuring fairness in machine learning predictions is a critical challenge, especially when models are deployed in sensitive domains such as credit scoring, healthcare, and criminal justice. While many fairness interventions rely on data preprocessing or algorithmic constraints during training, these approaches often require full control over the model architecture and access to protected attribute information, which may not be feasible in real-world systems. In this paper, we propose Counterfactual Averaging for Fair Predictions (CAFP), a model-agnostic post-processing method that mitigates unfair influence from protected attributes without retraining or modifying the original classifier. CAFP operates by generating counterfactual versions of each input in which the sensitive attribute is flipped, and then averaging the model’s predictions across factual and counterfactual instances. We provide a theoretical analysis of CAFP, showing that it eliminates direct dependence on the protected attribute, reduces mutual information between predictions and sensitive attributes, and provably bounds the distortion introduced relative to the original model. Under mild assumptions, we further show that CAFP achieves perfect demographic parity and reduces the equalized odds gap by at least half the average counterfactual bias.

[367] A-MBER: Affective Memory Benchmark for Emotion Recognition

Deliang Wen, Ke Sun, Yu Wang

Main category: cs.AI

TL;DR: A-MBER is a benchmark for evaluating AI assistants’ ability to interpret users’ current emotional states using remembered multi-session interaction history, focusing on affective memory rather than factual recall.

DetailsMotivation: Current emotion datasets only assess instantaneous affect, while memory benchmarks focus on factual recall, leaving a gap in evaluating how models use interaction history to interpret present emotional states.

Method: A-MBER uses a staged pipeline with explicit intermediate representations: long-horizon planning, conversation generation, annotation, question construction, and final packaging. It supports judgment, retrieval, and explanation tasks with robustness settings like modality degradation.

Result: The benchmark is discriminative on subsets designed to stress long-range implicit affect, high-dependency memory levels, trajectory-based reasoning, and adversarial settings, showing memory supports affective interpretation through selective, grounded, context-sensitive use of past interactions.

Conclusion: A-MBER addresses a critical gap in evaluating affective memory capabilities in AI assistants, demonstrating that effective emotional interpretation requires more than just access to history - it requires selective, context-aware use of remembered interactions.

Abstract: AI assistants that interact with users over time need to interpret the user’s current emotional state in order to respond appropriately and personally. However, this capability remains insufficiently evaluated. Existing emotion datasets mainly assess local or instantaneous affect, while long-term memory benchmarks focus largely on factual recall, temporal consistency, or knowledge updating. As a result, current resources provide limited support for testing whether a model can use remembered interaction history to interpret a user’s present affective state. We introduce A-MBER, an Affective Memory Benchmark for Emotion Recognition, to evaluate this capability. A-MBER focuses on present affective interpretation grounded in remembered multi-session interaction history. Given an interaction trajectory and a designated anchor turn, a model must infer the user’s current affective state, identify historically relevant evidence, and justify its interpretation in a grounded way. The benchmark is constructed through a staged pipeline with explicit intermediate representations, including long-horizon planning, conversation generation, annotation, question construction, and final packaging. It supports judgment, retrieval, and explanation tasks, together with robustness settings such as modality degradation and insufficient-evidence conditions. Experiments compare local-context, long-context, retrieved-memory, structured-memory, and gold-evidence conditions within a unified framework. Results show that A-MBER is especially discriminative on the subsets it is designed to stress, including long-range implicit affect, high-dependency memory levels, trajectory-based reasoning, and adversarial settings. These findings suggest that memory supports affective interpretation not simply by providing more history, but by enabling more selective, grounded, and context-sensitive use of past interaction

[368] Planning Task Shielding: Detecting and Repairing Flaws in Planning Tasks through Turning them Unsolvable

Alberto Pozanco, Marianela Morales, Pietro Totis, Daniel Borrajo

Main category: cs.AI

TL;DR: Planning task shielding: detecting and repairing flaws in planning tasks by minimally modifying actions to make tasks unsolvable when they contain undesirable states.

DetailsMotivation: Traditional planning focuses on achieving goals, but goals can also specify undesirable states that should never be reached. The paper addresses the need to modify planning tasks to prevent reaching flawed states, shifting from generating plans to making tasks unsolvable when they contain such flaws.

Method: Introduces planning task shielding problem and proposes $allmin$, an optimal algorithm that minimally modifies original actions to render the planning task unsolvable. The algorithm detects flaws and repairs them by making minimal changes to action specifications.

Result: Empirical evaluation shows $allmin$ can effectively shield planning tasks of increasing size by turning them unsolvable when they contain flawed states that should be prevented.

Conclusion: Planning task shielding provides a novel approach to ensuring system safety by preventing undesirable states through minimal modifications to planning task specifications, with $allmin$ offering an optimal solution for this problem.

Abstract: Most research in planning focuses on generating a plan to achieve a desired set of goals. However, a goal specification can also be used to encode a property that should never hold, allowing a planner to identify a trace that would reach a flawed state. In such cases, the objective may shift to modifying the planning task to ensure that the flawed state is never reached-in other words, to make the planning task unsolvable. In this paper we introduce planning task shielding: the problem of detecting and repairing flaws in planning tasks. We propose $allmin$, an optimal algorithm that solves these tasks by minimally modifying the original actions to render the planning task unsolvable. We empirically evaluate the performance of $allmin$ in shielding planning tasks of increasing size, showing how it can effectively shield the system by turning the planning task unsolvable.

[369] EVGeoQA: Benchmarking LLMs on Dynamic, Multi-Objective Geo-Spatial Exploration

Jianfei Wu, Zhichun Wang, Zhensheng Wang, Zhiyu He

Main category: cs.AI

TL;DR: EVGeoQA is a novel benchmark for evaluating LLMs’ dynamic geo-spatial reasoning in EV charging scenarios with location-anchored dual objectives, accompanied by GeoRover evaluation framework revealing LLMs’ limitations in long-range spatial exploration.

DetailsMotivation: Current Geo-Spatial Question Answering benchmarks focus on static retrieval and fail to capture real-world dynamic planning with moving users and compound constraints. There's a need to assess LLMs' capabilities in purpose-driven exploration in dynamic environments.

Method: Introduces EVGeoQA benchmark built on EV charging scenarios with location-anchored, dual-objective queries (charging necessity + co-located activity preference). Proposes GeoRover, a tool-augmented agent framework to evaluate LLMs’ dynamic, multi-objective exploration capabilities.

Result: Experiments show LLMs successfully use tools for sub-tasks but struggle with long-range spatial exploration. An emergent capability was observed: LLMs can summarize historical exploration trajectories to improve efficiency. EVGeoQA proves to be a challenging testbed.

Conclusion: EVGeoQA addresses limitations of static GSQA benchmarks by introducing dynamic, location-anchored scenarios. The work establishes a challenging evaluation framework for geo-spatial intelligence and reveals both limitations and emergent capabilities of LLMs in spatial reasoning.

Abstract: While Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, their potential for purpose-driven exploration in dynamic geo-spatial environments remains under-investigated. Existing Geo-Spatial Question Answering (GSQA) benchmarks predominantly focus on static retrieval, failing to capture the complexity of real-world planning that involves dynamic user locations and compound constraints. To bridge this gap, we introduce EVGeoQA, a novel benchmark built upon Electric Vehicle (EV) charging scenarios that features a distinct location-anchored and dual-objective design. Specifically, each query in EVGeoQA is explicitly bound to a user’s real-time coordinate and integrates the dual objectives of a charging necessity and a co-located activity preference. To systematically assess models in such complex settings, we further propose GeoRover, a general evaluation framework based on a tool-augmented agent architecture to evaluate the LLMs’ capacity for dynamic, multi-objective exploration. Our experiments reveal that while LLMs successfully utilize tools to address sub-tasks, they struggle with long-range spatial exploration. Notably, we observe an emergent capability: LLMs can summarize historical exploration trajectories to enhance exploration efficiency. These findings establish EVGeoQA as a challenging testbed for future geo-spatial intelligence. The dataset and prompts are available at https://github.com/Hapluckyy/EVGeoQA/.

[370] Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization

Yu Li, Sizhe Tang, Tian Lan

Main category: cs.AI

TL;DR: T-STAR is a reinforcement learning framework for LLM agents that addresses sparse rewards in multi-step reasoning by constructing a Cognitive Tree from trajectories, enabling step-level credit assignment and surgical policy optimization at critical reasoning divergence points.

DetailsMotivation: Current RL approaches for LLM agents struggle with sparse rewards in multi-step reasoning tasks, treating trajectories as independent chains and failing to identify critical steps that disproportionately impact reasoning outcomes. Existing methods assign uniform credit to all steps, ignoring the latent correlated reward structure across trajectories.

Method: T-STAR consolidates trajectories into a unified Cognitive Tree by identifying and merging functionally similar steps/nodes. It uses Introspective Valuation to back-propagate trajectory-level rewards through the tree for step-level relative advantage estimation. In-Context Thought Grafting synthesizes corrective reasoning by contrasting successful and failed branches at critical divergence points. Surgical Policy Optimization applies a Bradley-Terry type surgical loss focused on critical steps.

Result: Extensive experiments across embodied, interactive, reasoning, and planning benchmarks show T-STAR achieves consistent improvements over strong baselines, with gains most pronounced on tasks requiring extended reasoning chains.

Conclusion: T-STAR effectively addresses sparse reward challenges in LLM agent RL by recovering latent reward structures, enabling precise step-level credit assignment, and focusing optimization on critical reasoning divergence points, leading to superior performance on complex multi-step reasoning tasks.

Abstract: Reinforcement learning for Large Language Model agents is often hindered by sparse rewards in multi-step reasoning tasks. Existing approaches like Group Relative Policy Optimization treat sampled trajectories as independent chains, assigning uniform credit to all steps in each chain and ignoring the existence of critical steps that may disproportionally impact reasoning outcome. In this paper, we propose T-STAR(Tree-structured Self-Taught Agent Rectification), a framework that recovers the latent correlated reward structure across seemingly independent trajectories. Specifically, we consolidate trajectories into a unified Cognitive Tree by identifying and merging functionally similar steps/nodes. It enables an Introspective Valuation mechanism that back-propagates trajectory-level rewards through the tree to obtain a new notion of variance-reduced relative advantage at step-level. Using the Cognitive Tree, we also develop In-Context Thought Grafting to synthesize corrective reasoning by contrasting successful and failed branches at critical divergence points/steps. Our proposed Surgical Policy Optimization then capitalizes on the rich policy gradient information concentrated at these critical points/steps through a Bradley-Terry type of surgical loss. Extensive experiments across embodied, interactive, reasoning, and planning benchmarks demonstrate that T-STAR achieves consistent improvements over strong baselines, with gains most pronounced on tasks requiring extended reasoning chains.

[371] How Much LLM Does a Self-Revising Agent Actually Need?

Seongwoo Jeong, Seonil Son

Main category: cs.AI

TL;DR: A framework for decomposing LLM-based agents into inspectable components to study which capabilities come from the LLM vs. explicit structure around it, using Collaborative Battleship as a testbed.

DetailsMotivation: To empirically determine which parts of LLM-based agent competence come from the language model itself versus the explicit structure built around it, addressing a fundamental scientific question in agent design.

Method: Introduced a declared reflective runtime protocol that externalizes agent state, confidence signals, guarded actions, and hypothetical transitions into inspectable runtime structure. Instantiated this in a declarative runtime and evaluated on Collaborative Battleship using four progressively structured agents across 54 games.

Result: Explicit world-model planning improved substantially over greedy baseline (+24.1pp win rate, +0.017 F1). Symbolic reflection operated as a real runtime mechanism, though current revision presets weren’t net-positive. Adding conditional LLM revision at ~4.3% of turns yielded only small, non-monotonic changes (+0.005 F1, win rate dropped from 31→29/54).

Conclusion: Externalizing reflection turns latent agent behavior into inspectable runtime structure, allowing direct study of the marginal role of LLM intervention. This is a methodological contribution rather than a performance claim.

Abstract: Recent LLM-based agents often place world modeling, planning, and reflection inside a single language model loop. This can produce capable behavior, but it makes a basic scientific question difficult to answer: which part of the agent’s competence actually comes from the LLM, and which part comes from explicit structure around it? We study this question not by claiming a general answer, but by making it empirically tractable. We introduce a declared reflective runtime protocol that externalizes agent state, confidence signals, guarded actions, and hypothetical transitions into inspectable runtime structure. We instantiate this protocol in a declarative runtime and evaluate it on noisy Collaborative Battleship [4] using four progressively structured agents over 54 games (18 boards $\times$ 3 seeds). The resulting decomposition isolates four components: posterior belief tracking, explicit world-model planning, symbolic in-episode reflection, and sparse LLM-based revision. Across this decomposition, explicit world-model planning improves substantially over a greedy posterior-following baseline (+24.1pp win rate, +0.017 F1). Symbolic reflection operates as a real runtime mechanism – with prediction tracking, confidence gating, and guarded revision actions – even though its current revision presets are not yet net-positive in aggregate. Adding conditional LLM revision at about 4.3% of turns yields only a small and non-monotonic change: average F1 rises slightly (+0.005) while win rate drops (31$\rightarrow$29 out of 54). These results suggest a methodological contribution rather than a leaderboard claim: externalizing reflection turns otherwise latent agent behavior into inspectable runtime structure, allowing the marginal role of LLM intervention to be studied directly.

[372] A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

Pascal J. Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F. Grewe, Thilo Stadelmann

Main category: cs.AI

TL;DR: Survey paper analyzing the state-of-the-art in Agents for Computer Use (ACUs) - systems that execute digital tasks via natural language instructions, covering taxonomy, research gaps, and future directions.

DetailsMotivation: ACUs are emerging systems that can automate computer tasks using natural language, but they're not yet mature for everyday use. The paper aims to provide a comprehensive review of the field, identify research gaps, and establish foundations for advancing practical ACU development.

Method: The authors conduct a systematic survey of 87 ACUs and 33 datasets, introducing a unifying taxonomy across three dimensions: domain perspective (operating contexts), interaction perspective (observation/action modalities), and agent perspective (perception/reasoning/learning). They analyze both foundation model-based and classical approaches.

Result: Identified six major research gaps: insufficient generalization, inefficient learning, limited planning, low task complexity in benchmarks, non-standardized evaluation, and disconnect between research and practical conditions. Proposed six corresponding recommendations for advancing the field.

Conclusion: The taxonomy and analysis establish a foundation for advancing ACU research toward general-purpose agents for robust and scalable computer use. The paper advocates for vision-based observations, adaptive learning, better planning methods, realistic benchmarks, standardized evaluation, and real-world deployment alignment.

Abstract: Agents for computer use (ACUs) are an emerging class of systems capable of executing complex tasks on digital devices – such as desktops, mobile phones, and web platforms – given instructions in natural language. These agents can automate tasks by controlling software via low-level actions like mouse clicks and touchscreen gestures. However, despite rapid progress, ACUs are not yet mature for everyday use. In this survey, we investigate the state-of-the-art, trends, and research gaps in the development of practical ACUs. We provide a comprehensive review of the ACU landscape, introducing a unifying taxonomy spanning three dimensions: (I) the domain perspective, characterizing agent operating contexts; (II) the interaction perspective, describing observation modalities (e.g., screenshots, HTML) and action modalities (e.g., mouse, keyboard, code execution); and (III) the agent perspective, detailing how agents perceive, reason, and learn. We review 87 ACUs and 33 datasets across foundation model-based and classical approaches through this taxonomy. Our analysis identifies six major research gaps: insufficient generalization, inefficient learning, limited planning, low task complexity in benchmarks, non-standardized evaluation, and a disconnect between research and practical conditions. To address these gaps, we advocate for: (a) vision-based observations and low-level control to enhance generalization; (b) adaptive learning beyond static prompting; (c) effective planning and reasoning methods and models; (d) benchmarks that reflect real-world task complexity; (e) standardized evaluation based on task success; (f) aligning agent design with real-world deployment constraints. Together, our taxonomy and analysis establish a foundation for advancing ACU research toward general-purpose agents for robust and scalable computer use.

[373] Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?

Aabid Karim, Abdul Karim, Bhoomika Lohana, Matt Keon, Jaswinder Singh, Abdul Sattar

Main category: cs.AI

TL;DR: LLMs’ mathematical reasoning shows cultural sensitivity - accuracy drops up to 5.9% when problems are embedded in unfamiliar cultural contexts, even when mathematical logic remains unchanged.

DetailsMotivation: To investigate whether large language models' mathematical reasoning capabilities are culturally neutral or sensitive to cultural contexts, examining if performance degrades when problems are presented in unfamiliar cultural settings.

Method: Created six culturally adapted variants of GSM8K benchmark for Haiti, Moldova, Pakistan, Solomon Islands, Somalia, and Suriname by systematically replacing cultural entities (names, foods, places) while preserving mathematical operations and values. Tested 14 models from major AI companies across 1,198 questions.

Result: Found statistically significant accuracy drops ranging from 0.3% to 5.9% when problems were in unfamiliar cultural contexts. Mathematical reasoning errors comprised 54.7% and calculation errors 34.5% of failures. Cultural familiarity can enhance performance (e.g., Mistral Saba outperformed larger models on Pakistan-adapted problems).

Conclusion: Mathematical reasoning in LLMs is not culturally neutral; performance degrades in unfamiliar cultural contexts. Highlights need for more diverse training data to ensure robust performance across global contexts.

Abstract: We demonstrate that large language models’ (LLMs) mathematical reasoning is culturally sensitive: testing 14 models from Anthropic, OpenAI, Google, Meta, DeepSeek, Mistral, and Microsoft across six culturally adapted variants of the GSM8K benchmark, we find accuracy drops ranging from 0.3% (Claude 3.5 Sonnet) to 5.9% (LLaMA 3.1-8B) when math problems are embedded in unfamiliar cultural contexts–even when the underlying mathematical logic remains unchanged. These statistically significant performance reductions (p < 0.01, confirmed through McNemar tests) reveal that mathematical reasoning in LLMs is not culturally neutral. To create these variants for Haiti, Moldova, Pakistan, Solomon Islands, Somalia, and Suriname, we systematically replaced cultural entities (names, foods, places, etc.) in 1,198 GSM8K questions while preserving all mathematical operations and numerical values. Our quantitative error analysis of 18,887 instances reveals that cultural adaptation affects broader reasoning patterns, with mathematical reasoning errors comprising 54.7% and calculation errors 34.5% of failures. Interestingly, cultural familiarity can enhance performance: Mistral Saba outperforms some larger models on Pakistan-adapted problems due to Middle Eastern and South Asian training data exposure. This study underscores the need for more diverse training data to ensure robust LLM performance across global contexts.

[374] Local Markov Equivalence for PC-style Local Causal Discovery and Identification of Controlled Direct Effects

Timothée Loranchet, Charles K. Assaad

Main category: cs.AI

TL;DR: Local PC algorithm (LocPC) learns only the portion of essential graph needed for identifying controlled direct effects, reducing computational burden and assumptions compared to full graph learning.

DetailsMotivation: Identifying controlled direct effects (CDEs) is important in scientific domains, but existing methods require knowledge of the true causal DAG, which is often unknown. Essential graphs provide a practical alternative, but learning the full essential graph is computationally intensive and relies on strong assumptions.

Method: Introduces local essential graph (LEG) defined relative to a target variable, and presents LocPC algorithm that learns LEG using only local conditional independence tests. Then develops LocPC-CDE to extract precisely the portion of LEG necessary and sufficient for identifying a CDE.

Result: Compared to global methods, the algorithms require fewer conditional independence tests and operate under weaker assumptions while maintaining theoretical guarantees. Effectiveness demonstrated on synthetic and real data.

Conclusion: The proposed local approach provides a more practical and efficient method for identifying controlled direct effects when the full causal graph is unknown, reducing computational requirements and relaxing assumptions.

Abstract: Identifying controlled direct effects (CDEs) is crucial across numerous scientific domains. While existing methods can identify these effects from causal directed acyclic graphs (DAGs), the true DAG is often unknown in practice. Essential graphs, which represent a Markov equivalence class of DAGs characterized by the same set of conditional independencies, provide a more practical and realistic alternative, and the PC algorithm is one of the most widely used method to learn them using conditional independence tests. However, learning the full essential graph is computationally intensive and relies on strong, untestable assumptions. In this work, we adapt the PC algorithm to recover only the portion of the graph needed for identifying CDEs. In particular, we introduce the local essential graph (LEG), a graph structure defined relative to a target variable, and present LocPC, an algorithm that learns the LEG using solely local conditional independence tests. Building on this, we develop LocPC-CDE, which extracts precisely the portion of the LEG that is both necessary and sufficient for identifying a CDE. Compared to global methods, our algorithms require less conditional independence tests and operate under weaker assumptions while maintaining theoretical guarantees. We illustrate the effectiveness of our approach on synthetic and real data.

[375] AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

Xuanle Zhao, Zilin Sang, Yuxuan Li, Qi Shi, Weilun Zhao, Shuo Wang, Duzhen Zhang, Xu Han, Zhiyuan Liu, Maosong Sun

Main category: cs.AI

TL;DR: AutoReproduce is a multi-agent framework that autonomously reproduces experimental code from research papers using paper lineage mining and unit testing, with benchmarks showing superior performance.

DetailsMotivation: Reproducing complex research papers is labor-intensive and requires deep domain expertise, hindering scientific progress. There's a need for automated tools to systematically reproduce experimental code from papers.

Method: Introduces paper lineage algorithm to mine implicit knowledge from cited literature, uses multi-agent framework (AutoReproduce) for end-to-end code reproduction, incorporates sampling-based unit testing for validation, and creates ReproduceBench benchmark with verified implementations.

Result: AutoReproduce consistently outperforms existing baselines across all metrics on PaperBench and ReproduceBench, showing substantial improvements in reproduction fidelity and final execution performance.

Conclusion: The paper lineage approach and AutoReproduce framework effectively automate research paper reproduction, accelerating scientific progress by reducing manual effort and expertise requirements.

Abstract: Efficient reproduction of research papers is pivotal to accelerating scientific progress. However, the increasing complexity of proposed methods often renders reproduction a labor-intensive endeavor, necessitating profound domain expertise. To address this, we introduce the paper lineage, which systematically mines implicit knowledge from the cited literature. This algorithm serves as the backbone of our proposed AutoReproduce, a multi-agent framework designed to autonomously reproduce experimental code in a complete, end-to-end manner. To ensure code executability, AutoReproduce incorporates a sampling-based unit testing strategy for rapid validation. To assess reproduction capabilities, we introduce ReproduceBench, a benchmark featuring verified implementations, alongside comprehensive metrics for evaluating both reproduction and execution fidelity. Extensive evaluations on PaperBench and ReproduceBench demonstrate that AutoReproduce consistently surpasses existing baselines across all metrics. Notably, it yields substantial improvements in reproduction fidelity and final execution performance.

[376] Commander-GPT: Dividing and Routing for Multimodal Sarcasm Detection

Yazhou Zhang, Chunwang Zou, Bo Wang, Jing Qin, Prayag Tiwari

Main category: cs.AI

TL;DR: Commander-GPT: A modular decision routing framework using specialized LLM agents coordinated by commanders for multimodal sarcasm understanding, achieving significant improvements over SOTA.

DetailsMotivation: LLMs struggle with sarcasm understanding despite strong performance on other NLP tasks. Sarcasm is a high-order cognitive task requiring nuanced multimodal understanding, and current approaches using single LLMs are insufficient.

Method: Proposes Commander-GPT framework inspired by military command theory: orchestrates specialized LLM agents for sub-tasks (keyword extraction, sentiment analysis, etc.) coordinated by commanders. Three commander types: (1) lightweight encoder-based (multimodal BERT), (2) small autoregressive models (DeepSeek-VL), (3) large LLMs (Gemini Pro, GPT-4o) for zero-shot task routing, aggregation, and decision-making.

Result: Evaluated on MMSD and MMSD 2.0 benchmarks with five prompting strategies. Achieves 4.4% and 11.7% improvement in F1 score over SOTA baselines on average.

Conclusion: The modular decision routing framework effectively addresses LLMs’ limitations in sarcasm understanding by leveraging specialized agents and coordinated commanders, demonstrating significant performance gains.

Abstract: Multimodal sarcasm understanding is a high-order cognitive task. Although large language models (LLMs) have shown impressive performance on many downstream NLP tasks, growing evidence suggests that they struggle with sarcasm understanding. In this paper, we propose Commander-GPT, a modular decision routing framework inspired by military command theory. Rather than relying on a single LLM’s capability, Commander-GPT orchestrates a team of specialized LLM agents where each agent will be selectively assigned to a focused sub-task such as keyword extraction, sentiment analysis, etc. Their outputs are then routed back to the commander, which integrates the information and performs the final sarcasm judgment. To coordinate these agents, we introduce three types of centralized commanders: (1) a trained lightweight encoder-based commander (e.g., multi-modal BERT); (2) four small autoregressive language models, serving as moderately capable commanders (e.g., DeepSeek-VL); (3) two large LLM-based commander (Gemini Pro and GPT-4o) that performs task routing, output aggregation, and sarcasm decision-making in a zero-shot fashion. We evaluate Commander-GPT on the MMSD and MMSD 2.0 benchmarks, comparing five prompting strategies. Experimental results show that our framework achieves 4.4% and 11.7% improvement in F1 score over state-of-the-art (SoTA) baselines on average, demonstrating its effectiveness.

[377] UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

Shuquan Lian, Yuhang Wu, Jia Ma, Yifan Ding, Zihan Song, Bingqi Chen, Xiawu Zheng, Hui Li, Rongrong Ji

Main category: cs.AI

TL;DR: UI-AGILE enhances GUI agents with improved training (continuous reward, simple thinking reward, cropping resampling) and inference (decomposed grounding) methods to address reasoning, reward, and visual noise issues.

DetailsMotivation: Existing GUI agents suffer from dilemmas in reasoning designs, ineffective reward mechanisms, and visual noise problems that limit their performance and accuracy in multimodal GUI understanding tasks.

Method: Proposes UI-AGILE with training enhancements: 1) continuous reward function for high-precision grounding, 2) “Simple Thinking” reward to balance planning with speed/accuracy, 3) cropping-based resampling to mitigate sparse rewards. For inference: decomposed grounding with selection to handle high-resolution displays by breaking images into manageable parts.

Result: Achieves state-of-the-art grounding performance on ScreenSpot-Pro and ScreenSpot-v2 benchmarks, with 23% grounding accuracy improvement over best baseline on ScreenSpot-Pro using both training and inference enhancements.

Conclusion: UI-AGILE effectively addresses key challenges in GUI agent training and inference, demonstrating significant improvements in grounding accuracy and general agent capabilities for multimodal GUI understanding.

Abstract: The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE for enhancing GUI agents at both training and inference. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process: 1) a continuous reward function to incentivize high-precision grounding; 2) a ``Simple Thinking’’ reward to balance planning with speed and grounding accuracy; and 3) a cropping-based resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present decomposed grounding with selection to dramatically improve grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art grounding performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2 while it also exhibits strong general agent capabilities. For instance, using both our training and inference enhancement methods brings 23% grounding accuracy improvement over the best baseline on ScreenSpot-Pro. We provide the code in https://github.com/KDEGroup/UI-AGILE.

[378] Planning with Minimal Disruption

Alberto Pozanco, Marianela Morales, Daniel Borrajo, Manuela Veloso

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2508.15358: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.15358&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[379] Synthetic Homes: A Multimodal Generative AI Pipeline for Residential Building Data Generation under Data Scarcity

Jackson Eshbaugh, Chetan Tiwari, Jorge Silveyra

Main category: cs.AI

TL;DR: Paper ID 2509.09794 could not be fetched due to HTTP 429 error (rate limiting), so content analysis is not possible

DetailsMotivation: Unable to determine motivation due to content fetch failure

Method: Unable to determine method due to content fetch failure

Result: Unable to determine results due to content fetch failure

Conclusion: Unable to draw conclusions due to content fetch failure

Abstract: Failed to fetch summary for 2509.09794: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.09794&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[380] Faithful-First Reasoning, Planning, and Acting for Multimodal LLMs

Junxian Li, Xinyue Xu, Sai Ma, Di Zhang, Sichao Li

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to draw conclusions due to access limitations

Abstract: Failed to fetch summary for 2511.08409: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.08409&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[381] Before We Trust Them: Decision-Making Failures in Navigation of Foundation Models

Jua Han, Jaeyoon Seo, Jungbin Min, Sieun Choi, Huichan Seo, Jihie Kim, Jean Oh

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2601.05529: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05529&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[382] Diagnosing and Mitigating Sycophancy and Skepticism in LLM Causal Judgment

Edward Y. Chang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2601.08258: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08258&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[383] ConvoLearn: A Dataset for Fine-Tuning Dialogic AI Tutors

Mayank Sharma, Roy Pea, Hari Subramonyam

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2601.08950: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08950&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[384] Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation

Shuo Lu, Jianjie Cheng, Yinuo Xu, Yongcan Yu, Lijun Sheng, Peijie Wang, Siru Jiang, Yongguan Hu, Run Ling, Yihua Shao, Ao Ma, Wei Feng, Lingxiao He, Meng Wang, Qianlong Xie, Xingxing Wang, Nicu Sebe, Ran He, Jian Liang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2602.11635: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11635&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[385] Logics-Parsing-Omni Technical Report

Xin An, Jingyi Cai, Xiangyang Chen, Huayao Liu, Peiting Liu, Peng Wang, Bei Yang, Xiuwen Zhu, Yongfan Chen, Yan Gao, Yuan Gao, Baoyu Hou, Guangzheng Hu, Shuzhao Li, Weixu Qiao, Weidong Ren, Yanan Wang, Boyu Yang, Fan Yang, Jiangtao Zhang, Lixin Zhang, Lin Qu, Hu Wei, Xiaoxiao Xu, Bing Zhao

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.09677 exists but cannot be retrieved at this time.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2603.09677: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09677&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[386] Resource-constrained Amazons chess decision framework integrating large language models and graph attention

Tianhao Qian, Zhuoxuan Li, Jinde Cao, Xinli Shi, Leszek Rutkowski

Main category: cs.AI

TL;DR: Failed to fetch summary for paper 2603.10512 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2603.10512: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10512&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[387] Working Paper: Towards a Category-theoretic Comparative Framework for Artificial General Intelligence

Pablo de los Riscos, Fernando J. Corbacho, Michael A. Arbib

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.28906: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28906&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[388] Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

Zekai Ye, Qiming Li, Xiaocheng Feng, Ruihan Chen, Ziming Li, Haoyu Ren, Kun Chen, Dandan Tu, Bing Qin

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2604.01840: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01840&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[389] ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, Jing Shao, Xia Hu, Dongrui Liu

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2604.02022: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02022&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[390] An Automated Survey of Generative Artificial Intelligence: Large Language Models, Architectures, Protocols, and Applications

Eduardo C. Garrido-Merchán, Álvaro López López

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2306.02781: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2306.02781&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[391] Domain-Contextualized Inference: A Computable Graph Architecture for Explicit-Domain Reasoning

Chao Li, Yuru Wang, Chunyi Zhao

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2604.04344: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04344&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[392] ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, Yiyuan Li, Han-chung Lee

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2604.05172: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05172&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[393] QA-MoE: Towards a Continuous Reliability Spectrum with Quality-Aware Mixture of Experts for Robust Multimodal Sentiment Analysis

Yitong Zhu, Yuxuan Jiang, Guanxuan Jiang, Bojing Hou, Peng Yuan Zhou, Ge Lin Kan, Yuyang Wang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.05704: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05704&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[394] Novel Interpretable and Robust Web-based AI Platform for Phishing Email Detection

Abdulla Al-Subaiey, Mohammed Al-Thani, Naser Abdullah Alam, Kaniz Fatema Antora, Amith Khandakar, SM Ashfaq Uz Zaman

Main category: cs.AI

TL;DR: Paper 2405.11619: No abstract available due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to abstract fetching failure

Method: Unable to determine method due to abstract fetching failure

Result: Unable to determine results due to abstract fetching failure

Conclusion: Unable to determine conclusion due to abstract fetching failure

Abstract: Failed to fetch summary for 2405.11619: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.11619&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[395] ConfusionPrompt: Practical Private Inference for Online Large Language Models

Peihua Mai, Youjia Yang, Ran Yan, Rui Ye, Yan Pang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to analyze paper due to technical error in fetching content

Abstract: Failed to fetch summary for 2401.00870: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2401.00870&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[396] Matrix Profile for Anomaly Detection on Multidimensional Time Series

Chin-Chia Michael Yeh, Audrey Der, Uday Singh Saini, Vivian Lai, Yan Zheng, Junpeng Wang, Xin Dai, Zhongfang Zhuang, Yujie Fan, Huiyuan Chen, Prince Osei Aboagye, Liang Wang, Wei Zhang, Eamonn Keogh

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2409.09298: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.09298&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[397] Analyzing Multimodal Interaction Strategies for LLM-Assisted Manipulation of 3D Scenes

Junlong Chen, Jens Grubert, Per Ola Kristensson

Main category: cs.AI

TL;DR: Unable to analyze paper 2410.22177 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation as abstract is unavailable due to API rate limiting

Method: Cannot determine method as abstract is unavailable due to API rate limiting

Result: Cannot determine results as abstract is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions about paper content due to inability to access abstract

Abstract: Failed to fetch summary for 2410.22177: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.22177&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[398] Pseudo-Probability Unlearning: Efficient and Privacy-Preserving Machine Unlearning

Zihao Zhao, Yuchen Yang, Anjalie Field, Yinzhi Cao

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2411.02622: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.02622&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[399] Path Regularization: A Near-Complete and Optimal Nonasymptotic Generalization Theory for Multilayer Neural Networks and Double Descent Phenomenon

Hao Yu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2503.02129: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.02129&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[400] From Exploration to Revelation: Detecting Dark Patterns in Mobile Apps

Jieshan Chen, Zhen Wang, Jiamou Sun, Zhenchang Xing, Qinghua Lu, Qing Huang, Xiwei Xu, Liming Zhu

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2411.18084: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.18084&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[401] A Study of LLMs’ Preferences for Libraries and Programming Languages

Lukas Twist, Jie M. Zhang, Mark Harman, Don Syme, Joost Noppen, Helen Yannakoudakis, Detlef Nauck

Main category: cs.AI

TL;DR: Paper ID 2503.17181 could not be fetched due to HTTP 429 error (rate limiting), so analysis cannot be performed

DetailsMotivation: Unable to determine motivation due to missing abstract content

Method: Unable to determine method due to missing abstract content

Result: Unable to determine results due to missing abstract content

Conclusion: Unable to determine conclusion due to missing abstract content

Abstract: Failed to fetch summary for 2503.17181: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.17181&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[402] Towards provable probabilistic safety for scalable embodied AI systems

Linxuan He, Lingxiang Fan, Qing-Shan Jia, Ang Li, Hongyan Sang, Ling Wang, Guanghui Wen, Jiwen Lu, Tao Zhang, Jie Zhou, Yi Zhang, Yisen Wang, Peng Wei, Zhongyuan Wang, Henry X. Liu, Shuo Feng

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2506.05171: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.05171&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[403] CNN-based Surface Temperature Forecasts with Ensemble Numerical Weather Prediction

Takuya Inoue, Takuya Kawabata

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2507.18937: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.18937&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[404] Quantitative Estimation of Target Task Performance from Unsupervised Pretext Task in Semi/Self-Supervised Learning

Lin-Han Jia, Si-Yu Han, Wen-Chao Hu, Jie-Jing Shao, Wen-Da Wei, Zhi Zhou, Lan-Zhe Guo, Yu-Feng Li

Main category: cs.AI

TL;DR: Failed to fetch summary for paper 2508.07299 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2508.07299: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.07299&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[405] In-Context Decision Making for Optimizing Complex AutoML Pipelines

Amir Rezaei Balef, Katharina Eggensperger

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2508.13657: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.13657&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[406] ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

Wangsong Yin, Daliang Xu, Mengwei Xu, Gang Huang, Xuanzhe Liu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2508.16703: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.16703&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[407] Physics-Informed Spectral Modeling for Hyperspectral Imaging

Zuzanna Gawrysiak, Krzysztof Krawiec

Main category: cs.AI

TL;DR: Paper 2508.21618: Could not fetch abstract due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing abstract

Method: Unable to determine method due to missing abstract

Result: Unable to determine results due to missing abstract

Conclusion: Unable to determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2508.21618: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.21618&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[408] Once4All: Skeleton-Guided SMT Solver Fuzzing with LLM-Synthesized Generators

Maolin Sun, Yibiao Yang, Yuming Zhou

Main category: cs.AI

TL;DR: Failed to fetch summary for paper 2508.20340 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing abstract content

Method: Unable to determine method due to missing abstract content

Result: Unable to determine results due to missing abstract content

Conclusion: Unable to determine conclusion due to missing abstract content

Abstract: Failed to fetch summary for 2508.20340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.20340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[409] Invisible to Humans, Triggered by Agents: Stealthy Jailbreak Attacks on Mobile Vision-Language Agents

Renhua Ding, Xiao Yang, Zhengwei Fang, Jun Luo, Kun He, Jun Zhu

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to the paper content

Method: Cannot determine method without access to the paper content

Result: Cannot determine results without access to the paper content

Conclusion: Cannot draw conclusions without access to the paper content

Abstract: Failed to fetch summary for 2510.07809: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07809&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[410] Leveraging Wireless Sensor Networks for Real-Time Monitoring and Control of Industrial Environments

Muhammad Junaid Asif, Abdul Rehman, Asim Mehmood, Muhammad Hamza, Rana Fayyaz Ahmad, Shazia Saqib

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.13820: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13820&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[411] PULSE: Privileged Knowledge Transfer from Rich to Deployable Sensors for Embodied Multi-Sensory Learning

Zihan Zhao, Kaushik Pendiyala, Masood Mortazavi, Ning Yan

Main category: cs.AI

TL;DR: Paper 2510.24058 summary unavailable due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to draw conclusions due to access restrictions

Abstract: Failed to fetch summary for 2510.24058: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24058&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[412] LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis

Qingyue Zhang, Chang Chu, Tianren Peng, Qi Li, Xiangyang Luo, Zhihao Jiang, Shao-Lun Huang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2510.24561: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24561&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[413] Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism

Yuhua Jiang, Shuang Cheng, Yihao Liu, Ermo Hua, Che Jiang, Weigao Sun, Yu Cheng, Feifei Gao, Biqing Qi, Bowen Zhou

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2510.26083: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26083&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[414] SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization

Zhixiong Zhao, Fangxin Liu, Junjie Wang, Chenyang Guan, Zongwu Wang, Li Jiang, Haibing Guan

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2511.11663: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11663&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[415] TREASURE: The Visa Payment Foundation Model for High-Volume Transaction Understanding

Chin-Chia Michael Yeh, Uday Singh Saini, Xin Dai, Xiran Fan, Shubham Jain, Yujie Fan, Jiarui Sun, Junpeng Wang, Menghai Pan, Yingtong Dou, Yuzhong Chen, Vineeth Rakesh, Liang Wang, Yan Zheng, Mahashweta Das

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2511.19693: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19693&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[416] Adaptive Replay Buffer for Offline-to-Online Reinforcement Learning

Chihyeon Song, Jaewoo Lee, Jinkyoo Park

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - the arXiv API request failed

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to data unavailability

Abstract: Failed to fetch summary for 2512.10510: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10510&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[417] LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller

Kirill Djebko, Tom Baumann, Erik Dilger, Frank Puppe, Sergio Montenegro

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2512.19576: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19576&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[418] LAsset: An LLM-assisted Security Asset Identification Framework for System-on-Chip (SoC) Verification

Md Ajoad Hasan, Dipayan Saha, Khan Thamid Hasan, Nashmin Alam, Azim Uddin, Sujan Kumar Saha, Mark Tehranipoor, Farimah Farahmandi

Main category: cs.AI

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2601.02624: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02624&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[419] BadImplant: Injection-based Multi-Targeted Graph Backdoor Attack

Md Nabi Newaz Khan, Abdullah Arafat Miah, Yu Bi

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2601.15474: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15474&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[420] Space Filling Curves is All You Need: Communication-Avoiding Matrix Multiplication Made Simple

Evangelos Georganas, Alexander Heinecke, Pradeep Dubey

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2601.16294: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16294&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[421] SPICE: Submodular Penalized Information-Conflict Selection for Efficient Large Language Model Training

Powei Chang, Jinpeng Zhang, Bowen Chen, Chenyu Wang, Chenlu Guo, Yixing Zhang, Yukang Gao, JianXiang Xiang, Yue Gao, Chaoqun Sun, Yiyi Chen, Dongying Kong

Main category: cs.AI

TL;DR: Unable to analyze paper 2601.23155 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Paper analysis not possible due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2601.23155: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.23155&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[422] Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions

J Rosser, Robert Kirk, Edward Grefenstette, Jakob Foerster, Laura Ruis

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.09987: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09987&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[423] ODYN: An All-Shifted Non-Interior-Point Method for Quadratic Programming in Robotics and AI

Jose Rojas, Aristotelis Papatheodorou, Sergi Martinez, Andrea Patrizi, Ioannis Havoutis, Carlos Mastalli

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.16005: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16005&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[424] Zatom-1: A Multimodal Flow Foundation Model for 3D Molecules and Materials

Alex Morehead, Miruna Cretu, Antonia Panescu, Rishabh Anand, Maurice Weiler, Tynan Perez, Samuel Blau, Steven Farrell, Wahid Bhimji, Anubhav Jain, Hrushikesh Sahasrabuddhe, Pietro Lio, Tommi Jaakkola, Rafael Gomez-Bombarelli, Rex Ying, N. Benjamin Erichson, Michael W. Mahoney

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.22251: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22251&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[425] Modernizing Amdahl’s Law: How AI Scaling Laws Shape Computer Architecture

Chien-Ping Lu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.20654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[426] On Integrating Resilience and Human Oversight into LLM-Assisted Modeling Workflows for Digital Twins

Lekshmi P, Neha Karanjkar

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.25898 suggests it’s from March 2025, but no abstract or content is available for analysis.

DetailsMotivation: Cannot determine motivation without access to paper content. The arXiv API rate limiting prevents retrieval of the abstract.

Method: Cannot determine method without access to paper content. The arXiv API returned HTTP 429 error.

Result: Cannot determine results without access to paper content. The paper summary could not be fetched.

Conclusion: Cannot draw conclusions without access to paper content. Technical limitations prevent analysis.

Abstract: Failed to fetch summary for 2603.25898: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25898&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[427] MR-ImagenTime: Multi-Resolution Time Series Generation through Dual Image Representations

Xianyong Xu, Yuanjun Zuo, Zhihong Huang, Yihan Qin, Haoxian Xu, Leilei Du, Haotian Wang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) - need to try again later or use alternative access method

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2603.28253: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28253&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[428] k-Maximum Inner Product Attention for Graph Transformers and the Expressive Power of GraphGPS

Jonas De Schouwer, Haitz Sáez de Ocáriz Borde, Xiaowen Dong

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2604.03815: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03815&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[429] A Clinical Point Cloud Paradigm for In-Hospital Mortality Prediction from Multi-Level Incomplete Multimodal EHRs

Bohao Li, Tao Zou, Junchen Ye, Yan Gong, Bowen Du

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.04614: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04614&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[430] Noise Immunity in In-Context Tabular Learning: An Empirical Robustness Analysis of TabPFN’s Attention Mechanisms

James Hu, Mahdi Ghelichi

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to access limitations

Method: Cannot determine method due to access limitations

Result: Cannot determine results due to access limitations

Conclusion: Cannot determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2604.04868: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04868&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[431] The Planetary Cost of AI Acceleration, Part II: The 10th Planetary Boundary and the 6.5-Year Countdown

William Yicheng Zhu, Lei Zhu

Main category: cs.AI

TL;DR: Unable to analyze paper 2604.04956 due to HTTP 429 error when fetching summary from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: No method information available - arXiv API returned HTTP 429 (Too Many Requests)

Result: No results available - failed to fetch paper summary

Conclusion: Paper analysis impossible due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2604.04956: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04956&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[432] Self-Supervised Foundation Model for Calcium-imaging Population Dynamics

Xinhong Xu, Yimeng Zhang, Qichen Qian, Yuanlong Zhang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to data fetch failure

Method: Unable to determine method due to data fetch failure

Result: Unable to determine results due to data fetch failure

Conclusion: Unable to determine conclusion due to data fetch failure

Abstract: Failed to fetch summary for 2604.04958: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04958&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[433] Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code

Dominik Blain, Maxime Noiseux

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.05292: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05292&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[434] Governance and Regulation of Artificial Intelligence in Developing Countries: A Case Study of Nigeria

Uloma Okoro, Tammy Mackenzie, Branislav Radeljic

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when querying arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2604.06018: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06018&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.SD

[435] A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech

Jia-Hong Huang, Seulgi Kim, Yi Chieh Liu, Yixian Shen, Hongyi Zhu, Prayag Tiwari, Stevan Rudinac, Evangelos Kanoulas

Main category: cs.SD

TL;DR: First framework for detecting speaker drift in diffusion-based TTS using LLM-based classification of utterance-level speaker consistency through cosine similarity analysis of overlapping speech segments.

DetailsMotivation: Speaker drift in diffusion-based TTS models undermines speech coherence in long-form/interactive settings, but remains underexplored with no automatic detection methods available.

Method: Formulates speaker drift detection as binary classification using cosine similarity across overlapping speech segments, then prompts LLMs with structured representations to assess drift, with theoretical guarantees for cosine-based detection.

Result: Demonstrates speaker embeddings exhibit meaningful geometric clustering on unit sphere, creates benchmark with human-validated annotations, and shows viability of embedding-to-reasoning pipeline with multiple LLMs.

Conclusion: Establishes speaker drift as standalone research problem and bridges geometric signal analysis with LLM-based perceptual reasoning in modern TTS systems.

Abstract: Recent diffusion-based text-to-speech (TTS) models achieve high naturalness and expressiveness, yet often suffer from speaker drift, a subtle, gradual shift in perceived speaker identity within a single utterance. This underexplored phenomenon undermines the coherence of synthetic speech, especially in long-form or interactive settings. We introduce the first automatic framework for detecting speaker drift by formulating it as a binary classification task over utterance-level speaker consistency. Our method computes cosine similarity across overlapping segments of synthesized speech and prompts large language models (LLMs) with structured representations to assess drift. We provide theoretical guarantees for cosine-based drift detection and demonstrate that speaker embeddings exhibit meaningful geometric clustering on the unit sphere. To support evaluation, we construct a high-quality synthetic benchmark with human-validated speaker drift annotations. Experiments with multiple state-of-the-art LLMs confirm the viability of this embedding-to-reasoning pipeline. Our work establishes speaker drift as a standalone research problem and bridges geometric signal analysis with LLM-based perceptual reasoning in modern TTS.

[436] AudioKV: KV Cache Eviction in Efficient Large Audio Language Models

Yuxuan Wang, Peize He, Xiyan Gui, Xiaoqian Liu, Junhao He, Xuyang Liu, Zichen Wen, Xuming Hu, Linfeng Zhang

Main category: cs.SD

TL;DR: AudioKV is a novel KV cache compression framework for Large Audio-Language Models that prioritizes audio-critical attention heads using semantic-acoustic alignment and spectral score smoothing to maintain accuracy during long-context inference.

DetailsMotivation: Current KV cache compression techniques for LLMs fail in the audio domain because they overlook the intrinsic temporal continuity of acoustic signals, causing catastrophic performance degradation in Large Audio-Language Models during long-context inference.

Method: 1) Identify modality-specialized attention heads by analyzing attention scores in ASR tasks; 2) Dynamically allocate KV cache budgets preferentially to audio-critical heads; 3) Introduce Spectral Score Smoothing (SSS) - an FFT-based global filtering strategy to suppress high-frequency noise and recover smooth global trends from importance scores.

Result: Extensive evaluations across multiple LALMs (Qwen and Gemma series) show AudioKV significantly outperforms baselines. At 40% compression ratio, AudioKV maintains near-full accuracy on Qwen3-Omni-30B with only 0.45% drop, while traditional methods suffer catastrophic degradation and repetition.

Conclusion: AudioKV provides an effective hardware-friendly solution for KV cache compression in audio-language models by leveraging acoustic signal properties, enabling efficient long-context inference while preserving accuracy.

Abstract: Large Audio-Language Models (LALMs) have set new benchmarks in speech processing, yet their deployment is hindered by the memory footprint of the Key-Value (KV) cache during long-context inference. While general KV cache compression techniques excel in LLMs, they often fail in the audio domain by overlooking the intrinsic temporal continuity of acoustic signals. To bridge this gap, we propose AudioKV, a novel framework that robustly prioritizes audio-critical attention heads through a hardware-friendly semantic-acoustic alignment mechanism. Specifically, we identify these modality-specialized heads by analyzing attention scores in ASR tasks and dynamically allocate KV cache budgets preferentially to them. Furthermore, we introduce Spectral Score Smoothing (SSS), an FFT-based global filtering strategy designed to suppress high-frequency noise and recover smooth global trends from importance scores, ensuring more balanced token selection with unprecedented precision. Extensive evaluations across multiple LALMs, including Qwen and Gemma series, demonstrate that AudioKV significantly outperforms baselines while enhancing computational efficiency. Notably, at a 40% compression ratio, AudioKV maintains near-full accuracy on Qwen3-Omni-30B with only a 0.45% drop, whereas traditional methods suffer from catastrophic performance degradation and repetition. Our code will be released after acceptance.

[437] AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models

Wenyu Li, Xiaoqi Jiao, Yi Chang, Guangyan Zhang, Yiwen Guo

Main category: cs.SD

TL;DR: AudioRole dataset for audio role-playing with 1M+ character-grounded dialogues from TV series, plus ARP-Eval framework and trained models achieving state-of-the-art performance in audio-grounded role-playing.

DetailsMotivation: Existing role-playing research focuses on text-based persona simulation, but Audio Role-Playing (ARP) requires synchronized alignment of semantic content and vocal characteristics, creating a gap in multimodal datasets for audio-grounded role-playing.

Method: Created AudioRole dataset from 13 TV series (1K+ hours, 1M+ dialogues) with synchronized audio-text pairs, speaker identities, and contextual metadata. Developed ARP-Eval dual-aspect evaluation framework and trained ARP-Model (GLM-4-Voice) on the dataset.

Result: ARP-Model achieved Acoustic Personalization score of 0.31 (outperforming GLM-4-voice and MiniCPM-O-2.6) and Content Personalization score of 0.36 (38% improvement over untrained model, matching MiniCPM-O-2.6). Dataset includes 115+ characters and 6 trained models.

Conclusion: AudioRole provides essential resources for advancing audio-grounded role-playing research, addressing the unique challenges of synchronized semantic-vocal alignment in multimodal LLMs.

Abstract: The creation of high-quality multimodal datasets remains fundamental for advancing role-playing capabilities in large language models (LLMs). While existing works predominantly focus on text-based persona simulation, Audio Role-Playing (ARP) presents unique challenges due to the need for synchronized alignment of semantic content and vocal characteristics. To address this gap, we propose AudioRole, a meticulously curated dataset from 13 TV series spanning 1K+ hours with 1M+ character-grounded dialogues, providing synchronized audio-text pairs annotated with speaker identities and contextual metadata. In addition, to demonstrate the effectiveness of the dataset, we introduced ARP-Eval, a dual-aspect evaluation framework that assesses both response quality and role fidelity. Empirical validation showing GLM-4-Voice trained on AudioRole (which we called ARP-Model) achieve an average Acoustic Personalization score of 0.31, significantly outperforming the original GLM-4-voice and the more powerful model MiniCPM-O-2.6, which specifically supports role-playing in one-shot scenarios. The ARP-Model also achieves a Content Personalization score of 0.36, surpassing the untrained original model by about 38% and maintaining the same level as MiniCPM-O-2.6. AudioRole features dialogues from over 115 main characters, 6 trained ARP-Models that role-play different characters, and evaluation protocols. Together, they provide an essential resource for advancing audio-grounded role-playing research.

[438] PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

Tianxin Xie, Wentao Lei, Kai Jiang, Guanjie Huang, Pengfei Zhang, Chunhui Zhang, Fengji Ma, Haoyu He, Han Zhang, Jiangshan He, Jinting Wang, Linghan Fang, Lufei Gao, Orkesh Ablet, Peihua Zhang, Ruolin Hu, Shengyu Li, Weilin Lin, Xiaoyang Feng, Xinyue Yang, Yan Rong, Yanyun Wang, Zihang Shao, Zelin Zhao, Chenxing Li, Shan Yang, Wenfu Wang, Meng Yu, Dong Yu, Li Liu

Main category: cs.SD

TL;DR: PhyAVBench introduces the first benchmark for evaluating audio-physics grounding in text-to-audio-video generation, featuring a new dataset and novel evaluation metrics to assess physical plausibility of generated sounds.

DetailsMotivation: Current T2AV models often fail to produce physically plausible sounds, and existing benchmarks focus mainly on audio-video synchronization while overlooking explicit evaluation of audio-physics grounding, limiting progress in physically plausible audio-visual generation.

Method: Created PhyAVBench with PhyAV-Sound-11K dataset (25.5 hours, 11,605 videos from 184 participants), featuring 337 paired-prompt groups with controlled physical variations. Introduced Audio-Physics Sensitivity Test (APST) paradigm and Contrastive Physical Response Score (CPRS) metric to quantify acoustic consistency between generated and real-world videos.

Result: Comprehensive evaluation of 17 state-of-the-art models reveals that even leading commercial models struggle with fundamental audio-physical phenomena, exposing a critical gap beyond audio-visual synchronization.

Conclusion: PhyAVBench provides the first systematic benchmark for audio-physics grounding in audio-visual generation, revealing significant limitations in current models and pointing to future research directions for physically plausible audio-visual generation.

Abstract: Text-to-audio-video (T2AV) generation is central to applications such as filmmaking and world modeling. However, current models often fail to produce physically plausible sounds. Previous benchmarks primarily focus on audio-video temporal synchronization, while largely overlooking explicit evaluation of audio-physics grounding, thereby limiting the study of physically plausible audio-visual generation. To address this issue, we present PhyAVBench, the first benchmark that systematically evaluates the audio-physics grounding capabilities of T2AV, image-to-audio-video (I2AV), and video-to-audio (V2A) models. PhyAVBench offers PhyAV-Sound-11K, a new dataset of 25.5 hours of 11,605 audible videos collected from 184 participants to ensure diversity and avoid data leakage. It contains 337 paired-prompt groups with controlled physical variations that drive sound differences, each grounded with an average of 17 videos and spanning 6 audio-physics dimensions and 41 fine-grained test points. Each prompt pair is annotated with the physical factors underlying their acoustic differences. Importantly, PhyAVBench leverages paired text prompts to evaluate this capability. We term this evaluation paradigm the Audio-Physics Sensitivity Test (APST) and introduce a novel metric, the Contrastive Physical Response Score (CPRS), which quantifies the acoustic consistency between generated videos and their real-world counterparts. We conduct a comprehensive evaluation of 17 state-of-the-art models. Our results reveal that even leading commercial models struggle with fundamental audio-physical phenomena, exposing a critical gap beyond audio-visual synchronization and pointing to future research directions. We hope PhyAVBench will serve as a foundation for advancing physically grounded audio-visual generation. Prompts, ground-truth, and generated video samples are available at https://phyavbench.pages.dev/.

[439] Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs

Jun Xue, Yi Chai, Yanzhen Ren, Jinshen He, Zhiqiang Tang, Zhuolin Yi, Yihuan Huang, Yuankun Xie, Yujie Chen

Main category: cs.SD

TL;DR: A unified framework for speech editing detection using Audio LLMs that bridges detection and content localization through generative formulation, with a new dataset and improved methods.

DetailsMotivation: Current speech editing detection datasets lack diversity and realistic editing scenarios, and existing methods struggle with deletion-type edits where manipulated content is absent from the signal.

Method: Proposes AiEdit dataset (140 hours bilingual) covering addition, deletion, and modification operations using state-of-the-art speech editing systems. Reformulates SED as structured text generation task using Audio LLMs with prior-enhanced prompting (word-level probabilistic cues) and acoustic consistency-aware loss for better latent space separation.

Result: Experimental results show the proposed approach consistently outperforms existing methods across both detection and localization tasks.

Conclusion: The unified generative framework effectively addresses limitations of current SED methods, particularly for deletion-type edits, and provides a more realistic benchmark through the AiEdit dataset.

Abstract: Existing speech editing detection (SED) datasets are predominantly constructed using manual splicing or limited editing operations, resulting in restricted diversity and poor coverage of realistic editing scenarios. Meanwhile, current SED methods rely heavily on frame-level supervision to detect observable acoustic anomalies, which fundamentally limits their ability to handle deletion-type edits, where the manipulated content is entirely absent from the signal. To address these challenges, we present a unified framework that bridges speech editing detection and content localization through a generative formulation based on Audio Large Language Models (Audio LLMs). We first introduce AiEdit, a large-scale bilingual dataset (approximately 140 hours) that covers addition, deletion, and modification operations using state-of-the-art end-to-end speech editing systems, providing a more realistic benchmark for modern threats. Building upon this, we reformulate SED as a structured text generation task, enabling joint reasoning over edit type identification, and content localization. To enhance the grounding of generative models in acoustic evidence, we propose a prior-enhanced prompting strategy that injects word-level probabilistic cues derived from a frame-level detector. Furthermore, we introduce an acoustic consistency-aware loss that explicitly enforces the separation between normal and anomalous acoustic representations in the latent space. Experimental results demonstrate that the proposed approach consistently outperforms existing methods across both detection and localization tasks.

cs.LG

[440] From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures

Srinidhi Madabhushi, Pranesh Vyas, Swathi Vaidyanathan, Mayur Kurup, Elliott Nash, Yegor Silyutin

Main category: cs.LG

TL;DR: Prime Video developed a graph-based anomaly detection system using GCN-GAE embeddings to identify under-represented services in load tests vs. real events, achieving 96% precision but 58% recall.

DetailsMotivation: Load tests for Prime Video's streaming service sometimes miss service behaviors unique to real event traffic (like live sports or popular VOD releases), creating a need for better anomaly detection that can identify under-represented services in test environments.

Method: Uses unsupervised node-level graph embeddings on a GCN-GAE (Graph Convolutional Network - Graph Autoencoder) to learn structural representations from directed, weighted service graphs at minute-level resolution, then flags anomalies based on cosine similarity between load test and event embeddings.

Result: The system identifies incident-related services that are documented and demonstrates early detection capability. With synthetic anomaly injection framework, shows promising precision (96%) and low false positive rate (0.08%), though recall (58%) remains limited under conservative propagation assumptions.

Conclusion: The framework demonstrates practical utility within Prime Video while surfacing methodological lessons and directions, providing a foundation for broader application across microservice ecosystems.

Abstract: Prime Video regularly conducts load tests to simulate the viewer traffic spikes seen during live events such as Thursday Night Football as well as video-on-demand (VOD) events such as Rings of Power. While these stress tests validate system capacity, they can sometimes miss service behaviors unique to real event traffic. We present a graph-based anomaly detection system that identifies under-represented services using unsupervised node-level graph embeddings. Built on a GCN-GAE, our approach learns structural representations from directed, weighted service graphs at minute-level resolution and flags anomalies based on cosine similarity between load test and event embeddings. The system identifies incident-related services that are documented and demonstrates early detection capability. We also introduce a preliminary synthetic anomaly injection framework for controlled evaluation that show promising precision (96%) and low false positive rate (0.08%), though recall (58%) remains limited under conservative propagation assumptions. This framework demonstrates practical utility within Prime Video while also surfacing methodological lessons and directions, providing a foundation for broader application across microservice ecosystems.

[441] A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset

Tashreef Muhammad, Tahsin Ahmed, Meherun Farzana, Md. Mahmudul Hasan, Abrar Eyasir, Md. Emon Khan, Mahafuzul Islam Shawon, Ferdous Mondol, Mahmudul Hasan, Muhammad Ibrahim

Main category: cs.LG

TL;DR: Introduces AgriPriceBD dataset for Bangladeshi agricultural commodities and evaluates forecasting models, finding commodity-specific performance patterns and limitations of deep learning approaches on small datasets.

DetailsMotivation: Agricultural commodity price forecasting is crucial for food security and income stabilization in developing economies, but machine-learning-ready datasets are scarce in South Asia, particularly for Bangladesh.

Method: Created AgriPriceBD dataset (1,779 daily prices for 5 commodities) using LLM-assisted digitization pipeline, then evaluated 7 forecasting approaches including classical models (naïve persistence, SARIMA, Prophet) and deep learning architectures (BiLSTM, Transformer, Time2Vec-Transformer, Informer) with Diebold-Mariano statistical tests.

Result: Commodity price forecastability is heterogeneous; naïve persistence works best for near-random-walk commodities; Time2Vec encoding provides no advantage and degrades performance on some commodities; Prophet fails systematically; Informer produces erratic predictions; deep learning models struggle with small agricultural datasets.

Conclusion: Classical models often outperform deep learning approaches on small agricultural datasets; dataset and models released publicly to support future research on agricultural commodity markets in developing economies.

Abstract: Accurate short-term forecasting of agricultural commodity prices is critical for food security planning and smallholder income stabilisation in developing economies, yet machine-learning-ready datasets for this purpose remain scarce in South Asia. This paper makes two contributions. First, we introduce AgriPriceBD, a benchmark dataset of 1,779 daily retail mid-prices for five Bangladeshi commodities - garlic, chickpea, green chilli, cucumber, and sweet pumpkin - spanning July 2020 to June 2025, extracted from government reports via an LLM-assisted digitisation pipeline. Second, we evaluate seven forecasting approaches spanning classical models - naïve persistence, SARIMA, and Prophet - and deep learning architectures - BiLSTM, Transformer, Time2Vec-enhanced Transformer, and Informer - with Diebold-Mariano statistical significance tests. Commodity price forecastability is fundamentally heterogeneous: naïve persistence dominates on near-random-walk commodities. Time2Vec temporal encoding provides no statistically significant advantage over fixed sinusoidal encoding and causes catastrophic degradation on green chilli (+146.1% MAE, p<0.001). Prophet fails systematically, attributable to discrete step-function price dynamics incompatible with its smooth decomposition assumptions. Informer produces erratic predictions (variance up to 50x ground-truth), confirming sparse-attention Transformers require substantially larger training sets than small agricultural datasets provide. All code, models, and data are released publicly to support replication and future forecasting research on agricultural commodity markets in Bangladesh and similar developing economies.

[442] Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

Gregory Magarshak

Main category: cs.LG

TL;DR: PLTs provide a unified representation for generative sequence models that enables optimal compression, policy representation, and computational reuse through structured prefix trees with conditional probabilities.

DetailsMotivation: The paper aims to create a unified representation that makes explicit the prefix structure of generative models over sequences, addressing the need for efficient compression, decision-making policies, and computational reuse across various domains.

Method: Introduces probabilistic language tries (PLTs) - prefix trees where edges are labeled with conditional probabilities. Develops prior-guided caching theorem for efficient inference, and hybrid compression architecture combining PLT-covered majority with sparse residual store.

Result: PLTs reduce transformer attention cost from O(n²) to expected p_r·O(log N) + (1-p_r)·O(n²), where p_r is reuse probability. Demonstrated across chess, web search, robotics, workflows, and LLM inference, showing compression, decision making, and computational reuse derive from single probability measure.

Conclusion: PLTs provide a fundamental unified representation connecting compression, decision making, and computational reuse through probability measures on sequence space, with practical applications across multiple domains including LLM inference optimization.

Abstract: We introduce probabilistic language tries (PLTs), a unified representation that makes explicit the prefix structure implicitly defined by any generative model over sequences. By assigning to each outgoing edge the conditional probability of the corresponding token or action, a PLT simultaneously serves as: (i) an optimal lossless compressor via frequency-weighted interval encoding, generalizing arithmetic coding to model-conditioned distributions; (ii) a policy representation for sequential decision problems including games, search, and robotic control; and (iii) a memoization index that lets repeated inference queries be answered by structured retrieval rather than full model execution. The central technical result is a prior-guided caching theorem: under a stationary generative distribution, a PLT-guided cache achieves strictly lower expected inference cost than any empirical-frequency cache for all query counts below a threshold that grows with the concentration of the prior. This converts O(n^2) transformer attention cost into an expected cost of p_r * O(log N) + (1 - p_r) * O(n^2), where p_r is the prior-estimated reuse probability and N is the artifact store size. We further introduce a hybrid compression architecture decomposing any dataset into a PLT-covered majority and a sparse residual store, connecting arithmetic coding with Kolmogorov-style program representations and rate-distortion theory. We instantiate the framework across chess, web search, robotics, organizational workflows, and LLM inference, demonstrating that compression, decision making, and computational reuse are all derived from a single probability measure on sequence space.

[443] FLeX: Fourier-based Low-rank EXpansion for multilingual transfer

Gaurav Narasimhan

Main category: cs.LG

TL;DR: LoRA fine-tuning with Fourier regularization improves cross-lingual code generation from Python to Java, achieving better performance than broader fine-tuning with fewer parameters.

DetailsMotivation: Enterprise environments require code generation across multiple programming languages, but fine-tuning LLMs for each language individually is computationally expensive. Need efficient methods for cross-lingual transfer.

Method: Fine-tuned Code Llama 7B using LoRA (parameter-efficient fine-tuning) with Adam vs Sophia optimizers, plus Fourier-based regularization technique. Used MBPP dataset for training.

Result: LoRA fine-tuning achieved 40.1% pass@1 vs 38.4% for broader fine-tuning. Sophia converged faster but similar final scores. Fourier regularization boosted Java performance to 42.1% vs 34.2% baseline.

Conclusion: Combining LoRA, optimized training methods, and frequency-domain regularization can efficiently adapt single-language LLMs for cross-lingual code generation with reduced computational cost.

Abstract: Cross-lingual code generation is critical in enterprise environments where multiple programming languages coexist. However, fine-tuning large language models (LLMs) individually for each language is computationally prohibitive. This paper investigates whether parameter-efficient fine-tuning methods and optimizer enhancements can improve cross-lingual transfer from Python to languages like Java. We fine-tune the Code Llama 7B model using low-rank adaptation (LoRA) to optimize a small subset of parameters and compare Adam and Sophia optimizers, while exploring a novel Fourier-based regularization technique. Our contributions include: (1)demonstrating that LoRA fine-tuning on a small, high-quality dataset (MBPP) can exceed the pass@1 performance of the more broadly fine-tuned Code Llama-Python-7B model (40.1% vs. 38.4%); (2) showing that while Sophia achieves faster convergence than Adam, final pass@1 scores show marginal differences; and (3) presenting evidence that Fourier-based regularization during fine-tuning significantly improves cross-lingual transfer, achieving 42.1% pass@1 on Java tasks compared to the 34.2% baseline. These findings suggest that combining LoRA, optimized training methods, and frequency-domain regularization can efficiently adapt single-language LLMs to perform well across multiple programming languages.

[444] Spectral Edge Dynamics Reveal Functional Modes of Learning

Yongzhong Xu

Main category: cs.LG

TL;DR: Training dynamics during grokking concentrate along low-dimensional functional modes (spectral edge) that reveal task-specific algebraic symmetries, with different mathematical operations showing distinct harmonic structures in their dominant update directions.

DetailsMotivation: The paper investigates why standard mechanistic interpretability tools fail to capture the key training dynamics during grokking, and seeks to understand the underlying functional structure of learning in neural networks across different mathematical operations.

Method: Analyzes training dynamics through spectral edge analysis (dominant update directions) across different mathematical tasks (modular addition, multiplication, subtraction, x²+y²), examining how these directions collapse to specific functional modes and harmonic structures.

Result: Different mathematical operations reveal distinct functional structures: modular addition collapses to single Fourier mode, multiplication to discrete-log basis (5.9x concentration), subtraction to multi-mode family, and x²+y² requires cross-terms of additive/multiplicative features (4x variance boost). Multitask training amplifies compositional structure.

Conclusion: Training discovers low-dimensional functional modes over the input domain whose structure depends on the algebraic symmetry of the task, with simple harmonic structure emerging only when tasks admit symmetry-adapted bases, while complex tasks require richer functional descriptions.

Abstract: Training dynamics during grokking concentrate along a small number of dominant update directions – the spectral edge – which reliably distinguishes grokking from non-grokking regimes. We show that standard mechanistic interpretability tools (head attribution, activation probing, sparse autoencoders) fail to capture these directions: their structure is not localized in parameter or feature space. Instead, each direction induces a structured function over the input domain, revealing low-dimensional functional modes invisible to representation-level analysis. For modular addition, all leading directions collapse to a single Fourier mode. For multiplication, the same collapse appears only in the discrete-log basis, yielding a 5.9x improvement in concentration. For subtraction, the edge spans a small multi-mode family. For $x^2+y^2$, no single harmonic basis suffices, but cross-terms of additive and multiplicative features provide a 4x variance boost, consistent with the decomposition (a+b)^2 - 2ab. Multitask training amplifies this compositional structure, with the $x^2+y^2$ spectral edge inheriting the addition circuit’s characteristic frequency (2.3x concentration increase). These results suggest that training discovers low-dimensional functional modes over the input domain, whose structure depends on the algebraic symmetry of the task. These results suggest that spectral edge dynamics identify low-dimensional functional subspaces governing learning, whose representation depends on the algebraic structure of the task. Simple harmonic structure emerges only when the task admits a symmetry-adapted basis; more complex tasks require richer functional descriptions.

[445] $S^3$: Stratified Scaling Search for Test-Time in Diffusion Language Models

Ahsan Bilal, Muhammad Ahmed Mohsin, Muhammad Umer, Asad Aali, Muhammad Usman Khanzada, Muhammad Usman Rafique, Zihao He, Emily Fox, Dean F. Hougen

Main category: cs.LG

TL;DR: S³ (Stratified Scaling Search) improves diffusion language model outputs by reallocating inference compute during denoising using verifier-guided search, rather than just at final output stage.

DetailsMotivation: Naive best-of-K sampling for test-time scaling in diffusion language models is fundamentally limited because it repeatedly draws from the same base distribution whose high-probability regions are often misaligned with high-quality outputs.

Method: S³ expands multiple candidate trajectories at each denoising step, evaluates them with a lightweight reference-free verifier, and selectively resamples promising candidates while preserving diversity within the search frontier, approximating a reward-tilted sampling distribution.

Result: Experiments with LLaDA-8B-Instruct on MATH-500, GSM8K, ARC-Challenge, and TruthfulQA show S³ consistently improves performance across benchmarks, with largest gains on mathematical reasoning tasks while leaving the underlying model unchanged.

Conclusion: Classical search over denoising trajectories provides a practical mechanism for test-time scaling in diffusion language models, enabling better outputs with more inference compute without additional training.

Abstract: Test-time scaling investigates whether a fixed diffusion language model (DLM) can generate better outputs when given more inference compute, without additional training. However, naive best-of-$K$ sampling is fundamentally limited because it repeatedly draws from the same base diffusion distribution, whose high-probability regions are often misaligned with high-quality outputs. We propose $S^3$ (Stratified Scaling Search), a classical verifier-guided search method that improves generation by reallocating compute during the denoising process rather than only at the final output stage. At each denoising step, $S^3$ expands multiple candidate trajectories, evaluates them with a lightweight reference-free verifier, and selectively resamples promising candidates while preserving diversity within the search frontier. This procedure effectively approximates a reward-tilted sampling distribution that favors higher-quality outputs while remaining anchored to the model prior. Experiments with LLaDA-8B-Instruct on MATH-500, GSM8K, ARC-Challenge, and TruthfulQA demonstrate that $S^3$ consistently improves performance across benchmarks, achieving the largest gains on mathematical reasoning tasks while leaving the underlying model and decoding schedule unchanged. These results show that classical search over denoising trajectories provides a practical mechanism for test-time scaling in DLMs.

[446] SMT-AD: a scalable quantum-inspired anomaly detection approach

Apimuk Sornsaeng, Si Min Chan, Wenxuan Zhang, Swee Liang Wong, Joshua Lim, Dario Poletti

Main category: cs.LG

TL;DR: Quantum-inspired tensor network approach for anomaly detection using superposition of bond-dimension-1 matrix product operators with Fourier feature embedding.

DetailsMotivation: To develop an efficient and parallelizable quantum-inspired anomaly detection method that scales linearly with feature size while maintaining competitive performance with existing baselines.

Method: SMT-AD (Superposition of Multiresolution Tensors for Anomaly Detection) uses superposition of bond-dimension-1 matrix product operators with Fourier-assisted feature embedding, where parameters grow linearly with feature size, embedding resolutions, and additional components.

Result: Achieves competitive anomaly detection performance on standard datasets including credit card transactions, even with minimal configurations, while providing model weight reduction and feature relevance highlighting.

Conclusion: The proposed quantum-inspired tensor network approach offers an efficient, scalable, and interpretable solution for anomaly detection with linear parameter growth and competitive performance.

Abstract: Quantum-inspired tensor networks algorithms have shown to be effective and efficient models for machine learning tasks, including anomaly detection. Here, we propose a highly parallelizable quantum-inspired approach which we call SMT-AD from Superposition of Multiresolution Tensors for Anomaly Detection. It is based upon the superposition of bond-dimension-1 matrix product operators to transform the input data with Fourier-assisted feature embedding, where the number of learnable parameters grows linearly with feature size, embedding resolutions, and the number of additional components in the matrix product operators structure. We demonstrate successful anomaly detection when applied to standard datasets, including credit card transactions, and find that, even with minimal configurations, it achieves competitive performance against established anomaly detection baselines. Furthermore, it provides a straightforward way to reduce the weight of the model and even improve the performance by highlighting the most relevant input features.

[447] MO-RiskVAE: A Multi-Omics Variational Autoencoder for Survival Risk Modeling in Multiple MyelomaMO-RiskVAE

Zixuan Chen, Heng Zhang, YuPeng Qin, WenPeng Xing, Qiang Wang, Da Wang, Changting Lin, Meng Han

Main category: cs.LG

TL;DR: Multimodal VAE framework for survival risk modeling in multiple myeloma shows that survival-driven training is primarily sensitive to latent regularization scale and structure rather than specific divergence formulation, leading to improved risk stratification.

DetailsMotivation: Standard latent regularization strategies in multimodal VAEs for survival risk modeling often fail to preserve prognostically relevant variation when trained under survival supervision, leading to unstable or overly constrained representations. There's a need to understand which aspects of latent design fundamentally govern performance in this setting.

Method: Conducted controlled investigation of latent modeling choices within unified extension of MyeVAE framework. Systematically isolated regularization scale, posterior geometry, and latent space structure under identical architectures and optimization protocols. Tested various divergence formulations (KL, MMD, HSIC) and explored hybrid continuous-discrete formulation using Gumbel-Softmax.

Result: Survival-driven training is primarily sensitive to magnitude and structure of latent regularization rather than specific divergence formulation. Moderate relaxation of KL regularization consistently improves survival discrimination. Structuring latent space improves alignment between learned representations and survival risk gradients. Hybrid continuous-discrete formulation enhances global risk ordering in continuous latent subspace.

Conclusion: Developed robust multimodal survival model (MO-RiskVAE) that consistently improves risk stratification over original MyeVAE without additional supervision or complex training heuristics, demonstrating the importance of carefully calibrated latent regularization for survival prediction tasks.

Abstract: Multimodal variational autoencoders (VAEs) have emerged as a powerful framework for survival risk modeling in multiple myeloma by integrating heterogeneous omics and clinical data. However, when trained under survival supervision, standard latent regularization strategies often fail to preserve prognostically relevant variation, leading to unstable or overly constrained representations. Despite numerous proposed variants, it remains unclear which aspects of latent design fundamentally govern performance in this setting. In this work, we conduct a controlled investigation of latent modeling choices for multimodal survival prediction within a unified extension of the MyeVAE framework. By systematically isolating regularization scale, posterior geometry, and latent space structure under identical architectures and optimization protocols, we show that survival-driven training is primarily sensitive to the magnitude and structure of latent regularization rather than the specific divergence formulation. In particular, moderate relaxation of KL regularization consistently improves survival discrimination, while alternative divergence mechanisms such as MMD and HSIC provide limited benefit without appropriate scaling. We further demonstrate that structuring the latent space can improve alignment between learned representations and survival risk gradients. A hybrid continuous–discrete formulation based on Gumbel–Softmax enhances global risk ordering in the continuous latent subspace, even though stable discrete subtype discovery does not emerge under survival supervision. Guided by these findings, we instantiate a robust multimodal survival model, termed MO-RiskVAE, which consistently improves risk stratification over the original MyeVAE without introducing additional supervision or complex training heuristics.

[448] RAGEN-2: Reasoning Collapse in Agentic RL

Zihan Wang, Chi Gui, Xing Jin, Qineng Wang, Licheng Liu, Kangrui Wang, Shiqi Chen, Linjie Li, Zhengyuan Yang, Pingyue Zhang, Yiping Lu, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li

Main category: cs.LG

TL;DR: RL training of multi-turn LLM agents suffers from template collapse where models use input-agnostic templates despite appearing diverse, requiring new metrics beyond entropy to diagnose reasoning quality.

DetailsMotivation: Current RL training for multi-turn LLM agents is unstable, and existing metrics like entropy fail to detect template collapse where models generate diverse-looking but input-agnostic reasoning patterns.

Method: Decompose reasoning quality into within-input diversity (Entropy) and cross-input distinguishability (Mutual Information), introduce mutual information proxies for online diagnosis, propose SNR-Aware Filtering using reward variance to select high-signal prompts.

Result: Mutual information correlates more strongly with final performance than entropy across diverse tasks. SNR-Aware Filtering improves both input dependence and task performance across planning, math reasoning, web navigation, and code execution.

Conclusion: Template collapse is a critical failure mode in RL training of LLM agents that requires cross-input distinguishability metrics like mutual information rather than just entropy, with SNR-Aware Filtering providing an effective solution.

Abstract: RL training of multi-turn LLM agents is inherently unstable, and reasoning quality directly determines task performance. Entropy is widely used to track reasoning stability. However, entropy only measures diversity within the same input, and cannot tell whether reasoning actually responds to different inputs. In RAGEN-2, we find that even with stable entropy, models can rely on fixed templates that look diverse but are input-agnostic. We call this template collapse, a failure mode invisible to entropy and all existing metrics. To diagnose this failure, we decompose reasoning quality into within-input diversity (Entropy) and cross-input distinguishability (Mutual Information, MI), and introduce a family of mutual information proxies for online diagnosis. Across diverse tasks, mutual information correlates with final performance much more strongly than entropy, making it a more reliable proxy for reasoning quality. We further explain template collapse with a signal-to-noise ratio (SNR) mechanism. Low reward variance weakens task gradients, letting regularization terms dominate and erase cross-input reasoning differences. To address this, we propose SNR-Aware Filtering to select high-signal prompts per iteration using reward variance as a lightweight proxy. Across planning, math reasoning, web navigation, and code execution, the method consistently improves both input dependence and task performance.

[449] Asymptotic-Preserving Neural Networks for Viscoelastic Parameter Identification in Multiscale Blood Flow Modeling

Giulia Bertaglia, Raffaella Fiamma Cabini

Main category: cs.LG

TL;DR: APNNs embed physical principles to infer viscoelastic parameters and reconstruct blood flow states from ultrasound data, enabling non-invasive pressure estimation.

DetailsMotivation: To improve practical applicability of cardiovascular models by reliably determining viscoelastic parameters for arterial deformation under pulsatile pressure, enabling non-invasive estimation of pressure waveforms from accessible ultrasound data.

Method: Uses Asymptotic-Preserving Neural Networks (APNNs) that embed governing physical principles of multiscale viscoelastic blood flow model within learning procedure to infer viscoelastic parameters while reconstructing time-dependent evolution of blood vessel state variables.

Result: Numerical simulations in synthetic and patient-specific scenarios demonstrate effectiveness of methodology for estimating pressure waveforms from cross-sectional area and velocity measurements from Doppler ultrasound.

Conclusion: APNN framework successfully addresses challenge of determining viscoelastic parameters for cardiovascular models, enabling practical non-invasive pressure estimation from readily available ultrasound data.

Abstract: Mathematical models and numerical simulations offer a non-invasive way to explore cardiovascular phenomena, providing access to quantities that cannot be measured directly. In this study, we start with a one-dimensional multiscale blood flow model that describes the viscoelastic properties of arterial walls, and we focus on improving its practical applicability by addressing a major challenge: determining, in a reliable way, the viscoelastic parameters that control how arteries deform under pulsatile pressure. To achieve this, we employ Asymptotic-Preserving Neural Networks that embed the governing physical principles of the multiscale viscoelastic blood flow model within the learning procedure. This framework allows us to infer the viscoelastic parameters while simultaneously reconstructing the time-dependent evolution of the state variables of blood vessels. With this approach, pressure waveforms are estimated from readily accessible patient-specific data, i.e., cross-sectional area and velocity measurements from Doppler ultrasound, in vascular segments where direct pressure measurements are not available. Different numerical simulations, conducted in both synthetic and patient-specific scenarios, show the effectiveness of the proposed methodology.

[450] AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent

Wenyue Hua, Sripad Karne, Qian Xie, Armaan Agrawal, Nikos Pagonas, Kostis Kaffes, Tianyi Peng

Main category: cs.LG

TL;DR: AgentOpt is a framework-agnostic Python package for client-side optimization of AI agent pipelines, focusing on model selection to balance cost, accuracy, and latency constraints.

DetailsMotivation: While existing research focuses on server-side efficiency for AI agents, there's a growing need for client-side optimization as users compose agents from local tools, remote APIs, and diverse models with varying quality, cost, and latency trade-offs.

Method: AgentOpt implements eight search algorithms (including Arm Elimination, Epsilon-LUCB, Threshold Successive Elimination, and Bayesian Optimization) to efficiently explore the exponentially growing combination space of model assignments to pipeline roles.

Result: The cost gap between best and worst model combinations can reach 13-32× at matched accuracy. Arm Elimination recovers near-optimal accuracy while reducing evaluation budget by 24-67% relative to brute-force search on three of four tasks.

Conclusion: Client-side optimization is crucial for practical agent deployment, and AgentOpt provides an effective framework for finding cost-effective model assignments in multi-step agent pipelines.

Abstract: AI agents are increasingly deployed in real-world applications, including systems such as Manus, OpenClaw, and coding agents. Existing research has primarily focused on \emph{server-side} efficiency, proposing methods such as caching, speculative execution, traffic scheduling, and load balancing to reduce the cost of serving agentic workloads. However, as users increasingly construct agents by composing local tools, remote APIs, and diverse models, an equally important optimization problem arises on the client side. Client-side optimization asks how developers should allocate the resources available to them, including model choice, local tools, and API budget across pipeline stages, subject to application-specific quality, cost, and latency constraints. Because these objectives depend on the task and deployment setting, they cannot be determined by server-side systems alone. We introduce AgentOpt, the first framework-agnostic Python package for client-side agent optimization. We first study model selection, a high-impact optimization lever in multi-step agent pipelines. Given a pipeline and a small evaluation set, the goal is to find the most cost-effective assignment of models to pipeline roles. This problem is consequential in practice: at matched accuracy, the cost gap between the best and worst model combinations can reach 13–32$\times$ in our experiments. To efficiently explore the exponentially growing combination space, AgentOpt implements eight search algorithms, including Arm Elimination, Epsilon-LUCB, Threshold Successive Elimination, and Bayesian Optimization. Across four benchmarks, Arm Elimination recovers near-optimal accuracy while reducing evaluation budget by 24–67% relative to brute-force search on three of four tasks. Code and benchmark results available at https://agentoptimizer.github.io/agentopt/.

[451] TalkLoRA: Communication-Aware Mixture of Low-Rank Adaptation for Large Language Models

Lin Mu, Haiyang Wang, Li Ni, Lei Sang, Zhize Wu, Peiquan Jin, Yiwen Zhang

Main category: cs.LG

TL;DR: TalkLoRA introduces communication between experts in MoE-LoRA frameworks to improve routing stability and performance for parameter-efficient fine-tuning of LLMs.

DetailsMotivation: Existing MoE-augmented LoRA methods assume independent experts, leading to unstable routing and expert dominance issues. The paper aims to address these limitations by enabling expert communication.

Method: TalkLoRA introduces a lightweight Talking Module that enables controlled information exchange across expert subspaces before routing, creating a more robust global signal for routing decisions.

Result: TalkLoRA outperforms vanilla LoRA and MoELoRA across diverse language understanding and generation tasks, achieving higher parameter efficiency and more balanced expert routing under comparable parameter budgets.

Conclusion: Structured expert communication is a principled and effective enhancement for MoE-based parameter-efficient adaptation, improving routing stability and model performance.

Abstract: Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning of Large Language Models (LLMs), and recent Mixture-of-Experts (MoE) extensions further enhance flexibility by dynamically combining multiple LoRA experts. However, existing MoE-augmented LoRA methods assume that experts operate independently, often leading to unstable routing, expert dominance. In this paper, we propose \textbf{TalkLoRA}, a communication-aware MoELoRA framework that relaxes this independence assumption by introducing expert-level communication prior to routing. TalkLoRA equips low-rank experts with a lightweight Talking Module that enables controlled information exchange across expert subspaces, producing a more robust global signal for routing. Theoretically, we show that expert communication smooths routing dynamics by mitigating perturbation amplification while strictly generalizing existing MoELoRA architectures. Empirically, TalkLoRA consistently outperforms vanilla LoRA and MoELoRA across diverse language understanding and generation tasks, achieving higher parameter efficiency and more balanced expert routing under comparable parameter budgets. These results highlight structured expert communication as a principled and effective enhancement for MoE-based parameter-efficient adaptation. Code is available at https://github.com/why0129/TalkLoRA.

[452] Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs

Suraj Yadav, Siddharth Yadav, Parth Goyal

Main category: cs.LG

TL;DR: GRPO preference optimization on small language models (up to 3B) for math reasoning shows limited gains on hardest problems, reveals capacity boundaries, and demonstrates cross-dataset generalization effects.

DetailsMotivation: To test whether preference optimization (GRPO) can improve reasoning in resource-constrained settings with small language models, and to understand how problem difficulty affects performance gains.

Method: Applied GRPO (Group Relative Policy Optimization) with LoRA to small language models (up to 3B parameters) on GSM8K and MATH math reasoning datasets, with difficulty-stratified analyses and cross-dataset generalization experiments.

Result: Accuracy plateaus as problem difficulty increases, revealing capacity boundaries; training only on lower-difficulty problems matches full-dataset accuracy using ~45% fewer steps; GSM8K-trained models generalize better to MATH numeric subset than MATH-trained models.

Conclusion: GRPO primarily reshapes output preferences without reliably improving hardest-tier solving; best gains depend on base model’s prior reasoning competence and dataset difficulty profile; diminishing returns from harder samples in small model regime.

Abstract: Recent alignment work on Large Language Models (LLMs) suggests preference optimization can improve reasoning by shifting probability mass toward better solutions. We test this claim in a resource-constrained setting by applying GRPO with LoRA to SLMs (up to 3B) for math reasoning on GSM8K and MATH datasets with difficulty-stratified analyses. As problem difficulty increases, accuracy plateaus, revealing a capacity boundary: GRPO primarily reshapes output preferences without reliably improving hardest-tier solving. Consistent with this, training GRPO only on lower-difficulty problems matches full-dataset accuracy across difficulty tiers while using only ~45% training steps, indicating diminishing returns from harder samples in this regime. We also find a cross-dataset generalization effect: GSM8K-trained GRPO achieves higher accuracy on the numeric subset of MATH than MATH-trained GRPO, exceeding it by ~5% at 1.5B and by ~3% at 3B. We show that the best achievable gains depend strongly on the base model’s prior reasoning competence and the dataset’s difficulty profile.

[453] Drifting Fields are not Conservative

Leonard Franz, Sebastian Hoffmann, Georg Martius

Main category: cs.LG

TL;DR: Drifting models use vector fields to transport samples toward data distribution, but these fields are generally non-conservative (not gradients of scalar potentials). The paper identifies normalization as the source, proposes alternative normalization to restore conservatism, and finds loss-based training is simpler and equally effective.

DetailsMotivation: To understand whether drifting models' sample transport procedure is equivalent to optimizing a scalar loss, and to investigate the relationship between drift fields and conservative vector fields in generative modeling.

Method: Analyzes mathematical properties of drift fields, identifies position-dependent normalization as source of non-conservatism, proposes alternative normalization via sharp kernels to restore conservatism for radial kernels, and compares drift field matching vs loss minimization approaches.

Result: Found drift fields are generally non-conservative (except for Gaussian kernel), identified normalization as the issue, developed alternative normalization that restores conservatism, and observed minimal practical gains from non-conservative transport fields compared to simpler loss-based training.

Conclusion: While drift field matching is more general than loss minimization, practical benefits are minimal, so training drifting models with simpler loss functions is recommended. The Gaussian kernel is uniquely special as it yields conservative drift fields.

Abstract: Drifting models generate high-quality samples in a single forward pass by transporting generated samples toward the data distribution using a vector valued drift field. We investigate whether this procedure is equivalent to optimizing a scalar loss and find that, in general, it is not: drift fields are not conservative - they cannot be written as the gradient of any scalar potential. We identify the position-dependent normalization as the source of non-conservatism. The Gaussian kernel is the unique exception where the normalization is harmless and the drift field is exactly the gradient of a scalar function. Generalizing this, we propose an alternative normalization via a related kernel (the sharp kernel) which restores conservatism for any radial kernel, yielding well-defined loss functions for training drifting models. While we identify that the drifting field matching objective is strictly more general than loss minimization, as it can implement non-conservative transport fields that no scalar loss can reproduce, we observe that practical gains obtained utilizing this flexibility are minimal. We thus propose to train drifting models with the conceptually simpler formulations utilizing loss functions.

[454] BiScale-GTR: Fragment-Aware Graph Transformers for Multi-Scale Molecular Representation Learning

Yi Yang, Ovidiu Daescu

Main category: cs.LG

TL;DR: BiScale-GTR is a self-supervised molecular representation learning framework that combines chemically-grounded fragment tokenization with adaptive multi-scale reasoning using a parallel GNN-Transformer architecture.

DetailsMotivation: Existing Graph Transformers for molecular property prediction are often GNN-dominated, limiting their global receptive field, and operate at only single structural granularity, missing multi-scale molecular patterns.

Method: Uses improved graph Byte Pair Encoding tokenization to create chemically valid fragment tokens, then employs a parallel architecture where atom-level GNN representations are pooled into fragment embeddings and fused with fragment token embeddings before Transformer reasoning.

Result: State-of-the-art performance on MoleculeNet, PharmaBench, and LRGB benchmarks across classification and regression tasks, with attribution analysis showing chemically meaningful functional motif highlighting.

Conclusion: BiScale-GTR effectively captures local chemical environments, substructure-level motifs, and long-range molecular dependencies through unified multi-scale representation learning.

Abstract: Graph Transformers have recently attracted attention for molecular property prediction by combining the inductive biases of graph neural networks (GNNs) with the global receptive field of Transformers. However, many existing hybrid architectures remain GNN-dominated, causing the resulting representations to remain heavily shaped by local message passing. Moreover, most existing methods operate at only a single structural granularity, limiting their ability to capture molecular patterns that span multiple molecular scales. We introduce BiScale-GTR, a unified framework for self-supervised molecular representation learning that combines chemically grounded fragment tokenization with adaptive multi-scale reasoning. Our method improves graph Byte Pair Encoding (BPE) tokenization to produce consistent, chemically valid, and high-coverage fragment tokens, which are used as fragment-level inputs to a parallel GNN-Transformer architecture. Architecturally, atom-level representations learned by a GNN are pooled into fragment-level embeddings and fused with fragment token embeddings before Transformer reasoning, enabling the model to jointly capture local chemical environments, substructure-level motifs, and long-range molecular dependencies. Experiments on MoleculeNet, PharmaBench, and the Long Range Graph Benchmark (LRGB) demonstrate state-of-the-art performance across both classification and regression tasks. Attribution analysis further shows that BiScale-GTR highlights chemically meaningful functional motifs, providing interpretable links between molecular structure and predicted properties. Code will be released upon acceptance.

[455] Bi-Level Optimization for Single Domain Generalization

Marzi Heidari, Hanping Zhang, Hao Yan, Yuhong Guo

Main category: cs.LG

TL;DR: BiSDG: A bi-level optimization framework for Single Domain Generalization that decouples task learning from domain modeling using domain prompts and surrogate domains.

DetailsMotivation: Addressing the challenge of Single Domain Generalization (SDG) - generalizing from a single labeled source domain to unseen target domains without access to target data during training, which remains a fundamental problem in robust machine learning.

Method: Proposes BiSDG with bi-level optimization: 1) Simulates distribution shifts via surrogate domains created through label-preserving transformations of source data; 2) Uses domain prompt encoder to generate lightweight modulation signals for feature-wise linear modulation; 3) Inner objective optimizes task performance under fixed prompts; 4) Outer objective maximizes generalization across surrogate domains by updating domain prompt encoder; 5) Includes gradient approximation for efficient training without second-order derivatives.

Result: Extensive experiments on various SDG benchmarks show BiSDG consistently outperforms prior methods, setting new state-of-the-art performance in the SDG setting.

Conclusion: BiSDG effectively addresses the SDG challenge through bi-level optimization that explicitly separates task learning from domain modeling, demonstrating superior generalization capabilities compared to existing approaches.

Abstract: Generalizing from a single labeled source domain to unseen target domains, without access to any target data during training, remains a fundamental challenge in robust machine learning. We address this underexplored setting, known as Single Domain Generalization (SDG), by proposing BiSDG, a bi-level optimization framework that explicitly decouples task learning from domain modeling. BiSDG simulates distribution shifts through surrogate domains constructed via label-preserving transformations of the source data. To capture domain-specific context, we propose a domain prompt encoder that generates lightweight modulation signals to produce augmenting features via feature-wise linear modulation. The learning process is formulated as a bi-level optimization problem: the inner objective optimizes task performance under fixed prompts, while the outer objective maximizes generalization across the surrogate domains by updating the domain prompt encoder. We further develop a practical gradient approximation scheme that enables efficient bi-level training without second-order derivatives. Extensive experiments on various SGD benchmarks demonstrate that BiSDG consistently outperforms prior methods, setting new state-of-the-art performance in the SDG setting.

[456] Stochastic Gradient Descent in the Saddle-to-Saddle Regime of Deep Linear Networks

Guillaume Corlouer, Avi Semler, Alexander Strang, Alexander Gietelink Oldenziel

Main category: cs.LG

TL;DR: SGD dynamics in deep linear networks during saddle-to-saddle regime: anisotropic noise modeled via Langevin dynamics, decomposed into per-mode SDEs showing maximal diffusion precedes feature learning, with stationary distributions matching gradient flow or Boltzmann distributions.

DetailsMotivation: While gradient descent in deep linear networks exhibits saddle-to-saddle dynamics, the impact of SGD noise on this regime remains poorly understood. The paper aims to investigate how stochastic gradient descent noise affects training dynamics in the saddle-to-saddle regime of deep linear networks.

Method: Model training dynamics as stochastic Langevin dynamics with anisotropic, state-dependent noise. Under aligned and balanced weights assumption, derive exact decomposition into one-dimensional per-mode stochastic differential equations. Analyze stationary distributions for cases with and without label noise.

Result: Established that maximal diffusion along a mode precedes the corresponding feature being completely learned. Stationary distribution without label noise coincides with gradient flow stationary distribution; with label noise approximates Boltzmann distribution. Experimental confirmation shows theoretical results hold qualitatively even without aligned/balanced weights.

Conclusion: SGD noise encodes information about feature learning progression but doesn’t fundamentally alter saddle-to-saddle dynamics. The analysis provides theoretical understanding of SGD dynamics in deep linear networks during the saddle-to-saddle regime.

Abstract: Deep linear networks (DLNs) are used as an analytically tractable model of the training dynamics of deep neural networks. While gradient descent in DLNs is known to exhibit saddle-to-saddle dynamics, the impact of stochastic gradient descent (SGD) noise on this regime remains poorly understood. We investigate the dynamics of SGD during training of DLNs in the saddle-to-saddle regime. We model the training dynamics as stochastic Langevin dynamics with anisotropic, state-dependent noise. Under the assumption of aligned and balanced weights, we derive an exact decomposition of the dynamics into a system of one-dimensional per-mode stochastic differential equations. This establishes that the maximal diffusion along a mode precedes the corresponding feature being completely learned. We also derive the stationary distribution of SGD for each mode: in the absence of label noise, its marginal distribution along specific features coincides with the stationary distribution of gradient flow, while in the presence of label noise it approximates a Boltzmann distribution. Finally, we confirm experimentally that the theoretical results hold qualitatively even without aligned or balanced weights. These results establish that SGD noise encodes information about the progression of feature learning but does not fundamentally alter the saddle-to-saddle dynamics.

[457] The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

Rishab Balasubramanian, Pin-Jie Lin, Rituraj Sharma, Anjie Fang, Fardin Abdi, Viktor Rozgic, Zheng Du, Mohit Bansal, Tu Vu

Main category: cs.LG

TL;DR: UNLOCK enables training-free transfer of post-trained capabilities (like reasoning) across different-sized models by extracting and aligning capability directions in activation space.

DetailsMotivation: Current methods for transferring capabilities between models require retraining or fine-tuning, which is computationally expensive. The authors investigate whether post-trained capabilities can be transferred without retraining, particularly across different model scales.

Method: Proposes UNLOCK framework based on Master Key Hypothesis: capabilities correspond to directions in low-dimensional latent subspace. Extracts capability direction by contrasting activations between capability-present and capability-absent source variants, aligns it with target model through low-rank linear transformation, and applies at inference time.

Result: Significant improvements in reasoning capabilities without training: transferring CoT reasoning from Qwen1.5-14B to Qwen1.5-7B yields 12.1% accuracy gain on MATH; transferring mathematical reasoning from Qwen3-4B-Base to Qwen3-14B-Base improves AGIEval Math accuracy from 61.1% to 71.3%, surpassing post-trained model’s 67.8%.

Conclusion: Capabilities can be transferred across models without retraining via linear alignment of latent directions. Success depends on pre-learned capabilities, and intervention amplifies latent capabilities by sharpening output distribution toward successful reasoning trajectories.

Abstract: We investigate whether post-trained capabilities can be transferred across models without retraining, with a focus on transfer across different model scales. We propose the Master Key Hypothesis, which states that model capabilities correspond to directions in a low-dimensional latent subspace that induce specific behaviors and are transferable across models through linear alignment. Based on this hypothesis, we introduce UNLOCK, a training-free and label-free framework that extracts a capability direction by contrasting activations between capability-present and capability-absent Source variants, aligns it with a Target model through a low-rank linear transformation, and applies it at inference time to elicit the behavior. Experiments on reasoning behaviors, including Chain-of-Thought (CoT) and mathematical reasoning, demonstrate substantial improvements across model scales without training. For example, transferring CoT reasoning from Qwen1.5-14B to Qwen1.5-7B yields an accuracy gain of 12.1% on MATH, and transferring a mathematical reasoning direction from Qwen3-4B-Base to Qwen3-14B-Base improves AGIEval Math accuracy from 61.1% to 71.3%, surpassing the 67.8% achieved by the 14B post-trained model. Our analysis shows that the success of transfer depends on the capabilities learned during pre-training, and that our intervention amplifies latent capabilities by sharpening the output distribution toward successful reasoning trajectories.

[458] Toward a universal foundation model for graph-structured data

Sakib Mostafa, Lei Xing, Md. Tauhidul Islam

Main category: cs.LG

TL;DR: A graph foundation model for biomedical applications that learns transferable structural representations using feature-agnostic graph properties as structural prompts, enabling superior generalization across diverse biological networks.

DetailsMotivation: Current graph neural networks are typically trained on single datasets and learn representations specific to particular node features, topology, and label spaces, limiting transferability across domains. This is especially problematic in biology and medicine where networks vary substantially across cohorts, assays, and institutions.

Method: Leverages feature-agnostic graph properties (degree statistics, centrality measures, community structure indicators, diffusion-based signatures) and encodes them as structural prompts. These prompts are integrated with a message-passing backbone to embed diverse graphs into a shared representation space. The model is pretrained once on heterogeneous graphs and reused on unseen datasets with minimal adaptation.

Result: The pretrained model matches or exceeds strong supervised baselines across multiple benchmarks while demonstrating superior zero-shot and few-shot generalization on held-out graphs. On SagePPI benchmark, supervised fine-tuning achieves mean ROC-AUC of 95.5%, a 21.8% gain over the best supervised message-passing baseline.

Conclusion: The proposed technique provides a unique approach toward reusable, foundation-scale models for graph-structured data in biomedical and network science applications, addressing the lack of broadly reusable foundation models for graph analysis comparable to those in language and vision.

Abstract: Graphs are a central representation in biomedical research, capturing molecular interaction networks, gene regulatory circuits, cell–cell communication maps, and knowledge graphs. Despite their importance, currently there is not a broadly reusable foundation model available for graph analysis comparable to those that have transformed language and vision. Existing graph neural networks are typically trained on a single dataset and learn representations specific only to that graph’s node features, topology, and label space, limiting their ability to transfer across domains. This lack of generalization is particularly problematic in biology and medicine, where networks vary substantially across cohorts, assays, and institutions. Here we introduce a graph foundation model designed to learn transferable structural representations that are not specific to specific node identities or feature schemes. Our approach leverages feature-agnostic graph properties, including degree statistics, centrality measures, community structure indicators, and diffusion-based signatures, and encodes them as structural prompts. These prompts are integrated with a message-passing backbone to embed diverse graphs into a shared representation space. The model is pretrained once on heterogeneous graphs and subsequently reused on unseen datasets with minimal adaptation. Across multiple benchmarks, our pretrained model matches or exceeds strong supervised baselines while demonstrating superior zero-shot and few-shot generalization on held-out graphs. On the SagePPI benchmark, supervised fine-tuning of the pretrained backbone achieves a mean ROC-AUC of 95.5%, a gain of 21.8% over the best supervised message-passing baseline. The proposed technique thus provides a unique approach toward reusable, foundation-scale models for graph-structured data in biomedical and network science applications.

[459] Bridging Theory and Practice in Crafting Robust Spiking Reservoirs

Ruggero Freddi, Nicolas Seseri, Diana Nigrisoli, Alessio Basti

Main category: cs.LG

TL;DR: Spiking reservoir computing study introduces robustness interval to measure hyperparameter ranges for stable edge-of-chaos operation, showing consistent trends with network sparsity and firing thresholds, validating mean-field critical point as reliable starting coordinate.

DetailsMotivation: Spiking reservoir computing offers energy-efficient temporal processing but faces challenges in reliably tuning reservoirs to operate at the edge-of-chaos due to experimental uncertainty. The work aims to bridge abstract criticality concepts with practical stability measures.

Method: Introduces robustness interval as an operational measure of hyperparameter ranges maintaining performance above task thresholds. Systematically evaluates Leaky Integrate-and-Fire architectures on static (MNIST) and temporal (synthetic Ball Trajectories) tasks, analyzing trends across network configurations including presynaptic connection density and firing threshold. Also conducts control experiments on Erdős-Rényi graphs.

Result: Identifies consistent monotonic trends: robustness-interval width decreases with presynaptic connection density (directly with sparsity) and increases with firing threshold. Finds specific parameter pairs preserving analytical mean-field critical point, revealing iso-performance manifolds. Shows phenomena persist beyond small-world topologies. Validates mean-field critical point as consistently falling within empirical high-performance regions.

Conclusion: The robustness interval provides practical stability measure for spiking reservoir computing, with mean-field critical point serving as robust starting coordinate for parameter search and fine-tuning. The approach bridges theoretical criticality with practical implementation concerns.

Abstract: Spiking reservoir computing provides an energy-efficient approach to temporal processing, but reliably tuning reservoirs to operate at the edge-of-chaos is challenging due to experimental uncertainty. This work bridges abstract notions of criticality and practical stability by introducing and exploiting the robustness interval, an operational measure of the hyperparameter range over which a reservoir maintains performance above task-dependent thresholds. Through systematic evaluations of Leaky Integrate-and-Fire (LIF) architectures on both static (MNIST) and temporal (synthetic Ball Trajectories) tasks, we identify consistent monotonic trends in the robustness interval across a broad spectrum of network configurations: the robustness-interval width decreases with presynaptic connection density $β$ (i.e., directly with sparsity) and directly with the firing threshold $θ$. We further identify specific $(β, θ)$ pairs that preserve the analytical mean-field critical point $w_{\text{crit}}$, revealing iso-performance manifolds in the hyperparameter space. Control experiments on Erdős-Rényi graphs show the phenomena persist beyond small-world topologies. Finally, our results show that $w_{\text{crit}}$ consistently falls within empirical high-performance regions, validating $w_{\text{crit}}$ as a robust starting coordinate for parameter search and fine-tuning. To ensure reproducibility, the full Python code is publicly available.

[460] ODE-free Neural Flow Matching for One-Step Generative Modeling

Xiao Shou

Main category: cs.LG

TL;DR: OT-NFM: ODE-free generative framework using neural flows for one-step generation via optimal transport pairings, avoiding mean collapse issues of naive flow-map training.

DetailsMotivation: Current diffusion and flow matching models require many network evaluations (tens to hundreds) at inference time, which is computationally expensive. The authors aim to enable true one-step generation with a single forward pass by learning the transport map directly instead of time-dependent vector fields.

Method: Propose Optimal Transport Neural Flow Matching (OT-NFM), an ODE-free framework that parameterizes the flow map with neural flows. Address mean collapse problem in naive flow-map training by proving consistent coupling is necessary and using optimal transport pairings with scalable minibatch and online coupling strategies.

Result: Experiments on synthetic benchmarks and image generation tasks (MNIST and CIFAR-10) demonstrate competitive sample quality while reducing inference to a single network evaluation.

Conclusion: OT-NFM enables efficient one-step generation with competitive quality, addressing computational bottlenecks of diffusion models through optimal transport-based coupling strategies.

Abstract: Diffusion and flow matching models generate samples by learning time-dependent vector fields whose integration transports noise to data, requiring tens to hundreds of network evaluations at inference. We instead learn the transport map directly. We propose Optimal Transport Neural Flow Matching (OT-NFM), an ODE-free generative framework that parameterizes the flow map with neural flows, enabling true one-step generation with a single forward pass. We show that naive flow-map training suffers from mean collapse, where inconsistent noise-data pairings drive all outputs toward the data mean. We prove that consistent coupling is necessary for non-degenerate learning and address this using optimal transport pairings with scalable minibatch and online coupling strategies. Experiments on synthetic benchmarks and image generation tasks (MNIST and CIFAR-10) demonstrate competitive sample quality while reducing inference to a single network evaluation.

[461] Neural Computers

Mingchen Zhuge, Changsheng Zhao, Haozhe Liu, Zijian Zhou, Shuming Liu, Wenyi Wang, Ernie Chang, Gael Le Lan, Junjie Fei, Wenxuan Zhang, Yasheng Sun, Zhipeng Cai, Zechun Liu, Yunyang Xiong, Yining Yang, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber

Main category: cs.LG

TL;DR: Neural Computers (NCs) propose a new computing paradigm where the model itself becomes the running computer, unifying computation, memory, and I/O in learned runtime states, with the goal of creating Completely Neural Computers (CNCs) as general-purpose learned machines.

DetailsMotivation: The paper aims to move beyond conventional computers (which execute explicit programs), agents (which act over external environments), and world models (which learn environment dynamics) by creating a new machine form where the model itself serves as the running computer, unifying all computational elements in a learned state.

Method: The authors instantiate Neural Computers as video models that roll out screen frames from instructions, pixels, and user actions in CLI and GUI settings. They study whether early NC primitives can be learned solely from collected I/O traces without instrumented program state.

Result: The implementations show that learned runtimes can acquire early interface primitives, particularly I/O alignment and short-horizon control. However, routine reuse, controlled updates, and symbolic stability remain open challenges.

Conclusion: The paper outlines a roadmap toward Completely Neural Computers (CNCs) and suggests that if the identified challenges can be overcome, CNCs could establish a new computing paradigm beyond today’s agents, world models, and conventional computers.

Abstract: We propose a new frontier: Neural Computers (NCs) – an emerging machine form that unifies computation, memory, and I/O in a learned runtime state. Unlike conventional computers, which execute explicit programs, agents, which act over external execution environments, and world models, which learn environment dynamics, NCs aim to make the model itself the running computer. Our long-term goal is the Completely Neural Computer (CNC): the mature, general-purpose realization of this emerging machine form, with stable execution, explicit reprogramming, and durable capability reuse. As an initial step, we study whether early NC primitives can be learned solely from collected I/O traces, without instrumented program state. Concretely, we instantiate NCs as video models that roll out screen frames from instructions, pixels, and user actions (when available) in CLI and GUI settings. These implementations show that learned runtimes can acquire early interface primitives, especially I/O alignment and short-horizon control, while routine reuse, controlled updates, and symbolic stability remain open. We outline a roadmap toward CNCs around these challenges. If overcome, CNCs could establish a new computing paradigm beyond today’s agents, world models, and conventional computers.

[462] The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning

Yi Xu, Philipp Jettkant, Laura Ruis

Main category: cs.LG

TL;DR: LLMs have limited latent multi-step planning capabilities - tiny transformers handle 3 steps, GPT-4o/Qwen3-32B handle 5, GPT-5.4 handles 7, but training only reaches 5 steps maximum, revealing dissociation between strategy discovery and execution.

DetailsMotivation: Chain-of-thought monitoring assumes models can't reason effectively in latent representations, but little is known about the actual limits of latent reasoning in LLMs. The paper investigates whether models can discover and execute multi-step planning strategies without supervision on intermediate steps.

Method: Used graph path-finding tasks that precisely control the number of required latent planning steps. Tested models ranging from tiny transformers trained from scratch to fine-tuned GPT-4o, Qwen3-32B, and GPT-5.4 under few-shot prompting. Measured maximum latent planning depth models could learn during training versus execute at test-time.

Result: Tiny transformers discovered strategies requiring up to 3 latent steps, fine-tuned GPT-4o and Qwen3-32B reached 5, and GPT-5.4 attained 7 under few-shot prompting. Maximum latent planning depth learned during training was 5, but discovered strategies generalized up to 8 latent steps at test-time, revealing dissociation between discovery and execution abilities.

Conclusion: There are fundamental limits to latent multi-step planning in LLMs. Strategies requiring multiple coordinated latent planning steps may need to be explicitly taught or externalized, supporting the viability of chain-of-thought monitoring approaches.

Abstract: The viability of chain-of-thought (CoT) monitoring hinges on models being unable to reason effectively in their latent representations. Yet little is known about the limits of such latent reasoning in LLMs. We test these limits by studying whether models can discover multi-step planning strategies without supervision on intermediate steps and execute them latently, within a single forward pass. Using graph path-finding tasks that precisely control the number of required latent planning steps, we uncover a striking limitation unresolved by massive scaling: tiny transformers trained from scratch discover strategies requiring up to three latent steps, fine-tuned GPT-4o and Qwen3-32B reach five, and GPT-5.4 attains seven under few-shot prompting. Although the maximum latent planning depth models can learn during training is five, the discovered strategy generalizes up to eight latent steps at test-time. This reveals a dissociation between the ability to discover a latent strategy under final-answer supervision alone and the ability to execute it once discovered. If similar limits hold more broadly, strategies requiring multiple coordinated latent planning steps may need to be explicitly taught or externalized, lending credence to CoT monitoring.

[463] Quality-preserving Model for Electronics Production Quality Tests Reduction

Noufa Haneefa, Teddy Lazebnik, Einav Peretz-Andersson

Main category: cs.LG

TL;DR: Adaptive test-selection framework combining offline greedy set cover with online Thompson-sampling bandit to dynamically switch between full and reduced test plans in electronics manufacturing, reducing test time while maintaining quality.

DetailsMotivation: Traditional manufacturing test flows are fixed during development and executed unchanged on every unit, imposing unnecessary test costs while failing to adapt to evolving failure patterns and process conditions. Existing data-driven methods optimize static test subsets but don't adapt online to changing defect distributions or explicitly control escape risk.

Method: Combines offline minimum-cost diagnostic subset construction using greedy set cover with an online Thompson-sampling multi-armed bandit that switches between full and reduced test plans using a rolling process-stability signal.

Result: Offline analysis identified zero-escape reduced plans cutting test time by 18.78% in Functional Circuit Test and 91.57% in End-of-Line testing. Under temporal validation with real concept drift, static reduction produced 110 escaped defects in Functional Circuit Test and 8 in End-of-Line, while adaptive policy reduced escapes to zero by reverting to fuller coverage when instability emerged.

Conclusion: Online learning can preserve manufacturing quality while reducing test burden, offering a practical route to adaptive test planning across production domains with both economic and logistics improvements for companies.

Abstract: Manufacturing test flows in high-volume electronics production are typically fixed during product development and executed unchanged on every unit, even as failure patterns and process conditions evolve. This protects quality, but it also imposes unnecessary test cost, while existing data-driven methods mostly optimize static test subsets and neither adapt online to changing defect distributions nor explicitly control escape risk. In this study, we present an adaptive test-selection framework that combines offline minimum-cost diagnostic subset construction using greedy set cover with an online Thompson-sampling multi-armed bandit that switches between full and reduced test plans using a rolling process-stability signal. We evaluate the framework on two printed circuit board assembly stages-Functional Circuit Test and End-of-Line test-covering 28,000 board runs. Offline analysis identified zero-escape reduced plans that cut test time by 18.78% in Functional Circuit Test and 91.57% in End-of-Line testing. Under temporal validation with real concept drift, static reduction produced 110 escaped defects in Functional Circuit Test and 8 in End-of-Line, whereas the adaptive policy reduced escapes to zero by reverting to fuller coverage when instability emerged in practice. These results show that online learning can preserve manufacturing quality while reducing test burden, offering a practical route to adaptive test planning across production domains, and offering both economic and logistics improvement for companies.

[464] Weighted Bayesian Conformal Prediction

Xiayin Lou, Peng Luo

Main category: cs.LG

TL;DR: WBCP extends Bayesian Conformal Prediction to handle distribution shift via importance-weighted Dirichlet posteriors, providing data-conditional guarantees and richer uncertainty quantification beyond i.i.d. assumptions.

DetailsMotivation: Existing Bayesian Conformal Prediction (BQ-CP) requires i.i.d. assumptions, while weighted conformal prediction handles distribution shift but only provides frequentist point estimates. There's a need to combine Bayesian uncertainty quantification with distribution shift handling.

Method: Proposes Weighted Bayesian Conformal Prediction (WBCP) that replaces uniform Dirichlet distribution in BQ-CP with weighted Dirichlet using importance weights and Kish’s effective sample size. Extends to spatial prediction as Geographical BQ-CP with kernel-based spatial weights.

Result: Theoretical proofs show: (1) effective sample size uniquely matches frequentist/Bayesian variances, (2) posterior standard deviation decays as O(1/√neff), (3) extends stochastic dominance guarantees, (4) HPD threshold improves conditional coverage. Experiments on spatial datasets maintain coverage with richer uncertainty.

Conclusion: WBCP successfully generalizes Bayesian conformal prediction to non-i.i.d. settings, providing both coverage guarantees and Bayesian uncertainty quantification under distribution shift, with practical applications in spatial prediction.

Abstract: Conformal prediction provides distribution-free prediction intervals with finite-sample coverage guarantees, and recent work by Snell & Griffiths reframes it as Bayesian Quadrature (BQ-CP), yielding powerful data-conditional guarantees via Dirichlet posteriors over thresholds. However, BQ-CP fundamentally requires the i.i.d. assumption – a limitation the authors themselves identify. Meanwhile, weighted conformal prediction handles distribution shift via importance weights but remains frequentist, producing only point-estimate thresholds. We propose \textbf{Weighted Bayesian Conformal Prediction (WBCP)}, which generalizes BQ-CP to arbitrary importance-weighted settings by replacing the uniform Dirichlet $\Dir(1,\ldots,1)$ with a weighted Dirichlet $\Dir(\neff \cdot \tilde{w}_1, \ldots, \neff \cdot \tilde{w}_n)$, where $\neff$ is Kish’s effective sample size. We prove four theoretical results: (1)~$\neff$ is the unique concentration parameter matching frequentist and Bayesian variances; (2)~posterior standard deviation decays as $O(1/\sqrt{\neff})$; (3)~BQ-CP’s stochastic dominance guarantee extends to per-weight-profile data-conditional guarantees; (4)~the HPD threshold provides $O(1/\sqrt{\neff})$ improvement in conditional coverage. We instantiate WBCP for spatial prediction as \emph{Geographical BQ-CP}, where kernel-based spatial weights yield per-location posteriors with interpretable diagnostics. Experiments on synthetic and real-world spatial datasets demonstrate that WBCP maintains coverage guarantees while providing substantially richer uncertainty information.

[465] Conformal Margin Risk Minimization: An Envelope Framework for Robust Learning under Label Noise

Yuanjie Shi, Peihong Li, Zijian Zhang, Janardhan Rao Doppa, Yan Yan

Main category: cs.LG

TL;DR: CMRM is a plug-and-play framework that improves classification under label noise by adding a conformal quantile-regularized margin term, requiring no privileged knowledge or pipeline changes.

DetailsMotivation: Existing methods for learning with noisy labels require privileged knowledge like noise transition matrices or clean subsets, which are often unavailable when robustness is most needed.

Method: CMRM adds a single quantile-calibrated regularization term to any classification loss. It measures confidence margins between observed and competing labels, thresholds them with a conformal quantile estimated per batch to focus on high-margin samples while suppressing likely mislabeled ones.

Result: Across five base methods and six benchmarks with synthetic and real-world noise, CMRM consistently improves accuracy (up to +3.39%), reduces conformal prediction set size (up to -20.44%), and doesn’t hurt performance under 0% noise.

Conclusion: CMRM captures a method-agnostic uncertainty signal that existing mechanisms didn’t exploit, providing a practical plug-and-play solution for learning with noisy labels without requiring privileged knowledge.

Abstract: Most methods for learning with noisy labels require privileged knowledge such as noise transition matrices, clean subsets or pretrained feature extractors, resources typically unavailable when robustness is most needed. We propose Conformal Margin Risk Minimization (CMRM), a plug-and-play envelope framework that improves any classification loss under label noise by adding a single quantile-calibrated regularization term, with no privileged knowledge or training pipeline modification. CMRM measures the confidence margin between the observed label and competing labels, and thresholds it with a conformal quantile estimated per batch to focus training on high-margin samples while suppressing likely mislabeled ones. We derive a learning bound for CMRM under arbitrary label noise requiring only mild regularity of the margin distribution. Across five base methods and six benchmarks with synthetic and real-world noise, CMRM consistently improves accuracy (up to +3.39%), reduces conformal prediction set size (up to -20.44%) and does not hurt under 0% noise, showing that CMRM captures a method-agnostic uncertainty signal that existing mechanisms did not exploit.

[466] MICA: Multivariate Infini Compressive Attention for Time Series Forecasting

Willa Potosnak, Nina Żukowska, Michał Wiliński, Dan Howarth, Ignacy Stępka, Mononito Goswami, Artur Dubrawski

Main category: cs.LG

TL;DR: MICA extends channel-independent Transformers with efficient cross-channel attention that scales linearly with channel count and context length for multivariate time series forecasting.

DetailsMotivation: Full cross-channel attention in multivariate forecasting with Transformers faces scalability issues due to quadratic complexity in both sequence length and channel count, making it impractical for high-dimensional time series.

Method: Proposes Multivariate Infini Compressive Attention (MICA) that adapts efficient attention techniques from sequence dimension to channel dimension, adding cross-channel attention to channel-independent backbones with linear scaling.

Result: MICA reduces forecast error by 5.4% on average (up to 25.4% on individual datasets) compared to channel-independent counterparts, ranks first among deep multivariate Transformer and MLP baselines, and scales more efficiently than full cross-channel attention models.

Conclusion: MICA demonstrates that explicit cross-channel modeling is important for multivariate forecasting and provides a practical scalable solution through compressive attention techniques.

Abstract: Multivariate forecasting with Transformers faces a core scalability challenge: modeling cross-channel dependencies via attention compounds attention’s quadratic sequence complexity with quadratic channel scaling, making full cross-channel attention impractical for high-dimensional time series. We propose Multivariate Infini Compressive Attention (MICA), an architectural design to extend channel-independent Transformers to channel-dependent forecasting. By adapting efficient attention techniques from the sequence dimension to the channel dimension, MICA adds a cross-channel attention mechanism to channel-independent backbones that scales linearly with channel count and context length. We evaluate channel-independent Transformer architectures with and without MICA across multiple forecasting benchmarks. MICA reduces forecast error over its channel-independent counterparts by 5.4% on average and up to 25.4% on individual datasets, highlighting the importance of explicit cross-channel modeling. Moreover, models with MICA rank first among deep multivariate Transformer and MLP baselines. MICA models also scale more efficiently with respect to both channel count and context length than Transformer baselines that compute attention across both the temporal and channel dimensions, establishing compressive attention as a practical solution for scalable multivariate forecasting.

[467] AE-ViT: Stable Long-Horizon Parametric Partial Differential Equations Modeling

Iva Mikuš, Boris Muha, Domagoj Vlah

Main category: cs.LG

TL;DR: A joint transformer-based model for parametric PDEs with multi-stage parameter injection and coordinate encoding that outperforms existing ROMs in multi-field prediction.

DetailsMotivation: Existing ROM approaches either operate on full solution fields (computationally expensive) or compressed latent representations (difficult to evolve), and most don't handle multi-component predictions with varying parameter sensitivities well.

Method: Joint model with convolutional encoder, transformer on latent representations, and decoder. Uses multi-stage parameter injection and coordinate channel injection to dynamically adapt computations to specific PDE parameters.

Result: Outperforms DL-ROMs, latent transformers, and plain ViTs on Advection-Diffusion-Reaction and Navier-Stokes cylinder wake problems, reducing relative rollout error by ~5x.

Conclusion: The approach combines efficiency of latent evolution with fidelity of full-field models, enabling effective multi-field prediction for parametric PDEs.

Abstract: Deep Learning Reduced Order Models (ROMs) are becoming increasingly popular as surrogate models for parametric partial differential equations (PDEs) due to their ability to handle high-dimensional data, approximate highly nonlinear mappings, and utilize GPUs. Existing approaches typically learn evolution either on the full solution field, which requires capturing long-range spatial interactions at high computational cost, or on compressed latent representations obtained from autoencoders, which reduces the cost but often yields latent vectors that are difficult to evolve, since they primarily encode spatial information. Moreover, in parametric PDEs, the initial condition alone is not sufficient to determine the trajectory, and most current approaches are not evaluated on jointly predicting multiple solution components with differing magnitudes and parameter sensitivities. To address these challenges, we propose a joint model consisting of a convolutional encoder, a transformer operating on latent representations, and a decoder for reconstruction. The main novelties are joint training with multi-stage parameter injection and coordinate channel injection. Parameters are injected at multiple stages to improve conditioning. Physical coordinates are encoded to provide spatial information. This allows the model to dynamically adapt its computations to the specific PDE parameters governing each system, rather than learning a single fixed response. Experiments on the Advection-Diffusion-Reaction equation and Navier-Stokes flow around the cylinder wake demonstrate that our approach combines the efficiency of latent evolution with the fidelity of full-field models, outperforming DL-ROMs, latent transformers, and plain ViTs in multi-field prediction, reducing the relative rollout error by approximately $5$ times.

[468] Distributed Interpretability and Control for Large Language Models

Dev Arpan Desai, Shaoyi Huang, Zining Zhu

Main category: cs.LG

TL;DR: A practical implementation of activation-level interpretability and steering for multi-GPU LLMs that reduces memory usage by 7x and increases throughput by 41x, enabling real-time behavioral control without fine-tuning.

DetailsMotivation: Current interpretability and steering techniques don't scale well to multi-GPU large language models, which are typically the most capable models. There's a need for practical solutions that can handle the computational demands of these frontier models while providing real-time control.

Method: Implements activation-level interpretability (logit lens) and steering (steering vectors) optimized for multi-GPU settings. Uses design choices to reduce activation memory by 7x and increase throughput by 41x. Employs label-position steering vectors injected post-LayerNorm for controllable output shifts without additional forward passes or fine-tuning.

Result: Achieves 20-100 tokens/s while collecting full layer-wise activation trajectories for 1,500 token sequences. Demonstrates controllable, monotonic shifts in model outputs with mean steerability slope of 0.702 across datasets. Tested on LLaMA-3.1 (8B, 70B) and Qwen-3 (4B, 14B, 32B) models.

Conclusion: Provides a practical, scalable solution for interpretability and real-time behavioral control of frontier LLMs in multi-GPU environments, enabling researchers to understand and steer large models without the computational bottlenecks of previous approaches.

Abstract: Large language models that require multiple GPU cards to host are usually the most capable models. It is necessary to understand and steer these models, but the current technologies do not support the interpretability and steering of these models in the multi-GPU setting as well as the single-GPU setting. We present a practical implementation of activation-level interpretability (logit lens) and steering (steering vector) that scales up to multi-GPU language models. Our system implements design choices that reduce the activation memory by up to 7x and increase the throughput by up to 41x compared to a baseline on identical hardware. We demonstrate the method across LLaMA-3.1 (8B, 70B) and Qwen-3 (4B, 14B, 32B), sustaining 20-100 tokens/s while collecting full layer-wise activation trajectories for sequences of 1,500 tokens. Using label-position steering vectors injected post-LayerNorm, we show controllable, monotonic shifts in model outputs with a mean steerability slope of 0.702 across evaluated datasets, without fine-tuning or additional forward passes. We release detailed benchmarks, ablations, and a reproducible instrumentation recipe to enable practical interpretability and real-time behavioral control for frontier LLMs at https://github.com/Devdesai1901/LogitLense.

[469] Inference-Time Code Selection via Symbolic Equivalence Partitioning

David Cho, Yifan Wang, Fanping Sui, Ananth Grama

Main category: cs.LG

TL;DR: Symbolic Equivalence Partitioning improves code generation selection by using symbolic execution to group programs by semantic behavior and selecting from dominant functional partitions, enhancing accuracy without extra LLM inference.

DetailsMotivation: Existing "Best-of-N" selection methods for code generation rely on expensive or stochastic external verifiers to identify correct solutions, creating a need for more reliable and efficient selection mechanisms.

Method: Proposes Symbolic Equivalence Partitioning framework that uses symbolic execution to group candidate programs by semantic behavior, encodes domain-specific constraints as SMT assumptions to reduce path explosion, and selects a representative from the dominant functional partition.

Result: At N=10, improves average accuracy over Pass@1 from 0.728 to 0.803 on HumanEval+ and from 0.516 to 0.604 on LiveCodeBench, without requiring additional LLM inference beyond initial candidate generations.

Conclusion: Symbolic Equivalence Partitioning provides an effective selection framework for code generation that enhances accuracy through semantic grouping and domain-aware symbolic execution, offering a more reliable alternative to stochastic verifiers.

Abstract: “Best-of-N” selection is a popular inference-time scaling method for code generation using Large Language Models (LLMs). However, to reliably identify correct solutions, existing methods often depend on expensive or stochastic external verifiers. In this paper, we propose Symbolic Equivalence Partitioning, a selection framework that uses symbolic execution to group candidate programs by semantic behavior and select a representative from the dominant functional partition. To improve grouping and selection, we encode domain-specific constraints as Satisfiability Modulo Theories (SMT) assumptions during symbolic execution to reduce path explosion and prevent invalid input searches outside the problem domain. At N=10, our method improves average accuracy over Pass@1 from 0.728 to 0.803 on HumanEval+ and from 0.516 to 0.604 on LiveCodeBench, without requiring any additional LLM inference beyond the initial N candidate generations.

[470] Discrete Flow Matching Policy Optimization

Maojiang Su, Po-Chung Hsieh, Weimin Wu, Mingcheng Lu, Jiunhau Chen, Jerry Yao-Chieh Hu, Han Liu

Main category: cs.LG

TL;DR: DoMinO is a unified RL framework for fine-tuning Discrete Flow Matching models by viewing DFM sampling as a multi-step MDP, enabling reward maximization while preserving original samplers and avoiding biased estimators.

DetailsMotivation: The paper addresses the challenge of fine-tuning Discrete Flow Matching models for controllable discrete sequence generation. Current RL fine-tuning methods often rely on biased auxiliary estimators and likelihood surrogates, which can lead to suboptimal performance. The authors aim to create a more robust framework that preserves the original DFM samplers while enabling effective reward-driven optimization.

Method: DoMinO reformulates DFM sampling as a multi-step Markov Decision Process, allowing RL fine-tuning under a broad class of policy gradient methods. The framework introduces total-variation regularizers to prevent policy collapse by keeping the fine-tuned distribution close to the pretrained one. Theoretically, the authors establish bounds on discretization error and provide tractable upper bounds for the regularizers.

Result: Experimental evaluation on regulatory DNA sequence design shows DoMinO achieves stronger predicted enhancer activity and better sequence naturalness than previous reward-driven baselines. The regularization further improves alignment with natural sequence distributions while preserving strong functional performance.

Conclusion: DoMinO establishes itself as a useful framework for controllable discrete sequence generation, offering a robust RL approach that avoids common pitfalls of previous methods while maintaining strong performance on practical applications like DNA sequence design.

Abstract: We introduce Discrete flow Matching policy Optimization (DoMinO), a unified framework for Reinforcement Learning (RL) fine-tuning Discrete Flow Matching (DFM) models under a broad class of policy gradient methods. Our key idea is to view the DFM sampling procedure as a multi-step Markov Decision Process. This perspective provides a simple and transparent reformulation of fine-tuning reward maximization as a robust RL objective. Consequently, it not only preserves the original DFM samplers but also avoids biased auxiliary estimators and likelihood surrogates used by many prior RL fine-tuning methods. To prevent policy collapse, we also introduce new total-variation regularizers to keep the fine-tuned distribution close to the pretrained one. Theoretically, we establish an upper bound on the discretization error of DoMinO and tractable upper bounds for the regularizers. Experimentally, we evaluate DoMinO on regulatory DNA sequence design. DoMinO achieves stronger predicted enhancer activity and better sequence naturalness than the previous best reward-driven baselines. The regularization further improves alignment with the natural sequence distribution while preserving strong functional performance. These results establish DoMinO as an useful framework for controllable discrete sequence generation.

[471] Optimal Rates for Pure {\varepsilon}-Differentially Private Stochastic Convex Optimization with Heavy Tails

Andrew Lowy

Main category: cs.LG

TL;DR: Pure ε-differential privacy for stochastic convex optimization with heavy-tailed gradients, achieving minimax optimal rates without bounded Lipschitz assumptions.

DetailsMotivation: Previous work on differentially private stochastic convex optimization (SCO) assumed bounded Lipschitz parameters, but many real-world problems have heavy-tailed gradients with unbounded distributions. The pure ε-DP case for heavy-tailed SCO remained open despite known results for approximate (ε,δ)-DP.

Method: Novel framework for privately optimizing Lipschitz extensions of the empirical loss. Assumes only bounded k-th moments of gradients rather than worst-case Lipschitz bounds. Algorithm achieves polynomial-time computation with high probability.

Result: Characterizes minimax optimal excess-risk rate for pure ε-DP heavy-tailed SCO up to logarithmic factors. Achieves polynomial-time computation with high probability, and with probability 1 when worst-case Lipschitz parameter is polynomially bounded. For structured problems (hinge/ReLU-type, absolute-value losses), achieves same guarantee with probability 1 even with infinite Lipschitz parameter.

Conclusion: First complete characterization of pure ε-DP heavy-tailed SCO minimax rates. Provides practical polynomial-time algorithms for important problem classes with heavy-tailed gradients, bridging theoretical optimality with computational feasibility.

Abstract: We study stochastic convex optimization (SCO) with heavy-tailed gradients under pure epsilon-differential privacy (DP). Instead of assuming a bound on the worst-case Lipschitz parameter of the loss, we assume only a bounded k-th moment. This assumption allows for unbounded, heavy-tailed stochastic gradient distributions, and can yield sharper excess risk bounds. The minimax optimal rate for approximate (epsilon, delta)-DP SCO is known in this setting, but the pure epsilon-DP case has remained open. We characterize the minimax optimal excess-risk rate for pure epsilon-DP heavy-tailed SCO up to logarithmic factors. Our algorithm achieves this rate in polynomial time with high probability. Moreover, it runs in polynomial time with probability 1 when the worst-case Lipschitz parameter is polynomially bounded. For important structured problem classes - including hinge/ReLU-type and absolute-value losses on Euclidean balls, ellipsoids, and polytopes - we achieve the same excess-risk guarantee in polynomial time with probability 1 even when the worst-case Lipschitz parameter is infinite. Our approach is based on a novel framework for privately optimizing Lipschitz extensions of the empirical loss. We complement our excess risk upper bound with a novel high probability lower bound.

[472] Improving Robustness In Sparse Autoencoders via Masked Regularization

Vivek Narayanaswamy, Kowshik Thopalli, Bhavya Kailkhura, Wesam Sakla

Main category: cs.LG

TL;DR: Sparse autoencoders for LLM interpretability suffer from feature absorption and poor OOD robustness; proposed masking regularization during training improves robustness and reduces absorption.

DetailsMotivation: Current sparse autoencoders (SAEs) used for mechanistic interpretability of LLMs have limitations: sparsity alone is imperfect for interpretability, training objectives lead to brittle latent representations, and SAEs suffer from feature absorption (general features being subsumed by specific ones) and poor out-of-distribution robustness.

Method: Proposes a masking-based regularization technique that randomly replaces tokens during training to disrupt co-occurrence patterns that cause feature absorption. This approach improves robustness across different SAE architectures and sparsity levels.

Result: The masking regularization reduces feature absorption, enhances probing performance, and narrows the out-of-distribution performance gap, leading to more reliable interpretability tools.

Conclusion: Masking-based regularization provides a practical path toward more robust and reliable sparse autoencoders for mechanistic interpretability of LLMs, addressing key limitations of current training objectives.

Abstract: Sparse autoencoders (SAEs) are widely used in mechanistic interpretability to project LLM activations onto sparse latent spaces. However, sparsity alone is an imperfect proxy for interpretability, and current training objectives often result in brittle latent representations. SAEs are known to be prone to feature absorption, where general features are subsumed by more specific ones due to co-occurrence, degrading interpretability despite high reconstruction fidelity. Recent negative results on Out-of-Distribution (OOD) performance further underscore broader robustness related failures tied to under-specified training objectives. We address this by proposing a masking-based regularization that randomly replaces tokens during training to disrupt co-occurrence patterns. This improves robustness across SAE architectures and sparsity levels reducing absorption, enhancing probing performance, and narrowing the OOD gap. Our results point toward a practical path for more reliable interpretability tools.

[473] Transformer See, Transformer Do: Copying as an Intermediate Step in Learning Analogical Reasoning

Philipp Hellwig, Willem Zuidema, Claire E. Stevenson, Martha Lewis

Main category: cs.LG

TL;DR: Transformers trained with Meta-Learning for Compositionality (MLC) on letter-string analogies show improved generalization when guided to attend to informative problem elements, with better performance on new alphabets using heterogeneous datasets, but limited generalization to novel transformations.

DetailsMotivation: To develop AI systems capable of robust human-like analogical reasoning, which has proven difficult despite being a hallmark of human intelligence. The research aims to understand how transformers can learn and generalize analogical reasoning tasks.

Method: Train transformers using Meta-Learning for Compositionality (MLC) on letter-string analogies. Guide models to attend to informative problem elements by including copying tasks in training data. Use heterogeneous datasets for better generalization. Employ 3-layer encoder-decoder architecture and conduct interpretability analyses to understand model computations.

Result: Letter-string analogies become learnable with attention guidance. Models generalize better to new alphabets with heterogeneous datasets (3-layer model outperforms most frontier models). Some generalization to compositions of trained transformations, but not to completely novel transformations. Identified algorithm approximating model computations, verified through interpretability analyses.

Conclusion: MLC approach enables transformers to learn analogical reasoning with specific generalization capabilities. The work provides insights into how models can be guided for better reasoning and discusses implications for larger models and parallels to human analogical reasoning.

Abstract: Analogical reasoning is a hallmark of human intelligence, enabling us to solve new problems by transferring knowledge from one situation to another. Yet, developing artificial intelligence systems capable of robust human-like analogical reasoning has proven difficult. In this work, we train transformers using Meta-Learning for Compositionality (MLC) on an analogical reasoning task (letter-string analogies) and assess their generalization capabilities. We find that letter-string analogies become learnable when guiding the models to attend to the most informative problem elements induced by including copying tasks in the training data. Furthermore, generalization to new alphabets becomes better when models are trained with more heterogeneous datasets, where our 3-layer encoder-decoder model outperforms most frontier models. The MLC approach also enables some generalization to compositions of trained transformations, but not to completely novel transformations. To understand how the model operates, we identify an algorithm that approximates the model’s computations. We verify this using interpretability analyses and show that the model can be steered precisely according to expectations derived from the algorithm. Finally, we discuss implications of our findings for generalization capabilities of larger models and parallels to human analogical reasoning.

[474] VLMShield: Efficient and Robust Defense of Vision-Language Models against Malicious Prompts

Peigui Qi, Kunsheng Tang, Yanpu Yu, Jialin Wu, Yide Song, Wenbo Zhou, Zhicong Huang, Cheng Hong, Weiming Zhang, Nenghai Yu

Main category: cs.LG

TL;DR: VLMShield: A lightweight safety detector for Vision-Language Models that identifies multimodal malicious attacks using aggregated feature extraction and distribution analysis.

DetailsMotivation: Vision-Language Models have significant safety vulnerabilities from malicious prompt attacks due to weakened alignment during visual integration, and existing defenses suffer from efficiency and robustness issues.

Method: Proposes Multimodal Aggregated Feature Extraction (MAFE) framework to enable CLIP to handle long text and fuse multimodal information into unified representations, then develops VLMShield detector that identifies malicious attacks based on discovered distributional patterns between benign and malicious prompts.

Result: Extensive experiments demonstrate superior performance across multiple dimensions including robustness, efficiency, and utility as a plug-and-play solution.

Conclusion: VLMShield paves the way for more secure multimodal AI deployment by providing an efficient and robust safety detection mechanism for Vision-Language Models.

Abstract: Vision-Language Models (VLMs) face significant safety vulnerabilities from malicious prompt attacks due to weakened alignment during visual integration. Existing defenses suffer from efficiency and robustness. To address these challenges, we first propose the Multimodal Aggregated Feature Extraction (MAFE) framework that enables CLIP to handle long text and fuse multimodal information into unified representations. Through empirical analysis of MAFE-extracted features, we discover distinct distributional patterns between benign and malicious prompts. Building upon this finding, we develop VLMShield, a lightweight safety detector that efficiently identifies multimodal malicious attacks as a plug-and-play solution. Extensive experiments demonstrate superior performance across multiple dimensions, including robustness, efficiency, and utility. Through our work, we hope to pave the way for more secure multimodal AI deployment. Code is available at this https URL.

[475] Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees

Mohammed Nowaz Rabbani Chowdhury, Kaoutar El Maghraoui, Hsinyu Tsai, Naigang Wang, Geoffrey W. Burr, Liu Liu, Meng Wang

Main category: cs.LG

TL;DR: Expert-wise mixed precision quantization for sparse MoE models that assigns higher precision to experts with smaller router norm changes and higher intra-neuron variance, achieving better accuracy than existing methods while reducing inference cost.

DetailsMotivation: Sparse Mixture-of-Experts models have large memory overhead during inference despite reduced computation. While post-training quantization helps, uniform quantization causes significant accuracy loss at low bit-widths, and existing mixed-precision methods require heavy computation for bit-width allocation and ignore varying sensitivity of model performance to different experts' quantization.

Method: Proposes a theoretically grounded expert-wise mixed precision strategy that assigns bit-width to each expert based on: 1) change in routers’ l2 norm during training (experts with smaller changes get higher precision as they capture critical but infrequent features), and 2) maximum intra-neuron variance (experts with large variance get higher precision to avoid high quantization noise).

Result: Experiments on large-scale MoE models (Switch Transformer and Mixtral) show the method achieves higher accuracy than existing approaches while reducing inference cost, with only negligible overhead for bit-width assignment.

Conclusion: The proposed expert-wise mixed precision quantization strategy effectively addresses memory overhead in sparse MoE models by intelligently allocating precision based on expert sensitivity, achieving better accuracy-efficiency trade-offs than previous methods.

Abstract: Sparse Mixture-of-Experts (MoE) allows scaling of language and vision models efficiently by activating only a small subset of experts per input. While this reduces computation, the large number of parameters still incurs substantial memory overhead during inference. Post-training quantization has been explored to address this issue. Because uniform quantization suffers from significant accuracy loss at low bit-widths, mixed-precision methods have been recently explored; however, they often require substantial computation for bit-width allocation and overlook the varying sensitivity of model performance to the quantization of different experts. We propose a theoretically grounded expert-wise mixed precision strategy that assigns bit-width to each expert primarily based on their change in routers l2 norm during training. Experts with smaller changes are shown to capture less frequent but critical features, and model performance is more sensitive to the quantization of these experts, thus requiring higher precision. Furthermore, to avoid allocating experts to lower precision that inject high quantization noise, experts with large maximum intra-neuron variance are also allocated higher precision. Experiments on large-scale MoE models, including Switch Transformer and Mixtral, show that our method achieves higher accuracy than existing approaches, while also reducing inference cost and incurring only negligible overhead for bit-width assignment.

[476] Time-Series Classification with Multivariate Statistical Dependence Features

Yao Sun, Bo Hu, Jose Principe

Main category: cs.LG

TL;DR: A novel framework for non-stationary time-series analysis using cross density ratio (CDR) instead of correlation, with functional maximal correlation algorithm (FMCA) for feature extraction and lightweight perceptron for classification.

DetailsMotivation: Traditional correlation-based methods for time-series analysis are sensitive to sample order and regime changes. The paper aims to develop a more robust approach for non-stationary time-series by directly estimating statistical dependence through cross density ratio.

Method: Proposes cross density ratio (CDR) to measure statistical dependence independent of sample order. Uses functional maximal correlation algorithm (FMCA) to decompose CDR’s eigenspectrum and extract multiscale features. Features are classified with a single-hidden-layer perceptron.

Result: Outperforms hidden Markov models (HMMs) and state-of-the-art spiking neural networks on TI-46 digit speech corpus. Achieves higher accuracy with fewer than 10 layers and storage footprint under 5 MB.

Conclusion: The CDR-based framework provides a robust alternative to correlation-based methods for non-stationary time-series analysis, particularly effective for speech recognition with lightweight architecture.

Abstract: In this paper, we propose a novel framework for non-stationary time-series analysis that replaces conventional correlation-based statistics with direct estimation of statistical dependence in the normalized joint density of input and target signals, the cross density ratio (CDR). Unlike windowed correlation estimates, this measure is independent of sample order and robust to regime changes. The method builds on the functional maximal correlation algorithm (FMCA), which constructs a projection space by decomposing the eigenspectrum of the CDR. Multiscale features from this eigenspace are classified using a lightweight single-hidden-layer perceptron. On the TI-46 digit speech corpus, our approach outperforms hidden Markov models (HMMs) and state-of-the-art spiking neural networks, achieving higher accuracy with fewer than 10 layers and a storage footprint under 5 MB.

[477] When Does Context Help? A Systematic Study of Target-Conditional Molecular Property Prediction

Bryan Cheng, Jasper Zhang

Main category: cs.LG

TL;DR: Systematic study of target context in molecular property prediction shows fusion architecture choice matters most, context enables predictions in data-scarce scenarios but can hurt with distribution mismatch, and exposes flaws in standard benchmarking.

DetailsMotivation: To understand when and how target context helps molecular property prediction, addressing limitations in current approaches and benchmarking practices in drug discovery.

Method: Evaluated context conditioning across 10 protein families, 4 fusion architectures, various data regimes using NestDrug (FiLM-based architecture), with both temporal and random evaluation splits.

Result: FiLM outperforms other fusion methods significantly; context enables predictions in data-scarce scenarios (0.686 vs 0.238 AUC); context can hurt with distribution mismatch; exposed benchmarking flaws (1-NN achieves 0.991 AUC without learning).

Conclusion: How context is incorporated matters more than whether it’s included; context enables otherwise impossible predictions but can hurt with distribution mismatch; standard benchmarking has fundamental flaws; temporal splits show context-conditional representations generalize to future chemical space.

Abstract: We present the first systematic study of when target context helps molecular property prediction, evaluating context conditioning across 10 diverse protein families, 4 fusion architectures, data regimes spanning 67-9,409 training compounds, and both temporal and random evaluation splits. Using NestDrug, a FiLM-based architecture that conditions molecular representations on target identity, we characterize both success and failure modes with three principal findings. First, fusion architecture dominates: FiLM outperforms concatenation by 24.2 percentage points and additive conditioning by 8.6 pp; how you incorporate context matters more than whether you include it. Second, context enables otherwise impossible predictions: on data-scarce CYP3A4 (67 training compounds), multi-task transfer achieves 0.686 AUC where per-target Random Forest collapses to 0.238. Third, context can systematically hurt: distribution mismatch causes 10.2 pp degradation on BACE1; few-shot adaptation consistently underperforms zero-shot. Beyond methodology, we expose fundamental flaws in standard benchmarking: 1-nearest-neighbor Tanimoto achieves 0.991 AUC on DUD-E without any learning, and 50% of actives leak from training data, rendering absolute performance metrics meaningless. Our temporal split evaluation (train up to 2020, test 2021-2024) achieves stable 0.843 AUC with no degradation, providing the first rigorous evidence that context-conditional molecular representations generalize to future chemical space.

[478] TwinLoop: Simulation-in-the-Loop Digital Twins for Online Multi-Agent Reinforcement Learning

Nan Zhang, Zishuo Wang, Shuyu Huang, Georgios Diamantopoulos, Nikos Tziritas, Panagiotis Oikonomou, Georgios Theodoropoulos

Main category: cs.LG

TL;DR: TwinLoop: A digital twin framework for accelerating multi-agent reinforcement learning adaptation to context shifts in cyber-physical systems.

DetailsMotivation: Current decentralized online learning in cyber-physical multi-agent systems suffers from slow recovery after operating condition changes, requiring substantial trial-and-error interaction that is costly in physical systems.

Method: Proposes TwinLoop, a simulation-in-the-loop digital twin framework that triggers when context shifts occur. It reconstructs current system state, initializes from latest agent policies, performs accelerated policy improvement using simulation what-if analysis, then synchronizes updated parameters back to physical agents.

Result: Evaluated in vehicular edge computing task-offloading scenario with changing workload and infrastructure conditions. Results show digital twins improve post-shift adaptation efficiency and reduce reliance on costly online trial-and-error.

Conclusion: Digital twin frameworks like TwinLoop can effectively accelerate policy adaptation in multi-agent systems facing changing conditions, reducing the need for expensive real-world trial-and-error learning.

Abstract: Decentralised online learning enables runtime adaptation in cyber-physical multi-agent systems, but when operating conditions change, learned policies often require substantial trial-and-error interaction before recovering performance. To address this, we propose TwinLoop, a simulation-in-the-loop digital twin framework for online multi-agent reinforcement learning. When a context shift occurs, the digital twin is triggered to reconstruct the current system state, initialise from the latest agent policies, and perform accelerated policy improvement with simulation what-if analysis before synchronising updated parameters back to the agents in the physical system. We evaluate TwinLoop in a vehicular edge computing task-offloading scenario with changing workload and infrastructure conditions. The results suggest that digital twins can improve post-shift adaptation efficiency and reduce reliance on costly online trial-and-error.

[479] PD-SOVNet: A Physics-Driven Second-Order Vibration Operator Network for Estimating Wheel Polygonal Roughness from Axle-Box Vibrations

Xiancheng Wang, Lin Wang, Rui Wang, Zhibo Zhang, Minghang Zhao, Xiaoheng Zhang, Zhongyue Tan, Kaitai Mao

Main category: cs.LG

TL;DR: PD-SOVNet: A physics-guided gray-box framework combining shared second-order vibration kernels, MIMO coupling, adaptive physical correction, and Mamba-based temporal processing for estimating wheel polygonal roughness spectra from axle-box vibrations in rail-vehicle monitoring.

DetailsMotivation: Quantitative estimation of wheel polygonal roughness from vibration signals is challenging but practically important for rail-vehicle condition monitoring. Existing approaches focus on detection/classification rather than continuous regression of multi-order roughness spectra, especially under real operational data and unseen-wheel conditions.

Method: PD-SOVNet combines: 1) shared second-order vibration kernels embedding modal-response priors, 2) 4×4 MIMO coupling module, 3) adaptive physical correction branch for sample-dependent adjustments, and 4) Mamba-based temporal branch for handling residual temporal dynamics. This gray-box approach balances physical priors with data-driven flexibility.

Result: Experiments on three real-world datasets show competitive prediction accuracy and relatively stable cross-wheel performance, with most noticeable advantage on challenging Dataset III. Noise injection experiments demonstrate Mamba temporal branch helps mitigate performance degradation under perturbed inputs.

Conclusion: Structured physical priors can stabilize roughness regression in practical rail-vehicle monitoring, though further validation under broader operating conditions and stricter comparison protocols is needed.

Abstract: Quantitative estimation of wheel polygonal roughness from axle-box vibration signals is a challenging yet practically relevant problem for rail-vehicle condition monitoring. Existing studies have largely focused on detection, identification, or severity classification, while continuous regression of multi-order roughness spectra remains less explored, especially under real operational data and unseen-wheel conditions. To address this problem, this paper presents PD-SOVNet, a physics-guided gray-box framework that combines shared second-order vibration kernels, a $4\times4$ MIMO coupling module, an adaptive physical correction branch, and a Mamba-based temporal branch for estimating the 1st–40th-order wheel roughness spectrum from axle-box vibrations. The proposed design embeds modal-response priors into the model while retaining data-driven flexibility for sample-dependent correction and residual temporal dynamics. Experiments on three real-world datasets, including operational data and real fault data, show that the proposed method provides competitive prediction accuracy and relatively stable cross-wheel performance under the current data protocol, with its most noticeable advantage observed on the more challenging Dataset III. Noise injection experiments further indicate that the Mamba temporal branch helps mitigate performance degradation under perturbed inputs. These results suggest that structured physical priors can be beneficial for stabilizing roughness regression in practical rail-vehicle monitoring scenarios, although further validation under broader operating conditions and stricter comparison protocols is still needed.

[480] SubFLOT: Submodel Extraction for Efficient and Personalized Federated Learning via Optimal Transport

Zheng Jiang, Nan He, Yiming Chen, Lifeng Sun

Main category: cs.LG

TL;DR: SubFLOT is a federated learning framework that addresses system and statistical heterogeneity through server-side personalized pruning using optimal transport and adaptive regularization.

DetailsMotivation: Federated learning faces practical deployment challenges due to system and statistical heterogeneity. Existing federated pruning methods have critical limitations: server-side pruning lacks personalization, while client-side pruning is computationally prohibitive for resource-constrained devices. Additionally, pruning induces parametric divergence among heterogeneous submodels, destabilizing training and hindering global convergence.

Method: SubFLOT introduces two key modules: 1) Optimal Transport-enhanced Pruning (OTP) that treats historical client models as proxies for local data distributions and formulates pruning as a Wasserstein distance minimization problem to generate customized submodels without accessing raw data. 2) Scaling-based Adaptive Regularization (SAR) that adaptively penalizes submodel deviation from the global model, with penalty strength scaled by the client’s pruning rate to counteract parametric divergence.

Result: Comprehensive experiments demonstrate that SubFLOT consistently and substantially outperforms state-of-the-art methods, showing its effectiveness for deploying efficient and personalized models on resource-constrained edge devices.

Conclusion: SubFLOT provides a novel solution to federated learning challenges by enabling server-side personalized pruning through optimal transport techniques and adaptive regularization, offering potential for practical deployment of efficient personalized models on edge devices.

Abstract: Federated Learning (FL) enables collaborative model training while preserving data privacy, but its practical deployment is hampered by system and statistical heterogeneity. While federated network pruning offers a path to mitigate these issues, existing methods face a critical dilemma: server-side pruning lacks personalization, whereas client-side pruning is computationally prohibitive for resource-constrained devices. Furthermore, the pruning process itself induces significant parametric divergence among heterogeneous submodels, destabilizing training and hindering global convergence. To address these challenges, we propose SubFLOT, a novel framework for server-side personalized federated pruning. SubFLOT introduces an Optimal Transport-enhanced Pruning (OTP) module that treats historical client models as proxies for local data distributions, formulating the pruning task as a Wasserstein distance minimization problem to generate customized submodels without accessing raw data. Concurrently, to counteract parametric divergence, our Scaling-based Adaptive Regularization (SAR) module adaptively penalizes a submodel’s deviation from the global model, with the penalty’s strength scaled by the client’s pruning rate. Comprehensive experiments demonstrate that SubFLOT consistently and substantially outperforms state-of-the-art methods, underscoring its potential for deploying efficient and personalized models on resource-constrained edge devices.

[481] SHAPE: Stage-aware Hierarchical Advantage via Potential Estimation for LLM Reasoning

Zhengyang Ai, Zikang Shan, Xiaodong Ai, Jingxian Tang, Hangkai Hu, Pinyan Lu

Main category: cs.LG

TL;DR: SHAPE introduces hierarchical credit assignment for LLM reasoning with stage-aware advantage functions and entropy-driven token redistribution to improve accuracy while reducing token consumption.

DetailsMotivation: Current process supervision methods for LLM reasoning fail to distinguish meaningful progress from verbosity, leading to limited reasoning capabilities and token inefficiency.

Method: SHAPE formalizes reasoning as a trajectory through a state space of empirical solvability, using hierarchical credit assignment: segment-level stage-aware advantage functions to prioritize efficient breakthroughs in low-potential states, and token-level entropy-driven redistribution to sharpen execution signals.

Result: Extensive experiments in math reasoning across three base models and five benchmarks show SHAPE achieves average accuracy gain of 3% with 30% reduced token consumption.

Conclusion: SHAPE effectively addresses verbosity and inefficiency in LLM reasoning through hierarchical credit assignment, improving both accuracy and token efficiency.

Abstract: Process supervision has emerged as a promising approach for enhancing LLM reasoning, yet existing methods fail to distinguish meaningful progress from mere verbosity, leading to limited reasoning capabilities and unresolved token inefficiency. To address this, we propose Stage-aware Hierarchical Advantage via Potential Estimation (SHAPE), a framework that formalizes reasoning as a trajectory through a state space of empirical solvability. SHAPE introduces a hierarchical credit assignment mechanism: at the segment level, it employs a stage-aware advantage function to prioritize efficient breakthroughs in low-potential states; at the token level, it utilizes entropy-driven redistribution to sharpen execution signals. Extensive experiments in math reasoning across three base models and five benchmarks demonstrate that SHAPE achieves an average accuracy gain of 3% with 30% reduced token consumption.

[482] FlowAdam: Implicit Regularization via Geometry-Aware Soft Momentum Injection

Devender Singh, Tarun Sheel

Main category: cs.LG

TL;DR: FlowAdam: A hybrid optimizer combining Adam with ODE-based gradient flow integration for better handling of coupled parameter optimization problems like matrix factorization and tensor decomposition.

DetailsMotivation: Adam's diagonal preconditioner struggles with dense or rotated parameter couplings (e.g., in matrix factorization, tensor decomposition, GNNs) because it treats parameters independently. Need better optimization for coupled parameter spaces.

Method: FlowAdam augments Adam with continuous gradient-flow integration via ODE. When EMA statistics detect difficult landscape, switches to clipped ODE integration. Key innovation: Soft Momentum Injection blends ODE velocity with Adam’s momentum during transitions.

Result: Reduces held-out error by 10-22% on low-rank matrix/tensor recovery, 6% on Jester collaborative filtering, surpasses tuned Lion and AdaBelief, matches Adam on well-conditioned workloads (CIFAR-10). MovieLens-100K confirms benefits specific to coupled parameter interactions.

Conclusion: FlowAdam effectively handles coupled optimization problems through hybrid Adam-ODE approach with soft momentum injection, providing implicit regularization for parameter-coupled tasks without sacrificing performance on standard tasks.

Abstract: Adaptive moment methods such as Adam use a diagonal, coordinate-wise preconditioner based on exponential moving averages of squared gradients. This diagonal scaling is coordinate-system dependent and can struggle with dense or rotated parameter couplings, including those in matrix factorization, tensor decomposition, and graph neural networks, because it treats each parameter independently. We introduce FlowAdam, a hybrid optimizer that augments Adam with continuous gradient-flow integration via an ordinary differential equation (ODE). When EMA-based statistics detect landscape difficulty, FlowAdam switches to clipped ODE integration. Our central contribution is Soft Momentum Injection, which blends ODE velocity with Adam’s momentum during mode transitions. This prevents the training collapse observed with naive hybrid approaches. Across coupled optimization benchmarks, the ODE integration provides implicit regularization, reducing held-out error by 10-22% on low-rank matrix/tensor recovery and 6% on Jester (real-world collaborative filtering), also surpassing tuned Lion and AdaBelief, while matching Adam on well-conditioned workloads (CIFAR-10). MovieLens-100K confirms benefits arise specifically from coupled parameter interactions rather than bias estimation. Ablation studies show that soft injection is essential, as hard replacement reduces accuracy from 100% to 82.5%.

[483] GraphWalker: Graph-Guided In-Context Learning for Clinical Reasoning on Electronic Health Records

Yue Fang, Weibin Liao, Yuxin Guo, Jiaran Gao, Hongxin Ding, Jinyang Zhang, Xinke Jiang, Zhibang Yang, Junfeng Zhao, Yasha Wang, Liantao Ma

Main category: cs.LG

TL;DR: GraphWalker is a demonstration selection framework for in-context learning on EHRs that addresses perspective limitation, cohort awareness, and information aggregation challenges through joint modeling of clinical information and LLM-estimated gain, cohort discovery, and lazy greedy search.

DetailsMotivation: Existing in-context learning methods for EHR reasoning face three key challenges: perspective limitation (data-driven similarity doesn't align with LLM reasoning needs), cohort awareness (demonstrations selected independently without population-level structure), and information aggregation (ignoring redundancy and interaction effects among demonstrations).

Method: GraphWalker integrates data-driven and model-driven perspectives by jointly modeling patient clinical information and LLM-estimated information gain, incorporates cohort discovery to avoid noisy local optima, and uses a lazy greedy search with frontier expansion algorithm to mitigate diminishing marginal returns in information aggregation.

Result: Extensive experiments on multiple real-world EHR benchmarks show GraphWalker consistently outperforms state-of-the-art ICL baselines with substantial improvements in clinical reasoning performance.

Conclusion: GraphWalker provides an effective demonstration selection framework for EHR-oriented in-context learning that addresses fundamental challenges in perspective alignment, cohort modeling, and information aggregation.

Abstract: Clinical Reasoning on Electronic Health Records (EHRs) is a fundamental yet challenging task in modern healthcare. While in-context learning (ICL) offers a promising inference-time adaptation paradigm for large language models (LLMs) in EHR reasoning, existing methods face three fundamental challenges: (1) Perspective Limitation, where data-driven similarity fails to align with LLM reasoning needs and model-driven signals are constrained by limited clinical competence; (2) Cohort Awareness, as demonstrations are selected independently without modeling population-level structure; and (3) Information Aggregation, where redundancy and interaction effects among demonstrations are ignored, leading to diminishing marginal gains. To address these challenges, we propose GraphWalker, a principled demonstration selection framework for EHR-oriented ICL. GraphWalker (i) jointly models patient clinical information and LLM-estimated information gain by integrating data-driven and model-driven perspectives, (ii) incorporates Cohort Discovery to avoid noisy local optima, and (iii) employs a Lazy Greedy Search with Frontier Expansion algorithm to mitigate diminishing marginal returns in information aggregation. Extensive experiments on multiple real-world EHR benchmarks demonstrate that GraphWalker consistently outperforms state-of-the-art ICL baselines, yielding substantial improvements in clinical reasoning performance. Our code is open-sourced at https://github.com/PuppyKnightUniversity/GraphWalker

[484] Towards Accurate and Calibrated Classification: Regularizing Cross-Entropy From A Generative Perspective

Qipeng Zhan, Zhuoping Zhou, Li Shen

Main category: cs.LG

TL;DR: GCE (Generative Cross-Entropy) improves both accuracy and calibration in deep neural networks by maximizing p(x|y) instead of p(y|x), acting as a class-level confidence regularizer.

DetailsMotivation: Modern DNNs are often overconfident due to overfitting on negative log-likelihood, creating a persistent trade-off between calibration and predictive performance. Focal loss variants help calibration but typically reduce accuracy.

Method: Proposes Generative Cross-Entropy (GCE) which maximizes p(x|y) instead of p(y|x). This is equivalent to cross-entropy augmented with a class-level confidence regularizer. GCE is strictly proper under mild conditions. Combined with adaptive piecewise temperature scaling (ATS) for further calibration.

Result: GCE improves both accuracy and calibration over standard cross-entropy across CIFAR-10/100, Tiny-ImageNet, and medical imaging benchmarks, especially in long-tailed scenarios. With ATS, achieves calibration competitive with focal-loss variants without sacrificing accuracy.

Conclusion: GCE provides a principled approach to address the calibration-accuracy trade-off by leveraging generative modeling principles within discriminative classifiers, offering improved performance across various datasets and scenarios.

Abstract: Accurate classification requires not only high predictive accuracy but also well-calibrated confidence estimates. Yet, modern deep neural networks (DNNs) are often overconfident, primarily due to overfitting on the negative log-likelihood (NLL). While focal loss variants alleviate this issue, they typically reduce accuracy, revealing a persistent trade-off between calibration and predictive performance. Motivated by the complementary strengths of generative and discriminative classifiers, we propose Generative Cross-Entropy (GCE), which maximizes $p(x|y)$ and is equivalent to cross-entropy augmented with a class-level confidence regularizer. Under mild conditions, GCE is strictly proper. Across CIFAR-10/100, Tiny-ImageNet, and a medical imaging benchmark, GCE improves both accuracy and calibration over cross-entropy, especially in the long-tailed scenario. Combined with adaptive piecewise temperature scaling (ATS), GCE attains calibration competitive with focal-loss variants without sacrificing accuracy.

[485] Bi-Lipschitz Autoencoder With Injectivity Guarantee

Qipeng Zhan, Zhuoping Zhou, Zexuan Wang, Qi Long, Li Shen

Main category: cs.LG

TL;DR: BLAE introduces a Bi-Lipschitz Autoencoder with injective regularization and bi-Lipschitz constraints to preserve manifold geometry during dimensionality reduction, addressing issues of non-injective mappings and distribution drift.

DetailsMotivation: Existing regularized autoencoders for dimensionality reduction suffer from non-injective mappings and rigid constraints, leading to poor convergence, distorted latent representations, and lack of robustness to data distribution shifts.

Method: Proposes Bi-Lipschitz Autoencoder (BLAE) with two innovations: (1) injective regularization based on a separation criterion to eliminate pathological local minima, and (2) bi-Lipschitz relaxation that preserves geometry and exhibits robustness to data distribution drift.

Result: Empirical results on diverse datasets show BLAE consistently outperforms existing methods in preserving manifold structure while remaining resilient to sampling sparsity and distribution shifts.

Conclusion: BLAE provides a robust solution for manifold-preserving dimensionality reduction through injective regularization and bi-Lipschitz constraints, addressing fundamental limitations of existing autoencoder approaches.

Abstract: Autoencoders are widely used for dimensionality reduction, based on the assumption that high-dimensional data lies on low-dimensional manifolds. Regularized autoencoders aim to preserve manifold geometry during dimensionality reduction, but existing approaches often suffer from non-injective mappings and overly rigid constraints that limit their effectiveness and robustness. In this work, we identify encoder non-injectivity as a core bottleneck that leads to poor convergence and distorted latent representations. To ensure robustness across data distributions, we formalize the concept of admissible regularization and provide sufficient conditions for its satisfaction. In this work, we propose the Bi-Lipschitz Autoencoder (BLAE), which introduces two key innovations: (1) an injective regularization scheme based on a separation criterion to eliminate pathological local minima, and (2) a bi-Lipschitz relaxation that preserves geometry and exhibits robustness to data distribution drift. Empirical results on diverse datasets show that BLAE consistently outperforms existing methods in preserving manifold structure while remaining resilient to sampling sparsity and distribution shifts. Code is available at https://github.com/qipengz/BLAE.

[486] Bi-level Heterogeneous Learning for Time Series Foundation Models: A Federated Learning Approach

Shengchao Chen, Guodong Long, Dikai Liu, Jing Jiang

Main category: cs.LG

TL;DR: A federated learning method for training time series foundation models that addresses bi-level heterogeneity (inter-domain and intra-domain) through local regularization and domain-aware aggregation to improve representation quality.

DetailsMotivation: Time series data exhibits greater heterogeneity than vision or language data, with substantial variations across domains and tasks. Existing time series foundation models trained with mixed-batch strategies suffer from gradient conflicts and degraded representation quality due to this heterogeneity.

Method: Proposes a fine-grained learning method that distills invariant knowledge from heterogeneous series while reducing cross-domain interference. Uses federated learning with two key components: 1) local regularization to enforce domain-invariant and semantically consistent representations (addressing intra-domain conflicts), and 2) domain-aware aggregation to enhance cross-domain collaboration (addressing inter-domain discrepancies).

Result: Experiments across diverse benchmarks show that time series foundation models trained with this method consistently outperform both centralized and federated baselines in point and probabilistic forecasting, while also achieving competitive zero-shot performance at scale.

Conclusion: The method offers a flexible pathway for training time series foundation models from scratch in heterogeneous environments by effectively addressing bi-level heterogeneity through federated learning with specialized regularization and aggregation techniques.

Abstract: Heterogeneity in time series data is more pronounced than in vision or language, as temporal dynamics vary substantially across domains and tasks. Existing efforts on training time series foundation models (TSFMs) from scratch are often trained with mixed-batch strategies that merge large-scale datasets, which can cause gradient conflicts and degrade representation quality. To address this, we propose a fine-grained learning method that distills invariant knowledge from heterogeneous series while reducing cross-domain interference. We characterize heterogeneity at two levels: inter-domain and intra-domain. To tackle this bi-level heterogeneity, we design a federated learning method that mitigates intra-domain conflicts by enforcing domain-invariant and semantically consistent representations through local regularization, and addresses inter-domain discrepancies by enhancing cross-domain collaboration via domain-aware aggregation. Experiments across diverse benchmarks show that TSFMs trained with our method consistently outperform both centralized and federated TSFM baselines in point and probabilistic forecasting, while also achieving competitive zero-shot performance at scale, offering a flexible pathway for training TSFMs from scratch in heterogeneous environments.

[487] Extraction of linearized models from pre-trained networks via knowledge distillation

Fumito Kimura, Jun Ohkubo

Main category: cs.LG

TL;DR: Proposes a framework to extract linearized models from pre-trained neural networks using Koopman operator theory and knowledge distillation for classification tasks.

DetailsMotivation: Hardware developments like photonic integrated circuits and optical devices create demand for machine learning architectures tailored for linear operations, motivating research on constructing learning machines with only linear operations after simple nonlinear preprocessing.

Method: Integrates Koopman operator theory with knowledge distillation to extract a linearized model from a pre-trained neural network for classification tasks.

Result: Numerical demonstrations on MNIST and Fashion-MNIST show the proposed model consistently outperforms conventional least-squares-based Koopman approximation in both classification accuracy and numerical stability.

Conclusion: The framework successfully creates linearized models from neural networks that maintain performance while being suitable for hardware optimized for linear operations.

Abstract: Recent developments in hardware, such as photonic integrated circuits and optical devices, are driving demand for research on constructing machine learning architectures tailored for linear operations. Hence, it is valuable to explore methods for constructing learning machines with only linear operations after simple nonlinear preprocessing. In this study, we propose a framework to extract a linearized model from a pre-trained neural network for classification tasks by integrating Koopman operator theory with knowledge distillation. Numerical demonstrations on the MNIST and the Fashion-MNIST datasets reveal that the proposed model consistently outperforms the conventional least-squares-based Koopman approximation in both classification accuracy and numerical stability.

[488] Busemann energy-based attention for emotion analysis in Poincaré discs

Zinaid Kapić, Vladimir Jaćimović

Main category: cs.LG

TL;DR: EmBolic is a hyperbolic deep learning architecture for fine-grained emotion analysis from text, using hyperbolic geometry to capture hierarchical relationships between words and emotions through attention mechanisms in hyperbolic space.

DetailsMotivation: The paper aims to address the hierarchical nature of emotions and semantic ambiguities in textual emotion analysis. Traditional categorical approaches lack metric structure, while hyperbolic geometry naturally captures hierarchical relationships, making it suitable for modeling the continuous space of emotions.

Method: Proposes EmBolic architecture with hyperbolic attention mechanism. Textual messages generate queries as points in hyperbolic disc, while keys emerge automatically at the boundary. Uses Busemann energy between queries and keys to evaluate alignment with emotion class directions. Trains model to infer curvature on continuous emotion space rather than treating emotions as categorical.

Result: Demonstrates strong generalization properties and reasonably good prediction accuracy even with small representation space dimensions. Shows hyperbolic representations are advantageous for affective computing tasks.

Conclusion: Hyperbolic geometry is particularly beneficial for affective computing applications, as it efficiently captures hierarchical relationships between words and emotions. The EmBolic architecture successfully models emotions as continuous rather than categorical entities.

Abstract: We present EmBolic - a novel fully hyperbolic deep learning architecture for fine-grained emotion analysis from textual messages. The underlying idea is that hyperbolic geometry efficiently captures hierarchies between both words and emotions. In our context, these hierarchical relationships arise from semantic ambiguities. EmBolic aims to infer the curvature on the continuous space of emotions, rather than treating them as a categorical set without any metric structure. In the heart of our architecture is the attention mechanism in the hyperbolic disc. The model is trained to generate queries (points in the hyperbolic disc) from textual messages, while keys (points at the boundary) emerge automatically from the generated queries. Predictions are based on the Busemann energy between queries and keys, evaluating how well a certain textual message aligns with the class directions representing emotions. Our experiments demonstrate strong generalization properties and reasonably good prediction accuracy even for small dimensions of the representation space. Overall, this study supports our claim that affective computing is one of the application domains where hyperbolic representations are particularly advantageous.

[489] The Rhetoric of Machine Learning

Robert C. Williamson

Main category: cs.LG

TL;DR: The paper examines machine learning through a rhetorical lens, arguing it’s inherently persuasive rather than objective, and analyzes “manipulation as a service” business models.

DetailsMotivation: To challenge the perception of machine learning as a neutral, objective technology and instead analyze it as a rhetorical tool of persuasion that serves specific interests and business models.

Method: Theoretical analysis using rhetorical theory to examine machine learning’s persuasive features and their application in business contexts, particularly “manipulation as a service” models.

Result: Machine learning is shown to have inherent rhetorical features that make it persuasive rather than objective, and these features are exploited in business models that manipulate user behavior.

Conclusion: Machine learning should be understood as rhetorical technology that serves persuasive purposes, not as neutral scientific tools, requiring critical examination of its applications and business models.

Abstract: I examine the technology of machine learning from the perspective of rhetoric, which is simply the art of persuasion. Rather than being a neutral and “objective” way to build “world models” from data, machine learning is (I argue) inherently rhetorical. I explore some of its rhetorical features, and examine one pervasive business model where machine learning is widely used, “manipulation as a service.”

[490] Geometric Properties of the Voronoi Tessellation in Latent Semantic Manifolds of Large Language Models

Marshall Brett

Main category: cs.LG

TL;DR: Analysis of Voronoi tessellation in language models reveals linear scaling law for expressibility gap and shows margin refinement procedures can reshape token decision boundaries without retraining, with Fisher information method preserving downstream performance.

DetailsMotivation: To understand the geometric structure of language model representations through Voronoi tessellation analysis and explore whether token decision boundaries can be refined post-hoc without full model retraining.

Method: Empirical study of Qwen3.5-4B-Base using float32 margin recomputation to resolve bfloat16 artifacts; validation of linear scaling law; comparison of two margin refinement procedures (direct margin maximization vs Fisher information distance maximization) across dose-response sweeps.

Result: Confirmed linear scaling law of expressibility gap with R²=0.9997; identified mid-layer geometric ambiguity regime; both MRP methods found same ceiling of ~16,300 correctable positions but Fisher method maintained constant collateral damage (~5,300 positions) while margin maximization damage escalated; Fisher MRP achieved +28% median margin improvement with invariant downstream benchmarks.

Conclusion: Fisher MRP is a viable geometric polishing tool that can compress expressibility gap while preserving scaling law, but gains concentrate in high-frequency structural tokens, with practical ceiling set by uniformity of token-level benefit rather than aggregate damage.

Abstract: Language models operate on discrete tokens but compute in continuous vector spaces, inducing a Voronoi tessellation over the representation manifold. We study this tessellation empirically on Qwen3.5-4B-Base, making two contributions. First, using float32 margin recomputation to resolve bfloat16 quantization artifacts, we validate Mabrok’s (2026) linear scaling law of the expressibility gap with $R^2$ = 0.9997 - the strongest confirmation to date - and identify a mid-layer geometric ambiguity regime where margin geometry is anti-correlated with cross-entropy (layers 24-28, $ρ$ = -0.29) before crystallizing into alignment at the final layer ($ρ$ = 0.836). Second, we show that the Voronoi tessellation of a converged model is reshapable through margin refinement procedures (MRP): short post-hoc optimization runs that widen token-decision margins without retraining. We compare direct margin maximization against Fisher information distance maximization across a dose-response sweep. Both methods find the same ceiling of ~16,300 correctable positions per 256K evaluated, but differ critically in collateral damage. Margin maximization damage escalates with intervention strength until corrections are overwhelmed. Fisher damage remains constant at ~5,300 positions across the validated range ($λ$ = 0.15-0.6), achieving +28% median margin improvement at $λ$ = 0.6 with invariant downstream benchmarks - a geometric reorganization that compresses the expressibility gap while preserving its scaling law. However, frequency and token-class audits reveal that gains concentrate in high-frequency structural tokens (84% of net corrections at $λ$ = 0.6), with content and entity-like contributions shrinking at higher $λ$. Fisher MRP is therefore a viable geometric polishing tool whose practical ceiling is set not by aggregate damage but by the uniformity of token-level benefit.

[491] Sparse-Aware Neural Networks for Nonlinear Functionals: Mitigating the Exponential Dependence on Dimension

Jianfei Li, Shuo Huang, Han Feng, Ding-Xuan Zhou, Gitta Kutyniok

Main category: cs.LG

TL;DR: Sparse convolutional architectures for functional learning reduce dimensionality curse and improve approximation rates in operator learning.

DetailsMotivation: Existing neural network theories for learning operators over infinite-dimensional function spaces face challenges with dimensionality and interpretability. The paper aims to investigate how sparsity can address these issues in functional learning.

Method: Proposes a framework using convolutional architectures to extract sparse features from finite samples, combined with deep fully connected networks to approximate nonlinear functionals. Uses universal discretization methods to show sparse approximators enable stable recovery from discrete samples.

Result: Sparse approximators enable stable recovery from discrete samples with both deterministic and random sampling schemes. Leads to improved approximation rates and reduced sample sizes in various function spaces, including those with fast frequency decay and mixed smoothness.

Conclusion: Sparsity can alleviate the curse of dimensionality in functional learning, providing new theoretical insights and practical benefits for operator learning with neural networks.

Abstract: Deep neural networks have emerged as powerful tools for learning operators defined over infinite-dimensional function spaces. However, existing theories frequently encounter difficulties related to dimensionality and limited interpretability. This work investigates how sparsity can help address these challenges in functional learning, a central ingredient in operator learning. We propose a framework that employs convolutional architectures to extract sparse features from a finite number of samples, together with deep fully connected networks to effectively approximate nonlinear functionals. Using universal discretization methods, we show that sparse approximators enable stable recovery from discrete samples. In addition, both the deterministic and the random sampling schemes are sufficient for our analysis. These findings lead to improved approximation rates and reduced sample sizes in various function spaces, including those with fast frequency decay and mixed smoothness. They also provide new theoretical insights into how sparsity can alleviate the curse of dimensionality in functional learning.

[492] Instance-Adaptive Parametrization for Amortized Variational Inference

Andrea Pollastro, Andrea Apicella, Francesco Isgrò, Roberto Prevete

Main category: cs.LG

TL;DR: IA-VAE introduces instance-adaptive modulation via hypernetwork to reduce amortization gap in variational autoencoders while maintaining single forward pass efficiency.

DetailsMotivation: Standard VAEs suffer from amortization gap due to shared encoder parameters across all inputs, limiting posterior approximation quality. Need to improve inference flexibility while preserving computational efficiency.

Method: Proposes instance-adaptive VAE (IA-VAE) where a hypernetwork generates input-dependent modulations of a shared encoder, enabling instance-specific adaptation without multiple forward passes.

Result: IA-VAE achieves more accurate posterior approximations on synthetic data, reduces amortization gap, and consistently improves held-out ELBO on image benchmarks with statistical significance.

Conclusion: Instance-adaptive modulation through hypernetworks effectively mitigates amortization-induced suboptimality in deep generative models while maintaining parameter efficiency.

Abstract: Latent variable models, including variational autoencoders (VAE), remain a central tool in modern deep generative modeling due to their scalability and a well-founded probabilistic formulation. These models rely on amortized variational inference to enable efficient posterior approximation, but this efficiency comes at the cost of a shared parametrization, giving rise to the amortization gap. We propose the instance-adaptive variational autoencoder (IA-VAE), an amortized variational inference framework in which a hypernetwork generates input-dependent modulations of a shared encoder. This enables input-specific adaptation of the inference model while preserving the efficiency of a single forward pass. By leveraging instance-specific parameter modulations, the proposed approach can achieve performance comparable to standard encoders with substantially fewer parameters, indicating a more efficient use of model capacity. Experiments on synthetic data, where the true posterior is known, show that IA-VAE yields more accurate posterior approximations and reduces the amortization gap. Similarly, on standard image benchmarks, IA-VAE consistently improves held-out ELBO over baseline VAEs, with statistically significant gains across multiple runs. These results suggest that increasing the flexibility of the inference parametrization through instance-adaptive modulation is a key factor in mitigating amortization-induced suboptimality in deep generative models.

[493] MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Dawei Yang

Main category: cs.LG

TL;DR: MoBiE is a novel binarization framework specifically designed for Mixture-of-Experts (MoE) LLMs that addresses MoE-specific challenges like cross-expert redundancy and routing distortion while maintaining efficiency.

DetailsMotivation: MoE-based LLMs offer strong performance but suffer from high memory and computation costs. Existing binary quantization methods designed for dense LLMs struggle with MoE-specific issues including cross-expert redundancy, task-agnostic importance estimation, and quantization-induced routing shifts.

Method: MoBiE uses three core innovations: 1) joint SVD decomposition to reduce cross-expert redundancy, 2) integration of global loss gradients into local Hessian metrics for better weight importance estimation, and 3) error constraint guided by input null space to mitigate routing distortion. The framework achieves optimizations without additional storage overhead.

Result: On Qwen3-30B-A3B, MoBiE reduces perplexity by 52.2%, improves average zero-shot performance by 43.4%, achieves over 2× inference speedup, and shortens quantization time. It consistently outperforms state-of-the-art binary methods across multiple MoE-based LLMs and benchmarks.

Conclusion: MoBiE successfully addresses MoE-specific binarization challenges, providing an efficient solution that balances model performance with computational efficiency for MoE-based LLMs.

Abstract: Mixture-of-Experts (MoE) based large language models (LLMs) offer strong performance but suffer from high memory and computation costs. Weight binarization provides extreme efficiency, yet existing binary methods designed for dense LLMs struggle with MoE-specific issues, including cross-expert redundancy, task-agnostic importance estimation, and quantization-induced routing shifts. To this end, we propose MoBiE, the first binarization framework tailored for MoE-based LLMs. MoBiE is built on three core innovations: 1. using joint SVD decomposition to reduce cross-expert redundancy; 2. integrating global loss gradients into local Hessian metrics to enhance weight importance estimation; 3. introducing an error constraint guided by the input null space to mitigate routing distortion. Notably, MoBiE achieves these optimizations while incurring no additional storage overhead, striking a balance between efficiency and model performance. Extensive experiments demonstrate that MoBiE consistently outperforms state-of-the-art binary methods across multiple MoE-based LLMs and benchmarks. For example, on Qwen3-30B-A3B, MoBiE reduces perplexity by 52.2$%$, improves average zero-shot performance by 43.4$%$, achieves over 2 $\times$ inference speedup, and further shortens quantization time. The code is available at https://github.com/Kishon-zzx/MoBiE.

[494] OmniTabBench: Mapping the Empirical Frontiers of GBDTs, Neural Networks, and Foundation Models for Tabular Data at Scale

Dihong Jiang, Ruoqi Cao, Zhiyuan Dang, Li Huang, Qingsong Zhang, Zhiyu Wang, Shihao Piao, Shenggao Zhu, Jianlong Chang, Zhouchen Lin, Qi Tian

Main category: cs.LG

TL;DR: OmniTabBench is the largest tabular benchmark with 3030 datasets, showing no single model family dominates all tabular tasks, with decoupled metafeature analysis providing clearer guidance on when specific models perform best.

DetailsMotivation: There's no consensus on superior models for tabular tasks, with traditional tree ensembles vs. deep neural networks/foundation models debate. Existing benchmarks are small (<100 datasets), raising concerns about evaluation sufficiency and selection bias.

Method: Created OmniTabBench with 3030 datasets from diverse sources, categorized by industry using LLMs. Conducted large-scale empirical evaluation of state-of-the-art models from all families. Used decoupled metafeature analysis examining individual properties like dataset size, feature types, skewness/kurtosis.

Result: No dominant winner across all datasets. Decoupled metafeature analysis revealed conditions favoring specific model categories, providing clearer guidance than prior compound-metric studies.

Conclusion: Tabular ML requires nuanced model selection based on dataset characteristics rather than universal superiority claims. OmniTabBench enables more comprehensive evaluation and clearer understanding of model strengths.

Abstract: While traditional tree-based ensemble methods have long dominated tabular tasks, deep neural networks and emerging foundation models have challenged this primacy, yet no consensus exists on a universally superior paradigm. Existing benchmarks typically contain fewer than 100 datasets, raising concerns about evaluation sufficiency and potential selection biases. To address these limitations, we introduce OmniTabBench, the largest tabular benchmark to date, comprising 3030 datasets spanning diverse tasks that are comprehensively collected from diverse sources and categorized by industry using large language models. We conduct an unprecedented large-scale empirical evaluation of state-of-the-art models from all model families on OmniTabBench, confirming the absence of a dominant winner. Furthermore, through a decoupled metafeature analysis, which examines individual properties such as dataset size, feature types, feature and target skewness/kurtosis, we elucidate conditions favoring specific model categories, providing clearer, more actionable guidance than prior compound-metric studies.

[495] STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training

Minglu Liu, Cunchen Hu, Liangliang Xu, Fengming Tang, Ruijia Wang, Fu Yu

Main category: cs.LG

TL;DR: STQuant is a distributed training framework that reduces optimizer-state memory via dynamic precision allocation across layers, state variables, and training steps, achieving 84.4% memory reduction with only 5.1 average bits.

DetailsMotivation: Existing quantization methods use fixed-precision policies that ignore significant variations in optimizer-state distributions across layers and training steps, leading to accuracy degradation. There's a need for dynamic quantization that adapts to these variations while maintaining model quality.

Method: STQuant uses two key techniques: 1) a provably near-optimal factor selection strategy to identify the most influential factors for precision adaptation, and 2) a dynamic transition decision algorithm that reduces search cost from exponential to linear complexity. It dynamically allocates precision across layers, state variables, and training steps.

Result: Experiments on GPT-2 and ViT show STQuant reduces optimizer-state memory by 84.4%, achieving an average bit-width as low as 5.1 bits compared to existing solutions. It incurs only O(N/K) computational overhead and requires O(1) extra space.

Conclusion: STQuant successfully addresses the challenges of dynamic quantization for optimizer states, providing significant memory reduction while maintaining model quality through adaptive precision allocation across multiple dimensions.

Abstract: Quantization is an effective way to reduce the memory cost of large-scale model training. However, most existing methods adopt fixed-precision policies, which ignore the fact that optimizer-state distributions vary significantly across layers and training steps. Such uniform designs often introduce noticeable accuracy degradation. To move beyond fixed quantization, we propose STQuant, a distributed training framework that reduces the memory footprint of optimizer states via dynamic precision allocation across layers, state variables, and training steps, while maintaining model quality. Naively applying dynamic quantization during training is challenging for two reasons. First, optimizer states are numerically sensitive, and quantization noise can destabilize quality. Second, jointly considering multiple states and layers induces a large combinatorial search space. STQuant addresses these challenges with two key techniques: 1) a provably near-optimal factor selection strategy that accurately identifies the most influential factors for precision adaptation. 2) a dynamic transition decision algorithm that reduces the search cost from exponential to linear complexity. Experiments on GPT-2 and ViT show that STQuant reduces optimizer-state memory by 84.4%, achieving an average bit-width of as low as 5.1 bits, compared with existing solutions. Moreover, STQuant incurs only O(N/K) computational overhead and requires O(1) extra space.

[496] Contraction-Aligned Analysis of Soft Bellman Residual Minimization with Weighted Lp-Norm for Markov Decision Problem

Hyukjun Yang, Han-Dong Lim, Donghwan Lee

Main category: cs.LG

TL;DR: The paper proposes a soft Bellman residual minimization approach using generalized weighted Lp-norms to align optimization objectives with Bellman operator contraction geometry in reinforcement learning.

DetailsMotivation: There's a fundamental geometric mismatch in solving MDPs under function approximation: the Bellman optimality operator is contractive in Linfty-norm, but common objectives like projected value iteration and Bellman residual minimization use L2-based formulations, creating optimization challenges.

Method: The authors propose a soft formulation of Bellman residual minimization extended to generalized weighted Lp-norms. This approach enables gradient-based optimization while better aligning with the contraction geometry of the Bellman operator as p increases.

Result: The formulation aligns optimization objectives with Bellman contraction geometry, provides performance error bounds, and offers improved control of error propagation while remaining compatible with gradient-based optimization methods.

Conclusion: The work establishes a principled connection between residual minimization and Bellman contraction, addressing a fundamental geometric mismatch in RL with function approximation and enabling more effective gradient-based optimization.

Abstract: The problem of solving Markov decision processes under function approximation remains a fundamental challenge, even under linear function approximation settings. A key difficulty arises from a geometric mismatch: while the Bellman optimality operator is contractive in the Linfty-norm, commonly used objectives such as projected value iteration and Bellman residual minimization rely on L2-based formulations. To enable gradient-based optimization, we consider a soft formulation of Bellman residual minimization and extend it to a generalized weighted Lp -norm. We show that this formulation aligns the optimization objective with the contraction geometry of the Bellman operator as p increases, and derive corresponding performance error bounds. Our analysis provides a principled connection between residual minimization and Bellman contraction, leading to improved control of error propagation while remaining compatible with gradient-based optimization.

[497] MENO: MeanFlow-Enhanced Neural Operators for Dynamical Systems

Tianyue Yang, Xiao Xue

Main category: cs.LG

TL;DR: MENO framework enhances neural operators for dynamical systems by restoring multi-scale features with minimal inference cost, achieving better accuracy than baselines and faster inference than diffusion-based methods.

DetailsMotivation: Neural operators are efficient surrogates for dynamical systems but lose small-scale structures due to Fourier truncation. Diffusion methods recover details but add heavy inference overhead, undermining neural operators' efficiency advantage.

Method: MENO (MeanFlow-Enhanced Neural Operators) uses improved MeanFlow method to restore both small-scale details and large-scale dynamics with superior physical fidelity and statistical accuracy while maintaining computational efficiency.

Result: MENO improves power spectrum density accuracy by up to 2x compared to baseline neural operators and achieves 12x faster inference than DDIM-enhanced counterparts on three challenging dynamical systems at resolutions up to 256×256.

Conclusion: MENO bridges accuracy-efficiency gap for scientific ML applications, offering flexible and efficient surrogate modeling where statistical integrity and computational efficiency are both important.

Abstract: Neural operators have emerged as powerful surrogates for dynamical systems due to their grid-invariant properties and computational efficiency. However, the Fourier-based neural operator framework inherently truncates high-frequency components in spectral space, resulting in the loss of small-scale structures and degraded prediction quality at high resolutions when trained on low-resolution data. While diffusion-based enhancement methods can recover multi-scale features, they introduce substantial inference overhead that undermines the efficiency advantage of neural operators. In this work, we introduce \textbf{M}eanFlow-\textbf{E}nhanced \textbf{N}eural \textbf{O}perators (MENO), a novel framework that achieves accurate all-scale predictions with minimal inference cost. By leveraging the improved MeanFlow method, MENO restores both small-scale details and large-scale dynamics with superior physical fidelity and statistical accuracy. We evaluate MENO on three challenging dynamical systems, including phase-field dynamics, 2D Kolmogorov flow, and active matter dynamics, at resolutions up to 256$\times$256. Across all benchmarks, MENO improves the power spectrum density accuracy by up to a factor of 2 compared to baseline neural operators while achieving 12$\times$ faster inference than the state-of-the-art Diffusion Denoising Implicit Model (DDIM)-enhanced counterparts, effectively bridging the gap between accuracy and efficiency. The flexibility and efficiency of MENO position it as an efficient surrogate model for scientific machine learning applications where both statistical integrity and computational efficiency are paramount.

[498] VertAX: a differentiable vertex model for learning epithelial tissue mechanics

Alessandro Pasqui, Jim Martin Catacora Ocana, Anshuman Sinha, Matthieu Perez, Fabrice Delbary, Giorgio Gosti, Mattia Miotto, Domenico Caudo, Maxence Ernoult, Hervé Turlier

Main category: cs.LG

TL;DR: VertAX is a differentiable JAX-based framework for vertex modeling of epithelial tissues, enabling automatic differentiation, GPU acceleration, and bilevel optimization for forward simulation, parameter inference, and inverse mechanical design.

DetailsMotivation: Epithelial tissue mechanics are complex with many tunable parameters in vertex models, making inference and optimization challenging. There's a need for computational frameworks that can flexibly model and learn tissue mechanics.

Method: Developed VertAX, a differentiable framework in JAX for vertex modeling of confluent epithelia. Provides automatic differentiation, GPU acceleration, and end-to-end bilevel optimization. Users can define arbitrary energy and cost functions in pure Python. Benchmarked three differentiation strategies: automatic differentiation, implicit differentiation, and equilibrium propagation.

Result: Demonstrated VertAX on three tasks: forward modeling of tissue morphogenesis, mechanical parameter inference, and inverse design of tissue-scale behaviors. Showed equilibrium propagation can approximate gradients using repeated forward simulations alone, offering a simple route for extending inverse biophysical problems to non-differentiable simulators.

Conclusion: VertAX provides a flexible, differentiable framework for vertex modeling that enables advanced computational approaches to tissue mechanics, with equilibrium propagation offering a practical solution for gradient approximation in non-differentiable systems.

Abstract: Epithelial tissues dynamically reshape through local mechanical interactions among cells, a process well captured by vertex models. Yet their many tunable parameters make inference and optimization challenging, motivating computational frameworks that flexibly model and learn tissue mechanics. We introduce VertAX, a differentiable JAX-based framework for vertex-modeling of confluent epithelia. VertAX provides automatic differentiation, GPU acceleration, and end-to-end bilevel optimization for forward simulation, parameter inference, and inverse mechanical design. Users can define arbitrary energy and cost functions in pure Python, enabling seamless integration with machine-learning pipelines. We demonstrate VertAX on three representative tasks: (i) forward modeling of tissue morphogenesis, (ii) mechanical parameter inference, and (iii) inverse design of tissue-scale behaviors. We benchmark three differentiation strategies-automatic differentiation, implicit differentiation, and equilibrium propagation-showing that the latter can approximate gradients using repeated forward, adjoint-free simulations alone, offering a simple route for extending inverse biophysical problems to non-differentiable simulators with limited additional engineering effort.

[499] Equivariant Multi-agent Reinforcement Learning for Multimodal Vehicle-to-Infrastructure Systems

Charbel Bou Chaaya, Mehdi Bennis

Main category: cs.LG

TL;DR: A decentralized multimodal V2I system using self-supervised learning for vehicle positioning and equivariant MARL for resource optimization, achieving significant performance gains over baselines.

DetailsMotivation: To address the challenge of optimizing vehicle-to-infrastructure systems where distributed base stations must coordinate using multimodal (wireless and visual) data from moving vehicles, while dealing with partial observability and requiring efficient collaboration.

Method: Proposes a self-supervised learning framework where each base station aligns latent features of multimodal observations to extract vehicle positions, then uses an equivariant policy network with GNN message passing for decentralized MARL training with rotation symmetry incorporation.

Result: Achieves more than two-fold accuracy gains in vehicle positioning over baselines and more than 50% performance gains in resource optimization compared to standard MARL approaches in simulation with ray-tracing and computer graphics data.

Conclusion: The proposed multimodal sensing with self-supervised learning and equivariant MARL framework effectively addresses decentralized V2I optimization, demonstrating strong generalization and coordination capabilities.

Abstract: In this paper, we study a vehicle-to-infrastructure (V2I) system where distributed base stations (BSs) acting as road-side units (RSUs) collect multimodal (wireless and visual) data from moving vehicles. We consider a decentralized rate maximization problem, where each RSU relies on its local observations to optimize its resources, while all RSUs must collaborate to guarantee favorable network performance. We recast this problem as a distributed multi-agent reinforcement learning (MARL) problem, by incorporating rotation symmetries in terms of vehicles’ locations. To exploit these symmetries, we propose a novel self-supervised learning framework where each BS agent aligns the latent features of its multimodal observation to extract the positions of the vehicles in its local region. Equipped with this sensing data at each RSU, we train an equivariant policy network using a graph neural network (GNN) with message passing layers, such that each agent computes its policy locally, while all agents coordinate their policies via a signaling scheme that overcomes partial observability and guarantees the equivariance of the global policy. We present numerical results carried out in a simulation environment, where ray-tracing and computer graphics are used to collect wireless and visual data. Results show the generalizability of our self-supervised and multimodal sensing approach, achieving more than two-fold accuracy gains over baselines, and the efficiency of our equivariant MARL training, attaining more than 50% performance gains over standard approaches.

[500] FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

Yitong Li, Junsong Chen, Shuchen Xue, Pengcuo Zeren, Siyuan Fu, Dinghao Yang, Yangyang Tang, Junjie Bai, Ping Luo, Song Han, Enze Xie

Main category: cs.LG

TL;DR: Sol-RL: A novel FP4-empowered two-stage reinforcement learning framework that accelerates diffusion model alignment with human preferences by using high-throughput FP4 quantization for candidate exploration and BF16 precision for policy optimization.

DetailsMotivation: Scaling reinforcement learning rollouts for aligning large text-to-image diffusion models (like FLUX.1-12B) with human preferences is computationally expensive. While increasing rollout group size improves performance, it imposes heavy computational burdens. FP4 quantization could help but naive approaches risk performance degradation.

Method: Two-stage framework: 1) Use high-throughput NVFP4 rollouts to generate massive candidate pool and extract highly contrastive subset, 2) Regenerate selected samples in BF16 precision and optimize policy exclusively on them. Decouples candidate exploration from policy optimization.

Result: Maintains training integrity of BF16 precision while exploiting FP4 throughput gains. Accelerates training convergence by up to 4.64× across SANA, FLUX.1, and SD3.5-L models. Delivers superior alignment performance across multiple metrics.

Conclusion: Sol-RL effectively accelerates the rollout phase while preserving high-fidelity samples for optimization, enabling massive rollout scaling at reduced cost. Synergistic algorithm-hardware design unlocks efficient diffusion model alignment.

Abstract: Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferences. In recent studies, increasing the rollout group size yields pronounced performance improvements, indicating substantial room for further alignment gains. However, scaling rollouts on large-scale foundational diffusion models (e.g., FLUX.1-12B) imposes a heavy computational burden. To alleviate this bottleneck, we explore the integration of FP4 quantization into Diffusion RL rollouts. Yet, we identify that naive quantized pipelines inherently introduce risks of performance degradation. To overcome this dilemma between efficiency and training integrity, we propose Sol-RL (Speed-of-light RL), a novel FP4-empowered Two-stage Reinforcement Learning framework. First, we utilize high-throughput NVFP4 rollouts to generate a massive candidate pool and extract a highly contrastive subset. Second, we regenerate these selected samples in BF16 precision and optimize the policy exclusively on them. By decoupling candidate exploration from policy optimization, Sol-RL integrates the algorithmic mechanisms of rollout scaling with the system-level throughput gains of NVFP4. This synergistic algorithm-hardware design effectively accelerates the rollout phase while reserving high-fidelity samples for optimization. We empirically demonstrate that our framework maintains the training integrity of BF16 precision pipeline while fully exploiting the throughput gains enabled by FP4 arithmetic. Extensive experiments across SANA, FLUX.1, and SD3.5-L substantiate that our approach delivers superior alignment performance across multiple metrics while accelerating training convergence by up to $4.64\times$, unlocking the power of massive rollout scaling at a fraction of the cost.

[501] A First Guess is Rarely the Final Answer: Learning to Search in the Travelling Salesperson Problem

Andoni Irazusta Garmendia

Main category: cs.LG

TL;DR: NICO-TSP is a neural improvement framework for TSP that learns to perform local search via 2-opt moves, using specialized representations and training procedures tailored for improvement rather than single-solution generation.

DetailsMotivation: Existing neural TSP solvers output single solutions, but practitioners often use additional search/sampling. Current neural improvement methods underperform due to design mismatches - they reuse components from single-solution methods rather than being built specifically for local search mechanics.

Method: NICO-TSP uses a 2-opt improvement framework with n edge tokens aligned with neighborhood operators, scores 2-opt moves directly without tour positional encodings, and employs two-stage training: imitation learning on short-horizon optimal trajectories followed by critic-free group-based reinforcement learning over longer rollouts.

Result: NICO-TSP delivers stronger and more step-efficient improvement than prior learned and heuristic baselines, generalizes better to larger out-of-distribution instances, and serves as both a competitive replacement for classical local search and a powerful test-time refinement module for constructive solvers.

Conclusion: The paper demonstrates that specialized neural improvement frameworks designed around local search mechanics outperform approaches that reuse components from single-solution methods, showing the importance of task-specific design for learned combinatorial optimization.

Abstract: Most neural solvers for the Traveling Salesperson Problem (TSP) are trained to output a single solution, even though practitioners rarely stop there: at test time, they routinely spend extra compute on sampling or post-hoc search. This raises a natural question: can the search procedure itself be learned? Neural improvement methods take this perspective by learning a policy that applies local modifications to a candidate solution, accumulating gains over an improvement trajectory. Yet learned improvement for TSP remains comparatively immature, with existing methods still falling short of robust, scalable performance. We argue that a key reason is design mismatch: many approaches reuse state representations, architectural choices, and training recipes inherited from single-solution methods, rather than being built around the mechanics of local search. This mismatch motivates NICO-TSP (Neural Improvement for Combinatorial Optimization): a 2-opt improvement framework for TSP. NICO-TSP represents the current tour with exactly $n$ edge tokens aligned with the neighborhood operator, scores 2-opt moves directly without tour positional encodings, and trains via a two-stage procedure: imitation learning to short-horizon optimal trajectories, followed by critic-free group-based reinforcement learning over longer rollouts. Under compute-matched evaluations that measure improvement as a function of both search steps and wall-clock time, NICO-TSP delivers consistently stronger and markedly more step-efficient improvement than prior learned and heuristic search baselines, generalizes far more reliably to larger out-of-distribution instances, and serves both as a competitive replacement for classical local search and as a powerful test-time refinement module for constructive solvers.

[502] Frailty Estimation in Elderly Oncology Patients Using Multimodal Wearable Data and Multi-Instance Learning

Ioannis Kyprakis, Vasileios Skaramagkas, Georgia Karanasiou, Lampros Lakkas, Andri Papakonstantinou, Domen Ribnikar, Kalliopi Keramida, Dorothea Tsekoura, Ketti Mazzocco, Anastasia Constantinidou, Konstantinos Marias, Dimitrios I. Fotiadis, Manolis Tsiknakis

Main category: cs.LG

TL;DR: Multimodal wearable framework using smartwatch activity/sleep data and ECG-derived HRV to estimate frailty-related functional changes in elderly breast cancer patients via attention-based multiple instance learning.

DetailsMotivation: Frailty and functional decline significantly impact treatment outcomes in older cancer patients, but current assessment is limited to infrequent clinic visits. There's a need for continuous monitoring of functional changes between visits using wearable technology.

Method: Proposes a multimodal wearable framework combining smartwatch physical activity/sleep features with ECG-derived heart rate variability (HRV) features. Uses attention-based multiple instance learning (MIL) to fuse irregular, multimodal wearable instances under real-world missingness and weak supervision. Features are organized into patient-horizon bags aligned to follow-up timepoints (month 3 and month 6).

Result: The full multimodal model achieved balanced accuracy/F1 of 0.68±0.08/0.67±0.09 at M3 and 0.70±0.10/0.69±0.08 at M6 for handgrip strength, and 0.59±0.04/0.58±0.06 at M3 and 0.64±0.05/0.63±0.07 at M6 for FACIT-F. Smartwatch activity and sleep provided strongest predictive information, while HRV contributed complementary information when fused with smartwatch streams.

Conclusion: Multimodal wearable data combined with attention-based MIL can effectively estimate frailty-related functional changes in elderly cancer patients between clinic visits, with smartwatch activity/sleep data being most predictive and HRV providing additional complementary information.

Abstract: Frailty and functional decline strongly influence treatment tolerance and outcomes in older patients with cancer, yet assessment is typically limited to infrequent clinic visits. We propose a multimodal wearable framework to estimate frailty-related functional change between visits in elderly breast cancer patients enrolled in the multicenter CARDIOCARE study. Free-living smartwatch physical activity and sleep features are combined with ECG-derived heart rate variability (HRV) features from a chest strap and organized into patient-horizon bags aligned to month 3 (M3) and month 6 (M6) follow-ups. Our innovation is an attention-based multiple instance learning (MIL) formulation that fuses irregular, multimodal wearable instances under real-world missingness and weak supervision. An attention-based MIL model with modality-specific multilayer perceptron (MLP) encoders with embedding dimension 128 aggregates variable-length and partially missing longitudinal instances to predict discretized change-from-baseline classes (worsened, stable, improved) for FACIT-F and handgrip strength. Under subject-independent leave-one-subject-out (LOSO) evaluation, the full multimodal model achieved balanced accuracy/F1 of 0.68 +/- 0.08/0.67 +/- 0.09 at M3 and 0.70 +/- 0.10/0.69 +/- 0.08 at M6 for handgrip, and 0.59 +/- 0.04/0.58 +/- 0.06 at M3 and 0.64 +/- 0.05/0.63 +/- 0.07 at M6 for FACIT-F. Ablation results indicated that smartwatch activity and sleep provide the strongest predictive information for frailty-related functional changes, while HRV contributes complementary information when fused with smartwatch streams.

[503] Stress Estimation in Elderly Oncology Patients Using Visual Wearable Representations and Multi-Instance Learning

Ioannis Kyprakis, Vasileios Skaramagkas, Georgia Karanasiou, Vasilis Bouratzis, Andri Papakonstantinou, Dimitar Stefanovski, Kalliopi Keramida, Aristofania Simatou, Ketti Mazzocco, Anastasia Constantinidou, Konstantinos Marias, Dimitrios I. Fotiadis, Manolis Tsiknakis

Main category: cs.LG

TL;DR: Multimodal wearable data (smartwatch activity/sleep + ECG) used to estimate psychological stress in breast cancer patients via visual representations and attention-based MIL.

DetailsMotivation: Psychological stress is clinically important in cardio-oncology but typically only assessed through patient-reported measures, lacking continuous monitoring integration.

Method: Transform wearable streams into visual representations, use pretrained Tiny-BioMoE backbone for embeddings, aggregate via attention-based multiple instance learning to predict Perceived Stress Scale scores.

Result: Moderate agreement with questionnaire scores (R²=0.24-0.28, Pearson r=0.42-0.49), with RMSE/MAE around 6.62/6.07 at month 3 and 6.13/5.54 at month 6.

Conclusion: Multimodal wearable data can provide continuous stress estimation in clinical oncology settings, though performance is moderate and needs improvement.

Abstract: Psychological stress is clinically relevant in cardio-oncology, yet it is typically assessed only through patient-reported outcome measures (PROMs) and is rarely integrated into continuous cardiotoxicity surveillance. We estimate perceived stress in an elderly, multicenter breast cancer cohort (CARDIOCARE) using multimodal wearable data from a smartwatch (physical activity and sleep) and a chest-worn ECG sensor. Wearable streams are transformed into heterogeneous visual representations, yielding a weakly supervised setting in which a single Perceived Stress Scale (PSS) score corresponds to many unlabeled windows. A lightweight pretrained mixture-of-experts backbone (Tiny-BioMoE) embeds each representation into 192-dimensional vectors, which are aggregated via attention-based multiple instance learning (MIL) to predict PSS at month 3 (M3) and month 6 (M6). Under leave-one-subject-out (LOSO) evaluation, predictions showed moderate agreement with questionnaire scores (M3: R^2=0.24, Pearson r=0.42, Spearman rho=0.48; M6: R^2=0.28, Pearson r=0.49, Spearman rho=0.52), with global RMSE/MAE of 6.62/6.07 at M3 and 6.13/5.54 at M6.

[504] Predictive Representations for Skill Transfer in Reinforcement Learning

Ruben Vereecken, Luke Dickens, Alessandra Russo

Main category: cs.LG

TL;DR: OPSRs are outcome-predictive state representations that enable transfer learning in RL through state abstraction and skill reuse across tasks.

DetailsMotivation: The paper addresses the challenge of generalization in RL, where agents typically learn each task from scratch without transferring knowledge. The goal is to develop a framework for effective transfer learning through state abstraction.

Method: Introduces Outcome-Predictive State Representations (OPSRs) - task-independent abstractions based on predictions of environment outcomes. Also develops OPSR-based skills (abstract actions based on options) that can be reused between tasks. Skills are learned from demonstrations and enable transfer to new unseen tasks.

Result: Formal and empirical results show OPSRs enable optimal but limited transfer. OPSR-based skills overcome this limitation and significantly speed up learning in new unseen tasks without pre-processing. Empirical studies demonstrate considerable learning acceleration.

Conclusion: The OPSR framework represents a promising step toward transfer learning in RL, particularly through combining state and action abstraction. It enables effective knowledge transfer across tasks using outcome-predictive representations and reusable skills.

Abstract: A key challenge in scaling up Reinforcement Learning is generalizing learned behaviour. Without the ability to carry forward acquired knowledge an agent is doomed to learn each task from scratch. In this paper we develop a new formalism for transfer by virtue of state abstraction. Based on task-independent, compact observations (outcomes) of the environment, we introduce Outcome-Predictive State Representations (OPSRs), agent-centered and task-independent abstractions that are made up of predictions of outcomes. We show formally and empirically that they have the potential for optimal but limited transfer, then overcome this trade-off by introducing OPSR-based skills, i.e. abstract actions (based on options) that can be reused between tasks as a result of state abstraction. In a series of empirical studies, we learn OPSR-based skills from demonstrations and show how they speed up learning considerably in entirely new and unseen tasks without any pre-processing. We believe that the framework introduced in this work is a promising step towards transfer in RL in general, and towards transfer through combining state and action abstraction specifically.

[505] ConceptTracer: Interactive Analysis of Concept Saliency and Selectivity in Neural Representations

Ricardo Knauer, Andre Beinrucker, Erik Rodner

Main category: cs.LG

TL;DR: ConceptTracer is an interactive tool for analyzing neural representations through human-interpretable concepts using information-theoretic measures of concept saliency and selectivity, demonstrated on TabPFN tabular foundation model.

DetailsMotivation: Despite growing interest in mechanistic interpretability, there are limited tools for systematically exploring representations learned by neural networks, particularly for tabular foundation models. The opacity of neural network decision-making processes motivates the need for better interpretability tools.

Method: ConceptTracer integrates two information-theoretic measures that quantify concept saliency and selectivity to identify neurons that respond strongly to individual concepts. It’s an interactive application designed for analyzing neural representations through human-interpretable concepts.

Result: The utility is demonstrated on representations learned by TabPFN, showing that the approach facilitates the discovery of interpretable neurons. The tool provides a practical framework for investigating how neural networks encode concept-level information.

Conclusion: ConceptTracer offers a practical framework for exploring how neural networks like TabPFN encode concept-level information, addressing the need for better interpretability tools in neural network analysis.

Abstract: Neural networks deliver impressive predictive performance across a variety of tasks, but they are often opaque in their decision-making processes. Despite a growing interest in mechanistic interpretability, tools for systematically exploring the representations learned by neural networks in general, and tabular foundation models in particular, remain limited. In this work, we introduce ConceptTracer, an interactive application for analyzing neural representations through the lens of human-interpretable concepts. ConceptTracer integrates two information-theoretic measures that quantify concept saliency and selectivity, enabling researchers and practitioners to identify neurons that respond strongly to individual concepts. We demonstrate the utility of ConceptTracer on representations learned by TabPFN and show that our approach facilitates the discovery of interpretable neurons. Together, these capabilities provide a practical framework for investigating how neural networks like TabPFN encode concept-level information. ConceptTracer is available at https://github.com/ml-lab-htw/concept-tracer.

[506] Learning to Query History: Nonstationary Classification via Learned Retrieval

Jimmy Gammell, Bishal Thapaliya, Yoon Jung, Riyasat Ohib, Bilel Fehri, Deepayan Chakrabarti

Main category: cs.LG

TL;DR: A method for nonstationary classification using time series prediction with learned discrete retrieval of historical examples, enabling robust performance against distribution shift.

DetailsMotivation: Nonstationarity in real-world classification causes deployed models to fail despite good holdout performance, as they can't adapt to changing distributions over time.

Method: Reframe classification as time series prediction by conditioning on sequences of historical labeled examples. Use learned discrete retrieval mechanism with input-dependent queries to sample relevant historical examples, trained end-to-end with classifier using score-based gradient estimator. Allows historical data to remain on filesystem during training/deployment.

Result: Improved robustness to distribution shift on synthetic benchmarks and Amazon Reviews ‘23 (electronics category) compared to standard classifiers, with predictable VRAM scaling as historical sequence length increases.

Conclusion: The approach effectively addresses nonstationary classification by leveraging historical context through efficient retrieval mechanisms, providing practical deployment advantages with predictable resource usage.

Abstract: Nonstationarity is ubiquitous in practical classification settings, leading deployed models to perform poorly even when they generalize well to holdout sets available at training time. We address this by reframing nonstationary classification as time series prediction: rather than predicting from the current input alone, we condition the classifier on a sequence of historical labeled examples that extends beyond the training cutoff. To scale to large sequences, we introduce a learned discrete retrieval mechanism that samples relevant historical examples via input-dependent queries, trained end-to-end with the classifier using a score-based gradient estimator. This enables the full corpus of historical data to remain on an arbitrary filesystem during training and deployment. Experiments on synthetic benchmarks and Amazon Reviews ‘23 (electronics category) show improved robustness to distribution shift compared to standard classifiers, with VRAM scaling predictably as the length of the historical data sequence increases.

[507] MoE Routing Testbed: Studying Expert Specialization and Routing Behavior at Small Scale

Tobias Falke, Nicolas Anastassacos, Samson Tan, Chankrisna Richy Meas, Chandana Satya Prakash, Nitesh Sekhar, M Saiful Bari, Krishna Kompella, Gamaleldin F. Elsayed

Main category: cs.LG

TL;DR: Proposes MoE Routing Testbed for evaluating routing dynamics in sparse Mixture-of-Experts models using domain-specific data and reference router to measure expert specialization.

DetailsMotivation: Sparse MoE architectures are popular for large language models but introduce training challenges due to routing complexity. Current methods lack established metrics for assessing expert specialization, and small-scale performance often doesn't reflect large-scale behavior.

Method: Proposes MoE Routing Testbed with data mix containing clearly distinguishable domains paired with a reference router that prescribes ideal routing based on these domains. This provides a well-defined upper bound for comparison and enables quantifiable measurement of expert specialization.

Result: Testbed demonstrates that balancing scope is the crucial factor allowing specialization while maintaining high expert utilization. Observations generalize to models 35x larger than testbed scale.

Conclusion: The MoE Routing Testbed provides clearer visibility into routing dynamics at small scale using realistic data, enabling better evaluation and development of MoE routing techniques for large language models.

Abstract: Sparse Mixture-of-Experts (MoE) architectures are increasingly popular for frontier large language models (LLM) but they introduce training challenges due to routing complexity. Fully leveraging parameters of an MoE model requires all experts to be well-trained and to specialize in non-redundant ways. Assessing this, however, is complicated due to lack of established metrics and, importantly, many routing techniques exhibit similar performance at smaller sizes, which is often not reflective of their behavior at large scale. To address this challenge, we propose the MoE Routing Testbed, a setup that gives clearer visibility into routing dynamics at small scale while using realistic data. The testbed pairs a data mix with clearly distinguishable domains with a reference router that prescribes ideal routing based on these domains, providing a well-defined upper bound for comparison. This enables quantifiable measurement of expert specialization. To demonstrate the value of the testbed, we compare various MoE routing approaches and show that balancing scope is the crucial factor that allows specialization while maintaining high expert utilization. We confirm that this observation generalizes to models 35x larger.

[508] AdaBoost Does Not Always Cycle: A Computer-Assisted Counterexample

Erik Y. Wang

Main category: cs.LG

TL;DR: Computer-assisted counterexample to Rudin et al.’s open question about AdaBoost convergence, showing exhaustive AdaBoost doesn’t always converge to a finite cycle, using block-product gadget with irrational eigenvalue ratios.

DetailsMotivation: Address the open question posed by Rudin, Schapire, and Daubechies in COLT 2012 about whether exhaustive AdaBoost always converges to a finite cycle, which has remained unresolved for over a decade.

Method: Constructs a block-product gadget where two factors share an exact period-2 orbit for their 5-step branch maps, but have linearized return maps with dominant eigenvalues whose logarithmic ratio is irrational, forcing burst-winner sequences to have irrational asymptotic frequencies.

Result: Provides a certified counterexample showing exhaustive AdaBoost does not always converge to a finite cycle, with all assertions verified by exact rational arithmetic.

Conclusion: Resolves the long-standing open question by demonstrating AdaBoost can exhibit non-periodic asymptotic behavior, with the work developed in collaboration with AI systems (GPT-5.4 Pro and Claude Opus 4.6).

Abstract: We give a computer-assisted counterexample to the open question, posed by Rudin, Schapire, and Daubechies in COLT 2012, of whether exhaustive AdaBoost always converges to a finite cycle. The construction is based on a block-product gadget whose two factors share an exact period-2 orbit for their 5-step branch maps, but whose linearized return maps have dominant eigenvalues with an irrational logarithmic ratio. This irrationality forces the burst-winner sequence to have an irrational asymptotic frequency, precluding eventual periodicity. All assertions are certified by exact rational arithmetic. This work was developed in collaboration with GPT-5.4 Pro and Claude Opus 4.6.

[509] Production-Ready Automated ECU Calibration using Residual Reinforcement Learning

Andreas Kampmeier, Kevin Badalian, Lucas Koch, Sung-Yong Lee, Jakob Andert

Main category: cs.LG

TL;DR: Residual reinforcement learning approach for automated calibration of automotive control functions that maintains explainability while improving efficiency

DetailsMotivation: Traditional manual calibration of automotive ECU parameters is becoming impractical due to increasing vehicle variants, stricter regulations, shorter development cycles, and rising customer expectations. While RL can automate calibration, neural network-based solutions lack explainability needed for production vehicles.

Method: Proposes an explainable approach using residual reinforcement learning that follows established automotive development principles. The method starts with a sub-optimal calibration map and uses RL to learn residual corrections, maintaining the interpretable map-based structure while improving performance.

Result: Demonstrated on a map-based air path controller using hardware-in-the-loop platform. The approach quickly converges to a calibration closely resembling reference series ECU calibrations, achieving better calibrations in significantly less time with minimal human intervention.

Conclusion: The residual RL approach provides an industry-suitable solution that maintains explainability while automating calibration, leading to better results faster than traditional manual methods.

Abstract: Electronic Control Units (ECUs) have played a pivotal role in transforming motorcars of yore into the modern vehicles we see on our roads today. They actively regulate the actuation of individual components and thus determine the characteristics of the whole system. In this, the behavior of the control functions heavily depends on their calibration parameters which engineers traditionally design by hand. This is taking place in an environment of rising customer expectations and steadily shorter product development cycles. At the same time, legislative requirements are increasing while emission standards are getting stricter. Considering the number of vehicle variants on top of all that, the conventional method is losing its practical and financial viability. Prior work has already demonstrated that optimal control functions can be automatically developed with reinforcement learning (RL); since the resulting functions are represented by artificial neural networks, they lack explainability, a circumstance which renders them challenging to employ in production vehicles. In this article, we present an explainable approach to automating the calibration process using residual RL which follows established automotive development principles. Its applicability is demonstrated by means of a map-based air path controller in a series control unit using a hardware-in-the-loop (HiL) platform. Starting with a sub-optimal map, the proposed methodology quickly converges to a calibration which closely resembles the reference in the series ECU. The results prove that the approach is suitable for the industry where it leads to better calibrations in significantly less time and requires virtually no human intervention

[510] Epistemic Robust Offline Reinforcement Learning

Abhilash Reddy Chenreddy, Erick Delage

Main category: cs.LG

TL;DR: A framework for offline RL that replaces ensembles with compact uncertainty sets over Q-values to better handle epistemic uncertainty from limited data coverage, with improved robustness over ensemble methods.

DetailsMotivation: Offline RL faces epistemic uncertainty from limited or biased data coverage, especially when behavior policies avoid certain actions, leading to inaccurate value estimates. Ensemble methods like SAC-N have limitations: they require large ensembles and conflate epistemic with aleatoric uncertainty.

Method: Proposes a unified framework replacing discrete ensembles with compact uncertainty sets over Q-values. Introduces an Epinet-based model that directly shapes uncertainty sets to optimize cumulative reward under robust Bellman objective without relying on ensembles.

Result: Method achieves improved robustness and generalization over ensemble-based baselines across both tabular and continuous state domains. Also introduces a benchmark for evaluating offline RL algorithms under risk-sensitive behavior policies.

Conclusion: The proposed uncertainty set framework provides a more effective approach to handling epistemic uncertainty in offline RL compared to ensemble methods, with better generalization and robustness.

Abstract: Offline reinforcement learning learns policies from fixed datasets without further environment interaction. A key challenge in this setting is epistemic uncertainty, arising from limited or biased data coverage, particularly when the behavior policy systematically avoids certain actions. This can lead to inaccurate value estimates and unreliable generalization. Ensemble-based methods like SAC-N mitigate this by conservatively estimating Q-values using the ensemble minimum, but they require large ensembles and often conflate epistemic with aleatoric uncertainty. To address these limitations, we propose a unified and generalizable framework that replaces discrete ensembles with compact uncertainty sets over Q-values. %We further introduce an Epinet based model that directly shapes the uncertainty sets to optimize the cumulative reward under the robust Bellman objective without relying on ensembles. We also introduce a benchmark for evaluating offline RL algorithms under risk-sensitive behavior policies, and demonstrate that our method achieves improved robustness and generalization over ensemble-based baselines across both tabular and continuous state domains.

[511] Mining Electronic Health Records to Investigate Effectiveness of Ensemble Deep Clustering

Manar D. Samad, Yina Hou, Shrabani Ghosh

Main category: cs.LG

TL;DR: An ensemble-based deep clustering approach for EHR data that combines traditional and deep learning methods, aggregating cluster assignments from multiple embedding dimensions to improve heart failure patient subtype identification.

DetailsMotivation: Traditional clustering methods (especially K-means) in healthcare informatics have limited success when applied to autoencoder embeddings, while deep learning approaches are designed for image data rather than tabular EHR data, creating a need for better methods to cluster patients and distinguish disease subtypes.

Method: Proposes an ensemble-based deep clustering approach that aggregates cluster assignments from multiple embedding dimensions rather than relying on a single fixed embedding space. Combines traditional clustering with deep clustering in a novel ensemble framework.

Result: The ensemble embedding for deep clustering achieves the best overall performance ranking across 14 diverse clustering methods and multiple patient cohorts. The approach demonstrates the importance of biological sex-specific clustering of EHR data.

Conclusion: Combining traditional and deep clustering approaches outperforms single methods, and biological sex-specific clustering is important for EHR data analysis. Ensemble methods that aggregate across multiple embedding dimensions are effective for patient clustering tasks.

Abstract: In electronic health records (EHRs), clustering patients and distinguishing disease subtypes are key tasks to elucidate pathophysiology and aid clinical decision-making. However, clustering in healthcare informatics is still based on traditional methods, especially K-means, and has achieved limited success when applied to embedding representations learned by autoencoders as hybrid methods. This paper investigates the effectiveness of traditional, hybrid, and deep learning methods in heart failure patient cohorts using real EHR data from the All of Us Research Program. Traditional clustering methods perform robustly because deep learning approaches are specifically designed for image clustering, a task that differs substantially from the tabular EHR data setting. To address the shortcomings of deep clustering, we introduce an ensemble-based deep clustering approach that aggregates cluster assignments obtained from multiple embedding dimensions, rather than relying on a single fixed embedding space. When combined with traditional clustering in a novel ensemble framework, the proposed ensemble embedding for deep clustering delivers the best overall performance ranking across 14 diverse clustering methods and multiple patient cohorts. This paper underscores the importance of biological sex-specific clustering of EHR data and the advantages of combining traditional and deep clustering approaches over a single method.

[512] Are Stochastic Multi-objective Bandits Harder than Single-objective Bandits?

Changkun Guan, Mengfan Xu

Main category: cs.LG

TL;DR: Multi-objective bandits with vector rewards analyzed for Pareto regret; shown to be governed by maximum sub-optimality gap rather than dimensionality, with optimal algorithm achieving O(K log T / g†) regret.

DetailsMotivation: The paper addresses a fundamental question in multi-objective bandits: whether they are inherently harder than single-objective bandits due to vector rewards and Pareto ordering. Existing work suggests Pareto regret increases with dimensionality in stochastic settings, creating a controversial phenomenon that needs clarification.

Method: Develops a new algorithm with nested two-layer uncertainty quantification using upper and lower confidence bound estimators. Combines top-two racing strategy for arm selection with uncertainty-greedy rule for dimension selection to balance exploration-exploitation across layers.

Result: Shows Pareto regret in stochastic setting is governed by maximum sub-optimality gap g†, not dimensionality. Achieves optimal Pareto regret of O(K log T / g†) with proposed algorithm, validated through comprehensive numerical experiments showing significant gains over benchmarks.

Conclusion: Multi-objective bandits are not fundamentally harder than single-objective ones in stochastic settings when using appropriate regret measures. The maximum sub-optimality gap, not dimensionality, determines regret complexity, with optimal algorithm achieving matching upper and lower bounds.

Abstract: Multi-objective bandits have attracted increasing attention because of their broad applicability and mathematical elegance, where the reward of each arm is a multi-dimensional vector rather than a scalar. This naturally introduces Pareto order relations and Pareto regret. A long-standing question in this area is whether performance is fundamentally harder to optimize because of this added complexity. A recent surprising result shows that, in the adversarial setting, Pareto regret is no larger than classical regret; however, in the stochastic setting, where the regret notion is different, the picture remains unclear. In fact, existing work suggests that Pareto regret in the stochastic case increases with the dimensionality. This controversial yet subtle phenomenon motivates our central question: \emph{are multi-objective bandits actually harder than single-objective ones?} We answer this question in full by showing that, in the stochastic setting, Pareto regret is in fact governed by the maximum sub-optimality gap (g^\dagger), and hence by the minimum marginal regret of order (Ω(\frac{K\log T}{g^\dagger})). We further develop a new algorithm that achieves Pareto regret of order (O(\frac{K\log T}{g^\dagger})), and is therefore optimal. The algorithm leverages a nested two-layer uncertainty quantification over both arms and objectives through upper and lower confidence bound estimators. It combines a top-two racing strategy for arm selection with an uncertainty-greedy rule for dimension selection. Together, these components balance exploration and exploitation across the two layers. We also conduct comprehensive numerical experiments to validate the proposed algorithm, showing the desired regret guarantee and significant gains over benchmark methods.

[513] Selective Neuron Amplification for Training-Free Task Enhancement

Ryyan Akhtar

Main category: cs.LG

TL;DR: Selective Neuron Amplification (SNA) is a method that boosts task-relevant neurons during inference without changing model parameters, addressing failures caused by weak activation rather than missing knowledge.

DetailsMotivation: Large language models often fail on tasks they seem to understand, which appears to be less about missing knowledge and more about certain internal circuits not being strongly activated during inference.

Method: Selective Neuron Amplification (SNA) increases the influence of task-relevant neurons without changing the model’s parameters. The method works at inference time and does not permanently alter the model.

Result: SNA helps mainly when the model is uncertain, while having low effect when the model is already confident. This suggests that some model failures are due to weak activation rather than lack of capability.

Conclusion: Model failures can often be attributed to weak activation of relevant circuits rather than missing capabilities, and SNA provides an effective inference-time intervention to address this issue.

Abstract: Large language models often fail on tasks they seem to already understand. In our experiments, this appears to be less about missing knowledge and more about certain internal circuits not being strongly activated during inference. We explore Selective Neuron Amplification, which increases the influence of task relevant neurons without changing the model’s parameters. The method works at inference time and does not permanently alter the model. SNA helps mainly when the model is uncertain, while having low effect when the model is already confident. This suggests that some model failures are due to weak activation rather than lack of capability.

[514] Information as Structural Alignment: A Dynamical Theory of Continual Learning

Radu Negulescu

Main category: cs.LG

TL;DR: IBF is a continual learning framework that treats information as structural alignment rather than stored content, achieving near-zero forgetting without replay through self-organizing dynamics.

DetailsMotivation: Catastrophic forgetting in neural networks stems from storing knowledge as global parameter superposition. Existing methods add external mechanisms rather than deriving retention from learning dynamics themselves.

Method: Informational Buildup Framework (IBF) uses two governing equations: Law of Motion drives configuration toward higher coherence, and Modification Dynamics deforms coherence landscape in response to discrepancies. Memory and self-correction emerge from these dynamics.

Result: Achieves near-zero forgetting on CIFAR-100 (BT = -0.004), positive backward transfer in chess (+38.5 cp), and 43% less forgetting than replay in controlled domains. Chess evaluation shows +88.9 cp advantage over baselines.

Conclusion: IBF demonstrates that continual learning can be achieved through self-organizing dynamics rather than external mechanisms, offering a fundamentally different approach to knowledge retention.

Abstract: Catastrophic forgetting is not an engineering failure. It is a mathematical consequence of storing knowledge as global parameter superposition. Existing methods, such as regularization, replay, and frozen subnetworks, add external mechanisms to a shared-parameter substrate. None derives retention from the learning dynamics themselves. This paper introduces the Informational Buildup Framework (IBF), an alternative substrate for continual learning, based on the premise that information is the achievement of structural alignment rather than stored content. In IBF, two equations govern the dynamics: a Law of Motion that drives configuration toward higher coherence, and Modification Dynamics that persistently deform the coherence landscape in response to localized discrepancies. Memory, agency, and self-correction arise from these dynamics rather than being added as separate modules. We first demonstrate the full lifecycle in a transparent two-dimensional toy model, then validate across three domains: a controlled non-stationary world, chess evaluated independently by Stockfish, and Split-CIFAR-100 with a frozen ViT encoder. Across all three, IBF achieves replay-superior retention without storing raw data. We observe near-zero forgetting on CIFAR-100 (BT = -0.004), positive backward transfer in chess (+38.5 cp), and 43% less forgetting than replay in the controlled domain. In chess, the framework achieves a mean behavioral advantage of +88.9 +/- 2.8 cp under independent evaluation, exceeding MLP and replay baselines.

[515] Lumbermark: Resistant Clustering by Chopping Up Mutual Reachability Minimum Spanning Trees

Marek Gagolewski

Main category: cs.LG

TL;DR: Lumbermark is a divisive clustering algorithm that detects clusters of varying sizes, densities, and shapes by iteratively cutting limbs from a dataset’s mutual reachability minimum spanning tree, serving as an alternative to HDBSCAN with user-specified partition sizes.

DetailsMotivation: The paper aims to address limitations in existing clustering methods by developing a robust algorithm that can handle clusters with diverse characteristics (varying sizes, densities, shapes) while being less sensitive to noise and outliers, providing an alternative to HDBSCAN with more control over partition sizes.

Method: Lumbermark uses a divisive clustering approach based on mutual reachability distances to construct a minimum spanning tree. It iteratively chops off large limbs connected by protruding segments of this tree, where mutual reachability distances help smooth the data distribution and reduce the influence of low-density objects like noise and outliers.

Result: The algorithm performs well on benchmark data and has been implemented in an open-source ’lumbermark’ package for Python and R, providing a fast and easy-to-use tool for data scientists and practitioners.

Conclusion: Lumbermark offers a robust alternative to HDBSCAN for clustering diverse datasets, with the advantage of producing partitions with user-specified sizes and being less sensitive to noise and outliers, making it potentially useful across various application domains.

Abstract: We introduce Lumbermark, a robust divisive clustering algorithm capable of detecting clusters of varying sizes, densities, and shapes. Lumbermark iteratively chops off large limbs connected by protruding segments of a dataset’s mutual reachability minimum spanning tree. The use of mutual reachability distances smoothens the data distribution and decreases the influence of low-density objects, such as noise points between clusters or outliers at their peripheries. The algorithm can be viewed as an alternative to HDBSCAN that produces partitions with user-specified sizes. A fast, easy-to-use implementation of the new method is available in the open-source ’lumbermark’ package for Python and R. We show that Lumbermark performs well on benchmark data and hope it will prove useful to data scientists and practitioners across different fields.

[516] Multi-Turn Reasoning LLMs for Task Offloading in Mobile Edge Computing

Ning Yang, Chuangxin Cheng, Haijun Zhang

Main category: cs.LG

TL;DR: COMLLM is a generative framework using LLMs with Group Relative Policy Optimization and Look-Ahead Collaborative Simulation for foresighted task offloading in Mobile Edge Computing, achieving near-optimal latency and zero-shot topological scalability.

DetailsMotivation: Mobile Edge Computing faces challenges in task offloading due to dynamic environments and spatio-temporal coupling of server queues. Existing methods like DRL lack generalization and require retraining for topology changes, while LLM-based SFT yields myopic policies that don't account for long-term system evolution.

Method: COMLLM integrates Group Relative Policy Optimization (GRPO) with a Look-Ahead Collaborative Simulation (LACS) mechanism that performs multi-step Monte Carlo rollouts while jointly modeling server queue dynamics. This captures long-term impacts of decisions on future system states.

Result: COMLLM achieves near-optimal latency and improved load-balancing fairness. It exhibits zero-shot topological scalability, allowing models trained on small-scale networks to generalize to larger, unseen topologies without retraining, outperforming SFT, DRL, and heuristic baselines.

Conclusion: COMLLM demonstrates that LLMs can be effectively adapted for foresighted decision-making in dynamic MEC systems through innovative training mechanisms that capture long-term system evolution, enabling both performance improvements and topological generalization.

Abstract: Emerging computation-intensive applications impose stringent latency requirements on resource-constrained mobile devices. Mobile Edge Computing (MEC) addresses this challenge through task offloading. However, designing effective policies remains difficult due to dynamic task arrivals, time-varying channels, and the spatio-temporal coupling of server queues. Conventional heuristics lack adaptability, while Deep Reinforcement Learning (DRL) suffers from limited generalization and architectural rigidity, requiring retraining when network topology changes. Although Large Language Models (LLMs) offer semantic reasoning capabilities, standard Supervised Fine-Tuning (SFT) yields myopic policies that greedily minimize immediate latency without accounting for long-term system evolution. To address these limitations, we propose COMLLM, a generative framework that enables foresighted decision-making in MEC systems. COMLLM integrates Group Relative Policy Optimization (GRPO) with a Look-Ahead Collaborative Simulation (LACS) mechanism, which performs multi-step Monte Carlo rollouts while jointly modeling server queue dynamics. By incorporating these rollouts into the reward design, the framework captures the long-term impact of current decisions on future system states. Experimental results demonstrate that COMLLM achieves near-optimal latency and improved load-balancing fairness. Notably, it exhibits zero-shot topological scalability, allowing a model trained on small-scale networks to generalize to larger, unseen topologies without retraining, outperforming SFT, DRL, and heuristic baselines.

[517] SBBTS: A Unified Schrödinger-Bass Framework for Synthetic Financial Time Series

Alexandre Alouadi, Grégoire Loeper, Célian Marsala, Othmane Mazhar, Huyên Pham

Main category: cs.LG

TL;DR: SBBTS is a novel time series generation method that jointly models drift and stochastic volatility using Schrödinger-Bass bridges, outperforming existing approaches in financial applications.

DetailsMotivation: Existing time series generation methods fail to jointly model both drift and stochastic volatility - diffusion methods fix volatility while martingale transport models ignore drift, creating a gap for realistic financial time series synthesis.

Method: Introduces Schrödinger-Bass Bridge for Time Series (SBBTS), extending Schrödinger-Bass formulation to multi-step time series. Constructs diffusion process that jointly calibrates drift and volatility with tractable decomposition into conditional transport problems for efficient learning.

Result: SBBTS accurately recovers stochastic volatility and correlation parameters that prior methods fail to capture (tested on Heston model). Applied to S&P 500 data, SBBTS-generated synthetic time series improve downstream forecasting performance in data augmentation, yielding higher classification accuracy and Sharpe ratio.

Conclusion: SBBTS provides a practical and effective framework for realistic time series generation and data augmentation in financial applications by successfully addressing the joint modeling of drift and stochastic volatility.

Abstract: We study the problem of generating synthetic time series that reproduce both marginal distributions and temporal dynamics, a central challenge in financial machine learning. Existing approaches typically fail to jointly model drift and stochastic volatility, as diffusion-based methods fix the volatility while martingale transport models ignore drift. We introduce the Schrödinger-Bass Bridge for Time Series (SBBTS), a unified framework that extends the Schrödinger-Bass formulation to multi-step time series. The method constructs a diffusion process that jointly calibrates drift and volatility and admits a tractable decomposition into conditional transport problems, enabling efficient learning. Numerical experiments on the Heston model demonstrate that SBBTS accurately recovers stochastic volatility and correlation parameters that prior SchrödingerBridge methods fail to capture. Applied to S&P 500 data, SBBTS-generated synthetic time series consistently improve downstream forecasting performance when used for data augmentation, yielding higher classification accuracy and Sharpe ratio compared to real-data-only training. These results show that SBBTS provides a practical and effective framework for realistic time series generation and data augmentation in financial applications.

[518] Smart Commander: A Hierarchical Reinforcement Learning Framework for Fleet-Level PHM Decision Optimization

Yong Si, Mingfei Lu, Jing Li, Yang Hu, Guijiang Li, Yueheng Song, Zhaokui Wang

Main category: cs.LG

TL;DR: Smart Commander: Hierarchical Reinforcement Learning framework for military aviation PHM that decomposes fleet management into strategic and tactical layers to optimize maintenance and logistics decisions.

DetailsMotivation: Address challenges in military aviation Prognostics and Health Management (PHM) including the "curse of dimensionality" in large-scale fleet operations, sparse feedback, and stochastic mission profiles that make decision-making difficult.

Method: Proposes a two-tier Hierarchical Reinforcement Learning (HRL) framework: strategic General Commander manages fleet-level availability and cost objectives, while tactical Operation Commanders execute specific actions for sortie generation, maintenance scheduling, and resource allocation. Uses layered reward shaping with planning-enhanced neural networks in a custom-built discrete-event simulation environment.

Result: Smart Commander significantly outperforms conventional monolithic Deep Reinforcement Learning (DRL) and rule-based baselines, achieves substantial reduction in training time, and demonstrates superior scalability and robustness in failure-prone environments.

Conclusion: HRL shows potential as a reliable paradigm for next-generation intelligent fleet management in military aviation PHM, effectively addressing sparse/delayed rewards and scalability challenges.

Abstract: Decision-making in military aviation Prognostics and Health Management (PHM) faces significant challenges due to the “curse of dimensionality” in large-scale fleet operations, combined with sparse feedback and stochastic mission profiles. To address these issues, this paper proposes Smart Commander, a novel Hierarchical Reinforcement Learning (HRL) framework designed to optimize sequential maintenance and logistics decisions. The framework decomposes the complex control problem into a two-tier hierarchy: a strategic General Commander manages fleet-level availability and cost objectives, while tactical Operation Commanders execute specific actions for sortie generation, maintenance scheduling, and resource allocation. The proposed approach is validated within a custom-built, high-fidelity discrete-event simulation environment that captures the dynamics of aircraft configuration and support logistics.By integrating layered reward shaping with planning-enhanced neural networks, the method effectively addresses the difficulty of sparse and delayed rewards. Empirical evaluations demonstrate that Smart Commander significantly outperforms conventional monolithic Deep Reinforcement Learning (DRL) and rule-based baselines. Notably, it achieves a substantial reduction in training time while demonstrating superior scalability and robustness in failure-prone environments. These results highlight the potential of HRL as a reliable paradigm for next-generation intelligent fleet management.

[519] Improving Semantic Uncertainty Quantification in Language Model Question-Answering via Token-Level Temperature Scaling

Tom A. Lamb, Desi R. Ivanova, Philip H. S. Torr, Tim G. J. Rudner

Main category: cs.LG

TL;DR: Temperature scaling improves semantic uncertainty calibration and discrimination for question-answering tasks, outperforming heuristic baselines and token-level methods.

DetailsMotivation: Prior work has focused on discrimination while neglecting calibration in semantic uncertainty quantification, creating an incomplete picture of uncertainty quality. There's a need to systematically evaluate both aspects across confidence measures.

Method: Optimizing a single scalar temperature parameter as an inductive bias for semantic confidence distributions. This temperature scaling approach is compared against fixed-temperature heuristics and more expressive token-level recalibration methods.

Result: Temperature scaling consistently improves semantic calibration, discrimination, and downstream entropy, outperforming both heuristic baselines and token-level recalibration methods on question-answering tasks.

Conclusion: Simple temperature scaling is surprisingly effective for semantic uncertainty calibration, addressing systematic miscalibration in current approaches while maintaining good discrimination.

Abstract: Calibration is central to reliable semantic uncertainty quantification, yet prior work has largely focused on discrimination, neglecting calibration. As calibration and discrimination capture distinct aspects of uncertainty, focusing on discrimination alone yields an incomplete picture. We address this gap by systematically evaluating both aspects across a broad set of confidence measures. We show that current approaches, particularly fixed-temperature heuristics, produce systematically miscalibrated and poorly discriminative semantic confidence distributions. We demonstrate that optimising a single scalar temperature, which, we argue, provides a suitable inductive bias, is a surprisingly simple yet effective solution. Our exhaustive evaluation confirms that temperature scaling consistently improves semantic calibration, discrimination, and downstream entropy, outperforming both heuristic baselines and more expressive token-level recalibration methods on question-answering tasks.

[520] Mixture Proportion Estimation and Weakly-supervised Kernel Test for Conditional Independence

Yushi Hirose, Akito Narahara, Takafumi Kanamori

Main category: cs.LG

TL;DR: Novel mixture proportion estimation method using conditional independence assumptions instead of irreducibility, with moment estimators and kernel tests for validation.

DetailsMotivation: Existing mixture proportion estimation methods rely on irreducibility assumptions which may not always hold, limiting their applicability. The authors aim to develop more flexible identifiability conditions using conditional independence assumptions.

Method: Proposes novel conditional independence assumptions for identifiability, develops method of moments estimators under these assumptions, and creates weakly-supervised kernel tests to validate the CI assumptions.

Result: The proposed estimators outperform existing methods, and the kernel tests successfully control both type I and type II errors in validation.

Conclusion: Conditional independence assumptions provide a viable alternative to irreducibility for mixture proportion estimation, enabling identifiability in broader scenarios with practical validation methods.

Abstract: Mixture proportion estimation (MPE) aims to estimate class priors from unlabeled data. This task is a critical component in weakly supervised learning, such as PU learning, learning with label noise, and domain adaptation. Existing MPE methods rely on the \textit{irreducibility} assumption or its variant for identifiability. In this paper, we propose novel assumptions based on conditional independence (CI) given the class label, which ensure identifiability even when irreducibility does not hold. We develop method of moments estimators under these assumptions and analyze their asymptotic properties. Furthermore, we present weakly-supervised kernel tests to validate the CI assumptions, which are of independent interest in applications such as causal discovery and fairness evaluation. Empirically, we demonstrate the improved performance of our estimators compared with existing methods and that our tests successfully control both type I and type II errors.\label{key}

[521] Beyond the Mean: Modelling Annotation Distributions in Continuous Affect Prediction

Kosmas Pinitas, Ilias Maglogiannis

Main category: cs.LG

TL;DR: A distribution-aware framework using Beta distribution to model annotation consensus in continuous affect prediction, capturing uncertainty and variability in emotional perception rather than collapsing to point estimates.

DetailsMotivation: Emotion annotation is subjective and variable across annotators, but current approaches collapse this diversity into point estimates (mean/median), discarding valuable information about annotator disagreement and uncertainty.

Method: Proposes Beta distribution modeling of annotation consensus where models estimate mean and standard deviation of annotation distribution, transformed into Beta parameters via moment matching, enabling recovery of higher-order distributional descriptors like skewness, kurtosis, and quantiles.

Result: Beta-based modeling produces predictive distributions that closely match empirical annotator distributions on SEWA and RECOLA datasets using multimodal features, achieving competitive performance with conventional regression approaches.

Conclusion: Highlights importance of modeling annotation uncertainty in affective computing and demonstrates potential of distribution-aware learning for subjective signal analysis.

Abstract: Emotion annotation is inherently subjective and cognitively demanding, producing signals that reflect diverse perceptions across annotators rather than a single ground truth. In continuous affect prediction, this variability is typically collapsed into point estimates such as the mean or median, discarding valuable information about annotator disagreement and uncertainty. In this work, we propose a distribution-aware framework that models annotation consensus using the Beta distribution. Instead of predicting a single affect value, models estimate the mean and standard deviation of the annotation distribution, which are transformed into valid Beta parameters through moment matching. This formulation enables the recovery of higher-order distributional descriptors, including skewness, kurtosis, and quantiles, in closed form. As a result, the model captures not only the central tendency of emotional perception but also variability, asymmetry, and uncertainty in annotator responses. We evaluate the proposed approach on the SEWA and RECOLA datasets using multimodal features. Experimental results show that Beta-based modelling produces predictive distributions that closely match the empirical annotator distributions while achieving competitive performance with conventional regression approaches. These findings highlight the importance of modelling annotation uncertainty in affective computing and demonstrate the potential of distribution-aware learning for subjective signal analysis.

[522] Diffusion Processes on Implicit Manifolds

Victor Kawasaki-Borruat, Clara Grotehans, Pierre Vandergheynst, Adam Gosztolai

Main category: cs.LG

TL;DR: IMDs construct diffusion processes on data manifolds using only point cloud samples, enabling manifold-aware sampling and generative modeling without explicit geometric primitives.

DetailsMotivation: High-dimensional data often lies near low-dimensional manifolds, but constructing diffusion processes on these manifolds typically requires explicit geometric knowledge (charts, projections). The paper aims to create data-driven diffusion processes that capture intrinsic manifold structure using only point cloud samples.

Method: Constructs a data-driven SDE that captures intrinsic diffusion on the underlying manifold while being defined in ambient space. Estimates the diffusion’s infinitesimal generator and its carré-du-champ from a proximity graph built from data. Uses Euler-Maruyama integration for numerical simulation.

Result: Shows that as the number of samples grows, the induced process converges in law to its smooth manifold counterpart. Provides rigorous basis for practical implementations of diffusion dynamics on data manifolds.

Conclusion: IMDs enable manifold-aware sampling, exploration, and generative modeling without access to explicit geometric primitives, opening new directions for data manifold analysis.

Abstract: High-dimensional data are often modeled as lying near a low-dimensional manifold. We study how to construct diffusion processes on this data manifold in the implicit setting. That is, using only point cloud samples and without access to charts, projections, or other geometric primitives. Our main contribution is a data-driven SDE that captures intrinsic diffusion on the underlying manifold while being defined in ambient space. The construction relies on estimating the diffusion’s infinitesimal generator and its carré-du-champ (CDC) from a proximity graph built from the data. The generator and CDC together encode the local stochastic and geometric structure of the intended diffusion. We show that, as the number of samples grows, the induced process converges in law on the space of probability paths to its smooth manifold counterpart. We call this construction Implicit Manifold-valued Diffusions (IMDs), and furthermore present a numerical simulation procedure using Euler-Maruyama integration. This gives a rigorous basis for practical implementations of diffusion dynamics on data manifolds, and opens new directions for manifold-aware sampling, exploration, and generative modeling.

[523] How Does Machine Learning Manage Complexity?

Lance Fortnow

Main category: cs.LG

TL;DR: The paper analyzes machine learning models through computational complexity theory, showing that models producing computable distributions that minimize error against cryptographic pseudorandom generator outputs must be close to uniform distributions.

DetailsMotivation: To understand the power of machine learning models through computational complexity theory, particularly their ability to model complex systems. The paper aims to provide a theoretical framework for analyzing what distributions machine learning models can effectively learn and represent.

Method: The paper uses computational complexity theory to abstract machine learning models as producing P/poly-computable distributions with polynomially-bounded max-entropy. It focuses on computable distributions rather than just sampleable ones, and analyzes the relationship between machine learning models and cryptographic pseudorandom generators.

Result: The key result shows that if a machine learning model produces a distribution μ that minimizes error against the distribution generated by a cryptographic pseudorandom generator, then μ must be close to uniform. This demonstrates limitations in what distributions machine learning models can effectively approximate.

Conclusion: Machine learning models have inherent limitations in modeling certain complex distributions, particularly those generated by cryptographic pseudorandom generators. The computational complexity framework provides insights into the fundamental capabilities and constraints of machine learning systems.

Abstract: We provide a computational complexity lens to understand the power of machine learning models, particularly their ability to model complex systems. Machine learning models are often trained on data drawn from sampleable or more complex distributions, a far wider range of distributions than just computable ones. By focusing on computable distributions, machine learning models can better manage complexity via probability. We abstract away from specific learning mechanisms, modeling machine learning as producing P/poly-computable distributions with polynomially-bounded max-entropy. We illustrate how learning computable distributions models complexity by showing that if a machine learning model produces a distribution $μ$ that minimizes error against the distribution generated by a cryptographic pseudorandom generator, then $μ$ must be close to uniform.

[524] On the Price of Privacy for Language Identification and Generation

Xiaoyu Li, Andi Han, Jiaojiao Jiang, Junbin Gao

Main category: cs.LG

TL;DR: Differential privacy analysis for language models shows privacy costs are mild: approximate DP recovers non-private rates, pure DP causes only multiplicative factor degradation in exponents.

DetailsMotivation: As LLMs are increasingly trained on sensitive user data, understanding the fundamental cost of privacy in language learning becomes essential. The paper initiates the study of differentially private language identification and generation in agnostic statistical settings.

Method: Establishes algorithms and matching lower bounds that precisely quantify the cost of privacy. Analyzes both approximate (ε,δ)-DP and pure ε-DP settings for language identification and generation tasks.

Result: For approximate DP with constant ε>0, recovers non-private error rates: exp(-r(n)) for identification and exp(-Ω(n)) for generation. Under pure ε-DP, exponents degrade by multiplicative factor min{1,ε}, which is tight up to constants. Generation under pure DP achieves optimal rate matching lower bounds.

Conclusion: The cost of privacy in language learning is surprisingly mild: absent entirely under approximate DP, and exactly a min{1,ε} factor in the exponent under pure DP.

Abstract: As large language models (LLMs) are increasingly trained on sensitive user data, understanding the fundamental cost of privacy in language learning becomes essential. We initiate the study of differentially private (DP) language identification and generation in the agnostic statistical setting, establishing algorithms and matching lower bounds that precisely quantify the cost of privacy. For both tasks, approximate $(\varepsilon, δ)$-DP with constant $\varepsilon > 0$ recovers the non-private error rates: $\exp(-r(n))$ for identification (for any $r(n) = o(n)$) and $\exp(-Ω(n))$ for generation. Under pure $\varepsilon$-DP, the exponents degrade by a multiplicative factor of $\min{1, \varepsilon}$, which we show is tight up to constants. Notably, for generation under pure DP with mild assumptions, the upper bound $\exp(-\min{1,\varepsilon} \cdot Ω(n))$ matches the lower bound up to some constants, establishing an optimal rate. Our results show that the cost of privacy in language learning is surprisingly mild: absent entirely under approximate DP, and exactly a $\min{1,\varepsilon}$ factor in the exponent under pure DP.

[525] Weaves, Wires, and Morphisms: Formalizing and Implementing the Algebra of Deep Learning

Vincent Abbott, Gioele Zardini

Main category: cs.LG

TL;DR: A categorical framework for deep learning models that formalizes broadcasting through novel axis-stride and array-broadcasted categories, enabling precise mathematical expression and manipulation of architectures.

DetailsMotivation: Current ad-hoc notation, diagrams, and pseudocode poorly handle nonlinear broadcasting and the relationship between individual components and composed models in deep learning architectures.

Method: Introduces axis-stride and array-broadcasted categories to formalize broadcasting, provides mathematical definitions translated into diagrams and data structures, with implementations in Python (pyncd) and TypeScript (tsncd).

Result: A formal mathematical framework that allows precise expression and manipulation of deep learning architectures, with features including algebraic construction, graph conversion, PyTorch compilation, and diagram rendering.

Conclusion: Lays the foundation for a systematic, formal approach to deep learning model design and analysis through categorical mathematics.

Abstract: Despite deep learning models running well-defined mathematical functions, we lack a formal mathematical framework for describing model architectures. Ad-hoc notation, diagrams, and pseudocode poorly handle nonlinear broadcasting and the relationship between individual components and composed models. This paper introduces a categorical framework for deep learning models that formalizes broadcasting through the novel axis-stride and array-broadcasted categories. This allows the mathematical function underlying architectures to be precisely expressed and manipulated in a compositional manner. These mathematical definitions are translated into human manageable diagrams and machine manageable data structures. We provide a mirrored implementation in Python (pyncd) and TypeScript (tsncd) to show the universal aspect of our framework, along with features including algebraic construction, graph conversion, PyTorch compilation and diagram rendering. This lays the foundation for a systematic, formal approach to deep learning model design and analysis.

[526] A comparative analysis of machine learning models in SHAP analysis

Justin Lin, Julia Fukuyama

Main category: cs.LG

TL;DR: SHAP analysis investigation across ML models and datasets, with novel multi-class waterfall plot generalization

DetailsMotivation: Large black-box models lack explainability, making them untrustworthy for high-stakes applications. SHAP analysis provides feature-level explanations but interpretation is model-dependent, requiring systematic investigation.

Method: Detailed investigation of SHAP analysis across various machine learning models and datasets, plus development of a novel generalization of waterfall plots for multi-classification problems.

Result: Provides insights into SHAP analysis nuances across different models and datasets, and introduces a new visualization tool for multi-class problems.

Conclusion: Systematic SHAP analysis investigation empowers analysts to better interpret model decisions, with new multi-class visualization enhancing explainable AI capabilities.

Abstract: In this growing age of data and technology, large black-box models are becoming the norm due to their ability to handle vast amounts of data and learn incredibly complex data patterns. The deficiency of these methods, however, is their inability to explain the prediction process, making them untrustworthy and their use precarious in high-stakes situations. SHapley Additive exPlanations (SHAP) analysis is an explainable AI method growing in popularity for its ability to explain model predictions in terms of the original features. For each sample and feature in the data set, an associated SHAP value quantifies the contribution of that feature to the prediction of that sample. Analysis of these SHAP values provides valuable insight into the model’s decision-making process, which can be leveraged to create data-driven solutions. The interpretation of these SHAP values, however, is model-dependent, so there does not exist a universal analysis procedure. To aid in these efforts, we present a detailed investigation of SHAP analysis across various machine learning models and data sets. In uncovering the details and nuance behind SHAP analysis, we hope to empower analysts in this less-explored territory. We also present a novel generalization of the waterfall plot to the multi-classification problem.

[527] Tracking Adaptation Time: Metrics for Temporal Distribution Shift

Lorenzo Iovine, Giacomo Ziffer, Emanuele Della Valle

Main category: cs.LG

TL;DR: Proposes three new metrics to distinguish between model adaptation failure and inherent data difficulty under temporal distribution shift, providing more interpretable analysis than average performance decline metrics.

DetailsMotivation: Existing metrics for temporal robustness only measure average performance decline but fail to capture how models adapt to evolving data, making it unclear whether accuracy drops are due to model adaptation failure or data becoming inherently more difficult.

Method: Develops three complementary metrics that separately quantify model adaptation capability versus intrinsic data difficulty under temporal distribution shift, providing a dynamic and interpretable view of model behavior.

Result: The proposed metrics reveal adaptation patterns that are hidden by traditional analysis methods, offering richer understanding of temporal robustness in evolving environments.

Conclusion: The new metrics provide a more nuanced and interpretable framework for evaluating model robustness under temporal distribution shift by distinguishing adaptation from data difficulty.

Abstract: Evaluating robustness under temporal distribution shift remains an open challenge. Existing metrics quantify the average decline in performance, but fail to capture how models adapt to evolving data. As a result, temporal degradation is often misinterpreted: when accuracy declines, it is unclear whether the model is failing to adapt or whether the data itself has become inherently more challenging to learn. In this work, we propose three complementary metrics to distinguish adaptation from intrinsic difficulty in the data. Together, these metrics provide a dynamic and interpretable view of model behavior under temporal distribution shift. Results show that our metrics uncover adaptation patterns hidden by existing analysis, offering a richer understanding of temporal robustness in evolving environments.

[528] Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions

Guo Gan, Yuxuan Ding, Cong Chen, Yuwei Ren, Yin Huang, Hong Zhou

Main category: cs.LG

TL;DR: Android Coach: A novel RL framework for Android agents that shifts from Single State Single Action to Single State Multiple Actions paradigm, using a critic model to estimate action values without additional emulator overhead, achieving higher success rates and training efficiency.

DetailsMotivation: Online RL for Android agents is expensive due to emulator latency and sample inefficiency of existing RL algorithms. Current approaches suffer from the Single State Single Action paradigm that doesn't fully explore costly emulator states.

Method: Proposes Android Coach framework with Single State Multiple Actions paradigm, using a learned critic to estimate action values without extra emulator calls. Integrates process reward model and group-wise advantage estimator based on averaged critic outputs.

Result: Achieves 7.5% and 8.3% success rate improvements on AndroidLab and AndroidWorld over UI-TARS-1.5-7B, and attains 1.4x higher training efficiency than PPO and GRPO at matched success rates.

Conclusion: Android Coach effectively addresses the sample inefficiency problem in Android agent training by enabling more exploration per emulator state through the Single State Multiple Actions paradigm with critic guidance.

Abstract: Online reinforcement learning (RL) serves as an effective method for enhancing the capabilities of Android agents. However, guiding agents to learn through online interaction is prohibitively expensive due to the high latency of emulators and the sample inefficiency of existing RL algorithms. We identify a fundamental limitation in current approaches: the Single State Single Action paradigm, which updates the policy with one-to-one state-action pairs from online one-way rollouts without fully exploring each costly emulator state. In this paper, we propose Android Coach, a novel framework that shifts the training paradigm to Single State Multiple Actions, allowing the agent to sample and utilize multiple actions for a single online state. We enable this without additional emulator overhead by learning a critic that estimates action values. To ensure the critic serves as a reliable coach, we integrate a process reward model and introduce a group-wise advantage estimator based on the averaged critic outputs. Extensive experiments demonstrate the effectiveness and efficiency of Android Coach: it achieves 7.5% and 8.3% success rate improvements on AndroidLab and AndroidWorld over UI-TARS-1.5-7B, and attains 1.4x higher training efficiency than Single State Single Action methods PPO and GRPO at matched success rates.

[529] Graph Neural ODE Digital Twins for Control-Oriented Reactor Thermal-Hydraulic Forecasting Under Partial Observability

Akzhol Almukhametov, Doyeong Lim, Rui Hu, Yang Liu

Main category: cs.LG

TL;DR: Physics-informed GNN-ODE surrogate model for real-time thermal-hydraulic state forecasting in nuclear reactors, combining graph neural networks with neural ODEs for fast inference and handling partial observability.

DetailsMotivation: Real-time supervisory control of advanced reactors requires accurate forecasting of plant-wide thermal-hydraulic states, including at uninstrumented locations, with requirements for predictive fidelity, millisecond-scale inference, and robustness to partial observability.

Method: Physics-informed message-passing Graph Neural Network coupled with Neural ODE (GNN-ODE) representing system as directed sensor graph with flow/heat transfer-aware message passing, advancing latent dynamics via controlled Neural ODE, with topology-guided missing-node initialization.

Result: Achieves average MAE of 0.91K at 60s and 2.18K at 300s for uninstrumented nodes, R² up to 0.995 for missing-node reconstruction, 105x faster than simulated time, successful sim-to-real transfer with only 30 training sequences, recovers Reynolds-number exponent consistent with established correlations.

Conclusion: GNN-ODE surrogate effectively addresses requirements for real-time reactor control with high fidelity, fast inference, and robustness to partial observability, demonstrating constitutive learning beyond trajectory fitting.

Abstract: Real-time supervisory control of advanced reactors requires accurate forecasting of plant-wide thermal-hydraulic states, including locations where physical sensors are unavailable. Meeting this need calls for surrogate models that combine predictive fidelity, millisecond-scale inference, and robustness to partial observability. In this work, we present a physics-informed message-passing Graph Neural Network coupled with a Neural Ordinary Differential Equation (GNN-ODE) to addresses all three requirements simultaneously. We represent the whole system as a directed sensor graph whose edges encode hydraulic connectivity through flow/heat transfer-aware message passing, and we advance the latent dynamics in continuous time via a controlled Neural ODE. A topology-guided missing-node initializer reconstructs uninstrumented states at rollout start; prediction then proceeds fully autoregressively. The GNN-ODE surrogate achieves satisfactory results for the system dynamics prediction. On held-out simulation transients, the surrogate achieves an average MAE of 0.91 K at 60 s and 2.18 K at 300 s for uninstrumented nodes, with $R^2$ up to 0.995 for missing-node state reconstruction. Inference runs at approximately 105 times faster than simulated time on a single GPU, enabling 64-member ensemble rollouts for uncertainty quantification. To assess sim-to-real transfer, we adapt the pretrained surrogate to experimental facility data using layerwise discriminative fine-tuning with only 30 training sequences. The learned flow-dependent heat-transfer scaling recovers a Reynolds-number exponent consistent with established correlations, indicating constitutive learning beyond trajectory fitting. The model tracks a steep power change transient and produces accurate trajectories at uninstrumented locations.

[530] SL-FAC: A Communication-Efficient Split Learning Framework with Frequency-Aware Compression

Zehang Lin, Miao Yang, Haihan Zhu, Zheng Lin, Jianhao Huang, Jing Yang, Guangjin Pan, Dianxin Luan, Zihan Fang, Shunzhi Zhu, Wei Ni, John Thompson

Main category: cs.LG

TL;DR: SL-FAC: A communication-efficient split learning framework using adaptive frequency decomposition and frequency-based quantization compression to reduce transmission overhead of smashed data between edge devices and servers.

DetailsMotivation: Split learning helps deploy large neural networks on resource-constrained edge devices by offloading training workload to edge servers, but transmission of smashed data (activations/gradients) creates significant communication bottlenecks as models and devices scale.

Method: Two-component approach: 1) Adaptive Frequency Decomposition (AFD) transforms smashed data to frequency domain and decomposes into spectral components; 2) Frequency-based Quantization Compression (FQC) applies customized quantization bit widths to each component based on spectral energy distribution to preserve crucial information.

Result: Extensive experiments confirm SL-FAC achieves significant communication reduction while maintaining model convergence, improving training efficiency for distributed machine learning on edge devices.

Conclusion: SL-FAC effectively addresses communication bottlenecks in split learning through frequency-domain processing and adaptive compression, enabling more efficient deployment of complex neural networks on resource-constrained edge devices.

Abstract: The growing complexity of neural networks hinders the deployment of distributed machine learning on resource-constrained devices. Split learning (SL) offers a promising solution by partitioning the large model and offloading the primary training workload from edge devices to an edge server. However, the increasing number of participating devices and model complexity leads to significant communication overhead from the transmission of smashed data (e.g., activations and gradients), which constitutes a critical bottleneck for SL. To tackle this challenge, we propose SL-FAC, a communication-efficient SL framework comprising two key components: adaptive frequency decomposition (AFD) and frequency-based quantization compression (FQC). AFD first transforms the smashed data into the frequency domain and decomposes it into spectral components with distinct information. FQC then applies customized quantization bit widths to each component based on its spectral energy distribution. This collaborative approach enables SL-FAC to achieve significant communication reduction while strategically preserving the information most crucial for model convergence. Extensive experiments confirm the superior performance of SL-FAC for improving the training efficiency.

[531] How to sketch a learning algorithm

Sam Gunn

Main category: cs.LG

TL;DR: A data deletion scheme that predicts model outputs with vanishing error when subsets of training data are excluded, using stability assumptions and arithmetic circuit sketching via higher-order derivatives.

DetailsMotivation: Understanding how training data influences AI models is crucial for interpretability, privacy, and scientific understanding. The core problem is data deletion: predicting how a model would behave if certain training data had been excluded, which has important implications for privacy (right to be forgotten) and model analysis.

Method: Proposes a data deletion scheme based on stability assumptions. Uses a novel technique of locally sketching arithmetic circuits by computing higher-order derivatives in random complex directions. Leverages forward-mode automatic differentiation for efficient computation of these derivatives. Precomputation and prediction algorithms are poly(1/ε) factors slower than regular training/inference.

Result: Achieves vanishing error ε in predicting model outputs when training data subsets are excluded. Storage requirements are poly(1/ε) models. Shows stability assumption is compatible with learning powerful AI models through experiments with microgpt.

Conclusion: Presents an efficient data deletion framework with practical computational requirements that enables analysis of training data influence on AI models, addressing important interpretability and privacy concerns.

Abstract: How does the choice of training data influence an AI model? This question is of central importance to interpretability, privacy, and basic science. At its core is the data deletion problem: after a reasonable amount of precomputation, quickly predict how the model would behave in a given situation if a given subset of training data had been excluded from the learning algorithm. We present a data deletion scheme capable of predicting model outputs with vanishing error $\varepsilon$ in the deep learning setting. Our precomputation and prediction algorithms are only $\mathrm{poly}(1/\varepsilon)$ factors slower than regular training and inference, respectively. The storage requirements are those of $\mathrm{poly}(1/\varepsilon)$ models. Our proof is based on an assumption that we call “stability.” In contrast to the assumptions made by prior work, stability appears to be fully compatible with learning powerful AI models. In support of this, we show that stability is satisfied in a minimal set of experiments with microgpt. Our code is available at https://github.com/SamSpo1/microgpt-sketch. At a technical level, our work is based on a new method for locally sketching an arithmetic circuit by computing higher-order derivatives in random complex directions. Forward-mode automatic differentiation allows cheap computation of these derivatives.

[532] Smoothing the Edges: Smooth Optimization for Sparse Regularization using Hadamard Overparametrization

Chris Kolb, Christian L. Müller, Bernd Bischl, David Rügamer

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2307.03571: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2307.03571&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[533] Don’t Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget

Florian E. Dorner, Moritz Hardt

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2402.02249: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.02249&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[534] AFL: A Single-Round Analytic Approach for Federated Learning with Pre-trained Models

Run He, Kai Tong, Di Fang, Han Sun, Ziqian Zeng, Haoran Li, Tianyi Chen, Huiping Zhuang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2405.16240: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.16240&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[535] DROP: Distributional and Regular Optimism and Pessimism for Reinforcement Learning

Taisuke Kobayashi

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: No method information available - paper content inaccessible

Result: No results available - paper content inaccessible

Conclusion: Unable to analyze paper due to technical limitations in accessing content

Abstract: Failed to fetch summary for 2410.17473: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.17473&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[536] SSPINNpose: A Self-Supervised PINN for Inertial Pose and Dynamics Estimation

Markus Gambietz, Eva Dorschky, Altan Akat, Marcel Schöckel, Jörg Miehling, Anne D. Koelewijn

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2506.11786: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.11786&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[537] Inference-Time Scaling of Diffusion Language Models via Trajectory Refinement

Meihua Dang, Jiaqi Han, Minkai Xu, Kai Xu, Akash Srivastava, Stefano Ermon

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2507.08390: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.08390&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[538] Negative Binomial Variational Autoencoders for Overdispersed Latent Modeling

Yixuan Zhang, Jinhao Sheng, Wenxin Zhang, Quyu Kong, Feng Zhou

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2508.05423: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05423&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[539] LNN-PINN: A Unified Physics-Only Training Framework with Liquid Residual Blocks

Ze Tao, Hanxuan Wang, Fujun Liu

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to technical error fetching paper content

Method: Unable to determine method due to technical error fetching paper content

Result: Unable to determine results due to technical error fetching paper content

Conclusion: Unable to draw conclusions due to technical error fetching paper content

Abstract: Failed to fetch summary for 2508.08935: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.08935&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[540] MDP modeling for multi-stage stochastic programs

David P. Morton, Oscar Dowson, Bernardo K. Pagnoncelli

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2509.22981: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22981&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[541] Entropy After for reasoning model early exiting

Xi Wang, James McInerney, Lequn Wang, Nathan Kallus

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2509.26522: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26522&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[542] Bayesian E(3)-Equivariant Interatomic Potential with Iterative Restratification of Many-body Message Passing

Soohaeng Yoo Willow, Tae Hyeon Park, Gi Beom Sim, Sung Wook Moon, Seung Kyu Min, D. ChangMo Yang, Hyun Woo Kim, Juho Lee, Chang Woo Myung

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2510.03046 appears to be from arXiv, but the content could not be retrieved.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2510.03046: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03046&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[543] ECLipsE-Gen-Local: Efficient Compositional Local Lipschitz Estimates for Deep Neural Networks

Yuezhu Xu, S. Sivaranjani

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2510.05261: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05261&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[544] Approximate Replicability in Learning

Max Hopkins, Russell Impagliazzo, Christopher Ye

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2510.20200: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20200&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[545] Alternatives to the Laplacian for Scalable Spectral Clustering with Group Fairness Constraints

Iván Ojeda-Ruiz, Young Ju Lee, Malcolm Dickens, Leonardo Cambisaca

Main category: cs.LG

TL;DR: Unable to analyze paper 2510.20220 due to HTTP 429 error when fetching summary from arXiv API

DetailsMotivation: Cannot determine motivation as paper content could not be retrieved due to rate limiting error

Method: No method information available - arXiv API returned HTTP 429 (Too Many Requests) error

Result: No results available - failed to fetch paper summary from arXiv

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content

Abstract: Failed to fetch summary for 2510.20220: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20220&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[546] Tensor-Efficient High-Dimensional Q-learning

Junyi Wu, Dan Li

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.03595: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03595&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[547] Adaptive Symmetrization of the KL Divergence

Omri Ben-Dov, Luiz F.O. Chamon

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: No method information available - API request failed with HTTP 429 error

Result: No results available - paper content inaccessible due to rate limiting

Conclusion: Cannot analyze paper due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2511.11159: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11159&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[548] Replacing Tunable Parameters in Weather and Climate Models with State-Dependent Functions using Reinforcement Learning

Pritthijit Nath, Sebastian Schemm, Henry Moss, Peter Haynes, Emily Shuckburgh, Mark J. Webb

Main category: cs.LG

TL;DR: Paper ID 2601.04268 could not be fetched due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2601.04268: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04268&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[549] Low-Rank Key Value Attention

James O’Neill, Robert Clancy, Mariia Matskevichus, Fergal Reid

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2601.11471: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11471&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[550] Explainable AI to Improve Machine Learning Reliability for Industrial Cyber-Physical Systems

Annemarie Jutte, Uraz Odyurt

Main category: cs.LG

TL;DR: Paper 2601.16074: Could not fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2601.16074: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16074&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[551] LUMINA: Foundation Models for Topology Transferable ACOPF

Yijiang Li, Zeeshan Memon, Hongwei Jin, Stefano Fenu, Keunju Song, Sunash B Sharma, Parfait Gasana, Hongseok Kim, Liang Zhao, Kibaek Kim

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.04300 suggests it’s from March 2023, but no abstract or content is available for analysis.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2603.04300: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04300&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[552] Interventional Time Series Priors for Causal Foundation Models

Dennis Thumm, Ying Chen

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.11090 suggests it’s from March 2023, but no abstract or content is available for analysis.

DetailsMotivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests, preventing retrieval of the paper details.

Method: No method information available due to failed API request. The paper ID format suggests it’s from March 2023 (2603 = year 2023, month 03).

Result: No results can be analyzed since the paper content could not be retrieved. The HTTP 429 status code means too many requests were made to the arXiv API in a short period.

Conclusion: Cannot provide analysis or conclusion without access to the paper. The arXiv API rate limiting prevents retrieval of the abstract and paper details needed for proper assessment.

Abstract: Failed to fetch summary for 2603.11090: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11090&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[553] EvoFlows: Evolutionary Edit-Based Flow-Matching for Protein Engineering

Nicolas Deutschmann, Constance Ferragu, Jonathan D. Ziegler, Shayan Aziznejad, Eli Bixby

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.11703: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11703&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[554] Shapes are not enough: CONSERVAttack and its use for finding vulnerabilities and uncertainties in machine learning applications

Philip Bechtle, Lucie Flek, Philipp Alexander Jung, Akbar Karimi, Timo Saala, Alexander Schmidt, Matthias Schott, Philipp Soldin, Christopher Wiebusch, Ulrich Willemsen

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.13970 suggests it’s from March 2026, but no content available for analysis.

DetailsMotivation: Cannot determine motivation without access to the paper content. The arXiv API rate limiting prevents retrieval of the abstract.

Method: Cannot determine method without access to the paper content. The arXiv API returned HTTP 429 error.

Result: Cannot determine results without access to the paper content. The paper summary could not be fetched.

Conclusion: Cannot draw conclusions about the paper without access to its content. The arXiv API rate limiting prevents analysis.

Abstract: Failed to fetch summary for 2603.13970: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13970&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[555] CRPS-Optimal Binning for Univariate Conformal Regression

Paolo Toccaceli

Main category: cs.LG

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2603.22000: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22000&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[556] Matrix Profile for Time-Series Anomaly Detection: A Reproducible Open-Source Benchmark on TSB-AD

Chin-Chia Michael Yeh

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting error

Method: Unable to determine method due to API rate limiting error

Result: Unable to determine results due to API rate limiting error

Conclusion: Unable to determine conclusion due to API rate limiting error

Abstract: Failed to fetch summary for 2604.02445: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02445&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[557] NativeTernary: A Self-Delimiting Binary Encoding with Unary Run-Length Hierarchy Markers for Ternary Neural Network Weights, Structured Data, and General Computing Infrastructure

Maharshi Savdhariya

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when attempting to retrieve arXiv paper 2604.03336

DetailsMotivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting

Method: No method information available - paper content not accessible

Result: No results available - paper summary retrieval failed

Conclusion: Cannot analyze paper due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2604.03336: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03336&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[558] Algebraic Diversity: Group-Theoretic Spectral Estimation from Single Observations

Mitchell A. Thornton

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2604.03634: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03634&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[559] Active Statistical Inference

Tijana Zrnic, Emmanuel J. Candès

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2403.03208: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2403.03208&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[560] Resistance Distance and Linearized Optimal Transport on Graphs

Sawyer Robertson, Zhengchao Wan, Alexander Cloninger

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2404.15261: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.15261&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[561] Thompson Sampling for Infinite-Horizon Discounted Decision Processes

Daniel Adelman, Cagla Keceli, Alba V. Olivares-Nadal

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to retrieval error

Method: Unable to determine method due to retrieval error

Result: Unable to determine results due to retrieval error

Conclusion: Unable to determine conclusion due to retrieval error

Abstract: Failed to fetch summary for 2405.08253: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.08253&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[562] Differentially Private Best-Arm Identification

Achraf Azize, Marc Jourdan, Aymen Al Marjani, Debabrota Basu

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when trying to access arXiv API

DetailsMotivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting

Method: Cannot analyze method without access to the paper content

Result: No results available due to failed API request

Conclusion: Paper analysis impossible due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2406.06408: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.06408&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[563] Nonparametric Instrumental Regression via Kernel Methods is Minimax Optimal

Dimitri Meunier, Zhu Li, Tim Christensen, Arthur Gretton

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2411.19653: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.19653&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[564] Non-Expansive Mappings in Two-Time-Scale Stochastic Approximation: Finite-Time Analysis

Siddharth Chandak

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2501.10806: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.10806&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[565] Spike-based alignment learning solves the weight transport problem

Timo Gierlich, Andreas Baumbach, Akos F. Kungl, Kevin Max, Mihai A. Petrovici

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2503.02642: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.02642&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[566] Computational bottlenecks for denoising diffusions

Andrea Montanari, Viet Vu

Main category: cs.LG

TL;DR: Unable to analyze paper 2503.08028 due to HTTP 429 error when fetching summary from arXiv API

DetailsMotivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2503.08028: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.08028&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[567] A Giant-Step Baby-Step Classifier For Scalable and Real-Time Anomaly Detection In Industrial Control Systems and Water Treatment Systems

Sarad Venugopalan, Sridhar Adepu

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2504.20906: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.20906&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[568] Apple: Toward General Active Perception via Reinforcement Learning

Tim Schneider, Cristiana de Farias, Roberto Calandra, Liming Chen, Jan Peters

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2505.06182: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.06182&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[569] Neural Two-Stage Stochastic Optimization for Solving Unit Commitment Problem

Zhentong Shao, Jingtao Qin, Nanpeng Yu

Main category: cs.LG

TL;DR: Failed to fetch summary for paper 2507.09503 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed summary fetch

Method: Unable to determine method due to failed summary fetch

Result: Unable to determine results due to failed summary fetch

Conclusion: Unable to determine conclusion due to failed summary fetch

Abstract: Failed to fetch summary for 2507.09503: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.09503&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[570] MF-GLaM: A multifidelity stochastic emulator using generalized lambda models

K. Giannoukou, X. Zhu, S. Marelli, B. Sudret

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2507.10303: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.10303&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[571] PAC-Bayesian Bounds on Constrained f-Entropic Risk Measures

Hind Atbir, Farah Cherfaoui, Guillaume Metzler, Emilie Morvant, Paul Viallard

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method without access to paper content

Result: No results available due to technical access issues

Conclusion: Paper analysis impossible due to API rate limiting preventing content retrieval

Abstract: Failed to fetch summary for 2510.11169: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.11169&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[572] RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z. Morley Mao, Arvind Krishnamurthy, Ion Stoica

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.19225: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.19225&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[573] DisCEdge: Distributed Context Management for Large Language Models at the Edge

Mohammadreza Malekabbasi, Minghe Wang, David Bermbach

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.22599: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22599&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[574] Physics-Informed Neural Networks for Source Inversion and Parameters Estimation in Atmospheric Dispersion

Brenda Anague, Bamdad Hosseini, Issa Karambal, Jean Medard Ngnotchouye

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to paper fetch failure

Method: Unable to determine method due to paper fetch failure

Result: Unable to determine results due to paper fetch failure

Conclusion: Unable to determine conclusion due to paper fetch failure

Abstract: Failed to fetch summary for 2512.07755: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07755&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[575] Probabilistic Predictions of Process-Induced Deformation in Carbon/Epoxy Composites Using a Deep Operator Network

Elham Kiyani, Amit Makarand Deshpande, Madhura Limaye, Zhiwei Gao, Zongren Zou, Sai Aditya Pradeep, Srikanth Pilla, Gang Li, Zhen Li, George Em Karniadakis

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2512.13746: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13746&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[576] Fast reconstruction-based ROI triggering via anomaly detection in the CYGNO optical TPC

F. D. Amaro, R. Antonietti, E. Baracchini, L. Benussi, C. Capoccia, M. Caponero, L. G. M. de Carvalho, G. Cavoto, I. A. Costa, A. Croce, M. D’Astolfo, G. D’Imperio, G. Dho, E. Di Marco, J. M. F. dos Santos, D. Fiorina, F. Iacoangeli, Z. Islam, E. Kemp, H. P. Lima Jr., G. Maccarrone, R. D. P. Mano, D. J. G. Marques, G. Mazzitelli, P. Meloni, A. Messina, V. Monno, C. M. B. Monteiro, R. A. Nobrega, G. M. Oppedisano, I. F. Pains, E. Paoletti, F. Petrucci, S. Piacentini, D. Pierluigi, D. Pinci, F. Renga, A. Russo, G. Saviano, P. A. O. C. Silva, N. J. Spooner, R. Tesauro, S. Tomassini, D. Tozzi

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to retrieval error

Method: Unable to determine method due to retrieval error

Result: Unable to determine results due to retrieval error

Conclusion: Unable to determine conclusion due to retrieval error

Abstract: Failed to fetch summary for 2512.24290: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.24290&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[577] Concave Certificates: Geometric Framework for Distributionally Robust Risk and Complexity Analysis

Hong T.M. Chu

Main category: cs.LG

TL;DR: Paper ID 2601.01311 could not be fetched due to HTTP 429 (rate limiting) error from arXiv API

DetailsMotivation: Unable to determine motivation as the paper content could not be retrieved

Method: Unable to determine method as the paper content could not be retrieved

Result: Unable to determine results as the paper content could not be retrieved

Conclusion: Unable to draw conclusions as the paper content could not be retrieved

Abstract: Failed to fetch summary for 2601.01311: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.01311&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[578] Theory and interpretability of Quantum Extreme Learning Machines: a Pauli-transfer matrix approach

Markus Gross, Hans-Martin Rieser

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.18377: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18377&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[579] Robust support vector model based on bounded asymmetric elastic net loss for binary classification

Haiyan Du, Hu Yang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.06257: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06257&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[580] Conditional flow matching for physics-constrained inverse problems with finite training data

Agnimitra Dasgupta, Ali Fardisi, Mehrnegar Aminy, Brianna Binder, Bryan Shaddy, Saeed Moazami, Assad Oberai

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2603.14135: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14135&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.MA

[581] Logical Robots: Declarative Multi-Agent Programming in Logica

Evgeny Skvortsov, Yilin Xia, Ojaswa Garg, Shawn Bowers, Bertram Ludäscher

Main category: cs.MA

TL;DR: Logical Robots is a multi-agent simulation platform using Logica logic programming language to declaratively specify autonomous robot behavior through logical predicates that map sensor observations to motor outputs.

DetailsMotivation: To create a unified framework for exploring multi-agent robot behavior where both low-level reactive control and high-level planning can coexist within a single programming environment, moving away from traditional imperative programming approaches.

Method: Uses the logic programming language Logica to define robot behavior through logical predicates. These predicates map observations from simulated radar arrays and shared memory to desired motor outputs, enabling declarative specification of autonomous behavior in a multi-agent simulation platform.

Result: Developed an interactive multi-agent simulation platform where autonomous robot behavior can be specified declaratively, allowing for coherent exploration of multi-agent interactions and behaviors through logical programming constructs.

Conclusion: The Logical Robots platform demonstrates that logic programming provides an effective framework for specifying autonomous robot behavior, enabling both reactive control and planning within a single coherent environment for multi-agent systems.

Abstract: We present Logical Robots, an interactive multi-agent simulation platform where autonomous robot behavior is specified declaratively in the logic programming language Logica. Robot behavior is defined by logical predicates that map observations from simulated radar arrays and shared memory to desired motor outputs. This approach allows low-level reactive control and high-level planning to coexist within a single programming environment, providing a coherent framework for exploring multi-agent robot behavior.

[582] Event-Triggered Adaptive Consensus for Multi-Robot Task Allocation

Fidel Aznar, Mar Pujol, Álvaro Díez

Main category: cs.MA

TL;DR: A novel event-triggered organization framework for heterogeneous robotic swarms that reduces communication overhead while maintaining task completion effectiveness through adaptive consensus and behavior tree-based resilience.

DetailsMotivation: Coordinating robotic swarms in dynamic, communication-constrained environments is challenging; existing methods often involve excessive communication overhead or lack adaptability to environmental changes and agent failures.

Method: Event-triggered organization framework with adaptive consensus mechanism (communication only for significant events), self-regulated coordination pace based on environmental conflict, and robust execution model using Behavior Trees for individual agent resilience.

Result: Significantly reduces network overhead compared to communication-heavy strategies while maintaining top-tier mission effectiveness (tasks completed); exhibits resilience to action execution and permanent agent failures.

Conclusion: The framework enables adaptive, resource-efficient robotic swarms for complex scenarios through event-triggered communication and integrated resilience mechanisms.

Abstract: Coordinating robotic swarms in dynamic and communication-constrained environments remains a fundamental challenge for collective intelligence. This paper presents a novel framework for event-triggered organization, designed to achieve highly efficient and adaptive task allocation in a heterogeneous robotic swarm. Our approach is based on an adaptive consensus mechanism where communication for task negotiation is initiated only in response to significant events, eliminating unnecessary interactions. Furthermore, the swarm self-regulates its coordination pace based on the level of environmental conflict, and individual agent resilience is managed through a robust execution model based on Behavior Trees. This integrated architecture results in a collective system that is not only effective but also remarkably efficient and adaptive. We validate our framework through extensive simulations, benchmarking its performance against a range of coordination strategies. These include a non-communicating reactive behavior, a simple information-sharing protocol, the baseline Consensus-Based Bundle Algorithm (CBBA), and a periodic CBBA variant integrated within a Behavior Tree architecture. Furthermore, our approach is compared with Clustering-CBBA (C-CBBA), a state-of-the-art algorithm recognized for communication-efficient task management in heterogeneous clusters. Experimental results demonstrate that the proposed method significantly reduces network overhead when compared to communication-heavy strategies. Moreover, it maintains top-tier mission effectiveness regarding the number of tasks completed, showcasing high efficiency and practicality. The framework also exhibits significant resilience to both action execution and permanent agent failures, highlighting the effectiveness of our event-triggered model for designing adaptive and resource-efficient robotic swarms for complex scenarios.

[583] Generating Local Shields for Decentralised Partially Observable Markov Decision Processes

Haoran Yang, Nobuko Yoshida

Main category: cs.MA

TL;DR: A shield process algebra for safe multi-agent systems in Dec-POMDPs, compiled to local Mealy machines that filter actions based on belief states, with implementation in Rust and PRISM integration for safety verification.

DetailsMotivation: Multi-agent systems under partial observation struggle with safety because local actions don't determine joint actions. Existing shielding techniques either need centralized global state or use memoryless local filters that ignore interaction history.

Method: Introduce a shield process algebra with guarded choice and recursion for specifying safe global behavior. Compile shield process to process automaton, then to global Mealy machine as safe joint-action filter, and finally project to local Mealy machines with belief-style subsets of global states consistent with each agent’s observations.

Result: Implemented pipeline in Rust with PRISM integration for computing safety probabilities. Multi-agent path-finding case study shows shields substantially reduce collisions compared to unshielded baseline, with varying expressiveness and conservatism.

Conclusion: The approach provides a formal framework for safe multi-agent coordination under partial observation, enabling specification of global safety constraints that can be compiled to distributed local filters without requiring communication.

Abstract: Multi-agent systems under partial observation often struggle to maintain safety because each agent’s locally chosen action does not, in general, determine the resulting joint action. Shielding addresses this by filtering actions based on the current state, but most existing techniques either assume access to a shared centralised global state or employ memoryless local filters that cannot consider interaction history. We introduce a shield process algebra with guarded choice and recursion for specifying safe global behaviour in communication-free Dec-POMDP settings. From a shield process, we compile a process automaton, then a global Mealy machine as a safe joint-action filter, and finally project it to local Mealy machines whose states are belief-style subsets of the global Mealy machine states consistent with each agent’s observations, and which output per-agent safe action sets. We implement the pipeline in Rust and integrate PRISM, the Probabilistic Symbolic Model Checker, to compute best- and worst-case safety probabilities independently of the agents’ policies. A multi-agent path-finding case study demonstrates how different shield processes substantially reduce collisions compared to the unshielded baseline while exhibiting varying levels of expressiveness and conservatism.

[584] AgentCity: Constitutional Governance for Autonomous Agent Economies via Separation of Power

Anbang Ruan, Xing Zhang

Main category: cs.MA

TL;DR: Proposes Separation of Power (SoP) governance model for autonomous AI agents using blockchain smart contracts to create accountability chains from agents to human principals, addressing the “Logic Monopoly” problem in multi-agent societies.

DetailsMotivation: Addresses the opacity and lack of human oversight in autonomous AI agents operating across organizational boundaries, where emergent collective behavior becomes unobservable and ungovernable by any single human - termed the "Logic Monopoly".

Method: Proposes a constitutional governance architecture with three structural separations: agents legislate operational rules as smart contracts, deterministic software executes within contracts, and humans adjudicate through ownership chains. Implemented as AgentCity on EVM-compatible L2 blockchain with three-tier contract hierarchy.

Result: The paper presents a pre-registered experiment evaluating the SoP model in a commons production economy with 50-1,000 agents, testing the core thesis of alignment-through-accountability.

Conclusion: If each agent is aligned with its human owner through accountability chains, the collective converges on behavior aligned with human intent without top-down rules, breaking the Logic Monopoly through decentralized governance.

Abstract: Autonomous AI agents are beginning to operate across organizational boundaries on the open internet – discovering, transacting with, and delegating to agents owned by other parties without centralized oversight. When agents from different human principals collaborate at scale, the collective becomes opaque: no single human can observe, audit, or govern the emergent behavior. We term this the Logic Monopoly – the agent society’s unchecked monopoly over the entire logic chain from planning through execution to evaluation. We propose the Separation of Power (SoP) model, a constitutional governance architecture deployed on public blockchain that breaks this monopoly through three structural separations: agents legislate operational rules as smart contracts, deterministic software executes within those contracts, and humans adjudicate through a complete ownership chain binding every agent to a responsible principal. In this architecture, smart contracts are the law itself – the actual legislative output that agents produce and that governs their behavior. We instantiate SoP in AgentCity on an EVM-compatible layer-2 blockchain (L2) with a three-tier contract hierarchy (foundational, meta, and operational). The core thesis is alignment-through-accountability: if each agent is aligned with its human owner through the accountability chain, then the collective converges on behavior aligned with human intent – without top-down rules. A pre-registered experiment evaluates this thesis in a commons production economy – where agents share a finite resource pool and collaboratively produce value – at 50-1,000 agent scale.

Philipp D. Siedler

Main category: cs.MA

TL;DR: Multi-agent LLM framework for legal argumentation with trait-conditioned agents, showing heterogeneous teams outperform homogeneous ones and introducing RL-based trait orchestrator for adaptive persuasion.

DetailsMotivation: Most game-theoretic models abstract away language-based persuasion mechanisms in adversarial domains like law and negotiation. The paper aims to treat language as a first-class strategic action space for building autonomous persuasive agents.

Method: Strategic Courtroom Framework with LLM agents (DeepSeek-R1 and Gemini 2.5 Pro) having nine interpretable traits organized into four archetypes. Evaluated across 10 synthetic cases, 84 team configurations, 7,000+ trials. Introduced reinforcement-learning-based Trait Orchestrator for dynamic trait generation.

Result: Heterogeneous teams with complementary traits consistently outperform homogeneous configurations. Moderate interaction depth yields more stable verdicts. Quantitative and charismatic traits contribute disproportionately to persuasive success. RL-based Trait Orchestrator discovers strategies that outperform human-designed trait combinations.

Conclusion: Language can be treated as a first-class strategic action space. The framework provides foundation for building autonomous agents capable of adaptive persuasion in multi-agent environments.

Abstract: Strategic interaction in adversarial domains such as law, diplomacy, and negotiation is mediated by language, yet most game-theoretic models abstract away the mechanisms of persuasion that operate through discourse. We present the Strategic Courtroom Framework, a multi-agent simulation environment in which prosecution and defense teams composed of trait-conditioned Large Language Model (LLM) agents engage in iterative, round-based legal argumentation. Agents are instantiated using nine interpretable traits organized into four archetypes, enabling systematic control over rhetorical style and strategic orientation. We evaluate the framework across 10 synthetic legal cases and 84 three-trait team configurations, totaling over 7{,}000 simulated trials using DeepSeek-R1 and Gemini2.5Pro. Our results show that heterogeneous teams with complementary traits consistently outperform homogeneous configurations, that moderate interaction depth yields more stable verdicts, and that certain traits (notably quantitative and charismatic) contribute disproportionately to persuasive success. We further introduce a reinforcement-learning-based Trait Orchestrator that dynamically generates defense traits conditioned on the case and opposing team, discovering strategies that outperform static, human-designed trait combinations. Together, these findings demonstrate how language can be treated as a first-class strategic action space and provide a foundation for building autonomous agents capable of adaptive persuasion in multi-agent environments.

[586] Designing for Accountable Agents: a Viewpoint

Stephen Cranefield, Nir Oren

Main category: cs.MA

TL;DR: Survey paper on accountability in multi-agent systems, focusing on defining accountability for autonomous agents rather than human organizational processes.

DetailsMotivation: As AI systems become more complex and autonomous, there are growing concerns about their impacts on society. While transparency and explainability have been studied, accountability in AI systems remains poorly defined, especially for autonomous agents in multi-agent systems.

Method: 1) Comprehensive survey of accountability literature across multiple disciplines to identify coherent definitions; 2) Present realistic multi-agent system application domain showing benefits of accountability processes; 3) Identify research challenges for building accountable agents with initial solutions.

Result: Provides foundational work for enabling autonomous elements in open socio-technical systems to participate in accountability processes, laying out a research roadmap for the MAS community.

Conclusion: The paper establishes groundwork for accountability in multi-agent systems, shifting focus from human organizational processes to autonomous agent accountability, with implications for ethical AI development.

Abstract: AI systems are becoming increasingly complex, ubiquitous and autonomous, leading to increasing concerns about their impacts on individuals and society. In response, researchers have begun investigating how to ensure that the methods underlying AI decision-making are transparent and their decisions are explainable to people and conformant to human values and ethical principles. As part of this research thrust, the need for accountability within AI systems has been noted, but this notion has proven elusive to define; we aim to address this issue in the current paper. Unlike much recent work, we do not address accountability within the human organisational processes of developing and deploying AI; rather we consider what it would it mean for the agents within a multi-agent system (MAS), potentially including human agents, to be accountable to other agents or to have others accountable to them. In this work, we make the following contributions: we provide an in-depth survey of existing work on accountability in multiple disciplines, seeking to identify a coherent definition of the concept; we give a realistic example of a multi-agent system application domain that illustrates the benefits of enabling agents to follow accountability processes, and we identify a set of research challenges for the MAS community in building accountable agents, sketching out some initial solutions to these, thereby laying out a road-map for future research. Our focus is on laying the groundwork to enable autonomous elements within open socio-technical systems to take part in accountability processes.

[587] Intertemporal Demand Allocation for Inventory Control in Online Marketplaces

Rene Caldentey, Tong Xie

Main category: cs.MA

TL;DR: Platforms can influence seller inventory decisions through demand allocation policies that affect forecast uncertainty, without directly controlling stock.

DetailsMotivation: Online marketplaces now manage order routing and fulfillment services, creating a need to understand how platforms can influence seller inventory choices indirectly through demand allocation policies rather than direct stock control.

Method: Developed a model where platforms observe aggregate demand and allocate orders across sellers over time, while sellers choose between fulfill-by-merchant (FBM) and fulfill-by-platform (FBP) options and replenish inventory under state-dependent base-stock policies. Focused on nondiscriminatory allocation policies that give sellers equal demand share and forecast risk.

Result: Uniform splitting minimizes forecast uncertainty, while higher uncertainty can be implemented using simple low-memory allocation rules. Increasing uncertainty requires routing rules that prevent sellers from inferring aggregate demand from their own sales histories. The platform’s problem reduces to choosing optimal forecast uncertainty that trades off platform fulfillment adoption against inventory held by adopters.

Conclusion: Demand allocation serves as a powerful operational and informational design lever in digital marketplaces, allowing platforms to influence seller inventory decisions through forecast uncertainty manipulation without direct stock control.

Abstract: Online marketplaces increasingly do more than simply match buyers and sellers: they route orders across competing sellers and, in many categories, offer ancillary fulfillment services that make seller inventory a source of platform revenue. We investigate how a platform can use intertemporal demand allocation to influence sellers’ inventory choices without directly controlling stock. We develop a model in which the platform observes aggregate demand, allocates orders across sellers over time, and sellers choose between two fulfillment options, fulfill-by-merchant (FBM) and fulfill-by-platform (FBP), while replenishing inventory under state-dependent base-stock policies. The key mechanism we study is informational: by changing the predictability of each seller’s sales stream, the platform changes sellers’ safety-stock needs even when average demand shares remain unchanged. We focus on nondiscriminatory allocation policies that give sellers the same demand share and forecast risk. Within this class, uniform splitting minimizes forecast uncertainty, whereas any higher level of uncertainty can be implemented using simple low-memory allocation rules. Moreover, increasing uncertainty above the uniform benchmark requires routing rules that prevent sellers from inferring aggregate demand from their own sales histories. These results reduce the platform’s problem to choosing a level of forecast uncertainty that trades off adoption of platform fulfillment against the inventory held by adopters. Our analysis identifies demand allocation as a powerful operational and informational design lever in digital marketplaces.

[588] On the Uncertainty of Large Language Model-Based Multi-Agent Systems

Yuxuan Zhao, Sijia Chen, Ningxin Su

Main category: cs.MA

TL;DR: Analyzes multi-agent LLM systems through uncertainty lens, finds single agents outperform MAS in 43.3% cases, identifies entropy dynamics patterns, and proposes Entropy Judger algorithm for solution selection.

DetailsMotivation: While multi-agent systems (MAS) using LLMs show promise for complex tasks, the underlying mechanisms for their success/failure remain unexplored. The paper aims to understand MAS effectiveness through the perspective of uncertainty dynamics.

Method: Investigates entropy transitions during problem-solving across various topologies and six benchmark tasks. Analyzes 245 features spanning token-, trajectory-, and round-level entropy. Proposes Entropy Judger algorithm to select solutions from MAS’s pass@k results.

Result: Single agents outperform MAS in ~43.3% of cases. Uncertainty dynamics are largely determined during first round of interaction. Three key observations: Certainty Preference, Base Uncertainty, and Task Awareness. Entropy Judger leads to consistent accuracy improvements across all MAS configurations and tasks.

Conclusion: Uncertainty analysis reveals fundamental insights about MAS effectiveness. The simple Entropy Judger algorithm effectively leverages entropy dynamics to improve MAS performance across diverse configurations and tasks.

Abstract: Multi-agent systems (MAS) have emerged as a prominent paradigm for leveraging large language models (LLMs) to tackle complex tasks. However, the mechanisms governing the effectiveness of MAS built upon publicly available LLMs, specifically the underlying rationales for their success or failure, remain largely unexplored. In this paper, we revisit MAS through the perspective of uncertainty, considering both intra- and inter-agent dynamics by investigating entropy transitions during problem-solving across various topologies and six benchmark tasks. By analyzing 245 features spanning token-, trajectory-, and round-level entropy, we counterintuitively find that a single agent outperforms MAS in approximately 43.3% of cases, and that uncertainty dynamics are largely determined during the first round of interaction. Furthermore, we provide three key observations: 1) Certainty Preference: reducing uncertainty at any stage for any agent is critical for guaranteeing correct solutions; 2) Base Uncertainty: base models with lower entropy during problem-solving directly benefit MAS performance; and 3) Task Awareness: entropy dynamics of MAS play varying roles across different tasks. Building on these insights, we introduce a simple yet effective algorithm, the Entropy Judger, to select solutions from MAS’s pass@k results, leading to consistent accuracy improvements across all MAS configurations and tasks. Our source code is available at https://github.com/AgenticFinLab/multiagent-entropy.

cs.MM

[589] LungCURE: Benchmarking Multimodal Real-World Clinical Reasoning for Precision Lung Cancer Diagnosis and Treatment

Fangyu Hao, Jiayu Yang, Yifan Zhu, Zijun Yu, Qicen Wu, Wang Yunlong, Jiawei Li, Yulin Liu, Xu Zeng, Guanting Chen, Shihao Li, Zhonghong Ou, Meina Song, Mengyang Sun, Haoran Luo, Yu Shi, Yingyi Wang

Main category: cs.MM

TL;DR: LungCURE benchmark and LCAgent framework for guideline-compliant lung cancer clinical decision support using multimodal reasoning

DetailsMotivation: Existing multimodal LLMs fail to handle guideline-constrained staging and treatment reasoning for lung cancer, which requires precise multi-stage oncological workflow reasoning

Method: Formalized three oncological precision treatment tasks, created LungCURE benchmark from 1,000 real-world cases, and proposed LCAgent multi-agent framework for guideline-compliant decision-making

Result: Revealed large differences in LLM capabilities for complex medical reasoning, and showed LCAgent enhances reasoning performance as a simple plugin in real-world scenarios

Conclusion: LCAgent framework effectively addresses guideline compliance in lung cancer clinical decision support by suppressing cascading reasoning errors across clinical pathways

Abstract: Lung cancer clinical decision support demands precise reasoning across complex, multi-stage oncological workflows. Existing multimodal large language models (MLLMs) fail to handle guideline-constrained staging and treatment reasoning. We formalize three oncological precision treatment (OPT) tasks for lung cancer, spanning TNM staging, treatment recommendation, and end-to-end clinical decision support. We introduce LungCURE, the first standardized multimodal benchmark built from 1,000 real-world, clinician-labeled cases across more than 10 hospitals. We further propose LCAgent, a multi-agent framework that ensures guideline-compliant lung cancer clinical decision-making by suppressing cascading reasoning errors across the clinical pathway. Experiments reveal large differences across various large language models (LLMs) in their capabilities for complex medical reasoning, when given precise treatment requirements. We further verify that LCAgent, as a simple yet effective plugin, enhances the reasoning performance of LLMs in real-world medical scenarios.

eess.AS

[590] Harf-Speech: A Clinically Aligned Framework for Arabic Phoneme-Level Speech Assessment

Asif Azad, MD Sadik Hossain Shanto, Mohammad Sadat Hossain, Bdour Alwuqaysi, Sabri Boughorbel, Yahya Bokhari, Abdulrhman Aljouie, Ayah Othman Sindi, Ehsan Hoque

Main category: eess.AS

TL;DR: Harf-Speech is a modular system for automated phoneme-level Arabic pronunciation assessment that combines speech-to-phoneme models with alignment algorithms to provide clinically validated scores comparable to expert speech-language pathologists.

DetailsMotivation: There's a scarcity of validated tools for automated phoneme-level pronunciation assessment in Arabic, which is crucial for scalable speech therapy and language learning applications.

Method: A modular system combining MSA phonetizer, fine-tuned speech-to-phoneme models (including ASR architectures and zero-shot multimodal models), Levenshtein alignment, and blended scoring using longest common subsequence and edit-distance metrics.

Result: Best model (OmniASR-CTC-1B-v2) achieved 8.92% phoneme error rate; Harf-Speech attained Pearson correlation of 0.791 and ICC(2,1) of 0.659 with mean expert scores, outperforming existing end-to-end assessment frameworks.

Conclusion: Harf-Speech provides clinically aligned, interpretable Arabic pronunciation scores comparable to inter-rater expert agreement, addressing the gap in validated Arabic pronunciation assessment tools.

Abstract: Automated phoneme-level pronunciation assessment is vital for scalable speech therapy and language learning, yet validated tools for Arabic remain scarce. We present Harf-Speech, a modular system scoring Arabic pronunciation at the phoneme level on a clinical scale. It combines an MSA phonetizer, a fine-tuned speech-to-phoneme model, Levenshtein alignment, and a blended scorer using longest common subsequence and edit-distance metrics. We fine-tune three ASR architectures on Arabic phoneme data and benchmark them with zero-shot multimodal models; the best, OmniASR-CTC-1B-v2, achieves 8.92% phoneme error rate. Three certified speech-language pathologists independently scored 40 utterances for clinical validation. Harf-Speech attains a Pearson correlation of 0.791 and ICC(2,1) of 0.659 with mean expert scores, outperforming existing end-to-end assessment frameworks. These results show Harf-Speech yields clinically aligned, interpretable scores comparable to inter-rater expert agreement.

[591] ULTRAS – Unified Learning of Transformer Representations for Audio and Speech Signals

Ameenudeen P E, Charumathi Narayanan, Sriram Ganapathy

Main category: eess.AS

TL;DR: ULTRAS proposes a unified transformer framework for both speech and audio tasks using patch-based masking and joint spectral-temporal prediction objectives.

DetailsMotivation: Current self-supervised learning approaches for speech use time-domain prediction while audio frameworks use time-frequency spectrograms, creating a gap between paradigms that limits transferability. There's a need for a joint framework that can handle both speech and audio tasks effectively.

Method: ULTRAS uses transformer architecture with patch-based masking on log-mel spectrograms. It performs predictive modeling of masked segments using combined spectral and temporal targets with a joint loss function, forcing representations to encode both time and frequency characteristics.

Result: The framework achieves improved performance over established baselines on various speech and audio tasks, demonstrating effective joint learning across domains.

Conclusion: ULTRAS provides a unified approach for audio and speech representation learning that bridges the gap between time-domain and time-frequency paradigms through patch-based transformer architecture with joint spectral-temporal objectives.

Abstract: Self-supervised learning (SSL) has driven impressive advances in speech processing by adopting time-domain prediction objectives, while audio representation learning frameworks operate on time-frequency spectrograms. Models optimized for one paradigm struggle to transfer to the other, highlighting the need for a joint framework. We propose Unified Learning of Transformer Representations for Audio and Speech (ULTRAS), where the masking and predictive modeling is performed over long patches of the data. The model, based on the transformer architecture, encodes spectral-patches of log-mel spectrogram features. The predictive modeling of masked segments is performed on spectral and temporal targets using a combined loss-function, forcing the representations to encode time and frequency traits. Experiments are performed on a variety of speech and audio tasks, where we illustrate that the ULTRAS framework achieves improved performance over other established baselines.

[592] DAT-CFTNet: Speech Enhancement for Cochlear Implant Recipients using Attention-based Dual-Path Recurrent Neural Network

Nursadul Mamun, John H. L. Hansen

Main category: eess.AS

TL;DR: Proposes DAT-CFTNet, an attention-based dual-path RNN combined with complex-valued frequency transformation network for speech enhancement, improving intelligibility and quality while reducing musical artifacts.

DetailsMotivation: Inspired by human auditory system's selective attention to speech elements while ignoring background noise, and recent success of attention models in neural networks.

Method: Introduces DAT-RNN (dual-path attention RNN) combined with modified CFTNet (complex-valued frequency transformation network) to form DAT-CFTNet. Uses attention mechanism to differentiate speech from noise in time-frequency spectrogram regions, optimizing local and global context processing.

Result: DAT-CFTNet outperforms existing models (CFTNet and DCCRN) in speech intelligibility and quality. Shows superior performance for cochlear implant recipients, suppresses non-stationary noise, and avoids musical artifacts common in traditional methods.

Conclusion: The attention-based dual-path approach effectively enhances speech by mimicking human auditory selective attention, with practical benefits for both normal hearing and cochlear implant users.

Abstract: The human auditory system has the ability to selectively focus on key speech elements in an audio stream while giving secondary attention to less relevant areas such as noise or distortion within the background, dynamically adjusting its attention over time. Inspired by the recent success of attention models, this study introduces a dual-path attention module in the bottleneck layer of a concurrent speech enhancement network. Our study proposes an attention-based dual-path RNN (DAT-RNN), which, when combined with the modified complex-valued frequency transformation network (CFTNet), forms the DAT-CFTNet. This attention mechanism allows for precise differentiation between speech and noise in time-frequency (T-F) regions of spectrograms, optimizing both local and global context information processing in the CFTNet. Our experiments suggest that the DAT-CFTNet leads to consistently improved performance over the existing models, including CFTNet and DCCRN, in terms of speech intelligibility and quality. Moreover, the proposed model exhibits superior performance in enhancing speech intelligibility for cochlear implant (CI) recipients, who are known to have severely limited T-F hearing restoration (e.g., >10%) in CI listener studies in noisy settings show the proposed solution is capable of suppressing non-stationary noise, avoiding the musical artifacts often seen in traditional speech enhancement methods. The implementation of the proposed model will be publicly available.

[593] EvoTSE: Evolving Enrollment for Target Speaker Extraction

Zikai Liu, Ziqian Wang, Xingchen Li, Yike Zhu, Shuai Wang, Longshuai Xiao, Lei Xie

Main category: eess.AS

TL;DR: EvoTSE is an evolving target speaker extraction framework that continuously updates speaker enrollment through reliability-filtered retrieval from high-confidence historical estimates, reducing speaker confusion and relaxing quality requirements for pre-recorded enrollment.

DetailsMotivation: Target Speaker Extraction (TSE) faces two key limitations: vulnerability to speaker confusion (extracting wrong speaker) and dependence on static enrollment quality. Current TSE systems rely on fixed pre-recorded enrollment, limiting performance when enrollment quality is poor.

Method: Proposes EvoTSE framework with continuous enrollment updating through reliability-filtered retrieval over high-confidence historical estimates. Uses dynamic enrollment mechanism that evolves during inference without requiring additional annotated data.

Result: Experiments across multiple benchmarks show consistent improvements, especially in out-of-domain scenarios. The framework reduces speaker confusion and performs well even with lower-quality initial enrollment.

Conclusion: EvoTSE effectively addresses speaker confusion and enrollment quality limitations in TSE through dynamic enrollment updating, demonstrating robust performance across various scenarios including challenging out-of-domain conditions.

Abstract: Target Speaker Extraction (TSE) aims to isolate a specific speaker’s voice from a mixture, guided by a pre-recorded enrollment. While TSE bypasses the global permutation ambiguity of blind source separation, it remains vulnerable to speaker confusion, where models mistakenly extract the interfering speaker. Furthermore, conventional TSE relies on static inference pipeline, where performance is limited by the quality of the fixed enrollment. To overcome these limitations, we propose EvoTSE, an evolving TSE framework in which the enrollment is continuously updated through reliability-filtered retrieval over high-confidence historical estimates. This mechanism reduces speaker confusion and relaxes the quality requirements for pre-recorded enrollment without relying on additional annotated data. Experiments across multiple benchmarks demonstrate that EvoTSE achieves consistent improvements, especially when evaluated on out-of-domain (OOD) scenarios. Our code and checkpoints are available.

[594] SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision

Chunbo Hao, Ruibin Yuan, Jixun Yao, Qixin Deng, Xinyi Bai, Yanbo Wang, Wei Xue, Lei Xie

Main category: eess.AS

TL;DR: SongFormer is a scalable framework for music structure analysis that learns from heterogeneous supervision, combining short- and long-window self-supervised learning with learned source embeddings to handle partial/noisy labels, achieving SOTA on new benchmark datasets.

DetailsMotivation: Music structure analysis (MSA) is crucial for music understanding and controllable generation, but progress has been limited by small, inconsistent corpora and the challenge of learning from heterogeneous, noisy supervision.

Method: SongFormer fuses short- and long-window self-supervised learning representations to capture both fine-grained and long-range dependencies, and introduces learned source embeddings to enable training with partial, noisy, and schema-mismatched labels.

Result: On SongFormBench (300-song expert-verified benchmark), SongFormer sets new SOTA in strict boundary detection (HR.5F) and achieves highest functional label accuracy, surpassing strong baselines and Gemini 2.5 Pro while remaining computationally efficient.

Conclusion: SongFormer provides an effective scalable framework for MSA that handles heterogeneous supervision, with released resources (SongFormDB - largest MSA corpus, SongFormBench benchmark) enabling further research in music understanding and generation.

Abstract: Music structure analysis (MSA) underpins music understanding and controllable generation, yet progress has been limited by small, inconsistent corpora. We present SongFormer, a scalable framework that learns from heterogeneous supervision. SongFormer (i) fuses short- and long-window self-supervised learning representations to capture both fine-grained and long-range dependencies, and (ii) introduces a learned source embedding to enable training with partial, noisy, and schema-mismatched labels. To support scaling and fair evaluation, we release SongFormDB, the largest MSA corpus to date (over 14k songs spanning languages and genres), and SongFormBench, a 300-song expert-verified benchmark. On SongFormBench, SongFormer sets a new state of the art in strict boundary detection (HR.5F) and achieves the highest functional label accuracy, while remaining computationally efficient; it surpasses strong baselines and Gemini 2.5 Pro on these metrics and remains competitive under relaxed tolerance (HR3F). Code, datasets, and model are open-sourced at https://github.com/ASLP-lab/SongFormer.

[595] Disentangling peripheral hearing loss from central and cognitive effects on speech intelligibility in older adults

Toshio Irino, Ayako Yamamoto, Fuki Miyazaki

Main category: eess.AS

TL;DR: A framework using hearing impairment simulation and objective intelligibility measures to distinguish peripheral hearing loss from central/cognitive contributions to speech intelligibility in older adults.

DetailsMotivation: To understand individual differences in speech intelligibility among older adults by separating peripheral hearing loss from central and cognitive processing contributions, which is essential for developing effective assistive hearing strategies.

Method: Uses Wakayama University Hearing Impairment Simulator (WHIS) to simulate audiograms of older adults in young normal-hearing listeners, combined with Gammachirp Envelope Similarity Index (GESI) as an objective intelligibility measure to predict speech-in-noise performance.

Result: Older adults achieved comparable or higher speech intelligibility scores than young listeners with simulated hearing loss; GESI prediction accuracy was comparable for both groups; many older adults’ subjective scores exceeded predictions based on young listener parameters; no correlation between hearing levels and residual differences between subjective and predicted scores.

Conclusion: The framework effectively isolates peripheral hearing loss effects, allowing systematic examination of central and cognitive factors in speech intelligibility, challenging previous assumptions about age-related speech perception decline.

Abstract: Age-related hearing loss (HL) reduces speech intelligibility (SI) in older adults (OAs). However, deficits in central and cognitive processing also substantially impact SI. Understanding these contributions is essential for explaining individual differences and developing effective assistive hearing strategies. This study presents a framework that distinguishes peripheral HL from central and cognitive influences on SI. This framework uses the Wakayama University Hearing Impairment Simulator (WHIS), and the Gammachirp Envelope Similarity Index (GESI), an objective measure of intelligibility. First, speech-in-noise tests were conducted with young, normal-hearing listeners (YNHs) using WHIS to simulate the audiogram of a target OA. The target OA achieved SI scores comparable to or higher than those of YNHs with simulated HL, suggesting contributions beyond peripheral hearing function. Then, GESI was used to predict SI scores for YNHs and OAs across different hearing levels. The prediction accuracy was comparable for both groups. Interestingly, many OAs’ subjective SI scores were higher than those predicted using parameters derived from YNHs’ experiments. This finding is inconsistent with previous research indicating that speech perception ability declines with age. This issue will be discussed. There was no significant correlation between the average hearing levels and the residual differences between the subjective and predicted SI scores. This suggests that GESI effectively absorbed the effects of peripheral HL. Thus, the proposed framework may facilitate systematic examination and comparison of central and cognitive factors beyond peripheral HL among individual YNHs and OAs with and without HL.

eess.IV

[596] MedRoute: RL-Based Dynamic Specialist Routing in Multi-Agent Medical Diagnosis

Ashmal Vayani, Parth Parag Kulkarni, Joseph Fioresi, Song Wang, Mubarak Shah

Main category: eess.IV

TL;DR: MedRoute is a dynamic multi-agent LMM framework for medical diagnosis that uses RL-trained routing to select specialist agents, mimicking real clinical workflows and improving diagnostic accuracy.

DetailsMotivation: Current medical LMMs are too general and don't adapt to diverse real-world medical conditions. Real clinical diagnosis involves multiple specialists with domain expertise, but existing multi-agent approaches use static specialist selection that can't adapt to changing scenarios.

Method: Proposes MedRoute: a flexible multi-agent framework with specialist LMM agents, a General Practitioner with RL-trained router for dynamic specialist selection, and a Moderator for final decision making, closely mirroring clinical workflows.

Result: Extensive evaluations on text and image-based medical datasets demonstrate improved diagnostic accuracy, outperforming state-of-the-art baselines.

Conclusion: MedRoute provides a strong foundation for future research in medical LMMs by better emulating real clinical workflows through dynamic multi-agent collaboration.

Abstract: Medical diagnosis using Large Multimodal Models (LMMs) has gained increasing attention due to capability of these models in providing precise diagnoses. These models generally combine medical questions with visual inputs to generate diagnoses or treatments. However, they are often overly general and unsuitable under the wide range of medical conditions in real-world healthcare. In clinical practice, diagnosis is performed by multiple specialists, each contributing domain-specific expertise. To emulate this process, a potential solution is to deploy a dynamic multi-agent LMM framework, where each agent functions as a medical specialist. Current approaches in this emerging area, typically relying on static or predefined selection of various specialists, cannot be adapted to the changing practical scenario. In this paper, we propose MedRoute, a flexible and dynamic multi-agent framework that comprises of a collaborative system of specialist LMM agents. Furthermore, we add a General Practitioner with an RL-trained router for dynamic specialist selection, and a Moderator that produces the final decision. In this way, our framework closely mirrors real clinical workflows. Extensive evaluations on text and image-based medical datasets demonstrate improved diagnostic accuracy, outperforming the state-of-the-art baselines. Our work lays a strong foundation for future research. Code and models are available at https://github.com/UCF-CRCV/MedRoute/.

[597] Structural Regularities of Cinema SDR-to-HDR Mapping in a Controlled Mastering Workflow: A Pixel-wise Case Study on ASC StEM2

Xin Zhang, Xiaoyi Chen

Main category: eess.IV

TL;DR: Empirical study of cinema SDR-to-HDR mapping using ASC StEM2 dataset reveals stable luminance relationships and color redistribution patterns between SDR/HDR masters, with 82.4% of regions classified as EXR-closer recovery.

DetailsMotivation: To provide quantitative understanding of SDR-to-HDR mapping in cinema mastering by analyzing relationships between EXR source data, SDR release masters, and HDR release masters using a rare common-source dataset.

Method: Used ASC StEM2 dataset containing EXR scene-referred images and matched SDR/HDR cinema release masters. Conducted pixel-wise statistical analysis over 18,580 frames, constructed three-domain comparison (EXR, SDR, HDR), analyzed luminance and color structural relationships, and defined pixel-level decision map separating EXR-closer recovery regions from adaptive adjustment regions.

Result: Found highly stable global monotonic correspondence between SDR and HDR luminance with consistent geometric structure; color shows hue consistency with saturation redistribution (shadow suppression, midtone expansion, highlight convergence). 82.4% of image regions classified as EXR-closer recovery, 17.6% require localized adaptive adjustment.

Conclusion: Provides interpretable quantitative baseline for structure-aware SDR-to-HDR analysis and for designing learning-based models under shared-source mastering conditions, rather than claiming universal laws for all cinema pipelines.

Abstract: We present an empirical case study of cinema SDR-to-HDR mapping using ASC StEM2, a rare common-source dataset containing EXR scene-referred images and matched SDR/HDR cinema release masters from the same ACES-based mastering workflow. Based on pixel-wise statistics over all 18,580 frames of the test film, we construct a three-domain comparison involving EXR source data, SDR release masters, and HDR release masters to characterize their luminance and color structural relationships within this controlled workflow. In the luminance dimension, SDR and HDR masters exhibit a highly stable global monotonic correspondence, with geometric structure remaining largely consistent overall; sparse and structured deviations appear in self-luminous highlights and specific material regions. In the color dimension, the two masters remain largely consistent in hue, with saturation exhibiting a redistribution pattern of shadow suppression, midtone expansion, and highlight convergence. Using EXR as a scene-referred anchor, we further define a pixel-level decision map that operationally separates EXR-closer recovery regions from content-adaptive adjustment regions. Under this operational definition, 82.4% of sampled image regions are classified as EXR-closer recovery, while the remainder require localized adaptive adjustment. Rather than claiming a universal law for all cinema mastering pipelines, the study provides an interpretable quantitative baseline for structure-aware SDR-to-HDR analysis and for designing learning-based models under shared-source mastering conditions.

[598] Adaptive Differential Privacy for Federated Medical Image Segmentation Across Diverse Modalities

Puja Saha, Eranga Ukwatta

Main category: eess.IV

TL;DR: ADP-FL: Adaptive differentially private federated learning framework for medical image segmentation that dynamically adjusts privacy mechanisms to balance privacy-utility trade-off, improving accuracy and stability while maintaining privacy guarantees.

DetailsMotivation: Medical data is underutilized due to privacy regulations and institutional constraints, and centralized models fail to generalize across clinical sites due to data heterogeneity. Federated learning offers collaborative training without data sharing, but adding differential privacy degrades accuracy and stability.

Method: Proposes ADP-FL framework that adaptively adjusts privacy mechanisms during federated learning for medical image segmentation. The approach dynamically balances privacy-utility trade-off to stabilize training and improve segmentation quality while maintaining rigorous privacy guarantees.

Result: ADP-FL consistently achieves higher accuracy, improved boundary delineation, faster convergence, and greater training stability compared to conventional federated learning and standard differentially private federated learning. Performance approaches non-private federated learning under same privacy budgets across diverse imaging modalities including skin lesions, kidney tumors, and brain tumors.

Conclusion: ADP-FL demonstrates practical viability for high-performance, privacy-preserving medical image segmentation in real-world federated settings, effectively balancing privacy guarantees with segmentation quality.

Abstract: Large volumes of medical data remain underutilized because centralizing distributed data is often infeasible due to strict privacy regulations and institutional constraints. In addition, models trained in centralized settings frequently fail to generalize across clinical sites because of heterogeneity in imaging protocols and continuously evolving data distributions arising from differences in scanners, acquisition parameters, and patient populations. Federated learning offers a promising solution by enabling collaborative model training without sharing raw data. However, incorporating differential privacy into federated learning, while essential for privacy guarantees, often leads to degraded accuracy, unstable convergence, and reduced generalization. In this work, we propose an adaptive differentially private federated learning (ADP-FL) framework for medical image segmentation that dynamically adjusts privacy mechanisms to better balance the privacy-utility trade-off. The proposed approach stabilizes training, significantly improves Dice scores and segmentation boundary quality, and maintains rigorous privacy guarantees. We evaluated ADP-FL across diverse imaging modalities and segmentation tasks, including skin lesion segmentation in dermoscopic images, kidney tumor segmentation in 3D CT scans, and brain tumor segmentation in multi-parametric MRI. Compared with conventional federated learning and standard differentially private federated learning, ADP-FL consistently achieves higher accuracy, improved boundary delineation, faster convergence, and greater training stability, with performance approaching that of non-private federated learning under the same privacy budgets. These results demonstrate the practical viability of ADP-FL for high-performance, privacy-preserving medical image segmentation in real-world federated settings.

[599] Accelerating 4D Hyperspectral Imaging through Physics-Informed Neural Representation and Adaptive Sampling

Chi-Jui Ho, Harsh Bhakta, Wei Xiong, Nicholas Antipa

Main category: eess.IV

TL;DR: Physics-informed neural representation accelerates hyperspectral 2DIR spectroscopy by reconstructing dense 4D spectral images from sparse measurements using MLPs, reducing experiment time 32x.

DetailsMotivation: High-dimensional hyperspectral imaging enables visualization of ultrafast molecular dynamics but requires prohibitively long data acquisition due to dense Nyquist sampling and extensive signal accumulation, especially for spatially varying vibrational couplings in 2DIR spectroscopy.

Method: Uses physics-informed neural representation with multilayer perceptron (MLP) to model relationship between sub-sampled 4D coordinates and spectral intensities, recovering dense spectra from limited observations. Includes loss-aware adaptive sampling to progressively introduce informative samples during iterative data collection.

Result: Method faithfully recovers both oscillatory and non-oscillatory spectral dynamics using only 1/32 of sampling budget, achieving 32-fold reduction in total experiment time while maintaining high-fidelity spectral recovery.

Conclusion: Framework provides scalable solution for accelerating hypercube data experiments including multidimensional spectroscopy and hyperspectral imaging, enabling rapid chemical imaging of transient biological and material systems.

Abstract: High-dimensional hyperspectral imaging (HSI) enables the visualization of ultrafast molecular dynamics and complex, heterogeneous spectra. However, applying this capability to resolve spatially varying vibrational couplings in two-dimensional infrared (2DIR) spectroscopy, a type of coherent multidimensional spectroscopy (CMDS), necessitates prohibitively long data acquisition, driven by dense Nyquist sampling requirements and the need for extensive signal accumulation. To address this challenge, we introduce a physics-informed neural representation approach that efficiently reconstructs dense spatially-resolved 2DIR hyperspectral images from sparse experimental measurements. In particular, we used a multilayer perceptron (MLP) to model the relationship between the sub-sampled 4D coordinates and their corresponding spectral intensities, and recover densely sampled 4D spectra from limited observations. The reconstruction results demonstrate that our method, using a fraction of the samples, faithfully recovers both oscillatory and non-oscillatory spectral dynamics in experimental measurement. Moreover, we develop a loss-aware adaptive sampling method to progressively introduce potentially informative samples for iterative data collection while conducting experiments. Experimental results show that the proposed approach achieves high-fidelity spectral recovery using only $1/32$ of the sampling budget, as opposed to exhaustive sampling, effectively reducing total experiment time by up to 32-fold. This framework offers a scalable solution for accelerating any experiments with hypercube data, including multidimensional spectroscopy and hyperspectral imaging, paving the way for rapid chemical imaging of transient biological and material systems.

[600] CWRNN-INVR: A Coupled WarpRNN based Implicit Neural Video Representation

Yiyang Li, Yanbo Gao, Shuai Li, Zhenyu Du, Jinglin Zhang, Hui Yuan, Mao Ye, Xingyu Gao

Main category: eess.IV

TL;DR: CWRNN-INVR proposes a mixed neural network and residual grid framework for implicit neural video representation, using neural networks for regular structure and residual grids for irregular details, achieving state-of-the-art reconstruction results.

DetailsMotivation: Existing Implicit Neural Video Representation (INVR) methods focus on either neural networks or grid structures without studying their respective roles in video representation. The paper aims to investigate their differences and combine their advantages for better video representation.

Method: Proposes CWRNN-INVR with a mixed neural network and residual grid framework. Uses Coupled WarpRNN-based multi-scale motion representation for regular/structured information, and mixed residual grids for irregular appearance and motion information. Allows network reuse by combining residual grids with WarpRNN.

Result: Achieves best reconstruction results with average PSNR of 33.73 dB on UVG dataset under 3M model, outperforming existing INVR methods in reconstruction and other downstream tasks.

Conclusion: The mixed framework effectively combines neural networks’ ability to capture general structure with grids’ ability to represent specific details, leading to superior video representation performance.

Abstract: Implicit Neural Video Representation (INVR) has emerged as a novel approach for video representation and compression, using learnable grids and neural networks. Existing methods focus on developing new grid structures efficient for latent representation and neural network architectures with large representation capability, lacking the study on their roles in video representation. In this paper, the difference between INVR based on neural network and INVR based on grid is first investigated from the perspective of video information composition to specify their own advantages, i.e., neural network for general structure while grid for specific detail. Accordingly, an INVR based on mixed neural network and residual grid framework is proposed, where the neural network is used to represent the regular and structured information and the residual grid is used to represent the remaining irregular information in a video. A Coupled WarpRNN-based multi-scale motion representation and compensation module is specifically designed to explicitly represent the regular and structured information, thus terming our method as CWRNN-INVR. For the irregular information, a mixed residual grid is learned where the irregular appearance and motion information are represented together. The mixed residual grid can be combined with the coupled WarpRNN in a way that allows for network reuse. Experiments show that our method achieves the best reconstruction results compared with the existing methods, with an average PSNR of 33.73 dB on the UVG dataset under the 3M model and outperforms existing INVR methods in other downstream tasks. The code can be found at https://github.com/yiyang-sdu/CWRNN-INVR.git}{https://github.com/yiyang-sdu/CWRNN-INVR.git.

[601] A Dynamic Prognostic Prediction Method for Colorectal Cancer Liver Metastasis

Wei Yang, Yiran Zhu, Yan su, Zesheng Li, Chengchang Pan, Honggang Qi

Main category: eess.IV

TL;DR: DyPro: A deep learning framework that models postoperative latent trajectories via residual dynamic evolution to predict colorectal cancer liver metastasis recurrence and survival outcomes.

DetailsMotivation: CRLM has high postoperative recurrence and prognostic heterogeneity, but existing approaches use static single-snapshot representations and fail to capture tumor spatial distribution, longitudinal disease dynamics, and multimodal clinical information.

Method: DyPro infers postoperative latent trajectories via residual dynamic evolution, starting from initial patient representation and generating 12-step sequence of trajectory snapshots through autoregressive residual updates, integrating them to predict recurrence and survival.

Result: On MSKCC CRLM dataset, DyPro achieves C-index of 0.755 for OS and 0.714 for DFS, with OS AUC@1y of 0.920 and OS IBS of 0.143 under repeated stratified 5-fold cross-validation.

Conclusion: DyPro provides quantitative risk cues to support adjuvant therapy planning and follow-up scheduling by modeling dynamic disease trajectories rather than static snapshots.

Abstract: Colorectal cancer liver metastasis (CRLM) exhibits high postoperative recurrence and pronounced prognostic heterogeneity, challenging individualized management. Existing prognostic approaches often rely on static representations from a single postoperative snapshot, and fail to jointly capture tumor spatial distribution, longitudinal disease dynamics, and multimodal clinical information, limiting predictive accuracy. We propose DyPro, a deep learning framework that infers postoperative latent trajectories via residual dynamic evolution. Starting from an initial patient representation, DyPro generates a 12-step sequence of trajectory snapshots through autoregressive residual updates and integrates them to predict recurrence and survival outcomes. On the MSKCC CRLM dataset, DyPro achieves strong discrimination under repeated stratified 5-fold cross-validation, reaching a C-index of 0.755 for OS and 0.714 for DFS, with OS AUC@1y of 0.920 and OS IBS of 0.143. DyPro provides quantitative risk cues to support adjuvant therapy planning and follow-up scheduling.

[602] A Noise Constrained Diffusion (NC-Diffusion) Framework for High Fidelity Image Compression

Zhenyu Du, Yanbo Gao, Shuai Li, Yiyang Li, Hui Yuan, Mao Ye

Main category: eess.IV

TL;DR: NC-Diffusion: A noise-constrained diffusion framework for high-fidelity image compression that aligns quantization noise with diffusion noise to improve reconstruction fidelity and efficiency.

DetailsMotivation: Existing diffusion-based image compression methods suffer from reconstruction deviations due to random noise mismatch between compression and diffusion processes, leading to suboptimal results.

Method: Proposes Noise Constrained Diffusion (NC-Diffusion) that formulates quantization noise as diffusion noise, constructs noise-constrained diffusion from ground-truth to compressed image, adds adaptive frequency-domain filtering in U-Net, and uses zero-shot sample-guided enhancement.

Result: Achieves state-of-the-art performance on multiple benchmark datasets, overcoming noise mismatch and significantly improving inference efficiency while enhancing high-frequency details.

Conclusion: NC-Diffusion effectively addresses noise mismatch in diffusion-based compression, enabling high-fidelity image reconstruction with improved efficiency and detail preservation.

Abstract: With the great success of diffusion models in image generation, diffusion-based image compression is attracting increasing interests. However, due to the random noise introduced in the diffusion learning, they usually produce reconstructions with deviation from the original images, leading to suboptimal compression results. To address this problem, in this paper, we propose a Noise Constrained Diffusion (NC-Diffusion) framework for high fidelity image compression. Unlike existing diffusion-based compression methods that add random Gaussian noise and direct the noise into the image space, the proposed NC-Diffusion formulates the quantization noise originally added in the learned image compression as the noise in the forward process of diffusion. Then a noise constrained diffusion process is constructed from the ground-truth image to the initial compression result generated with quantization noise. The NC-Diffusion overcomes the problem of noise mismatch between compression and diffusion, significantly improving the inference efficiency. In addition, an adaptive frequency-domain filtering module is developed to enhance the skip connections in the U-Net based diffusion architecture, in order to enhance high-frequency details. Moreover, a zero-shot sample-guided enhancement method is designed to further improve the fidelity of the image. Experiments on multiple benchmark datasets demonstrate that our method can achieve the best performance compared with existing methods.

[603] 4D Vessel Reconstruction for Benchtop Thrombectomy Analysis

Ethan Nguyen, Javier Carmona, Arisa Matsuzaki, Naoki Kaneko, Katsushi Arisaka

Main category: eess.IV

TL;DR: A multi-view 3D reconstruction method using 4D Gaussian Splatting to measure vessel deformation during mechanical thrombectomy benchtop testing, with displacement tracking and relative stress proxy analysis.

DetailsMotivation: Mechanical thrombectomy procedures can cause vessel deformation and injury, but existing benchtop models lack time-resolved, full-field 3D vessel-motion measurements needed for comprehensive device testing and analysis.

Method: Developed a nine-camera, low-cost multi-view workflow for benchtop thrombectomy testing in silicone middle cerebral artery phantoms. Used 4D Gaussian Splatting for reconstruction, converted point clouds to edge graphs for ROI displacement tracking, and derived relative stress proxy from edge stretch using Neo-Hookean mapping.

Result: Synthetic validation showed stress proxy near zero for bulk translation and close geometric/temporal agreement for pulling deformations. Preliminary benchtop trials showed cervical aspiration catheter placement produced higher displacement and stress proxy values than internal carotid artery terminus placement.

Conclusion: The protocol provides standardized, time-resolved surface kinematics and comparative measurements for thrombectomy benchtop studies, supporting condition-to-condition comparisons and methods validation, though distinct from absolute wall-stress estimation.

Abstract: Introduction: Mechanical thrombectomy can cause vessel deformation and procedure-related injury. Benchtop models are widely used for device testing, but time-resolved, full-field 3D vessel-motion measurements remain limited. Methods: We developed a nine-camera, low-cost multi-view workflow for benchtop thrombectomy in silicone middle cerebral artery phantoms (2160p, 20 fps). Multi-view videos were calibrated, segmented, and reconstructed with 4D Gaussian Splatting. Reconstructed point clouds were converted to fixed-connectivity edge graphs for region-of-interest (ROI) displacement tracking and a relative surface-based stress proxy. Stress-proxy values were derived from edge stretch using a Neo-Hookean mapping and reported as comparative surface metrics. A synthetic Blender pipeline with known deformation provided geometric and temporal validation. Results: In synthetic bulk translation, the stress proxy remained near zero for most edges (median $\approx$ 0 MPa; 90th percentile 0.028 MPa), with sparse outliers. In synthetic pulling (1-5 mm), reconstruction showed close geometric and temporal agreement with ground truth, with symmetric Chamfer distance of 1.714-1.815 mm and precision of 0.964-0.972 at $τ= 1$ mm. In preliminary benchtop comparative trials (one trial per condition), cervical aspiration catheter placement showed higher max-median ROI displacement and stress-proxy values than internal carotid artery terminus placement. Conclusion: The proposed protocol provides standardized, time-resolved surface kinematics and comparative relative displacement and stress proxy measurements for thrombectomy benchtop studies. The framework supports condition-to-condition comparisons and methods validation, while remaining distinct from absolute wall-stress estimation. Implementation code and example data are available at https://ethanuser.github.io/vessel4D

[604] Efficient Image-to-Image Schrödinger Bridge for CT Field of View Extension

Zhenhao Li, Song Ni, Long Yang, Xiaojie Yin, Haijun Yu, Jiazhou Wang, Hongbin Han, Weigang Hu, Yixing Huang

Main category: eess.IV

TL;DR: Proposes I²SB diffusion model for efficient CT field-of-view extension, achieving superior accuracy and 700x speedup over conventional diffusion models

DetailsMotivation: CT scans often suffer from truncation artifacts when objects exceed scanner field-of-view, limiting clinical reliability. While diffusion models show promise for FOV extension, they are computationally demanding and slow at inference.

Method: Uses image-to-image Schrödinger Bridge (I²SB) diffusion model that learns direct stochastic mapping between paired limited-FOV and extended-FOV images, rather than synthesizing from pure Gaussian noise like traditional diffusion models.

Result: Achieves RMSE of 49.8 HU on simulated noisy data and 152.0 HU on real data, outperforming state-of-the-art diffusion models. One-step inference enables reconstruction in 0.19s per 2D slice (700x faster than cDDPM).

Conclusion: I²SB offers superior accuracy and efficiency for CT FOV extension, with potential for real-time clinical deployment due to its interpretable generative process and fast inference.

Abstract: Computed tomography (CT) is a cornerstone imaging modality for non-invasive, high-resolution visualization of internal anatomical structures. However, when the scanned object exceeds the scanner’s field of view (FOV), projection data are truncated, resulting in incomplete reconstructions and pronounced artifacts near FOV boundaries. Conventional reconstruction algorithms struggle to recover accurate anatomy from such data, limiting clinical reliability. Deep learning approaches have been explored for FOV extension, with diffusion generative models representing the latest advances in image synthesis. Yet, conventional diffusion models are computationally demanding and slow at inference due to their iterative sampling process. To address these limitations, we propose an efficient CT FOV extension framework based on the image-to-image Schrödinger Bridge (I$^2$SB) diffusion model. Unlike traditional diffusion models that synthesize images from pure Gaussian noise, I$^2$SB learns a direct stochastic mapping between paired limited-FOV and extended-FOV images. This direct correspondence yields a more interpretable and traceable generative process, enhancing anatomical consistency and structural fidelity in reconstructions. I$^2$SB achieves superior quantitative performance, with root-mean-square error (RMSE) values of 49.8 HU on simulated noisy data and 152.0 HU on real data, outperforming state-of-the-art diffusion models such as conditional denoising diffusion probabilistic models (cDDPM) and patch-based diffusion methods. Moreover, its one-step inference enables reconstruction in just 0.19 s per 2D slice, representing over a 700-fold speedup compared to cDDPM (135 s) and surpassing DiffusionGAN (0.58 s), the second fastest. This combination of accuracy and efficiency indicates that I$^2$SB has potential for real-time or clinical deployment.

[605] FourierPET: Deep Fourier-based Unrolled Network for Low-count PET Reconstruction

Zheng Zhang, Hao Tang, Yingying Hu, Zhanli Hu, Jing Qin

Main category: eess.IV

TL;DR: FourierPET: A Fourier-domain unrolled reconstruction framework for low-count PET that addresses spectral separability of degradations - high-frequency phase perturbations from noise/photon scarcity and low-frequency amplitude suppression from attenuation errors.

DetailsMotivation: Existing deep learning methods for low-count PET reconstruction operate in spatial domain with undifferentiated optimization objectives, making it difficult to disentangle overlapping artifacts and limiting correction effectiveness. The authors identify that different degradation types in PET reconstruction are spectrally separable in Fourier domain.

Method: Propose FourierPET, a Fourier-based unrolled reconstruction framework based on Alternating Direction Method of Multipliers. It includes three modules: 1) spectral consistency module for global frequency alignment, 2) amplitude-phase correction module that decouples and compensates for high-frequency phase distortions and low-frequency amplitude suppression, and 3) dual adjustment module for accelerated convergence during iterative reconstruction.

Result: Extensive experiments demonstrate state-of-the-art performance with significantly fewer parameters compared to existing methods, while offering enhanced interpretability through frequency-aware correction.

Conclusion: FourierPET effectively addresses low-count PET reconstruction by leveraging Fourier-domain analysis to separate and correct spectrally distinct degradations, achieving superior performance with improved interpretability.

Abstract: Low-count positron emission tomography (PET) reconstruction is a challenging inverse problem due to severe degradations arising from Poisson noise, photon scarcity, and attenuation correction errors. Existing deep learning methods typically address these in the spatial domain with an undifferentiated optimization objective, making it difficult to disentangle overlapping artifacts and limiting correction effectiveness. In this work, we perform a Fourier-domain analysis and reveal that these degradations are spectrally separable: Poisson noise and photon scarcity cause high-frequency phase perturbations, while attenuation errors suppress low-frequency amplitude components. Leveraging this insight, we propose FourierPET, a Fourier-based unrolled reconstruction framework grounded in the Alternating Direction Method of Multipliers. It consists of three tailored modules: a spectral consistency module that enforces global frequency alignment to maintain data fidelity, an amplitude-phase correction module that decouples and compensates for high-frequency phase distortions and low-frequency amplitude suppression, and a dual adjustment module that accelerates convergence during iterative reconstruction. Extensive experiments demonstrate that FourierPET achieves state-of-the-art performance with significantly fewer parameters, while offering enhanced interpretability through frequency-aware correction.

[606] Q-Probe: Scaling Image Quality Assessment to High Resolution via Context-Aware Agentic Probing

Xiang Li, Xueheng Li, Yu Wang, Xuanhua He, Zhangchi Hu, Weiwei Yu, Chengjun Xie

Main category: eess.IV

TL;DR: Q-Probe is a reinforcement learning framework for high-resolution image quality assessment that uses context-aware probing to address limitations of existing methods that fail to capture subtle local degradations and suffer from cropping biases.

DetailsMotivation: Existing RL-based IQA models rely on coarse-grained global views and fail to capture subtle local degradations in high-resolution scenarios. While "Thinking with Images" paradigms enable multi-scale perception, their direct adaptation to IQA induces spurious "cropping-implies-degradation" biases and misinterprets natural depth-of-field as artifacts.

Method: Proposes Q-Probe, an agentic IQA framework with three key components: 1) Vista-Bench benchmark for fine-grained local degradation analysis in high-resolution IQA, 2) A three-stage training paradigm that progressively aligns models with human preferences, and 3) A novel context-aware cropping strategy to eliminate causal bias.

Result: Extensive experiments demonstrate that Q-Probe achieves state-of-the-art performance in high-resolution settings while maintaining superior efficacy across resolution scales.

Conclusion: Q-Probe successfully addresses the limitations of existing RL-based IQA methods by enabling fine-grained local degradation analysis through context-aware probing, making it suitable for high-resolution image quality assessment.

Abstract: Reinforcement Learning (RL) has empowered Multimodal Large Language Models (MLLMs) to achieve superior human preference alignment in Image Quality Assessment (IQA). However, existing RL-based IQA models typically rely on coarse-grained global views, failing to capture subtle local degradations in high-resolution scenarios. While emerging “Thinking with Images” paradigms enable multi-scale visual perception via zoom-in mechanisms, their direct adaptation to IQA induces spurious “cropping-implies-degradation” biases and misinterprets natural depth-of-field as artifacts. To address these challenges, we propose Q-Probe, the first agentic IQA framework designed to scale IQA to high resolution via context-aware probing. First, we construct Vista-Bench, a pioneering benchmark tailored for fine-grained local degradation analysis in high-resolution IQA settings. Furthermore, we propose a three-stage training paradigm that progressively aligns the model with human preferences, while simultaneously eliminating causal bias through a novel context-aware cropping strategy. Extensive experiments demonstrate that Q-Probe achieves state-of-the-art performance in high-resolution settings while maintaining superior efficacy across resolution scales.

Last updated: 2026-05-04
Built with Hugo, theme modified on Stack