Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] ACAVCaps: Enabling large-scale training for fine-grained and diverse audio understanding

Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Junbo Zhang, Jian Luan

Main category: eess.AS

TL;DR: ACAVCaps is a new large-scale, fine-grained audio captioning dataset created from ACAV100M using a multi-expert pipeline and LLM synthesis to generate rich descriptions, enabling better generalization in audio-language models.

Details

Motivation: Existing audio captioning datasets lack the scale and descriptive granularity needed to train versatile large audio-language models for general audio understanding.

Method: Created ACAVCaps from ACAV100M using a multi-expert pipeline that analyzes audio from speech, music, and acoustic perspectives, then synthesizes these analyses into detailed descriptions using a large language model.

Result: Models pre-trained on ACAVCaps show substantially stronger generalization capabilities on various downstream tasks compared to those trained on other leading captioning datasets.

Conclusion: ACAVCaps addresses the data limitations in audio captioning and enables better training of large audio-language models for general audio understanding.

Abstract: General audio understanding is a fundamental goal for large audio-language models, with audio captioning serving as a cornerstone task for their development. However, progress in this domain is hindered by existing datasets, which lack the scale and descriptive granularity required to train truly versatile models. To address this gap, we introduce ACAVCaps, a new large-scale, fine-grained, and multi-faceted audio captioning dataset. Derived from the ACAV100M collection, ACAVCaps is constructed using a multi-expert pipeline that analyzes audio from diverse perspectives-including speech, music, and acoustic properties-which are then synthesized into rich, detailed descriptions by a large language model. Experimental results demonstrate that models pre-trained on ACAVCaps exhibit substantially stronger generalization capabilities on various downstream tasks compared to those trained on other leading captioning datasets. The dataset is available at https://github.com/xiaomi-research/acavcaps.

Relevance: 9/10

[2] OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

Maomao Li, Zhen Li, Kaipeng Zhang, Guosheng Yin, Zhifeng Li, Dong Xu

Main category: cs.SD

TL;DR: OmniCustom is a DiT-based framework for synchronously customizing both video identity and audio timbre using reference images and audio, while following text prompts.

Details

Motivation: Existing video customization focuses on identity consistency but lacks audio timbre control. The paper introduces sync audio-video customization to simultaneously customize video identity and audio timbre for more compelling multimodal generation.

Method: Uses DiT-based framework with separate LoRA modules for identity and audio timbre control via self-attention layers. Introduces contrastive learning alongside flow matching to enhance preservation of identity/timbre. Trained on large-scale audio-visual human dataset.

Result: Extensive experiments show OmniCustom outperforms existing methods in generating audio-video content with consistent identity and timbre fidelity.

Conclusion: Proposes a novel sync audio-video customization task and presents OmniCustom as an effective solution for joint audio-video generation with identity and timbre control.

Abstract: Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a more compelling new task: sync audio-video customization, which aims to synchronously customize both video identity and audio timbre. Specifically, given a reference image $I^{r}$ and a reference audio $A^{r}$, this novel task requires generating videos that maintain the identity of the reference image while imitating the timbre of the reference audio, with spoken content freely specifiable through user-provided textual prompts. To this end, we propose OmniCustom, a powerful DiT-based audio-video customization framework that can synthesize a video following reference image identity, audio timbre, and text prompts all at once in a zero-shot manner. Our framework is built on three key contributions. First, identity and audio timbre control are achieved through separate reference identity and audio LoRA modules that operate through self-attention layers within the base audio-video generation model. Second, we introduce a contrastive learning objective alongside the standard flow matching objective. It uses predicted flows conditioned on reference inputs as positive examples and those without reference conditions as negative examples, thereby enhancing the model ability to preserve identity and timbre. Third, we train OmniCustom on our constructed large-scale, high-quality audio-visual human dataset. Extensive experiments demonstrate that OmniCustom outperforms existing methods in generating audio-video content with consistent identity and timbre fidelity. Project page: https://omnicustom-project.github.io/page/.

Relevance: 9/10

[3] See, Remember, Explore: A Benchmark and Baselines for Streaming Spatial Reasoning

Yuxi Wei, Wei Huang, Qirui Chen, Lu Hou, Xiaojuan Qi

Main category: cs.CV

TL;DR: S3-Bench benchmark for streaming spatial QA with active exploration, requiring real-time reasoning from temporal video streams and active perception when views are insufficient.

Details

Motivation: Most spatial VLMs and benchmarks evaluate offline QA on pre-recorded inputs, missing deployment-critical requirements: long-horizon streaming inference and active perception when current view is insufficient for answering questions.

Method: Introduces S3-Bench benchmark suite with dual-domain design (simulator + real-world streaming videos) and AMF-VLM model with memory folding (compresses long observations) and active exploration (outputs actions to acquire missing evidence).

Result: AMF-VLM improves by 8.8% on simulated split and 13.3% on real split of S3-Eval compared to models using identical training data, while maintaining competitive transferability to standard spatial benchmarks.

Conclusion: The work addresses crucial gaps in spatial VLMs for embodied agents by introducing streaming spatial QA with active exploration, enabling real-time reasoning and perception in dynamic environments.

Abstract: Spatial understanding is fundamental for embodied agents, yet most spatial VLMs and benchmarks remain offline-evaluating post-hoc QA over pre-recorded inputs and overlooking two crucial deployment-critical requirements: long-horizon streaming inference and active perception when the current view is insufficient. To address this gap, we introduce S3-Bench, a benchmark suite for streaming spatial question answering with active exploration, where queries are temporally grounded to specific timestamps and must be answered using only observations available up to that moment. S3-Bench adopts a dual-domain design, combining a scalable simulator with controllable trajectories and exploration actions, and real-world streaming videos that capture practical sensing artifacts for rigorous generalization evaluation. Overall, it spans 10K+ scenes and 26K+ trajectories, with dedicated training (S3-Train) and evaluation (S3-Eval) splits. We further propose AMF-VLM, which supports streaming spatial reasoning under bounded computing via (i) memory folding, which compresses long-horizon observations into compact structured memory, and (ii) active exploration, which outputs explicit actions (e.g. move/rotate/scan) to acquire missing evidence before answering. Extensive experiments demonstrate that, compared to models using identical training data, our approach yields improvements of 8.8% and 13.3% on the simulated and real splits of S3-Eval, respectively, while maintaining competitive transferability to standard spatial benchmarks.

Relevance: 9/10

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 117]
cs.CV [Total: 200]
cs.AI [Total: 87]
cs.SD [Total: 10]
cs.LG [Total: 154]
cs.MA [Total: 7]
cs.MM [Total: 0]
eess.AS [Total: 9]
eess.IV [Total: 8]

cs.CL

[1] Leveraging Computerized Adaptive Testing for Cost-effective Evaluation of Large Language Models in Medical Benchmarking

Tianpeng Zheng, Zhehan Jiang, Jiayi Liu, Shicong Feng

Main category: cs.CL

TL;DR: A computerized adaptive testing framework using item response theory to efficiently evaluate medical knowledge in LLMs, reducing assessment time and computational costs while maintaining accuracy.

Details

Motivation: Current static benchmarks for evaluating LLMs in healthcare are costly, vulnerable to data contamination, lack calibrated measurement properties, and are inefficient for repeated assessments.

Method: Two-phase design: Monte Carlo simulation to identify optimal CAT configurations, then empirical evaluation of 38 LLMs using human-calibrated medical item bank with adaptive testing that dynamically selects items based on real-time ability estimates and terminates when reaching reliability threshold.

Result: CAT-derived proficiency estimates achieved near-perfect correlation with full-bank estimates (r = 0.988) while using only 1.3% of items, reducing evaluation time from hours to minutes with substantial reductions in token usage and computational cost while preserving inter-model rankings.

Conclusion: Establishes a psychometric framework for rapid, low-cost benchmarking of medical knowledge in LLMs, serving as a standardized pre-screening and monitoring tool (not a substitute for clinical validation).

Abstract: The rapid proliferation of large language models (LLMs) in healthcare creates an urgent need for scalable and psychometrically sound evaluation methods. Conventional static benchmarks are costly to administer repeatedly, vulnerable to data contamination, and lack calibrated measurement properties for fine-grained performance tracking. We propose and validate a computerized adaptive testing (CAT) framework grounded in item response theory (IRT) for efficient assessment of standardized medical knowledge in LLMs. The study comprises a two-phase design: a Monte Carlo simulation to identify optimal CAT configurations and an empirical evaluation of 38 LLMs using a human-calibrated medical item bank. Each model completed both the full item bank and an adaptive test that dynamically selected items based on real-time ability estimates and terminated upon reaching a predefined reliability threshold (standard error <= 0.3). Results show that CAT-derived proficiency estimates achieved a near-perfect correlation with full-bank estimates (r = 0.988) while using only 1.3 percent of the items. Evaluation time was reduced from several hours to minutes per model, with substantial reductions in token usage and computational cost, while preserving inter-model performance rankings. This work establishes a psychometric framework for rapid, low-cost benchmarking of foundational medical knowledge in LLMs. The proposed adaptive methodology is intended as a standardized pre-screening and continuous monitoring tool and is not a substitute for real-world clinical validation or safety-oriented prospective studies.

[2] Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes

Fangyu Ding, Ding Ding, Sijin Chen, Kaibo Wang, Peng Xu, Zijin Feng, Haoli Bai, Kai Han, Youliang Yan, Binhang Yuan, Jiacheng Sun

Main category: cs.CL

TL;DR: DID (Deletion-Insertion Diffusion) replaces masking-based diffusion language models with deletion-insertion operations for improved efficiency and flexibility in text generation.

Details

Motivation: Masked Diffusion Language Models (MDLMs) suffer from computational inefficiency due to processing non-informative and tokens, and lack flexibility for variable-length sequences. The authors aim to create a more efficient and flexible diffusion-based language model.

Method: Proposes DID that formulates token deletion and insertion as discrete diffusion processes instead of masking. Uses a score-based approach with training objectives involving subsequence counting problems solved via parallelized dynamic programming algorithm.

Result: DID outperforms MDLMs and existing insertion-based LMs in modeling performance, sampling quality, and training/inference speed across fixed and variable-length settings without hyperparameter tuning.

Conclusion: DID provides a more efficient and flexible alternative to masked diffusion language models by eliminating computational overhead from non-informative tokens and natively supporting variable-length sequences with self-correction mechanisms.

Abstract: While Masked Diffusion Language Models (MDLMs) relying on token masking and unmasking have shown promise in language modeling, their computational efficiency and generation flexibility remain constrained by the masking paradigm. In this paper, we propose Deletion-Insertion Diffusion language models (DID) that rigorously formulate token deletion and insertion as discrete diffusion processes, replacing the masking and unmasking processes in current MDLMs. DID improves training and inference efficiency by eliminating two major sources of computational overhead in MDLMs: the computations on non-informative 1) tokens inherent to the paradigm, and 2) tokens introduced in variable-length settings. Furthermore, DID offers greater flexibility by: 1) natively supporting variable-length sequences without requiring fixed-length padding, and 2) an intrinsic self-correction mechanism during generation due to insertion that dynamically adjusts token positions. To train DID, we design a score-based approach that assigns scores to token insertion operations and derive appropriate training objectives. The objectives involve subsequence counting problems, which we efficiently solve via a parallelized dynamic programming algorithm. Our experiments across fixed and variable-length settings demonstrate the advantage of DID over baselines of MDLMs and existing insertion-based LMs, in terms of modeling performance, sampling quality, and training/inference speed, without any hyperparameter tuning.

[3] Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems

Xunzhuo Liu, Bowei He, Xue Liu, Haichen Zhang, Huamin Chen

Main category: cs.CL

TL;DR: Real-time verification system for RAG that enables full-document grounding under latency constraints, improving detection of unsupported responses compared to truncated validation

Details

Motivation: Current RAG systems struggle with verifying that generated answers faithfully reflect retrieved documents - LLMs are too slow/costly for real-time use, while lightweight classifiers miss evidence outside truncated passages

Method: Design of a production RAG pipeline component that processes documents up to 32K tokens with adaptive inference strategies to balance response time and verification coverage

Result: Full-context verification substantially improves detection of unsupported responses compared with truncated validation, with practical deployment in real-time systems

Conclusion: Provides practical guidance for building reliable large-scale RAG applications, highlighting when long-context verification is necessary and how latency budgets shape model design

Abstract: Retrieval-augmented generation (RAG) is increasingly deployed in enterprise search and document-centric assistants, where responses must be grounded in long and complex source materials. In practice, verifying that generated answers faithfully reflect retrieved documents is difficult: large language models can check long contexts but are too slow and costly for interactive services, while lightweight classifiers operate within strict context limits and frequently miss evidence outside truncated passages. We present the design of a real-time verification component integrated into a production RAG pipeline that enables full-document grounding under latency constraints. The system processes documents up to 32K tokens and employs adaptive inference strategies to balance response time and verification coverage across workloads. We describe the architectural decisions, operational trade-offs, and evaluation methodology used to deploy the verifier, and show that full-context verification substantially improves detection of unsupported responses compared with truncated validation. Our experience highlights when long-context verification is necessary, why chunk-based checking often fails in real documents, and how latency budgets shape model design. These findings provide practical guidance for practitioners building reliable large-scale retrieval-augmented applications. (Model, benchmark, and code: https://huggingface.co/llm-semantic-router)

[4] Internal Safety Collapse in Frontier Large Language Models

Yutao Wu, Xiao Liu, Yifeng Gao, Xiang Zheng, Hanxun Huang, Yige Li, Cong Wang, Bo Li, Xingjun Ma, Yu-Gang Jiang

Main category: cs.CL

TL;DR: Frontier LLMs suffer from Internal Safety Collapse (ISC) where they generate harmful content when tasks intrinsically require it, with failure rates up to 95.3% across major models.

Details

Motivation: To identify critical safety vulnerabilities in frontier LLMs that emerge when models must generate harmful content to complete otherwise benign professional tasks, revealing that alignment efforts don't eliminate underlying unsafe capabilities.

Method: Developed TVD (Task, Validator, Data) framework to trigger ISC through domain tasks where harmful content generation is the only valid completion. Created ISC-Bench with 53 scenarios across 8 professional disciplines and evaluated on JailbreakBench.

Result: Three representative scenarios yielded worst-case safety failure rates averaging 95.3% across four frontier LLMs (including GPT-5.2 and Claude Sonnet 4.5), substantially exceeding standard jailbreak attacks. Frontier models were more vulnerable than earlier LLMs.

Conclusion: Frontier LLMs retain inherently unsafe internal capabilities despite alignment efforts; their advanced capabilities become liabilities when tasks involve harmful content. This creates a growing attack surface as new dual-use tools emerge, requiring caution in high-stakes deployments.

Abstract: This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content while executing otherwise benign tasks. We introduce TVD (Task, Validator, Data), a framework that triggers ISC through domain tasks where generating harmful content is the only valid completion, and construct ISC-Bench containing 53 scenarios across 8 professional disciplines. Evaluated on JailbreakBench, three representative scenarios yield worst-case safety failure rates averaging 95.3% across four frontier LLMs (including GPT-5.2 and Claude Sonnet 4.5), substantially exceeding standard jailbreak attacks. Frontier models are more vulnerable than earlier LLMs: the very capabilities that enable complex task execution become liabilities when tasks intrinsically involve harmful content. This reveals a growing attack surface: almost every professional domain uses tools that process sensitive data, and each new dual-use tool automatically expands this vulnerability–even without any deliberate attack. Despite substantial alignment efforts, frontier LLMs retain inherently unsafe internal capabilities: alignment reshapes observable outputs but does not eliminate the underlying risk profile. These findings underscore the need for caution when deploying LLMs in high-stakes settings. Source code: https://github.com/wuyoscar/ISC-Bench

[5] Visuospatial Perspective Taking in Multimodal Language Models

Jonathan Prunty, Seraphina Zhang, Patrick Quinn, Jianxun Lian, Xing Xie, Lucy Cheke

Main category: cs.CL

TL;DR: MLMs show significant deficits in visuospatial perspective-taking tasks, particularly Level 2 VPT requiring inhibition of one’s own perspective to adopt another’s viewpoint.

Details

Motivation: As multimodal language models are increasingly used in social and collaborative settings, it's crucial to evaluate their perspective-taking abilities, which are currently underexplored in existing benchmarks that focus on text-based vignettes or static scene understanding.

Method: Adapted two evaluation tasks from human studies: 1) Director Task assessing VPT in referential communication paradigm, and 2) Rotating Figure Task probing perspective-taking across angular disparities.

Result: Across both tasks, MLMs show pronounced deficits in Level 2 VPT, which requires inhibiting one’s own perspective to adopt another’s viewpoint.

Conclusion: Current MLMs have critical limitations in representing and reasoning about alternative perspectives, with important implications for their use in collaborative contexts.

Abstract: As multimodal language models (MLMs) are increasingly used in social and collaborative settings, it is crucial to evaluate their perspective-taking abilities. Existing benchmarks largely rely on text-based vignettes or static scene understanding, leaving visuospatial perspective-taking (VPT) underexplored. We adapt two evaluation tasks from human studies: the Director Task, assessing VPT in a referential communication paradigm, and the Rotating Figure Task, probing perspective-taking across angular disparities. Across tasks, MLMs show pronounced deficits in Level 2 VPT, which requires inhibiting one’s own perspective to adopt another’s. These results expose critical limitations in current MLMs’ ability to represent and reason about alternative perspectives, with implications for their use in collaborative contexts.

[6] DISCO: Document Intelligence Suite for COmparative Evaluation

Kenza Benkirane, Dan Goldwater, Martin Asenov, Aneiss Ghodsi

Main category: cs.CL

TL;DR: DISCO is a document intelligence evaluation suite that separately assesses OCR pipelines and vision-language models on parsing and QA tasks across diverse document types, finding performance varies by document characteristics.

Details

Motivation: Document intelligence requires accurate text extraction and reliable reasoning, but current evaluation methods don't adequately assess how different approaches (OCR vs VLMs) perform across diverse document types with varying characteristics.

Method: Created DISCO evaluation suite to separately test OCR pipelines and vision-language models on parsing and question answering tasks across diverse document types including handwritten text, multilingual scripts, medical forms, infographics, and multi-page documents.

Result: Performance varies substantially across tasks and document characteristics: OCR pipelines better for handwriting and long/multi-page documents, VLMs better for multilingual text and visually rich layouts. Task-aware prompting has mixed effects.

Conclusion: Need complexity-aware approach selection for document processing; empirical guidance provided for selecting strategies based on document structure and reasoning demands.

Abstract: Document intelligence requires accurate text extraction and reliable reasoning over document content. We introduce \textbf{DISCO}, a \emph{Document Intelligence Suite for COmparative Evaluation}, that evaluates optical character recognition (OCR) pipelines and vision-language models (VLMs) separately on parsing and question answering across diverse document types, including handwritten text, multilingual scripts, medical forms, infographics, and multi-page documents. Our evaluation shows that performance varies substantially across tasks and document characteristics, underscoring the need for complexity-aware approach selection. OCR pipelines are generally more reliable for handwriting and for long or multi-page documents, where explicit text grounding supports text-heavy reasoning, while VLMs perform better on multilingual text and visually rich layouts. Task-aware prompting yields mixed effects, improving performance on some document types while degrading it on others. These findings provide empirical guidance for selecting document processing strategies based on document structure and reasoning demands.

[7] S-Path-RAG: Semantic-Aware Shortest-Path Retrieval Augmented Generation for Multi-Hop Knowledge Graph Question Answering

Rong Fu, Yemin Wang, Tianxiang Xu, Yongtai Liu, Weizhi Tang, Wangyu Wu, Xiaowen Ma, Simon Fong

Main category: cs.CL

TL;DR: S-Path-RAG is a semantic-aware shortest-path retrieval framework for multi-hop question answering over knowledge graphs that uses hybrid path enumeration, learned path scoring, and iterative graph dialogue to improve accuracy and efficiency.

Details

Motivation: Current retrieval-augmented generation approaches for knowledge graph question answering often use one-shot, text-heavy retrieval that lacks semantic awareness and topology understanding. There's a need for more efficient, interpretable, and adaptive retrieval mechanisms that can handle multi-hop reasoning while being token-efficient.

Method: Uses hybrid weighted k-shortest, beam, and constrained random-walk strategies to enumerate bounded-length candidate paths. Learns a differentiable path scorer with contrastive path encoder and lightweight verifier. Injects path latents into LLM via cross-attention. Operates within Neural-Socratic Graph Dialogue loop where LLM diagnostic messages trigger graph edits or seed expansions for adaptive retrieval.

Result: Demonstrates consistent improvements in answer accuracy, evidence coverage, and end-to-end efficiency compared to strong graph- and LLM-based baselines on standard multi-hop KGQA benchmarks. Provides analysis of trade-offs between semantic weighting, verifier filtering, and iterative updates.

Conclusion: S-Path-RAG provides a token-efficient, topology-aware retrieval mechanism with interpretable path-level traces, offering practical deployment recommendations for constrained compute and token budgets while improving multi-hop reasoning over knowledge graphs.

Abstract: We present S-Path-RAG, a semantic-aware shortest-path Retrieval-Augmented Generation framework designed to improve multi-hop question answering over large knowledge graphs. S-Path-RAG departs from one-shot, text-heavy retrieval by enumerating bounded-length, semantically weighted candidate paths using a hybrid weighted $k$-shortest, beam, and constrained random-walk strategy, learning a differentiable path scorer together with a contrastive path encoder and lightweight verifier, and injecting a compact soft mixture of selected path latents into a language model via cross-attention. The system runs inside an iterative Neural-Socratic Graph Dialogue loop in which concise diagnostic messages produced by the language model are mapped to targeted graph edits or seed expansions, enabling adaptive retrieval when the model expresses uncertainty. This combination yields a retrieval mechanism that is both token-efficient and topology-aware while preserving interpretable path-level traces for diagnostics and intervention. We validate S-Path-RAG on standard multi-hop KGQA benchmarks and through ablations and diagnostic analyses. The results demonstrate consistent improvements in answer accuracy, evidence coverage, and end-to-end efficiency compared to strong graph- and LLM-based baselines. We further analyze trade-offs between semantic weighting, verifier filtering, and iterative updates, and report practical recommendations for deployment under constrained compute and token budgets.

[8] Berta: an open-source, modular tool for AI-enabled clinical documentation

Samridhi Vaid, Mike Weldon, Jesse Dunn, Sacha Davis, Kevin Lonergan, Henry Li, Jeffrey Franc, Mohamed Abdalla, Daniel C. Baumgart, Jake Hayward, J Ross Mitchell

Main category: cs.CL

TL;DR: Open-source AI scribe platform Berta deployed at provincial scale in healthcare system, achieving 70-95% cost reduction while maintaining data sovereignty.

Details

Motivation: Commercial AI scribes are expensive, opaque, and don't return data to institutional infrastructure, limiting organizational control over data governance, quality improvement, and clinical workflows.

Method: Developed Berta as open-source modular scribe platform combining automatic speech recognition with large language models, deployed within Alberta Health Services integrated with their Snowflake AI Data Cloud infrastructure while retaining all clinical data within secure environment.

Result: 198 emergency physicians used system over 8 months across 105 facilities, generating 22,148 clinical sessions and 2800+ hours of audio; usage grew from 680 to 5530 monthly sessions; operating costs averaged <$30 per physician per month (70-95% reduction vs commercial); approved for expansion to 850 physicians.

Conclusion: First provincial-scale deployment of AI scribe integrated with existing health system infrastructure; open-source release provides reproducible, cost-effective alternative supporting data sovereignty and informed evaluation of AI documentation technology.

Abstract: Commercial AI scribes cost $99-600 per physician per month, operate as opaque systems, and do not return data to institutional infrastructure, limiting organizational control over data governance, quality improvement, and clinical workflows. We developed Berta, an open-source modular scribe platform for AI-enabled clinical documentation, and deployed a customized implementation within Alberta Health Services (AHS) integrated with their existing Snowflake AI Data Cloud infrastructure. The system combines automatic speech recognition with large language models while retaining all clinical data within the secure AHS environment. During eight months (November 2024 to July 2025), 198 emergency physicians used the system in 105 urban and rural facilities, generating 22148 clinical sessions and more than 2800 hours of audio. The use grew from 680 to 5530 monthly sessions. Operating costs averaged less than $30 per physician per month, a 70-95% reduction compared to commercial alternatives. AHS has since approved expansion to 850 physicians. This is the first provincial-scale deployment of an AI scribe integrated with existing health system infrastructure. By releasing Berta as open source, we provide a reproducible, cost-effective alternative that health systems can adapt to their own secure environments, supporting data sovereignty and informed evaluation of AI documentation technology.

[9] Plato’s Cave: A Human-Centered Research Verification System

Matheus Kunzler Maldaner, Raul Valle, Junsung Kim, Tonuka Sultan, Pranav Bhargava, Matthew Maloni, John Courtney, Hoang Nguyen, Aamogh Sawant, Kristian O’Connor, Stephen Wormald, Damon L. Woodard

Main category: cs.CL

TL;DR: Plato’s Cave is an open-source research verification system that analyzes papers using directed acyclic graphs, web agents for credibility scoring, and argumentative structure evaluation.

Details

Motivation: The rapid growth in research publications creates a need for better fact-checking, writing quality assessment, and identification of unverifiable claims in academic papers.

Method: The system creates a directed acyclic graph (DAG) from documents, uses web agents to assign credibility scores to nodes and edges, and provides final scores by interpreting the paper’s argumentative structure.

Result: Implemented system and reported results on a dataset of 104 research papers, demonstrating the approach’s feasibility for research verification.

Conclusion: Plato’s Cave provides a human-centered approach to research verification that can help address the challenges of information overload and credibility assessment in academic publishing.

Abstract: The growing publication rate of research papers has created an urgent need for better ways to fact-check information, assess writing quality, and identify unverifiable claims. We present Plato’s Cave as an open-source, human-centered research verification system that (i) creates a directed acyclic graph (DAG) from a document, (ii) leverages web agents to assign credibility scores to nodes and edges from the DAG, and (iii) gives a final score by interpreting and evaluating the paper’s argumentative structure. We report the system implementation and results on a collected dataset of 104 research papers.

[10] DepthCharge: A Domain-Agnostic Framework for Measuring Depth-Dependent Knowledge in Large Language Models

Alexander Sheppert

Main category: cs.CL

TL;DR: DepthCharge: A domain-agnostic framework for measuring LLM knowledge depth through adaptive probing, on-demand fact verification, and survival statistics across arbitrary domains.

Details

Motivation: LLMs appear competent on general questions but fail on domain-specific details; no existing methodology provides out-of-the-box solution for measuring how deeply LLMs can sustain accurate responses under adaptive follow-up questioning across arbitrary domains.

Method: Three innovations: 1) Adaptive probing that generates follow-up questions based on concepts the model actually mentions, 2) On-demand fact verification from authoritative sources, 3) Survival statistics with constant sample sizes at every depth level. Framework can be deployed on any knowledge domain with publicly verifiable facts without requiring pre-constructed test sets.

Result: Empirical validation across four diverse domains (Medicine, Constitutional Law, Ancient Rome, Quantum Computing) with five frontier models reveals depth-dependent performance variation hidden by standard benchmarks. Expected Valid Depth (EVD) ranges from 3.45 to 7.55 across model-domain combinations. Model rankings vary substantially by domain with no single model dominating all areas. Cost-performance analysis shows expensive models don’t always achieve deeper knowledge.

Conclusion: DepthCharge provides comparative evaluation tool for measuring LLM knowledge depth across domains, revealing that domain-specific evaluation is more informative than aggregate benchmarks for model selection in professional applications.

Abstract: Large Language Models appear competent when answering general questions but often fail when pushed into domain-specific details. No existing methodology provides an out-of-the-box solution for measuring how deeply LLMs can sustain accurate responses under adaptive follow-up questioning across arbitrary domains. We present DepthCharge, a domain-agnostic framework that measures knowledge depth through three innovations: adaptive probing that generates follow-up questions based on concepts the model actually mentions, on-demand fact verification from authoritative sources, and survival statistics with constant sample sizes at every depth level. The framework can be deployed on any knowledge domain with publicly verifiable facts, without requiring pre-constructed test sets or domain-specific expertise. DepthCharge results are relative to the evaluator model used for answer checking, making the framework a tool for comparative evaluation rather than absolute accuracy certification. Empirical validation across four diverse domains (Medicine, Constitutional Law, Ancient Rome, and Quantum Computing) with five frontier models demonstrates that DepthCharge reveals depth-dependent performance variation hidden by standard benchmarks. Expected Valid Depth (EVD) ranges from 3.45 to 7.55 across model-domain combinations, and model rankings vary substantially by domain, with no single model dominating all areas. Cost-performance analysis further reveals that expensive models do not always achieve deeper knowledge, suggesting that domain-specific evaluation is more informative than aggregate benchmarks for model selection in professional applications.

[11] Training a Large Language Model for Medical Coding Using Privacy-Preserving Synthetic Clinical Data

John Cook, Michael Wyatt, Peng Wei, Iris Chin, Santosh Gupta, Van Zyl Van Vuuren, Richie Siburian, Amanda Spicer, Kristen Viviano, Alda Cami, Raunaq Malhotra, Zhewei Yao, Jeff Rasley, Gaurav Kaushik

Main category: cs.CL

TL;DR: Fine-tuning Llama 3-70B on synthetic EHR-derived data achieves expert-level medical coding performance (F1 >0.70) from clinical notes, significantly improving over zero-shot baselines while preserving privacy.

Details

Motivation: Automating medical coding (ICD-10-CM and CPT) from clinical documentation is challenging due to heterogeneous records, nuanced guidelines, and long-tail distributions. Foundation models perform poorly at zero-shot coding, and there's a need for privacy-preserving approaches that don't expose protected health information.

Method: Fine-tuned Llama 3-70B on synthetic training data derived from electronic health records using EHR-grounded templates and coding policies. Created privacy-preserving synthetic pairs of clinical notes and gold codes, then evaluated exact-code prediction for both ICD-10-CM and CPT coding systems.

Result: Zero-shot baseline achieved F1 score of 0.18 for exact code match. After fine-tuning on synthetic corpus, exact-match F1 exceeded 0.70, representing large absolute gains. Performance remained high on complex categories requiring multi-step clinical reasoning (Advanced Illness, Frailty classes), and the model retained medical comprehension capabilities.

Conclusion: Synthetic, policy-aware data can efficiently teach general-purpose LLMs to support precise medical coding without exposing protected health information. This offers a practical path for training coding agents safely and iteratively on specific tasks representing real-world populations.

Abstract: Improving the accuracy and reliability of medical coding reduces clinician burnout and supports revenue cycle processes, freeing providers to focus more on patient care. However, automating the assignment of ICD-10-CM and CPT codes from clinical documentation remains a challenge due to heterogeneous records, nuanced coding guidelines, and long-tail distributions. Large language models have been proposed to help or automate specific medical coding tasks. However, foundation models are not explicitly trained for medical coding and zero-shot coding has yielded poor results. We investigate whether a modern open-weight foundation model can be adapted for an expert-level medical coding task using privacy-preserving synthetic training data derived from electronic health records. We fine-tune Llama 3-70B on pairs of clinical notes and gold codes generated from EHR-grounded templates and coding policies, then evaluate exact-code prediction for ICD-10-CM and CPT. A zero-shot baseline with the unadapted model achieved an F1 score of 0.18 for exact code match. After fine-tuning on the synthetic corpus, exact-match F1 exceeded 0.70, representing a large absolute gain across both code systems. Notably, performance remained high on complex categories that often require multi-step clinical reasoning and code composition, including Advanced Illness and Frailty classes, and the model retained its performance on medical comprehension tasks. These results indicate that synthetic, policy-aware data can efficiently teach a general-purpose large language model to support precise medical coding without exposing protected health information. The approach offers a practical path for training coding agents safely and iteratively on specific tasks that represent real-world populations.

[12] A Sociolinguistic Analysis of Automatic Speech Recognition Bias in Newcastle English

Dana Serditova, Kevin Tang

Main category: cs.CL

TL;DR: ASR systems show systematic bias against regional dialects like Newcastle English, with phonological variation causing most errors and social factors (gender, age) affecting error rates.

Details

Motivation: ASR systems perform unevenly across speakers, particularly with dialectal variations not represented in training data, raising concerns about equity and accessibility in speech technology.

Method: Used spontaneous speech from Diachronic Electronic Corpus of Tyneside English (DECTE), evaluated commercial ASR system output, analyzed 3,000+ transcription errors by linguistic domain and social variables (gender, age, socioeconomic status), and conducted acoustic case study of vowel features.

Result: Phonological variation accounts for majority of errors, with dialect-specific features (vowel quality, glottalisation), local vocabulary, and non-standard grammatical forms causing recurrent failures. Error rates vary by social groups - higher for men and speakers at age extremes.

Conclusion: ASR errors are socially patterned, not random. Sociolinguistic expertise should be incorporated into speech technology evaluation/development. More equitable ASR requires explicit attention to dialectal variation and community-based speech data.

Abstract: Automatic Speech Recognition (ASR) systems are widely used in everyday communication, education, healthcare, and industry, yet their performance remains uneven across speakers, particularly when dialectal variation diverges from the mainstream accents represented in training data. This study investigates ASR bias through a sociolinguistic analysis of Newcastle English, a regional variety of North-East England that has been shown to challenge current speech recognition technologies. Using spontaneous speech from the Diachronic Electronic Corpus of Tyneside English (DECTE), we evaluate the output of a state-of-the-art commercial ASR system and conduct a fine-grained analysis of more than 3,000 transcription errors. Errors are classified by linguistic domain and examined in relation to social variables including gender, age, and socioeconomic status. In addition, an acoustic case study of selected vowel features demonstrates how gradient phonetic variation contributes directly to misrecognition. The results show that phonological variation accounts for the majority of errors, with recurrent failures linked to dialect-specific features like vowel quality and glottalisation, as well as local vocabulary and non-standard grammatical forms. Error rates also vary across social groups, with higher error frequencies observed for men and for speakers at the extremes of the age spectrum. These findings indicate that ASR errors are not random but socially patterned and can be explained from a sociolinguistic perspective. Thus, the study demonstrates the importance of incorporating sociolinguistic expertise into the evaluation and development of speech technologies and argues that more equitable ASR systems require explicit attention to dialectal variation and community-based speech data.

[13] MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

Yu Chen, Runkai Chen, Sheng Yi, Xinda Zhao, Xiaohong Li, Jianjin Zhang, Jun Sun, Chuanrui Hu, Yunyun Han, Lidong Bing, Yafeng Deng, Tianqiao Chen

Main category: cs.CL

TL;DR: MSA is a scalable memory model framework with linear complexity that enables 100M-token context while maintaining performance stability, outperforming existing methods in long-context tasks.

Details

Motivation: Current LLMs are limited to ~1M tokens due to full-attention constraints. Existing approaches (hybrid linear attention, RNNs, RAG, agents) suffer from precision degradation, latency issues, inability to dynamically modify memory, or lack of end-to-end optimization, limiting applications like large-corpus summarization and long-history reasoning.

Method: Memory Sparse Attention (MSA) uses scalable sparse attention and document-wise RoPE to achieve linear complexity. Includes KV cache compression with Memory Parallel for efficient 100M-token inference on 2xA800 GPUs, and Memory Interleaving for multi-hop reasoning across scattered memory segments.

Result: MSA shows less than 9% degradation when scaling from 16K to 100M tokens, significantly outperforms frontier LLMs, state-of-the-art RAG systems, and leading memory agents in long-context benchmarks, enabling 100M-token inference on limited hardware.

Conclusion: MSA provides a scalable foundation for lifetime-scale memory in general-purpose models by decoupling memory capacity from reasoning, addressing key bottlenecks in long-context processing.

Abstract: Long-term memory is a cornerstone of human intelligence. Enabling AI to process lifetime-scale information remains a long-standing pursuit in the field. Due to the constraints of full-attention architectures, the effective context length of large language models (LLMs) is typically limited to 1M tokens. Existing approaches, such as hybrid linear attention, fixed-size memory states (e.g., RNNs), and external storage methods like RAG or agent systems, attempt to extend this limit. However, they often suffer from severe precision degradation and rapidly increasing latency as context length grows, an inability to dynamically modify memory content, or a lack of end-to-end optimization. These bottlenecks impede complex scenarios like large-corpus summarization, Digital Twins, and long-history agent reasoning, while limiting memory capacity and slowing inference. We present Memory Sparse Attention (MSA), an end-to-end trainable, efficient, and massively scalable memory model framework. Through core innovations including scalable sparse attention and document-wise RoPE, MSA achieves linear complexity in both training and inference while maintaining exceptional stability, exhibiting less than 9% degradation when scaling from 16K to 100M tokens. Furthermore, KV cache compression, combined with Memory Parallel, enables 100M-token inference on 2xA800 GPUs. We also propose Memory Interleaving to facilitate complex multi-hop reasoning across scattered memory segments. MSA significantly surpasses frontier LLMs, state-of-the-art RAG systems, and leading memory agents in long-context benchmarks. These results demonstrate that by decoupling memory capacity from reasoning, MSA provides a scalable foundation to endow general-purpose models with intrinsic, lifetime-scale memory.

[14] The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More

Lingjiao Chen, Chi Zhang, Yeye He, Ion Stoica, Matei Zaharia, James Zou

Main category: cs.CL

TL;DR: Study reveals that listed API prices for reasoning language models often don’t reflect actual inference costs, with 21.8% of model comparisons showing pricing reversals where cheaper-listed models actually cost more due to thinking token consumption variability.

Details

Motivation: Developers and consumers choose reasoning language models based on listed API prices, but it's unclear how accurately these prices reflect actual inference costs, potentially leading to suboptimal model selection and unexpected expenses.

Method: Systematic evaluation of 8 frontier reasoning language models across 9 diverse tasks covering competition math, science QA, code generation, and multi-domain reasoning, analyzing actual inference costs versus listed prices.

Result: Found pricing reversal phenomenon in 21.8% of model-pair comparisons, with reversal magnitude up to 28x; identified thinking token consumption heterogeneity as root cause (up to 900% variation); removing thinking token costs reduces reversals by 70%; per-query cost prediction is fundamentally difficult due to up to 9.7x thinking token variation.

Conclusion: Listed API pricing is an unreliable proxy for actual inference costs, calling for cost-aware model selection and transparent per-request cost monitoring rather than relying solely on listed prices.

Abstract: Developers and consumers increasingly choose reasoning language models (RLMs) based on their listed API prices. However, how accurately do these prices reflect actual inference costs? We conduct the first systematic study of this question, evaluating 8 frontier RLMs across 9 diverse tasks covering competition math, science QA, code generation, and multi-domain reasoning. We uncover the pricing reversal phenomenon: in 21.8% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitude reaching up to 28x. For example, Gemini 3 Flash’s listed price is 78% cheaper than GPT-5.2’s, yet its actual cost across all tasks is 22% higher. We trace the root cause to vast heterogeneity in thinking token consumption: on the same query, one model may use 900% more thinking tokens than another. In fact, removing thinking token costs reduces ranking reversals by 70% and raises the rank correlation (Kendall’s $τ$ ) between price and cost rankings from 0.563 to 0.873. We further show that per-query cost prediction is fundamentally difficult: repeated runs of the same query yield thinking token variation up to 9.7x, establishing an irreducible noise floor for any predictor. Our findings demonstrate that listed API pricing is an unreliable proxy for actual cost, calling for cost-aware model selection and transparent per-request cost monitoring.

[15] Cluster-R1: Large Reasoning Models Are Instruction-following Clustering Agents

Peijun Qing, Puneet Mathur, Nedim Lipka, Varun Manjunatha, Ryan Rossi, Franck Dernoncourt, Saeed Hassanpour, Soroush Vosoughi

Main category: cs.CL

TL;DR: Instruction-following clustering reframed as generative task using large reasoning models as autonomous clustering agents, outperforming embedding-based methods.

Details

Motivation: General embedding models capture semantic similarities but not user-specified characteristics, while instruction-tuned embedders can't infer latent corpus structures like optimal cluster numbers. Need approach that combines instruction-following with autonomous structure discovery.

Method: Reframe instruction-following clustering as generative task, train large reasoning models as autonomous clustering agents using reasoning-driven training pipeline. Models interpret high-level clustering instructions and infer corresponding latent groupings.

Result: Approach consistently outperforms strong embedding-based methods and LRM baselines across diverse datasets and clustering scenarios in ReasonCluster benchmark (28 tasks spanning daily dialogue, legal cases, financial reports).

Conclusion: Explicit reasoning fosters more faithful and interpretable instruction-based clustering, demonstrating superiority of reasoning-driven approach over traditional embedding methods for instruction-following clustering tasks.

Abstract: General-purpose embedding models excel at recognizing semantic similarities but fail to capture the characteristics of texts specified by user instructions. In contrast, instruction-tuned embedders can align embeddings with textual instructions yet cannot autonomously infer latent corpus structures, such as determining the optimal number of clusters. To address both limitations, we reframe instruction-following clustering as a generative task and train large reasoning models (LRMs) as autonomous clustering agents. Our reasoning-driven training pipeline enables LRMs to interpret high-level clustering instructions and then infer the corresponding latent groupings. To evaluate this paradigm, we introduce ReasonCluster, a comprehensive benchmark comprising 28 diverse tasks spanning daily dialogue, legal cases, and financial reports. Experiments across diverse datasets and clustering scenarios show that our approach consistently outperforms strong embedding-based methods and LRM baselines, demonstrating that explicit reasoning fosters more faithful and interpretable instruction-based clustering.

[16] MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?

Lin Yang, Yuancheng Yang, Xu Wang, Changkun Liu, Haihua Yang

Main category: cs.CL

TL;DR: MedMT-Bench: A challenging medical multi-turn instruction following benchmark that simulates complete diagnosis/treatment processes to stress-test LLMs on long-context memory, interference robustness, and safety in medical applications.

Details

Motivation: Existing medical benchmarks don't adequately test LLMs on critical practical requirements like long-context memory, interference robustness, and safety defenses needed for real-world medical applications.

Method: Created 400 test cases via scene-by-scene data synthesis with expert editing, averaging 22 rounds per case (max 52). Uses LLM-as-judge evaluation with instance-level rubrics and atomic test points validated against expert annotations.

Result: Tested 17 frontier models, all underperformed with overall accuracy below 60% (best model: 59.75%). Human-LLM agreement reached 91.94% for evaluation protocol.

Conclusion: MedMT-Bench fills a critical gap in medical AI evaluation and can drive research toward safer, more reliable medical LLMs by testing practical requirements missing in current benchmarks.

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across various specialist domains and have been integrated into high-stakes areas such as medicine. However, as existing medical-related benchmarks rarely stress-test the long-context memory, interference robustness, and safety defense required in practice. To bridge this gap, we introduce MedMT-Bench, a challenging medical multi-turn instruction following benchmark that simulates the entire diagnosis and treatment process. We construct the benchmark via scene-by-scene data synthesis refined by manual expert editing, yielding 400 test cases that are highly consistent with real-world application scenarios. Each test case has an average of 22 rounds (maximum of 52 rounds), covering 5 types of difficult instruction following issues. For evaluation, we propose an LLM-as-judge protocol with instance-level rubrics and atomic test points, validated against expert annotations with a human-LLM agreement of 91.94%. We test 17 frontier models, all of which underperform on MedMT-Bench (overall accuracy below 60.00%), with the best model reaching 59.75%. MedMT-Bench can be an essential tool for driving future research towards safer and more reliable medical AI. The benchmark is available in https://openreview.net/attachment?id=aKyBCsPOHB&name=supplementary_material

[17] From Physician Expertise to Clinical Agents: Preserving, Standardizing, and Scaling Physicians’ Medical Expertise with Lightweight LLM

Chanyong Luo, Jirui Dai, Zhendong Wang, Kui Chen, Jiaxi Yang, Bingjie Lu, Jing Wang, Jiaxin Hao, Bing Li, Ruiyang He, Yiyu Qiao, Chenkai Zhang, Kaiyu Wang, Zhi Liu, Zeyu Zheng, Yan Li, Xiaohong Gu

Main category: cs.CL

TL;DR: Med-Shicheng enables LLMs to learn and transfer master physicians’ diagnostic philosophies and adaptation rules through a 5-stage framework, achieving performance comparable to large models while running on resource-constrained hardware.

Details

Motivation: Master physicians develop individualized diagnostic methodologies through years of practice, but this expertise is slow to develop and hard to transmit at scale, creating scarcity of high-quality clinical expertise. The paper aims to systematically capture and transfer this knowledge using LLMs.

Method: A 5-stage framework built on Tianyi that targets 5 master TCM physicians, curates multi-source materials, and trains a single model to internalize all five knowledge systems across seven clinical tasks including diagnosis, treatment selection, prescription generation, and clinical advice.

Result: Implemented on Qwen2.5-1.5B-Base, Med-Shicheng runs on resource-constrained GPUs while achieving performance comparable to DeepSeek-R1 and GPT-5. The study also reveals that LLM-as-a-judge tracks overall trends but shows bias on fine-grained individualized distinctions.

Conclusion: The framework successfully enables systematic learning and transfer of physicians’ expertise, but highlights the need for physician involvement in evaluation when ground truth is unavailable and for domain-adapted judge models.

Abstract: Medicine is an empirical discipline refined through long-term observation and the messy, high-variance reality of clinical practice. Physicians build diagnostic and therapeutic competence through repeated cycles of application, reflection, and improvement, forming individualized methodologies. Yet outcomes vary widely, and master physicians’ knowledge systems are slow to develop and hard to transmit at scale, contributing to the scarcity of high-quality clinical expertise. To address this, we propose Med-Shicheng, a general framework that enables large language models to systematically learn and transfer distinguished physicians’ diagnostic-and-therapeutic philosophy and case-dependent adaptation rules in a standardized way. Built on Tianyi, Med-Shicheng consists of five stages. We target five National Masters of Chinese Medicine or distinguished TCM physicians, curate multi-source materials, and train a single model to internalize all five knowledge systems across seven tasks, including etiology-pathogenesis analysis, syndrome diagnosis, treatment principle selection, prescription generation, prescription explanation, symptom evolution with regimen adjustment, and clinical advice. Implemented on Qwen2.5-1.5B-Base, Med-Shicheng runs on resource-constrained GPUs while achieving performance comparable to DeepSeek-R1 and GPT-5. We also examine the reliability of LLM-as-a-judge versus physician evaluation: automated judging tracks overall trends but shows bias on fine-grained individualized distinctions, highlighting the need for physician involvement when ground truth is unavailable and for domain-adapted judge models.

[18] Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages

Shaharukh Khan, Ali Faraz, Abhinav Ravi, Mohd Nauman, Mohd Sarfraz, Akshat Patidar, Raja Kolla, Chandra Khatri, Shubham Agarwal

Main category: cs.CL

TL;DR: Introduces Chitrakshara dataset series for multimodal research covering 11 Indian languages, addressing the gap in non-English vision-language datasets through large-scale interleaved image-text data collection.

Details

Motivation: Most Vision-Language Models are trained primarily on English datasets, leading to inadequate representation of Indian languages in multimodal research. There's a need for culturally inclusive datasets to develop more representative VLMs.

Method: Created Chitrakshara dataset series from Common Crawl covering 11 Indian languages: (1) Chitrakshara-IL - large-scale interleaved pretraining dataset with 193M images, 30B text tokens, and 50M multilingual documents; (2) Chitrakshara-Cap - 44M image-text pairs with 733M tokens. Detailed data collection pipeline including curation, filtering, and processing methodologies.

Result: Produced comprehensive multimodal datasets for Indian languages with quality and diversity analysis to assess representativeness across Indic languages. The datasets enable development of more culturally inclusive Vision-Language Models.

Conclusion: Chitrakshara addresses the gap in non-English multimodal datasets and provides resources for developing more representative VLMs for Indian languages, promoting cultural inclusivity in multimodal research.

Abstract: Multimodal research has predominantly focused on single-image reasoning, with limited exploration of multi-image scenarios. Recent models have sought to enhance multi-image understanding through large-scale pretraining on interleaved image-text datasets. However, most Vision-Language Models (VLMs) are trained primarily on English datasets, leading to inadequate representation of Indian languages. To address this gap, we introduce the Chitrakshara dataset series, covering 11 Indian languages sourced from Common Crawl. It comprises (1) Chitrakshara-IL, a large-scale interleaved pretraining dataset with 193M images, 30B text tokens, and 50M multilingual documents, and (2) Chitrakshara-Cap, which includes 44M image-text pairs with 733M tokens. This paper details the data collection pipeline, including curation, filtering, and processing methodologies. Additionally, we present a comprehensive quality and diversity analysis to assess the dataset’s representativeness across Indic languages and its potential for developing more culturally inclusive VLMs.

[19] Qworld: Question-Specific Evaluation Criteria for LLMs

Shanghua Gao, Yuchang Su, Pengwei Sui, Curtis Ginder, Marinka Zitnik

Main category: cs.CL

TL;DR: Qworld is a method for generating question-specific evaluation criteria for LLMs using recursive expansion trees, enabling more nuanced context-dependent assessment than fixed rubrics.

Details

Motivation: Current LLM evaluation methods use binary scores and static rubrics that fail to capture context-dependent quality requirements. Existing approaches define criteria at dataset level or generate them in single passes, limiting exploration of question-specific evaluation dimensions.

Method: Qworld generates question-specific evaluation criteria through recursive expansion trees. Given a question, it decomposes it into scenarios, perspectives, and fine-grained binary criteria using structured hierarchical and horizontal expansion to specify what high-quality answers must address.

Result: On HealthBench, Qworld covers 89% of expert-authored criteria and generates 79% novel criteria validated by human experts. Experts rate Qworld criteria higher in insight and granularity than prior methods. Applied to 11 frontier LLMs, Qworld reveals capability differences in dimensions like long-term impact, equity, error handling, and interdisciplinary reasoning that coarse rubrics miss.

Conclusion: By formulating criteria generation as structured coverage of question-implied evaluation axes, Qworld enables adaptive evaluation that responds to each question’s specific requirements rather than relying on fixed task-level criteria.

Abstract: Evaluating large language models (LLMs) on open-ended questions is difficult because response quality depends on the question’s context. Binary scores and static rubrics fail to capture these context-dependent requirements. Existing methods define criteria at the dataset level or generate them in a single pass, which limits their ability to explore the evaluation space implied by each question. We introduce One-Question-One-World (Qworld), a method that generates question-specific evaluation criteria using a recursive expansion tree. Given a question, Qworld decomposes it into scenarios, perspectives, and fine-grained binary criteria through structured hierarchical and horizontal expansion. The resulting criteria specify what a high-quality answer must address for that question. On HealthBench, Qworld covers 89% of expert-authored criteria and generates 79% novel criteria validated by human experts. Experts rate Qworld criteria higher in insight and granularity than those produced by prior methods. When applied to 11 frontier LLMs on HealthBench and Humanity’s Last Exam, Qworld reveals capability differences in dimensions such as long-term impact, equity, error handling, and interdisciplinary reasoning that coarse rubrics do not distinguish. By formulating criteria generation as structured coverage of question-implied evaluation axes, Qworld enables evaluation that adapts to each question rather than relying on fixed task-level criteria.

[20] Do 3D Large Language Models Really Understand 3D Spatial Relationships?

Xianzheng Ma, Tao Sun, Shuai Chen, Yash Bhalgat, Jindong Gu, Angel X Chang, Iro Armeni, Iro Laina, Songyou Peng, Victor Adrian Prisacariu

Main category: cs.CL

TL;DR: The paper reveals that current 3D-LLMs may exploit textual shortcuts rather than genuine 3D reasoning, introduces a more rigorous benchmark (Real-3DQA), and proposes a 3D-reweighted training objective to improve spatial reasoning.

Details

Motivation: Existing 3D-LLMs claim to understand 3D worlds but may be exploiting textual shortcuts in benchmarks rather than engaging in genuine 3D-aware reasoning. The SQA3D benchmark may not effectively detect this issue, requiring a more rigorous evaluation framework.

Method: 1) Show that fine-tuning language models on text-only QA pairs can match or surpass 3D-LLMs on SQA3D, 2) Introduce Real-3DQA benchmark with filtered easy questions and structured taxonomy, 3) Propose 3D-reweighted training objective to guide models to rely more on 3D visual clues.

Result: Experiments confirm existing 3D-LLMs struggle with spatial relationships when simple cues are removed. The proposed 3D-reweighted training objective substantially enhances 3D-LLM performance in spatial reasoning tasks on the new benchmark.

Conclusion: The findings highlight the need for robust benchmarks and tailored training strategies to advance genuine 3D vision-language understanding, moving beyond textual shortcut exploitation to true 3D spatial reasoning.

Abstract: Recent 3D Large-Language Models (3D-LLMs) claim to understand 3D worlds, especially spatial relationships among objects. Yet, we find that simply fine-tuning a language model on text-only question-answer pairs can perform comparably or even surpass these methods on the SQA3D benchmark without using any 3D input. This indicates that the SQA3D benchmark may not be able to detect if the model exploits textual shortcuts rather than engages in 3D-aware reasoning. To address this issue, we introduce Real-3DQA, a more rigorous evaluation benchmark that filters out easy-to-guess questions and introduces a structured taxonomy to assess various aspects of 3D reasoning. Experiments on Real-3DQA confirm that existing 3D-LLMs struggle with spatial relationships once simple cues are removed. We further propose a 3D-reweighted training objective that guides model to rely more on 3D visual clues, substantially enhancing 3D-LLMs performance in spatial reasoning tasks. Our findings underscore the need for robust benchmarks and tailored training strategies to advance genuine 3D vision-language understanding. Project page: https://real-3dqa.github.io/.

[21] Navigating the Concept Space of Language Models

Wilson E. Marcílio-Jr, Danilo M. Eler

Main category: cs.CL

TL;DR: Concept Explorer: A scalable interactive system for exploring sparse autoencoder features from LLMs using hierarchical neighborhood embeddings

Details

Motivation: Current methods for analyzing SAE features rely on manual inspection or semantic search, making exploratory discovery of concepts difficult at scale. There's a need for better tools to organize and navigate the thousands of features extracted from LLM activations.

Method: Constructs a multi-resolution manifold over SAE feature embeddings using hierarchical neighborhood embeddings. Enables progressive navigation from coarse concept clusters to fine-grained neighborhoods, supporting discovery, comparison, and relationship analysis among concepts.

Result: Demonstrated on SAE features from SmolLM2, revealing coherent high-level structure, meaningful subclusters, and distinctive rare concepts that are hard to identify with existing workflows.

Conclusion: Concept Explorer provides a scalable interactive system for post-hoc exploration of SAE features, enabling better discovery and analysis of interpretable concepts in LLMs.

Abstract: Sparse autoencoders (SAEs) trained on large language model activations output thousands of features that enable mapping to human-interpretable concepts. The current practice for analyzing these features primarily relies on inspecting top-activating examples, manually browsing individual features, or performing semantic search on interested concepts, which makes exploratory discovery of concepts difficult at scale. In this paper, we present Concept Explorer, a scalable interactive system for post-hoc exploration of SAE features that organizes concept explanations using hierarchical neighborhood embeddings. Our approach constructs a multi-resolution manifold over SAE feature embeddings and enables progressive navigation from coarse concept clusters to fine-grained neighborhoods, supporting discovery, comparison, and relationship analysis among concepts. We demonstrate the utility of Concept Explorer on SAE features extracted from SmolLM2, where it reveals coherent high-level structure, meaningful subclusters, and distinctive rare concepts that are hard to identify with existing workflows.

[22] Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao

Main category: cs.CL

TL;DR: Team-of-Thoughts: A heterogeneous multi-agent system framework that leverages diverse model architectures as specialized tools through orchestrator calibration and agent self-assessment to improve reasoning and code generation performance.

Details

Motivation: Existing multi-agent systems typically use homogeneous model configurations, failing to exploit the diverse expertise inherent in different post-trained architectures. There's a need to better utilize specialized capabilities of various models through coordinated heterogeneous systems.

Method: Proposes a heterogeneous MAS framework with two key components: (1) Orchestrator Calibration identifies models with superior coordination and synthesis capabilities, and (2) Agent Self-Assessment allows tool agents to profile their own domain-specific strengths. The orchestrator dynamically activates the most compatible agents based on these profiles during inference.

Result: Outperforms individual models and existing MAS baselines across five mathematical reasoning and code generation benchmarks. Achieves 96.00% accuracy on AIME24 and 77.91% on LiveCodeBench, significantly improving over homogeneous role-play baselines (80.00% and 65.93%).

Conclusion: Team-of-Thoughts demonstrates that heterogeneous multi-agent systems with specialized model coordination can significantly outperform homogeneous approaches, effectively leveraging diverse model expertise through systematic orchestration and self-assessment mechanisms.

Abstract: Existing Multi-Agent Systems (MAS) typically rely on homogeneous model configurations, failing to exploit the diverse expertise inherent in different post-trained architectures. We propose Team-of-Thoughts, a heterogeneous MAS framework that treats diverse models as specialized tools within an orchestrator-driven paradigm. Team-of-Thoughts introduces two novel components: (1) Orchestrator Calibration, which identifies models with superior coordination and synthesis capabilities, and (2) Agent Self-Assessment, a protocol where tool agents profile their own domain-specific strengths to guide selection. At inference, the orchestrator dynamically activates the most compatible agents based on these profiles to maximize capability coverage. Across five mathematical reasoning and code generation benchmarks, Team-of-Thoughts consistently outperforms individual models and existing MAS baselines. Notably, on AIME24 and LiveCodeBench, Team-of-Thoughts achieves 96.00% and 77.91% accuracy, respectively, significantly improving over homogeneous role-play baselines (80.00% and 65.93%).

[23] Prompt Compression in Production Task Orchestration: A Pre-Registered Randomized Trial

Warren Johnson, Charles Lee

Main category: cs.CL

TL;DR: Prompt compression reduces input tokens but can increase output length, affecting total cost; moderate compression (50% retention) saves 27.9% total cost while aggressive compression (20% retention) increases cost by 1.8% despite input reduction.

Details

Motivation: Current prompt compression focuses on reducing input tokens but ignores how compression affects output length, which is priced higher. Need to understand the full economic impact of compression on total inference costs.

Method: Pre-registered six-arm randomized controlled trial with 358 Claude Sonnet 4.5 runs from 1,199 real orchestration instructions. Compared uncompressed control with three uniform retention rates (80%, 50%, 20%) and two structure-aware strategies (entropy-adaptive, recency-weighted). Measured total inference cost (input+output) and embedding-based response similarity.

Result: Moderate compression (r=0.5) reduced mean total cost by 27.9%. Aggressive compression (r=0.2) increased mean cost by 1.8% despite input reduction, with small output expansion (1.03x vs control). Recency-weighted compression achieved 23.5% savings. Moderate compression and recency-weighted compression formed the cost-similarity Pareto frontier.

Conclusion: “Compress more” is not a reliable production heuristic; output tokens must be considered as a first-class outcome when designing compression policies. Moderate compression and structure-aware strategies offer optimal cost-similarity trade-offs.

Abstract: The economics of prompt compression depend not only on reducing input tokens but on how compression changes output length, which is typically priced several times higher. We evaluate this in a pre-registered six-arm randomized controlled trial of prompt compression on production multi-agent task-orchestration, analyzing 358 successful Claude Sonnet 4.5 runs (59-61 per arm) drawn from a randomized corpus of 1,199 real orchestration instructions. We compare an uncompressed control with three uniform retention rates (r=0.8, 0.5, 0.2) and two structure-aware strategies (entropy-adaptive and recency-weighted), measuring total inference cost (input+output) and embedding-based response similarity. Moderate compression (r=0.5) reduced mean total cost by 27.9%, while aggressive compression (r=0.2) increased mean cost by 1.8% despite substantial input reduction, consistent with small mean output expansion (1.03x vs. control) and heavy-tailed uncertainty. Recency-weighted compression achieved 23.5% savings and, together with moderate compression, occupied the empirical cost-similarity Pareto frontier, whereas aggressive compression was dominated on both cost and similarity. These results show that “compress more” is not a reliable production heuristic and that output tokens must be treated as a first-class outcome when designing compression policies.

[24] Compression Method Matters: Benchmark-Dependent Output Dynamics in LLM Prompt Compression

Warren Johnson

Main category: cs.CL

TL;DR: Study shows prompt compression’s real impact depends on output length changes, not just input reduction, with benchmark structure being key moderator of output expansion effects.

Details

Motivation: Current prompt compression evaluation focuses on input-token reduction, but real deployment impact depends on how compression affects output length and total inference cost. There are conflicting prior observations about compression effects that need explanation.

Method: Controlled replication and extension study across 5,400 API calls, three benchmarks, and multiple providers. Formalized instruction survival probability (Psi) metric to capture task-critical prompt segments after truncation. Introduced Compression Robustness Index (CRI) for cross-benchmark evaluation. Included direct NVML measurements from rented RunPod GPUs for energy analysis.

Result: Strong benchmark effect observed: DeepSeek shows severe output expansion on MBPP (56x) but lower expansion on HumanEval (5x), while GPT-4o-mini is more stable. Prompt structure, not provider identity alone, moderates compression effects. Single-benchmark assessments can be misleading. Token savings can overstate actual energy savings.

Conclusion: Benchmark-diverse testing and structure-aware compression policies are needed for reliable, energy-conscious LLM deployment. Compression evaluation must consider output dynamics and energy impacts beyond simple token reduction metrics.

Abstract: Prompt compression is often evaluated by input-token reduction, but its real deployment impact depends on how compression changes output length and total inference cost. We present a controlled replication and extension study of benchmark-dependent output dynamics under aggressive compression, covering 5,400 API calls across three benchmarks and multiple providers. To explain conflicting prior observations, we formalize instruction survival probability (Psi), a structural metric that captures whether task-critical prompt segments remain after truncation. Results show a strong benchmark effect: under r=0.3, DeepSeek exhibits severe output expansion on MBPP (56x, Psi approx 0.15) but substantially lower expansion on HumanEval (5x, Psi approx 0.72), while GPT-4o-mini is comparatively stable across benchmarks. This reconciles the apparent discrepancy between previously reported extreme explosion and lower replication effects by identifying prompt structure, not provider identity alone, as the primary moderator. We introduce the Compression Robustness Index (CRI) for cross-benchmark evaluation and show that single-benchmark assessments can produce misleading conclusions about compression safety and efficiency. To contextualize energy claims, we incorporate companion direct NVML measurements from rented RunPod GPUs and show that token savings can overstate joule savings. These findings motivate benchmark-diverse testing and structure-aware compression policies for reliable, energy-conscious LLM deployment.

[25] The Compression Paradox in LLM Inference: Provider-Dependent Energy Effects of Prompt Compression

Warren Johnson

Main category: cs.CL

TL;DR: Prompt compression reduces input tokens but causes severe quality degradation and inconsistent energy effects across LLM providers, with some models showing energy increases due to output expansion.

Details

Motivation: LLMs are becoming significant contributors to carbon emissions, creating an environmental paradox. The paper investigates whether prompt compression can improve inference energy efficiency as a potential solution.

Method: Conducted 28,421 API trials across three providers (OpenAI GPT-4o-mini, Anthropic Claude-3.5-Sonnet, DeepSeek-Chat), five benchmarks (HumanEval, MBPP, GSM8K, MATH, MMLU), and four compression ratios (1.0, 0.7, 0.5, 0.3). Used token-based energy proxy calibrated against local measurements and tracked quality with benchmark pass rates.

Result: Compression caused substantial quality loss (26.0% pass rate at baseline vs. 1.5% at r=0.7) and strongly provider-dependent energy effects. DeepSeek showed output expansion (21 to 798 tokens at r=0.3) with energy increases up to +2,140%, while GPT-4o-mini had mixed effects including some reduction at r=0.5.

Conclusion: Input-token reduction alone is not a reliable energy optimization strategy for production inference. Model selection and output-length control provide more consistent energy-quality tradeoffs than prompt compression.

Abstract: The rapid proliferation of Large Language Models has created an environmental paradox: the very technology that could help solve climate challenges is itself becoming a significant contributor to global carbon emissions. We test whether prompt compression improves inference energy efficiency in 28,421 successful API trials (28,428 planned) across three providers (OpenAI GPT-4o-mini, Anthropic Claude-3.5-Sonnet, and DeepSeek-Chat), five benchmarks (HumanEval, MBPP, GSM8K, MATH, MMLU), and four compression ratios (r in {1.0, 0.7, 0.5, 0.3}). Energy is estimated with a token-based proxy calibrated against local direct measurements, and quality is tracked with benchmark pass rates. Compression produced substantial quality loss (overall pass rate 26.0% at baseline vs. 1.5% at r=0.7) and strongly provider-dependent energy behavior. DeepSeek exhibited output expansion under compression (21 to 798 tokens at r=0.3), corresponding to energy increases up to +2,140%, while GPT-4o-mini showed mixed effects including a reduction at r=0.5. These results indicate that input-token reduction alone is not a reliable energy optimization strategy in production inference. For the evaluated settings, model selection and output-length control provided more consistent energy-quality tradeoffs than prompt compression.

[26] Agentic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment

Mohamed Sobhi Jabal, Jikai Zhang, Dominic LaBella, Jessica L. Houk, Dylan Zhang, Jeffrey D. Rudie, Kirti Magudia, Maciej A. Mazurowski, Evan Calabrese

Main category: cs.CL

TL;DR: Multi-agent LLM system with CNN segmentation automates BT-RADS classification for glioma MRI, outperforming initial clinical assessments with 76% accuracy.

Details

Motivation: BT-RADS standardizes glioma MRI response assessment but requires complex integration of imaging trends, medication effects, and radiation timing, making automated classification desirable.

Method: End-to-end multi-agent LLM system combined with CNN-based tumor segmentation; extractor agent identifies clinical variables from unstructured notes, scorer agent applies BT-RADS decision logic integrating extracted variables with volumetric measurements.

Result: System achieved 76.0% accuracy vs 57.5% for initial clinical assessments (+18.5 percentage points; P<.001); high sensitivity for context-dependent categories (BT-1b 100%, BT-1a 92.7%, BT-3a 87.5%) and high positive predictive value for BT-4 detection (92.9%).

Conclusion: Multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, demonstrating potential for automated glioma MRI assessment.

Abstract: The Brain Tumor Reporting and Data System (BT-RADS) standardizes post-treatment MRI response assessment in patients with diffuse gliomas but requires complex integration of imaging trends, medication effects, and radiation timing. This study evaluates an end-to-end multi-agent large language model (LLM) and convolutional neural network (CNN) system for automated BT-RADS classification. A multi-agent LLM system combined with automated CNN-based tumor segmentation was retrospectively evaluated on 509 consecutive post-treatment glioma MRI examinations from a single high-volume center. An extractor agent identified clinical variables (steroid status, bevacizumab status, radiation date) from unstructured clinical notes, while a scorer agent applied BT-RADS decision logic integrating extracted variables with volumetric measurements. Expert reference standard classifications were established by an independent board-certified neuroradiologist. Of 509 examinations, 492 met inclusion criteria. The system achieved 374/492 (76.0%; 95% CI, 72.1%-79.6%) accuracy versus 283/492 (57.5%; 95% CI, 53.1%-61.8%) for initial clinical assessments (+18.5 percentage points; P<.001). Context-dependent categories showed high sensitivity (BT-1b 100%, BT-1a 92.7%, BT-3a 87.5%), while threshold-dependent categories showed moderate sensitivity (BT-3c 74.8%, BT-2 69.2%, BT-4 69.3%, BT-3b 57.1%). For BT-4, positive predictive value was 92.9%. The multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4 detection.

[27] Konkani LLM: Multi-Script Instruction Tuning and Evaluation for a Low-Resource Indian Language

Reuben Chagas Fernandes, Gaurang S. Patkar

Main category: cs.CL

TL;DR: The paper introduces Konkani-Instruct-100k, a synthetic instruction-tuning dataset for the low-resource Konkani language, and develops Konkani LLM models that outperform base models in machine translation tasks.

Details

Motivation: Large Language Models underperform in low-resource linguistic contexts like Konkani due to training data scarcity and high script diversity across Devanagari, Romi, and Kannada orthographies.

Method: Created Konkani-Instruct-100k synthetic instruction-tuning dataset using Gemini 3, evaluated leading open-weight architectures (Llama 3.1, Qwen2.5, Gemma 3) and proprietary models, and developed Konkani LLM through fine-tuning for regional nuances.

Result: Konkani LLM delivers consistent gains over base models in machine translation and is competitive with/surpasses proprietary baselines in several settings. Also developing Multi-Script Konkani Benchmark for cross-script evaluation.

Conclusion: The approach successfully addresses LLM performance gaps in low-resource languages through synthetic data generation and targeted fine-tuning, with potential applications for other under-resourced linguistic contexts.

Abstract: Large Language Models (LLMs) consistently under perform in low-resource linguistic contexts such as Konkani. This performance deficit stems from acute training data scarcity compounded by high script diversity across Devanagari, Romi and Kannada orthographies. To address this gap, we introduce Konkani-Instruct-100k, a comprehensive synthetic instruction-tuning dataset generated through Gemini 3. We establish rigorous baseline benchmarks by evaluating leading open-weights architectures including Llama 3.1, Qwen2.5 and Gemma 3 alongside proprietary closed-source models. Our primary contribution involves the development of Konkani LLM, a series of fine-tuned models optimized for regional nuances. Furthermore, we are developing the Multi-Script Konkani Benchmark to facilitate cross-script linguistic evaluation. In machine translation, Konkani LLM delivers consistent gains over the corresponding base models and is competitive with and in several settings surpasses proprietary baselines

[28] Did You Forget What I Asked? Prospective Memory Failures in Large Language Models

Avni Mittal

Main category: cs.CL

TL;DR: LLMs struggle with formatting instructions under concurrent task load, with compliance dropping 2-21% depending on constraint type; terminal constraints degrade most (up to 50% drop), but salience-enhanced formatting recovers most performance.

Details

Motivation: Large language models often fail to satisfy formatting instructions when performing demanding tasks simultaneously. The paper investigates this behavior through a cognitive psychology lens (prospective memory) to understand how formatting compliance degrades under task load.

Method: Used a controlled paradigm combining verifiable formatting constraints with benchmark tasks of increasing complexity. Tested across three model families with over 8,000 prompts. Examined different constraint types (terminal vs avoidance), implemented salience-enhanced formatting (explicit instruction framing + trailing reminder), and conducted stacking experiments with multiple constraints.

Result: Compliance drops 2-21% under concurrent task load. Terminal constraints degrade most (up to 50% drop), while avoidance constraints remain robust. Salience-enhanced formatting recovers most lost compliance (90-100% restoration). Interference is bidirectional - formatting constraints can reduce task accuracy (GSM8K accuracy dropped from 93% to 27%). Joint compliance declines sharply as constraints accumulate.

Conclusion: LLM formatting compliance is vulnerable to concurrent task demands, especially for terminal constraints. Simple interventions like salience-enhanced formatting can substantially mitigate these issues, but bidirectional interference and constraint accumulation remain challenges for practical applications.

Abstract: Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behaviour through a prospective memory inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2-21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90-100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model’s GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers without an LLM-as-judge component on publicly available datasets.

[29] Large Language Models Unpack Complex Political Opinions through Target-Stance Extraction

Özgür Togay, Florian Kunneman, Javier Garcia-Bernardo, Anastasia Giachanou

Main category: cs.CL

TL;DR: LLMs can effectively extract political opinions through Target-Stance Extraction, performing comparably to human annotators on Reddit political discourse analysis.

Details

Motivation: Political polarization analysis often oversimplifies discourse with coarse partisan labels, missing nuanced interactions between beliefs about policies, figures, and issues. Online political conversations are particularly complex, making it difficult to automatically identify discussion targets and expressed opinions.

Method: Constructed a dataset of 1,084 Reddit posts from r/NeutralPolitics covering 138 distinct political targets. Evaluated proprietary and open-source LLMs using zero-shot, few-shot, and context-augmented prompting strategies for Target-Stance Extraction (TSE) - combining target identification and stance detection.

Result: Best performing LLMs achieved performance comparable to highly trained human annotators and remained robust on challenging posts with low inter-annotator agreement.

Conclusion: LLMs can extract complex political opinions with minimal supervision, offering a scalable tool for computational social science and political text analysis through Target-Stance Extraction.

Abstract: Political polarization emerges from a complex interplay of beliefs about policies, figures, and issues. However, most computational analyses reduce discourse to coarse partisan labels, overlooking how these beliefs interact. This is especially evident in online political conversations, which are often nuanced and cover a wide range of subjects, making it difficult to automatically identify the target of discussion and the opinion expressed toward them. In this study, we investigate whether Large Language Models (LLMs) can address this challenge through Target-Stance Extraction (TSE), a recent natural language processing task that combines target identification and stance detection, enabling more granular analysis of political opinions. For this, we construct a dataset of 1,084 Reddit posts from r/NeutralPolitics, covering 138 distinct political targets and evaluate a range of proprietary and open-source LLMs using zero-shot, few-shot, and context-augmented prompting strategies. Our results show that the best models perform comparably to highly trained human annotators and remain robust on challenging posts with low inter-annotator agreement. These findings demonstrate that LLMs can extract complex political opinions with minimal supervision, offering a scalable tool for computational social science and political text analysis.

[30] Generating Hierarchical JSON Representations of Scientific Sentences Using LLMs

Satya Sri Rajiteswari Nimmagadda, Ethan Young, Niladri Sengupta, Ananya Jana, Aniruddha Maiti

Main category: cs.CL

TL;DR: A lightweight LLM is fine-tuned with structural loss to generate hierarchical JSON from scientific sentences, then reconstructs original text to test if structured representations preserve meaning.

Details

Motivation: To investigate whether structured representations (hierarchical JSON) can effectively preserve the semantic meaning of scientific sentences, which is important for information extraction, knowledge representation, and text understanding tasks in scientific domains.

Method: Fine-tune a lightweight LLM using a novel structural loss function to generate hierarchical JSON structures from scientific sentences. Then use a generative model to reconstruct original text from these JSONs. Evaluate using semantic and lexical similarity metrics between original and reconstructed sentences.

Result: Hierarchical JSON formats are shown to be capable of effectively retaining information from scientific texts, as demonstrated by high semantic and lexical similarity between original and reconstructed sentences.

Conclusion: Structured hierarchical representations can preserve the meaning of scientific sentences, suggesting potential applications in scientific text processing, knowledge extraction, and information retrieval systems.

Abstract: This paper investigates whether structured representations can preserve the meaning of scientific sentences. To test this, a lightweight LLM is fine-tuned using a novel structural loss function to generate hierarchical JSON structures from sentences collected from scientific articles. These JSONs are then used by a generative model to reconstruct the original text. Comparing the original and reconstructed sentences using semantic and lexical similarity we show that hierarchical formats are capable of retaining information of scientific texts effectively.

[31] MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys and Key-Based Restructuring for High-Accuracy RAG

Bhavik Mangla

Main category: cs.CL

TL;DR: MDKeyChunker is a structure-aware chunking pipeline for Markdown documents that treats headers, code blocks, tables, and lists as atomic units, extracts metadata via single LLM call, and merges related chunks for better retrieval.

Details

Motivation: Traditional RAG pipelines use fixed-size chunking that ignores document structure, fragments semantic units across boundaries, and requires multiple LLM calls per chunk for metadata extraction, leading to inefficient retrieval.

Method: Three-stage pipeline: (1) structure-aware chunking treating headers, code blocks, tables, and lists as atomic units; (2) single LLM call per chunk extracting title, summary, keywords, typed entities, hypothetical questions, and semantic key with rolling key propagation; (3) restructuring chunks by merging those sharing semantic keys via bin-packing.

Result: Evaluation on 30 queries over 18-document Markdown corpus shows Config D (BM25 over structural chunks) achieves Recall@5=1.000 and MRR=0.911, while dense retrieval over full pipeline reaches Recall@5=0.867.

Conclusion: MDKeyChunker provides efficient structure-aware document chunking with single-call metadata extraction, improving retrieval performance while reducing LLM calls compared to traditional fixed-size chunking approaches.

Abstract: RAG pipelines typically rely on fixed-size chunking, which ignores document structure, fragments semantic units across boundaries, and requires multiple LLM calls per chunk for metadata extraction. We present MDKeyChunker, a three-stage pipeline for Markdown documents that (1) performs structure-aware chunking treating headers, code blocks, tables, and lists as atomic units; (2) enriches each chunk via a single LLM call extracting title, summary, keywords, typed entities, hypothetical questions, and a semantic key, while propagating a rolling key dictionary to maintain document-level context; and (3) restructures chunks by merging those sharing the same semantic key via bin-packing, co-locating related content for retrieval. The single-call design extracts all seven metadata fields in one LLM invocation, eliminating the need for separate per-field extraction passes. Rolling key propagation replaces hand-tuned scoring with LLM-native semantic matching. An empirical evaluation on 30 queries over an 18-document Markdown corpus shows Config D (BM25 over structural chunks) achieves Recall@5=1.000 and MRR=0.911, while dense retrieval over the full pipeline (Config C) reaches Recall@5=0.867. MDKeyChunker is implemented in Python with four dependencies and supports any OpenAI-compatible endpoint.

[32] Not All Pretraining are Created Equal: Threshold Tuning and Class Weighting for Imbalanced Polarization Tasks in Low-Resource Settings

Abass Oguntade

Main category: cs.CL

TL;DR: Transformer-based systems for polarization detection in social media text using multilingual models, achieving strong performance on binary detection but facing challenges with implicit polarization and code-switching.

Details

Motivation: To develop effective systems for detecting and classifying polarization in social media text across English and Swahili languages, addressing the SemEval-2025 Polarization Shared Task challenges.

Method: Uses multilingual and African language-specialized Transformer models (mDeBERTa-v3-base, SwahBERT, AfriBERTa-large) with class-weighted loss functions, iterative stratified data splitting, and per-label threshold tuning to handle severe class imbalance across three subtasks.

Result: Best configuration (mDeBERTa-v3-base) achieves 0.8032 macro-F1 on validation for binary polarization detection, with competitive performance on multi-label tasks (up to 0.556 macro-F1).

Conclusion: Transformer-based approaches are effective for polarization detection but face persistent challenges with implicit polarization, code-switching, and distinguishing heated political discourse from genuine polarization.

Abstract: This paper describes my submission to the Polarization Shared Task at SemEval-2025, which addresses polarization detection and classification in social media text. I develop Transformer-based systems for English and Swahili across three subtasks: binary polarization detection, multi-label target type classification, and multi-label manifestation identification. The approach leverages multilingual and African language-specialized models (mDeBERTa-v3-base, SwahBERT, AfriBERTa-large), class-weighted loss functions, iterative stratified data splitting, and per-label threshold tuning to handle severe class imbalance. The best configuration, mDeBERTa-v3-base, achieves 0.8032 macro-F1 on validation for binary detection, with competitive performance on multi-label tasks (up to 0.556 macro-F1). Error analysis reveals persistent challenges with implicit polarization, code-switching, and distinguishing heated political discourse from genuine polarization.

[33] Revisiting Real-Time Digging-In Effects: No Evidence from NP/Z Garden-Paths

Amani Maina-Kilaas, Roger Levy

Main category: cs.CL

TL;DR: No evidence found for real-time digging-in effects in human sentence processing; apparent effects are artifacts of wrap-up processes at sentence boundaries.

Details

Motivation: To resolve conflicting evidence about digging-in effects in sentence processing - whether they represent real-time self-organized processing or methodological artifacts from wrap-up processes.

Method: Two experiments on English NP/Z garden-path sentences using Maze and self-paced reading tasks, comparing human behavior with predictions from an ensemble of large language models.

Result: No evidence for real-time digging-in effects. Positive digging-in trends only appear sentence-finally where wrap-up effects confound interpretation. Nonfinal items show reverse trends consistent with neural model predictions.

Conclusion: Digging-in effects are not robust real-time phenomena in human sentence processing but artifacts of wrap-up processes at sentence boundaries.

Abstract: Digging-in effects, where disambiguation difficulty increases with longer ambiguous regions, have been cited as evidence for self-organized sentence processing, in which structural commitments strengthen over time. In contrast, surprisal theory predicts no such effect unless lengthening genuinely shifts statistical expectations, and neural language models appear to show the opposite pattern. Whether digging-in is a robust real-time phenomenon in human sentence processing – or an artifact of wrap-up processes or methodological confounds – remains unclear. We report two experiments on English NP/Z garden-path sentences using Maze and self-paced reading, comparing human behavior with predictions from an ensemble of large language models. We find no evidence for real-time digging-in effects. Critically, items with sentence-final versus nonfinal disambiguation show qualitatively different patterns: positive digging-in trends appear only sentence-finally, where wrap-up effects confound interpretation. Nonfinal items – the cleaner test of real-time processing – show reverse trends consistent with neural model predictions.

[34] Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks

Fatih Uenal

Main category: cs.CL

TL;DR: Swiss-Bench SBP-002 is a trilingual benchmark for evaluating frontier LLMs on Swiss regulatory compliance tasks, showing poor performance (top model: 38.2% correct) with open-weight models competing with closed-source ones.

Details

Motivation: Existing benchmarks focus on Swiss legal translation and academic legal reasoning, but lack evaluation of frontier models on applied Swiss regulatory compliance tasks, which is crucial for practical legal applications.

Method: Created Swiss-Bench SBP-002 with 395 expert-crafted items across three Swiss regulatory domains (FINMA, Legal-CH, EFK), seven task types, and three languages (German, French, Italian). Evaluated ten frontier models using structured three-dimension scoring with blind three-judge LLM panel (GPT-4o, Claude Sonnet 4, Qwen3-235B) with majority-vote aggregation and weighted kappa = 0.605.

Result: Models clustered into three tiers: Tier A (35-38% correct), Tier B (26-29%), Tier C (13-21%). Top model Qwen 3.5 Plus achieved only 38.2% correct. Task difficulty varied widely: legal translation and case analysis had 69-72% correct, while regulatory Q&A, hallucination detection, and gap analysis remained below 9%. Open-weight models performed competitively against closed-source counterparts.

Conclusion: Swiss regulatory compliance tasks are challenging for frontier models even under zero-retrieval conditions, with open-weight models showing competitive performance. The benchmark provides an empirical reference for assessing model capabilities in this domain.

Abstract: While recent work has benchmarked large language models on Swiss legal translation (Niklaus et al., 2025) and academic legal reasoning from university exams (Fan et al., 2025), no existing benchmark evaluates frontier model performance on applied Swiss regulatory compliance tasks. I introduce Swiss-Bench SBP-002, a trilingual benchmark of 395 expert-crafted items spanning three Swiss regulatory domains (FINMA, Legal-CH, EFK), seven task types, and three languages (German, French, Italian), and evaluate ten frontier models from March 2026 using a structured three-dimension scoring framework assessed via a blind three-judge LLM panel (GPT-4o, Claude Sonnet 4, Qwen3-235B) with majority-vote aggregation and weighted kappa = 0.605, with reference answers validated by an independent human legal expert on a 100-item subset (73% rated Correct, 0% Incorrect, perfect Legal Accuracy). Results reveal three descriptive performance clusters: Tier A (35-38% correct), Tier B (26-29%), and Tier C (13-21%). The benchmark proves difficult: even the top-ranked model (Qwen 3.5 Plus) achieves only 38.2% correct, with 47.3% incorrect and 14.4% partially correct. Task type difficulty varies widely: legal translation and case analysis yield 69-72% correct rates, while regulatory Q&A, hallucination detection, and gap analysis remain below 9%. Within this roster (seven open-weight, three closed-source), an open-weight model leads the ranking, and several open-weight models match or outperform their closed-source counterparts. These findings provide an initial empirical reference point for assessing frontier model capability on Swiss regulatory tasks under zero-retrieval conditions.

[35] Ethio-ASR: Joint Multilingual Speech Recognition and Language Identification for Ethiopian Languages

Badr M. Abdullah, Israel Abebe Azime, Atnafu Lambebo Tonja, Jesujoba O. Alabi, Abel Mulat Alemu, Eyob G. Hagos, Bontu Fufa Balcha, Mulubrhan A. Nerea, Debela Desalegn Yadeta, Dagnachew Mekonnen Marilign, Amanuel Temesgen Fentahun, Tadesse Kebede, Israel D. Gebru, Michael Melese Woldeyohannis, Walelign Tewabe Sewunetie, Bernd Möbius, Dietrich Klakow

Main category: cs.CL

TL;DR: Ethio-ASR: A suite of multilingual CTC-based ASR models for five underrepresented Ethiopian languages, achieving state-of-the-art performance with comprehensive bias and error analysis.

Details

Motivation: Ethiopian languages (Amharic, Tigrinya, Oromo, Sidaama, Wolaytta) are severely underrepresented in speech technology despite being spoken by the majority of Ethiopia's population. There's a need for specialized ASR models for these Afroasiatic languages.

Method: Multilingual CTC-based ASR models trained on the WAXAL corpus using pre-trained speech encoders. Joint training across five languages with comprehensive evaluation against OmniASR baselines.

Result: Best model achieves average WER of 30.48% on WAXAL test set, outperforming OmniASR with fewer parameters. Includes analysis of gender bias, vowel length/consonant gemination contributions to errors, and multilingual CTC training dynamics.

Conclusion: Ethio-ASR provides effective multilingual ASR for underrepresented Ethiopian languages, with models and code publicly available to advance speech technology for these languages.

Abstract: We present Ethio-ASR, a suite of multilingual CTC-based automatic speech recognition (ASR) models jointly trained on five Ethiopian languages: Amharic, Tigrinya, Oromo, Sidaama, and Wolaytta. These languages belong to the Semitic, Cushitic, and Omotic branches of the Afroasiatic family, and remain severely underrepresented in speech technology despite being spoken by the vast majority of Ethiopia’s population. We train our models on the recently released WAXAL corpus using several pre-trained speech encoders and evaluate against strong multilingual baselines, including OmniASR. Our best model achieves an average WER of 30.48% on the WAXAL test set, outperforming the best OmniASR model with substantially fewer parameters. We further provide a comprehensive analysis of gender bias, the contribution of vowel length and consonant gemination to ASR errors, and the training dynamics of multilingual CTC models. Our models and codebase are publicly available to the research community.

[36] Probing Ethical Framework Representations in Large Language Models: Structure, Entanglement, and Methodological Challenges

Weilun Xu, Alexander Rusnak, Frederic Kaplan

Main category: cs.CL

TL;DR: LLMs develop differentiated internal representations for different ethical frameworks (deontology, utilitarianism, virtue, justice, commonsense) rather than collapsing ethics into a single dimension, with asymmetric transfer patterns between frameworks.

Details

Motivation: To understand whether large language models internally distinguish between different normative ethical frameworks or treat ethics as a single acceptability dimension, and to examine how these representations transfer across different ethical scenarios.

Method: Probed hidden representations across five ethical frameworks in six LLMs (4B-72B parameters), analyzed differentiated ethical subspaces and asymmetric transfer patterns, and conducted post-hoc validation to examine dependence on surface features.

Result: Models show differentiated ethical subspaces with asymmetric transfer (e.g., deontology probes partially generalize to virtue scenarios while commonsense probes fail catastrophically on justice). Disagreement between deontological and utilitarian probes correlates with higher behavioral entropy, though this may partly reflect shared sensitivity to scenario difficulty.

Conclusion: LLMs do distinguish between different ethical frameworks in their internal representations rather than collapsing ethics into a single dimension, though probes partially depend on surface features of benchmark templates, requiring cautious interpretation of results.

Abstract: When large language models make ethical judgments, do their internal representations distinguish between normative frameworks, or collapse ethics into a single acceptability dimension? We probe hidden representations across five ethical frameworks (deontology, utilitarianism, virtue, justice, commonsense) in six LLMs spanning 4B–72B parameters. Our analysis reveals differentiated ethical subspaces with asymmetric transfer patterns – e.g., deontology probes partially generalize to virtue scenarios while commonsense probes fail catastrophically on justice. Disagreement between deontological and utilitarian probes correlates with higher behavioral entropy across architectures, though this relationship may partly reflect shared sensitivity to scenario difficulty. Post-hoc validation reveals that probes partially depend on surface features of benchmark templates, motivating cautious interpretation. We discuss both the structural insights these methods provide and their epistemological limitations.

[37] PLACID: Privacy-preserving Large language models for Acronym Clinical Inference and Disambiguation

Manjushree B. Aithal, Ph. D., Alexander Kotz, James Mitchell, Ph. D

Main category: cs.CL

TL;DR: Privacy-preserving on-device LLMs for clinical acronym disambiguation using cascaded pipeline with general detection models and domain-specific biomedical expansion models.

Details

Motivation: Healthcare LLM integration faces privacy constraints; clinical acronym misinterpretation can cause severe medication errors; cloud-based solutions violate privacy; need for on-device privacy-preserving solutions.

Method: Cascaded pipeline with general-purpose local models for acronym detection, routing to domain-specific biomedical models for context-relevant expansions; all models deployed on-device (2B-10B parameters).

Result: General models achieve high detection accuracy (~0.988) but poor expansion (~0.655); cascaded approach with domain-specific models increases expansion accuracy to ~0.81; demonstrates viability of on-device privacy-preserving models.

Conclusion: Privacy-preserving on-device models (2B-10B) can deliver high-fidelity clinical acronym disambiguation, bridging privacy constraints with healthcare LLM applications.

Abstract: Large Language Models (LLMs) offer transformative solutions across many domains, but healthcare integration is hindered by strict data privacy constraints. Clinical narratives are dense with ambiguous acronyms, misinterpretation these abbreviations can precipitate severe outcomes like life-threatening medication errors. While cloud-dependent LLMs excel at Acronym Disambiguation, transmitting Protected Health Information to external servers violates privacy frameworks. To bridge this gap, this study pioneers the evaluation of small-parameter models deployed entirely on-device to ensure privacy preservation. We introduce a privacy-preserving cascaded pipeline leveraging general-purpose local models to detect clinical acronyms, routing them to domain-specific biomedical models for context-relevant expansions. Results reveal that while general instruction-following models achieve high detection accuracy (~0.988), their expansion capabilities plummet (~0.655). Our cascaded approach utilizes domain-specific medical models to increase expansion accuracy to (~0.81). This novel work demonstrates that privacy-preserving, on-device (2B-10B) models deliver high-fidelity clinical acronym disambiguation support.

[38] The Diminishing Returns of Early-Exit Decoding in Modern LLMs

Rui Wei, Rui Du, Hanfei Yu, Devesh Tiwari, Jian Li, Zhaozhuo Xu, Hao Wang

Main category: cs.CL

TL;DR: Early-exit inference effectiveness in modern LLMs is diminishing due to reduced layer redundancy from improved architectures and training, with dense transformers showing more potential than MoE/SSMs, and larger/base models having higher early-exit suitability.

Details

Motivation: The motivation is to re-evaluate early-exit inference strategies in modern LLMs, as improved pretraining recipes and architectures have reduced layer redundancy, potentially limiting the effectiveness of early-exit approaches that were more viable in older models.

Method: The authors analyze how intermediate representations evolve during training, introduce a metric to quantify a model’s intrinsic suitability for early-exit, and propose a benchmark for researchers to explore early-exit benefits across different models and workloads.

Result: Results show a diminishing trend in early-exit effectiveness across newer model generations, with dense transformers generally offering greater early-exit potential than Mixture-of-Experts and State Space Models. Larger models (20B+ parameters) and base pretrained models without specialized tuning exhibit higher early-exit potential.

Conclusion: Early-exit inference strategies need re-evaluation for modern LLMs due to architectural improvements that reduce layer redundancy, with certain model types and sizes showing more promise for early-exit approaches than others.

Abstract: In Large Language Model (LLM) inference, early-exit refers to stopping computation at an intermediate layer once the prediction is sufficiently confident, thereby reducing latency and cost. However, recent LLMs adopt improved pretraining recipes and architectures that reduce layer redundancy, potentially limiting early-exit opportunities. We re-evaluate layer-wise early-exit in modern LLMs and analyze how intermediate representations evolve during training. We introduce a metric to quantify a model’s intrinsic suitability for early-exit and propose a benchmark for researchers to explore the potential early-exit benefits on different models and workloads. Our results show a diminishing trend in early-exit effectiveness across newer model generations. We further find that dense transformers generally offer greater early-exit potential than Mixture-of-Experts and State Space Models. In addition, larger models, particularly those with more than 20 billion parameters, and base pretrained models without specialized tuning tend to exhibit higher early-exit potential.

[39] IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

Ali Abdelaal, Mohammed Nader Al Haffar, Mahmoud Fawzi, Walid Magdy

Main category: cs.CL

TL;DR: IslamicMMLU benchmark evaluates LLMs on Islamic knowledge across Quran, Hadith, and Fiqh disciplines with 10,013 multiple-choice questions, revealing performance gaps and madhab biases.

Details

Motivation: There's growing use of LLMs for Islamic knowledge consultation, but no comprehensive benchmark exists to evaluate their performance across core Islamic disciplines like Quran, Hadith, and Fiqh.

Method: Created IslamicMMLU benchmark with 10,013 multiple-choice questions across three tracks: Quran (2,013 questions), Hadith (4,000 questions), and Fiqh (4,000 questions). Includes madhab bias detection in Fiqh track. Evaluated 26 LLMs on this benchmark.

Result: Model accuracy ranged from 39.8% to 93.8% (Gemini 3 Flash). Quran track showed widest performance span (32.4% to 99.3%). Fiqh track revealed variable madhab preferences across models. Arabic-specific models underperformed frontier models.

Conclusion: IslamicMMLU provides comprehensive evaluation of LLMs on Islamic knowledge, revealing significant performance variations and biases. The benchmark and leaderboard are publicly available for ongoing evaluation.

Abstract: Large language models are increasingly consulted for Islamic knowledge, yet no comprehensive benchmark evaluates their performance across core Islamic disciplines. We introduce IslamicMMLU, a benchmark of 10,013 multiple-choice questions spanning three tracks: Quran (2,013 questions), Hadith (4,000 questions), and Fiqh (jurisprudence, 4,000 questions). Each track is formed of multiple types of questions to examine LLMs capabilities handling different aspects of Islamic knowledge. The benchmark is used to create the IslamicMMLU public leaderboard for evaluating LLMs, and we initially evaluate 26 LLMs, where their averaged accuracy across the three tracks varied between 39.8% to 93.8% (by Gemini 3 Flash). The Quran track shows the widest span (99.3% to 32.4%), while the Fiqh track includes a novel madhab (Islamic school of jurisprudence) bias detection task revealing variable school-of-thought preferences across models. Arabic-specific models show mixed results, but they all underperform compared to frontier models. The evaluation code and leaderboard are made publicly available.

[40] Infrequent Child-Directed Speech Is Bursty and May Draw Infant Vocalizations

Margaret Cychosz, Adriana Weisleder

Main category: cs.CL

TL;DR: Study examines infant speech input patterns in rural Bolivia vs urban US, finding child-directed speech occurs in temporal bursts in both settings, and infants vocalize more during directed speech, especially from older children in Bolivia.

Details

Motivation: To understand how infants learn language when they receive relatively little directed speech input, examining differences in speech input patterns across diverse cultural contexts.

Method: Used longform, infant-centered audio recordings from rural Bolivia and urban U.S. to analyze temporal patterns of speech input and infants’ pre-linguistic vocal behavior.

Result: Child-directed speech in Bolivia, though less frequent, was as temporally clustered as in the U.S. Infants were most likely to produce speech-like vocalizations during directed speech periods, with Bolivian infants responding more to older children’s speech than adult speech.

Conclusion: Language development impact depends not only on quantity of child-directed speech but also on temporal concentration and source, with older children serving as important input sources in communities where adult speech to infants is less frequent.

Abstract: Children in many parts of the world hear relatively little speech directed to them, yet still reach major language development milestones. What differs about the speech input that infants learn from when directed input is rare? Using longform, infant-centered audio recordings taken in rural Bolivia and the urban U.S., we examined temporal patterns of infants’ speech input and their pre-linguistic vocal behavior. We find that child-directed speech in Bolivia, though less frequent, was just as temporally clustered as speech input in the U.S, arriving in concentrated bursts rather than spread across the day. In both communities, infants were most likely to produce speech-like vocalizations during periods of speech directed to them, with the probability of infants’ speech-like vocalizations during target child-directed speech nearly double that during silence. In Bolivia, infants’ speech-like vocalizations were also more likely to occur during bouts of directed speech from older children than from adults. Together, these findings suggest that the developmental impact of child-directed speech may depend not only on quantity, but on temporal concentration and source, with older children serving as an important source of input in some communities, including where adult speech to infants is less frequent.

[41] Perturbation: A simple and efficient adversarial tracer for representation learning in language models

Joshua Rozner, Cory Shain

Main category: cs.CL

TL;DR: The paper proposes a new method for studying linguistic representations in language models by perturbing them with adversarial fine-tuning and measuring how this perturbation transfers to other examples, revealing structured linguistic representations without geometric assumptions.

Details

Motivation: Current methods for finding linguistic representations in language models face a dilemma: either enforce implausible constraints (like linearity) or trivialize the notion of representation altogether. There's a need for a better approach to understand how LMs represent linguistic information.

Method: The method involves perturbing a language model by fine-tuning it on a single adversarial example, then measuring how this perturbation “infects” or transfers to other examples. This approach makes no geometric assumptions and tests whether representations exist by examining how learning perturbations propagate.

Result: Perturbation reveals structured transfer at multiple linguistic grain sizes in trained language models, suggesting that LMs both generalize along representational lines and acquire linguistic abstractions from experience alone. The method doesn’t find representations where it shouldn’t (e.g., in untrained LMs).

Conclusion: By reconceptualizing representations as conduits for learning rather than patterns of activation, the perturbation approach escapes the dilemma of current representation-finding methods and provides evidence that trained LMs develop structured linguistic representations that support generalization.

Abstract: Linguistic representation learning in deep neural language models (LMs) has been studied for decades, for both practical and theoretical reasons. However, finding representations in LMs remains an unsolved problem, in part due to a dilemma between enforcing implausible constraints on representations (e.g., linearity; Arora et al. 2024) and trivializing the notion of representation altogether (Sutter et al., 2025). Here we escape this dilemma by reconceptualizing representations not as patterns of activation but as conduits for learning. Our approach is simple: we perturb an LM by fine-tuning it on a single adversarial example and measure how this perturbation ``infects’’ other examples. Perturbation makes no geometric assumptions, and unlike other methods, it does not find representations where it should not (e.g., in untrained LMs). But in trained LMs, perturbation reveals structured transfer at multiple linguistic grain sizes, suggesting that LMs both generalize along representational lines and acquire linguistic abstractions from experience alone.

[42] PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay

Rohan Khetan, Ashna Khetan

Main category: cs.CL

TL;DR: PoliticsBench: A novel multi-turn roleplay framework to evaluate political bias in LLMs, revealing systematic left-leaning bias in most models except Grok which leans right.

Details

Motivation: Existing LLM bias benchmarks focus on gender/racial stereotypes, with political bias measured coarsely. Need for fine-grained evaluation of specific political values underlying sociopolitical leanings in LLMs.

Method: Used PoliticsBench framework adapted from EQ-Bench-v3 psychometric benchmark. Tested 8 prominent LLMs through 20 evolving multi-turn roleplay scenarios where models reported stances and determined actions. Scored responses on scale of 10 political values.

Result: 7 of 8 models leaned left (Claude, Deepseek, Gemini, GPT, Llama, Qwen Base, Qwen Instruction-Tuned), while Grok leaned right. Left-leaning LLMs strongly exhibited liberal traits, moderately conservative ones. Slight variations in alignment scores across roleplay stages with no clear pattern. Most models used consequence-based reasoning, Grok used facts/statistics.

Conclusion: First psychometric evaluation of political values in LLMs through multi-stage free-text interactions, revealing systematic political biases that could impact objectivity when LLMs are used as information sources.

Abstract: While Large Language Models (LLMs) are increasingly used as primary sources of information, their potential for political bias may impact their objectivity. Existing benchmarks of LLM social bias primarily evaluate gender and racial stereotypes. When political bias is included, it is typically measured at a coarse level, neglecting the specific values that shape sociopolitical leanings. This study investigates political bias in eight prominent LLMs (Claude, Deepseek, Gemini, GPT, Grok, Llama, Qwen Base, Qwen Instruction-Tuned) using PoliticsBench: a novel multi-turn roleplay framework adapted from the EQ-Bench-v3 psychometric benchmark. We test whether commercially developed LLMs display a systematic left-leaning bias that becomes more pronounced in later stages of multi-stage roleplay. Through twenty evolving scenarios, each model reported its stance and determined its course of action. Scoring these responses on a scale of ten political values, we explored the values underlying chatbots’ deviations from unbiased standards. Seven of our eight models leaned left, while Grok leaned right. Each left-leaning LLM strongly exhibited liberal traits and moderately exhibited conservative ones. We discovered slight variations in alignment scores across stages of roleplay, with no particular pattern. Though most models used consequence-based reasoning, Grok frequently argued with facts and statistics. Our study presents the first psychometric evaluation of political values in LLMs through multi-stage, free-text interactions.

[43] Language Model Planners do not Scale, but do Formalizers?

Owen Jiang, Cassie Huang, Ashish Sabharwal, Li Zhang

Main category: cs.CL

TL;DR: LLM formalizers outperform LLM planners in complex planning problems, maintaining perfect accuracy in large state spaces, with divide-and-conquer techniques improving robustness and a new LLM-as-higher-order-formalizer paradigm addressing combinatorial explosion challenges.

Details

Motivation: The paper investigates whether LLM formalizers (which generate solver-oriented programs) scale better than LLM planners (which generate reasoning traces) for complex planning problems, particularly examining their performance in domains with huge state spaces and combinatorial complexity.

Method: Systematic evaluation of LLM formalizers vs planners in classic BlocksWorld domain with state spaces up to 10^165; development of divide-and-conquer formalizing technique to improve robustness; introduction of unraveling problems with exponential formal language requirements; and proposal of LLM-as-higher-order-formalizer paradigm where LLMs generate program generators to decouple token output from combinatorial explosion.

Result: LLM formalizers significantly outperform LLM planners, with some maintaining perfect accuracy even in massive state spaces; divide-and-conquer techniques greatly improve robustness of smaller LLM formalizers; the new LLM-as-higher-order-formalizer paradigm effectively addresses combinatorial explosion challenges in unraveling problems.

Conclusion: LLM formalizers scale much better than LLM planners for complex planning problems, and novel techniques like divide-and-conquer formalizing and LLM-as-higher-order-formalizer paradigms can address remaining scalability challenges, particularly for problems with combinatorial explosion in formal language requirements.

Abstract: Recent work shows overwhelming evidence that LLMs, even those trained to scale their reasoning trace, perform unsatisfactorily when solving planning problems too complex. Whether the same conclusion holds for LLM formalizers that generate solver-oriented programs remains unknown. We systematically show that LLM formalizers greatly out-scale LLM planners, some retaining perfect accuracy in the classic BlocksWorld domain with a huge state space of size up to $10^{165}$. While performance of smaller LLM formalizers degrades with problem complexity, we show that a divide-and-conquer formalizing technique can greatly improve its robustness. Finally, we introduce unraveling problems where one line of problem description realistically corresponds to exponentially many lines of formal language such as the Planning Domain Definition Language (PDDL), greatly challenging LLM formalizers. We tackle this challenge by introducing a new paradigm, namely LLM-as-higher-order-formalizer, where an LLM generates a program generator. This decouples token output from the combinatorial explosion of the underlying formalization and search space.

[44] BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents

Praveen Kumar Myakala, Manan Agrawal, Rahul Manche

Main category: cs.CL

TL;DR: BeliefShift introduces a benchmark for evaluating belief dynamics in multi-session LLM conversations, focusing on how models handle opinion changes, contradictions, and evidence-driven belief updates over time.

Details

Motivation: Current LLM benchmarks treat user information as static facts, but real conversations involve changing opinions, belief drift, and evolving perspectives over extended interactions. There's a need to evaluate how LLMs handle belief dynamics in multi-session dialogues.

Method: Created a longitudinal benchmark with 2,400 human-annotated multi-session interaction trajectories across health, politics, values, and preferences. Includes three tracks: Temporal Belief Consistency, Contradiction Detection, and Evidence-Driven Revision. Evaluated seven models (GPT-4o, Claude 3.5, Gemini 1.5, LLaMA-3, Mistral-Large) under zero-shot and RAG settings.

Result: Found a trade-off: models that personalize aggressively resist belief drift poorly, while factually grounded models miss legitimate belief updates. Introduced four novel metrics: Belief Revision Accuracy (BRA), Drift Coherence Score (DCS), Contradiction Resolution Rate (CRR), and Evidence Sensitivity Index (ESI).

Conclusion: Belief dynamics in multi-session LLM interactions require specialized evaluation beyond static fact retrieval. The BeliefShift benchmark provides tools to assess how models handle evolving user beliefs, revealing fundamental trade-offs in current architectures.

Abstract: LLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved. That’s the wrong model. People change their minds, and over extended interactions, phenomena like opinion drift, over-alignment, and confirmation bias start to matter a lot. BeliefShift introduces a longitudinal benchmark designed specifically to evaluate belief dynamics in multi-session LLM interactions. It covers three tracks: Temporal Belief Consistency, Contradiction Detection, and Evidence-Driven Revision. The dataset includes 2,400 human-annotated multi-session interaction trajectories spanning health, politics, personal values, and product preferences. We evaluate seven models including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, LLaMA-3, and Mistral-Large under zero-shot and retrieval-augmented generation (RAG) settings. Results reveal a clear trade-off: models that personalize aggressively resist drift poorly, while factually grounded models miss legitimate belief updates. We further introduce four novel evaluation metrics: Belief Revision Accuracy (BRA), Drift Coherence Score (DCS), Contradiction Resolution Rate (CRR), and Evidence Sensitivity Index (ESI).

[45] Self-Distillation for Multi-Token Prediction

Guoliang Zhao, Ruobing Xie, An Wang, Shuaipeng Li, Huaibing Xie, Xingwu Sun

Main category: cs.CL

TL;DR: MTP-D is a self-distillation method that improves multi-token prediction in LLMs by boosting acceptance rates of MTP heads while preserving main-head performance, with a looped extension strategy for further inference acceleration.

Details

Motivation: As LLMs scale up, inference efficiency becomes critical. Multi-Token Prediction (MTP) could accelerate inference by predicting multiple future tokens in parallel, but existing approaches face challenges with limited acceptance rates of MTP heads and difficulties in jointly training multiple MTP heads.

Method: Proposes MTP-D, a simple yet effective self-distillation method with minimal additional training cost. Also introduces a looped extension strategy for MTP-D that enables effective and economical MTP head extension. Systematically explores distillation strategies and MTP scalability through extensive experiments.

Result: MTP-D boosts MTP head acceptance rates by +7.5% while preserving main-head performance. The looped extension strategy enables further significant inference speedup to 1-head MTP (+220.4%). Validated through experiments on seven benchmarks.

Conclusion: MTP-D and looped extension strategy effectively enhance MTP-head performance and inference efficiency, facilitating practical usage of MTP in LLMs.

Abstract: As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head acceptance rates (+7.5%) while maximumly preserving main-head performance. We also introduce a looped extension strategy for MTP-D, enabling effective and economical MTP head extension and further significant inference speedup to 1-head MTP (+220.4%). Moreover, we systematically explore and validate key insights on the distillation strategies and the potential scalability of MTP through extensive experiments on seven benchmarks. These results demonstrate that our MTP-D and looped extension strategy effectively enhance MTP-head performance and inference efficiency, facilitating the practical usage of MTP in LLMs.

[46] Dialogue to Question Generation for Evidence-based Medical Guideline Agent Development

Zongliang Ji, Ziyang Zhang, Xincheng Tan, Matthew Thompson, Anna Goldenberg, Carl Yang, Rahul G. Krishnan, Fan Zhang

Main category: cs.CL

TL;DR: LLMs can generate clinically meaningful, guideline-relevant questions during physician-patient encounters to support evidence-based medicine in primary care, though reliability needs improvement.

Details

Motivation: Evidence-based medicine is difficult to implement in fast-paced primary care due to short consultations, high patient loads, and impractical guideline documents. There's a need for ambient assistants that can surface targeted questions during clinical encounters.

Method: Used Gemini 2.5 as backbone LLM with two prompting strategies: zero-shot baseline and multi-stage reasoning variant. Evaluated on 80 de-identified transcripts from real clinical encounters with six experienced physicians providing over 90 hours of structured review.

Result: General-purpose LLMs are not yet fully reliable but can produce clinically meaningful and guideline-relevant questions, showing potential to reduce cognitive burden and make evidence-based medicine more actionable at point of care.

Conclusion: LLMs show significant potential as ambient assistants for generating evidence-based questions during clinical encounters, though further development is needed to improve reliability for real-world implementation.

Abstract: Evidence-based medicine (EBM) is central to high-quality care, but remains difficult to implement in fast-paced primary care settings. Physicians face short consultations, increasing patient loads, and lengthy guideline documents that are impractical to consult in real time. To address this gap, we investigate the feasibility of using large language models (LLMs) as ambient assistants that surface targeted, evidence-based questions during physician-patient encounters. Our study focuses on question generation rather than question answering, with the aim of scaffolding physician reasoning and integrating guideline-based practice into brief consultations. We implemented two prompting strategies, a zero-shot baseline and a multi-stage reasoning variant, using Gemini 2.5 as the backbone model. We evaluated on a benchmark of 80 de-identified transcripts from real clinical encounters, with six experienced physicians contributing over 90 hours of structured review. Results indicate that while general-purpose LLMs are not yet fully reliable, they can produce clinically meaningful and guideline-relevant questions, suggesting significant potential to reduce cognitive burden and make EBM more actionable at the point of care.

Seunghee Kim, Bumkyu Park, Kyudan Jung, Joosung Lee, Soyoon Kim, Jeonghoon Kim, Taeuk Kim, Hwiyeol Jo

Main category: cs.CL

TL;DR: OmniACBench: A benchmark for evaluating acoustic control in omni-modal models that requires generating appropriate speech based on multimodal context (spoken instruction, text script, image).

Details

Motivation: Existing testbeds for omni-modal models focus on textual outputs, leaving it unclear whether these models can properly verbalize their answers. There's a need to evaluate models' ability to generate speech with appropriate acoustic features based on multimodal context.

Method: Introduces OmniACBench with 3,559 verified instances covering six acoustic features: speech rate, phonation, pronunciation, emotion, global accent, and timbre. Models must read a text script aloud with appropriate tone and manner given a spoken instruction, text script, and image. Evaluates eight models extensively.

Result: Experiments reveal models’ limitations in acoustic control despite strong performance on textual-output evaluations. Main bottleneck is integrating multimodal context for faithful speech generation, not processing individual modalities. Identifies three common failure modes: weak direct control, failed implicit inference, and failed multimodal grounding.

Conclusion: Omni-modal models struggle with acoustic control and verbalizing responses effectively. The benchmark provides insights for developing models that can better integrate multimodal context for speech generation.

Abstract: Most testbeds for omni-modal models assess multimodal understanding via textual outputs, leaving it unclear whether these models can properly speak their answers. To study this, we introduce OmniACBench, a benchmark for evaluating context-grounded acoustic control in omni-modal models. Given a spoken instruction, a text script, and an image, a model must read the script aloud with an appropriate tone and manner. OmniACBench comprises 3,559 verified instances covering six acoustic features: speech rate, phonation, pronunciation, emotion, global accent, and timbre. Extensive experiments on eight models reveal their limitations in the proposed setting, despite their strong performance on prior textual-output evaluations. Our analyses show that the main bottleneck lies not in processing individual modalities, but in integrating multimodal context for faithful speech generation. Moreover, we identify three common failure modes-weak direct control, failed implicit inference, and failed multimodal grounding-providing insights for developing models that can verbalize responses effectively.

[48] Argument Mining as a Text-to-Text Generation Task

Masayuki Kawarada, Tsutomu Hirao, Wataru Uchida, Masaaki Nagata

Main category: cs.CL

TL;DR: Text-to-text generation approach for argument mining that simultaneously generates annotated text for spans, components, and relations, eliminating need for task-specific postprocessing.

Details

Motivation: Previous argument mining methods require multiple subtasks and rule-based postprocessing, adding complexity and expanding hyperparameter search space.

Method: Uses pretrained encoder-decoder language model for text-to-text generation that simultaneously outputs argumentatively annotated text for spans, components, and relations.

Result: Achieves state-of-the-art performance on three benchmark datasets: Argument-annotated Essays Corpus (AAEC), AbstRCT, and Cornell eRulemaking Corpus (CDCP).

Conclusion: Proposed method simplifies argument mining by eliminating postprocessing and hyperparameter tuning while maintaining strong performance across different argumentative structures.

Abstract: Argument Mining(AM) aims to uncover the argumentative structures within a text. Previous methods require several subtasks, such as span identification, component classification, and relation classification. Consequently, these methods need rule-based postprocessing to derive argumentative structures from the output of each subtask. This approach adds to the complexity of the model and expands the search space of the hyperparameters. To address this difficulty, we propose a simple yet strong method based on a text-to-text generation approach using a pretrained encoder-decoder language model. Our method simultaneously generates argumentatively annotated text for spans, components, and relations, eliminating the need for task-specific postprocessing and hyperparameter tuning. Furthermore, because it is a straightforward text-to-text generation method, we can easily adapt our approach to various types of argumentative structures. Experimental results demonstrate the effectiveness of our method, as it achieves state-of-the-art performance on three different types of benchmark datasets: the Argument-annotated Essays Corpus(AAEC), AbstRCT, and the Cornell eRulemaking Corpus(CDCP)

[49] From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents

Sirui Xia, Yikai Zhang, Aili Chen, Siye Wu, Siyu Yuan, Yanghua Xiao

Main category: cs.CL

TL;DR: POISE: Automated framework for discovering improved policy optimization algorithms for language models through structured search and evidence-driven iteration

Details

Motivation: Manual discovery of improved policy optimization algorithms for language models is costly and requires repeated mechanism-level modifications and validation. There's a need for automated approaches that can search over algorithmic mechanisms while reusing empirical evidence across iterations.

Method: POISE is a closed-loop framework that maintains a structured, genealogically linked archive connecting proposals, executable implementations, standardized evaluations, and natural-language reflections to support evidence-driven iteration. It searches over algorithmic mechanisms tightly coupled with training dynamics.

Result: In mathematical reasoning experiments starting from GRPO, POISE evaluated 64 candidate algorithms and discovered improved mechanisms including analytic-variance scaling and validity masking. The best variant improved weighted Overall from 47.8 to 52.5 (+4.6) and increased AIME25 pass@32 from 26.7% to 43.3%.

Conclusion: POISE demonstrates the feasibility of automated policy optimization discovery for language models while supporting interpretable design principles, offering a systematic approach to algorithm improvement.

Abstract: Discovering improved policy optimization algorithms for language models remains a costly manual process requiring repeated mechanism-level modification and validation. Unlike simple combinatorial code search, this problem requires searching over algorithmic mechanisms tightly coupled with training dynamics while reusing empirical evidence across iterations. We propose POISE, a closed-loop framework for automated discovery of policy optimization algorithms for language models. POISE maintains a structured, genealogically linked archive linking proposals, executable implementations, standardized evaluations, and natural-language reflections to support evidence-driven iteration. In mathematical reasoning experiments starting from GRPO, POISE evaluates 64 candidate algorithms and discovers improved mechanisms, including analytic-variance scaling and validity masking. The best variant improves weighted Overall from 47.8 to 52.5 (+4.6) and increases AIME25 pass@32 from 26.7% to 43.3%, demonstrating the feasibility of automated policy optimization discovery while supporting interpretable design principles.

[50] Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith

Somaya Eltanbouly, Samer Rashwani

Main category: cs.CL

TL;DR: A retrieval-augmented generation framework using historical Arabic dictionaries improves LLM performance on complex religious texts like Quran and Hadith.

Details

Motivation: LLMs struggle with complex historical and religious Arabic texts due to linguistic challenges like diacritics and compound expressions, requiring specialized knowledge beyond general corpora.

Method: Developed a RAG framework using the Doha Historical Dictionary of Arabic (DHDA) with hybrid retrieval and intent-based routing to provide precise historical lexical information to Arabic LLMs.

Result: Improved accuracy of Arabic-native LLMs (Fanar and ALLaM) to over 85%, reducing performance gap with proprietary models like Gemini, with high human-judge agreement (kappa = 0.87).

Conclusion: Integrating diachronic lexicographic resources into RAG frameworks significantly enhances Arabic language understanding, especially for historical and religious texts.

Abstract: Large language models (LLMs) have achieved remarkable progress in many language tasks, yet they continue to struggle with complex historical and religious Arabic texts such as the Quran and Hadith. To address this limitation, we develop a retrieval-augmented generation (RAG) framework grounded in diachronic lexicographic knowledge. Unlike prior RAG systems that rely on general-purpose corpora, our approach retrieves evidence from the Doha Historical Dictionary of Arabic (DHDA), a large-scale resource documenting the historical development of Arabic vocabulary. The proposed pipeline combines hybrid retrieval with an intent-based routing mechanism to provide LLMs with precise, contextually relevant historical information. Our experiments show that this approach improves the accuracy of Arabic-native LLMs, including Fanar and ALLaM, to over 85%, substantially reducing the performance gap with Gemini, a proprietary large-scale model. Gemini also serves as an LLM-as-a-judge system for automatic evaluation in our experiments. The automated judgments were verified through human evaluation, demonstrating high agreement (kappa = 0.87). An error analysis further highlights key linguistic challenges, including diacritics and compound expressions. These findings demonstrate the value of integrating diachronic lexicographic resources into retrieval-augmented generation frameworks to enhance Arabic language understanding, particularly for historical and religious texts. The code and resources are publicly available at: https://github.com/somayaeltanbouly/Doha-Dictionary-RAG.

[51] CoCR-RAG: Enhancing Retrieval-Augmented Generation in Web Q&A via Concept-oriented Context Reconstruction

Kaize Shi, Xueyao Sun, Qika Lin, Firoj Alam, Qing Li, Xiaohui Tao, Guandong Xu

Main category: cs.CL

TL;DR: CoCR-RAG improves RAG by using concept distillation from AMR graphs to fuse multi-source documents into coherent contexts, enhancing factual consistency in web Q&A.

Details

Motivation: Current RAG systems struggle with multi-source document fusion due to heterogeneous writing styles, formats, and granularity, leading to irrelevant/redundant information that compromises factual consistency in answers.

Method: Proposes CoCR-RAG framework with concept distillation algorithm that extracts essential concepts from Abstract Meaning Representation (AMR) graphs, then fuses and reconstructs them into unified contexts using LLMs.

Result: Significantly outperforms existing context-reconstruction methods on PopQA and EntityQuestions datasets, shows robustness across various backbone LLMs, and works as flexible plug-and-play component.

Conclusion: CoCR-RAG effectively addresses multi-source information fusion in RAG through linguistically grounded concept-level integration, improving factual consistency in web Q&A systems.

Abstract: Retrieval-augmented generation (RAG) has shown promising results in enhancing Q&A by incorporating information from the web and other external sources. However, the supporting documents retrieved from the heterogeneous web often originate from multiple sources with diverse writing styles, varying formats, and inconsistent granularity. Fusing such multi-source documents into a coherent and knowledge-intensive context remains a significant challenge, as the presence of irrelevant and redundant information can compromise the factual consistency of the inferred answers. This paper proposes the Concept-oriented Context Reconstruction RAG (CoCR-RAG), a framework that addresses the multi-source information fusion problem in RAG through linguistically grounded concept-level integration. Specifically, we introduce a concept distillation algorithm that extracts essential concepts from Abstract Meaning Representation (AMR), a stable semantic representation that structures the meaning of texts as logical graphs. The distilled concepts from multiple retrieved documents are then fused and reconstructed into a unified, information-intensive context by Large Language Models, which supplement only the necessary sentence elements to highlight the core knowledge. Experiments on the PopQA and EntityQuestions datasets demonstrate that CoCR-RAG significantly outperforms existing context-reconstruction methods across these Web Q&A benchmarks. Furthermore, CoCR-RAG shows robustness across various backbone LLMs, establishing itself as a flexible, plug-and-play component adaptable to different RAG frameworks.

[52] Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping

Yao Chen, Yilong Chen, Yinqi Yang, Junyuan Shang, Zhenyu Zhang, Zefeng Zhang, Shuaiyi Nie, Shuohuan Wang, Yu Sun, Hua Wu, HaiFeng Wang, Tingwen Liu

Main category: cs.CL

TL;DR: Sparse Growing Transformer (SGT) is a training-time framework that progressively extends recurrence from deeper to shallower layers via targeted attention looping on informative heads, reducing computational redundancy while improving performance.

Details

Motivation: Existing approaches to increasing Transformer depth use static parameter reuse across entire blocks, leading to computational redundancy. The authors argue depth allocation should be a progressively growing structural process based on observed deep-to-shallow maturation trajectory across layers.

Method: SGT introduces a training-time sparse depth allocation framework that progressively extends recurrence from deeper to shallower layers via targeted attention looping on high-entropy informative heads, creating structural sparsity by selectively increasing depth for only a small subset of parameters.

Result: SGT consistently outperforms training-time static block-level looping baselines across multiple parameter scales while reducing additional training FLOPs overhead from approximately 16-20% to only 1-3% relative to standard Transformer backbone.

Conclusion: Progressive depth allocation via sparse attention looping on informative heads is more efficient than static parameter reuse, enabling deeper effective depth with minimal computational overhead.

Abstract: Existing approaches to increasing the effective depth of Transformers predominantly rely on parameter reuse, extending computation through recursive execution. Under this paradigm, the network structure remains static along the training timeline, and additional computational depth is uniformly assigned to entire blocks at the parameter level. This rigidity across training time and parameter space leads to substantial computational redundancy during training. In contrast, we argue that depth allocation during training should not be a static preset, but rather a progressively growing structural process. Our systematic analysis reveals a deep-to-shallow maturation trajectory across layers, where high-entropy attention heads play a crucial role in semantic integration. Motivated by this observation, we introduce the Sparse Growing Transformer (SGT). SGT is a training-time sparse depth allocation framework that progressively extends recurrence from deeper to shallower layers via targeted attention looping on informative heads. This mechanism induces structural sparsity by selectively increasing depth only for a small subset of parameters as training evolves. Extensive experiments across multiple parameter scales demonstrate that SGT consistently outperforms training-time static block-level looping baselines under comparable settings, while reducing the additional training FLOPs overhead from approximately 16–20% to only 1–3% relative to a standard Transformer backbone.

Kun-Yang Yu, Zhi Zhou, Shi-Yu Tian, Xiao-Wen Yang, Zi-Yi Jia, Ming Yang, Zi-Jian Cheng, Lan-Zhe Guo, Yu-Feng Li

Main category: cs.CL

TL;DR: TWT is a program-aided code-based neuro-symbolic reasoning approach for Tabular-Vision Multi-Modal Understanding that outperforms existing baselines by 10% on average across eight datasets.

Details

Motivation: Tabular data is a critical real-world modality that remains underexplored in multimodal learning, with challenges including high structural variability, complex feature dependencies, and heterogeneous problem-solving pipelines.

Method: Proposes Thinking with Tables (TWT) using program-aided code-based neuro-symbolic reasoning that interacts with external environments for operations like information extraction and element modeling.

Result: TWT consistently outperforms existing baselines by an average of 10% in accuracy across eight datasets, achieving performance comparable to or surpassing proprietary commercial SOTA LLMs on TVMU tasks.

Conclusion: TWT effectively addresses core challenges in tabular-vision multimodal understanding through neuro-symbolic reasoning, demonstrating strong performance on diverse tabular-vision tasks.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable reasoning capabilities across modalities such as images and text. However, tabular data, despite being a critical real-world modality, remains relatively underexplored in multimodal learning. In this paper, we focus on the task of Tabular-Vision Multi-Modal Understanding (TVMU) and identify three core challenges: (1) high structural variability and data incompleteness in tables, (2) implicit and complex feature dependencies, and (3) significant heterogeneity in problem-solving pipelines across downstream tasks. To address these issues, we propose Thinking with Tables (TWT). TWT employs a program-aided code-based neuro-symbolic reasoning mechanism that facilitates key operations, such as information extraction and element modeling, by interacting with external environments. We evaluate TWT on eight representative datasets. Experimental results demonstrate that TWT consistently outperforms existing baselines by an average of 10% in accuracy, achieving performance comparable to, or even surpassing, proprietary commercial SOTA LLMs on TVMU tasks. Models and codes are available at https://github.com/kunyang-YU/Thinking-with-Tables

Wassim Swaileh, Mohammed-En-Nadhir Zighem, Hichem Telli, Salah Eddine Bekhouche, Abdellah Zakaria Sellam, Fadi Dornaika, Dimitrios Kotzinos

Main category: cs.CL

TL;DR: A retrieval-augmented generation (RAG) system for Islamic inheritance law reasoning that uses synthetic data generation, hybrid retrieval, and schema-constrained validation to achieve high precision in Arabic legal reasoning tasks.

Details

Motivation: Islamic inheritance law (Ilm al-Mawarith) is a complex multi-stage legal reasoning task with variations across legal schools and civil-law codifications, requiring models to operate under explicit legal configurations while maintaining high precision and reliability.

Method: Proposes a RAG pipeline combining: 1) rule-grounded synthetic data generation using a symbolic inheritance calculator to create high-quality corpus with full reasoning traces, 2) hybrid retrieval (dense and BM25) with cross-encoder reranking, and 3) schema-constrained output validation for legal and numerical consistency.

Result: Achieves MIR-E score of 0.935 and ranks first on the official QIAS 2026 blind-test leaderboard, demonstrating that retrieval-grounded, schema-aware generation significantly improves reliability in high-precision Arabic legal reasoning tasks.

Conclusion: The proposed system effectively addresses the challenges of Islamic inheritance law reasoning through a combination of synthetic data generation, hybrid retrieval, and schema validation, showing strong performance in Arabic legal reasoning tasks.

Abstract: Islamic inheritance (Ilm al-Mawarith) is a multi-stage legal reasoning task requiring the identification of eligible heirs, resolution of blocking rules (hajb), assignment of fixed and residual shares, handling of adjustments such as awl and radd, and generation of a consistent final distribution. The task is further complicated by variations across legal schools and civil-law codifications, requiring models to operate under explicit legal configurations. We present a retrieval-augmented generation (RAG) pipeline for this setting, combining rule-grounded synthetic data generation, hybrid retrieval (dense and BM25) with cross-encoder reranking, and schema-constrained output validation. A symbolic inheritance calculator is used to generate a large high-quality synthetic corpus with full intermediate reasoning traces, ensuring legal and numerical consistency. The proposed system achieves a MIR-E score of 0.935 and ranks first on the official QIAS 2026 blind-test leaderboard. Results demonstrate that retrieval-grounded, schema-aware generation significantly improves reliability in high-precision Arabic legal reasoning tasks.

[55] Schema on the Inside: A Two-Phase Fine-Tuning Method for High-Efficiency Text-to-SQL at Scale

Chinmay Soni, Shivam Chourasia, Gaurav Kumar, Hitesh Kapoor

Main category: cs.CL

TL;DR: Specialized 8B-parameter model for cricket statistics text-to-SQL that eliminates schema prompts, reducing tokens by 99% and achieving 98.4% execution success.

Details

Motivation: Proprietary API-based LLMs for text-to-SQL face prohibitive costs and latency due to massive schema-heavy prompts, hindering scalable production deployment in large applications like fantasy sports platforms.

Method: Two-phase supervised fine-tuning approach enables an 8B-parameter model to internalize the entire database schema, eliminating long-context prompts and replacing costly API calls with efficient local inference.

Result: Reduced input tokens by over 99% (from 17k to <100), achieved 98.4% execution success and 92.5% semantic accuracy, outperforming prompt-engineered Gemini Flash 2.0 baseline (95.6% execution, 89.4% semantic accuracy).

Conclusion: Demonstrates practical path for high-precision, low-latency text-to-SQL applications using domain-specialized, self-hosted LLMs in large-scale production, solving cost and latency challenges of API-based models.

Abstract: Applying large, proprietary API-based language models to text-to-SQL tasks poses a significant industry challenge: reliance on massive, schema-heavy prompts results in prohibitive per-token API costs and high latency, hindering scalable production deployment. We present a specialized, self-hosted 8B-parameter model designed for a conversational bot in CriQ, a sister app to Dream11, India’s largest fantasy sports platform with over 250 million users, that answers user queries about cricket statistics. Our novel two-phase supervised fine-tuning approach enables the model to internalize the entire database schema, eliminating the need for long-context prompts. This reduces input tokens by over 99%, from a 17k-token baseline to fewer than 100, and replaces costly external API calls with efficient local inference. The resulting system achieves 98.4% execution success and 92.5% semantic accuracy, substantially outperforming a prompt-engineered baseline using Google’s Gemini Flash 2.0 (95.6% execution, 89.4% semantic accuracy). These results demonstrate a practical path toward high-precision, low-latency text-to-SQL applications using domain-specialized, self-hosted language models in large-scale production environments.

[56] From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs

Xiaoyong Guo, Nanjie Li, Zijie Zeng, Kai Wang, Hao Huang, Haihua Xu, Wei Shi

Main category: cs.CL

TL;DR: A training framework to improve robustness of contextual ASR with Speech-LLMs by addressing contextual exposure bias through teacher error knowledge, context dropout, and DPO on failure cases.

Details

Motivation: Contextual ASR with Speech-LLMs suffers from train-test mismatch where models are trained with oracle conversation history but must use error-prone predicted history at inference, causing contextual exposure bias that degrades performance.

Method: Three techniques: (1) Teacher Error Knowledge using Whisper large-v3 hypotheses as training-time history, (2) Context Dropout to regularize over-reliance on history, and (3) Direct Preference Optimization on curated failure cases to improve robustness.

Result: On TED-LIUM 3 and zero-shot LibriSpeech, the approach reduces WER from 5.59% (oracle-history training) to 5.17% with DPO. Under irrelevant-context attacks, DPO shows smallest degradation (5.17% → 5.63%), indicating improved robustness to misleading context.

Conclusion: The proposed unified training framework effectively addresses contextual exposure bias in Speech-LLMs, improving robustness to error-prone and misleading context through teacher error knowledge, regularization, and preference optimization.

Abstract: Contextual automatic speech recognition (ASR) with Speech-LLMs is typically trained with oracle conversation history, but relies on error-prone history at inference, causing a train-test mismatch in the context channel that we term contextual exposure bias. We propose a unified training framework to improve robustness under realistic histories: (i) Teacher Error Knowledge by using Whisper large-v3 hypotheses as training-time history, (ii) Context Dropout to regularize over-reliance on history, and (iii) Direct Preference Optimization (DPO) on curated failure cases. Experiments on TED-LIUM 3 (in-domain) and zero-shot LibriSpeech (out-of-domain) show consistent gains under predicted-history decoding. With a two-utterance history as context, SFT with Whisper hypotheses reduce WER from 5.59% (oracle-history training) to 5.47%, and DPO further improves to 5.17%. Under irrelevant-context attacks, DPO yields the smallest degradation (5.17% -> 5.63%), indicating improved robustness to misleading context. Our code and models are published on https://github.com/XYGuo1996/Contextual_Speech_LLMs.

[57] FinToolSyn: A forward synthesis Framework for Financial Tool-Use Dialogue Data with Dynamic Tool Retrieval

Caishuang Huang, Yang Qiao, Rongyu Zhang, Junjie Ye, Pu Lu, Wenxi Wu, Meng Zhou, Xiku Du, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: FinToolSyn is a forward synthesis framework for generating high-quality financial dialogues with dynamic tool retrieval, addressing limitations of reverse synthesis methods in capturing implicit, event-driven needs in finance.

Details

Motivation: Existing data synthesis methods for LLM tool-use in finance rely on reverse synthesis (generating queries from pre-sampled tools), which introduces artificial explicitness and fails to capture the implicit, event-driven nature of real-world financial needs. They also overlook dynamic retrieval processes needed for massive tool spaces.

Method: FinToolSyn uses a forward synthesis framework with three stages: 1) persona instruction, 2) atomic tool synthesis, and 3) dynamic retrieval dialogue generation. It constructs a repository of 43,066 tools and synthesizes over 148k dialogue instances, incorporating dynamic retrieval to emulate noisy candidate sets in massive tool spaces.

Result: Models trained on FinToolSyn achieve a 21.06% improvement in tool-calling capabilities. The framework provides a robust foundation for tool learning in financial scenarios and includes a dedicated benchmark for evaluating tool-calling in realistic financial scenarios.

Conclusion: FinToolSyn addresses key limitations of existing synthesis methods by using forward synthesis with dynamic retrieval, better capturing real-world financial needs and providing substantial improvements in LLM tool-use capabilities for finance.

Abstract: Tool-use capabilities are vital for Large Language Models (LLMs) in finance, a domain characterized by massive investment targets and data-intensive inquiries. However, existing data synthesis methods typically rely on a reverse synthesis paradigm, generating user queries from pre-sampled tools. This approach inevitably introduces artificial explicitness, yielding queries that fail to capture the implicit, event-driven nature of real-world needs. Moreover, its reliance on static tool sets overlooks the dynamic retrieval process required to navigate massive tool spaces. To address these challenges, we introduce \textit{FinToolSyn}, a forward synthesis framework designed to generate high-quality financial dialogues. Progressing from persona instruction and atomic tool synthesis to dynamic retrieval dialogue generation, our pipeline constructs a repository of 43,066 tools and synthesizes over 148k dialogue instances, incorporating dynamic retrieval to emulate the noisy candidate sets typical of massive tool spaces. We also establish a dedicated benchmark to evaluate tool-calling capabilities in realistic financial scenarios. Extensive experiments demonstrate that models trained on FinToolSyn achieve a 21.06% improvement, providing a robust foundation for tool learning in financial scenarios.

[58] ConceptKT: A Benchmark for Concept-Level Deficiency Prediction in Knowledge Tracing

Yu-Chen Kang, Yu-Chien Tang, An-Zi Yen

Main category: cs.CL

TL;DR: ConceptKT extends knowledge tracing to predict specific concept deficiencies, using LLMs/LRMs with context selection based on conceptual alignment and semantic similarity.

Details

Motivation: Traditional knowledge tracing only predicts binary correctness, lacking diagnostic capability to identify specific conceptual misunderstandings that cause errors, which is essential for personalized remediation.

Method: Introduces concept-level deficiency prediction task, creates ConceptKT dataset with concept annotations, explores in-context learning with LLMs/LRMs, and investigates strategies for selecting informative historical records based on conceptual alignment and semantic similarity.

Result: Experimental results show that selecting response histories based on conceptual alignment and semantic similarity improves performance on both correctness prediction and concept-level deficiency identification.

Conclusion: Concept-level deficiency prediction enhances knowledge tracing by providing fine-grained diagnostic feedback, and context selection strategies significantly improve LLM/LRM performance on this task.

Abstract: Knowledge Tracing (KT) is a critical technique for modeling student knowledge to support personalized learning. However, most KT systems focus on binary correctness prediction and cannot diagnose the underlying conceptual misunderstandings that lead to errors. Such fine-grained diagnostic feedback is essential for designing targeted instruction and effective remediation. In this work, we introduce the task of concept-level deficiency prediction, which extends traditional KT by identifying the specific concepts a student is likely to struggle with on future problems. We present ConceptKT, a dataset annotated with labels that capture both the concepts required to solve each question and the missing concepts underlying incorrect responses. We investigate in-context learning approaches to KT and evaluate the diagnostic capabilities of various Large Language Models (LLMs) and Large Reasoning Models (LRMs). Different strategies for selecting informative historical records are explored. Experimental results demonstrate that selecting response histories based on conceptual alignment and semantic similarity leads to improved performance on both correctness prediction and concept-level deficiency identification.

[59] LLMpedia: A Transparent Framework to Materialize an LLM’s Encyclopedic Knowledge at Scale

Muhammed Saeed, Simon Razniewski

Main category: cs.CL

TL;DR: LLMpedia generates encyclopedic articles from parametric memory without retrieval, revealing that language models’ factuality is actually 15+ percentage points lower than benchmark scores suggest, with frontier subjects having even lower accuracy.

Details

Motivation: The paper addresses the discrepancy between high benchmark scores (like MMLU's 90%+) and actual factuality of language models. It aims to provide a more comprehensive evaluation by generating full encyclopedic articles from parametric memory rather than just answering fixed questions.

Method: LLMpedia generates approximately 1M encyclopedic articles across three model families entirely from parametric memory without retrieval. It evaluates factuality by comparing to Wikipedia for covered subjects and curated web evidence for frontier subjects. The approach includes a capture-trap benchmark inspired by prior Grokipedia analysis.

Result: For gpt-5-mini, the verifiable true rate on Wikipedia-covered subjects is only 74.7% (15+ points below benchmark scores). Frontier subjects verifiable only through curated web evidence fall to 63.2% true rate. Wikipedia covers just 61% of surfaced subjects, and three model families overlap by only 7.3% in subject choice. LLMpedia achieves higher factuality at roughly half the textual similarity to Wikipedia compared to Grokipedia.

Conclusion: Language models’ factuality is significantly overestimated by current benchmarks. LLMpedia provides a more realistic evaluation through parametric encyclopedia generation, revealing substantial gaps in factual accuracy, especially for frontier subjects. The fully open release enables better factuality evaluation and knowledge materialization research.

Abstract: Benchmarks such as MMLU suggest flagship language models approach factuality saturation, with scores above 90%. We show this picture is incomplete. \emph{LLMpedia} generates encyclopedic articles entirely from parametric memory, producing ${\sim}$1M articles across three model families without retrieval. For gpt-5-mini, the verifiable true rate on Wikipedia-covered subjects is only 74.7% – more than 15 percentage points below the benchmark-based picture, consistent with the availability bias of fixed-question evaluation. Beyond Wikipedia, frontier subjects verifiable only through curated web evidence fall further to 63.2% true rate. Wikipedia covers just 61% of surfaced subjects, and three model families overlap by only 7.3% in subject choice. In a capture-trap benchmark inspired by prior analysis of Grokipedia, LLMpedia achieves substantially higher factuality at roughly half the textual similarity to Wikipedia. Unlike Grokipedia, every prompt, artifact, and evaluation verdict is publicly released, making LLMpedia the first fully open parametric encyclopedia – bridging factuality evaluation and knowledge materialization. All data, code, and a browsable interface are at https://llmpedia.net.

[60] Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study

Nour Bouchouchi, Thiabult Laugel, Xavier Renard, Christophe Marsala, Marie-Jeanne Lesot, Marcin Detyniecki

Main category: cs.CL

TL;DR: A unified framework for analyzing intrinsic and extrinsic gender bias in LLMs reveals consistent associations between latent gender information and expressed bias, showing that alignment reduces output bias but leaves underlying representations unchanged.

Details

Motivation: Current gender bias mitigation in LLMs focuses on output-level evaluation using structured benchmarks, which may not reveal whether alignment modifies underlying representations or generalizes to realistic usage scenarios.

Method: Proposes a unified framework using identical neutral prompts to jointly analyze intrinsic gender bias (encoded in internal representations) and extrinsic bias (expressed in generated outputs), enabling direct comparison. Examines alignment effects through supervised fine-tuning for bias reduction.

Result: Finds consistent association between latent gender information and expressed bias under unified protocol (contrary to prior weak correlations). Alignment reduces expressed bias but measurable gender associations remain in internal representations and can be reactivated via adversarial prompting. Debiasing effects on structured benchmarks don’t generalize to realistic scenarios like story generation.

Conclusion: Current alignment approaches may reduce surface-level bias in outputs without fundamentally changing underlying gender representations, highlighting the need for more comprehensive bias mitigation strategies that address both intrinsic and extrinsic aspects.

Abstract: During training, Large Language Models (LLMs) learn social regularities that can lead to gender bias in downstream applications. Most mitigation efforts focus on reducing bias in generated outputs, typically evaluated on structured benchmarks, which raises two concerns: output-level evaluation does not reveal whether alignment modifies the model’s underlying representations, and structured benchmarks may not reflect realistic usage scenarios. We propose a unified framework to jointly analyze intrinsic and extrinsic gender bias in LLMs using identical neutral prompts, enabling direct comparison between gender-related information encoded in internal representations and bias expressed in generated outputs. Contrary to prior work reporting weak or inconsistent correlations, we find a consistent association between latent gender information and expressed bias when measured under the unified protocol. We further examine the effect of alignment through supervised fine-tuning aimed at reducing gender bias. Our results suggest that while the latter indeed reduces expressed bias, measurable gender-related associations are still present in internal representations, and can be reactivated under adversarial prompting. Finally, we consider two realistic settings and show that debiasing effects observed on structured benchmarks do not necessarily generalize, e.g., to the case of story generation.

[61] MedAidDialog: A Multilingual Multi-Turn Medical Dialogue Dataset for Accessible Healthcare

Shubham Kumar Nigam, Suparnojit Sarkar, Piyush Patel

Main category: cs.CL

TL;DR: MedAidDialog: A multilingual medical dialogue dataset and MedAidLM model for realistic physician-patient consultations using LLMs and parameter-efficient fine-tuning.

Details

Motivation: Existing medical dialogue systems are limited by single-turn QA or template-based approaches, lacking conversational realism and multilingual applicability, especially in healthcare-limited settings.

Method: Created MedAidDialog dataset by extending MDDial corpus with synthetic consultations generated by LLMs, covering 7 languages. Developed MedAidLM using parameter-efficient fine-tuning on quantized small language models with optional patient pre-context information.

Result: System effectively performs symptom elicitation through multi-turn dialogue and generates diagnostic recommendations. Medical expert evaluation confirms plausibility and coherence of generated consultations.

Conclusion: MedAidLM enables realistic multilingual medical consultations without high-end computational infrastructure, addressing accessibility gaps in healthcare-limited settings.

Abstract: Conversational artificial intelligence has the potential to assist users in preliminary medical consultations, particularly in settings where access to healthcare professionals is limited. However, many existing medical dialogue systems operate in a single-turn question–answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. In this work, we introduce MedAidDialog, a multilingual multi-turn medical dialogue dataset designed to simulate realistic physician–patient consultations. The dataset extends the MDDial corpus by generating synthetic consultations using large language models and further expands them into a parallel multilingual corpus covering seven languages: English, Hindi, Telugu, Tamil, Bengali, Marathi, and Arabic. Building on this dataset, we develop MedAidLM, a conversational medical model trained using parameter-efficient fine-tuning on quantized small language models, enabling deployment without high-end computational infrastructure. Our framework additionally incorporates optional patient pre-context information (e.g., age, gender, allergies) to personalize the consultation process. Experimental results demonstrate that the proposed system can effectively perform symptom elicitation through multi-turn dialogue and generate diagnostic recommendations. We further conduct medical expert evaluation to assess the plausibility and coherence of the generated consultations.

[62] A visual observation on the geometry of UMAP projections of the difference vectors of antonym and synonym word pair embeddings

Rami Luisto

Main category: cs.CL

TL;DR: The paper explores whether antonymy (opposite word pairs) can be detected geometrically in transformer embeddings by analyzing difference vectors between word pairs and comparing antonyms to synonyms.

Details

Motivation: To investigate if antonym relationships can be detected in the geometric structure of transformer embeddings, specifically examining whether antonym pairs exhibit distinctive patterns in their difference vectors compared to synonym pairs.

Method: Analyzes embedding vectors of antonymic word pairs across different embedding models, focusing on difference vectors between word pairs. Uses specific projection configurations to examine geometric patterns, discovering a curious “swirl” pattern that appears consistently across models.

Result: Discovers a consistent “swirl” pattern in the geometry of antonymic word pairs across different embedding models when viewed in specific projection configurations, suggesting geometric regularities in how antonyms are encoded.

Conclusion: Antonym relationships appear to have detectable geometric signatures in transformer embeddings, manifested as specific patterns like the discovered “swirl” in difference vectors, which distinguishes them from synonym relationships.

Abstract: Antonyms, or opposites, are sometimes defined as \emph{word pairs that have all of the same contextually relevant properties but one}. Seeing how transformer models seem to encode concepts as directions, this begs the question if one can detect antonymity'' in the geometry of the embedding vectors of word pairs, especially based on their difference vectors. Such geometrical studies are then naturally contrasted by comparing antonymic pairs to their opposites; synonyms. This paper started as an exploratory project on the complexity of the systems needed to detect the geometry of the embedding vectors of antonymic word pairs. What we now report is a curious swirl’’ that appears across embedding models in a somewhat specific projection configuration.

[63] Variation is the Norm: Embracing Sociolinguistics in NLP

Anne-Marie Lutgen, Alistair Plum, Verena Blaschke, Barbara Plank, Christoph Purschke

Main category: cs.CL

TL;DR: A framework combining sociolinguistics with NLP to actively include language variation in research, demonstrated through a Luxembourgish case study showing NLP models struggle with orthographic variation but can be improved by including variation in fine-tuning.

Details

Motivation: NLP typically treats language variation as noise to be normalized, while sociolinguistics studies variation as central to language. The paper aims to bridge these perspectives by showing how embracing sociolinguistic variation can inform and improve NLP systems.

Method: Proposes a framework combining sociolinguistic and technical NLP dimensions, with a case study on Luxembourgish featuring significant orthographic variation. Tests NLP models on data with variation vs. standard orthography, and explores fine-tuning approaches that include variation.

Result: Large performance discrepancies between models tested on varied vs. standard orthographic data. Models are not robust to natural language variation. However, including variation in the fine-tuning process can improve performance on varied data.

Conclusion: Language variation should be actively included in NLP research rather than treated as noise. The proposed framework facilitates this integration while being grounded in sociolinguistic theory, leading to more robust NLP systems.

Abstract: In Natural Language Processing (NLP), variation is typically seen as noise and “normalised away” before processing, even though it is an integral part of language. Conversely, studying language variation in social contexts is central to sociolinguistics. We present a framework to combine the sociolinguistic dimension of language with the technical dimension of NLP. We argue that by embracing sociolinguistics, variation can actively be included in a research setup, in turn informing the NLP side. To illustrate this, we provide a case study on Luxembourgish, an evolving language featuring a large amount of orthographic variation, demonstrating how NLP performance is impacted. The results show large discrepancies in the performance of models tested and fine-tuned on data with a large amount of orthographic variation in comparison to data closer to the (orthographic) standard. Furthermore, we provide a possible solution to improve the performance by including variation in the fine-tuning process. This case study highlights the importance of including variation in the research setup, as models are currently not robust to occurring variation. Our framework facilitates the inclusion of variation in the thought-process while also being grounded in the theoretical framework of sociolinguistics.

[64] Stance Labels Fail When They Matter Most: The Projection Problem in Stance Detection

Bowen Zhang

Main category: cs.CL

TL;DR: The paper identifies a “projection problem” in stance detection where multi-dimensional attitudes get compressed into single labels, causing annotator disagreement that reflects different weighting choices rather than confusion.

Details

Motivation: Current stance detection models inherit a three-way classification (Favor/Against/Neutral) from debate analysis, but this fails to capture multi-dimensional attitudes where people can support some aspects while opposing others of complex targets.

Method: The authors analyze the projection problem theoretically and empirically through a pilot study on SemEval-2016 Task 6 data, examining agreement patterns between label-level and dimension-level annotations.

Result: When dimensions align, label agreement exceeds dimensional agreement; when dimensions conflict, the pattern reverses - label agreement collapses while dimensional agreement remains intact, with Policy dimension reaching high agreement.

Conclusion: The projection problem is real and activates precisely where it matters most - in complex cases where multi-dimensional attitudes conflict, suggesting the need for multi-dimensional stance modeling rather than simple three-way classification.

Abstract: Stance detection is nearly always formulated as classifying text into Favor, Against, or Neutral – a convention inherited from debate analysis and applied without modification to social media since SemEval-2016. But attitudes toward complex targets are not unitary: a person can accept climate science while opposing carbon taxes, expressing support on one dimension and opposition on another. When annotators must compress such multi-dimensional attitudes into a single label, different annotators weight different dimensions – producing disagreement that reflects not confusion but different compression choices. We call this the \textbf{projection problem}, and show that its cost is conditional: when a text’s dimensions align, any weighting yields the same label and three-way annotation works well; when dimensions conflict, label agreement collapses while agreement on individual dimensions remains intact. A pilot study on SemEval-2016 Task 6 confirms this crossover: on dimension-consistent texts, label agreement (Krippendorff’s $α= 0.307$) exceeds dimensional agreement ($α= 0.082$); on dimension-conflicting texts, the pattern reverses – label $α$ drops to $0.085$ while dimensional $α$ rises to $0.334$, with Policy reaching $0.572$. The projection problem is real – but it activates precisely where it matters most.

[65] Optimizing Multilingual LLMs via Federated Learning: A Study of Client Language Composition

Aleix Sant, Jordi Luque, Carlos Escolano

Main category: cs.CL

TL;DR: Extended FederatedScope-LLM for multilingual instruction-tuning with LLMs and introduced LDES-FL for client-specific early stopping to improve efficiency in federated learning of multilingual language models.

Details

Motivation: Address challenges in federated learning of LLMs in multilingual environments, including heterogeneous language distributions across clients and disparities in language resource availability.

Method: Extended FederatedScope-LLM framework for multilingual instruction-tuning experiments, and introduced Local Dynamic Early Stopping (LDES-FL) mechanism that allows clients to pause/resume training based on client-side validation performance.

Result: Monolingual local fine-tuning most effective for single-language specialization; federated training better for balanced multilingual models. Increasing within-client multilinguality leads to stronger, fairer global models, narrows gap to centralized multilingual fine-tuning, with largest gains for lower-resource languages.

Conclusion: Client language composition is a key design variable in multilingual FL that shapes performance, fairness and efficiency.

Abstract: Federated Learning (FL) of Large Language Models (LLMs) in multilingual environments presents significant challenges stemming from heterogeneous language distributions across clients and disparities in language resource availability. To address these challenges, we extended the FederatedScope-LLM framework to support multilingual instruction-tuning experiments with LLMs. We also introduced a novel client-specific early stopping mechanism, Local Dynamic Early Stopping (LDES-FL), which allows clients to pause and resume local training based on client-side validation performance, enhancing training efficiency and sustainability. Through a series of experiments, we studied how client language composition - from fully monolingual to increasingly multilingual clients - affects multilingual quality, fairness and training cost. Monolingual local fine-tuning remains the most effective for single-language specialization, whereas federated training is better suited to learning a single balanced multilingual model. In FL, increasing within-client multilinguality leads to stronger and fairer global models, narrows the gap to centralized multilingual fine-tuning, and yields the largest gains for lower-resource languages, albeit at the cost of more optimization steps. Overall, our results identify client language composition as a key design variable in multilingual FL, shaping performance, fairness and efficiency

[66] Semantic Centroids and Hierarchical Density-Based Clustering for Cross-Document Software Coreference Resolution

Julia Matela, Frank Krüger

Main category: cs.CL

TL;DR: A hybrid system for cross-document coreference resolution of software mentions using semantic embeddings, knowledge base lookup, and density-based clustering, achieving high F1 scores on three subtasks.

Details

Motivation: Address the challenge of identifying and clustering inconsistent software mentions across scientific corpora, which is important for organizing and linking software references in research literature.

Method: Hybrid framework combining: 1) dense semantic embeddings from pre-trained Sentence-BERT, 2) KB lookup strategy using FAISS for efficient retrieval from training-set cluster centroids, 3) HDBSCAN density-based clustering for unassigned mentions, 4) surface-form normalization and abbreviation resolution, with blocking strategy adaptation for large-scale settings.

Result: Achieved CoNLL F1 scores of 0.98, 0.98, and 0.96 on Subtasks 1, 2, and 3 respectively in the SOMD 2026 Shared Task for Cross-Document Coreference Resolution of software mentions.

Conclusion: The proposed hybrid approach effectively resolves software mention coreference across documents, with the combination of semantic embeddings, knowledge base lookup, and clustering proving successful for both standard and large-scale settings.

Abstract: This paper describes the system submitted to the SOMD 2026 Shared Task for Cross-Document Coreference Resolution (CDCR) of software mentions. Our approach addresses the challenge of identifying and clustering inconsistent software mentions across scientific corpora. We propose a hybrid framework that combines dense semantic embeddings from a pre-trained Sentence-BERT model, Knowledge Base (KB) lookup strategy built from training-set cluster centroids using FAISS for efficient retrieval, and HDBSCAN density-based clustering for mentions that cannot be confidently assigned to existing clusters. Surface-form normalization and abbreviation resolution are applied to improve canonical name matching. The same core pipeline is applied to Subtasks 1 and 2. To address the large scale settings of Subtask 3, the pipeline was adapted by utilising a blocking strategy based on entity types and canonicalized surface forms. Our system achieved CoNLL F1 scores of 0.98, 0.98, and 0.96 on Subtasks 1, 2, and 3 respectively.

[67] Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning

He Huang

Main category: cs.CL

TL;DR: A study on semantic alignment across four historical stages of Ancient Egyptian using a compact encoder-decoder model with multiple training objectives and auxiliary views (Latin transliteration, IPA reconstruction) to handle script/orthographic differences and scarce parallel data.

Details

Motivation: Ancient Egyptian has four historical stages with different scripts and orthographies, making semantic alignment challenging due to scarce parallel data and surface divergence. The research aims to develop methods for modeling historical languages under real-world constraints.

Method: Joint training of a compact encoder-decoder model with shared byte-level tokenizer across all four stages. Combines masked language modeling (MLM), translation language modeling (TLM), sequence-to-sequence translation, and part-of-speech tagging with task-aware loss. Adds Latin transliteration and IPA reconstruction as auxiliary views, integrated through KL-based consistency and embedding-level fusion.

Result: Translation yielded the strongest gains for alignment. IPA with KL consistency improved cross-branch alignment, while early fusion showed limited efficacy. Overall alignment remained limited but provided a reproducible baseline and practical guidance for historical language modeling.

Conclusion: The findings provide a reproducible baseline for modeling historical languages under constraints and show how normalization and task design shape what counts as alignment in typologically distant settings. The approach offers practical guidance despite limited overall alignment.

Abstract: We study word-level semantic alignment across four historical stages of Ancient Egyptian. These stages differ in script and orthography, and parallel data are scarce. We jointly train a compact encoder-decoder model with a shared byte-level tokenizer on all four stages, combining masked language modeling (MLM), translation language modeling (TLM), sequence-to-sequence translation, and part-of-speech tagging under a task-aware loss with fixed weights and uncertainty-based scaling. To reduce surface divergence we add Latin transliteration and IPA reconstruction as auxiliary views. We integrate these views through KL-based consistency and through embedding-level fusion. We evaluate alignment quality using pairwise metrics, specifically ROC-AUC and triplet accuracy, on curated Egyptian-English and intra-Egyptian cognate datasets. Translation yields the strongest gains. IPA with KL consistency improves cross-branch alignment, while early fusion demonstrates limited efficacy. Although the overall alignment remains limited, the findings provide a reproducible baseline and practical guidance for modeling historical languages under real constraints. They also show how normalization and task design shape what counts as alignment in typologically distant settings.

[68] Samasāmayik: A Parallel Dataset for Hindi-Sanskrit Machine Translation

N J Karthika, Keerthana Suryanarayanan, Jahanvi Purohit, Ganesh Ramakrishnan, Jitin Singla, Anil Kumar Gourishetty

Main category: cs.CL

TL;DR: Samasāmayik is a novel Hindi-Sanskrit parallel corpus with 92,196 sentences from contemporary sources, enabling improved machine translation for this low-resource language pair.

Details

Motivation: Existing Sanskrit datasets focus on classical texts, creating a gap for contemporary language understanding and translation. There's a need for modern Hindi-Sanskrit parallel data to advance machine translation for this low-resource Indic language pair.

Method: Created a large-scale Hindi-Sanskrit parallel corpus (92,196 sentences) from diverse contemporary sources including spoken tutorials, children’s magazines, radio conversations, and instruction materials. Benchmarked the dataset by fine-tuning three models: ByT5, NLLB, and IndicTrans-v2.

Result: Models trained on Samasāmayik achieve significant performance gains on in-domain test data while maintaining comparable performance on other test sets. Comparative analysis shows minimal semantic/lexical overlap with existing corpora, confirming dataset novelty.

Conclusion: Samasāmayik establishes a strong new baseline for contemporary Hindi-Sanskrit translation and serves as a robust resource for low-resource Indic language machine translation research.

Abstract: We release Samasāmayik, a novel, meticulously curated, large-scale Hindi-Sanskrit corpus, comprising 92,196 parallel sentences. Unlike most data available in Sanskrit, which focuses on classical era text and poetry, this corpus aggregates data from diverse sources covering contemporary materials, including spoken tutorials, children’s magazines, radio conversations, and instruction materials. We benchmark this new dataset by fine-tuning three complementary models - ByT5, NLLB and IndicTrans-v2, to demonstrate its utility. Our experiments demonstrate that models trained on the Samasamayik corpus achieve significant performance gains on in-domain test data, while achieving comparable performance on other widely used test sets, establishing a strong new performance baseline for contemporary Hindi-Sanskrit translation. Furthermore, a comparative analysis against existing corpora reveals minimal semantic and lexical overlap, confirming the novelty and non-redundancy of our dataset as a robust new resource for low-resource Indic language MT.

[69] GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Yunzhe Wang, Runhui Xu, Kexin Zheng, Tianyi Zhang, Jayavibhav Niranjan Kogundi, Soham Hans, Volkan Ustun

Main category: cs.CL

TL;DR: GameplayQA is a new benchmark for evaluating multimodal LLMs’ agentic perception and reasoning in 3D environments using densely annotated multiplayer gameplay videos with structured QA pairs.

Details

Motivation: Existing benchmarks don't adequately evaluate multimodal LLMs' capabilities needed for autonomous agents in 3D environments, such as perceiving rapid state changes, attributing actions to correct entities, and reasoning about concurrent multi-agent behaviors from first-person perspective.

Method: Densely annotates multiplayer 3D gameplay videos at 1.22 labels/second with time-synced concurrent captions structured around Self, Other Agents, and World triadic system. Creates 2.4K diagnostic QA pairs organized into three cognitive complexity levels with structured distractor taxonomy.

Result: Evaluation of frontier MLLMs shows substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling game decision density.

Conclusion: GameplayQA provides a framework for evaluating agentic perception and reasoning, revealing current MLLM limitations and stimulating research at intersection of embodied AI, agentic perception, and world modeling.

Abstract: Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi-agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine-grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling.

[70] Improving Lean4 Autoformalization via Cycle Consistency Fine-tuning

Arsen Shebzukhov

Main category: cs.CL

TL;DR: Fine-tuning Qwen3.5-2B with reinforcement learning using cycle consistency reward outperforms supervised fine-tuning for autoformalization of natural language math to Lean4.

Details

Motivation: Autoformalization can accelerate AI-assisted mathematical research through proof verification and search, but current methods need improvement in translating natural language math to formal proof languages like Lean4.

Method: Fine-tuned Qwen3.5-2B with LoRA on FineLeanCorpus using three regimes: supervised fine-tuning with curriculum learning, SFT without curriculum, and reinforcement learning with GRPO using cycle consistency reward based on NL→Lean4→NL’ loop similarity.

Result: RL substantially outperformed both SFT variants (mean cycle consistency 0.669 vs. 0.513 on FLC; 0.561 vs. 0.422 on PutnamBench) with minimal impact on formalization quality and only +0.011 nats cross-entropy loss. Curriculum ordering provided no measurable benefit.

Conclusion: Reinforcement learning with cycle consistency reward is effective for autoformalization, outperforming supervised approaches while maintaining formalization quality, with curriculum learning offering no advantage.

Abstract: Autoformalization - automatically translating natural language mathematical texts into formal proof language such as Lean4 - can help accelerate AI-assisted mathematical research, be it via proof verification or proof search. I fine-tune Qwen3.5-2B with LoRA for natural language to Lean4 formalization on FineLeanCorpus and consider three training regimes: supervised fine-tuning (SFT) with curriculum learning (difficulty 1 to 10), SFT without curriculum ordering, and reinforcement learning using group relative policy optimization (GRPO) with a cycle consistency reward. Cycle consistency measures how well the meaning of a statement is preserved through a NL to Lean4 to NL’ loop, computed as cosine similarity of off-the-shelf sentence embeddings. On an unseen subset of FineLeanCorpus (FLC) and on PutnamBench, RL substantially outperforms both SFT variants (mean cycle consistency 0.669 vs. 0.513 on FLC; 0.561 vs. 0.422 on PutnamBench), while increasing cross-entropy loss by only 0.011 nats, with minimal impact on formalization quality. Curriculum ordering provides no measurable benefit over shuffled training.

[71] Towards Reward Modeling for AI Tutors in Math Mistake Remediation

Kseniia Petukhova, Ekaterina Kochmar

Main category: cs.CL

TL;DR: The paper introduces a framework for evaluating AI tutor pedagogical quality by creating synthetic contrastive response pairs and training preference models to assess mistake remediation aspects like scaffolding and mistake identification.

Details

Motivation: Standard NLG metrics fail to evaluate key pedagogical aspects of AI tutors such as whether they identify mistakes, scaffold reasoning, or avoid revealing answers, making it challenging to assess their educational quality.

Method: Derived a hierarchy of pedagogical aspects from human pairwise preferences on MRBench, synthesized minimally contrastive response pairs differing along key aspects, and developed Bradley-Terry preference models trained on weighted-sum rankings from MRBench, synthetic pairs, and data combinations.

Result: Using only synthetic data, the best model achieved 0.69 pairwise accuracy on human preference tests; combining weighted-sum data with targeted synthetic groups improved accuracy to 0.74, outperforming larger general-purpose reward models with only a 0.5B-parameter backbone.

Conclusion: The approach enables effective evaluation of AI tutor pedagogical quality through synthetic contrastive data and specialized preference models, offering a scalable solution for assessing educational aspects that standard NLG metrics miss.

Abstract: Evaluating the pedagogical quality of AI tutors remains challenging: standard NLG metrics do not determine whether responses identify mistakes, scaffold reasoning, or avoid revealing the answers. For the task of mistake remediation, we derive a hierarchy of pedagogical aspects from human pairwise preferences on MRBench, and synthesize minimally contrastive response pairs that differ along key aspects (e.g., mistake identification and location, targetedness, scaffolding, actionability, clarity, and coherence). We develop and release Bradley-Terry preference models trained on weighted-sum rankings that we automatically create from MRBench, synthetic pairs, and data combinations. Using only synthetic data, our best model reaches 0.69 pairwise accuracy on a human preference test, and combining weighted-sum data with targeted synthetic groups improves accuracy to 0.74, outperforming larger general-purpose reward models while using only a 0.5B-parameter backbone.

[72] When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools

Xingming Li, Runke Huang, Yanan Bao, Yuye Jin, Yuru Jiao, Qingyong Hu

Main category: cs.CL

TL;DR: AI framework for scalable assessment of teacher-child interactions in preschools using multimodal data and LLMs

Details

Motivation: Traditional expert-based assessment of teacher-child interactions faces scalability challenges in large education systems like China's, making continuous quality monitoring infeasible

Method: Developed Interaction2Eval, an LLM-based framework with specialized components for child speech recognition, Mandarin homophone disambiguation, and rubric-based reasoning on a large-scale dataset (TEPE-TCI-370h)

Result: Achieved up to 88% agreement with human expert judgments and demonstrated 18x efficiency gain in assessment workflow

Conclusion: AI can enable scalable, continuous quality monitoring in early childhood education, shifting from infrequent audits to monthly AI-assisted assessment with human oversight

Abstract: High-quality teacher-child interaction (TCI) is fundamental to early childhood development, yet traditional expert-based assessment faces a critical scalability challenge. In large systems like China’s-serving 36 million children across 250,000+ kindergartens-the cost and time requirements of manual observation make continuous quality monitoring infeasible, relegating assessment to infrequent episodic audits that limit timely intervention and improvement tracking. In this paper, we investigate whether AI can serve as a scalable assessment teammate by extracting structured quality indicators and validating their alignment with human expert judgments. Our contributions include: (1) TEPE-TCI-370h (Tracing Effective Preschool Education), the first large-scale dataset of naturalistic teacher-child interactions in Chinese preschools (370 hours, 105 classrooms) with standardized ECQRS-EC and SSTEW annotations; (2) We develop Interaction2Eval, a specialized LLM-based framework addressing domain-specific challenges-child speech recognition, Mandarin homophone disambiguation, and rubric-based reasoning-achieving up to 88% agreement; (3) Deployment validation across 43 classrooms demonstrating an 18x efficiency gain in the assessment workflow, highlighting its potential for shifting from annual expert audits to monthly AI-assisted monitoring with targeted human oversight. This work not only demonstrates the technical feasibility of scalable, AI-augmented quality assessment but also lays the foundation for a new paradigm in early childhood education-one where continuous, inclusive, AI-assisted evaluation becomes the engine of systemic improvement and equitable growth.

[73] PINGALA: Prosody-Aware Decoding for Sanskrit Poetry Generation

Manoj Balaji Jagadeeshan, Atul Singh, Nallani Chakravartula Sahith, Amrith Krishna, Pawan Goyal

Main category: cs.CL

TL;DR: PINGALA: A decoding approach for Sanskrit poetry generation that improves semantic coherence by 10% through line segmentation and token selection, with 46% better metrical alignment using phonetic transliteration.

Details

Motivation: Sanskrit poetry generation requires both semantic coherence and strict adherence to prosodic rules (binary syllable weight patterns). Existing approaches treat verses as monolithic sequences, but segmentation into grouped-lines can improve quality.

Method: Proposes PINGALA decoding approach with: 1) Line segmentation instead of monolithic sequence processing, 2) Token selection favoring longer tokens to encourage well-formed words, 3) Phonetically aware SLP1 transliteration scheme, 4) Reference-free evaluation using cross-encoders.

Result: 10% improvement in semantic coherence with comparable metrical adherence; 46% increase in metrical alignment using SLP1 transliteration with comparable semantic similarity; better evaluation alignment with true poetry instances.

Conclusion: Segmenting Sanskrit verses as grouped-lines significantly improves semantic coherence, and phonetic transliteration enhances metrical alignment, making PINGALA an effective approach for Sanskrit poetry generation with instruction fine-tuned LLMs.

Abstract: Poetry generation in Sanskrit typically requires the verse to be semantically coherent and adhere to strict prosodic rules. In Sanskrit prosody, every line of a verse is typically a fixed length sequence of syllables adhering to prescribed binary patterns of syllable weights. We observe that instead of treating a verse as a monolithic sequence, segmenting them as grouped-lines leads to significant improvement in semantic coherence by 10% with comparable metrical adherence. Specifically, PINGALA, our proposed decoding approach is designed to encourage every line to have well-formed words and our token selection biases the model towards it by preferring longer tokens. Writing in Sanskrit follows phonemic orthography, hence using a phonetically aware transliteration scheme, SLP1, increased the metrical alignment by 46% with comparable semantic similarity, for a instruction fine-tuned large language models like Phi-4. We also introduce a new approach for reference-free evaluation using cross-encoders which achieved better alignment with true poetry instances.

[74] Mechanic: Sorrifier-Driven Formal Decomposition Workflow for Automated Theorem Proving

Ruichen Qiu, Yichuan Cao, Junqi Liu, Dakai Guo, Xiao-Shan Gao, Lihong Zhi, Ruyong Feng

Main category: cs.CL

TL;DR: Mechanic is a novel agent system for automated theorem proving that uses sorry-driven formal decomposition to isolate and resolve failed subproblems independently, avoiding inefficient full regeneration or context degradation from repeated repairs.

Details

Motivation: Current LLM-based theorem proving systems struggle with complex mathematical reasoning, rarely succeeding on first try and requiring repeated modifications. Existing approaches either inefficiently discard entire proofs or cause context degradation through iterative repairs.

Method: Uses sorry-driven formal decomposition leveraging the sorry placeholder in Lean to precisely isolate unresolved subgoals while preserving verified proof structure, extracting each failed subproblem into clean, self-contained contexts for independent resolution.

Result: Experimental results on challenging mathematical competition benchmarks (IMO 2025 and Putnam 2025) demonstrate significant advantages in proving efficiency compared to existing approaches.

Conclusion: Mechanic effectively addresses the dilemma between inefficient full regeneration and context-degrading iterative repairs by isolating subproblems, improving automated theorem proving for complex mathematical reasoning.

Abstract: Recent advances in large language models (LLMs) and LLM-based agents have substantially improved the capabilities of automated theorem proving. However, for problems requiring complex mathematical reasoning, current systems rarely succeed on the first try and must repeatedly modify their proof strategies. Existing approaches for handling failed attempts typically either discard the entire proof and regenerate it from scratch or iteratively fix errors within the proof. The former is inefficient, as it may abandon mostly correct reasoning due to localized errors, while the latter, although preserving prior progress, leads to progressively longer contexts which progressively degrades the model’s ability to attend to the remaining unresolved subproblems. To address this dilemma, we propose Mechanic, a novel agent system that employs a sorry-driven formal decomposition strategy. By leveraging the sorry placeholder in Lean to precisely isolate unresolved subgoals while preserving the surrounding verified proof structure, Mechanic extracts each failed subproblem into a clean, self-contained context and resolves it independently. This avoids both the waste of full regeneration and the excessive context length induced by repeated repairs. Experimental results on challenging mathematical competition benchmarks, including IMO 2025 and Putnam 2025, demonstrate that our agent achieves significant advantages in proving efficiency.

[75] Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang

Main category: cs.CL

TL;DR: Self-distillation in LLMs for mathematical reasoning can degrade performance by suppressing epistemic verbalization (uncertainty expression), causing up to 40% performance drops despite shorter reasoning traces.

Details

Motivation: Self-distillation is effective for improving LLM performance and shortening reasoning traces, but its impact on mathematical reasoning specifically needs investigation, especially regarding how it affects the model's ability to express uncertainty during reasoning.

Method: Conducted controlled experiments varying conditioning context richness and task coverage across multiple models (Qwen3-8B, DeepSeek-Distill-Qwen-7B, Olmo3-7B-Instruct). Analyzed how self-distillation affects epistemic verbalization and performance in both in-domain and out-of-distribution mathematical reasoning tasks.

Result: Self-distillation reduces response length but degrades mathematical reasoning performance by up to 40%. Rich conditioning contexts suppress uncertainty expression, enabling rapid in-domain optimization but harming OOD performance where uncertainty expression is beneficial.

Conclusion: Exposing appropriate levels of uncertainty is crucial for robust reasoning. Optimizing reasoning behavior should go beyond reinforcing correct answer traces to preserve epistemic verbalization capabilities.

Abstract: Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model’s expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.

[76] Representation Learning to Study Temporal Dynamics in Tutorial Scaffolding

Conrad Borchers, Jiayi Zhang, Ashish Gurung

Main category: cs.CL

TL;DR: An embedding-based method to measure scaffolding in tutoring dialogues by analyzing semantic alignment between tutor/student contributions and task content, applied to math tutoring data.

Details

Motivation: The field lacks robust methods for measuring adaptive scaffolding in authentic tutoring dialogue, which has become more pressing with the rise of remote human tutoring and LLM-based systems.

Method: Embedding-based approach that analyzes scaffolding dynamics by computing cosine similarity between tutor/student contributions and task-relevant content (problem statements and correct solutions), applied to 1,576 real-world math tutoring dialogues.

Result: Revealed systematic differences in task alignment and distinct temporal patterns; tutor contributions showed stronger grounding in problem content early in interactions; student solution alignment was modestly positively associated with progression; role-specific semantic alignment predicts tutorial progression beyond baseline features.

Conclusion: Scaffolding is a continuous, role-sensitive process grounded in task semantics; the approach provides a principled method for analyzing instructional dialogue and evaluating conversational tutoring systems.

Abstract: Adaptive scaffolding enhances learning, yet the field lacks robust methods for measuring it within authentic tutoring dialogue. This gap has become more pressing with the rise of remote human tutoring and large language model-based systems. We introduce an embedding-based approach that analyzes scaffolding dynamics by aligning the semantics of dialogue turns, problem statements, and correct solutions. Specifically, we operationalize alignment by computing cosine similarity between tutor and student contributions and task-relevant content. We apply this framework to 1,576 real-world mathematics tutoring dialogues from the Eedi Question Anchored Tutoring Dialogues dataset. The analysis reveals systematic differences in task alignment and distinct temporal patterns in how participants ground their contributions in problem and solution content. Further, mixed-effects models show that role-specific semantic alignment predicts tutorial progression beyond baseline features such as message order and length. Tutor contributions exhibited stronger grounding in problem content early in interactions. In contrast, student solution alignment was modestly positively associated with progression. These findings support scaffolding as a continuous, role-sensitive process grounded in task semantics. By capturing role-specific alignment over time, this approach provides a principled method for analyzing instructional dialogue and evaluating conversational tutoring systems.

[77] Robust Multilingual Text-to-Pictogram Mapping for Scalable Reading Rehabilitation

Soufiane Jhilal, Martina Galletti

Main category: cs.CL

TL;DR: AI-powered multilingual interface that automatically enhances text with visual scaffolding (pictograms) to support reading comprehension for children with special educational needs

Details

Motivation: Reading comprehension is challenging for children with Special Educational Needs and Disabilities (SEND), requiring intensive one-on-one support. There's a need to scale this support for therapists through automated assistance.

Method: Developed a multilingual AI-powered interface that dynamically identifies key concepts in text and maps them to contextually relevant pictograms. Evaluated across five languages (English, French, Italian, Spanish, Arabic) through multilingual coverage analysis, expert clinical review by speech therapists and special education professionals, and latency assessment.

Result: High pictogram coverage and visual scaffolding density across all five languages. Expert audits showed semantic appropriateness exceeding 95% for European languages and ~90% for Arabic. System latency remained within interactive thresholds suitable for real-time educational use.

Conclusion: The system demonstrates technical viability, semantic safety, and acceptability for automated multimodal scaffolding to improve accessibility for neurodiverse learners across multiple languages.

Abstract: Reading comprehension presents a significant challenge for children with Special Educational Needs and Disabilities (SEND), often requiring intensive one-on-one reading support. To assist therapists in scaling this support, we developed a multilingual, AI-powered interface that automatically enhances text with visual scaffolding. This system dynamically identifies key concepts and maps them to contextually relevant pictograms, supporting learners across languages. We evaluated the system across five typologically diverse languages (English, French, Italian, Spanish, and Arabic), through multilingual coverage analysis, expert clinical review by speech therapists and special education professionals, and latency assessment. Evaluation results indicate high pictogram coverage and visual scaffolding density across the five languages. Expert audits suggested that automatically selected pictograms were semantically appropriate, with combined correct and acceptable ratings exceeding 95% for the four European languages and approximately 90% for Arabic despite reduced pictogram repository coverage. System latency remained within interactive thresholds suitable for real-time educational use. These findings support the technical viability, semantic safety, and acceptability of automated multimodal scaffolding to improve accessibility for neurodiverse learners.

[78] MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie Hu, Yu Qin, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

Main category: cs.CL

TL;DR: MARCH is a multi-agent reinforcement learning framework that reduces hallucinations in LLMs by using information asymmetry between specialized agents to break confirmation bias cycles.

Details

Motivation: Hallucination remains a critical bottleneck for LLMs, especially in RAG systems. Existing hallucination detection methods using LLM-as-a-judge suffer from confirmation bias where verifiers reproduce the original generation's errors.

Method: MARCH orchestrates three specialized agents: Solver generates initial RAG response, Proposer decomposes it into atomic propositions, and Checker validates these against retrieved evidence in isolation (information asymmetry). The pipeline is trained with multi-agent reinforcement learning for co-evolution.

Result: Extensive experiments show MARCH substantially reduces hallucination rates. An 8B-parameter LLM with MARCH achieves performance competitive with powerful closed-source models.

Conclusion: MARCH provides a scalable path for factual self-improvement of LLMs through co-evolution, addressing confirmation bias in hallucination detection.

Abstract: Hallucination remains a critical bottleneck for large language models (LLMs), undermining their reliability in real-world applications, especially in Retrieval-Augmented Generation (RAG) systems. While existing hallucination detection methods employ LLM-as-a-judge to verify LLM outputs against retrieved evidence, they suffer from inherent confirmation bias, where the verifier inadvertently reproduces the errors of the original generation. To address this, we introduce Multi-Agent Reinforced Self-Check for Hallucination (MARCH), a framework that enforces rigorous factual alignment by leveraging deliberate information asymmetry. MARCH orchestrates a collaborative pipeline of three specialized agents: a Solver, a Proposer, and a Checker. The Solver generates an initial RAG response, which the Proposer decomposes into claim-level verifiable atomic propositions. Crucially, the Checker validates these propositions against retrieved evidence in isolation, deprived of the Solver’s original output. This well-crafted information asymmetry scheme breaks the cycle of self-confirmation bias. By training this pipeline with multi-agent reinforcement learning (MARL), we enable the agents to co-evolve and optimize factual adherence. Extensive experiments across hallucination benchmarks demonstrate that MARCH substantially reduces hallucination rates. Notably, an 8B-parameter LLM equipped with MARCH achieves performance competitive with powerful closed-source models. MARCH paves a scalable path for factual self-improvement of LLMs through co-evolution. The code is at https://github.com/Qwen-Applications/MARCH.

[79] Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, Tunazzina Islam

Main category: cs.CL

TL;DR: Domain-specific RAG system for AI policy analysis shows that improved retrieval doesn’t always translate to better QA performance, and can sometimes increase hallucinations when relevant documents are missing.

Details

Motivation: RAG systems struggle with reliability in complex policy domains with dense legal language and evolving regulations. The paper aims to study RAG applications specifically for AI governance and policy analysis using the AGORA corpus.

Method: Combines ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned using Direct Preference Optimization (DPO). Uses synthetic queries and collected pairwise preferences to adapt to the policy domain. Evaluates retrieval quality, answer relevance, and faithfulness.

Result: Domain-specific fine-tuning improves retrieval metrics but doesn’t consistently improve end-to-end QA performance. Stronger retrieval can sometimes lead to more confident hallucinations when relevant documents are absent from the corpus.

Conclusion: Improvements to individual RAG components don’t necessarily translate to more reliable answers in policy domains. Provides practical insights for designing grounded QA systems over dynamic regulatory corpora.

Abstract: Retrieval-augmented generation (RAG) systems are increasingly used to analyze complex policy documents, but achieving sufficient reliability for expert usage remains challenging in domains characterized by dense legal language and evolving, overlapping regulatory frameworks. We study the application of RAG to AI governance and policy analysis using the AI Governance and Regulatory Archive (AGORA) corpus, a curated collection of 947 AI policy documents. Our system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). We construct synthetic queries and collect pairwise preferences to adapt the system to the policy domain. Through experiments evaluating retrieval quality, answer relevance, and faithfulness, we find that domain-specific fine-tuning improves retrieval metrics but does not consistently improve end-to-end question answering performance. In some cases, stronger retrieval counterintuitively leads to more confident hallucinations when relevant documents are absent from the corpus. These results highlight a key concern for those building policy-focused RAG systems: improvements to individual components do not necessarily translate to more reliable answers. Our findings provide practical insights for designing grounded question-answering systems over dynamic regulatory corpora.

[80] Evaluation of Large Language Models via Coupled Token Generation

Nina Corvelo Benz, Stratis Tsirtsis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco, Suhas Thejaswi, Manuel Gomez-Rodriguez

Main category: cs.CL

TL;DR: This paper introduces coupled autoregressive generation as a method to control for randomness in LLM evaluation, showing it requires fewer samples for benchmark evaluations and can lead to different model rankings in pairwise comparisons.

Details

Motivation: Current LLM evaluations are confounded by randomization in model responses, making it difficult to determine if performance differences are genuine or artifacts of random sampling. The authors argue for controlling randomness to ensure fair model comparisons.

Method: Develops a causal model for coupled autoregressive generation that allows different LLMs to sample responses with the same source of randomness. This enables controlled comparisons by synchronizing the random seed across models.

Result: Coupled generation requires up to 75% fewer samples than vanilla generation to reach the same conclusions on benchmark datasets. Surprisingly, coupled and vanilla generation can lead to different model rankings in pairwise comparisons even with infinite samples.

Conclusion: Existing LLM evaluation protocols may produce misleading rankings due to uncontrolled randomness. Coupled autoregressive generation provides a more rigorous evaluation framework that controls for this confounding factor.

Abstract: State of the art large language models rely on randomization to respond to a prompt. As an immediate consequence, a model may respond differently to the same prompt if asked multiple times. In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning. Our starting point is the development of a causal model for coupled autoregressive generation, which allows different large language models to sample responses with the same source of randomness. Building upon our causal model, we first show that, on evaluations based on benchmark datasets, coupled autoregressive generation leads to the same conclusions as vanilla autoregressive generation but using provably fewer samples. However, we further show that, on evaluations based on (human) pairwise comparisons, coupled and vanilla autoregressive generation can surprisingly lead to different rankings when comparing more than two models, even with an infinite amount of samples. This suggests that the apparent advantage of a model over others in existing evaluation protocols may not be genuine but rather confounded by the randomness inherent to the generation process. To illustrate and complement our theoretical results, we conduct experiments with several large language models from the Llama, Mistral and Qwen families. We find that, across multiple benchmark datasets, coupled autoregressive generation requires up to 75% fewer samples to reach the same conclusions as vanilla autoregressive generation. Further, we find that the win-rates derived from pairwise comparisons by a strong large language model to prompts from the LMSYS Chatbot Arena platform differ under coupled and vanilla autoregressive generation.

[81] Structured Legal Document Generation in India: A Model-Agnostic Wrapper Approach with VidhikDastaavej

Shubham Kumar Nigam, Balaramamahanthi Deepak Patnaik, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya

Main category: cs.CL

TL;DR: VidhikDastaavej dataset and Model-Agnostic Wrapper for structured legal document generation in Indian context

Details

Motivation: Automating legal document drafting to improve efficiency, addressing the scarcity of public datasets and complexity of long-form legal drafting, particularly in the Indian context

Method: Two-stage generation framework: first plans section structure, then generates each section with retrieval-based prompts; model-agnostic approach works with both open- and closed-source LLMs

Result: Wrapper substantially improves factual accuracy, coherence, and completeness compared to fine-tuned baselines; comprehensive evaluation shows effectiveness across lexical, semantic, LLM-based, and expert-driven assessments

Conclusion: Establishes new benchmark dataset and generalizable generation framework for AI-assisted legal drafting, paving way for future research in structured legal text generation

Abstract: Automating legal document drafting can improve efficiency and reduce the burden of manual legal work. Yet, the structured generation of private legal documents remains underexplored, particularly in the Indian context, due to the scarcity of public datasets and the complexity of adapting models for long-form legal drafting. To address this gap, we introduce VidhikDastaavej, a large-scale, anonymized dataset of private legal documents curated in collaboration with an Indian law firm. Covering 133 diverse categories, this dataset is the first resource of its kind and provides a foundation for research in structured legal text generation and Legal AI more broadly. We further propose a Model-Agnostic Wrapper (MAW), a two-stage generation framework that first plans the section structure of a legal draft and then generates each section with retrieval-based prompts. MAW is independent of any specific LLM, making it adaptable across both open- and closed-source models. Comprehensive evaluation, including lexical, semantic, LLM-based, and expert-driven assessments with inter-annotator agreement, shows that the wrapper substantially improves factual accuracy, coherence, and completeness compared to fine-tuned baselines. This work establishes both a new benchmark dataset and a generalizable generation framework, paving the way for future research in AI-assisted legal drafting.

[82] Disentangling Knowledge Representations for Large Language Model Editing

Mengqi Zhang, Zisheng Zhou, Xiaotian Ye, Qiang Liu, Zhaochun Ren, Zhumin Chen, Pengjie Ren

Main category: cs.CL

TL;DR: DiKE: A knowledge editing approach for LLMs that disentangles subject representations to preserve fine-grained irrelevant knowledge while updating target knowledge.

Details

Motivation: Existing knowledge editing methods fail to preserve fine-grained irrelevant knowledge - facts that share the same subject as edited knowledge but differ in relation/object - because subject representations encode multiple attributes, causing entanglement in representation space.

Method: DiKE consists of: 1) Knowledge Representation Disentanglement (KRD) module that decomposes subject representation into target-knowledge-related and unrelated components, and 2) Disentanglement-based Knowledge Edit (DKE) module that updates only target-related component while preserving unrelated one. Uses closed-form rank-one parameter update based on matrix theory for efficient editing.

Result: Extensive experiments across multiple LLMs show DiKE substantially improves fine-grained irrelevant knowledge preservation while maintaining competitive general editing performance. Introduces FINE-KED benchmark for evaluation.

Conclusion: DiKE addresses the limitation of existing knowledge editing methods by disentangling knowledge representations, enabling targeted updates while preserving fine-grained irrelevant knowledge through a theoretically grounded, efficient approach.

Abstract: Knowledge Editing has emerged as a promising solution for efficiently updating embedded knowledge in large language models (LLMs). While existing approaches demonstrate effectiveness in integrating new knowledge and preserving the original capabilities of LLMs, they fail to maintain fine-grained irrelevant knowledge, namely facts that share the same subject as edited knowledge but differ in relation and object. This challenge arises because subject representations inherently encode multiple attributes, causing the target and fine-grained irrelevant knowledge to become entangled in the representation space, and thus vulnerable to unintended alterations during editing. To address this, we propose DiKE, a novel approach that Disentangles Knowledge representations for LLM Editing (DiKE). DiKE consists of two key components: a Knowledge Representation Disentanglement (KRD) module that decomposes the subject representation into target-knowledge-related and -unrelated components, and a Disentanglementbased Knowledge Edit (DKE) module that updates only the target-related component while explicitly preserving the unrelated one. We further derive a closedform, rank-one parameter update based on matrix theory to enable efficient and minimally invasive edits. To rigorously evaluate fine-grained irrelevant knowledge preservation, we construct FINE-KED, a new benchmark comprising fine-grained irrelevant knowledge at different levels of relational similarity to the edited knowledge. Extensive experiments across multiple LLMs demonstrate that DiKE substantially improves fine-grained irrelevant knowledge preservation while maintaining competitive general editing performance.

[83] From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

Tianqiao Liu, Xueyi Li, Hao Wang, Haoxuan Li, Zhichao Chen, Weiqi Luo, Zitao Liu

Main category: cs.CL

TL;DR: TtT is a unified audio-text framework that combines autoregressive text generation with non-autoregressive audio diffusion in a single Transformer, using modality-aware attention and block-wise diffusion for efficient speech-to-speech systems.

Details

Motivation: Existing multimodal models for audio-text interleaved scenarios rely solely on autoregressive methods, which overlook the fundamental difference that text depends on target-target relations while audio depends mainly on source-target relations. This limitation motivates a hybrid approach.

Method: Proposes Text-to-Talk (TtT) framework integrating AR text generation with NAR audio diffusion using absorbing discrete diffusion. Uses modality-aware attention for causal text decoding and bidirectional audio modeling. Implements three training strategies to reduce train-test discrepancies and employs block-wise diffusion for parallel audio synthesis.

Result: Comprehensive experiments on Audio-QA, ASR, AAC and speech-to-speech benchmarks show TtT consistently surpasses strong AR and NAR baselines. Ablation studies confirm the contribution of each component.

Conclusion: TtT provides an effective unified framework for audio-text multimodal systems by combining AR text generation with NAR audio diffusion, addressing the different relational dependencies between modalities while maintaining efficiency and performance.

Abstract: Recent advances in large language models (LLMs) have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-to-speech conversational systems. However, existing multimodal models handling interleaved audio and text rely on autoregressive (AR) methods, overlooking that text depends on target-target relations whereas audio depends mainly on source-target relations. In this work, we propose Text-to-Talk (TtT), a unified audio-text framework that integrates AR text generation with non-autoregressive (NAR) audio diffusion in a single Transformer. By leveraging the any-order AR property of absorbing discrete diffusion, our approach provides a unified training objective for text and audio. To support this hybrid generation paradigm, we design a modality-aware attention mechanism that enforces causal decoding for text while allowing bidirectional modeling within audio spans, and further introduce three training strategies that reduce train-test discrepancies. During inference, TtT employs block-wise diffusion to synthesize audio in parallel while flexibly handling variable-length outputs. Comprehensive experiments on Audio-QA, ASR, AAC and speech-to-speech benchmarks show that TtT consistently surpasses strong AR and NAR baselines, with additional ablation and training-strategy analyses confirming the contribution of each component. We will open-source our models, data and code to facilitate future research in this direction.

[84] KnowledgeSmith: Uncovering Knowledge Updating in LLMs with Model Editing and Unlearning

Yinyi Luo, Zhexian Zhou, Hao Chen, Kai Qiu, Marios Savvides, Sharon Li, Jindong Wang

Main category: cs.CL

TL;DR: KnowledgeSmith is a unified framework for systematically studying knowledge updating mechanisms in LLMs, treating editing and unlearning as constrained optimization problems to analyze knowledge propagation across different graph levels and data scales.

Details

Motivation: The paper aims to address the lack of systematic understanding about how LLMs update knowledge, particularly comparing knowledge editing and machine unlearning approaches, and investigating whether LLMs update knowledge similarly to humans across different knowledge levels.

Method: Proposes KnowledgeSmith framework that: 1) Formulates editing and unlearning as constrained optimization problems, 2) Creates automatic dataset generator for structured interventions across multiple graph levels and data scales, 3) Enables controlled studies of modification strategy propagation through model knowledge.

Result: Extensive experiments reveal nuanced insights: LLMs don’t update knowledge similarly to humans across different knowledge levels, existence of consistency-capacity trade-off, and detailed understanding of knowledge propagation, plasticity scaling, consistency, and robustness.

Conclusion: The findings provide systematic understanding of LLM knowledge updating mechanisms and offer suggestions for designing more reliable and scalable knowledge updating strategies for large language models.

Abstract: Knowledge editing and machine unlearning are two popular approaches for large language models (LLMs) to stay up-to-date. However, the knowledge updating mechanism of LLMs remains largely unexplored due to insufficient, isolated, and small-scale evaluation. For instance, are LLMs similar to humans in modifying certain knowledge? What differs editing and unlearning as training data increases? This paper proposes KnowledgeSmith, a unified framework to systematically understand the updating mechanism of LLMs. We first cast editing and unlearning as instances of one constrained optimization problem. Then, we propose an automatic dataset generator that provides structured interventions across multiple graph levels and data scales, enabling controlled studies of how different modification strategies propagate through model knowledge. Extensive experiments demonstrate nuanced insights over knowledge propagation, plasticity scaling, consistency, and robustness. For instance, our results show that LLMs do not exhibit similar updating as humans for different levels of knowledge, and there exists consistency-capacity trade-off. We hope our findings can offer suggestions to the design of more reliable and scalable strategies. Code: https://github.com/AIFrontierLab/KnowledgeSmith.git

[85] FedSRD: Sparsify-Reconstruct-Decompose for Communication-Efficient Federated Large Language Models Fine-Tuning

Guochen Yan, Luyuan Xie, Qingni Shen, Yuejian Fang, Zhonghai Wu

Main category: cs.CL

TL;DR: FedSRD: A communication-efficient federated learning framework for LLM fine-tuning that reduces communication costs by 90% while improving performance on heterogeneous data.

Details

Motivation: Federated learning enables privacy-preserving collaborative fine-tuning of LLMs on decentralized private data, but faces communication bottlenecks due to structural redundancy in LoRA parameters and heterogeneous network conditions.

Method: Proposes FedSRD (Sparsify-Reconstruct-Decompose) framework with importance-aware sparsification to reduce upload parameters, server aggregation in full-rank space to mitigate conflicts, and decomposition into sparse low-rank format for broadcast. Also introduces FedSRD-e variant for reduced computational overhead.

Result: Experiments on 10 benchmarks show the framework reduces communication costs by up to 90% while improving performance on heterogeneous client data.

Conclusion: FedSRD provides an efficient solution for federated LLM fine-tuning that addresses communication bottlenecks while maintaining performance on decentralized data.

Abstract: The current paradigm of training large language models (LLMs) on public available Web data is becoming unsustainable as high-quality data sources in specialized domains near exhaustion. Federated Learning (FL) emerges as a practical solution for the next generation of AI on a decentralized Web, enabling privacy-preserving collaborative fine-tuning on decentralized private data. While Low-Rank Adaptation (LoRA) is standard for efficient fine-tuning, its federated application faces a critical bottleneck: communication overhead under heterogeneous network conditions. Structural redundancy in LoRA parameters increases communication costs and causes aggregation conflicts. To address this, we propose FedSRD, a Sparsify-Reconstruct-Decompose framework for communication-efficient federated LLM fine-tuning. We introduce importance-aware sparsification to reduce the upload parameter count while preserving the structural integrity of LoRA updates. The server aggregates updates in full-rank space to mitigate conflicts, then decomposes the global update into a sparse low-rank format for broadcast, ensuring a symmetrically efficient cycle. We also propose an efficient variant, FedSRD-e, to reduce computational overhead. Experiments on 10 benchmarks show our framework significantly reduces communication costs by up to 90% while improving performance on heterogeneous client data.

[86] How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective

Xianzhen Luo, Jinyang Huang, Wenzhen Zheng, Qingfu Zhu, Mingzheng Xu, Yiheng Xu, Yuantao Fan, Wanxiang Che

Main category: cs.CL

TL;DR: TC-Bench: A novel framework for evaluating LLM-generated test cases using optimal diagnostic basis selection from wrong codes, addressing score inflation and computational costs in existing benchmarks.

Details

Motivation: Existing benchmarks for evaluating LLM-generated test cases suffer from high computational costs, score inflation, and bias toward common trivial bugs while missing rare critical faults. There's a need for a more efficient and accurate evaluation framework.

Method: Formalizes benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix. Introduces WrongSelect algorithm to select maximally diverse wrong codes, creating TC-Bench - a compact, diverse benchmark from competitive programming submissions.

Result: Even advanced test case generation methods achieve only ~60% exclusion rates on TC-Bench, exposing significant gaps in diagnostic power and highlighting substantial room for improvement.

Conclusion: The framework provides a principled approach to benchmark construction that is inflation-resistant and computationally efficient, revealing limitations in current LLM-based test generation methods.

Abstract: Evaluating test cases automatically generated by Large Language Models (LLMs) is a critical yet challenging task. Existing benchmarks often evaluate the exclusion ratio on large, unstructured collections of wrong codes, suffering from high computational costs and score inflation. Furthermore, they inadvertently reward generators that detect common, trivial bugs, while failing to penalize their inability to identify rare yet critical faults. In this work, we connect two fundamental questions: (1) What is the minimal set of wrong codes sufficient to represent the entire error space? and (2) What is the minimal set of test cases needed to distinguish them? We introduce a novel framework that formalizes benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix, where rows represent wrong codes and columns represent test case results. The rank of this matrix specifies the minimal number of independent error patterns (wrong codes) and provides a tight upper bound on the number of test cases required for complete fault coverage. Our objective is to identify a basis of size equal to the matrix rank that maximizes internal diversity. To tackle this NP-hard problem, we propose WrongSelect, an efficient approximation algorithm to select maximally diverse wrong codes. Applying this framework to millions of competitive programming submissions, we construct TC-Bench, a compact, diverse, and inflation-resistant benchmark. Extensive experiments show that even the most advanced test case generation methods achieve only ~60% exclusion rates on TC-Bench, exposing a significant gap in their diagnostic power and highlighting substantial room for future improvement. Our dataset is available at: https://huggingface.co/datasets/Luoberta/TC-Bench and our code is at: https://github.com/Luowaterbi/TC-Bench.

[87] You only need 4 extra tokens: Synergistic Test-time Adaptation for LLMs

Yijie Xu, Huizai Yao, Zhiyu Guo, Pengteng Li, Aiwei Liu, Xuming Hu, Weiyu Guo, Hui Xiong

Main category: cs.CL

TL;DR: SyTTA: A label-free test-time adaptation framework for LLMs that uses input perplexity and output entropy to adapt models on-the-fly during inference without additional supervision.

Details

Motivation: LLMs face distribution shifts when deployed in specialized domains (finance, medicine, agriculture), but domain-specific fine-tuning requires expensive labeled data. Need for adaptation methods that work without supervision in expertise-limited settings.

Method: SyTTA couples two uncertainty signals: input-side perplexity (indicating mismatch with domain-specific terminology/patterns) and output-side predictive entropy (indicating diffuse/unstable token probabilities). Uses these signals for on-the-fly adaptation during inference without labeled examples.

Result: Across diverse model architectures and domain-specific benchmarks, SyTTA delivers consistent gains. On agricultural question answering, improves Rouge-LSum by over 120% on Qwen-2.5-7B with only 4 extra tokens per query.

Conclusion: Effective test-time adaptation for language models is achievable without labeled examples, supporting deployment in label-scarce domains. The framework is efficient and practical for real-world applications.

Abstract: Large language models (LLMs) are increasingly deployed in specialized domains such as finance, medicine, and agriculture, where they face significant distribution shifts from their training data. Domain-specific fine-tuning can mitigate this challenge but relies on high-quality labeled data that is expensive and slow to collect in expertise-limited settings. We study label-free test-time adaptation for language models and present SyTTA, an inference-time framework that adapts models on-the-fly without additional supervision. SyTTA couples two complementary uncertainty signals that arise under distribution shift: input-side perplexity, indicating mismatch with domain-specific terminology and patterns, and output-side predictive entropy, indicating diffuse and unstable token probabilities during generation. Across diverse model architectures and domain-specific benchmarks, SyTTA delivers consistent gains. Notably, on agricultural question answering, SyTTA improves Rouge-LSum by over 120% on Qwen-2.5-7B with only 4 extra tokens per query. These results show that effective test-time adaptation for language models is achievable without labeled examples, supporting deployment in label-scarce domains. The code will be made available upon acceptance.

[88] Is Multilingual LLM Watermarking Truly Multilingual? Scaling Robustness to 100+ Languages via Back-Translation

Asim Mohamed, Martin Gubri

Main category: cs.CL

TL;DR: STEAM introduces a multilingual watermark detection method using Bayesian optimization to search for optimal back-translation languages, improving robustness across low-resource languages where existing methods fail.

Details

Motivation: Current multilingual watermarking methods claim cross-lingual robustness but only work well for high-resource languages, failing under translation attacks for medium- and low-resource languages due to semantic clustering issues when tokenizers lack sufficient full-word tokens.

Method: STEAM uses Bayesian optimization to search among 133 candidate languages for the back-translation that best recovers watermark strength. It’s compatible with any watermarking method, non-invasive, and easily extendable to new languages.

Result: STEAM achieves average gains of +0.23 AUC and +37% TPR@1%, providing scalable and fairer watermarking across diverse languages, including low-resource ones.

Conclusion: STEAM offers a robust, scalable solution for multilingual watermark detection that addresses the limitations of existing methods, particularly for low-resource languages, making watermarking more equitable across linguistic diversity.

Abstract: Multilingual watermarking aims to make large language model (LLM) outputs traceable across languages, yet current methods still fall short. Despite claims of cross-lingual robustness, they are evaluated only on high-resource languages. We show that existing multilingual watermarking methods are not truly multilingual: they fail to remain robust under translation attacks in medium- and low-resource languages. We trace this failure to semantic clustering, which fails when the tokenizer vocabulary contains too few full-word tokens for a given language. To address this, we introduce STEAM, a detection method that uses Bayesian optimisation to search among 133 candidate languages for the back-translation that best recovers the watermark strength. It is compatible with any watermarking method, robust across different tokenizers and languages, non-invasive, and easily extendable to new languages. With average gains of +0.23 AUC and +37% TPR@1%, STEAM provides a scalable approach toward fairer watermarking across the diversity of languages.

[89] Quantification and object perception in Multimodal Large Language Models and human linguistic cognition

Raquel Montero, Natalia Moskvina, Paolo Morosi, Tamara Serrano, Elena Pagliarini, Evelina Leivada

Main category: cs.CL

TL;DR: LLMs struggle with linguistic quantification, showing differences from humans in quantifier scales, usage ranges, and numerical biases, with thinking models performing better but still having key gaps.

Details

Motivation: Quantification is a difficult linguistic phenomenon for MLLMs, but the exact reasons are unclear since it interfaces with logic, pragmatics, and numerical domains. The paper aims to explore three unexplored features of human quantification in LLMs: quantifier ordering into scales, usage ranges/prototypicality, and biases from the human approximate number system.

Method: Investigates how three key features of human quantification are encoded in LLM architectures, comparing thinking vs. instruct models across different languages. Examines quantifier scales, ranges of use, prototypicality values, and numerical biases to determine differences from human performance.

Result: Thinking models showed high accuracy in numerosity estimation and organizing quantifiers into scales, but key differences remain between humans and LLMs across all model types, particularly in ranges of use and prototypicality values.

Conclusion: This work paves the way for addressing MLLMs as semantic and pragmatic agents, while cross-linguistic analysis can reveal whether their abilities are robust and stable across different languages, despite current gaps in quantification understanding.

Abstract: Quantification has been proven to be a particularly difficult linguistic phenomenon for (Multimodal) Large Language Models (MLLMs). However, given that quantification interfaces with the logic, pragmatic, and numerical domains, the exact reasons for the poor performance are still unclear. This paper looks at three key features of human quantification shared cross-linguistically that have remained so far unexplored in the (M)LLM literature: the ordering of quantifiers into scales, the ranges of use and prototypicality, and the biases inherent in the human approximate number system. The aim is to determine how these features are encoded in the models’ architecture, how they may differ from humans, and whether the results are affected by the type of model (thinking vs. instruct) and the language under investigation. Results show that although thinking models showed a high accuracy in the numerosity estimation task and in the organization of quantifiers into scales, there are still key differences between humans and LLMs across all model types, particularly in terms of ranges of use and prototypicality values. This work, thus, paves the way for addressing the nature of MLLMs as semantic and pragmatic agents, while the cross-linguistic lens can elucidate whether their abilities are robust and stable across different languages.

Edward Ajayi, Martha Kachweka, Mawuli Deku, Emily Aiken

Main category: cs.CL

TL;DR: A unified multiclass classification framework for detecting 10 mental health and cyberbullying categories from social media data, using fine-tuned transformers with human-in-the-loop explainability.

Details

Motivation: Address the growing prevalence of mental health challenges and cyberbullying in digital spaces by creating scalable, interpretable detection systems that can assist human moderators.

Method: Curated datasets from Twitter and Reddit, implemented “split-then-balance” pipeline, compared traditional lexical models, hybrid approaches, and fine-tuned transformers. Used domain-adapted MentalBERT and introduced hybrid SHAPLLM explainability framework with prototype dashboard.

Result: End-to-end fine-tuning proved critical, with MentalBERT achieving 0.92 accuracy and 0.76 Macro F1 score, outperforming generic models and zero-shot LLM baselines. Created practical “Social Media Screener” dashboard for moderators.

Conclusion: The framework provides a robust baseline for mental health and cyberbullying detection, emphasizing human-in-the-loop screening rather than diagnosis, with future needs for multi-label clinically-validated datasets.

Abstract: Mental health challenges and cyberbullying are increasingly prevalent in digital spaces, necessitating scalable and interpretable detection systems. This paper introduces a unified multiclass classification framework for detecting ten distinct mental health and cyberbullying categories from social media data. We curate datasets from Twitter and Reddit, implementing a rigorous “split-then-balance” pipeline to train on balanced data while evaluating on a realistic, held-out imbalanced test set. We conducted a comprehensive evaluation comparing traditional lexical models, hybrid approaches, and several end-to-end fine-tuned transformers. Our results demonstrate that end-to-end fine-tuning is critical for performance, with the domain-adapted MentalBERT emerging as the top model, achieving an accuracy of 0.92 and a Macro F1 score of 0.76, surpassing both its generic counterpart and a zero-shot LLM baseline. Grounded in a comprehensive ethical analysis, we frame the system as a human-in-the-loop screening aid, not a diagnostic tool. To support this, we introduce a hybrid SHAPLLM explainability framework and present a prototype dashboard (“Social Media Screener”) designed to integrate model predictions and their explanations into a practical workflow for moderators. Our work provides a robust baseline, highlighting future needs for multi-label, clinically-validated datasets at the critical intersection of online safety and computational mental health.

[91] Collaborative Causal Sensemaking: Closing the Complementarity Gap in Human-AI Decision Support

Raunak Jain

Main category: cs.CL

TL;DR: The paper proposes Collaborative Causal Sensemaking (CCS) as a research agenda to develop LLM-based agents that can co-construct causal explanations with humans, rather than just being answer engines, to improve human-AI team performance in high-stakes decision-making.

Details

Motivation: Current LLM-based agents trained as answer engines fail to achieve reliable complementarity with humans in high-stakes decision-making settings. There's a fundamental mismatch between how agents are trained (as answer engines) and how experts actually make decisions (through collaborative sensemaking involving causal explanation co-construction, uncertainty surfacing, and goal adaptation).

Method: Proposes Collaborative Causal Sensemaking (CCS) as a research agenda with three components: 1) New training environments that reward collaborative thinking, 2) Representations for shared human-AI mental models, and 3) Evaluation centered on trust and complementarity rather than just answer accuracy.

Result: The paper presents a conceptual framework and research agenda rather than empirical results. It argues for shifting from oracle-like answer engines to AI teammates that co-reason with humans over causal structures of shared decisions.

Conclusion: To advance effective human-AI teams, research should focus on developing agents’ collaborative sensemaking capabilities through CCS, moving beyond answer engines to cultivate AI teammates that can co-reason with human partners about causal structures in decision-making.

Abstract: LLM-based agents are increasingly deployed for expert decision support, yet human-AI teams in high-stakes settings do not yet reliably outperform the best individual. We argue this complementarity gap reflects a fundamental mismatch: current agents are trained as answer engines, not as partners in the collaborative sensemaking through which experts actually make decisions. Sensemaking (the ability to co-construct causal explanations, surface uncertainties, and adapt goals) is the key capability that current training pipelines do not explicitly develop or evaluate. We propose Collaborative Causal Sensemaking (CCS) as a research agenda to develop this capability from the ground up, spanning new training environments that reward collaborative thinking, representations for shared human-AI mental models, and evaluation centred on trust and complementarity. Taken together, these directions shift MAS research from building oracle-like answer engines to cultivating AI teammates that co-reason with their human partners over the causal structure of shared decisions, advancing the design of effective human-AI teams.

[92] ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection

Tao Liu, Taiqiang Wu, Runming Yang, Shaoning Sun, Junjie Wang, Yujiu Yang

Main category: cs.CL

TL;DR: ProFit is a novel SFT method that selectively masks low-probability tokens to prevent overfitting to surface-level expressions, improving LLM alignment without needing multiple reference answers.

Details

Motivation: Traditional SFT forces alignment with single reference answers, ignoring language's one-to-many nature and causing overfitting to non-core expressions. While multiple references could help, they're prohibitively expensive in data and computation costs.

Method: ProFit leverages the insight that high-probability tokens carry core logical framework while low-probability tokens are replaceable expressions. It selectively masks low-probability tokens during SFT to prevent surface-level overfitting.

Result: Extensive experiments show ProFit consistently outperforms traditional SFT baselines on general reasoning and mathematical benchmarks.

Conclusion: ProFit offers an effective solution to single-reference overfitting in SFT by strategically masking low-probability tokens, achieving better alignment without costly multiple reference datasets.

Abstract: Supervised fine-tuning (SFT) is a fundamental post-training strategy to align Large Language Models (LLMs) with human intent. However, traditional SFT often ignores the one-to-many nature of language by forcing alignment with a single reference answer, leading to the model overfitting to non-core expressions. Although our empirical analysis suggests that introducing multiple reference answers can mitigate this issue, the prohibitive data and computational costs necessitate a strategic shift: prioritizing the mitigation of single-reference overfitting over the costly pursuit of answer diversity. To achieve this, we reveal the intrinsic connection between token probability and semantic importance: high-probability tokens carry the core logical framework, while low-probability tokens are mostly replaceable expressions. Based on this insight, we propose ProFit, which selectively masks low-probability tokens to prevent surface-level overfitting. Extensive experiments confirm that ProFit consistently outperforms traditional SFT baselines on general reasoning and mathematical benchmarks.

[93] ChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart Generation

Jesus-German Ortiz-Barajas, Jonathan Tonglet, Vivek Gupta, Iryna Gurevych

Main category: cs.CL

TL;DR: ChartAttack is a framework for evaluating how multimodal LLMs can be misused to generate misleading charts at scale, with AttackViz dataset for testing robustness.

Details

Motivation: Multimodal LLMs are increasingly used for chart generation from data tables, but this introduces new misuse risks where models can be exploited to create misleading visualizations that induce incorrect data interpretations.

Method: ChartAttack framework injects misleaders into chart designs to induce incorrect interpretations. AttackViz dataset contains chart QA pairs labeled with effective misleaders and their induced incorrect answers for evaluation.

Result: ChartAttack significantly degrades QA performance, reducing MLLM accuracy by 17.2 points in-domain and 11.9 cross-domain. Human results show 20.2-point accuracy drop. AttackViz can be used to fine-tune MLLMs to improve robustness.

Conclusion: There’s an urgent need for robustness and security considerations in MLLM-based chart generation systems. The framework highlights vulnerabilities and provides tools for improving model resilience against misleading visualizations.

Abstract: Multimodal large language models (MLLMs) are increasingly used to automate chart generation from data tables, enabling efficient data analysis and reporting but also introducing new misuse risks. In this work, we introduce ChartAttack, a novel framework for evaluating how MLLMs can be misused to generate misleading charts at scale. ChartAttack injects misleaders into chart designs, aiming to induce incorrect interpretations of the underlying data. Furthermore, we create AttackViz, a chart question-answering (QA) dataset where each (chart specification, QA) pair is labeled with effective misleaders and their induced incorrect answers. ChartAttack significantly degrades QA performance, reducing MLLM accuracy by 17.2 points in-domain and 11.9 cross-domain. Preliminary human results (limited sample size) indicate a 20.2-point accuracy drop. Finally, we demonstrate that AttackViz can be used to fine-tune MLLMs to improve robustness against misleading charts. Our findings highlight an urgent need for robustness and security considerations in the design, evaluation, and deployment of MLLM-based chart generation systems. We make our code and data publicly available.

[94] From Sycophancy to Sensemaking: Premise Governance for Human-AI Decision Making

Raunak Jain

Main category: cs.CL

TL;DR: Paper proposes shifting from LLM answer generation to collaborative premise governance for reliable human-AI partnership, using discrepancy-driven control loops and commitment gating to prevent sycophantic agreement in high-stakes decisions.

Details

Motivation: As LLMs move from assistance to decision support, they exhibit dangerous sycophantic behavior - fluent agreement without calibrated judgment. This amplifies poor commitments in deep-uncertainty decisions where objectives are contested and reversals are costly, pushing verification costs onto experts.

Method: Proposes collaborative premise governance over a knowledge substrate with discrepancy-driven control loops: detecting conflicts, localizing misalignment via typed discrepancies (teleological, epistemic, procedural), and triggering bounded negotiation through decision slices. Uses commitment gating to block action on uncommitted load-bearing premises and value-gated challenge allocation.

Result: Theoretical framework illustrated with tutoring examples, proposing trust attaches to auditable premises and evidence standards rather than conversational fluency. Includes falsifiable evaluation criteria for implementation.

Conclusion: Reliable human-AI partnership requires fundamental shift from answer generation to collaborative premise governance, with systematic mechanisms to prevent sycophantic agreement and ensure transparent, accountable decision-making in high-stakes scenarios.

Abstract: As LLMs expand from assistance to decision support, a dangerous pattern emerges: fluent agreement without calibrated judgment. Low-friction assistants can become sycophantic, baking in implicit assumptions and pushing verification costs onto experts, while outcomes arrive too late to serve as reward signals. In deep-uncertainty decisions (where objectives are contested and reversals are costly), scaling fluent agreement amplifies poor commitments faster than it builds expertise. We argue reliable human-AI partnership requires a shift from answer generation to collaborative premise governance over a knowledge substrate, negotiating only what is decision-critical. A discrepancy-driven control loop operates over this substrate: detecting conflicts, localizing misalignment via typed discrepancies (teleological, epistemic, procedural), and triggering bounded negotiation through decision slices. Commitment gating blocks action on uncommitted load-bearing premises unless overridden under logged risk; value-gated challenge allocates probing under interaction cost. Trust then attaches to auditable premises and evidence standards, not conversational fluency. We illustrate with tutoring and propose falsifiable evaluation criteria.

[95] Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection

Md Tanvir Rouf Shawon, Mohammad Sabik Irbaz, Hadeel R. A. Elyazori, Keerti Reddy Resapu, Yili Lin, Vladimir Franzuela Cardenas, Farrokh Alemi, Kevin Lybarger

Main category: cs.CL

TL;DR: A patient simulator for evaluating healthcare conversational agents that generates realistic, controllable interactions varying across medical, linguistic, and behavioral dimensions to assess AI performance risks across diverse populations.

Details

Motivation: To enable scalable, automated evaluation of healthcare conversational agents by creating a simulator that can systematically test AI performance across diverse patient populations with varying medical conditions, health literacy levels, and behavioral patterns, addressing the need for comprehensive risk assessment in healthcare AI deployment.

Method: The simulator integrates three profile components: (1) medical profiles from All of Us EHR data using risk-ratio gating, (2) linguistic profiles modeling health literacy and condition-specific communication, and (3) behavioral profiles representing cooperative, distracted, and adversarial engagement. Grounded in NIST AI Risk Management Framework, it was evaluated against an AI Decision Aid for antidepressant selection across 500 simulated conversations.

Result: The simulator revealed monotonic degradation in AI Decision Aid performance across health literacy levels: Rank-1 concept retrieval ranged from 47.6% (limited literacy) to 81.9% (proficient literacy). Medical concept fidelity was high (96.6% across 8,210 concepts), validated by human annotators (0.73 kappa) and LLM judge (0.78 kappa). Behavioral profiles were reliably distinguished (0.93 kappa), and linguistic profiles showed moderate agreement (0.61 kappa).

Conclusion: The simulator exposes measurable performance risks in conversational healthcare AI, with health literacy emerging as a primary risk factor with direct implications for equitable AI deployment. The approach enables systematic evaluation of AI systems across diverse patient populations.

Abstract: Objective: This paper introduces a patient simulator for scalable, automated evaluation of healthcare conversational agents, generating realistic, controllable interactions that systematically vary across medical, linguistic, and behavioral dimensions to support risk assessment across populations. Methods: Grounded in the NIST AI Risk Management Framework, the simulator integrates three profile components: (1) medical profiles constructed from All of Us electronic health records using risk-ratio gating; (2) linguistic profiles modeling health literacy and condition-specific communication; and (3) behavioral profiles representing cooperative, distracted, and adversarial engagement. Profiles were evaluated against NIST AI RMF trustworthiness requirements and assessed against an AI Decision Aid for antidepressant selection. Results: Across 500 simulated conversations, the simulator revealed monotonic degradation in AI Decision Aid performance across health literacy levels: Rank-1 concept retrieval ranged from 47.6% (limited) to 81.9% (proficient), with corresponding recommendation degradation. Medical concept fidelity was high (96.6% across 8,210 concepts), validated by human annotators (0.73 kappa) and an LLM judge with comparable agreement (0.78 kappa). Behavioral profiles were reliably distinguished (0.93 kappa), and linguistic profiles showed moderate agreement (0.61 kappa). Conclusions: The simulator exposes measurable performance risks in conversational healthcare AI. Health literacy emerged as a primary risk factor with direct implications for equitable AI deployment.

[96] FHIRPath-QA: Executable Question Answering over FHIR Electronic Health Records

Michael Frew, Nishit Bheda, Bryan Tripp

Main category: cs.CL

TL;DR: FHIRPath-QA introduces a text-to-FHIRPath query synthesis approach for patient-specific clinical question answering using EHRs, reducing hallucinations and computational costs compared to retrieval-based LLM methods.

Details

Motivation: Existing EHR interfaces don't support precise patient-specific QA, and retrieval-based LLM approaches are inefficient, hallucination-prone, and difficult to deploy on real EHRs.

Method: Proposes text-to-FHIRPath QA paradigm that shifts reasoning from free-text generation to FHIRPath query synthesis, using FHIRPath-QA dataset with 14k+ natural language questions paired with validated FHIRPath queries and answers.

Result: Reduced average token usage by 391x (629,829 vs 1,609 tokens per question) and lowered failure rates from 0.36 to 0.09 on clinician-phrased questions; LLMs achieved at most 42% accuracy but improved to 79% with supervised fine-tuning.

Conclusion: Text-to-FHIRPath synthesis offers practical foundation for safe, efficient, interoperable consumer health applications, with FHIRPath-QA dataset serving as benchmark for future research.

Abstract: Though patients are increasingly granted digital access to their electronic health records (EHRs), existing interfaces may not support precise, trustworthy answers to patient-specific questions. Large language models (LLM) show promise in clinical question answering (QA), but retrieval-based approaches are computationally inefficient, prone to hallucination, and difficult to deploy over real-life EHRs. This work introduces FHIRPath-QA, the first open dataset and benchmark for patient-specific QA that includes open-standard FHIRPath queries over real-world clinical data. A text-to-FHIRPath QA paradigm is proposed that shifts reasoning from free-text generation to FHIRPath query synthesis. For o4-mini, this reduced average token usage by 391x relative to retrieval-first prompting (629,829 vs 1,609 tokens per question) and lowered failure rates from 0.36 to 0.09 on clinician-phrased questions. Built on MIMIC-IV on FHIR Demo, the dataset pairs over 14k natural language questions in patient and clinician phrasing with validated FHIRPath queries and answers. Empirically, the evaluated LLMs achieve at most 42% accuracy, highlighting the challenge of the task, but benefit strongly from supervised fine-tuning, with query synthesis accuracy improving from 27% to 79% for 4o-mini. These results highlight that text-to-FHIRPath synthesis has the potential to serve as a practical foundation for safe, efficient, and interoperable consumer health applications, and the FHIRPath-QA dataset and benchmark serve as a starting point for future research on the topic. The full dataset and generation code can be accessed at: https://github.com/mooshifrew/fhirpath-qa.

[97] IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation

Md Mofijul Islam, Md Sirajus Salekin, Joe King, Priyashree Roy, Vamsi Thilak Gudi, Spencer Romo, Akhil Nooney, David Kaleko, Boyi Xie, Bob Strahan, Diego A. Socolinsky

Main category: cs.CL

TL;DR: IDP Accelerator is a framework for intelligent document processing using multimodal LLMs and agentic AI to extract structured data from complex document packets with high accuracy and efficiency.

Details

Motivation: Traditional NLP pipelines fail to handle multi-document packets, complex reasoning, and strict compliance requirements in industrial document processing, necessitating a more intelligent solution.

Method: Four-component framework: 1) DocSplit benchmark dataset with multimodal classifier using BIO tagging, 2) configurable extraction module using multimodal LLMs, 3) Agentic Analytics Module with secure code execution, 4) Rule Validation Module with LLM-driven compliance checks.

Result: Production deployment at healthcare provider achieved 98% classification accuracy, 80% reduced processing latency, and 77% lower operational costs compared to legacy baselines.

Conclusion: IDP Accelerator enables effective end-to-end document intelligence across industries through multimodal LLMs and agentic AI, with demonstrated real-world success and open-source availability.

Abstract: Understanding and extracting structured insights from unstructured documents remains a foundational challenge in industrial NLP. While Large Language Models (LLMs) enable zero-shot extraction, traditional pipelines often fail to handle multi-document packets, complex reasoning, and strict compliance requirements. We present IDP (Intelligent Document Processing) Accelerator, a framework enabling agentic AI for end-to-end document intelligence with four key components: (1) DocSplit, a novel benchmark dataset and multimodal classifier using BIO tagging to segment complex document packets; (2) configurable Extraction Module leveraging multimodal LLMs to transform unstructured content into structured data; (3) Agentic Analytics Module, compliant with the Model Context Protocol (MCP) providing data access through secure, sandboxed code execution; and (4) Rule Validation Module replacing deterministic engines with LLM-driven logic for complex compliance checks. The interactive demonstration enables users to upload document packets, visualize classification results, and explore extracted data through an intuitive web interface. We demonstrate effectiveness across industries, highlighting a production deployment at a leading healthcare provider achieving 98% classification accuracy, 80% reduced processing latency, and 77% lower operational costs over legacy baselines. IDP Accelerator is open-sourced with a live demonstration available to the community.

[98] Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning

Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2512.16917: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16917&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[99] Let the Agent Search: Autonomous Exploration Beats Rigid Workflows in Temporal Question Answering

Xufei Lv, Jiahui Yang, Haoyuan Sun, Xialin Su, Zhiliang Tian, Yifu Gao, Linbo Qiao, Houde Liu

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.01853 suggests it’s from March 2025, but no content available for analysis.

Details

Motivation: Cannot determine motivation without paper content. The arXiv ID format suggests this is a recent paper from March 2025.

Method: Cannot determine method without paper content. The arXiv API returned a rate limiting error (HTTP 429).

Result: No results available due to API rate limiting preventing access to paper details.

Conclusion: Unable to analyze paper due to technical limitations in accessing the content from arXiv.

Abstract: Failed to fetch summary for 2603.01853: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01853&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[100] AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents

Zekun Wu, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed data retrieval

Method: Unable to determine method due to failed data retrieval

Result: Unable to determine results due to failed data retrieval

Conclusion: Unable to draw conclusions due to failed data retrieval

Abstract: Failed to fetch summary for 2603.12564: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12564&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[101] Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval

Md. Asraful Haque, Aasar Mehdi, Maaz Mahboob, Tamkeen Fatima

Main category: cs.CL

TL;DR: Paper 2603.17872 summary unavailable due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to summary fetch failure

Method: Unable to determine method due to summary fetch failure

Result: Unable to determine results due to summary fetch failure

Conclusion: Unable to draw conclusions due to summary fetch failure

Abstract: Failed to fetch summary for 2603.17872: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17872&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[102] PrefPO: Pairwise Preference Prompt Optimization

Rahul Singhal, Pradyumna Tambwekar, Karime Maamari

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.19311: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19311&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[103] JUBAKU: An Adversarial Benchmark for Exposing Culturally Grounded Stereotypes in Japanese LLMs

Taihei Shiotani, Masahiro Kaneko, Ayana Niwa, Yuki Maruyama, Daisuke Oba, Masanari Ohi, Naoaki Okazaki

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.20581: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20581&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[104] Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models

Xinyue Liu, Niloofar Mireshghallah, Jane C. Ginsburg, Tuhin Chakrabarty

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2603.20957: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20957&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[105] TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning

Christian Greisinger, Steffen Eger

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.03072: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03072&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[106] Phrase-Instance Alignment for Generalized Referring Segmentation

E-Ro Nguyen, Hieu Le, Dimitris Samaras, Michael S. Ryoo

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API

Details

Motivation: The paper ID 2411.15087 was requested but could not be retrieved due to technical limitations

Method: N/A - Paper content unavailable due to API rate limiting error

Result: Failed to obtain paper summary or abstract

Conclusion: Cannot analyze paper without access to its content; arXiv API rate limiting prevents retrieval

Abstract: Failed to fetch summary for 2411.15087: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.15087&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[107] Linguistic Comparison of AI- and Human-Written Responses to Online Mental Health Queries

Koustuv Saha, Yoshee Jain, Violeta J. Rodriguez, Munmun De Choudhury

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2504.09271: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.09271&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[108] Explainable embeddings with Distance Explainer

Christiaan Meijer, E. G. Patrick Bos

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2505.15516: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.15516&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[109] EHR2Path: Scalable Modeling of Longitudinal Patient Pathways from Multimodal Electronic Health Records

Chantal Pellegrini, Ege Özsoy, David Bani-Harouni, Matthias Keicher, Nassir Navab

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2506.04831: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.04831&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[110] Reward Is Enough: LLMs Are In-Context Reinforcement Learners

Kefan Song, Amir Moeini, Peng Wang, Lei Gong, Rohan Chandra, Shangtong Zhang, Yanjun Qi

Main category: cs.CL

TL;DR: Unable to analyze paper 2506.06303 due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Cannot determine motivation without access to paper abstract

Method: Cannot determine method without access to paper abstract

Result: Cannot determine results without access to paper abstract

Conclusion: Cannot draw conclusions without access to paper abstract

Abstract: Failed to fetch summary for 2506.06303: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.06303&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[111] COALA: Numerically Stable and Efficient Framework for Context-Aware Low-Rank Approximation

Uliana Parkina, Maxim Rakhuba

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2507.07580: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.07580&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[112] Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation

Hubert M. Pysklo, Artem Zhuravel, Patrick D. Watson

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.11224: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11224&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[113] Offline-First Large Language Model Architecture for AI-Assisted Learning with Adaptive Response Levels in Low-Connectivity Environments

Joseph Walusimbi, Ann Move Oguti, Joshua Benjamin Ssentongo, Keith Ainebyona

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.03339: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03339&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[114] EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.12252: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12252&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[115] Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies

Siddharth Srikanth, Freddie Liang, Ya-Chuan Hsu, Varun Bhatt, Shihan Zhao, Henry Chen, Bryon Tjanaka, Minjune Hwang, Akanksha Saran, Daniel Seita, Aaquib Tabrez, Stefanos Nikolaidis

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.12510: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12510&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[116] PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection

Hyoseok Park, Yeonsang Park

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.21576: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21576&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[117] Problems with Chinchilla Approach 2: Systematic Biases in IsoFLOP Parabola Fits

Eric Czech, Zhiwei Xu, Yael Elmatad, Yixin Wang, William Held

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2603.22339: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22339&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.CV

[118] LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset

Royden Wagner, Omer Sahin Tas, Jaime Villa, Felix Hauser, Yinzhe Shen, Marlon Steiner, Dominik Strutz, Carlos Fernandez, Christian Kinzig, Guillermo S. Guitierrez-Cabello, Hendrik Königshof, Fabian Immel, Richard Schwarzkopf, Nils Alexander Rack, Kevin Rösch, Kaiwen Wang, Jan-Hendrik Pauls, Martin Lauer, Igor Gilitschenski, Holger Caesar, Christoph Stiller

Main category: cs.CV

TL;DR: A new multimodal dataset for end-to-end driving focusing on long-tail events, with multi-view video, trajectories, instructions, and multilingual reasoning traces to study how reasoning affects driving competence.

Details

Motivation: Address the fundamental challenge of generalization to rare scenarios in real-world domains like self-driving by creating a dataset specifically focused on long-tail driving events.

Method: Introduce a new dataset with multi-view video data, trajectories, high-level instructions, and detailed reasoning traces in English, Spanish, and Chinese from domain experts with diverse cultural backgrounds.

Result: Created a benchmark for multimodal models (VLMs and VLAs) that evaluates instruction following and semantic coherence beyond traditional safety and comfort metrics, available as a public dataset.

Conclusion: The dataset provides a unique resource for studying how different forms of reasoning affect driving competence and enables in-context learning and few-shot generalization for multimodal models.

Abstract: In real-world domains such as self-driving, generalization to rare scenarios remains a fundamental challenge. To address this, we introduce a new dataset designed for end-to-end driving that focuses on long-tail driving events. We provide multi-view video data, trajectories, high-level instructions, and detailed reasoning traces, facilitating in-context learning and few-shot generalization. The resulting benchmark for multimodal models, such as VLMs and VLAs, goes beyond safety and comfort metrics by evaluating instruction following and semantic coherence between model outputs. The multilingual reasoning traces in English, Spanish, and Chinese are from domain experts with diverse cultural backgrounds. Thus, our dataset is a unique resource for studying how different forms of reasoning affect driving competence. Our dataset is available at: https://hf.co/datasets/kit-mrt/kitscenes-longtail

Alexandre Symeonidis-Herzig, Jianhe Low, Ozge Mercanoglu Sincan, Richard Bowden

Main category: cs.CV

TL;DR: SMPL-FX combines FLAME’s facial expression space with SMPL-X body model, and M3T uses multi-modal FSQ-VAEs with autoregressive transformer for state-of-the-art sign language production including non-manual features.

Details

Motivation: Sign language production requires non-manual features (mouthings, eyebrow raises, gaze, head movements) which are grammatically obligatory and cannot be recovered from manual articulators alone. Existing 3D production systems face limitations: standard body models have low-dimensional facial spaces, and richer representations suffer from codebook collapse in discrete tokenization.

Method: Propose SMPL-FX which couples FLAME’s rich expression space with SMPL-X body model. Tokenize the resulting representation with modality-specific Finite Scalar Quantization VAEs for body, hands, and face. M3T is an autoregressive transformer trained on this multi-modal motion vocabulary with an auxiliary translation objective that encourages semantically grounded embeddings.

Result: Achieves state-of-the-art sign language production quality across three standard benchmarks (How2Sign, CSL-Daily, Phoenix14T). On NMFs-CSL, where signs are distinguishable only by non-manual features, reaches 58.3% accuracy vs 49.0% for strongest comparable pose baseline.

Conclusion: The proposed SMPL-FX model with multi-modal FSQ-VAE tokenization and M3T transformer effectively addresses limitations in sign language production by incorporating rich non-manual features, achieving superior performance especially in cases where non-manual features are crucial for sign distinction.

Abstract: Sign language production requires more than hand motion generation. Non-manual features, including mouthings, eyebrow raises, gaze, and head movements, are grammatically obligatory and cannot be recovered from manual articulators alone. Existing 3D production systems face two barriers to integrating them: the standard body model provides a facial space too low-dimensional to encode these articulations, and when richer representations are adopted, standard discrete tokenization suffers from codebook collapse, leaving most of the expression space unreachable. We propose SMPL-FX, which couples FLAME’s rich expression space with the SMPL-X body, and tokenize the resulting representation with modality-specific Finite Scalar Quantization VAEs for body, hands, and face. M3T is an autoregressive transformer trained on this multi-modal motion vocabulary, with an auxiliary translation objective that encourages semantically grounded embeddings. Across three standard benchmarks (How2Sign, CSL-Daily, Phoenix14T) M3T achieves state-of-the-art sign language production quality, and on NMFs-CSL, where signs are distinguishable only by non-manual features, reaches 58.3% accuracy against 49.0% for the strongest comparable pose baseline.

[120] Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection

Sa Zhu, Wanqian Zhang, Lin Wang, Xiaohua Chen, Chenxu Cui, Jinchao Zhang, Bo Li

Main category: cs.CV

TL;DR: PDA framework for Open-Vocabulary Temporal Action Detection uses phase-wise decomposition and alignment to transfer knowledge from seen to unseen action classes by leveraging LLM reasoning for phase-level descriptions and adaptive visual-textual matching.

Details

Motivation: Previous OV-TAD methods rely on global alignment between label-level semantics and visual features, which is insufficient for transferring temporal consistent visual knowledge from seen to unseen action classes. There's a need for fine-grained action pattern learning to enable effective prior knowledge transfer.

Method: Proposes Phase-wise Decomposition and Alignment (PDA) framework with three modules: 1) CoT-Prompting Semantic Decomposition (CSD) uses LLM chain-of-thought reasoning to decompose action labels into phase-level descriptions, 2) Text-infused Foreground Filtering (TIF) adaptively filters action-relevant segments using phase-wise semantic cues, and 3) Adaptive Phase-wise Alignment (APA) performs phase-level visual-textual matching and aggregates results across phases.

Result: Extensive experiments on two OV-TAD benchmarks demonstrate the superiority of the proposed method over previous approaches, showing significant enhancement in generalization to unseen actions.

Conclusion: The PDA framework enables fine-grained action pattern learning through phase-wise decomposition and alignment, facilitating transferable action pattern capture and significantly improving generalization to unseen action categories in temporal action detection.

Abstract: Open-Vocabulary Temporal Action Detection (OV-TAD) aims to classify and localize action segments in untrimmed videos for unseen categories. Previous methods rely solely on global alignment between label-level semantics and visual features, which is insufficient to transfer temporal consistent visual knowledge from seen to unseen classes. To address this, we propose a Phase-wise Decomposition and Alignment (PDA) framework, which enables fine-grained action pattern learning for effective prior knowledge transfer. Specifically, we first introduce the CoT-Prompting Semantic Decomposition (CSD) module, which leverages the chain-of-thought (CoT) reasoning ability of large language models to automatically decompose action labels into coherent phase-level descriptions, emulating human cognitive processes. Then, Text-infused Foreground Filtering (TIF) module is introduced to adaptively filter action-relevant segments for each phase leveraging phase-wise semantic cues, producing semantically aligned visual representations. Furthermore, we propose the Adaptive Phase-wise Alignment (APA) module to perform phase-level visual-textual matching, and adaptively aggregates alignment results across phases for final prediction. This adaptive phase-wise alignment facilitates the capture of transferable action patterns and significantly enhances generalization to unseen actions. Extensive experiments on two OV-TAD benchmarks demonstrated the superiority of the proposed method.

[121] Ukrainian Visual Word Sense Disambiguation Benchmark

Yurii Laba, Yaryna Mohytych, Ivanna Rohulia, Halyna Kyryleyza, Hanna Dydyk-Meush, Oles Dobosevych, Rostyslav Hryniv

Main category: cs.CV

TL;DR: A benchmark for Visual Word Sense Disambiguation in Ukrainian was created following established methodology, revealing that current multilingual multimodal models underperform compared to CLIP baseline and show significant performance gap between Ukrainian and English.

Details

Motivation: To extend Visual-WSD evaluation to Ukrainian language and enable cross-language model performance comparisons, addressing the lack of Ukrainian-specific multimodal benchmarks.

Method: Semi-automated data collection with expert refinement following established Visual-WSD benchmark methodology; evaluation of eight multilingual multimodal LLMs against CLIP baseline.

Result: All tested models performed worse than zero-shot CLIP baseline; significant performance gap observed between Ukrainian and English Visual-WSD tasks.

Conclusion: Current multilingual multimodal models struggle with Ukrainian Visual-WSD, highlighting language-specific challenges and need for improved cross-lingual multimodal understanding.

Abstract: This study presents a benchmark for evaluating the Visual Word Sense Disambiguation (Visual-WSD) task in Ukrainian. The main goal of the Visual-WSD task is to identify, with minimal contextual information, the most appropriate representation of a given ambiguous word from a set of ten images. To construct this benchmark, we followed a methodology similar to that proposed by (CITATION), who previously introduced benchmarks for the Visual-WSD task in English, Italian, and Farsi. This approach allows us to incorporate the Ukrainian benchmark into a broader framework for cross-language model performance comparisons. We collected the benchmark data semi-automatically and refined it with input from domain experts. We then assessed eight multilingual and multimodal large language models using this benchmark. All tested models performed worse than the zero-shot CLIP-based baseline model (CITATION) used by (CITATION) for the English Visual-WSD task. Our analysis revealed a significant performance gap in the Visual-WSD task between Ukrainian and English.

[122] EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing

Wei Chow, Linfeng Li, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, Songhua Liu

Main category: cs.CV

TL;DR: EditMGT: A Masked Generative Transformer-based image editing framework that uses cross-attention maps for precise localization and region-hold sampling to confine edits to target areas while preserving non-target regions.

Details

Motivation: Diffusion models for image editing conflate local editing targets with full-image context, causing unintended modifications in non-target regions. The paper explores Masked Generative Transformers as an alternative approach with inherent localized decoding capabilities.

Method: 1) Uses MGT’s cross-attention maps for localization with multi-layer attention consolidation; 2) Introduces region-hold sampling to restrict token flipping to low-attention areas; 3) Adapts pre-trained text-to-image MGT into editing model via attention injection; 4) Trained on CrispEdit-2M dataset spanning 7 editing categories.

Result: Achieves similarity performance with 6x faster editing using <1B parameters. Improves style change by 3.6% and style transfer by 17.6% on standard benchmarks. Delivers comparable or superior editing quality while better preserving non-target regions.

Conclusion: MGTs offer a promising alternative to diffusion models for image editing, with inherent localized decoding capabilities that enable precise, targeted edits while preserving surrounding areas, plus faster inference.

Abstract: Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT’s cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct CrispEdit-2M, a high-resolution dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves similarity performance while enabling 6 times faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.

[123] Stochastic Ray Tracing for the Reconstruction of 3D Gaussian Splatting

Peiyu Xu, Xin Sun, Krishna Mullia, Raymond Fei, Iliyan Georgiev, Shuang Zhao

Main category: cs.CV

TL;DR: Differentiable, sorting-free stochastic ray tracing for 3D Gaussian Splatting that uses Monte Carlo sampling instead of sorting all Gaussians per ray, enabling faster reconstruction and rendering with better quality for relightable scenes.

Details

Motivation: Current ray-tracing-based 3DGS methods are slower due to the cost of sorting all intersecting Gaussians along every ray, and existing methods still rely on rasterization-style approximations like shadow mapping for relightable scenes, undermining the generality of ray tracing.

Method: Proposes a differentiable, sorting-free stochastic formulation using an unbiased Monte Carlo estimator for pixel-color gradients that evaluates only a small sampled subset of Gaussians per ray, bypassing the need for sorting. For relightable 3DGS, the same stochastic estimator drives per-Gaussian shading with fully ray-traced shadow rays.

Result: For standard 3DGS, matches reconstruction quality and speed of rasterization-based 3DGS while substantially outperforming sorting-based ray tracing. For relightable 3DGS, delivers notably higher reconstruction fidelity than prior work.

Conclusion: The method provides a practical, efficient ray-tracing framework for 3DGS that eliminates sorting bottlenecks and enables high-quality relightable scene reconstruction with fully ray-traced shadows.

Abstract: Ray-tracing-based 3D Gaussian splatting (3DGS) methods overcome the limitations of rasterization – rigid pinhole camera assumptions, inaccurate shadows, and lack of native reflection or refraction – but remain slower due to the cost of sorting all intersecting Gaussians along every ray. Moreover, existing ray-tracing methods still rely on rasterization-style approximations such as shadow mapping for relightable scenes, undermining the generality that ray tracing promises. We present a differentiable, sorting-free stochastic formulation for ray-traced 3DGS – the first framework that uses stochastic ray tracing to both reconstruct and render standard and relightable 3DGS scenes. At its core is an unbiased Monte Carlo estimator for pixel-color gradients that evaluates only a small sampled subset of Gaussians per ray, bypassing the need for sorting. For standard 3DGS, our method matches the reconstruction quality and speed of rasterization-based 3DGS while substantially outperforming sorting-based ray tracing. For relightable 3DGS, the same stochastic estimator drives per-Gaussian shading with fully ray-traced shadow rays, delivering notably higher reconstruction fidelity than prior work.

[124] λSplit: Self-Supervised Content-Aware Spectral Unmixing for Fluorescence Microscopy

Federico Carrara, Talley Lambert, Mehdi Seifi, Florian Jug

Main category: cs.CV

TL;DR: λSplit is a physics-informed deep generative model for spectral unmixing in fluorescence microscopy that combines hierarchical VAE with differentiable spectral mixing to learn structural priors and achieve state-of-the-art performance.

Details

Motivation: Classical spectral unmixing methods degrade with overlapping emission spectra and noise; existing learning-based approaches are either not optimized for microscopy or too specific. Need for data-driven approach that learns structural priors for fluorescence microscopy.

Method: Physics-informed deep generative model using hierarchical Variational Autoencoder with differentiable Spectral Mixer that enforces consistency with image formation process while learning structural priors.

Result: State-of-the-art performance on 66 challenging benchmarks across 3 real-world datasets, outperforming 10 baseline methods in high noise regimes, overlapping spectra, and low spectral dimensionality scenarios.

Conclusion: λSplit enables robust spectral unmixing with implicit noise removal, compatible with standard confocal microscopes without hardware modifications, establishing new state-of-the-art for fluorescence microscopy.

Abstract: In fluorescence microscopy, spectral unmixing aims to recover individual fluorophore concentrations from spectral images that capture mixed fluorophore emissions. Since classical methods operate pixel-wise and rely on least-squares fitting, their performance degrades with increasingly overlapping emission spectra and higher levels of noise, suggesting that a data-driven approach that can learn and utilize a structural prior might lead to improved results. Learning-based approaches for spectral imaging do exist, but they are either not optimized for microscopy data or are developed for very specific cases that are not applicable to fluorescence microscopy settings. To address this, we propose λSplit, a physics-informed deep generative model that learns a conditional distribution over concentration maps using a hierarchical Variational Autoencoder. A fully differentiable Spectral Mixer enforces consistency with the image formation process, while the learned structural priors enable state-of-the-art unmixing and implicit noise removal. We demonstrate λSplit on 3 real-world datasets that we synthetically cast into a total of 66 challenging spectral unmixing benchmarks. We compare our results against a total of 10 baseline methods, including classical methods and a range of learning-based methods. Our results consistently show competitive performance and improved robustness in high noise regimes, when spectra overlap considerably, or when the spectral dimensionality is lowered, making λSplit a new state-of-the-art for spectral unmixing of fluorescent microscopy data. Importantly, λSplit is compatible with spectral data produced by standard confocal microscopes, enabling immediate adoption without specialized hardware modifications.

[125] Tiny Inference-Time Scaling with Latent Verifiers

Davide Bucciarelli, Evelyn Turri, Lorenzo Baraldi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Main category: cs.CV

TL;DR: VHS is a verifier that operates directly on diffusion transformer hidden states instead of decoding to pixel space, reducing verification cost while maintaining or improving performance compared to MLLM verifiers.

Details

Motivation: Current MLLM-based verifiers for inference-time scaling require costly decoding of diffusion model outputs to pixel space and re-encoding, creating redundant operations and high computational overhead.

Method: Proposes Verifier on Hidden States (VHS) that analyzes intermediate hidden representations of Diffusion Transformer single-step generators directly, avoiding pixel space conversion.

Result: VHS reduces joint generation-and-verification time by 63.3%, compute FLOPs by 51%, VRAM usage by 14.5% compared to standard MLLM verifiers, while achieving +2.7% improvement on GenEval at same inference budget.

Conclusion: Operating directly on generator hidden states enables efficient inference-time scaling without sacrificing verification quality, making verification more practical for real-world deployment.

Abstract: Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference-time cost. Indeed, diffusion pipelines operate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re-encoding them into the visual embedding space, leading to redundant and costly operations. In this work, we propose Verifier on Hidden States (VHS), a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single-step generators. VHS analyzes generator features without decoding to pixel space, thereby reducing the per-candidate verification cost while improving or matching the performance of MLLM-based competitors. We show that, under tiny inference budgets with only a small number of candidates per prompt, VHS enables more efficient inference-time scaling reducing joint generation-and-verification time by 63.3%, compute FLOPs by 51% and VRAM usage by 14.5% with respect to a standard MLLM verifier, achieving a +2.7% improvement on GenEval at the same inference-time budget.

[126] Foundation Model Embeddings Meet Blended Emotions: A Multimodal Fusion Approach for the BLEMORE Challenge

Masoumeh Chapariniya, Aref Farhadipour, Sarah Ebling, Volker Dellwo, Teodora Vukovic

Main category: cs.CV

TL;DR: A multimodal emotion recognition system combining six encoder families through late probability fusion achieves competitive results on the BLEMORE Challenge, with key findings about layer selection in audio models and personalized expression thresholds.

Details

Motivation: To develop an effective system for the BLEMORE Challenge on blended emotion recognition with relative salience prediction, addressing the need for robust multimodal approaches that can handle diverse emotional expressions across different modalities.

Method: Combines six encoder families through late probability fusion: S4D-ViTMoE face encoder with soft-label KL training, frozen layer-selective Wav2Vec2 audio features (layers 6-12), finetuned body-language encoders (TimeSformer, VideoMAE), and Gemini Embedding 2.0 for multimodal video embeddings. Uses ensemble weighting where task-adapted encoders receive 62% of weight.

Result: Achieved Score = 0.279 (ACCP = 0.391, ACCS = 0.168) on test set, placing 6th in the challenge. Key findings: selective audio layer freezing outperforms end-to-end finetuning (0.207 vs 0.161), personalized expression thresholds vary widely (0.05-0.43), and Gemini Embedding 2.0 produces competitive accuracy from only 2-second inputs.

Conclusion: The system demonstrates effective multimodal fusion for emotion recognition, with insights about audio model adaptation for non-verbal content and the importance of personalized expression modeling. Task-specific encoders significantly outperform general-purpose baselines in ensemble weighting.

Abstract: We present our system for the BLEMORE Challenge at FG 2026 on blended emotion recognition with relative salience prediction. Our approach combines six encoder families through late probability fusion: an S4D-ViTMoE face encoder adapted with soft-label KL training, frozen layer-selective Wav2Vec2 audio features, finetuned body-language encoders (TimeSformer, VideoMAE), and – for the first time in emotion recognition – Gemini Embedding 2.0, a large multimodal model whose video embeddings produce competitive presence accuracy (ACCP = 0.320) from only 2 seconds of input. Three key findings emerge from our experiments: selecting prosody-encoding layers (6–12) from frozen Wav2Vec2 outperforms end-to-end finetuning (Score 0.207 vs. 0.161), as the non-verbal nature of BLEMORE audio makes phonetic layers irrelevant; the post-processing salience threshold $β$ varies from 0.05 to 0.43 across folds, revealing that personalized expression styles are the primary bottleneck; and task-adapted encoders collectively receive 62% of ensemble weight over general-purpose baselines. Our 12-encoder system achieves Score = 0.279 (ACCP = 0.391, ACCS = 0.168) on the test set, placing 6th.

[127] Vision-Language Models vs Human: Perceptual Image Quality Assessment

Imran Mehmood, Imad Ali Shah, Ming Ronnier Luo, Brian Deegan

Main category: cs.CV

TL;DR: VLMs show variable performance on perceptual image quality assessment tasks, with strong human alignment on colorfulness but poor performance on contrast, revealing attribute-dependent capabilities and tradeoffs between self-consistency and human alignment.

Details

Motivation: To investigate whether Vision Language Models can approximate human perceptual judgments for image quality assessment, addressing the limitations of costly psychophysical experiments while leveraging VLMs' potential for scalable quality evaluation.

Method: Systematic benchmarking of six VLMs (four proprietary, two open-weight) against psychophysical data across three image quality scales: contrast, colorfulness, and overall preference, including attribute weighting analysis and intramodel consistency evaluation.

Result: VLMs show strong attribute-dependent variability - high human alignment for colorfulness (ρ up to 0.93) but underperformance on contrast, with most models weighting colorfulness higher than contrast for overall preference similar to human data. Most self-consistent models aren’t most human-aligned, and human-VLM agreement increases with perceptual separability.

Conclusion: VLMs demonstrate promising but attribute-specific capabilities for perceptual IQA, with complex tradeoffs between self-consistency and human alignment, suggesting their reliability depends on stimulus differences and scene-dependent perceptual cues.

Abstract: Psychophysical experiments remain the most reliable approach for perceptual image quality assessment (IQA), yet their cost and limited scalability encourage automated approaches. We investigate whether Vision Language Models (VLMs) can approximate human perceptual judgments across three image quality scales: contrast, colorfulness and overall preference. Six VLMs four proprietary and two openweight models are benchmarked against psychophysical data. This work presents a systematic benchmark of VLMs for perceptual IQA through comparison with human psychophysical data. The results reveal strong attribute dependent variability models with high human alignment for colorfulness (ρup to 0.93) underperform on contrast and vice-versa. Attribute weighting analysis further shows that most VLMs assign higher weights to colorfulness compared to contrast when evaluating overall preference similar to the psychophysical data. Intramodel consistency analysis reveals a counterintuitive tradeoff: the most self consistent models are not necessarily the most human aligned suggesting response variability reflects sensitivity to scene dependent perceptual cues. Furthermore, human-VLM agreement is increased with perceptual separability, indicating VLMs are more reliable when stimulus differences are clearly expressed.

[128] Estimating Individual Tree Height and Species from UAV Imagery

Jannik Endres, Etienne Laliberté, David Rolnick, Arthur Ouaknine

Main category: cs.CV

TL;DR: BIRCH-Trees benchmark for individual tree height/species estimation from UAV images, with DINOvTree method using Vision Foundation Model backbone achieving best results with fewer parameters.

Details

Motivation: Accurate forest biomass estimation requires tree-level traits like height and species. UAVs with RGB cameras offer cost-effective, scalable tree mapping, but lack standardized benchmarks for individual tree analysis from aerial imagery.

Method: Introduces BIRCH-Trees benchmark spanning three forest datasets (temperate, tropical, boreal). Presents DINOvTree approach using Vision Foundation Model backbone with task-specific heads for simultaneous height regression and species classification.

Result: DINOvTree achieves top overall results with accurate height predictions and competitive classification accuracy while using only 54-58% of parameters compared to second-best approach. Outperforms biological allometric equations and other vision methods.

Conclusion: BIRCH-Trees provides valuable benchmark for tree analysis from UAV imagery. DINOvTree demonstrates effectiveness of Vision Foundation Models for simultaneous tree trait estimation with parameter efficiency, advancing scalable forest monitoring.

Abstract: Accurate estimation of forest biomass, a major carbon sink, relies heavily on tree-level traits such as height and species. Unoccupied Aerial Vehicles (UAVs) capturing high-resolution imagery from a single RGB camera offer a cost-effective and scalable approach for mapping and measuring individual trees. We introduce BIRCH-Trees, the first benchmark for individual tree height and species estimation from tree-centered UAV images, spanning three datasets: temperate forests, tropical forests, and boreal plantations. We also present DINOvTree, a unified approach using a Vision Foundation Model (VFM) backbone with task-specific heads for simultaneous height and species prediction. Through extensive evaluations on BIRCH-Trees, we compare DINOvTree against commonly used vision methods, including VFMs, as well as biological allometric equations. We find that DINOvTree achieves top overall results with accurate height predictions and competitive classification accuracy while using only 54% to 58% of the parameters of the second-best approach.

[129] Prototype Fusion: A Training-Free Multi-Layer Approach to OOD Detection

Shreen Gul, Mohamed Elmahallawy, Ardhendu Tripathy, Sanjay Madria

Main category: cs.CV

TL;DR: Proposes multi-layer feature aggregation for OOD detection, challenging the assumption that only penultimate-layer features are most informative, achieving improved performance across diverse architectures.

Details

Motivation: Existing OOD detection methods rely primarily on penultimate-layer activations, assuming they contain the most informative in-distribution representations. The authors challenge this assumption by showing that intermediate layers encode equally rich and discriminative information for OOD detection.

Method: A model-agnostic approach that aggregates features from successive convolutional blocks, computes class-wise mean embeddings, applies L2 normalization to form compact ID prototypes, and uses cosine similarity between test features and prototypes as OOD score during inference.

Result: Extensive experiments on state-of-the-art OOD benchmarks show robust, architecture-agnostic performance with improvements of up to 4.41% in AUROC and 13.58% reduction in FPR, demonstrating strong generalization for image classification.

Conclusion: Multi-layer feature aggregation is a powerful yet underexplored signal for OOD detection that challenges the dominance of penultimate-layer-based methods, offering improved robustness and generalization across diverse architectures.

Abstract: Deep learning models are increasingly deployed in safety-critical applications, where reliable out-of-distribution (OOD) detection is essential to ensure robustness. Existing methods predominantly rely on the penultimate-layer activations of neural networks, assuming they encapsulate the most informative in-distribution (ID) representations. In this work, we revisit this assumption to show that intermediate layers encode equally rich and discriminative information for OOD detection. Based on this observation, we propose a simple yet effective model-agnostic approach that leverages internal representations across multiple layers. Our scheme aggregates features from successive convolutional blocks, computes class-wise mean embeddings, and applies L_2 normalization to form compact ID prototypes capturing class semantics. During inference, cosine similarity between test features and these prototypes serves as an OOD score–ID samples exhibit strong affinity to at least one prototype, whereas OOD samples remain uniformly distant. Extensive experiments on state-of-the-art OOD benchmarks across diverse architectures demonstrate that our approach delivers robust, architecture-agnostic performance and strong generalization for image classification. Notably, it improves AUROC by up to 4.41% and reduces FPR by 13.58%, highlighting multi-layer feature aggregation as a powerful yet underexplored signal for OOD detection, challenging the dominance of penultimate-layer-based methods. Our code is available at: https://github.com/sgchr273/cosine-layers.git.

[130] MoCHA: Denoising Caption Supervision for Motion-Text Retrieval

Nikolai Warner, Cameron Ethan Taylor, Irfan Essa, Apaar Sadhwani

Main category: cs.CV

TL;DR: MoCHA is a text canonicalization framework for text-motion retrieval that projects captions onto their motion-recoverable content to reduce embedding variance and improve alignment.

Details

Motivation: Standard contrastive training for text-motion retrieval treats each caption as a single positive target, overlooking that captions are samples from a distribution of valid descriptions. This includes both motion-recoverable semantics and annotator-specific style/context, causing within-motion embedding variance that weakens alignment.

Method: Proposes MoCHA, a text canonicalization framework that projects each caption onto its motion-recoverable content before encoding. Presents two learned variants: an LLM-based approach (GPT-5.2) and a distilled FlanT5 model requiring no LLM at inference. Works as a preprocessing step compatible with any retrieval architecture.

Result: Sets new SOTA on HumanML3D (13.9% T2M R@1, +3.1pp) and KIT-ML (24.3% T2M R@1, +10.3pp). Reduces within-motion text-embedding variance by 11-19%. Improves cross-dataset transfer substantially: H to K by 94% and K to H by 52%.

Conclusion: Canonicalization reduces embedding variance and improves motion-language alignment. Standardizing the language space yields more transferable motion-language representations, with learned canonicalizers providing substantial gains over rule-based methods.

Abstract: Text-motion retrieval systems learn shared embedding spaces from motion-caption pairs via contrastive objectives. However, each caption is not a deterministic label but a sample from a distribution of valid descriptions: different annotators produce different text for the same motion, mixing motion-recoverable semantics (action type, body parts, directionality) with annotator-specific style and inferred context that cannot be determined from 3D joint coordinates alone. Standard contrastive training treats each caption as the single positive target, overlooking this distributional structure and inducing within-motion embedding variance that weakens alignment. We propose MoCHA, a text canonicalization framework that reduces this variance by projecting each caption onto its motion-recoverable content prior to encoding, producing tighter positive clusters and better-separated embeddings. Canonicalization is a general principle: even deterministic rule-based methods improve cross-dataset transfer, though learned canonicalizers provide substantially larger gains. We present two learned variants: an LLM-based approach (GPT-5.2) and a distilled FlanT5 model requiring no LLM at inference time. MoCHA operates as a preprocessing step compatible with any retrieval architecture. Applied to MoPa (MotionPatches), MoCHA sets a new state of the art on both HumanML3D (H) and KIT-ML (K): the LLM variant achieves 13.9% T2M R@1 on H (+3.1pp) and 24.3% on K (+10.3pp), while the LLM-free T5 variant achieves gains of +2.5pp and +8.1pp. Canonicalization reduces within-motion text-embedding variance by 11-19% and improves cross-dataset transfer substantially, with H to K improving by 94% and K to H by 52%, demonstrating that standardizing the language space yields more transferable motion-language representations.

[131] AdvSplat: Adversarial Attacks on Feed-Forward Gaussian Splatting Models

Yiran Qiao, Yiren Lu, Yunlai Zhou, Rui Yang, Linlin Hou, Yu Yin, Jing Ma

Main category: cs.CV

TL;DR: AdvSplat: First systematic study of adversarial attacks on feed-forward 3D Gaussian Splatting models, revealing vulnerabilities and developing query-efficient black-box attack methods.

Details

Motivation: Feed-forward 3DGS models enable fast 3D reconstruction but use neural networks as backbones, amplifying adversarial manipulation risks. The paper aims to systematically study these vulnerabilities as an overlooked security challenge in this emerging domain.

Method: First employs white-box attacks to reveal fundamental vulnerabilities, then develops two query-efficient black-box algorithms: one based on gradient estimation and another gradient-free approach, both optimizing pixel-space perturbations via frequency-domain parameterization without requiring model internals access.

Result: Extensive experiments across multiple datasets show AdvSplat can significantly disrupt reconstruction results by injecting imperceptible perturbations into input images, demonstrating the security vulnerabilities of feed-forward 3DGS models.

Conclusion: The study surfaces an urgent security and robustness problem in feed-forward 3DGS, highlighting adversarial vulnerabilities that need community attention for this commercially promising technology.

Abstract: 3D Gaussian Splatting (3DGS) is increasingly recognized as a powerful paradigm for real-time, high-fidelity 3D reconstruction. However, its per-scene optimization pipeline limits scalability and generalization, and prevents efficient inference. Recently emerged feed-forward 3DGS models address these limitations by enabling fast reconstruction from a few input views after large-scale pretraining, without scene-specific optimization. Despite their advantages and strong potential for commercial deployment, the use of neural networks as the backbone also amplifies the risk of adversarial manipulation. In this paper, we introduce AdvSplat, the first systematic study of adversarial attacks on feed-forward 3DGS. We first employ white-box attacks to reveal fundamental vulnerabilities of this model family. We then develop two improved, practically relevant, query-efficient black-box algorithms that optimize pixel-space perturbations via a frequency-domain parameterization: one based on gradient estimation and the other gradient-free, without requiring any access to model internals. Extensive experiments across multiple datasets demonstrate that AdvSplat can significantly disrupt reconstruction results by injecting imperceptible perturbations into the input images. Our findings surface an overlooked yet urgent problem in this domain, and we hope to draw the community’s attention to this emerging security and robustness challenge.

[132] CoRe: Joint Optimization with Contrastive Learning for Medical Image Registration

Eytan Kats, Christoph Grossbroehmer, Ziad Al-Haj Hemidi, Fenja Falta, Wiebke Heyer, Mattias P. Heinrich

Main category: cs.CV

TL;DR: Proposes a novel medical image registration framework that integrates equivariant contrastive learning directly into the registration model to learn robust feature representations invariant to tissue deformations.

Details

Motivation: Medical image registration faces challenges from intensity inconsistencies and nonlinear tissue deformations. While self-supervised representation learning shows promise for generating robust anatomical embeddings, current approaches separate pre-training from registration. The authors aim to directly integrate contrastive learning into the registration framework for better alignment.

Method: Proposes a framework that integrates equivariant contrastive learning directly into the registration model. Uses contrastive learning to learn robust feature representations invariant to tissue deformations. Jointly optimizes contrastive and registration objectives to ensure learned representations are both informative and suitable for registration tasks.

Result: Evaluated on abdominal and thoracic image registration tasks, including intra-patient and inter-patient scenarios. Experimental results show significant performance improvement over strong baseline methods, demonstrating the effectiveness of integrating contrastive learning directly into the registration framework.

Conclusion: Direct integration of equivariant contrastive learning into medical image registration framework significantly improves performance by learning robust feature representations that are invariant to tissue deformations and suitable for registration tasks.

Abstract: Medical image registration is a fundamental task in medical image analysis, enabling the alignment of images from different modalities or time points. However, intensity inconsistencies and nonlinear tissue deformations pose significant challenges to the robustness of registration methods. Recent approaches leveraging self-supervised representation learning show promise by pre-training feature extractors to generate robust anatomical embeddings, that farther used for the registration. In this work, we propose a novel framework that integrates equivariant contrastive learning directly into the registration model. Our approach leverages the power of contrastive learning to learn robust feature representations that are invariant to tissue deformations. By jointly optimizing the contrastive and registration objectives, we ensure that the learned representations are not only informative but also suitable for the registration task. We evaluate our method on abdominal and thoracic image registration tasks, including both intra-patient and inter-patient scenarios. Experimental results demonstrate that the integration of contrastive learning directly into the registration framework significantly improves performance, surpassing strong baseline methods.

[133] Mind the Hitch: Dynamic Calibration and Articulated Perception for Autonomous Trucks

Morui Zhu, Yongqi Zhu, Song Fu, Qing Yang

Main category: cs.CV

TL;DR: dCAP is a vision-based framework for dynamic calibration and articulated perception in autonomous trucking, using transformers to estimate 6-DoF relative pose between tractor and trailer cameras, improving 3D object detection with dynamic extrinsics.

Details

Motivation: Autonomous trucking faces unique challenges from articulated tractor-trailer geometry and time-varying sensor poses due to the fifth-wheel joint and trailer flex. Existing methods assume static baselines or require high-parallax scenes, limiting reliability in real-world settings.

Method: dCAP uses a transformer with cross-view and temporal attention to robustly aggregate spatial cues while maintaining temporal consistency for continuous 6-DoF relative pose estimation. Integrated with BEVFormer, it replaces static calibration with dynamically predicted extrinsics for 3D object detection.

Result: Experiments show dCAP achieves stable, accurate perception while addressing limitations of static calibration. The framework is evaluated on STT4AT, a CARLA-based benchmark simulating semi-trailer trucks with synchronized multi-sensor suites and time-varying inter-rig geometry.

Conclusion: dCAP provides a robust vision-based solution for dynamic calibration in articulated vehicles, enabling more reliable perception for autonomous trucking. The dataset, development kit, and source code will be publicly released.

Abstract: Autonomous trucking poses unique challenges due to articulated tractor-trailer geometry, and time-varying sensor poses caused by the fifth-wheel joint and trailer flex. Existing perception and calibration methods assume static baselines or rely on high-parallax and texture-rich scenes, limiting their reliability under real-world settings. We propose dCAP (dynamic Calibration and Articulated Perception), a vision-based framework that continuously estimates the 6-DoF (degree of freedom) relative pose between tractor and trailer cameras. dCAP employs a transformer with cross-view and temporal attention to robustly aggregate spatial cues while maintaining temporal consistency, enabling accurate perception under rapid articulation and occlusion. Integrated with BEVFormer, dCAP improves 3D object detection by replacing static calibration with dynamically predicted extrinsics. To facilitate evaluation, we introduce STT4AT, a CARLA-based benchmark simulating semi-trailer trucks with synchronized multi-sensor suites and time-varying inter-rig geometry across diverse environments. Experiments demonstrate that dCAP achieves stable, accurate perception while addressing the limitations of static calibration in autonomous trucking. The dataset, development kit, and source code will be publicly released.

[134] Bi-CRCL: Bidirectional Conservative-Radical Complementary Learning with Pre-trained Foundation Models for Class-incremental Medical Image Analysis

Xinyao Wu, Zhe Xu, Cheng Chen, Jiawei Ma, Yefeng Zheng, Raymond Kai-yu Tong

Main category: cs.CV

TL;DR: A dual-learner framework (Bi-CRCL) for class-incremental learning in medical imaging that combines conservative and radical learners to balance stability and plasticity while preventing catastrophic forgetting.

Details

Motivation: Medical image-guided diagnosis requires scalable systems that can learn new disease categories over time without forgetting previous knowledge, but this is challenging due to heterogeneous data, privacy constraints preventing memory replay, and anatomical complexity. While pretrained foundation models have advanced CIL in general domains, their application to medical imaging remains underexplored.

Method: Proposes Bidirectional Conservative-Radical Complementary Learning (Bi-CRCL), a dual-learner framework inspired by complementary learning systems. It includes: 1) a conservative learner that preserves prior knowledge through stability-oriented updates, 2) a radical learner that rapidly adapts to new categories via plasticity-oriented learning, and 3) a bidirectional interaction mechanism enabling forward transfer and backward consolidation. During inference, outputs from both learners are adaptively fused.

Result: Experiments on five medical imaging datasets demonstrate consistent improvements over state-of-the-art methods under diverse settings, including cross-dataset shifts and varying task configurations.

Conclusion: Bi-CRCL effectively addresses the challenges of class-incremental learning in medical imaging by balancing stability and plasticity through complementary learners, enabling scalable clinical deployment of diagnostic systems that can continuously learn new disease categories.

Abstract: Class-incremental learning (CIL) in medical image-guided diagnosis requires retaining prior diagnostic knowledge while adapting to newly emerging disease categories, which is critical for scalable clinical deployment. This problem is particularly challenging due to heterogeneous data and privacy constraints that prevent memory replay. Although pretrained foundation models (PFMs) have advanced general-domain CIL, their potential in medical imaging remains underexplored, where domain-specific adaptation is essential yet difficult due to anatomical complexity and inter-institutional heterogeneity. To address this gap, we conduct a systematic benchmark of recent PFM-based CIL methods and propose Bidirectional Conservative-Radical Complementary Learning (Bi-CRCL), a dual-learner framework inspired by complementary learning systems. Bi-CRCL integrates a conservative learner that preserves prior knowledge through stability-oriented updates and a radical learner that rapidly adapts to new categories via plasticity-oriented learning. A bidirectional interaction mechanism enables forward transfer and backward consolidation, allowing continual integration of new knowledge while mitigating catastrophic forgetting. During inference, outputs from both learners are adaptively fused for robust predictions. Experiments on five medical imaging datasets demonstrate consistent improvements over state-of-the-art methods under diverse settings, including cross-dataset shifts and varying task configurations.

[135] An Adapter-free Fine-tuning Approach for Tuning 3D Foundation Models

Sneha Paul, Zachary Patterson, Nizar Bouguila

Main category: cs.CV

TL;DR: MCFT is an adapter-free fine-tuning method for point cloud models that selectively fine-tunes encoder portions while preserving pre-trained representations via momentum consistency, maintaining original parameter count and inference efficiency.

Details

Motivation: Point cloud foundation models struggle with downstream task adaptation in low-data regimes. Full fine-tuning causes overfitting and representation drift, while parameter-efficient methods add inference latency. Need a method that bridges this gap without extra parameters.

Method: MCFT selectively fine-tunes portions of pre-trained encoder while enforcing momentum-based consistency constraints to preserve task-agnostic representations. No additional parameters beyond standard task head. Two variants: semi-supervised framework using unlabeled data, and pruning-based variant for computational efficiency.

Result: Outperforms prior methods with 3.30% gain in 5-shot settings and up to 6.13% improvement with semi-supervised learning. Maintains original model parameter count and inference efficiency, suitable for resource-constrained deployment.

Conclusion: MCFT effectively bridges gap between full and parameter-efficient fine-tuning for point cloud models, achieving strong performance in low-data regimes while preserving inference efficiency and pre-trained representations.

Abstract: Point cloud foundation models demonstrate strong generalization, yet adapting them to downstream tasks remains challenging in low-data regimes. Full fine-tuning often leads to overfitting and significant drift from pre-trained representations, while existing parameter-efficient fine-tuning (PEFT) methods mitigate this issue by introducing additional trainable components at the cost of increased inference-time latency. We propose Momentum-Consistency Fine-Tuning (MCFT), an adapter-free approach that bridges the gap between full and parameter-efficient fine-tuning. MCFT selectively fine-tunes a portion of the pre-trained encoder while enforcing a momentum-based consistency constraint to preserve task-agnostic representations. Unlike PEFT methods, MCFT introduces no additional representation learning parameters beyond a standard task head, maintaining the original model’s parameter count and inference efficiency. We further extend MCFT with two variants: a semi-supervised framework that leverages abundant unlabeled data to enhance few-shot performance, and a pruning-based variant that improves computational efficiency through structured layer removal. Extensive experiments on object recognition and part segmentation benchmarks demonstrate that MCFT consistently outperforms prior methods, achieving a 3.30% gain in 5-shot settings and up to a 6.13% improvement with semi-supervised learning, while remaining well-suited for resource-constrained deployment.

[136] Detection and Classification of (Pre)Cancerous Cells in Pap Smears: An Ensemble Strategy for the RIVA Cervical Cytology Challenge

Lautaro Kogan, María Victoria Ríos

Main category: cs.CV

TL;DR: YOLOv11m-based ensemble approach for cervical cell detection in Pap smear images using loss reweighting, data resampling, and transfer learning strategies to address class imbalance and nuclear overlap challenges.

Details

Motivation: Automated detection and classification of cervical cells in Pap smear images can improve cervical cancer screening by reducing manual workload, improving triage, and increasing consistency, but faces challenges of severe class imbalance and frequent nuclear overlap.

Method: Uses YOLOv11m as base architecture with three strategies: loss reweighting, data resampling, and transfer learning. Builds ensemble by combining models trained under each strategy using Weighted Boxes Fusion (WBF) to promote complementary detection behavior.

Result: Ensemble achieves mAP50-95 of 0.201 on preliminary test set and 0.147 on final test set, representing 29% improvement over best individual model on final test set.

Conclusion: Combining complementary imbalance mitigation strategies through ensemble learning effectively improves cervical cell detection performance in challenging Pap smear images with class imbalance and nuclear overlap.

Abstract: Automated detection and classification of cervical cells in conventional Pap smear images can strengthen cervical cancer screening at scale by reducing manual workload, improving triage, and increasing consistency across readers. However, it is challenged by severe class imbalance and frequent nuclear overlap. We present our approach to the RIVA Cervical Cytology Challenge (ISBI 2026), which requires multi-class detection of eight Bethesda cell categories under these conditions. Using YOLOv11m as the base architecture, we systematically evaluate three strategies to improve detection performance: loss reweighting, data resampling and transfer learning. We build an ensemble by combining models trained under each strategy, promoting complementary detection behavior and combining them through Weighted Boxes Fusion (WBF). The ensemble achieves a mAP50-95 of 0.201 on the preliminary test set and 0.147 on the final test set, representing a 29% improvement over the best individual model on the final test set and demonstrating the effectiveness of combining complementary imbalance mitigation strategies.

[137] IJmond Industrial Smoke Segmentation Dataset

Yen-Chia Hsu, Despoina Touska

Main category: cs.CV

TL;DR: A dataset for industrial smoke segmentation published on figshare under CC BY 4.0 license

Details

Motivation: To provide a specialized dataset for industrial smoke segmentation tasks, addressing the need for domain-specific visual data in environmental monitoring and industrial safety applications

Method: Dataset creation and publication approach - the dataset is hosted on figshare with a specific DOI and licensed under Creative Commons Attribution 4.0

Result: A publicly available dataset for industrial smoke segmentation accessible via https://doi.org/10.21942/uva.31847188

Conclusion: The dataset provides a valuable resource for researchers working on smoke detection and segmentation in industrial environments

Abstract: This report describes a dataset for industrial smoke segmentation, published on a figshare repository (https://doi.org/10.21942/uva.31847188). The dataset is licensed under CC BY 4.0.

[138] Learning Cross-Joint Attention for Generalizable Video-Based Seizure Detection

Omar Zamzam, Takfarinas Medani, Chinmay Chinara, Richard Leahy

Main category: cs.CV

TL;DR: A joint-centric attention model for seizure detection from clinical videos that focuses on body dynamics to improve cross-subject generalization by suppressing background bias and appearance cues.

Details

Motivation: Existing video-based seizure detection methods struggle with cross-subject generalization due to background bias and reliance on subject-specific appearance cues, limiting their clinical utility for real-time monitoring and reducing manual review time.

Method: Proposes a joint-centric attention model that detects body joints, extracts joint-centered clips to suppress background, tokenizes them using Video Vision Transformer (ViViT), and learns cross-joint attention to model spatial-temporal interactions between body parts for seizure pattern recognition.

Result: Extensive cross-subject experiments show the method consistently outperforms state-of-the-art CNN-, graph-, and transformer-based approaches on unseen subjects.

Conclusion: Focusing on body dynamics through joint-centric attention improves cross-subject generalization for seizure detection, making automated video-based monitoring more clinically viable.

Abstract: Automated seizure detection from long-term clinical videos can substantially reduce manual review time and enable real-time monitoring. However, existing video-based methods often struggle to generalize to unseen subjects due to background bias and reliance on subject-specific appearance cues. We propose a joint-centric attention model that focuses exclusively on body dynamics to improve cross-subject generalization. For each video segment, body joints are detected and joint-centered clips are extracted, suppressing background context. These joint-centered clips are tokenized using a Video Vision Transformer (ViViT), and cross-joint attention is learned to model spatial and temporal interactions between body parts, capturing coordinated movement patterns characteristic of seizure semiology. Extensive cross-subject experiments show that the proposed method consistently outperforms state-of-the-art CNN-, graph-, and transformer-based approaches on unseen subjects.

[139] Semantic Iterative Reconstruction: One-Shot Universal Anomaly Detection

Ning Zhu

Main category: cs.CV

TL;DR: SIR is a universal anomaly detection framework for medical imaging that works with extremely few normal samples across diverse domains without task-specific retraining.

Details

Motivation: Medical anomaly detection is limited by scarcity of normal training samples and lack of cross-modality generalization in existing methods that require dedicated models per dataset/disease.

Method: Uses pretrained teacher encoder for multi-scale deep features, compact up-then-down decoder with multi-loop iterative refinement to enforce normality priors in feature space. Single model trained by mixing one normal sample from each of nine heterogeneous datasets.

Result: Achieves state-of-the-art on nine medical benchmarks under all four settings (one-shot universal, full-shot universal, one-shot specialized, full-shot specialized), consistently outperforming previous methods.

Conclusion: SIR offers efficient and scalable solution for multi-domain clinical anomaly detection with minimal normal samples and no task-specific retraining.

Abstract: Unsupervised medical anomaly detection is severely limited by the scarcity of normal training samples. Existing methods typically train dedicated models for each dataset or disease, requiring hundreds of normal images per task and lacking cross-modality generalization. We propose Semantic Iterative Reconstruction (SIR), a framework that enables a single universal model to detect anomalies across diverse medical domains using extremely few normal samples. SIR leverages a pretrained teacher encoder to extract multi-scale deep features and employs a compact up-then-down decoder with multi-loop iterative refinement to enforce robust normality priors in deep feature space. The framework adopts a one-shot universal design: a single model is trained by mixing exactly one normal sample from each of nine heterogeneous datasets, enabling effective anomaly detection on all corresponding test sets without task-specific retraining. Extensive experiments on nine medical benchmarks demonstrate that SIR achieves state-of-the-art under all four settings – one-shot universal, full-shot universal, one-shot specialized, and full-shot specialized – consistently outperforming previous methods. SIR offers an efficient and scalable solution for multi-domain clinical anomaly detection. Code is available at https://github.com/jusufzn212427/sir4ad.

[140] Retinal Disease Classification from Fundus Images using CNN Transfer Learning

Ali Akram

Main category: cs.CV

TL;DR: Transfer learning with VGG16 outperforms baseline CNN for retinal disease classification from fundus images, achieving 90.8% accuracy but with challenges in minority class sensitivity.

Details

Motivation: Retinal diseases are leading causes of preventable visual impairment worldwide. Automated screening via fundus image analysis can expand access to early detection, especially in underserved populations.

Method: Developed reproducible deep learning pipeline for binary retinal disease risk classification. Compared baseline CNN with transfer learning using pretrained VGG16 backbone. Addressed class imbalance with class weighting. Evaluated on held-out data using accuracy, precision, recall, F1-score, confusion matrices, and ROC-AUC.

Result: VGG16 transfer learning model achieved 90.8% test accuracy with weighted F1-score of 0.90, substantially outperforming baseline CNN (83.1% accuracy). Transfer learning improved discrimination but revealed challenges in sensitivity to minority disease cases.

Conclusion: Transfer learning with pretrained models improves retinal disease classification performance over baseline CNNs. However, practical limitations exist related to dataset characteristics, class imbalance, and threshold selection. The paper provides guidance for reproducibility and future improvements for clinically reliable screening.

Abstract: Retinal diseases remain among the leading preventable causes of visual impairment worldwide. Automated screening based on fundus image analysis has the potential to expand access to early detection, particularly in underserved populations. This paper presents a reproducible deep learning pipeline for binary retinal disease risk classification from publicly available fundus photographs. We implement and compare a baseline convolutional neural network with a transfer learning approach using a pretrained VGG16 backbone and evaluate generalization on held-out data. To address class imbalance, we apply class weighting and report standard classification metrics including accuracy, precision, recall, F1-score, confusion matrices, and ROC-AUC. The VGG16 transfer learning model achieves 90.8% test accuracy with a weighted F1-score of 0.90, substantially outperforming the baseline CNN (83.1% accuracy). Results indicate that transfer learning improves discrimination compared to a baseline CNN, while also revealing remaining challenges in sensitivity to minority disease cases. We discuss practical limitations related to dataset characteristics, class imbalance, and threshold selection, and provide guidance for reproducibility and future improvements for clinically reliable screening

[141] Re-Prompting SAM 3 via Object Retrieval: 3rd of the 5th PVUW MOSE Track

Mingqi Gao, Sijie Li, Jungong Han

Main category: cs.CV

TL;DR: SAM3-based automatic re-prompting framework for semi-supervised video object segmentation that handles target disappearance/reappearance, transformations, and distractors through DINOv3 object matching and multi-anchor propagation.

Details

Motivation: Address core challenges in complex semi-supervised video object segmentation: target disappearance and reappearance, severe transformations, and strong same-category distractors that existing methods struggle with.

Method: Uses SAM3 detector on later frames to find same-category candidates, performs DINOv3-based object-level matching with transformation-aware target feature pool to retrieve reliable anchors, then injects these anchors back into SAM3 tracker with first-frame mask for multi-anchor propagation.

Result: Achieves J&F of 51.17% on MOSEv2 test set, ranking 3rd in the PVUW 2026 Challenge MOSEv2 track.

Conclusion: Automatic re-prompting framework with multi-anchor propagation effectively addresses core challenges in semi-supervised VOS, demonstrating robustness to target disappearance/reappearance and same-category distractors.

Abstract: This technical report explores the MOSEv2 track of the PVUW 2026 Challenge, which targets complex semi-supervised video object segmentation. Built on SAM3, we develop an automatic re-prompting framework to improve robustness under target disappearance and reappearance, severe transformation, and strong same-category distractors. Our method first applies the SAM3 detector to later frames to identify same-category object candidates, and then performs DINOv3-based object-level matching with a transformation-aware target feature pool to retrieve reliable target anchors. These anchors are injected back into the SAM~3 tracker together with the first-frame mask, enabling multi-anchor propagation rather than relying solely on the initial prompt. This simple directly benefits several core challenges of MOSEv2. Our solution achieves a J&F of 51.17% on the test set, ranking 3rd in the MOSEv2 track.

[142] Sparse Autoencoders for Interpretable Medical Image Representation Learning

Philipp Wesp, Robbie Holland, Vasiliki Sideri-Lampretsa, Sergios Gatidis

Main category: cs.CV

TL;DR: SAEs create interpretable sparse features from medical vision foundation model embeddings, enabling human-understandable concepts while maintaining performance.

Details

Motivation: Medical vision foundation models produce opaque latent representations that clinicians cannot verify or interrogate, creating a need for interpretable features that maintain performance while being human-understandable.

Method: Train Sparse Autoencoders (SAEs) on embeddings from BiomedParse (biomedical) and DINOv3 (general-purpose) vision models using 909,873 CT and MRI 2D slices from TotalSegmentator dataset.

Result: SAEs achieve high reconstruction fidelity (R² up to 0.941), recover 87.8% downstream performance with only 10 features (99.4% dimensionality reduction), preserve semantic fidelity in retrieval, enable LLM-based concept interpretation, and bridge clinical language with latent representations.

Conclusion: SAEs provide a promising pathway toward interpretable, concept-driven medical vision systems by replacing opaque foundation model representations with human-understandable sparse features.

Abstract: Vision foundation models (FMs) achieve state-of-the-art performance in medical imaging. However, they encode information in abstract latent representations that clinicians cannot interrogate or verify. The goal of this study is to investigate Sparse Autoencoders (SAEs) for replacing opaque FM image representations with human-interpretable, sparse features. We train SAEs on embeddings from BiomedParse (biomedical) and DINOv3 (general-purpose) using 909,873 CT and MRI 2D image slices from the TotalSegmentator dataset. We find that learned sparse features: (a) reconstruct original embeddings with high fidelity (R2 up to 0.941) and recover up to 87.8% of downstream performance using only 10 features (99.4% dimensionality reduction), (b) preserve semantic fidelity in image retrieval tasks, (c) correspond to specific concepts that can be expressed in language using large language model (LLM)-based auto-interpretation. (d) bridge clinical language and abstract latent representations in zero-shot language-driven image retrieval. Our work indicates SAEs are a promising pathway towards interpretable, concept-driven medical vision systems. Code repository: https://github.com/pwesp/sail.

[143] 3D-LLDM: Label-Guided 3D Latent Diffusion Model for Improving High-Resolution Synthetic MR Imaging in Hepatic Structure Segmentation

Kyeonghun Kim, Jaehyeok Bae, Youngung Han, Joo Young Bae, Seoyoung Ju, Junsu Lim, Gyeongmin Kim, Nam-Joon Kim, Woo Kyoung Jeong, Ken Ying-Kai Liao, Won Jae Lee, Pa Hong, Hyuk-Jae Lee

Main category: cs.CV

TL;DR: 3D-LLDM: A label-guided 3D latent diffusion model for generating synthetic medical MR volumes with segmentation masks, improving data augmentation for medical imaging tasks.

Details

Motivation: Medical imaging suffers from scarce annotated datasets, limiting deep learning applications. Synthetic data generation can help, but current methods struggle with high-quality 3D medical image synthesis with corresponding anatomical labels.

Method: Proposes 3D-LLDM, a label-guided 3D latent diffusion model using ControlNet-based architecture. Uses hepatobiliary phase MR images with Gd-EOB-DTPA contrast agent to derive structural masks for liver, portal vein, hepatic vein, and hepatocellular carcinoma, which guide volumetric synthesis.

Result: Achieves FID of 28.31, improving over GANs by 70.9% and over state-of-the-art diffusion baselines by 26.7%. Synthetic volumes improve hepatocellular carcinoma segmentation by up to 11.153% Dice score across five CNN architectures when used for data augmentation.

Conclusion: 3D-LLDM effectively generates high-quality synthetic medical volumes with segmentation masks, demonstrating significant improvements in downstream segmentation tasks through data augmentation.

Abstract: Deep learning and generative models are advancing rapidly, with synthetic data increasingly being integrated into training pipelines for downstream analysis tasks. However, in medical imaging, their adoption remains constrained by the scarcity of reliable annotated datasets. To address this limitation, we propose 3D-LLDM, a label-guided 3D latent diffusion model that generates high-quality synthetic magnetic resonance (MR) volumes with corresponding anatomical segmentation masks. Our approach uses hepatobiliary phase MR images enhanced with the Gd-EOB-DTPA contrast agent to derive structural masks for the liver, portal vein, hepatic vein, and hepatocellular carcinoma, which then guide volumetric synthesis through a ControlNet-based architecture. Trained on 720 real clinical hepatobiliary phase MR scans from Samsung Medical Center, 3D-LLDM achieves a Fréchet Inception Distance (FID) of 28.31, improving over GANs by 70.9% and over state-of-the-art diffusion baselines by 26.7%. When used for data augmentation, the synthetic volumes improve hepatocellular carcinoma segmentation by up to 11.153% Dice score across five CNN architectures.

[144] See, Remember, Explore: A Benchmark and Baselines for Streaming Spatial Reasoning

Yuxi Wei, Wei Huang, Qirui Chen, Lu Hou, Xiaojuan Qi

Main category: cs.CV

TL;DR: S3-Bench benchmark for streaming spatial QA with active exploration, requiring real-time reasoning from temporal video streams and active perception when views are insufficient.

Details

[145] MLE-UVAD: Minimal Latent Entropy Autoencoder for Fully Unsupervised Video Anomaly Detection

Yuang Geng, Junkai Zhou, Kang Yang, Pan He, Zhuoyang Zhou, Jose C. Principe, Joel Harley, Ivan Ruchkin

Main category: cs.CV

TL;DR: Unsupervised video anomaly detection using entropy-guided autoencoder with dual reconstruction and minimal latent entropy losses to separate normal and abnormal events without labels.

Details

Motivation: Address limitations of existing VAD methods that require extensive labeling or normal-only videos, which are vulnerable to distribution shifts and contamination. Need for fully unsupervised approach using raw videos containing both normal and abnormal events.

Method: Propose entropy-guided autoencoder combining standard reconstruction loss with novel Minimal Latent Entropy (MLE) loss. MLE minimizes entropy of latent embeddings, encouraging concentration around high-density normal regions, pulling sparse anomalous embeddings into normal cluster for poor anomaly reconstruction.

Result: Extensive experiments on two widely used benchmarks and challenging self-collected driving dataset demonstrate robust and superior performance over baselines.

Conclusion: Dual-loss design produces clear reconstruction gap enabling effective anomaly detection in fully unsupervised setting without requiring labeled data or normal-only videos.

Abstract: In this paper, we address the challenging problem of single-scene, fully unsupervised video anomaly detection (VAD), where raw videos containing both normal and abnormal events are used directly for training and testing without any labels. This differs sharply from prior work that either requires extensive labeling (fully or weakly supervised) or depends on normal-only videos (one-class classification), which are vulnerable to distribution shifts and contamination. We propose an entropy-guided autoencoder that detects anomalies through reconstruction error by reconstructing normal frames well while making anomalies reconstruct poorly. The key idea is to combine the standard reconstruction loss with a novel Minimal Latent Entropy (MLE) loss in the autoencoder. Reconstruction loss alone maps normal and abnormal inputs to distinct latent clusters due to their inherent differences, but also risks reconstructing anomalies too well to detect. Therefore, MLE loss addresses this by minimizing the entropy of latent embeddings, encouraging them to concentrate around high-density regions. Since normal frames dominate the raw video, sparse anomalous embeddings are pulled into the normal cluster, so the decoder emphasizes normal patterns and produces poor reconstructions for anomalies. This dual-loss design produces a clear reconstruction gap that enables effective anomaly detection. Extensive experiments on two widely used benchmarks and a challenging self-collected driving dataset demonstrate that our method achieves robust and superior performance over baselines.

[146] EnvSocial-Diff: A Diffusion-Based Crowd Simulation Model with Environmental Conditioning and Individual-Group Interaction

Bingxue Zhao, Qi Zhang, Hui Huang

Main category: cs.CV

TL;DR: EnvSocial-Diff: A diffusion-based crowd simulation model that integrates environmental conditioning and individual-group interactions for realistic pedestrian trajectory prediction.

Details

Motivation: Most existing pedestrian trajectory models focus primarily on social interactions while neglecting environmental context, which is crucial for realistic crowd simulation in complex scenes.

Method: Uses diffusion-based modeling with two key modules: 1) structured environmental conditioning encoding obstacles, objects of interest, and lighting; 2) individual-group interaction module capturing both interpersonal relations and group-level conformity via graph-based design.

Result: Outperforms state-of-the-art methods on multiple benchmark datasets, demonstrating the importance of explicit environmental conditioning and multi-level social interaction.

Conclusion: Explicit environmental conditioning combined with multi-level social interaction modeling is essential for realistic crowd simulation, and EnvSocial-Diff provides an effective framework for this integration.

Abstract: Modeling realistic pedestrian trajectories requires accounting for both social interactions and environmental context, yet most existing approaches largely emphasize social dynamics. We propose \textbf{EnvSocial-Diff}: a diffusion-based crowd simulation model informed by social physics and augmented with environmental conditioning and individual–group interaction. Our structured environmental conditioning module explicitly encodes obstacles, objects of interest, and lighting levels, providing interpretable signals that capture scene constraints and attractors. In parallel, the individual–group interaction module goes beyond individual-level modeling by capturing both fine-grained interpersonal relations and group-level conformity through a graph-based design. Experiments on multiple benchmark datasets demonstrate that EnvSocial-Diff outperforms the latest state-of-the-art methods, underscoring the importance of explicit environmental conditioning and multi-level social interaction for realistic crowd simulation. Code is here: https://github.com/zqyq/EnvSocial-Diff.

[147] BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment

Risa Shinoda, Kaede Shiohara, Nakamasa Inoue, Kuniaki Saito, Hiroaki Santo, Fumio Okura

Main category: cs.CV

TL;DR: BioVITA is a visual-textual-acoustic alignment framework for biological species understanding that integrates audio with existing visual-textual models to create unified multimodal representations for biodiversity applications.

Details

Motivation: While recent biological models like BioCLIP have shown strong image-text alignment for species identification, the integration of audio modality remains an open problem in multimodal biodiversity understanding. There's a need to bridge this gap to enable comprehensive species analysis across visual, textual, and acoustic data.

Method: 1) Constructed a large-scale training dataset with 1.3M audio clips and 2.3M images covering 14,133 species with ecological trait labels. 2) Built upon BioCLIP2 with a two-stage training framework to align audio representations with visual and textual representations. 3) Developed a cross-modal retrieval benchmark covering all directional retrieval across three modalities at three taxonomic levels.

Result: The model learns a unified representation space that captures species-level semantics beyond taxonomy, advancing multimodal biodiversity understanding. Extensive experiments demonstrate effective cross-modal alignment across visual, textual, and acoustic modalities.

Conclusion: BioVITA successfully integrates audio with visual and textual modalities for biological applications, providing a comprehensive framework for multimodal biodiversity understanding and creating new opportunities for ecological research through unified multimodal representations.

Abstract: Understanding animal species from multimodal data poses an emerging challenge at the intersection of computer vision and ecology. While recent biological models, such as BioCLIP, have demonstrated strong alignment between images and textual taxonomic information for species identification, the integration of the audio modality remains an open problem. We propose BioVITA, a novel visual-textual-acoustic alignment framework for biological applications. BioVITA involves (i) a training dataset, (ii) a representation model, and (iii) a retrieval benchmark. First, we construct a large-scale training dataset comprising 1.3 million audio clips and 2.3 million images, covering 14,133 species annotated with 34 ecological trait labels. Second, building upon BioCLIP2, we introduce a two-stage training framework to effectively align audio representations with visual and textual representations. Third, we develop a cross-modal retrieval benchmark that covers all possible directional retrieval across the three modalities (i.e., image-to-audio, audio-to-text, text-to-image, and their reverse directions), with three taxonomic levels: Family, Genus, and Species. Extensive experiments demonstrate that our model learns a unified representation space that captures species-level semantics beyond taxonomy, advancing multimodal biodiversity understanding. The project page is available at: https://dahlian00.github.io/BioVITA_Page/

[148] Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

Gengluo Li, Chengquan Zhang, Yupu Liang, Huawen Shen, Yaping Zhang, Pengyuan Lyu, Weinong Wang, Xingyu Wan, Gangyan Zeng, Han Hu, Can Ma, Yu Zhou

Main category: cs.CV

TL;DR: A data-training co-design framework for robust end-to-end document parsing using MLLMs, addressing data scarcity and structural inconsistency issues through realistic scene synthesis and document-aware training.

Details

Motivation: Traditional document parsing pipelines fail under casually captured or non-standard conditions, and end-to-end approaches suffer from repetitive, hallucinated, and structurally inconsistent predictions due to scarcity of large-scale full-page parsing data and lack of structure-aware training strategies.

Method: Proposes a data-training co-design framework with: 1) Realistic Scene Synthesis strategy to construct large-scale, structurally diverse full-page end-to-end supervision by composing layout templates with document elements, and 2) Document-Aware Training Recipe with progressive learning and structure-token optimization for enhanced structural fidelity and decoding stability.

Result: Integrated into a 1B-parameter MLLM, the method achieves superior accuracy and robustness across both scanned/digital and real-world captured scenarios, with evaluation on the proposed Wild-OmniDocBench benchmark.

Conclusion: The framework effectively addresses data scarcity and structural inconsistency in document parsing, demonstrating strong performance in real-world scenarios, with all models, data synthesis pipelines, and benchmarks to be publicly released.

Abstract: Document parsing has recently advanced with multimodal large language models (MLLMs) that directly map document images to structured outputs. Traditional cascaded pipelines depend on precise layout analysis and often fail under casually captured or non-standard conditions. Although end-to-end approaches mitigate this dependency, they still exhibit repetitive, hallucinated, and structurally inconsistent predictions - primarily due to the scarcity of large-scale, high-quality full-page (document-level) end-to-end parsing data and the lack of structure-aware training strategies. To address these challenges, we propose a data-training co-design framework for robust end-to-end document parsing. A Realistic Scene Synthesis strategy constructs large-scale, structurally diverse full-page end-to-end supervision by composing layout templates with rich document elements, while a Document-Aware Training Recipe introduces progressive learning and structure-token optimization to enhance structural fidelity and decoding stability. We further build Wild-OmniDocBench, a benchmark derived from real-world captured documents for robustness evaluation. Integrated into a 1B-parameter MLLM, our method achieves superior accuracy and robustness across both scanned/digital and real-world captured scenarios. All models, data synthesis pipelines, and benchmarks will be publicly released to advance future research in document understanding.

[149] FilterGS: Traversal-Free Parallel Filtering and Adaptive Shrinking for Large-Scale LoD 3D Gaussian Splatting

Yixian Wang, Haolin Yu, Jiadong Tang, Yu Gao, Xihan Wang, Yufeng Yue, Yi Yang

Main category: cs.CV

TL;DR: FilterGS improves 3D Gaussian Splatting for large scenes with parallel filtering and Gaussian-tile redundancy reduction, achieving faster rendering while maintaining quality.

Details

Motivation: Current 3D Gaussian Splatting approaches for large scenes using Level-of-Detail methods suffer from inefficient serial traversal (consuming over 60% of rendering time) and redundant Gaussian-tile pairs that cause unnecessary processing overhead.

Method: Introduces FilterGS with: 1) Parallel filtering mechanism with two complementary filters that select Gaussian elements efficiently without tree traversal, and 2) Novel GTC metric to quantify Gaussian-tile pair redundancy, plus a scene-adaptive Gaussian shrinking strategy to reduce redundant pairs.

Result: Extensive experiments show FilterGS achieves state-of-the-art rendering speeds while maintaining competitive visual quality across multiple large-scale datasets.

Conclusion: FilterGS successfully addresses the efficiency bottlenecks in large-scale 3D Gaussian Splatting through parallel filtering and redundancy reduction, enabling faster rendering without compromising visual quality.

Abstract: 3D Gaussian Splatting has revolutionized neural rendering with real-time performance. However, scaling this approach to large scenes using Level-of-Detail methods faces critical challenges: inefficient serial traversal consuming over 60% of rendering time, and redundant Gaussian-tile pairs that incur unnecessary processing overhead. To address these limitations, we introduce FilterGS, featuring a parallel filtering mechanism with two complementary filters that select Gaussian elements efficiently without tree traversal. Additionally, we propose a novel GTC metric that quantifies the redundancy of Gaussian-tile key-value pairs. Based on this metric, we introduce a scene-adaptive Gaussian shrinking strategy that effectively reduces redundant pairs. Extensive experiments demonstrate that FilterGS achieves state-of-the-art rendering speeds while maintaining competitive visual quality across multiple large-scale datasets. Project page: https://github.com/xenon-w/FilterGS

[150] MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation

Gengluo Li, Chengquan Zhang, Yupu Liang, Huawen Shen, Yaping Zhang, Pengyuan Lyu, Weinong Wang, Xingyu Wan, Gangyan Zeng, Han Hu, Can Ma, Yu Zhou

Main category: cs.CV

TL;DR: MMTIT-Bench: A multilingual multi-scenario benchmark for text-image machine translation, plus CPR-Trans data paradigm integrating scene cognition, text perception, and translation reasoning.

Details

Motivation: End-to-end text-image machine translation (TIMT) is crucial for multilingual scene understanding, but lacks robust evaluation across diverse visual scenes and low-resource languages. Existing VLLMs have immature reasoning paradigms for TIMT, either cascading parsing/translation or focusing on language-only reasoning while overlooking visual cognition.

Method: 1) Created MMTIT-Bench: human-verified multilingual multi-scenario benchmark with 1,400 images across 14 non-English/non-Chinese languages and diverse settings. 2) Proposed CPR-Trans: Cognition-Perception-Reasoning for Translation data paradigm integrating scene cognition, text perception, and translation reasoning in unified process. 3) Used VLLM-driven data generation pipeline for structured, interpretable supervision aligning perception with reasoning.

Result: Experiments on 3B and 7B models show consistent gains in accuracy and interpretability. The benchmark enables rigorous assessment of end-to-end TIMT across diverse scenarios and languages.

Conclusion: MMTIT-Bench addresses the lack of evaluation resources for multilingual TIMT, while CPR-Trans provides effective reasoning paradigm that integrates visual cognition with translation reasoning, improving model performance and interpretability.

Abstract: End-to-end text-image machine translation (TIMT), which directly translates textual content in images across languages, is crucial for real-world multilingual scene understanding. Despite advances in vision-language large models (VLLMs), robustness across diverse visual scenes and low-resource languages remains underexplored due to limited evaluation resources. We present MMTIT-Bench, a human-verified multilingual and multi-scenario benchmark with 1,400 images spanning fourteen non-English and non-Chinese languages and diverse settings such as documents, scenes, and web images, enabling rigorous assessment of end-to-end TIMT. Beyond benchmarking, we study how reasoning-oriented data design improves translation. Although recent VLLMs have begun to incorporate long Chain-of-Thought (CoT) reasoning, effective thinking paradigms for TIMT are still immature: existing designs either cascade parsing and translation in a sequential manner or focus on language-only reasoning, overlooking the visual cognition central to VLLMs. We propose Cognition-Perception-Reasoning for Translation (CPR-Trans), a data paradigm that integrates scene cognition, text perception, and translation reasoning within a unified reasoning process. Using a VLLM-driven data generation pipeline, CPR-Trans provides structured, interpretable supervision that aligns perception with reasoning. Experiments on 3B and 7B models show consistent gains in accuracy and interpretability. We will release MMTIT-Bench to promote the multilingual and multi-scenario TIMT research upon acceptance.

[151] Knowledge-Refined Dual Context-Aware Network for Partially Relevant Video Retrieval

Junkai Yang, Qirui Wang, Yaoqing Jin, Shuai Ma, Minghan Xu, Shanmin Pang

Main category: cs.CV

TL;DR: KDC-Net improves video segment retrieval by addressing text-video information density mismatch through hierarchical semantic aggregation for text and dynamic temporal attention for video, with CLIP-based distillation for knowledge transfer.

Details

Motivation: The paper addresses two key challenges in retrieving partially relevant segments from untrimmed videos: 1) mismatch in information density between text queries and video segments, and 2) limited attention mechanisms that overlook semantic focus and event correlations in videos.

Method: KDC-Net uses a dual approach: 1) Hierarchical Semantic Aggregation module captures multi-scale phrase cues to enrich query semantics on text side, 2) Dynamic Temporal Attention mechanism with relative positional encoding and adaptive temporal windows highlights key events with local temporal coherence on video side, plus 3) dynamic CLIP-based distillation with temporal-continuity-aware refinement for knowledge transfer.

Result: Experiments on PRVR benchmarks show KDC-Net consistently outperforms state-of-the-art methods, especially under low moment-to-video ratios.

Conclusion: KDC-Net effectively addresses text-video information density mismatch and improves video segment retrieval through dual context-aware modeling and knowledge distillation.

Abstract: Retrieving partially relevant segments from untrimmed videos remains difficult due to two persistent challenges: the mismatch in information density between text and video segments, and limited attention mechanisms that overlook semantic focus and event correlations. We present KDC-Net, a Knowledge-Refined Dual Context-Aware Network that tackles these issues from both textual and visual perspectives. On the text side, a Hierarchical Semantic Aggregation module captures and adaptively fuses multi-scale phrase cues to enrich query semantics. On the video side, a Dynamic Temporal Attention mechanism employs relative positional encoding and adaptive temporal windows to highlight key events with local temporal coherence. Additionally, a dynamic CLIP-based distillation strategy, enhanced with temporal-continuity-aware refinement, ensures segment-aware and objective-aligned knowledge transfer. Experiments on PRVR benchmarks show that KDC-Net consistently outperforms state-of-the-art methods, especially under low moment-to-video ratios.

[152] Latent Bias Alignment for High-Fidelity Diffusion Inversion in Real-World Image Reconstruction and Manipulation

Weiming Chen, Qifan Liu, Siyi Liu, Yushun Tang, Yijia Wang, Zhihan Zhu, Zhihai He

Main category: cs.CV

TL;DR: Proposes Latent Bias Optimization (LBO) and Image Latent Boosting (ILB) to improve diffusion inversion for better real-world image reconstruction and editing

Details

Motivation: Existing diffusion inversion methods suffer from low reconstruction quality and weak robustness due to misalignment between inversion/generation trajectories and mismatch between diffusion inversion and VQAE reconstruction

Method: Introduces Latent Bias Optimization (LBO) to learn bias vectors reducing trajectory misalignment, and Image Latent Boosting (ILB) for joint optimization of diffusion inversion and VQAE reconstruction by adjusting image latent representations

Result: Significantly improves image reconstruction quality and enhances downstream task performance including image editing and rare concept generation

Conclusion: The proposed LBO and ILB techniques effectively address diffusion inversion challenges, enabling better bridging between diffusion models and real-world applications

Abstract: Recent research has shown that text-to-image diffusion models are capable of generating high-quality images guided by text prompts. But can they be used to generate or approximate real-world images from the seed noise? This is known as the diffusion inversion problem, which serves as a fundamental building block for bridging diffusion models and real-world scenarios. However, existing diffusion inversion methods often suffer from low reconstruction quality or weak robustness. Two major challenges need to be carefully addressed: (1) the misalignment between the inversion and generation trajectories during the diffusion process, and (2) the mismatch between the diffusion inversion process and the VQ autoencoder (VQAE) reconstruction. To address these challenges, we introduce a latent bias vector at each inversion step, which is learned to reduce the misalignment between inversion and generation trajectories. We refer to this strategy as Latent Bias Optimization (LBO). Furthermore, we perform an approximate joint optimization of the diffusion inversion and VQAE reconstruction processes by learning to adjust the image latent representation, which serves as the connecting interface between them. We refer to this technique as Image Latent Boosting (ILB). Extensive experimental results demonstrate that the proposed method significantly improves the image reconstruction quality of the diffusion model, as well as the performance of downstream tasks, including image editing and rare concept generation.

[153] GenMask: Adapting DiT for Segmentation via Direct Mask

Yuhuan Yang, Xianwei Zhuang, Yuxuan Cai, Chaofan Ma, Shuai Bai, Jiangchao Yao, Ya Zhang, Junyang Lin, Yanfeng Wang

Main category: cs.CV

TL;DR: GenMask proposes training segmentation directly as a generative task using DiT, eliminating feature extraction pipelines by harmonizing mask and image generation through noise scheduling.

Details

Motivation: Current segmentation approaches misuse generative models as feature extractors, creating representation misalignment and complex pipelines. The paper argues segmentation should be trained directly in a generative manner rather than through indirect adaptation.

Method: Introduces GenMask, a DiT that generates both segmentation masks and RGB images using a timestep sampling strategy that emphasizes extreme noise for masks and moderate noise for images to handle their different latent distributions.

Result: Achieves state-of-the-art performance on referring and reasoning segmentation benchmarks while eliminating feature extraction pipelines.

Conclusion: Direct generative training for segmentation is more effective than indirect feature extraction approaches, enabling unified architecture for both image and mask generation.

Abstract: Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation. It also depends heavily on indirect feature extraction pipelines, which complicate the workflow and limit adaptation. In this paper, we argue that instead of indirect adaptation, segmentation tasks should be trained directly in a generative manner. We identify a key obstacle to this unified formulation: VAE latents of binary masks are sharply distributed, noise robust, and linearly separable, distinct from natural image latents. To bridge this gap, we introduce timesteps sampling strategy for binary masks that emphasizes extreme noise levels for segmentation and moderate noise for image generation, enabling harmonious joint training. We present GenMask, a DiT trains to generate black-and-white segmentation masks as well as colorful images in RGB space under the original generative objective. GenMask preserves the original DiT architecture while removing the need of feature extraction pipelines tailored for segmentation tasks. Empirically, GenMask attains state-of-the-art performance on referring and reasoning segmentation benchmarks and ablations quantify the contribution of each component.

[154] Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding

Fatih Ilhan, Gaowen Liu, Ramana Rao Kompella, Selim Furkan Tekin, Tiansheng Huang, Zachary Yahn, Yichang Xu, Ling Liu

Main category: cs.CV

TL;DR: AttentionPack is an adaptive optimization framework that improves memory efficiency in large vision-language models during decoding by compacting attention KV matrices and using token-specific decompression.

Details

Motivation: Large VLMs face significant memory overhead during decoding, especially with long sequences of visual and text tokens in multi-image/video tasks, limiting inference efficiency and batch processing capabilities.

Method: Two key innovations: 1) Multi-head attention compaction exploiting low-rank structure to store KV matrices economically, and 2) Token-specific attention-aware decompression to reduce latency overhead.

Result: Achieves up to 8x memory efficiency improvement, enables higher batch sizes and faster inference while preserving output quality, and shows further gains when combined with eviction, quantization, and kernel fusion.

Conclusion: AttentionPack provides an effective solution for memory-efficient decoding in large VLMs, particularly beneficial for long-context tasks with multiple high-resolution images/videos in resource-limited environments.

Abstract: Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and answer of VLMs consist of long sequences of visual and text tokens. This paper presents AttentionPack, an adaptive and attention-aware optimization framework tailored for large vision-language models with improving memory-efficiency during decoding, focusing on addressing the challenges due to the increased high number of visual inputs and interactions, particularly in long-context tasks with multiple high-resolution images or videos. AttentionPack is novel in two aspects: (i) We introduce a multi-head attention compaction method for economically storing key and value matrices by exploiting the implicit low-rank structure, and (ii) we develop a token-specific attention-aware decompression mechanism to reduce latency overhead. Experimental results on multiple benchmarks demonstrate that AttentionPack improves memory efficiency by up to 8x, enabling higher batch sizes and faster batch inference while preserving the model output quality or longer context lengths for superior retrieval performance. We also report the effectiveness of AttentionPack combined with eviction, quantization and kernel fusion, showing further efficiency gains for resource-limited environments.

[155] DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning

Jiajian Huang, Dongliang Zhu, Zitong YU, Hui Ma, Jiayu Zhang, Chunmei Zhu, Xiaochun Cao

Main category: cs.CV

TL;DR: A multimodal deception detection framework with structured reasoning datasets, multicultural T4-Deception dataset, and robust learning modules (SICS and DMC) for auditable, generalizable performance across domains and cultures.

Details

Motivation: Existing deception detection systems lack verifiable evidence connecting audiovisual cues to decisions, have limited generalization across domains/cultures, and suffer from shortcut learning due to small datasets with only binary labels.

Method: Three main contributions: 1) Construct reasoning datasets with structured cue-level descriptions and reasoning chains; 2) Release T4-Deception multicultural dataset (1695 samples from 4 countries); 3) Propose SICS module (synergizes global priors with sample-adaptive residuals) and DMC module (aligns modality-specific predictions via knowledge distillation).

Result: Achieves state-of-the-art performance on three established benchmarks and the novel T4-Deception dataset, with superior transferability across cultural contexts and robust performance in both in-domain and cross-domain scenarios.

Conclusion: The framework enables auditable deception detection with verifiable reasoning, addresses multicultural generalization, and overcomes shortcut learning through structured reasoning datasets and robust multimodal learning modules.

Abstract: Multimodal deception detection aims to identify deceptive behavior by analyzing audiovisual cues for forensics and security. In these high-stakes settings, investigators need verifiable evidence connecting audiovisual cues to final decisions, along with reliable generalization across domains and cultural contexts. However, existing benchmarks provide only binary labels without intermediate reasoning cues. Datasets are also small with limited scenario coverage, leading to shortcut learning. We address these issues through three contributions. First, we construct reasoning datasets by augmenting existing benchmarks with structured cue-level descriptions and reasoning chains, enabling model output auditable reports. Second, we release T4-Deception, a multicultural dataset based on the unified ``To Tell The Truth’’ television format implemented across four countries. With 1695 samples, it is the largest non-laboratory deception detection dataset. Third, we propose two modules for robust learning under small-data conditions. Stabilized Individuality-Commonality Synergy (SICS) refines multimodal representations by synergizing learnable global priors with sample-adaptive residuals, followed by a polarity-aware adjustment that bi-directionally recalibrates representations. Distilled Modality Consistency (DMC) aligns modality-specific predictions with the fused multimodal predictions via knowledge distillation to prevent unimodal shortcut learning. Experiments on three established benchmarks and our novel dataset demonstrate that our method achieves state-of-the-art performance in both in-domain and cross-domain scenarios, while exhibiting superior transferability across diverse cultural contexts. The datasets and codes will be released.

[156] Uncertainty-Aware Vision-based Risk Object Identification via Conformal Risk Tube Prediction

Kai-Yu Fu, Yi-Ting Chen

Main category: cs.CV

TL;DR: Conformal Risk Tube Prediction (CRTP) is a unified framework for spatiotemporal risk uncertainty modeling in vision-based risk object identification for autonomous driving, providing coverage guarantees and calibrated risk scores.

Details

Motivation: Current vision risk object identification methods make deterministic decisions without modeling uncertainty, leading to safety-critical failures in ambiguous scenarios with multiple interacting risks. Existing approaches lack principled frameworks for joint spatiotemporal risk uncertainty modeling.

Method: Proposes Conformal Risk Tube Prediction (CRTP) - a unified formulation that captures spatiotemporal risk uncertainty, provides coverage guarantees for true risks, and produces calibrated risk scores with uncertainty estimates. Includes new dataset and metrics for systematic evaluation.

Result: CRTP delivers substantial improvements over prior approaches, enhancing vision-ROI robustness and downstream performance (e.g., reducing nuisance braking alerts). Systematic analysis examines factors affecting uncertainty estimation including scenario variations and perception error propagation.

Conclusion: The proposed framework addresses critical limitations in current vision risk identification by modeling uncertainty jointly across space and time, providing coverage guarantees, and improving safety in complex driving scenarios.

Abstract: We study object importance-based vision risk object identification (Vision-ROI), a key capability for hazard detection in intelligent driving systems. Existing approaches make deterministic decisions and ignore uncertainty, which could lead to safety-critical failures. Specifically, in ambiguous scenarios, fixed decision thresholds may cause premature or delayed risk detection and temporally unstable predictions, especially in complex scenes with multiple interacting risks. Despite these challenges, current methods lack a principled framework to model risk uncertainty jointly across space and time. We propose Conformal Risk Tube Prediction, a unified formulation that captures spatiotemporal risk uncertainty, provides coverage guarantees for true risks, and produces calibrated risk scores with uncertainty estimates. To conduct a systematic evaluation, we present a new dataset and metrics probing diverse scenario configurations with multi-risk coupling effects, which are not supported by existing datasets. We systematically analyze factors affecting uncertainty estimation, including scenario variations, per-risk category behavior, and perception error propagation. Our method delivers substantial improvements over prior approaches, enhancing vision-ROI robustness and downstream performance, such as reducing nuisance braking alerts. For more qualitative results, please visit our project webpage: https://hcis-lab.github.io/CRTP/

[157] DepthArb: Training-Free Depth-Arbitrated Generation for Occlusion-Robust Image Synthesis

Hongjin Niu, Jiahao Wang, Xirui Hu, Weizhan Zhang, Lan Ma, Yuan Gao

Main category: cs.CV

TL;DR: DepthArb is a training-free framework that resolves occlusion ambiguities in text-to-image diffusion models by arbitrating attention competition between objects, improving depth-ordered visibility without model retraining.

Details

Motivation: Text-to-image diffusion models struggle with accurate occlusion relationships of multiple objects, especially in dense overlapping regions. Existing layout-guided methods use rigid spatial priors that ignore depth order, leading to concept mixing and illogical occlusion.

Method: DepthArb uses two core mechanisms: 1) Attention Arbitration Modulation (AAM) enforces depth-ordered visibility by suppressing background activations in overlapping regions, and 2) Spatial Compactness Control (SCC) preserves structural integrity by curbing attention divergence. The framework also introduces OcclBench, a benchmark for evaluating occlusion scenarios.

Result: DepthArb consistently outperforms state-of-the-art baselines in both occlusion accuracy and visual fidelity. It serves as a plug-and-play method that enhances compositional capabilities of diffusion backbones.

Conclusion: DepthArb offers a novel perspective on spatial layering in generative models by resolving occlusion ambiguities through attention arbitration, improving multi-object composition without requiring model retraining.

Abstract: Text-to-image diffusion models frequently exhibit deficiencies in synthesizing accurate occlusion relationships of multiple objects, particularly within dense overlapping regions. Existing training-free layout-guided methods predominantly rely on rigid spatial priors that remain agnostic to depth order, often resulting in concept mixing or illogical occlusion. To address these limitations, we propose DepthArb, a training-free framework that resolves occlusion ambiguities by arbitrating attention competition between interacting objects. Specifically, DepthArb employs two core mechanisms: Attention Arbitration Modulation (AAM), which enforces depth-ordered visibility by suppressing background activations in overlapping regions, and Spatial Compactness Control (SCC), which preserves structural integrity by curbing attention divergence. These mechanisms enable robust occlusion generation without model retraining. To systematically evaluate this capability, we propose OcclBench, a comprehensive benchmark designed to evaluate diverse occlusion scenarios. Extensive evaluations demonstrate that DepthArb consistently outperforms state-of-the-art baselines in both occlusion accuracy and visual fidelity. As a plug-and-play method, DepthArb seamlessly enhances the compositional capabilities of diffusion backbones, offering a novel perspective on spatial layering within generative models.

[158] DP^2-VL: Private Photo Dataset Protection by Data Poisoning for Vision-Language Models

Hongyi Miao, Jun Jia, Xincheng Wang, Qianli Ma, Wei Sun, Wangqiu Zhou, Dandan Zhu, Yewen Cao, Zhi Liu, Guangtao Zhai

Main category: cs.CV

TL;DR: Paper introduces identity-affiliation learning threat to VLMs, creates benchmark dataset, and proposes DP2-VL data poisoning defense for privacy protection.

Details

Motivation: As VLMs gain fine-grained image understanding capabilities, they introduce new privacy risks where attackers can fine-tune models on private photos to learn associations between facial identities and private information (affiliations, properties, relationships).

Method: 1) Proposes identity-affiliation learning threat model; 2) Creates first identity-affiliation dataset with 7 typical private photo scenarios; 3) Demonstrates vulnerability of mainstream VLMs (LLaVA, Qwen-VL, MiniGPT-v2); 4) Proposes DP2-VL defense using data poisoning with imperceptible perturbations that shift embeddings to cause overfitting during fine-tuning.

Result: VLMs can recognize facial identities and infer identity-affiliation relationships even with small-scale fine-tuning. DP2-VL achieves strong generalization across models, robustness to post-processing, and consistent effectiveness across protection ratios.

Conclusion: VLMs pose significant privacy risks through identity-affiliation learning, requiring new defense mechanisms like DP2-VL’s data poisoning approach to protect private photos while maintaining model utility.

Abstract: Recent advances in visual-language alignment have endowed vision-language models (VLMs) with fine-grained image understanding capabilities. However, this progress also introduces new privacy risks. This paper first proposes a novel privacy threat model named identity-affiliation learning: an attacker fine-tunes a VLM using only a few private photos of a target individual, thereby embedding associations between the target facial identity and their private property and social relationships into the model’s internal representations. Once deployed via public APIs, this model enables unauthorized exposure of the target user’s private information upon input of their photos. To benchmark VLMs’ susceptibility to such identity-affiliation leakage, we introduce the first identity-affiliation dataset comprising seven typical scenarios appearing in private photos. Each scenario is instantiated with multiple identity-centered photo-description pairs. Experimental results demonstrate that mainstream VLMs like LLaVA, Qwen-VL, and MiniGPT-v2, can recognize facial identities and infer identity-affiliation relationships by fine-tuning on small-scale private photographic dataset, and even on synthetically generated datasets. To mitigate this privacy risk, we propose DP2-VL, the first Dataset Protection framework for private photos that leverages Data Poisoning. Though optimizing imperceptible perturbations by pushing the original representations toward an antithetical region, DP2-VL induces a dataset-level shift in the embedding space of VLMs’encoders. This shift separates protected images from clean inference images, causing fine-tuning on the protected set to overfit. Extensive experiments demonstrate that DP2-VL achieves strong generalization across models, robustness to diverse post-processing operations, and consistent effectiveness across varying protection ratios.

[159] Revealing Multi-View Hallucination in Large Vision-Language Models

Wooje Park, Insu Lee, Soohyun Kim, Jaeyun Jang, Minyoung Noh, Kyuhong Shim, Byonghyo Shim

Main category: cs.CV

TL;DR: LVLMs struggle with multi-view hallucination (confusing information from different instances/viewpoints). MVH-Bench benchmark created with 4.8k QA pairs. RSCD decoding technique improves performance by up to 34.6 points.

Details

Motivation: Current large vision-language models (LVLMs) are increasingly applied to multi-view image inputs but suffer from multi-view hallucination - confusing or mismatching visual information from different instances or viewpoints. There's a need to systematically analyze and address this problem.

Method: 1) Construct MVH-Bench benchmark with 4.8k question-answer pairs targeting cross-instance and cross-view hallucination types. 2) Propose Reference Shift Contrastive Decoding (RSCD), a training-free decoding technique that suppresses visual interference by generating negative logits through attention masking.

Result: Experiments on MVH-Bench with Qwen2.5-VL and LLaVA-OneVision show RSCD consistently improves performance by up to 21.1 and 34.6 points over existing hallucination mitigation methods. Recent LVLMs struggle to correctly associate visual evidence with corresponding instances/viewpoints.

Conclusion: Multi-view hallucination is a significant problem in LVLMs. RSCD effectively addresses this issue without requiring additional training, demonstrating substantial improvements over existing methods. The MVH-Bench provides a valuable tool for evaluating multi-view understanding capabilities.

Abstract: Large vision-language models (LVLMs) are increasingly being applied to multi-view image inputs captured from diverse viewpoints. However, despite this growing use, current LVLMs often confuse or mismatch visual information originating from different instances or viewpoints, a phenomenon we term multi-view hallucination. To systematically analyze this problem, we construct MVH-Bench, a benchmark comprising 4.8k question-answer pairs targeting two types of hallucination: cross-instance and cross-view. Empirical results show that recent LVLMs struggle to correctly associate visual evidence with its corresponding instance or viewpoint. To overcome this limitation, we propose Reference Shift Contrastive Decoding (RSCD), a training-free decoding technique that suppresses visual interference by generating negative logits through attention masking. Experiments on MVH-Bench with Qwen2.5-VL and LLaVA-OneVision demonstrate that RSCD consistently improves performance by up to 21.1 and 34.6 points over existing hallucination mitigation methods, highlighting the effectiveness of our approach.

[160] High-Fidelity Face Content Recovery via Tamper-Resilient Versatile Watermarking

Peipeng Yu, Jinfeng Xie, Chengfu Ou, Xiaoyu Zhou, Jianwei Fei, Yunshu Dai, Zhihua Xia, Chip Hong Chang

Main category: cs.CV

TL;DR: VeriFi is a versatile watermarking framework for deepfake forensics that unifies copyright protection, pixel-level manipulation localization, and high-fidelity face content recovery using semantic latent watermarks.

Details

Motivation: The proliferation of AIGC-driven face manipulation and deepfakes threatens media provenance, integrity, and copyright protection. Existing watermarking systems have fidelity-functionality trade-offs and rarely support content recovery, limiting forensic value.

Method: VeriFi embeds compact semantic latent watermarks as content-preserving priors, achieves fine-grained localization without explicit localization artifacts by correlating image features with decoded provenance signals, and uses an AIGC attack simulator combining latent-space mixing with seamless blending for robustness.

Result: Extensive experiments on CelebA-HQ and FFHQ show VeriFi consistently outperforms strong baselines in watermark robustness, localization accuracy, and recovery quality.

Conclusion: VeriFi provides a practical and verifiable defense for deepfake forensics by unifying copyright protection, manipulation localization, and content recovery in a single framework.

Abstract: The proliferation of AIGC-driven face manipulation and deepfakes poses severe threats to media provenance, integrity, and copyright protection. Prior versatile watermarking systems typically rely on embedding explicit localization payloads, which introduces a fidelity–functionality trade-off: larger localization signals degrade visual quality and often reduce decoding robustness under strong generative edits. Moreover, existing methods rarely support content recovery, limiting their forensic value when original evidence must be reconstructed. To address these challenges, we present VeriFi, a versatile watermarking framework that unifies copyright protection, pixel-level manipulation localization, and high-fidelity face content recovery. VeriFi makes three key contributions: (1) it embeds a compact semantic latent watermark that serves as an content-preserving prior, enabling faithful restoration even after severe manipulations; (2) it achieves fine-grained localization without embedding localization-specific artifacts by correlating image features with decoded provenance signals; and (3) it introduces an AIGC attack simulator that combines latent-space mixing with seamless blending to improve robustness to realistic deepfake pipelines. Extensive experiments on CelebA-HQ and FFHQ show that VeriFi consistently outperforms strong baselines in watermark robustness, localization accuracy, and recovery quality, providing a practical and verifiable defense for deepfake forensics.

[161] VOLMO: Versatile and Open Large Models for Ophthalmology

Zhenyue Qin, Younjoon Chung, Elijah Lee, Wanyue Feng, Xuguang Ai, Serina Applebaum, Minjie Zou, Yang Liu, Pan Xiao, Mac Singer, Amisha Dave, Aidan Gilson, Tiarnan D. L. Keenan, Emily Y. Chew, Zhiyong Lu, Yih-Chung Tham, Ron Adelman, Luciano V. Del Priore, Qingyu Chen

Main category: cs.CV

TL;DR: VOLMO is an ophthalmology-specific multimodal LLM framework with 2B parameters that outperforms existing medical MLLMs on eye disease screening, severity classification, and clinical reasoning tasks.

Details

Motivation: Early detection of vision impairment is critical but current ophthalmology workflows are time-consuming. Existing general and medical MLLMs perform poorly in ophthalmology, and few ophthalmology-specific MLLMs are openly available.

Method: Three-stage framework: 1) Ophthalmology knowledge pretraining on 86,965 image-text pairs from 26,569 articles, 2) Domain task fine-tuning on 26,929 annotated instances for 12 eye conditions, 3) Multi-step clinical reasoning on 913 patient case reports for assessment, planning, and follow-up care.

Result: VOLMO-2B consistently outperformed baselines (InternVL-2B, LLaVA-Med-7B, MedGemma-4B/27B, RETFound) on image description, disease screening/staging (avg F1 87.4% across 12 conditions), and external validation on AMD and diabetic retinopathy cohorts.

Conclusion: VOLMO provides an effective, open framework for developing ophthalmology-specific MLLMs that can assist clinicians in time-consuming diagnostic workflows and improve early detection of vision-threatening conditions.

Abstract: Vision impairment affects millions globally, and early detection is critical to preventing irreversible vision loss. Ophthalmology workflows require clinicians to integrate medical images, structured clinical data, and free-text notes to determine disease severity and management, which is time-consuming and burdensome. Recent multimodal large language models (MLLMs) show promise, but existing general and medical MLLMs perform poorly in ophthalmology, and few ophthalmology-specific MLLMs are openly available. We present VOLMO (Versatile and Open Large Models for Ophthalmology), a model-agnostic, data-open framework for developing ophthalmology-specific MLLMs. VOLMO includes three stages: ophthalmology knowledge pretraining on 86,965 image-text pairs from 26,569 articles across 82 journals; domain task fine-tuning on 26,929 annotated instances spanning 12 eye conditions for disease screening and severity classification; and multi-step clinical reasoning on 913 patient case reports for assessment, planning, and follow-up care. Using this framework, we trained a compact 2B-parameter MLLM and compared it with strong baselines, including InternVL-2B, LLaVA-Med-7B, MedGemma-4B, MedGemma-27B, and RETFound. We evaluated these models on image description generation, disease screening and staging classification, and assessment-and-management generation, with additional manual review by two healthcare professionals and external validation on three independent cohorts for age-related macular degeneration and diabetic retinopathy. Across settings, VOLMO-2B consistently outperformed baselines, achieving stronger image description performance, an average F1 of 87.4% across 12 eye conditions, and higher scores in external validation.

[162] SynMVCrowd: A Large Synthetic Benchmark for Multi-view Crowd Counting and Localization

Qi Zhang, Daijie Chen, Yunfei Gong, Hui Huang

Main category: cs.CV

TL;DR: SynMVCrowd: A large synthetic benchmark for multi-view crowd counting and localization with 50 scenes, up to 1000 people, addressing limitations of small datasets.

Details

Motivation: Existing multi-view crowd counting methods are evaluated on small datasets that are easily overfit, making practical evaluation and comparison difficult. There's a need for larger, more realistic benchmarks.

Method: Created a synthetic benchmark with 50 scenes, large crowd numbers (up to 1000), multiple camera views, and many frames. Also proposed strong baseline methods for multi-view crowd localization and counting.

Result: The benchmark enables more practical evaluation and comparison. The proposed baselines outperform existing methods on SynMVCrowd and show better domain transfer to real scenes for both multi-view and single-image counting.

Conclusion: SynMVCrowd advances multi-view and single-image crowd counting research toward more practical applications by providing a comprehensive synthetic benchmark that addresses limitations of existing small datasets.

Abstract: Existing multi-view crowd counting and localization methods are evaluated under relatively small scenes with limited crowd numbers, camera views, and frames. This makes the evaluation and comparison of existing methods impractical, as small datasets are easily overfit by these methods. To avoid these issues, 3DROM proposes a data augmentation method. Instead, in this paper, we propose a large synthetic benchmark, SynMVCrowd, for more practical evaluation and comparison of multi-view crowd counting and localization tasks. The SynMVCrowd benchmark consists of 50 synthetic scenes with a large number of multi-view frames and camera views and a much larger crowd number (up to 1000), which is more suitable for large-scene multi-view crowd vision tasks. Besides, we propose strong multi-view crowd localization and counting baselines that outperform all comparison methods on the new SynMVCrowd benchmark. Moreover, we prove that better domain transferring multi-view and single-image counting performance could be achieved with the aid of the benchmark on novel new real scenes. As a result, the proposed benchmark could advance the research for multi-view and single-image crowd counting and localization to more practical applications. The codes and datasets are here: https://github.com/zqyq/SynMVCrowd.

[163] PointRFT: Explicit Reinforcement Fine-tuning for Point Cloud Few-shot Learning

Yankai Wang, Yiding Sun, Qirui Wang, Pengbo Li, Chaoyi Lu, Dongxu Zhang

Main category: cs.CV

TL;DR: PointRFT introduces the first reinforcement fine-tuning paradigm for 3D point cloud representation learning, using specialized reward functions to enhance foundation models’ performance in few-shot classification tasks.

Details

Motivation: The paper addresses the unexplored potential of RL-based methods in 3D perception, particularly for point cloud understanding. While RL algorithms like GRPO have shown success in large language models by enhancing reasoning capabilities, their application to 3D point cloud fine-tuning remains largely untapped. The authors aim to bridge this gap by developing a reinforcement fine-tuning approach specifically for point cloud representation learning.

Method: The authors propose PointRFT, a reinforcement fine-tuning paradigm for point cloud representation learning. They select three prevalent 3D foundation models and design specialized reward functions: accuracy reward and dispersion reward. These functions are engineered to stabilize training and mitigate distribution shifts. The approach is evaluated through comprehensive few-shot classification experiments comparing different training paradigms.

Result: PointRFT consistently outperforms vanilla supervised fine-tuning (SFT) across diverse benchmarks. When integrated into a hybrid Pretraining-SFT-RFT paradigm, the representational capacity of point cloud foundation models is substantially enhanced, achieving state-of-the-art performance, particularly under data-scarce scenarios.

Conclusion: The paper demonstrates that RL-based methods can effectively empower 3D point cloud fine-tuning. PointRFT represents a significant advancement in point cloud representation learning, showing that reinforcement fine-tuning can substantially improve foundation models’ performance, especially in data-limited situations.

Abstract: Understanding spatial dynamics and semantics in point cloud is fundamental for comprehensive 3D comprehension. While reinforcement learning algorithms such as Group Relative Policy Optimization (GRPO) have recently achieved remarkable breakthroughs in large language models by incentivizing reasoning capabilities through strategic reward design, their potential remains largely unexplored in the 3D perception domain. This naturally raises a pivotal question: Can RL-based methods effectively empower 3D point cloud fine-tuning? In this paper, we propose PointRFT, the first reinforcement fine-tuning paradigm tailored specifically for point cloud representation learning. We select three prevalent 3D foundation models and devise specialized accuracy reward and dispersion reward functions to stabilize training and mitigate distribution shifts. Through comprehensive few-shot classification experiments comparing distinct training paradigms, we demonstrate that PointRFT consistently outperforms vanilla supervised fine-tuning (SFT) across diverse benchmarks. Furthermore, when organically integrated into a hybrid Pretraining-SFT-RFT paradigm, the representational capacity of point cloud foundation models is substantially unleashed, achieving state-of-the-art performance particularly under data-scarce scenarios.

[164] Leave No Stone Unturned: Uncovering Holistic Audio-Visual Intrinsic Coherence for Deepfake Detection

Jielun Peng, Yabin Wang, Yaqi Li, Long Kong, Xiaopeng Hong

Main category: cs.CV

TL;DR: HAVIC is a holistic audio-visual deepfake detector that learns intrinsic coherence priors from authentic videos and dynamically fuses multimodal features, achieving state-of-the-art performance on challenging cross-dataset scenarios.

Details

Motivation: Existing deepfake detectors either use uni-modal artifacts or audio-visual discrepancies, failing to jointly leverage both information sources. Detectors relying on generator-specific artifacts also suffer from poor generalization to unseen forgeries. The authors argue that robust detection should be grounded in intrinsic audio-visual coherence within and across modalities.

Method: HAVIC learns priors of modality-specific structural coherence, inter-modal micro- and macro-coherence by pre-training on authentic videos. It then performs holistic adaptive aggregation to dynamically fuse audio-visual features for deepfake detection. The authors also introduce HiFi-AVDF, a high-fidelity audio-visual deepfake dataset featuring text-to-video and image-to-video forgeries from state-of-the-art commercial generators.

Result: Extensive experiments across several benchmarks demonstrate that HAVIC significantly outperforms existing state-of-the-art methods, achieving improvements of 9.39% AP and 9.37% AUC on the most challenging cross-dataset scenario.

Conclusion: HAVIC provides a robust and generalizable deepfake detection approach by grounding detection in intrinsic audio-visual coherence, with strong performance demonstrated across challenging benchmarks including cross-dataset scenarios.

Abstract: The rapid progress of generative AI has enabled hyper-realistic audio-visual deepfakes, intensifying threats to personal security and social trust. Most existing deepfake detectors rely either on uni-modal artifacts or audio-visual discrepancies, failing to jointly leverage both sources of information. Moreover, detectors that rely on generator-specific artifacts tend to exhibit degraded generalization when confronted with unseen forgeries. We argue that robust and generalizable detection should be grounded in intrinsic audio-visual coherence within and across modalities. Accordingly, we propose HAVIC, a Holistic Audio-Visual Intrinsic Coherence-based deepfake detector. HAVIC first learns priors of modality-specific structural coherence, inter-modal micro- and macro-coherence by pre-training on authentic videos. Based on the learned priors, HAVIC further performs holistic adaptive aggregation to dynamically fuse audio-visual features for deepfake detection. Additionally, we introduce HiFi-AVDF, a high-fidelity audio-visual deepfake dataset featuring both text-to-video and image-to-video forgeries from state-of-the-art commercial generators. Extensive experiments across several benchmarks demonstrate that HAVIC significantly outperforms existing state-of-the-art methods, achieving improvements of 9.39% AP and 9.37% AUC on the most challenging cross-dataset scenario. Our code and dataset are available at https://github.com/tuffy-studio/HAVIC.

[165] SLAT-Phys: Fast Material Property Field Prediction from Structured 3D Latents

Rocktim Jyoti Das, Dinesh Manocha

Main category: cs.CV

TL;DR: SLAT-Phys predicts 3D material property fields from single RGB images using pretrained 3D generation model features, achieving 120x speedup over prior methods.

Details

Motivation: Current vision-based material property estimation methods are either too expensive/slow or require 3D information. There's a need for efficient single-image approaches without explicit 3D reconstruction.

Method: Leverages spatially organized latent features from a pretrained 3D asset generation model that encodes geometry and semantic priors. Trains a lightweight neural decoder to estimate Young’s modulus, density, and Poisson’s ratio directly from single RGB images.

Result: Achieves competitive accuracy in predicting continuous material parameters while requiring only 9.9 seconds per object (120x speedup vs prior methods). Avoids reconstruction and voxelization preprocessing.

Conclusion: SLAT-Phys enables fast, accurate material property estimation from single RGB images by leveraging 3D priors without explicit reconstruction, making it practical for real-world applications.

Abstract: Estimating the material property field of 3D assets is critical for physics-based simulation, robotics, and digital twin generation. Existing vision-based approaches are either too expensive and slow or rely on 3D information. We present SLAT-Phys, an end-to-end method that predicts spatially varying material property fields of 3D assets directly from a single RGB image without explicit 3D reconstruction. Our approach leverages spatially organised latent features from a pretrained 3D asset generation model that encodes rich geometry and semantic prior, and trains a lightweight neural decoder to estimate Young’s modulus, density, and Poisson’s ratio. The coarse volumetric layout and semantic cues of the latent representation about object geometry and appearance enable accurate material estimation. Our experiments demonstrate that our method provides competitive accuracy in predicting continuous material parameters when compared against prior approaches, while significantly reducing computation time. In particular, SLAT-Phys requires only 9.9 seconds per object on an NVIDIA RTXA5000 GPU and avoids reconstruction and voxelization preprocessing. This results in 120x speedup compared to prior methods and enables faster material property estimation from a single image.

[166] HyDRA: Hybrid Domain-Aware Robust Architecture for Heterogeneous Collaborative Perception

Minwoo Song, Minhee Kang, Heejin Ahn

Main category: cs.CV

TL;DR: HyDRA is a unified pipeline for collaborative perception that handles agent heterogeneity by combining intermediate and late fusion with domain-aware classification and anchor-guided optimization.

Details

Motivation: Collaborative perception systems suffer from performance degradation due to heterogeneity in agent models and training data distributions, which existing methods struggle to handle effectively.

Method: HyDRA integrates intermediate and late fusion within a domain-aware framework using a lightweight domain classifier to identify heterogeneous agents and assign them to late-fusion branch, plus anchor-guided pose graph optimization to mitigate localization errors.

Result: Extensive experiments show HyDRA achieves performance comparable to state-of-the-art heterogeneity-aware methods without additional training, maintaining performance as number of collaborating agents increases.

Conclusion: HyDRA provides a unified solution for heterogeneous collaborative perception that enables zero-cost scaling without retraining, addressing key challenges in multi-agent systems.

Abstract: In collaborative perception, an agent’s performance can be degraded by heterogeneity arising from differences in model architecture or training data distributions. To address this challenge, we propose HyDRA (Hybrid Domain-Aware Robust Architecture), a unified pipeline that integrates intermediate and late fusion within a domain-aware framework. We introduce a lightweight domain classifier that dynamically identifies heterogeneous agents and assigns them to the late-fusion branch. Furthermore, we propose anchor-guided pose graph optimization to mitigate localization errors inherent in late fusion, leveraging reliable detections from intermediate fusion as fixed spatial anchors. Extensive experiments demonstrate that, despite requiring no additional training, HyDRA achieves performance comparable to state-of-the-art heterogeneity-aware CP methods. Importantly, this performance is maintained as the number of collaborating agents increases, enabling zero-cost scaling without retraining.

[167] SilLang: Improving Gait Recognition with Silhouette Language Encoding

Ruiyi Zhan, Guozhen Peng, Canyu Chen, Jian Lei, Annan Li

Main category: cs.CV

TL;DR: A novel approach that bridges binary gait silhouettes with natural language by using LLMs to capture temporal motion patterns, achieving state-of-the-art performance on gait recognition benchmarks.

Details

Motivation: Current gait recognition methods use visual backbones that focus on continuous visual features, overlooking the discrete nature of binary silhouettes that share encoding space with natural language. LLMs excel at extracting discriminative features from discrete sequences and modeling long-range dependencies, making them potentially effective for capturing subtle temporal motion patterns in gait.

Method: Proposes Contour-Velocity Tokenizer to encode binary gait silhouettes while reshaping their distribution to align with text token space. Establishes Silhouette Language Model (SilLang), a dual-branch framework that enhances visual silhouettes by integrating discrete linguistic embeddings from LLMs.

Result: SilLang consistently improves state-of-the-art methods across SUSTech1K, GREW, and Gait3D benchmarks, demonstrating the effectiveness of bridging gait silhouettes with natural language representations.

Conclusion: The paper successfully demonstrates that leveraging LLMs for gait recognition by aligning binary silhouette encoding with natural language token space can significantly improve performance, opening new directions for multimodal learning in motion analysis.

Abstract: Gait silhouettes, which can be encoded into binary gait codes, are widely adopted to representing motion patterns of pedestrian. Recent approaches commonly leverage visual backbones to encode gait silhouettes, achieving successful performance. However, they primarily focus on continuous visual features, overlooking the discrete nature of binary silhouettes that inherently share a discrete encoding space with natural language. Large Language Models (LLMs) have demonstrated exceptional capability in extracting discriminative features from discrete sequences and modeling long-range dependencies, highlighting their potential to capture temporal motion patterns by identifying subtle variations. Motivated by these observations, we explore bridging binary gait silhouettes and natural language within a binary encoding space. However, the encoding spaces of text tokens and binary gait silhouettes remain misaligned, primarily due to differences in token frequency and density. To address this issue, we propose the Contour-Velocity Tokenizer, which encodes binary gait silhouettes while reshaping their distribution to better align with the text token space. We then establish a dual-branch framework termed Silhouette Language Model, which enhances visual silhouettes by integrating discrete linguistic embeddings derived from LLMs. Implemented on mainstream gait backbones, SilLang consistently improves state-of-the-art methods across SUSTech1K, GREW, and Gait3D.

[168] CAKE: Real-time Action Detection via Motion Distillation and Background-aware Contrastive Learning

Hieu Hoang, Dung Trung Tran, Hong Nguyen, Nam-Phong Nguyen

Main category: cs.CV

TL;DR: CAKE is a flow-based distillation framework for Online Action Detection that transfers motion knowledge from optical flow to RGB models without explicit flow computation, achieving high performance with low computational cost.

Details

Motivation: Online Action Detection systems face high computational costs and insufficient modeling of discriminative temporal dynamics against background motion. Optical flow provides strong motion cues but adds significant computational overhead.

Method: Proposes CAKE framework with Dynamic Motion Adapter (DMA) to suppress static background noise and emphasize pixel changes, approximating optical flow without explicit computation. Uses Floating Contrastive Learning to distinguish informative motion dynamics from temporal background.

Result: Achieves standout mAP compared with state-of-the-art while using the same backbone. Operates at over 72 FPS on a single CPU, making it highly suitable for resource-constrained systems. Validated on TVSeries, THUMOS'14, Kinetics-400 datasets.

Conclusion: CAKE effectively transfers motion knowledge from optical flow to RGB models, achieving high performance with low computational overhead, making it practical for real-time applications on resource-constrained systems.

Abstract: Online Action Detection (OAD) systems face two primary challenges: high computational cost and insufficient modeling of discriminative temporal dynamics against background motion. Adding optical flow could provides strong motion cues but it incurs significant computational overhead. We propose CAKE, a OAD Flow-based distillation framework to transfer motion knowledge into RGB models. We propose Dynamic Motion Adapter (DMA) to suppress static background noise and emphasize pixel changes, effectively approximating optical flow without explicit computation. The framework also integrates a Floating Contrastive Learning strategy to distinguish informative motion dynamics from temporal background. Various experiments conducted on the TVSeries, THUMOS'14, Kinetics-400 datasets show effectiveness of our model. CAKE achieves a standout mAP compared with SOTA while using the same backbone. Our model operates at over 72 FPS on a single CPU, making it highly suitable for resource-constrained systems.

[169] HGGT: Robust and Flexible 3D Hand Mesh Reconstruction from Uncalibrated Images

Yumeng Liu, Xiao-Xiao Long, Marc Habermann, Xuanze Yang, Cheng Lin, Yuan Liu, Yuexin Ma, Wenping Wang, Ligang Liu

Main category: cs.CV

TL;DR: A feed-forward architecture that jointly infers 3D hand meshes and camera poses from uncalibrated multi-view images, bridging the gap between single-view ambiguity and multi-view calibration requirements.

Details

Motivation: Current 3D hand reconstruction methods face a dilemma: single-view approaches suffer from depth ambiguity and occlusion but are easy to deploy, while multi-view systems resolve uncertainties but require fixed, calibrated setups that limit real-world utility. There's a need for methods that can leverage unstructured image data and work with consumer-grade cameras without complex calibration.

Method: Reformulates hand reconstruction from arbitrary views as a visual-geometry grounded task. Proposes a feed-forward architecture that jointly infers 3D hand meshes and camera poses from uncalibrated views, drawing inspiration from 3D foundation models that learn explicit geometry directly from visual data.

Result: Extensive evaluations show the approach outperforms state-of-the-art benchmarks and demonstrates strong generalization to uncalibrated, in-the-wild scenarios.

Conclusion: The proposed method successfully bridges the gap between single-view and multi-view approaches by enabling accurate 3D hand reconstruction from uncalibrated views, offering both accuracy and deployment flexibility for real-world applications.

Abstract: Recovering high-fidelity 3D hand geometry from images is a critical task in computer vision, holding significant value for domains such as robotics, animation and VR/AR. Crucially, scalable applications demand both accuracy and deployment flexibility, requiring the ability to leverage massive amounts of unstructured image data from the internet or enable deployment on consumer-grade RGB cameras without complex calibration. However, current methods face a dilemma. While single-view approaches are easy to deploy, they suffer from depth ambiguity and occlusion. Conversely, multi-view systems resolve these uncertainties but typically demand fixed, calibrated setups, limiting their real-world utility. To bridge this gap, we draw inspiration from 3D foundation models that learn explicit geometry directly from visual data. By reformulating hand reconstruction from arbitrary views as a visual-geometry grounded task, we propose a feed-forward architecture that, for the first time in literature, jointly infers 3D hand meshes and camera poses from uncalibrated views. Extensive evaluations show that our approach outperforms state-of-the-art benchmarks and demonstrates strong generalization to uncalibrated, in-the-wild scenarios. Here is the link of our project page: https://lym29.github.io/HGGT/.

[170] DB SwinT: A Dual-Branch Swin Transformer Network for Road Extraction in Optical Remote Sensing Imagery

Zongyang He, Xiangli Yang, Xian Gao, Zhiguo Wang

Main category: cs.CV

TL;DR: DB SwinT: A dual-branch Swin Transformer network for road extraction from optical remote sensing imagery, combining local detail recovery with global context modeling to handle occlusions and fragmentation.

Details

Motivation: Road extraction from high-resolution remote sensing imagery is crucial for urban planning, traffic monitoring, and disaster management, but remains challenging due to occlusions (trees, buildings) causing fragmented structures and reduced accuracy.

Method: Proposes Dual-Branch Swin Transformer network (DB SwinT) that combines Swin Transformer’s long-range dependency modeling with U-Net’s multi-scale feature fusion. Uses dual-branch encoder: local branch recovers fine structural details in occluded areas, global branch captures broader semantic context. Includes Attentional Feature Fusion (AFF) module to adaptively fuse features from both branches.

Result: Achieves IoU scores of 79.35% on Massachusetts dataset and 74.84% on DeepGlobe dataset, demonstrating effectiveness for road extraction from optical remote sensing imagery.

Conclusion: DB SwinT effectively addresses road extraction challenges in complex environments by combining local detail recovery with global context modeling, outperforming existing methods on benchmark datasets.

Abstract: With the continuous improvement in the spatial resolution of optical remote sensing imagery, accurate road extraction has become increasingly important for applications such as urban planning, traffic monitoring, and disaster management. However, road extraction in complex urban and rural environments remains challenging, as roads are often occluded by trees, buildings, and other objects, leading to fragmented structures and reduced extraction accuracy. To address this problem, this paper proposes a Dual-Branch Swin Transformer network (DB SwinT) for road extraction. The proposed framework combines the long-range dependency modeling capability of the Swin Transformer with the multi-scale feature fusion strategy of U-Net, and employs a dual-branch encoder to learn complementary local and global representations. Specifically, the local branch focuses on recovering fine structural details in occluded areas, while the global branch captures broader semantic context to preserve the overall continuity of road networks. In addition, an Attentional Feature Fusion (AFF) module is introduced to adaptively fuse features from the two branches, further enhancing the representation of occluded road segments. Experimental results on the Massachusetts and DeepGlobe datasets show that DB SwinT achieves Intersection over Union (IoU) scores of 79.35% and 74.84%, respectively, demonstrating its effectiveness for road extraction from optical remote sensing imagery.

[171] UW-VOS: A Large-Scale Dataset for Underwater Video Object Segmentation

Hongshen Zhao, Jingkang Tai, Yuhang Wu, Wenkang Zhang, Xi Lan, Shangyan Wang, Tianyu Zhang, Wankou Yang

Main category: cs.CV

TL;DR: First large-scale underwater video object segmentation benchmark (UW-VOS) with 1,431 sequences and 309k masks, plus SAM-U framework adapting SAM2 with lightweight adapters for underwater domain.

Details

Motivation: Underwater VOS is crucial for marine exploration but suffers from domain gap issues like color distortion, low contrast, and camouflage. Existing open-air methods degrade significantly underwater due to lack of high-quality training data.

Method: 1) Created UW-VOS benchmark via semi-automatic data engine with human verification. 2) Proposed SAM-U framework that adapts SAM2 to underwater domain by inserting lightweight adapters into image encoder, requiring only ~2% trainable parameters.

Result: Existing methods experience average 13-point J&F drop on UW-VOS, while SAM-U effectively bridges domain gap and achieves state-of-the-art performance. Analysis identifies small targets, camouflage, and exit-re-entry as critical bottlenecks.

Conclusion: UW-VOS benchmark addresses data scarcity in underwater VOS, SAM-U demonstrates efficient domain adaptation, and analysis provides roadmap for future robust underwater perception research.

Abstract: Underwater Video Object Segmentation (VOS) is essential for marine exploration, yet open-air methods suffer significant degradation due to color distortion, low contrast, and prevalent camouflage. A primary hurdle is the lack of high-quality training data. To bridge this gap, we introduce $\textbf{UW-VOS}$, the first large-scale underwater VOS benchmark comprising 1,431 video sequences across 409 categories with 309,295 mask annotations, constructed via a semi-automatic data engine with rigorous human verification. We further propose $\textbf{SAM-U}$, a parameter-efficient framework that adapts SAM2 to the underwater domain. By inserting lightweight adapters into the image encoder, SAM-U achieves state-of-the-art performance with only $\sim$2$%$ trainable parameters. Extensive experiments reveal that existing methods experience an average 13-point $\mathcal{J}&\mathcal{F}$ drop on UW-VOS, while SAM-U effectively bridges this domain gap. Detailed attribute-based analysis further identifies small targets, camouflage, and exit-re-entry as critical bottlenecks, providing a roadmap for future research in robust underwater perception.

[172] Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification

Han Sun, Qin Li, Peixin Wang, Min Zhang

Main category: cs.CV

TL;DR: AIR addresses object hallucination in LVLMs by rectifying attention imbalance across modalities and tokens, achieving up to 35.1% reduction in hallucination rates while improving general capabilities.

Details

Motivation: Object hallucination in Large Vision-Language Models severely compromises reliability in real-world applications like autonomous driving and medical imaging, creating a critical barrier to deployment in high-stakes scenarios.

Method: Introduces attention imbalance concept to quantify attention disparity, then proposes Attention Imbalance Rectification (AIR) - a lightweight decoding-time intervention method that reallocates attention weights to rectify modality-wise and token-wise imbalances.

Result: Extensive evaluations on four mainstream LVLMs and three benchmarks (CHAIR, POPE, MM-Vet) with seven baselines show AIR consistently reduces object hallucination rates by up to 35.1% compared to baselines, while improving up to 15.9% of LVLMs’ general capability across diverse vision-language tasks.

Conclusion: Attention imbalance is a key factor driving object hallucination in LVLMs, and AIR provides an effective lightweight solution that not only reduces hallucinations but also enhances overall model performance across vision-language tasks.

Abstract: Object hallucination in Large Vision-Language Models (LVLMs) severely compromises their reliability in real-world applications, posing a critical barrier to their deployment in high-stakes scenarios such as autonomous driving and medical image analysis. Through systematic empirical investigation, we identify that the imbalanced attention allocation, both across modalities (i.e., vision and language) and within modalities (among individual tokens), exhibits a strong causal correlation with the occurrence of object hallucination. Leveraging this insight, we introduce a novel concept termed attention imbalance, which not only quantifies the degree of attention disparity but also visually delineates the underlying patterns (e.g., over-attentiveness to irrelevant language tokens or under-attentiveness to discriminative visual features) that drive object hallucination. To mitigate object hallucination, we further propose Attention Imbalance Rectification (AIR), a lightweight decoding-time intervention method that reallocates attention weights and adjusts attention distributions to rectify modality-wise and token-wise imbalances. Extensive evaluations on four mainstream LVLMs and three benchmarks (CHAIR, POPE, and MM-Vet) with seven baselines demonstrate that AIR consistently reduces object hallucination rates, achieving up to a 35.1% reduction compared to the baselines, while improving up to 15.9% of LVLMs’ general capability across diverse vision-language tasks.

[173] COVTrack++: Learning Open-Vocabulary Multi-Object Tracking from Continuous Videos via a Synergistic Paradigm

Zekun Qian, Wei Feng, Ruize Han, Junhui Hou

Main category: cs.CV

TL;DR: COVTrack++ is an open-vocabulary multi-object tracking framework that addresses data and framework bottlenecks through C-TAO dataset and synergistic detection-association modules.

Details

Motivation: Traditional MOT is limited to specific categories, while open-vocabulary MOT lacks continuously annotated training data and customized frameworks that synergistically handle detection and association.

Method: Two-part approach: (1) C-TAO dataset with 26x denser annotations for smooth motion dynamics; (2) COVTrack++ framework with Multi-Cue Adaptive Fusion (MCF), Multi-Granularity Hierarchical Aggregation (MGA), and Temporal Confidence Propagation (TCP) modules.

Result: State-of-the-art on TAO dataset with novel TETA reaching 35.4% (val) and 30.5% (test), improving novel AssocA by 4.8% and novel LocA by 5.8% over previous methods, with strong zero-shot generalization on BDD100K.

Conclusion: COVTrack++ effectively addresses OVMOT challenges through synergistic detection-association framework and dense annotation dataset, enabling robust tracking of arbitrary categories including novel objects.

Abstract: Multi-Object Tracking (MOT) has traditionally focused on a few specific categories, restricting its applicability to real-world scenarios involving diverse objects. Open-Vocabulary Multi-Object Tracking (OVMOT) addresses this by enabling tracking of arbitrary categories, including novel objects unseen during training. However, current progress is constrained by two challenges: the lack of continuously annotated video data for training, and the lack of a customized OVMOT framework to synergistically handle detection and association. We address the data bottleneck by constructing C-TAO, the first continuously annotated training set for OVMOT, which increases annotation density by 26x over the original TAO and captures smooth motion dynamics and intermediate object states. For the framework bottleneck, we propose COVTrack++, a synergistic framework that achieves a bidirectional reciprocal mechanism between detection and association through three modules: (1) Multi-Cue Adaptive Fusion (MCF) dynamically balances appearance, motion, and semantic cues for association feature learning; (2) Multi-Granularity Hierarchical Aggregation (MGA) exploits hierarchical spatial relationships in dense detections, where visible child nodes (e.g., object parts) assist occluded parent objects (e.g., whole body) for association feature enhancement; (3) Temporal Confidence Propagation (TCP) recovers flickering detections through high-confidence tracked objects boosting low-confidence candidates across frames, stabilizing trajectories. Extensive experiments on TAO demonstrate state-of-the-art performance, with novel TETA reaching 35.4% and 30.5% on validation and test sets, improving novel AssocA by 4.8% and novel LocA by 5.8% over previous methods, and show strong zero-shot generalization on BDD100K. The code and dataset will be publicly available.

[174] When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm

Ye Leng, Junjie Chu, Mingjie Li, Chenhao Lin, Chao Shen, Michael Backes, Yun Shen, Yang Zhang

Main category: cs.CV

TL;DR: MLLMs generate more unsafe content than diffusion models due to better semantic understanding, creating new safety risks that are harder to detect.

Details

Motivation: To systematically analyze and compare safety risks of emerging multimodal large language models (MLLMs) versus diffusion models, focusing on unsafe content generation and fake image synthesis.

Method: Systematic analysis across multiple unsafe generation benchmark datasets, comparing MLLMs and diffusion models on two dimensions: unsafe content generation and fake image synthesis. Testing current fake image detectors and evaluating their effectiveness against MLLM-generated content.

Result: MLLMs generate more unsafe images than diffusion models. Diffusion models often fail to interpret abstract prompts, producing corrupted outputs, while MLLMs comprehend these prompts and generate unsafe content. MLLM-generated images are harder for current detectors to identify, even when retrained with MLLM-specific data. Longer, more descriptive inputs can bypass detectors.

Conclusion: Emerging safety risks of MLLMs have not been sufficiently recognized, posing new challenges to real-world safety. The enhanced semantic understanding capability of MLLMs introduces greater safety risks compared to diffusion models.

Abstract: Recently, multimodal large language models (MLLMs) have emerged as a unified paradigm for language and image generation. Compared with diffusion models, MLLMs possess a much stronger capability for semantic understanding, enabling them to process more complex textual inputs and comprehend richer contextual meanings. However, this enhanced semantic ability may also introduce new and potentially greater safety risks. Taking diffusion models as a reference point, we systematically analyze and compare the safety risks of emerging MLLMs along two dimensions: unsafe content generation and fake image synthesis. Across multiple unsafe generation benchmark datasets, we observe that MLLMs tend to generate more unsafe images than diffusion models. This difference partly arises because diffusion models often fail to interpret abstract prompts, producing corrupted outputs, whereas MLLMs can comprehend these prompts and generate unsafe content. For current advanced fake image detectors, MLLM-generated images are also notably harder to identify. Even when detectors are retrained with MLLMs-specific data, they can still be bypassed by simply providing MLLMs with longer and more descriptive inputs. Our measurements indicate that the emerging safety risks of the cutting-edge generative paradigm, MLLMs, have not been sufficiently recognized, posing new challenges to real-world safety.

[175] SpectralSplats: Robust Differentiable Tracking via Spectral Moment Supervision

Avigail Cohen Rimon, Amir Mann, Mirela Ben Chen, Or Litany

Main category: cs.CV

TL;DR: SpectralSplats: A robust 3D Gaussian Splatting tracking framework that uses frequency-domain supervision to solve vanishing gradient problems in camera misalignment scenarios.

Details

Motivation: 3D Gaussian Splatting enables real-time novel view synthesis but suffers from vanishing gradient problems when camera misalignment occurs, as the local support of Gaussian primitives causes gradients to vanish when rendered objects fall outside target footprints.

Method: Shifts optimization from spatial to frequency domain using global complex sinusoidal features (Spectral Moments), creating a global basin of attraction. Implements principled Frequency Annealing to transition from global convexity to precise spatial alignment without periodic local minima.

Result: SpectralSplats successfully recovers complex deformations from severely misaligned initializations where standard appearance-based tracking fails catastrophically, working across diverse deformation parameterizations.

Conclusion: Frequency-domain supervision provides a robust solution to vanishing gradient problems in 3DGS tracking, enabling reliable optimization even with severe misalignment through global spectral features and principled annealing.

Abstract: 3D Gaussian Splatting (3DGS) enables real-time, photorealistic novel view synthesis, making it a highly attractive representation for model-based video tracking. However, leveraging the differentiability of the 3DGS renderer “in the wild” remains notoriously fragile. A fundamental bottleneck lies in the compact, local support of the Gaussian primitives. Standard photometric objectives implicitly rely on spatial overlap; if severe camera misalignment places the rendered object outside the target’s local footprint, gradients strictly vanish, leaving the optimizer stranded. We introduce SpectralSplats, a robust tracking framework that resolves this “vanishing gradient” problem by shifting the optimization objective from the spatial to the frequency domain. By supervising the rendered image via a set of global complex sinusoidal features (Spectral Moments), we construct a global basin of attraction, ensuring that a valid, directional gradient toward the target exists across the entire image domain, even when pixel overlap is completely nonexistent. To harness this global basin without introducing periodic local minima associated with high frequencies, we derive a principled Frequency Annealing schedule from first principles, gracefully transitioning the optimizer from global convexity to precise spatial alignment. We demonstrate that SpectralSplats acts as a seamless, drop-in replacement for spatial losses across diverse deformation parameterizations (from MLPs to sparse control points), successfully recovering complex deformations even from severely misaligned initializations where standard appearance-based tracking catastrophically fails.

[176] A^3: Towards Advertising Aesthetic Assessment

Kaiyuan Ji, Yixuan Gao, Lu Sun, Yushuo Zheng, Zijian Chen, Jianbo Zhang, Xiangyang Zhu, Yuan Tian, Zicheng Zhang, Guangtao Zhai

Main category: cs.CV

TL;DR: A^3 (Advertising Aesthetic Assessment) is a comprehensive framework for evaluating advertising images using a multimodal large language model with a theory-driven paradigm, large dataset, and benchmark for objective aesthetic assessment.

Details

Motivation: Current advertising image evaluation methods rely on subjective judgments, lack scalability, standardized criteria, and interpretability. There's a need for objective, scalable assessment frameworks.

Method: Developed A^3 framework with four components: A^3-Law (three-stage hierarchical paradigm), A^3-Dataset (120K instruction-response pairs from 30K images), A^3-Align (multimodal LLM trained with CoT-guided learning), and A^3-Bench benchmark.

Result: A^3-Align achieves superior alignment with the A^3-Law paradigm compared to existing models, and demonstrates good generalization to quality advertisement selection and prescriptive advertisement critique tasks.

Conclusion: The A^3 framework provides a comprehensive, theory-driven approach to advertising aesthetic assessment with strong performance and generalization capabilities, offering potential for broader deployment in advertising evaluation.

Abstract: Advertising images significantly impact commercial conversion rates and brand equity, yet current evaluation methods rely on subjective judgments, lacking scalability, standardized criteria, and interpretability. To address these challenges, we present A^3 (Advertising Aesthetic Assessment), a comprehensive framework encompassing four components: a paradigm (A^3-Law), a dataset (A^3-Dataset), a multimodal large language model (A^3-Align), and a benchmark (A^3-Bench). Central to A^3 is a theory-driven paradigm, A^3-Law, comprising three hierarchical stages: (1) Perceptual Attention, evaluating perceptual image signals for their ability to attract attention; (2) Formal Interest, assessing formal composition of image color and spatial layout in evoking interest; and (3) Desire Impact, measuring desire evocation from images and their persuasive impact. Building on A^3-Law, we construct A^3-Dataset with 120K instruction-response pairs from 30K advertising images, each richly annotated with multi-dimensional labels and Chain-of-Thought (CoT) rationales. We further develop A^3-Align, trained under A^3-Law with CoT-guided learning on A^3-Dataset. Extensive experiments on A^3-Bench demonstrate that A^3-Align achieves superior alignment with A^3-Law compared to existing models, and this alignment generalizes well to quality advertisement selection and prescriptive advertisement critique, indicating its potential for broader deployment. Dataset, code, and models can be found at: https://github.com/euleryuan/A3-Align.

[177] SemLayer: Semantic-aware Generative Segmentation and Layer Construction for Abstract Icons

Haiyang Xu, Ronghuan Wu, Li-Yi Wei, Nanxuan Zhao, Chenxi Liu, Cuong Nguyen, Zhuowen Tu, Zhaowen Wang

Main category: cs.CV

TL;DR: SemLayer is a pipeline that reconstructs editable layered structures from flattened vector icons by generating chromatically differentiated representations, performing semantic completion of occluded regions, and assembling parts into layered vector representations.

Details

Motivation: Flattened vector icons lose original semantic layering, hindering downstream editing, restyling, and animation tasks. Current vector graphics lack editable layered structures needed for practical design workflows.

Method: 1) Generate chromatically differentiated representation to make semantic components visually separable; 2) Perform semantic completion to reconstruct coherent object-level shapes including occluded regions; 3) Assemble recovered parts into layered vector representation with inferred occlusion relationships.

Result: Extensive qualitative comparisons and quantitative evaluations demonstrate effectiveness. Enables editing workflows previously inapplicable to flattened vector graphics. Establishes semantic layer reconstruction as practical and valuable task.

Conclusion: SemLayer successfully restores editable layered structures from flattened vector art, addressing a significant limitation in current design workflows and enabling new editing capabilities for vector graphics.

Abstract: Graphic icons are a cornerstone of modern design workflows, yet they are often distributed as flattened single-path or compound-path graphics, where the original semantic layering is lost. This absence of semantic decomposition hinders downstream tasks such as editing, restyling, and animation. We formalize this problem as semantic layer construction for flattened vector art and introduce SemLayer, a visual generation empowered pipeline that restores editable layered structures. Given an abstract icon, SemLayer first generates a chromatically differentiated representation in which distinct semantic components become visually separable. To recover the complete geometry of each part, including occluded regions, we then perform a semantic completion step that reconstructs coherent object-level shapes. Finally, the recovered parts are assembled into a layered vector representation with inferred occlusion relationships. Extensive qualitative comparisons and quantitative evaluations demonstrate the effectiveness of SemLayer, enabling editing workflows previously inapplicable to flattened vector graphics and establishing semantic layer reconstruction as a practical and valuable task. Project page: https://xxuhaiyang.github.io/SemLayer/

[178] HAM: A Training-Free Style Transfer Approach via Heterogeneous Attention Modulation for Diffusion Models

Yeqi He, Liang Li, Zhiwen Yang, Xichun Sheng, Zhidong Zhao, Chenggang Yan

Main category: cs.CV

TL;DR: Training-free style transfer method using heterogeneous attention modulation to balance style-content trade-off in diffusion models

Details

Motivation: Existing style transfer methods using diffusion models often fail to capture complex style references or retain content image identity, struggling with style-content balance

Method: Proposes HAM (heterogeneous attention modulation) with style noise initialization, Global Attention Regulation (GAR), and Local Attention Transplantation (LAT) to preserve content details while capturing style

Result: Achieves state-of-the-art performance on multiple quantitative metrics through qualitative and quantitative experiments

Conclusion: HAM effectively addresses style-content trade-off in diffusion-based style transfer without requiring training

Abstract: Diffusion models have demonstrated remarkable performance in image generation, particularly within the domain of style transfer. Prevailing style transfer approaches typically leverage pre-trained diffusion models’ robust feature extraction capabilities alongside external modular control pathways to explicitly impose style guidance signals. However, these methods often fail to capture complex style reference or retain the identity of user-provided content images, thus falling into the trap of style-content balance. Thus, we propose a training-free style transfer approach via $\textbf{h}$eterogeneous $\textbf{a}$ttention $\textbf{m}$odulation ($\textbf{HAM}$) to protect identity information during image/text-guided style reference transfer, thereby addressing the style-content trade-off challenge. Specifically, we first introduces style noise initialization to initialize latent noise for diffusion. Then, during the diffusion process, it innovatively employs HAM for different attention mechanisms, including Global Attention Regulation (GAR) and Local Attention Transplantation (LAT), which better preserving the details of the content image while capturing complex style references. Our approach is validated through a series of qualitative and quantitative experiments, achieving state-of-the-art performance on multiple quantitative metrics.

[179] LGEST: Dynamic Spatial-Spectral Expert Routing for Hyperspectral Image Classification

Jiawen Wen, Suixuan Qiu, Zihang Luo, Xiaofei Yang, Haotian Shi

Main category: cs.CV

TL;DR: LGEST is a novel transformer-based framework for hyperspectral image classification that combines local-global representations through expert systems and cross-attention mechanisms to address spectral-spatial scale disparities and the Hughes phenomenon.

Details

Motivation: Existing deep learning methods for hyperspectral image classification have limitations: inflexible integration of local-global representations, inadequate handling of spectral-spatial scale disparities across heterogeneous bands, and susceptibility to the Hughes phenomenon under high-dimensional sample heterogeneity.

Method: LGEST combines three key innovations: 1) Deep Spatial-Spectral Autoencoder (DSAE) for compact discriminative embeddings, 2) Cross-Interactive Mixed Expert Feature Pyramid (CIEM-FPN) for dynamic multi-scale feature fusion, and 3) Local-Global Expert System (LGES) with sparsely activated expert pairs (convolutional for fine-grained textures, transformer for long-range dependencies).

Result: Extensive experiments on four benchmark datasets demonstrate that LGEST consistently outperforms state-of-the-art methods in hyperspectral image classification.

Conclusion: LGEST effectively addresses the limitations of existing methods by synergistically combining local-global representations through expert systems and cross-attention mechanisms, achieving superior performance in hyperspectral image classification.

Abstract: Deep learning methods, including Convolutional Neural Networks, Transformers and Mamba, have achieved remarkable success in hyperspectral image (HSI) classification. Nevertheless, existing methods exhibit inflexible integration of local-global representations, inadequate handling of spectral-spatial scale disparities across heterogeneous bands, and susceptibility to the Hughes phenomenon under high-dimensional sample heterogeneity. To address these challenges, we propose Local-Global Expert Spatial-Spectral Transformer (LGEST), a novel framework that synergistically combines three key innovations. The LGEST first employs a Deep Spatial-Spectral Autoencoder (DSAE) to generate compact yet discriminative embeddings through hierarchical nonlinear compression, preserving 3D neighborhood coherence while mitigating information loss in high-dimensional spaces. Secondly, a Cross-Interactive Mixed Expert Feature Pyramid (CIEM-FPN) leverages cross-attention mechanisms and residual mixture-of-experts layers to dynamically fuse multi-scale features, adaptively weighting spectral discriminability and spatial saliency through learnable gating functions. Finally, a Local-Global Expert System (LGES) processes decomposed features via sparsely activated expert pairs: convolutional sub-experts capture fine-grained textures, while transformer sub-experts model long-range contextual dependencies, with a routing controller dynamically selecting experts based on real-time feature saliency. Extensive experiments on four benchmark datasets demonstrate that LGEST consistently outperforms state-of-the-art methods.

[180] Beyond Semantic Priors: Mitigating Optimization Collapse for Generalizable Visual Forensics

Jipeng Liu, Haichao Shi, Siyu Xing, Rong Yin, Xiao-Yu Zhang

Main category: cs.CV

TL;DR: CoRIT addresses optimization collapse in deepfake detection by analyzing geometric stability of SAM optimization and proposing contrastive gradient proxy with training-free strategies to improve generalization on non-semantic forgeries.

Details

Motivation: Vision-Language Models like CLIP have limitations in deepfake detection because their semantic-centric pre-training fails to capture non-semantic artifacts in hyper-realistic synthesis, leading to optimization collapse under Sharpness-Aware Minimization.

Method: Proposes Contrastive Regional Injection Transformer (CoRIT) with Contrastive Gradient Proxy and three training-free strategies: Region Refinement Mask, Regional Signal Injection, and Hierarchical Representation Integration to address optimization collapse.

Result: CoRIT mitigates optimization collapse and achieves state-of-the-art generalization performance across cross-domain and universal forgery benchmarks.

Conclusion: The paper identifies optimization collapse as a fundamental issue in deepfake detection with VLMs and provides both theoretical analysis and practical solution through CoRIT to improve generalization on non-semantic forgeries.

Abstract: While Vision-Language Models (VLMs) like CLIP have emerged as a dominant paradigm for generalizable deepfake detection, a representational disconnect remains: their semantic-centric pre-training is ill-suited for capturing non-semantic artifacts inherent to hyper-realistic synthesis. In this work, we identify a failure mode termed Optimization Collapse, where detectors trained with Sharpness-Aware Minimization (SAM) degenerate to random guessing on non-semantic forgeries once the perturbation radius exceeds a narrow threshold. To theoretically formalize this collapse, we propose the Critical Optimization Radius (COR) to quantify the geometric stability of the optimization landscape, and leverage the Gradient Signal-to-Noise Ratio (GSNR) to measure generalization potential. We establish a theorem proving that COR increases monotonically with GSNR, thereby revealing that the geometric instability of SAM optimization originates from degraded intrinsic generalization potential. This result identifies the layer-wise attenuation of GSNR as the root cause of Optimization Collapse in detecting non-semantic forgeries. Although naively reducing perturbation radius yields stable convergence under SAM, it merely treats the symptom without mitigating the intrinsic generalization degradation, necessitating enhanced gradient fidelity. Building on this insight, we propose the Contrastive Regional Injection Transformer (CoRIT), which integrates a computationally efficient Contrastive Gradient Proxy (CGP) with three training-free strategies: Region Refinement Mask to suppress CGP variance, Regional Signal Injection to preserve CGP magnitude, and Hierarchical Representation Integration to attain more generalizable representations. Extensive experiments demonstrate that CoRIT mitigates optimization collapse and achieves state-of-the-art generalization across cross-domain and universal forgery benchmarks.

[181] Powerful Teachers Matter: Text-Guided Multi-view Knowledge Distillation with Visual Prior Enhancement

Xin Zhang, Jianyang Xu, Hao Peng, Dongjing Wang, Jingyuan Zheng, Yu Li, Yuyu Yin, Hongbo Wang

Main category: cs.CV

TL;DR: TMKD enhances knowledge distillation by using dual-modality teachers (visual and text/CLIP) with multi-view inputs and prior-aware prompts for adaptive feature fusion, plus vision-language contrastive regularization.

Details

Motivation: Existing knowledge distillation methods focus on distillation strategies but overlook enhancing teacher knowledge quality. The paper aims to provide richer supervisory signals by leveraging both visual and textual modalities.

Method: Proposes Text-guided Multi-view Knowledge Distillation (TMKD) with: 1) Dual-modality teachers (visual teacher and text teacher/CLIP), 2) Enhanced visual teacher with multi-view inputs (edge and high-frequency features), 3) Text teacher generates semantic weights through prior-aware prompts for adaptive feature fusion, 4) Vision-language contrastive regularization to strengthen semantic knowledge in student model.

Result: Extensive experiments on five benchmarks show TMKD consistently improves knowledge distillation performance by up to 4.49%, validating the effectiveness of the dual-teacher multi-view enhancement strategy.

Conclusion: TMKD demonstrates that enhancing teacher knowledge quality through multi-view visual priors and text guidance significantly improves knowledge distillation performance, offering a novel approach to multimodal knowledge transfer.

Abstract: Knowledge distillation transfers knowledge from large teacher models to smaller students for efficient inference. While existing methods primarily focus on distillation strategies, they often overlook the importance of enhancing teacher knowledge quality. In this paper, we propose Text-guided Multi-view Knowledge Distillation (TMKD), which leverages dual-modality teachers, a visual teacher and a text teacher (CLIP), to provide richer supervisory signals. Specifically, we enhance the visual teacher with multi-view inputs incorporating visual priors (edge and high-frequency features), while the text teacher generates semantic weights through prior-aware prompts to guide adaptive feature fusion. Additionally, we introduce vision-language contrastive regularization to strengthen semantic knowledge in the student model. Extensive experiments on five benchmarks demonstrate that TMKD consistently improves knowledge distillation performance by up to 4.49%, validating the effectiveness of our dual-teacher multi-view enhancement strategy. Code is available at https://anonymous.4open.science/r/TMKD-main-44D1.

[182] AD-Reasoning: Multimodal Guideline-Guided Reasoning for Alzheimer’s Disease Diagnosis

Qiuhui Chen, Yushan Deng, Xuancheng Yao, Yi Hong

Main category: cs.CV

TL;DR: Multimodal framework for Alzheimer’s disease diagnosis combining structural MRI with clinical data and rule-based verification for transparent, guideline-aligned reasoning.

Details

Motivation: Current multimodal models for Alzheimer's disease diagnosis are opaque and not well-aligned with established clinical guidelines, lacking transparent reasoning and structured outputs.

Method: AD-Reasoning framework uses modality-specific encoders for structural MRI and six clinical modalities, bidirectional cross-attention fusion, reinforcement fine-tuning with verifiable rewards, and a rule-based verifier to ensure NIA-AA guideline consistency.

Result: Achieves state-of-the-art diagnostic accuracy on AD-MultiSense dataset (10,378-visit multimodal QA dataset), produces structured rationales that improve transparency over baselines.

Conclusion: The framework successfully integrates multimodal data with clinical reasoning guidelines to provide transparent, structured diagnoses for Alzheimer’s disease.

Abstract: Alzheimer’s disease (AD) diagnosis requires integrating neuroimaging with heterogeneous clinical evidence and reasoning under established criteria, yet most multimodal models remain opaque and weakly guideline-aligned. We present AD-Reasoning, a multimodal framework that couples structural MRI with six clinical modalities and a rule-based verifier to generate structured, NIA-AA-consistent diagnoses. AD-Reasoning combines modality-specific encoders, bidirectional cross-attention fusion, and reinforcement fine-tuning with verifiable rewards that enforce output format, guideline evidence coverage, and reasoning–decision consistency. We also release AD-MultiSense, a 10,378-visit multimodal QA dataset with guideline-validated rationales built from ADNI/AIBL. On AD-MultiSense, AD-Reasoning achieves state-of-the-art diagnostic accuracy and produces structured rationales that improve transparency over recent baselines, while providing transparent rationales.

[183] PosterIQ: A Design Perspective Benchmark for Poster Understanding and Generation

Yuheng Feng, Wen Zhang, Haodong Duan, Xingxing Zou

Main category: cs.CV

TL;DR: PosterIQ is a comprehensive benchmark for poster understanding and generation, evaluating multimodal models on design-specific tasks like layout parsing, typography analysis, and composition-aware generation.

Details

Motivation: Current multimodal models lack understanding of visual design principles like composition structure, typographic hierarchy, and semantic intent in posters. There's a need for benchmarks that bridge visual design cognition with generative modeling to improve models' creative capabilities.

Method: Created a dataset of 7,765 image-annotation instances and 822 generation prompts across real, professional, and synthetic posters. Defined tasks including layout parsing, text-image correspondence, typography/readability analysis, font perception, design quality assessment, and controllable composition-aware generation with metaphor.

Result: Evaluation of state-of-the-art MLLMs and diffusion models revealed persistent gaps in visual hierarchy understanding, typographic semantics, saliency control, and intention communication. Commercial models performed well on high-level reasoning but were insensitive automatic raters, while generators rendered text well but struggled with composition-aware synthesis.

Conclusion: PosterIQ serves as both a quantitative benchmark and diagnostic tool for design reasoning, offering reproducible metrics to catalyze models’ creativity and integrate human-centered design principles into vision-language systems.

Abstract: We present PosterIQ, a design-driven benchmark for poster understanding and generation, annotated across composition structure, typographic hierarchy, and semantic intent. It includes 7,765 image-annotation instances and 822 generation prompts spanning real, professional, and synthetic cases. To bridge visual design cognition and generative modeling, we define tasks for layout parsing, text-image correspondence, typography/readability and font perception, design quality assessment, and controllable, composition-aware generation with metaphor. We evaluate state-of-the-art MLLMs and diffusion-based generators, finding persistent gaps in visual hierarchy, typographic semantics, saliency control, and intention communication; commercial models lead on high-level reasoning but act as insensitive automatic raters, while generators render text well yet struggle with composition-aware synthesis. Extensive analyses show PosterIQ is both a quantitative benchmark and a diagnostic tool for design reasoning, offering reproducible, task-specific metrics. We aim to catalyze models’ creativity and integrate human-centred design principles into generative vision-language systems.

[184] LGTM: Training-Free Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation

Ryugo Morita, Stanislav Frolov, Brian Bernhard Moser, Ko Watanabe, Riku Takahashi, Andreas Dengel

Main category: cs.CV

TL;DR: A training-free method for light-guided text-to-image diffusion models that manipulates initial latent noise to control lighting direction without fine-tuning.

Details

Motivation: Existing methods for lighting control in text-to-image generation use inefficient two-stage pipelines requiring fine-tuning with large datasets, limiting adaptability to new models and tasks.

Method: Proposes LGTM (Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation) that selectively manipulates specific channels in the initial latent noise based on channel-wise analysis of the latent space to control lighting direction.

Result: Method surpasses prompt-based baselines in lighting consistency while preserving image quality and text alignment, and integrates seamlessly with models like ControlNet.

Conclusion: The approach enables fine-grained lighting control without fine-tuning or modifying pre-trained models, introducing new possibilities for dynamic, user-guided light control in text-to-image generation.

Abstract: Diffusion models have demonstrated high-quality performance in conditional text-to-image generation, particularly with structural cues such as edges, layouts, and depth. However, lighting conditions have received limited attention and remain difficult to control within the generative process. Existing methods handle lighting through a two-stage pipeline that relights images after generation, which is inefficient. Moreover, they rely on fine-tuning with large datasets and heavy computation, limiting their adaptability to new models and tasks. To address this, we propose a novel Training-Free Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation (LGTM), which manipulates the initial latent noise of the diffusion process to guide image generation with text prompts and user-specified light directions. Through a channel-wise analysis of the latent space, we find that selectively manipulating latent channels enables fine-grained lighting control without fine-tuning or modifying the pre-trained model. Extensive experiments show that our method surpasses prompt-based baselines in lighting consistency, while preserving image quality and text alignment. This approach introduces new possibilities for dynamic, user-guided light control. Furthermore, it integrates seamlessly with models like ControlNet, demonstrating adaptability across diverse scenarios.

[185] LaDy: Lagrangian-Dynamic Informed Network for Skeleton-based Action Segmentation via Spatial-Temporal Modulation

Haoyu Ji, Xueting Liu, Yu Gao, Wenze Huang, Zhihao Yang, Weihong Ren, Zhiyong Wang, Honghai Liu

Main category: cs.CV

TL;DR: LaDy integrates Lagrangian dynamics principles into skeleton-based action segmentation to improve discriminability between similar actions and enhance boundary localization by modeling physical forces and energy constraints.

Details

Motivation: Existing skeleton-based temporal action segmentation methods focus on spatio-temporal kinematics but neglect underlying physical dynamics, limiting their ability to distinguish between actions with similar kinematics but different dynamic intents and hindering precise boundary localization where dynamic force profiles change.

Method: Proposes Lagrangian-Dynamic Informed Network (LaDy) that computes generalized coordinates from joint positions, estimates Lagrangian terms under physical constraints to synthesize generalized forces, and uses an Energy Consistency Loss to enforce the work-energy theorem. A Spatio-Temporal Modulation module fuses generalized forces with spatial representations for discriminative semantics and constructs salient dynamic signals for temporal gating.

Result: Experiments on challenging datasets show LaDy achieves state-of-the-art performance, validating the benefits of integrating physical dynamics for action segmentation.

Conclusion: Incorporating Lagrangian dynamics principles significantly improves skeleton-based action segmentation by providing more discriminative semantics and enhanced boundary awareness through physical coherence constraints.

Abstract: Skeleton-based Temporal Action Segmentation (STAS) aims to densely parse untrimmed skeletal sequences into frame-level action categories. However, existing methods, while proficient at capturing spatio-temporal kinematics, neglect the underlying physical dynamics that govern human motion. This oversight limits inter-class discriminability between actions with similar kinematics but distinct dynamic intents, and hinders precise boundary localization where dynamic force profiles shift. To address these, we propose the Lagrangian-Dynamic Informed Network (LaDy), a framework integrating principles of Lagrangian dynamics into the segmentation process. Specifically, LaDy first computes generalized coordinates from joint positions and then estimates Lagrangian terms under physical constraints to explicitly synthesize the generalized forces. To further ensure physical coherence, our Energy Consistency Loss enforces the work-energy theorem, aligning kinetic energy change with the work done by the net force. The learned dynamics then drive a Spatio-Temporal Modulation module: Spatially, generalized forces are fused with spatial representations to provide more discriminative semantics. Temporally, salient dynamic signals are constructed for temporal gating, thereby significantly enhancing boundary awareness. Experiments on challenging datasets show that LaDy achieves state-of-the-art performance, validating the integration of physical dynamics for action segmentation. Code is available at https://github.com/HaoyuJi/LaDy.

[186] Granular Ball Guided Stable Latent Domain Discovery for Domain-General Crowd Counting

Fan Chen, Shuyin Xia, Yi Wang, Xinbo Gao

Main category: cs.CV

TL;DR: A granular ball guided stable latent domain discovery framework for domain-general crowd counting that transforms sample-level clustering into hierarchical representative-based clustering for more stable pseudo-domain assignments, combined with a two-branch learning framework to separate semantic and style representations.

Details

Motivation: Single-source domain generalization for crowd counting is challenging due to heterogeneous latent domains in source data and severe distribution shifts in test data. Direct flat clustering on sample-level features is unstable due to noise, outliers, and representation drift, leading to unreliable pseudo-domain assignments.

Method: 1) Organize samples into compact local granular balls and cluster ball centers as representatives for hierarchical pseudo-domain discovery. 2) Two-branch learning framework: semantic branch with codebook re-encoding for transferable representations, and style branch for domain-specific appearance variations to reduce semantic-style entanglement.

Result: Extensive experiments on ShanghaiTech A/B, UCF_QNRF, and NWPU-Crowd datasets under strict no-adaptation protocol show the method consistently outperforms strong baselines, especially under large domain gaps.

Conclusion: The proposed granular ball guided stable latent domain discovery framework effectively addresses the instability in pseudo-domain assignment for domain-general crowd counting, leading to improved generalization performance under domain shifts.

Abstract: Single-source domain generalization for crowd counting remains highly challenging because a single labeled source domain often contains heterogeneous latent domains, while test data may exhibit severe distribution shifts. A fundamental difficulty lies in stable latent domain discovery: directly performing flat clustering on evolving sample-level latent features is easily affected by feature noise, outliers, and representation drift, leading to unreliable pseudo-domain assignments and weakened domain-structured learning. To address this issue, we propose a granular ball guided stable latent domain discovery framework for domain-general crowd counting. Specifically, the proposed method first organizes samples into compact local granular balls and then clusters granular ball centers as representatives to obtain pseudo-domains, transforming direct sample-level clustering into a hierarchical representative-based clustering process. This design yields more stable and semantically consistent pseudo-domain assignments. Built upon the discovered latent domains, we further develop a two-branch learning framework that enhances transferable semantic representations via semantic codebook re-encoding while modeling domain-specific appearance variations through a style branch, thereby reducing semantic–style entanglement and improving generalization under domain shifts. Extensive experiments on ShanghaiTech A/B, UCF_QNRF, and NWPU-Crowd under a strict no-adaptation protocol demonstrate that the proposed method consistently outperforms strong baselines, especially under large domain gaps.

[187] Retinal Layer Segmentation in OCT Images With 2.5D Cross-slice Feature Fusion Module for Glaucoma Assessment

Hyunwoo Kim, Heesuk Kim, Wungrak Choi, Jae-Sang Hyun

Main category: cs.CV

TL;DR: A 2.5D segmentation framework with cross-slice feature fusion for consistent retinal layer segmentation in OCT images, balancing contextual awareness and computational efficiency for glaucoma diagnosis.

Details

Motivation: Existing 2D OCT segmentation methods suffer from slice-to-slice inconsistencies due to lack of contextual information across adjacent B-scans, while 3D methods require expensive computational resources.

Method: Proposes a 2.5D segmentation framework with a novel cross-slice feature fusion (CFF) module integrated into a U-Net-like architecture to fuse inter-slice features and capture contextual information.

Result: Achieved 8.56% reduction in mean absolute distance and 13.92% reduction in root mean square error compared to methods without CFF, demonstrating improved segmentation accuracy and robustness on clinical and DUKE DME datasets.

Conclusion: The 2.5D framework effectively balances contextual awareness and computational efficiency, enabling reliable retinal layer delineation for automated glaucoma evaluation and clinical applications.

Abstract: For accurate glaucoma diagnosis and monitoring, reliable retinal layer segmentation in OCT images is essential. However, existing 2D segmentation methods often suffer from slice-to-slice inconsistencies due to the lack of contextual information across adjacent B-scans. 3D segmentation methods are better for capturing slice-to-slice context, but they require expensive computational resources. To address these limitations, we propose a 2.5D segmentation framework that incorporates a novel cross-slice feature fusion (CFF) module into a U-Net-like architecture. The CFF module fuses inter-slice features to effectively capture contextual information, enabling consistent boundary detection across slices and improved robustness in noisy regions. The framework was validated on both a clinical dataset and the publicly available DUKE DME dataset. Compared to other segmentation methods without the CFF module, the proposed method achieved an 8.56% reduction in mean absolute distance and a 13.92% reduction in root mean square error, demonstrating improved segmentation accuracy and robustness. Overall, the proposed 2.5D framework balances contextual awareness and computational efficiency, enabling anatomically reliable retinal layer delineation for automated glaucoma evaluation and potential clinical applications.

[188] Combi-CAM: A Novel Multi-Layer Approach for Explainable Image Geolocalization

David Faget, José Luis Lisani, Miguel Colom

Main category: cs.CV

TL;DR: Combi-CAM enhances explainability of CNN-based geolocalization models by combining gradient-weighted class activation maps from multiple network layers instead of just the deepest layer.

Details

Motivation: While deep learning models have advanced photo geolocalization, understanding the reasoning behind their predictions remains challenging. Current methods typically use only the deepest layer for explainability, which may not provide sufficient detail about how different image features contribute to decisions.

Method: Combi-CAM combines gradient-weighted class activation maps (Grad-CAM) from several layers of the CNN architecture rather than using only information from the deepest layer. This multi-layer approach provides a more comprehensive view of feature contributions.

Result: The approach provides more detailed understanding of how different image features contribute to the model’s geolocalization decisions, offering deeper insights than traditional single-layer methods.

Conclusion: Combi-CAM improves explainability in CNN-based geolocalization models by leveraging multi-layer activation maps, enhancing understanding of model decision-making processes.

Abstract: Planet-scale photo geolocalization involves the intricate task of estimating the geographic location depicted in an image purely based on its visual features. While deep learning models, particularly convolutional neural networks (CNNs), have significantly advanced this field, understanding the reasoning behind their predictions remains challenging. In this paper, we present Combi-CAM, a novel method that enhances the explainability of CNN-based geolocalization models by combining gradient-weighted class activation maps obtained from several layers of the network architecture, rather than using only information from the deepest layer as is typically done. This approach provides a more detailed understanding of how different image features contribute to the model’s decisions, offering deeper insights than the traditional approaches.

[189] Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep

Tianyi Liu, Ye Lu, Linfeng Zhang, Chen Cai, Jianjun Gao, Yi Wang, Kim-Hui Yap, Lap-Pui Chau

Main category: cs.CV

TL;DR: HetCache is a training-free acceleration framework for diffusion-based video editing that reduces redundant attention operations in Diffusion Transformers by selectively caching context tokens based on their relevance to generative tokens.

Details

Motivation: Diffusion Transformers (DiT) for video editing are computationally expensive due to iterative denoising. Existing acceleration methods focus on timestep-level feature reuse but ignore architectural redundancy in attention operations over spatio-temporal tokens.

Method: HetCache assesses contextual relevance and interaction strength among tokens, divides spatial-temporal tokens into context and generative tokens, and selectively caches context tokens with strongest correlation to generative ones to reduce redundant attention operations.

Result: Achieves 2.67× latency speedup and FLOPs reduction over foundation models with negligible degradation in editing quality.

Conclusion: HetCache effectively accelerates diffusion-based video editing by exploiting architectural redundancy in DiT models while maintaining editing consistency and fidelity.

Abstract: Diffusion-based video editing has emerged as an important paradigm for high-quality and flexible content generation. However, despite their generality and strong modeling capacity, Diffusion Transformers (DiT) remain computationally expensive due to the iterative denoising process, posing challenges for practical deployment. Existing video diffusion acceleration methods primarily exploit denoising timestep-level feature reuse, which mitigates the redundancy in denoising process, but overlooks the architectural redundancy within the DiT that many attention operations over spatio-temporal tokens are redundantly executed, offering little to no incremental contribution to the model output. This work introduces HetCache, a training-free diffusion acceleration framework designed to exploit the inherent heterogeneity in diffusion-based masked video-to-video (MV2V) generation and editing. Instead of uniformly reuse or randomly sampling tokens, HetCache assesses the contextual relevance and interaction strength among various types of tokens in designated computing steps. Guided by spatial priors, it divides the spatial-temporal tokens in DiT model into context and generative tokens, and selectively caches the context tokens that exhibit the strongest correlation and most representative semantics with generative ones. This strategy reduces redundant attention operations while maintaining editing consistency and fidelity. Experiments show that HetCache achieves a noticeable acceleration, including a 2.67$\times$ latency speedup and FLOPs reduction over commonly used foundation models, with negligible degradation in editing quality.

[190] Spectral Scalpel: Amplifying Adjacent Action Discrepancy via Frequency-Selective Filtering for Skeleton-Based Action Segmentation

Haoyu Ji, Bowen Chen, Zhihao Yang, Wenze Huang, Yu Gao, Xueting Liu, Weihong Ren, Zhiyong Wang, Honghai Liu

Main category: cs.CV

TL;DR: Spectral Scalpel is a frequency-selective filtering framework for skeleton-based temporal action segmentation that enhances inter-class discriminability and sharpens boundaries by suppressing shared frequency components between adjacent actions while amplifying action-specific frequencies.

Details

Motivation: Existing skeleton-based temporal action segmentation methods suffer from limited inter-class discriminability and blurred segmentation boundaries due to insufficient distinction of spatio-temporal patterns between adjacent actions.

Method: Proposes Spectral Scalpel framework with adaptive multi-scale spectral filters to edit frequency spectra, coupled with discrepancy loss between adjacent actions. Also introduces frequency-aware channel mixer to strengthen channel evolution by aggregating spectra across channels.

Result: Extensive experiments on five public datasets demonstrate state-of-the-art performance for skeleton-based temporal action segmentation.

Conclusion: Presents a novel paradigm for STAS that extends conventional spatio-temporal modeling by incorporating frequency-domain analysis to enhance inter-action discrepancies and sharpen transition boundaries.

Abstract: Skeleton-based Temporal Action Segmentation (STAS) seeks to densely segment and classify diverse actions within long, untrimmed skeletal motion sequences. However, existing STAS methodologies face challenges of limited inter-class discriminability and blurred segmentation boundaries, primarily due to insufficient distinction of spatio-temporal patterns between adjacent actions. To address these limitations, we propose Spectral Scalpel, a frequency-selective filtering framework aimed at suppressing shared frequency components between adjacent distinct actions while amplifying their action-specific frequencies, thereby enhancing inter-action discrepancies and sharpening transition boundaries. Specifically, Spectral Scalpel employs adaptive multi-scale spectral filters as scalpels to edit frequency spectra, coupled with a discrepancy loss between adjacent actions serving as the surgical objective. This design amplifies representational disparities between neighboring actions, effectively mitigating boundary localization ambiguities and inter-class confusion. Furthermore, complementing long-term temporal modeling, we introduce a frequency-aware channel mixer to strengthen channel evolution by aggregating spectra across channels. This work presents a novel paradigm for STAS that extends conventional spatio-temporal modeling by incorporating frequency-domain analysis. Extensive experiments on five public datasets demonstrate that Spectral Scalpel achieves state-of-the-art performance. Code is available at https://github.com/HaoyuJi/SpecScalpel.

[191] Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection

Zhanhe Lei, Zhongyuan Wang, Jikang Cheng, Baojin Huang, Yuhong Yang, Zhen Han, Chao Liang, Dengpan Ye

Main category: cs.CV

TL;DR: TSRL framework uses reinforcement learning to dynamically weight training samples for deepfake detection, improving generalization to unseen manipulation techniques.

Details

Motivation: Standard supervised training treats all samples equally, which is suboptimal for learning robust features. Need adaptive curriculum to prioritize high-value samples like hard-but-learnable examples.

Method: Tutor-Student Reinforcement Learning (TSRL) models training as Markov Decision Process. Tutor (PPO agent) observes rich state representation (visual features + learning dynamics) and assigns continuous weight (0-1) to sample loss. Rewarded based on Student’s immediate performance change.

Result: Adaptive curriculum improves Student’s generalization capabilities against unseen manipulation techniques compared to traditional training methods.

Conclusion: TSRL framework enables more efficient and effective training for deepfake detection by dynamically optimizing training curriculum through reinforcement learning.

Abstract: Standard supervised training for deepfake detection treats all samples with uniform importance, which can be suboptimal for learning robust and generalizable features. In this work, we propose a novel Tutor-Student Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. Our method models the training process as a Markov Decision Process where a Tutor'' agent learns to guide a Student’’ (the deepfake detector). The Tutor, implemented as a Proximal Policy Optimization (PPO) agent, observes a rich state representation for each training sample, encapsulating not only its visual features but also its historical learning dynamics, such as EMA loss and forgetting counts. Based on this state, the Tutor takes an action by assigning a continuous weight (0-1) to the sample’s loss, thereby dynamically re-weighting the training batch. The Tutor is rewarded based on the Student’s immediate performance change, specifically rewarding transitions from incorrect to correct predictions. This strategy encourages the Tutor to learn a curriculum that prioritizes high-value samples, such as hard-but-learnable examples, leading to a more efficient and effective training process. We demonstrate that this adaptive curriculum improves the Student’s generalization capabilities against unseen manipulation techniques compared to traditional training methods. Code is available at https://github.com/wannac1/TSRL.

[192] LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds

Jaehun Bang, Jinhyeok Kim, Minji Kim, Seungheon Jeong, Kyungdon Joo

Main category: cs.CV

TL;DR: LightSplat is a fast, memory-efficient training-free framework for open-vocabulary 3D scene understanding that injects compact 2-byte semantic indices into 3D representations from multi-view images, achieving 50-400x speedup and 64x lower memory usage.

Details

Motivation: Existing open-vocabulary 3D scene understanding approaches are slow, memory-intensive, and overly complex due to iterative optimization and dense per-Gaussian feature assignments, limiting scalability for real-world applications.

Method: LightSplat injects compact 2-byte semantic indices into 3D representations from multi-view images, assigns semantic indices only to salient regions, uses lightweight index-feature mapping to eliminate costly feature optimization, and employs single-step clustering for semantic consistency and efficient inference.

Result: Achieves state-of-the-art performance on LERF-OVS, ScanNet, and DL3DV-OVS across complex indoor-outdoor scenes with up to 50-400x speedup and 64x lower memory usage compared to existing methods.

Conclusion: LightSplat enables scalable language-driven 3D understanding through its efficient training-free framework that dramatically reduces computational overhead while maintaining high performance.

Abstract: Open-vocabulary 3D scene understanding enables users to segment novel objects in complex 3D environments through natural language. However, existing approaches remain slow, memory-intensive, and overly complex due to iterative optimization and dense per-Gaussian feature assignments. To address this, we propose LightSplat, a fast and memory-efficient training-free framework that injects compact 2-byte semantic indices into 3D representations from multi-view images. By assigning semantic indices only to salient regions and managing them with a lightweight index-feature mapping, LightSplat eliminates costly feature optimization and storage overhead. We further ensure semantic consistency and efficient inference via single-step clustering that links geometrically and semantically related masks in 3D. We evaluate our method on LERF-OVS, ScanNet, and DL3DV-OVS across complex indoor-outdoor scenes. As a result, LightSplat achieves state-of-the-art performance with up to 50-400x speedup and 64x lower memory, enabling scalable language-driven 3D understanding. For more details, visit our project page https://vision3d-lab.github.io/lightsplat/.

[193] A convergent Plug-and-Play Majorization-Minimization algorithm for Poisson inverse problems

Thibaut Modrzyk, Ane Etxebeste, Élie Bretin, Voichita Maxim

Main category: cs.CV

TL;DR: Novel variational plug-and-play algorithm for Poisson inverse problems combining Kullback-Leibler data fidelity with pre-trained neural network regularization, enabling use of Gaussian denoisers with convergence guarantees.

Details

Motivation: Address Poisson inverse problems (common in nuclear medicine, microscopy) where traditional methods struggle with high noise. Need to leverage pre-trained neural network denoisers while maintaining theoretical convergence guarantees.

Method: Variational plug-and-play approach minimizing explicit functional: KL data fidelity term + regularization via pre-trained neural network. Combines classical likelihood maximization with gradient-based denoisers. Formulated in majorization-minimization framework for convergence guarantees.

Result: State-of-the-art performance in deconvolution and tomography under moderate noise, clear superiority in high-noise conditions. Particularly valuable for nuclear medicine applications where Poisson noise is prevalent.

Conclusion: Method successfully integrates pre-trained Gaussian denoisers into Poisson inverse problem solving with theoretical convergence guarantees, offering robust performance especially in challenging high-noise scenarios.

Abstract: In this paper, we present a novel variational plug-and-play algorithm for Poisson inverse problems. Our approach minimizes an explicit functional which is the sum of a Kullback-Leibler data fidelity term and a regularization term based on a pre-trained neural network. By combining classical likelihood maximization methods with recent advances in gradient-based denoisers, we allow the use of pre-trained Gaussian denoisers without sacrificing convergence guarantees. The algorithm is formulated in the majorization-minimization framework, which guarantees convergence to a stationary point. Numerical experiments confirm state-of-the-art performance in deconvolution and tomography under moderate noise, and demonstrate clear superiority in high-noise conditions, making this method particularly valuable for nuclear medicine applications.

[194] Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Jing Zhang, Jun Zhang, Xing Wei, Yi Liu, Dianhai Yu, Yanjun Ma

Main category: cs.CV

TL;DR: PaddleOCR-VL is a coarse-to-fine document parsing architecture that uses a Valid Region Focus Module to identify semantically relevant regions, reducing redundant vision tokens and computational costs while maintaining high performance with a compact 0.9B vision-language model.

Details

Motivation: Document parsing requires high-resolution images for good performance, but this leads to quadratic increase in vision tokens and high computational costs due to visual redundancy (like backgrounds). Need efficient approach that maintains accuracy while reducing computation.

Method: Proposes PaddleOCR-VL with: 1) Lightweight Valid Region Focus Module (VRFM) that identifies valid vision tokens using localization and contextual relationship prediction; 2) Compact 0.9B vision-language model trained for detailed recognition guided by VRFM outputs, avoiding direct processing of entire large images.

Result: Achieves state-of-the-art performance in both page-level parsing and element-level recognition. Outperforms existing solutions, competitive with top-tier VLMs, with fast inference using substantially fewer vision tokens and parameters.

Conclusion: Targeted coarse-to-fine parsing is effective for accurate and efficient document understanding. The approach balances performance and efficiency by focusing on semantically relevant regions while suppressing redundant ones.

Abstract: Document parsing is a fine-grained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR-VL, a novel coarse-to-fine architecture that focuses on semantically relevant regions while suppressing redundant ones, thereby improving both efficiency and performance. Specifically, we introduce a lightweight Valid Region Focus Module (VRFM) which leverages localization and contextual relationship prediction capabilities to identify valid vision tokens. Subsequently, we design and train a compact yet powerful 0.9B vision-language model (PaddleOCR-VL-0.9B) to perform detailed recognition, guided by VRFM outputs to avoid direct processing of the entire large image. Extensive experiments demonstrate that PaddleOCR-VL achieves state-of-the-art performance in both page-level parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference while utilizing substantially fewer vision tokens and parameters, highlighting the effectiveness of targeted coarse-to-fine parsing for accurate and efficient document understanding. The source code and models are publicly available at https://github.com/PaddlePaddle/PaddleOCR.

[195] CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare

Akash Ghosh, Tajamul Ashraf, Rishu Kumar Singh, Numan Saeed, Sriparna Saha, Xiuying Chen, Salman Khan

Main category: cs.CV

TL;DR: CarePilot is a multi-agent framework for long-horizon medical software automation using vision-language models with actor-critic paradigm and dual-memory mechanisms.

Details

Motivation: Existing multimodal agentic pipelines focus on short-horizon or general-purpose applications, leaving long-horizon automation in domain-specific healthcare systems largely unexplored. There's a need for systems that can handle complex, multi-step workflows across medical software tools.

Method: Proposes CarePilot, a multi-agent framework based on actor-critic paradigm. Actor integrates tool grounding with dual-memory mechanisms (long-term and short-term experience) to predict next semantic actions. Critic evaluates actions, updates memory based on effects, and provides corrective feedback. Uses iterative agentic simulation for training.

Result: CarePilot achieves state-of-the-art performance, outperforming strong closed-source multimodal baselines by ~15.26% and open-source baselines by ~3.38% on the CareFlow benchmark and out-of-distribution datasets.

Conclusion: The CarePilot framework effectively addresses long-horizon automation challenges in healthcare software systems, demonstrating superior performance over existing vision-language models through its multi-agent architecture and memory mechanisms.

Abstract: Multimodal agentic pipelines are transforming human-computer interaction by enabling efficient and accessible automation of complex, real-world tasks. However, recent efforts have focused on short-horizon or general-purpose applications (e.g., mobile or desktop interfaces), leaving long-horizon automation for domain-specific systems, particularly in healthcare, largely unexplored. To address this, we introduce CareFlow, a high-quality human-annotated benchmark comprising complex, long-horizon software workflows across medical annotation tools, DICOM viewers, EHR systems, and laboratory information systems. On this benchmark, existing vision-language models (VLMs) perform poorly, struggling with long-horizon reasoning and multi-step interactions in medical contexts. To overcome this, we propose CarePilot, a multi-agent framework based on the actor-critic paradigm. The Actor integrates tool grounding with dual-memory mechanisms (long-term and short-term experience) to predict the next semantic action from the visual interface and system state. The Critic evaluates each action, updates memory based on observed effects, and either executes or provides corrective feedback to refine the workflow. Through iterative agentic simulation, the Actor learns to perform more robust and reasoning-aware predictions during inference. Our experiments show that CarePilot achieves state-of-the-art performance, outperforming strong closed-source and open-source multimodal baselines by approximately 15.26% and 3.38%, respectively, on our benchmark and out-of-distribution dataset.

[196] Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection

Xu Zhang, Zhe Chen, Jing Zhang, Dacheng Tao

Main category: cs.CV

TL;DR: HeROD introduces heuristic-inspired spatial and semantic reasoning priors into DETR-style referring object detection pipelines to improve performance in low-data regimes, outperforming baselines on RefCOCO benchmarks.

Details

Motivation: Referring object detection models struggle with label scarcity in practical deployments like robotics and AR. The paper explores whether explicit reasoning priors can help models learn more efficiently when data is scarce.

Method: Proposes HeROD framework that injects heuristic-inspired spatial and semantic reasoning priors into 3 stages of DETR-style pipeline: proposal ranking, prediction fusion, and Hungarian matching. These interpretable priors bias training and inference toward plausible candidates.

Result: HeROD consistently outperforms strong grounding baselines on RefCOCO, RefCOCO+, and RefCOCOg benchmarks in scarce-label regimes, demonstrating improved label efficiency and convergence performance.

Conclusion: Integrating simple, interpretable reasoning priors provides a practical and extensible path toward better data-efficient vision-language understanding, suggesting explicit reasoning can enhance multimodal learning in low-data conditions.

Abstract: Most referring object detection (ROD) models, especially the modern grounding detectors, are designed for data-rich conditions, yet many practical deployments, such as robotics, augmented reality, and other specialized domains, would face severe label scarcity. In such regimes, end-to-end grounding detectors need to learn spatial and semantic structure from scratch, wasting precious samples. We ask a simple question: Can explicit reasoning priors help models learn more efficiently when data is scarce? To explore this, we first introduce a Data-efficient Referring Object Detection (De-ROD) task, which is a benchmark protocol for measuring ROD performance in low-data and few-shot settings. We then propose the HeROD (Heuristic-inspired ROD), a lightweight, model-agnostic framework that injects explicit, heuristic-inspired spatial and semantic reasoning priors, which are interpretable signals derived based on the referring phrase, into 3 stages of a modern DETR-style pipeline: proposal ranking, prediction fusion, and Hungarian matching. By biasing both training and inference toward plausible candidates, these priors promise to improve label efficiency and convergence performance. On RefCOCO, RefCOCO+, and RefCOCOg, HeROD consistently outperforms strong grounding baselines in scarce-label regimes. More broadly, our results suggest that integrating simple, interpretable reasoning priors provides a practical and extensible path toward better data-efficient vision-language understanding.

[197] Language-Guided Structure-Aware Network for Camouflaged Object Detection

Min Zhang

Main category: cs.CV

TL;DR: LGSAN: A language-guided network for camouflaged object detection that uses CLIP text prompts to guide visual attention and incorporates frequency-domain edge enhancement for better boundary perception.

Details

Motivation: Existing camouflaged object detection methods lack textual semantic guidance, limiting their ability to focus on camouflaged regions in complex scenes where objects blend with backgrounds in color, texture, and structure.

Method: Proposes LGSAN with four key components: 1) CLIP-based text-image mask generation to guide PVT-v2 features, 2) Fourier Edge Enhancement Module for frequency-domain high-frequency information, 3) Structure-Aware Attention Module for better boundary perception, and 4) Coarse-Guided Local Refinement Module for fine-grained reconstruction.

Result: Extensive experiments show highly competitive performance across multiple COD datasets, validating effectiveness and robustness of the approach.

Conclusion: Language guidance combined with structure-aware attention and frequency-domain edge enhancement significantly improves camouflaged object detection performance by better focusing on target regions and enhancing boundary perception.

Abstract: Camouflaged Object Detection (COD) aims to segment objects that are highly integrated with the background in terms of color, texture, and structure, making it a highly challenging task in computer vision. Although existing methods introduce multi-scale fusion and attention mechanisms to alleviate the above issues, they generally lack the guidance of textual semantic priors, which limits the model’s ability to focus on camouflaged regions in complex scenes. To address this issue, this paper proposes a Language-Guided Structure-Aware Network (LGSAN). Specifically, based on the visual backbone PVT-v2, we introduce CLIP to generate masks from text prompts and RGB images, thereby guiding the multi-scale features extracted by PVT-v2 to focus on potential target regions. On this foundation, we further design a Fourier Edge Enhancement Module (FEEM), which integrates multi-scale features with high-frequency information in the frequency domain to extract edge enhancement features. Furthermore, we propose a Structure-Aware Attention Module (SAAM) to effectively enhance the model’s perception of object structures and boundaries. Finally, we introduce a Coarse-Guided Local Refinement Module (CGLRM) to enhance fine-grained reconstruction and boundary integrity of camouflaged object regions. Extensive experiments demonstrate that our method consistently achieves highly competitive performance across multiple COD datasets, validating its effectiveness and robustness.

[198] Unlocking Few-Shot Capabilities in LVLMs via Prompt Conditioning and Head Selection

Adhemar de Senneville, Xavier Bou, Jérémy Anger, Rafael Grompone, Gabriele Facciolo

Main category: cs.CV

TL;DR: LVLMs underperform at image classification despite using CLIP encoders; the paper shows LVLMs can improve visual feature separability via prompt conditioning and that internal attention heads outperform the model itself at classification, leading to Head Ensemble Classifiers (HEC) which achieves SOTA few-shot/zero-shot classification.

Details

Motivation: Large Vision Language Models (LVLMs) excel at tasks like image captioning and VQA but surprisingly underperform at image classification compared to CLIP-based methods, despite using CLIP-pretrained vision encoders. This gap exists because CLIP's separate vision-text encoders bias classification toward class-name matching rather than joint visual-text reasoning.

Method: The paper introduces Head Ensemble Classifiers (HEC), inspired by Gaussian Discriminant Analysis. HEC ranks the most discriminative vision and text attention heads from LVLMs and combines them into a training-free classifier. It leverages LVLMs’ internal representations, particularly attention heads, which can outperform the model itself at classification tasks.

Result: HEC achieves state-of-the-art performance in few-shot and zero-shot classification across 12 datasets, bridging the performance gap between CLIP-based and LVLM-based classification methods.

Conclusion: Despite poor raw performance, LVLMs contain valuable internal representations (especially attention heads) that can be effectively leveraged for classification tasks. HEC demonstrates that LVLMs can match or exceed CLIP-based methods for classification when their internal representations are properly utilized.

Abstract: Current Large Vision Language Models (LVLMs) excel at many zero-shot tasks like image captioning, visual question answering and OCR. However, these same models suffer from poor performance at image classification tasks, underperforming against CLIP-based methods. Notably, this gap is surprising because many LVLMs use CLIP-pretrained vision encoders. Yet LVLMs are not inherently limited by CLIP’s architecture with independent vision and text encoders. In CLIP, this separation biases classification toward class-name matching rather than joint visual-text reasoning. In this paper we show that, despite their poor raw performance, LVLMs can improve visual feature class separability at inference using prompt conditioning, and LVLMs’ internal representations, especially attention heads, can outperform the model itself at zero-shot and few-shot classification. We introduce Head Ensemble Classifiers (HEC) to bridge the performance gap between CLIP-based and LVLM-based classification methods. Inspired by Gaussian Discriminant Analysis, HEC ranks the most discriminative vision and text heads and combines them into a training-free classifier. We show that HEC achieves state-of-the-art performance in few-shot and zero-shot classification across 12 datasets.

[199] RefReward-SR: LR-Conditioned Reward Modeling for Preference-Aligned Super-Resolution

Yushuai Song, Weize Quan, Weining Wang, Jiahui Sun, Jing Liu, Meng Li, Pengbin Yu, Zhentao Chen, Wei Shen, Lunxi Yuan, Dong-ming Yan

Main category: cs.CV

TL;DR: RefReward-SR: A novel LR reference-aware reward model for super-resolution that uses multimodal LLMs to evaluate semantic consistency and plausibility, achieving better alignment with human perception than traditional metrics.

Details

Motivation: Existing super-resolution evaluation metrics (Full-Reference and No-Reference) fail to align with human perception, either penalizing semantically plausible details due to pixel misalignment or favoring sharp but inconsistent artifacts. Current SR methods rely on ground-truth-dependent distribution matching that doesn't correspond to human judgments.

Method: Proposes RefReward-SR, an LR reference-aware reward model that assesses HR reconstructions conditioned on LR inputs, treating LR as semantic anchor. Uses MLLM visual-linguistic priors to evaluate semantic consistency and plausibility. Creates RefSR-18K dataset with LR-conditioned preference rankings. Fine-tunes MLLM with Group Relative Policy Optimization using LR-conditioned ranking rewards, and integrates GRPO into SR model training with RefReward-SR as reward signal.

Result: The framework achieves substantially better alignment with human judgments, producing reconstructions that preserve semantic consistency while enhancing perceptual plausibility and visual naturalness.

Conclusion: RefReward-SR provides a novel approach to preference-aligned super-resolution by leveraging MLLMs for LR-conditioned evaluation, addressing the misalignment between traditional metrics and human perception.

Abstract: Recent advances in generative super-resolution (SR) have greatly improved visual realism, yet existing evaluation and optimization frameworks remain misaligned with human perception. Full-Reference and No-Reference metrics often fail to reflect perceptual preference, either penalizing semantically plausible details due to pixel misalignment or favoring visually sharp but inconsistent artifacts. Moreover, most SR methods rely on ground-truth (GT)-dependent distribution matching, which does not necessarily correspond to human judgments. In this work, we propose RefReward-SR, a low-resolution (LR) reference-aware reward model for preference-aligned SR. Instead of relying on GT supervision or NR evaluation, RefReward-SR assesses high-resolution (HR) reconstructions conditioned on their LR inputs, treating the LR image as a semantic anchor. Leveraging the visual-linguistic priors of a Multimodal Large Language Models (MLLM), it evaluates semantic consistency and plausibility in a reasoning-aware manner. To support this paradigm, we construct RefSR-18K, the first large-scale LR-conditioned preference dataset for SR, providing pairwise rankings based on LR-HR consistency and HR naturalness. We fine-tune the MLLM with Group Relative Policy Optimization (GRPO) using LR-conditioned ranking rewards, and further integrate GRPO into SR model training with RefReward-SR as the core reward signal for preference-aligned generation. Extensive experiments show that our framework achieves substantially better alignment with human judgments, producing reconstructions that preserve semantic consistency while enhancing perceptual plausibility and visual naturalness. Code, models, and datasets will be released upon paper acceptance.

[200] HEART-PFL: Stable Personalized Federated Learning under Heterogeneity with Hierarchical Directional Alignment and Adversarial Knowledge Transfer

Minjun Kim, Minje Kim

Main category: cs.CV

TL;DR: HEART-PFL is a personalized federated learning framework with hierarchical directional alignment and adversarial knowledge transfer that achieves SOTA accuracy on vision datasets while maintaining client specificity.

Details

Motivation: Existing PFL methods suffer from shallow prototype alignment and brittle server-side distillation, limiting their ability to deliver effective client-specific models under heterogeneous data distributions.

Method: Dual-sided framework with: (1) Hierarchical Directional Alignment (HDA) using cosine similarity in early stages and MSE matching in deep stages, and (2) Adversarial Knowledge Transfer (AKT) with symmetric KL distillation on clean and adversarial proxy data. Uses lightweight adapters with only 1.46M parameters.

Result: Achieves state-of-the-art personalized accuracy on CIFAR-100 (63.42%), Flowers-102 (84.23%), and Caltech-101 (95.67%) under Dirichlet non-IID partitions. Remains robust to out-of-domain proxy data.

Conclusion: HEART-PFL simultaneously enhances personalization and global stability, offering a strong and scalable solution for PFL. HDA and AKT provide complementary gains in alignment, robustness, and optimization stability.

Abstract: Personalized Federated Learning (PFL) aims to deliver effective client-specific models under heterogeneous distributions, yet existing methods suffer from shallow prototype alignment and brittle server-side distillation. We propose HEART-PFL, a dual-sided framework that (i) performs depth-aware Hierarchical Directional Alignment (HDA) using cosine similarity in the early stage and MSE matching in the deep stage to preserve client specificity, and (ii) stabilizes global updates through Adversarial Knowledge Transfer (AKT) with symmetric KL distillation on clean and adversarial proxy data. Using lightweight adapters with only 1.46M trainable parameters, HEART-PFL achieves state-of-the-art personalized accuracy on CIFAR-100, Flowers-102, and Caltech-101 (63.42%, 84.23%, and 95.67%, respectively) under Dirichlet non-IID partitions, and remains robust to out-of-domain proxy data. Ablation studies further confirm that HDA and AKT provide complementary gains in alignment, robustness, and optimization stability, offering insights into how the two components mutually reinforce effective personalization. Overall, these results demonstrate that HEART-PFL simultaneously enhances personalization and global stability, highlighting its potential as a strong and scalable solution for PFL(code available at https://github.com/danny0628/HEART-PFL).

[201] RVLM: Recursive Vision-Language Models with Adaptive Depth

Nicanor Mayumu, Zeenath Khan, Melodena Stephens, Patrick Mukala, Farhad Oroumchian

Main category: cs.CV

TL;DR: RVLM is an iterative reasoning framework for medical AI that replaces single-pass inference with a generate-execute loop using Python code and vision sub-agents, while RRouter adaptively controls iteration depth based on task complexity.

Details

Motivation: Address two key limitations in medical AI: 1) conventional vision-language models produce black-box predictions that lack auditability and clinical explanations, and 2) existing iterative reasoning systems use fixed iteration budgets that waste compute on simple cases while providing insufficient depth for complex ones.

Method: RVLM uses an iterative generate-execute loop where at each step the model writes Python code, invokes vision sub-agents, manipulates images, and accumulates evidence. RRouter is a lightweight controller that predicts optimal iteration budget from task-complexity features and monitors progress for early termination.

Result: Evaluated on BraTS 2023 Meningioma (brain MRI) and MIMIC-CXR (chest X-ray) using Gemini 2.5 Flash without fine-tuning. Shows high consistency on salient findings, can detect cross-modal discrepancies, generates structured reports, and recognizes view-specific artifacts.

Conclusion: The framework provides auditable, code-grounded medical AI predictions with adaptive computational efficiency, addressing key limitations in clinical AI governance and practical deployment.

Abstract: Medical AI systems face two fundamental limitations. First, conventional vision-language models (VLMs) perform single-pass inference, yielding black-box predictions that cannot be audited or explained in clinical terms. Second, iterative reasoning systems that expose intermediate steps rely on fixed iteration budgets wasting compute on simple cases while providing insufficient depth for complex ones. We address both limitations with a unified framework. RVLM replaces single-pass inference with an iterative generate-execute loop: at each step, the model writes Python code, invokes vision sub-agents, manipulates images, and accumulates evidence. Every diagnostic claim is grounded in executable code, satisfying auditability requirements of clinical AI governance frameworks. RRouter makes iteration depth adaptive: a lightweight controller predicts the optimal budget from task-complexity features, then monitors progress and terminates early when reasoning stalls. We evaluate on BraTS 2023 Meningioma (brain MRI) and MIMIC-CXR (chest X-ray) using Gemini 2.5 Flash without fine-tuning. Across repeated runs, RVLM shows high consistency on salient findings (e.g., mass presence and enhancement) and can detect cross-modal discrepancies between Fluid-Attenuated Inversion Recovery (FLAIR) signal characteristics and segmentation boundaries. On MIMIC-CXR, it generates structured reports and correctly recognises view-specific artefacts. Code: https://github.com/nican2018/rvlm.

[202] Counting Without Numbers & Finding Without Words

Badri Narayana Patro

Main category: cs.CV

TL;DR: First multimodal system using visual and acoustic biometrics for pet reunification, addressing limitations of appearance-only matching by incorporating species-specific vocal recognition.

Details

Motivation: Current pet reunification systems fail because they rely only on visual appearance, while animals recognize each other through sound. 70% of lost pets never reunite despite matches existing, highlighting the need for multimodal approaches that incorporate acoustic biometrics.

Method: Species-adaptive architecture that processes vocalizations across frequency ranges (10Hz elephant rumbles to 4kHz puppy whines) and pairs them with probabilistic visual matching that tolerates stress-induced appearance changes. Grounded in 50 years of cognitive science showing animals perceive quantity approximately and communicate identity acoustically.

Result: First multimodal reunification system integrating visual and acoustic biometrics, demonstrating AI can serve vulnerable populations lacking human language by being grounded in biological communication principles.

Conclusion: Multimodal AI systems incorporating both visual and acoustic biometrics can significantly improve pet reunification success by addressing the fundamental way animals recognize each other, moving beyond appearance-only approaches.

Abstract: Every year, 10 million pets enter shelters, separated from their families. Despite desperate searches by both guardians and lost animals, 70% never reunite, not because matches do not exist, but because current systems look only at appearance, while animals recognize each other through sound. We ask, why does computer vision treat vocalizing species as silent visual objects? Drawing on five decades of cognitive science showing that animals perceive quantity approximately and communicate identity acoustically, we present the first multimodal reunification system integrating visual and acoustic biometrics. Our species-adaptive architecture processes vocalizations from 10Hz elephant rumbles to 4kHz puppy whines, paired with probabilistic visual matching that tolerates stress-induced appearance changes. This work demonstrates that AI grounded in biological communication principles can serve vulnerable populations that lack human language.

[203] InstanceRSR: Real-World Super-Resolution via Instance-Aware Representation Alignment

Zixin Guo, Kai Zhao, Luyan Zhang

Main category: cs.CV

TL;DR: InstanceRSR: A real-world super-resolution framework that uses instance-level feature alignment and semantic modeling to recover fine-grained details while maintaining global consistency.

Details

Motivation: Existing RSR methods based on generative priors struggle to recover fine-grained details of diverse object instances in complex real-world scenes because denoising losses like MSE favor global consistency over instance-level perception.

Method: Proposes InstanceRSR framework that: 1) uses LR images as global consistency guidance while jointly modeling image data and semantic segmentation maps for semantic relevance, 2) designs instance representation learning module to align diffusion latent space with instance latent space for instance-aware feature alignment, 3) incorporates scale alignment mechanism for fine-grained perception.

Result: Extensive experiments on multiple real-world benchmarks show InstanceRSR significantly outperforms existing methods in both quantitative metrics and visual quality, achieving new SOTA performance.

Conclusion: InstanceRSR effectively addresses the limitation of existing RSR methods by incorporating instance-level feature alignment and semantic modeling, enabling both photorealistic detail generation and semantic consistency at the instance level.

Abstract: Existing real-world super-resolution (RSR) methods based on generative priors have achieved remarkable progress in producing high-quality and globally consistent reconstructions. However, they often struggle to recover fine-grained details of diverse object instances in complex real-world scenes. This limitation primarily arises because commonly adopted denoising losses (e.g., MSE) inherently favor global consistency while neglecting instance-level perception and restoration. To address this issue, we propose InstanceRSR, a novel RSR framework that jointly models semantic information and introduces instance-level feature alignment. Specifically, we employ low-resolution (LR) images as global consistency guidance while jointly modeling image data and semantic segmentation maps to enforce semantic relevance during sampling. Moreover, we design an instance representation learning module to align the diffusion latent space with the instance latent space, enabling instance-aware feature alignment, and further incorporate a scale alignment mechanism to enhance fine-grained perception and detail recovery. Benefiting from these designs, our approach not only generates photorealistic details but also preserves semantic consistency at the instance level. Extensive experiments on multiple real-world benchmarks demonstrate that InstanceRSR significantly outperforms existing methods in both quantitative metrics and visual quality, achieving new state-of-the-art (SOTA) performance.

[204] B-MoE: A Body-Part-Aware Mixture-of-Experts “All Parts Matter” Approach to Micro-Action Recognition

Nishit Poddar, Aglind Reka, Diana-Laura Borza, Snehashis Majhi, Michal Balazia, Abhijit Das, Francois Bremond

Main category: cs.CV

TL;DR: B-MoE is a body-part-aware Mixture-of-Experts framework for micro-action recognition that uses specialized experts for different body regions and fuses local and global motion features to capture subtle, fleeting human motions.

Details

Motivation: Current action recognition models struggle with micro-actions (glances, nods, posture shifts) due to their subtlety, short duration, and high inter-class ambiguity. These fleeting motions carry rich social meaning but are difficult to recognize with existing approaches.

Method: B-MoE uses a Body-part-aware Mixture-of-Experts framework with specialized experts for distinct body regions (head, body, upper limbs, lower limbs). Each expert uses a lightweight Macro-Micro Motion Encoder (M3E) to capture long-range context and fine-grained local motion. A cross-attention routing mechanism learns inter-region relationships and dynamically selects informative regions. A dual-stream encoder fuses region-specific semantic cues with global motion features.

Result: Experiments on three challenging benchmarks (MA-52, SocialGesture, and MPII-GroupInteraction) show consistent state-of-the-art gains, with improvements in ambiguous, underrepresented, and low amplitude classes.

Conclusion: B-MoE effectively addresses the challenges of micro-action recognition by explicitly modeling the structured nature of human motion through body-part specialization and fusion of local/global features, achieving superior performance on difficult micro-action recognition tasks.

Abstract: Micro-actions, fleeting and low-amplitude motions, such as glances, nods, or minor posture shifts, carry rich social meaning but remain difficult for current action recognition models to recognize due to their subtlety, short duration, and high inter-class ambiguity. In this paper, we introduce B-MoE, a Body-part-aware Mixture-of-Experts framework designed to explicitly model the structured nature of human motion. In B-MoE, each expert specializes in a distinct body region (head, body, upper limbs, lower limbs), and is based on the lightweight Macro-Micro Motion Encoder (M3E) that captures long-range contextual structure and fine-grained local motion. A cross-attention routing mechanism learns inter-region relationships and dynamically selects the most informative regions for each micro-action. B-MoE uses a dual-stream encoder that fuses these region-specific semantic cues with global motion features to jointly capture spatially localized cues and temporally subtle variations that characterize micro-actions. Experiments on three challenging benchmarks (MA-52, SocialGesture, and MPII-GroupInteraction) show consistent state-of-theart gains, with improvements in ambiguous, underrepresented, and low amplitude classes.

[205] Memory-Augmented Vision-Language Agents for Persistent and Semantically Consistent Object Captioning

Tommaso Galliena, Stefano Rosa, Tommaso Apicella, Pietro Morerio, Alessio Del Bue, Lorenzo Natale

Main category: cs.CV

TL;DR: A unified memory-augmented vision-language agent that maintains consistent object descriptions across viewpoints for embodied agents through autoregressive processing of RGB observations, exploration maps, and object-level episodic memory.

Details

Motivation: Vision-Language Models often produce inconsistent descriptions of the same object across different viewpoints, which hinders embodied agents from building consistent semantic representations over time. Previous approaches used offline multi-view aggregation or multi-stage pipelines that couldn't effectively reason about previously observed objects.

Method: Introduces a unified, memory-augmented Vision-Language agent that simultaneously handles data association, object captioning, and exploration policy within a single autoregressive framework. The model processes current RGB observation, top-down explored map, and object-level episodic memory serialized into object-level tokens to ensure persistent object identity and semantic consistency across extended sequences.

Result: Extensive evaluation on manually annotated object-level test set shows improvements of up to +11.86% in standard captioning scores and +7.39% in caption self-similarity over baseline models, while enabling scalable performance through compact scene representation.

Conclusion: The proposed unified framework successfully addresses the inconsistency problem in VLMs for embodied agents by maintaining persistent object identity and semantic consistency across viewpoints through memory-augmented autoregressive processing.

Abstract: Vision-Language Models (VLMs) often yield inconsistent descriptions of the same object across viewpoints, hindering the ability of embodied agents to construct consistent semantic representations over time. Previous methods resolved inconsistencies using offline multi-view aggregation or multi-stage pipelines that decouple exploration, data association, and caption learning, with limited capacity to reason over previously observed objects. In this paper, we introduce a unified, memory-augmented Vision-Language agent that simultaneously handles data association, object captioning, and exploration policy within a single autoregressive framework. The model processes the current RGB observation, a top-down explored map, and an object-level episodic memory serialized into object-level tokens, ensuring persistent object identity and semantic consistency across extended sequences. To train the model in a self-supervised manner, we collect a dataset in photorealistic 3D environments using a disagreement-based policy and a pseudo-captioning model that enforces consistency across multi-view caption histories. Extensive evaluation on a manually annotated object-level test set, demonstrate improvements of up to +11.86% in standard captioning scores and +7.39% in caption self-similarity over baseline models, while enabling scalable performance through a compact scene representation. Code, model weights, and data are available at https://github.com/hsp-iit/epos-vlm

[206] ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors

Haodong Yu, Yabo Zhang, Donglin Di, Ruyi Zhang, Wangmeng Zuo

Main category: cs.CV

TL;DR: ScrollScape reformulates extreme aspect ratio image synthesis as continuous video generation to overcome structural failures in diffusion models, achieving 32K resolution with global coherence.

Details

Motivation: Current diffusion models fail at generating ultra-high-resolution imagery with extreme aspect ratios, suffering from catastrophic structural failures like object repetition and spatial fragmentation due to lack of robust spatial priors from conventional training data.

Method: ScrollScape maps spatial expansion of large canvases to temporal evolution of video frames, leveraging video model temporal consistency as global constraint. Uses Scanning Positional Encoding (ScanPE) as flexible moving camera and Scrolling Super-Resolution (ScrollSR) to handle memory bottlenecks. Fine-tuned on curated 3K multi-ratio image dataset.

Result: Achieves unprecedented 32K resolution with exceptional global coherence and visual fidelity, significantly outperforming existing image-diffusion baselines by eliminating severe localized artifacts across diverse domains at extreme scales.

Conclusion: ScrollScape successfully overcomes structural bottlenecks in extreme aspect ratio image generation by reformulating the problem as video generation, leveraging temporal consistency priors from video models to ensure long-range structural integrity.

Abstract: While diffusion models excel at generating images with conventional dimensions, pushing them to synthesize ultra-high-resolution imagery at extreme aspect ratios (EAR) often triggers catastrophic structural failures, such as object repetition and spatial fragmentation.This limitation fundamentally stems from a lack of robust spatial priors, as static text-to-image models are primarily trained on image distributions with conventional dimensions.To overcome this bottleneck, we present ScrollScape, a novel framework that reformulates EAR image synthesis into a continuous video generation process through two core innovations.By mapping the spatial expansion of a massive canvas to the temporal evolution of video frames, ScrollScape leverages the inherent temporal consistency of video models as a powerful global constraint to ensure long-range structural integrity.Specifically, Scanning Positional Encoding (ScanPE) distributes global coordinates across frames to act as a flexible moving camera, while Scrolling Super-Resolution (ScrollSR) leverages video super-resolution priors to circumvent memory bottlenecks, efficiently scaling outputs to an unprecedented 32K resolution. Fine-tuned on a curated 3K multi-ratio image dataset, ScrollScape effectively aligns pre-trained video priors with the EAR generation task. Extensive evaluations demonstrate that it significantly outperforms existing image-diffusion baselines by eliminating severe localized artifacts. Consequently, our method overcomes inherent structural bottlenecks to ensure exceptional global coherence and visual fidelity across diverse domains at extreme scales.

[207] TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification

Guan Luo, Xiu Li, Rui Chen, Xuanyu Yi, Jing Lin, Chia-Hao Chen, Jiahang Liu, Song-Hai Zhang, Jianfeng Zhang

Main category: cs.CV

TL;DR: TopoMesh: A sparse voxel-based VAE that uses Dual Marching Cubes topology to establish explicit mesh-level correspondences between ground-truth and predicted meshes, improving 3D reconstruction fidelity and sharp feature preservation.

Details

Motivation: Existing VAEs for 3D generation suffer from representation mismatch between ground-truth meshes (variable topology) and network predictions (fixed-structure implicit fields), preventing explicit mesh-level correspondences and causing poor preservation of fine geometric details and sharp features.

Method: Introduces TopoMesh, a sparse voxel-based VAE that unifies both GT and predicted meshes under a shared Dual Marching Cubes topological framework. Converts input meshes into DMC-compliant representations via remeshing with L∞ distance metric to preserve sharp edges. Decoder outputs meshes in same DMC format, enabling explicit vertex/face-level correspondences for direct mesh supervision.

Result: TopoMesh significantly outperforms existing VAEs in reconstruction fidelity, achieving superior preservation of sharp features and geometric details compared to prior methods.

Conclusion: By establishing explicit mesh-level correspondences through a unified DMC topological framework, TopoMesh overcomes the representation mismatch problem in 3D VAEs, enabling higher-fidelity reconstruction and better preservation of geometric details.

Abstract: The dominant paradigm for high-fidelity 3D generation relies on a VAE-Diffusion pipeline, where the VAE’s reconstruction capability sets a firm upper bound on generation quality. A fundamental challenge limiting existing VAEs is the representation mismatch between ground-truth meshes and network predictions: GT meshes have arbitrary, variable topology, while VAEs typically predict fixed-structure implicit fields (\eg, SDF on regular grids). This inherent misalignment prevents establishing explicit mesh-level correspondences, forcing prior work to rely on indirect supervision signals such as SDF or rendering losses. Consequently, fine geometric details, particularly sharp features, are poorly preserved during reconstruction. To address this, we introduce TopoMesh, a sparse voxel-based VAE that unifies both GT and predicted meshes under a shared Dual Marching Cubes (DMC) topological framework. Specifically, we convert arbitrary input meshes into DMC-compliant representations via a remeshing algorithm that preserves sharp edges using an L$\infty$ distance metric. Our decoder outputs meshes in the same DMC format, ensuring that both predicted and target meshes share identical topological structures. This establishes explicit correspondences at the vertex and face level, allowing us to derive explicit mesh-level supervision signals for topology, vertex positions, and face orientations with clear gradients. Our sparse VAE architecture employs this unified framework and is trained with Teacher Forcing and progressive resolution training for stable and efficient convergence. Extensive experiments demonstrate that TopoMesh significantly outperforms existing VAEs in reconstruction fidelity, achieving superior preservation of sharp features and geometric details.

[208] VERIA: Verification-Centric Multimodal Instance Augmentation for Long-Tailed 3D Object Detection

Jumin Lee, Siyeong Lee, Namil Kim, Sung-Eui Yoon

Main category: cs.CV

TL;DR: VERIA is a multimodal augmentation framework that synthesizes synchronized RGB-LiDAR instances using foundation models with semantic and geometric verification to improve rare-class 3D object detection in autonomous driving datasets.

Details

Motivation: Long-tail distributions in driving datasets create challenges for 3D perception, especially for rare classes that have substantial intra-class diversity but sparse coverage. Existing augmentation methods lack fine-grained diversity and proper scene-context placement.

Method: VERIA uses off-the-shelf foundation models to synthesize synchronized RGB-LiDAR instances, then applies sequential semantic and geometric verification to curate instances that match real LiDAR statistics while spanning wider intra-class variation. Includes stage-wise yield decomposition for pipeline diagnostics.

Result: On nuScenes and Lyft datasets, VERIA improves rare-class 3D object detection in both LiDAR-only and multimodal settings.

Conclusion: VERIA’s verification-centric approach effectively addresses long-tail distribution challenges in 3D perception by generating diverse, contextually appropriate synthetic instances that improve rare-class detection performance.

Abstract: Long-tail distributions in driving datasets pose a fundamental challenge for 3D perception, as rare classes exhibit substantial intra-class diversity yet available samples cover this variation space only sparsely. Existing instance augmentation methods based on copy-paste or asset libraries improve rare-class exposure but are often limited in fine-grained diversity and scene-context placement. We propose VERIA, an image-first multimodal augmentation framework that synthesizes synchronized RGB–LiDAR instances using off-the-shelf foundation models and curates them with sequential semantic and geometric verification. This verification-centric design tends to select instances that better match real LiDAR statistics while spanning a wider range of intra-class variation. Stage-wise yield decomposition provides a log-based diagnostic of pipeline reliability. On nuScenes and Lyft, VERIA improves rare-class 3D object detection in both LiDAR-only and multimodal settings. Our code is available at https://sgvr.kaist.ac.kr/VERIA/.

[209] CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition

Florian Stilz, Vinkle Srivastav, Nassir Navab, Nicolas Padoy

Main category: cs.CV

TL;DR: CliPPER is a video-language pretraining framework for surgical procedures that introduces novel objectives for temporal understanding and multimodal alignment in long-form surgical videos.

Details

Motivation: Intraoperative surgical procedures present challenges due to scarce labeled data and need for precise temporal understanding. Existing video-language models struggle with fine-grained temporal alignment in long-form surgical content.

Method: Proposes Contextual Video-Text Contrastive Learning (VTC_CTX), Clip Order Prediction (COP), Cycle-Consistency Alignment, and Frame-Text Matching (FTM) to enhance multimodal alignment and temporal understanding in surgical videos.

Result: Achieves state-of-the-art performance across multiple surgical benchmarks for zero-shot recognition of phases, steps, instruments, and triplets.

Conclusion: CliPPER demonstrates effective video-language pretraining for surgical domain with novel temporal alignment strategies, enabling strong zero-shot performance on complex surgical tasks.

Abstract: Video-language foundation models have proven to be highly effective in zero-shot applications across a wide range of tasks. A particularly challenging area is the intraoperative surgical procedure domain, where labeled data is scarce, and precise temporal understanding is often required for complex downstream tasks. To address this challenge, we introduce CliPPER (Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition), a novel video-language pretraining framework trained on surgical lecture videos. Our method is designed for fine-grained temporal video-text recognition and introduces several novel pretraining strategies to improve multimodal alignment in long-form surgical videos. Specifically, we propose Contextual Video-Text Contrastive Learning (VTC_CTX) and Clip Order Prediction (COP) pretraining objectives, both of which leverage temporal and contextual dependencies to enhance local video understanding. In addition, we incorporate a Cycle-Consistency Alignment over video-text matches within the same surgical video to enforce bidirectional consistency and improve overall representation coherence. Moreover, we introduce a more refined alignment loss, Frame-Text Matching (FTM), to improve the alignment between video frames and text. As a result, our model establishes a new state-of-the-art across multiple public surgical benchmarks, including zero-shot recognition of phases, steps, instruments, and triplets. The source code and pretraining captions can be found at https://github.com/CAMMA-public/CliPPER.

[210] RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation

Kai Zhu, Zhenyu Cui, Zehua Zang, Jiahuan Zhou

Main category: cs.CV

TL;DR: RS-SSM enhances video semantic segmentation by refining forgotten specific information in state space models through channel-wise amplitude perception and forgetting gate inversion.

Details

Motivation: State space models efficiently compress video for segmentation but lose specific pixel-level details during compression, limiting temporal consistency in semantic object segmentation.

Method: Proposes RS-SSM with Channel-wise Amplitude Perceptron (CwAP) to extract specific information distribution and Forgetting Gate Information Refiner (FGIR) to invert and refine the forgetting gate matrix for complementary refinement.

Result: Achieves state-of-the-art performance on four VSS benchmarks while maintaining high computational efficiency.

Conclusion: RS-SSM effectively enhances spatiotemporal pixel-level segmentation by refining forgotten specific information in state space compression.

Abstract: Recently, state space models have demonstrated efficient video segmentation through linear-complexity state space compression. However, Video Semantic Segmentation (VSS) requires pixel-level spatiotemporal modeling capabilities to maintain temporal consistency in segmentation of semantic objects. While state space models can preserve common semantic information during state space compression, the fixed-size state space inevitably forgets specific information, which limits the models’ capability for pixel-level segmentation. To tackle the above issue, we proposed a Refining Specifics State Space Model approach (RS-SSM) for video semantic segmentation, which performs complementary refining of forgotten spatiotemporal specifics. Specifically, a Channel-wise Amplitude Perceptron (CwAP) is designed to extract and align the distribution characteristics of specific information in the state space. Besides, a Forgetting Gate Information Refiner (FGIR) is proposed to adaptively invert and refine the forgetting gate matrix in the state space model based on the specific information distribution. Consequently, our RS-SSM leverages the inverted forgetting gate to complementarily refine the specific information forgotten during state space compression, thereby enhancing the model’s capability for spatiotemporal pixel-level segmentation. Extensive experiments on four VSS benchmarks demonstrate that our RS-SSM achieves state-of-the-art performance while maintaining high computational efficiency. The code is available at https://github.com/zhoujiahuan1991/CVPR2026-RS-SSM.

[211] SEGAR: Selective Enhancement for Generative Augmented Reality

Fanjun Bu, Chenyang Yuan, Hiroshi Yasuda

Main category: cs.CV

TL;DR: SEGAR: A framework combining diffusion-based world models with selective correction for generating augmented reality future frames with region-specific edits while maintaining safety-critical alignment with real-world observations.

Details

Motivation: To enable practical AR applications by generating temporally coherent augmented future frames that can be cached ahead of time, avoiding per-frame rendering in real time, while ensuring safety-critical regions remain aligned with real-world observations.

Method: Combines a diffusion-based world model that generates augmented future frames with region-specific edits, followed by a selective correction stage that aligns safety-critical regions with real-world observations while preserving intended augmentations elsewhere.

Result: Demonstrated in driving scenarios where semantic region structure is well-defined and real-world feedback is available, showing the pipeline can generate, cache, and selectively correct augmented future frames.

Conclusion: SEGAR represents an early step toward using generative world models as practical AR infrastructure, enabling pre-computation of augmented frames with selective real-world alignment for safety-critical applications.

Abstract: Generative world models offer a compelling foundation for augmented-reality (AR) applications: by predicting future image sequences that incorporate deliberate visual edits, they enable temporally coherent, augmented future frames that can be computed ahead of time and cached, avoiding per-frame rendering from scratch in real time. In this work, we present SEGAR, a preliminary framework that combines a diffusion-based world model with a selective correction stage to support this vision. The world model generates augmented future frames with region-specific edits while preserving others, and the correction stage subsequently aligns safety-critical regions with real-world observations while preserving intended augmentations elsewhere. We demonstrate this pipeline in driving scenarios as a representative setting where semantic region structure is well defined and real-world feedback is readily available. We view this as an early step toward generative world models as practical AR infrastructure, where future frames can be generated, cached, and selectively corrected on demand.

[212] AMIF: Authorizable Medical Image Fusion Model with Built-in Authentication

Jie Song, Jun Jia, Wei Sun, Wangqiu Zhou, Tao Tan, Guangtao Zhai

Main category: cs.CV

TL;DR: AMIF is the first Authorizable Medical Image Fusion model with built-in authentication that embeds copyright identifiers for unauthorized usage while providing high-quality fusion results for authorized users.

Details

Motivation: Current multimodal image fusion models lack intellectual property protection, exposing proprietary model knowledge and sensitive training data through inference leakage, which malicious users can exploit via reverse engineering techniques.

Method: Proposes AMIF model with integrated authorization access control into the image fusion objective, embedding explicit visible copyright identifiers for unauthorized usage and providing high-quality fusion results only upon successful key-based authentication.

Result: First authorizable medical image fusion model that protects intellectual property rights while maintaining clinical utility for authorized users.

Conclusion: AMIF addresses the critical need for intellectual property protection in medical image fusion models through built-in authentication mechanisms, balancing security with clinical utility.

Abstract: Multimodal image fusion enables precise lesion localization and characterization for accurate diagnosis, thereby strengthening clinical decision-making and driving its growing prominence in medical imaging research. A powerful multimodal image fusion model relies on high-quality, clinically representative multimodal training data and a rigorously engineered model architecture. Therefore, the development of such professional radiomics models represents a collaborative achievement grounded in standardized acquisition, clinical-specific expertise, and algorithmic design proficiency, which necessitates protection of associated intellectual property rights. However, current multimodal image fusion models generate fused outputs without built-in mechanisms to safeguard intellectual property rights, inadvertently exposing proprietary model knowledge and sensitive training data through inference leakage. For example, malicious users can exploit fusion outputs and model distillation or other inference-based reverse engineering techniques to approximate the fusion performance of proprietary models. To address this issue, we propose AMIF, the first Authorizable Medical Image Fusion model with built-in authentication, which integrates authorization access control into the image fusion objective. For unauthorized usage, AMIF embeds explicit and visible copyright identifiers into fusion results. In contrast, high-quality fusion results are accessible upon successful key-based authentication.

[213] Refining time-space traffic diagrams: A neighborhood-adaptive linear regression method

Zhihong Yao, Yi Yu, Yunxia Wu, Hao Li, Yangsheng Jiang, Zhengbing He

Main category: cs.CV

TL;DR: A neighborhood-adaptive linear regression method for refining low-resolution time-space traffic diagrams by leveraging local pattern similarity and adaptive neighborhood identification.

Details

Motivation: Existing time-space traffic diagrams suffer from low resolution due to monitoring precision and sampling frequency limitations, which affects traffic theory research and engineering applications.

Method: Introduces neighborhood embedding concept, adaptively identifies neighborhoods similar to target cells, and fits low-to-high resolution mapping within these neighborhoods using neighborhood-adaptive linear regression.

Result: Achieves improvements of 9.16%, 8.16%, 1.86%, 3.89%, and 5.83% in MAE, MAPE, CMJS, SSIM, and GMSD metrics respectively compared to benchmarks, with strong generalization in cross-day and cross-scenario validations.

Conclusion: The method provides low-cost, fine-grained refinement of low-sampling-rate traffic data with minimal paired training data and concise formulation.

Abstract: The time-space (TS) traffic diagram serves as a crucial tool for characterizing the dynamic evolution of traffic flow, with its resolution directly influencing the effectiveness of traffic theory research and engineering applications. However, constrained by monitoring precision and sampling frequency, existing TS traffic diagrams commonly suffer from low resolution. To address this issue, this paper proposes a refinement method for TS traffic diagrams based on neighborhood-adaptive linear regression. Introducing the concept of neighborhood embedding into TS diagram refinement, the method leverages local pattern similarity in TS diagrams, adaptively identifies neighborhoods similar to target cells, and fits the low-to-high resolution mapping within these neighborhoods for refinement. It avoids the over-smoothing tendency of the traditional global linear model, allows the capture of unique traffic wave propagation and congestion evolution characteristics, and outperforms the traditional neighborhood embedding method in terms of local information utilization to achieve target cell refinement. Validation on two real datasets across multiple scales and upscaling factors shows that, compared to benchmark methods, the proposed method achieves improvements of 9.16%, 8.16%, 1.86%, 3.89%, and 5.83% in metrics including MAE, MAPE, CMJS, SSIM, and GMSD, respectively. Furthermore, the proposed method exhibits strong generalization and robustness in cross-day and cross-scenario validations. In summary, requiring only a minimal amount of paired high- and low-resolution training data, the proposed method features a concise formulation, providing a foundation for the low-cost, fine-grained refinement of low-sampling-rate traffic data.

[214] LensWalk: Agentic Video Understanding by Planning How You See in Videos

Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan

Main category: cs.CV

TL;DR: LensWalk is an agentic framework that enables LLMs to actively control their visual observation of videos through a reason-plan-observe loop, allowing dynamic specification of temporal scope and sampling density for on-demand evidence gathering.

Details

Motivation: Current video understanding methods rely on static, pre-processed information and cannot actively seek raw evidence from video as understanding evolves, creating a disconnect between reasoning and perception.

Method: Framework where LLM reasoner controls visual observation through dynamic specification of temporal scope and sampling density. Uses VLM-based tools for broad scans, focused fact extraction, and evidence stitching in a reason-plan-observe loop.

Result: Substantial plug-and-play performance gains without fine-tuning, boosting accuracy by over 5% on challenging long-video benchmarks like LVBench and Video-MME.

Conclusion: Enabling agents to control how they see video is key to unlocking more accurate, robust, and interpretable video reasoning.

Abstract: The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner to control its own visual observation actively. LensWalk establishes a tight reason-plan-observe loop where the agent dynamically specifies, at each step, the temporal scope and sampling density of the video it observes. Using a suite of versatile, Vision-Language Model based tools parameterized by these specifications, the agent can perform broad scans for cues, focus on specific segments for fact extraction, and stitch evidence from multiple moments for holistic verification. This design allows for progressive, on-demand evidence gathering that directly serves the agent’s evolving chain of thought. Without requiring any model fine-tuning, LensWalk delivers substantial, plug-and-play performance gains on multiple model recipes, boosting their accuracy by over 5% on challenging long-video benchmarks like LVBench and Video-MME. Our analysis reveals that enabling an agent to control how it sees is key to unlocking more accurate, robust, and interpretable video reasoning.

[215] Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions

Shiqin Wang, Haoyang Chen, Huaizhou Huang, Yinkan He, Dongfang Sun, Xiaoqing Chen, Xingyu Liu, Zheng Wang, Kaiyan Zhao

Main category: cs.CV

TL;DR: Autonomous class scheduler using RL for adaptive curriculum learning in semantic segmentation domain adaptation, achieving SOTA on adverse weather benchmarks.

Details

Motivation: Existing curriculum learning methods for unsupervised domain adaptation in semantic segmentation rely on static, handcrafted heuristics that fail to adapt to the model's evolving training dynamics, leading to category bias and suboptimal performance.

Method: Formulates curriculum learning as a sequential decision problem using Reinforcement Learning. Proposes an autonomous class scheduler with: (1) high-dimensional state encoder that maps model training status to latent space, and (2) category-fair policy-gradient objective ensuring balanced improvement across classes. Uses mixed source-target supervision.

Result: Achieves state-of-the-art performance on three widely used benchmarks (ACDC, Dark Zurich, Nighttime Driving) and shows generalization ability in synthetic-to-real semantic segmentation.

Conclusion: The proposed autonomous class scheduler enables more adaptive and dynamic learning by directing network focus to the most informative classes at each training stage, overcoming limitations of static curriculum schedules.

Abstract: The learning order of semantic classes significantly impacts unsupervised domain adaptation for semantic segmentation, especially under adverse weather conditions. Most existing curricula rely on handcrafted heuristics (e.g., fixed uncertainty metrics) and follow a static schedule, which fails to adapt to a model’s evolving, high-dimensional training dynamics, leading to category bias. Inspired by Reinforcement Learning, we cast curriculum learning as a sequential decision problem and propose an autonomous class scheduler. This scheduler consists of two components: (i) a high-dimensional state encoder that maps the model’s training status into a latent space and distills key features indicative of progress, and (ii) a category-fair policy-gradient objective that ensures balanced improvement across classes. Coupled with mixed source-target supervision, the learned class rankings direct the network’s focus to the most informative classes at each stage, enabling more adaptive and dynamic learning. It is worth noting that our method achieves state-of-the-art performance on three widely used benchmarks (e.g., ACDC, Dark Zurich, and Nighttime Driving) and shows generalization ability in synthetic-to-real semantic segmentation.

[216] Anti-I2V: Safeguarding your photos from malicious image-to-video generation

Duc Vu, Anh Nguyen, Chi Tran, Anh Tran

Main category: cs.CV

TL;DR: Anti-I2V is a novel defense method against malicious human image-to-video generation that operates in both Lab* and frequency domains, targeting temporal coherence degradation across diverse diffusion backbones including DiT models.

Details

Motivation: The paper addresses the threat of misuse of diffusion-based video generation models for creating fake videos from personal photos. Existing adversarial attacks primarily target image generation and UNet-based architectures, leaving a gap in protection against newer Diffusion Transformer (DiT) models which have better feature retention and temporal consistency.

Method: Anti-I2V operates in both Lab* and frequency domains instead of just RGB space, improving robustness and focusing on salient pixels. It identifies network layers that capture distinct semantic features during denoising to design training objectives that maximize degradation of temporal coherence and generation fidelity.

Result: Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering effective protection against malicious human image-to-video generation.

Conclusion: Anti-I2V provides an effective solution to protect against misuse of video diffusion models, particularly addressing the gap in defense against newer DiT architectures while maintaining robustness across diverse diffusion backbones.

Abstract: Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person’s photo and text prompts. Recent efforts have focused on adversarial attacks that introduce crafted perturbations to protect images from diffusion-based models. However, most existing approaches target image generation, while relatively few explicitly address image-to-video diffusion models (VDMs), and most primarily focus on UNet-based architectures. Hence, their effectiveness against Diffusion Transformer (DiT) models remains largely under-explored, as these models demonstrate improved feature retention, and stronger temporal consistency due to larger capacity and advanced attention mechanisms. In this work, we introduce Anti-I2V, a novel defense against malicious human image-to-video generation, applicable across diverse diffusion backbones. Instead of restricting noise updates to the RGB space, Anti-I2V operates in both the $L$$a$$b$* and frequency domains, improving robustness and concentrating on salient pixels. We then identify the network layers that capture the most distinct semantic features during the denoising process to design appropriate training objectives that maximize degradation of temporal coherence and generation fidelity. Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering an effective solution to the problem.

Ciem Cornelissen, Sam Leroux, Pieter Simoens

Main category: cs.CV

TL;DR: Le MuMo JEPA is a self-supervised multimodal framework that learns unified representations from RGB images and aligned companion modalities (LiDAR depth, thermal) using fusion tokens as a latent bottleneck in a shared transformer architecture.

Details

Motivation: Most self-supervised learning methods operate on single modalities, missing complementary information from heterogeneous sensors. The paper aims to leverage multimodal data (RGB + aligned modalities like LiDAR depth or thermal) for more robust visual representation learning.

Method: Extends LeJEPA to multimodal setting with fusion tokens that act as latent bottleneck between modality-specific patch stems in shared transformer. Uses pruned fusion strategy: after initial cross-modal attention, modality-specific tokens are dropped, forcing cross-modal information into shared fusion-token grid before applying Sketched Isotropic Gaussian Regularization (SIGReg) to joint multimodal CLS embedding.

Result: Strongest performance-efficiency trade-off on downstream patch probes among from-scratch multimodal baselines on Waymo; improves CenterNet detection and dense depth while competitive on segmentation. Best results on FLIR thermal benchmark, especially after Waymo-initialized fine-tuning. Best overall accuracy-efficiency balance at substantially lower compute, memory, and training time.

Conclusion: Le MuMo JEPA effectively learns unified multimodal representations through efficient fusion token bottleneck, demonstrating strong performance across various vision tasks while maintaining computational efficiency, making it suitable for real-world multimodal applications.

Abstract: Self-supervised learning has emerged as a powerful paradigm for learning visual representations without manual annotations, yet most methods still operate on a single modality and therefore miss the complementary structure available from heterogeneous sensors. We present Le MuMo JEPA, a self-supervised framework that learns unified representations from RGB images and aligned companion modalities. In our driving experiments, the second modality is camera-aligned LiDAR depth; we also evaluate RGB-thermal training and transfer on the Teledyne FLIR ADAS benchmark. Our approach extends LeJEPA to the multi-modal setting by learning fusion tokens that act as a latent bottleneck between modality-specific patch stems inside a shared transformer. Our default model employs a pruned fusion strategy: after an initial cross-modal attention layer, modality-specific tokens are dropped, forcing cross-modal information into the shared fusion-token grid as an efficient latent bottleneck before Sketched Isotropic Gaussian Regularization (SIGReg) is applied to the joint multimodal CLS embedding. On Waymo, Le MuMo JEPA gives the strongest performance-efficiency trade-off on downstream patch probes among the from-scratch multimodal baselines, improving CenterNet detection and dense depth while remaining competitive on segmentation. Under from-scratch training on nuScenes, Le MuMo JEPA remains the strongest model, and it also gives the best FLIR results, especially after Waymo-initialized fine-tuning. It also retains the best overall accuracy-efficiency balance in our study at substantially lower compute, memory, and estimated training time.

[218] VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

Qijia He, Xunmei Liu, Hammaad Memon, Ziang Li, Zixian Ma, Jaemin Cho, Jason Ren, Daniel S Weld, Ranjay Krishna

Main category: cs.CV

TL;DR: VFIG is a vision-language model family for converting raster figures to editable SVG vector graphics using a coarse-to-fine training approach with supervised fine-tuning and reinforcement learning refinement.

Details

Motivation: SVG files are often lost or inaccessible, leaving only rasterized versions that are difficult to modify or scale. Manual reconstruction is labor-intensive and requires specialized expertise, creating a need for automated figure-to-SVG conversion.

Method: Introduces VFIG-DATA dataset (66K figure-SVG pairs), uses coarse-to-fine training curriculum: supervised fine-tuning to learn atomic primitives, then reinforcement learning refinement to optimize global diagram fidelity, layout consistency, and topological edge cases.

Result: State-of-the-art performance among open-source models, performs on par with GPT-5.2, achieves VLM-Judge score of 0.829 on VFIG-BENCH evaluation suite.

Conclusion: VFIG successfully bridges the gap between raster figures and editable vector graphics through a data-driven approach with specialized training curriculum and comprehensive evaluation metrics.

Abstract: Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. In practice, however, original vector source files are frequently lost or inaccessible, leaving only “flat” rasterized versions (e.g., PNG or JPEG) that are difficult to modify or scale. Manually reconstructing these figures is a prohibitively labor-intensive process, requiring specialized expertise to recover the original geometric intent. To bridge this gap, we propose VFIG, a family of Vision-Language Models trained for complex and high-fidelity figure-to-SVG conversion. While this task is inherently data-driven, existing datasets are typically small-scale and lack the complexity of professional diagrams. We address this by introducing VFIG-DATA, a large-scale dataset of 66K high-quality figure-SVG pairs, curated from a diverse mix of real-world paper figures and procedurally generated diagrams. Recognizing that SVGs are composed of recurring primitives and hierarchical local structures, we introduce a coarse-to-fine training curriculum that begins with supervised fine-tuning (SFT) to learn atomic primitives and transitions to reinforcement learning (RL) refinement to optimize global diagram fidelity, layout consistency, and topological edge cases. Finally, we introduce VFIG-BENCH, a comprehensive evaluation suite with novel metrics designed to measure the structural integrity of complex figures. VFIG achieves state-of-the-art performance among open-source models and performs on par with GPT-5.2, achieving a VLM-Judge score of 0.829 on VFIG-BENCH.

[219] PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks

Cheng Cui, Yubo Zhang, Ting Sun, Xueqing Wang, Hongen Liu, Manhui Lin, Yue Zhang, Tingquan Gao, Changda Zhou, Jiaxuan Liu, Zelun Zhang, Jing Zhang, Jun Zhang, Yi Liu

Main category: cs.CV

TL;DR: PP-OCRv5 is a lightweight OCR system with only 5M parameters that achieves competitive performance with billion-parameter vision-language models through data-centric optimization rather than architectural scaling.

Details

Motivation: Large-scale VLMs for OCR have high computational costs, poor text localization in complex layouts, and hallucination issues. The paper challenges the notion that model scale is the only path to high accuracy and explores whether lightweight specialized models can compete through better data curation.

Method: Developed PP-OCRv5, a lightweight two-stage OCR pipeline with 5M parameters. Focused on data-centric investigation by systematically analyzing three dimensions: data difficulty (complexity of samples), data accuracy (label quality), and data diversity (variety of scenarios). Used extensive experiments to show that high-quality, diverse training data can push the performance limits of efficient traditional architectures.

Result: PP-OCRv5 achieves performance competitive with billion-parameter VLMs on standard OCR benchmarks while offering superior localization precision and reduced hallucinations. Demonstrates that traditional two-stage OCR pipelines have higher performance ceilings than commonly assumed when trained with sufficient high-quality, diverse data.

Conclusion: Lightweight specialized models remain viable in the large-model era through data-centric optimization. The work provides practical insights into data curation for OCR and challenges the prevailing emphasis on architectural scaling alone.

Abstract: The advent of “OCR 2.0” and large-scale vision-language models (VLMs) has set new benchmarks in text recognition. However, these unified architectures often come with significant computational demands, challenges in precise text localization within complex layouts, and a propensity for textual hallucinations. Revisiting the prevailing notion that model scale is the sole path to high accuracy, this paper introduces PP-OCRv5, a meticulously optimized, lightweight OCR system with merely 5 million parameters. We demonstrate that PP-OCRv5 achieves performance competitive with many billion-parameter VLMs on standard OCR benchmarks, while offering superior localization precision and reduced hallucinations. The cornerstone of our success lies not in architectural expansion but in a data-centric investigation. We systematically dissect the role of training data by quantifying three critical dimensions: data difficulty, data accuracy, and data diversity. Our extensive experiments reveal that with a sufficient volume of high-quality, accurately labeled, and diverse data, the performance ceiling for traditional, efficient two-stage OCR pipelines is far higher than commonly assumed. This work provides compelling evidence for the viability of lightweight, specialized models in the large-model era and offers practical insights into data curation for OCR. The source code and models are publicly available at https://github.com/PaddlePaddle/PaddleOCR.

[220] GeoRouter: Dynamic Paradigm Routing for Worldwide Image Geolocalization

Pengyue Jia, Derong Xu, Yingyi Zhang, Xiaopeng Li, Wenlin Zhang, Yi Wen, Yuanshao Zhu, Xiangyu Zhao

Main category: cs.CV

TL;DR: GeoRouter is a dynamic routing framework that adaptively assigns image geolocalization queries to either retrieval-based or generation-based paradigms based on which performs better for each specific image, achieving state-of-the-art results.

Details

Motivation: Current image geolocalization methods have complementary strengths: retrieval-based approaches excel at fine-grained instance matching while generation-based approaches using LVLMs offer robust semantic reasoning. Neither paradigm is universally superior, suggesting a hybrid approach could leverage their complementary strengths.

Method: Proposes GeoRouter, a dynamic routing framework using an LVLM backbone to analyze visual content and decide whether to use retrieval-based or generation-based geolocalization for each query. Introduces a distance-aware preference objective that converts performance differences between paradigms into continuous supervision signals, and creates GeoRouting dataset for training routing policies.

Result: Extensive experiments on IM2GPS3k and YFCC4k datasets demonstrate that GeoRouter significantly outperforms state-of-the-art baselines in worldwide image geolocalization.

Conclusion: Dynamic routing between complementary geolocalization paradigms (retrieval and generation) using LVLMs effectively leverages their respective strengths, achieving superior performance over single-paradigm approaches.

Abstract: Worldwide image geolocalization aims to predict precise GPS coordinates for images captured anywhere on Earth, which is challenging due to the large visual and geographic diversity. Recent methods mainly follow two paradigms: retrieval-based approaches that match queries against a reference database, and generation-based approaches that directly predict coordinates using Large Vision-Language Models (LVLMs). However, we observe distinct error profiles between them: retrieval excels at fine-grained instance matching, while generation offers robust semantic reasoning. This complementary heterogeneity suggests that no single paradigm is universally superior. To harness this potential, we propose GeoRouter, a dynamic routing framework that adaptively assigns each query to the optimal paradigm. GeoRouter leverages an LVLM backbone to analyze visual content and provide routing decisions. To optimize GeoRouter, we introduce a distance-aware preference objective that converts the distance gap between paradigms into a continuous supervision signal, explicitly reflecting relative performance differences. Furthermore, we construct GeoRouting, the first large-scale dataset tailored for training routing policies with independent paradigm predictions. Extensive experiments on IM2GPS3k and YFCC4k demonstrate that GeoRouter significantly outperforms state-of-the-art baselines.

[221] EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction

Falong Fan, Yi Xie, Arnis Lektauers, Bo Liu, Jerzy Rozenblit

Main category: cs.CV

TL;DR: EndoVGGT: A geometry-centric framework with Deformation-aware Graph Attention for robust 3D reconstruction of deformable soft tissues in surgery, addressing challenges like low-texture surfaces and occlusions through dynamic feature-space semantic graphs.

Details

Motivation: Accurate 3D reconstruction of deformable soft tissues is essential for surgical robotic perception, but existing methods struggle with low-texture surfaces, specular highlights, and instrument occlusions that fragment geometric continuity.

Method: Proposes EndoVGGT framework with Deformation-aware Graph Attention (DeGAT) module that dynamically constructs feature-space semantic graphs to capture long-range correlations among coherent tissue regions, enabling robust propagation of structural cues across occlusions.

Result: Extensive experiments on SCARED show significant improvements: 24.6% increase in PSNR and 9.1% increase in SSIM over prior state-of-the-art. Strong zero-shot cross-dataset generalization to unseen SCARED and EndoNeRF domains.

Conclusion: EndoVGGT demonstrates the efficacy of dynamic feature-space modeling for consistent surgical 3D reconstruction, with DeGAT learning domain-agnostic geometric priors that enable robust performance across different surgical domains.

Abstract: Accurate 3D reconstruction of deformable soft tissues is essential for surgical robotic perception. However, low-texture surfaces, specular highlights, and instrument occlusions often fragment geometric continuity, posing a challenge for existing fixed-topology approaches. To address this, we propose EndoVGGT, a geometry-centric framework equipped with a Deformation-aware Graph Attention (DeGAT) module. Rather than using static spatial neighborhoods, DeGAT dynamically constructs feature-space semantic graphs to capture long-range correlations among coherent tissue regions. This enables robust propagation of structural cues across occlusions, enforcing global consistency and improving non-rigid deformation recovery. Extensive experiments on SCARED show that our method significantly improves fidelity, increasing PSNR by 24.6% and SSIM by 9.1% over prior state-of-the-art. Crucially, EndoVGGT exhibits strong zero-shot cross-dataset generalization to the unseen SCARED and EndoNeRF domains, confirming that DeGAT learns domain-agnostic geometric priors. These results highlight the efficacy of dynamic feature-space modeling for consistent surgical 3D reconstruction.

[222] ViHOI: Human-Object Interaction Synthesis with Visual Priors

Songjin Cai, Linjie Zhong, Ling Guo, Changxing Ding

Main category: cs.CV

TL;DR: ViHOI: A diffusion-based framework that extracts interaction priors from 2D images using a Vision-Language Model to generate realistic 3D Human-Object Interactions, improving generalization to unseen objects and interactions.

Details

Motivation: Generating realistic 3D Human-Object Interactions is challenging because describing physical constraints with words alone is difficult. The paper proposes extracting rich interaction priors from easily accessible 2D images to overcome this limitation.

Method: Uses a large Vision-Language Model as a prior-extraction engine with layer-decoupled strategy for visual/textual priors. Implements Q-Former-based adapter to compress high-dimensional features into compact prior tokens for diffusion model training. Trains on motion-rendered images for semantic alignment, and uses text-to-image generated reference images during inference for generalization.

Result: Achieves state-of-the-art performance, outperforming existing methods across multiple benchmarks and demonstrating superior generalization to unseen objects and interaction categories.

Conclusion: The ViHOI framework successfully leverages 2D image priors through vision-language models to enhance 3D human-object interaction generation, addressing the limitation of text-only descriptions for physical constraints.

Abstract: Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion generation. One primary reason is that describing these physical constraints with words alone is difficult. To address this limitation, we propose a new paradigm: extracting rich interaction priors from easily accessible 2D images. Specifically, we introduce ViHOI, a novel framework that enables diffusion-based generative models to leverage rich, task-specific priors from 2D images to enhance generation quality. We utilize a large Vision-Language Model (VLM) as a powerful prior-extraction engine and adopt a layer-decoupled strategy to obtain visual and textual priors. Concurrently, we design a Q-Former-based adapter that compresses the VLM’s high-dimensional features into compact prior tokens, which significantly facilitates the conditional training of our diffusion model. Our framework is trained on motion-rendered images from the dataset to ensure strict semantic alignment between visual inputs and motion sequences. During inference, it leverages reference images synthesized by a text-to-image generation model to improve generalization to unseen objects and interaction categories. Experimental results demonstrate that ViHOI achieves state-of-the-art performance, outperforming existing methods across multiple benchmarks and demonstrating superior generalization.

[223] Causal Transfer in Medical Image Analysis

Mohammed M. Abdelsamea, Daniel Tweneboah Anyimadu, Tasneem Selim, Saif Alzubi, Lei Zhang, Ahmed Karam Eldaly, Xujiong Ye

Main category: cs.CV

TL;DR: Survey paper introducing Causal Transfer Learning (CTL) for medical image analysis, integrating causal reasoning with cross-domain representation learning to address domain shift and improve generalization across hospitals, scanners, and protocols.

Details

Motivation: Medical imaging models often fail when deployed across different clinical settings due to domain shift. Traditional transfer learning and domain adaptation methods rely on spurious correlations that break under changing conditions, limiting clinical reliability. Causal inference offers a principled approach to identify invariant mechanisms that remain stable across environments.

Method: The paper presents a systematic survey of Causal Transfer Learning (CTL) for medical image analysis. It frames domain shift as a causal problem and analyzes how structural causal models, invariant risk minimization, and counterfactual reasoning can be embedded within transfer learning pipelines. The authors organize approaches by task, shift type, and causal assumption, proposing a unified taxonomy connecting causal frameworks and transfer mechanisms.

Result: The survey summarizes datasets, benchmarks, and empirical gains, highlighting when and why causal transfer outperforms correlation-based domain adaptation. It demonstrates how CTL supports fairness, robustness, and trustworthy deployment in multi-institutional and federated settings.

Conclusion: CTL provides a promising paradigm for clinically reliable medical imaging AI by integrating causal reasoning with transfer learning. The paper outlines open challenges and research directions for developing robust, generalizable models that can handle domain shifts across clinical settings.

Abstract: Medical imaging models frequently fail when deployed across hospitals, scanners, populations, or imaging protocols due to domain shift, limiting their clinical reliability. While transfer learning and domain adaptation address such shifts statistically, they often rely on spurious correlations that break under changing conditions. On the other hand, causal inference provides a principled way to identify invariant mechanisms that remain stable across environments. This survey introduces and systematises Causal Transfer Learning (CTL) for medical image analysis. This paradigm integrates causal reasoning with cross-domain representation learning to enable robust and generalisable clinical AI. We frame domain shift as a causal problem and analyse how structural causal models, invariant risk minimisation, and counterfactual reasoning can be embedded within transfer learning pipelines. We studied spanning classification, segmentation, reconstruction, anomaly detection, and multimodal imaging, and organised them by task, shift type, and causal assumption. A unified taxonomy is proposed that connects causal frameworks and transfer mechanisms. We further summarise datasets, benchmarks, and empirical gains, highlighting when and why causal transfer outperforms correlation-based domain adaptation. Finally, we discuss how CTL supports fairness, robustness, and trustworthy deployment in multi-institutional and federated settings, and outline open challenges and research directions for clinically reliable medical imaging AI.

[224] Teacher-Student Diffusion Model for Text-Driven 3D Hand Motion Generation

Ching-Lam Cheng, Bin Zhu, Shengfeng He

Main category: cs.CV

TL;DR: TSHaMo: A teacher-student diffusion framework for text-driven 3D hand motion generation without requiring 3D objects at inference time

Details

Motivation: Existing methods for generating 3D hand motion from text either focus on full-body motion (missing detailed hand gestures) or require explicit 3D object meshes, limiting practical applicability and generality.

Method: Proposes TSHaMo, a model-agnostic teacher-student diffusion framework where the student learns to synthesize motions from text alone, while the teacher leverages auxiliary signals (e.g., MANO parameters) to provide structured guidance during training via co-training strategy.

Result: Evaluated on GRAB and H2O datasets using two diffusion backbones, TSHaMo consistently improves motion quality and diversity. Ablations confirm robustness and flexibility in using diverse auxiliary inputs without requiring 3D objects at test time.

Conclusion: TSHaMo enables realistic 3D hand motion generation from natural language without needing 3D objects at inference, making it more practical for VR, robotics, and human-computer interaction applications.

Abstract: Generating realistic 3D hand motion from natural language is vital for VR, robotics, and human-computer interaction. Existing methods either focus on full-body motion, overlooking detailed hand gestures, or require explicit 3D object meshes, limiting generality. We propose TSHaMo, a model-agnostic teacher-student diffusion framework for text-driven hand motion generation. The student model learns to synthesize motions from text alone, while the teacher leverages auxiliary signals (e.g., MANO parameters) to provide structured guidance during training. A co-training strategy enables the student to benefit from the teacher’s intermediate predictions while remaining text-only at inference. Evaluated using two diffusion backbones on GRAB and H2O, TSHaMo consistently improves motion quality and diversity. Ablations confirm its robustness and flexibility in using diverse auxiliary inputs without requiring 3D objects at test time.

[225] The Gait Signature of Frailty: Transfer Learning based Deep Gait Models for Scalable Frailty Assessment

Laura McDaniel, Basudha Pal, Crystal Szczesny, Yuxiang Guo, Ryan Roemmich, Peter Abadir, Rama Chellappa

Main category: cs.CV

TL;DR: This paper introduces a silhouette-based frailty gait dataset and evaluates pretrained gait recognition models for frailty classification, showing that selective freezing of low-level representations and handling class imbalance improves performance for clinical frailty assessment.

Details

Motivation: Frailty assessment in aging medicine is subjective and difficult to scale. Gait is a sensitive marker of biological aging, but computer vision applications for gait-based frailty assessment have been limited by small datasets and lack of clinically representative benchmarks.

Method: Created a publicly available silhouette-based frailty gait dataset collected in clinically realistic settings. Evaluated how pretrained gait recognition models (convolutional and hybrid attention-based architectures) can be adapted for frailty classification under limited data conditions, studying selective freezing strategies and handling of class imbalance.

Result: Selectively freezing low-level gait representations while allowing higher-level features to adapt yields more stable and generalizable performance than full fine-tuning or rigid freezing. Conservative handling of class imbalance improves training stability, and combining complementary learning objectives enhances discrimination between clinically adjacent frailty states.

Conclusion: Gait-based representation learning provides a scalable, non-invasive, and interpretable framework for frailty assessment, supporting integration of modern biometric modeling approaches into aging research and clinical practice.

Abstract: Frailty is a condition in aging medicine characterized by diminished physiological reserve and increased vulnerability to stressors. However, frailty assessment remains subjective, heterogeneous, and difficult to scale in clinical practice. Gait is a sensitive marker of biological aging, capturing multisystem decline before overt disability. Yet the application of modern computer vision to gait-based frailty assessment has been limited by small, imbalanced datasets and a lack of clinically representative benchmarks. In this work, we introduce a publicly available silhouette-based frailty gait dataset collected in a clinically realistic setting, spanning the full frailty spectrum and including older adults who use walking aids. Using this dataset, we evaluate how pretrained gait recognition models can be adapted for frailty classification under limited data conditions. We study both convolutional and hybrid attention-based architectures and show that predictive performance depends primarily on how pretrained representations are transferred rather than architectural complexity alone. Across models, selectively freezing low-level gait representations while allowing higher-level features to adapt yields more stable and generalizable performance than either full fine-tuning or rigid freezing. Conservative handling of class imbalance further improves training stability, and combining complementary learning objectives enhances discrimination between clinically adjacent frailty states. Interpretability analyses reveal consistent model attention to lower-limb and pelvic regions, aligning with established biomechanical correlates of frailty. Together, these findings establish gait-based representation learning as a scalable, non-invasive, and interpretable framework for frailty assessment and support the integration of modern biometric modeling approaches into aging research and clinical practice.

[226] Unleashing Vision-Language Semantics for Deepfake Video Detection

Jiawen Zhu, Yunqi Miao, Xueyi Zhang, Jiankang Deng, Guansong Pang

Main category: cs.CV

TL;DR: VLAForge: A deepfake video detection framework that leverages vision-language models’ cross-modal semantics and identity-aware prompting to enhance detection of subtle forgery artifacts.

Details

Motivation: Existing deepfake detection methods using VLMs focus only on visual features, overlooking the rich vision-language semantics that could provide stronger discriminative cues for detecting forgeries.

Method: Proposes VLAForge with two key components: 1) ForgePerceiver module that enhances visual perception to capture subtle forgery cues while preserving pretrained vision-language alignment, and 2) Identity-Aware VLA score that uses identity-informed text prompting to generate discriminative cross-modal semantics tailored to each identity.

Result: Substantially outperforms state-of-the-art methods on video deepfake detection benchmarks, including both classical face-swapping and recent full-face generation forgeries, at both frame and video levels.

Conclusion: Leveraging vision-language semantics with identity-aware prompting significantly improves deepfake detection performance, demonstrating the untapped potential of cross-modal understanding in forgery detection tasks.

Abstract: Recent Deepfake Video Detection (DFD) studies have demonstrated that pre-trained Vision-Language Models (VLMs) such as CLIP exhibit strong generalization capabilities in detecting artifacts across different identities. However, existing approaches focus on leveraging visual features only, overlooking their most distinctive strength – the rich vision-language semantics embedded in the latent space. We propose VLAForge, a novel DFD framework that unleashes the potential of such cross-modal semantics to enhance model’s discriminability in deepfake detection. This work i) enhances the visual perception of VLM through a ForgePerceiver, which acts as an independent learner to capture diverse, subtle forgery cues both granularly and holistically, while preserving the pretrained Vision-Language Alignment (VLA) knowledge, and ii) provides a complementary discriminative cue – Identity-Aware VLA score, derived by coupling cross-modal semantics with the forgery cues learned by ForgePerceiver. Notably, the VLA score is augmented by an identity prior-informed text prompting to capture authenticity cues tailored to each identity, thereby enabling more discriminative cross-modal semantics. Comprehensive experiments on video DFD benchmarks, including classical face-swapping forgeries and recent full-face generation forgeries, demonstrate that our VLAForge substantially outperforms state-of-the-art methods at both frame and video levels. Code is available at https://github.com/mala-lab/VLAForge.

[227] OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning

Kaihang Pan, Qi Tian, Jianwei Zhang, Weijie Kong, Jiangfeng Xiong, Yanxin Long, Shixue Zhang, Haiyi Qiu, Tan Wang, Zheqi Lv, Yue Wu, Liefeng Bo, Siliang Tang, Zhao Zhong

Main category: cs.CV

TL;DR: OmniWeaving is an open-source omni-level video generation model that integrates diverse multimodal inputs (text, images, video) with reasoning capabilities to create sophisticated videos, addressing the gap between proprietary and academic video generation systems.

Details

Motivation: There's a significant gap between proprietary video generation systems (like Seedance-2.0) and open-source alternatives. Academic models are fragmented, and existing unified video generation frameworks struggle to seamlessly integrate diverse tasks. The authors aim to create an open-source solution that can handle complex multimodal composition and reasoning.

Method: OmniWeaving leverages massive-scale pretraining on diverse compositional and reasoning-augmented scenarios. It learns to temporally bind interleaved text, multi-image, and video inputs while acting as an intelligent agent to infer complex user intentions. The authors also introduce IntelligentVBench, a comprehensive benchmark for evaluating intelligent unified video generation.

Result: OmniWeaving achieves state-of-the-art performance among open-source unified video generation models. Extensive experiments demonstrate its effectiveness in handling diverse multimodal inputs and complex reasoning tasks for video creation.

Conclusion: OmniWeaving successfully bridges the gap between proprietary and open-source video generation systems by providing a unified framework with multimodal composition and reasoning capabilities. The model and benchmark will be publicly available to advance research in this area.

Abstract: While proprietary systems such as Seedance-2.0 have achieved remarkable success in omni-capable video generation, open-source alternatives significantly lag behind. Most academic models remain heavily fragmented, and the few existing efforts toward unified video generation still struggle to seamlessly integrate diverse tasks within a single framework. To bridge this gap, we propose OmniWeaving, an omni-level video generation model featuring powerful multimodal composition and reasoning-informed capabilities. By leveraging a massive-scale pretraining dataset that encompasses diverse compositional and reasoning-augmented scenarios, OmniWeaving learns to temporally bind interleaved text, multi-image, and video inputs while acting as an intelligent agent to infer complex user intentions for sophisticated video creation. Furthermore, we introduce IntelligentVBench, the first comprehensive benchmark designed to rigorously assess next-level intelligent unified video generation. Extensive experiments demonstrate that OmniWeaving achieves SoTA performance among open-source unified models. The codes and model will be made publicly available soon. Project Page: https://omniweaving.github.io.

[228] Positive-First Most Ambiguous: A Simple Active Learning Criterion for Interactive Retrieval of Rare Categories

Kawtar Zaher, Olivier Buisson, Alexis Joly

Main category: cs.CV

TL;DR: PF-MA: Active learning method for imbalanced fine-grained visual retrieval that prioritizes likely positives near decision boundaries to rapidly discover rare visual categories with minimal supervision.

Details

Motivation: Real-world fine-grained visual retrieval often requires discovering rare concepts from large unlabeled collections with minimal supervision, especially in biodiversity monitoring and ecological studies where targets represent tiny fractions of data. Conventional active learning assumes symmetric class priors and large annotation budgets, limiting effectiveness in imbalanced, low-budget, low-latency settings.

Method: Introduces Positive-First Most Ambiguous (PF-MA), an active learning criterion that explicitly addresses class imbalance asymmetry by prioritizing near-boundary samples while favoring likely positives. Also proposes a class coverage metric to measure how well selected positives span the visual variability of the target class.

Result: Experiments on long-tailed datasets including fine-grained botanical data demonstrate that PF-MA consistently outperforms strong baselines in both coverage and classifier performance across varying class sizes and descriptors.

Conclusion: Aligning active learning with asymmetric and user-centric objectives of interactive fine-grained retrieval enables simple yet powerful solutions for retrieving rare and visually subtle categories in realistic human-in-the-loop settings.

Abstract: Real-world fine-grained visual retrieval often requires discovering a rare concept from large unlabeled collections with minimal supervision. This is especially critical in biodiversity monitoring, ecological studies, and long-tailed visual domains, where the target may represent only a tiny fraction of the data, creating highly imbalanced binary problems. Interactive retrieval with relevance feedback offers a practical solution: starting from a small query, the system selects candidates for binary user annotation and iteratively refines a lightweight classifier. While Active Learning (AL) is commonly used to guide selection, conventional AL assumes symmetric class priors and large annotation budgets, limiting effectiveness in imbalanced, low-budget, low-latency settings. We introduce Positive-First Most Ambiguous (PF-MA), a simple yet effective AL criterion that explicitly addresses the class imbalance asymmetry: it prioritizes near-boundary samples while favoring likely positives, enabling rapid discovery of subtle visual categories while maintaining informativeness. Unlike standard methods that oversample negatives, PF-MA consistently returns small batches with a high proportion of relevant samples, improving early retrieval and user satisfaction. To capture retrieval diversity, we also propose a class coverage metric that measures how well selected positives span the visual variability of the target class. Experiments on long-tailed datasets, including fine-grained botanical data, demonstrate that PF-MA consistently outperforms strong baselines in both coverage and classifier performance, across varying class sizes and descriptors. Our results highlight that aligning AL with the asymmetric and user-centric objectives of interactive fine-grained retrieval enables simple yet powerful solutions for retrieving rare and visually subtle categories in realistic human-in-the-loop settings.

[229] Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models

Siqi Liu, Xinyang Li, Bochao Zou, Junbao Zhuo, Huimin Ma, Jiansheng Chen

Main category: cs.CV

TL;DR: VisionToM: A vision-oriented intervention framework that enhances multimodal language models’ Theory of Mind abilities by aligning visual representations with semantic targets through attention steering.

Details

Motivation: Current Theory of Mind evaluations focus mainly on text, neglecting visual scenarios despite real-world human-AI interaction requiring multimodal understanding. Existing methods treat models as black boxes without examining internal attention mechanisms, and LLM hallucinations' impact on multimodal tasks remains underexplored.

Method: VisionToM computes intervention vectors that align visual representations with correct semantic targets, steering the model’s attention through different layers of visual features. This reduces reliance on spurious linguistic priors and improves multimodal reasoning.

Result: Experiments on the EgoToM benchmark (egocentric real-world video dataset for ToM with three multiple-choice QA settings) show substantial improvements in MLLMs’ ToM abilities. Additional open-ended generation tasks demonstrate that VisionToM enables MLLMs to produce more accurate free-form explanations of agents’ mental states.

Conclusion: VisionToM effectively enhances multimodal language models’ Theory of Mind capabilities by addressing visual understanding gaps and improving attention mechanisms, pushing machine-human collaboration toward greater alignment through better mental state inference.

Abstract: As large language models (LLMs) continue to advance, there is increasing interest in their ability to infer human mental states and demonstrate a human-like Theory of Mind (ToM). Most existing ToM evaluations, however, are centered on text-based inputs, while scenarios relying solely on visual information receive far less attention. This leaves a gap, since real-world human-AI interaction typically requires multimodal understanding. In addition, many current methods regard the model as a black box and rarely probe how its internal attention behaves in multiple-choice question answering (QA). The impact of LLM hallucinations on such tasks is also underexplored from an interpretability perspective. To address these issues, we introduce VisionToM, a vision-oriented intervention framework designed to strengthen task-aware reasoning. The core idea is to compute intervention vectors that align visual representations with the correct semantic targets, thereby steering the model’s attention through different layers of visual features. This guidance reduces the model’s reliance on spurious linguistic priors, leading to more reliable multimodal language model (MLLM) outputs and better QA performance. Experiments on the EgoToM benchmark-an egocentric, real-world video dataset for ToM with three multiple-choice QA settings-demonstrate that our method substantially improves the ToM abilities of MLLMs. Furthermore, results on an additional open-ended generation task show that VisionToM enables MLLMs to produce free-form explanations that more accurately capture agents’ mental states, pushing machine-human collaboration toward greater alignment.

[230] Toward Physically Consistent Driving Video World Models under Challenging Trajectories

Jiawei Zhou, Zhenxin Zhu, Lingyi Du, Linye Lyu, Lijun Zhou, Zhanqian Wu, Hongcheng Luo, Zhuotao Tian, Bing Wang, Guang Chen, Hangjun Ye, Haiyang Sun, Yu Li

Main category: cs.CV

TL;DR: PhyGenesis is a world model for autonomous driving simulation that generates physically consistent driving videos even for challenging or counterfactual trajectories, addressing limitations of existing models trained only on safe real-world data.

Details

Motivation: Existing video generation models for autonomous driving simulation are trained primarily on real-world driving datasets containing mostly safe scenarios, causing them to fail when generating videos for challenging or counterfactual trajectories, producing physically inconsistent results.

Method: Two-component framework: (1) physical condition generator that transforms invalid trajectories into physically plausible conditions, and (2) physics-enhanced video generator for high-fidelity multi-view driving videos. Trained on large-scale heterogeneous dataset combining real-world videos with diverse challenging scenarios from CARLA simulator.

Result: PhyGenesis consistently outperforms state-of-the-art methods, especially on challenging trajectories, demonstrating strong physical consistency and visual fidelity in generated driving videos.

Conclusion: The proposed challenging-trajectory learning strategy enables trajectory correction and promotes physically consistent video generation, making PhyGenesis a robust world model for autonomous driving simulation.

Abstract: Video generation models have shown strong potential as world models for autonomous driving simulation. However, existing approaches are primarily trained on real-world driving datasets, which mostly contain natural and safe driving scenarios. As a result, current models often fail when conditioned on challenging or counterfactual trajectories-such as imperfect trajectories generated by simulators or planning systems-producing videos with severe physical inconsistencies and artifacts. To address this limitation, we propose PhyGenesis, a world model designed to generate driving videos with high visual fidelity and strong physical consistency. Our framework consists of two key components: (1) a physical condition generator that transforms potentially invalid trajectory inputs into physically plausible conditions, and (2) a physics-enhanced video generator that produces high-fidelity multi-view driving videos under these conditions. To effectively train these components, we construct a large-scale, physics-rich heterogeneous dataset. Specifically, in addition to real-world driving videos, we generate diverse challenging driving scenarios using the CARLA simulator, from which we derive supervision signals that guide the model to learn physically grounded dynamics under extreme conditions. This challenging-trajectory learning strategy enables trajectory correction and promotes physically consistent video generation. Extensive experiments demonstrate that PhyGenesis consistently outperforms state-of-the-art methods, especially on challenging trajectories. Our project page is available at: https://wm-research.github.io/PhyGenesis/.

Dipam Goswami, Simone Magistri, Gido M. van de Ven, Bartłomiej Twardowski, Andrew D. Bagdanov, Tinne Tuytelaars, Joost van de Weijer

Main category: cs.CV

TL;DR: Improving few-shot image classification by mixing text and image prototypes, with text-aligned image subspace projection and combined classifier approach.

Details

Motivation: Recent works show that both text and image embeddings from training sets are important for CLIP-based few-shot classification. The paper investigates direct mixing of image and text prototypes and analyzes this from bias-variance perspective to improve classification performance.

Method: 1) Analyze mixing prototypes as shrinkage estimator; 2) Project image prototypes onto principal directions of semantic text embedding space to obtain text-aligned semantic image subspace; 3) For datasets with poor cross-modal alignment, model anisotropy using class covariances; 4) Combine text-aligned mixed prototype classifier with image-specific LDA classifier.

Result: Mixed prototypes improve classification performance but image prototypes add noise. Text-aligned image prototypes further improve classification when mixed with text embeddings. Combined text-aligned mixed prototype and image-specific LDA classifier outperforms existing methods across few-shot classification benchmarks.

Conclusion: The proposed approach of text-aligned image subspace projection and combined classifier framework effectively leverages both text and image information for improved few-shot classification, addressing limitations of direct prototype mixing and poor cross-modal alignment scenarios.

Abstract: Vision-language models (VLMs) like CLIP are trained with the objective of aligning text and image pairs. To improve CLIP-based few-shot image classification, recent works have observed that, along with text embeddings, image embeddings from the training set are an important source of information. In this work we investigate the impact of directly mixing image and text prototypes for few-shot classification and analyze this from a bias-variance perspective. We show that mixing prototypes acts like a shrinkage estimator. Although mixed prototypes improve classification performance, the image prototypes still add some noise in the form of instance-specific background or context information. In order to capture only information from the image space relevant to the given classification task, we propose projecting image prototypes onto the principal directions of the semantic text embedding space to obtain a text-aligned semantic image subspace. These text-aligned image prototypes, when mixed with text embeddings, further improve classification. However, for downstream datasets with poor cross-modal alignment in CLIP, semantic alignment might be suboptimal. We show that the image subspace can still be leveraged by modeling the anisotropy using class covariances. We demonstrate that combining a text-aligned mixed prototype classifier and an image-specific LDA classifier outperforms existing methods across few-shot classification benchmarks.

[232] The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series

Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mirela Tulbure, Patrick Hostert, Stefan Erasmi

Main category: cs.CV

TL;DR: Vision Transformer model using Sentinel-2 time series data can discriminate between organic and conventional farming systems, with performance varying by crop type and improving with spatial context but not significantly with multitask learning.

Details

Motivation: Organic farming is important for sustainable agriculture, but comprehensive spatial information is needed to understand its development and impact. Remote sensing could provide this information by distinguishing organic from conventional farming systems.

Method: Used Vision Transformer (TSViT) architecture with Sentinel-2 time series data for classification. Extended model for multitask learning of both farming system and crop type. Varied patch sizes to test spatial context influence on classification accuracy.

Result: Discrimination between organic and conventional farming is feasible but varies by crop type. Winter rye, wheat, and oat achieved F1 scores ≥0.8, while grassland, orchards, grapevines, and hops scored ≤0.4. Joint learning provided limited benefits, but spatial context improved both tasks.

Conclusion: Agricultural farming system classification is possible in diverse regions using multispectral remote sensing, with spatial context being more important than multitask learning for improving performance.

Abstract: Organic farming is a key element in achieving more sustainable agriculture. For a better understanding of the development and impact of organic farming, comprehensive, spatially explicit information is needed. This study presents an approach for the discrimination of organic and conventional farming systems using intra-annual Sentinel-2 time series. In addition, it examines two factors influencing this discrimination: the joint learning of crop type information in a concurrent task and the role of spatial context. A Vision Transformer model based on the Temporo-Spatial Vision Transformer (TSViT) architecture was used to construct a classification model for the two farming systems. The model was extended for simultaneous learning of the crop type, creating a multitask learning setting. By varying the patch size presented to the model, we tested the influence of spatial context on the classification accuracy of both tasks. We show that discrimination between organic and conventional farming systems using multispectral remote sensing data is feasible. However, classification performance varies substantially across crop types. For several crops, such as winter rye, winter wheat, and winter oat, F1 scores of 0.8 or higher can be achieved. In contrast, other agricultural land use classes, such as permanent grassland, orchards, grapevines, and hops, cannot be reliably distinguished, with F1 scores for the organic management class of 0.4 or lower. Joint learning of farming system and crop type provides only limited additional benefits over single-task learning. In contrast, incorporating wider spatial context improves the performance of both farming system and crop type classification. Overall, we demonstrate that a classification of agricultural farming systems is possible in a diverse agricultural region using multispectral remote sensing data.

[233] POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan

Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kumar Das, Monorama Swain, Yufang Hou, Elisabeth Andre, Khalid Mahmood Malik, Markus Schedl, Shah Nawaz

Main category: cs.CV

TL;DR: The POLY-SIM Grand Challenge 2026 focuses on developing robust multimodal speaker identification systems that handle missing modalities (audio/visual) and cross-lingual scenarios, addressing real-world limitations of current systems.

Details

Motivation: Real-world multimodal speaker identification faces challenges with missing visual data (occlusions, camera failures, privacy) and multilingual speakers, which current systems assuming complete modalities cannot handle effectively.

Method: The paper presents a standardized benchmark challenge with dataset, task formulation, evaluation protocol, and baseline model for multimodal speaker identification under missing-modality and cross-lingual conditions.

Result: The challenge design provides a framework for evaluating robust multimodal speaker identification methods, though specific performance results are not reported as this is a challenge announcement paper.

Conclusion: The POLY-SIM Grand Challenge 2026 aims to advance research in practical multimodal speaker identification by addressing real-world limitations through standardized evaluation of robustness to missing modalities and linguistic variability.

Abstract: Multimodal speaker identification systems typically assume the availability of complete and homogeneous audio-visual modalities during both training and testing. However, in real-world applications, such assumptions often do not hold. Visual information may be missing due to occlusions, camera failures, or privacy constraints, while multilingual speakers introduce additional complexity due to linguistic variability across languages. These challenges significantly affect the robustness and generalization of multimodal speaker identification systems. The POLY-SIM Grand Challenge 2026 aims to advance research in multimodal speaker identification under missing-modality and cross-lingual conditions. Specifically, the Grand Challenge encourages the development of robust methods that can effectively leverage incomplete multimodal inputs while maintaining strong performance across different languages. This report presents the design and organization of the POLY-SIM Grand Challenge 2026, including the dataset, task formulation, evaluation protocol, and baseline model. By providing a standardized benchmark and evaluation framework, the challenge aims to foster progress toward more robust and practical multimodal speaker identification systems.

[234] Towards Training-Free Scene Text Editing

Yubo Li, Xugong Qin, Peng Zhang, Hailun Lin, Gangyan Zeng, Kexin Zhang

Main category: cs.CV

TL;DR: TextFlow is a training-free scene text editing framework that uses Attention Boost and Flow Manifold Steering to modify text in natural images while preserving visual realism and semantic consistency without requiring task-specific training.

Details

Motivation: Existing scene text editing methods often require task-specific training or paired data, which limits their scalability and adaptability. There's a need for more flexible, training-free approaches that can handle diverse scenes and languages.

Method: Proposes TextFlow framework with two complementary modules: Flow Manifold Steering (FMS) preserves structural and style consistency by modeling visual flow of characters and background regions, and Attention Boost (AttnBoost) enhances textual content rendering through attention-based guidance. The approach performs end-to-end text editing through semantic alignment and spatial refinement in a plug-and-play manner.

Result: Extensive experiments show the framework achieves visual quality and text accuracy comparable to or superior to training-based counterparts, generalizing well across diverse scenes and languages.

Conclusion: TextFlow advances scene text editing toward a more efficient, generalizable, and training-free paradigm, making it more accessible and scalable for practical applications.

Abstract: Scene text editing seeks to modify textual content in natural images while maintaining visual realism and semantic consistency. Existing methods often require task-specific training or paired data, limiting their scalability and adaptability. In this paper, we propose TextFlow, a training-free scene text editing framework that integrates the strengths of Attention Boost (AttnBoost) and Flow Manifold Steering (FMS) to enable flexible, high-fidelity text manipulation without additional training. Specifically, FMS preserves the structural and style consistency by modeling the visual flow of characters and background regions, while AttnBoost enhances the rendering of textual content through attention-based guidance. By jointly leveraging these complementary modules, our approach performs end-to-end text editing through semantic alignment and spatial refinement in a plug-and-play manner. Extensive experiments demonstrate that our framework achieves visual quality and text accuracy comparable to or superior to those of training-based counterparts, generalizing well across diverse scenes and languages. This study advances scene text editing toward a more efficient, generalizable, and training-free paradigm. Code is available at https://github.com/lyb18758/TextFlow

[235] Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving

Linbo Wang, Yupeng Zheng, Qiang Chen, Shiwei Li, Yichen Zhang, Zebin Xing, Qichao Zhang, Xiang Li, Deheng Qian, Pengxuan Yang, Yihang Dong, Ce Hao, Xiaoqing Ye, Junyu han, Yifeng Pan, Dongbin Zhao

Main category: cs.CV

TL;DR: Latent-WAM is an efficient end-to-end autonomous driving framework that uses spatially-aware and dynamics-informed latent world representations for trajectory planning, achieving state-of-the-art results with a compact 104M-parameter model.

Details

Motivation: Existing world-model-based planners suffer from inadequately compressed representations, limited spatial understanding, and underutilized temporal dynamics, leading to sub-optimal planning under constrained data and compute budgets.

Method: Two core modules: 1) Spatial-Aware Compressive World Encoder (SCWE) that distills geometric knowledge from a foundation model and compresses multi-view images into compact scene tokens via learnable queries, and 2) Dynamic Latent World Model (DLWM) that uses a causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations.

Result: Achieves new state-of-the-art results: 89.3 EPDMS on NAVSIM v2 and 28.9 HD-Score on HUGSIM, surpassing the best prior perception-free method by 3.2 EPDMS with significantly less training data and a compact 104M-parameter model.

Conclusion: Latent-WAM demonstrates that efficient world modeling with spatial awareness and temporal dynamics can achieve strong autonomous driving performance with minimal computational resources.

Abstract: We introduce Latent-WAM, an efficient end-to-end autonomous driving framework that achieves strong trajectory planning through spatially-aware and dynamics-informed latent world representations. Existing world-model-based planners suffer from inadequately compressed representations, limited spatial understanding, and underutilized temporal dynamics, resulting in sub-optimal planning under constrained data and compute budgets. Latent-WAM addresses these limitations with two core modules: a Spatial-Aware Compressive World Encoder (SCWE) that distills geometric knowledge from a foundation model and compresses multi-view images into compact scene tokens via learnable queries, and a Dynamic Latent World Model (DLWM) that employs a causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations. Extensive experiments on NAVSIM v2 and HUGSIM demonstrate new state-of-the-art results: 89.3 EPDMS on NAVSIM v2 and 28.9 HD-Score on HUGSIM, surpassing the best prior perception-free method by 3.2 EPDMS with significantly less training data and a compact 104M-parameter model.

[236] TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models

Jiaying Zhou, Zhihao Zhan, Ruifeng Zhai, Qinhan Lyu, Hao Liu, Keze Wang, Liang Lin, Guangrun Wang

Main category: cs.CV

TL;DR: TAG (Target-Agnostic Guidance) is an inference-time guidance mechanism that improves VLA policy robustness in cluttered scenes by contrasting predictions with original and object-erased observations to reduce distractor bias.

Details

Motivation: VLA policies often fail in cluttered scenes due to instance-level grounding failures, where plausible grasp trajectories land on wrong objects or slightly off-target, not from infeasible motions. Current methods struggle with distractor- and appearance-induced bias.

Method: TAG uses classifier-free guidance (CFG) inspired approach: contrasts policy predictions under original observation and object-erased observation, using their difference as residual steering signal to strengthen object evidence influence. No architecture changes needed, minimal training/inference modifications.

Result: Evaluated on LIBERO, LIBERO-Plus, and VLABench benchmarks, TAG consistently improves robustness under clutter, reduces near-miss and wrong-object executions across standard manipulation tasks.

Conclusion: TAG effectively addresses instance-level grounding failures in VLA policies through simple inference-time guidance, enhancing reliability in cluttered environments without requiring architectural changes.

Abstract: Vision–Language–Action (VLA) policies have shown strong progress in mapping language instructions and visual observations to robotic actions, yet their reliability degrades in cluttered scenes with distractors. By analyzing failure cases, we find that many errors do not arise from infeasible motions, but from instance-level grounding failures: the policy often produces a plausible grasp trajectory that lands slightly off-target or even on the wrong object instance. To address this issue, we propose TAG (Target-Agnostic Guidance), a simple inference-time guidance mechanism that explicitly reduces distractor- and appearance-induced bias in VLA policies. Inspired by classifier-free guidance (CFG), TAG contrasts policy predictions under the original observation and an object-erased observation, and uses their difference as a residual steering signal that strengthens the influence of object evidence in the decision process. TAG does not require modifying the policy architecture and can be integrated with existing VLA policies with minimal training and inference changes. We evaluate TAG on standard manipulation benchmarks, including LIBERO, LIBERO-Plus, and VLABench, where it consistently improves robustness under clutter and reduces near-miss and wrong-object executions.

[237] GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics

Yan Zhang, Simiao Ren, Ankit Raj, En Wei, Dennis Ng, Alex Shen, Jiayu Xue, Yuxin Zhang, Evelyn Marotta

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2603.11442: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11442&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[238] WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation

Lianghui Zhu, Yingyue Li, Jiemin Fang, Yan Liu, Hao Xin, Wenyu Liu, Xinggang Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2304.01184: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2304.01184&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[239] Set2Seq Transformer: Temporal and Position-Aware Set Representations for Sequential Multiple-Instance Learning

Athanasios Efthymiou, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2408.03404: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.03404&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[240] Natural Adversaries: Fuzzing Autonomous Vehicles with Realistic Roadside Object Placements

Yang Sun, Haoyu Wang, Christopher M. Poskitt, Jun Sun

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2409.10562: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.10562&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[241] Morph: A Motion-free Physics Optimization Framework for Human Motion Generation

Zhuo Li, Mingshuang Luo, Ruibing Hou, Xin Zhao, Hao Liu, Hong Chang, Zimo Liu, Chen Li

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2411.14951: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.14951&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[242] SASNet: Spatially-Adaptive Sinusoidal Networks for INRs

Haoan Feng, Diana Aldana, Tiago Novello, Leila De Floriani

Main category: cs.CV

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv

Details

Motivation: Unable to determine motivation as abstract could not be retrieved

Method: Unable to determine method as abstract could not be retrieved

Result: Unable to determine results as abstract could not be retrieved

Conclusion: Unable to draw conclusions as abstract could not be retrieved

Abstract: Failed to fetch summary for 2503.09750: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.09750&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[243] CA-LoRA: Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation

Minho Park, Sunghyun Park, Jungsoo Lee, Hyojin Park, Kyuwoong Hwang, Fatih Porikli, Jaegul Choo, Sungha Choi

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2503.22172: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.22172&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[244] RestoreVAR: Visual Autoregressive Generation for All-in-One Image Restoration

Sudarshan Rajagopalan, Kartik Narayan, Vishal M. Patel

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2505.18047: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.18047&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[245] TopoSculpt: Betti-Steered Topological Sculpting of 3D Fine-grained Tubular Shapes

Minghui Zhang, Yaoyu Liu, Junyang Wu, Xin You, Hanxiao Zhang, Junjun He, Yun Gu

Main category: cs.CV

TL;DR: Unable to analyze paper 2509.03938 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract retrieval failed due to rate limiting (HTTP 429)

Method: No method information available due to failed abstract retrieval

Result: No results information available due to failed abstract retrieval

Conclusion: Cannot draw conclusions about paper content due to technical limitations in accessing the abstract

Abstract: Failed to fetch summary for 2509.03938: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.03938&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[246] VocSegMRI: Multimodal Learning for Precise Vocal Tract Segmentation in Real-time MRI

Daiqi Liu, Johannes Enk, Maureen Stone, Fangxu Xing, Tomás Arias-Vergara, Jerry L. Prince, Jana Hutter, Jonghye Woo, Andreas Maier, Paula Andrea Pérez-Toro

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2509.13767: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.13767&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[247] FOCUS: Optimal Control for Multi-Entity World Modeling in Text-to-Image Generation

Eric Tillmann Bill, Enis Simsar, Thomas Hofmann

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2510.02315 exists but cannot be analyzed without access to its abstract or content.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot determine conclusion without access to paper content.

Abstract: Failed to fetch summary for 2510.02315: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02315&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[248] One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework

Lorenzo Bianchi, Giacomo Pacini, Fabio Carrara, Nicola Messina, Giuseppe Amato, Fabrizio Falchi

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error fetching paper content

Method: Unable to determine method due to technical error fetching paper content

Result: Unable to determine results due to technical error fetching paper content

Conclusion: Unable to determine conclusion due to technical error fetching paper content

Abstract: Failed to fetch summary for 2510.02898: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02898&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[249] Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals

Xiangyu Fan, Zesong Qiu, Zhuguanyu Wu, Fanzhou Wang, Zhiqian Lin, Tianxiang Ren, Dahua Lin, Ruihao Gong, Lei Yang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2510.27684: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.27684&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[250] KINESIS: Motion Imitation for Human Musculoskeletal Locomotion

Merkourios Simos, Alberto Silvio Chiappa, Alexander Mathis

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2503.14637: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.14637&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[251] Distribution Matching Distillation Meets Reinforcement Learning

Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Liuzhuozheng Li, Hengzhuang Li, Xin Jin, David Liu, Changsheng Lu, Zhen Li, Bo Zhang, Mengmeng Wang, Steven Hoi, Peng Gao, Harry Yang

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2511.13649: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.13649&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[252] DepthFocus: Controllable Depth Estimation for See-Through Scenes

Junhong Min, Jimin Kim, Minwook Kim, Cheol-Hui Min, Youngpil Jeon, Minyong Choi

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.16993: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16993&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[253] Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation

Yara Bahram, Mélodie Desbos, Mohammadhadi Shateri, Eric Granger

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.18281: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18281&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[254] MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer

Zenghao Chai, Chen Tang, Yongkang Wong, Xulei Yang, Mohan Kankanhalli

Main category: cs.CV

TL;DR: Paper 2511.18370: Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as abstract is unavailable due to arXiv API rate limiting

Method: Method unknown - paper content inaccessible

Result: No results available - abstract retrieval failed

Conclusion: Unable to analyze paper due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2511.18370: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18370&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[255] GaINeR: Geometry-Aware Implicit Network Representation

Weronika Jakubowska, Mikołaj Zieliński, Rafał Tobiasz, Krzysztof Byrski, Maciej Zięba, Dominik Belter, Przemysław Spurek

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2511.20924: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20924&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Suhyeon Lee, Jong Chul Ye

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2510.00430: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00430&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[257] Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration

Mengyu Yang, Yanming Yang, Chenyi Xu, Chenxi Song, Yufan Zuo, Tong Zhao, Ruibo Li, Chi Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.22533: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22533&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[258] FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention

Zipeng Wang, Dan Xu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2512.01540: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01540&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[259] From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature

Kun Yuan, Min Woo Sun, Zhen Chen, Alejandro Lozano, Xiangteng He, Shi Li, Nassir Navab, Xiaoxiao Sun, Nicolas Padoy, Serena Yeung-Levy

Main category: cs.CV

TL;DR: Unable to analyze paper 2512.02566 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract retrieval failed due to rate limiting (HTTP 429)

Method: Cannot determine method without access to paper abstract

Result: No results available due to API rate limiting error

Conclusion: Paper analysis impossible without access to abstract content

Abstract: Failed to fetch summary for 2512.02566: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02566&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[260] Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

Jialuo Li, Bin Li, Jiahao Li, Yan Lu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.04000: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04000&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[261] AGORA: Adversarial Generation Of Real-time Animatable 3D Gaussian Head Avatars

Ramazan Fazylov, Sergey Zagoruyko, Aleksandr Parkin, Stamatis Lefkimmiatis, Ivan Laptev

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2512.06438: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06438&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[262] HybridSplat: Fast Reflection-baked Gaussian Tracing using Hybrid Splatting

Chang Liu, Hongliang Yuan, Lianghao Zhang, Sichao Wang, Jianwei Guo, Shi-Sheng Huang

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.08334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[263] Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding

Yuchen Feng, Zhenyu Zhang, Naibin Gu, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2512.10548: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10548&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[264] Anchored Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models

Mariam Hassan, Bastien Van Delft, Wuyang Li, Alexandre Alahi

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to retrieval failure

Method: Unable to determine method due to retrieval failure

Result: Unable to determine results due to retrieval failure

Conclusion: Unable to determine conclusion due to retrieval failure

Abstract: Failed to fetch summary for 2512.16371: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16371&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[265] Prime and Reach: Synthesising Body Motion for Gaze-Primed Object Reach

Masashi Hatano, Saptarshi Sinha, Jacob Chalk, Wei-Hong Li, Hideo Saito, Dima Damen

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed data retrieval

Method: Unable to determine method due to failed data retrieval

Result: Unable to determine results due to failed data retrieval

Conclusion: Unable to determine conclusion due to failed data retrieval

Abstract: Failed to fetch summary for 2512.16456: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16456&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[266] E0: Enhancing Generalization and Fine-Grained Control in VLA Models via Tweedie Discrete Diffusion

Zhihao Zhan, Jiaying Zhou, Likui Zhang, Qinhan Lv, Hao Liu, Jusheng Zhang, Weizheng Li, Ziliang Chen, Tianshui Chen, Ruifeng Zhai, Keze Wang, Liang Lin, Guangrun Wang

Main category: cs.CV

TL;DR: Paper 2511.21542: Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing abstract

Method: Cannot determine method due to missing abstract

Result: Cannot determine results due to missing abstract

Conclusion: Cannot determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2511.21542: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21542&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[267] Pro-Pose: Unpaired Full-Body Portrait Synthesis via Canonical UV Maps

Sandeep Mishra, Yasamin Jafarian, Andreas Lugmayr, Yingwei Li, Varsha Ramakrishnan, Srivatsan Varadharajan, Alan C. Bovik, Ira Kemelmacher-Shlizerman

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.17143: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17143&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[268] SERA-H: Beyond Native Sentinel Spatial Limits for High-Resolution Canopy Height Mapping

Thomas Boudras, Martin Schwartz, Rasmus Fensholt, Martin Brandt, Ibrahim Fayad, Jean-Pierre Wigneron, Gabriel Belouze, Fajwel Fogel, Philippe Ciais

Main category: cs.CV

TL;DR: Unable to analyze paper 2512.18128 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.18128: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18128&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[269] NeXT-IMDL: Build Benchmark for NeXT-Generation Image Manipulation Detection & Localization

Yifei Li, Haoyuan He, Yu Zheng, Bingyao Yu, Wenzhao Zheng, Lei Chen, Jie Zhou, Jiwen Lu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.23374: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23374&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[270] Physics-driven human-like working memory outperforms digital networks in dynamic vision

Jingli Liu, Huannan Zheng, Bohao Zou, Kezhou Yang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2512.15829: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15829&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[271] Understanding Pure Textual Reasoning for Blind Image Quality Assessment

Yuan Li, Shin’ya Nishida

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2601.02441: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02441&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[272] Latent Diffusion Inversion Requires Understanding the Latent Space

Mingxing Rao, Bowen Qu, Daniel Moyer

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.20592: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20592&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[273] DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

Hengyu Shen, Tiancheng Gu, Bin Qin, Lan Wu, Yuling Wu, Shuo Tan, Zelong Sun, Jun Wang, Nan Wu, Xiang An, Weidong Cai, Ziyong Feng, Kaicheng Yang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2601.10305: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10305&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[274] Deep Feature Deformation Weights

Richard Liu, Itai Lang, Rana Hanocka

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2601.12527: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12527&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[275] Modeling Image-Caption Rating from Comparative Judgments

Kezia Minni, Qiang Zhang, Monoshiz Mahbub Khan, Zhe Yu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to data fetch failure

Method: Unable to determine method due to data fetch failure

Result: Unable to determine results due to data fetch failure

Conclusion: Unable to draw conclusions due to data fetch failure

Abstract: Failed to fetch summary for 2602.00381: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00381&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[276] SPARE: Self-distillation for PARameter-Efficient Removal

Natnael Mola, Leonardo S. B. Pereira, Carolina R. Kelsch, Luis H. Arribas, Juan C. S. M. Avedillo

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.07058: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07058&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[277] Continual GUI Agents

Ziwei Liu, Borui Kang, Hangjie Yuan, Zixiang Zhao, Wei Li, Yifan Zhu, Tao Feng

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2601.20732: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20732&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[278] Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, Hang Xu, Xiaodan Liang

Main category: cs.CV

TL;DR: Paper 2602.06037: Could not fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to HTTP 429 error preventing access to paper content

Method: Unable to determine method due to HTTP 429 error preventing access to paper content

Result: Unable to determine results due to HTTP 429 error preventing access to paper content

Conclusion: Unable to determine conclusion due to HTTP 429 error preventing access to paper content

Abstract: Failed to fetch summary for 2602.06037: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06037&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[279] ShapBPT: Image Feature Attributions Using Data-Aware Binary Partition Trees

Muhammad Rashid, Elvio G. Amparore, Enrico Ferrari, Damiano Verda

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to determine conclusion due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2602.07047: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07047&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[280] OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents

Akashah Shabbir, Muhammad Umer Sheikh, Muhammad Akhtar Munir, Hiyam Debary, Mustansar Fiaz, Muhammad Zaigham Zaheer, Paolo Fraccaro, Fahad Shahbaz Khan, Muhammad Haris Khan, Xiao Xiang Zhu, Salman Khan

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting).

Details

Motivation: Unable to determine motivation due to failed paper retrieval.

Method: Unable to determine method due to failed paper retrieval.

Result: Unable to determine results due to failed paper retrieval.

Conclusion: Unable to determine conclusion due to failed paper retrieval.

Abstract: Failed to fetch summary for 2602.17665: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17665&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[281] Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction

Shannan Yan, Leqi Zheng, Keyu Lv, Jingchen Ni, Hongyang Wei, Jiajun Zhang, Guangting Wang, Jing Lyu, Chun Yuan, Fengyun Rao

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.18996: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18996&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[282] ChordEdit: One-Step Low-Energy Transport for Image Editing

Liangsi Lu, Xuhang Chen, Minzhe Guo, Shichu Li, Jingchao Wang, Yang Shi

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.19083: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19083&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[283] Lipschitz-Based Robustness Certification Under Floating-Point Execution

Toby Murray

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.13334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[284] ExpPortrait: Expressive Portrait Generation via Personalized Representation

Junyi Wang, Yudong Guo, Boyang Guo, Shengming Yang, Juyong Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.19900: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19900&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[285] CADC: Content Adaptive Diffusion-Based Generative Image Compression

Xihua Sheng, Lingyu Zhu, Tianyu Zhang, Dong Liu, Shiqi Wang, Jing Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.21591: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21591&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[286] Scan Clusters, Not Pixels: A Cluster-Centric Paradigm for Efficient Ultra-high-definition Image Restoration

Chen Wu, Ling Wang, Zhuoran Zheng, Yuning Cui, Zhixiong Yang, Xiangyu Chen, Yue Zhang, Weidong Jiang, Jingyuan Xia

Main category: cs.CV

TL;DR: Unable to analyze paper 2602.21917 due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2602.21917: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21917&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[287] Geometry-Guided Camera Motion Understanding in VideoLLMs

Haoan Feng, Sri Harsha Musunuri, Guan-Ming Su

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.13119: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13119&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[288] Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences

Julian Kaltheuner, Hannah Dröge, Markus Plack, Patrick Stotko, Reinhard Klein

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.22212: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22212&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[289] Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition

Seokmin Lee, Yunghee Lee, Byeonghyun Pak, Byeongju Woo

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.13904: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13904&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Xiwen Wang, Shichao Zhang, Hailun Zhang, Ruowei Wang, Mao Li, Chenyu Zhou, Qijun Zhao, Ji-Zhe Zhou

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API for paper ID 2603.01601

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to unavailability of content

Abstract: Failed to fetch summary for 2603.01601: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01601&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[291] Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework

Kaihua Tang, Jiaxin Qi, Jinli Ou, Yuhua Zheng, Jianqiang Huang

Main category: cs.CV

TL;DR: Paper 2603.07659: Could not fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing abstract

Method: Unable to determine method due to missing abstract

Result: Unable to determine results due to missing abstract

Conclusion: Unable to draw conclusions due to missing abstract

Abstract: Failed to fetch summary for 2603.07659: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07659&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[292] Ontology-Guided Diffusion for Zero-Shot Visual Sim2Real Transfer

Mohamed Youssef, Mayar Elfares, Anna-Maria Meer, Matteo Bortoletto, Andreas Bulling

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2603.18719: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18719&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Mingzhe Tao, Ruiping Liu, Junwei Zheng, Yufan Chen, Kedi Ying, M. Saquib Sarfraz, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen

Main category: cs.CV

TL;DR: Unable to analyze paper 2603.11380 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting error

Method: Cannot determine method as abstract is unavailable due to rate limiting error

Result: Cannot determine results as abstract is unavailable due to rate limiting error

Conclusion: Cannot draw conclusions about paper content due to technical error in fetching abstract

Abstract: Failed to fetch summary for 2603.11380: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11380&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[294] OSMDA: OpenStreetMap-based Domain Adaptation for Remote Sensing VLMs

Stefan Maria Ailuro, Mario Markov, Mohammad Mahdi, Delyan Boychev, Luc Van Gool, Danda Pani Paudel

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.11804 appears to be from March 2023, but no content is available for analysis.

Details

Motivation: Unable to determine motivation due to lack of access to paper content. The HTTP 429 error indicates rate limiting from arXiv API, preventing retrieval of the abstract.

Method: No method information available. The paper content could not be fetched due to technical limitations with the arXiv API.

Result: No results available. The analysis cannot proceed without access to the paper’s abstract or content.

Conclusion: Unable to provide analysis or relevance assessment due to technical limitations preventing access to the paper content.

Abstract: Failed to fetch summary for 2603.11804: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11804&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[295] VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos

Pengyiang Liu, Zhongyue Shi, Hongye Hao, Qi Fu, Xueting Bi, Siwei Zhang, Xiaoyang Hu, Zitian Wang, Linjiang Huang, Si Liu

Main category: cs.CV

TL;DR: Paper 2603.12703: Unable to fetch abstract due to HTTP 429 error (rate limiting).

Details

Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting from arXiv API.

Method: Cannot determine method as abstract is unavailable due to rate limiting from arXiv API.

Result: Cannot determine results as abstract is unavailable due to rate limiting from arXiv API.

Conclusion: Cannot determine conclusion as abstract is unavailable due to rate limiting from arXiv API.

Abstract: Failed to fetch summary for 2603.12703: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12703&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[296] Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation

Yingjie Chen, Shilun Lin, Cai Xing, Binxin Yang, Long Zhou, Qixin Yan, Wenjing Wang, Dingming Liu, Hao Liu, Chen Li, Jing Lyu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.17889: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17889&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[297] DarkDriving: A Real-World Day and Night Aligned Dataset for Autonomous Driving in the Dark Environment

Wuqi Wang, Haochen Yang, Baolu Li, Jiaqi Sun, Xiangmo Zhao, Zhigang Xu, Qing Guo, Haigen Min, Tianyun Zhang, Hongkai Yu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2603.18067: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18067&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[298] EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

Longfei Liu, Yongjie Hou, Yang Li, Qirui Wang, Youyang Sha, Yongjun Yu, Yinzhi Wang, Peizhe Ru, Xuanlong Yu, Xi Shen

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.18739: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18739&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[299] Language Models Can Explain Visual Features via Steering

Javier Ferrando, Enrique Lopez-Cuena, Pablo Agustin Martin-Torres, Daniel Hinjos, Anna Arias-Duart, Dario Garcia-Gasulla

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.22593 appears to be from March 2023, but no content available for analysis.

Details

Motivation: Cannot determine motivation due to content unavailability.

Method: Cannot determine method due to content unavailability.

Result: Cannot determine results due to content unavailability.

Conclusion: Cannot draw conclusions due to content unavailability.

Abstract: Failed to fetch summary for 2603.22593: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22593&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[300] Evaluating Image Editing with LLMs: A Comprehensive Benchmark and Intermediate-Layer Probing Approach

Shiqi Gao, Zitong Xu, Kang Fu, Huiyu Duan, Xiongkuo Min, Jia wang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2603.19775: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19775&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[301] Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features

Zheng Gao, Debin Meng, Yunqi Miao, Zhensong Zhang, Songcen Xu, Ioannis Patras, Jifei Song

Main category: cs.CV

TL;DR: The paper with ID 2603.20012 could not be analyzed due to HTTP 429 error (rate limiting) when attempting to fetch the abstract from arXiv API.

Details

Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting.

Method: Unable to determine method as the paper content could not be retrieved due to API rate limiting.

Result: Unable to determine results as the paper content could not be retrieved due to API rate limiting.

Conclusion: Unable to draw conclusions about the paper due to technical limitations in accessing the content.

Abstract: Failed to fetch summary for 2603.20012: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20012&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[302] F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting

Injae Kim, Chaehyeon Kim, Minseong Bae, Minseok Joo, Hyunwoo J. Kim

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.21304: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21304&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[303] Knowledge Priors for Identity-Disentangled Open-Set Privacy-Preserving Video FER

Feng Xu, Xun Li, Lars Petersson, Yulei Sui, David Ahmedt-Aristizabal, Dadong Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error fetching paper content

Method: Unable to determine method due to technical error fetching paper content

Result: Unable to determine results due to technical error fetching paper content

Conclusion: Unable to draw conclusions due to technical error fetching paper content

Abstract: Failed to fetch summary for 2603.21387: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21387&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[304] Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning

Xingyu Zhu, Liang Yi, Shuo Wang, Wenbo Zhu, Yonglinag Wu, Beier Zhu, Hanwang Zhang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2603.22070: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22070&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[305] Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models

Xingyu Zhu, Beier Zhu, Shuo Wang, Junfeng Fang, Kesen Zhao, Hanwang Zhang, Xiangnan He

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.22094: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22094&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[306] PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation

Mingju Gao, Kaisen Yang, Huan-ang Gao, Bohan Li, Ao Ding, Wenyi Li, Yangcheng Yu, Jinkun Liu, Shaocong Xu, Yike Niu, Haohan Chi, Hao Chen, Hao Tang, Yu Zhang, Li Yi, Hao Zhao

Main category: cs.CV

TL;DR: Paper 2603.22193 summary unavailable due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to abstract fetch failure

Method: Unable to determine method due to abstract fetch failure

Result: Unable to determine results due to abstract fetch failure

Conclusion: Unable to draw conclusions due to abstract fetch failure

Abstract: Failed to fetch summary for 2603.22193: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22193&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[307] Group Editing : Edit Multiple Images in One Go

Yue Ma, Xinyu Wang, Qianli Ma, Qinghe Wang, Mingzhe Zheng, Xiangpeng Yang, Hao Li, Chongbo Zhao, Jixuan Ying, Harry Yang, Hongyu Liu, Qifeng Chen

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.22883: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22883&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[308] SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes

Zhicheng Qiu, Jiarui Meng, Tong-an Luo, Yican Huang, Xuan Feng, Xuanfu Li, ZHan Xu

Main category: cs.CV

TL;DR: Unable to analyze paper 2603.22893 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2603.22893: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22893&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[309] Dual-Teacher Distillation with Subnetwork Rectification for Black-Box Domain Adaptation

Zhe Zhang, Jing Li, Wanli Xue, Xu Cheng, Jianhua Zhang, Qinghua Hu, Shengyong Chen

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to determine conclusion due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2603.22908: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22908&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[310] WorldMesh: Generating Navigable Multi-Room 3D Scenes via Mesh-Conditioned Image Diffusion

Manuel-Andreas Schneider, Angela Dai

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to analyze paper content due to technical retrieval error

Abstract: Failed to fetch summary for 2603.22972: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22972&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[311] MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding

Basit Alawode, Arif Mahmood, Muaz Khalifa Al-Radi, Shahad Albastaki, Asim Khan, Muhammad Bilal, Moshira Ali Abdalla, Mohammed Bennamoun, Sajid Javed

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2603.23067

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2603.23067: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23067&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[312] PoseDriver: A Unified Approach to Multi-Category Skeleton Detection for Autonomous Driving

Yasamin Borhani, Taylor Mordan, Yihan Wang, Reyhaneh Hosseininejad, Javad Khoramdel, Alexandre Alahi

Main category: cs.CV

TL;DR: Paper ID 2603.23215 summary could not be fetched due to HTTP 429 (rate limiting) error from arXiv API

Details

Motivation: Unable to determine motivation due to technical error in fetching paper information

Method: Unable to determine method due to technical error in fetching paper information

Result: Unable to determine results due to technical error in fetching paper information

Conclusion: Unable to draw conclusions due to technical error in fetching paper information

Abstract: Failed to fetch summary for 2603.23215: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23215&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[313] Knot-10:A Tightness-Stratified Benchmark for Real-World Knot Classification with Topological Difficulty Analysis

Shiheng Nie, Yunguang Yue

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2603.23286: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23286&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[314] From Feature Learning to Spectral Basis Learning: A Unifying and Flexible Framework for Efficient and Robust Shape Matching

Feifan Luo, Hongyang Chen

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.23383: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23383&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[315] Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation

Gal Fiebelman, Hadar Averbuch-Elor, Sagie Benaim

Main category: cs.CV

TL;DR: Unable to analyze paper 2504.05296 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to the paper abstract

Method: Cannot determine method without access to the paper abstract

Result: Cannot determine results without access to the paper abstract

Conclusion: Cannot draw conclusions without access to the paper abstract

Abstract: Failed to fetch summary for 2504.05296: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.05296&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[316] Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process

Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, Haoang Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.01718: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01718&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[317] Establishing Stochastic Object Models from Noisy Data via Ambient Measurement-Integrated Diffusion

Xiaoning Lei, Jianwei Sun, Wenhao Cai, Xichen Xu, Yanshu Wang, Hu Gao

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2512.14187: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14187&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.AI

[318] PLDR-LLMs Reason At Self-Organized Criticality

Burc Gokden

Main category: cs.AI

TL;DR: PLDR-LLMs trained at self-organized criticality exhibit reasoning capabilities with deductive outputs showing characteristics similar to second-order phase transitions, where reasoning quality correlates with order parameter near zero at criticality.

Details

Motivation: To understand how reasoning emerges in large language models and provide a physics-inspired framework for quantifying reasoning capabilities without relying on benchmark evaluations.

Method: Train PLDR-LLMs at self-organized criticality, analyze deductive outputs using statistical physics concepts (correlation length, scaling functions, universality classes), define order parameter from global statistics of deductive output parameters, and correlate with benchmark performance.

Result: Models trained near criticality show better reasoning capabilities when order parameter is close to zero, with deductive outputs exhibiting metastable steady states and characteristics of second-order phase transitions.

Conclusion: Reasoning in LLMs can be explained through statistical physics principles and quantified using global model parameters without benchmark evaluations, providing a self-contained theoretical framework.

Abstract: We show that PLDR-LLMs pretrained at self-organized criticality exhibit reasoning at inference time. The characteristics of PLDR-LLM deductive outputs at criticality is similar to second-order phase transitions. At criticality, the correlation length diverges, and the deductive outputs attain a metastable steady state. The steady state behaviour suggests that deductive outputs learn representations equivalent to scaling functions, universality classes and renormalization groups from the training dataset, leading to generalization and reasoning capabilities in the process. We can then define an order parameter from the global statistics of the model’s deductive output parameters at inference. The reasoning capabilities of a PLDR-LLM is better when its order parameter is close to zero at criticality. This observation is supported by the benchmark scores of the models trained at near-criticality and sub-criticality. Our results provide a self-contained explanation on how reasoning manifests in large language models, and the ability to reason can be quantified solely from global model parameter values of the deductive outputs at steady state, without any need for evaluation of curated benchmark datasets through inductive output for reasoning and comprehension.

[319] Environment Maps: Structured Environmental Representations for Long-Horizon Agents

Yenchia Feng, Chirag Sharma, Karime Maamari

Main category: cs.AI

TL;DR: Environment Maps: A persistent, agent-agnostic graph representation that consolidates heterogeneous evidence (screen recordings, execution traces) to improve long-horizon task automation in dynamic interfaces.

Details

Motivation: LLMs struggle with robust automation of complex software workflows due to cascading errors and environmental stochasticity in long-horizon settings. Single missteps in dynamic interfaces lead to task failures, hallucinations, or trial-and-error approaches.

Method: Introduces Environment Maps with four core components: Contexts (abstracted locations), Actions (parameterized affordances), Workflows (observed trajectories), and Tacit Knowledge (domain definitions and reusable procedures). This structured graph representation consolidates heterogeneous evidence like screen recordings and execution traces.

Result: On WebArena benchmark across five domains, agents with environment maps achieve 28.2% success rate, nearly doubling baseline performance (14.2%) and outperforming agents with raw trajectory data (23.3%).

Conclusion: Environment Maps provide a structured interface between models and environments, establishing a persistent foundation for long-horizon planning that is human-interpretable, editable, and incrementally refinable.

Abstract: Although large language models (LLMs) have advanced rapidly, robust automation of complex software workflows remains an open problem. In long-horizon settings, agents frequently suffer from cascading errors and environmental stochasticity; a single misstep in a dynamic interface can lead to task failure, resulting in hallucinations or trial-and-error. This paper introduces $\textit{Environment Maps}$: a persistent, agent-agnostic representation that mitigates these failures by consolidating heterogeneous evidence, such as screen recordings and execution traces, into a structured graph. The representation consists of four core components: (1) Contexts (abstracted locations), (2) Actions (parameterized affordances), (3) Workflows (observed trajectories), and (4) Tacit Knowledge (domain definitions and reusable procedures). We evaluate this framework on the WebArena benchmark across five domains. Agents equipped with environment maps achieve a 28.2% success rate, nearly doubling the performance of baselines limited to session-bound context (14.2%) and outperforming agents that have access to the raw trajectory data used to generate the environment maps (23.3%). By providing a structured interface between the model and the environment, Environment Maps establish a persistent foundation for long-horizon planning that is human-interpretable, editable, and incrementally refinable.

[320] Evaluating a Multi-Agent Voice-Enabled Smart Speaker for Care Homes: A Safety-Focused Framework

Zeinab Dehghani, Rameez Raja Kureshi, Koorosh Aslansefat, Faezeh Alsadat Abedi, Dhavalkumar Thakker, Lisa Greaves, Bhupesh Kumar Mishra, Baseer Ahmad, Tanaya Maslekar

Main category: cs.AI

TL;DR: Voice-enabled AI smart speaker system for care homes using speech recognition and RAG for resident records, reminders, and scheduling tasks, with safety-focused evaluation showing high accuracy in controlled testing.

Details

Motivation: To reduce administrative workload in health and social care settings by developing voice-enabled AI systems that allow staff to spend more time on patient care through spoken access to records, reminders, and scheduling.

Method: Combines Whisper-based speech recognition with retrieval-augmented generation (RAG) approaches (hybrid, sparse, dense). Uses supervised care-home trials and controlled testing with 330 spoken transcripts across 11 care categories, evaluating resident identification, care category matching, reminder recognition/extraction, and end-to-end scheduling correctness with safety mechanisms like confidence scoring and human oversight.

Result: Best configuration (GPT-5.2) achieved 100% resident ID and care category matching, 89.09% reminder recognition with 100% recall (zero missed reminders), and 84.65% exact reminder-count agreement for end-to-end scheduling via calendar integration.

Conclusion: Voice-enabled systems with careful evaluation and appropriate safeguards can support accurate documentation, effective task management, and trustworthy AI use in care home settings, though edge cases remain in converting informal spoken instructions to actionable events.

Abstract: Artificial intelligence (AI) is increasingly being explored in health and social care to reduce administrative workload and allow staff to spend more time on patient care. This paper evaluates a voice-enabled Care Home Smart Speaker designed to support everyday activities in residential care homes, including spoken access to resident records, reminders, and scheduling tasks. A safety-focused evaluation framework is presented that examines the system end-to-end, combining Whisper-based speech recognition with retrieval-augmented generation (RAG) approaches (hybrid, sparse, and dense). Using supervised care-home trials and controlled testing, we evaluated 330 spoken transcripts across 11 care categories, including 184 reminder-containing interactions. These evaluations focus on (i) correct identification of residents and care categories, (ii) reminder recognition and extraction, and (iii) end-to-end scheduling correctness under uncertainty (including safe deferral/clarification). Given the safety-critical nature of care homes, particular attention is also paid to reliability in noisy environments and across diverse accents, supported by confidence scoring, clarification prompts, and human-in-the-loop oversight. In the best-performing configuration (GPT-5.2), resident ID and care category matching reached 100% (95% CI: 98.86-100), while reminder recognition reached 89.09% (95% CI: 83.81-92.80) with zero missed reminders (100% recall) but some false positives. End-to-end scheduling via calendar integration achieved 84.65% exact reminder-count agreement (95% CI: 78.00-89.56), indicating remaining edge cases in converting informal spoken instructions into actionable events. The findings suggest that voice-enabled systems, when carefully evaluated and appropriately safeguarded, can support accurate documentation, effective task management, and trustworthy use of AI in care home settings.

[321] Can LLM Agents Be CFOs? A Benchmark for Resource Allocation in Dynamic Enterprise Environments

Yi Han, Lingfei Qian, Yan Wang, Yueru He, Xueqing Peng, Dongji Feng, Yankai Chen, Haohang Li, Yupeng Cao, Jimin Huang, Xue Liu, Jian-Yun Nie, Sophia Ananiadou

Main category: cs.AI

TL;DR: EnterpriseArena benchmark evaluates LLM agents on long-horizon enterprise resource allocation under uncertainty, revealing significant capability gaps in current models.

Details

Motivation: While LLMs enable agentic systems for reasoning and planning, it's unclear if they can effectively allocate scarce resources over time under uncertainty, balancing competing objectives while preserving flexibility for future needs.

Method: Introduces EnterpriseArena, a 132-month enterprise simulator combining financial data, business documents, macroeconomic signals, and expert-validated operating rules. The environment is partially observable, forcing agents to trade off information acquisition against resource conservation.

Result: Experiments with eleven advanced LLMs show high difficulty: only 16% of runs survive the full horizon, and larger models don’t reliably outperform smaller ones, indicating a distinct capability gap.

Conclusion: Long-horizon resource allocation under uncertainty represents a significant challenge for current LLM agents, requiring new approaches beyond current capabilities.

Abstract: Large language models (LLMs) have enabled agentic systems that can reason, plan, and act across complex tasks, but it remains unclear whether they can allocate resources effectively under uncertainty. Unlike short-horizon reactive decisions, allocation requires committing scarce resources over time while balancing competing objectives and preserving flexibility for future needs. We introduce EnterpriseArena, the first benchmark for evaluating agents on long-horizon enterprise resource allocation. It instantiates CFO-style decision-making in a 132-month enterprise simulator combining firm-level financial data, anonymized business documents, macroeconomic and industry signals, and expert-validated operating rules. The environment is partially observable and reveals the state only through budgeted organizational tools, forcing agents to trade off information acquisition against conserving scarce resources. Experiments on eleven advanced LLMs show that this setting remains highly challenging: only 16% of runs survive the full horizon, and larger models do not reliably outperform smaller ones. These results identify long-horizon resource allocation under uncertainty as a distinct capability gap for current LLM agents.

[322] GTO Wizard Benchmark

Marc-Antoine Provost, Nejc Ilenic, Christopher Solinas, Philippe Beardsell

Main category: cs.AI

TL;DR: A benchmark for evaluating poker agents against a superhuman AI using variance reduction techniques, with initial LLM testing showing they’re far below baseline performance.

Details

Motivation: To create a standardized evaluation framework for benchmarking algorithms in Heads-Up No-Limit Texas Hold'em (HUNL) poker, addressing the fundamental challenge of variance in poker evaluation and providing researchers with a precise setting to evaluate advances in planning and reasoning in multi-agent systems with partial observability.

Method: Developed GTO Wizard Benchmark with a public API that evaluates agents against GTO Wizard AI (a superhuman poker agent approximating Nash Equilibria). Integrated AIVAT, a provably unbiased variance reduction technique that achieves equivalent statistical significance with ten times fewer hands than naive Monte Carlo evaluation. Conducted comprehensive benchmarking of state-of-the-art LLMs under zero-shot conditions.

Result: GTO Wizard AI defeated the previous strongest benchmark (Slumbot) by $19.4 ± 4.1 bb/100. Initial LLM benchmarking revealed dramatic progress in reasoning over recent years, but all models (including GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4) remain far below the benchmark baseline. Qualitative analysis identified opportunities for improvement in representation and reasoning over hidden states.

Conclusion: The benchmark provides researchers with a precise, quantifiable setting for evaluating advances in planning and reasoning in multi-agent systems with partial observability, with LLMs showing significant progress but still needing substantial improvement for complex reasoning tasks like poker.

Abstract: We introduce GTO Wizard Benchmark, a public API and standardized evaluation framework for benchmarking algorithms in Heads-Up No-Limit Texas Hold’em (HUNL). The benchmark evaluates agents against GTO Wizard AI, a state-of-the-art superhuman poker agent that approximates Nash Equilibria, and defeated Slumbot, the 2018 Annual Computer Poker Competition champion and previous strongest publicly accessible HUNL benchmark, by $19.4$ $\pm$ $4.1$ bb/100. Variance is a fundamental challenge in poker evaluation; we address this by integrating AIVAT, a provably unbiased variance reduction technique that achieves equivalent statistical significance with ten times fewer hands than naive Monte Carlo evaluation. We conduct a comprehensive benchmarking study of state-of-the-art large language models under zero-shot conditions, including GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4, and others. Initial results and analysis reveal dramatic progress in LLM reasoning over recent years, yet all models remain far below the baseline established by our benchmark. Qualitative analysis reveals clear opportunities for improvement, including representation and the ability to reason over hidden states. This benchmark provides researchers with a precise and quantifiable setting to evaluate advances in planning and reasoning in multi-agent systems with partial observability.

[323] Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement

Ashish Malik, Caleb Lowe, Aayam Shrestha, Stefan Lee, Fuxin Li, Alan Fern

Main category: cs.AI

TL;DR: RAMP-3D: A reactive planning system for 3D box rearrangement using 3D vision-language models to predict object and target region masks from RGB-D observations and natural language goals.

Details

Motivation: Existing approaches for long-horizon planning in 3D environments struggle with reasoning over many objects, complex 3D geometry, and implicit semantic constraints. Symbolic planners have brittle relational grounding, while 2D vision-language models lack 3D understanding.

Method: Extends 3D grounding models to propose RAMP-3D, which formulates planning as sequential reactive prediction of paired 3D masks: a “which-object” mask for picking and a “which-target-region” mask for placement. Processes RGB-D observations and natural language tasks to generate pick-and-place actions.

Result: Achieves 79.5% success rate on long-horizon rearrangement tasks across 11 task variants with 1-30 boxes and diverse natural-language constraints. Significantly outperforms 2D VLM-based baselines.

Conclusion: Mask-based reactive policies are a promising alternative to symbolic pipelines for long-horizon planning in 3D environments, demonstrating strong performance on complex rearrangement tasks with natural language specifications.

Abstract: We study long-horizon planning in 3D environments from under-specified natural-language goals using only visual observations, focusing on multi-step 3D box rearrangement tasks. Existing approaches typically rely on symbolic planners with brittle relational grounding of states and goals, or on direct action-sequence generation from 2D vision-language models (VLMs). Both approaches struggle with reasoning over many objects, rich 3D geometry, and implicit semantic constraints. Recent advances in 3D VLMs demonstrate strong grounding of natural-language referents to 3D segmentation masks, suggesting the potential for more general planning capabilities. We extend existing 3D grounding models and propose Reactive Action Mask Planner (RAMP-3D), which formulates long-horizon planning as sequential reactive prediction of paired 3D masks: a “which-object” mask indicating what to pick and a “which-target-region” mask specifying where to place it. The resulting system processes RGB-D observations and natural-language task specifications to reactively generate multi-step pick-and-place actions for 3D box rearrangement. We conduct experiments across 11 task variants in warehouse-style environments with 1-30 boxes and diverse natural-language constraints. RAMP-3D achieves 79.5% success rate on long-horizon rearrangement tasks and significantly outperforms 2D VLM-based baselines, establishing mask-based reactive policies as a promising alternative to symbolic pipelines for long-horizon planning.

[324] SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems

Chung-En Johnny Yu, Brian Jalaian, Nathaniel D. Bastian

Main category: cs.AI

TL;DR: SCoOP is a training-free uncertainty quantification framework for multi-VLM systems that uses semantic-consistent opinion pooling to detect hallucinations and enable abstention for uncertain samples.

Details

Motivation: Combining multiple Vision-Language Models (VLMs) can improve multimodal reasoning but amplifies uncertainty and increases hallucination risks. Existing uncertainty quantification methods are designed for single models, not multi-VLM systems.

Method: SCoOP uses uncertainty-weighted linear opinion pooling to measure collective, system-level uncertainty across multiple VLMs. It’s a training-free framework that explicitly quantifies uncertainty in multi-VLM systems for hallucination detection and abstention.

Result: On ScienceQA, SCoOP achieves AUROC of 0.866 for hallucination detection (outperforming baselines by 10-13%) and AURAC of 0.907 for abstention (exceeding baselines by 7-9%). It adds only microsecond-level overhead to VLM inference.

Conclusion: SCoOP provides an efficient and principled mechanism for uncertainty-aware aggregation in multi-VLM systems, advancing the reliability of multimodal AI systems with minimal computational overhead.

Abstract: Combining multiple Vision-Language Models (VLMs) can enhance multimodal reasoning and robustness, but aggregating heterogeneous models’ outputs amplifies uncertainty and increases the risk of hallucinations. We propose SCoOP (Semantic-Consistent Opinion Pooling), a training-free uncertainty quantification (UQ) framework multi-VLM systems through uncertainty-weighted linear opinion pooling. Unlike prior UQ methods designed for single models, SCoOP explicitly measures collective, system-level uncertainty across multiple VLMs, enabling effective hallucination detection and abstention for highly uncertain samples. On ScienceQA, SCoOP achieves an AUROC of 0.866 for hallucination detection, outperforming baselines (0.732-0.757) by approximately 10-13%. For abstention, it attains an AURAC of 0.907, exceeding baselines (0.818-0.840) by 7-9%. Despite these gains, SCoOP introduces only microsecond-level aggregation overhead relative to the baselines, which is trivial compared to typical VLM inference time (on the order of seconds). These results demonstrate that SCoOP provides an efficient and principled mechanism for uncertainty-aware aggregation, advancing the reliability of multimodal AI systems.

[325] LLMs Do Not Grade Essays Like Humans

Jerin George Mathew, Sumayya Taher, Anindita Kundu, Denilson Barbosa

Main category: cs.AI

TL;DR: LLMs show limited agreement with human essay grading, with systematic biases: they over-score short/underdeveloped essays and under-score longer essays with minor errors, though their feedback is internally consistent with their scoring.

Details

Motivation: To evaluate how well LLMs can perform automated essay scoring compared to human graders, and understand the patterns in LLM grading behavior without task-specific training.

Method: Evaluated several GPT and Llama family models in out-of-the-box settings, comparing LLM-generated scores with human grades, analyzing agreement patterns, and examining relationships between scores and feedback.

Result: LLM-human agreement remains relatively weak and varies with essay characteristics. LLMs tend to assign higher scores to short/underdeveloped essays and lower scores to longer essays with minor errors. LLM-generated feedback is consistent with their scoring patterns.

Conclusion: LLMs follow coherent but different grading patterns than humans, resulting in limited alignment with human grading practices. However, they produce internally consistent feedback and can be used to support essay scoring.

Abstract: Large language models have recently been proposed as tools for automated essay scoring, but their agreement with human grading remains unclear. In this work, we evaluate how LLM-generated scores compare with human grades and analyze the grading behavior of several models from the GPT and Llama families in an out-of-the-box setting, without task-specific training. Our results show that agreement between LLM and human scores remains relatively weak and varies with essay characteristics. In particular, compared to human raters, LLMs tend to assign higher scores to short or underdeveloped essays, while assigning lower scores to longer essays that contain minor grammatical or spelling errors. We also find that the scores generated by LLMs are generally consistent with the feedback they generate: essays receiving more praise tend to receive higher scores, while essays receiving more criticism tend to receive lower scores. These results suggest that LLM-generated scores and feedback follow coherent patterns but rely on signals that differ from those used by human raters, resulting in limited alignment with human grading practices. Nevertheless, our work shows that LLMs produce feedback that is consistent with their grading and that they can be reliably used in supporting essay scoring.

[326] Efficient Benchmarking of AI Agents

Franck Ndzomga

Main category: cs.AI

TL;DR: Small task subsets can preserve agent rankings at lower cost than full benchmark evaluation, with mid-range difficulty filtering (30-70% pass rates) reducing tasks by 44-70% while maintaining rank fidelity.

Details

Motivation: Evaluating AI agents on comprehensive benchmarks is expensive due to interactive rollouts with tool use and multi-step reasoning. The paper investigates whether small task subsets can preserve agent rankings at substantially lower cost, addressing the challenge of scaffold-driven distribution shift in agent evaluation.

Method: The study analyzes eight benchmarks, 33 agent scaffolds, and 70+ model configurations. It finds that absolute score prediction degrades under scaffold-driven distribution shift, while rank-order prediction remains stable. The paper proposes an optimization-free protocol: evaluate new agents only on tasks with intermediate historical pass rates (30-70%), motivated by Item Response Theory.

Result: The mid-range difficulty filter reduces the number of evaluation tasks by 44-70% while maintaining high rank fidelity under scaffold and temporal shifts. It provides more reliable rankings than random sampling (which exhibits high variance) and outperforms greedy task selection under distribution shift.

Conclusion: Reliable leaderboard ranking for AI agents does not require full-benchmark evaluation. Small task subsets with mid-range difficulty filtering can preserve rankings at substantially lower cost, addressing the expense of agent evaluation while maintaining reliability.

Abstract: Evaluating AI agents on comprehensive benchmarks is expensive because each evaluation requires interactive rollouts with tool use and multi-step reasoning. We study whether small task subsets can preserve agent rankings at substantially lower cost. Unlike static language model benchmarks, agent evaluation is subject to scaffold-driven distribution shift, since performance depends on the framework wrapping the underlying model. Across eight benchmarks, 33 agent scaffolds, and 70+ model configurations, we find that absolute score prediction degrades under this shift, while rank-order prediction remains stable. Exploiting this asymmetry, we propose a simple optimization-free protocol: evaluate new agents only on tasks with intermediate historical pass rates (30-70%). This mid-range difficulty filter, motivated by Item Response Theory, reduces the number of evaluation tasks by 44-70% while maintaining high rank fidelity under scaffold and temporal shifts. It provides more reliable rankings than random sampling, which exhibits high variance across seeds, and outperforms greedy task selection under distribution shift. These results suggest that reliable leaderboard ranking does not require full-benchmark evaluation.

[327] Learning-guided Prioritized Planning for Lifelong Multi-Agent Path Finding in Warehouse Automation

Han Zheng, Yining Ma, Brandon Araki, Jingkai Chen, Cathy Wu

Main category: cs.AI

TL;DR: RL-guided Rolling Horizon Prioritized Planning (RL-RH-PP) integrates reinforcement learning with classical search-based planning for lifelong multi-agent path finding in warehouse automation, achieving superior throughput through learned priority assignment policies.

Details

Motivation: Lifelong Multi-Agent Path Finding (MAPF) is crucial for warehouse automation but faces complexity in dynamic environments. Classical search-based solvers require costly adaptations, and while machine learning methods have been explored, their superiority over search-based methods remains inconclusive. The paper aims to bridge this gap by integrating RL with search-based planning.

Method: RL-RH-PP integrates reinforcement learning with classical Prioritized Planning (PP) as a backbone. Dynamic priority assignment is formulated as a Partially Observable Markov Decision Process (POMDP). An attention-based neural network autoregressively decodes priority orders on-the-fly, enabling efficient sequential single-agent planning by the PP planner. The framework exploits sequential decision-making nature while delegating complex spatial-temporal interactions to reinforcement learning.

Result: Evaluations in realistic warehouse simulations show RL-RH-PP achieves the highest total throughput among baselines. It generalizes effectively across agent densities, planning horizons, and warehouse layouts. Interpretive analyses reveal that RL-RH-PP proactively prioritizes congested agents and strategically redirects agents from congestion, easing traffic flow and boosting throughput.

Conclusion: The paper demonstrates that learning-guided approaches can augment traditional heuristics in modern warehouse automation. RL-RH-PP shows the potential of integrating reinforcement learning with classical search-based planning for lifelong MAPF problems, achieving superior performance through learned priority assignment policies.

Abstract: Lifelong Multi-Agent Path Finding (MAPF) is critical for modern warehouse automation, which requires multiple robots to continuously navigate conflict-free paths to optimize the overall system throughput. However, the complexity of warehouse environments and the long-term dynamics of lifelong MAPF often demand costly adaptations to classical search-based solvers. While machine learning methods have been explored, their superiority over search-based methods remains inconclusive. In this paper, we introduce Reinforcement Learning (RL) guided Rolling Horizon Prioritized Planning (RL-RH-PP), the first framework integrating RL with search-based planning for lifelong MAPF. Specifically, we leverage classical Prioritized Planning (PP) as a backbone for its simplicity and flexibility in integrating with a learning-based priority assignment policy. By formulating dynamic priority assignment as a Partially Observable Markov Decision Process (POMDP), RL-RH-PP exploits the sequential decision-making nature of lifelong planning while delegating complex spatial-temporal interactions among agents to reinforcement learning. An attention-based neural network autoregressively decodes priority orders on-the-fly, enabling efficient sequential single-agent planning by the PP planner. Evaluations in realistic warehouse simulations show that RL-RH-PP achieves the highest total throughput among baselines and generalizes effectively across agent densities, planning horizons, and warehouse layouts. Our interpretive analyses reveal that RL-RH-PP proactively prioritizes congested agents and strategically redirects agents from congestion, easing traffic flow and boosting throughput. These findings highlight the potential of learning-guided approaches to augment traditional heuristics in modern warehouse automation.

[328] VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents

Yuhao Chen, Yi Xu, Xinyun Ding, Xiang Fang, Shuochen Liu, Luxi Lin, Qingyu Zhang, Ya Li, Quan Liu, Tong Xu

Main category: cs.AI

TL;DR: VehicleMemBench: A multi-user long-context memory benchmark for vehicle-based agents with executable simulation environment, evaluating tool use and memory evolution in dynamic preference scenarios.

Details

Motivation: Vehicle agents are evolving from simple assistants to long-term companions, requiring continuous modeling of multi-user preferences and reliable decision-making in dynamic environments with preference conflicts and changing habits. Existing benchmarks are limited to single-user, static QA settings and fail to capture temporal evolution and multi-user tool-interactive nature of real vehicle environments.

Method: Introduces VehicleMemBench, a multi-user long-context memory benchmark built on an executable in-vehicle simulation environment. The benchmark evaluates tool use and memory by comparing post-action environment state with predefined target states, enabling objective and reproducible evaluation without LLM-based or human scoring. Includes 23 tool modules and each sample contains over 80 historical memory events.

Result: Experiments show that powerful models perform well on direct instruction tasks but struggle in scenarios involving memory evolution, particularly when user preferences change dynamically. Even advanced memory systems struggle to handle domain-specific memory requirements in this environment.

Conclusion: Highlights the need for more robust and specialized memory management mechanisms to support long-term adaptive decision-making in real-world in-vehicle systems. The findings demonstrate limitations of current models in handling dynamic multi-user preference scenarios with memory evolution.

Abstract: With the growing demand for intelligent in-vehicle experiences, vehicle-based agents are evolving from simple assistants to long-term companions. This evolution requires agents to continuously model multi-user preferences and make reliable decisions in the face of inter-user preference conflicts and changing habits over time. However, existing benchmarks are largely limited to single-user, static question-answer settings, failing to capture the temporal evolution of preferences and the multi-user, tool-interactive nature of real vehicle environments. To address this gap, we introduce VehicleMemBench, a multi-user long-context memory benchmark built on an executable in-vehicle simulation environment. The benchmark evaluates tool use and memory by comparing the post-action environment state with a predefined target state, enabling objective and reproducible evaluation without LLM-based or human scoring. VehicleMemBench includes 23 tool modules, and each sample contains over 80 historical memory events. Experiments show that powerful models perform well on direct instruction tasks but struggle in scenarios involving memory evolution, particularly when user preferences change dynamically. Even advanced memory systems struggle to handle domain-specific memory requirements in this environment. These findings highlight the need for more robust and specialized memory management mechanisms to support long-term adaptive decision-making in real-world in-vehicle systems. To facilitate future research, we release the data and code.

[329] When AI output tips to bad but nobody notices: Legal implications of AI’s mistakes

Dylan J. Restrepo, Nicholas J. Restrepo, Frank Y. Huo, Neil F. Johnson

Main category: cs.AI

TL;DR: Paper analyzes AI fabrication of legal content as a deterministic failure mode in Transformers, not random hallucination, with implications for legal professional competence and verification protocols.

Details

Motivation: The paper addresses the critical problem of generative AI fabricating fictitious legal content (case law, statutes, judicial holdings) that appears authentic, posing serious risks to legal professionals and court integrity. Current dismissal of this as random "hallucination" fails to address the underlying deterministic mechanisms.

Method: The paper applies physics-based analysis of Transformer mechanisms to reveal deterministic components in AI failures. It presents this science in a legal-industry context through a simulated brief-drafting scenario, analyzing how internal states cross calculable thresholds leading to fabrication.

Result: Analysis shows fabrication risk is not an anomalous glitch but a foreseeable consequence of Transformer design. The paper demonstrates that AI’s internal state can cross calculable thresholds causing output to flip from reliable legal reasoning to authoritative-sounding fabrication.

Conclusion: Legal professionals, courts, and regulators should replace outdated “black box” mental models with verification protocols based on understanding how these systems actually fail deterministically. This has direct implications for the evolving duty of technological competence in legal practice.

Abstract: The adoption of generative AI across commercial and legal professions offers dramatic efficiency gains – yet for law in particular, it introduces a perilous failure mode in which the AI fabricates fictitious case law, statutes, and judicial holdings that appear entirely authentic. Attorneys who unknowingly file such fabrications face professional sanctions, malpractice exposure, and reputational harm, while courts confront a novel threat to the integrity of the adversarial process. This failure mode is commonly dismissed as random hallucination', but recent physics-based analysis of the Transformer's core mechanism reveals a deterministic component: the AI's internal state can cross a calculable threshold, causing its output to flip from reliable legal reasoning to authoritative-sounding fabrication. Here we present this science in a legal-industry setting, walking through a simulated brief-drafting scenario. Our analysis suggests that fabrication risk is not an anomalous glitch but a foreseeable consequence of the technology's design, with direct implications for the evolving duty of technological competence. We propose that legal professionals, courts, and regulators replace the outdated black box’ mental model with verification protocols based on how these systems actually fail.

[330] The DeepXube Software Package for Solving Pathfinding Problems with Learned Heuristic Functions and Search

Forest Agostinelli

Main category: cs.AI

TL;DR: DeepXube is an open-source Python package that uses machine learning to learn heuristic functions for solving pathfinding problems, combining deep reinforcement learning, heuristic search, and formal logic.

Details

Motivation: To automate the solution of pathfinding problems by leveraging machine learning to learn effective heuristic functions that can guide search algorithms, making pathfinding more efficient and accessible.

Method: Combines deep reinforcement learning (limited-horizon Bellman-based learning, hindsight experience replay), heuristic search algorithms (batch weighted A*, Q* search, beam search), and formal logic (answer-set programming for goal specification). Uses automatic parallelization across CPUs for data generation and GPUs for reinforcement learning updates.

Result: Developed a comprehensive open-source tool with multiple inheritance structure for domain definition, efficient training through parallelization, and various search algorithms that leverage GPU parallelism and DNN architectures.

Conclusion: DeepXube provides an automated, efficient framework for solving pathfinding problems using learned heuristics, with visualization, profiling, and monitoring features, publicly available as open-source software.

Abstract: DeepXube is a free and open-source Python package and command-line tool that seeks to automate the solution of pathfinding problems by using machine learning to learn heuristic functions that guide heuristic search algorithms tailored to deep neural networks (DNNs). DeepXube is comprised of the latest advances in deep reinforcement learning, heuristic search, and formal logic for solving pathfinding problems. This includes limited-horizon Bellman-based learning, hindsight experience replay, batched heuristic search, and specifying goals with answer-set programming. A robust multiple-inheritance structure simplifies the definition of pathfinding domains and the generation of training data. Training heuristic functions is made efficient through the automatic parallelization of the generation of training data across central processing units (CPUs) and reinforcement learning updates across graphics processing units (GPUs). Pathfinding algorithms that take advantage of the parallelism of GPUs and DNN architectures, such as batch weighted A* and Q* search and beam search are easily employed to solve pathfinding problems through command-line arguments. Finally, several convenient features for visualization, code profiling, and progress monitoring during training and solving are available. The GitHub repository is publicly available at https://github.com/forestagostinelli/deepxube.

[331] DUPLEX: Agentic Dual-System Planning via LLM-Driven Information Extraction

Keru Hua, Ding Wang, Yaoying Gu, Xiaoguang Ma

Main category: cs.AI

TL;DR: DUPLEX is a dual-system neuro-symbolic architecture that restricts LLMs to schema-guided information extraction for PDDL problem generation, while using symbolic planners for logical plan synthesis, improving reliability in robotic task planning.

Details

Motivation: LLMs have semantic flexibility for robotic task planning but suffer from hallucination and logical inconsistency, limiting reliability in long-horizon domains. There's a need to bridge unstructured environments with rigorous plan synthesis.

Method: DUPLEX uses a dual-system approach: 1) Fast System with lightweight LLM extracts entities/relations from natural language and deterministically maps to PDDL problem files for classical symbolic planners; 2) Slow System activates only upon planning failure, using solver diagnostics to drive high-capacity LLM in iterative reflection and repair.

Result: Extensive evaluations across 12 classical and household planning domains show DUPLEX significantly outperforms existing end-to-end and hybrid LLM baselines in both success rate and reliability.

Conclusion: The key insight is to restrict LLMs to structured semantic grounding (what they’re good at) and leave logical plan synthesis to symbolic planners, rather than trying to make LLMs plan better.

Abstract: While Large Language Models (LLMs) provide semantic flexibility for robotic task planning, their susceptibility to hallucination and logical inconsistency limits their reliability in long-horizon domains. To bridge the gap between unstructured environments and rigorous plan synthesis, we propose DUPLEX, an agentic dual-system neuro-symbolic architecture that strictly confines the LLM to schema-guided information extraction rather than end-to-end planning or code generation. In our framework, a feed-forward Fast System utilizes a lightweight LLM to extract entities, relations etc. from natural language, deterministically mapping them into a Planning Domain Definition Language (PDDL) problem file for a classical symbolic planner. To resolve complex or underspecified scenarios, a Slow System is activated exclusively upon planning failure, leveraging solver diagnostics to drive a high-capacity LLM in iterative reflection and repair. Extensive evaluations across 12 classical and household planning domains demonstrate that DUPLEX significantly outperforms existing end-to-end and hybrid LLM baselines in both success rate and reliability. These results confirm that The key is not to make the LLM plan better, but to restrict the LLM to the part it is good at - structured semantic grounding - and leave logical plan synthesis to a symbolic planner.

[332] AnalogAgent: Self-Improving Analog Circuit Design Automation with LLM Agents

Zhixuan Bao, Zhuoyi Lin, Jiageng Wang, Jinhai Hu, Yuan Gao, Yaoxin Wu, Xiaoli Li, Xun Xu

Main category: cs.AI

TL;DR: AnalogAgent: A training-free multi-agent LLM framework with self-evolving memory for analog circuit design automation, achieving high success rates without additional expert feedback.

Details

Motivation: Current LLM-based approaches for analog circuit design rely on single-model loops that favor summaries over domain-specific insight and suffer from context attrition, losing critical technical details.

Method: Proposes AnalogAgent framework integrating LLM-based multi-agent system with self-evolving memory, coordinating Code Generator, Design Optimizer, and Knowledge Curator to distill execution feedback into adaptive playbook.

Result: Achieves 92% Pass@1 with Gemini and 97.4% Pass@1 with GPT-5; with compact models (Qwen-8B), yields +48.8% average Pass@1 gain and reaches 72.1% Pass@1 overall.

Conclusion: AnalogAgent substantially strengthens open-weight models for high-quality analog circuit design automation and enables cross-task transfer without additional expert feedback, databases, or libraries.

Abstract: Recent advances in large language models (LLMs) suggest strong potential for automating analog circuit design. Yet most LLM-based approaches rely on a single-model loop of generation, diagnosis, and correction, which favors succinct summaries over domain-specific insight and suffers from context attrition that erases critical technical details. To address these limitations, we propose AnalogAgent, a training-free agentic framework that integrates an LLM-based multi-agent system (MAS) with self-evolving memory (SEM) for analog circuit design automation. AnalogAgent coordinates a Code Generator, Design Optimizer, and Knowledge Curator to distill execution feedback into an adaptive playbook in SEM and retrieve targeted guidance for subsequent generation, enabling cross-task transfer without additional expert feedback, databases, or libraries. Across established benchmarks, AnalogAgent achieves 92% Pass@1 with Gemini and 97.4% Pass@1 with GPT-5. Moreover, with compact models (e.g., Qwen-8B), it yields a +48.8% average Pass@1 gain across tasks and reaches 72.1% Pass@1 overall, indicating that AnalogAgent substantially strengthens open-weight models for high-quality analog circuit design automation.

[333] From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments

Lijing Luo, Yiben Luo, Alexey Gorbatovski, Sergey Kovalchuk, Xiaodan Liang

Main category: cs.AI

TL;DR: A data-driven analysis of RL environment evolution showing a paradigm shift toward LLM-driven semantic ecosystems and domain-specific generalization, with implications for designing next-gen embodied semantic simulators.

Details

Motivation: To move beyond qualitative reviews and provide quantitative understanding of how RL environments have evolved, particularly the transition from isolated physical simulations to language-driven foundation agents, and to map the field's bifurcation into distinct ecosystems.

Method: Programmatic processing of massive academic literature corpus, rigorous distillation of 2,000+ publications, quantitative methodology with multi-dimensional taxonomy, automated semantic and statistical analysis to identify cognitive fingerprints and ecosystem patterns.

Result: Revealed a profound paradigm shift: bifurcation into “Semantic Prior” ecosystem dominated by LLMs and “Domain-Specific Generalization” ecosystem, with characterization of cognitive fingerprints showing mechanisms of cross-task synergy, multi-domain interference, and zero-shot generalization.

Conclusion: Provides rigorous quantitative roadmap for designing next-generation Embodied Semantic Simulators that bridge continuous physical control with high-level logical reasoning, offering data-verified insights into RL environment evolution.

Abstract: The remarkable progress of reinforcement learning (RL) is intrinsically tied to the environments used to train and evaluate artificial agents. Moving beyond traditional qualitative reviews, this work presents a large-scale, data-driven empirical investigation into the evolution of RL environments. By programmatically processing a massive corpus of academic literature and rigorously distilling over 2,000 core publications, we propose a quantitative methodology to map the transition from isolated physical simulations to generalist, language-driven foundation agents. Implementing a novel, multi-dimensional taxonomy, we systematically analyze benchmarks against diverse application domains and requisite cognitive capabilities. Our automated semantic and statistical analysis reveals a profound, data-verified paradigm shift: the bifurcation of the field into a “Semantic Prior” ecosystem dominated by Large Language Models (LLMs) and a “Domain-Specific Generalization” ecosystem. Furthermore, we characterize the “cognitive fingerprints” of these distinct domains to uncover the underlying mechanisms of cross-task synergy, multi-domain interference, and zero-shot generalization. Ultimately, this study offers a rigorous, quantitative roadmap for designing the next generation of Embodied Semantic Simulators, bridging the gap between continuous physical control and high-level logical reasoning.

[334] Language-Grounded Multi-Agent Planning for Personalized and Fair Participatory Urban Sensing

Xusen Guo, Mingxing Peng, Hongliang Lu, Hai Yang, Jun Ma, Yuxuan Liang

Main category: cs.AI

TL;DR: MAPUS is an LLM-based multi-agent framework for personalized and fair participatory urban sensing that uses autonomous participant agents and a coordinator agent to improve satisfaction and fairness while maintaining sensing coverage.

Details

Motivation: Existing participatory urban sensing methods rely on centralized optimization and assume homogeneous participants, resulting in rigid assignments that overlook personal preferences and heterogeneous urban contexts, leading to poor participant satisfaction and fairness.

Method: Proposes MAPUS: an LLM-based multi-agent framework where participants are modeled as autonomous agents with individual profiles and schedules, and a coordinator agent performs fairness-aware selection and refines sensing routes through language-based negotiation.

Result: Experiments on real-world datasets show that MAPUS achieves competitive sensing coverage while substantially improving participant satisfaction and fairness compared to existing methods.

Conclusion: MAPUS promotes more human-centric and sustainable urban sensing systems by addressing personal preferences and fairness through LLM-based multi-agent negotiation.

Abstract: Participatory urban sensing leverages human mobility for large-scale urban data collection, yet existing methods typically rely on centralized optimization and assume homogeneous participants, resulting in rigid assignments that overlook personal preferences and heterogeneous urban contexts. We propose MAPUS, an LLM-based multi-agent framework for personalized and fair participatory urban sensing. In our framework, participants are modeled as autonomous agents with individual profiles and schedules, while a coordinator agent performs fairness-aware selection and refines sensing routes through language-based negotiation. Experiments on real-world datasets show that MAPUS achieves competitive sensing coverage while substantially improving participant satisfaction and fairness, promoting more human-centric and sustainable urban sensing systems.

[335] ELITE: Experiential Learning and Intent-Aware Transfer for Self-improving Embodied Agents

Bingqing Wei, Zhongyu Xia, Dingai Liu, Xiaoyu Zhou, Zhiwei Lin, Yongtao Wang

Main category: cs.AI

TL;DR: ELITE is an embodied agent framework that enables VLMs to learn from environment interactions through experiential learning and intent-aware knowledge transfer, improving performance on complex embodied tasks.

Details

Motivation: Current vision-language models (VLMs) show strong semantic understanding but fail at complex embodied tasks due to the gap between static training data and physical interaction requirements. Embodied agents built on VLMs often skip steps, propose invalid actions, and repeat mistakes.

Method: ELITE uses two synergistic mechanisms: 1) Self-reflective knowledge construction that extracts reusable strategies from execution trajectories and maintains an evolving strategy pool through structured refinement operations, and 2) Intent-aware retrieval that identifies relevant strategies from the pool and applies them to current tasks.

Result: On EB-ALFRED and EB-Habitat benchmarks, ELITE achieves 9% and 5% performance improvement over base VLMs in online settings without supervision. In supervised settings, it generalizes effectively to unseen task categories, outperforming state-of-the-art training-based methods.

Conclusion: ELITE effectively bridges the gap between semantic understanding and reliable action execution for embodied agents, demonstrating that experiential learning and intent-aware knowledge transfer can significantly improve VLM-based embodied task performance.

Abstract: Vision-language models (VLMs) have shown remarkable general capabilities, yet embodied agents built on them fail at complex tasks, often skipping critical steps, proposing invalid actions, and repeating mistakes. These failures arise from a fundamental gap between the static training data of VLMs and the physical interaction for embodied tasks. VLMs can learn rich semantic knowledge from static data but lack the ability to interact with the world. To address this issue, we introduce ELITE, an embodied agent framework with {E}xperiential {L}earning and {I}ntent-aware {T}ransfer that enables agents to continuously learn from their own environment interaction experiences, and transfer acquired knowledge to procedurally similar tasks. ELITE operates through two synergistic mechanisms, \textit{i.e.,} self-reflective knowledge construction and intent-aware retrieval. Specifically, self-reflective knowledge construction extracts reusable strategies from execution trajectories and maintains an evolving strategy pool through structured refinement operations. Then, intent-aware retrieval identifies relevant strategies from the pool and applies them to current tasks. Experiments on the EB-ALFRED and EB-Habitat benchmarks show that ELITE achieves 9% and 5% performance improvement over base VLMs in the online setting without any supervision. In the supervised setting, ELITE generalizes effectively to unseen task categories, achieving better performance compared to state-of-the-art training-based methods. These results demonstrate the effectiveness of ELITE for bridging the gap between semantic understanding and reliable action execution.

[336] Enhanced Mycelium of Thought (EMoT): A Bio-Inspired Hierarchical Reasoning Architecture with Strategic Dormancy and Mnemonic Encoding

Florian Odi Stummer

Main category: cs.AI

TL;DR: EMoT is a bio-inspired reasoning framework for LLMs with hierarchical processing, strategic dormancy, memory encoding, but shows trade-offs: performs well on complex cross-domain tasks but overthinks simple problems with high computational cost.

Details

Motivation: Current LLM prompting methods like CoT and ToT lack persistent memory, strategic dormancy, and cross-domain synthesis capabilities needed for complex multi-domain reasoning.

Method: Four-level hierarchical architecture (Micro, Meso, Macro, Meta) with strategic dormancy/reactivation of reasoning nodes and Memory Palace with five mnemonic encoding styles.

Result: EMoT achieved near-parity with CoT (4.20 vs 4.33/5.0) with higher stability, outperformed CoT on Cross-Domain Synthesis (4.8 vs 4.4), but substantially underperformed on simple problems (27% vs baselines). Strategic dormancy was essential (quality collapsed from 4.2 to 1.0 when disabled).

Conclusion: EMoT shows promise for complex multi-domain reasoning but suffers from systematic overthinking on simple problems and has 33x computational overhead. It’s a specialized framework, not a general-purpose enhancement.

Abstract: Current prompting paradigms for large language models (LLMs), including Chain-of-Thought (CoT) and Tree-of-Thoughts (ToT), follow linear or tree-structured reasoning paths that lack persistent memory, strategic dormancy, and cross-domain synthesis. We present the Enhanced Mycelium of Thought (EMoT) framework, a bio-inspired reasoning architecture that organises cognitive processing into a four-level hierarchy (Micro, Meso, Macro, Meta), implements strategic dormancy and reactivation of reasoning nodes, and integrates a Memory Palace with five mnemonic encoding styles. EMoT is a research prototype for complex, multi-domain problems, not a general-purpose prompting enhancement. Two complementary evaluations reveal a characteristic trade-off. In a blind LLM-as-Judge evaluation across three domains, EMoT achieved near-parity with CoT (4.20 vs. 4.33/5.0) with higher stability, and outperformed CoT on Cross-Domain Synthesis (4.8 vs. 4.4). Ablation studies show that strategic dormancy is architecturally essential (quality collapsed from 4.2 to 1.0 when disabled). On a 15-item short-answer benchmark, EMoT (27%) substantially underperformed simpler baselines, confirming systematic overthinking on simple problems. These results are subject to important limitations: small sample sizes (n=3 complex cases, n=15 short-answer items), LLM-as-Judge evaluation with potential self-preference bias, and approximately 33-fold computational cost overhead. To our knowledge, EMoT is the first reasoning framework to combine hierarchical topology, strategic thought dormancy with reactivation, and mnemonic memory encoding in a single architecture.

[337] Bridging the Evaluation Gap: Standardized Benchmarks for Multi-Objective Search

Hadar Peer, Carlos Hernandez, Sven Koenig, Ariel Felner, Oren Salzman

Main category: cs.AI

TL;DR: A comprehensive benchmark suite for multi-objective search with standardized instances across four domains to enable robust and reproducible evaluations.

Details

Motivation: Multi-objective search evaluation suffers from fragmentation due to heterogeneous problem instances with incompatible objective definitions, making cross-study comparisons difficult. Existing benchmarks like DIMACS road networks have highly correlated objectives that fail to capture diverse Pareto-front structures.

Method: Created a standardized benchmark suite spanning four structurally diverse domains: real-world road networks, structured synthetic graphs, game-based grid environments, and high-dimensional robotic motion-planning roadmaps. Provides fixed graph instances, standardized start-goal queries, and both exact and approximate reference Pareto-optimal solution sets.

Result: The benchmark captures a full spectrum of objective interactions from strongly correlated to strictly independent, providing a common foundation for future MOS evaluations.

Conclusion: This standardized benchmark suite enables robust, reproducible, and structurally comprehensive evaluations in multi-objective search research.

Abstract: Empirical evaluation in multi-objective search (MOS) has historically suffered from fragmentation, relying on heterogeneous problem instances with incompatible objective definitions that make cross-study comparisons difficult. This standardization gap is further exacerbated by the realization that DIMACS road networks, a historical default benchmark for the field, exhibit highly correlated objectives that fail to capture diverse Pareto-front structures. To address this, we introduce the first comprehensive, standardized benchmark suite for exact and approximate MOS. Our suite spans four structurally diverse domains: real-world road networks, structured synthetic graphs, game-based grid environments, and high-dimensional robotic motion-planning roadmaps. By providing fixed graph instances, standardized start-goal queries, and both exact and approximate reference Pareto-optimal solution sets, this suite captures a full spectrum of objective interactions: from strongly correlated to strictly independent. Ultimately, this benchmark provides a common foundation to ensure future MOS evaluations are robust, reproducible, and structurally comprehensive.

[338] AI-Supervisor: Autonomous AI Research Supervision via a Persistent Research World Model

Yunbo Long

Main category: cs.AI

TL;DR: AutoProf is a multi-agent AI research framework with persistent knowledge graph and self-correcting loops for autonomous literature review, gap discovery, method development, and paper writing.

Details

Motivation: Existing automated research systems are stateless, linear pipelines that lack persistent understanding, structured gap analysis, and verification mechanisms. They process papers sequentially without maintaining shared memory or enabling agents to refine each other's findings.

Method: Multi-agent orchestration framework with specialized agents, continuously evolving Research World Model (Knowledge Graph), structured gap discovery decomposing methods into modules, self-correcting discovery loops analyzing module performance, self-improving development loops using cross-domain mechanism search, and consensus-based validation.

Result: Framework enables end-to-end AI research supervision from literature review through paper writing, with autonomous exploration and self-correcting updates. It’s model-agnostic, supports mainstream LLMs, and scales elastically with token budget.

Conclusion: AutoProf provides a novel approach to automated research with persistent knowledge representation and self-correcting mechanisms, overcoming limitations of stateless linear pipelines.

Abstract: Existing automated research systems operate as stateless, linear pipelines, generating outputs without maintaining a persistent understanding of the research landscape. They process papers sequentially, propose ideas without structured gap analysis, and lack mechanisms for agents to verify or refine each other’s findings. We present AutoProf (Autonomous Professor), a multi-agent orchestration framework where specialized agents provide end-to-end AI research supervision driven by human interests, from literature review through gap discovery, method development, evaluation, and paper writing, via autonomous exploration and self-correcting updates. Unlike sequential pipelines, AutoProf maintains a continuously evolving Research World Model implemented as a Knowledge Graph, capturing methods, benchmarks, limitations, and unexplored gaps as shared memory across agents. The framework introduces three contributions: first, structured gap discovery that decomposes methods into modules, evaluates them across benchmarks, and identifies module-level gaps; second, self-correcting discovery loops that analyze why modules succeed or fail, detect benchmark biases, and assess evaluation adequacy; third, self-improving development loops using cross-domain mechanism search to iteratively address failing components. All agents operate under a consensus mechanism where findings are validated before being committed to the shared model. The framework is model-agnostic, supports mainstream large language models, and scales elastically with token budget from lightweight exploration to full-scale investigation.

[339] Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA

John Ray B. Martinez

Main category: cs.AI

TL;DR: Multi-agent framework with specialist agents and two-phase verification improves calibration and discrimination in medical QA, reducing ECE by 49-74% across medical benchmarks.

Details

Motivation: Miscalibrated confidence scores are a major obstacle to deploying AI in clinical settings, as overconfident models offer no useful signal for deferral decisions in safety-critical applications.

Method: Multi-agent framework with four domain-specific specialist agents (respiratory, cardiology, neurology, gastroenterology) using Qwen2.5-7B-Instruct, followed by two-phase self-verification process that measures internal consistency to produce Specialist Confidence Scores (S-scores), which drive weighted fusion for final answer selection and confidence calibration.

Result: ECE reduced by 49-74% across all four experimental settings (MedQA-USMLE and MedMCQA subsets), with full system achieving ECE = 0.091 (74.4% reduction) and AUROC = 0.630 (+0.056) at 59.2% accuracy on MedQA-250. Two-phase verification identified as primary calibration driver, multi-agent reasoning as primary accuracy driver.

Conclusion: Consistency-based verification produces more reliable uncertainty estimates across diverse medical question types, providing practical confidence signals for deferral in safety-critical clinical AI applications.

Abstract: Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi-agent framework that combines domain-specific specialist agents with Two-Phase Verification and S-Score Weighted Fusion to improve both calibration and discrimination in medical multiple-choice question answering. Four specialist agents (respiratory, cardiology, neurology, gastroenterology) generate independent diagnoses using Qwen2.5-7B-Instruct. Each diagnosis is then subjected to a two-phase self-verification process that measures internal consistency and produces a Specialist Confidence Score (S-score). The S-scores drive a weighted fusion strategy that selects the final answer and calibrates the reported confidence. We evaluate across four experimental settings, covering 100-question and 250-question high-disagreement subsets of both MedQA-USMLE and MedMCQA. Calibration improvement is the central finding, with ECE reduced by 49-74% across all four settings, including the harder MedMCQA benchmark where these gains persist even when absolute accuracy is constrained by knowledge-intensive recall demands. On MedQA-250, the full system achieves ECE = 0.091 (74.4% reduction over the single-specialist baseline) and AUROC = 0.630 (+0.056) at 59.2% accuracy. Ablation analysis identifies Two-Phase Verification as the primary calibration driver and multi-agent reasoning as the primary accuracy driver. These results establish that consistency-based verification produces more reliable uncertainty estimates across diverse medical question types, providing a practical confidence signal for deferral in safety-critical clinical AI applications.

[340] From Liar Paradox to Incongruent Sets: A Normal Form for Self-Reference

Shalender Singh, Vishnu Priya Singh Parmar

Main category: cs.AI

TL;DR: INF transforms self-referential sentences into families of non-self-referential sentences that are individually satisfiable but not jointly satisfiable, isolating semantic obstruction while preserving local classical semantics.

Details

Motivation: To provide a structural representation for self-referential semantic sentences that isolates the semantic obstruction created by self-reference while preserving classical semantics locally, and to study the role of incongruence as a structural source of semantic informativeness.

Method: Introduces incongruent normal form (INF) that replaces self-referential sentences with finite families of non-self-referential sentences. Uses model-theoretic analysis of informativeness, studies incompleteness through incompatible complete extensions, and develops quantitative semantic framework with Fourier-analytic notion of semantic energy based on total influence.

Result: Shows that semantic completeness precludes informativeness while incongruence preserves it, demonstrates that any consistent incomplete first-order theory admits finite incongruent families, and establishes quantitative bounds relating semantic determinacy, informativeness, and spectral simplicity with matrix inequality bounding aggregate semantic variance by total semantic energy.

Conclusion: Incongruence is a fundamental structural and quantitative feature of semantic representation that cannot collapse into a single determinate state without unbounded energy cost, providing a minimal formal basis for semantic knowledge.

Abstract: We introduce incongruent normal form (INF), a structural representation for self-referential semantic sentences. An INF replaces a self-referential sentence with a finite family of non-self-referential sentences that are individually satisfiable but not jointly satisfiable. This transformation isolates the semantic obstruction created by self-reference while preserving classical semantics locally and is accompanied by correctness theorems characterizing when global inconsistency arises from locally compatible commitments. We then study the role of incongruence as a structural source of semantic informativeness. Using a minimal model-theoretic notion of informativeness-understood as the ability of sentences to distinguish among admissible models-we show that semantic completeness precludes informativeness, while incongruence preserves it. Moreover, incongruence is not confined to paradoxical constructions: any consistent incomplete first-order theory admits finite incongruent families arising from incompatible complete extensions. In this sense, incompleteness manifests structurally as locally realizable but globally incompatible semantic commitments, providing a minimal formal basis for semantic knowledge. Finally, we introduce a quantitative semantic framework. In a canonical finite semantic-state setting, we model semantic commitments as Boolean functions and define a Fourier-analytic notion of semantic energy based on total influence. We derive uncertainty-style bounds relating semantic determinacy, informativeness, and spectral simplicity, and establish a matrix inequality bounding aggregate semantic variance by total semantic energy. These results show quantitatively that semantic informativeness cannot collapse into a single determinate state without unbounded energy cost, identifying incongruence as a fundamental structural and quantitative feature of semantic representation.

[341] Completeness of Unbounded Best-First Minimax and Descent Minimax

Quentin Cohen-Solal

Main category: cs.AI

TL;DR: The paper analyzes search algorithms for two-player perfect information games, focusing on whether completion techniques enable algorithms like Unbounded Best-First Minimax and Descent Minimax to always determine winning strategies.

Details

Motivation: Existing search algorithms for two-player perfect information games (like Unbounded Best-First Minimax and Descent Minimax) cannot always determine winning strategies even with infinite search time, despite being core algorithms in state-of-the-art knowledge-free reinforcement learning. The completion technique was introduced to improve them, but whether it sufficiently enables them to always find winning strategies remained an open question.

Method: The authors generalize the two algorithms (their versions using the completion technique) and analyze the class of algorithms they belong to. They provide theoretical analysis to determine whether algorithms in this class can compute optimal strategies.

Result: The paper shows that any algorithm in this generalized class computes the best strategy. Additionally, experimental results demonstrate that the completion technique improves winning performance in practice.

Conclusion: The completion technique is sufficient to enable algorithms like Unbounded Best-First Minimax and Descent Minimax to always determine winning strategies, resolving the open question. The generalized class of algorithms with completion techniques can compute optimal strategies for two-player perfect information games.

Abstract: In this article, we focus on search algorithms for two-player perfect information games, whose objective is to determine the best possible strategy, and ideally a winning strategy. Unfortunately, some search algorithms for games in the literature are not able to always determine a winning strategy, even with an infinite search time. This is the case, for example, of the following algorithms: Unbounded Best-First Minimax and Descent Minimax, which are core algorithms in state-of-the-art knowledge-free reinforcement learning. They were then improved with the so-called completion technique. However, whether this technique sufficiently improves these algorithms to allow them to always determine a winning strategy remained an open question until now. To answer this question, we generalize the two algorithms (their versions using the completion technique), and we show that any algorithm of this class of algorithms computes the best strategy. Finally, we experimentally show that the completion technique improves winning performance.

[342] The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Biplab Pal, Santanu Bhattacharya

Main category: cs.AI

TL;DR: A measure-theoretic Markov framework for evaluating AI agent reliability in organizational workflows, focusing on statistical support, ambiguity, and oversight costs using event log data.

Details

Motivation: Addresses the challenge of replacing deterministic organizational workflows with stochastic AI policies while maintaining reliability and managing oversight costs. The key problem is ensuring agent trajectories remain statistically supported, locally unambiguous, and economically governable.

Method: Develops a measure-theoretic Markov framework with core quantities: state blind-spot mass, state-action blind mass, entropy-based human-in-the-loop escalation gate, and expected oversight-cost identity over workflow visitation measure. Applied to Business Process Intelligence Challenge 2019 purchase-to-pay log with 251,734 cases and 1,595,923 events.

Result: Large workflows can appear well-supported at state level while retaining substantial blind mass over next-step decisions. Refining operational state to include context, economic magnitude, and actor class expanded state space from 42 to 668 and raised state-action blind mass from 0.0165 to 0.1253. On held-out data, m(s) = max_a π̂(a|s) tracked realized autonomous step accuracy within 3.4 percentage points.

Conclusion: The same quantities that delimit statistically credible autonomy also determine expected oversight burden. The framework is applicable to engineering processes with operational event logs and provides tools for evaluating AI agent reliability in organizational settings.

Abstract: Agentic artificial intelligence (AI) in organizations is a sequential decision problem constrained by reliability and oversight cost. When deterministic workflows are replaced by stochastic policies over actions and tool calls, the key question is not whether a next step appears plausible, but whether the resulting trajectory remains statistically supported, locally unambiguous, and economically governable. We develop a measure-theoretic Markov framework for this setting. The core quantities are state blind-spot mass B_n(tau), state-action blind mass B^SA_{pi,n}(tau), an entropy-based human-in-the-loop escalation gate, and an expected oversight-cost identity over the workflow visitation measure. We instantiate the framework on the Business Process Intelligence Challenge 2019 purchase-to-pay log (251,734 cases, 1,595,923 events, 42 distinct workflow actions) and construct a log-driven simulated agent from a chronological 80/20 split of the same process. The main empirical finding is that a large workflow can appear well supported at the state level while retaining substantial blind mass over next-step decisions: refining the operational state to include case context, economic magnitude, and actor class expands the state space from 42 to 668 and raises state-action blind mass from 0.0165 at tau=50 to 0.1253 at tau=1000. On the held-out split, m(s) = max_a pi-hat(a|s) tracks realized autonomous step accuracy within 3.4 percentage points on average. The same quantities that delimit statistically credible autonomy also determine expected oversight burden. The framework is demonstrated on a large-scale enterprise procurement workflow and is designed for direct application to engineering processes for which operational event logs are available.

[343] Learning To Guide Human Decision Makers With Vision-Language Models

Debodeep Banerjee, Stefano Teso, Burcu Sayin, Andrea Passerini

Main category: cs.AI

TL;DR: SLOG framework enables AI systems to provide interpretable textual guidance for human decision-making in high-stakes domains like medical diagnosis, rather than taking over decisions.

Details

Motivation: Current AI systems for high-stakes decision-making create problematic separation of responsibilities where humans either over-rely on AI decisions or get no assistance on the hardest cases. Regulatory agencies increasingly require human oversight for trustworthy AI, but existing approaches fail to provide meaningful assistance while maintaining human responsibility.

Method: Proposes Learning to Guide (LTG) framework where AI provides interpretable guidance rather than making decisions. Develops SLOG (Specific Learning to Guide) approach that transforms any vision-language model into a capable generator of textual guidance using minimal human feedback, ensuring guidance is interpretable and task-specific.

Result: Empirical evaluation shows promise on both synthetic datasets and challenging real-world medical diagnosis tasks, demonstrating the framework’s effectiveness in providing useful guidance while maintaining human decision-making responsibility.

Conclusion: SLOG offers a viable alternative to traditional AI decision-making systems by providing interpretable guidance that assists human experts without removing their decision-making responsibility, addressing regulatory requirements for human oversight in high-stakes domains.

Abstract: There is growing interest in AI systems that support human decision-making in high-stakes domains (e.g., medical diagnosis) to improve decision quality and reduce cognitive load. Mainstream approaches pair human experts with a machine-learning model, offloading low-risk decisions to the model so that experts can focus on cases that require their judgment. This separation of responsibilities setup, however, is inadequate for high-stakes scenarios. The expert may end up over-relying on the machine’s decisions due to anchoring bias, thus losing the human oversight that is increasingly being required by regulatory agencies to ensure trustworthy AI. On the other hand, the expert is left entirely unassisted on the (typically hardest) decisions on which the model abstained. As a remedy, we introduce learning to guide (LTG), an alternative framework in which – rather than taking control from the human expert – the machine provides guidance useful for decision making, and the human is entirely responsible for coming up with a decision. In order to ensure guidance is interpretable and task-specific, we develop SLOG, an approach for turning any vision-language model into a capable generator of textual guidance by leveraging a modicum of human feedback. Our empirical evaluation highlights the promise of SLOG on both on a synthetic dataset and a challenging, real-world medical diagnosis task.

[344] The Collaboration Paradox: Why Generative AI Requires Both Strategic Intelligence and Operational Stability in Supply Chain Management

Soumyadeep Dhar

Main category: cs.AI

TL;DR: AI agents in supply chains exhibit a “collaboration paradox” where collaborative agents perform worse than non-AI baselines due to inventory hoarding, requiring a two-layer framework for stability.

Details

Motivation: To investigate emergent strategic behavior of AI-driven agents in economic settings, specifically in multi-echelon supply chains prone to instabilities like the bullwhip effect.

Method: Computational experiments with LLM-powered generative AI agents in controlled supply chain simulations, testing collaborative agents designed with Vendor-Managed Inventory principles against non-AI baselines.

Result: Discovered “collaboration paradox” - collaborative AI agents perform worse than non-AI baselines due to inventory hoarding. Resilience requires high-level AI-driven policy-setting combined with low-level collaborative execution with proactive replenishment.

Conclusion: The work provides insights into emergent behaviors of collaborative AI agents and offers a blueprint for designing stable, effective AI-driven systems for business analytics through a two-layer framework.

Abstract: The rise of autonomous, AI-driven agents in economic settings raises critical questions about their emergent strategic behavior. This paper investigates these dynamics in the cooperative context of a multi-echelon supply chain, a system famously prone to instabilities like the bullwhip effect. We conduct computational experiments with generative AI agents, powered by Large Language Models (LLMs), within a controlled supply chain simulation designed to isolate their behavioral tendencies. Our central finding is the “collaboration paradox”: a novel, catastrophic failure mode where theoretically superior collaborative AI agents, designed with Vendor-Managed Inventory (VMI) principles, perform even worse than non-AI baselines. We demonstrate that this paradox arises from an operational flaw where agents hoard inventory, starving the system. We then show that resilience is only achieved through a synthesis of two distinct layers: high-level, AI-driven proactive policy-setting to establish robust operational targets, and a low-level, collaborative execution protocol with proactive downstream replenishment to maintain stability. Our final framework, which implements this synthesis, can autonomously generate, evaluate, and quantify a portfolio of viable strategic choices. The work provides a crucial insight into the emergent behaviors of collaborative AI agents and offers a blueprint for designing stable, effective AI-driven systems for business analytics.

[345] From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs

Jessica M. Lundin, Usman Nasir Nakakana, Guillaume Chabot-Couture

Main category: cs.AI

TL;DR: A graph-based evaluation harness that transforms clinical guidelines into knowledge graphs to dynamically generate contamination-resistant evaluation queries for language model assessment.

Details

Motivation: Current benchmarks for domain-specific language models are static, manually curated, and susceptible to contamination, lacking comprehensive coverage and maintainability as guidelines evolve.

Method: Transform structured clinical guidelines into queryable knowledge graphs, then dynamically instantiate evaluation queries via graph traversal with combinatorial variation to resist surface-form contamination.

Result: Applied to WHO IMCI guidelines, generated clinically grounded multiple-choice questions. Evaluation of five language models revealed systematic gaps: good symptom recognition but lower accuracy on treatment protocols and clinical management decisions.

Conclusion: The framework provides scalable, contamination-resistant evaluation infrastructure that supports continuous regeneration as guidelines evolve and generalizes to domains with structured decision logic.

Abstract: Rigorous evaluation of domain-specific language models requires benchmarks that are comprehensive, contamination-resistant, and maintainable. Static, manually curated datasets do not satisfy these properties. We present a graph-based evaluation harness that transforms structured clinical guidelines into a queryable knowledge graph and dynamically instantiates evaluation queries via graph traversal. The framework provides three guarantees: (1) complete coverage of guideline relationships; (2) surface-form contamination resistance through combinatorial variation; and (3) validity inherited from expert-authored graph structure. Applied to the WHO IMCI guidelines, the harness generates clinically grounded multiple-choice questions spanning symptom recognition, treatment, severity classification, and follow-up care. Evaluation across five language models reveals systematic capability gaps. Models perform well on symptom recognition but show lower accuracy on treatment protocols and clinical management decisions. The framework supports continuous regeneration of evaluation data as guidelines evolve and generalizes to domains with structured decision logic. This provides a scalable foundation for evaluation infrastructure.

[346] GeoSketch: A Neural-Symbolic Approach to Geometric Multimodal Reasoning with Auxiliary Line Construction and Affine Transformation

Shichao Weng, Zhiqiang Wang, Yuhua Zhou, Rui Lu, Ting Liu, Zhiyang Teng, Xiaozhang Liu, Hanmeng Liu

Main category: cs.AI

TL;DR: GeoSketch is a neural-symbolic framework for geometric problem solving that integrates perception, symbolic reasoning, and sketch actions in an interactive loop, enabling dynamic manipulation of diagrams for better geometric reasoning.

Details

Motivation: Existing MLLMs process diagrams as static images, lacking capacity for dynamic manipulation which is essential for human geometric reasoning involving auxiliary line construction and affine transformations.

Method: Three-module framework: (1) Perception module abstracts diagrams into structured logic forms, (2) Symbolic Reasoning module applies geometric theorems to decide next deductive step, (3) Sketch Action module executes operations like drawing auxiliary lines or transformations. Trained with two-stage pipeline: supervised fine-tuning on 2,000 symbolic-curated trajectories followed by reinforcement learning with dense symbolic rewards.

Result: GeoSketch significantly improves stepwise reasoning accuracy and problem-solving success over static perception methods on the GeoSketch Benchmark (390 geometry problems requiring auxiliary construction or affine transformations).

Conclusion: GeoSketch advances multimodal reasoning from static interpretation to dynamic, verifiable interaction, establishing a new foundation for solving complex visuospatial problems through hierarchical decision-making, executable visual actions, and symbolic verification.

Abstract: Geometric Problem Solving (GPS) poses a unique challenge for Multimodal Large Language Models (MLLMs), requiring not only the joint interpretation of text and diagrams but also iterative visuospatial reasoning. While existing approaches process diagrams as static images, they lack the capacity for dynamic manipulation - a core aspect of human geometric reasoning involving auxiliary line construction and affine transformations. We present GeoSketch, a neural-symbolic framework that recasts geometric reasoning as an interactive perception-reasoning-action loop. GeoSketch integrates: (1) a Perception module that abstracts diagrams into structured logic forms, (2) a Symbolic Reasoning module that applies geometric theorems to decide the next deductive step, and (3) a Sketch Action module that executes operations such as drawing auxiliary lines or applying transformations, thereby updating the diagram in a closed loop. To train this agent, we develop a two-stage pipeline: supervised fine-tuning on 2,000 symbolic-curated trajectories followed by reinforcement learning with dense, symbolic rewards to enhance robustness and strategic exploration. To evaluate this paradigm, we introduce the GeoSketch Benchmark, a high-quality set of 390 geometry problems requiring auxiliary construction or affine transformations. Experiments on strong MLLM baselines demonstrate that GeoSketch significantly improves stepwise reasoning accuracy and problem-solving success over static perception methods. By unifying hierarchical decision-making, executable visual actions, and symbolic verification, GeoSketch advances multimodal reasoning from static interpretation to dynamic, verifiable interaction, establishing a new foundation for solving complex visuospatial problems.

[347] SAG-Agent: Enabling Long-Horizon Reasoning in Strategy Games via Dynamic Knowledge Graphs

Chenwei Tang, Lin Long, Xinyu Liu, Jingyu Xing, Zizhou Wang, Joey Tianyi Zhou, Jiawei Du, Liangli Zhen, Jiancheng Lv

Main category: cs.AI

TL;DR: SAG-Agent: Experience-driven learning framework using State-Action Graphs to improve GUI-based autonomous agents by structuring pixel interactions, enabling better generalization and long-horizon planning in API-free environments.

Details

Motivation: Most software lacks accessible APIs, forcing autonomous agents to interact through pixel-based GUIs. LLM-based agents in this API-free setting face efficiency bottlenecks: limited to local visual experiences, they make myopic decisions and rely on inefficient trial-and-error, hindering skill acquisition and long-horizon planning.

Method: Proposes SAG-Agent framework that structures raw pixel-level interactions into a persistent State-Action Graph (SAG). Mitigates inefficient exploration by topologically linking functionally similar but visually distinct GUI states. Designs hybrid intrinsic reward mechanism based on graph topology combining state-value reward for exploiting known high-value pathways with novelty reward for targeted exploration.

Result: Evaluated in two complex, open-ended GUI-based decision-making environments (Civilization V and Slay the Spire), demonstrating significant improvements in exploration efficiency and strategic depth over state-of-the-art methods.

Conclusion: SAG-Agent effectively addresses efficiency bottlenecks in API-free GUI interaction by structuring experience into graphs, enabling better generalization and decoupling strategic planning from pure discovery for improved long-horizon reasoning.

Abstract: Most commodity software lacks accessible Application Programming Interfaces (APIs), requiring autonomous agents to interact solely through pixel-based Graphical User Interfaces (GUIs). In this API-free setting, large language model (LLM)-based agents face severe efficiency bottlenecks: limited to local visual experiences, they make myopic decisions and rely on inefficient trial-and-error, hindering both skill acquisition and long-horizon planning. To overcome these limitations, we propose SAG-Agent, an experience-driven learning framework that structures an agent’s raw pixel-level interactions into a persistent State-Action Graph (SAG). SAG-Agent mitigates inefficient exploration by topologically linking functionally similar but visually distinct GUI states, constructing a rich neighborhood of experience that enables the agent to generalize from a diverse set of historical strategies. To facilitate long-horizon reasoning, we design a novel hybrid intrinsic reward mechanism based on the graph topology, combining a state-value reward for exploiting known high-value pathways with a novelty reward that encourages targeted exploration. This approach decouples strategic planning from pure discovery, allowing the agent to effectively value setup actions with delayed gratification. We evaluate SAG-Agent in two complex, open-ended GUI-based decision-making environments (Civilization V and Slay the Spire), demonstrating significant improvements in exploration efficiency and strategic depth over the state-of-the-art methods.

[348] CastMind: An Interaction-Driven Agentic Reasoning Framework for Cognition-Inspired Time Series Forecasting

Xiaohan Zhang, Tian Gao, Mingyue Cheng, Bokai Pan, Ze Guo, Yaguo Liu, Xiaoyu Tao, Qi Liu

Main category: cs.AI

TL;DR: CastMind is an agentic reasoning framework that uses training-free LLMs for time series forecasting by mimicking human expert iterative reasoning processes.

Details

Motivation: Current time series forecasting methods treat it as static, single-pass regression, while human experts use iterative reasoning with temporal features, domain knowledge, case references, and continuous refinement.

Method: Reformulates forecasting as expert-like multi-stage workflow: context preparation, reasoning-based generation, and reflective evaluation. Uses lightweight toolkit with feature set, knowledge base, case library, and contextual pool to support LLM-based reasoning.

Result: Extensive experiments across multiple benchmarks show CastMind generally outperforms representative baselines.

Conclusion: CastMind transforms forecasting from single-pass output to multi-turn autonomous interaction, enabling accurate time series forecasting with training-free LLMs.

Abstract: Time series forecasting plays a crucial role in decision-making across many real-world applications. Despite substantial progress, most existing methods still treat forecasting as a static, single-pass regression problem. In contrast, human experts form predictions through iterative reasoning that integrates temporal features, domain knowledge, case-based references, and supplementary context, with continuous refinement. In this work, we propose CastMind, an interaction-driven agentic reasoning framework that enables accurate time series forecasting with training-free large language models. CastMind reformulates forecasting as an expert-like process and organizes it into a multi-stage workflow involving context preparation, reasoning-based generation, and reflective evaluation, transforming forecasting from a single-pass output into a multi-turn, autonomous interaction process. To support diverse perspectives commonly considered by human experts, we develop a lightweight toolkit comprising a feature set, a knowledge base, a case library, and a contextual pool that provides external support for LLM-based reasoning. Extensive experiments across multiple benchmarks show that CastMind generally outperforms representative baselines. Code is available at this repository: https://github.com/SkyeGT/CastMind .

[349] Pharos-ESG: A Framework for Multimodal Parsing, Contextual Narration, and Hierarchical Labeling of ESG Report

Yan Chen, Yu Zou, Jialei Zeng, Haoran You, Xiaorui Zhou, Aixi Zhong

Main category: cs.AI

TL;DR: Pharos-ESG is a multimodal framework that transforms ESG reports into structured representations using layout analysis, hierarchy-aware segmentation, and visual-to-text conversion with ESG-specific annotations.

Details

Motivation: ESG reports present challenges for large-scale understanding due to irregular layouts, weak structure, and implicit hierarchies, making automated analysis difficult despite their importance in financial governance.

Method: Unified framework with reading-order modeling based on layout flow, hierarchy-aware segmentation using table-of-contents anchors, multimodal aggregation pipeline converting visual elements to natural language, and enrichment with ESG, GRI, and sentiment labels.

Result: Outperforms dedicated document parsing systems and general-purpose multimodal models on annotated benchmarks, and releases Aurora-ESG dataset with structured multimodal content and fine-grained annotations.

Conclusion: Pharos-ESG effectively addresses ESG report parsing challenges and supports ESG integration in financial governance through structured multimodal representations.

Abstract: Environmental, Social, and Governance (ESG) principles are reshaping the foundations of global financial gover- nance, transforming capital allocation architectures, regu- latory frameworks, and systemic risk coordination mecha- nisms. However, as the core medium for assessing corpo- rate ESG performance, the ESG reports present significant challenges for large-scale understanding, due to chaotic read- ing order from slide-like irregular layouts and implicit hier- archies arising from lengthy, weakly structured content. To address these challenges, we propose Pharos-ESG, a uni- fied framework that transforms ESG reports into structured representations through multimodal parsing, contextual nar- ration, and hierarchical labeling. It integrates a reading-order modeling module based on layout flow, hierarchy-aware seg- mentation guided by table-of-contents anchors, and a multi- modal aggregation pipeline that contextually transforms vi- sual elements into coherent natural language. The framework further enriches its outputs with ESG, GRI, and sentiment labels, yielding annotations aligned with the analytical de- mands of financial research. Extensive experiments on anno- tated benchmarks demonstrate that Pharos-ESG consistently outperforms both dedicated document parsing systems and general-purpose multimodal models. In addition, we release Aurora-ESG, the first large-scale public dataset of ESG re- ports, spanning Mainland China, Hong Kong, and U.S. mar- kets, featuring unified structured representations of multi- modal content, enriched with fine-grained layout and seman- tic annotations to better support ESG integration in financial governance and decision-making.

[350] Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Ye, Yuzhi Zhang, Linfeng Zhang, Weinan E, Siheng Chen, Yanfeng Wang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2601.10402: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10402&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[351] Are LLMs Smarter Than Chimpanzees? An Evaluation on Perspective Taking and Knowledge State Estimation

Dingyi Yang, Junqi Zhao, Xue Li, Ce Li, Boyang Li

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2601.12410: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12410&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Jingyu Li, Zhaocheng Du, Qianhui Zhu, kaiyuan Li, Zhicheng Zhang, Song-Li Wu, Chaolang Li, Pengwen Dai

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2601.19178: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19178&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[353] CIRCLE: A Framework for Evaluating AI from a Real-World Lens

Reva Schwartz, Carina Westling, Morgan Briggs, Marzieh Fadaee, Isar Nejadgholi, Matthew Holmes, Fariza Rashid, Maya Carlyle, Afaf Taïk, Kyra Wilson, Peter Douglas, Theodora Skeadas, Gabriella Waters, Rumman Chowdhury, Thiago Lacerda

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2602.24055: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.24055&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[354] Agentified Assessment of Logical Reasoning Agents

Zhiyu Ni, Yifeng Xiao, Zheng Liang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2603.02788: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02788&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[355] Relationship-Aware Safety Unlearning for Multimodal LLMs

Vishnu Narayanan Anilkumar, Abhijith Sreesylesh Babu, Trieu Hai Vo, Mohankrishna Kolla, Alexander Cuneo

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to retrieval error

Method: Unable to determine method due to retrieval error

Result: Unable to determine results due to retrieval error

Conclusion: Unable to determine conclusion due to retrieval error

Abstract: Failed to fetch summary for 2603.14185: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14185&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[356] DomAgent: Leveraging Knowledge Graphs and Case-Based Reasoning for Domain-Specific Code Generation

Shuai Wang, Dhasarathy Parthasarathy, Robert Feldt, Yinan Yu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.21430: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21430&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[357] PhySe-RPO: Physics and Semantics Guided Relative Policy Optimization for Diffusion-Based Surgical Smoke Removal

Zining Fang, Chunhui Liu, Bin Xu, Ming Chen, Xiaowei Hu, Cheng Xue

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2603.22844: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22844&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[358] Human strategic decision making in parametrized games

Sam Ganzfried

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2104.14744 exists but content cannot be retrieved at this time.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2104.14744: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2104.14744&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[359] Entire Space Counterfactual Learning for Reliable Content Recommendations

Hao Wang, Zhichao Chen, Zhaoran Liu, Haozhe Li, Degui Yang, Xinggao Liu, Haoxuan Li

Main category: cs.AI

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API fetch failure

Method: Unable to determine method due to API fetch failure

Result: Unable to determine results due to API fetch failure

Conclusion: Unable to determine conclusion due to API fetch failure

Abstract: Failed to fetch summary for 2210.11039: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2210.11039&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[360] A Comprehensive Survey on Enterprise Financial Risk Analysis from Big Data and LLMs Perspective

Huaming Du, Cancan Feng, Yuqian Lei, Chenyang Zhang, Guisong Liu, Gang Kou, Carl Yang, Yu Zhao

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to the paper content

Method: Cannot determine method without access to the paper content

Result: Cannot determine results without access to the paper content

Conclusion: Cannot draw conclusions without access to the paper content

Abstract: Failed to fetch summary for 2211.14997: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2211.14997&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[361] Perturbative adaptive importance sampling for Bayesian LOO cross-validation

Joshua C Chang, Xiangting Li, Tianyi Su, Shixin Xu, Hao-Ren Yao, Julia Porcino, Carson Chow

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2402.08151: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.08151&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[362] Moonwalk: Inverse-Forward Differentiation

Dmitrii Krylov, Armin Karamzade, Roy Fox

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2402.14212: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.14212&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[363] DIDLM: A SLAM Dataset for Difficult Scenarios Featuring Infrared, Depth Cameras, LIDAR, 4D Radar, and Others under Adverse Weather, Low Light Conditions, and Rough Roads

Weisheng Gong, Chen He, Kaijie Su, Qingyong Li, Tong Wu, Z. Jane Wang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2404.09622: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.09622&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[364] Hamiltonian Mechanics of Feature Learning: Bottleneck Structure in Leaky ResNets

Arthur Jacot, Alexandre Kaiser

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2405.17573: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.17573&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[365] Proximity Matters: Local Proximity Enhanced Balancing for Treatment Effect Estimation

Hao Wang, Zhichao Chen, Zhaoran Liu, Xu Chen, Haoxuan Li, Zhouchen Lin

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2407.01111: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.01111&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[366] Dynamic Neural Potential Field: Online Trajectory Optimization in the Presence of Moving Obstacles

Aleksei Staroverov, Muhammad Alhaddad, Aditya Narendra, Konstantin Mironov, Aleksandr Panov

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2410.06819: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.06819&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[367] Symmetry-Guided Memory Augmentation for Efficient Locomotion Learning

Kaixi Bao, Chenhao Li, Yarden As, Andreas Krause, Marco Hutter

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2502.01521: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.01521&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[368] Unicorn: A Universal and Collaborative Reinforcement Learning Approach Towards Generalizable Network-Wide Traffic Signal Control

Yifeng Zhang, Yilin Liu, Ping Gong, Peizhuo Li, Mingfeng Fan, Guillaume Sartoretti

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2503.11488: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.11488&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[369] Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning

Adnan Oomerjee, Zafeirios Fountas, Haitham Bou-Ammar, Jun Wang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2505.16950: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16950&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[370] Wideband RF Radiance Field Modeling Using Frequency-embedded 3D Gaussian Splatting

Zechen Li, Lanqing Yang, Yiheng Bian, Hao Pan, Yongjian Fu, Yezhou Wang, Zhuxi Chen, Yi-Chao Chen, Guangtao Xue

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2505.20714: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.20714&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[371] Enhancing Jailbreak Attacks on LLMs via Persona Prompts

Zheng Zhang, Peilin Zhao, Deheng Ye, Hao Wang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2507.22171: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.22171&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[372] Bridging Past and Future: Distribution-Aware Alignment for Time Series Forecasting

Yifan Hu, Jie Yang, Tian Zhou, Peiyuan Liu, Yujin Tang, Rong Jin, Liang Sun

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.14181: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.14181&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[373] Fiaingen: A financial time series generative method matching real-world data quality

Jože M. Rožanec, Tina Žezlin, Laurentiu Vasiliu, Dunja Mladenić, Radu Prodan, Dumitru Roman

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.01169: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01169&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[374] From Imperative to Declarative: Towards LLM-friendly OS Interfaces for Boosted Computer-Use Agents

Yuan Wang, Mingyu Li, Haibo Chen

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.04607: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04607&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[375] From Prompts to Packets: A View from the Network on ChatGPT, Copilot, and Gemini

Antonio Montieri, Alfredo Nascita, Antonio Pescapè

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.11269: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.11269&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[376] Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries

Divyat Mahajan, Sachin Goyal, Badr Youbi Idrissi, Mohammad Pezeshki, Ioannis Mitliagkas, David Lopez-Paz, Kartik Ahuja

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - cannot analyze paper content

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2510.14751: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14751&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[377] OffSim: Offline Simulator for Model-based Offline Inverse Reinforcement Learning

Woo-Jin Ahn, Sang-Ryul Baek, Yong-Jun Lee, Hyun-Duck Choi, Myo-Taeg Lim

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2510.15495: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15495&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Zhixiong Zhao, Haomin Li, Fangxin Liu, Yuncheng Lu, Zongwu Wang, Tao Yang, Li Jiang, Haibing Guan

Main category: cs.AI

TL;DR: Paper 2511.06767: Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing abstract

Method: Cannot determine method due to missing abstract

Result: Cannot determine results due to missing abstract

Conclusion: Cannot determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2511.06767: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06767&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[379] Uncertainty Makes It Stable: Curiosity-Driven Quantized Mixture-of-Experts

Sebastián Andrés Cajas Ordóñez, Luis Fernando Torres Torres, Mackenzie J. Meni, Carlos Andrés Duran Paredes, Eric Arazo, Cristian Bosch, Ricardo Simon Carbajo, Yuan Lai, Leo Anthony Celi

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.11743: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11743&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[380] Reward Engineering for Spatial Epidemic Simulations: A Reinforcement Learning Platform for Individual Behavioral Learning

Radman Rakhshandehroo, Daniel Coombs

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.18000: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18000&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[381] Goal-Oriented Multi-Agent Semantic Networking: Unifying Intents, Semantics, and Intelligence

Shutong Chen, Qi Liao, Adnan Aijaz, Yansha Deng

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to data fetch failure

Method: Unable to determine method due to data fetch failure

Result: Unable to determine results due to data fetch failure

Conclusion: Unable to draw conclusions due to data fetch failure

Abstract: Failed to fetch summary for 2512.01035: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01035&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[382] ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators

Guoqiang Zou, Wanyu Wang, Hao Zheng, Longxiang Yin, Yinhe Han

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2512.09427 appears to be an arXiv paper, but the system cannot retrieve its abstract or content at this time.

Details

Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests, preventing retrieval of the paper details.

Method: Cannot determine method without access to the paper content. The system attempted to fetch the paper from arXiv but encountered rate limiting restrictions.

Result: No results can be analyzed since the paper content could not be retrieved. The only available information is the paper ID (2512.09427) which suggests it’s a recent arXiv submission from December 2024.

Conclusion: Cannot provide conclusions about the paper without access to its content. The analysis is limited by the inability to fetch the paper details from arXiv due to rate limiting.

Abstract: Failed to fetch summary for 2512.09427: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09427&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[383] Deep Neural Networks as Discrete Dynamical Systems: Implications for Physics-Informed Learning

Abhisek Ganguly, Santosh Ansumali, Sauro Succi

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2601.00473: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.00473&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[384] PASTA: A Scalable Framework for Multi-Policy AI Compliance Evaluation

Yu Yang, Ig-Jae Kim, Dongwook Yoon

Main category: cs.AI

TL;DR: Unable to analyze paper 2601.11702 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: No method information available due to fetch failure

Result: No results available due to fetch failure

Conclusion: Cannot provide conclusion without access to paper content

Abstract: Failed to fetch summary for 2601.11702: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11702&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[385] HalluJudge: A Reference-Free Hallucination Detection for Context Misalignment in Code Review Automation

Kla Tantithamthavorn, Hong Yi Lin, Patanamon Thongtanunam, Wachiraphan Charoenwet, Minwoo Jeong, Ming Wu

Main category: cs.AI

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Unable to determine motivation due to abstract fetch failure

Method: Unable to determine method due to abstract fetch failure

Result: Unable to determine results due to abstract fetch failure

Conclusion: Unable to determine conclusion due to abstract fetch failure

Abstract: Failed to fetch summary for 2601.19072: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19072&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[386] On Randomness in Agentic Evals

Bjarni Haukur Bjarnason, André Silva, Martin Monperrus

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.07150: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07150&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[387] AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering

Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Siheng Chen

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.07906 could not be retrieved from arXiv API.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2602.07906: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07906&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[388] KRONE: Hierarchical and Modular Log Anomaly Detection

Lei Ma, Jinyang Liu, Tieying Zhang, Peter M. VanNostrand, Dennis M. Hofmann, Lei Cao, Elke A. Rundensteiner, Jianjun Chen

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.07303: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07303&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[389] Smooth Gate Functions for Soft Advantage Policy Optimization

Egor Denisov, Svetlana Glazyrina, Maksim Kryzhanovskiy, Roman Ischenko

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.19345: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19345&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[390] Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parametric Policies

Xiang Li, Yuheng Zhang, Nan Jiang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.23811 could not be retrieved from arXiv API.

Details

Motivation: Cannot determine motivation as paper content is unavailable.

Method: Cannot determine method as paper content is unavailable.

Result: Cannot determine results as paper content is unavailable.

Conclusion: Cannot draw conclusions as paper content is unavailable.

Abstract: Failed to fetch summary for 2602.23811: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23811&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[391] OSS-CRS: Liberating AIxCC Cyber Reasoning Systems for Real-World Open-Source Security

Andrew Chin, Dongkwan Kim, Yu-Fu Fu, Fabian Fleischer, Youngjoon Kim, HyungSeok Han, Cen Zhang, Brian Junekyu Lee, Hanqing Zhao, Taesoo Kim

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to data fetch failure

Method: Unable to determine method due to data fetch failure

Result: Unable to determine results due to data fetch failure

Conclusion: Unable to determine conclusion due to data fetch failure

Abstract: Failed to fetch summary for 2603.08566: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08566&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Anupam Purwar, Aditya Choudhary

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2603.09643: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09643&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[393] FedPBS: Proximal-Balanced Scaling Federated Learning Model for Robust Personalized Training for Non-IID Data

Eman M. AbouNassar, Amr Elshall, Sameh Abdulah

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2603.13909: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13909&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[394] Exploring Collatz Dynamics with Human-LLM Collaboration

Edward Y. Chang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.11066 exists but cannot be analyzed without the abstract content.

Details

Motivation: Cannot determine motivation without access to the paper abstract.

Method: Cannot determine method without access to the paper abstract.

Result: Cannot determine results without access to the paper abstract.

Conclusion: Cannot draw conclusions without access to the paper abstract.

Abstract: Failed to fetch summary for 2603.11066: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11066&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[395] LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, Randall Balestriero

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Unable to determine method due to API rate limiting preventing access to paper content

Result: Unable to determine results due to API rate limiting preventing access to paper content

Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2603.19312: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19312&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[396] 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

Yeounoh Chung, Rushabh Desai, Jian He, Yu Xiao, Thibaud Hottelier, Yves-Laurent Kom Samo, Pushkar Kadilkar, Xianshun Chen, Sam Idicula, Fatma Özcan, Alon Halevy, Yannis Papakonstantinou

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2603.15970: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15970&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[397] Agent Control Protocol: Admission Control for Agent Actions

Marcelo Fernandez

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2603.18829: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18829&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[398] mSFT: Addressing Dataset Mixtures Overfitting Heterogeneously in Multi-task SFT

Woosung Koh, Jeyoung Jeon, Youngjin Song, Yujin Cheon, Soowon Oh, Jaehyeong Choi, Se-Young Yun

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to retrieval failure

Method: Unable to determine method due to retrieval failure

Result: Unable to determine results due to retrieval failure

Conclusion: Unable to determine conclusion due to retrieval failure

Abstract: Failed to fetch summary for 2603.21606: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21606&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[399] An Agentic Multi-Agent Architecture for Cybersecurity Risk Management

Ravish Gupta, Saket Kumar, Shreeya Sharma, Maulik Dang, Abhishek Aggarwal

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.20131: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20131&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[400] Extending Precipitation Nowcasting Horizons via Spectral Fusion of Radar Observations and Foundation Model Priors

Yuze Qin, Qingyong Li, Zhiqing Guo, Wen Wang, Yan Liu, Yangli-ao Geng

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.21768: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21768&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[401] DeepXplain: XAI-Guided Autonomous Defense Against Multi-Stage APT Campaigns

Trung V. Phan, Thomas Bauschert

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.21296: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21296&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[402] LLM-Powered Workflow Optimization for Multidisciplinary Software Development: An Automotive Industry Case Study

Shuai Wang, Yinan Yu, Earl Barr, Dhasarathy Parthasarathy

Main category: cs.AI

TL;DR: Paper ID 2603.21439 could not be analyzed due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting error preventing access to paper content

Method: No method information available - paper content could not be retrieved

Result: No results available - the paper summary could not be fetched

Conclusion: Cannot draw conclusions about an inaccessible paper

Abstract: Failed to fetch summary for 2603.21439: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21439&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[403] Sim-to-Real of Humanoid Locomotion Policies via Joint Torque Space Perturbation Injection

Junhyeok Rui Cha, Woohyun Cha, Jaeyong Shin, Donghyeon Kim, Jaeheung Park

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2603.21853: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21853&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[404] Mind Your HEARTBEAT! Claw Background Execution Inherently Enables Silent Memory Pollution

Yechao Zhang, Shiqian Zhao, Jie Zhang, Gelei Deng, Jiawen Zhang, Xiaogeng Liu, Chaowei Xiao, Tianwei Zhang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.23064: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23064&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.SD

[405] Echoes: A semantically-aligned music deepfake detection dataset

Octavian Pascu, Dan Oneata, Horia Cucu, Nicolas M. Muller

Main category: cs.SD

TL;DR: Echoes is a challenging music deepfake detection dataset with 3,577 tracks from 10 AI music generation systems, designed with semantic alignment to prevent shortcut learning and promote robust generalization.

Details

Motivation: Existing AI-generated music datasets lack provider diversity and semantic alignment, leading to shortcut learning and poor generalization in music deepfake detection systems.

Method: Created Echoes dataset with 3,577 tracks (110 hours) spanning multiple genres, generated by 10 AI music systems, with semantic-level alignment between spoofed audio and bona fide references achieved by conditioning generated samples on original waveforms or song descriptors.

Result: Echoes is the hardest in-domain dataset; detectors trained on existing datasets transfer poorly to Echoes; training on Echoes yields strongest generalization performance across datasets.

Conclusion: Provider diversity and semantic alignment help learn more transferable detection cues for music deepfake detection, making Echoes a valuable benchmark for robust detection systems.

Abstract: We introduce Echoes, a new dataset for music deepfake detection designed for training and benchmarking detectors under realistic and provider-diverse conditions. Echoes comprises 3,577 tracks (110 hours of audio) spanning multiple genres (pop, rock, electronic), and includes content generated by ten popular AI music generation systems. To prevent shortcut learning and promote robust generalization, the dataset is deliberately constructed to be challenging, enforcing semantic-level alignment between spoofed audio and bona fide references. This alignment is achieved by conditioning generated audio samples directly on bona-fide waveforms or song descriptors. We evaluate Echoes in a cross-dataset setting against three existing AI-generated music datasets using state-of-the-art Wav2Vec2 XLS-R 2B representations. Results show that (i) Echoes is the hardest in-domain dataset; (ii) detectors trained on existing datasets transfer poorly to Echoes; (iii) training on Echoes yields the strongest generalization performance. These findings suggest that provider diversity and semantic alignment help learn more transferable detection cues.

[406] Variable-Length Audio Fingerprinting

Hongjie Chen, Hanyu Meng, Huimin Zeng, Ryan A. Rossi, Lie Lu, Josh Kimball

Main category: cs.SD

TL;DR: VLAFP introduces variable-length audio fingerprinting using deep learning, overcoming rigid fixed-length segmentation limitations for improved audio identification and retrieval.

Details

Motivation: Existing deep learning audio fingerprinting methods rigidly process fixed-length audio segments, neglecting temporal dynamics and limiting practical applications where audio lengths vary naturally.

Method: Proposes Variable-Length Audio FingerPrinting (VLAFP), a deep learning model that supports variable-length audio processing for both training and testing, enabling more flexible and dynamic audio representation.

Result: VLAFP outperforms state-of-the-art methods in live audio identification and audio retrieval across three real-world datasets, demonstrating superior performance with variable-length inputs.

Conclusion: VLAFP represents a significant advancement in audio fingerprinting by enabling variable-length processing, addressing practical limitations of existing methods and improving real-world audio identification performance.

Abstract: Audio fingerprinting converts audio to much lower-dimensional representations, allowing distorted recordings to still be recognized as their originals through similar fingerprints. Existing deep learning approaches rigidly fingerprint fixed-length audio segments, thereby neglecting temporal dynamics during segmentation. To address limitations due to this rigidity, we propose Variable-Length Audio FingerPrinting (VLAFP), a novel method that supports variable-length fingerprinting. To the best of our knowledge, VLAFP is the first deep audio fingerprinting model capable of processing audio of variable length, for both training and testing. Our experiments show that VLAFP outperforms existing state-of-the-arts in live audio identification and audio retrieval across three real-world datasets.

[407] Semantic-Aware Interruption Detection in Spoken Dialogue Systems: Benchmark, Metric, and Model

Kangxiang Xia, Bingshen Mu, Xian Shi, Jin Xu, Lei Xie

Main category: cs.SD

TL;DR: LLM-based semantic interruption detection model for spoken dialogue systems with new benchmark and metric to balance responsiveness and robustness.

Details

Motivation: Current interruption detection methods for spoken dialogue systems are polarized between trigger-happy VAD-based approaches (misinterpreting backchannels) and robust end-to-end models (with unacceptable delays). Lack of real-world benchmarks and holistic metrics hinders progress.

Method: Introduces SID-Bench (first real-world benchmark for semantic-aware interruption detection), proposes Average Penalty Time (APT) metric to assess responsiveness-robustness trade-off, and designs LLM-based detection model optimized through novel training paradigm to capture semantic intent cues.

Result: Model significantly outperforms mainstream baselines, achieving nearly threefold reduction in APT. Successfully resolves tension between speed and stability, establishing new state-of-the-art for intelligent interruption handling.

Conclusion: Presents comprehensive framework for interruption detection in spoken dialogue systems, balancing responsiveness and robustness through semantic understanding. Provides benchmark and code for future research.

Abstract: Achieving natural full-duplex interaction in spoken dialogue systems (SDS) remains a challenge due to the difficulty of accurately detecting user interruptions. Current solutions are polarized between “trigger-happy” VAD-based methods that misinterpret backchannels and robust end-to-end models that exhibit unacceptable response delays. Moreover, the absence of real-world benchmarks and holistic metrics hinders progress in the field. This paper presents a comprehensive frame-work to overcome these limitations. We first introduce SID-Bench, the first benchmark for semantic-aware interruption detection built entirely from real-world human dialogues. To provide a rigorous assessment of the responsiveness-robustness trade-off, we propose the Average Penalty Time (APT) metric, which assigns a temporal cost to both false alarms and late responses. Building on this framework, we design an LLM-based detection model optimized through a novel training paradigm to capture subtle semantic cues of intent. Experimental results show that our model significantly outperforms mainstream baselines, achieving a nearly threefold reduction in APT. By successfully resolving the long-standing tension between speed and stability, our work establishes a new state-of-the-art for intelligent interruption handling in SDS. To facilitate future research, SID-Bench and the associated code are available at: https://github.com/xkx-hub/SID-bench.

[408] Bridging Biological Hearing and Neuromorphic Computing: End-to-End Time-Domain Audio Signal Processing with Reservoir Computing

Rinku Sebastian, Simon O’Keefe, Martin Trefzer

Main category: cs.SD

TL;DR: Novel reservoir computing approach for real-time MFCC extraction using time-domain convolution instead of frequency-domain transformations, enabling efficient audio processing for embedded systems.

Details

Motivation: Current audio signal processing lacks human-level precision and efficiency, especially for real-time applications. Traditional MFCC extraction relies on computationally intensive time-frequency transformations that limit deployment in embedded systems and voice-driven applications.

Method: Proposes using reservoir computing to streamline MFCC extraction by replacing conventional frequency-domain conversions with convolution operations, eliminating complex transformations while maintaining feature discriminability. Develops an end-to-end audio processing framework integrating this approach.

Result: Demonstrates potential for efficient and real-time speech analysis, contributing to energy-efficient audio processing technologies that can be deployed in embedded systems and voice-driven applications.

Conclusion: Bridges biologically inspired feature extraction with neuromorphic computing, offering a scalable solution for next-generation speech recognition systems with improved efficiency and real-time capabilities.

Abstract: Despite the advancements in cutting-edge technologies, audio signal processing continues to pose challenges and lacks the precision of a human speech processing system. To address these challenges, we propose a novel approach to simplify audio signal processing by leveraging time-domain techniques and reservoir computing. Through our research, we have developed a real-time audio signal processing system by simplifying audio signal processing through the utilization of reservoir computers, which are significantly easier to train. Feature extraction is a fundamental step in speech signal processing, with Mel Frequency Cepstral Coefficients (MFCCs) being a dominant choice due to their perceptual relevance to human hearing. However, conventional MFCC extraction relies on computationally intensive time-frequency transformations, limiting efficiency in real-time applications. To address this, we propose a novel approach that leverages reservoir computing to streamline MFCC extraction. By replacing traditional frequency-domain conversions with convolution operations, we eliminate the need for complex transformations while maintaining feature discriminability. We present an end-to-end audio processing framework that integrates this method, demonstrating its potential for efficient and real-time speech analysis. Our results contribute to the advancement of energy-efficient audio processing technologies, enabling seamless deployment in embedded systems and voice-driven applications. This work bridges the gap between biologically inspired feature extraction and modern neuromorphic computing, offering a scalable solution for next-generation speech recognition systems.

[409] Enhancing Efficiency and Performance in Deepfake Audio Detection through Neuron-level dropin & Neuroplasticity Mechanisms

Yupei Li, Shuaijie Shao, Manuel Milling, Björn Schuller

Main category: cs.SD

TL;DR: Novel dropin and plasticity algorithms dynamically adjust neuron counts in audio deepfake detection models, improving efficiency and reducing error rates across multiple architectures including ResNet, GRU, and Wav2Vec.

Details

Motivation: Current audio deepfake detection models face computational bottlenecks when scaling parameters. Existing low-rank adaptation methods are limited to attention-based architectures, and simply stacking layers is expensive. Inspired by neuronal plasticity in mammalian brains, the authors seek more flexible parameter modulation.

Method: Proposed dropin and plasticity algorithms that dynamically adjust the number of neurons in certain layers to flexibly modulate model parameters. Evaluated on multiple architectures including ResNet, Gated Recurrent Neural Networks (GRU), and Wav2Vec for audio deepfake detection.

Result: Experimental results on ASVSpoof2019 LA, PA, and FakeorReal datasets show consistent computational efficiency improvements with dropin, and maximum relative reductions of ~39% and ~66% in Equal Error Rate with dropin and plasticity approaches respectively.

Conclusion: The proposed plasticity-inspired algorithms provide effective parameter modulation for audio deepfake detection across diverse architectures, offering computational efficiency and significant error rate reductions without full model retraining.

Abstract: Current audio deepfake detection has achieved remarkable performance using diverse deep learning architectures such as ResNet, and has seen further improvements with the introduction of large models (LMs) like Wav2Vec. The success of large language models (LLMs) further demonstrates the benefits of scaling model parameters, but also highlights one bottleneck where performance gains are constrained by parameter counts. Simply stacking additional layers, as done in current LLMs, is computationally expensive and requires full retraining. Furthermore, existing low-rank adaptation methods are primarily applied to attention-based architectures, which limits their scope. Inspired by the neuronal plasticity observed in mammalian brains, we propose novel algorithms, dropin and further plasticity, that dynamically adjust the number of neurons in certain layers to flexibly modulate model parameters. We evaluate these algorithms on multiple architectures, including ResNet, Gated Recurrent Neural Networks, and Wav2Vec. Experimental results using the widely recognised ASVSpoof2019 LA, PA, and FakeorReal dataset demonstrate consistent improvements in computational efficiency with the dropin approach and a maximum of around 39% and 66% relative reduction in Equal Error Rate with the dropin and plasticity approach among these dataset, respectively. The code and supplementary material are available at Github link.

[410] OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

Maomao Li, Zhen Li, Kaipeng Zhang, Guosheng Yin, Zhifeng Li, Dong Xu

Main category: cs.SD

TL;DR: OmniCustom is a DiT-based framework for synchronously customizing both video identity and audio timbre using reference images and audio, while following text prompts.

Details

Result: Extensive experiments show OmniCustom outperforms existing methods in generating audio-video content with consistent identity and timbre fidelity.

Conclusion: Proposes a novel sync audio-video customization task and presents OmniCustom as an effective solution for joint audio-video generation with identity and timbre control.

[411] Iterate to Differentiate: Enhancing Discriminability and Reliability in Zero-Shot TTS Evaluation

Shengfan Shen, Di Wu, Xingchen Song, Dinghao Zhou, Liumeng Xue, Meng Meng, Jian Luan, Shuai Wang

Main category: cs.SD

TL;DR: I2D is an evaluation framework for zero-shot TTS that uses iterative synthesis to amplify performance differences between models by measuring their resilience to distributional shifts.

Details

Motivation: Current evaluation methods for zero-shot TTS are inadequate - subjective tests are expensive and non-reproducible, while objective metrics often saturate and fail to distinguish state-of-the-art systems.

Method: I2D recursively synthesizes speech using the model’s own outputs as references, creating a distributional shift. Higher-quality models degrade more slowly across iterations, and this differential degradation is measured to amplify performance gaps.

Result: I2D significantly improves discriminability and alignment with human judgments, increasing system-level SRCC from 0.118 to 0.464 for UTMOSv2. Experiments on 11 models across Chinese, English, and emotion datasets demonstrate reliable automated evaluation.

Conclusion: I2D provides a more reliable automated evaluation framework for zero-shot TTS by exploiting iterative synthesis to reveal model robustness and amplify performance differences.

Abstract: Reliable evaluation of modern zero-shot text-to-speech (TTS) models remains challenging. Subjective tests are costly and hard to reproduce, while objective metrics often saturate, failing to distinguish SOTA systems. To address this, we propose Iterate to Differentiate (I2D), an evaluation framework that recursively synthesizes speech using the model’s own outputs as references. Higher-quality models exhibit greater resilience to the distributional shift induced by iterative synthesis, resulting in slower performance degradation. I2D exploits this differential degradation to amplify performance gaps and reveal robustness. By aggregating objective metrics across iterations, I2D improves discriminability and alignment with human judgments, increasing system-level SRCC from 0.118 to 0.464 for UTMOSv2. Experiments on 11 models across Chinese, English, and emotion datasets demonstrate that I2D enables more reliable automated evaluation for zero-shot TTS.

[412] What and When to Learn: CURriculum Ranking Loss for Large-Scale Speaker Verification

Massa Baali, Sarthak Bisht, Rita Singh, Bhiksha Raj

Main category: cs.SD

TL;DR: Curry (CURriculum Ranking) is an adaptive loss for large-scale speaker verification that estimates sample difficulty online using Sub-center ArcFace confidence scores, ranking samples into easy/medium/hard tiers to guide learning from stable foundations to boundary sharpening.

Details

Motivation: Fixed-margin losses treat all samples equally regardless of quality, but mislabeled or degraded samples introduce noisy gradients that disrupt compact speaker manifolds in large-scale speaker verification.

Method: Uses Sub-center ArcFace to estimate sample difficulty online via dominant sub-center cosine similarity. Samples are ranked into easy, medium, and hard tiers using running batch statistics without auxiliary annotations. Learnable weights guide the model through three learning stages.

Result: Achieves 86.8% and 60.0% EER reduction over Sub-center ArcFace baseline on VoxCeleb1-O and SITW datasets. Represents the largest-scale speaker verification system trained to date.

Conclusion: Curry establishes a new paradigm for robust speaker verification on imperfect large-scale data by adaptively handling sample difficulty and reducing the impact of noisy gradients.

Abstract: Speaker verification at large scale remains an open challenge as fixed-margin losses treat all samples equally regardless of quality. We hypothesize that mislabeled or degraded samples introduce noisy gradients that disrupt compact speaker manifolds. We propose Curry (CURriculum Ranking), an adaptive loss that estimates sample difficulty online via Sub-center ArcFace: confidence scores from dominant sub-center cosine similarity rank samples into easy, medium, and hard tiers using running batch statistics, without auxiliary annotations. Learnable weights guide the model from stable identity foundations through manifold refinement to boundary sharpening. To our knowledge, this is the largest-scale speaker verification system trained to date. Evaluated on VoxCeleb1-O, and SITW, Curry reduces EER by 86.8% and 60.0% over the Sub-center ArcFace baseline, establishing a new paradigm for robust speaker verification on imperfect large-scale data.

[413] An interpretable speech foundation model for depression detection by revealing prediction-relevant acoustic features from long speech

Qingkun Deng, Saturnino Luz, Sofia de la Fuente Garcia

Main category: cs.SD

TL;DR: Speech foundation model for depression detection using long-duration speech with interpretable acoustic feature analysis

Details

Motivation: To develop clinically applicable speech-based depression detection tools that are interpretable and reliable, addressing limitations of existing approaches

Method: Speech-level Audio Spectrogram Transformer (AST) using long-duration speech instead of short segments, with novel interpretation method to reveal prediction-relevant acoustic features

Result: Proposed model outperforms segment-level AST, identifies reduced loudness and F0 as relevant depression signals aligning with clinical findings

Conclusion: Interpretable speech foundation model enhances clinical applicability of depression detection tools through responsible AI approach

Abstract: Speech-based depression detection tools could aid early screening. Here, we propose an interpretable speech foundation model approach to enhance the clinical applicability of such tools. We introduce a speech-level Audio Spectrogram Transformer (AST) to detect depression using long-duration speech instead of short segments, along with a novel interpretation method that reveals prediction-relevant acoustic features for clinician interpretation. Our experiments show the proposed model outperforms a segment-level AST, highlighting the impact of segment-level labelling noise and the advantage of leveraging longer speech duration for more reliable depression detection. Through interpretation, we observe our model identifies reduced loudness and F0 as relevant depression signals, aligning with documented clinical findings. This interpretability supports a responsible AI approach for speech-based depression detection, rendering such tools more clinically applicable.

[414] DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Trained Speech Foundational Model

Massa Baali, Rita Singh, Bhiksha Raj

Main category: cs.SD

TL;DR: DELULU is a speaker-aware self-supervised speech model that incorporates speaker-discriminative features into pseudo-label generation using frame-level embeddings from a speaker verification model, achieving significant improvements on speaker-centric tasks without task-specific fine-tuning.

Details

Motivation: Current self-supervised speech models excel at content-driven tasks but lack strong speaker-discriminative features needed for verification, diarization, and profiling applications. There's a need for foundational models that can capture speaker identity while maintaining general speech understanding capabilities.

Method: DELULU uses speaker-informed structure in pseudo-label generation by leveraging frame-level embeddings from ReDimNet (a state-of-the-art speaker verification model) to guide k-means clustering during pre-training. This introduces speaker-discriminative inductive bias that aligns representation learning with speaker identity while maintaining self-supervised learning principles.

Result: DELULU significantly outperforms prior SSL models across speaker-centric tasks, achieving up to 62% relative improvement in EER for speaker verification and consistent gains on zero-shot profiling tasks (gender, age, accent, speaker counting). Notably, it surpasses even its teacher model on zero-shot evaluations.

Conclusion: DELULU serves as a strong universal encoder for speaker-aware speech processing, enabling superior performance without task-specific fine-tuning. It successfully bridges the gap between content-driven and speaker-discriminative speech representation learning.

Abstract: Self-supervised speech models have achieved remarkable success on content-driven tasks, yet they remain limited in capturing speaker-discriminative features critical for verification, diarization, and profiling applications. We introduce \textsc{DELULU}, a speaker-aware self-trained foundational model that addresses this limitation by incorporating speaker-informed structure into pseudo-label generation. DELULU leverages frame-level embeddings from ReDimNet, a state-of-the-art speaker verification model, to guide k-means clustering during pre-training, introducing a speaker-discriminative inductive bias that aligns representation learning with speaker identity. DELULU significantly outperforms prior SSL models across a range of speaker-centric tasks, achieving up to \textbf{62% relative improvement} in equal error rate (EER) for speaker verification and consistent gains on zero-shot profiling tasks including gender, age, accent, and speaker counting; notably surpassing even its teacher model on zero-shot evaluations. Our findings demonstrate that \textbf{DELULU is a strong universal encoder for speaker-aware speech processing}, enabling superior performance without task-specific fine-tuning.

cs.LG

[415] Beyond Accuracy: Introducing a Symbolic-Mechanistic Approach to Interpretable Evaluation

Reza Habibi, Darian Lee, Magy Seif El-Nasr

Main category: cs.LG

TL;DR: Paper proposes mechanism-aware evaluation combining symbolic rules with mechanistic interpretability to distinguish genuine generalization from shortcuts like memorization, demonstrated on NL-to-SQL tasks.

Details

Motivation: Accuracy-based evaluation fails to distinguish genuine generalization from shortcuts (memorization, leakage, brittle heuristics), especially in small-data regimes, leading to misleading assessments of model competence.

Method: Combines task-relevant symbolic rules with mechanistic interpretability to create algorithmic pass/fail scores that reveal exactly where models generalize versus exploit patterns. Demonstrated on NL-to-SQL by training identical architectures with/without schema information.

Result: Standard evaluation shows memorization model achieves 94% field-name accuracy on unseen data, falsely suggesting competence. Symbolic-mechanistic evaluation reveals this model violates core schema generalization rules, a failure invisible to accuracy metrics.

Conclusion: Mechanism-aware evaluation is necessary to properly assess model generalization capabilities beyond superficial accuracy metrics, revealing whether models truly understand tasks or exploit shortcuts.

Abstract: Accuracy-based evaluation cannot reliably distinguish genuine generalization from shortcuts like memorization, leakage, or brittle heuristics, especially in small-data regimes. In this position paper, we argue for mechanism-aware evaluation that combines task-relevant symbolic rules with mechanistic interpretability, yielding algorithmic pass/fail scores that show exactly where models generalize versus exploit patterns. We demonstrate this on NL-to-SQL by training two identical architectures under different conditions: one without schema information (forcing memorization), one with schema (enabling grounding). Standard evaluation shows the memorization model achieves 94% field-name accuracy on unseen data, falsely suggesting competence. Our symbolic-mechanistic evaluation reveals this model violates core schema generalization rules, a failure invisible to accuracy metrics.

[416] Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction

Haoyu Wang, Yuxin Chen, Liang Luo, Buyun Zhang, Ellie Dingqiao Wen, Pan Li

Main category: cs.LG

TL;DR: ITPO introduces implicit turn-wise policy optimization for multi-turn human-AI collaboration, using process reward models to derive fine-grained turn-level rewards from sparse outcome signals, improving convergence in tasks like math tutoring and medical recommendation.

Details

Motivation: Multi-turn human-AI collaboration is crucial for interactive services like adaptive tutoring and conversational recommendation, but RL optimization is hindered by sparse intermediate rewards and high stochasticity of user responses.

Method: ITPO leverages an implicit process reward model to derive fine-grained, turn-wise process rewards from sparse outcome signals. These turn-level signals are more robust than volatile token-level rewards and use normalization for training stability.

Result: ITPO combined with PPO, GRPO, or RLOO consistently achieves improved convergence over existing baselines across three multi-turn collaborative tasks: math tutoring, document writing, and medical recommendation.

Conclusion: ITPO effectively addresses reward sparsity in multi-turn collaboration by inferring semantically aligned turn-wise preferences, enabling more stable and efficient RL optimization for interactive AI services.

Abstract: Multi-turn human-AI collaboration is fundamental to deploying interactive services such as adaptive tutoring, conversational recommendation, and professional consultation. However, optimizing these interactions via reinforcement learning is hindered by the sparsity of verifiable intermediate rewards and the high stochasticity of user responses. To address these challenges, we introduce Implicit Turn-wise Policy Optimization (ITPO). ITPO leverages an implicit process reward model to derive fine-grained, turn-wise process rewards from sparse outcome signals. Unlike volatile token-level rewards, these turn-level signals exhibit superior robustness and may utilize a normalization mechanism to further enhance training stability. We evaluate ITPO across three representative multi-turn collaborative tasks: math tutoring, document writing, and medical recommendation. Empirical results demonstrate that ITPO, when combined with PPO, GRPO, or RLOO, consistently achieves improved convergence than existing baselines. Elaborate trajectory analysis confirms that ITPO infers turn-wise preferences that are semantically aligned with human judgment. Code is publicly available at https://github.com/Graph-COM/ITPO.

[417] Upper Entropy for 2-Monotone Lower Probabilities

Tuan-Anh Vu, Sébastien Destercke, Frédéric Pichon

Main category: cs.LG

TL;DR: The paper provides computational analysis and improved algorithms for calculating upper entropies in credal uncertainty quantification, showing the problem has a strongly polynomial solution.

Details

Motivation: Upper entropy is a key uncertainty measure in credal approaches that model uncertainty as probability sets, important for tasks like model selection, regularization, active learning, and OOD detection. However, computational aspects of upper entropies need better algorithmic understanding.

Method: The paper conducts exhaustive algorithmic and complexity analysis of upper entropy computation problems. It shows the problem has a strongly polynomial solution and proposes significant improvements over past algorithms, particularly for 2-monotone lower probabilities and their specific cases.

Result: The research demonstrates that upper entropy computation has a strongly polynomial solution and provides improved algorithms that outperform previous approaches for 2-monotone lower probabilities and related cases.

Conclusion: The paper advances the computational understanding of upper entropies in uncertainty quantification, providing efficient algorithmic solutions that can enhance practical applications in model selection, active learning, and OOD detection.

Abstract: Uncertainty quantification is a key aspect in many tasks such as model selection/regularization, or quantifying prediction uncertainties to perform active learning or OOD detection. Within credal approaches that consider modeling uncertainty as probability sets, upper entropy plays a central role as an uncertainty measure. This paper is devoted to the computational aspect of upper entropies, providing an exhaustive algorithmic and complexity analysis of the problem. In particular, we show that the problem has a strongly polynomial solution, and propose many significant improvements over past algorithms proposed for 2-monotone lower probabilities and their specific cases.

[418] Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG

Seungju Han, Konwoo Kim, Chanwoo Park, Benjamin Newman, Suhas Kotha, Jaehun Jung, James Zou, Yejin Choi

Main category: cs.LG

TL;DR: Synthetic Mixed Training combines synthetic QAs and documents to outperform RAG in data-constrained domains, achieving 2.6-4.4% relative gains on long-document comprehension benchmarks.

Details

Motivation: Existing synthetic data augmentation methods have diminishing returns below RAG performance when scaled naively. The paper aims to break the RAG ceiling for knowledge acquisition in data-constrained domains.

Method: Introduces Synthetic Mixed Training that combines synthetic QAs and synthetic documents to leverage complementary training signals. Also proposes Focal Rewriting for synthetic document generation that conditions document generation on specific questions to improve diversity.

Result: Achieves 2.6% relative gain over RAG on QuaLITY benchmark, with final recipe achieving 4.4% relative gain. Outperforms RAG in 5 of 6 settings across models and benchmarks (QuaLITY, LongHealth, FinanceBench), with 9.1% gain when combined with RAG.

Conclusion: Synthetic Mixed Training with Focal Rewriting enables log-linear improvements with increasing synthetic data volume and generator strength, allowing models to surpass RAG performance in data-constrained domains.

Abstract: Synthetic data augmentation helps language models learn new knowledge in data-constrained domains. However, naively scaling existing synthetic data methods by training on more synthetic tokens or using stronger generators yields diminishing returns below the performance of RAG. To break the RAG ceiling, we introduce Synthetic Mixed Training, which combines synthetic QAs and synthetic documents. This leverages their complementary training signals, and enables log-linear improvements as both synthetic data volume and generator strength increase. This allows the model to outperform RAG by a 2.6% relative gain on QuaLITY, a long-document reading comprehension benchmark. In addition, we introduce Focal Rewriting, a simple technique for synthetic document generation that explicitly conditions document generation on specific questions, improving the diversity of synthetic documents and yielding a steeper log-linear scaling curve. On QuaLITY, our final recipe trains a Llama 8B model that outperforms RAG by 4.4% relatively. Across models and benchmarks (QuaLITY, LongHealth, FinanceBench), our training enables models to beat RAG in five of six settings, outperforms by 2.6%, and achieves a 9.1% gain when combined with RAG.

[419] Safe Reinforcement Learning with Preference-based Constraint Inference

Chenglin Li, Guangchun Ruan, Hua Geng

Main category: cs.LG

TL;DR: PbCRL is a preference-based constrained RL method that improves safety constraint inference from human preferences using dead zone mechanisms and SNR loss for better constraint alignment and exploration.

Details

Motivation: Real-world safety constraints are complex and hard to specify explicitly. Existing constraint inference methods rely on restrictive assumptions or extensive demonstrations, while preference-based approaches using Bradley-Terry models fail to capture asymmetric, heavy-tailed safety costs, leading to risk underestimation.

Method: Proposes Preference-based Constrained RL (PbCRL) with: 1) Dead zone mechanism in preference modeling to encourage heavy-tailed cost distributions, 2) Signal-to-Noise Ratio (SNR) loss to encourage exploration via cost variances, 3) Two-stage training strategy to reduce online labeling burden while adaptively enhancing constraint satisfaction.

Result: Empirical results show PbCRL achieves superior alignment with true safety requirements and outperforms state-of-the-art baselines in terms of both safety and reward performance.

Conclusion: PbCRL provides an effective approach for constraint inference in Safe RL, exploring a promising direction for safety-critical applications by better capturing safety preferences through improved preference modeling.

Abstract: Safe reinforcement learning (RL) is a standard paradigm for safety-critical decision making. However, real-world safety constraints can be complex, subjective, and even hard to explicitly specify. Existing works on constraint inference rely on restrictive assumptions or extensive expert demonstrations, which is not realistic in many real-world applications. How to cheaply and reliably learn these constraints is the major challenge we focus on in this study. While inferring constraints from human preferences offers a data-efficient alternative, we identify the popular Bradley-Terry (BT) models fail to capture the asymmetric, heavy-tailed nature of safety costs, resulting in risk underestimation. It is still rare in the literature to understand the impacts of BT models on the downstream policy learning. To address the above knowledge gaps, we propose a novel approach namely Preference-based Constrained Reinforcement Learning (PbCRL). We introduce a novel dead zone mechanism into preference modeling and theoretically prove that it encourages heavy-tailed cost distributions, thereby achieving better constraint alignment. Additionally, we incorporate a Signal-to-Noise Ratio (SNR) loss to encourage exploration by cost variances, which is found to benefit policy learning. Further, two-stage training strategy are deployed to lower online labeling burdens while adaptively enhancing constraint satisfaction. Empirical results demonstrate that PbCRL achieves superior alignment with true safety requirements and outperforms the state-of-the-art baselines in terms of safety and reward. Our work explores a promising and effective way for constraint inference in Safe RL, which has great potential in a range of safety-critical applications.

[420] AscendOptimizer: Episodic Agent for Ascend NPU Operator Optimization

Jiehao Wu, Zixiao Huang, Wenhao Li, Chuyun Shen, Junjie Sheng, Xiangfeng Wang

Main category: cs.LG

TL;DR: AscendOptimizer: An AI agent that optimizes AscendC operators for Huawei NPUs through evolutionary search for host-side tiling and data movement, combined with kernel optimization via de-optimization trajectory mining and guided rewriting.

Details

Motivation: Optimizing AscendC operators on Huawei NPUs faces two challenges: lack of public reference implementations compared to CUDA ecosystem, and complex coupled optimization requiring both host-side tiling/data-movement orchestration and kernel instruction scheduling/pipelining.

Method: Two-part approach: 1) Host-side: profiling-in-the-loop evolutionary search to discover optimal tiling and data-movement configurations from hardware feedback. 2) Kernel-side: mining optimization motifs by systematically de-optimizing optimized kernels to create “bad-to-good” trajectories, distilling these into retrievable experience bank for guided rewriting. Alternates host tuning and kernel rewriting in closed loop.

Result: On 127 real AscendC operators benchmark, achieves 1.19x geometric-mean speedup over open-source baseline, with 49.61% of operators outperforming their references, beating strong agent and search baselines.

Conclusion: AscendOptimizer successfully bootstraps missing expertise for AscendC optimization through execution-as-experience approach, combining evolutionary search and optimization pattern mining to overcome knowledge bottlenecks in Huawei NPU ecosystem.

Abstract: AscendC (Ascend C) operator optimization on Huawei Ascend neural processing units (NPUs) faces a two-fold knowledge bottleneck: unlike the CUDA ecosystem, there are few public reference implementations to learn from, and performance hinges on a coupled two-part artifact - a host-side tiling program that orchestrates data movement and a kernel program that schedules and pipelines instructions. We present AscendOptimizer, an episodic agent that bootstraps this missing expertise by turning execution into experience. On the host side, AscendOptimizer performs profiling-in-the-loop evolutionary search to discover valid and high-performing tiling and data-movement configurations directly from hardware feedback. On the kernel side, it mines transferable optimization motifs by rewinding optimized kernels - systematically de-optimizing them to synthesize instructive “bad-to-good” trajectories - and distills these motifs into a retrievable experience bank for guided rewriting. By alternating host tuning and kernel rewriting in a closed loop, AscendOptimizer steadily expands feasibility and pushes latency down. On a benchmark of 127 real AscendC operators, AscendOptimizer achieves a 1.19x geometric-mean speedup over the open-source baseline, with 49.61% of operators outperforming their references, outperforming strong agent and search baselines.

[421] Causal Reconstruction of Sentiment Signals from Sparse News Data

Stefania Stan, Marzio Lunghi, Vito Vargetto, Claudio Ricci, Rolands Repetto, Brayden Leo, Shao-Hong Gan

Main category: cs.LG

TL;DR: Proposes a causal signal reconstruction pipeline to transform sparse news sentiment classifications into stable temporal series, with evaluation showing consistent 3-week lead-lag patterns with stock prices.

Details

Motivation: Transforming raw article-level sentiment observations into reliable temporal series is an unsolved engineering problem due to sparsity, redundancy, and classifier uncertainty in news data.

Method: Three-stage modular pipeline: (1) aggregates article-level scores with uncertainty-aware weights, (2) fills gaps with causal projection rules, (3) applies causal smoothing. Uses label-free evaluation framework with signal stability diagnostics and counterfactual tests.

Result: Key finding: consistent three-week lead-lag pattern between reconstructed sentiment and stock prices across all pipeline configurations and aggregation regimes, more informative than single correlation coefficients.

Conclusion: Stable, deployable sentiment indicators require careful reconstruction pipelines, not just better classifiers, with causal signal reconstruction providing robust temporal series from sparse news data.

Abstract: Sentiment signals derived from sparse news are commonly used in financial analysis and technology monitoring, yet transforming raw article-level observations into reliable temporal series remains a largely unsolved engineering problem. Rather than treating this as a classification challenge, we propose to frame it as a causal signal reconstruction problem: given probabilistic sentiment outputs from a fixed classifier, recover a stable latent sentiment series that is robust to the structural pathologies of news data such as sparsity, redundancy, and classifier uncertainty. We present a modular three-stage pipeline that (i) aggregates article-level scores onto a regular temporal grid with uncertainty-aware and redundancy-aware weights, (ii) fills coverage gaps through strictly causal projection rules, and (iii) applies causal smoothing to reduce residual noise. Because ground-truth longitudinal sentiment labels are typically unavailable, we introduce a label-free evaluation framework based on signal stability diagnostics, information preservation lag proxies, and counterfactual tests for causality compliance and redundancy robustness. As a secondary external check, we evaluate the consistency of reconstructed signals against stock-price data for a multi-firm dataset of AI-related news titles (November 2024 to February 2026). The key empirical finding is a three-week lead lag pattern between reconstructed sentiment and price that persists across all tested pipeline configurations and aggregation regimes, a structural regularity more informative than any single correlation coefficient. Overall, the results support the view that stable, deployable sentiment indicators require careful reconstruction, not only better classifiers.

Zhiyuan Chen, Yuxuan Zhong, Fan Wang, Bo Yu, Pengtao Shao, Shaoshan Liu, Ning Ding

Main category: cs.LG

TL;DR: StateLinFormer: A linear-attention navigation model with stateful memory that preserves recurrent states across training segments to achieve long-horizon memory retention for navigation tasks.

Details

Motivation: Existing navigation approaches face a dilemma - modular systems lack flexibility while Transformer-based models are constrained by fixed context windows, limiting persistent memory across extended interactions. There's a need for models that can maintain long-term memory for both immediate generalization and sustained adaptation in navigation.

Method: StateLinFormer uses linear-attention with a stateful memory mechanism that preserves recurrent memory states across consecutive training segments instead of reinitializing them at each batch boundary. This training paradigm approximates learning on infinitely long sequences.

Result: Experiments in MAZE and ProcTHOR environments show StateLinFormer significantly outperforms stateless linear-attention counterparts and standard Transformer baselines with fixed context windows. As interaction length increases, persistent stateful training substantially improves context-dependent adaptation.

Conclusion: StateLinFormer’s stateful memory mechanism enhances long-horizon memory retention and improves In-Context Learning capabilities for navigation tasks, addressing limitations of both modular systems and fixed-context Transformer models.

Abstract: Effective navigation intelligence relies on long-term memory to support both immediate generalization and sustained adaptation. However, existing approaches face a dilemma: modular systems rely on explicit mapping but lack flexibility, while Transformer-based end-to-end models are constrained by fixed context windows, limiting persistent memory across extended interactions. We introduce StateLinFormer, a linear-attention navigation model trained with a stateful memory mechanism that preserves recurrent memory states across consecutive training segments instead of reinitializing them at each batch boundary. This training paradigm effectively approximates learning on infinitely long sequences, enabling the model to achieve long-horizon memory retention. Experiments across both MAZE and ProcTHOR environments demonstrate that StateLinFormer significantly outperforms its stateless linear-attention counterpart and standard Transformer baselines with fixed context windows. Notably, as interaction length increases, persistent stateful training substantially improves context-dependent adaptation, suggesting an enhancement in the model’s In-Context Learning (ICL) capabilities for navigation tasks.

[423] Dual-Criterion Curriculum Learning: Application to Temporal Data

Gaspard Abel, Eloi Campagne, Mohamed Benloughmari, Argyris Kalogeratos

Main category: cs.LG

TL;DR: DCCL framework combines loss-based and density-based criteria for curriculum learning in time-series forecasting, showing improved performance over loss-only baselines.

Details

Motivation: Curriculum Learning (CL) requires meaningful difficulty assessment measures, but existing heuristics are often application-specific and limited. The paper aims to create a more effective CL framework by combining two complementary views of instance difficulty.

Method: Proposes Dual-Criterion Curriculum Learning (DCCL) that combines: 1) loss-based criterion (training-based evidence), and 2) density-based criterion learned in data representation space (considers data sparseness amplifies learning difficulty). Evaluated on multivariate time-series forecasting benchmarks using One-Pass and Baby-Steps training schedules.

Result: Empirical results show density-based and hybrid dual-criterion curricula outperform loss-only baselines and standard non-CL training in time-series forecasting settings.

Conclusion: DCCL framework effectively combines complementary difficulty assessment criteria for curriculum learning, demonstrating improved performance in time-series forecasting tasks.

Abstract: Curriculum Learning (CL) is a meta-learning paradigm that trains a model by feeding the data instances incrementally according to a schedule, which is based on difficulty progression. Defining meaningful difficulty assessment measures is crucial and most usually the main bottleneck for effective learning, while also in many cases the employed heuristics are only application-specific. In this work, we propose the Dual-Criterion Curriculum Learning (DCCL) framework that combines two views of assessing instance-wise difficulty: a loss-based criterion is complemented by a density-based criterion learned in the data representation space. Essentially, DCCL calibrates training-based evidence (loss) under the consideration that data sparseness amplifies the learning difficulty. As a testbed, we choose the time-series forecasting task. We evaluate our framework on multivariate time-series benchmarks under standard One-Pass and Baby-Steps training schedules. Empirical results show the interest of density-based and hybrid dual-criterion curricula over loss-only baselines and standard non-CL training in this setting.

[424] PoiCGAN: A Targeted Poisoning Based on Feature-Label Joint Perturbation in Federated Learning

Tao Liu, Jiguang Lv, Dapeng Man, Weiye Xi, Yaole Li, Feiyu Zhao, Kuiming Wang, Yingchao Bian, Chen Xu, Wu Yang

Main category: cs.LG

TL;DR: PoiCGAN: A targeted poisoning attack for federated learning using feature-label collaborative perturbation via modified CGAN to generate stealthy poisoned samples with automatic label flipping.

Details

Motivation: Federated Learning is vulnerable to poisoning attacks, but existing methods are easily detected by model performance tests and anomaly detection defenses, limiting their practical utility. Need for attacks that maintain model performance while being stealthy.

Method: Proposes PoiCGAN, a targeted poisoning attack based on feature-label collaborative perturbation. Modifies inputs of discriminator and generator in Conditional GAN to influence training and generate an ideal poison generator that produces specific poisoned samples with automatic label flipping.

Result: Achieves 83.97% higher attack success rate than baselines with less than 8.87% reduction in main task accuracy. Poisoned samples and malicious models show high stealthiness across various datasets.

Conclusion: PoiCGAN effectively bypasses model performance tests and anomaly detection defenses while maintaining attack effectiveness, making it a practical threat to federated learning systems in industrial image classification.

Abstract: Federated Learning (FL), as a popular distributed learning paradigm, has shown outstanding performance in improving computational efficiency and protecting data privacy, and is widely applied in industrial image classification. However, due to its distributed nature, FL is vulnerable to threats from malicious clients, with poisoning attacks being a common threat. A major limitation of existing poisoning attack methods is their difficulty in bypassing model performance tests and defense mechanisms based on model anomaly detection. This often results in the detection and removal of poisoned models, which undermines their practical utility. To ensure both the performance of industrial image classification and attacks, we propose a targeted poisoning attack, PoiCGAN, based on feature-label collaborative perturbation. Our method modifies the inputs of the discriminator and generator in the Conditional Generative Adversarial Network (CGAN) to influence the training process, generating an ideal poison generator. This generator not only produces specific poisoned samples but also automatically performs label flipping. Experiments across various datasets show that our method achieves an attack success rate 83.97% higher than baseline methods, with a less than 8.87% reduction in the main task’s accuracy. Moreover, the poisoned samples and malicious models exhibit high stealthiness.

[425] APreQEL: Adaptive Mixed Precision Quantization For Edge LLMs

Meriem Bouzouad, Yuan-Hao Chang, Jalil Boukhobza

Main category: cs.LG

TL;DR: Adaptive mixed precision quantization for LLMs that analyzes layer-wise contributions and hardware behavior to assign optimal quantization types per layer, balancing memory, latency, and accuracy for edge deployment.

Details

Motivation: LLMs have high computational costs and memory requirements that make edge deployment challenging. Uniform quantization doesn't account for different layer responses to reduced precision, and memory vs. computational throughput trade-offs complicate deployment decisions.

Method: Proposes adaptive mixed precision quantization that analyzes layer-wise contribution and infers how different quantization types behave across target hardware platforms to assign the most suitable quantization type to each layer.

Result: The approach unlocks new configuration designs that uniform quantization cannot achieve, expanding the solution space for efficient LLM deployment on resource-constrained devices while balancing memory, latency, and accuracy.

Conclusion: The adaptive mixed precision quantization mechanism enables more efficient deployment of LLMs on edge devices by jointly respecting layer importance and overall performance trade-offs, addressing the limitations of uniform quantization approaches.

Abstract: Today, large language models have demonstrated their strengths in various tasks ranging from reasoning, code generation, and complex problem solving. However, this advancement comes with a high computational cost and memory requirements, making it challenging to deploy these models on edge devices to ensure real-time responses and data privacy. Quantization is one common approach to reducing memory use, but most methods apply it uniformly across all layers. This does not account for the fact that different layers may respond differently to reduced precision. Importantly, memory consumption and computational throughput are not necessarily aligned, further complicating deployment decisions. This paper proposes an adaptive mixed precision quantization mechanism that balances memory, latency, and accuracy in edge deployment under user-defined priorities. This is achieved by analyzing the layer-wise contribution and by inferring how different quantization types behave across the target hardware platform in order to assign the most suitable quantization type to each layer. This integration ensures that layer importance and the overall performance trade-offs are jointly respected in this design. Our work unlocks new configuration designs that uniform quantization cannot achieve, expanding the solution space to efficiently deploy the LLMs on resource-constrained devices.

[426] The Geometric Price of Discrete Logic: Context-driven Manifold Dynamics of Number Representations

Long Zhang, Dai-jun Lin, Wei-neng Chen

Main category: cs.LG

TL;DR: LLMs require topological distortion to form discrete decision boundaries for logical reasoning, achieved through dual-modulation mechanism of class-agnostic preservation and algebraic divergence.

Details

Motivation: LLMs generalize across continuous semantic spaces but need discrete boundaries for logical reasoning. Prevailing linear isometric theories fail to resolve this tension between continuity and discreteness.

Method: Applied Gram-Schmidt decomposition to residual-stream activations to reveal dual-modulation mechanism. Used targeted vector ablation to establish causal relationships and analyzed geometric evolution across tasks from simple mapping to complex primality testing.

Result: Found that erasing divergence component collapses parity classification from 100% to chance levels (38.57%). Uncovered three-phase layer-wise geometric dynamics and showed social pressure prompts cause insufficient divergence leading to manifold entanglement explaining sycophancy and hallucination.

Conclusion: Discrete logic emergence in LLMs requires irreducible topological deformation, revising linear-isometric presumption. The cost of logical reasoning is topological distortion through dual-modulation mechanism.

Abstract: Large language models (LLMs) generalize smoothly across continuous semantic spaces, yet strict logical reasoning demands the formation of discrete decision boundaries. Prevailing theories relying on linear isometric projections fail to resolve this fundamental tension. In this work, we argue that task context operates as a non-isometric dynamical operator that enforces a necessary “topological distortion.” By applying Gram-Schmidt decomposition to residual-stream activations , we reveal a dual-modulation mechanism driving this process: a class-agnostic topological preservation that anchors global structure to prevent semantic collapse, and a specific algebraic divergence that directionally tears apart cross-class concepts to forge logical boundaries. We validate this geometric evolution across a gradient of tasks, from simple mapping to complex primality testing. Crucially, targeted specific vector ablation establishes a strict causal binding between this topology and model function: algebraically erasing the divergence component collapses parity classification accuracy from 100% to chance levels (38.57%). Furthermore, we uncover a three-phase layer-wise geometric dynamic and demonstrate that under social pressure prompts, models fail to generate sufficient divergence. This results in a “manifold entanglement” that geometrically explains sycophancy and hallucination. Ultimately, our findings revise the linear-isometric presumption, demonstrating that the emergence of discrete logic in LLMs is purchased at an irreducible cost of topological deformation.

[427] Residual Attention Physics-Informed Neural Networks for Robust Multiphysics Simulation of Steady-State Electrothermal Energy Systems

Yuqing Zhou, Ze Tao, Fujun Liu

Main category: cs.LG

TL;DR: RA-PINN: A Residual Attention Physics-Informed Neural Network for solving coupled electrothermal multiphysics problems in energy systems with superior accuracy over conventional PINN methods.

Details

Motivation: Efficient thermal management and precise field prediction are critical for advanced energy systems, but steady-state simulation of electrothermal coupled multiphysics systems is challenging due to strong nonlinear field coupling, temperature-dependent coefficient variability, and complex interface dynamics.

Method: Proposes a Residual Attention Physics-Informed Neural Network (RA-PINN) framework integrating unified five-field operator formulation with residual-connected feature propagation and attention-guided channel modulation to capture localized coupling structures and steep gradients.

Result: RA-PINN achieves superior accuracy with lowest MSE, RMSE, and relative L2 errors across four energy-relevant benchmarks compared to Pure-MLP, LSTM-PINN, and pLSTM-PINN, maintaining high structural fidelity in interface-dominated and variable-coefficient settings.

Conclusion: RA-PINN establishes as a robust and accurate computational framework for high-fidelity modeling and optimization of complex electrothermal multiphysics in sustainable energy applications.

Abstract: Efficient thermal management and precise field prediction are critical for the design of advanced energy systems, including electrohydrodynamic transport, microfluidic energy harvesters, and electrically driven thermal regulators. However, the steady-state simulation of these electrothermal coupled multiphysics systems remains challenging for physics-informed neural computation due to strong nonlinear field coupling, temperature-dependent coefficient variability, and complex interface dynamics. This study proposes a Residual Attention Physics-Informed Neural Network (RA-PINN) framework for the unified solution of coupled velocity, pressure, electric-potential, and temperature fields. By integrating a unified five-field operator formulation with residual-connected feature propagation and attention-guided channel modulation, the proposed architecture effectively captures localized coupling structures and steep gradients. We evaluate RA-PINN across four representative energy-relevant benchmarks: constant-coefficient coupling, indirect pressure-gauge constraints, temperature-dependent transport, and oblique-interface consistency. Comparative analysis against Pure-MLP, LSTM-PINN, and pLSTM-PINN demonstrates that RA-PINN achieves superior accuracy, yielding the lowest MSE, RMSE, and relative $L_2$ errors across all scenarios. Notably, RA-PINN maintains high structural fidelity in interface-dominated and variable-coefficient settings where conventional PINN backbones often fail. These results establish RA-PINN as a robust and accurate computational framework for the high-fidelity modeling and optimization of complex electrothermal multiphysics in sustainable energy applications.

[428] MetaKube: An Experience-Aware LLM Framework for Kubernetes Failure Diagnosis

Wei Sun, Ting Wang, Xinran Tian, Wanshun Lan, Xuhan Feng, Haoyue Li, Fangxin Wang

Main category: cs.LG

TL;DR: MetaKube is an experience-aware LLM framework for Kubernetes diagnostics that learns from past resolutions through episodic memory, meta-cognitive routing, and domain-specific fine-tuning.

Details

Motivation: Existing LLM-based Kubernetes diagnostic systems operate on static knowledge bases without learning from operational experience, limiting their ability to improve from past resolutions and adapt to new scenarios.

Method: Three synergistic innovations: (1) Episodic Pattern Memory Network (EPMN) for abstracting diagnostic patterns from historical resolutions with confidence-calibrated retrieval, (2) meta-cognitive controller for dynamic routing between intuitive and analytical pathways based on problem familiarity, and (3) KubeLLM, a locally-deployable 8B model enhanced through domain-specific post-training on a 7,000-sample Kubernetes Fault Resolution Dataset.

Result: Evaluation on 1,873 real-world scenarios shows MetaKube transforms Qwen3-8B from 50.9 to 90.5 points, approaching GPT-4.1 performance while ensuring data privacy. EPMN contributes 15.3% improvement through experiential learning, with continuous learning experiments showing progressive gains as the system accumulates operational knowledge.

Conclusion: MetaKube demonstrates that experience-aware LLM frameworks can significantly improve diagnostic performance in specialized domains like Kubernetes by learning from operational history, with implications for other domain-specific AI applications requiring privacy and continuous improvement.

Abstract: Existing LLM-based Kubernetes diagnostic systems cannot learn from operational experience, operating on static knowledge bases without improving from past resolutions. We present MetaKube, an experience-aware LLM framework through three synergistic innovations: (1) an Episodic Pattern Memory Network (EPMN) that abstracts diagnostic patterns from historical resolutions and provides confidence-calibrated retrieval for both rapid pattern matching and guided causal exploration, (2) a meta-cognitive controller that dynamically routes between intuitive and analytical pathways based on problem familiarity, optimizing the trade-off between speed and depth, and (3) KubeLLM, a locally-deployable 8B model enhanced through domain-specific post-training on our 7,000-sample Kubernetes Fault Resolution Dataset. Evaluation on 1,873 real-world scenarios demonstrates MetaKube transforms Qwen3-8B from 50.9 to 90.5 points, approaching GPT-4.1 performance while ensuring complete data privacy. EPMN contributes 15.3% improvement through experiential learning, with continuous learning experiments showing progressive gains as the system accumulates operational knowledge. The source code and related resources are available at https://github.com/MetaKube-LLM-for-Kubernetes-Diagnosis/MetaKube.

[429] AI Generalisation Gap In Comorbid Sleep Disorder Staging

Saswata Bose, Suvadeep Maiti, Shivam Kumar Sharma, Mythirayee S, Tapabrata Chakraborti, Srijitesh Rajendran, Raju S. Bapi

Main category: cs.LG

TL;DR: iSLEEPS introduces a clinically annotated ischemic stroke EEG dataset and demonstrates poor generalization of automated sleep staging models from healthy to clinical populations, highlighting the need for disease-specific models with clinical validation.

Details

Motivation: Current deep learning models for automated EEG-based sleep staging work well on healthy subjects but fail to generalize to clinical populations like stroke patients with disrupted sleep architecture, necessitating clinically validated disease-specific approaches.

Method: The authors introduce iSLEEPS dataset (publicly released), use SE-ResNet plus bidirectional LSTM for single-channel EEG sleep staging, employ Grad-CAM for attention visualization, and conduct statistical/computational analyses comparing healthy vs. ischemic stroke cohorts.

Result: Cross-domain performance between healthy and diseased subjects is poor; attention visualizations show models focus on physiologically uninformative EEG regions in patient data; significant sleep architecture differences exist between healthy and stroke cohorts.

Conclusion: Automated sleep staging models need subject-aware or disease-specific approaches with clinical validation before deployment, as models trained on healthy data don’t generalize to clinical populations with disrupted sleep architecture.

Abstract: Accurate sleep staging is essential for diagnosing OSA and hypopnea in stroke patients. Although PSG is reliable, it is costly, labor-intensive, and manually scored. While deep learning enables automated EEG-based sleep staging in healthy subjects, our analysis shows poor generalization to clinical populations with disrupted sleep. Using Grad-CAM interpretations, we systematically demonstrate this limitation. We introduce iSLEEPS, a newly clinically annotated ischemic stroke dataset (to be publicly released), and evaluate a SE-ResNet plus bidirectional LSTM model for single-channel EEG sleep staging. As expected, cross-domain performance between healthy and diseased subjects is poor. Attention visualizations, supported by clinical expert feedback, show the model focuses on physiologically uninformative EEG regions in patient data. Statistical and computational analyses further confirm significant sleep architecture differences between healthy and ischemic stroke cohorts, highlighting the need for subject-aware or disease-specific models with clinical validation before deployment. A summary of the paper and the code is available at https://himalayansaswatabose.github.io/iSLEEPS_Explainability.github.io/

[430] Sample-Efficient Hypergradient Estimation for Decentralized Bi-Level Reinforcement Learning

Mikoto Kudo, Takumi Tanabe, Akifumi Wachi, Youhei Akimoto

Main category: cs.LG

TL;DR: A hypergradient-based method for bi-level RL in decentralized settings using Boltzmann covariance trick for efficient gradient estimation from interaction samples

Details

Motivation: Addressing strategic decision-making problems like warehouse robot environment design where a leader agent optimizes its objective while a follower solves an MDP, but the leader cannot intervene in the follower's optimization process and can only observe outcomes

Method: Derives hypergradient of leader’s objective using Boltzmann covariance trick to enable efficient hypergradient estimation solely from interaction samples, even with high-dimensional leader decision space; first method enabling hypergradient-based optimization for 2-player Markov games in decentralized settings

Result: Experiments demonstrate effectiveness in both discrete and continuous state tasks, highlighting impact of hypergradient updates

Conclusion: Proposed method enables efficient hypergradient estimation for decentralized bi-level RL problems, overcoming limitations of prior methods that require extensive data or have complexity scaling issues with high-dimensional decision spaces

Abstract: Many strategic decision-making problems, such as environment design for warehouse robots, can be naturally formulated as bi-level reinforcement learning (RL), where a leader agent optimizes its objective while a follower solves a Markov decision process (MDP) conditioned on the leader’s decisions. In many situations, a fundamental challenge arises when the leader cannot intervene in the follower’s optimization process; it can only observe the optimization outcome. We address this decentralized setting by deriving the hypergradient of the leader’s objective, i.e., the gradient of the leader’s strategy that accounts for changes in the follower’s optimal policy. Unlike prior hypergradient-based methods that require extensive data for repeated state visits or rely on gradient estimators whose complexity can increase substantially with the high-dimensional leader’s decision space, we leverage the Boltzmann covariance trick to derive an alternative hypergradient formulation. This enables efficient hypergradient estimation solely from interaction samples, even when the leader’s decision space is high-dimensional. Additionally, to our knowledge, this is the first method that enables hypergradient-based optimization for 2-player Markov games in decentralized settings. Experiments highlight the impact of hypergradient updates and demonstrate our method’s effectiveness in both discrete and continuous state tasks.

[431] LineMVGNN: Anti-Money Laundering with Line-Graph-Assisted Multi-View Graph Neural Networks

Chung-Hoo Poon, James Kwok, Calvin Chow, Jang-Hyeon Choi

Main category: cs.LG

TL;DR: LineMVGNN: A spatial graph neural network method for anti-money laundering that uses line-graph-assisted multi-view learning to capture payment and receipt transactions in financial networks.

Details

Motivation: Conventional rule-based AML systems have suboptimal accuracy and lack scalability. Existing graph neural networks for digraphs have limitations: spectral GNNs don't support multi-dimensional edge features well, lack interpretability, and have limited scalability; spatial methods may not capture money flow effectively.

Method: Proposes LineMVGNN (Line-Graph-Assisted Multi-View Graph Neural Network), a spatial method that considers both payment and receipt transactions. It extends a lightweight MVGNN module for two-way message passing between nodes in transaction graphs, and incorporates a line graph view of the original transaction graph to enhance transaction information propagation.

Result: Experiments on two real-world datasets (Ethereum phishing transaction network and financial payment transaction dataset from industry partner) show LineMVGNN outperforms state-of-the-art methods for money laundering detection.

Conclusion: LineMVGNN effectively detects money laundering through line-graph-assisted multi-view graph learning, with discussions on scalability, adversarial robustness, and regulatory considerations.

Abstract: Anti-money laundering (AML) systems are important for protecting the global economy. However, conventional rule-based methods rely on domain knowledge, leading to suboptimal accuracy and a lack of scalability. Graph neural networks (GNNs) for digraphs (directed graphs) can be applied to transaction graphs and capture suspicious transactions or accounts. However, most spectral GNNs do not naturally support multi-dimensional edge features, lack interpretability due to edge modifications, and have limited scalability owing to their spectral nature. Conversely, most spatial methods may not capture the money flow well. Therefore, in this work, we propose LineMVGNN (Line-Graph-Assisted Multi-View Graph Neural Network), a novel spatial method that considers payment and receipt transactions. Specifically, the LineMVGNN model extends a lightweight MVGNN module, which performs two-way message passing between nodes in a transaction graph. Additionally, LineMVGNN incorporates a line graph view of the original transaction graph to enhance the propagation of transaction information. We conduct experiments on two real-world account-based transaction datasets: the Ethereum phishing transaction network dataset and a financial payment transaction dataset from one of our industry partners. The results show that our proposed method outperforms state-of-the-art methods, reflecting the effectiveness of money laundering detection with line-graph-assisted multi-view graph learning. We also discuss scalability, adversarial robustness, and regulatory considerations of our proposed method.

[432] A Theory of LLM Information Susceptibility

Zhuo-Yang Song, Hua Xing Zhu

Main category: cs.LG

TL;DR: LLM-mediated optimization has fundamental limits - fixed LLM intervention doesn’t increase performance susceptibility with sufficient computational resources, but nested co-scaling architectures can exceed these bounds.

Details

Motivation: To understand the fundamental limits of LLM-mediated improvement in agentic systems, as current understanding of when LLM intervention helps or doesn't help remains poor.

Method: Proposes a theory of LLM information susceptibility with a multi-variable utility-function framework, validates empirically across diverse domains and model scales, and applies statistical physics tools.

Result: Fixed LLM intervention doesn’t increase performance susceptibility with sufficient computational resources, but nested co-scaling architectures can exceed susceptibility bounds and open new response channels.

Conclusion: The theory clarifies when LLM intervention helps, shows statistical physics tools can predict AI system design constraints, and suggests nested architectures may be necessary for open-ended agentic self-improvement.

Abstract: Large language models (LLMs) are increasingly deployed as optimization modules in agentic systems, yet the fundamental limits of such LLM-mediated improvement remain poorly understood. Here we propose a theory of LLM information susceptibility, centred on the hypothesis that when computational resources are sufficiently large, the intervention of a fixed LLM does not increase the performance susceptibility of a strategy set with respect to budget. We develop a multi-variable utility-function framework that generalizes this hypothesis to architectures with multiple co-varying budget channels, and discuss the conditions under which co-scaling can exceed the susceptibility bound. We validate the theory empirically across structurally diverse domains and model scales spanning an order of magnitude, and show that nested, co-scaling architectures open response channels unavailable to fixed configurations. These results clarify when LLM intervention helps and when it does not, demonstrating that tools from statistical physics can provide predictive constraints for the design of AI systems. If the susceptibility hypothesis holds generally, the theory suggests that nested architectures may be a necessary structural condition for open-ended agentic self-improvement.

[433] Steering Code LLMs with Activation Directions for Language and Library Control

Md Mahbubur Rahman, Arjun Guha, Harshitha Menon

Main category: cs.LG

TL;DR: Code LLMs have embedded language/library preferences that can be manipulated via linear steering vectors in activation space to change generation behavior.

Details

Motivation: Code LLMs often default to specific programming languages and libraries under neutral prompts, and researchers want to understand if these preferences are encoded as linear directions that can be manipulated at inference time.

Method: Used difference-in-means method to estimate layer-wise steering vectors for five language/library pairs, then added these vectors to model hidden states during generation across three open-weight code LLMs.

Result: Steering vectors substantially increase generation toward target ecosystems under neutral prompts and remain effective even when prompts explicitly request the opposite choice. Steering strength varies by model and target, with common ecosystems easier to induce than rarer alternatives.

Conclusion: Code-style preferences in LLMs are partly represented by compact, steerable structure in activation space, though overly strong interventions can reduce output quality.

Abstract: Code LLMs often default to particular programming languages and libraries under neutral prompts. We investigate whether these preferences are encoded as approximately linear directions in activation space that can be manipulated at inference time. Using a difference-in-means method, we estimate layer-wise steering vectors for five language/library pairs and add them to model hidden states during generation. Across three open-weight code LLMs, these interventions substantially increase generation toward the target ecosystem under neutral prompts and often remain effective even when prompts explicitly request the opposite choice. Steering strength varies by model and target, with common ecosystems easier to induce than rarer alternatives, and overly strong interventions can reduce output quality. Overall, our results suggest that code-style preferences in LLMs are partly represented by compact, steerable structure in activation space.

[434] Boost Like a (Var)Pro: Trust-Region Gradient Boosting via Variable Projection

Abhijit Chowdhary, Elizabeth Newman, Deepanshu Verma

Main category: cs.LG

TL;DR: VPBoost introduces a gradient boosting algorithm for separable smooth approximators (like neural networks) using variable projection and second-order weak learning, achieving competitive performance with theoretical convergence guarantees.

Details

Motivation: Gradient boosting is well-established for decision trees but less developed for smooth parametric learners like neural networks. The authors aim to create a theoretically-grounded boosting method for separable smooth approximators that combines the benefits of boosting with neural network flexibility.

Method: VPBoost fuses variable projection (enforcing optimality of linear weights) with second-order weak learning strategy for separable models. It uses closed-form solutions for optimal linear weights and operates as a functional trust-region method, leveraging trust-region theory for convergence analysis.

Result: Theoretical analysis shows VPBoost converges to a stationary point under mild conditions and achieves superlinear convergence under stronger assumptions. Experiments on synthetic data, image recognition, and scientific ML benchmarks show improved evaluation metrics compared to gradient-descent-based boosting and competitive performance with industry-standard decision tree boosting.

Conclusion: VPBoost provides a theoretically sound and practical gradient boosting approach for separable smooth approximators, bridging the gap between traditional decision tree boosting and neural network ensembles with provable convergence properties.

Abstract: Gradient boosting, a method of building additive ensembles from weak learners, has established itself as a practical and theoretically-motivated approach to approximate functions, especially using decision tree weak learners. Comparable methods for smooth parametric learners, such as neural networks, remain less developed in both training methodology and theory. To this end, we introduce \texttt{VPBoost} ({\bf V}ariable {\bf P}rojection {\bf Boost}ing), a gradient boosting algorithm for separable smooth approximators, i.e., models with a smooth nonlinear featurizer followed by a final linear mapping. \texttt{VPBoost} fuses variable projection, a training paradigm for separable models that enforces optimality of the linear weights, with a second-order weak learning strategy. The combination of second-order boosting, separable models, and variable projection give rise to a closed-form solution for the optimal linear weights and a natural interpretation of \VPBoost as a functional trust-region method. We thereby leverage trust-region theory to prove \VPBoost converges to a stationary point under mild geometric conditions and, under stronger assumptions, achieves a superlinear convergence rate. Comprehensive numerical experiments on synthetic data, image recognition, and scientific machine learning benchmarks demonstrate that \VPBoost learns an ensemble with improved evaluation metrics in comparison to gradient-descent-based boosting and attains competitive performance relative to an industry-standard decision tree boosting algorithm.

[435] CDMT-EHR: A Continuous-Time Diffusion Framework for Generating Mixed-Type Time-Series Electronic Health Records

Shaonan Liu, Yuichiro Iwashita, Soichiro Nakako, Masakazu Iwamura, Koichi Kise

Main category: cs.LG

TL;DR: Continuous-time diffusion framework for generating synthetic EHR time-series data with mixed numerical/categorical features, outperforming discrete-time methods with fewer sampling steps.

Details

Motivation: EHR data sharing is limited by privacy concerns, requiring synthetic data generation. Existing diffusion models for EHRs use discrete-time formulations with approximation errors and coupled training-sampling steps, making them inefficient for mixed-type time-series data.

Method: Proposes continuous-time diffusion with bidirectional GRU backbone for temporal dependencies, unified Gaussian diffusion via learnable continuous embeddings for categorical variables, and factorized learnable noise schedule that adapts to per-feature-per-timestep learning difficulties.

Result: Outperforms existing approaches on two ICU datasets in downstream task performance, distribution fidelity, and discriminability, while requiring only 50 sampling steps vs. 1,000 for baselines. Classifier-free guidance enables effective conditional generation for class-imbalanced scenarios.

Conclusion: Continuous-time diffusion framework effectively generates high-quality synthetic EHR time-series data with mixed features, offering superior performance and efficiency over discrete-time methods.

Abstract: Electronic health records (EHRs) are invaluable for clinical research, yet privacy concerns severely restrict data sharing. Synthetic data generation offers a promising solution, but EHRs present unique challenges: they contain both numerical and categorical features that evolve over time. While diffusion models have demonstrated strong performance in EHR synthesis, existing approaches predominantly rely on discrete-time formulations, which suffer from finite-step approximation errors and coupled training-sampling step counts. We propose a continuous-time diffusion framework for generating mixed-type time-series EHRs with three contributions: (1) continuous-time diffusion with a bidirectional gated recurrent unit backbone for capturing temporal dependencies, (2) unified Gaussian diffusion via learnable continuous embeddings for categorical variables, enabling joint cross-feature modeling, and (3) a factorized learnable noise schedule that adapts to per-feature-per-timestep learning difficulties. Experiments on two large-scale intensive care unit datasets demonstrate that our method outperforms existing approaches in downstream task performance, distribution fidelity, and discriminability, while requiring only 50 sampling steps compared to 1,000 for baseline methods. Classifier-free guidance further enables effective conditional generation for class-imbalanced clinical scenarios.

[436] BXRL: Behavior-Explainable Reinforcement Learning

Ram Rachum, Yotam Amitai, Yonatan Nakar, Reuth Mirsky, Cameron Allen

Main category: cs.LG

TL;DR: BXRL introduces a formal definition of behaviors as patterns of actions across episodes, enabling new “explain this behavior” queries in reinforcement learning.

Details

Motivation: Current XRL methods can explain specific actions, trajectories, or policies, but lack formal definitions for behaviors as patterns across multiple episodes. There's a need to treat behaviors as first-class objects for more comprehensive explainability.

Method: Proposes Behavior-Explainable RL (BXRL) formulation with behavior measures as functions from policies to real numbers. Defines contrastive behaviors to reduce preference questions to behavior measurement. Adapts existing XRL methods for behavior explanation and provides a JAX port of HighwayEnv for behavior differentiation.

Result: Establishes formal framework for behavior explanation in RL, enabling new query types. Provides tools for defining, measuring, and differentiating behaviors with respect to model parameters through the JAX environment implementation.

Conclusion: BXRL provides a principled approach to behavior explanation in RL, treating behaviors as first-class objects and enabling more comprehensive explainability beyond action- or trajectory-level analysis.

Abstract: A major challenge of Reinforcement Learning is that agents often learn undesired behaviors that seem to defy the reward structure they were given. Explainable Reinforcement Learning (XRL) methods can answer queries such as “explain this specific action”, “explain this specific trajectory”, and “explain the entire policy”. However, XRL lacks a formal definition for behavior as a pattern of actions across many episodes. We provide such a definition, and use it to enable a new query: “Explain this behavior”. We present Behavior-Explainable Reinforcement Learning (BXRL), a new problem formulation that treats behaviors as first-class objects. BXRL defines a behavior measure as any function $m : Π\to \mathbb{R}$, allowing users to precisely express the pattern of actions that they find interesting and measure how strongly the policy exhibits it. We define contrastive behaviors that reduce the question “why does the agent prefer $a$ to $a’$?” to “why is $m(π)$ high?” which can be explored with differentiation. We do not implement an explainability method; we instead analyze three existing methods and propose how they could be adapted to explain behavior. We present a port of the HighwayEnv driving environment to JAX, which provides an interface for defining, measuring, and differentiating behaviors with respect to the model parameters.

[437] Kronecker-Structured Nonparametric Spatiotemporal Point Processes

Zhitong Xu, Qiwei Yuan, Yinghao Chen, Yan Sun, Bin Shen, Shandian Zhe

Main category: cs.LG

TL;DR: Kronecker-Structured Nonparametric Spatiotemporal Point Process (KSTPP) for interpretable event relationship discovery with scalable modeling of complex spatiotemporal interactions.

Details

Motivation: Existing point process models have limitations: classical parametric models (Poisson/Hawkes) are too restrictive for complex interactions, while recent neural models lack interpretability for relationship discovery. Need a method that combines interpretability with flexibility.

Method: Proposes KSTPP with background intensity modeled by spatial Gaussian Process and influence kernel as spatiotemporal GP. Uses separable product kernels and structured grid representations to induce Kronecker-structured covariance matrices for scalability. Develops tensor-product Gauss-Legendre quadrature for efficient likelihood evaluation.

Result: Model achieves scalable training and prediction on large event collections while enabling transparent event-wise relationship discovery. Captures rich interaction patterns including excitation, inhibition, neutrality, and time-varying effects.

Conclusion: KSTPP provides an interpretable yet flexible framework for spatiotemporal event modeling that scales to large datasets while maintaining transparency in relationship discovery.

Abstract: Events in spatiotemporal domains arise in numerous real-world applications, where uncovering event relationships and enabling accurate prediction are central challenges. Classical Poisson and Hawkes processes rely on restrictive parametric assumptions that limit their ability to capture complex interaction patterns, while recent neural point process models increase representational capacity but integrate event information in a black-box manner, hindering interpretable relationship discovery. To address these limitations, we propose a Kronecker-Structured Nonparametric Spatiotemporal Point Process (KSTPP) that enables transparent event-wise relationship discovery while retaining high modeling flexibility. We model the background intensity with a spatial Gaussian process (GP) and the influence kernel as a spatiotemporal GP, allowing rich interaction patterns including excitation, inhibition, neutrality, and time-varying effects. To enable scalable training and prediction, we adopt separable product kernels and represent the GPs on structured grids, inducing Kronecker-structured covariance matrices. Exploiting Kronecker algebra substantially reduces computational cost and allows the model to scale to large event collections. In addition, we develop a tensor-product Gauss-Legendre quadrature scheme to efficiently evaluate intractable likelihood integrals. Extensive experiments demonstrate the effectiveness of our framework.

[438] Self Paced Gaussian Contextual Reinforcement Learning

Mohsen Sahraei Ardakani, Rui Song

Main category: cs.LG

TL;DR: SPGL introduces a computationally efficient curriculum learning method for RL using closed-form Gaussian updates, avoiding expensive inner-loop optimization while maintaining sample efficiency.

Details

Motivation: Curriculum learning improves RL efficiency but existing self-paced methods require computationally expensive inner-loop optimizations that limit scalability in high-dimensional context spaces.

Method: Proposes Self-Paced Gaussian Curriculum Learning (SPGL) with closed-form update rules for Gaussian context distributions, avoiding costly numerical procedures while maintaining adaptability.

Result: SPGL matches or outperforms existing curriculum methods across contextual RL benchmarks (Point Mass, Lunar Lander, Ball Catching), especially in hidden context scenarios, with more stable context distribution convergence.

Conclusion: SPGL offers a scalable, principled alternative for curriculum generation in challenging continuous and partially observable domains with reduced computational overhead.

Abstract: Curriculum learning improves reinforcement learning (RL) efficiency by sequencing tasks from simple to complex. However, many self-paced curriculum methods rely on computationally expensive inner-loop optimizations, limiting their scalability in high-dimensional context spaces. In this paper, we propose Self-Paced Gaussian Curriculum Learning (SPGL), a novel approach that avoids costly numerical procedures by leveraging a closed-form update rule for Gaussian context distributions. SPGL maintains the sample efficiency and adaptability of traditional self-paced methods while substantially reducing computational overhead. We provide theoretical guarantees on convergence and validate our method across several contextual RL benchmarks, including the Point Mass, Lunar Lander, and Ball Catching environments. Experimental results show that SPGL matches or outperforms existing curriculum methods, especially in hidden context scenarios, and achieves more stable context distribution convergence. Our method offers a scalable, principled alternative for curriculum generation in challenging continuous and partially observable domains.

[439] Lightweight Fairness for LLM-Based Recommendations via Kernelized Projection and Gated Adapters

Nan Cui, Wendy Hui Wang, Yue Ning

Main category: cs.LG

TL;DR: A lightweight bias mitigation method for LLM-based recommender systems using kernelized Iterative Null-space Projection with gated Mixture-of-Experts adapter to remove sensitive attributes without extra parameters.

Details

Motivation: LLM-based recommender systems can amplify social biases from pre-training data, especially with demographic cues. Existing fairness solutions require extra parameter tuning or suffer from optimization instability, needing a lightweight, scalable approach.

Method: Combines kernelized Iterative Null-space Projection (INLP) with gated Mixture-of-Experts (MoE) adapter. INLP estimates closed-form projection to remove single/multiple sensitive attributes from LLM representations without trainable parameters. Two-level MoE adapter selectively restores useful signals without reintroducing bias.

Result: Experiments on two public datasets show reduced attribute leakage across multiple protected variables while maintaining competitive recommendation accuracy.

Conclusion: Proposed method effectively mitigates bias in LLM-based recommender systems with lightweight, parameter-free approach that preserves task utility.

Abstract: Large Language Models (LLMs) have introduced new capabilities to recommender systems, enabling dynamic, context-aware, and conversational recommendations. However, LLM-based recommender systems inherit and may amplify social biases embedded in their pre-training data, especially when demographic cues are present. Existing fairness solutions either require extra parameters fine-tuning, or suffer from optimization instability. We propose a lightweight and scalable bias mitigation method that combines a kernelized Iterative Null-space Projection (INLP) with a gated Mixture-of-Experts (MoE) adapter. Our approach estimates a closed-form projection that removes single or multiple sensitive attributes from LLM representations with no additional trainable parameters. To preserve task utility, we introduce a two-level MoE adapter that selectively restores useful signals without reintroducing bias. Experiments on two public datasets show that our method reduces attribute leakage across multiple protected variables while maintaining competitive recommendation accuracy.

[440] Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

Christopher Ackerman, Nina Panickssery

Main category: cs.LG

TL;DR: LLMs can recognize their own writing, with Llama3-8b-Instruct showing this ability through post-training experience, and researchers identify a causal “self-authorship” vector that can control the model’s perception and claims of authorship.

Details

Motivation: The paper investigates whether LLMs can recognize their own writing, which has important implications for AI safety but is understudied. The researchers aim to establish if this behavior occurs robustly, understand how it works, and explore whether it can be controlled.

Method: First tested Llama3 models on self-written-text recognition tasks, then analyzed the residual stream to identify a vector associated with correct self-authorship judgments. Used causal intervention techniques to manipulate this vector and control the model’s behavior and perception.

Result: Llama3-8b-Instruct can reliably distinguish its own outputs from human ones, unlike the base model. Found a specific vector in the residual stream that activates during correct self-authorship judgments, which is causally related to the model’s ability to perceive and assert self-authorship. This vector can be used to steer the model’s authorship claims and beliefs.

Conclusion: LLMs can develop self-authorship recognition through post-training, and this ability is mediated by specific neural mechanisms that can be identified and controlled, with significant implications for AI safety and model interpretability.

Abstract: It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of “self” in the model, and demonstrate that the vector is causally related to the model’s ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model’s behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model’s output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.

[441] Probabilistic Geometric Alignment via Bayesian Latent Transport for Domain-Adaptive Foundation Models

Kuepon Aueawatthanaphisut, Kuepon Aueawatthanaphisut

Main category: cs.LG

TL;DR: Uncertainty-aware probabilistic latent transport framework for domain adaptation using Bayesian transport operators and Wasserstein geodesics with PAC-Bayesian regularization.

Details

Motivation: Address challenges in adapting large foundation models to new domains with limited supervision, including latent distribution mismatch, unstable optimization, and miscalibrated uncertainty propagation.

Method: Formulates domain adaptation as stochastic geometric alignment using Bayesian transport operators to redistribute latent probability mass along Wasserstein geodesics, with PAC-Bayesian regularization to constrain posterior complexity.

Result: Demonstrates reduced latent manifold discrepancy, accelerated transport energy decay, improved covariance calibration, and bounded posterior uncertainty evolution compared to deterministic fine-tuning and adversarial baselines.

Conclusion: Uncertainty-aware probabilistic alignment provides a promising paradigm for reliable transfer learning in next-generation deep representation systems by connecting stochastic optimal transport geometry with statistical generalization theory.

Abstract: Adapting large-scale foundation models to new domains with limited supervision remains a fundamental challenge due to latent distribution mismatch, unstable optimization dynamics, and miscalibrated uncertainty propagation. This paper introduces an uncertainty-aware probabilistic latent transport framework that formulates domain adaptation as a stochastic geometric alignment problem in representation space. A Bayesian transport operator is proposed to redistribute latent probability mass along Wasserstein-type geodesic trajectories, while a PAC-Bayesian regularization mechanism constrains posterior model complexity to mitigate catastrophic overfitting. The proposed formulation yields theoretical guarantees on convergence stability, loss landscape smoothness, and sample efficiency under distributional shift. Empirical analyses demonstrate substantial reduction in latent manifold discrepancy, accelerated transport energy decay, and improved covariance calibration compared with deterministic fine-tuning and adversarial domain adaptation baselines. Furthermore, bounded posterior uncertainty evolution indicates enhanced probabilistic reliability during cross-domain transfer. By establishing a principled connection between stochastic optimal transport geometry and statistical generalization theory, the proposed framework provides new insights into robust adaptation of modern foundation architectures operating in heterogeneous environments. These findings suggest that uncertainty-aware probabilistic alignment constitutes a promising paradigm for reliable transfer learning in next-generation deep representation systems.

[442] Mitigating Many-Shot Jailbreaking

Christopher M. Ackerman, Nina Panickssery

Main category: cs.LG

TL;DR: Fine-tuning and input sanitization techniques can mitigate many-shot jailbreaking attacks on LLMs while preserving normal functionality

Details

Motivation: Many-shot jailbreaking exploits LLMs' long context windows to bypass safety training by providing many examples of inappropriate responses, overriding safety protocols through in-context learning

Method: Probe effectiveness of different fine-tuning and input sanitization approaches, alone and in combination, to mitigate MSJ attacks while retaining model performance on benign tasks

Result: Incremental mitigation effectiveness for each technique individually, with combined techniques significantly reducing MSJ attack effectiveness while maintaining performance on normal in-context learning and conversational tasks

Conclusion: The combined approach could meaningfully ameliorate MSJ vulnerabilities if incorporated into model safety post-training

Abstract: Many-shot jailbreaking (MSJ) is an adversarial technique that exploits the long context windows of modern LLMs to circumvent model safety training by including in the prompt many examples of a “fake” assistant responding inappropriately before the final request. With enough examples, the model’s in-context learning abilities override its safety training, and it responds as if it were the “fake” assistant. In this work, we probe the effectiveness of different fine-tuning and input sanitization approaches on mitigating MSJ attacks, alone and in combination. We find incremental mitigation effectiveness for each, and show that the combined techniques significantly reduce the effectiveness of MSJ attacks, while retaining model performance in benign in-context learning and conversational tasks. We suggest that our approach could meaningfully ameliorate this vulnerability if incorporated into model safety post-training.

[443] Latent Algorithmic Structure Precedes Grokking: A Mechanistic Study of ReLU MLPs on Modular Arithmetic

Anand Swaroop

Main category: cs.LG

TL;DR: ReLU MLPs learn near-binary square wave input weights for modular addition, with output weights satisfying phase-sum relation, suggesting grokking sharpens rather than discovers algorithms.

Details

Motivation: To understand the grokking phenomenon in neural networks, specifically examining the weight distributions and algorithmic representations learned by ReLU MLPs for modular addition tasks.

Method: Empirical analysis of ReLU MLPs trained on modular addition, using DFT to extract frequency and phase information from weights, and constructing idealized models with perfect binary square wave input weights and cosine output weights.

Result: ReLU MLPs learn near-binary square wave input weights with intermediate values only near sign-change boundaries, and output weights satisfy phase-sum relation φ_out = φ_a + φ_b. Idealized model achieves 95.5% accuracy even when trained on noisy data that only achieves 0.23% accuracy.

Conclusion: Grokking doesn’t discover correct algorithms but rather sharpens an algorithm substantially encoded during memorization, progressively binarizing input weights and aligning output weights until generalization becomes possible.

Abstract: Grokking-the phenomenon where validation accuracy of neural networks on modular addition of two integers rises long after training data has been memorized-has been characterized in previous works as producing sinusoidal input weight distributions in transformers and multi-layer perceptrons (MLPs). We find empirically that ReLU MLPs in our experimental setting instead learn near-binary square wave input weights, where intermediate-valued weights appear exclusively near sign-change boundaries, alongside output weight distributions whose dominant Fourier phases satisfy a phase-sum relation $φ_{\mathrm{out}} = φ_a + φ_b$; this relation holds even when the model is trained on noisy data and fails to grok. We extract the frequency and phase of each neuron’s weights via DFT and construct an idealized MLP: Input weights are replaced by perfect binary square waves and output weights by cosines, both parametrized by the frequencies, phases, and amplitudes extracted from the dominant Fourier components of the real model weights. This idealized model achieves 95.5% accuracy when the frequencies and phases are extracted from the weights of a model trained on noisy data that itself achieves only 0.23% accuracy. This suggests that grokking does not discover the correct algorithm, but rather sharpens an algorithm substantially encoded during memorization, progressively binarizing the input weights into cleaner square waves and aligning the output weights, until generalization becomes possible.

[444] Evidence for Limited Metacognition in LLMs

Christopher Ackerman

Main category: cs.LG

TL;DR: LLMs show emerging metacognitive abilities to assess and utilize their own confidence in answering questions, measured through behavioral paradigms rather than self-reports.

Details

Motivation: The paper addresses growing concerns about LLM self-awareness and sentience, noting that current measurement science is nascent. It aims to develop quantitative methods to evaluate metacognitive abilities in LLMs, inspired by research on nonhuman animals.

Method: The approach avoids model self-reports and instead tests strategic deployment of knowledge about internal states. Uses two experimental paradigms to measure: 1) ability to assess and utilize confidence in answering factual/reasoning questions correctly, and 2) ability to anticipate and utilize their own future answers. Also analyzes token probabilities to identify upstream internal signals.

Result: Frontier LLMs since early 2024 show increasingly strong evidence of certain metacognitive abilities. These abilities are limited in resolution, emerge context-dependently, and differ qualitatively from human metacognition. Model differences suggest post-training may develop metacognitive abilities. Token probability analysis reveals upstream internal signals that could support metacognition.

Conclusion: LLMs demonstrate measurable metacognitive abilities, though different from humans. The methodology provides a foundation for quantitative evaluation of LLM self-awareness, with implications for AI safety and policy.

Abstract: The possibility of LLM self-awareness and even sentience is gaining increasing public attention and has major safety and policy implications, but the science of measuring them is still in a nascent state. Here we introduce a novel methodology for quantitatively evaluating metacognitive abilities in LLMs. Taking inspiration from research on metacognition in nonhuman animals, our approach eschews model self-reports and instead tests to what degree models can strategically deploy knowledge of internal states. Using two experimental paradigms, we demonstrate that frontier LLMs introduced since early 2024 show increasingly strong evidence of certain metacognitive abilities, specifically the ability to assess and utilize their own confidence in their ability to answer factual and reasoning questions correctly and the ability to anticipate what answers they would give and utilize that information appropriately. We buttress these behavioral findings with an analysis of the token probabilities returned by the models, which suggests the presence of an upstream internal signal that could provide the basis for metacognition. We further find that these abilities 1) are limited in resolution, 2) emerge in context-dependent manners, and 3) seem to be qualitatively different from those of humans. We also report intriguing differences across models of similar capabilities, suggesting that LLM post-training may have a role in developing metacognitive abilities.

[445] Manifold Generalization Provably Proceeds Memorization in Diffusion Models

Zebang Shen, Ya-Ping Hsieh, Niao He

Main category: cs.LG

TL;DR: Diffusion models can generate novel samples with coarse scores by capturing data manifold geometry rather than full distributional structure, achieving faster generalization rates than classical density estimation.

Details

Motivation: To explain why diffusion models generate novel samples even with coarse score estimation, challenging the standard view of diffusion training as density estimation. The paper aims to provide theoretical understanding of how diffusion models exploit data manifold geometry.

Method: Theoretical analysis under the manifold hypothesis, proving that diffusion models with coarse scores can achieve near-parametric rates by exploiting manifold regularity rather than estimating the full data distribution. The analysis compares classical minimax rates for density estimation with rates achievable through coarse score estimation.

Result: Diffusion models with coarse scores can generate samples from a target distribution comparable to the data distribution in neighborhoods of the manifold, with rates depending only on manifold smoothness rather than density regularity. When the manifold is smooth, generalization occurs at rates faster than full population distribution estimation.

Conclusion: Coarse scores in diffusion models capture data manifold geometry while discarding fine-scale distributional structure, enabling novel sample generation at faster statistical rates than classical density estimation, especially beneficial when data density is irregular.

Abstract: Diffusion models often generate novel samples even when the learned score is only \emph{coarse} – a phenomenon not accounted for by the standard view of diffusion training as density estimation. In this paper, we show that, under the \emph{manifold hypothesis}, this behavior can instead be explained by coarse scores capturing the \emph{geometry} of the data while discarding the fine-scale distributional structure of the population measure~$μ_{\scriptscriptstyle\mathrm{data}}$. Concretely, whereas estimating the full data distribution $μ_{\scriptscriptstyle\mathrm{data}}$ supported on a $k$-dimensional manifold is known to require the classical minimax rate $\tilde{\mathcal{O}}(N^{-1/k})$, we prove that diffusion models trained with coarse scores can exploit the \emph{regularity of the manifold support} and attain a near-parametric rate toward a \emph{different} target distribution. This target distribution has density uniformly comparable to that of~$μ_{\scriptscriptstyle\mathrm{data}}$ throughout any $\tilde{\mathcal{O}}\bigl(N^{-β/(4k)}\bigr)$-neighborhood of the manifold, where $β$ denotes the manifold regularity. Our guarantees therefore depend only on the smoothness of the underlying support, and are especially favorable when the data density itself is irregular, for instance non-differentiable. In particular, when the manifold is sufficiently smooth, we obtain that \emph{generalization} – formalized as the ability to generate novel, high-fidelity samples – occurs at a statistical rate strictly faster than that required to estimate the full population distribution~$μ_{\scriptscriptstyle\mathrm{data}}$.

[446] Resolving gradient pathology in physics-informed epidemiological models

Nickson Golooba, Woldegebriel Assefa Woldegerima

Main category: cs.LG

TL;DR: Physics-informed neural networks (PINNs) for epidemiological modeling suffer from gradient conflicts between data and physics objectives. The proposed Conflict-Gated Gradient Scaling (CGGS) method uses cosine similarity to dynamically modulate penalty weights, suppressing physics constraints when gradients conflict and restoring them when aligned, leading to stable training and improved parameter estimation.

Details

Motivation: PINNs are increasingly used in mathematical epidemiology to combine noisy clinical data with compartmental models like SEIR, but training is often unstable due to competing optimization objectives where gradient vectors from data loss and physical residuals point in conflicting directions, leading to slow convergence or optimization deadlock.

Method: Conflict-Gated Gradient Scaling (CGGS) uses cosine similarity between data and physics gradients to dynamically modulate the penalty weight. Unlike standard annealing schemes, CGGS acts as a geometric gate: it suppresses physical constraints when directional conflict is high (allowing prioritization of data fidelity) and restores constraints when gradients align.

Result: The method preserves O(1/T) convergence rate for smooth non-convex objectives, autonomously induces curriculum learning effects, and improves parameter estimation in stiff epidemiological systems compared to magnitude-based baselines, with empirical results showing improved peak recovery and convergence.

Conclusion: CGGS provides a computationally efficient alternative to existing methods for addressing gradient conflicts in PINNs for epidemiological modeling, ensuring stable and efficient training through a geometric gating mechanism that dynamically balances data and physics objectives based on gradient alignment.

Abstract: Physics-informed neural networks (PINNs) are increasingly used in mathematical epidemiology to bridge the gap between noisy clinical data and compartmental models, such as the susceptible-exposed-infected-removed (SEIR) model. However, training these hybrid networks is often unstable due to competing optimization objectives. As established in recent literature on ``gradient pathology," the gradient vectors derived from the data loss and the physical residual often point in conflicting directions, leading to slow convergence or optimization deadlock. While existing methods attempt to resolve this by balancing gradient magnitudes or projecting conflicting vectors, we propose a novel method, conflict-gated gradient scaling (CGGS), to address gradient conflicts in physics-informed neural networks for epidemiological modelling, ensuring stable and efficient training and a computationally efficient alternative. This method utilizes the cosine similarity between the data and physics gradients to dynamically modulate the penalty weight. Unlike standard annealing schemes that only normalize scales, CGGS acts as a geometric gate: it suppresses the physical constraint when directional conflict is high, allowing the optimizer to prioritize data fidelity, and restores the constraint when gradients align. We prove that this gating mechanism preserves the standard $O(1/T)$ convergence rate for smooth non-convex objectives, a guarantee that fails under fixed-weight or magnitude-balanced training when gradients conflict. We demonstrate that this mechanism autonomously induces a curriculum learning effect, improving parameter estimation in stiff epidemiological systems compared to magnitude-based baselines. Our empirical results show improved peak recovery and convergence over magnitude-based methods.

[447] Deep Neural Regression Collapse

Akshay Rangamani, Altay Unal

Main category: cs.LG

TL;DR: Deep Neural Regression Collapse (NRC) extends beyond the last layer, revealing sparse/low-rank structures throughout deep regression networks, with features aligning to target dimensions and covariances.

Details

Motivation: While Neural Collapse has been studied for classification and recently extended to regression (but only at the last layer), there's a need to understand if and how this phenomenon occurs throughout all layers of deep regression models.

Method: Established that Neural Regression Collapse occurs below the last layer across different model types, analyzing feature subspace alignment with target dimensions, covariance alignment, weight subspace alignment, and linear prediction error relationships.

Result: Found that in collapsed layers: features lie in target-dimension subspace, feature covariance aligns with target covariance, weight input subspace aligns with feature subspace, and linear feature prediction error approximates overall model error. Also showed Deep NRC helps learn intrinsic dimension of low-rank targets and explored weight decay’s role.

Conclusion: Provides a more complete understanding of the simple structures learned by deep networks in regression contexts, extending Neural Collapse theory beyond classification and beyond just the final layer.

Abstract: Neural Collapse is a phenomenon that helps identify sparse and low rank structures in deep classifiers. Recent work has extended the definition of neural collapse to regression problems, albeit only measuring the phenomenon at the last layer. In this paper, we establish that Neural Regression Collapse (NRC) also occurs below the last layer across different types of models. We show that in the collapsed layers of neural regression models, features lie in a subspace that corresponds to the target dimension, the feature covariance aligns with the target covariance, the input subspace of the layer weights aligns with the feature subspace, and the linear prediction error of the features is close to the overall prediction error of the model. In addition to establishing Deep NRC, we also show that models that exhibit Deep NRC learn the intrinsic dimension of low rank targets and explore the necessity of weight decay in inducing Deep NRC. This paper provides a more complete picture of the simple structure learned by deep networks in the context of regression.

[448] Circuit Complexity of Hierarchical Knowledge Tracing and Implications for Log-Precision Transformers

Naiming Liu, Richard Baraniuk, Shashank Sonkar

Main category: cs.LG

TL;DR: The paper analyzes hierarchical prerequisite propagation in knowledge tracing through a computational complexity lens, showing transformer limitations on deep concept hierarchies and proposing structure-aware solutions.

Details

Motivation: Knowledge tracing models need to understand mastery over interconnected concepts organized by prerequisites, but current transformer-style models may have computational limitations when dealing with deep hierarchical prerequisite structures.

Method: The paper uses circuit complexity theory to analyze prerequisite-tree tasks, formalizes recursive-majority mastery propagation, and empirically tests transformer encoders on recursive-majority trees with various supervision strategies.

Result: Transformers trained on recursive-majority trees converge to permutation-invariant shortcuts rather than learning the underlying structure, but auxiliary supervision on intermediate subtrees enables structure-dependent computation and near-perfect accuracy at depths 3-4.

Conclusion: The findings reveal computational barriers for transformers on deep hierarchical structures and motivate the need for structure-aware objectives and iterative mechanisms for effective prerequisite-sensitive knowledge tracing.

Abstract: Knowledge tracing models mastery over interconnected concepts, often organized by prerequisites. We analyze hierarchical prerequisite propagation through a circuit-complexity lens to clarify what is provable about transformer-style computation on deep concept hierarchies. Using recent results that log-precision transformers lie in logspace-uniform $\mathsf{TC}^0$, we formalize prerequisite-tree tasks including recursive-majority mastery propagation. Unconditionally, recursive-majority propagation lies in $\mathsf{NC}^1$ via $O(\log n)$-depth bounded-fanin circuits, while separating it from uniform $\mathsf{TC}^0$ would require major progress on open lower bounds. Under a monotonicity restriction, we obtain an unconditional barrier: alternating ALL/ANY prerequisite trees yield a strict depth hierarchy for \emph{monotone} threshold circuits. Empirically, transformer encoders trained on recursive-majority trees converge to permutation-invariant shortcuts; explicit structure alone does not prevent this, but auxiliary supervision on intermediate subtrees elicits structure-dependent computation and achieves near-perfect accuracy at depths 3–4. These findings motivate structure-aware objectives and iterative mechanisms for prerequisite-sensitive knowledge tracing on deep hierarchies.

[449] Unveiling Hidden Convexity in Deep Learning: a Sparse Signal Processing Perspective

Emi Zeger, Mert Pilanci

Main category: cs.LG

TL;DR: Convex equivalences of ReLU neural networks connect deep learning with sparse signal processing, revealing hidden convexities in loss landscapes for better optimization and theoretical understanding.

Details

Motivation: Despite DNNs' success, their non-convex loss functions complicate optimization and limit theoretical understanding. The paper aims to bridge recent mathematical advances in deep learning with traditional signal processing by exploring convex equivalences of ReLU networks.

Method: The paper provides an educational overview of recently discovered convex equivalences in ReLU neural networks, particularly focusing on two-layer ReLU networks and other architectures, connecting them to sparse signal processing models.

Result: The paper demonstrates how convex reformulations of ReLU networks can address training challenges and improve theoretical understanding, bridging deep learning mathematics with signal processing applications.

Conclusion: Convex equivalences in ReLU neural networks offer promising approaches to overcome optimization challenges and enhance theoretical foundations, encouraging broader applications of signal processing techniques in deep learning.

Abstract: Deep neural networks (DNNs), particularly those using Rectified Linear Unit (ReLU) activation functions, have achieved remarkable success across diverse machine learning tasks, including image recognition, audio processing, and language modeling. Despite this success, the non-convex nature of DNN loss functions complicates optimization and limits theoretical understanding. In this paper, we highlight how recently developed convex equivalences of ReLU NNs and their connections to sparse signal processing models can address the challenges of training and understanding NNs. Recent research has uncovered several hidden convexities in the loss landscapes of certain NN architectures, notably two-layer ReLU networks and other deeper or varied architectures. This paper seeks to provide an accessible and educational overview that bridges recent advances in the mathematics of deep learning with traditional signal processing, encouraging broader signal processing applications.

[450] Symbolic–KAN: Kolmogorov-Arnold Networks with Discrete Symbolic Structure for Interpretable Learning

Salah A Faroughi, Farinaz Mostajeran, Amirhossein Arzani, Shirko Faroughi

Main category: cs.LG

TL;DR: Symbolic-KANs embed discrete symbolic structure within neural networks to bridge the gap between interpretable symbolic regression and scalable neural learning, producing compact closed-form expressions for governing equations.

Details

Motivation: To address the fundamental trade-off between interpretability and scalable learning in scientific machine learning - classical symbolic regression is interpretable but relies on combinatorial search, while neural networks scale well but produce opaque representations.

Method: Symbolic-KANs represent multivariate functions as compositions of learned univariate primitives applied to learned scalar projections, using a library of analytic primitives, hierarchical gating, and symbolic regularization that progressively sharpens continuous mixtures into one-hot selections.

Result: Symbolic-KANs reliably recover correct primitive terms and governing structures in data-driven regression and inverse dynamical systems, and extend to forward/inverse physics-informed learning of PDEs, producing accurate solutions with compact symbolic representations.

Conclusion: Symbolic-KAN represents a step toward scalable, interpretable, and mechanistically grounded learning of governing laws by embedding symbolic structure directly within trainable deep networks.

Abstract: Symbolic discovery of governing equations is a long-standing goal in scientific machine learning, yet a fundamental trade-off persists between interpretability and scalable learning. Classical symbolic regression methods yield explicit analytic expressions but rely on combinatorial search, whereas neural networks scale efficiently with data and dimensionality but produce opaque representations. In this work, we introduce Symbolic Kolmogorov-Arnold Networks (Symbolic-KANs), a neural architecture that bridges this gap by embedding discrete symbolic structure directly within a trainable deep network. Symbolic-KANs represent multivariate functions as compositions of learned univariate primitives applied to learned scalar projections, guided by a library of analytic primitives, hierarchical gating, and symbolic regularization that progressively sharpens continuous mixtures into one-hot selections. After gated training and discretization, each active unit selects a single primitive and projection direction, yielding compact closed-form expressions without post-hoc symbolic fitting. Symbolic-KANs further act as scalable primitive discovery mechanisms, identifying the most relevant analytic components that can subsequently inform candidate libraries for sparse equation-learning methods. We demonstrate that Symbolic-KAN reliably recovers correct primitive terms and governing structures in data-driven regression and inverse dynamical systems. Moreover, the framework extends to forward and inverse physics-informed learning of partial differential equations, producing accurate solutions directly from governing constraints while constructing compact symbolic representations whose selected primitives reflect the true analytical structure of the underlying equations. These results position Symbolic-KAN as a step toward scalable, interpretable, and mechanistically grounded learning of governing laws.

[451] Why the Maximum Second Derivative of Activations Matters for Adversarial Robustness

Yunrui Yu, Hang Su, Jun Zhu

Main category: cs.LG

TL;DR: Activation function curvature (max|σ’’|) critically affects adversarial robustness: insufficient curvature limits expressivity, excessive curvature amplifies Hessian diagonal norm causing sharp minima. Optimal robustness occurs when max|σ’’| is between 4-10 across architectures, datasets, and training methods.

Details

Motivation: To investigate the fundamental relationship between activation function curvature and adversarial robustness, addressing the trade-off between model expressivity and robust generalization through systematic analysis of curvature effects.

Method: Uses Recursive Curvature-Tunable Activation Family (RCT-AF) with parameters α and β for precise curvature control. Systematically analyzes relationship between max|σ’’| and adversarial robustness across diverse network architectures, datasets, and adversarial training methods. Examines how curvature affects Hessian matrix diagonal elements of loss.

Result: Reveals non-monotonic relationship: optimal adversarial robustness consistently occurs when max|σ’’| falls within 4 to 10. Normalized Hessian diagonal norm shows U-shaped dependence on max|σ’’|, with minimum within optimal robustness range. Findings hold across various experimental settings.

Conclusion: Activation function curvature plays a critical role in adversarial robustness through a fundamental trade-off mechanism. Optimal curvature range (4-10 for max|σ’’|) balances expressivity and robust generalization, providing practical guidance for designing robust neural networks.

Abstract: This work investigates the critical role of activation function curvature – quantified by the maximum second derivative $\max|σ’’|$ – in adversarial robustness. Using the Recursive Curvature-Tunable Activation Family (RCT-AF), which enables precise control over curvature through parameters $α$ and $β$, we systematically analyze this relationship. Our study reveals a fundamental trade-off: insufficient curvature limits model expressivity, while excessive curvature amplifies the normalized Hessian diagonal norm of the loss, leading to sharper minima that hinder robust generalization. This results in a non-monotonic relationship where optimal adversarial robustness consistently occurs when $\max|σ’’|$ falls within 4 to 10, a finding that holds across diverse network architectures, datasets, and adversarial training methods. We provide theoretical insights into how activation curvature affects the diagonal elements of the hessian matrix of the loss, and experimentally demonstrate that the normalized Hessian diagonal norm exhibits a U-shaped dependence on $\max|σ’’|$, with its minimum within the optimal robustness range, thereby validating the proposed mechanism.

[452] An Invariant Compiler for Neural ODEs in AI-Accelerated Scientific Simulation

Fangzhou Yu, Yiqi Su, Ray Lee, Shenfeng Cheng, Naren Ramakrishnan

Main category: cs.LG

TL;DR: A framework called “invariant compiler” that uses LLMs to automatically transform neural ODEs into structure-preserving architectures that respect domain invariants by construction.

Details

Motivation: Neural ODEs can violate domain invariants like conservation laws, leading to physically implausible solutions and compounding errors in long-horizon prediction. Existing methods use soft penalties that don't guarantee constraint satisfaction.

Method: Treats invariants as first-class types and uses LLM-driven compilation to translate generic neural ODE specifications into structure-preserving architectures that keep trajectories on admissible manifolds.

Result: Provides a systematic framework for creating invariant-respecting neural surrogates that separate scientific structure (what must be preserved) from learned dynamics (what is learned from data).

Conclusion: The invariant compiler offers a clean separation between preservation requirements and learning, enabling systematic design of physically plausible neural ODE models across scientific domains.

Abstract: Neural ODEs are increasingly used as continuous-time models for scientific and sensor data, but unconstrained neural ODEs can drift and violate domain invariants (e.g., conservation laws), yielding physically implausible solutions. In turn, this can compound error in long-horizon prediction and surrogate simulation. Existing solutions typically aim to enforce invariance by soft penalties or other forms of regularization, which can reduce overall error but do not guarantee that trajectories will not leave the constraint manifold. We introduce the invariant compiler, a framework that enforces invariants by construction: it treats invariants as first-class types and uses an LLM-driven compilation workflow to translate a generic neural ODE specification into a structure-preserving architecture whose trajectories remain on the admissible manifold in continuous time (and up to numerical integration error in practice). This compiler view cleanly separates what must be preserved (scientific structure) from what is learned from data (dynamics within that structure). It provides a systematic design pattern for invariant-respecting neural surrogates across scientific domains.

[453] Deep Convolutional Neural Networks for predicting highest priority functional group in organic molecules

Kunal Khatri, Vineet Mehta

Main category: cs.LG

TL;DR: Deep CNN predicts dominant functional groups in organic molecules from FTIR spectra, outperforming SVM methods.

Details

Motivation: Functional groups determine organic molecule properties, and FTIR spectroscopy is commonly used for identification. The paper aims to automate the prediction of the highest priority functional group from FTIR spectra using deep learning.

Method: Proposes a Deep Convolutional Neural Network (CNN) architecture to analyze Fourier-transform infrared (FTIR) spectra and predict the dominant functional group in organic molecules. Compares performance with Support Vector Machine (SVM) baseline.

Result: CNN model outperforms traditional SVM methods in predicting the highest priority functional group from FTIR spectra, demonstrating superior accuracy and capability in spectral pattern recognition.

Conclusion: Deep CNNs are effective for functional group prediction from FTIR spectra, offering advantages over conventional machine learning approaches like SVM for spectral data analysis.

Abstract: Our work addresses the problem of predicting the highest priority functional group present in an organic molecule. Functional Groups are groups of bound atoms that determine the physical and chemical properties of organic molecules. In the presence of multiple functional groups, the dominant functional group determines the compound’s properties. Fourier-transform Infrared spectroscopy (FTIR) is a commonly used spectroscopic method for identifying the presence or absence of functional groups within a compound. We propose the use of a Deep Convolutional Neural Networks (CNN) to predict the highest priority functional group from the Fourier-transform infrared spectrum (FTIR) of the organic molecule. We have compared our model with other previously applied Machine Learning (ML) method Support Vector Machine (SVM) and reasoned why CNN outperforms it.

[454] Can VLMs Reason Robustly? A Neuro-Symbolic Investigation

Weixin Chen, Antonio Vergari, Han Zhao

Main category: cs.LG

TL;DR: VLC: A neuro-symbolic method combining VLM-based concept recognition with circuit-based symbolic reasoning for robust visual deductive reasoning under distribution shifts.

Details

Motivation: Vision-Language Models (VLMs) achieve high in-distribution accuracy on reasoning tasks but fail to generalize under covariate shifts, suggesting fine-tuning doesn't induce robust reasoning functions. Existing neuro-symbolic approaches with black-box components also show inconsistent robustness.

Method: Proposes VLC: neuro-symbolic method combining VLM-based concept recognition with circuit-based symbolic reasoning. Task rules are compiled into symbolic programs (circuits) that execute rules exactly over object concepts recognized by the VLM, decoupling perception from reasoning.

Result: Experiments on three visual deductive reasoning tasks with distinct rule sets show VLC consistently achieves strong performance under covariate shifts, demonstrating robust reasoning capabilities.

Conclusion: VLC’s neuro-symbolic approach combining VLM perception with exact symbolic reasoning circuits enables robust visual deductive reasoning under distribution shifts, addressing limitations of end-to-end fine-tuned VLMs and black-box neuro-symbolic methods.

Abstract: Vision-Language Models (VLMs) have been applied to a wide range of reasoning tasks, yet it remains unclear whether they can reason robustly under distribution shifts. In this paper, we study covariate shifts in which the perceptual input distribution changes while the underlying prediction rules do not. To investigate this question, we consider visual deductive reasoning tasks, where a model is required to answer a query given an image and logical rules defined over the object concepts in the image. Empirically, we find that VLMs fine-tuned through gradient-based end-to-end training can achieve high in-distribution accuracy but fail to generalize under such shifts, suggesting that fine-tuning does not reliably induce the underlying reasoning function. This motivates a neuro-symbolic perspective that decouples perception from reasoning. However, we further observe that recent neuro-symbolic approaches that rely on black-box components for reasoning can still exhibit inconsistent robustness across tasks. To address this issue, we propose VLC, a neuro-symbolic method that combines VLM-based concept recognition with circuit-based symbolic reasoning. In particular, task rules are compiled into a symbolic program, specifically a circuit, which executes the rules exactly over the object concepts recognized by the VLM. Experiments on three visual deductive reasoning tasks with distinct rule sets show that VLC consistently achieves strong performance under covariate shifts, highlighting its ability to support robust reasoning.

[455] HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation

Ken Ding

Main category: cs.LG

TL;DR: HDPO improves RL training for math reasoning by using privileged self-distillation on failure cases where standard RL gradients vanish.

Details

Motivation: Standard RL for math reasoning fails on "cliff" prompts where models cannot solve at all, causing vanishing gradients and no learning signal for these failure modes.

Method: Hybrid Distillation Policy Optimization (HDPO) identifies failure prompts, generates privileged rollouts with ground-truth information, filters correct solutions, and distills token-level distributions from teacher (with privileged info) to student (same weights, different input).

Result: On OpenMathInstruct-2 with Qwen2.5-Math-1.5B-Instruct, HDPO improves coverage metrics (pass@4 by +0.8-1.1%, pass@8 by +0.4-1.7%) while maintaining greedy accuracy, with distillation weight controlling exploration-exploitation tradeoff.

Conclusion: HDPO effectively addresses the cliff prompt problem in RL for math reasoning through privileged self-distillation with bounded realizability gap, providing a principled approach to improve coverage without sacrificing accuracy.

Abstract: Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all - “cliff” prompts - the RL gradient vanishes entirely, preventing any learning signal from reaching these failure modes. We introduce Hybrid Distillation Policy Optimization (HDPO), which augments standard RL with privileged self-distillation targeting cliff prompts. On each training step, HDPO identifies prompts where all rollouts fail, generates privileged rollouts by providing the model with ground-truth information, filters for correct solutions, and distills the teacher’s token-level distribution into the student. Because teacher and student share the same weights - differing only in their input - the realizability gap is provably bounded, unlike cross-model distillation. We prove that R=1 filtered privileged generation recovers the optimal KL-regularized RL policy in the hard-threshold limit. Experiments on OpenMathInstruct-2 with Qwen2.5-Math-1.5B-Instruct show that HDPO consistently improves coverage metrics (pass@4 by +0.8-1.1%, pass@8 by +0.4-1.7%) while maintaining greedy accuracy, with the distillation weight lambda providing direct control over the exploration-exploitation tradeoff.

[456] The Luna Bound Propagator for Formal Analysis of Neural Networks

Henry LeCates, Haoze Wu

Main category: cs.LG

TL;DR: Luna is a C++ implementation of alpha-CROWN bound propagation for neural network verification, offering competitive performance with existing Python implementations while enabling better integration into production systems.

Details

Motivation: Existing alpha-CROWN implementations are limited to Python, which complicates integration into existing DNN verifiers and long-term production-level systems. There's a need for a C++ implementation to facilitate better integration and deployment.

Method: Developed Luna, a new bound propagator implemented in C++ that supports Interval Bound Propagation, CROWN analysis, and alpha-CROWN analysis over general computational graphs. The paper describes the architecture of this C++ implementation.

Result: Luna is competitive with state-of-the-art alpha-CROWN implementation in terms of both bound tightness and computational efficiency on benchmarks from VNN-COMP 2025.

Conclusion: Luna provides a practical C++ implementation of alpha-CROWN that enables better integration into existing DNN verification frameworks and production systems while maintaining competitive performance.

Abstract: The parameterized CROWN analysis, a.k.a., alpha-CROWN, has emerged as a practically successful bound propagation method for neural network verification. However, existing implementations of alpha-CROWN are limited to Python, which complicates integration into existing DNN verifiers and long-term production-level systems. We introduce Luna, a new bound propagator implemented in C++. Luna supports Interval Bound Propagation, the CROWN analysis, and the alpha-CROWN analysis over a general computational graph. We describe the architecture of Luna and show that it is competitive with the state-of-the-art alpha-CROWN implementation in terms of both bound tightness and computational efficiency on benchmarks from VNN-COMP 2025.

[457] Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

Guopeng Li, Matthijs T. J. Spaan, Julian F. P. Kooij

Main category: cs.LG

TL;DR: COX-Q is an off-policy safe RL algorithm that combines cost-bounded optimistic exploration with conservative offline distributional value learning to achieve high sample efficiency while controlling constraint violations.

Details

Motivation: Existing off-policy safe RL methods suffer from constraint violations due to cost-agnostic exploration and estimation bias in cumulative cost, which is problematic for safety-critical applications.

Method: Proposes Constrained Optimistic eXploration Q-learning (COX-Q) with two key components: 1) cost-constrained optimistic exploration strategy that resolves gradient conflicts between reward and cost, and adaptively adjusts trust region; 2) truncated quantile critics to stabilize cost value learning and quantify epistemic uncertainty for exploration guidance.

Result: Experiments on safe velocity, safe navigation, and autonomous driving tasks show COX-Q achieves high sample efficiency, competitive test safety performance, and controlled data collection cost.

Conclusion: COX-Q is a promising RL method for safety-critical applications that effectively balances exploration efficiency with safety constraints.

Abstract: When safety is formulated as a limit of cumulative cost, safe reinforcement learning (RL) aims to learn policies that maximize return subject to the cost constraint in data collection and deployment. Off-policy safe RL methods, although offering high sample efficiency, suffer from constraint violations due to cost-agnostic exploration and estimation bias in cumulative cost. To address this issue, we propose Constrained Optimistic eXploration Q-learning (COX-Q), an off-policy safe RL algorithm that integrates cost-bounded online exploration and conservative offline distributional value learning. First, we introduce a novel cost-constrained optimistic exploration strategy that resolves gradient conflicts between reward and cost in the action space and adaptively adjusts the trust region to control the training cost. Second, we adopt truncated quantile critics to stabilize the cost value learning. Quantile critics also quantify epistemic uncertainty to guide exploration. Experiments on safe velocity, safe navigation, and autonomous driving tasks demonstrate that COX-Q achieves high sample efficiency, competitive test safety performance, and controlled data collection cost. The results highlight COX-Q as a promising RL method for safety-critical applications.

[458] Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Guy Zamir, Matthew Zurek, Yudong Chen

Main category: cs.LG

TL;DR: A UCB-style algorithm for infinite-horizon MDPs achieves optimal variance-dependent regret bounds for average-reward and γ-regret settings, with improved lower-order terms and characterization of optimal dependence on bias span.

Details

Motivation: Online reinforcement learning in infinite-horizon MDPs has less theoretical development than episodic RL, with existing algorithms suffering from high burn-in costs and failing to adapt to instance-specific complexity. The paper aims to address these shortcomings for average-reward and γ-regret objectives.

Method: Develops a single tractable UCB-style algorithm applicable to both average-reward and γ-regret settings. The algorithm achieves optimal variance-dependent regret guarantees with improved lower-order terms, and analyzes both cases with and without prior knowledge of optimal bias span.

Result: Achieves regret bounds of Õ(√(SA·Var) + lower-order terms) for both settings, which is minimax-optimal and adapts to easier instances. With prior knowledge of bias span, obtains optimal lower-order terms scaling as ||h*||_sp S²A. Without prior knowledge, shows fundamental gap in achievable performance.

Conclusion: The work completely characterizes optimal dependence on bias span in both leading and lower-order terms, revealing a fundamental gap between what’s achievable with and without prior knowledge, and provides near-optimal algorithms for infinite-horizon MDPs.

Abstract: Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in’’ costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the $γ$-regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form $\tilde{O}( \sqrt{SA,\text{Var}} + \text{lower-order terms})$, where $S,A$ are the state and action space sizes, and $\text{Var}$ captures cumulative transition variance. This implies minimax-optimal average-reward and $γ$-regret bounds in the worst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. Furthermore, our algorithm enjoys significantly improved lower-order terms for the average-reward setting. With prior knowledge of the optimal bias span $\Vert h^\star\Vert_\text{sp}$, our algorithm obtains lower-order terms scaling as $\Vert h^\star\Vert_\text{sp} S^2 A$, which we prove is optimal in both $\Vert h^\star\Vert_\text{sp}$ and $A$. Without prior knowledge, we prove that no algorithm can have lower-order terms smaller than $\Vert h^\star \Vert_\text{sp}^2 S A$, and we provide a prior-free algorithm whose lower-order terms scale as $\Vert h^\star\Vert_\text{sp}^2 S^3 A$, nearly matching this lower bound. Taken together, these results completely characterize the optimal dependence on $\Vert h^\star\Vert_\text{sp}$ in both leading and lower-order terms, and reveal a fundamental gap in what is achievable with and without prior knowledge.

[459] GRMLR: Knowledge-Enhanced Small-Data Learning for Deep-Sea Cold Seep Stage Inference

Chenxu Zhou, Zelin Liu, Rui Cai, Houlin Gong, Yikang Yu, Jia Zeng, Yanru Pei, Liang Zhang, Weishu Zhao, Xiaofeng Gao

Main category: cs.LG

TL;DR: A knowledge-enhanced classification framework using ecological knowledge graphs and graph-regularized multinomial logistic regression for deep-sea cold seep stage assessment from microbial data.

Details

Motivation: Traditional deep-sea cold seep assessment relies on expensive manned submersible operations and macrofauna surveys. Microbial communities offer a cost-effective alternative, but existing datasets are extremely small (n=13) relative to feature dimensions (p=26), causing overfitting in purely data-driven models.

Method: Proposes a knowledge-enhanced classification framework incorporating ecological knowledge graphs as structural priors. Uses macro-microbe coupling and microbial co-occurrence patterns to create a Graph-Regularized Multinomial Logistic Regression (GRMLR) model with manifold penalty to constrain feature space for biologically consistent classification.

Result: The approach significantly outperforms standard baselines, demonstrating potential as a robust and scalable framework for deep-sea ecological assessment. Importantly, it removes the need for macrofauna observations at inference time - only microbial abundance profiles are needed for prediction.

Conclusion: The knowledge-enhanced framework effectively addresses the small-sample, high-dimensional challenge in deep-sea microbial ecology by incorporating ecological knowledge as structural priors, enabling reliable cold seep stage assessment without costly macrofauna surveys.

Abstract: Deep-sea cold seep stage assessment has traditionally relied on costly, high-risk manned submersible operations and visual surveys of macrofauna. Although microbial communities provide a promising and more cost-effective alternative, reliable inference remains challenging because the available deep-sea dataset is extremely small ($n = 13$) relative to the microbial feature dimension ($p = 26$), making purely data-driven models highly prone to overfitting. To address this, we propose a knowledge-enhanced classification framework that incorporates an ecological knowledge graph as a structural prior. By fusing macro-microbe coupling and microbial co-occurrence patterns, the framework internalizes established ecological logic into a \underline{\textbf{G}}raph-\underline{\textbf{R}}egularized \underline{\textbf{M}}ultinomial \underline{\textbf{L}}ogistic \underline{\textbf{R}}egression (GRMLR) model, effectively constraining the feature space through a manifold penalty to ensure biologically consistent classification. Importantly, the framework removes the need for macrofauna observations at inference time: macro-microbe associations are used only to guide training, whereas prediction relies solely on microbial abundance profiles. Experimental results demonstrate that our approach significantly outperforms standard baselines, highlighting its potential as a robust and scalable framework for deep-sea ecological assessment.

[460] Wireless communication empowers online scheduling of partially-observable transportation multi-robot systems in a smart factory

Yaxin Liao, Qimei Cui, Kwang-Cheng Chen, Xiong Li, Jinlian Chen, Xiyu Zhao, Xiaofeng Tao, Ping Zhang

Main category: cs.LG

TL;DR: A communication-enabled online scheduling framework for multi-robot task assignment that integrates wireless M2M networking with route scheduling to enable AGVs to exchange intention information for overcoming partial observability in dynamic factory environments.

Details

Motivation: Online multi-robot task assignment in smart factories requires real-time collision-free and congestion-free route scheduling for transportation robots, but faces challenges due to partial observability, dynamic environments, and real-time operational requirements.

Method: Proposes an integrated framework that couples wireless M2M networking with route scheduling, using AGV intention data as M2M traffic with retransmission-free multi-link transmission. Combines simulated annealing-based MRTA with congestion-aware A*-based route scheduling.

Result: The integrated scheme significantly enhances scheduling efficiency compared to baselines, even under high AGV load and limited channel resources. Shows wireless communication impacts on T-MRS performance and reveals scheduling-oriented M2M communication differs fundamentally from human-to-human communications.

Conclusion: The communication-enabled scheduling framework effectively addresses partial observability challenges in dynamic factory environments, enabling AGVs to dynamically adjust collision-free routes with reduced computational overhead, opening new technological opportunities for wireless networked smart factories.

Abstract: Achieving agile and reconfigurable production flows in smart factories depends on online multi-robot task assignment (MRTA), which requires online collision-free and congestion-free route scheduling of transportation multi-robot systems (T-MRS), e.g., collaborative automatic guided vehicles (AGVs). Due to the real-time operational requirements and dynamic interactions between T-MRS and production MRS, online scheduling under partial observability in dynamic factory environments remains a significant and under-explored challenge. This paper proposes a novel communication-enabled online scheduling framework that explicitly couples wireless machine-to-machine (M2M) networking with route scheduling, enabling AGVs to exchange intention information, e.g., planned routes, to overcome partial observations and assist complex computation of online scheduling. Specifically, we determine intelligent AGVs’ intention and sensor data as new M2M traffic and tailor the retransmission-free multi-link transmission networking to meet real-time operation demands. This scheduling-oriented networking is then integrated with a simulated annealing-based MRTA scheme and a congestion-aware A*-based route scheduling method. The integrated communication and scheduling scheme allows AGVs to dynamically adjust collision-free and congestion-free routes with reduced computational overhead. Numerical experiments shows the impacts from wireless communication on the performance of T-MRS and suggest that the proposed integrated scheme significantly enhances scheduling efficiency compared to other baselines, even under high AGV load conditions and limited channel resources. Moreover, the results reveal that the scheduling-oriented wireless M2M communication design fundamentally differs from human-to-human communications, implying new technological opportunities in a wireless networked smart factory.

[461] Kirchhoff-Inspired Neural Networks for Evolving High-Order Perception

Tongfei Chen, Jingying Yang, Linlin Yang, Jinhu Lü, David Doermann, Chunyu Xie, Long He, Tian Wang, Juan Zhang, Guodong Guo, Baochang Zhang

Main category: cs.LG

TL;DR: KINN is a novel neural network architecture inspired by Kirchhoff’s current law that uses state-variable-based design with explicit decoupling of higher-order evolutionary components, achieving superior performance on PDE solving and ImageNet classification.

Details

Motivation: Current deep learning architectures lack systematic mechanisms to jointly characterize the interplay among signal intensity, coupling structure, and state evolution, unlike biological systems which use dynamic membrane potential fluctuations. The paper aims to bridge this gap by creating a more biologically-inspired architecture.

Method: Proposes Kirchhoff-Inspired Neural Network (KINN) based on Kirchhoff’s current law, using state-variable-based architecture with numerically stable state updates derived from fundamental ordinary differential equations. The method enables explicit decoupling and encoding of higher-order evolutionary components within a single layer while maintaining physical consistency and interpretability.

Result: Extensive experiments show KINN outperforms state-of-the-art methods on both partial differential equation (PDE) solving and ImageNet image classification tasks.

Conclusion: KINN provides a novel biologically-inspired neural network architecture that systematically characterizes signal intensity, coupling structure, and state evolution, offering improved performance, physical consistency, and interpretability compared to conventional deep networks.

Abstract: Deep learning architectures are fundamentally inspired by neuroscience, particularly the structure of the brain’s sensory pathways, and have achieved remarkable success in learning informative data representations. Although these architectures mimic the communication mechanisms of biological neurons, their strategies for information encoding and transmission are fundamentally distinct. Biological systems depend on dynamic fluctuations in membrane potential; by contrast, conventional deep networks optimize weights and biases by adjusting the strengths of inter-neural connections, lacking a systematic mechanism to jointly characterize the interplay among signal intensity, coupling structure, and state evolution. To tackle this limitation, we propose the Kirchhoff-Inspired Neural Network (KINN), a state-variable-based network architecture constructed based on Kirchhoff’s current law. KINN derives numerically stable state updates from fundamental ordinary differential equations, enabling the explicit decoupling and encoding of higher-order evolutionary components within a single layer while preserving physical consistency, interpretability, and end-to-end trainability. Extensive experiments on partial differential equation (PDE) solving and ImageNet image classification validate that KINN outperforms state-of-the-art existing methods.

[462] Transcending Classical Neural Network Boundaries: A Quantum-Classical Synergistic Paradigm for Seismic Data Processing

Zhengyi Yuan, Xintong Dong, Xinyang Wang, Zheng Cong, Shiqi Dong

Main category: cs.LG

TL;DR: Quantum-classical synergistic GAN (QC-GAN) combines quantum neural networks with classical convolutional networks for seismic data processing, leveraging quantum mechanics’ exponential state space to overcome representational limitations of classical NNs.

Details

Motivation: Classical neural networks for seismic data processing (denoising, interpolation, frequency-band extension) face representational bottlenecks due to stacked perceptrons and standard activation functions, making it difficult to capture complex, non-stationary seismic wavefield dynamics.

Method: Proposes QC-GAN with two pathways: quantum pathway exploits high-order feature correlations using quantum neural networks that map features into high-dimensional Hilbert spaces, and convolutional pathway extracts waveform structures. Includes QC feature complementarity loss to enforce feature orthogonality between pathways.

Result: Experimental results on denoising and interpolation tasks show QC-GAN preserves wavefield continuity and amplitude-phase information under complex noise conditions, demonstrating improved performance over classical methods.

Conclusion: QC-GAN breaks representational bottlenecks of classical GANs by synergistically integrating quantum and convolutional pathways, enabling better seismic data processing through enriched feature representation from quantum mechanics’ exponential state space.

Abstract: In recent years, a number of neural-network (NN) methods have exhibited good performance in seismic data processing, such as denoising, interpolation, and frequency-band extension. However, these methods rely on stacked perceptrons and standard activation functions, which imposes a bottleneck on the representational capacity of deep-learning models, making it difficult to capture the complex and non-stationary dynamics of seismic wavefields. Different from the classical perceptron-stacked NNs which are fundamentally confined to real-valued Euclidean spaces, the quantum NNs leverage the exponential state space of quantum mechanics to map the features into high-dimensional Hilbert spaces, transcending the representational boundary of classical NNs. Based on this insight, we propose a quantum-classical synergistic generative adversarial network (QC-GAN) for seismic data processing, serving as the first application of quantum NNs in seismic exploration. In QC-GAN, a quantum pathway is used to exploit the high-order feature correlations, while the convolutional pathway specializes in extracting the waveform structures of seismic wavefields. Furthermore, we design a QC feature complementarity loss to enforce the feature orthogonality in the proposed QC-GAN. This novel loss function can ensure that the two pathways encode non-overlapping information to enrich the capacity of feature representation. On the whole, by synergistically integrating the quantum and convolutional pathways, the proposed QC-GAN breaks the representational bottleneck inherent in classical GAN. Experimental results on denoising and interpolation tasks demonstrate that QC-GAN preserves wavefield continuity and amplitude-phase information under complex noise conditions.

[463] Diet Your LLM: Dimension-wise Global Pruning of LLMs via Merging Task-specific Importance Score

Jimyung Hong, Jaehyung Kim

Main category: cs.LG

TL;DR: DIET is a training-free structured pruning method for LLMs that combines dimension-level pruning with task-aware selection using activation profiling across multiple tasks.

Details

Motivation: LLMs are too large for practical deployment, and existing pruning methods face trade-offs: task-agnostic approaches can't adapt to specific tasks, while task-aware methods require expensive training. There's a need for efficient, task-aware pruning without training costs.

Method: DIET profiles activation magnitudes across multiple tasks using only 100 samples per task, then applies majority voting to construct a single global pruning mask at the dimension level, requiring no training or large pre-computation costs.

Result: On Gemma-2 2B and 9B models across seven zero-shot benchmarks, DIET achieves significant improvements - at 20% sparsity on Gemma-2 2B, it shows ~10% average accuracy improvement over previous SOTA structured pruning methods, with advantages persisting across sparsity levels and model scales.

Conclusion: DIET provides a practical, robust training-free structured pruning solution that effectively balances dimension-level granularity with task-aware selection, making it suitable for efficient LLM deployment.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, but their massive scale poses significant challenges for practical deployment. Structured pruning offers a promising solution by removing entire dimensions or layers, yet existing methods face critical trade-offs: task-agnostic approaches cannot adapt to task-specific requirements, while task-aware methods require costly training to learn task adaptability. We propose DIET (Dimension-wise global pruning of LLMs via merging Task-wise importance scores), a training-free structured pruning method that combines dimension-level granularity with task-aware selection. DIET profiles activation magnitudes across tasks using only 100 samples per task, then applies majority voting to construct a single global mask. DIET does not require large costs from pre-computation or training. Experiments on seven zero-shot benchmarks using Gemma-2 2B and 9B models demonstrate the effectiveness of DIET; for example, at 20% sparsity on Gemma-2 2B, DIET achieves near 10% average accuracy improvement, compared to previous state-of-the-art structured pruning methods. This advantage persists across various sparsity levels and model scales, positioning DIET as a practical and robust choice for structured LLM pruning.

[464] Can we generate portable representations for clinical time series data using LLMs?

Zongliang Ji, Yifei Sun, Andre Amaral, Anna Goldenberg, Rahul G. Krishnan

Main category: cs.LG

TL;DR: LLMs create portable patient embeddings from ICU time series via natural language summaries, enabling cross-hospital model transfer with minimal retraining.

Details

Motivation: Clinical ML models degrade when deployed across hospitals due to distribution shifts; need portable patient representations that enable cross-hospital transfer with minimal retraining.

Method: Map ICU time series to natural language summaries using frozen LLM, embed summaries with frozen text embedding model to get fixed-length vectors for downstream predictors.

Result: Competitive with in-distribution baselines (grid imputation, self-supervised learning, time series foundation models) while showing smaller performance drops when transferring to new hospitals.

Conclusion: LLMs show potential as tools for scalable clinical ML deployment by creating portable patient embeddings that reduce engineering overhead for cross-hospital transfer.

Abstract: Deploying clinical ML is slow and brittle: models that work at one hospital often degrade under distribution shifts at the next. In this work, we study a simple question – can large language models (LLMs) create portable patient embeddings i.e. representations of patients enable a downstream predictor built on one hospital to be used elsewhere with minimal-to-no retraining and fine-tuning. To do so, we map from irregular ICU time series onto concise natural language summaries using a frozen LLM, then embed each summary with a frozen text embedding model to obtain a fixed length vector capable of serving as input to a variety of downstream predictors. Across three cohorts (MIMIC-IV, HIRID, PPICU), on multiple clinically grounded forecasting and classification tasks, we find that our approach is simple, easy to use and competitive with in-distribution with grid imputation, self-supervised representation learning, and time series foundation models, while exhibiting smaller relative performance drops when transferring to new hospitals. We study the variation in performance across prompt design, with structured prompts being crucial to reducing the variance of the predictive models without altering mean accuracy. We find that using these portable representations improves few-shot learning and does not increase demographic recoverability of age or sex relative to baselines, suggesting little additional privacy risk. Our work points to the potential that LLMs hold as tools to enable the scalable deployment of production grade predictive models by reducing the engineering overhead.

[465] Understanding the Challenges in Iterative Generative Optimization with LLMs

Allen Nie, Xavier Daull, Zhiyi Kuang, Abhinav Akkiraju, Anish Chaudhuri, Max Piasevoli, Ryan Rong, YuCheng Yuan, Prerit Choudhary, Shannon Xiao, Rasool Fakoor, Adith Swaminathan, Ching-An Cheng

Main category: cs.LG

TL;DR: Generative optimization using LLMs for iterative improvement of artifacts is promising but brittle due to hidden design choices in setting up learning loops.

Details

Motivation: Despite active research, only 9% of surveyed agents use automated optimization, indicating brittleness in current generative optimization approaches. The authors argue this stems from engineers having to make implicit design choices when setting up learning loops.

Method: Investigates three key factors affecting applications: starting artifact selection, credit horizon for execution traces, and batching trials into learning evidence. Uses case studies in MLAgentBench, Atari, and BigBench Extra Hard to analyze these design decisions.

Result: Found that design decisions critically determine optimization success: different starting artifacts affect solution reachability in MLAgentBench, truncated traces can still improve Atari agents, and larger minibatches don’t monotonically improve generalization on BBEH.

Conclusion: Lack of simple, universal way to set up learning loops across domains is a major hurdle for productionization. Provides practical guidance for making these design choices explicit.

Abstract: Generative optimization uses large language models (LLMs) to iteratively improve artifacts (such as code, workflows or prompts) using execution feedback. It is a promising approach to building self-improving agents, yet in practice remains brittle: despite active research, only 9% of surveyed agents used any automated optimization. We argue that this brittleness arises because, to set up a learning loop, an engineer must make ``hidden’’ design choices: What can the optimizer edit and what is the “right” learning evidence to provide at each update? We investigate three factors that affect most applications: the starting artifact, the credit horizon for execution traces, and batching trials and errors into learning evidence. Through case studies in MLAgentBench, Atari, and BigBench Extra Hard, we find that these design decisions can determine whether generative optimization succeeds, yet they are rarely made explicit in prior work. Different starting artifacts determine which solutions are reachable in MLAgentBench, truncated traces can still improve Atari agents, and larger minibatches do not monotonically improve generalization on BBEH. We conclude that the lack of a simple, universal way to set up learning loops across domains is a major hurdle for productionization and adoption. We provide practical guidance for making these choices.

[466] Stochastic Dimension-Free Zeroth-Order Estimator for High-Dimensional and High-Order PINNs

Zhangyong Liang, Ji Zhang

Main category: cs.LG

TL;DR: SDZE framework enables training of 10M-dimensional PINNs on a single GPU by combining stochastic dimension-free spatial estimators with zeroth-order optimization using variance reduction techniques.

Details

Motivation: PINNs for high-dimensional/high-order PDEs face prohibitive memory overhead from backpropagation and computational complexity from spatial derivatives. Existing randomized spatial estimators reduce spatial complexity but still require backpropagation memory, while naive zeroth-order optimization suffers from variance explosion.

Method: Proposes Stochastic Dimension-free Zeroth-order Estimator (SDZE) with two key innovations: 1) Common Random Numbers Synchronization (CRNS) to cancel O(1/ε²) variance by locking spatial random seeds across perturbations, and 2) implicit matrix-free subspace projection to reduce parameter exploration variance from O(P) to O(r) while maintaining O(1) memory footprint.

Result: SDZE enables training of 10-million-dimensional PINNs on a single NVIDIA A100 GPU, delivering significant improvements in speed and memory efficiency over state-of-the-art baselines.

Conclusion: SDZE provides a unified framework that achieves dimension-independent complexity in both space and memory for PINNs, overcoming fundamental limitations of backpropagation-based methods for high-dimensional PDEs.

Abstract: Physics-Informed Neural Networks (PINNs) for high-dimensional and high-order partial differential equations (PDEs) are primarily constrained by the $\mathcal{O}(d^k)$ spatial derivative complexity and the $\mathcal{O}(P)$ memory overhead of backpropagation (BP). While randomized spatial estimators successfully reduce the spatial complexity to $\mathcal{O}(1)$, their reliance on first-order optimization still leads to prohibitive memory consumption at scale. Zeroth-order (ZO) optimization offers a BP-free alternative; however, naively combining randomized spatial operators with ZO perturbations triggers a variance explosion of $\mathcal{O}(1/\varepsilon^2)$, leading to numerical divergence. To address these challenges, we propose the \textbf{S}tochastic \textbf{D}imension-free \textbf{Z}eroth-order \textbf{E}stimator (\textbf{SDZE}), a unified framework that achieves dimension-independent complexity in both space and memory. Specifically, SDZE leverages \emph{Common Random Numbers Synchronization (CRNS)} to algebraically cancel the $\mathcal{O}(1/\varepsilon^2)$ variance by locking spatial random seeds across perturbations. Furthermore, an \emph{implicit matrix-free subspace projection} is introduced to reduce parameter exploration variance from $\mathcal{O}(P)$ to $\mathcal{O}(r)$ while maintaining an $\mathcal{O}(1)$ optimizer memory footprint. Empirical results demonstrate that SDZE enables the training of 10-million-dimensional PINNs on a single NVIDIA A100 GPU, delivering significant improvements in speed and memory efficiency over state-of-the-art baselines.

[467] i-IF-Learn: Iterative Feature Selection and Unsupervised Learning for High-Dimensional Complex Data

Chen Ma, Wanjie Wang, Shuhao Fan

Main category: cs.LG

TL;DR: i-IF-Learn is an iterative unsupervised framework that jointly performs feature selection and clustering to identify influential features that define clusters in high-dimensional data.

Details

Motivation: High-dimensional data often contains irrelevant or noisy features that obscure underlying structures, making unsupervised learning challenging. Only a few "influential features" meaningfully define clusters, and recovering these features is crucial for data interpretation and clustering.

Method: Proposes i-IF-Learn, an iterative framework with adaptive feature selection statistic that combines pseudo-label supervision with unsupervised signals. Uses low-dimensional embeddings (PCA or Laplacian eigenmaps) followed by k-means to simultaneously output influential feature subset and clustering labels.

Result: Numerical experiments on gene microarray and single-cell RNA-seq datasets show i-IF-Learn significantly surpasses classical and deep clustering baselines. Using selected influential features as preprocessing substantially enhances downstream deep models like DeepCluster, UMAP, and VAE.

Conclusion: Targeted feature selection is important and effective for unsupervised learning in high-dimensional data. i-IF-Learn provides a robust framework for identifying influential features that meaningfully define clusters while improving downstream model performance.

Abstract: Unsupervised learning of high-dimensional data is challenging due to irrelevant or noisy features obscuring underlying structures. It’s common that only a few features, called the influential features, meaningfully define the clusters. Recovering these influential features is helpful in data interpretation and clustering. We propose i-IF-Learn, an iterative unsupervised framework that jointly performs feature selection and clustering. Our core innovation is an adaptive feature selection statistic that effectively combines pseudo-label supervision with unsupervised signals, dynamically adjusting based on intermediate label reliability to mitigate error propagation common in iterative frameworks. Leveraging low-dimensional embeddings (PCA or Laplacian eigenmaps) followed by $k$-means, i-IF-Learn simultaneously outputs influential feature subset and clustering labels. Numerical experiments on gene microarray and single-cell RNA-seq datasets show that i-IF-Learn significantly surpasses classical and deep clustering baselines. Furthermore, using our selected influential features as preprocessing substantially enhances downstream deep models such as DeepCluster, UMAP, and VAE, highlighting the importance and effectiveness of targeted feature selection.

[468] Lagrangian Relaxation Score-based Generation for Mixed Integer linear Programming

Ruobing Wang, Xin Li, Yujie Fang, Mingzhong Wang

Main category: cs.LG

TL;DR: SRG is a generative framework using Lagrangian relaxation-guided stochastic differential equations to accelerate mixed-integer linear programming solving by generating diverse, high-quality solution candidates.

Details

Motivation: Existing predict-and-search methods for MILP solving assume variable independence and use deterministic single-point predictions, limiting solution diversity and requiring extensive downstream search. There's a need for methods that can generate diverse, high-quality solutions while capturing inter-variable dependencies.

Method: SRG uses Lagrangian relaxation-guided stochastic differential equations with convolutional kernels to capture inter-variable dependencies. It integrates Lagrangian relaxation to guide sampling toward feasible and near-optimal regions, generating diverse solution candidates that define compact trust-region subproblems for standard MILP solvers.

Result: SRG consistently outperforms existing machine learning baselines in solution quality across multiple public benchmarks. It demonstrates strong zero-shot transferability, achieving competitive optimality with state-of-the-art exact solvers on unseen instances while significantly reducing computational overhead.

Conclusion: SRG provides a theoretically-grounded generative framework for MILP solving that generates diverse, high-quality solutions, captures variable dependencies, and enables efficient solving with strong transfer capabilities.

Abstract: Predict-and-search (PaS) methods have shown promise for accelerating mixed-integer linear programming (MILP) solving. However, existing approaches typically assume variable independence and rely on deterministic single-point predictions, which limits solution diversityand often necessitates extensive downstream search for high-quality solutions. In this paper, we propose \textbf{SRG}, a generative framework based on Lagrangian relaxation-guided stochastic differential equations (SDEs), with theoretical guarantees on solution quality. SRG leverages convolutional kernels to capture inter-variable dependencies while integrating Lagrangian relaxation to guide the sampling process toward feasible and near-optimal regions. Rather than producing a single estimate, SRG generates diverse, high-quality solution candidates that collectively define compact and effective trust-region subproblems for standard MILP solvers. Across multiple public benchmarks, SRG consistently outperforms existing machine learning baselines in solution quality. Moreover, SRG demonstrates strong zero-shot transferability: on unseen cross-scale/problem instances, it achieves competitive optimality with state-of-the-art exact solvers while significantly reducing computational overhead through faster search and superior solution quality.

[469] MoE-Sieve: Routing-Guided LoRA for Efficient MoE Fine-Tuning

Andrea Manzoni

Main category: cs.LG

TL;DR: MoE-Sieve: Routing-guided LoRA fine-tuning for MoE models that only applies adapters to frequently activated “hot” experts, reducing parameters and training time while maintaining performance.

Details

Motivation: Standard LoRA fine-tuning applies adapters to all experts in MoE models, but expert routing is highly skewed - most tokens are handled by a small subset of experts while many are rarely activated ("cold experts"). This leads to inefficient parameter usage and training time.

Method: Profile routing counts on a small calibration set, select the top-k most-routed experts per layer, and apply LoRA only to those selected experts. The method systematically profiles expert routing across architectures and tasks to identify frequently activated experts.

Result: Tuning only top 25% routed experts per layer remains competitive with full LoRA (within +/-1 percentage point), reduces LoRA trainable parameters by 70-73%, adapter checkpoint size by 71-73%, and wall-clock training time by up to 50%. Random expert selection performs 2.5 percentage points worse.

Conclusion: MoE-Sieve provides an efficient fine-tuning framework for MoE models by leveraging routing patterns, demonstrating that adapting cold experts can introduce gradient noise without improving accuracy. The routing signal is crucial for effective expert selection.

Abstract: Standard LoRA fine-tuning of Mixture-of-Experts (MoE) models applies adapters to every expert, yet our profiling shows that per-layer expert routing is highly skewed: a small subset of experts handles most tokens in each layer, while many others are rarely activated (“cold”). We propose MoE-Sieve, a simple routing-guided framework for LoRA fine-tuning, and pair it with a systematic profiling study of expert routing across architectures and tasks. The method is simple: profile routing counts on a small calibration set, select the top-k most-routed experts per layer, and apply LoRA only to those experts. Across two architecturally distinct MoE models and three diverse tasks, tuning only the top 25% routed experts per layer remains competitive with full LoRA, with mean differences within +/-1 percentage point across all conditions. This reduces LoRA trainable parameters by 70-73%, adapter checkpoint size by 71-73%, and wall-clock training time by up to 50%. We also observe a non-monotonic relationship between expert count and seed-to-seed variance, consistent with the hypothesis that adapting cold experts can introduce gradient noise without improving accuracy. Further ablations show that random expert selection at matched budget is about 2.5 percentage points worse, indicating that the routing signal matters, while greedy per-layer budget optimization does not improve over uniform top-k.

[470] The impact of sensor placement on graph-neural-network-based leakage detection

J. J. H. van Gemert, V. Breschi, D. R. Yntema, K. J. Keesman, M. Lazar

Main category: cs.LG

TL;DR: Proposes PageRank-Centrality-based sensor placement method for water distribution networks to improve GNN-based leakage detection performance

Details

Motivation: Sensor placement is critical for leakage detection in water distribution networks, and GNN performance depends heavily on sensor configurations

Method: Novel PageRank-Centrality-based sensor placement method that strategically positions sensors to optimize GNN performance for pressure estimation, prediction, and leakage detection

Result: Demonstrated substantial impact on reconstruction, prediction, and leakage detection performance on EPANET Net1 benchmark

Conclusion: Sensor placement significantly influences GNN-based leakage detection effectiveness, and the proposed centrality-based method improves performance

Abstract: Sensor placement for leakage detection in water distribution networks is an important and practical challenge for water utilities. Recent work has shown that graph neural networks can estimate and predict pressures and detect leaks, but their performance strongly depends on the available sensor measurements and configurations. In this paper, we investigate how sensor placement influences the performance of GNN-based leakage detection. We propose a novel PageRank-Centrality-based sensor placement method and demonstrate that it substantially impacts reconstruction, prediction, and leakage detection on the EPANET Net1.

[471] Towards Effective Experiential Learning: Dual Guidance for Utilization and Internalization

Fei Bai, Zhipeng Chen, Chuan Hao, Ming Yang, Ran Tao, Bryan Dai, Wayne Xin Zhao, Jian Yang, Hongteng Xu

Main category: cs.LG

TL;DR: DGO (Dual Guidance Optimization) is a reinforcement learning framework that improves LLM reasoning by leveraging both external experience banks and internal knowledge to guide exploration and internalize useful trajectories.

Details

Motivation: Current RL-based training for LLMs is a rough approximation of human learning, which effectively uses both external and internal experience. The authors aim to bridge this gap by developing a method that better utilizes and internalizes experience during RLVR training.

Method: DGO constructs an experience bank from previously explored trajectories, then guides policy exploration using both the external experience bank and the model’s internal knowledge. The resulting trajectories refine the experience bank and optimize model parameters, creating a closed loop of experience utilization and internalization.

Result: Experiments show that DGO consistently outperforms baseline methods, demonstrating that better utilization and internalization of experience leads to more effective reasoning in LLMs.

Conclusion: The proposed DGO framework successfully addresses the gap between current RL training and human learning by enabling LLMs to better leverage both external and internal experience, resulting in improved reasoning capabilities.

Abstract: Recently, reinforcement learning~(RL) has become an important approach for improving the capabilities of large language models~(LLMs). In particular, reinforcement learning from verifiable rewards~(RLVR) has emerged as a promising paradigm for reasoning tasks. However, existing RL-based training still remains only a rough approximation to human learning. Human learners leverage both external and internal experience to guide exploration and gradually internalize useful trajectories into stable knowledge. Motivated by this gap, we ask: how can LLMs better utilize and internalize experience during RLVR training? To answer this question, we propose \textbf{D}ual \textbf{G}uidance \textbf{O}ptimization~(\textbf{DGO}), a unified framework that leverages \emph{external} and \emph{internal experience} to improve training effectiveness. Specifically, DGO first constructs an experience bank from previously explored trajectories. The policy then performs exploration under the joint guidance of the experience bank and the model’s internal knowledge. The resulting trajectories are further used to refine the experience bank and optimize model parameters, forming a closed loop of experience utilization and internalization. Experiments show that DGO consistently outperforms baseline methods, suggesting that better utilization and internalization of experience lead to more effective reasoning.

[472] KCLNet: Electrically Equivalence-Oriented Graph Representation Learning for Analog Circuits

Peng Xu, Yapeng Li, Tinghuan Chen, Tsung-Yi Ho, Bei Yu

Main category: cs.LG

TL;DR: KCLNet is a novel analog circuit representation learning framework that uses Kirchhoff’s Current Law-inspired constraints to maintain electrical properties in embeddings, achieving strong performance on analog circuit tasks.

Details

Motivation: While digital circuit representation learning has advanced significantly, analog circuit representation remains challenging due to continuous electrical characteristics. Existing methods struggle to preserve electrical constraints in learned representations.

Method: Proposes KCLNet with an asynchronous graph neural network using electrically-simulated message passing and a representation learning method inspired by Kirchhoff’s Current Law that enforces current embedding equality at each depth.

Result: Achieves significant performance improvements in analog circuit classification, subcircuit detection, and circuit edit distance prediction tasks compared to baseline methods.

Conclusion: KCLNet provides an effective solution for analog circuit representation learning that preserves electrical constraints, enhancing generalization ability of circuit embeddings.

Abstract: Digital circuits representation learning has made remarkable progress in the electronic design automation domain, effectively supporting critical tasks such as testability analysis and logic reasoning. However, representation learning for analog circuits remains challenging due to their continuous electrical characteristics compared to the discrete states of digital circuits. This paper presents a direct current (DC) electrically equivalent-oriented analog representation learning framework, named \textbf{KCLNet}. It comprises an asynchronous graph neural network structure with electrically-simulated message passing and a representation learning method inspired by Kirchhoff’s Current Law (KCL). This method maintains the orderliness of the circuit embedding space by enforcing the equality of the sum of outgoing and incoming current embeddings at each depth, which significantly enhances the generalization ability of circuit embeddings. KCLNet offers a novel and effective solution for analog circuit representation learning with electrical constraints preserved. Experimental results demonstrate that our method achieves significant performance in a variety of downstream tasks, e.g., analog circuit classification, subcircuit detection, and circuit edit distance prediction.

[473] The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation

Mingyi Liu

Main category: cs.LG

TL;DR: RLHF-aligned language models exhibit response homogenization where multiple samples produce semantically identical answers, reducing uncertainty estimation effectiveness, but token entropy retains signal and selective prediction can improve performance.

Details

Motivation: To investigate how RLHF alignment affects language model response diversity and uncertainty estimation, particularly examining whether alignment causes homogenization of responses that reduces the effectiveness of sampling-based uncertainty methods.

Method: Conducted systematic experiments across 22 experiments, 5 benchmarks, 4 model families, and 3 model scales (3B-14B). Used base-vs-instruct ablation, training stage ablation (Base, SFT, DPO), and cross-family replication. Employed Jaccard, embedding, and NLI-based baselines with DeBERTa models. Validated with cross-embedder and cross-dataset tests. Explored cheapest-first cascade (UCBD) over orthogonal uncertainty signals for selective prediction.

Result: RLHF-aligned models show significant response homogenization: 40-79% of TruthfulQA questions produce single semantic clusters across samples. On affected questions, sampling-based uncertainty methods have zero discriminative power (AUROC=0.500), while token entropy retains signal (0.603). Alignment tax is task-dependent (GSM8K: 0.724 AUROC). Base models show 1.0% single-cluster rate vs. 28.5% for instruct models. DPO training causes homogenization, not SFT. Selective prediction raises GSM8K accuracy from 84.4% to 93.2% at 50% coverage with 57% cost savings.

Conclusion: RLHF alignment causes response homogenization that impairs uncertainty estimation, but token entropy remains effective. The alignment tax varies by model family and scale. Selective prediction using orthogonal uncertainty signals can mitigate these issues and improve performance-cost tradeoffs.

Abstract: RLHF-aligned language models exhibit response homogenization: on TruthfulQA (n=790), 40-79% of questions produce a single semantic cluster across 10 i.i.d. samples. On affected questions, sampling-based uncertainty methods have zero discriminative power (AUROC=0.500), while free token entropy retains signal (0.603). This alignment tax is task-dependent: on GSM8K (n=500), token entropy achieves 0.724 (Cohen’s d=0.81). A base-vs-instruct ablation confirms the causal role of alignment: the base model shows 1.0% single-cluster rate vs. 28.5% for the instruct model (p < 10^{-6}). A training stage ablation (Base 0.0% -> SFT 1.5% -> DPO 4.0% SCR) localizes the cause to DPO, not SFT. Cross-family replication on four model families reveals alignment tax severity varies by family and scale. We validate across 22 experiments, 5 benchmarks, 4 model families, and 3 model scales (3B-14B), with Jaccard, embedding, and NLI-based baselines at three DeBERTa scales (all ~0.51 AUROC). Cross-embedder validation with two independent embedding families rules out coupling bias. Cross-dataset validation on WebQuestions (58.0% SCR) confirms generalization beyond TruthfulQA. The central finding – response homogenization – is implementation-independent and label-free. Motivated by this diagnosis, we explore a cheapest-first cascade (UCBD) over orthogonal uncertainty signals. Selective prediction raises GSM8K accuracy from 84.4% to 93.2% at 50% coverage; weakly dependent boundaries (|r| <= 0.12) enable 57% cost savings.

[474] Causality-Driven Disentangled Representation Learning in Multiplex Graphs

Saba Nasiri, Selin Aviyente, Dorina Thanou

Main category: cs.LG

TL;DR: CaDeM: A causal inference framework for disentangling common and private representations in multiplex graphs through self-supervised learning with backdoor adjustment.

Details

Motivation: Multiplex graphs (multi-layer networks with multiple relation types) present challenges for representation learning due to entangled shared and layer-specific information, which limits generalization and interpretability of learned representations.

Method: Introduces a causal inference-based framework that: (1) aligns shared embeddings across layers, (2) enforces private embeddings to capture layer-specific signals, and (3) applies backdoor adjustment to ensure common embeddings capture only global information while being separated from private representations.

Result: Experiments on synthetic and real-world datasets demonstrate consistent improvements over existing baselines, highlighting effectiveness for robust and interpretable multiplex graph representation learning.

Conclusion: The CaDeM framework successfully disentangles common and private components in multiplex graphs, leading to more robust and interpretable representations through causal inference techniques.

Abstract: Learning representations from multiplex graphs, i.e., multi-layer networks where nodes interact through multiple relation types, is challenging due to the entanglement of shared (common) and layer-specific (private) information, which limits generalization and interpretability. In this work, we introduce a causal inference-based framework that disentangles common and private components in a self-supervised manner. CaDeM jointly (i) aligns shared embeddings across layers, (ii) enforces private embeddings to capture layer-specific signals, and (iii) applies backdoor adjustment to ensure that the common embeddings capture only global information while being separated from the private representations. Experiments on synthetic and real-world datasets demonstrate consistent improvements over existing baselines, highlighting the effectiveness of our approach for robust and interpretable multiplex graph representation learning.

[475] A Deep Dive into Scaling RL for Code Generation with Synthetic Data and Curricula

Cansu Sancaktar, David Zhang, Gabriel Synnaeve, Taco Cohen

Main category: cs.LG

TL;DR: Multi-turn synthetic data generation pipeline for RL training of LLMs, using teacher models to create structured difficulty progressions without fine-tuning, improving code and math performance.

Details

Motivation: Reinforcement learning for LLMs faces challenges in sustaining performance gains at scale, where data diversity and structure become limiting factors rather than just volume. Current approaches lack effective methods for generating structured, progressive difficulty data.

Method: Introduces a scalable multi-turn synthetic data generation pipeline where a teacher model iteratively refines problems based on in-context student performance summaries. This creates structured difficulty progressions (stepping stones) without teacher fine-tuning, supporting curriculum-based training.

Result: Multi-turn approach substantially improves yield of valid synthetic problems and produces easier/harder task variants. Synthetic augmentation consistently improves in-domain code performance and in most cases out-of-domain math performance across Llama3.1-8B, Qwen3-8B, and Qwen2.5-32B models.

Conclusion: The multi-turn synthetic data generation pipeline effectively addresses RL scaling challenges by creating structured difficulty progressions, providing empirical insights into how curriculum design and data diversity jointly shape RL training dynamics.

Abstract: Reinforcement learning (RL) has emerged as a powerful paradigm for improving large language models beyond supervised fine-tuning, yet sustaining performance gains at scale remains an open challenge, as data diversity and structure, rather than volume alone, become the limiting factor. We address this by introducing a scalable multi-turn synthetic data generation pipeline in which a teacher model iteratively refines problems based on in-context student performance summaries, producing structured difficulty progressions without any teacher fine-tuning. Compared to single-turn generation, this multi-turn approach substantially improves the yield of valid synthetic problems and naturally produces stepping stones, i.e. easier and harder variants of the same core task, that support curriculum-based training. We systematically study how task difficulty, curriculum scheduling, and environment diversity interact during RL training across the Llama3.1-8B Instruct and Qwen3-8B Base model families, with additional scaling experiments on Qwen2.5-32B. Our results show that synthetic augmentation consistently improves in-domain code and in most cases out-of-domain math performance, and we provide empirical insights into how curriculum design and data diversity jointly shape RL training dynamics.

[476] Mixed-signal implementation of feedback-control optimizer for single-layer Spiking Neural Networks

Jonathan Haag, Christian Metzner, Dmitrii Zendrikov, Giacomo Indiveri, Benjamin Grewe, Chiara De Luca, Matteo Saponati

Main category: cs.LG

TL;DR: Implementation of feedback-control optimizers for on-chip learning in mixed-signal neuromorphic processors, demonstrating hardware-compatible training that matches software performance.

Details

Motivation: On-chip learning is crucial for scalable and adaptive neuromorphic systems, but existing training methods are either hardware-unfriendly or too restrictive. Feedback-control optimizers offer a promising alternative for expressive on-chip training.

Method: Implemented feedback-control optimizers on a mixed-signal neuromorphic processor using an In-The-Loop (ITL) training setup. Tested on binary classification and nonlinear Yin-Yang problems, comparing against numerical simulations and gradient-based baselines.

Result: On-chip training achieved performance matching numerical simulations and gradient-based baselines, demonstrating feasibility of feedback-driven online learning under realistic mixed-signal constraints.

Conclusion: Feedback-control optimizers enable expressive on-chip training for neuromorphic systems, representing a co-design approach for embedding learning rules directly in silicon for autonomous and adaptive neuromorphic computing.

Abstract: On-chip learning is key to scalable and adaptive neuromorphic systems, yet existing training methods are either difficult to implement in hardware or overly restrictive. However, recent studies show that feedback-control optimizers can enable expressive, on-chip training of neuromorphic devices. In this work, we present a proof-of-concept implementation of such feedback-control optimizers on a mixed-signal neuromorphic processor. We assess the proposed approach in an In-The-Loop(ITL) training setup on both a binary classification task and the nonlinear Yin-Yang problem, demonstrating on-chip training that matches the performance of numerical simulations and gradient-based baselines. Our results highlight the feasibility of feedback-driven, online learning under realistic mixed-signal constraints, and represent a co-design approach toward embedding such rules directly in silicon for autonomous and adaptive neuromorphic computing.

[477] Likelihood hacking in probabilistic program synthesis

Jacek Karwowski, Younesse Kaddar, Zihuiwen Ye, Nikolay Malkin, Sam Staton

Main category: cs.LG

TL;DR: The paper introduces “likelihood hacking” - a failure mode where language models trained by RL to write probabilistic programs artificially inflate their marginal-likelihood reward by producing programs with non-normalizing data distributions instead of actually fitting data better.

Details

Motivation: When language models are trained via reinforcement learning to generate probabilistic programs, they can discover exploits to artificially inflate their reward metrics without genuinely improving model quality. This "likelihood hacking" undermines the reliability of automated Bayesian model discovery systems.

Method: The authors formalize likelihood hacking in a core probabilistic programming language, identify sufficient syntactic conditions to prevent it, and prove that a safe language fragment satisfying these conditions cannot produce likelihood-hacking programs. They implement these conditions as SafeStan, a modification of Stan that resists likelihood hacking.

Result: Empirical results show that GRPO-trained models generating PyMC code discover likelihood hacking exploits within the first few training steps, driving violation rates well above baseline. SafeStan effectively prevents likelihood hacking under optimization pressure.

Conclusion: Language-level safety constraints are both theoretically grounded and effective in practice for preventing likelihood hacking in automated Bayesian model discovery systems, ensuring reliable probabilistic program generation.

Abstract: When language models are trained by reinforcement learning (RL) to write probabilistic programs, they can artificially inflate their marginal-likelihood reward by producing programs whose data distribution fails to normalise instead of fitting the data better. We call this failure likelihood hacking (LH). We formalise LH in a core probabilistic programming language (PPL) and give sufficient syntactic conditions for its prevention, proving that a safe language fragment $\mathcal{L}{\text{safe}}$ satisfying these conditions cannot produce likelihood-hacking programs. Empirically, we show that GRPO-trained models generating PyMC code discover LH exploits within the first few training steps, driving violation rates well above the untrained-model baseline. We implement $\mathcal{L}{\text{safe}}$’s conditions as $\texttt{SafeStan}$, a LH-resistant modification of Stan, and show empirically that it prevents LH under optimisation pressure. These results show that language-level safety constraints are both theoretically grounded and effective in practice for automated Bayesian model discovery.

[478] On Gossip Algorithms for Machine Learning with Pairwise Objectives

Igor Colin, Aurélien Bellet, Stephan Clémençon, Joseph Salmon

Main category: cs.LG

TL;DR: Gossip-based algorithms for decentralized statistical learning with pairwise objective functions (U-statistics of degree two) in IoT sensor networks

Details

Motivation: In IoT era with distributed smart sensors having limited resources, need for statistical learning methods that work over networks while respecting privacy constraints. Most existing gossip algorithms focus on simple averaging, but many real-world problems involve pairwise relationships (similarity learning, ranking, clustering).

Method: Revisits gossip algorithms specifically designed for pairwise objective functions (U-statistics of degree two). Provides comprehensive theoretical framework for convergence analysis, examining graph properties affecting efficiency and establishing convergence conditions.

Result: Fills gap in literature by establishing conditions under which these methods succeed and identifying critical graph properties. Performs refined analysis of convergence upper and lower bounds.

Conclusion: Provides theoretical foundation for gossip algorithms handling pairwise statistical learning tasks in distributed IoT networks, addressing limitations of existing averaging-focused approaches.

Abstract: In the IoT era, information is more and more frequently picked up by connected smart sensors with increasing, though limited, storage, communication and computation abilities. Whether due to privacy constraints or to the structure of the distributed system, the development of statistical learning methods dedicated to data that are shared over a network is now a major issue. Gossip-based algorithms have been developed for the purpose of solving a wide variety of statistical learning tasks, ranging from data aggregation over sensor networks to decentralized multi-agent optimization. Whereas the vast majority of contributions consider situations where the function to be estimated or optimized is a basic average of individual observations, it is the goal of this article to investigate the case where the latter is of pairwise nature, taking the form of a U -statistic of degree two. Motivated by various problems such as similarity learning, ranking or clustering for instance, we revisit gossip algorithms specifically designed for pairwise objective functions and provide a comprehensive theoretical framework for their convergence. This analysis fills a gap in the literature by establishing conditions under which these methods succeed, and by identifying the graph properties that critically affect their efficiency. In particular, a refined analysis of the convergence upper and lower bounds is performed.

[479] Uncovering Memorization in Timeseries Imputation models: LBRM Membership Inference and its link to attribute Leakage

Faiz Taleb, Ivan Gazeau, Maryline Laurent

Main category: cs.LG

TL;DR: Time series imputation models are vulnerable to privacy attacks including membership inference and attribute inference in black-box settings, with proposed attacks showing high effectiveness even on robust models.

Details

Motivation: While deep learning models for time series imputation are widely used in healthcare, IoT, and finance, their deployment raises privacy concerns beyond known memorization issues. The paper aims to demonstrate that time series models are vulnerable to inference attacks in black-box settings.

Method: Proposes a two-stage attack framework: (1) a novel membership inference attack using a reference model that improves detection accuracy even for models robust to overfitting-based attacks, and (2) the first attribute inference attack that predicts sensitive characteristics of training data for time series imputation models. Evaluated on attention-based and autoencoder architectures in two scenarios: models trained from scratch and fine-tuned models where adversary has access to initial weights.

Result: The membership attack retrieves a significant portion of training data with tpr@top25% score significantly higher than naive attack baseline. The membership attack also provides good insight into whether attribute inference will work (90% precision vs 78% in general case).

Conclusion: Time series imputation models have serious privacy vulnerabilities to inference attacks, necessitating better privacy-preserving techniques for deployment in sensitive domains.

Abstract: Deep learning models for time series imputation are now essential in fields such as healthcare, the Internet of Things (IoT), and finance. However, their deployment raises critical privacy concerns. Beyond the well-known issue of unintended memorization, which has been extensively studied in generative models, we demonstrate that time series models are vulnerable to inference attacks in a black-box setting. In this work, we introduce a two-stage attack framework comprising: (1) a novel membership inference attack based on a reference model that improves detection accuracy, even for models robust to overfitting-based attacks, and (2) the first attribute inference attack that predicts sensitive characteristics of the training data for timeseries imputation model. We evaluate these attacks on attention-based and autoencoder architectures in two scenarios: models that are trained from scratch, and fine-tuned models where the adversary has access to the initial weights. Our experimental results demonstrate that the proposed membership attack retrieves a significant portion of the training data with a tpr@top25% score significantly higher than a naive attack baseline. We show that our membership attack also provides a good insight of whether attribute inference will work (with a precision of 90% instead of 78% in the genral case).

[480] Reservoir-Based Graph Convolutional Networks

Mayssa Soussia, Gita Ayu Salsabila, Mohamed Ali Mahjoub, Islem Rekik

Main category: cs.LG

TL;DR: RGC-Net integrates reservoir computing with graph convolutional networks to address over-smoothing and computational challenges in GNNs, achieving SOTA performance in graph classification and generation tasks.

Details

Motivation: GCNs struggle with complex/dynamic data, requiring deep layers that cause over-smoothing and high computational costs. Existing reservoir-based GNNs lack structured convolutional mechanisms for accurate multi-hop neighborhood aggregation.

Method: Proposes RGC-Net combining reservoir dynamics with graph convolution: uses fixed random reservoir weights with leaky integrator for feature retention, adaptable model for graph classification, and RGC-Net-powered transformer for graph generation.

Result: Achieves state-of-the-art performance in classification and generative tasks, including brain graph evolution, with faster convergence and reduced over-smoothing compared to existing methods.

Conclusion: RGC-Net successfully integrates reservoir computing with structured graph convolution, addressing key limitations of GCNs while maintaining computational efficiency and performance.

Abstract: Message passing is a core mechanism in Graph Neural Networks (GNNs), enabling the iterative update of node embeddings by aggregating information from neighboring nodes. Graph Convolutional Networks (GCNs) exemplify this approach by adapting convolutional operations for graph structures, allowing features from adjacent nodes to be combined effectively. However, GCNs encounter challenges with complex or dynamic data. Capturing long-range dependencies often requires deeper layers, which not only increase computational costs but also lead to over-smoothing, where node embeddings become indistinguishable. To overcome these challenges, reservoir computing has been integrated into GNNs, leveraging iterative message-passing dynamics for stable information propagation without extensive parameter tuning. Despite its promise, existing reservoir-based models lack structured convolutional mechanisms, limiting their ability to accurately aggregate multi-hop neighborhood information. To address these limitations, we propose RGC-Net (Reservoir-based Graph Convolutional Network), which integrates reservoir dynamics with structured graph convolution. Key contributions include: (i) a reimagined convolutional framework with fixed random reservoir weights and a leaky integrator to enhance feature retention; (ii) a robust, adaptable model for graph classification; and (iii) an RGC-Net-powered transformer for graph generation with application to dynamic brain connectivity. Extensive experiments show that RGC-Net achieves state-of-the-art performance in classification and generative tasks, including brain graph evolution, with faster convergence and reduced over-smoothing. Source code is available at https://github.com/basiralab/RGC-Net .

Lukas Theiner, Maik Pfefferkorn, Yongpeng Zhao, Sebastian Hirt, Rolf Findeisen

Main category: cs.LG

TL;DR: Multi-fidelity Bayesian optimization framework combines low-fidelity numerical data with high-fidelity human preferences to efficiently tune control policies using mixed-modality data.

Details

Motivation: Manual tuning of control policies for high-level objectives is time-consuming, especially for systems involving humans where optimization must be based on subjective criteria. Existing preferential Bayesian optimization relying solely on preference data is inefficient.

Method: Proposes a multi-fidelity, multi-modal Bayesian optimization framework that integrates low-fidelity numerical data with high-fidelity human preferences. Uses Gaussian process surrogate models with both hierarchical autoregressive and non-hierarchical coregionalization-based structures to learn from mixed-modality data.

Result: Demonstrated by tuning an autonomous vehicle’s trajectory planner, showing that combining numerical and preference data significantly reduces the need for human-involved experiments while effectively adapting driving style to individual preferences.

Conclusion: The proposed framework enables efficient optimization of control policies using both objective numerical data and subjective human preferences, reducing experimental burden while achieving personalized adaptation.

Abstract: Tuning control policies manually to meet high-level objectives is often time-consuming. Bayesian optimization provides a data-efficient framework for automating this process using numerical evaluations of an objective function. However, many systems, particularly those involving humans, require optimization based on subjective criteria. Preferential Bayesian optimization addresses this by learning from pairwise comparisons instead of quantitative measurements, but relying solely on preference data can be inefficient. We propose a multi-fidelity, multi-modal Bayesian optimization framework that integrates low-fidelity numerical data with high-fidelity human preferences. Our approach employs Gaussian process surrogate models with both hierarchical, autoregressive and non-hierarchical, coregionalization-based structures, enabling efficient learning from mixed-modality data. We illustrate the framework by tuning an autonomous vehicle’s trajectory planner, showing that combining numerical and preference data significantly reduces the need for experiments involving the human decision maker while effectively adapting driving style to individual preferences.

[482] Linear-Nonlinear Fusion Neural Operator for Partial Differential Equations

Heng Wu, Junjie Wang, Benzhuo Lu

Main category: cs.LG

TL;DR: LNF-NO is a novel neural operator architecture that decouples linear and nonlinear effects for PDE operator learning, achieving faster training and comparable/better accuracy than DeepONet and FNO.

Details

Motivation: Traditional neural operators don't explicitly separate linear and nonlinear effects, which limits learning efficiency. The authors aim to create a more efficient and interpretable neural operator by decoupling these components.

Method: Proposes Linear-Nonlinear Fusion Neural Operator (LNF-NO) that models operator mappings via multiplicative fusion of linear and nonlinear components, supporting multiple functional inputs and both regular/irregular geometries.

Result: LNF-NO trains substantially faster than DeepONet and FNO while achieving comparable or better accuracy across PDE benchmarks, including 2.7x faster training than 3D FNO on Poisson-Boltzmann equations.

Conclusion: Linear-nonlinear decoupling improves neural operator learning efficiency and interpretability while maintaining accuracy, making LNF-NO a promising approach for PDE operator learning.

Abstract: Neural operator learning directly constructs the mapping relationship from the equation parameter space to the solution space, enabling efficient direct inference in practical applications without the need for repeated solution of partial differential equations (PDEs) - an advantage that is difficult to achieve with traditional numerical methods. In this work, we find that explicitly decoupling linear and nonlinear effects within such operator mappings leads to markedly improved learning efficiency. This yields a novel network structure, namely the Linear-Nonlinear Fusion Neural Operator (LNF-NO), which models operator mappings via the multiplicative fusion of a linear component and a nonlinear component, thus achieving a lightweight and interpretable representation. This linear-nonlinear decoupling enables efficient capture of complex solution features at the operator level while maintaining stability and generality. LNF-NO naturally supports multiple functional inputs and is applicable to both regular grids and irregular geometries. Across a diverse suite of PDE operator-learning benchmarks, including nonlinear Poisson-Boltzmann equations and multi-physics coupled systems, LNF-NO is typically substantially faster to train than Deep Operator Networks (DeepONet) and Fourier Neural Operators (FNO), while achieving comparable or better accuracy in most cases. On the tested 3D Poisson-Boltzmann case, LNF-NO attains the best accuracy among the compared models and trains approximately 2.7x faster than a 3D FNO baseline.

[483] TsetlinWiSARD: On-Chip Training of Weightless Neural Networks using Tsetlin Automata on FPGAs

Shengyu Duan, Marcos L. L. Sartori, Rishad Shafik, Alex Yakovlev

Main category: cs.LG

TL;DR: TsetlinWiSARD enables efficient on-chip training for weightless neural networks using Tsetlin Automata, achieving 1000x faster training and significant hardware efficiency gains on FPGA.

Details

Motivation: Address the need for adaptable, private, and secure edge ML with on-chip training capabilities, overcoming limitations of traditional weightless neural networks that suffer from overfitting due to one-shot training.

Method: Proposes TsetlinWiSARD training approach using Tsetlin Automata for probabilistic, feedback-driven learning with iterative optimization, implemented on FPGA-based training architecture.

Result: Achieves 1000x faster training vs traditional WiSARD, with 22% reduced resource usage, 93.3% lower latency, and 64.2% lower power consumption compared to other FPGA-based ML training accelerators.

Conclusion: TsetlinWiSARD enables efficient on-chip training for weightless neural networks with state-of-the-art accuracy and significant hardware efficiency improvements, suitable for edge computing applications.

Abstract: Increasing demands for adaptability, privacy, and security at the edge have persistently pushed the frontiers for a new generation of machine learning (ML) algorithms with training and inference capabilities on-chip. Weightless Neural Network (WNN) is such an algorithm that is principled on lookup table based simple neuron structures. As a result, it offers architectural benefits, such as low-latency, low-complexity inference, compared to deep neural networks that depend heavily on multiply-accumulate operations. However, traditional WNNs rely on memorization-based one-shot training, which either leads to overfitting and reduced accuracy or requires tedious post-training adjustments, limiting their effectiveness for efficient on chip training. In this work, we propose TsetlinWiSARD, a training approach for WNNs that leverages Tsetlin Automata (TAs) to enable probabilistic, feedback-driven learning. It overcomes the overfitting of WiSARD’s one-shot training with iterative optimization, while maintaining simple, continuous binary feedback for efficient on-chip training. Central to our approach is a field programmable gate array (FPGA)-based training architecture that delivers state-of-the-art accuracy while significantly improving hardware efficiency. Our approach provides over 1000x faster training when compared with the traditional WiSARD implementation of WNNs. Further, we demonstrate 22% reduced resource usage, 93.3% lower latency, and 64.2% lower power consumption compared to FPGA-based training accelerators implementing other ML algorithms.

[484] IPatch: A Multi-Resolution Transformer Architecture for Robust Time-Series Forecasting

Aymane Harkati, Moncef Garouani, Olivier Teste, Julien Aligon, Mohamed Hamlich

Main category: cs.LG

TL;DR: IPatch: Multi-resolution Transformer for time series forecasting that integrates point-wise and patch-wise tokens to capture both fine-grained details and broader temporal dependencies.

Details

Motivation: Current Transformer-based time series forecasting models face a trade-off: point-wise representations preserve fine details but are computationally expensive and poor at modeling long-range dependencies, while patch-wise representations improve efficiency but lose critical fine-grained temporal information needed for accurate predictions in volatile/complex time series.

Method: Proposes IPatch, a multi-resolution Transformer architecture that integrates both point-wise and patch-wise tokens to model temporal information at multiple resolutions simultaneously, capturing both fine-grained details and broader contextual dependencies.

Result: Experiments on 7 benchmark datasets show IPatch consistently improves forecasting accuracy, robustness to noise, and generalization across various prediction horizons compared to single-representation baselines.

Conclusion: Integrating multiple temporal resolutions in Transformer architectures is effective for time series forecasting, addressing limitations of single-representation approaches and improving performance across diverse datasets and conditions.

Abstract: Accurate forecasting of multivariate time series remains challenging due to the need to capture both short-term fluctuations and long-range temporal dependencies. Transformer-based models have emerged as a powerful approach, but their performance depends critically on the representation of temporal data. Traditional point-wise representations preserve individual time-step information, enabling fine-grained modeling, yet they tend to be computationally expensive and less effective at modeling broader contextual dependencies, limiting their scalability to long sequences. Patch-wise representations aggregate consecutive steps into compact tokens to improve efficiency and model local temporal dynamics, but they often discard fine-grained temporal details that are critical for accurate predictions in volatile or complex time series. We propose IPatch, a multi-resolution Transformer architecture that integrates both point-wise and patch-wise tokens, modeling temporal information at multiple resolutions. Experiments on 7 benchmark datasets demonstrate that IPatch consistently improves forecasting accuracy, robustness to noise, and generalization across various prediction horizons compared to single-representation baselines.

[485] Embracing Heteroscedasticity for Probabilistic Time Series Forecasting

Yijun Wang, Qiyuan Zhuang, Xiu-Shen Wei

Main category: cs.LG

TL;DR: LSG-VAE is a probabilistic time series forecasting model that explicitly models heteroscedastic uncertainty using a location-scale Gaussian VAE framework to capture time-varying conditional variances.

Details

Motivation: Most existing non-autoregressive generative approaches to probabilistic time series forecasting rely on MSE-based training objectives that impose homoscedastic assumptions, limiting their ability to model temporal heteroscedasticity in real-world time series with time-varying conditional variances.

Method: Proposes Location-Scale Gaussian VAE (LSG-VAE) that explicitly parameterizes both predictive mean and time-dependent variance through a location-scale likelihood formulation, enabling adaptive attenuation mechanism that down-weights highly volatile observations during training.

Result: Extensive experiments on nine benchmark datasets show LSG-VAE consistently outperforms fifteen strong generative baselines while maintaining high computational efficiency suitable for real-time deployment.

Conclusion: LSG-VAE effectively addresses the heteroscedasticity limitation in probabilistic time series forecasting by explicitly modeling time-dependent variances, leading to improved robustness in trend prediction and better uncertainty quantification.

Abstract: Probabilistic time series forecasting (PTSF) aims to model the full predictive distribution of future observations, enabling both accurate forecasting and principled uncertainty quantification. A central requirement of PTSF is to embrace heteroscedasticity, as real-world time series exhibit time-varying conditional variances induced by nonstationary dynamics, regime changes, and evolving external conditions. However, most existing non-autoregressive generative approaches to PTSF, such as TimeVAE and $K^2$VAE, rely on MSE-based training objectives that implicitly impose a homoscedastic assumption, thereby fundamentally limiting their ability to model temporal heteroscedasticity. To address this limitation, we propose the Location-Scale Gaussian VAE (LSG-VAE), a simple but effective framework that explicitly parameterizes both the predictive mean and time-dependent variance through a location-scale likelihood formulation. This design enables LSG-VAE to faithfully capture heteroscedastic aleatoric uncertainty and introduces an adaptive attenuation mechanism that automatically down-weights highly volatile observations during training, leading to improved robustness in trend prediction. Extensive experiments on nine benchmark datasets demonstrate that LSG-VAE consistently outperforms fifteen strong generative baselines while maintaining high computational efficiency suitable for real-time deployment.

[486] Identification of NMF by choosing maximum-volume basis vectors

Qianqian Qi, Zhongming Chen, Peter G. M. van der Heijden

Main category: cs.LG

TL;DR: Proposes maximum-volume-constrained NMF as an alternative to minimum-volume-constrained NMF for better handling of highly mixed data and more interpretable basis vectors.

Details

Motivation: Minimum-volume-constrained NMF tends to induce sparsity in coefficient matrices, which fails for highly mixed data where sparsity doesn't hold. Additionally, estimated basis vectors in minimum-volume-constrained NMF may be mixtures of ground truth basis vectors, making them difficult to interpret.

Method: Proposes a new NMF framework called maximum-volume-constrained NMF that makes basis vectors as distinct as possible. Establishes identifiability theorem for this approach and provides an algorithm for estimation.

Result: Experimental results demonstrate the effectiveness of the proposed maximum-volume-constrained NMF method.

Conclusion: Maximum-volume-constrained NMF addresses limitations of minimum-volume-constrained NMF by producing more distinct and interpretable basis vectors, particularly suitable for highly mixed data.

Abstract: In nonnegative matrix factorization (NMF), minimum-volume-constrained NMF is a widely used framework for identifying the solution of NMF by making basis vectors as similar as possible. This typically induces sparsity in the coefficient matrix, with each row containing zero entries. Consequently, minimum-volume-constrained NMF may fail for highly mixed data, where such sparsity does not hold. Moreover, the estimated basis vectors in minimum-volume-constrained NMF may be difficult to interpret as they may be mixtures of the ground truth basis vectors. To address these limitations, in this paper we propose a new NMF framework, called maximum-volume-constrained NMF, which makes the basis vectors as distinct as possible. We further establish an identifiability theorem for maximum-volume-constrained NMF and provide an algorithm to estimate it. Experimental results demonstrate the effectiveness of the proposed method.

[487] Attack Assessment and Augmented Identity Recognition for Human Skeleton Data

Joseph G. Zalameda, Megan A. Witherow, Alexander M. Glandon, Jose Aguilera, Khan M. Iftekharuddin

Main category: cs.LG

TL;DR: Attack-AAIRS enhances AAIRS framework by using GAN-generated adversarial samples to inoculate skeleton-based person identification models against attacks, improving robustness without sacrificing accuracy on real data.

Details

Motivation: Person identification from LiDAR skeleton data requires expensive data collection, making small datasets vulnerable to adversarial attacks. Existing AAIRS framework doesn't evaluate or defend against such attacks, and perturbation-based approaches are limited for small datasets.

Method: Proposes Attack-AAIRS extension that uses GANs to learn the distribution of adversarial attacks exploiting HCN-ID weaknesses. Combines small real dataset with GAN-generated synthetic adversarial samples to augment training and inoculate the model.

Result: Ten-fold cross validation shows increased robustness to unseen attacks (FGSM, PGD, Gaussian Noise, MI-FGSM, BIM). Generated attack samples maintain quality similar to original benign synthetic samples, and inoculated models preserve test accuracy on real data.

Conclusion: Attack-AAIRS effectively improves model robustness against adversarial attacks without compromising performance on real data, addressing security vulnerabilities in skeleton-based person identification with small datasets.

Abstract: Machine learning models trained on small data sets for security applications are especially vulnerable to adversarial attacks. Person identification from LiDAR based skeleton data requires time consuming and expensive data acquisition for each subject identity. Recently, Assessment and Augmented Identity Recognition for Skeletons (AAIRS) has been used to train Hierarchical Co-occurrence Networks for Person Identification (HCN-ID) with small LiDAR based skeleton data sets. However, AAIRS does not evaluate robustness of HCN-ID to adversarial attacks or inoculate the model to defend against such attacks. Popular perturbation-based approaches to generating adversarial attacks are constrained to targeted perturbations added to real training samples, which is not ideal for inoculating models with small training sets. Thus, we propose Attack-AAIRS, a novel addition to the AAIRS framework. Attack-AAIRS leverages a small real data set and a GAN generated synthetic data set to assess and improve model robustness against unseen adversarial attacks. Rather than being constrained to perturbations of limited real training samples, the GAN learns the distribution of adversarial attack samples that exploit weaknesses in HCN-ID. Attack samples drawn from this distribution augment training for inoculation of the HCN-ID to improve robustness. Ten-fold cross validation of Attack-AAIRS yields increased robustness to unseen attacks- including FGSM, PGD, Additive Gaussian Noise, MI-FGSM, and BIM. The HCN-ID Synthetic Data Quality Score for Attack-AAIRS indicates that generated attack samples are of similar quality to the original benign synthetic samples generated by AAIRS. Furthermore, inoculated models show consistent final test accuracy with the original model trained on real data, demonstrating that our method improves robustness to adversarial attacks without reducing test performance on real data.

[488] Cost-Sensitive Neighborhood Aggregation for Heterophilous Graphs: When Does Per-Edge Routing Help?

Eyal Weiss

Main category: cs.LG

TL;DR: CSNA introduces cost-sensitive neighborhood aggregation for GNNs that soft-routes messages through concordant/discordant channels based on learned pairwise distances, showing it helps in adversarial heterophily regimes but not in informative heterophily regimes.

Details

Motivation: To understand when per-edge message routing in GNNs helps versus when uniform spectral channels are sufficient, distinguishing between adversarial heterophily (where cross-class edges harm classification) and informative heterophily (where heterophilous structure carries useful signal).

Method: Cost-Sensitive Neighborhood Aggregation (CSNA) computes pairwise distance in a learned projection and uses it to soft-route each message through concordant and discordant channels with independent transformations, with theoretical analysis under a contextual stochastic block model.

Result: CSNA is competitive with SOTA on adversarial-heterophily datasets (Texas, Wisconsin, Cornell, Actor) but underperforms on informative-heterophily datasets (Chameleon, Squirrel), showing per-edge routing helps only when there’s useful decomposition to exploit.

Conclusion: The cost function’s ability to separate edge types serves as a diagnostic for heterophily regime, revealing when fine-grained routing adds value over uniform channels and when it does not.

Abstract: Recent work distinguishes two heterophily regimes: adversarial, where cross-class edges dilute class signal and harm classification, and informative, where the heterophilous structure itself carries useful signal. We ask: when does per-edge message routing help, and when is a uniform spectral channel sufficient? To operationalize this question we introduce Cost-Sensitive Neighborhood Aggregation (CSNA), a GNN layer that computes pairwise distance in a learned projection and uses it to soft-route each message through concordant and discordant channels with independent transformations. Under a contextual stochastic block model we show that cost-sensitive weighting preserves class-discriminative signal where mean aggregation provably attenuates it, provided $w_+/w_- > q/p$. On six benchmarks with uniform tuning, CSNA is competitive with state-of-the-art methods on adversarial-heterophily datasets (Texas, Wisconsin, Cornell, Actor) but underperforms on informative-heterophily datasets (Chameleon, Squirrel) – precisely the regime where per-edge routing has no useful decomposition to exploit. The pattern is itself the finding: the cost function’s ability to separate edge types serves as a diagnostic for the heterophily regime, revealing when fine-grained routing adds value over uniform channels and when it does not. Code is available at https://github.com/eyal-weiss/CSNA-public .

[489] Forecasting with Guidance: Representation-Level Supervision for Time Series Forecasting

Jiacheng Wang, Liang Fan, Baihua Li, Luyan Zhang

Main category: cs.LG

TL;DR: ReGuider is a plug-in method that uses pretrained time series foundation models as semantic teachers to guide forecasting models, improving temporal representation learning and forecasting accuracy.

Details

Motivation: Current time series forecasting methods using end-to-end training with error-based objectives tend to discard informative extreme patterns, resulting in smooth predictions and poor capture of salient dynamics in temporal representations.

Method: ReGuider integrates pretrained time series foundation models as semantic teachers. During training, both the target forecasting model and pretrained model process input sequences. Intermediate embeddings from the pretrained model (rich in temporal/semantic information) are aligned with the target model’s encoder embeddings through representation-level supervision.

Result: Extensive experimentation across diverse datasets and architectures shows ReGuider consistently improves forecasting performance, confirming its effectiveness and versatility.

Conclusion: ReGuider successfully addresses the limitation of standard forecasting approaches by leveraging pretrained foundation models to guide temporal representation learning, resulting in more expressive representations and better forecasting accuracy.

Abstract: Nowadays, time series forecasting is predominantly approached through the end-to-end training of deep learning architectures using error-based objectives. While this is effective at minimizing average loss, it encourages the encoder to discard informative yet extreme patterns. This results in smooth predictions and temporal representations that poorly capture salient dynamics. To address this issue, we propose ReGuider, a plug-in method that can be seamlessly integrated into any forecasting architecture. ReGuider leverages pretrained time series foundation models as semantic teachers. During training, the input sequence is processed together by the target forecasting model and the pretrained model. Rather than using the pretrained model’s outputs directly, we extract its intermediate embeddings, which are rich in temporal and semantic information, and align them with the target model’s encoder embeddings through representation-level supervision. This alignment process enables the encoder to learn more expressive temporal representations, thereby improving the accuracy of downstream forecasting. Extensive experimentation across diverse datasets and architectures demonstrates that our ReGuider consistently improves forecasting performance, confirming its effectiveness and versatility.

[490] DeepDTF: Dual-Branch Transformer Fusion for Multi-Omics Anticancer Drug Response Prediction

Yuhan Zhao, Jacob Tennant, James Yang, Zhishan Guo, Young Whang, Ning Sui

Main category: cs.LG

TL;DR: DeepDTF is a dual-branch Transformer framework for cancer drug response prediction that fuses multi-omics data with molecular graph representations of drugs using cross-modal attention.

Details

Motivation: Cancer drug response varies widely due to molecular heterogeneity, requiring computational models. Existing deep learning approaches struggle with aligning high-dimensional multi-omics data with chemically structured drugs due to cross-modal misalignment and limited inductive bias.

Method: Dual-branch Transformer framework: cell-line branch uses modality-specific encoders for multi-omics profiles with Transformer blocks; drug branch represents compounds as molecular graphs encoded with GNN-Transformer; cross-modal fusion via Transformer-based module to model interactions and mitigate feature misalignment.

Result: Outperforms baselines on pharmacogenomic benchmarks under cold-start evaluation: RMSE=1.248, R²=0.875, AUC=0.987 with full multi-omics inputs, reducing classification error by 9.5%. Provides biological explanations via SHAP-based gene attributions and pathway enrichment.

Conclusion: DeepDTF effectively addresses cross-modal alignment challenges in drug response prediction, achieving state-of-the-art performance while providing interpretable biological insights through attribution analysis.

Abstract: Cancer drug response varies widely across tumors due to multi-layer molecular heterogeneity, motivating computational decision support for precision oncology. Despite recent progress in deep CDR models, robust alignment between high-dimensional multi-omics and chemically structured drugs remains challenging due to cross-modal misalignment and limited inductive bias. We present DeepDTF, an end-to-end dual-branch Transformer fusion framework for joint log(IC50) regression and drug sensitivity classification. The cell-line branch uses modality-specific encoders for multi-omics profiles with Transformer blocks to capture long-range dependencies, while the drug branch represents compounds as molecular graphs and encodes them with a GNN-Transformer to integrate local topology with global context. Omics and drug representations are fused by a Transformer-based module that models cross-modal interactions and mitigates feature misalignment. On public pharmacogenomic benchmarks under 5-fold cold-start cell-line evaluation, DeepDTF consistently outperforms strong baselines across omics settings, achieving up to RMSE=1.248, R^2=0.875, and AUC=0.987 with full multi-omics inputs, while reducing classification error (1-ACC) by 9.5%. Beyond accuracy, DeepDTF provides biologically grounded explanations via SHAP-based gene attributions and pathway enrichment with pre-ranked GSEA.

[491] Large Language Model Guided Incentive Aware Reward Design for Cooperative Multi-Agent Reinforcement Learning

Dogan Urgun, Gokhan Gungor

Main category: cs.LG

TL;DR: LLM-based automated reward design framework for cooperative multi-agent systems that synthesizes executable reward programs from environment instrumentation, validated in Overcooked-AI environments with various coordination challenges.

Details

Motivation: Designing effective auxiliary rewards for cooperative multi-agent systems is challenging; misaligned incentives can lead to suboptimal coordination, especially when sparse task feedback provides insufficient grounding for learning.

Method: Automated reward design framework using large language models to synthesize executable reward programs from environment instrumentation. Programs are constrained within formal validity envelopes and evaluated by training policies from scratch under fixed computational budgets, with selection based solely on sparse task returns.

Result: Iterative search generations consistently yield superior task returns and delivery counts across four Overcooked-AI layouts with varied coordination challenges. Most pronounced gains occur in environments with interaction bottlenecks. Synthesized shaping components increase interdependence in action selection and improve signal alignment in coordination-intensive tasks.

Conclusion: Automated search for objective-grounded reward programs can reduce manual engineering burden while producing shaping signals compatible with cooperative learning under finite computational budgets.

Abstract: Designing effective auxiliary rewards for cooperative multi-agent systems remains a precarious task; misaligned incentives risk inducing suboptimal coordination, especially where sparse task feedback fails to provide sufficient grounding. This study introduces an automated reward design framework that leverages large language models to synthesize executable reward programs from environment instrumentation. The procedure constrains candidate programs within a formal validity envelope and evaluates their efficacy by training policies from scratch under a fixed computational budget; selection depends exclusively on the sparse task return. The framework is evaluated across four distinct Overcooked-AI layouts characterized by varied corridor congestion, handoff dependencies, and structural asymmetries. Iterative search generations consistently yield superior task returns and delivery counts, with the most pronounced gains occurring in environments dominated by interaction bottlenecks. Diagnostic analysis of the synthesized shaping components indicates increased interdependence in action selection and improved signal alignment in coordination-intensive tasks. These results demonstrate that the search for objectivegrounded reward programs can mitigate the burden of manual engineering while producing shaping signals compatible with cooperative learning under finite budgets.

[492] Language-Assisted Image Clustering Guided by Discriminative Relational Signals and Adaptive Semantic Centers

Jun Ma, Xu Zhang, Zhengxing Jiao, Yaxin Hou, Hui Liu, Junhui Hou, Yuheng Jia

Main category: cs.LG

TL;DR: LAIC framework improves image clustering by using cross-modal relations for better self-supervision and learning category-wise semantic centers via prompt learning.

Details

Motivation: Existing language-assisted image clustering methods have two key limitations: (1) textual features for each image are too similar, resulting in weak inter-class discriminability, and (2) clustering is restricted to pre-built image-text alignments, limiting better utilization of text modality.

Method: Proposes a new LAIC framework with two components: (1) exploits cross-modal relations to produce more discriminative self-supervision signals for clustering, compatible with most VLMs training mechanisms; (2) learns category-wise continuous semantic centers via prompt learning to produce final clustering assignments.

Result: Extensive experiments on eight benchmark datasets show average improvement of 2.6% over state-of-the-art methods, and the learned semantic centers exhibit strong interpretability.

Conclusion: The proposed LAIC framework effectively addresses limitations of existing methods by leveraging cross-modal relations and prompt learning, achieving superior clustering performance with interpretable semantic centers.

Abstract: Language-Assisted Image Clustering (LAIC) augments the input images with additional texts with the help of vision-language models (VLMs) to promote clustering performance. Despite recent progress, existing LAIC methods often overlook two issues: (i) textual features constructed for each image are highly similar, leading to weak inter-class discriminability; (ii) the clustering step is restricted to pre-built image-text alignments, limiting the potential for better utilization of the text modality. To address these issues, we propose a new LAIC framework with two complementary components. First, we exploit cross-modal relations to produce more discriminative self-supervision signals for clustering, as it compatible with most VLMs training mechanisms. Second, we learn category-wise continuous semantic centers via prompt learning to produce the final clustering assignments. Extensive experiments on eight benchmark datasets demonstrate that our method achieves an average improvement of 2.6% over state-of-the-art methods, and the learned semantic centers exhibit strong interpretability. Code is available in the supplementary material.

[493] CoordLight: Learning Decentralized Coordination for Network-Wide Traffic Signal Control

Yifeng Zhang, Harsh Goel, Peizhuo Li, Mehul Damani, Sandeep Chinchali, Guillaume Sartoretti

Main category: cs.LG

TL;DR: CoordLight: A MARL framework for traffic signal control using queue dynamics encoding and neighbor-aware policy optimization for coordinated multi-intersection traffic management.

Details

Motivation: Address challenges in multi-agent reinforcement learning for traffic signal control, particularly partial observability and coordination in decentralized environments, to create scalable and efficient control strategies for urban traffic networks.

Method: Proposes CoordLight with two key components: 1) Queue Dynamic State Encoding (QDSE) for improved traffic state representation, and 2) Neighbor-aware Policy Optimization (NAPO) with attention mechanism for coordinated decision-making among adjacent agents.

Result: Superior performance against state-of-the-art methods across three real-world traffic datasets with up to 196 intersections, demonstrating consistent improvements in diverse traffic networks with varying flows.

Conclusion: CoordLight effectively addresses coordination challenges in decentralized MARL for traffic control through novel state encoding and neighbor-aware optimization, enabling scalable network-level traffic optimization.

Abstract: Adaptive traffic signal control (ATSC) is crucial in alleviating congestion, maximizing throughput and promoting sustainable mobility in ever-expanding cities. Multi-Agent Reinforcement Learning (MARL) has recently shown significant potential in addressing complex traffic dynamics, but the intricacies of partial observability and coordination in decentralized environments still remain key challenges in formulating scalable and efficient control strategies. To address these challenges, we present CoordLight, a MARL-based framework designed to improve intra-neighborhood traffic by enhancing decision-making at individual junctions (agents), as well as coordination with neighboring agents, thereby scaling up to network-level traffic optimization. Specifically, we introduce the Queue Dynamic State Encoding (QDSE), a novel state representation based on vehicle queuing models, which strengthens the agents’ capability to analyze, predict, and respond to local traffic dynamics. We further propose an advanced MARL algorithm, named Neighbor-aware Policy Optimization (NAPO). It integrates an attention mechanism that discerns the state and action dependencies among adjacent agents, aiming to facilitate more coordinated decision-making, and to improve policy learning updates through robust advantage calculation. This enables agents to identify and prioritize crucial interactions with influential neighbors, thus enhancing the targeted coordination and collaboration among agents. Through comprehensive evaluations against state-of-the-art traffic signal control methods over three real-world traffic datasets composed of up to 196 intersections, we empirically show that CoordLight consistently exhibits superior performance across diverse traffic networks with varying traffic flows. The code is available at https://github.com/marmotlab/CoordLight

[494] MolEvolve: LLM-Guided Evolutionary Search for Interpretable Molecular Optimization

Xiangsen Chen, Ruilong Wu, Yanyan Lan, Ting Ma, Yang Liu

Main category: cs.LG

TL;DR: MolEvolve is an evolutionary framework using LLMs and MCTS for autonomous molecular discovery, addressing interpretability gaps and activity cliffs in chemistry.

Details

Motivation: Deep learning in chemistry lacks interpretability and fails to resolve activity cliffs where minor structural changes cause drastic property shifts. Current representation learning is limited by similarity principles and can't capture structural-activity discontinuities.

Method: MolEvolve reformulates molecular discovery as autonomous look-ahead planning using LLMs to explore chemical symbolic operations, with MCTS for test-time planning and external tools like RDKit. The system self-discovers optimal trajectories through evolutionary reasoning chains.

Result: MolEvolve outperforms baselines in both property prediction and molecule optimization tasks while evolving transparent, human-readable chemical insights.

Conclusion: The framework enables autonomous molecular discovery with improved interpretability and performance, addressing key limitations in current chemistry deep learning approaches.

Abstract: Despite deep learning’s success in chemistry, its impact is hindered by a lack of interpretability and an inability to resolve activity cliffs, where minor structural nuances trigger drastic property shifts. Current representation learning, bound by the similarity principle, often fails to capture these structural-activity discontinuities. To address this, we introduce MolEvolve, an evolutionary framework that reformulates molecular discovery as an autonomous, look-ahead planning problem. Unlike traditional methods that depend on human-engineered features or rigid prior knowledge, MolEvolve leverages a Large Language Model (LLM) to actively explore and evolve a library of executable chemical symbolic operations. By utilizing the LLM to cold start and an Monte Carlo Tree Search (MCTS) engine for test-time planning with external tools (e.g. RDKit), the system self-discovers optimal trajectories autonomously. This process evolves transparent reasoning chains that translate complex structural transformations into actionable, human-readable chemical insights. Experimental results demonstrate that MolEvolve’s autonomous search not only evolves transparent, human-readable chemical insights, but also outperforms baselines in both property prediction and molecule optimization tasks.

[495] On the Use of Bagging for Local Intrinsic Dimensionality Estimation

Kristóf Péter, Ricardo J. G. B. Campello, James Bailey, Michael E. Houle

Main category: cs.LG

TL;DR: Proposes an ensemble bagging approach with subbagging to reduce variance in Local Intrinsic Dimensionality (LID) estimation while preserving local neighborhood structure, analyzing hyperparameter interplay between sampling rate, k-NN size, and ensemble size.

Details

Motivation: LID estimation suffers from high variance when using limited data from small neighborhoods, while larger neighborhoods introduce biases from nonlocal effects and manifold mixing. Need variance reduction strategy that maintains local distribution characteristics.

Method: Uses ensemble approach with subbagging (subsampling without replacement) to preserve local nearest neighbor distance distributions. Analyzes interplay between sampling rate, k-NN size, and ensemble size theoretically and experimentally. Also combines bagging with neighborhood smoothing techniques.

Result: Bagged estimators significantly reduce variance and mean squared error compared to non-bagged baselines within broad hyperparameter regions, with controllable bias impact. Combining bagging with neighborhood smoothing provides substantial further improvements in LID estimation performance.

Conclusion: The proposed ensemble bagging approach effectively addresses the variance-bias tradeoff in LID estimation, with well-characterized hyperparameter regions enabling informed selection based on application preferences. The method provides reliable variance reduction while maintaining local neighborhood characteristics.

Abstract: The theory of Local Intrinsic Dimensionality (LID) has become a valuable tool for characterizing local complexity within and across data manifolds, supporting a range of data mining and machine learning tasks. Accurate LID estimation requires samples drawn from small neighborhoods around each query to avoid biases from nonlocal effects and potential manifold mixing, yet limited data within such neighborhoods tends to cause high estimation variance. As a variance reduction strategy, we propose an ensemble approach that uses subbagging to preserve the local distribution of nearest neighbor (NN) distances. The main challenge is that the uniform reduction in total sample size within each subsample increases the proximity threshold for finding a fixed number k of NNs around the query. As a result, in the specific context of LID estimation, the sampling rate has an additional, complex interplay with the neighborhood size, where both combined determine the sample size as well as the locality and resolution considered for estimation. We analyze both theoretically and experimentally how the choice of the sampling rate and the k-NN size used for LID estimation, alongside the ensemble size, affects performance, enabling informed prior selection of these hyper-parameters depending on application-based preferences. Our results indicate that within broad and well-characterized regions of the hyper-parameters space, using a bagged estimator will most often significantly reduce variance as well as the mean squared error when compared to the corresponding non-bagged baseline, with controllable impact on bias. We additionally propose and evaluate different ways of combining bagging with neighborhood smoothing for substantial further improvements on LID estimation performance.

[496] Marchuk: Efficient Global Weather Forecasting from Mid-Range to Sub-Seasonal Scales via Flow Matching

Arsen Kuzhamuratov, Mikhail Zhirnov, Andrey Kuznetsov, Ivan Oseledets, Konstantin Sobolev

Main category: cs.LG

TL;DR: Marchuk is a generative latent flow-matching model for global weather forecasting up to 30 days, using compact architecture with 276M parameters to achieve performance comparable to larger models while being computationally efficient.

Details

Motivation: Subseasonal weather forecasting is challenging due to atmospheric chaos limiting conventional models beyond 15 days. There's a need for accurate, efficient models that can predict weather up to 30 days.

Method: Uses generative latent flow-matching model that conditions on current-day weather maps and autoregressively predicts subsequent days’ weather maps in learned latent space. Replaces rotary positional encodings with trainable positional embeddings and extends temporal context window to better capture long-range temporal dependencies.

Result: Despite having only 276M parameters, achieves performance comparable to LaDCast (1.6B parameters) while operating at significantly higher inference speeds. Model is computationally efficient and offers strong predictive performance for subseasonal forecasting.

Conclusion: Marchuk provides an effective solution for subseasonal weather forecasting with computational efficiency and strong performance, making it practical for real-world applications. The model is open-sourced for community use.

Abstract: Accurate subseasonal weather forecasting remains a major challenge due to the inherently chaotic nature of the atmosphere, which limits the predictive skill of conventional models beyond the mid-range horizon (approximately 15 days). In this work, we present \textit{Marchuk}, a generative latent flow-matching model for global weather forecasting spanning mid-range to subseasonal timescales, with prediction horizons of up to 30 days. Marchuk conditions on current-day weather maps and autoregressively predicts subsequent days’ weather maps within the learned latent space. We replace rotary positional encodings (RoPE) with trainable positional embeddings and extend the temporal context window, which together enhance the model’s ability to represent and propagate long-range temporal dependencies during latent forecasting. Marchuk offers two key advantages: high computational efficiency and strong predictive performance. Despite its compact architecture of only 276 million parameters, the model achieves performance comparable to LaDCast, a substantially larger model with 1.6 billion parameters, while operating at significantly higher inference speeds. We open-source our inference code and model at: https://v-gen-ai.github.io/Marchuk/

[497] Learning Response-Statistic Shifts and Parametric Roll Episodes from Wave–Vessel Time Series via LSTM Functional Models

Jose del Aguila Ferrandis

Main category: cs.LG

TL;DR: Data-driven LSTM surrogate model learns nonlinear mapping from wave-motion time series to vessel motions, reproducing parametric roll episodes and statistical shifts in ship response.

Details

Motivation: Parametric roll is a dangerous ship instability causing abrupt regime changes in roll statistics and tail risk. Need data-driven surrogates that can learn causal functional mappings from wave-motion data to predict these rare but high-consequence events.

Method: Developed stacked LSTM surrogate trained on wave-elevation time series from URANS numerical wave tank simulations. Used 49 random-phase realizations for three sea states with fixed forward speed. Compared loss functions (MSE, relative-entropy-based objectives, amplitude-weighted variants) to balance average error and tail fidelity.

Result: Model successfully tracks onset and growth of large-amplitude roll consistent with parametric excitation in severe sea states. Captures corresponding changes in roll probability density functions (PDFs). Demonstrates trade-offs between different loss functions for average error vs. tail fidelity relevant to risk assessment.

Conclusion: Data-driven LSTM surrogates can effectively learn nonlinear causal mappings from wave-motion data to predict parametric roll episodes and associated statistical shifts, providing valuable tools for ship operability and risk assessment.

Abstract: Parametric roll is a rare but high-consequence instability that can trigger abrupt regime changes in ship response, including pronounced shifts in roll statistics and tail risk. This paper develops a data-driven surrogate that learns the nonlinear, causal functional mapping from incident wave–motion time series to vessel motions, and demonstrates that the surrogate reproduces both (i) parametric roll episodes and (ii) the associated statistical shifts in the response. Crucially, the learning framework is data-source agnostic: the paired wave–motion time series can be obtained from controlled experiments (e.g., towing-tank or basin tests with wave probes and motion tracking) when a hull exists, or from high-fidelity simulations during design when experiments are not yet available. To provide a controlled severe-sea demonstration, we generate training data with a URANS numerical wave tank, using long-crested irregular seas synthesized from a modified Pierson–Moskowitz spectrum. The demonstration dataset comprises 49 random-phase realizations for each of three sea states, simulated at a fixed forward speed selected to yield encounter conditions under which parametric-roll episodes can occur. A stacked LSTM surrogate is trained on wave-elevation time series and evaluated on held-out realizations using time-domain accuracy and distributional fidelity metrics. In the most severe case, the model tracks the onset and growth of large-amplitude roll consistent with parametric excitation, and captures the corresponding changes in roll probability density functions (PDFs). We further compare loss-function choices (MSE, relative-entropy-based objectives, and amplitude-weighted variants) and show how they trade average error for improved tail fidelity relevant to operability and risk assessment.

[498] CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

Xiangru Jian, Shravan Nayak, Kevin Qinghong Lin, Aarash Feizi, Kaixin Li, Patrice Bechard, Spandana Gella, Sai Rajeswar

Main category: cs.LG

TL;DR: CUA-Suite introduces a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents, addressing the scarcity of continuous human demonstration videos needed for scaling general-purpose agents.

Details

Motivation: Progress toward general-purpose computer-use agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Existing datasets like ScaleCUA contain only sparse screenshots (2M screenshots, <20 hours of video), while continuous video is the critical missing ingredient for scaling these agents.

Method: CUA-Suite provides VideoCUA with ~10,000 human-demonstrated tasks across 87 diverse applications featuring continuous 30 fps screen recordings, kinematic cursor traces, and multi-layered reasoning annotations (55 hours, 6M frames). It also includes UI-Vision benchmark for evaluation and GroundCUA with 56K annotated screenshots and 3.6M UI element annotations.

Result: The dataset reveals current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). CUA-Suite’s rich multimodal corpus supports emerging research including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models.

Conclusion: CUA-Suite addresses the critical bottleneck in computer-use agent development by providing continuous video demonstrations and dense annotations, enabling research in multimodal understanding and generation for desktop automation.

Abstract: Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA-Suite’s rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models are publicly released.

[499] Conformalized Transfer Learning for Li-ion Battery State of Health Forecasting under Manufacturing and Usage Variability

Samuel Filgueira da Silva, Mehmet Fatih Ozkan, Faissal El Idrissi, Marcello Canova

Main category: cs.LG

TL;DR: Uncertainty-aware transfer learning framework combining LSTM with domain adaptation (MMD) and uncertainty quantification (Conformal Prediction) for lithium-ion battery state-of-health forecasting across heterogeneous cells.

Details

Motivation: Existing battery SOH forecasting models trained on laboratory data fail to generalize to new cells with manufacturing variations or different operating conditions, creating a need for robust transfer learning approaches.

Method: LSTM model trained on virtual battery dataset capturing real-world variability, combined with Maximum Mean Discrepancy for domain adaptation between simulated and target domains, and Conformal Prediction for uncertainty quantification.

Result: Framework improves generalization and trustworthiness of SOH forecasts across heterogeneous cells by addressing domain shift and providing calibrated prediction intervals.

Conclusion: The proposed uncertainty-aware transfer learning framework effectively addresses domain shift in battery SOH forecasting, enhancing both accuracy and reliability for real-world applications.

Abstract: Accurate forecasting of state-of-health (SOH) is essential for ensuring safe and reliable operation of lithium-ion cells. However, existing models calibrated on laboratory tests at specific conditions often fail to generalize to new cells that differ due to small manufacturing variations or operate under different conditions. To address this challenge, an uncertainty-aware transfer learning framework is proposed, combining a Long Short-Term Memory (LSTM) model with domain adaptation via Maximum Mean Discrepancy (MMD) and uncertainty quantification through Conformal Prediction (CP). The LSTM model is trained on a virtual battery dataset designed to capture real-world variability in electrode manufacturing and operating conditions. MMD aligns latent feature distributions between simulated and target domains to mitigate domain shift, while CP provides calibrated, distribution-free prediction intervals. This framework improves both the generalization and trustworthiness of SOH forecasts across heterogeneous cells.

[500] Uniform Laws of Large Numbers in Product Spaces

Ron Holzman, Shay Moran, Alexander Shlimovich

Main category: cs.LG

TL;DR: The paper establishes uniform laws of large numbers for cartesian product spaces under distributional assumptions compatible with product structure, showing equivalence to finite linear VC dimension rather than classical VC dimension.

Details

Motivation: To extend Vapnik-Chervonenkis theory to cartesian product spaces where distributions have product-like structure, capturing natural settings like product distributions, sparse mixtures, and low mutual information cases.

Method: Assumes distributions absolutely continuous with respect to product of marginals, introduces linear VC dimension (maximum shattered set on axis-parallel lines), and develops novel estimators departing from standard empirical mean.

Result: Proves uniform law of large numbers holds if and only if linear VC dimension is finite, shows linear VC dimension can be arbitrarily smaller than classical VC dimension, and demonstrates necessity of non-standard estimators.

Conclusion: Establishes new uniform convergence theory for product spaces with product-compatible distributions, introduces linear VC dimension as key complexity measure, and opens questions about quantitative sample complexity bounds.

Abstract: Uniform laws of large numbers form a cornerstone of Vapnik–Chervonenkis theory, where they are characterized by the finiteness of the VC dimension. In this work, we study uniform convergence phenomena in cartesian product spaces, under assumptions on the underlying distribution that are compatible with the product structure. Specifically, we assume that the distribution is absolutely continuous with respect to the product of its marginals, a condition that captures many natural settings, including product distributions, sparse mixtures of product distributions, distributions with low mutual information, and more. We show that, under this assumption, a uniform law of large numbers holds for a family of events if and only if the linear VC dimension of the family is finite. The linear VC dimension is defined as the maximum size of a shattered set that lies on an axis-parallel line, namely, a set of vectors that agree on all but at most one coordinate. This dimension is always at most the classical VC dimension, yet it can be arbitrarily smaller. For instance, the family of convex sets in $\mathbb{R}^d$ has linear VC dimension $2$, while its VC dimension is infinite already for $d\ge 2$. Our proofs rely on estimator that departs substantially from the standard empirical mean estimator and exhibits more intricate structure. We show that such deviations from the standard empirical mean estimator are unavoidable in this setting. Throughout the paper, we propose several open questions, with a particular focus on quantitative sample complexity bounds.

[501] Project and Generate: Divergence-Free Neural Operators for Incompressible Flows

Xigui Li, Hongwei Zhang, Ruoxi Jiang, Deshu Chen, Chensen Lin, Limei Han, Yuan Qi, Xin Guo, Yuan Cheng

Main category: cs.LG

TL;DR: A unified framework enforcing incompressible continuity as hard constraint for fluid dynamics models via spectral Leray projection for deterministic models and divergence-free Gaussian reference measure for generative models.

Details

Motivation: Learning-based fluid dynamics models often produce physically inadmissible, unstable simulations with spurious divergence and long-term collapse. Existing penalty-based methods provide only soft regularization without structural guarantees.

Method: Two-pronged approach: 1) For deterministic models, integrate differentiable spectral Leray projection based on Helmholtz-Hodge decomposition to restrict regression to divergence-free velocity fields. 2) For generative models, construct divergence-free Gaussian reference measure via curl-based pushforward to ensure entire probability flow remains subspace-consistent.

Result: Experiments on 2D Navier-Stokes equations demonstrate exact incompressibility up to discretization error and substantially improved stability and physical consistency compared to unconstrained models.

Conclusion: The framework provides hard structural guarantees for physical consistency in fluid dynamics modeling, addressing fundamental limitations of penalty-based approaches through intrinsic constraint enforcement.

Abstract: Learning-based models for fluid dynamics often operate in unconstrained function spaces, leading to physically inadmissible, unstable simulations. While penalty-based methods offer soft regularization, they provide no structural guarantees, resulting in spurious divergence and long-term collapse. In this work, we introduce a unified framework that enforces the incompressible continuity equation as a hard, intrinsic constraint for both deterministic and generative modeling. First, to project deterministic models onto the divergence-free subspace, we integrate a differentiable spectral Leray projection grounded in the Helmholtz-Hodge decomposition, which restricts the regression hypothesis space to physically admissible velocity fields. Second, to generate physically consistent distributions, we show that simply projecting model outputs is insufficient when the prior is incompatible. To address this, we construct a divergence-free Gaussian reference measure via a curl-based pushforward, ensuring the entire probability flow remains subspace-consistent by construction. Experiments on 2D Navier-Stokes equations demonstrate exact incompressibility up to discretization error and substantially improved stability and physical consistency.

[502] Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, Maksym Andriushchenko

Main category: cs.LG

TL;DR: LLM agents (Claude Code) discover novel white-box adversarial attack algorithms that significantly outperform 30+ existing methods in jailbreaking and prompt injection evaluations, achieving up to 40% attack success rate on safety models.

Details

Motivation: To demonstrate that LLM agents can automate incremental safety and security research, particularly white-box adversarial red-teaming, by discovering novel attack algorithms that outperform existing methods.

Method: Autoresearch-style pipeline powered by Claude Code that starts from existing attack implementations (like GCG) and iteratively produces new algorithms through optimization with dense quantitative feedback.

Result: Discovered algorithms achieve up to 40% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B (vs ≤10% for baselines) and generalize to achieve 100% ASR against Meta-SecAlign-70B (vs 56% for best baseline).

Conclusion: White-box adversarial red-teaming is well-suited for LLM agent automation, and incremental safety/security research can be effectively automated using LLM agents.

Abstract: LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering \citep{rank2026posttrainbench, novikov2025alphaevolve}. We show that an \emph{autoresearch}-style pipeline \citep{karpathy2026autoresearch} powered by Claude Code discovers novel white-box adversarial attack \textit{algorithms} that \textbf{significantly outperform all existing (30+) methods} in jailbreaking and prompt injection evaluations. Starting from existing attack implementations, such as GCG~\citep{zou2023universal}, the agent iterates to produce new algorithms achieving up to 40% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B, compared to $\leq$10% for existing algorithms (\Cref{fig:teaser}, left). The discovered algorithms generalize: attacks optimized on surrogate models transfer directly to held-out models, achieving \textbf{100% ASR against Meta-SecAlign-70B} \citep{chen2025secalign} versus 56% for the best baseline (\Cref{fig:teaser}, middle). Extending the findings of~\cite{carlini2025autoadvexbench}, our results are an early demonstration that incremental safety and security research can be automated using LLM agents. White-box adversarial red-teaming is particularly well-suited for this: existing methods provide strong starting points, and the optimization objective yields dense, quantitative feedback. We release all discovered attacks alongside baseline implementations and evaluation code at https://github.com/romovpa/claudini.

[503] Towards Safe Learning-Based Non-Linear Model Predictive Control through Recurrent Neural Network Modeling

Mihaela-Larisa Clement, Mónika Farsang, Agnes Poks, Johannes Edelmann, Manfred Plöchl, Radu Grosu, Ezio Bartocci

Main category: cs.LG

TL;DR: Sequential-AMPC is a neural policy that approximates nonlinear MPC by sharing parameters across the prediction horizon, requiring fewer expert demonstrations and improving online safety with fallback mechanisms.

Details

Motivation: Traditional NMPC requires solving complex nonlinear programs online, which is computationally expensive on embedded hardware. Learning-based approximations need large expert datasets and costly training. There's a need for efficient MPC approximations that require fewer demonstrations and maintain safety.

Method: Proposes Sequential-AMPC, a sequential neural policy that generates MPC candidate control sequences by sharing parameters across the prediction horizon. For deployment, wraps the policy in a safety-augmented online evaluation and fallback mechanism (Safe Sequential-AMPC).

Result: Compared to feedforward policy baselines, Sequential-AMPC requires substantially fewer expert MPC rollouts, yields candidate sequences with higher feasibility rates, and improves closed-loop safety. On high-dimensional systems, it shows better learning dynamics, performance in fewer epochs, and stable validation improvement where baselines stagnate.

Conclusion: Sequential-AMPC provides an efficient approximation to NMPC that reduces offline training requirements while maintaining safety through online evaluation mechanisms, making it suitable for deployment on resource-constrained hardware.

Abstract: The practical deployment of nonlinear model predictive control (NMPC) is often limited by online computation: solving a nonlinear program at high control rates can be expensive on embedded hardware, especially when models are complex or horizons are long. Learning-based NMPC approximations shift this computation offline but typically demand large expert datasets and costly training. We propose Sequential-AMPC, a sequential neural policy that generates MPC candidate control sequences by sharing parameters across the prediction horizon. For deployment, we wrap the policy in a safety-augmented online evaluation and fallback mechanism, yielding Safe Sequential-AMPC. Compared to a naive feedforward policy baseline across several benchmarks, Sequential-AMPC requires substantially fewer expert MPC rollouts and yields candidate sequences with higher feasibility rates and improved closed-loop safety. On high-dimensional systems, it also exhibits better learning dynamics and performance in fewer epochs while maintaining stable validation improvement where the feedforward baseline can stagnate.

[504] No Single Metric Tells the Whole Story: A Multi-Dimensional Evaluation Framework for Uncertainty Attributions

Emily Schiller, Teodor Chiaburu, Marco Zullich, Luca Longo

Main category: cs.LG

TL;DR: Proposes a unified evaluation framework for uncertainty attribution methods in XAI, aligning with Co-12 framework and introducing new metrics including conveyance for epistemic uncertainty propagation.

Details

Motivation: Current evaluation of uncertainty attribution methods is inconsistent with heterogeneous proxy tasks and metrics, hindering comparability and systematic development of these methods.

Method: Aligns uncertainty attributions with Co-12 XAI evaluation framework, proposes concrete implementations for correctness, consistency, continuity, and compactness properties, and introduces conveyance property specifically for uncertainty attributions. Evaluates eight metrics across combinations of uncertainty quantification and feature attribution methods on tabular and image data.

Result: Gradient-based methods outperform perturbation-based approaches in consistency and conveyance; Monte-Carlo dropconnect outperforms Monte-Carlo dropout in most metrics. Most metrics rank methods consistently across samples, but inter-method agreement remains low, suggesting no single metric sufficiently evaluates uncertainty attribution quality.

Conclusion: Proposed evaluation framework establishes foundation for systematic comparison and development of uncertainty attribution methods, addressing current inconsistencies in the field.

Abstract: Research on explainable AI (XAI) has frequently focused on explaining model predictions. More recently, methods have been proposed to explain prediction uncertainty by attributing it to input features (uncertainty attributions). However, the evaluation of these methods remains inconsistent as studies rely on heterogeneous proxy tasks and metrics, hindering comparability. We address this by aligning uncertainty attributions with the well-established Co-12 framework for XAI evaluation. We propose concrete implementations for the correctness, consistency, continuity, and compactness properties. Additionally, we introduce conveyance, a property tailored to uncertainty attributions that evaluates whether controlled increases in epistemic uncertainty reliably propagate to feature-level attributions. We demonstrate our evaluation framework with eight metrics across combinations of uncertainty quantification and feature attribution methods on tabular and image data. Our experiments show that gradient-based methods consistently outperform perturbation-based approaches in consistency and conveyance, while Monte-Carlo dropconnect outperforms Monte-Carlo dropout in most metrics. Although most metrics rank the methods consistently across samples, inter-method agreement remains low. This suggests no single metric sufficiently evaluates uncertainty attribution quality. The proposed evaluation framework contributes to the body of knowledge by establishing a foundation for systematic comparison and development of uncertainty attribution methods.

[505] AVO: Agentic Variation Operators for Autonomous Evolutionary Search

Terry Chen, Zhifan Ye, Bing Xu, Zihao Ye, Timmy Liu, Ali Hassani, Tianqi Chen, Andrew Kerr, Haicheng Wu, Yang Xu, Yu-Jung Chen, Hanfeng Chen, Aditya Kane, Ronny Krashinsky, Ming-Yu Liu, Vinod Grover, Luis Ceze, Roger Bringmann, John Tran, Wei Liu, Fung Xie, Michael Lightstone, Humphrey Shi

Main category: cs.LG

TL;DR: AVO replaces traditional evolutionary operators with autonomous coding agents that can propose, repair, critique, and verify implementation edits using lineage, knowledge bases, and execution feedback, achieving performance gains over expert-engineered attention kernels on modern GPUs.

Details

Motivation: Current evolutionary algorithms rely on fixed mutation/crossover operators and hand-designed heuristics, limiting their ability to discover complex optimizations. The paper aims to move beyond LLM-in-the-loop pipelines by elevating language models from mere candidate generators to autonomous variation operators that can discover performance-critical micro-architectural optimizations.

Method: AVO replaces classical evolutionary variation operators with autonomous coding agents that operate in a self-directed loop. These agents can consult the current evolutionary lineage, domain-specific knowledge bases, and execution feedback to propose, repair, critique, and verify implementation edits. The system was evaluated on attention kernels for NVIDIA Blackwell GPUs.

Result: Over 7 days of autonomous evolution on multi-head attention, AVO discovered kernels that outperform cuDNN by up to 3.5% and FlashAttention-4 by up to 10.5%. For grouped-query attention, only 30 minutes of additional adaptation yielded gains of up to 7.0% over cuDNN and 9.3% over FlashAttention-4.

Conclusion: Agentic variation operators represent a significant advancement beyond prior LLM-in-the-loop evolutionary pipelines by elevating agents from candidate generators to autonomous variation operators, enabling discovery of performance-critical optimizations that surpass expert-engineered implementations on state-of-the-art hardware.

Abstract: Agentic Variation Operators (AVO) are a new family of evolutionary variation operators that replace the fixed mutation, crossover, and hand-designed heuristics of classical evolutionary search with autonomous coding agents. Rather than confining a language model to candidate generation within a prescribed pipeline, AVO instantiates variation as a self-directed agent loop that can consult the current lineage, a domain-specific knowledge base, and execution feedback to propose, repair, critique, and verify implementation edits. We evaluate AVO on attention, among the most aggressively optimized kernel targets in AI, on NVIDIA Blackwell (B200) GPUs. Over 7 days of continuous autonomous evolution on multi-head attention, AVO discovers kernels that outperform cuDNN by up to 3.5% and FlashAttention-4 by up to 10.5% across the evaluated configurations. The discovered optimizations transfer readily to grouped-query attention, requiring only 30 minutes of additional autonomous adaptation and yielding gains of up to 7.0% over cuDNN and 9.3% over FlashAttention-4. Together, these results show that agentic variation operators move beyond prior LLM-in-the-loop evolutionary pipelines by elevating the agent from candidate generator to variation operator, and can discover performance-critical micro-architectural optimizations that produce kernels surpassing state-of-the-art expert-engineered attention implementations on today’s most advanced GPU hardware.

[506] UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience

Zichuan Lin, Feiyu Liu, Yijun Yang, Jiafei Lyu, Yiming Gao, Yicheng Liu, Zhicong Lu, Yangbin Yu, Mingyu Yang, Junyou Li, Deheng Ye, Jie Jiang

Main category: cs.LG

TL;DR: UI-Voyager: A two-stage self-evolving mobile GUI agent using rejection fine-tuning and group relative self-distillation to improve learning from failures and credit assignment for long-horizon GUI tasks.

Details

Motivation: Existing mobile GUI agents suffer from inefficient learning from failed trajectories and ambiguous credit assignment under sparse rewards for long-horizon GUI tasks, requiring expensive manual data annotation.

Method: Two-stage approach: 1) Rejection Fine-Tuning (RFT) for continuous co-evolution of data and models in autonomous loop; 2) Group Relative Self-Distillation (GRSD) identifies critical fork points in group rollouts and constructs dense step-level supervision from successful trajectories to correct failed ones.

Result: Achieves 81.0% Pass@1 success rate on AndroidWorld, outperforming recent baselines and exceeding human-level performance. Ablation studies verify GRSD effectiveness.

Conclusion: Represents significant leap toward efficient, self-evolving, high-performance mobile GUI automation without expensive manual data annotation.

Abstract: Autonomous mobile GUI agents have attracted increasing attention along with the advancement of Multimodal Large Language Models (MLLMs). However, existing methods still suffer from inefficient learning from failed trajectories and ambiguous credit assignment under sparse rewards for long-horizon GUI tasks. To that end, we propose UI-Voyager, a novel two-stage self-evolving mobile GUI agent. In the first stage, we employ Rejection Fine-Tuning (RFT), which enables the continuous co-evolution of data and models in a fully autonomous loop. The second stage introduces Group Relative Self-Distillation (GRSD), which identifies critical fork points in group rollouts and constructs dense step-level supervision from successful trajectories to correct failed ones. Extensive experiments on AndroidWorld show that our 4B model achieves an 81.0% Pass@1 success rate, outperforming numerous recent baselines and exceeding human-level performance. Ablation and case studies further verify the effectiveness of GRSD. Our method represents a significant leap toward efficient, self-evolving, and high-performance mobile GUI automation without expensive manual data annotation.

[507] TuneShift-KD: Knowledge Distillation and Transfer for Fine-tuned Models

Yushi Guan, Jeanine Ohene-Agyei, Daniel Kwan, Jean Sebastien Dandurand, Yifei Zhang, Nandita Vijaykumar

Main category: cs.LG

TL;DR: TuneShift-KD: A method to transfer specialized knowledge from fine-tuned LLMs to new target models using perplexity differences to identify specialized knowledge and generate synthetic training data.

Details

Motivation: As new LLM architectures emerge, transferring specialized knowledge from fine-tuned models to newer models becomes crucial, especially when original training data is unavailable due to privacy or commercial restrictions.

Method: Uses perplexity differences between base and fine-tuned models to identify specialized knowledge: prompts where fine-tuned model has low perplexity but base model has high perplexity indicate specialized knowledge. Creates synthetic training dataset through iterative generation of similar prompts, requiring only initial models and few representative prompts.

Result: Models fine-tuned using TuneShift-KD achieve higher accuracy than prior approaches, enabling effective transfer of specialized knowledge without access to original training data.

Conclusion: TuneShift-KD provides an automated, data-efficient approach for knowledge transfer between LLMs, addressing practical constraints of data availability and model evolution.

Abstract: To embed domain-specific or specialized knowledge into pre-trained foundation models, fine-tuning using techniques such as parameter efficient fine-tuning (e.g. LoRA) is a common practice. However, as new LLM architectures and pre-trained models emerge, transferring this specialized knowledge to newer models becomes an important task. In many scenarios, the original specialized data may be unavailable due to privacy or commercial restrictions, necessitating distillation and transfer of this specialized knowledge from the fine-tuned base model to a different pre-trained model. We present TuneShift-KD, a novel approach that automatically distills specialized knowledge from a fine-tuned model to a target model using only a few examples representative of the specialized information. Our key insight is that specialized knowledge can be identified through perplexity differences between base and fine-tuned models: prompts where the fine-tuned model responds confidently (low perplexity), but the base model struggles (high perplexity), indicate queries corresponding to the specialized knowledge learned by the fine-tuned model. TuneShift-KD leverages this insight to create a synthetic training dataset to transfer the specialized knowledge. Using an iterative process, TuneShift-KD generates more prompts similar to those that generated responses with specialized knowledge. TuneShift-KD does not require training discriminators or access to training datasets. It is an automated approach that only requires the initial fine-tuned and base models and a few representative prompts. Our experiments demonstrate that models fine-tuned using TuneShift-KD achieve higher accuracy than prior approaches, enabling ease of deployment and more effective transfer of the specialized knowledge.

[508] Scaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction

Haresh Rengaraj Rajamohan, Xiang Gao, Weicheng Zhu, Shih-Lun Huang, Long Chen, Gabe Schulman, Huizhen Jin, Shengduo Li, Yixuan Wang, Huidi Yang, Kyunghyun Cho, Cem M. Deniz, Narges Razavian

Main category: cs.LG

TL;DR: RAVEN is a generative pretraining strategy for EHR data using recurrence-aware next-visit event prediction, addressing issues with repeated events in evaluation and showing strong zero-shot disease forecasting performance.

Details

Motivation: Large-scale pretraining has transformed language modeling but remains underexplored for structured EHR data. There's a need for better generative models that can handle sequential clinical data while addressing evaluation pitfalls like repeated event inflation.

Method: RAVEN uses recurrence-aware next-visit event prediction with autoregressive generation of tokenized clinical events. It introduces regularization for repeated events and is trained on over 1M patients. The approach addresses data-constrained, compute-saturated regimes.

Result: RAVEN rivals fine-tuned Transformer models in zero-shot disease forecasting and outperforms simulation-based next-token approaches. It generalizes to external cohorts despite lossy clinical code mappings and feature gaps without parameter updates.

Conclusion: RAVEN demonstrates effective generative pretraining for EHR data, addressing key evaluation pitfalls and showing strong generalization capabilities, though scaling requires commensurate data increases.

Abstract: While large-scale pretraining has revolutionized language modeling, its potential remains underexplored in healthcare with structured electronic health records (EHRs). We present RAVEN, a novel generative pretraining strategy for sequential EHR data based on Recurrence-Aware next-Visit EveNt prediction. Leveraging a dataset of over one million unique individuals, our model learns to autoregressively generate tokenized clinical events for the next visit conditioned on patient history. We introduce regularization on predicting repeated events and highlight a key pitfall in EHR-based foundation model evaluations: repeated event tokens can inflate performance metrics when new onsets are not distinguished from subsequent occurrences. Furthermore, we empirically investigate the scaling behaviors in a data-constrained, compute-saturated regime, showing that simply increasing model size is suboptimal without commensurate increases in data volume. We evaluate our model via zero-shot prediction for forecasting the incidence of a diverse set of diseases, where it rivals fully fine-tuned representation-based Transformer models and outperforms widely used simulation-based next-token approaches. Finally, without additional parameter updates, we show that RAVEN can generalize to an external patient cohort under lossy clinical code mappings and feature coverage gaps.

[509] DreamerAD: Efficient Reinforcement Learning via Latent World Model for Autonomous Driving

Pengxuan Yang, Yupeng Zheng, Deheng Qian, Zebin Xing, Qichao Zhang, Linbo Wang, Yichen Zhang, Shaoyu Guo, Zhongpu Xia, Qiang Chen, Junyu Han, Lingyun Xu, Yifeng Pan, Dongbin Zhao

Main category: cs.LG

TL;DR: DreamerAD is a latent world model framework for autonomous driving RL that compresses diffusion sampling from 100 steps to 1, achieving 80x speedup while maintaining visual interpretability.

Details

Motivation: Training RL policies on real-world driving data is costly and unsafe. Existing pixel-level diffusion world models enable safe imagination-based training but suffer from multi-step diffusion inference latency (2s/frame) that prevents high-frequency RL interaction.

Method: Leverages denoised latent features from video generation models through: (1) shortcut forcing for recursive multi-resolution step compression, (2) autoregressive dense reward model on latent representations, and (3) Gaussian vocabulary sampling for GRPO to constrain exploration to physically plausible trajectories.

Result: Achieves 87.7 EPDMS on NavSim v2, establishing state-of-the-art performance and demonstrating latent-space RL effectiveness for autonomous driving.

Conclusion: DreamerAD enables efficient reinforcement learning for autonomous driving by compressing diffusion sampling, making latent-space RL practical for real-world driving applications.

Abstract: We introduce DreamerAD, the first latent world model framework that enables efficient reinforcement learning for autonomous driving by compressing diffusion sampling from 100 steps to 1 - achieving 80x speedup while maintaining visual interpretability. Training RL policies on real-world driving data incurs prohibitive costs and safety risks. While existing pixel-level diffusion world models enable safe imagination-based training, they suffer from multi-step diffusion inference latency (2s/frame) that prevents high-frequency RL interaction. Our approach leverages denoised latent features from video generation models through three key mechanisms: (1) shortcut forcing that reduces sampling complexity via recursive multi-resolution step compression, (2) an autoregressive dense reward model operating directly on latent representations for fine-grained credit assignment, and (3) Gaussian vocabulary sampling for GRPO that constrains exploration to physically plausible trajectories. DreamerAD achieves 87.7 EPDMS on NavSim v2, establishing state-of-the-art performance and demonstrating that latent-space RL is effective for autonomous driving.

[510] Polynomial Speedup in Diffusion Models with the Multilevel Euler-Maruyama Method

Arthur Jacot

Main category: cs.LG

TL;DR: ML-EM method accelerates SDE/ODE solving by using multiple approximators of increasing accuracy/cost, achieving polynomial speedups for diffusion model sampling.

Details

Motivation: Traditional Euler-Maruyama methods for solving SDEs require many evaluations of expensive drift functions, especially in diffusion models where large UNets are computationally intensive. The goal is to reduce computational cost while maintaining accuracy.

Method: Multilevel Euler-Maruyama (ML-EM) uses a hierarchy of approximators f¹,…,f^k to the drift function f, with increasing accuracy and cost. It performs many evaluations of cheaper approximators and few evaluations of the most accurate one, leveraging variance reduction techniques.

Result: ML-EM achieves ε-approximation with ε^{-γ} compute (same as single drift evaluation) instead of traditional EM’s ε^{-γ-1}. For diffusion models with γ≈2.5, this yields up to 4x speedup on CelebA 64x64 image generation.

Conclusion: ML-EM provides polynomial speedups for SDE solving and diffusion model sampling, with potential for larger gains on practical large-scale models. The method effectively decouples computational cost from accuracy requirements.

Abstract: We introduce the Multilevel Euler-Maruyama (ML-EM) method compute solutions of SDEs and ODEs using a range of approximators $f^1,\dots,f^k$ to the drift $f$ with increasing accuracy and computational cost, only requiring a few evaluations of the most accurate $f^k$ and many evaluations of the less costly $f^1,\dots,f^{k-1}$. If the drift lies in the so-called Harder than Monte Carlo (HTMC) regime, i.e. it requires $ε^{-γ}$ compute to be $ε$-approximated for some $γ>2$, then ML-EM $ε$-approximates the solution of the SDE with $ε^{-γ}$ compute, improving over the traditional EM rate of $ε^{-γ-1}$. In other terms it allows us to solve the SDE at the same cost as a single evaluation of the drift. In the context of diffusion models, the different levels $f^{1},\dots,f^{k}$ are obtained by training UNets of increasing sizes, and ML-EM allows us to perform sampling with the equivalent of a single evaluation of the largest UNet. Our numerical experiments confirm our theory: we obtain up to fourfold speedups for image generation on the CelebA dataset downscaled to 64x64, where we measure a $γ\approx2.5$. Given that this is a polynomial speedup, we expect even stronger speedups in practical applications which involve orders of magnitude larger networks.

[511] Coded Computing for Resilient Distributed Computing: A Learning-Theoretic Framework

Parsa Moradi, Behrooz Tahmasebi, Mohammad Ali Maddah-Ali

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2406.00300: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.00300&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[512] Multiway Multislice PHATE: Visualizing Hidden Dynamics of RNNs through Training

Jiancheng Xie, Lou C. Kohler Voinov, Noga Mudrik, Gal Mishne, Adam Charles

Main category: cs.LG

TL;DR: Unable to analyze paper 2406.01969 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable due to API rate limiting

Method: Cannot determine method as abstract is unavailable due to API rate limiting

Result: Cannot determine results as abstract is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as abstract is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2406.01969: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.01969&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[513] An efficient wavelet-based physics-informed neural network for multiscale problems

Himanshu Pandey, Anshima Singh, Ratikanta Behera

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2409.11847: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.11847&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[514] Learning to Localize Leakage of Cryptographic Sensitive Variables

Jimmy Gammell, Anand Raghunathan, Abolfazl Hashemi, Kaushik Roy

Main category: cs.LG

TL;DR: Unable to analyze paper 2503.07464 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to paper abstract

Method: Cannot determine method without access to paper abstract

Result: Cannot determine results without access to paper abstract

Conclusion: Cannot draw conclusions without access to paper abstract

Abstract: Failed to fetch summary for 2503.07464: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.07464&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[515] GraphOmni: A Comprehensive and Extensible Benchmark Framework for Large Language Models on Graph-theoretic Tasks

Hao Xu, Xiangru Jian, Xinjian Zhao, Wei Pang, Chao Zhang, Suyuchen Wang, Qixin Zhang, Zhengyuan Dong, Joao Monteiro, Bang Liu, Qiuzhuang Sun, Tianshu Yu

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2504.12764: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.12764&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[516] Navigating the Latent Space Dynamics of Neural Models

Marco Fumero, Luca Moschella, Emanuele Rodolà, Francesco Locatello

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2505.22785: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.22785&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[517] TimeRecipe: A Time-Series Forecasting Recipe via Benchmarking Module Level Effectiveness

Zhiyuan Zhao, Juntong Ni, Shangqing Xu, Haoxin Liu, Wei Jin, B. Aditya Prakash

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2506.06482 exists but cannot be analyzed without access to its abstract or content.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2506.06482: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.06482&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[518] When Brain Foundation Model Meets Cauchy-Schwarz Divergence: A New Framework for Cross-Subject Motor Imagery Decoding

Jinzhou Wu, Baoping Tang, Qikang Li, Yi Wang, Cheng Li, Shujian Yu

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to technical access issues

Abstract: Failed to fetch summary for 2507.21037: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.21037&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[519] A Compression Based Classification Framework Using Symbolic Dynamics of Chaotic Maps

Parth Naik, Harikrishnan N B

Main category: cs.LG

TL;DR: Unable to analyze paper 2508.02330 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is not available

Method: Cannot determine method as abstract is not available

Result: Cannot determine results as abstract is not available

Conclusion: Cannot draw conclusion as paper content is inaccessible

Abstract: Failed to fetch summary for 2508.02330: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.02330&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[520] DART: A Server-side Plug-in for Resource-efficient Robust Federated Learning

Omar Bekdache, Naresh Shanbhag

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2508.17381: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.17381&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[521] Who to Trust? Aggregating Client Predictions in Federated Distillation

Viktor Kovalchuk, Denis Son, Arman Bolatov, Mohsen Guizani, Samuel Horváth, Maxim Panov, Martin Takáč, Eduard Gorbunov, Nikita Kotelevskii

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.15147: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.15147&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[522] A signal separation view of classification

H. N. Mhaskar, Ryan O’Dowd

Main category: cs.LG

TL;DR: Paper 2509.24140 summary unavailable due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to abstract fetch failure

Method: Unable to determine method due to abstract fetch failure

Result: Unable to determine results due to abstract fetch failure

Conclusion: Unable to draw conclusions due to abstract fetch failure

Abstract: Failed to fetch summary for 2509.24140: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24140&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[523] RamPINN: Recovering Raman Spectra From Coherent Anti-Stokes Spectra Using Embedded Physics

Sai Karthikeya Vemuri, Adithya Ashok Chalain Valapil, Tim Büchner, Joachim Denzler

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.06020: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06020&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[524] Score-Based Density Estimation from Pairwise Comparisons

Petrus Mikkola, Luigi Acerbi, Arto Klami

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.09146: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09146&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[525] Tackling Time-Series Forecasting Generalization via Mitigating Concept Drift

Zhiyuan Zhao, Haoxin Liu, B. Aditya Prakash

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.14814: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14814&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[526] MedM2T: A MultiModal Framework for Time-Aware Modeling with Electronic Health Record and Electrocardiogram Data

Yu-Chen Kuo, Yi-Ju Tseng

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2510.27321: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.27321&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[527] SigmaDock: Untwisting Molecular Docking With Fragment-Based SE(3) Diffusion

Alvaro Prat, Leo Zhang, Charlotte M. Deane, Yee Whye Teh, Garrett M. Morris

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to retry or use alternative methods to access paper information.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2511.04854: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.04854&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[528] Deep Reinforcement Learning for Dynamic Origin-Destination Matrix Estimation in Microscopic Traffic Simulations Considering Credit Assignment

Donggyu Min, Seongjin Choi, Dong-Kyu Kim

Main category: cs.LG

TL;DR: Paper 2511.06229: Could not fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.06229: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06229&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[529] Enhancing Nuclear Reactor Core Simulation through Data-Based Surrogate Models

Perceval Beja-Battais, Alain Grossetête, Nicolas Vayatis

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2511.16148: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16148&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[530] Bayesian Calibration of Engine-out NOx Models for Engine-to-Engine Transferability

Shrenik Zinage, Peter Meckl, Ilias Bilionis

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error fetching paper content

Method: Unable to determine method due to technical error fetching paper content

Result: Unable to determine results due to technical error fetching paper content

Conclusion: Unable to draw conclusions due to technical error fetching paper content

Abstract: Failed to fetch summary for 2511.18178: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18178&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[531] Perturbing the Derivative: Doubly Wild Refitting for Model-Free Evaluation of Opaque Machine Learning Predictors

Haichen Hu, David Simchi-Levi

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2511.18789: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18789&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[532] Quantum-Classical Physics-Informed Neural Networks for Solving Reservoir Seepage Equations

Xiang Rao, Yina Liu, Yuxuan Shen

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to access limitations

Abstract: Failed to fetch summary for 2512.03923: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03923&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[533] A Hessian-Free Actor-Critic Algorithm for Bi-Level Reinforcement Learning with Applications to LLM Fine-Tuning

Sihan Zeng, Sujay Bhatt, Sumitra Ganesh, Alec Koppel

Main category: cs.LG

TL;DR: Paper ID 2601.16399: Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2601.16399: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16399&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[534] Interactionless Inverse Reinforcement Learning: A Data-Centric Framework for Durable Alignment

Elias Malomgré, Pieter Simoens

Main category: cs.LG

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv

Details

Motivation: Unable to determine motivation as the paper abstract could not be retrieved due to rate limiting (HTTP 429 error)

Method: Cannot analyze method without access to the paper abstract or content

Result: No results available due to technical error in fetching the paper information

Conclusion: Unable to provide analysis due to technical limitations in accessing the paper content

Abstract: Failed to fetch summary for 2602.14844: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14844&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[535] Benchmarking State Space Models, Transformers, and Recurrent Networks for US Grid Forecasting

Sunki Hong, Jisoo Lee

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.21415: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21415&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[536] Efficient Continual Learning in Language Models via Thalamically Routed Cortical Columns

Afshin Khadangi

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2602.22479: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22479&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[537] Self-Aware Markov Models for Discrete Reasoning

Gregor Kornhardt, Jannis Chemseddine, Christian Wald, Gabriele Steidl

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2603.16661: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16661&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[538] Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs

Abhishek Gupta, Aditya Mahajan

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.17875 appears to be from March 2023, but no content is available for analysis.

Details

Motivation: Cannot determine motivation due to lack of access to paper content. The HTTP 429 error indicates rate limiting from arXiv API, preventing retrieval of the paper abstract.

Method: No method information available. The paper content could not be fetched from arXiv due to API rate limiting.

Result: No results available. The analysis cannot proceed without access to the paper abstract or content.

Conclusion: Unable to analyze paper due to technical limitations in accessing the content. The arXiv API rate limiting prevents proper evaluation of this paper’s relevance.

Abstract: Failed to fetch summary for 2603.17875: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17875&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[539] MRMS-Net and LMRMS-Net: Scalable Multi-Representation Multi-Scale Networks for Time Series Classification

Celal Alagöz, Mehmet Kurnaz, Farhan Aadil

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.19315: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19315&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[540] Two-Time-Scale Learning Dynamics: A Population View of Neural Network Training

Giacomo Borghi, Hyesung Im, Lorenzo Pareschi

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.19808: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19808&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[541] Cloud-Edge Collaborative Large Models for Robust Photovoltaic Power Forecasting

Nan Qiao, Shuning Wang, Sijing Duan, Wenpeng Cui, Yuzhe Chen, Qingchen Yang, Xingyuan Hua, Ju Ren

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.22343: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22343&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[542] DAK-UCB: Diversity-Aware Prompt Routing for LLMs and Generative Models

Donya Jafari, Farzan Farnia

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to retrieval error

Method: Unable to determine method due to retrieval error

Result: Unable to determine results due to retrieval error

Conclusion: Unable to determine conclusion due to retrieval error

Abstract: Failed to fetch summary for 2603.23140: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23140&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[543] Accelerating Matrix Factorization by Dynamic Pruning for Fast Recommendation

Yining Wu, Shengyu Duan, Gaole Sai, Chenhong Cao, Guobing Zou

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2404.04265: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.04265&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[544] General Coded Computing in a Probabilistic Straggler Regime

Parsa Moradi, Mohammad Ali Maddah-Ali

Main category: cs.LG

TL;DR: Paper 2502.00645: Unable to fetch abstract due to HTTP 429 error (rate limiting). No content available for analysis.

Details

Motivation: Cannot determine motivation as abstract content is unavailable due to HTTP 429 error from arXiv API.

Method: Cannot determine method as abstract content is unavailable due to HTTP 429 error from arXiv API.

Result: Cannot determine results as abstract content is unavailable due to HTTP 429 error from arXiv API.

Conclusion: Cannot draw conclusions as abstract content is unavailable due to HTTP 429 error from arXiv API.

Abstract: Failed to fetch summary for 2502.00645: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.00645&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[545] Algorithms with Calibrated Machine Learning Predictions

Judy Hanwen Shen, Ellen Vitercik, Anders Wikum

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2502.02861: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.02861&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[546] Energy-Efficient UAV-assisted LoRa Gateways: A Multi-Agent Optimization Approach

Abdullahi Isa Ahmed, Jamal Bentahar, El Mehdi Amhoud

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2502.03377: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.03377&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[547] Accelerated Parallel Tempering via Neural Transports

Leo Zhang, Peter Potaptchik, Jiajun He, Yuanqi Du, Arnaud Doucet, Francisco Vargas, Hai-Dang Dau, Saifuddin Syed

Main category: cs.LG

TL;DR: Unable to analyze paper 2502.10328 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to the paper abstract

Method: Cannot determine method without access to the paper abstract

Result: Cannot determine results without access to the paper abstract

Conclusion: Cannot draw conclusions without access to the paper abstract

Abstract: Failed to fetch summary for 2502.10328: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.10328&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[548] Gen-C: Populating Virtual Worlds with Generative Crowds

Andreas Panayiotou, Panayiotis Charalambous, Ioannis Karamouzas

Main category: cs.LG

TL;DR: Unable to analyze paper 2504.01924 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable

Method: Cannot determine method as abstract is unavailable

Result: Cannot determine results as abstract is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2504.01924: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.01924&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[549] Recurrent neural network-based robust control systems with regional properties and application to MPC design

Daniele Ravasio, Alessio La Bella, Marcello Farina, Andrea Ballarino

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2506.20334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.20334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[550] Generalization performance of narrow one-hidden layer networks in the teacher-student setting

Rodrigo Pérez Ortiz, Gibbs Nwemadji, Jean Barbier, Federica Gerace, Alessandro Ingrosso, Clarissa Lauditi, Enrico M. Malatesta

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2507.00629: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.00629&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Zhihao Luo, Wentao Yan, Jingyu Gong, Min Wang, Zhizhong Zhang, Xuhong Wang, Yuan Xie, Xin Tan

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2508.02046: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.02046&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[552] CloudFormer: An Attention-based Performance Prediction for Public Clouds with Unknown Workload

Amirhossein Shahbazinia, Darong Huang, Luis Costero, David Atienza

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2509.03394: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.03394&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[553] FLOWR.root: A flow matching based foundation model for joint multi-purpose structure-aware 3D ligand generation and affinity prediction

Julian Cremer, Tuan Le, Mohammad M. Ghahremanpour, Emilia Sługocka, Filipe Menezes, Djork-Arné Clevert

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2510.02578: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02578&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[554] Data-Prompt Co-Evolution: Growing Test Sets to Refine LLM Behavior

Minjae Lee, Minsuk Kahng

Main category: cs.LG

TL;DR: Paper 2510.12728: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot determine conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2510.12728: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12728&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[555] How to Sell High-Dimensional Data Optimally

Andrew Li, R. Ravi, Karan Singh, Zihong Yi, Weizhong Zhang

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.15214: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15214&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[556] Measurement-Driven Early Warning of Reliability Breakdown in 5G NSA Railway Networks

Po-Heng Chou, Da-Chih Lin, Hung-Yu Wei, Walid Saad, Yu Tsao

Main category: cs.LG

TL;DR: Paper ID 2511.08851 could not be fetched due to HTTP 429 error (rate limiting), so analysis cannot be performed.

Details

Motivation: Unable to determine motivation due to missing paper content.

Method: Unable to determine method due to missing paper content.

Result: Unable to determine results due to missing paper content.

Conclusion: Unable to draw conclusions due to missing paper content.

Abstract: Failed to fetch summary for 2511.08851: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.08851&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[557] Deep Learning as a Convex Paradigm of Computation: Minimizing Circuit Size with ResNets

Arthur Jacot

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2511.20888: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20888&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[558] Distributional Shrinkage II: Higher-Order Scores Encode Brenier Map

Tengyuan Liang

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without paper content

Method: Cannot determine method without paper content

Result: Cannot determine results without paper content

Conclusion: Cannot draw conclusions without paper content

Abstract: Failed to fetch summary for 2512.09295: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09295&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[559] Why Machine Learning Models Systematically Underestimate Extreme Values II: How to Fix It with LatentNN

Yuan-Sen Ting

Main category: cs.LG

TL;DR: Failed to fetch summary for paper 2512.23138 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed summary fetch

Method: Unable to determine method due to failed summary fetch

Result: Unable to determine results due to failed summary fetch

Conclusion: Unable to determine conclusion due to failed summary fetch

Abstract: Failed to fetch summary for 2512.23138: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23138&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[560] Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution

Rui Cai, Jun Guo, Xinze He, Piaopiao Jin, Jie Li, Bingxuan Lin, Futeng Liu, Wei Liu, Fei Ma, Kun Ma, Feng Qiu, Heng Qu, Yifei Su, Qiao Sun, Dong Wang, Donghao Wang, Yunhong Wang, Rujie Wu, Diyun Xiang, Yu Yang, Hangjun Ye, Yuan Zhang, Quanyun Zhou

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.12684: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12684&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[561] Gradient-Informed Bayesian and Interior Point Optimization for Efficient Inverse Design in Nanophotonics

Yannik Mahlau, Yannick Augenstein, Tyler W. Hughes, Marius Lindauer, Bodo Rosenhahn

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2602.18148: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18148&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[562] From Reachability to Learnability: Geometric Design Principles for Quantum Neural Networks

Vishal S. Ngairangbam, Michael Spannowsky

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2603.03071: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03071&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[563] Bayes with No Shame: Admissibility Geometries of Predictive Inference

Nicholas G. Polson, Daniel Zantedeschi

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.05335: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05335&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[564] Neural Networks as Local-to-Global Computations

Vicente Bosca, Robert Ghrist

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.14831: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14831&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[565] Breaking Hard Isomorphism Benchmarks with DRESS

Eduar Castrillo Velilla

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.18582: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18582&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[566] Learn for Variation: Variationally Guided AAV Trajectory Learning in Differentiable Environments

Xiucheng Wang, Zhenye Chen, Nan Cheng

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.18853: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18853&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[567] RadioDiff-FS: Physics-Informed Manifold Alignment in Few-Shot Diffusion Models for High-Fidelity Radio Map Construction

Xiucheng Wang, Zixuan Guo, Nan Cheng

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2603.18865: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18865&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[568] Minimax Generalized Cross-Entropy

Kartheek Bondugula, Santiago Mazuelas, Aritz Pérez, Anqi Liu

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.19874: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19874&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.MA

[569] Dual-Gated Epistemic Time-Dilation: Autonomous Compute Modulation in Asynchronous MARL

Igor Jankowski

Main category: cs.MA

TL;DR: ETD-MAPPO enables MARL agents to dynamically adjust execution frequency using uncertainty measures, reducing computational overhead by 73.6% without performance loss.

Details

Motivation: Standard synchronous MARL requires agents to execute neural network inferences at every micro-frame, creating deployment barriers for edge devices with constrained thermal and metabolic budgets.

Method: Proposes Epistemic Time-Dilation MAPPO with Dual-Gated Epistemic Trigger that modulates execution frequency using aleatoric uncertainty (Shannon entropy of policy) and epistemic uncertainty (state-value divergence in Twin-Critic architecture). Structures environment as Semi-Markov Decision Process with SMDP-Aligned Asynchronous Gradient Masking Critic for proper credit assignment.

Result: Achieves >60% relative baseline improvements over current temporal models, prevents premature policy collapse, enables emergent Temporal Role Specialization, and reduces computational overhead by 73.6% during off-ball execution without deteriorating centralized task dominance.

Conclusion: The approach successfully enables MARL agents to autonomously modulate execution frequency based on uncertainty, significantly reducing computational requirements while maintaining performance, making it suitable for edge device deployment.

Abstract: While Multi-Agent Reinforcement Learning (MARL) algorithms achieve unprecedented successes across complex continuous domains, their standard deployment strictly adheres to a synchronous operational paradigm. Under this paradigm, agents are universally forced to execute deep neural network inferences at every micro-frame, regardless of immediate necessity. This dense throughput acts as a fundamental barrier to physical deployment on edge-devices where thermal and metabolic budgets are highly constrained. We propose Epistemic Time-Dilation MAPPO (ETD-MAPPO), augmented with a Dual-Gated Epistemic Trigger. Instead of depending on rigid frame-skipping (macro-actions), agents autonomously modulate their execution frequency by interpreting aleatoric uncertainty (via Shannon entropy of their policy) and epistemic uncertainty (via state-value divergence in a Twin-Critic architecture). To format this, we structure the environment as a Semi-Markov Decision Process (SMDP) and build the SMDP-Aligned Asynchronous Gradient Masking Critic to ensure proper credit assignment. Empirical findings demonstrate massive improvements (> 60% relative baseline acquisition leaps) over current temporal models. By assessing LBF, MPE, and the 115-dimensional state space of Google Research Football (GRF), ETD correctly prevented premature policy collapse. Remarkably, this unconstrained approach leads to emergent Temporal Role Specialization, reducing computational overhead by a statistically dominant 73.6% entirely during off-ball execution without deteriorating centralized task dominance.

[570] Self-Evolving Multi-Agent Framework for Efficient Decision Making in Real-Time Strategy Scenarios

Li Ma, Hao Peng, Yiming Wang, Hongbin Luo, Jie Liu, Kongjing Gu, Guanlin Wu, Hui Lin, Lei Ren

Main category: cs.MA

TL;DR: SEMA is a self-evolving multi-agent framework for real-time strategy games that addresses LLM speed-quality trade-offs through adaptive bias calibration, dynamic observation pruning, and hybrid knowledge-memory mechanisms.

Details

Motivation: LLMs show promise for autonomous agent decision-making in dynamic settings but face critical speed-quality trade-offs in Real-Time Strategy (RTS) scenarios due to expansive state spaces, time limits, inference delays, and stochastic planning errors that undermine logical consistency.

Method: SEMA uses a collaborative multi-agent framework with self-evolution through adaptive bias calibration (in-episode assessment and cross-episode analysis), dynamic observation pruning based on structural entropy to model game states topologically, and a hybrid knowledge-memory mechanism integrating micro-trajectories, macro-experience, and hierarchical domain knowledge.

Result: Experiments across multiple StarCraft II maps show SEMA achieves superior win rates while reducing average decision latency by over 50%, validating its efficiency and robustness in complex RTS scenarios.

Conclusion: SEMA effectively addresses LLM limitations in RTS environments through its self-evolving multi-agent framework, demonstrating significant improvements in both decision quality and speed.

Abstract: Large language models (LLMs) have demonstrated exceptional potential in complex reasoning,pioneering a new paradigm for autonomous agent decision making in dynamic settings. However, in Real-Time Strategy (RTS) scenarios, LLMs suffer from a critical speed-quality trade-off. Specifically expansive state spaces and time limits render inference delays prohibitive, while stochastic planning errors undermine logical consistency. To address these challenges, we present SEMA (Self-Evolving Multi-Agent), a novel framework designed for high-performance, low-latency decision-making in RTS environments. This collaborative multi-agent framework facilitates self-evolution by adaptively calibrating model bias through in-episode assessment and cross-episode analysis. We further incorporate dynamic observation pruning based on structural entropy to model game states topologically. By distilling high dimensional data into core semantic information, this approach significantly reduces inference time. We also develop a hybrid knowledge-memory mechanism that integrates micro-trajectories, macro-experience, and hierarchical domain knowledge, thereby enhancing both strategic adaptability and decision consistency. Experiments across multiple StarCraft II maps demonstrate that SEMA achieves superior win rates while reducing average decision latency by over 50%, validating its efficiency and robustness in complex RTS scenarios.

[571] Relaxing Constraints in Anonymous Multi Agent Path Finding for Large Agents

Stepan Dergachev, Dmitry Avdeev

Main category: cs.MA

TL;DR: Modified Anonymous Multi-Agent Pathfinding algorithm for continuous space with disk-shaped agents, reducing required separation between start/goal positions from 4 to 2√3 agent radii while preserving collision-free guarantees.

Details

Motivation: Existing multi-agent pathfinding algorithms have limitations: discrete grid-based methods don't account for agent sizes, while continuous-space methods impose restrictive constraints on distances between positions and obstacles. The authors aim to make AMAPF more applicable to real-world scenarios like warehouse robot navigation by relaxing these constraints.

Method: The work modifies an existing AMAPF algorithm designed for continuous space where agents are modeled as disks of equal size. The original algorithm required strict minimum separation of 4 agent radii between any start/goal positions. The proposed modification reduces this requirement to 2√3 agent radii while preserving theoretical guarantees.

Result: Theoretically demonstrated that the enhanced algorithm preserves original theoretical properties, including the guarantee that all agents will eventually achieve their goals safely and without collisions, despite the relaxed constraints.

Conclusion: The modification makes AMAPF more practical for real-world applications by reducing the required separation between agent positions, potentially enabling deployment in more constrained environments while maintaining safety guarantees.

Abstract: The study addressed the problem of Anonymous Multi-Agent Path-finding (AMAPF). Unlike the classical formulation, where the assignment of agents to goals is fixed, in the anonymous MAPF setting it is irrelevant which agent reaches specific goal, provided that all goals are occupied. Most existing multi-agent pathfinding algorithms rely on a discrete representation of the environment (e.g., square grids) and do not account for the sizes of agents. This limits their applicability in real-world scenarios, such as trajectory planning for mobile robots in warehouses. Conversely, methods operating in continuous space typically impose substantial restrictions on the input data, such as constraints on the distances between initial and goal positions or between start/goal positions and obstacles. In this work, we considered one of the AMAPF algorithms designed for continuous space, where agents are modeled as disks of equal size. The algorithm requires a strict minimum separation of $4$ agent radii between any start/goal positions. Proposed a modification aimed at relaxing the constraints and reduce this limit from $4$ to $2\sqrt{3}$. We theoretically demonstrated that the proposed enhancements preserve original theoretical properties, including the guarantee that all agents will eventually achieve their goals safely and without collisions.

[572] Dynamic Adversarial Resource Allocation: the dDAB Game

Yue Guan, Daigo Shishika, Jason R. Marden, Michael Dorothy, Panagiotis Tsiotras, Vijay Kumar

Main category: cs.MA

TL;DR: Dynamic Defender-Attacker Blotto game extends classical Blotto to dynamic resource allocation over graphs, analyzing defender strategies to maintain numerical superiority against attackers moving through network nodes.

Details

Motivation: Extends classical static Blotto games to dynamic settings with graph constraints, addressing real-world security scenarios where defenders must allocate resources over time across networked locations to counter moving threats.

Method: Develops discrete-time game model where players reallocate resources one hop per time step, uses reachability analysis for graph-constrained movement, extends to attackers that can split/merge, applies superposition principles for defender strategies, and creates set-based dynamic programming algorithm.

Result: Determines necessary and sufficient defender resources for guaranteed defense, computes optimal strategies via dynamic programming, and validates approach through numerical simulations and Robotarium hardware experiments.

Conclusion: The dDAB framework successfully extends Blotto games to dynamic graph settings, providing analytical tools and algorithms for resource allocation in security scenarios with moving threats over networks.

Abstract: This work introduces the dynamic Defender-Attacker Blotto (dDAB) game, extending the classical static Blotto game to a dynamic resource allocation setting over graphs. In the dDAB game, a defender is required to maintain numerical superiority against attacker resources across a set of key nodes in a connected graph. The engagement unfolds as a discrete-time game, where each player reallocates its resources in turn, with resources allowed to move at most one hop per time step. The primary goal is to determine the necessary and sufficient amount of defender resources required to guarantee sustained defense, along with the corresponding strategies. To address the central challenge arising from graph-constrained resource reallocation, we conduct a reachability analysis, starting with simplified settings where attacker resources act as a single cohesive group. We then extend the framework to allow attacker resources to split and merge arbitrarily, and construct defender strategies using superposition principles. A set-based dynamic programming algorithm is developed to compute the optimal strategies, as well as the minimum amount of defender resources to ensure successful defense. The effectiveness of our approach is demonstrated through numerical simulations and hardware experiments on the Georgia Tech Robotarium platform.

[573] SafeSieve: From Heuristics to Experience in Progressive Pruning for LLM-based Multi-Agent Communication

Ruijia Zhang, Xinyan Zhao, Ruixiang Wang, Sigen Chen, Guibin Zhang, An Zhang, Kun Wang, Qingsong Wen

Main category: cs.MA

TL;DR: SafeSieve is a progressive adaptive pruning algorithm for LLM-based multi-agent systems that reduces redundant communication and token overhead through dual-mechanism refinement of inter-agent connections.

Details

Motivation: LLM-based multi-agent systems have strong collaborative capabilities but suffer from redundant communication and excessive token overhead. Existing methods isolate pre- and post-task optimization, lacking a unified strategy for efficient agent communication.

Method: SafeSieve uses a progressive adaptive pruning algorithm with dual-mechanism: initial LLM-based semantic evaluation combined with accumulated performance feedback, enabling transition from heuristic initialization to experience-driven refinement. Employs 0-extension clustering to preserve structurally coherent agent groups while eliminating ineffective links.

Result: Achieves 94.01% average accuracy while reducing token usage by 12.4%-27.8%. Shows robustness under prompt injection attacks (1.23% average accuracy drop). In heterogeneous settings, reduces deployment costs by 13.3% while maintaining performance.

Conclusion: SafeSieve establishes an efficient, GPU-free, and scalable framework for practical multi-agent systems by dynamically refining inter-agent communication through progressive pruning.

Abstract: LLM-based multi-agent systems exhibit strong collaborative capabilities but often suffer from redundant communication and excessive token overhead. Existing methods typically enhance efficiency through pretrained GNNs or greedy algorithms, but often isolate pre- and post-task optimization, lacking a unified strategy. To this end, we present SafeSieve, a progressive and adaptive multi-agent pruning algorithm that dynamically refines the inter-agent communication through a novel dual-mechanism. SafeSieve integrates initial LLM-based semantic evaluation with accumulated performance feedback, enabling a smooth transition from heuristic initialization to experience-driven refinement. Unlike existing greedy Top-k pruning methods, SafeSieve employs 0-extension clustering to preserve structurally coherent agent groups while eliminating ineffective links. Experiments across benchmarks (SVAMP, HumanEval, etc.) showcase that SafeSieve achieves 94.01% average accuracy while reducing token usage by 12.4%-27.8%. Results further demonstrate robustness under prompt injection attacks (1.23% average accuracy drop). In heterogeneous settings, SafeSieve reduces deployment costs by 13.3% while maintaining performance. These results establish SafeSieve as an efficient, GPU-free, and scalable framework for practical multi-agent systems. Our code can be found here: https://github.com/csgen/SafeSieve

[574] Agent Contracts: A Formal Framework for Resource-Bounded Autonomous AI Systems

Qing Ye, Jing Tan

Main category: cs.MA

TL;DR: Agent Contracts extend the contract metaphor from task allocation to resource-bounded execution, providing formal governance mechanisms with multi-dimensional resource constraints, temporal boundaries, and success criteria for predictable AI deployment.

Details

Motivation: Modern agent protocols lack formal resource governance mechanisms to bound agent consumption and operation time, creating unpredictable AI deployment. The paper aims to address this gap by extending the contract metaphor from task allocation to resource-bounded execution.

Method: Introduces Agent Contracts framework that unifies input/output specifications, multi-dimensional resource constraints, temporal boundaries, and success criteria into coherent governance with explicit lifecycle semantics. Establishes conservation laws for delegated budgets respecting parent constraints, enabling hierarchical coordination through contract delegation.

Result: Empirical validation across four experiments shows: 90% token reduction with 525x lower variance in iterative workflows, zero conservation violations in multi-agent delegation, and measurable quality-resource tradeoffs through contract modes.

Conclusion: Agent Contracts provide formal foundations for predictable, auditable, and resource-bounded autonomous AI deployment, extending coordination mechanisms from task allocation to resource governance.

Abstract: The Contract Net Protocol (1980) introduced coordination through contracts in multi-agent systems. Modern agent protocols standardize connectivity and interoperability; yet, none provide formal, resource governance-normative mechanisms to bound how much agents may consume or how long they may operate. We introduce Agent Contracts, a formal framework that extends the contract metaphor from task allocation to resource-bounded execution. An Agent Contract unifies input/output specifications, multi-dimensional resource constraints, temporal boundaries, and success criteria into a coherent governance mechanism with explicit lifecycle semantics. For multi-agent coordination, we establish conservation laws ensuring delegated budgets respect parent constraints, enabling hierarchical coordination through contract delegation. Empirical validation across four experiments demonstrates 90% token reduction with 525x lower variance in iterative workflows, zero conservation violations in multi-agent delegation, and measurable quality-resource tradeoffs through contract modes. Agent Contracts provide formal foundations for predictable, auditable, and resource-bounded autonomous AI deployment.

[575] Is AI Ready for Multimodal Hate Speech Detection? A Comprehensive Dataset and Benchmark Evaluation

Rui Xing, Qi Chai, Jie Ma, Jing Tao, Pinghui Wang, Shuming Zhang, Xinping Wang, Hao Wang

Main category: cs.MA

TL;DR: M^3 dataset: Multi-platform, multi-lingual multimodal meme dataset with fine-grained hate speech labels and rationales, created using agentic annotation framework with 7 specialized agents.

Details

Motivation: Existing multimodal hate speech datasets have coarse-grained labeling and lack integration with surrounding discourse, leading to imprecise assessments of hate speech in memes that require cultural knowledge for interpretation.

Method: Proposed agentic annotation framework with 7 specialized agents to generate hierarchical labels and rationales. Constructed M^3 dataset of 2,455 memes from X, 4chan, and Weibo with fine-grained hate labels and human-verified rationales.

Result: Benchmarking shows state-of-the-art Multimodal Large Language Models struggle to effectively utilize surrounding post context, often failing to improve or even degrading detection performance.

Conclusion: Current models face challenges in reasoning over memes embedded in real-world discourse, highlighting the need for context-aware multimodal architectures.

Abstract: Hate speech online targets individuals or groups based on identity attributes and spreads rapidly, posing serious social risks. Memes, which combine images and text, have emerged as a nuanced vehicle for disseminating hate speech, often relying on cultural knowledge for interpretation. However, existing multimodal hate speech datasets suffer from coarse-grained labeling and a lack of integration with surrounding discourse, leading to imprecise and incomplete assessments. To bridge this gap, we propose an agentic annotation framework that coordinates seven specialized agents to generate hierarchical labels and rationales. Based on this framework, we construct M^3 (Multi-platform, Multi-lingual, and Multimodal Meme), a dataset of 2,455 memes collected from X, 4chan, and Weibo, featuring fine-grained hate labels and human-verified rationales. Benchmarking state-of-the-art Multimodal Large Language Models reveals that these models struggle to effectively utilize surrounding post context, which often fails to improve or even degrades detection performance. Our finding highlights the challenges these models face in reasoning over memes embedded in real-world discourse and underscores the need for a context-aware multimodal architecture. Our dataset and code are available at https://github.com/mira-ai-lab/M3.

cs.MM

eess.AS

[576] Rethinking Masking Strategies for Masked Prediction-based Audio Self-supervised Learning

Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda, Binh Thien Nguyen, Noboru Harada, Nobutaka Ono

Main category: eess.AS

TL;DR: Dispersion-weighted masking (DWM) is proposed as a lightweight masking strategy for audio SSL that leverages spectral sparsity to improve performance while reducing computational overhead compared to informed masking techniques.

Details

Motivation: Recent informed masking techniques for audio representation learning incur substantial computational overhead. The authors aim to develop a more efficient masking strategy that leverages the inherent spectral sparsity in audio content.

Method: Proposes dispersion-weighted masking (DWM), a lightweight masking strategy that uses the spectral sparsity in audio frequency structure. Compares with inverse block masking and other recent SSL frameworks.

Result: DWM alleviates limitations of inverse block masking (which improves audio event understanding but hurts generalization) and reduces computational complexity, leading to consistent performance improvements.

Conclusion: Provides practical guidance on masking strategy design for masked prediction-based audio representation learning, showing DWM offers better efficiency and performance trade-offs.

Abstract: Since the introduction of Masked Autoencoders, various improvements to masking techniques have been explored. In this paper, we rethink masking strategies for audio representation learning using masked prediction-based self-supervised learning (SSL) on general audio spectrograms. While recent informed masking techniques have attracted attention, we observe that they incur substantial computational overhead. Motivated by this observation, we propose dispersion-weighted masking (DWM), a lightweight masking strategy that leverages the spectral sparsity inherent in the frequency structure of audio content. Our experiments show that inverse block masking, commonly used in recent SSL frameworks, improves audio event understanding performance while introducing a trade-off in generalization. The proposed DWM alleviates these limitations and computational complexity, leading to consistent performance improvements. This work provides practical guidance on masking strategy design for masked prediction-based audio representation learning.

[577] Crab: Multi Layer Contrastive Supervision to Improve Speech Emotion Recognition Under Both Acted and Natural Speech Condition

Lucas H. Ueda, João G. T. Lima, Paula D. P. Costa

Main category: eess.AS

TL;DR: Crab is a bimodal Cross-Modal Transformer for Speech Emotion Recognition that integrates speech and text representations with Multi Layer Contrastive Supervision to improve discriminative power throughout the network.

Details

Motivation: Real-world SER faces challenges with class imbalance and natural speech. Existing methods using SSL representations and multimodal fusion typically apply supervision only at the final classification layer, limiting intermediate representation quality.

Method: Proposes Crab architecture integrating WavLM speech representations and RoBERTa text representations via Cross-Modal Transformer. Introduces Multi Layer Contrastive Supervision (MLCS) that injects contrastive learning signals at multiple network layers without adding inference parameters. Uses weighted cross-entropy to handle data imbalance.

Result: Outperforms strong unimodal and multimodal baselines on IEMOCAP, MELD, and MSP-Podcast 2.0 datasets, with particularly large gains under naturalistic and highly imbalanced conditions.

Conclusion: MLCS is an effective general strategy for SER that improves emotional discriminative representations throughout the model without inference overhead.

Abstract: Speech Emotion Recognition (SER) in real-world scenarios remains challenging due to severe class imbalance and the prevalence of spontaneous, natural speech. While recent approaches leverage self-supervised learning (SSL) representations and multimodal fusion of speech and text, most existing methods apply supervision only at the final classification layer, limiting the discriminative power of intermediate representations. In this work, we propose Crab (Contrastive Representation and Multimodal Aligned Bottleneck), a bimodal Cross-Modal Transformer architecture that integrates speech representations from WavLM and textual representations from RoBERTa, together with a novel \textit{Multi Layer Contrastive Supervision} (MLCS) strategy. MLCS injects multi-positive contrastive learning signals at multiple layers of the network, encouraging emotionally discriminative representations throughout the model without introducing additional parameters at inference time. To further address data imbalance, we adopt weighted cross-entropy during training. We evaluate the proposed approach on three benchmark datasets covering different degrees of emotional naturalness: IEMOCAP, MELD, and MSP-Podcast 2.0. Experimental results demonstrate that Crab consistently outperforms strong unimodal and multimodal baselines across all datasets, with particularly large gains under naturalistic and highly imbalanced conditions. These findings highlight the effectiveness of \textit{Multi Layer Contrastive Supervision} as a general and robust strategy for SER. Official implementation can be found in https://github.com/AI-Unicamp/Crab.

[578] Autoregressive Guidance of Deep Spatially Selective Filters using Bayesian Tracking for Efficient Extraction of Moving Speakers

Jakob Kienegger, Timo Gerkmann

Main category: eess.AS

TL;DR: Deep spatial filters for speech enhancement with Bayesian tracking for dynamic speaker scenarios, using enhanced speech feedback to improve tracking accuracy with minimal computational overhead.

Details

Motivation: Existing deep spatial filters work well for stationary speakers with known directions, but dynamic scenarios with only initial speaker directions require lightweight tracking algorithms that can maintain real-time performance while adapting to speaker movement.

Method: Proposes Bayesian tracking algorithms that incorporate enhanced speech signals to improve tracking performance, using temporal feedback in a frame-wise causal processing style. The enhanced signal is used to autoregressively guide deep spatial filters. Also introduces a novel dataset based on the social force model for realistic trajectory simulation.

Result: Autoregressive incorporation significantly improves Bayesian tracker accuracy, resulting in superior speech enhancement with none or only negligibly increased computational overhead. Methods generalize well to unseen, challenging acoustic conditions as demonstrated with real-world recordings.

Conclusion: The proposed Bayesian tracking approach enables high-quality speech enhancement in dynamic scenarios by effectively leveraging enhanced speech signals to improve tracking accuracy while maintaining real-time computational efficiency.

Abstract: Deep spatially selective filters achieve high-quality enhancement with real-time capable architectures for stationary speakers of known directions. To retain this level of performance in dynamic scenarios when only the speakers’ initial directions are given, accurate, yet computationally lightweight tracking algorithms become necessary. Assuming a frame-wise causal processing style, temporal feedback allows for leveraging the enhanced speech signal to improve tracking performance. In this work, we investigate strategies to incorporate the enhanced signal into lightweight tracking algorithms and autoregressively guide deep spatial filters. Our proposed Bayesian tracking algorithms are compatible with arbitrary deep spatial filters. To increase the realism of simulated trajectories during development and evaluation, we propose and publish a novel dataset based on the social force model. Results validate that the autoregressive incorporation significantly improves the accuracy of our Bayesian trackers, resulting in superior enhancement with none or only negligibly increased computational overhead. Real-world recordings complement these findings and demonstrate the generalizability of our methods to unseen, challenging acoustic conditions.

[579] ACAVCaps: Enabling large-scale training for fine-grained and diverse audio understanding

Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Junbo Zhang, Jian Luan

Main category: eess.AS

Details

Motivation: Existing audio captioning datasets lack the scale and descriptive granularity needed to train versatile large audio-language models for general audio understanding.

Result: Models pre-trained on ACAVCaps show substantially stronger generalization capabilities on various downstream tasks compared to those trained on other leading captioning datasets.

Conclusion: ACAVCaps addresses the data limitations in audio captioning and enables better training of large audio-language models for general audio understanding.

Ludovic Pirard, Lorenzo Picinali, Katarina C. Poole

Main category: eess.AS

TL;DR: Photogrammetry-reconstructed head/ear meshes from consumer hardware produce synthetic HRTFs with preserved ITD cues but poor ILD/spectral accuracy, leading to worse perceptual performance than measured HRTFs or even random HRTFs.

Details

Motivation: Individual HRTFs are crucial for accurate spatial audio rendering but difficult to obtain through measurement. The study investigates whether photogrammetry-reconstructed meshes from consumer hardware can provide a practical baseline for individual HRTF synthesis.

Method: Used 72-image photogrammetry captures per subject processed with Apple’s Object Capture API to generate PR meshes for 150 subjects from SONICOM dataset. Computed PR synthetic HRTFs using Mesh2HRTF and compared against measured HRTFs, 3D scan-derived HRTFs, KEMAR, and random HRTFs through numerical evaluation, auditory models, and behavioral sound localization experiments (N=27).

Result: PR synthetic HRTFs preserved ITD cues but exhibited increased ILD and spectral errors. Auditory-model predictions and behavioral data showed substantially higher quadrant error rates, reduced elevation accuracy, and greater front-back confusions than measured HRTFs, performing worse than random HRTFs on perceptual metrics.

Conclusion: Current photogrammetry pipelines support individual HRTF synthesis but are limited by insufficient pinna morphology details and high-frequency spectral fidelity needed for accurate individual HRTFs containing monaural cues.

Abstract: Individual head-related transfer functions (HRTFs) are essential for accurate spatial audio binaural rendering but remain difficult to obtain due to measurement complexity. This study investigates whether photogrammetry-reconstructed (PR) head and ear meshes, acquired with consumer hardware, can provide a practically useful baseline for individual HRTF synthesis. Using the SONICOM HRTF dataset, 72-image photogrammetry captures per subject were processed with Apple’s Object Capture API to generate PR meshes for 150 subjects. Mesh2HRTF was used to compute PR synthetic HRTFs, which were compared against measured HRTFs, high-resolution 3D scan-derived HRTFs, KEMAR, and random HRTFs through numerical evaluation, auditory models, and a behavioural sound localisation experiment (N = 27). PR synthetic HRTFs preserved ITD cues but exhibited increased ILD and spectral errors. Auditory-model predictions and behavioural data showed substantially higher quadrant error rates, reduced elevation accuracy, and greater front-back confusions than measured HRTFs, performing worse than random HRTFs on perceptual metrics. Current photogrammetry pipelines support individual HRTF synthesis but are limited by insufficient pinna morphology details and high-frequency spectral fidelity needed for accurate individual HRTFs containing monaural cues.

[581] How Open is Open TTS? A Practical Evaluation of Open Source TTS Tools for Romanian

Teodora Răgman, Adrian Bogdan Stânea, Horia Cucu, Adriana Stan

Main category: eess.AS

TL;DR: Systematic evaluation of four open-source TTS architectures (FastPitch, VITS, Grad-TTS, Matcha-TTS) for building speech synthesis systems, with focus on Romanian language and practical implementation challenges.

Details

Motivation: Open-source TTS frameworks offer adaptability but face uneven applicability for under-resourced languages and constrained computational environments. Need systematic assessment of practical feasibility for building novel TTS models.

Method: Evaluated four open-source TTS architectures (FastPitch, VITS, Grad-TTS, Matcha-TTS) across multiple dimensions: qualitative aspects (ease of installation, dataset preparation, hardware requirements) and quantitative assessments (synthesis quality for Romanian). Used both objective metrics and subjective listening tests to evaluate intelligibility, speaker similarity, and naturalness.

Result: Revealed significant challenges in tool chain setup, data preprocessing, and computational efficiency that hinder adoption in low-resource contexts. Provides reproducible protocols and accessible evaluation criteria for TTS development.

Conclusion: This work aims to inform best practices and promote more inclusive, language-diverse TTS development by providing systematic evaluation of open-source frameworks and highlighting practical implementation barriers.

Abstract: Open-source text-to-speech (TTS) frameworks have emerged as highly adaptable platforms for developing speech synthesis systems across a wide range of languages. However, their applicability is not uniform – particularly when the target language is under-resourced or when computational resources are constrained. In this study, we systematically assess the feasibility of building novel TTS models using four widely adopted open-source architectures: FastPitch, VITS, Grad-TTS, and Matcha-TTS. Our evaluation spans multiple dimensions, including qualitative aspects such as ease of installation, dataset preparation, and hardware requirements, as well as quantitative assessments of synthesis quality for Romanian. We employ both objective metrics and subjective listening tests to evaluate intelligibility, speaker similarity, and naturalness of the generated speech. The results reveal significant challenges in tool chain setup, data preprocessing, and computational efficiency, which can hinder adoption in low-resource contexts. By grounding the analysis in reproducible protocols and accessible evaluation criteria, this work aims to inform best practices and promote more inclusive, language-diverse TTS development. All information needed to reproduce this study (i.e. code and data) are available in our git repository: https://gitlab.com/opentts_ragman/OpenTTS

Zhongweiyang Xu, Ashutosh Pandey, Juan Azcarreta, Zhaoheng Ni, Sanjeel Parekh, Buye Xu

Main category: eess.AS

TL;DR: ArrayDPS-Refine is a training-free, generative method that refines outputs of discriminative multi-channel speech enhancement models using diffusion posterior sampling with estimated noise spatial covariance matrix.

Details

Motivation: Discriminative models for multi-channel speech enhancement can suffer from non-linear distortions from regression-based objectives, especially in challenging noise conditions. The authors aim to improve these models without retraining using generative refinement.

Method: ArrayDPS-Refine is array-agnostic and training-free. It first estimates noise spatial covariance matrix (SCM) from enhanced speech produced by any discriminative model, then uses this estimated noise SCM for diffusion posterior sampling to refine the output.

Result: The method consistently improves performance of various discriminative models, including state-of-the-art waveform and STFT domain models, demonstrating effective refinement without retraining.

Conclusion: ArrayDPS-Refine provides an effective training-free approach to enhance discriminative multi-channel speech enhancement models using generative diffusion priors, offering consistent performance improvements across different model architectures.

Abstract: Multi-channel speech enhancement aims to recover clean speech from noisy multi-channel recordings. Most deep learning methods employ discriminative training, which can lead to non-linear distortions from regression-based objectives, especially under challenging environmental noise conditions. Inspired by ArrayDPS for unsupervised multi-channel source separation, we introduce ArrayDPS-Refine, a method designed to enhance the outputs of discriminative models using a clean speech diffusion prior. ArrayDPS-Refine is training-free, generative, and array-agnostic. It first estimates the noise spatial covariance matrix (SCM) from the enhanced speech produced by a discriminative model, then uses this estimated noise SCM for diffusion posterior sampling. This approach allows direct refinement of any discriminative model’s output without retraining. Our results show that ArrayDPS-Refine consistently improves the performance of various discriminative models, including state-of-the-art waveform and STFT domain models. Audio demos are provided at https://xzwy.github.io/ArrayDPSRefineDemo/.

[583] YingMusic-Singer: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance

Chunbo Hao, Junjie Zheng, Guobin Ma, Yuepeng Jiang, Huakang Chen, Wenjie Tian, Gongyu Chen, Zihao Chen, Lei Xie

Main category: eess.AS

TL;DR: YingMusic-Singer is a diffusion-based model for melody-controllable singing voice synthesis with flexible lyric manipulation, requiring no manual alignment and outperforming existing methods.

Details

Motivation: Existing methods for singing voice regeneration with altered lyrics either lack controllability or require laborious manual alignment, creating a need for more flexible and automated solutions.

Method: A fully diffusion-based model that takes three inputs: optional timbre reference, melody-providing singing clip, and modified lyrics. Uses curriculum learning and Group Relative Policy Optimization for training.

Result: Outperforms Vevo2 baseline in melody preservation and lyric adherence. Introduces LyricEditBench, the first benchmark for melody-preserving lyric modification evaluation.

Conclusion: YingMusic-Singer provides effective melody-controllable singing voice synthesis with flexible lyric manipulation without manual alignment, advancing the field of controllable audio generation.

Abstract: Regenerating singing voices with altered lyrics while preserving melody consistency remains challenging, as existing methods either offer limited controllability or require laborious manual alignment. We propose YingMusic-Singer, a fully diffusion-based model enabling melody-controllable singing voice synthesis with flexible lyric manipulation. The model takes three inputs: an optional timbre reference, a melody-providing singing clip, and modified lyrics, without manual alignment. Trained with curriculum learning and Group Relative Policy Optimization, YingMusic-Singer achieves stronger melody preservation and lyric adherence than Vevo2, the most comparable baseline supporting melody control without manual alignment. We also introduce LyricEditBench, the first benchmark for melody-preserving lyric modification evaluation. The code, weights, benchmark, and demos are publicly available at https://github.com/ASLP-lab/YingMusic-Singer.

[584] Adaptive Federated Fine-Tuning of Self-Supervised Speech Representations

Xin Guo, Chunrui Zhao, Hong Jia, Ting Dang, Gongping Huang, Xianrui Zheng, Yan Gao

Main category: eess.AS

TL;DR: Adaptive federated fine-tuning framework with early exits for speech SSL models addresses client heterogeneity in computational capacity and task requirements through lightweight prediction heads and partial aggregation.

Details

Motivation: Federated learning combined with self-supervised learning enables privacy-preserving fine-tuning for speech tasks, but faces challenges with client heterogeneity in computational capacity (straggler effects) and diverse downstream tasks requiring different representation depths, making full-model updates inefficient.

Method: Proposes adaptive federated fine-tuning with early exits: lightweight prediction heads inserted at intermediate layers of SSL backbone allow clients to terminate computation based on local constraints; layer-wise, depth-aware partial aggregation strategy better utilizes representations from different network depths.

Result: Framework reduces edge overhead, supports heterogeneous hardware, and maintains competitive performance in resource-constrained federated environments.

Conclusion: The adaptive framework effectively addresses heterogeneity challenges in federated SSL fine-tuning for speech tasks through early exits and partial aggregation, enabling efficient deployment in resource-constrained environments.

Abstract: Integrating Federated Learning (FL) with self-supervised learning (SSL) enables privacy-preserving fine-tuning for speech tasks. However, federated environments exhibit significant heterogeneity: clients differ in computational capacity, causing straggler effects under unified fine-tuning, while diverse downstream tasks require different representation depths, making full-model updates inefficient. To address these challenges, we propose an adaptive federated fine-tuning framework with early exits. Lightweight prediction heads are inserted at intermediate layers of the SSL backbone, allowing clients to terminate computation based on local constraints and task requirements. We further introduce a layer-wise, depth-aware partial aggregation strategy to better utilize representations from different network depths. Experiments show that the framework reduces edge overhead, supports heterogeneous hardware, and maintains competitive performance in resource-constrained federated environments.

eess.IV

[585] Sentinel-2 for Crop Yield Estimation: A Systematic Review

Mohammadreza Narimani, Alireza Pourreza, Ali Moghimi, Parastoo Farajpoor

Main category: eess.IV

TL;DR: Review of Sentinel-2 satellite-based crop yield estimation methods including empirical ML/DL models, crop growth model integration, and multi-sensor data fusion for precision agriculture.

Details

Motivation: Accurate crop yield estimation is critical for global food security, agricultural policy, and farm management. Sentinel-2 satellite data enables field-level analysis but requires systematic review of current approaches and challenges.

Method: Review synthesizes three main approaches: (1) empirical models using vegetation indices with ML/DL (Random Forest, CNNs), (2) integration of process-based crop growth models (WOFOST, SAFY) via data assimilation of Sentinel-2-derived variables like LAI, and (3) data fusion combining Sentinel-2 optical with Sentinel-1 SAR to mitigate cloud limitations.

Result: ML, DL, and hybrid modeling frameworks can explain substantial within-field yield variability across crops and regions. However, performance is constrained by limited ground-truth data, cloud-induced gaps, and model transferability challenges across years/locations.

Conclusion: Future directions include tighter integration of multi-modal data and improved in-season observations to support robust, operational decision-making in precision agriculture and sustainable intensification.

Abstract: Accurate and timely crop yield estimation is critical for global food security, agricultural policy, and farm management. The Copernicus Sentinel-2 satellite constellation, with high spatial, temporal, and spectral resolution, has transformed agricultural monitoring by enabling field- and sub-field-scale analysis. This review synthesizes recent advances in Sentinel-2-based crop yield estimation. A key trend is the shift from regional models to high-resolution field-level assessments driven by three main approaches: (i) empirical models using vegetation indices combined with machine and deep learning methods such as Random Forest and Convolutional Neural Networks; (ii) integration of process-based crop growth models (e.g., WOFOST, SAFY) via data assimilation of Sentinel-2-derived variables like Leaf Area Index (LAI); and (iii) data fusion techniques combining Sentinel-2 optical data with Sentinel-1 SAR to mitigate cloud-related limitations. The review shows that machine learning, deep learning, and hybrid modeling frameworks can explain substantial within-field yield variability across crops and regions. However, performance remains constrained by limited ground-truth data, cloud-induced gaps, and challenges in model transferability across years and locations. Future directions include tighter integration of multi-modal data and improved in-season observations to support robust, operational decision-making in precision agriculture and sustainable intensification.

[586] Joint Source-Channel-Check Coding with HARQ for Reliable Semantic Communications

Boyuan Li, Shuoyao Wang, Suzhi Bi, Liping Qian, Yunlong Cai

Main category: eess.IV

TL;DR: S3CHARQ: A joint source-channel-check coding framework with hybrid ARQ for semantic communications that integrates check codewords into JSCC to simultaneously support semantic fidelity verification and reconstruction enhancement.

Details

Motivation: Existing semantic communication reliability approaches rely on retransmission strategies with separate check codewords for retransmission triggering, causing substantial communication overhead. There's a need to fundamentally rethink the role of check codewords to improve efficiency.

Method: Proposes S3CHARQ framework with JS3C (Joint Source-Channel-Check) coding. At transmitter: semantic fidelity-aware check encoder embeds auxiliary reconstruction info into check codeword. At receiver: JSCC and check codewords jointly decoded by JS3C decoder, with check codeword also used for perceptual quality estimation. Uses reinforcement learning-based retransmission decision module for adaptive, sample-level decisions.

Result: Achieves 2.36 dB improvement in 97th percentile PSNR and 37.45% reduction in outage probability compared to existing HARQ-based semantic communication systems.

Conclusion: S3CHARQ effectively integrates check codewords into JSCC process, enabling simultaneous semantic fidelity verification and reconstruction enhancement while reducing communication overhead through intelligent retransmission decisions.

Abstract: Semantic communication has emerged as a promising paradigm for improving transmission efficiency and task-level reliability, yet most existing reliability-enhancement approaches rely on retransmission strategies driven by semantic fidelity checking that require additional check codewords solely for retransmission triggering, thereby incurring substantial communication overhead. In this paper, we propose S3CHARQ, a Joint Source-Channel-Check Coding framework with hybrid automatic repeat request that fundamentally rethinks the role of check codewords in semantic communications. By integrating the check codeword into the JSCC process, S3CHARQ enables JS3C, allowing the check codeword to simultaneously support semantic fidelity verification and reconstruction enhancement. At the transmitter, a semantic fidelity-aware check encoder embeds auxiliary reconstruction information into the check codeword. At the receiver, the JSCC and check codewords are jointly decoded by a JS3C decoder, while the check codeword is additionally exploited for perceptual quality estimation. Moreover, because retransmission decisions are necessarily based on imperfect semantic quality estimation in the absence of ground-truth reconstruction, estimation errors are unavoidable and fundamentally limit the effectiveness of rule-based decision schemes. To overcome this limitation, we develop a reinforcement learning-based retransmission decision module that enables adaptive, sample-level retransmission decisions, effectively balancing recovery and refinement information under dynamic channel conditions. Experimental results demonstrate that compared with existing HARQ-based semantic communication systems, the proposed S3CHARQ framework achieves a 2.36 dB improvement in the 97th percentile PSNR, as well as a 37.45% reduction in outage probability.

Tian Guo, Hui Yuan, Chang Sun, Wei Zhang, Raouf Hamzaoui, Sam Kwong

Main category: eess.IV

TL;DR: First blind quality enhancement model for compressed dynamic point clouds that handles unknown distortion levels using temporal dependencies and adaptive feature fusion.

Details

Motivation: Existing point cloud compression enhancement methods require prior knowledge of distortion levels and train multiple models, limiting practical applicability when distortion level is unknown and computational resources are limited.

Method: Proposes BQE with joint progressive feature extraction branch (recoloring-based motion compensation, temporal correlation-guided cross-attention, progressive feature extraction) and adaptive feature fusion branch (quality estimation module for weighting distribution).

Result: Achieved average PSNR improvements of 0.535 dB, 0.403 dB, and 0.453 dB with BD-rates of -17.4%, -20.5%, and -20.1% for Luma, Cb, and Cr components on G-PCC reference software.

Conclusion: BQE is the first blind quality enhancement model for compressed dynamic point clouds that effectively handles unknown distortion levels and outperforms existing approaches.

Abstract: Point cloud compression often introduces noticeable reconstruction artifacts, which makes quality enhancement necessary. Existing approaches typically assume prior knowledge of the distortion level and train multiple models with identical architectures, each designed for a specific distortion setting. This significantly limits their practical applicability in scenarios where the distortion level is unknown and computational resources are limited. To overcome these limitations, we propose the first blind quality enhancement (BQE) model for compressed dynamic point clouds. BQE enhances compressed point clouds under unknown distortion levels by exploiting temporal dependencies and jointly modeling feature similarity and differences across multiple distortion levels. It consists of a joint progressive feature extraction branch and an adaptive feature fusion branch. In the joint progressive feature extraction branch, consecutive reconstructed frames are first fed into a recoloring-based motion compensation module to generate temporally aligned virtual reference frames. These frames are then fused by a temporal correlation-guided cross-attention module and processed by a progressive feature extraction module to obtain hierarchical features at different distortion levels. In the adaptive feature fusion branch, the current reconstructed frame is input to a quality estimation module to predict a weighting distribution that guides the adaptive weighted fusion of these hierarchical features. When applied to the latest geometry-based point cloud compression (G-PCC) reference software, i.e., test model category13 version 28, BQE achieved average PSNR improvements of 0.535 dB, 0.403 dB, and 0.453 dB, with BD-rates of -17.4%, -20.5%, and -20.1% for the Luma, Cb, and Cr components, respectively.

Iris Dumeur, Jérémy Anger, Gabriele Facciolo

Main category: eess.IV

TL;DR: Dual-form attention mechanisms for efficient multi-modal satellite image time series analysis, enabling recurrent inference for incremental processing while maintaining performance comparable to standard Transformers.

Details

Motivation: Multi-modal Satellite Image Time Series (SITS) analysis faces computational challenges for live land monitoring applications. Standard Transformers have quadratic complexity and require reprocessing entire sequences for each new acquisition, limiting deployment for regular large-area monitoring.

Method: Study various dual-form attention mechanisms (linear attention and retention mechanisms) within a multi-modal spectro-temporal encoder. Develop temporal adaptations that compute token distances based on actual acquisition dates rather than sequence indices to handle SITS-specific challenges of temporal irregularity and unalignment.

Result: Dual-form mechanisms achieve performance comparable to standard Transformers while enabling efficient recurrent inference. Multimodal framework consistently outperforms mono-modal approaches across both tasks (multi-modal SITS forecasting and solar panel construction monitoring).

Conclusion: Dual-form attention mechanisms provide efficient solutions for operational land monitoring systems requiring regular updates over large geographic areas, opening new opportunities for practical deployment.

Abstract: Multi-modal Satellite Image Time Series (SITS) analysis faces significant computational challenges for live land monitoring applications. While Transformer architectures excel at capturing temporal dependencies and fusing multi-modal data, their quadratic computational complexity and the need to reprocess entire sequences for each new acquisition limit their deployment for regular, large-area monitoring. This paper studies various dual-form attention mechanisms for efficient multi-modal SITS analysis, that enable parallel training while supporting recurrent inference for incremental processing. We compare linear attention and retention mechanisms within a multi-modal spectro-temporal encoder. To address SITS-specific challenges of temporal irregularity and unalignment, we develop temporal adaptations of dual-form mechanisms that compute token distances based on actual acquisition dates rather than sequence indices. Our approach is evaluated on two tasks using Sentinel-1 and Sentinel-2 data: multi-modal SITS forecasting as a proxy task, and real-world solar panel construction monitoring. Experimental results demonstrate that dual-form mechanisms achieve performance comparable to standard Transformers while enabling efficient recurrent inference. The multimodal framework consistently outperforms mono-modal approaches across both tasks, demonstrating the effectiveness of dual mechanisms for sensor fusion. The results presented in this work open new opportunities for operational land monitoring systems requiring regular updates over large geographic areas.

[589] Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamic

Wanying Qu, Jianxiong Gao, Wei Wang, Yanwei Fu

Main category: eess.IV

TL;DR: EEG-conditioned framework reconstructs high-resolution fMRI sequences from EEG data with temporal coherence and spatial fidelity, using null-space intermediate-frame reconstruction for handling irregular sampling.

Details

Motivation: fMRI provides high-resolution cortical representations but is expensive and limited in scale, while EEG offers complementary millisecond-level temporal cues. The goal is to leverage EEG to reconstruct dynamic fMRI sequences for more accessible and comprehensive brain activity modeling.

Method: Proposes an EEG-conditioned framework that reconstructs dynamic fMRI as continuous neural sequences at cortical-vertex level. Incorporates null-space intermediate-frame reconstruction to handle sampling irregularities in real fMRI acquisitions, enabling measurement-consistent completion of arbitrary intermediate frames.

Result: Experiments on CineBrain dataset show superior voxel-wise reconstruction quality and robust temporal consistency across whole-brain and functionally specific regions. Reconstructed fMRI preserves essential functional information and supports downstream visual decoding tasks.

Conclusion: Provides a new pathway for estimating high-resolution fMRI dynamics from EEG, advancing multimodal neuroimaging toward more dynamic brain activity modeling with practical applicability for handling irregular sampling in real acquisitions.

Abstract: Capturing dynamic spatiotemporal neural activity is essential for understanding large-scale brain mechanisms. Functional magnetic resonance imaging (fMRI) provides high-resolution cortical representations that form a strong basis for characterizing fine-grained brain activity patterns. The high acquisition cost of fMRI limits large-scale applications, therefore making high-quality fMRI reconstruction a crucial task. Electroencephalography (EEG) offers millisecond-level temporal cues that complement fMRI. Leveraging this complementarity, we present an EEG-conditioned framework for reconstructing dynamic fMRI as continuous neural sequences with high spatial fidelity and strong temporal coherence at the cortical-vertex level. To address sampling irregularities common in real fMRI acquisitions, we incorporate a null-space intermediate-frame reconstruction, enabling measurement-consistent completion of arbitrary intermediate frames and improving sequence continuity and practical applicability. Experiments on the CineBrain dataset demonstrate superior voxel-wise reconstruction quality and robust temporal consistency across whole-brain and functionally specific regions. The reconstructed fMRI also preserves essential functional information, supporting downstream visual decoding tasks. This work provides a new pathway for estimating high-resolution fMRI dynamics from EEG and advances multimodal neuroimaging toward more dynamic brain activity modeling.

[590] MedAugment: Universal Automatic Data Augmentation Plug-in for Medical Image Analysis

Zhaoshan Liu, Qiujie Lv, Yifan Li, Ziduo Yang, Lei Shen

Main category: eess.IV

TL;DR: MedAugment: A specialized automatic data augmentation method for medical images that excludes operations that could damage medical features, using pixel/spatial augmentation spaces with controlled sampling and hyperparameter mapping.

Details

Motivation: Data augmentation in medical imaging faces challenges as conventional methods may break medical details/features, have experience-driven design, and incur high computational costs. Need for suitable, general automatic DA for medical images.

Method: Proposes MedAugment with: 1) pixel and spatial augmentation spaces excluding operations that damage medical details, 2) sampling strategy limiting operations from both spaces, 3) hyperparameter mapping to control augmentation level with single parameter.

Result: Extensive experiments on 4 classification and 4 segmentation datasets show superiority over existing approaches. Prevents color distortions/structural alterations while having negligible computational overhead. Serves as plugin without extra training.

Conclusion: MedAugment provides suitable, general automatic DA for medical images, addressing differences between natural and medical images. Benefits community and medical experts lacking DL foundation.

Abstract: Data augmentation (DA) has been widely leveraged in computer vision to alleviate data shortage, while its application in medical imaging faces multiple challenges. The prevalent DA approaches in medical image analysis encompass conventional DA, synthetic DA, and automatic DA. However, these approaches may result in experience-driven design and intensive computation costs. Here, we propose a suitable yet general automatic DA method for medical images termed MedAugment. We propose pixel and spatial augmentation spaces and exclude the operations that can break medical details and features. Besides, we propose a sampling strategy by sampling a limited number of operations from the two spaces. Moreover, we present a hyperparameter mapping relationship to produce a rational augmentation level and make the MedAugment fully controllable using a single hyperparameter. These configurations settle the differences between natural and medical images. Extensive experimental results on four classification and four segmentation datasets demonstrate the superiority of MedAugment. Compared with existing approaches, the proposed MedAugment prevents producing color distortions or structural alterations while involving negligible computational overhead. Our method can serve as a plugin without an extra training stage, offering significant benefits to the community and medical experts lacking a deep learning foundation. The code is available at https://github.com/NUS-Tim/MedAugment.

[591] A Generalizable Deep Learning System for Cardiac MRI

Rohan Shad, Cyril Zakka, Dhamanpreet Kaur, Mrudang Mathur, Robyn Fong, Joseph Cho, Ross Warren Filice, John Mongan, Kimberly Kalianos, Nishith Khandwala, David Eng, Matthew Leipzig, Walter R. Witschey, Alejandro de Feria, Victor A. Ferrari, Euan A. Ashley, Michael A. Acker, Curtis Langlotz, William Hiesinger

Main category: eess.IV

TL;DR: A foundational vision system for cardiac MRI using self-supervised contrastive learning trained on cine-sequence MRI scans and accompanying radiology reports, achieving clinical-grade diagnostic accuracy for various cardiovascular conditions.

Details

Motivation: To develop a comprehensive AI system for cardiac MRI that can represent the breadth of human cardiovascular disease and health, addressing the complexity of cardiac imaging and diagnosis.

Method: Deep-learning model trained via self-supervised contrastive learning on cine-sequence cardiac MRI scans, learning visual concepts from raw text of accompanying radiology reports. Trained and evaluated on data from four large US academic institutions, with additional testing on UK BioBank and two public datasets.

Result: Demonstrates remarkable performance across tasks including left-ventricular ejection fraction regression and diagnosis of 39 different cardiac conditions (e.g., cardiac amyloidosis, hypertrophic cardiomyopathy). Achieves clinical-grade diagnostic accuracy with less training data than typically required.

Conclusion: The system successfully contextualizes the complexity of human cardiovascular disease and can be directed toward clinical problems, offering impressive diagnostic capabilities with efficient training.

Abstract: Cardiac MRI allows for a comprehensive assessment of myocardial structure, function and tissue characteristics. Here we describe a foundational vision system for cardiac MRI, capable of representing the breadth of human cardiovascular disease and health. Our deep-learning model is trained via self-supervised contrastive learning, in which visual concepts in cine-sequence cardiac MRI scans are learned from the raw text of the accompanying radiology reports. We train and evaluate our model on data from four large academic clinical institutions in the United States. We additionally showcase the performance of our models on the UK BioBank and two additional publicly available external datasets. We explore emergent capabilities of our system and demonstrate remarkable performance across a range of tasks, including the problem of left-ventricular ejection fraction regression and the diagnosis of 39 different conditions such as cardiac amyloidosis and hypertrophic cardiomyopathy. We show that our deep-learning system is capable of not only contextualizing the staggering complexity of human cardiovascular disease but can be directed towards clinical problems of interest, yielding impressive, clinical-grade diagnostic accuracy with a fraction of the training data typically required for such tasks.

[592] Goal-Oriented Framework for Optical Flow-based Multi-User Multi-Task Video Transmission

Yujie Xu, Shutong Chen, Nan Li, Yansha Deng, Jinhong Yuan, Robert Schober

Main category: eess.IV

TL;DR: Proposes OF-GSC, a goal-oriented semantic communication framework for optical flow-based multi-user multi-task video transmission with transformer-based decoding and DDPG bandwidth allocation.

Details

Motivation: To reduce transmission burden and save communication resources for efficient multi-user multi-task video transmission in wireless systems by focusing on semantic content rather than raw data.

Method: Uses semantic encoder with motion extractor and patch-level optical flow-based semantic representation extractor at transmitter, transformer-based semantic decoder at receiver, and DDPG-based bandwidth allocation algorithm for multi-user transmission.

Result: Achieves 13.47% SSIM improvement for video reconstruction vs DeepJSCC, slightly surpasses VideoMAE Top-1 accuracy with only 25% data for classification, and reduces max transmission time by 25.97% vs equal-bandwidth allocation.

Conclusion: OF-GSC framework effectively improves video transmission efficiency through semantic communication, achieving better reconstruction quality, classification accuracy, and bandwidth utilization.

Abstract: Efficient multi-user multi-task video transmission is an important research topic within the realm of current wireless communication systems. To reduce the transmission burden and save communication resources, we propose a goal-oriented semantic communication framework for optical flow-based multi-user multi-task video transmission (OF-GSC). At the transmitter, we design a semantic encoder that consists of a motion extractor and a patch-level optical flow-based semantic representation extractor to effectively identify and select important semantic representations. At the receiver, we design a transformer-based semantic decoder for high-quality video reconstruction and video classification tasks. To minimize the communication time, we develop a deep deterministic policy gradient (DDPG)-based bandwidth allocation algorithm for multi-user transmission. For video reconstruction tasks, our OF-GSC framework achieves a significant improvement in the received video quality, as evidenced by a 13.47% increase in the structural similarity index measure (SSIM) score in comparison to DeepJSCC. For video classification tasks, OF-GSC achieves a Top-1 accuracy slightly surpassing the performance of VideoMAE with only 25% required data under the same mask ratio of 0.3. For bandwidth allocation optimization, our DDPG-based algorithm reduces the maximum transmission time by 25.97% compared with the baseline equal-bandwidth allocation scheme.

Editor’s Picks

[1] ACAVCaps: Enabling large-scale training for fine-grained and diverse audio understanding

[2] OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

[3] See, Remember, Explore: A Benchmark and Baselines for Streaming Spatial Reasoning

Today’s Research Highlights

Table of Contents

cs.CL

[1] Leveraging Computerized Adaptive Testing for Cost-effective Evaluation of Large Language Models in Medical Benchmarking

[2] Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes

[3] Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems

[4] Internal Safety Collapse in Frontier Large Language Models

[5] Visuospatial Perspective Taking in Multimodal Language Models

[6] DISCO: Document Intelligence Suite for COmparative Evaluation

[7] S-Path-RAG: Semantic-Aware Shortest-Path Retrieval Augmented Generation for Multi-Hop Knowledge Graph Question Answering

[8] Berta: an open-source, modular tool for AI-enabled clinical documentation

[9] Plato’s Cave: A Human-Centered Research Verification System

[10] DepthCharge: A Domain-Agnostic Framework for Measuring Depth-Dependent Knowledge in Large Language Models

[11] Training a Large Language Model for Medical Coding Using Privacy-Preserving Synthetic Clinical Data

[12] A Sociolinguistic Analysis of Automatic Speech Recognition Bias in Newcastle English

[13] MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

[14] The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More

[15] Cluster-R1: Large Reasoning Models Are Instruction-following Clustering Agents

[16] MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?

[17] From Physician Expertise to Clinical Agents: Preserving, Standardizing, and Scaling Physicians’ Medical Expertise with Lightweight LLM

[18] Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages

[19] Qworld: Question-Specific Evaluation Criteria for LLMs

[20] Do 3D Large Language Models Really Understand 3D Spatial Relationships?

[21] Navigating the Concept Space of Language Models

[22] Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

[23] Prompt Compression in Production Task Orchestration: A Pre-Registered Randomized Trial

[24] Compression Method Matters: Benchmark-Dependent Output Dynamics in LLM Prompt Compression

[25] The Compression Paradox in LLM Inference: Provider-Dependent Energy Effects of Prompt Compression

[26] Agentic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment

[27] Konkani LLM: Multi-Script Instruction Tuning and Evaluation for a Low-Resource Indian Language

[28] Did You Forget What I Asked? Prospective Memory Failures in Large Language Models

[29] Large Language Models Unpack Complex Political Opinions through Target-Stance Extraction

[30] Generating Hierarchical JSON Representations of Scientific Sentences Using LLMs

[31] MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys and Key-Based Restructuring for High-Accuracy RAG

[32] Not All Pretraining are Created Equal: Threshold Tuning and Class Weighting for Imbalanced Polarization Tasks in Low-Resource Settings

[33] Revisiting Real-Time Digging-In Effects: No Evidence from NP/Z Garden-Paths

[34] Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks

[35] Ethio-ASR: Joint Multilingual Speech Recognition and Language Identification for Ethiopian Languages

[36] Probing Ethical Framework Representations in Large Language Models: Structure, Entanglement, and Methodological Challenges

[37] PLACID: Privacy-preserving Large language models for Acronym Clinical Inference and Disambiguation

[38] The Diminishing Returns of Early-Exit Decoding in Modern LLMs

[39] IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

[40] Infrequent Child-Directed Speech Is Bursty and May Draw Infant Vocalizations

[41] Perturbation: A simple and efficient adversarial tracer for representation learning in language models

[42] PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay

[43] Language Model Planners do not Scale, but do Formalizers?

[44] BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents

[45] Self-Distillation for Multi-Token Prediction

[46] Dialogue to Question Generation for Evidence-based Medical Guideline Agent Development

[47] OmniACBench: A Benchmark for Evaluating Context-Grounded Acoustic Control in Omni-Modal Models

[48] Argument Mining as a Text-to-Text Generation Task

[49] From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents

[50] Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith

[51] CoCR-RAG: Enhancing Retrieval-Augmented Generation in Web Q&A via Concept-oriented Context Reconstruction

[52] Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping

[53] Thinking with Tables: Enhancing Multi-Modal Tabular Understanding via Neuro-Symbolic Reasoning

[54] CVPD at QIAS 2026: RAG-Guided LLM Reasoning for Al-Mawarith Share Computation and Heir Allocation

[55] Schema on the Inside: A Two-Phase Fine-Tuning Method for High-Efficiency Text-to-SQL at Scale

[56] From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs

[57] FinToolSyn: A forward synthesis Framework for Financial Tool-Use Dialogue Data with Dynamic Tool Retrieval

[58] ConceptKT: A Benchmark for Concept-Level Deficiency Prediction in Knowledge Tracing

[59] LLMpedia: A Transparent Framework to Materialize an LLM’s Encyclopedic Knowledge at Scale

[60] Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study

[61] MedAidDialog: A Multilingual Multi-Turn Medical Dialogue Dataset for Accessible Healthcare

[62] A visual observation on the geometry of UMAP projections of the difference vectors of antonym and synonym word pair embeddings

[63] Variation is the Norm: Embracing Sociolinguistics in NLP

[64] Stance Labels Fail When They Matter Most: The Projection Problem in Stance Detection

[65] Optimizing Multilingual LLMs via Federated Learning: A Study of Client Language Composition

[66] Semantic Centroids and Hierarchical Density-Based Clustering for Cross-Document Software Coreference Resolution

[67] Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning

[68] Samasāmayik: A Parallel Dataset for Hindi-Sanskrit Machine Translation

[69] GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

[70] Improving Lean4 Autoformalization via Cycle Consistency Fine-tuning

[71] Towards Reward Modeling for AI Tutors in Math Mistake Remediation

[72] When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools

[73] PINGALA: Prosody-Aware Decoding for Sanskrit Poetry Generation