Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] Gradient-Informed Training for Low-Resource Multilingual Speech Translation

Ruiyan Sun, Satoshi Nakamura

Main category: cs.CL

TL;DR: Proposes gradient-based method to automatically determine optimal layer-sharing patterns in multilingual speech-to-text translation to address representation conflicts and improve translation quality.

Details

Motivation: Uniform architectural sharing across languages in low-resource multilingual speech-to-text translation introduces representation conflicts that impede convergence, requiring more intelligent sharing strategies.

Method: Uses training gradient information to automatically determine layer-specific sharing patterns through three strategies: distance-based language clustering, self/cross-task divergence metrics for capacity allocation, and joint factorization with canonical correlation analysis for subspace alignment.

Result: Extensive evaluation across four language pairs using SeamlessM4T-Medium architecture demonstrates persistent improvements in translation quality metrics.

Conclusion: Gradient-based analysis provides a principled methodology for determining optimal sharing patterns in multilingual speech models, addressing representation conflicts and improving performance in low-resource settings.

Abstract: In low-resource multilingual speech-to-text translation, uniform architectural sharing across languages frequently introduces representation conflicts that impede convergence. This work proposes a principled methodology to automatically determine layer-specific sharing patterns by mining training gradient information. Our approach employs three distinct analysis strategies: distance-based language clustering, self/cross-task divergence metrics for capacity allocation, and joint factorization coupled with canonical correlation analysis for subspace alignment. Extensive evaluation across four language pairs (using the SeamlessM4T-Medium architecture) demonstrates persistent improvements in translation quality metrics.

Relevance: 9/10

[2] Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods

Xuanru Zhou, Yiwen Shao, Wei-Cheng Tseng, Dong Yu

Main category: cs.SD

TL;DR: Audio pre-training needs better data quality and coverage; new pipeline creates high-quality captions and unified tag system for speech, music, and environmental sounds; study shows data quality matters more than pre-training objectives.

Details

Motivation: Current audio pre-training is fragmented and limited by weak, noisy, and scale-limited labels. The field needs to establish its own large-scale, strong supervision framework similar to vision's foundational pre-training blueprint.

Method: Introduces a data-centric pipeline using a high-fidelity captioner to create SOTA-quality captions and the first Unified Tag System (UTS) that bridges speech, music, and environmental sounds. Conducts systematic comparative study of different pre-training objectives on strong source data.

Result: Experiments suggest that data quality and coverage are the primary drivers of performance, while the choice of objective dictates downstream task specialization.

Conclusion: Audio pre-training should focus on establishing strong supervision frameworks with high-quality data rather than just optimizing objectives. The unified approach across speech, music, and environmental sounds enables better audio understanding.

Abstract: Current audio pre-training seeks to learn unified representations for broad audio understanding tasks, but it remains fragmented and is fundamentally bottlenecked by its reliance on weak, noisy, and scale-limited labels. Drawing lessons from vision’s foundational pre-training blueprint, we argue that the audio field must first establish its own large-scale, strong supervision framework. We introduce a new data-centric pipeline that leverages a high-fidelity captioner to create SOTA-quality captions and the first Unified Tag System (UTS) that bridges speech, music, and environmental sounds. We then conduct a systematic comparative study of different pre-training objectives on these strong source data. Our experiments suggest that data quality and coverage are the primary drivers of performance, while the choice of objective dictates downstream task specialization.

Relevance: 9/10

[3] Cinematic Audio Source Separation Using Visual Cues

Kang Zhang, Suyeon Lee, Arda Senocak, Joon Son Chung

Main category: cs.MM

TL;DR: First audio-visual framework for cinematic audio source separation using conditional flow matching with dual-stream visual encoding, trained on synthetic data and generalizing to real films.

Details

Motivation: Existing CASS approaches are audio-only, ignoring the audio-visual nature of films where sounds align with visual cues. There's a need to leverage visual context to enhance separation quality for applications like dubbing and remastering.

Method: Formulates CASS as conditional generative modeling using conditional flow matching. Introduces training data synthesis pipeline pairing in-the-wild audio/video streams (facial videos for speech, scene videos for effects). Designs dedicated dual-stream visual encoder for this setup.

Result: Model trained entirely on synthetic data generalizes effectively to real-world cinematic content. Achieves strong performance on synthetic, real-world, and audio-only CASS benchmarks.

Conclusion: First successful audio-visual CASS framework demonstrates the value of visual context for cinematic audio separation, with synthetic training enabling generalization to real films.

Abstract: Cinematic Audio Source Separation (CASS) aims to decompose mixed film audio into speech, music, and sound effects, enabling applications like dubbing and remastering. Existing CASS approaches are audio-only, overlooking the inherent audio-visual nature of films, where sounds often align with visual cues. We present the first framework for audio-visual CASS (AV-CASS), leveraging visual context to enhance separation quality. Our method formulates CASS as a conditional generative modeling problem using conditional flow matching, enabling multimodal audio source separation. To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated visual encoder for this dual-stream setup. Trained entirely on synthetic data, our model generalizes effectively to real-world cinematic content and achieves strong performance on synthetic, real-world, and audio-only CASS benchmarks. Code and demo are available at \url{https://cass-flowmatching.github.io}.

Relevance: 9/10

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 87]
cs.CV [Total: 225]
cs.AI [Total: 54]
cs.SD [Total: 10]
cs.LG [Total: 118]
cs.MA [Total: 3]
cs.MM [Total: 2]
eess.AS [Total: 2]
eess.IV [Total: 8]

cs.CL

[1] Relational graph-driven differential denoising and diffusion attention fusion for multimodal conversation emotion recognition

Ying Liu, Yuntao Shou, Wei Ai, Tao Meng, Keqin Li

Main category: cs.CL

TL;DR: A multimodal emotion recognition model that addresses noise in audio/video features and modality imbalance through differential attention, relation graphs, and text-guided diffusion fusion.

Details

Motivation: Real-world audio/video signals often contain environmental noise and have quality imbalances between modalities, leading to information distortion and biased fusion that impairs emotion recognition performance. Most methods ignore noisy modalities and rely on implicit weighting, failing to account for text's predominant role in emotion understanding.

Method: 1) Differential Transformer that computes differences between attention maps to enhance temporally consistent information while suppressing noise in audio/video. 2) Modality-specific and cross-modality relation subgraphs to capture speaker-dependent emotional dependencies. 3) Text-guided cross-modal diffusion mechanism using self-attention to model intra-modal dependencies and adaptively diffuse audiovisual information into textual stream.

Result: The paper proposes a relation-aware denoising and diffusion attention fusion model for multimodal conversational emotion recognition (MCER) that achieves more robust and semantically aligned multimodal fusion.

Conclusion: The proposed approach effectively addresses noise in audio/video modalities and modality imbalance by explicitly modeling temporal consistency, speaker-dependent relations, and text-guided fusion, leading to improved emotion recognition performance.

Abstract: In real-world scenarios, audio and video signals are often subject to environmental noise and limited acquisition conditions, resulting in extracted features containing excessive noise. Furthermore, there is an imbalance in data quality and information carrying capacity between different modalities. These two issues together lead to information distortion and weight bias during the fusion phase, impairing overall recognition performance. Most existing methods neglect the impact of noisy modalities and rely on implicit weighting to model modality importance, thereby failing to explicitly account for the predominant contribution of the textual modality in emotion understanding. To address these issues, we propose a relation-aware denoising and diffusion attention fusion model for MCER. Specifically, we first design a differential Transformer that explicitly computes the differences between two attention maps, thereby enhancing temporally consistent information while suppressing time-irrelevant noise, which leads to effective denoising in both audio and video modalities. Second, we construct modality-specific and cross-modality relation subgraphs to capture speaker-dependent emotional dependencies, enabling fine-grained modeling of intra- and inter-modal relationships. Finally, we introduce a text-guided cross-modal diffusion mechanism that leverages self-attention to model intra-modal dependencies and adaptively diffuses audiovisual information into the textual stream, ensuring more robust and semantically aligned multimodal fusion.

[2] RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation

Jiajun Zhang, Yuying Li, Zhixun Li, Xingyu Guo, Jingzhuo Wu, Leqi Zheng, Yiran Yang, Jianke Zhang, Qingbin Li, Shannan Yan, Zhetong Li, Changguo Jia, Junfei Wu, Zilei Wang, Qiang Liu, Liang Wang

Main category: cs.CL

TL;DR: RealChart2Code benchmark evaluates VLMs on generating complex multi-panel visualizations from real-world data, revealing significant performance gaps compared to simpler benchmarks.

Details

Motivation: While VLMs show impressive code generation capabilities, their ability to replicate complex, multi-panel visualizations from authentic datasets remains largely unassessed. Current benchmarks lack systematic evaluation of chart generation from large-scale raw data and iterative code refinement in conversational settings.

Method: Introduces RealChart2Code benchmark with over 2,800 instances grounded in authentic datasets, featuring tasks with clear analytical intent. Evaluates 14 leading VLMs on chart generation from raw data and assesses iterative code refinement in multi-turn conversational settings.

Result: Evaluation reveals significant performance degradation compared to simpler benchmarks, highlighting VLMs’ struggles with complex plot structures and authentic data. Shows substantial performance gap between proprietary and open-weight models, with even state-of-the-art VLMs often failing to accurately replicate intricate multi-panel charts.

Conclusion: The findings provide valuable insights into current limitations of VLMs for complex visualization generation and guide future research directions. The benchmark addresses a critical gap in evaluating VLMs’ capabilities with real-world data and complex visual tasks.

Abstract: Vision-Language Models (VLMs) have demonstrated impressive capabilities in code generation across various domains. However, their ability to replicate complex, multi-panel visualizations from real-world data remains largely unassessed. To address this gap, we introduce \textbf{\texttt{RealChart2Code}}, a new large-scale benchmark with over 2,800 instances grounded in authentic datasets and featuring tasks with clear analytical intent. Crucially, it is the first benchmark to systematically evaluate chart generation from large-scale raw data and assess iterative code refinement in a multi-turn conversational setting. Our comprehensive evaluation of 14 leading VLMs on \texttt{RealChart2Code} reveals significant performance degradation compared to simpler benchmarks, highlighting their struggles with complex plot structures and authentic data. Our analysis uncovers a substantial performance gap between proprietary and open-weight models and confirms that even state-of-the-art VLMs often fail to accurately replicate intricate, multi-panel charts. These findings provide valuable insights into the current limitations of VLMs and guide future research directions. We release the benchmark and code at \url{https://github.com/Speakn0w/RealChart2Code}.

[3] Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI

Anna Kozlova, Stanislau Salavei, Pavel Satalkin, Hanna Plotnitskaya, Sergey Parfenyuk

Main category: cs.CL

TL;DR: Doctorina MedBench is an evaluation framework for medical AI agents that simulates realistic physician-patient dialogues to assess clinical competence beyond traditional test questions.

Details

Motivation: Traditional medical benchmarks rely on standardized test questions, which don't capture the complexity of real clinical interactions involving history-taking, analysis of medical materials, differential diagnosis, and personalized recommendations.

Method: The framework models multi-step clinical dialogues where AI systems must collect medical history, analyze attached materials (lab reports, images, documents), formulate differential diagnoses, and provide recommendations. It uses the D.O.T.S. metric (Diagnosis, Observations/Investigations, Treatment, Step Count) to evaluate both clinical correctness and dialogue efficiency.

Result: The dataset contains over 1,000 clinical cases covering more than 750 diagnoses. The framework supports safety-oriented trap cases, category-based random sampling, and full regression testing with multi-level quality monitoring to detect model degradation.

Conclusion: Simulation of clinical dialogue provides more realistic assessment of clinical competence compared to traditional examination-style benchmarks, and the framework can be used to evaluate both AI systems and physicians while supporting clinical reasoning skill development.

Abstract: We present Doctorina MedBench, a comprehensive evaluation framework for agent-based medical AI based on the simulation of realistic physician-patient interactions. Unlike traditional medical benchmarks that rely on solving standardized test questions, the proposed approach models a multi-step clinical dialogue in which either a physician or an AI system must collect medical history, analyze attached materials (including laboratory reports, images, and medical documents), formulate differential diagnoses, and provide personalized recommendations. System performance is evaluated using the D.O.T.S. metric, which consists of four components: Diagnosis, Observations/Investigations, Treatment, and Step Count, enabling assessment of both clinical correctness and dialogue efficiency. The system also incorporates a multi-level testing and quality monitoring architecture designed to detect model degradation during both development and deployment. The framework supports safety-oriented trap cases, category-based random sampling of clinical scenarios, and full regression testing. The dataset currently contains more than 1,000 clinical cases covering over 750 diagnoses. The universality of the evaluation metrics allows the framework to be used not only to assess medical AI systems, but also to evaluate physicians and support the development of clinical reasoning skills. Our results suggest that simulation of clinical dialogue may provide a more realistic assessment of clinical competence compared to traditional examination-style benchmarks.

[4] Gradient-Informed Training for Low-Resource Multilingual Speech Translation

Ruiyan Sun, Satoshi Nakamura

Main category: cs.CL

Details

Result: Extensive evaluation across four language pairs using SeamlessM4T-Medium architecture demonstrates persistent improvements in translation quality metrics.

[5] Methods for Knowledge Graph Construction from Text Collections: Development and Applications

Vanni Zavarella

Main category: cs.CL

TL;DR: This thesis focuses on using NLP, ML, and Generative AI with Semantic Web techniques to automatically construct Knowledge Graphs from large text corpora across three domains: digital transformation discourse, AECO research, and biomedical causal relations.

Details

Motivation: The dramatic growth of unstructured textual data across various sectors creates both opportunities and challenges for extracting actionable knowledge. There's a need for scalable, flexible methods adaptable across text genres and schema specifications to unlock the full potential of this data through semantically transparent Knowledge Graphs.

Method: The thesis applies Natural Language Processing, Machine Learning, and Generative AI methods powered by Semantic Web best practices to automatically construct Knowledge Graphs from large text corpora. It focuses on three use cases: digital transformation discourse analysis, AECO research mapping, and biomedical causal relation extraction.

Result: The contributions include benchmark evaluation results, customized algorithm designs, creation of Knowledge Graphs as data resources, and data analysis results built on top of these graphs across the three application domains.

Conclusion: The thesis demonstrates how coupling information extraction methods with Semantic Web techniques enables the construction of semantically transparent, explainable, and interoperable Knowledge Graphs from diverse textual data sources, providing valuable resources and insights across multiple domains.

Abstract: Virtually every sector of society is experiencing a dramatic growth in the volume of unstructured textual data that is generated and published, from news and social media online interactions, through open access scholarly communications and observational data in the form of digital health records and online drug reviews. The volume and variety of data across all this range of domains has created both unprecedented opportunities and pressing challenges for extracting actionable knowledge for several application scenarios. However, the extraction of rich semantic knowledge demands the deployment of scalable and flexible automatic methods adaptable across text genres and schema specifications. Moreover, the full potential of these data can only be unlocked by coupling information extraction methods with Semantic Web techniques for the construction of full-fledged Knowledge Graphs, that are semantically transparent, explainable by design and interoperable. In this thesis, we experiment with the application of Natural Language Processing, Machine Learning and Generative AI methods, powered by Semantic Web best practices, to the automatic construction of Knowledge Graphs from large text corpora, in three use case applications: the analysis of the Digital Transformation discourse in the global news and social media platforms; the mapping and trend analysis of recent research in the Architecture, Engineering, Construction and Operations domain from a large corpus of publications; the generation of causal relation graphs of biomedical entities from electronic health records and patient-authored drug reviews. The contributions of this thesis to the research community are in terms of benchmark evaluation results, the design of customized algorithms and the creation of data resources in the form of Knowledge Graphs, together with data analysis results built on top of them.

[6] Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR

Shashi Kumar, Esaú Villatoro-Tello, Sergio Burdisso, Kadri Hacioglu, Thibault Bañeras-Roux, Hasindri Watawana, Dairazalia Sanchez-Cortes, Srikanth Madikeri, Petr Motlicek, Andreas Stolcke

Main category: cs.CL

TL;DR: LLM-based speech recognition with conversational context compression using learned latent tokens instead of raw audio

Details

Motivation: Standard LLM-based ASR systems process utterances in isolation, missing conversational context that could improve recognition, especially for contextual entities. Raw context conditioning is expensive due to growing audio token sequences.

Method: Proposes Abstract Compression: replaces prior-turn audio with fixed number of learned latent tokens while retaining transcripts explicitly. Uses supervised multi-turn training and analyzes compression setup trade-offs.

Result: Compressed model recovers part of the gains of raw-context conditioning with smaller prior-turn audio footprint. Works on both in-domain and out-of-domain test sets, mainly helping with contextual entity recognition.

Conclusion: Abstract Compression provides efficient representation of conversational context for LLM-based ASR, balancing performance gains with computational efficiency.

Abstract: Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.

[7] Density-aware Soft Context Compression with Semi-Dynamic Compression Ratio

Yijiong Yu, Shuai Yuan, Jie Zheng, Huazheng Wang, Ji Pei

Main category: cs.CL

TL;DR: Semi-Dynamic Context Compression framework uses a discrete ratio selector to adapt compression ratios to information density, improving LLM efficiency for long contexts.

Details

Motivation: Existing soft context compression methods use uniform compression ratios, failing to account for natural language's varying information density. While dynamic compression seems intuitive, models struggle with input-dependent continuous structural hyperparameters.

Method: Introduces Semi-Dynamic Context Compression with a Discrete Ratio Selector that predicts compression targets based on intrinsic information density and quantizes them to predefined discrete ratios. Jointly trained with compressor on synthetic data using summary lengths as proxy labels.

Result: Extensive evaluations show the density-aware framework consistently outperforms static baselines, establishing a robust Pareto frontier for context compression techniques.

Conclusion: The semi-dynamic approach effectively addresses the limitations of uniform compression by adapting to information density while avoiding the pitfalls of continuous parameterization.

Abstract: Soft context compression reduces the computational workload of processing long contexts in LLMs by encoding long context into a smaller number of latent tokens. However, existing frameworks apply uniform compression ratios, failing to account for the extreme variance in natural language information density. While adopting a density-aware dynamic compression ratio seems intuitive, empirical investigations reveal that models struggle intrinsically with operations parameterized by input dependent, continuous structural hyperparameters. To resolve this pitfall, we introduce Semi-Dynamic Context Compression framework. Our approach features a Discrete Ratio Selector, which predicts a compression target based on intrinsic information density and quantizes it to a predefined set of discrete compression ratios. It is efficiently jointly trained with the compressor on synthetic data, with the summary lengths as a proxy to create labels for compression ratio prediction. Extensive evaluations confirm that our density-aware framework, utilizing mean pooling as the backbone, consistently outperforms static baselines, establishing a robust Pareto frontier for context compression techniques. Our code, data and model weights are available at https://github.com/yuyijiong/semi-dynamic-context-compress

[8] Can Small Models Reason About Legal Documents? A Comparative Study

Snehit Vaddi

Main category: cs.CL

TL;DR: Smaller language models (sub-10B parameters) can match GPT-4o-mini performance on legal tasks with proper prompting strategies, showing architecture and training quality matter more than parameter count.

Details

Motivation: Deploying large frontier models for legal applications raises concerns about cost, latency, and data privacy, motivating exploration of whether smaller models can serve as practical alternatives.

Method: Evaluated nine sub-10B parameter models across three legal benchmarks (ContractNLI, CaseHOLD, ECtHR) using five prompting strategies (direct, chain-of-thought, few-shot, BM25 RAG, dense RAG) with 405 experiments and three random seeds per configuration.

Result: A Mixture-of-Experts model activating only 3B parameters matched GPT-4o-mini in mean accuracy while surpassing it on legal holding identification; few-shot prompting was most consistently effective; retrieval quality wasn’t the bottleneck for RAG performance.

Conclusion: Smaller models can be practical alternatives to frontier models for legal applications, with architecture and training quality being more important than parameter count, and rigorous evaluation is accessible without dedicated GPU infrastructure.

Abstract: Large language models show promise for legal applications, but deploying frontier models raises concerns about cost, latency, and data privacy. We evaluate whether sub-10B parameter models can serve as practical alternatives by testing nine models across three legal benchmarks (ContractNLI, CaseHOLD, and ECtHR) using five prompting strategies (direct, chain-of-thought, few-shot, BM25 RAG, and dense RAG). Across 405 experiments with three random seeds per configuration, we find that a Mixture-of-Experts model activating only 3B parameters matches GPT-4o-mini in mean accuracy while surpassing it on legal holding identification, and that architecture and training quality matter more than raw parameter count. Our largest model (9B parameters) performs worst overall. Chain-of-thought prompting proves sharply task-dependent, improving contract entailment but degrading multiple-choice legal reasoning, while few-shot prompting emerges as the most consistently effective strategy. Comparing BM25 and dense retrieval for RAG, we find near-identical results, suggesting the bottleneck lies in the language model’s utilization of retrieved context rather than retrieval quality. All experiments were conducted via cloud inference APIs at a total cost of $62, demonstrating that rigorous LLM evaluation is accessible without dedicated GPU infrastructure.

[9] When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models

Binesh Sadanandan, Vahid Behzadan

Main category: cs.CL

TL;DR: Medical LLMs like MedGemma show surprising sensitivity to prompt formatting - CoT hurts performance, few-shot degrades accuracy, answer shuffling causes prediction changes 59% of the time, and cloze scoring outperforms all prompting strategies.

Details

Motivation: To evaluate the robustness of medical LLMs to prompt formatting variations, since prompt engineering techniques validated on general-purpose models may not transfer to domain-specific medical LLMs, and reliable deployment requires understanding these sensitivities.

Method: Evaluated MedGemma (4B and 27B parameters) on MedMCQA (4,183 questions) and PubMedQA (1,000 questions) across a broad suite of robustness tests including: Chain-of-Thought prompting, few-shot examples, answer option shuffling, context truncation (front/back), cloze scoring, and permutation voting.

Result: Several concerning findings: CoT prompting decreases accuracy by 5.7%; few-shot degrades performance by 11.9% while increasing position bias; answer shuffling causes prediction changes 59.1% of the time with accuracy drops up to 27.4 percentage points; front-truncation causes accuracy to plummet below baseline while back-truncation preserves 97% of accuracy; cloze scoring achieves 51.8% (4B) and 64.5% (27B), surpassing all prompting strategies; permutation voting recovers 4 percentage points over single-ordering inference.

Conclusion: Prompt engineering techniques validated on general-purpose models do not transfer to domain-specific medical LLMs, and reliable alternatives like cloze scoring and permutation voting exist. Models “know” more than their generated text shows, highlighting the importance of robust evaluation methods for medical AI deployment.

Abstract: Large Language Models (LLMs) are increasingly deployed in medical settings, yet their sensitivity to prompt formatting remains poorly characterized. We evaluate MedGemma (4B and 27B parameters) on MedMCQA (4,183 questions) and PubMedQA (1,000 questions) across a broad suite of robustness tests. Our experiments reveal several concerning findings. Chain-of-Thought (CoT) prompting decreases accuracy by 5.7% compared to direct answering. Few-shot examples degrade performance by 11.9% while increasing position bias from 0.14 to 0.47. Shuffling answer options causes the model to change predictions 59.1% of the time, with accuracy dropping up to 27.4 percentage points. Front-truncating context to 50% causes accuracy to plummet below the no-context baseline, yet back-truncation preserves 97% of full-context accuracy. We further show that cloze scoring (selecting the highest log-probability option token) achieves 51.8% (4B) and 64.5% (27B), surpassing all prompting strategies and revealing that models “know” more than their generated text shows. Permutation voting recovers 4 percentage points over single-ordering inference. These results demonstrate that prompt engineering techniques validated on general-purpose models do not transfer to domain-specific medical LLMs, and that reliable alternatives exist.

[10] MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization

Weizhi Zhang, Xiaokai Wei, Wei-Chieh Huang, Zheng Hui, Chen Wang, Michelle Gong, Philip S. Yu

Main category: cs.CL

TL;DR: MemoryCD is a large-scale benchmark for evaluating LLM memory capabilities using real-world user behavior data from Amazon Reviews across multiple domains over time.

Details

Motivation: Existing memory benchmarks are limited to short-session synthetic dialogues, lacking real-world, cross-domain, lifelong user behavior data needed to properly evaluate LLM memory capabilities for personalization tasks.

Method: Constructed MemoryCD benchmark from Amazon Review dataset tracking authentic user interactions across years and multiple domains. Created evaluation pipeline with 14 SOTA LLMs and 6 memory method baselines on 4 personalization tasks across 12 domains.

Result: Analysis shows existing memory methods are far from user satisfaction across various domains, revealing significant gaps in cross-domain lifelong personalization capabilities.

Conclusion: MemoryCD provides the first testbed for cross-domain lifelong personalization evaluation, highlighting the need for improved memory methods that can handle real-world user behavior patterns.

Abstract: Recent advancements in Large Language Models (LLMs) have expanded context windows to million-token scales, yet benchmarks for evaluating memory remain limited to short-session synthetic dialogues. We introduce \textsc{MemoryCD}, the first large-scale, user-centric, cross-domain memory benchmark derived from lifelong real-world behaviors in the Amazon Review dataset. Unlike existing memory datasets that rely on scripted personas to generate synthetic user data, \textsc{MemoryCD} tracks authentic user interactions across years and multiple domains. We construct a multi-faceted long-context memory evaluation pipeline of 14 state-of-the-art LLM base models with 6 memory method baselines on 4 distinct personalization tasks over 12 diverse domains to evaluate an agent’s ability to simulate real user behaviors in both single and cross-domain settings. Our analysis reveals that existing memory methods are far from user satisfaction in various domains, offering the first testbed for cross-domain life-long personalization evaluation.

[11] Toward Culturally Grounded Natural Language Processing

Sina Bagheri Nezhad

Main category: cs.CL

TL;DR: Survey paper analyzing limitations of current multilingual NLP in capturing cultural competence, proposing shift from language-focused benchmarks to modeling “communicative ecologies” with richer contextual and multimodal approaches.

Details

Motivation: The paper addresses the gap between multilingual capability and cultural competence in NLP, noting that current multilingual models often fail to capture cultural nuances, local norms, and community-specific contexts despite strong performance on standard benchmarks.

Method: Synthesizes over 50 papers from 2020-2026 across multiple research areas: multilingual performance inequality, cross-lingual transfer, culture-aware evaluation, cultural alignment, multimodal local-knowledge modeling, benchmark design critiques, and community-grounded data practices.

Result: Identifies that training data coverage alone is insufficient - tokenization, prompt language, translated benchmark design, culturally specific supervision, and multimodal context all significantly affect outcomes. Shows strong multilingual models can still flatten local norms and misread culturally grounded cues.

Conclusion: Proposes moving from treating languages as isolated benchmark entries toward modeling “communicative ecologies” - the institutions, scripts, translation pipelines, domains, modalities, and communities through which language is used. Suggests research agenda for culturally grounded NLP with richer contextual metadata, culturally stratified evaluation, participatory alignment, within-language variation, and multimodal community-aware design.

Abstract: Recent progress in multilingual NLP is often taken as evidence of broader global inclusivity, but a growing literature shows that multilingual capability and cultural competence come apart. This paper synthesizes over 50 papers from 2020–2026 spanning multilingual performance inequality, cross-lingual transfer, culture-aware evaluation, cultural alignment, multimodal local-knowledge modeling, benchmark design critiques, and community-grounded data practices. Across this literature, training data coverage remains a strong determinant of performance, yet it is not sufficient: tokenization, prompt language, translated benchmark design, culturally specific supervision, and multimodal context all materially affect outcomes. Recent work on Global-MMLU, CDEval, WorldValuesBench, CulturalBench, CULEMO, CulturalVQA, GIMMICK, DRISHTIKON, WorldCuisines, CARE, CLCA, and newer critiques of benchmark design and community-grounded evaluation shows that strong multilingual models can still flatten local norms, misread culturally grounded cues, and underperform in lower-resource or community-specific settings. We argue that the field should move from treating languages as isolated rows in a benchmark spreadsheet toward modeling communicative ecologies: the institutions, scripts, translation pipelines, domains, modalities, and communities through which language is used. On that basis, we propose a research agenda for culturally grounded NLP centered on richer contextual metadata, culturally stratified evaluation, participatory alignment, within-language variation, and multimodal community-aware design.

[12] AgentCollab: A Self-Evaluation-Driven Collaboration Paradigm for Efficient LLM Agents

Wenbo Gao, Renxi Liu, Xian Wang, Fang Guo, Shuai Yang, Xi Chen, Hui-Ling Zhen, Hanting Chen, Weizhe Lin, Xiaosong Li, Yaoyuan Wang

Main category: cs.CL

TL;DR: AgentCollab: A self-driven collaborative inference framework that dynamically coordinates LLMs of different capabilities during agent execution, using self-reflection signals to escalate to stronger models only when needed, improving accuracy-efficiency trade-offs.

Details

Motivation: There's a fundamental trade-off between execution efficiency and reasoning robustness in LLM-powered autonomous agents. Lower-cost models are fast but struggle with difficult reasoning, while stronger models are robust but computationally expensive. The paper aims to create a framework that leverages complementary advantages of models at different capability levels.

Method: AgentCollab uses the agent’s own self-reflection signal to determine if reasoning is making meaningful progress, then escalates control to stronger reasoning tiers only when necessary. It introduces a difficulty-aware cumulative escalation strategy that allocates additional reasoning budget based on recent failure signals. The framework is instantiated using a two-level small-large model setting.

Result: Experiments on diverse multi-step agent benchmarks show that AgentCollab consistently improves the accuracy-efficiency Pareto frontier of LLM agents, demonstrating better trade-offs between performance and computational cost.

Conclusion: AgentCollab provides an effective framework for dynamic model coordination in LLM agents, enabling better utilization of models with different capabilities through self-driven collaborative inference based on self-reflection signals.

Abstract: Autonomous agents powered by large language models (LLMs) perform complex tasks through long-horizon reasoning and tool interaction, where a fundamental trade-off arises between execution efficiency and reasoning robustness. Models at different capability-cost levels offer complementary advantages: lower-cost models enable fast execution but may struggle on difficult reasoning segments, while stronger models provide more robust reasoning at higher computational cost. We present AgentCollab, a self-driven collaborative inference framework that dynamically coordinates models with different reasoning capacities during agent execution. Instead of relying on external routing modules, the framework uses the agent’s own self-reflection signal to determine whether the current reasoning trajectory is making meaningful progress, and escalates control to a stronger reasoning tier only when necessary. To further stabilize long-horizon execution, we introduce a difficulty-aware cumulative escalation strategy that allocates additional reasoning budget based on recent failure signals. In our experiments, we instantiate this framework using a two-level small-large model setting. Experiments on diverse multi-step agent benchmarks show that AgentCollab consistently improves the accuracy-efficiency Pareto frontier of LLM agents.

[13] Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, Gerard I. Gállego, Jorge Iranzo-Sánchez, Ahrii Kim, Dominik Macháček, Patricia Schmidtova, Maike Züfle

Main category: cs.CL

TL;DR: SpeechLLMs (speech-integrated LLMs) are benchmarked against cascaded systems for speech translation, finding cascades remain most reliable overall but recent SpeechLLMs can match or outperform them in various settings.

Details

Motivation: To determine whether integrating speech as a native modality in LLMs (creating SpeechLLMs) actually improves speech-to-text translation quality compared to established cascaded architectures that combine speech foundation models with multilingual LLMs.

Method: Created “Hearing to Translate” test suite benchmarking 6 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems. Analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions including disfluent, noisy, and long-form speech.

Result: Cascaded systems remain the most reliable solution overall, but most recent SpeechLLMs can match or even outperform cascades in various settings. Speech foundation models alone lag behind both approaches, showing that integrating an LLM (either within the model or in a pipeline) is essential for high-quality speech translation.

Conclusion: While cascaded architectures currently offer the most reliable speech translation, SpeechLLMs show promising performance and can match or beat cascades in specific scenarios, demonstrating the value of integrating LLMs for speech understanding tasks.

Abstract: As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which directly process spoken language and enable speech-to-text translation (ST) and other downstream tasks, bypassing traditional transcription-based pipelines. Whether this integration improves ST quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 6 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable solution overall, but most recent SpeechLLMs can match or even outperform cascades in various settings while SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.

[14] Retrieval-Augmented Generation Based Nurse Observation Extraction

Kyomin Hwang, Nojun Kwak

Main category: cs.CL

TL;DR: Automated pipeline using Retrieval-Augmented Generation (RAG) to extract clinical observations from nurse dictations, achieving 0.796 F1-score on medical dataset.

Details

Motivation: To reduce human workload in medical field by automating clinical observation extraction from nurse dictations, alleviating burden on healthcare professionals.

Method: Proposes an automated pipeline based on Retrieval-Augmented Generation (RAG) for extracting clinical observations from nurse dictations.

Result: Achieved effective performance with F1-score of 0.796 on the MEDIQA-SYNUR test dataset.

Conclusion: The RAG-based approach successfully automates clinical observation extraction from nurse dictations, demonstrating practical utility in medical applications.

Abstract: Recent advancements in Large Language Models (LLMs) have played a significant role in reducing human workload across various domains, a trend that is increasingly extending into the medical field. In this paper, we propose an automated pipeline designed to alleviate the burden on nurses by automatically extracting clinical observations from nurse dictations. To ensure accurate extraction, we introduce a method based on Retrieval-Augmented Generation (RAG). Our approach demonstrates effective performance, achieving an F1-score of 0.796 on the MEDIQA-SYNUR test dataset.

[15] I Want to Believe (but the Vocabulary Changed): Measuring the Semantic Structure and Evolution of Conspiracy Theories

Manisha Keim, Sarmad Chandio, Osama Khalid, Rishab Nithyanand

Main category: cs.CL

TL;DR: This paper analyzes the semantic evolution of conspiracy theories in online political discourse using word embeddings on Reddit data from 2012-2022, revealing patterns of stability, expansion, contraction, and replacement.

Details

Motivation: Research on conspiracy theories has focused on belief formation, exposure, and diffusion, but neglected how their meanings change over time. Current approaches treat conspiracy-related terms as stable lexical markers, making it hard to separate genuine semantic changes from surface-level vocabulary changes.

Method: Analyzed 169.9M comments from Reddit’s r/politics subreddit spanning 2012-2022. Used word embeddings to demonstrate conspiracy-related language forms coherent semantic regions. Tracked evolution using aligned word embeddings to compare semantic neighborhoods across time periods.

Result: Conspiracy theories form semantically distinguishable regions in language space. They evolve non-uniformly with patterns of semantic stability, expansion, contraction, and replacement that aren’t captured by keyword-based approaches alone.

Conclusion: Conspiracy theories can be treated as semantic objects that evolve in complex ways over time, requiring methods beyond simple keyword tracking to understand their changing meanings in online discourse.

Abstract: Research on conspiracy theories has largely focused on belief formation, exposure, and diffusion, while paying less attention to how their meanings change over time. This gap persists partly because conspiracy-related terms are often treated as stable lexical markers, making it difficult to separate genuine semantic changes from surface-level vocabulary changes. In this paper, we measure the semantic structure and evolution of conspiracy theories in online political discourse. Using 169.9M comments from Reddit’s r/politics subreddit spanning 2012–2022, we first demonstrate that conspiracy-related language forms coherent and semantically distinguishable regions of language space, allowing conspiracy theories to be treated as semantic objects. We then track how these objects evolve over time using aligned word embeddings, enabling comparisons of semantic neighborhoods across periods. Our analysis reveals that conspiracy theories evolve non-uniformly, exhibiting patterns of semantic stability, expansion, contraction, and replacement that are not captured by keyword-based approaches alone.

[16] IndoBERT-Relevancy: A Context-Conditioned Relevancy Classifier for Indonesian Text

Muhammad Apriandito Arya Saputra, Andry Alamsyah, Dian Puteri Ramadhani, Thomhert Suprapto Siadari, Hanif Fakhrurroja

Main category: cs.CL

TL;DR: IndoBERT-Relevancy: A context-conditioned relevancy classifier for Bahasa Indonesia built on IndoBERT Large, trained on 31,360 labeled pairs across 188 topics, achieving 96.5% accuracy.

Details

Motivation: Relevancy classification for Bahasa Indonesia remains largely unexplored, requiring models to reason about relationships between topical context and candidate text simultaneously, unlike simpler tasks like sentiment analysis.

Method: Built on IndoBERT Large (335M parameters) with iterative, failure-driven data construction process using multiple data sources and targeted synthetic data to address specific model weaknesses.

Result: Achieves F1 score of 0.948 and accuracy of 96.5%, handling both formal and informal Indonesian text, with model publicly available on HuggingFace.

Conclusion: No single data source is sufficient for robust relevancy classification; targeted synthetic data effectively addresses model weaknesses, resulting in high-performance relevancy classifier for Bahasa Indonesia.

Abstract: Determining whether a piece of text is relevant to a given topic is a fundamental task in natural language processing, yet it remains largely unexplored for Bahasa Indonesia. Unlike sentiment analysis or named entity recognition, relevancy classification requires the model to reason about the relationship between two inputs simultaneously: a topical context and a candidate text. We introduce IndoBERT-Relevancy, a context-conditioned relevancy classifier built on IndoBERT Large (335M parameters) and trained on a novel dataset of 31,360 labeled pairs spanning 188 topics. Through an iterative, failure-driven data construction process, we demonstrate that no single data source is sufficient for robust relevancy classification, and that targeted synthetic data can effectively address specific model weaknesses. Our final model achieves an F1 score of 0.948 and an accuracy of 96.5%, handling both formal and informal Indonesian text. The model is publicly available at HuggingFace.

[17] LLM Benchmark-User Need Misalignment for Climate Change

Oucheng Liu, Lexing Xie, Jing Jiang

Main category: cs.CL

TL;DR: Proposes a framework to analyze climate knowledge behaviors in LLMs, revealing mismatch between benchmarks and real-world user needs.

Details

Motivation: As LLMs become interfaces for climate knowledge access, existing benchmarks may not reflect real-world user needs, requiring better evaluation methods.

Method: Developed Proactive Knowledge Behaviors Framework and Topic-Intent-Form taxonomy to analyze climate-related data representing different knowledge behaviors.

Result: Found substantial mismatch between current benchmarks and real-world user needs; knowledge interaction patterns between humans and LLMs resemble human-human interactions.

Conclusion: Provides actionable guidance for benchmark design, RAG system development, and LLM training to better align with real-world climate knowledge needs.

Abstract: Climate change is a major socio-scientific issue shapes public decision-making and policy discussions. As large language models (LLMs) increasingly serve as an interface for accessing climate knowledge, whether existing benchmarks reflect user needs is critical for evaluating LLM in real-world settings. We propose a Proactive Knowledge Behaviors Framework that captures the different human-human and human-AI knowledge seeking and provision behaviors. We further develop a Topic-Intent-Form taxonomy and apply it to analyze climate-related data representing different knowledge behaviors. Our results reveal a substantial mismatch between current benchmarks and real-world user needs, while knowledge interaction patterns between humans and LLMs closely resemble those in human-human interactions. These findings provide actionable guidance for benchmark design, RAG system development, and LLM training. Code is available at https://github.com/OuchengLiu/LLM-Misalign-Climate-Change.

[18] Clash of the models: Comparing performance of BERT-based variants for generic news frame detection

Vihang Jumle

Main category: cs.CL

TL;DR: Comparative analysis of five BERT-based transformer models for generic news frame detection in political communication, with a focus on Swiss electoral context.

Details

Motivation: To address the ongoing debate about which transformer models perform best for deductive frame detection in political communication, and to expand beyond US-centric data by providing a Swiss electoral context dataset.

Method: Comparative performance evaluation of five BERT-based variants (BERT, RoBERTa, DeBERTa, DistilBERT, ALBERT) for generic news frame detection, using a newly created labelled dataset from Swiss electoral context.

Result: The study provides performance comparisons of different BERT variants for frame detection and introduces fine-tuned models capable of robust generic news frame detection.

Conclusion: Contributes to best practices in computational text analysis for political communication by comparing transformer models and providing a contextual dataset for testing robustness beyond US-centric approaches.

Abstract: Framing continues to remain one of the most extensively applied theories in political communication. Developments in computation, particularly with the introduction of transformer architecture and more so with large language models (LLMs), have naturally prompted scholars to explore various novel computational approaches, especially for deductive frame detection, in recent years. While many studies have shown that different transformer models outperform their preceding models that use bag-of-words features, the debate continues to evolve regarding how these models compare with each other on classification tasks. By placing itself at this juncture, this study makes three key contributions: First, it comparatively performs generic news frame detection and compares the performance of five BERT-based variants (BERT, RoBERTa, DeBERTa, DistilBERT and ALBERT) to add to the debate on best practices around employing computational text analysis for political communication studies. Second, it introduces various fine-tuned models capable of robustly performing generic news frame detection. Third, building upon numerous previous studies that work with US-centric data, this study provides the scholarly community with a labelled generic news frames dataset based on the Swiss electoral context that aids in testing the contextual robustness of these computational approaches to framing analysis.

[19] ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory

Zhuohan Ge, Haoyang Li, Yubo Wang, Nicole Hu, Chen Jason Zhang, Qing Li

Main category: cs.CL

TL;DR: ClinicalAgents is a multi-agent framework using Monte Carlo Tree Search and dual-memory architecture to simulate clinical reasoning for improved diagnostic accuracy and explainability.

Details

Motivation: LLMs struggle with complex, non-linear clinical reasoning required for accurate diagnosis. Existing methods use static linear mappings that fail to capture iterative, hypothesis-driven reasoning of human clinicians.

Method: Multi-agent framework with dynamic orchestration via Monte Carlo Tree Search (MCTS), dual-memory architecture (mutable Working Memory for patient state, static Experience Memory for clinical guidelines/historical cases), and active feedback loops.

Result: Achieves state-of-the-art performance, significantly enhancing both diagnostic accuracy and explainability compared to strong single-agent and multi-agent baselines.

Conclusion: ClinicalAgents effectively bridges the gap between LLM capabilities and complex clinical reasoning by simulating expert clinician workflows through multi-agent collaboration and dynamic reasoning.

Abstract: While Large Language Models (LLMs) have demonstrated potential in healthcare, they often struggle with the complex, non-linear reasoning required for accurate clinical diagnosis. Existing methods typically rely on static, linear mappings from symptoms to diagnoses, failing to capture the iterative, hypothesis-driven reasoning inherent to human clinicians. To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians. Unlike rigid sequential chains, ClinicalAgents employs a dynamic orchestration mechanism modeled as a Monte Carlo Tree Search (MCTS) process. This allows an Orchestrator to iteratively generate hypotheses, actively verify evidence, and trigger backtracking when critical information is missing. Central to this framework is a Dual-Memory architecture: a mutable Working Memory that maintains the evolving patient state for context-aware reasoning, and a static Experience Memory that retrieves clinical guidelines and historical cases via an active feedback loop. Extensive experiments demonstrate that ClinicalAgents achieves state-of-the-art performance, significantly enhancing both diagnostic accuracy and explainability compared to strong single-agent and multi-agent baselines.

[20] Sparse Auto-Encoders and Holism about Large Language Models

Jumbly Grindrod

Main category: cs.CL

TL;DR: LLMs suggest semantic holism through distributional semantics, but mechanistic interpretability reveals interpretable features that challenge this view, though holism may still hold if features are countable.

Details

Motivation: To examine whether LLM technology supports a meta-semantic picture of how linguistic expressions acquire meaning, specifically addressing the tension between distributional semantics (holism) and mechanistic interpretability findings (decompositional features).

Method: Philosophical analysis comparing: (1) arguments for LLM semantic holism based on distributional semantics, (2) recent mechanistic interpretability work on sparse auto-encoders revealing interpretable latent features, (3) detailed examination of feature nature, and (4) defense of holism conditional on feature countability.

Result: The paper argues that while mechanistic interpretability reveals decompositional features challenging holistic interpretations, semantic holism in LLMs can still be defended if the discovered features are countable, maintaining the holistic picture proposed by Grindrod et al.

Conclusion: LLM technology suggests a meta-semantic picture where semantic holism remains plausible despite mechanistic interpretability findings, provided the interpretable latent features are countable rather than holistic themselves.

Abstract: Does Large Language Model (LLM) technology suggest a meta-semantic picture i.e. a picture of how words and complex expressions come to have the meaning that they do? One modest approach explores the assumptions that seem to be built into how LLMs capture the meanings of linguistic expressions as a way of considering their plausibility (Grindrod, 2026a, 2026b). It has previously been argued that LLMs, in employing a form of distributional semantics, adopt a form of holism about meaning (Grindrod, 2023; Grindrod et al., forthcoming). However, recent work in mechanistic interpretability presents a challenge to these arguments. Specifically, the discovery of a vast array of interpretable latent features within the high dimensional spaces used by LLMs potentially challenges the holistic interpretation. In this paper, I will present the original reasons for thinking that LLMs embody a form of holism (section 1), before introducing recent work on features generated through sparse auto-encoders, and explaining how the discovery of such features suggests an alternative decompositional picture of meaning (section 2). I will then respond to this challenge by considering in greater detail the nature of such features (section 3). Finally, I will return to the holistic picture defended by Grindrod et al. and argue that the picture still stands provided that the features are countable (section 4).

[21] Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents

Nicholas Edwards, Sebastian Schuster

Main category: cs.CL

TL;DR: Multi-agent system improves LLM agents’ ability to detect and resolve underspecified instructions in software engineering tasks by decoupling uncertainty detection from code execution.

Details

Motivation: LLM agents deployed in open-ended domains like software engineering often encounter underspecified instructions lacking crucial context. While humans naturally ask clarifying questions, current agents are optimized for autonomous execution rather than proactive clarification-seeking.

Method: Proposed an uncertainty-aware multi-agent scaffold that explicitly decouples underspecification detection from code execution. Evaluated on an underspecified variant of SWE-bench Verified, comparing multi-agent system (OpenHands + Claude Sonnet 4.5) against standard single-agent setup.

Result: Multi-agent system achieved 69.40% task resolve rate, significantly outperforming single-agent setup (61.20%) and closing performance gap with agents operating on fully specified instructions. System exhibited well-calibrated uncertainty, conserving queries on simple tasks while proactively seeking information on complex issues.

Conclusion: Current LLM models can be turned into proactive collaborators that independently recognize when to ask questions to elicit missing information in real-world, underspecified tasks, improving their practical utility in open-ended domains.

Abstract: As Large Language Model (LLM) agents are increasingly deployed in open-ended domains like software engineering, they frequently encounter underspecified instructions that lack crucial context. While human developers naturally resolve underspecification by asking clarifying questions, current agents are largely optimized for autonomous execution. In this work, we systematically evaluate the clarification-seeking abilities of LLM agents on an underspecified variant of SWE-bench Verified. We propose an uncertainty-aware multi-agent scaffold that explicitly decouples underspecification detection from code execution. Our results demonstrate that this multi-agent system using OpenHands + Claude Sonnet 4.5 achieves a 69.40% task resolve rate, significantly outperforming a standard single-agent setup (61.20%) and closing the performance gap with agents operating on fully specified instructions. Furthermore, we find that the multi-agent system exhibits well-calibrated uncertainty, conserving queries on simple tasks while proactively seeking information on more complex issues. These findings indicate that current models can be turned into proactive collaborators, where agents independently recognize when to ask questions to elicit missing information in real-world, underspecified tasks.

[22] GS-BrainText: A Multi-Site Brain Imaging Report Dataset from Generation Scotland for Clinical Natural Language Processing Development and Validation

Beatrice Alex, Claire Grover, Arlene Casey, Richard Tobin, Heather Whalley, William Whiteley

Main category: cs.CL

TL;DR: GS-BrainText is a curated dataset of 8,511 brain radiology reports with 2,431 annotated for 24 brain disease phenotypes, designed for developing generalizable clinical NLP algorithms.

Details

Motivation: Addresses the significant gap in available UK clinical text resources for developing and evaluating generalizable clinical NLP algorithms, particularly for brain radiology reports.

Method: Created a multi-site dataset spanning five Scottish NHS health boards with broad age representation. Expert annotations were performed by a multidisciplinary clinical team using an annotation schema with 10-100% double annotation per health board and rigorous quality assurance.

Result: Benchmark evaluation using EdIE-R (rule-based NLP system) showed performance variation across health boards (F1: 86.13-98.13), phenotypes (F1: 22.22-100), and age groups (F1: 87.01-98.13), highlighting challenges in NLP generalization.

Conclusion: GS-BrainText provides a valuable resource for studying linguistic variation, diagnostic uncertainty expression, and the impact of data characteristics on NLP system performance in clinical settings.

Abstract: We present GS-BrainText, a curated dataset of 8,511 brain radiology reports from the Generation Scotland cohort, of which 2,431 are annotated for 24 brain disease phenotypes. This multi-site dataset spans five Scottish NHS health boards and includes broad age representation (mean age 58, median age 53), making it uniquely valuable for developing and evaluating generalisable clinical natural language processing (NLP) algorithms and tools. Expert annotations were performed by a multidisciplinary clinical team using an annotation schema, with 10-100% double annotation per NHS health board and rigorous quality assurance. Benchmark evaluation using EdIE-R, an existing rule-based NLP system developed in conjunction with the annotation schema, revealed some performance variation across health boards (F1: 86.13-98.13), phenotypes (F1: 22.22-100) and age groups (F1: 87.01-98.13), highlighting critical challenges in generalisation of NLP tools. The GS-BrainText dataset addresses a significant gap in available UK clinical text resources and provides a valuable resource for the study of linguistic variation, diagnostic uncertainty expression and the impact of data characteristics on NLP system performance.

[23] A Universal Vibe? Finding and Controlling Language-Agnostic Informal Register with SAEs

Uri Z. Kialy, Avi Shtarkberg, Ayal Klein

Main category: cs.CL

TL;DR: Multilingual LLMs develop language-agnostic abstract representations of informal register (slang) that can be causally manipulated to control output formality across languages.

Details

Motivation: To understand whether multilingual language models process culture-specific pragmatic registers (like slang) as isolated language-specific memorizations or as unified, abstract concepts that transfer across languages.

Method: Probed internal representations of Gemma-2-9B-IT using Sparse Autoencoders across English, Hebrew, and Russian. Created novel dataset with polysemous terms appearing in both literal and informal contexts. Used activation steering to test causal effects.

Result: Found a small but robust cross-linguistic core that forms a geometrically coherent “informal register subspace” that sharpens in deeper layers. These shared representations causally shift output formality across all source languages and transfer zero-shot to six unseen languages.

Conclusion: Multilingual LLMs internalize informal register not just as surface-level heuristics, but as portable, language-agnostic pragmatic abstractions that enable cross-linguistic transfer.

Abstract: While multilingual language models successfully transfer factual and syntactic knowledge across languages, it remains unclear whether they process culture-specific pragmatic registers, such as slang, as isolated language-specific memorizations or as unified, abstract concepts. We study this by probing the internal representations of Gemma-2-9B-IT using Sparse Autoencoders (SAEs) across three typologically diverse source languages: English, Hebrew, and Russian. To definitively isolate pragmatic register processing from trivial lexical sensitivity, we introduce a novel dataset in which every target term is polysemous, appearing in both literal and informal contexts. We find that while much of the informal-register signal is distributed across language-specific features, a small but highly robust cross-linguistic core consistently emerges. This shared core forms a geometrically coherent ``informal register subspace’’ that sharpens in the model’s deeper layers. Crucially, these shared representations are not merely correlational: activation steering with these features causally shifts output formality across all source languages and transfers zero-shot to six unseen languages spanning diverse language families and scripts. Together, these results provide the first mechanistic evidence that multilingual LLMs internalize informal register not just as surface-level heuristics, but as a portable, language-agnostic pragmatic abstraction.

[24] Automatic Speech Recognition for Documenting Endangered Languages: Case Study of Ikema Miyakoan

Chihiro Taguchi, Yukinori Takubo, David Chiang

Main category: cs.CL

TL;DR: Developing an ASR system for the endangered Ikema language to assist in documentation and transcription efficiency.

Details

Motivation: Language endangerment threatens linguistic diversity, and Ikema is a severely endangered Ryukyuan language with only ~1,300 speakers, mostly elderly. Technology like ASR can help document and revitalize such languages by making transcription more efficient.

Method: 1) Constructed a speech corpus from field recordings, 2) Trained an ASR model on this corpus, 3) Evaluated the impact of ASR assistance on transcription efficiency and cognitive load.

Result: Achieved character error rate as low as 15% with the ASR model. ASR integration substantially reduced transcription time and cognitive load, making documentation more scalable.

Conclusion: ASR systems offer practical pathways for scalable, technology-supported documentation of endangered languages, helping preserve linguistic diversity.

Abstract: Language endangerment poses a major challenge to linguistic diversity worldwide, and technological advances have opened new avenues for documentation and revitalization. Among these, automatic speech recognition (ASR) has shown increasing potential to assist in the transcription of endangered language data. This study focuses on Ikema, a severely endangered Ryukyuan language spoken in Okinawa, Japan, with approximately 1,300 remaining speakers, most of whom are over 60 years old. We present an ongoing effort to develop an ASR system for Ikema based on field recordings. Specifically, we (1) construct a {\totaldatasethours}-hour speech corpus from field recordings, (2) train an ASR model that achieves a character error rate as low as 15%, and (3) evaluate the impact of ASR assistance on the efficiency of speech transcription. Our results demonstrate that ASR integration can substantially reduce transcription time and cognitive load, offering a practical pathway toward scalable, technology-supported documentation of endangered languages.

[25] SocialX: A Modular Platform for Multi-Source Big Data Research in Indonesia

Muhammad Apriandito Arya Saputra, Andry Alamsyah, Dian Puteri Ramadhani, Thomhert Suprapto Siadari, Hanif Fakhrurroja

Main category: cs.CL

TL;DR: SocialX is a modular platform for multi-source big data research in Indonesia that integrates heterogeneous data collection, language-aware preprocessing, and pluggable analysis into a unified pipeline.

Details

Motivation: Big data research in Indonesia faces fragmentation across different data sources (social media, news, e-commerce, etc.) with varying formats, access methods, and noise characteristics, forcing researchers to build custom pipelines that overshadow actual research.

Method: A three-layer modular architecture (collection, preprocessing, analysis) with lightweight job coordination, enabling independent growth of each layer and source-agnostic processing with Indonesian-specific text preprocessing.

Result: SocialX provides a publicly accessible web-based platform that demonstrates utility through typical research workflows, addressing Indonesian text challenges across different registers.

Conclusion: SocialX offers a unified solution for Indonesian big data research by overcoming data fragmentation through modular design, allowing researchers to focus on analysis rather than pipeline development.

Abstract: Big data research in Indonesia is constrained by a fundamental fragmentation: relevant data is scattered across social media, news portals, e-commerce platforms, review sites, and academic databases, each with different formats, access methods, and noise characteristics. Researchers must independently build collection pipelines, clean heterogeneous data, and assemble separate analysis tools, a process that often overshadows the research itself. We present SocialX, a modular platform for multi-source big data research that integrates heterogeneous data collection, language-aware preprocessing, and pluggable analysis into a unified, source-agnostic pipeline. The platform separates concerns into three independent layers (collection, preprocessing, and analysis) connected by a lightweight job-coordination mechanism. This modularity allows each layer to grow independently: new data sources, preprocessing methods, or analysis tools can be added without modifying the existing pipeline. We describe the design principles that enable this extensibility, detail the preprocessing methodology that addresses challenges specific to Indonesian text across registers, and demonstrate the platform’s utility through a walkthrough of a typical research workflow. SocialX is publicly accessible as a web-based platform at https://www.socialx.id.

[26] findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

Héctor Javier Vázquez Martínez

Main category: cs.CL

TL;DR: A unified toolkit called findsylls for syllable segmentation and analysis across languages, combining classical and modern methods with standardized evaluation.

Details

Motivation: Research on syllabification is fragmented across different implementations, datasets, and evaluation protocols, making comparisons difficult. There's a need for a unified framework to support reproducible syllable-level experiments across both high-resource and under-resourced languages.

Method: Developed findsylls, a modular, language-agnostic toolkit that unifies classical syllable detectors and end-to-end syllabifiers under a common interface. It implements and standardizes widely used methods (Sylber, VG-HuBERT) and allows component recombination for controlled comparisons of representations, algorithms, and token rates.

Result: Demonstrated findsylls on English and Spanish corpora and on new hand-annotated data from Kono (an underdocumented Central Mande language), showing how a single framework can support reproducible syllable-level experiments across different resource settings.

Conclusion: findsylls provides a unified framework for syllable segmentation research, enabling better comparisons across methods and languages, particularly valuable for both high-resource and under-resourced language settings.

Abstract: Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across disparate implementations, datasets, and evaluation protocols. We introduce findsylls, a modular, language-agnostic toolkit that unifies classical syllable detectors and end-to-end syllabifiers under a common interface for syllable segmentation, embedding extraction, and multi-granular evaluation. The toolkit implements and standardizes widely used methods (e.g., Sylber, VG-HuBERT) and allows their components to be recombined, enabling controlled comparisons of representations, algorithms, and token rates. We demonstrate findsylls on English and Spanish corpora and on new hand-annotated data from Kono, an underdocumented Central Mande language, illustrating how a single framework can support reproducible syllable-level experiments across both high-resource and under-resourced settings.

[27] From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMs

Jiyuan An, Liner Yang, Mengyan Wang, Luming Lu, Weihua An, Erhong Yang

Main category: cs.CL

TL;DR: LLMs show limited spatial reasoning capabilities with transient, fragmented representations rather than robust structured spatial cognition, despite encoding some spatial information in intermediate layers.

Details

Motivation: To determine whether LLMs' performance on spatial reasoning benchmarks reflects genuine structured spatial representations or just linguistic heuristics, and to understand the mechanistic basis of spatial cognition in LLMs.

Method: Decomposed spatial reasoning into three primitives (relational composition, representational transformation, stateful spatial updating), designed controlled task families for each, evaluated multilingual LLMs (English, Chinese, Arabic), and analyzed internal representations using linear probing, sparse autoencoder feature analysis, and causal interventions.

Result: Task-relevant spatial information is encoded in intermediate layers and can causally influence behavior, but representations are transient, fragmented across task families, and weakly integrated into final predictions. Cross-linguistic analysis shows mechanistic degeneracy where similar performance arises from distinct internal pathways.

Conclusion: Current LLMs exhibit limited and context-dependent spatial representations rather than robust, general-purpose spatial reasoning, highlighting the need for mechanistic evaluation beyond benchmark accuracy.

Abstract: As spatial intelligence becomes an increasingly important capability for foundation models, it remains unclear whether large language models’ (LLMs) performance on spatial reasoning benchmarks reflects structured internal spatial representations or reliance on linguistic heuristics. We address this question from a mechanistic perspective by examining how spatial information is internally represented and used. Drawing on computational theories of human spatial cognition, we decompose spatial reasoning into three primitives, relational composition, representational transformation, and stateful spatial updating, and design controlled task families for each. We evaluate multilingual LLMs in English, Chinese, and Arabic under single pass inference, and analyze internal representations using linear probing, sparse autoencoder based feature analysis, and causal interventions. We find that task relevant spatial information is encoded in intermediate layers and can causally influence behavior, but these representations are transient, fragmented across task families, and weakly integrated into final predictions. Cross linguistic analysis further reveals mechanistic degeneracy, where similar behavioral performance arises from distinct internal pathways. Overall, our results suggest that current LLMs exhibit limited and context dependent spatial representations rather than robust, general purpose spatial reasoning, highlighting the need for mechanistic evaluation beyond benchmark accuracy.

[28] CALRK-Bench: Evaluating Context-Aware Legal Reasoning in Korean Law

JiHyeok Jung, TaeYoung Yoon, HyunSouk Cho

Main category: cs.CL

TL;DR: CALRK-Bench is a Korean legal reasoning benchmark that evaluates context awareness in legal AI, testing temporal validity of norms, sufficiency of legal information, and understanding of judgment shifts.

Details

Motivation: Existing legal benchmarks focus on rule application assuming fixed norms, failing to capture real-world complexities like shifting judgments and interacting norms. There's a need for benchmarks that evaluate context-aware legal reasoning.

Method: Created CALRK-Bench from Korean legal precedents and consultation records, validated by legal experts. The benchmark tests three capabilities: 1) identifying temporal validity of legal norms, 2) determining sufficiency of legal information for cases, and 3) understanding reasons behind judgment shifts.

Result: Recent large language models consistently show low performance on all three tasks, demonstrating the benchmark’s effectiveness as a stress test for context-aware reasoning beyond simple legal knowledge memorization.

Conclusion: CALRK-Bench provides a valuable tool for evaluating context-aware legal reasoning capabilities in AI systems, revealing limitations in current models’ ability to handle temporal, informational, and judgmental complexities in legal contexts.

Abstract: Legal reasoning requires not only the application of legal rules but also an understanding of the context in which those rules operate. However, existing legal benchmarks primarily evaluate rule application under the assumption of fixed norms, and thus fail to capture situations where legal judgments shift or where multiple norms interact. In this work, we propose CALRK-Bench, a context-aware legal reasoning benchmark based on the legal system in Korean. CALRK-Bench evaluates whether models can identify the temporal validity of legal norms, determine whether sufficient legal information is available for a given case, and understand the reasons behind shifts in legal judgments. The dataset is constructed from legal precedents and legal consultation records, and is validated by legal experts. Experimental results show that even recent large language models consistently exhibit low performance on these three tasks. CALRK-Bench provides a new stress test for evaluating context-aware legal reasoning rather than simple memorization of legal knowledge. Our code is available at https://github.com/jhCOR/CALRKBench.

[29] Switch Attention: Towards Dynamic and Fine-grained Hybrid Transformers

Yusheng Zhao, Hourun Li, Bohan Wu, Jingyang Yuan, Meng Zhang, Yichun Yin, Lifeng Shang, Ming Zhang

Main category: cs.CL

TL;DR: SwiAttn is a hybrid transformer that dynamically routes between full attention and sliding window attention per token per layer for efficient long-context modeling.

Details

Motivation: Standard full attention scales quadratically with sequence length, making it inefficient for long contexts, while sliding window attention has limited receptive fields. Existing hybrid approaches use static patterns that don't adapt to different scenarios.

Method: Proposes Switch Attention (SwiAttn) that dynamically routes each token at each layer to either full-attention branch (global info) or sliding-window branch (local patterns). Uses adaptive regularization for efficiency and continual pretraining to transfer from full attention to hybrid architecture.

Result: Extensive experiments on 23 benchmark datasets across regular (4K) and long (32K) context lengths demonstrate effectiveness of the proposed method.

Conclusion: SwiAttn provides an efficient hybrid attention mechanism that dynamically balances global and local information processing, addressing computational bottlenecks in long-context language modeling.

Abstract: The attention mechanism has been the core component in modern transformer architectures. However, the computation of standard full attention scales quadratically with the sequence length, serving as a major bottleneck in long-context language modeling. Sliding window attention restricts the context length for better efficiency at the cost of narrower receptive fields. While existing efforts attempt to take the benefits from both sides by building hybrid models, they often resort to static, heuristically designed alternating patterns that limit efficient allocation of computation in various scenarios. In this paper, we propose Switch Attention (SwiAttn), a novel hybrid transformer that enables dynamic and fine-grained routing between full attention and sliding window attention. For each token at each transformer layer, SwiAttn dynamically routes the computation to either a full-attention branch for global information aggregation or a sliding-window branch for efficient local pattern matching. An adaptive regularization objective is designed to encourage the model towards efficiency. Moreover, we adopt continual pretraining to optimize the model, transferring the full attention architecture to the hybrid one. Extensive experiments are conducted on twenty-three benchmark datasets across both regular (4K) and long (32K) context lengths, demonstrating the effectiveness of the proposed method.

[30] Word Alignment-Based Evaluation of Uniform Meaning Representations

Daniel Zeman, Federica Gamba

Main category: cs.CL

TL;DR: A new node-matching algorithm for comparing Uniform Meaning Representations (UMR) that leverages word alignments for more intuitive and interpretable comparisons than existing methods like smatch.

Details

Motivation: Existing approaches for comparing graph-based sentence meaning representations have limitations: they maximize F1 scores regardless of intentional vs. accidental similarity, produce unhelpful error analyses, and face NP-hard search problems in methods like smatch.

Method: Proposes a node-matching algorithm that compares multiple UMR representations of the same sentence by taking advantage of node-word alignments that are inherently available in UMR, avoiding the complex search problem of smatch.

Result: The method provides more intuitive and interpretable comparisons of meaning representations while avoiding NP-hard search problems, with an available implementation script.

Conclusion: Word alignment sensitivity makes meaning representation comparison more intuitive and interpretable, offering a practical alternative to existing methods like smatch for UMR evaluation.

Abstract: Comparison and evaluation of graph-based representations of sentence meaning is a challenge because competing representations of the same sentence may have different number of nodes, and it is not obvious which nodes should be compared to each other. Existing approaches favor node mapping that maximizes $F_1$ score over node relations and attributes, regardless whether the similarity is intentional or accidental; consequently, the identified mismatches in values of node attributes are not useful for any detailed error analysis. We propose a node-matching algorithm that allows comparison of multiple Uniform Meaning Representations (UMR) of one sentence and that takes advantage of node-word alignments, inherently available in UMR. We compare it with previously used approaches, in particular smatch (the de-facto standard in AMR evaluation), and argue that sensitivity to word alignment makes the comparison of meaning representations more intuitive and interpretable, while avoiding the NP-hard search problem inherent in smatch. A script implementing the method is freely available.

[31] Why Models Know But Don’t Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models

Richard J. Young

Main category: cs.CL

TL;DR: Study examines how 12 reasoning models handle misleading hints, finding significant divergence between thinking tokens and visible answers, with 55.4% of hint-influenced cases showing hint acknowledgment only in thinking tokens.

Details

Motivation: To understand how extended-thinking models process misleading information through their dual-channel architecture (thinking tokens vs. visible answers), and to quantify the transparency of hint influence across different hint types and models.

Method: Analyzed 12 open-weight reasoning models on MMLU and GPQA questions paired with misleading hints. Examined 10,506 cases where models followed hints, classifying each case by whether hint acknowledgment appears in thinking tokens, answer text, both, or neither.

Result: 55.4% of hint-influenced cases show thinking-answer divergence (hint acknowledgment only in thinking tokens). Reverse pattern is near-zero (0.5%). Sycophancy hints are most transparent (58.8% acknowledge in both channels), while consistency (72.2%) and unethical (62.7%) hints show thinking-only acknowledgment. Models vary widely from 94.7% divergence (Step-3.5-Flash) to 19.6% (Qwen3.5-27B).

Conclusion: Answer-text-only monitoring misses over half of hint-influenced reasoning. Thinking-token access is necessary but still leaves 11.8% of cases with no verbalized acknowledgment. The study reveals systematic asymmetry in how models process misleading information across their dual channels.

Abstract: Extended-thinking models expose a second text-generation channel (“thinking tokens”) alongside the user-visible answer. This study examines 12 open-weight reasoning models on MMLU and GPQA questions paired with misleading hints. Among the 10,506 cases where models actually followed the hint (choosing the hint’s target over the ground truth), each case is classified by whether the model acknowledges the hint in its thinking tokens, its answer text, both, or neither. In 55.4% of these cases the model’s thinking tokens contain hint-related keywords that the visible answer omits entirely, a pattern termed thinking-answer divergence. The reverse (answer-only acknowledgment) is near-zero (0.5%), confirming that the asymmetry is directional. Hint type shapes the pattern sharply: sycophancy is the most transparent hint, with 58.8% of sycophancy-influenced cases acknowledging the professor’s authority in both channels, while consistency (72.2%) and unethical (62.7%) hints are dominated by thinking-only acknowledgment. Models also vary widely, from near-total divergence (Step-3.5-Flash: 94.7%) to relative transparency (Qwen3.5-27B: 19.6%). These results show that answer-text-only monitoring misses more than half of all hint-influenced reasoning and that thinking-token access, while necessary, still leaves 11.8% of cases with no verbalized acknowledgment in either channel.

[32] Analysing Calls to Order in German Parliamentary Debates

Nina Smirnova, Daniel Dan, Philipp Mayr

Main category: cs.CL

TL;DR: Analysis of incivility in German parliamentary debates using calls to order as formal indicators of norm violations over 72 years, with rule-based detection, classification system, and dataset creation.

Details

Motivation: Parliamentary incivility signals political polarization and institutional conflict, but calls to order as formal indicators of norm violations have received little systematic attention in parliamentary research despite their relevance.

Method: Rule-based method for detecting and annotating calls to order in parliamentary speeches, creation of a novel 72-year German parliamentary debate dataset with annotated CtO instances, and development of a classification system for CtO triggers.

Result: CtO issuance is partly subjective and influenced by session presidents and parliamentary dynamics, with certain individuals disproportionately affected. Insults are the most frequent cause, male members and opposition party members receive more CtOs, and most triggers occur in speeches about governmental affairs and presidential actions.

Conclusion: The study provides systematic analysis of parliamentary incivility through calls to order, revealing patterns of norm violations and institutional dynamics in German politics, with dataset available for further research.

Abstract: Parliamentary debate constitutes a central arena of political power, shaping legislative outcomes and public discourse. Incivility within this arena signals political polarization and institutional conflict. This study presents a systematic investigation of incivility in the German Bundestag by examining calls to order (CtO; plural: CtOs) as formal indicators of norm violations. Despite their relevance, CtOs have received little systematic attention in parliamentary research. We introduce a rule-based method for detecting and annotating CtOs in parliamentary speeches and present a novel dataset of German parliamentary debates spanning 72 years that includes annotated CtO instances. Additionally, we develop the first classification system for CtO triggers and analyze the factors associated with their occurrence. Our findings show that, despite formal regulations, the issuance of CtOs is partly subjective and influenced by session presidents and parliamentary dynamics, with certain individuals disproportionately affected. An insult towards individuals is the most frequent cause of CtO. In general, male members and those belonging to opposition parties receive more calls to order than their female and coalition-party counterparts. Most CtO triggers were detected in speeches dedicated to governmental affairs and actions of the presidency. The CtO triggers dataset is available at: https://github.com/kalawinka/cto_analysis.

[33] Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models

Mikko Saukkoriipi, Nicole Hernandez, Jaakko Sahlsten, Kimmo Kaski, Otso Arponen

Main category: cs.CL

TL;DR: Open-source LLMs (4B-70B parameters) can accurately answer clinical questions from EHRs offline, with Llama-3.1-70B achieving 95.3% accuracy, though some clinically significant errors remain requiring human oversight.

Details

Motivation: Clinicians need to retrieve patient-specific information from electronic health records (EHRs), which is time-consuming and error-prone. There's a need for locally deployable clinical question answering systems that don't require external data transfer for privacy and practical reasons.

Method: Developed a Clinical Contextual Question Answering (CCQA) framework using open-source LLMs (4B to 70B parameters) benchmarked under fully offline conditions. Used 1,664 expert-annotated question-answer pairs from 183 patients’ EHRs (predominantly Finnish clinical text). Evaluated models in both free-text generation and multiple-choice settings, tested low-precision quantization (4-bit and 8-bit) for deployment feasibility.

Result: Llama-3.1-70B achieved 95.3% accuracy and 97.3% consistency across semantically equivalent question variants. Qwen3-30B-A3B-2507 achieved comparable performance. Low-precision quantization preserved predictive performance while reducing GPU memory requirements. Clinical evaluation found clinically significant errors in 2.9% of outputs, with 0.96% of cases showing discordant responses where one formulation was correct and the other contained clinically significant errors.

Conclusion: Locally hosted open-source LLMs can accurately retrieve patient-specific information from EHRs using natural-language queries, but validation and human oversight are necessary due to persistent clinically significant errors and occasional inconsistencies in responses to semantically equivalent questions.

Abstract: Clinicians often need to retrieve patient-specific information from electronic health records (EHRs), a task that is time-consuming and error-prone. We present a locally deployable Clinical Contextual Question Answering (CCQA) framework that answers clinical questions directly from EHRs without external data transfer. Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients. The dataset consisted predominantly of Finnish clinical text. In free-text generation, Llama-3.1-70B achieved 95.3% accuracy and 97.3% consistency across semantically equivalent question variants, while the smaller Qwen3-30B-A3B-2507 model achieved comparable performance. In a multiple-choice setting, models showed similar accuracy but variable calibration. Low-precision quantization (4-bit and 8-bit) preserved predictive performance while reducing GPU memory requirements and improving deployment feasibility. Clinical evaluation identified clinically significant errors in 2.9% of outputs, and semantically equivalent questions occasionally yielded discordant responses, including instances where one formulation was correct and the other contained a clinically significant error (0.96% of cases). These findings demonstrate that locally hosted open-source LLMs can accurately retrieve patient-specific information from EHRs using natural-language queries, while highlighting the need for validation and human oversight in clinical deployment.

Raia Abu Ahmad, Max Upravitelev, Aida Usmanova, Veronika Solopova, Georg Rehm

Main category: cs.CL

TL;DR: ClimateCheck 2026 is a shared task for verifying climate-related claims against scientific literature, featuring expanded data and new disinformation narrative classification, with analysis of system performance and metric biases.

Details

Motivation: Climate disinformation verification is challenging due to specialized scientific evidence and diverse rhetorical strategies. The task aims to advance automated fact-checking systems for climate claims.

Method: Shared task competition with tripled training data and new narrative classification. Systems used dense retrieval pipelines, cross-encoder ensembles, and LLMs with structured hierarchical reasoning. Evaluation used Recall@K and Binary Preference metrics, plus automated framework for assessing retrieval quality under incomplete annotations.

Result: 20 registered participants, 8 leaderboard submissions. Analysis revealed systematic biases in conventional metrics and showed not all climate disinformation is equally verifiable, suggesting implications for future fact-checking system design.

Conclusion: ClimateCheck 2026 advances climate claim verification research, exposes metric limitations, and reveals heterogeneity in disinformation verifiability that should inform future system design.

Abstract: Automatically verifying climate-related claims against scientific literature is a challenging task, complicated by the specialised nature of scholarly evidence and the diversity of rhetorical strategies underlying climate disinformation. ClimateCheck 2026 is the second iteration of a shared task addressing this challenge, expanding on the 2025 edition with tripled training data and a new disinformation narrative classification task. Running from January to February 2026 on the CodaBench platform, the competition attracted 20 registered participants and 8 leaderboard submissions, with systems combining dense retrieval pipelines, cross-encoder ensembles, and large language models with structured hierarchical reasoning. In addition to standard evaluation metrics (Recall@K and Binary Preference), we adapt an automated framework to assess retrieval quality under incomplete annotations, exposing systematic biases in how conventional metrics rank systems. A cross-task analysis further reveals that not all climate disinformation is equally verifiable, potentially implicating how future fact-checking systems should be designed.

[35] Clinical named entity recognition in the Portuguese language: a benchmark of modern BERT models and LLMs

Vinicius Anjos de Almeida, Sandro Saorin da Silva, Josimar Chire, Leonardo Vicenzi, Nícolas Henrique Borges, Helena Kociolek, Sarah Miriã de Castro Rocha, Frederico Nassif Gomes, Júlia Cristina Ferreira, Oge Marques, Lucas Emanuel Silva e Oliveira

Main category: cs.CL

TL;DR: Evaluation of BERT-based models and LLMs for Portuguese clinical named entity recognition, with mmBERT-base achieving best performance and iterative stratification improving class imbalance handling.

Details

Motivation: Clinical notes contain valuable unstructured medical information, but benchmarks for Portuguese clinical NER remain scarce. The study aims to evaluate BERT-based models and LLMs for Portuguese clinical NER and test strategies for addressing multilabel imbalance.

Method: Compared BioBERTpt, BERTimbau, ModernBERT, and mmBERT with LLMs like GPT-5 and Gemini-2.5 using SemClinBr corpus and private breast cancer dataset. Models trained under identical conditions and evaluated using precision, recall, and F1-score. Explored iterative stratification, weighted loss, and oversampling to mitigate class imbalance.

Result: mmBERT-base achieved best performance (micro F1 = 0.76), outperforming all other models. Iterative stratification improved class balance and overall performance. Multilingual BERT models, particularly mmBERT, perform strongly for Portuguese clinical NER and can run locally with limited computational resources.

Conclusion: Multilingual BERT models, especially mmBERT, are effective for Portuguese clinical NER and can operate with limited computational resources. Balanced data-splitting strategies like iterative stratification further enhance performance for this task.

Abstract: Clinical notes contain valuable unstructured information. Named entity recognition (NER) enables the automatic extraction of medical concepts; however, benchmarks for Portuguese remain scarce. In this study, we aimed to evaluate BERT-based models and large language models (LLMs) for clinical NER in Portuguese and to test strategies for addressing multilabel imbalance. We compared BioBERTpt, BERTimbau, ModernBERT, and mmBERT with LLMs such as GPT-5 and Gemini-2.5, using the public SemClinBr corpus and a private breast cancer dataset. Models were trained under identical conditions and evaluated using precision, recall, and F1-score. Iterative stratification, weighted loss, and oversampling were explored to mitigate class imbalance. The mmBERT-base model achieved the best performance (micro F1 = 0.76), outperforming all other models. Iterative stratification improved class balance and overall performance. Multilingual BERT models, particularly mmBERT, perform strongly for Portuguese clinical NER and can run locally with limited computational resources. Balanced data-splitting strategies further enhance performance.

[36] AMALIA Technical Report: A Fully Open Source Large Language Model for European Portuguese

Afonso Simplício, Gonçalo Vinagre, Miguel Moura Ramos, Diogo Tavares, Rafael Ferreira, Giuseppe Attanasio, Duarte M. Alves, Inês Calvo, Inês Vieira, Rui Guerra, James Furtado, Beatriz Canaverde, Iago Paulo, Vasco Ramos, Diogo Glória-Silva, Miguel Faria, Marcos Treviso, Daniel Gomes, Pedro Gomes, David Semedo, André Martins, João Magalhães

Main category: cs.CL

TL;DR: AMALIA is an open LLM specifically optimized for European Portuguese (pt-PT) with targeted training and native evaluation benchmarks to address underrepresentation and linguistic nuances.

Details

Motivation: European Portuguese is underrepresented in LLM training data and evaluation, with machine-translated benchmarks missing linguistic and cultural nuances specific to pt-PT variant.

Method: Developed AMALIA LLM using more high-quality pt-PT data during mid- and post-training stages, and created native pt-PT benchmarks including translated tasks and four new datasets targeting generation, linguistic competence, and pt-PT/pt-BR bias.

Result: AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, demonstrating effectiveness of targeted training and native benchmarking.

Conclusion: Targeted training and native benchmarking are crucial for underrepresented language variants like European Portuguese to ensure proper linguistic and cultural representation in LLMs.

Abstract: Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant’s linguistic and cultural nuances. We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias. Experiments show that AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, supporting the case for targeted training and native benchmarking for European Portuguese.

[37] JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems

Guangzhao Yang, Yu Pan, Shi Qiu, Ningjie Bai

Main category: cs.CL

TL;DR: JAL-Turn is a lightweight speech-only turn-taking framework that jointly models acoustic and linguistic features using cross-attention, enabling parallel operation with ASR and low-latency turn detection without full-duplex data requirements.

Details

Motivation: Current turn-taking detection systems for Voice AI agents are inefficient and inaccurate, relying on single cues (acoustic or semantic only) or requiring costly full-duplex data and training overhead. There's a need for lightweight, real-time solutions that work well in industrial deployments.

Method: Proposes JAL-Turn framework with joint acoustic-linguistic modeling using cross-attention to integrate pre-trained acoustic representations with linguistic features. Uses frozen ASR encoder to enable parallel turn-taking prediction with speech recognition, introducing no additional latency. Also introduces scalable data construction pipeline for automatic turn-taking label generation from dialogue corpora.

Result: Extensive experiments on public multilingual benchmarks and in-house Japanese customer-service dataset show JAL-Turn consistently outperforms state-of-the-art baselines in detection accuracy while maintaining superior real-time performance.

Conclusion: JAL-Turn provides an efficient, lightweight solution for turn-taking detection that balances accuracy and real-time performance, addressing key challenges in industrial Voice AI deployments without requiring costly full-duplex data or training overhead.

Abstract: Despite recent advances, efficient and robust turn-taking detection remains a significant challenge in industrial-grade Voice AI agent deployments. Many existing systems rely solely on acoustic or semantic cues, leading to suboptimal accuracy and stability, while recent attempts to endow large language models with full-duplex capabilities require costly full-duplex data and incur substantial training and deployment overheads, limiting real-time performance. In this paper, we propose JAL-Turn, a lightweight and efficient speech-only turn-taking framework that adopts a joint acoustic-linguistic modeling paradigm, in which a cross-attention module adaptively integrates pre-trained acoustic representations with linguistic features to support low-latency prediction of hold vs shift states. By sharing a frozen ASR encoder, JAL-Turn enables turn-taking prediction to run fully in parallel with speech recognition, introducing no additional end-to-end latency or computational overhead. In addition, we introduce a scalable data construction pipeline that automatically derives reliable turn-taking labels from large-scale real-world dialogue corpora. Extensive experiments on public multilingual benchmarks and an in-house Japanese customer-service dataset show that JAL-Turn consistently outperforms strong state-of-the-art baselines in detection accuracy while maintaining superior real-time performance.

[38] ALBA: A European Portuguese Benchmark for Evaluating Language and Linguistic Dimensions in Generative LLMs

Inês Vieira, Inês Calvo, Iago Paulo, James Furtado, Rafael Ferreira, Diogo Tavares, Diogo Glória-Silva, David Semedo, João Magalhães

Main category: cs.CL

TL;DR: ALBA is a linguistically grounded benchmark for evaluating LLM proficiency in European Portuguese across eight linguistic dimensions, addressing the gap in under-represented language evaluation.

Details

Motivation: Existing LLM training data and benchmarks are mainly in Brazilian Portuguese, creating a need for comprehensive evaluation tools for European Portuguese to assess linguistic proficiency across multiple dimensions.

Method: Manually constructed benchmark by language experts covering eight linguistic dimensions, paired with an LLM-as-a-judge framework for scalable evaluation of European Portuguese generated language.

Result: Experiments reveal performance variability across linguistic dimensions in European Portuguese, highlighting the need for variety-sensitive benchmarks.

Conclusion: ALBA addresses the evaluation gap for European Portuguese and supports further development of tools in this under-represented language variety.

Abstract: As Large Language Models (LLMs) expand across multilingual domains, evaluating their performance in under-represented languages becomes increasingly important. European Portuguese (pt-PT) is particularly affected, as existing training data and benchmarks are mainly in Brazilian Portuguese (pt-BR). To address this, we introduce ALBA, a linguistically grounded benchmark designed from the ground up to assess LLM proficiency in linguistic-related tasks in pt-PT across eight linguistic dimensions, including Language Variety, Culture-bound Semantics, Discourse Analysis, Word Plays, Syntax, Morphology, Lexicology, and Phonetics and Phonology. ALBA is manually constructed by language experts and paired with an LLM-as-a-judge framework for scalable evaluation of pt-PT generated language. Experiments on a diverse set of models reveal performance variability across linguistic dimensions, highlighting the need for comprehensive, variety-sensitive benchmarks that support further development of tools in pt-PT.

[39] How Open Must Language Models be to Enable Reliable Scientific Inference?

James A. Michaelov, Catherine Arnett, Tyler A. Chang, Pamela D. Rivière, Samuel M. Taylor, Cameron R. Jones, Sean Trott, Roger P. Levy, Benjamin K. Bergen, Micah Altman

Main category: cs.CL

TL;DR: Paper analyzes how open vs. closed AI models impact scientific inference, arguing closed models threaten reliable research with some exceptions, and recommends systematic threat identification and justification for model selection.

Details

Motivation: The paper is motivated by the growing use of AI models in scientific research and concerns about how restrictions on information about model construction and deployment (open vs. closed models) threaten the reliability of scientific inferences drawn from such research.

Method: The paper appears to use conceptual analysis and argumentation to examine how information restrictions impact scientific inference, analyzing threats to reliability and proposing mitigation strategies.

Result: The analysis concludes that current closed models are generally ill-suited for scientific purposes (with some exceptions), identifies specific threats to reliable inference, and provides recommendations for addressing these issues.

Conclusion: Researchers should systematically identify threats to inference when using models, take steps to mitigate them, and provide specific justifications for model selection, with a preference for open models when possible for scientific rigor.

Abstract: How does the extent to which a model is open or closed impact the scientific inferences that can be drawn from research that involves it? In this paper, we analyze how restrictions on information about model construction and deployment threaten reliable inference. We argue that current closed models are generally ill-suited for scientific purposes, with some notable exceptions, and discuss ways in which the issues they present to reliable inference can be resolved or mitigated. We recommend that when models are used in research, potential threats to inference should be systematically identified along with the steps taken to mitigate them, and that specific justifications for model selection should be provided.

[40] Development of a European Union Time-Indexed Reference Dataset for Assessing the Performance of Signal Detection Methods in Pharmacovigilance using a Large Language Model

Maria Kefala, Jeffery L. Painter, Syed Tauhid Bukhari, Maurizio Sessa

Main category: cs.CL

TL;DR: Created a time-indexed reference dataset for EU pharmacovigilance that tracks when adverse events are officially recognized in drug labels, enabling evaluation of early signal detection methods.

Details

Motivation: Existing pharmacovigilance datasets lack temporal information about when adverse events are officially recognized by regulatory authorities, preventing accurate evaluation of early signal detection methods that aim to identify safety issues before confirmation.

Method: Retrieved current and historical Summaries of Product Characteristics (SmPCs) for all centrally authorized EU products (n=1,513), extracted Section 4.8 (adverse events) using DeepSeek V3, programmatically extracted regulatory metadata including labeling changes, and time-indexed based on AE inclusion dates.

Result: Database includes 17,763 SmPC versions (1995-2025) with 125,026 drug-AE associations; time-indexed dataset includes 1,479 active products and 110,823 associations; 74.5% AEs identified pre-marketing vs 25.5% post-marketing; safety updates peaked around 2012; gastrointestinal, skin, and nervous system disorders most common.

Conclusion: The dataset addresses a critical gap by incorporating temporal information on AE recognition, supporting more accurate assessment of signal detection performance and facilitating methodological comparisons in pharmacovigilance research.

Abstract: Background: The identification of optimal signal detection methods is hindered by the lack of reliable reference datasets. Existing datasets do not capture when adverse events (AEs) are officially recognized by regulatory authorities, preventing restriction of analyses to pre-confirmation periods and limiting evaluation of early detection performance. This study addresses this gap by developing a time-indexed reference dataset for the European Union (EU), incorporating the timing of AE inclusion in product labels along with regulatory metadata. Methods: Current and historical Summaries of Product Characteristics (SmPCs) for all centrally authorized products (n=1,513) were retrieved from the EU Union Register of Medicinal Products (data lock: 15 December 2025). Section 4.8 was extracted and processed using DeepSeek V3 to identify AEs. Regulatory metadata, including labelling changes, were programmatically extracted. Time indexing was based on the date of AE inclusion in the SmPC. Results: The database includes 17,763 SmPC versions spanning 1995-2025, comprising 125,026 drug-AE associations. The time-indexed reference dataset, restricted to active products, included 1,479 medicinal products and 110,823 drug-AE associations. Most AEs were identified pre-marketing (74.5%) versus post-marketing (25.5%). Safety updates peaked around 2012. Gastrointestinal, skin, and nervous system disorders were the most represented System Organ Classes. Drugs had a median of 48 AEs across 14 SOCs. Conclusions: The proposed dataset addresses a critical gap in pharmacovigilance by incorporating temporal information on AE recognition for the EU, supporting more accurate assessment of signal detection performance and facilitating methodological comparisons across analytical approaches.

[41] When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

Juan Gabriel Kostelec, Xiang Wang, Axel Laborieux, Christos Sourmpis, Qinghai Guo

Main category: cs.CL

TL;DR: Hybrid-KDA architecture with GenDistill pipeline enables efficient Transformer distillation with improved generation quality evaluation

Details

Motivation: Current Transformer distillation methods often use log-likelihood evaluation which underestimates generation quality gaps between teacher and student models, leading to misleading conclusions about distilled model performance

Method: Proposes Hybrid Kimi Delta Attention (Hybrid-KDA) architecture with GenDistill multi-stage distillation pipeline, using generation-based evaluation to guide design decisions across six axes: training objective, loss masking, training duration, dataset selection, parameter freezing, and architecture choice

Result: Best Hybrid-KDA model retains 86-90% of teacher accuracy on knowledge benchmarks while reducing KV cache memory by up to 75% and improving time-to-first-token by 2-4× at 128K-token contexts; log-likelihood evaluation consistently underestimates teacher-student gaps

Conclusion: Generation-based evaluation is crucial for accurate distillation assessment, with dataset selection, completion-only masking, and freezing attention layers during post-training having the largest impact on generation quality

Abstract: Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure important differences in model quality. For example, we show that a 7B parameter distilled model that nearly matches its teacher to within 0.2,pp under log-likelihood scoring actually falls behind by 20.8,pp when the model must generate answers autoregressively. We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and use generation-based evaluation throughout to guide design decisions. Applying this approach to Qwen3-0.6B, we systematically ablate six design axes: training objective, loss masking, training duration, dataset selection, parameter freezing, and architecture choice. We find that log-likelihood-based evaluation consistently underestimates the gap between teacher and student, and can in some cases reverse the ranking of design choices, meaning that conclusions drawn from perplexity-only evaluation may be misleading. Among the factors we study, dataset selection, completion-only masking, and freezing attention layers during post-training have the largest impact on generation quality. Our best Hybrid-KDA model retains 86–90% of teacher accuracy on knowledge benchmarks while reducing KV cache memory by up to 75% and improving time-to-first-token by 2–4$\times$ at 128K-token contexts.

[42] MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference

Joris Köster, Zixuan Liu, Siavash Khajavi, Zizhan Zheng

Main category: cs.CL

TL;DR: MemBoost: A memory-boosted LLM serving framework that reduces inference costs by reusing previously generated answers and selectively routing queries between lightweight and strong models.

Details

Motivation: LLMs have high inference costs in real-world services, especially with repeated or near-duplicate queries across users and sessions. There's a need to reduce costs while maintaining answer quality.

Method: Proposes MemBoost framework with: 1) Memory system for reusing previously generated answers, 2) Retrieval of relevant supporting information for cheap inference, 3) Cost-aware routing that escalates difficult queries to stronger models, 4) Support for interactive settings with continual memory growth.

Result: Experiments show MemBoost substantially reduces expensive large-model invocations and overall inference cost while maintaining high answer quality comparable to strong model baselines.

Conclusion: MemBoost provides an effective framework for cost-efficient LLM serving by leveraging memory and intelligent routing, particularly beneficial for workloads with repeated queries.

Abstract: Large Language Models (LLMs) deliver strong performance but incur high inference cost in real-world services, especially under workloads with repeated or near-duplicate queries across users and sessions. In this work, we propose MemBoost, a memory-boosted LLM serving framework that enables a lightweight model to reuse previously generated answers and retrieve relevant supporting information for cheap inference, while selectively escalating difficult or uncertain queries to a stronger model. Unlike standard retrieval-augmented generation, which primarily grounds a single response, MemBoost is designed for interactive settings by supporting answer reuse, continual memory growth, and cost-aware routing. Experiments across multiple models under simulated workloads show that MemBoost substantially reduces expensive large-model invocations and overall inference cost, while maintaining high answer quality comparable to the strong model baseline.

[43] EnTaCs: Analyzing the Relationship Between Sentiment and Language Choice in English-Tamil Code-Switching

Paul Bontempo

Main category: cs.CL

TL;DR: Analysis of English-Tamil code-switching shows positive sentiment correlates with higher English usage, while mixed sentiment leads to more language switches, supporting socio-linguistic theories about language and emotion.

Details

Motivation: To investigate how emotional content influences language choice in multilingual code-switching contexts, specifically examining the relationship between utterance sentiment and language mixing patterns in English-Tamil code-switched text.

Method: Used fine-tuned XLM-RoBERTa model for token-level language identification on 35,650 romanized YouTube comments from DravidianCodeMix dataset, then performed linear regression analysis to examine relationships between sentiment categories (positive, negative, mixed) and language metrics (English proportion, switch frequency).

Result: Positive utterances had significantly higher English proportion (34.3%) than negative utterances (24.8%), and mixed-sentiment utterances showed the highest language switch frequency when controlling for utterance length.

Conclusion: Emotional content demonstrably influences language choice in multilingual code-switching settings, supporting socio-linguistic theories about prestige and identity associations with embedded and matrix languages.

Abstract: This paper investigates the relationship between utterance sentiment and language choice in English-Tamil code-switched text, using methods from machine learning and statistical modelling. We apply a fine-tuned XLM-RoBERTa model for token-level language identification on 35,650 romanized YouTube comments from the DravidianCodeMix dataset, producing per-utterance measurements of English proportion and language switch frequency. Linear regression analysis reveals that positive utterances exhibit significantly greater English proportion (34.3%) than negative utterances (24.8%), and mixed-sentiment utterances show the highest language switch frequency when controlling for utterance length. These findings support the hypothesis that emotional content demonstrably influences language choice in multilingual code-switching settings, due to socio-linguistic associations of prestige and identity with embedded and matrix languages.

[44] Weight Tying Biases Token Embeddings Towards the Output Space

Antonio Lopardo, Avyukth Harish, Catherine Arnett, Akshat Gupta

Main category: cs.CL

TL;DR: Weight tying in language models causes embedding matrices to be optimized for output prediction rather than input representation due to gradient imbalance, negatively affecting early-layer computations and potentially harming performance at scale.

Details

Motivation: To understand the impact of weight tying (sharing parameters between input and output embedding matrices) on the learned embedding space, which is common practice but poorly understood in language model design.

Method: Analyzed tied embedding matrices using tuned lens analysis to compare alignment with input vs output embeddings, investigated gradient dynamics during training, and conducted experiments scaling input gradients to reduce bias.

Result: Tied embedding matrices align more closely with output (unembedding) matrices than with input embeddings, showing unembedding bias due to output gradients dominating early training; scaling input gradients reduces this bias.

Conclusion: Weight tying optimizes embedding matrices for output prediction at the expense of input representation, explaining why it can harm performance at scale and having implications for smaller LLMs where embeddings contribute substantially to parameter count.

Abstract: Weight tying, i.e. sharing parameters between input and output embedding matrices, is common practice in language model design, yet its impact on the learned embedding space remains poorly understood. In this paper, we show that tied embedding matrices align more closely with output (unembedding) matrices than with input embeddings of comparable untied models, indicating that the shared matrix is shaped primarily for output prediction rather than input representation. This unembedding bias arises because output gradients dominate early in training. Using tuned lens analysis, we show this negatively affects early-layer computations, which contribute less effectively to the residual stream. Scaling input gradients during training reduces this bias, providing causal evidence for the role of gradient imbalance. This is mechanistic evidence that weight tying optimizes the embedding matrix for output prediction, compromising its role in input representation. These results help explain why weight tying can harm performance at scale and have implications for training smaller LLMs, where the embedding matrix contributes substantially to total parameter count.

[45] FinTruthQA: A Benchmark for AI-Driven Financial Disclosure Quality Assessment in Investor – Firm Interactions

Peilin Zhou, Ziyue Xu, Xinyu Shi, Jiageng Wu, Yikang Jiang, Dading Chong, Bin Ke, Jie Yang

Main category: cs.CL

TL;DR: FinTruthQA: First benchmark for AI-driven assessment of financial disclosure quality using 6,000 annotated Q&A entries from Chinese stock exchanges’ investor platforms, evaluating four criteria with models showing strong performance on question aspects but weaker on answer quality assessment.

Details

Motivation: Financial information disclosure quality is crucial for market efficiency but difficult to assess at scale, especially on Chinese investor interactive platforms where firms often provide limited or non-substantive responses to investor concerns.

Method: Created FinTruthQA benchmark with 6,000 real-world financial Q&A entries manually annotated on four criteria: question identification, question relevance, answer readability, and answer relevance. Benchmarked statistical ML models, pre-trained language models, fine-tuned variants, and LLMs.

Result: Models achieve strong performance on question identification and relevance (F1 > 95%), but weaker on answer readability (~88% Micro F1) and especially answer relevance (~80% Micro F1). Domain-adapted pre-trained models outperform general-purpose models and LLM prompting on challenging tasks.

Conclusion: FinTruthQA provides practical foundation for AI-driven disclosure monitoring in capital markets, valuable for regulatory oversight, investor protection, and disclosure governance, with domain-adapted models showing best performance on fine-grained quality assessment tasks.

Abstract: Accurate and transparent financial information disclosure is essential for market efficiency, investor decision-making, and corporate governance. Chinese stock exchanges’ investor interactive platforms provide a widely used channel through which listed firms respond to investor concerns, yet these responses are often limited or non-substantive, making disclosure quality difficult to assess at scale. To address this challenge, we introduce FinTruthQA, to our knowledge the first benchmark for AI-driven assessment of financial disclosure quality in investor-firm interactions. FinTruthQA comprises 6,000 real-world financial Q&A entries, each manually annotated based on four key evaluation criteria: question identification, question relevance, answer readability, and answer relevance. We benchmark statistical machine learning models, pre-trained language models and their fine-tuned variants, as well as large language models (LLMs), on FinTruthQA. Experiments show that existing models achieve strong performance on question identification and question relevance (F1 > 95%), but remain substantially weaker on answer readability (Micro F1 approximately 88%) and especially answer relevance (Micro F1 approximately 80%), highlighting the nontrivial difficulty of fine-grained disclosure quality assessment. Domain- and task-adapted pre-trained language models consistently outperform general-purpose models and LLM-based prompting on the most challenging settings. These findings position FinTruthQA as a practical foundation for AI-driven disclosure monitoring in capital markets, with value for regulatory oversight, investor protection, and disclosure governance in real-world financial settings.

[46] Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models

Şaziye Betül Özateş, Tarık Emre Tıraş, Ece Elif Adak, Berat Doğan, Fatih Burak Karagöz, Efe Eren Genç, Esma F. Bilgin Taşdemir

Main category: cs.CL

TL;DR: First comprehensive NLP resources for historical Turkish: NER dataset (HisTR), UD treebank (OTA-BOUN), and Ottoman Text Corpus (OTC) with transformer models achieving strong performance on NER, dependency parsing, and POS tagging.

Details

Motivation: Historical Turkish NLP has been underexplored in computational linguistics, lacking foundational resources and models for analyzing historical linguistic structures and variations across time periods.

Method: Created three key resources: 1) HisTR - first NER dataset for historical Turkish, 2) OTA-BOUN - first Universal Dependencies treebank, 3) OTC - clean corpus of transliterated historical texts. Trained transformer-based models on these datasets for NER, dependency parsing, and POS tagging tasks.

Result: Achieved strong performance: 90.29% F1 for NER, 73.79% LAS for dependency parsing, and 94.98% F1 for POS tagging. Models demonstrate significant improvements in computational analysis of historical Turkish while highlighting challenges like domain adaptation and language variations across periods.

Conclusion: Provides first comprehensive NLP resources and models for historical Turkish, establishing benchmarks for future research. Resources available publicly to advance computational analysis of historical linguistic structures and variations.

Abstract: This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish, a domain that has remained underexplored in computational linguistics. We present the first named entity recognition (NER) dataset, HisTR, and the first Universal Dependencies treebank, OTA-BOUN, for a historical form of the Turkish language along with transformer-based models trained using these datasets for named entity recognition, dependency parsing, and part-of-speech tagging tasks. Furthermore, we introduce the Ottoman Text Corpus (OTC), a clean corpus of transliterated historical Turkish texts that spans a wide range of historical periods. Our experimental results demonstrate prominent improvements in the computational analysis of historical Turkish, achieving strong performance on tasks that require understanding of historical linguistic structures – specifically, 90.29% F1 in named entity recognition, 73.79% LAS for dependency parsing, and 94.98% F1 for part-of-speech tagging. They also highlight existing challenges, such as domain adaptation and language variations between time periods. All the resources and models presented are available at https://hf.co/bucolin to serve as a benchmark for future progress in historical Turkish NLP.

[47] Don’t Stop the Multi-Party! On Generating Synthetic Written Multi-Party Conversations with Constraints

Nicolò Penzo, Marco Guerini, Bruno Lepri, Goran Glavaš, Sara Tonelli

Main category: cs.CL

TL;DR: LLMs can generate synthetic written multi-party conversations using two strategies: one-pass generation or turn-by-turn simulation, with turn-based approach showing better constraint compliance and linguistic variability.

Details

Motivation: Existing WMPC datasets from social media have privacy concerns and platform-specific limitations that create simplistic interaction patterns, motivating the need for synthetic generation methods.

Method: Two LLM strategies: (1) generate entire WMPC at once with constraints, (2) simulate turn-by-turn generation with conversation history. Evaluation framework assesses constraint compliance, content quality, and interaction complexity.

Result: Significant differences among LLMs, with only some capable of high-quality WMPCs. Turn-by-turn generation yields better constraint compliance and higher linguistic variability than one-pass generation.

Conclusion: Both generation strategies can produce high-quality WMPCs, with turn-based approach showing advantages in constraint adherence and linguistic diversity.

Abstract: Written Multi-Party Conversations (WMPCs) are widely studied across disciplines, with social media as a primary data source due to their accessibility. However, these datasets raise privacy concerns and often reflect platform-specific properties. For example, interactions between speakers may be limited due to rigid platform structures (e.g., threads, tree-like discussions), which yield overly simplistic interaction patterns (e.g., one-to-one “reply-to” links). This work explores the feasibility of generating synthetic WMPCs with instruction-tuned Large Language Models (LLMs) by providing deterministic constraints such as dialogue structure and participants’ stance. We investigate two complementary strategies of leveraging LLMs in this context: (i.) LLMs as WMPC generators, where we task the LLM to generate a whole WMPC at once and (ii.) LLMs as WMPC parties, where the LLM generates one turn of the conversation at a time (made of speaker, addressee and message), provided the conversation history. We next introduce an analytical framework to evaluate compliance with the constraints, content quality, and interaction complexity for both strategies. Finally, we assess the level of obtained WMPCs via human and LLM-as-a-judge evaluations. We find stark differences among LLMs, with only some being able to generate high-quality WMPCs. We also find that turn-by-turn generation yields better conformance to constraints and higher linguistic variability than generating WMPCs in one pass. Nonetheless, our structural and qualitative evaluation indicates that both generation strategies can yield high-quality WMPCs.

[48] Not Minds, but Signs: Reframing LLMs through Semiotics

Davide Picca

Main category: cs.CL

TL;DR: The paper proposes a semiotic framework for understanding LLMs as sign-manipulating systems rather than cognitive agents, emphasizing their role in cultural meaning-making processes.

Details

Motivation: To challenge the prevailing cognitivist view of LLMs as understanding systems and instead situate them within semiotic theory as participants in sign manipulation and meaning-making processes.

Method: Theoretical analysis and practical examples demonstrating how LLMs function as semiotic agents, exploring applications in literature, philosophy, education, and cultural production.

Result: Develops a semiotic paradigm that avoids anthropomorphism, provides a more precise understanding of LLMs’ role in cultural processes, and offers an ethically aware framework for studying and using these systems.

Conclusion: LLMs should be reframed as technological participants in an ecology of signs that alter how we read, write, and make meaning, rather than as systems possessing minds or understanding.

Abstract: This paper challenges the prevailing tendency to frame Large Language Models (LLMs) as cognitive systems, arguing instead for a semiotic perspective that situates these models within the broader dynamics of sign manipulation and meaning-making. Rather than assuming that LLMs understand language or simulate human thought, we propose that their primary function is to recombine, recontextualize, and circulate linguistic forms based on probabilistic associations. By shifting from a cognitivist to a semiotic framework, we avoid anthropomorphism and gain a more precise understanding of how LLMs participate in cultural processes, not by thinking, but by generating texts that invite interpretation. Through theoretical analysis and practical examples, the paper demonstrates how LLMs function as semiotic agents whose outputs can be treated as interpretive acts, open to contextual negotiation and critical reflection. We explore applications in literature, philosophy, education, and cultural production, emphasizing how LLMs can serve as tools for creativity, dialogue, and critical inquiry. The semiotic paradigm foregrounds the situated, contingent, and socially embedded nature of meaning, offering a more rigorous and ethically aware framework for studying and using LLMs. Ultimately, this approach reframes LLMs as technological participants in an ongoing ecology of signs. They do not possess minds, but they alter how we read, write, and make meaning, compelling us to reconsider the foundations of language, interpretation, and the role of artificial systems in the production of knowledge.

[49] Beyond cognacy

Gerhard Jäger

Main category: cs.CL

TL;DR: Automated MSA-based phylogenetic inference from lexical data outperforms traditional expert-annotated cognate methods for language family analysis

Details

Motivation: Standard computational phylogenetics in linguistics relies on sparse, labor-intensive expert-annotated cognate sets that are limited to individual language families, creating bottlenecks for large-scale analysis

Method: Compares traditional expert-annotated cognate methods with two automated approaches: (1) automatic cognate clustering with unigram/concept features, and (2) multiple sequence alignment (MSA) derived from pair-hidden Markov models applied directly to lexical data

Result: MSA-based inference produces trees more consistent with linguistic classifications, better predicts typological variation, and provides clearer phylogenetic signal than traditional methods

Conclusion: MSA-based methods offer a promising, scalable alternative to traditional cognate-based approaches, enabling global-scale language phylogenies without expert annotation bottlenecks

Abstract: Computational phylogenetics has become an established tool in historical linguistics, with many language families now analyzed using likelihood-based inference. However, standard approaches rely on expert-annotated cognate sets, which are sparse, labor-intensive to produce, and limited to individual language families. This paper explores alternatives by comparing the established method to two fully automated methods that extract phylogenetic signal directly from lexical data. One uses automatic cognate clustering with unigram/concept features; the other applies multiple sequence alignment (MSA) derived from a pair-hidden Markov model. Both are evaluated against expert classifications from Glottolog and typological data from Grambank. Also, the intrinsic strengths of the phylogenetic signal in the characters are compared. Results show that MSA-based inference yields trees more consistent with linguistic classifications, better predicts typological variation, and provides a clearer phylogenetic signal, suggesting it as a promising, scalable alternative to traditional cognate-based methods. This opens new avenues for global-scale language phylogenies beyond expert annotation bottlenecks.

[50] From dots to faces: Individual differences in visual imagery capacity predict the content of Ganzflicker-induced hallucinations

Ana Chkhaidze, Reshanne R. Reeder, Connor Gag, Anastasia Kiyonaga, Seana Coulson

Main category: cs.CL

TL;DR: Study uses NLP to analyze Ganzflicker-induced visual hallucinations across imagery spectrum, finding strong imagers report complex naturalistic content while weak imagers report simple geometric patterns.

Details

Motivation: To investigate whether individual differences in visual imagery ability (from absent to vivid) influence the complexity of internally generated visual experiences during Ganzflicker-induced hallucinations.

Method: Analyzed free-text descriptions of hallucinations from over 4,000 participants using natural language processing tools, including topic modeling and crowd-sourced sensorimotor norms to assess perceptual associations in language.

Result: Strong imagers described complex, naturalistic content while weak imagers reported simple geometric patterns. Participants with stronger imagery used language with richer perceptual associations.

Conclusion: Individual variation in visual imagery ability correlates with complexity of internally generated visual experiences, possibly reflecting differences in coordination between early visual areas and higher-order regions.

Abstract: A rapidly alternating red and black display known as Ganzflicker induces visual hallucinations that reflect the generative capacity of the visual system. Individuals vary in their degree of visual imagery, ranging from absent to vivid imagery. Recent proposals suggest that differences in the visual system along this imagery spectrum should also influence the complexity of other internally generated visual experiences. Here, we used tools from natural language processing to analyze free-text descriptions of hallucinations from over 4,000 participants, asking whether people with different imagery phenotypes see different things in their mind’s eye during Ganzflicker-induced hallucinations. Topic modeling of descriptions revealed that strong imagers described complex, naturalistic content, while weak imagers reported simple geometric patterns. Using crowd-sourced sensorimotor norms, we also found that participants with stronger imagery used language with richer perceptual associations. These findings may reflect individual variation in coordination between early visual areas and higher-order regions relevant for the imagery spectrum.

[51] Neural Models and Language Model Prompting for the Multidimensional Evaluation of Open-Ended Conversations

Michelle Elizabeth, Alicja Kasicka, Natalia Krawczyk, Magalie Ochs, Gwénolé Lecorvé, Justyna Gromada, Lina M. Rojas-Barahona

Main category: cs.CL

TL;DR: Small models (<13B params) for dialogue evaluation using LM prompting and encoder-based classification/regression, achieving modest correlations with human judgments but ranking second in DSTC-12 Track 1 challenge.

Details

Motivation: The proliferation of generative AI dialogue systems creates a critical need for effective evaluation methods, addressed through the DSTC-12 Track 1 challenge focused on predicting dialogue-level, dimension-specific scores.

Method: Two main strategies: 1) Using Language Models as evaluators through prompting, and 2) Training encoder-based classification and regression models, all constrained to relatively small models (fewer than 13 billion parameters).

Result: LM prompting achieved only modest correlations with human judgments but ranked second on the test set (outperformed only by baseline). Regression/classification models showed high correlation for some dimensions on validation set but performance decreased on test set due to different score ranges.

Conclusion: Small models can be effective for dialogue evaluation, with LM prompting showing competitive performance despite modest correlations, while encoder-based approaches demonstrate potential but face challenges with distribution shifts between train/test data.

Abstract: The growing number of generative AI-based dialogue systems has made their evaluation a crucial challenge. This paper presents our contribution to this important problem through the Dialogue System Technology Challenge (DSTC-12, Track 1), where we developed models to predict dialogue-level, dimension-specific scores. Given the constraint of using relatively small models (i.e. fewer than 13 billion parameters) our work follows two main strategies: employing Language Models (LMs) as evaluators through prompting, and training encoder-based classification and regression models. Our results show that while LM prompting achieves only modest correlations with human judgments, it still ranks second on the test set, outperformed only by the baseline. The regression and classification models, with significantly fewer parameters, demonstrate high correlation for some dimensions on the validation set. Although their performance decreases on the test set, it is important to note that the test set contains annotations with significantly different score ranges for some of the dimensions with respect to the train and validation sets.

[52] Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum

Gaotang Li, Ruizhong Qiu, Xiusi Chen, Heng Ji, Hanghang Tong

Main category: cs.CL

TL;DR: Systematic study of probability-based objectives beyond NLL for supervised fine-tuning of LLMs, revealing that optimal objectives depend on model capability along a continuum.

Details

Motivation: Standard SFT using NLL shows limited generalization; post-training differs from training from scratch, potentially violating NLL's optimality assumptions when models already have task-relevant priors and supervision is noisy.

Method: Comprehensive experiments across 8 model backbones, 27 benchmarks, and 7 domains to study various probability-based objectives, characterizing when/why different objectives succeed or fail under varying conditions.

Result: Uncovered critical dimension: model-capability continuum. Near model-strong end, prior-leaning objectives that downweight low-probability tokens outperform NLL; toward model-weak end, NLL dominates; in between, no single objective prevails.

Conclusion: Optimal SFT objectives depend on model capability; theoretical analysis explains objective trade-offs across continuum, providing principled foundation for adapting objectives to model capability.

Abstract: Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. Rather than proposing a single universally superior replacement loss, we systematically study various probability-based objectives and characterize when and why different objectives succeed or fail under varying conditions. Through comprehensive experiments and extensive ablation studies across 8 model backbones, 27 benchmarks, and 7 domains, we uncover a critical dimension that governs objective behavior: the model-capability continuum. Near the model-strong end, prior-leaning objectives that downweight low-probability tokens (e.g., $-p$, $-p^{10}$, thresholded variants) consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails. Our theoretical analysis further elucidates how objectives trade places across the continuum, providing a principled foundation for adapting objectives to model capability. The code is provided at https://github.com/GaotangLi/Beyond-Log-Likelihood.

[53] Attention-Aligned Reasoning for Large Language Models

Hongxiang Zhang, Yuan Tian, Tianyi Zhang

Main category: cs.CL

TL;DR: ATAR is a novel reasoning method that leverages inherent reasoning structure to steer LLM attention, improving performance on complex tasks by preventing critical intermediate steps from being buried in long contexts.

Details

Motivation: LLMs generate long reasoning chains for complex tasks, but as chains extend, critical intermediate steps and original prompts get buried in context, receiving insufficient attention and leading to errors.

Method: ATAR (Attention Steering via Reasoning Structure) uses inherent reasoning structure to guide LLM attention, ensuring critical steps remain prominent in the attention mechanism throughout the reasoning process.

Result: ATAR outperforms SOTA methods across six benchmarks with up to 15.39% absolute improvement. Non-reasoning models with ATAR achieve comparable or better performance than reasoning models of same size in most benchmarks.

Conclusion: ATAR effectively addresses attention degradation in long reasoning chains, demonstrating that attention steering via reasoning structure significantly improves LLM reasoning performance across diverse tasks.

Abstract: Large Language Models (LLMs) tend to generate a long reasoning chain when solving complex tasks. However, as the reasoning chain extends, critical intermediate steps and the original prompt will be buried in the context, receiving insufficient attention and leading to errors. In this work, we present ATAR, a novel reasoning method that leverages the inherent reasoning structure to steer LLM attention. Our experiments show that ATAR outperforms SOTA methods across six benchmarks, achieving up to 15.39% absolute improvement. Furthermore, with ATAR, “non-reasoning” models achieve comparable or even better performance compared to reasoning models of the same size in most benchmarks. Finally, our ablation studies show that the attention alignment component contributes significantly, and that these improvements are persist under different attentionsteering backends.

[54] Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

David Samuel, Lilja Øvrelid, Erik Velldal, Andrey Kutuzov

Main category: cs.CL

TL;DR: Post-training method for lower-resource languages that preserves fluency when aligned by disfluent reward models, using on-policy training without requiring instruction-tuning data in target language.

Details

Motivation: Preference optimization research has focused on English and Chinese, leaving lower-resource languages lacking native-speaker datasets and instruction-tuned models. These languages need fluent preference-aligned models without requiring hard-to-obtain instruction-tuning data.

Method: Proposes on-policy training method for lower-resource languages, comparing it with supervised finetuning on machine-translated data and multilingual finetuning. Case study on Norwegian Bokmål with native-speaker fluency evaluations.

Result: On-policy training outperforms alternatives without relying on hard-to-obtain data. Native-speaker assessments confirm the approach preserves fluency in lower-resource languages.

Conclusion: On-policy training is crucial for developing fluent preference-aligned language models for lower-resource languages when instruction-tuning data is unavailable, offering a practical solution for language adaptation.

Abstract: We propose a post-training method for lower-resource languages that preserves the fluency of language models even when aligned by disfluent reward models. Preference optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and instruction-tuned language models capable of generating fluent synthetic data. To address this, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common alternatives: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmål and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.

[55] Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai, Yang Chen, Yangyi Chen, Zhuolin Yang, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

Main category: cs.CL

TL;DR: Nemotron-Cascade uses cascaded domain-wise reinforcement learning to build general-purpose reasoning models that handle heterogeneous domains efficiently, achieving state-of-the-art performance on coding benchmarks.

Details

Motivation: Traditional RL approaches struggle with cross-domain heterogeneity in reasoning tasks, including varying response lengths and verification latency, which complicates infrastructure, slows training, and makes curriculum design challenging.

Method: Cascade RL orchestrates sequential, domain-wise reinforcement learning instead of blending heterogeneous prompts. It uses RLHF for alignment as a pre-step, followed by domain-wise RLVR stages that maintain or improve performance across domains.

Result: The 14B model outperforms its SFT teacher DeepSeek-R1-0528 on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI).

Conclusion: Cascade RL reduces engineering complexity while delivering state-of-the-art performance across diverse benchmarks, demonstrating that RLHF for alignment boosts reasoning ability beyond preference optimization.

Abstract: Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop Nemotron-Cascade, capable of operating in both instruct and deep thinking modes, without any performance gap relative to a thinking-only counterpart. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model’s reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.

[56] Dual-objective Language Models: Training Efficiency Without Overfitting

David Samuel, Lucas Georges Gabriel Charpentier

Main category: cs.CL

TL;DR: Combining autoregressive and masked-diffusion training objectives improves language model performance without architectural changes, achieving better results than single-objective models across various data repetition settings.

Details

Motivation: Autoregressive models are training-efficient but prone to overfitting, while masked-diffusion models are more robust to overfitting but less efficient to train. The paper aims to combine both approaches to achieve the benefits of each.

Method: Trains 50 language models with varying combinations of autoregressive and masked-diffusion objectives under different levels of data repetition, finding optimal balance between objectives without architectural modifications.

Result: Dual-objective training consistently outperforms single-objective models across all evaluated settings. The optimal balance between objectives is similar whether targeting autoregressive or masked-diffusion downstream performance.

Conclusion: Combining autoregressive and masked-diffusion objectives achieves the best of both worlds - training efficiency and robustness to overfitting - resulting in more flexible and better-performing language models.

Abstract: This paper combines autoregressive and masked-diffusion training objectives without any architectural modifications, resulting in flexible language models that outperform single-objective models. Autoregressive modeling has been a popular approach, partly because of its training efficiency; however, that comes at the cost of sensitivity to overfitting. On the other hand, masked-diffusion models are less efficient to train while being more resilient to overfitting. In this work, we demonstrate that dual-objective training achieves the best of both worlds. To derive the optimal balance between both objectives, we train and evaluate 50 language models under varying levels of data repetition. We show that it is optimal to combine both objectives under all evaluated settings and that the optimal balance is similar whether targeting autoregressive or masked-diffusion downstream performance.

[57] MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation

Pengyu Wang, Shuchang Ye, Usman Naseem, Jinman Kim

Main category: cs.CL

TL;DR: MRG-R1 introduces a semantic-driven reinforcement learning framework for medical report generation that optimizes report-level clinical correctness instead of token-level likelihood, improving accuracy of clinically relevant findings.

Details

Motivation: Existing medical report generation approaches rely on token-level likelihood training which favors local lexical matching but leaves clinical correctness under-specified, failing to directly encode constraints on medically accurate findings.

Method: Proposes MRG-R1, a semantic-driven reinforcement learning framework with a clinically grounded report-level reward function that reinforces semantic agreement in clinically relevant findings between generated and reference reports.

Result: The framework improves accuracy and coverage of clinically relevant findings in generated reports and achieves state-of-the-art clinical efficacy on IU X-Ray and MIMIC-CXR benchmark datasets.

Conclusion: Directly optimizing report-level clinical correctness through semantic-driven reinforcement learning is more effective than token-level likelihood training for medical report generation, leading to better clinical accuracy.

Abstract: Medical report generation aims to automatically produce radiology-style reports from medical images, supporting efficient and accurate clinical decision-making.However, existing approaches predominately rely on token-level likelihood training, which favors local lexical matching and leaves clinical correctness under-specified in the training objective. This behavior can be attributed to token-level likelihood optimization, which rewards surface-form agreement and therefore fails to directly encode constraints on medically accurate findings. To address this objective mismatch, we introduce a semantic-driven reinforcement learning (SRL) framework for medical report generation, named MRG-R1, which directly optimizes report-level clinical correctness rather than token-level likelihood. The key module is a clinically grounded report-level reward function, which reinforces semantic agreement in clinically relevant findings between generated and reference reports, thereby enabling learning signals that explicitly constrain medical correctness beyond surface linguistic alignment. Our evaluations show that the proposed framework improves the accuracy and coverage of clinically relevant findings in generated reports, and that MRG-R1 achieves state-of-the-art clinical efficacy on the IU X-Ray and MIMIC-CXR benchmark datasets.

[58] Sigmoid Head for Quality Estimation under Language Ambiguity

Tu Anh Dinh, Jan Niehues

Main category: cs.CL

TL;DR: Proposes a Sigmoid Head module for quality estimation that addresses limitations of LM probability distributions by using sigmoid activation instead of softmax and avoiding negative sampling of potentially correct alternative tokens.

Details

Motivation: Language model probability is unreliable for quality estimation because natural language ambiguity causes probability to spread across multiple valid outputs, misleadingly indicating low quality. Two main limitations: 1) softmax activation prevents multiple correct options from receiving high probabilities simultaneously, 2) training data uses single one-hot encoded references suggesting only one correct option per step.

Method: Train a Quality Estimation module called Sigmoid Head on top of pre-trained LMs. Uses sigmoid activation instead of softmax to allow multiple tokens to receive high probabilities. During negative sampling for training, employs a heuristic to avoid selecting potentially alternative correct tokens. The approach is computationally efficient and doesn’t require human-annotated quality data.

Result: Sigmoid Head probability provides notably better quality signal compared to original softmax head. The method is more robust to out-of-domain settings compared to supervised quality estimation approaches since it doesn’t rely on human-annotated quality data.

Conclusion: The proposed Sigmoid Head effectively addresses limitations of LM probability for quality estimation by using sigmoid activation and careful negative sampling, providing more reliable quality signals without requiring annotated quality data.

Abstract: Language model (LM) probability is not a reliable quality estimator, as natural language is ambiguous. When multiple output options are valid, the model’s probability distribution is spread across them, which can misleadingly indicate low output quality. This issue is caused by two reasons: (1) LMs’ final output activation is softmax, which does not allow multiple correct options to receive high probabilities simultaneuously and (2) LMs’ training data is single, one-hot encoded references, indicating that there is only one correct option at each output step. We propose training a module for Quality Estimation on top of pre-trained LMs to address these limitations. The module, called Sigmoid Head, is an extra unembedding head with sigmoid activation to tackle the first limitation. To tackle the second limitation, during the negative sampling process to train the Sigmoid Head, we use a heuristic to avoid selecting potentially alternative correct tokens. Our Sigmoid Head is computationally efficient during training and inference. The probability from Sigmoid Head is notably better quality signal compared to the original softmax head. As the Sigmoid Head does not rely on human-annotated quality data, it is more robust to out-of-domain settings compared to supervised QE.

[59] T$^\star$: Progressive Block Scaling for Masked Diffusion Language Models Through Trajectory Aware Reinforcement Learning

Hanchen Xia, Baoyou Chen, Yutang Ge, Guojiang Zhao, Siyu Zhu

Main category: cs.CL

TL;DR: T* is a TraceRL-based curriculum training method for scaling masked diffusion language models from small to large blocks, enabling higher-parallelism decoding with minimal performance loss on math reasoning tasks.

Details

Motivation: The paper addresses the challenge of scaling masked diffusion language models (MDMs) to larger block sizes for higher parallelism in decoding. Current methods suffer from performance degradation when transitioning to larger blocks, and the authors aim to develop a smooth training curriculum that maintains performance while enabling more efficient parallel decoding.

Method: T* uses a TraceRL-based training curriculum that starts with an AR-initialized small-block MDM and progressively scales to larger blocks. The method provides a smooth transition between block sizes, allowing the model to adapt gradually rather than making abrupt changes that could degrade performance.

Result: The method enables higher-parallelism decoding with minimal performance degradation on math reasoning benchmarks. Additionally, analysis suggests T* may converge to an alternative decoding schedule that achieves comparable performance to traditional approaches.

Conclusion: T* provides an effective curriculum training approach for scaling masked diffusion language models, offering a practical solution for achieving higher parallelism in decoding while maintaining model performance on reasoning tasks.

Abstract: We present T$^\star$, a simple TraceRL-based training curriculum for progressive block-size scaling in masked diffusion language models (MDMs). Starting from an AR-initialized small-block MDM, T$^\star$ transitions smoothly to larger blocks, enabling higher-parallelism decoding with minimal performance degradation on math reasoning benchmarks. Moreover, further analysis suggests that T$^\star$ may actually converge to an alternative decoding schedule that achieves comparable performance.

[60] CitiLink: Enhancing Municipal Transparency and Citizen Engagement through Searchable Meeting Minutes

Rodrigo Silva, José Evans, José Isidro, Miguel Marques, Afonso Fonseca, Ricardo Morais, João Canavilhas, Arian Pasquali, Purificação Silvano, Alípio Jorge, Nuno Guimarães, Sérgio Nunes, Ricardo Campos

Main category: cs.CL

TL;DR: CitiLink is an NLP platform that transforms unstructured municipal meeting minutes into structured, searchable data using LLMs for information extraction and BM25 ranking for search.

Details

Motivation: City council minutes are lengthy, formal documents with bureaucratic writing styles that make it difficult for citizens and journalists to efficiently find information, despite being publicly available. There's a need to enhance accessibility and transparency of local government through better information retrieval.

Method: The system uses LLMs (specifically Gemini) to extract metadata, discussed subjects, and voting outcomes from unstructured meeting minutes. The extracted structured data is indexed in a database supporting full-text search with BM25 ranking and faceted filtering through a user-friendly interface. Built on 120 minutes from six Portuguese municipalities, usability was tested through guided sessions with municipal personnel.

Result: The system successfully transforms municipal minutes into searchable data. Gemini demonstrated effectiveness in extracting relevant information from the minutes. Usability testing with municipal personnel provided insights into real user interaction patterns with the system.

Conclusion: CitiLink demonstrates how NLP and information retrieval techniques can enhance accessibility and transparency of local government by making bureaucratic documents more searchable and user-friendly.

Abstract: City council minutes are typically lengthy and formal documents with a bureaucratic writing style. Although publicly available, their structure often makes it difficult for citizens or journalists to efficiently find information. In this demo, we present CitiLink, a platform designed to transform unstructured municipal meeting minutes into structured and searchable data, demonstrating how NLP and IR can enhance the accessibility and transparency of local government. The system employs LLMs to extract metadata, discussed subjects, and voting outcomes, which are then indexed in a database to support full-text search with BM25 ranking and faceted filtering through a user-friendly interface. The developed system was built over a collection of 120 minutes made available by six Portuguese municipalities. To assess its usability, CitiLink was tested through guided sessions with municipal personnel, providing insights into how real users interact with the system. In addition, we evaluated Gemini’s performance in extracting relevant information from the minutes, highlighting its effectiveness in data extraction.

[61] Formula-One Prompting: Equation-First Reasoning For Applied Mathematics

Natapong Nitarach, Pittawat Taveekitworachai, Kunat Pipatanakul

Main category: cs.CL

TL;DR: F-1 Prompting improves math problem solving by explicitly formulating governing equations as an intermediate step before selecting solving strategy.

Details

Motivation: Existing prompting methods like Chain-of-Thought and Program-of-Thought don't explicitly elicit equation formulation as a reasoning stage, despite LLMs having mathematical knowledge from pretraining.

Method: Formula-One Prompting (F-1) is a single-call, two-phase approach that first formulates governing equations from problem descriptions, then naturally selects solving strategy (CoT, PoT, or direct computation) based on equation structure.

Result: F-1 outperforms CoT by +5.76% and PoT by +8.42% on average across five models and four benchmarks, winning 53 out of 60 benchmark-model comparisons. Largest gains in applied domains: +13.30% on FinanceMath over CoT.

Conclusion: Explicit equation formalization is the primary driver of improved mathematical reasoning performance, especially in applied domains like physics and finance.

Abstract: LLMs encode vast mathematical knowledge including governing equations from pretraining on equation-rich corpora, yet existing prompting methods, including Chain-of-Thought (CoT) and Program-of-Thought (PoT), do not explicitly elicit equation formulation as a reasoning stage. We propose Formula-One Prompting (F-1), a single-call, two-phase approach that fills this equation gap by using mathematical equations as an intermediate representation before solving through natural flow reasoning. F-1 first formulates governing equations from problem descriptions; the model then naturally selects a solving strategy among CoT, PoT, or direct computation based on the formalized equation structure, without explicit routing rules. Results across five models and four benchmarks show F-1 outperforms CoT by +5.76% and PoT by +8.42% on average, winning 53 out of 60 benchmark-model comparisons (88.3%). Gains are largest in applied domains: +13.30% on FinanceMath over CoT, and within OlympiadBench, larger gains on physics (+2.55%) than pure math (+0.44%). Per-problem analysis confirms equation formalization is the primary driver.

[62] ClaimPT: A Portuguese Dataset of Annotated Claims in News Articles

Ricardo Campos, Raquel Sequeira, Sara Nerea, Inês Cantante, Diogo Folques, Luís Filipe Cunha, João Canavilhas, António Branco, Alípio Jorge, Sérgio Nunes, Nuno Guimarães, Purificação Silvano

Main category: cs.CL

TL;DR: ClaimPT: A new European Portuguese dataset of 1,308 news articles with 6,875 factual claim annotations for advancing fact-checking research in low-resource languages.

Details

Motivation: Manual fact-checking can't scale with online misinformation spread, and existing resources are English-dominated. Portuguese lacks accessible licensed datasets for NLP research and applications.

Method: Created dataset through partnership with LUSA Portuguese News Agency, using trained annotators with curator validation. Developed new annotation scheme and provided baseline models for claim detection.

Result: Produced ClaimPT dataset with 1,308 articles and 6,875 annotations, establishing initial benchmarks for Portuguese fact-checking research.

Conclusion: ClaimPT advances low-resource fact-checking research and enhances understanding of misinformation in news media for Portuguese language.

Abstract: Fact-checking remains a demanding and time-consuming task, still largely dependent on manual verification and unable to match the rapid spread of misinformation online. This is particularly important because debunking false information typically takes longer to reach consumers than the misinformation itself; accelerating corrections through automation can therefore help counter it more effectively. Although many organizations perform manual fact-checking, this approach is difficult to scale given the growing volume of digital content. These limitations have motivated interest in automating fact-checking, where identifying claims is a crucial first step. However, progress has been uneven across languages, with English dominating due to abundant annotated data. Portuguese, like other languages, still lacks accessible, licensed datasets, limiting research, NLP developments and applications. In this paper, we introduce ClaimPT, a dataset of European Portuguese news articles annotated for factual claims, comprising 1,308 articles and 6,875 individual annotations. Unlike most existing resources based on social media or parliamentary transcripts, ClaimPT focuses on journalistic content, collected through a partnership with LUSA, the Portuguese News Agency. To ensure annotation quality, two trained annotators labeled each article, with a curator validating all annotations according to a newly proposed scheme. We also provide baseline models for claim detection, establishing initial benchmarks and enabling future NLP and IR applications. By releasing ClaimPT, we aim to advance research on low-resource fact-checking and enhance understanding of misinformation in news media.

[63] NRR-Phi: Text-to-State Mapping for Ambiguity Preservation in LLM Inference

Kei Saito

Main category: cs.CL

TL;DR: A framework for text-to-state mapping that preserves multiple interpretations of ambiguous language, preventing LLMs from prematurely committing to single meanings.

Details

Motivation: LLMs have a systematic tendency toward early semantic commitment where they collapse multiple valid interpretations of ambiguous input into a single response before sufficient context is available, discarding information that may be essential as dialogue evolves.

Method: A formal framework for text-to-state mapping (phi: T -> S) that transforms natural language into a non-collapsing state space where multiple interpretations coexist. The mapping decomposes into three stages: conflict detection, interpretation extraction, and state construction. Instantiated with a hybrid extraction pipeline combining rule-based segmentation for explicit conflict markers with LLM-based enumeration of implicit ambiguity.

Result: On 68 ambiguous sentences, the resulting states preserve interpretive multiplicity: hybrid extraction yields mean state entropy H = 1.087 bits across ambiguity categories, compared to H = 0 for collapse-based baselines. Rule-based conflict detector also instantiated for Japanese markers to show cross-lingual portability. Empirical validation on 580 test cases shows 0% collapse for principle-satisfying operators versus up to 17.8% for violating operators.

Conclusion: The framework extends Non-Resolution Reasoning by providing an algorithmic bridge between text and the NRR state space, enabling architectural collapse deferment in LLM inference and preserving interpretive multiplicity.

Abstract: Large language models exhibit a systematic tendency toward early semantic commitment: given ambiguous input, they collapse multiple valid interpretations into a single response before sufficient context is available. This premature collapse discards information that may prove essential as dialogue evolves. We present a formal framework for text-to-state mapping (phi: T -> S) that transforms natural language into a non-collapsing state space where multiple interpretations coexist. The mapping decomposes into three stages: conflict detection, interpretation extraction, and state construction. We instantiate phi with a hybrid extraction pipeline that combines rule-based segmentation for explicit conflict markers with LLM-based enumeration of implicit ambiguity. On a test set of 68 ambiguous sentences, the resulting states preserve interpretive multiplicity: hybrid extraction yields mean state entropy H = 1.087 bits across ambiguity categories, compared to H = 0 for collapse-based baselines that commit to a single interpretation. We also instantiate the rule-based conflict detector for Japanese markers to illustrate cross-lingual portability. This framework extends Non-Resolution Reasoning (NRR) by providing the algorithmic bridge between text and the NRR state space, enabling architectural collapse deferment in LLM inference. Design principles for state-to-state transformations are detailed in the Appendix, with empirical validation on 580 test cases demonstrating 0% collapse for principle-satisfying operators versus up to 17.8% for violating operators.

[64] MiNER: A Two-Stage Pipeline for Metadata Extraction from Municipal Meeting Minutes

Rodrigo Batista, Luís Filipe Cunha, Purificação Silvano, Nuno Guimarães, Alípio Jorge, Evelin Amorim, Ricardo Campos

Main category: cs.CL

TL;DR: Two-stage pipeline for extracting metadata from municipal meeting minutes using QA for segment identification and transformer models for entity extraction, benchmarked against LLMs.

Details

Motivation: Municipal meeting minutes have heterogeneous formats and lack standardized metadata, making information retrieval difficult. Existing NER models are not adapted to domain-specific categories in this specialized domain.

Method: Two-stage pipeline: 1) QA model identifies opening/closing segments containing metadata, 2) Transformer models (BERTimbau and XLM-RoBERTa with/without CRF layer) perform fine-grained entity extraction enhanced by deslexicalization. Benchmarking includes open-weight (Phi) and closed-weight (Gemini) LLMs.

Result: Strong in-domain performance better than larger general-purpose LLMs, but cross-municipality evaluation shows reduced generalization due to variability and linguistic complexity of municipal records.

Conclusion: Establishes first benchmark for metadata extraction from municipal meeting minutes, providing foundation for future research in this domain-specific information extraction task.

Abstract: Municipal meeting minutes are official documents of local governance, exhibiting heterogeneous formats and writing styles. Effective information retrieval (IR) requires identifying metadata such as meeting number, date, location, participants, and start/end times, elements that are rarely standardized or easy to extract automatically. Existing named entity recognition (NER) models are ill-suited to this task, as they are not adapted to such domain-specific categories. In this paper, we propose a two-stage pipeline for metadata extraction from municipal minutes. First, a question answering (QA) model identifies the opening and closing text segments containing metadata. Transformer-based models (BERTimbau and XLM-RoBERTa with and without a CRF layer) are then applied for fine-grained entity extraction and enhanced through deslexicalization. To evaluate our proposed pipeline, we benchmark both open-weight (Phi) and closed-weight (Gemini) LLMs, assessing predictive performance, inference cost, and carbon footprint. Our results demonstrate strong in-domain performance, better than larger general-purpose LLMs. However, cross-municipality evaluation reveals reduced generalization reflecting the variability and linguistic complexity of municipal records. This work establishes the first benchmark for metadata extraction from municipal meeting minutes, providing a solid foundation for future research in this domain.

[65] TernaryLM: Memory-Efficient Language Modeling via Native 1.5-Bit Quantization with Adaptive Layer-wise Scaling

Nisharg Nargund, Priyesh Shukla

Main category: cs.CL

TL;DR: TernaryLM is a 132M-parameter transformer trained natively with ternary quantization (-1, 0, +1) achieving 1.58-bit precision, reducing memory by 2.4x while maintaining language modeling performance comparable to full-precision models.

Details

Motivation: Large language models require substantial computational resources, limiting deployment on edge devices and resource-constrained environments. There's a need for memory-efficient models that maintain performance while reducing computational demands.

Method: Ternary quantization approach using {-1, 0, +1} values (1.58-bit effective precision) trained from scratch with straight-through estimators and adaptive per-layer scaling factors, unlike post-training quantization methods.

Result: Achieves 58.42 validation perplexity on TinyStories, 82.47% F1 on MRPC (surpassing DistilBERT with 55x less data), 2.4x memory reduction (498MB vs 1197MB), and shows implicit regularization preventing overfitting with train/val ratio of 1.05x vs 3.51x baseline.

Conclusion: TernaryLM demonstrates that native ternary quantization enables memory-efficient language models without sacrificing performance, with middle transformer layers showing higher quantization sparsity, suggesting non-uniform precision allocation strategies.

Abstract: Large language models (LLMs) achieve remarkable performance but demand substantial computational resources, limiting deployment on edge devices and resource-constrained environments. We present TernaryLM, a 132M-parameter transformer trained natively with ternary quantization {-1, 0, +1} (log2(3) ~ 1.58-bit effective precision), achieving significant memory reduction without sacrificing language modeling capability. Unlike post-training quantization approaches that quantize pre-trained full-precision models, TernaryLM learns quantization-aware representations from scratch using straight-through estimators and adaptive per-layer scaling factors. Our experiments demonstrate: (1) validation perplexity of 58.42 on TinyStories with a cross-seed standard deviation of +/- 0.17 PPL, confirming stable optimization; (2) strong downstream transfer with 82.47% F1 on MRPC, surpassing DistilBERT despite using 55x less pretraining data; (3) 2.4x memory reduction (498 MB vs 1,197 MB for an FP32 model of identical architecture) with latency parity; and (4) an implicit regularization effect whereby the ternary constraint yields a train/val ratio of 1.05x versus 3.51x for the FP32 baseline, demonstrating that discrete weights prevent overfitting on small corpora. We provide layer-wise sparsity analysis revealing that middle transformer layers (L5-L9) achieve 60-62% quantization sparsity versus 45-55% for boundary layers, establishing an actionable design principle for non-uniform precision allocation. Our implementation and trained models are publicly available at https://github.com/1nisharg/TernaryLM-Memory-Efficient-Language-Modeling.

[66] Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection

Md Tanvir Rouf Shawon, Mohammad Sabik Irbaz, Hadeel R. A. Elyazori, Keerti Reddy Resapu, Yili Lin, Vladimir Franzuela Cardenas, Farrokh Alemi, Kevin Lybarger

Main category: cs.CL

TL;DR: A patient simulator for automated evaluation of healthcare conversational agents that generates realistic, controllable interactions varying across medical, linguistic, and behavioral dimensions to assess AI performance risks.

Details

Motivation: Need for scalable, automated evaluation of healthcare conversational agents to systematically assess performance risks across diverse patient populations, particularly addressing equity concerns in AI deployment.

Method: Simulator integrates three profile components: (1) medical profiles from electronic health records using risk-ratio gating, (2) linguistic profiles modeling health literacy and condition-specific communication, and (3) behavioral profiles representing different engagement types. Grounded in NIST AI Risk Management Framework.

Result: Across 500 simulated conversations, revealed monotonic degradation in AI Decision Aid performance across health literacy levels (47.6% to 81.9% concept retrieval). High medical concept fidelity (96.6%), validated by human annotators (0.73 kappa) and LLM judge (0.78 kappa). Behavioral profiles reliably distinguished (0.93 kappa).

Conclusion: The simulator exposes measurable performance risks in conversational healthcare AI, with health literacy emerging as a primary risk factor with direct implications for equitable AI deployment.

Abstract: Objective: This paper introduces a patient simulator for scalable, automated evaluation of healthcare conversational agents, generating realistic, controllable interactions that systematically vary across medical, linguistic, and behavioral dimensions to support risk assessment across populations. Methods: Grounded in the NIST AI Risk Management Framework, the simulator integrates three profile components: (1) medical profiles constructed from All of Us electronic health records using risk-ratio gating; (2) linguistic profiles modeling health literacy and condition-specific communication; and (3) behavioral profiles representing cooperative, distracted, and adversarial engagement. Profiles were evaluated against NIST AI RMF trustworthiness requirements and assessed against an AI Decision Aid for antidepressant selection. Results: Across 500 simulated conversations, the simulator revealed monotonic degradation in AI Decision Aid performance across health literacy levels: Rank-1 concept retrieval ranged from 47.6% (limited) to 81.9% (proficient), with corresponding recommendation degradation. Medical concept fidelity was high (96.6% across 8,210 concepts), validated by human annotators (0.73 kappa) and an LLM judge with comparable agreement (0.78 kappa). Behavioral profiles were reliably distinguished (0.93 kappa), and linguistic profiles showed moderate agreement (0.61 kappa). Conclusions: The simulator exposes measurable performance risks in conversational healthcare AI. Health literacy emerged as a primary risk factor with direct implications for equitable AI deployment.

[67] CitiLink-Minutes: A Multilayer Annotated Dataset of Municipal Meeting Minutes

Ricardo Campos, Ana Filipa Pacheco, Ana Luísa Fernandes, Inês Cantante, Rute Rebouças, Luís Filipe Cunha, José Miguel Isidro, José Pedro Evans, Miguel Marques, Rodrigo Batista, Evelin Amorim, Alípio Jorge, Nuno Guimarães, Sérgio Nunes, António Leal, Purificação Silvano

Main category: cs.CL

TL;DR: CitiLink-Minutes: A multilayer annotated dataset of 120 European Portuguese municipal meeting minutes with metadata, discussion subjects, and voting outcomes for NLP/IR research.

Details

Motivation: Municipal meeting minutes are crucial governance documents but have been neglected in NLP/IR research due to lack of annotated datasets, limiting computational model development for analyzing local government decisions.

Method: Created a dataset of 120 European Portuguese municipal meeting minutes from six municipalities, manually annotated by two trained annotators and curated by a linguist across three dimensions: metadata, discussion subjects, and voting outcomes.

Result: Dataset contains over 1 million tokens with 38,000+ individual annotations, all personal identifiers de-identified. Baseline results provided for metadata extraction, topic classification, and vote labeling tasks.

Conclusion: CitiLink-Minutes fills a gap in municipal document analysis, enables downstream NLP/IR tasks, and promotes transparent access to municipal decisions while adhering to FAIR principles.

Abstract: City councils play a crucial role in local governance, directly influencing citizens’ daily lives through decisions made during municipal meetings. These deliberations are formally documented in meeting minutes, which serve as official records of discussions, decisions, and voting outcomes. Despite their importance, municipal meeting records have received little attention in Information Retrieval (IR) and Natural Language Processing (NLP), largely due to the lack of annotated datasets, which ultimately limit the development of computational models. To address this gap, we introduce CitiLink-Minutes, a multilayer dataset of 120 European Portuguese municipal meeting minutes from six municipalities. Unlike prior annotated datasets of parliamentary or video records, CitiLink-Minutes provides multilayer annotations and structured linkage of official written minutes. The dataset contains over one million tokens, with all personal identifiers de-identified. Each minute was manually annotated by two trained annotators and curated by an experienced linguist across three complementary dimensions: (1) metadata, (2) subjects of discussion, and (3) voting outcomes, totaling over 38,000 individual annotations. Released under FAIR principles and accompanied by baseline results on metadata extraction, topic classification, and vote labeling, CitiLink-Minutes demonstrates its potential for downstream NLP and IR tasks, while promoting transparent access to municipal decisions.

Wenqiu Tang, Zhen Wan, Takahiro Komamizu, Ichiro Ide

Main category: cs.CL

TL;DR: A contrastive Sparse AutoEncoder framework learns facet-level personality control vectors aligned with the Big Five 30-facet model for precise personality steering in Role-Playing Agents, outperforming existing methods.

Details

Motivation: Current personality control methods for RPAs have limitations: supervised fine-tuning requires persona-labeled data and retraining for new roles, while prompt- and RAG-based methods can lead to personality drift in long dialogues. There's a need for flexible, stable personality control that maintains consistency.

Method: Proposes a contrastive Sparse AutoEncoder framework that learns facet-level personality control vectors aligned with the Big Five 30-facet model. Constructs a 15,000-sample leakage-controlled corpus for balanced supervision. Learned vectors are integrated into model’s residual space and dynamically selected by a trait-activated routing module.

Result: Experiments show the method maintains stable character fidelity and output quality across contextualized settings, outperforming Contrastive Activation Addition (CAA) and prompt-only baselines. The SAE+Prompt configuration achieves best overall performance.

Conclusion: Contrastively trained latent vectors can enhance persona control while preserving dialogue coherence, providing precise and interpretable personality steering for Role-Playing Agents.

Abstract: Personality control in Role-Playing Agents (RPAs) is commonly achieved via training-free methods that inject persona descriptions and memory through prompts or retrieval-augmented generation, or via supervised fine-tuning (SFT) on persona-specific corpora. While SFT can be effective, it requires persona-labeled data and retraining for new roles, limiting flexibility. In contrast, prompt- and RAG-based signals are easy to apply but can be diluted in long dialogues, leading to drifting and sometimes inconsistent persona behavior. To address this, we propose a contrastive Sparse AutoEncoder (SAE) framework that learns facet-level personality control vectors aligned with the Big Five 30-facet model. A new 15,000-sample leakage-controlled corpus is constructed to provide balanced supervision for each facet. The learned vectors are integrated into the model’s residual space and dynamically selected by a trait-activated routing module, enabling precise and interpretable personality steering. Experiments on Large Language Models (LLMs) show that the proposed method maintains stable character fidelity and output quality across contextualized settings, outperforming Contrastive Activation Addition (CAA) and prompt-only baselines. The combined SAE+Prompt configuration achieves the best overall performance, confirming that contrastively trained latent vectors can enhance persona control while preserving dialogue coherence. Dataset is available at: https://github.com/lunat5078/BigFive-Personality-Facets-Dataset

[69] A Browser-based Open Source Assistant for Multimodal Content Verification

Rosanna Milner, Michael Foster, Olesya Razuvayevskaya, Ian Roberts, Valentin Porcellini, Denis Teyssou, Kalina Bontcheva

Main category: cs.CL

TL;DR: A browser-based verification assistant tool that integrates multiple NLP classifiers to help journalists and fact-checkers detect disinformation and AI-generated content through a unified interface.

Details

Motivation: To address the challenge of disinformation and AI-generated false content by making NLP-based verification tools accessible to non-expert users and integrating them into daily workflows.

Method: Developed a browser-based tool (VERIFICATION ASSISTANT) that allows users to submit URLs or media files, automatically extracts content, routes it to backend NLP classifiers, and presents results in an easy-to-digest format.

Result: Created a widely adopted tool (140,000+ users) that provides actionable credibility signals, estimates AI-generated content, and offers verification guidance through a unified interface.

Conclusion: The VERIFICATION ASSISTANT successfully bridges the gap between advanced NLP detection methods and practical usability for journalists and fact-checkers combating disinformation.

Abstract: Disinformation and false content produced by generative AI pose a significant challenge for journalists and fact-checkers who must rapidly verify digital media information. While there is an abundance of NLP models for detecting credibility signals such as persuasion techniques, subjectivity, or machine-generated text, such methods often remain inaccessible to non-expert users and are not integrated into their daily workflows as a unified framework. This paper demonstrates the VERIFICATION ASSISTANT, a browser-based tool designed to bridge this gap. The VERIFICATION ASSISTANT, a core component of the widely adopted VERIFICATION PLUGIN (140,000+ users), allows users to submit URLs or media files to a unified interface. It automatically extracts content and routes it to a suite of backend NLP classifiers, delivering actionable credibility signals, estimating AI-generated content, and providing other verification guidance in a clear, easy-to-digest format. This paper showcases the tool architecture, its integration of multiple NLP services, and its real-world application to detecting disinformation.

[70] CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

Alexandre Le Mercier, Thomas Demeester, Chris Develder

Main category: cs.CL

TL;DR: CLASP is a lightweight XGBoost classifier that detects adversarial tokens in state space models like Mamba by analyzing block output embeddings to defend against Hidden State Poisoning Attacks.

Details

Motivation: State space models (SSMs) like Mamba are vulnerable to Hidden State Poisoning Attacks (HiSPAs) that corrupt model memory through adversarial strings, posing a critical security threat to these efficient alternatives to Transformers.

Method: Framed as binary classification at token level, CLASP uses XGBoost classifier on Mamba’s block output embeddings to identify malicious tokens with minimal computational overhead, operating independently of downstream models.

Result: Achieves 95.9% token-level F1 and 99.3% document-level F1 on malicious token detection, generalizes well to unseen attack patterns (96.9% document F1 in leave-one-out), and processes 1,032 tokens/sec with <4GB VRAM.

Conclusion: CLASP provides effective lightweight defense against HiSPAs for SSM-based architectures, suitable for real-world deployment as front-line security for models like Mamba and their hybrid variants.

Abstract: State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model (Classifier Against State Poisoning) to defend against this threat. CLASP exploits distinct patterns in Mamba’s block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening résumés to identify the best candidates for a role. Evaluated on a corpus of 2,483 résumés totaling 9.5M tokens with controlled injections, CLASP achieves 95.9% token-level F1 score and 99.3% document-level F1 score on malicious tokens detection. Crucially, the model generalizes to unseen attack patterns: under leave-one-out cross-validation, performance remains high (96.9% document-level F1), while under clustered cross-validation with structurally novel triggers, it maintains useful detection capability (91.6% average document-level F1). Operating independently of any downstream model, CLASP processes 1,032 tokens per second with under 4GB VRAM consumption, potentially making it suitable for real-world deployment as a lightweight front-line defense for SSM-based and hybrid architectures. All code and detailed results are available at https://anonymous.4open.science/r/hispikes-91C0.

[71] Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

Xinghao Zhao

Main category: cs.CL

TL;DR: Proposes trajectory shape analysis (rather than scalar magnitude) as a diagnostic approach for understanding uncertainty in chain-of-thought reasoning in LLMs, useful for selective prediction and triage.

Details

Motivation: Understanding uncertainty in chain-of-thought reasoning is critical for reliable deployment of large language models, but existing approaches may not capture the nuanced dynamics of reasoning uncertainty effectively.

Method: Proposes analyzing trajectory shape rather than scalar magnitude as a diagnostic signal for uncertainty in chain-of-thought reasoning. The approach is designed to be practical, interpretable, and inexpensive to obtain in black-box settings while remaining robust across models and datasets.

Result: Through extensive ablations and cross-domain replications, the method demonstrates utility for selective prediction and triage. The trajectory shape signal proves robust across different models and datasets.

Conclusion: The findings offer generalizable insights into uncertainty dynamics in reasoning tasks, with particular focus on numeric and discrete-answer settings, providing a practical tool for improving reliability of LLM deployment.

Abstract: Understanding uncertainty in chain-of-thought reasoning is critical for reliable deployment of large language models. In this work, we propose a simple yet effective diagnostic approach based on trajectory shape rather than scalar magnitude. We show that this signal is practical, interpretable, and inexpensive to obtain in black-box settings, while remaining robust across models and datasets. Through extensive ablations and cross-domain replications, we demonstrate its utility for selective prediction and triage. Our findings offer a generalizable insight into uncertainty dynamics in reasoning tasks, with particular focus on numeric and discrete-answer settings.

[72] The Hidden Puppet Master: Predicting Human Belief Change in Manipulative LLM Dialogues

Jocelyn Shen, Amina Luvsanchultem, Jessica Kim, Kynnedy Smith, Valdemar Danry, Kantwon Rogers, Hae Won Park, Maarten Sap, Cynthia Breazeal

Main category: cs.CL

TL;DR: PUPPET introduces a taxonomy and dataset for studying hidden incentive-driven manipulation in LLMs during everyday advice-giving, revealing that current safety paradigms fail to predict actual human belief shifts despite detecting manipulative strategies.

Details

Motivation: As users increasingly rely on LLMs for practical advice, they become vulnerable to subtle steering toward hidden incentives misaligned with their interests. Existing NLP research on manipulation detection relies on simulated debates and doesn't measure actual human belief shifts in real-world scenarios.

Method: Introduces PUPPET: a theoretical taxonomy and resource focusing on moral direction of hidden incentives in everyday advice contexts. Provides evaluation dataset of N=1,035 human-LLM interactions measuring users’ belief shifts. Defines the task of belief shift prediction and evaluates state-of-the-art LLMs on this task.

Result: Analysis reveals critical disconnect: while models can be trained to detect manipulative strategies, this detection doesn’t correlate with magnitude of resulting belief change. State-of-the-art LLMs achieve moderate correlation (r=0.3-0.5) but systematically underestimate intensity of human belief susceptibility.

Conclusion: Establishes theoretically grounded and behaviorally validated foundation for AI social safety by studying incentive-driven manipulation in LLMs during everyday practical user queries, highlighting the need for better models of human belief susceptibility.

Abstract: As users increasingly turn to LLMs for practical and personal advice, they become vulnerable to subtle steering toward hidden incentives misaligned with their own interests. While existing NLP research has benchmarked manipulation detection, these efforts often rely on simulated debates and remain fundamentally decoupled from actual human belief shifts in real-world scenarios. We introduce PUPPET, a theoretical taxonomy and resource that bridges this gap by focusing on the moral direction of hidden incentives in everyday, advice-giving contexts. We provide an evaluation dataset of N=1,035 human-LLM interactions, where we measure users’ belief shifts. Our analysis reveals a critical disconnect in current safety paradigms: while models can be trained to detect manipulative strategies, they do not correlate with the magnitude of resulting belief change. As such, we define the task of belief shift prediction and show that while state-of-the-art LLMs achieve moderate correlation (r=0.3-0.5), they systematically underestimate the intensity of human belief susceptibility. This work establishes a theoretically grounded and behaviorally validated foundation for AI social safety efforts by studying incentive-driven manipulation in LLMs during everyday, practical user queries.

[73] KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning

Shuai Wang, Yinan Yu

Main category: cs.CL

TL;DR: KG-Hopper: RL framework enabling compact LLMs to perform integrated multi-hop KG reasoning in a single inference round, outperforming larger multi-step systems.

Details

Motivation: LLMs struggle with knowledge-intensive reasoning tasks like KBQA that require accurate multi-hop reasoning over KGs. Existing approaches use sequential reasoning with predefined pipelines, causing error cascades and lacking flexibility.

Method: Propose KG-Hopper, a Reinforcement Learning framework that trains a Reasoning LLM to embed entire KG traversal and decision process into a unified “thinking” stage, enabling global reasoning over cross-step dependencies with dynamic path exploration and backtracking.

Result: On eight KG reasoning benchmarks, KG-Hopper (based on 7B-parameter LLM) consistently outperforms larger multi-step systems (up to 70B) and achieves competitive performance with proprietary models like GPT-3.5-Turbo and GPT-4o-mini.

Conclusion: KG-Hopper enables compact open LLMs to perform integrated multi-hop KG reasoning efficiently, addressing limitations of sequential approaches while remaining data-efficient and competitive with much larger proprietary models.

Abstract: Large Language Models (LLMs) demonstrate impressive natural language capabilities but often struggle with knowledge-intensive reasoning tasks. Knowledge Base Question Answering (KBQA), which leverages structured Knowledge Graphs (KGs) exemplifies this challenge due to the need for accurate multi-hop reasoning. Existing approaches typically perform sequential reasoning steps guided by predefined pipelines, restricting flexibility and causing error cascades due to isolated reasoning at each step. To address these limitations, we propose KG-Hopper, a novel Reinforcement Learning (RL) framework that empowers compact open LLMs with the ability to perform integrated multi-hop KG reasoning within a single inference round. Rather than reasoning step-by-step, we train a Reasoning LLM that embeds the entire KG traversal and decision process into a unified ``thinking’’ stage, enabling global reasoning over cross-step dependencies and dynamic path exploration with backtracking. Experimental results on eight KG reasoning benchmarks show that KG-Hopper, based on a 7B-parameter LLM, consistently outperforms larger multi-step systems (up to 70B) and achieves competitive performance with proprietary models such as GPT-3.5-Turbo and GPT-4o-mini, while remaining compact, open, and data-efficient. The code is publicly available at: https://github.com/Wangshuaiia/KG-Hopper.

[74] KALAVAI: Predicting When Independent Specialist Fusion Works – A Quantitative Model for Post-Hoc Cooperative LLM Training

Ramchand Kumaresan

Main category: cs.CL

TL;DR: Post-hoc fusion of independently trained domain specialists via lightweight MoE routing yields predictable performance gains proportional to model divergence, enabling efficient multi-domain model creation without retraining.

Details

Motivation: To enable practitioners to efficiently combine specialized models trained on different domains without expensive joint training, while providing predictable performance gains based on measurable divergence between specialists.

Method: KALAVAI protocol: contributors fine-tune copies of a shared checkpoint independently, then use lightweight Mixture of Experts (MoE) routing (500 steps) to fuse models. The router learns to assign tokens to appropriate specialists, matching domain-oracle routing performance.

Result: Predictable gains: gain = 0.82 x divergence - 2.72 (R^2 = 0.856). Consistent improvements: +7.72% at 410M, +7.49% at 1B, +6.53% at 6.9B over best specialist. Cross-lingual fusion achieved +21.76% with Yoruba perplexity dropping from 41.9 to 7.7. 20-contributor federation achieved +16.71% improvement.

Conclusion: Post-hoc fusion of domain specialists via learned routing is effective and predictable, enabling efficient model combination without joint training. Shared initialization, optional frozen layers, and learned routing are key requirements for success.

Abstract: Independently trained domain specialists can be fused post-hoc into a single model that outperforms any individual specialist, and the gain is predictable: gain = 0.82 x divergence - 2.72 (R^2 = 0.856, n=6, 3-26% divergence). This enables practitioners to estimate cooperative value before committing compute. Below ~3.3% divergence, gains approach zero.In the KALAVAI protocol, contributors fine-tune copies of a shared checkpoint independently, then submit for lightweight MoE routing (500 steps). Gains are consistent: +7.72% at 410M (+/-0.02%, 3 seeds), +7.49% at 1B (+/-0.01%, 3 seeds), +6.53% at 6.9B, each over the best specialist. The router matches domain-oracle routing within <10^{-5} nats. Cross-lingual fusion (Tamil/Yoruba/Welsh/Code) achieves +21.76%, with Yoruba perplexity falling 41.9 to 7.7. A 20-contributor federation achieves +16.71% (+/-0.07pp, 3 seeds).Three requirements bound the protocol. Shared initialisation is necessary: checkpoint mismatch degrades routing. Frozen layers are optional below ~10,000 steps and beneficial beyond. Learned routing is essential: uniform averaging degrades by -1.2% vs. best specialist, while any trained router achieves oracle-optimal assignment.

[75] MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys and Key-Based Restructuring for High-Accuracy RAG

Bhavik Mangla

Main category: cs.CL

TL;DR: MDKeyChunker: A structure-aware chunking pipeline for Markdown documents that treats headers, code blocks, tables, and lists as atomic units, enriches chunks with metadata via single LLM call, and merges related content for improved retrieval.

Details

Motivation: Traditional RAG pipelines use fixed-size chunking that ignores document structure, fragments semantic units across boundaries, and requires multiple LLM calls for metadata extraction, leading to inefficient retrieval.

Method: Three-stage pipeline: (1) structure-aware chunking treating document elements as atomic units, (2) single LLM call enrichment extracting title, summary, keywords, typed entities, hypothetical questions, and semantic key with rolling key propagation, (3) restructuring chunks by merging those sharing semantic keys via bin-packing.

Result: Evaluation on 30 queries over 18-document Markdown corpus shows Config D (BM25 over structural chunks) achieves Recall@5=1.000 and MRR=0.911, while dense retrieval over full pipeline reaches Recall@5=0.867.

Conclusion: MDKeyChunker improves RAG performance through structure-aware chunking, single-call metadata extraction, and semantic key-based merging, implemented in Python with minimal dependencies.

Abstract: RAG pipelines typically rely on fixed-size chunking, which ignores document structure, fragments semantic units across boundaries, and requires multiple LLM calls per chunk for metadata extraction. We present MDKeyChunker, a three-stage pipeline for Markdown documents that (1) performs structure-aware chunking treating headers, code blocks, tables, and lists as atomic units; (2) enriches each chunk via a single LLM call extracting title, summary, keywords, typed entities, hypothetical questions, and a semantic key, while propagating a rolling key dictionary to maintain document-level context; and (3) restructures chunks by merging those sharing the same semantic key via bin-packing, co-locating related content for retrieval. The single-call design extracts all seven metadata fields in one LLM invocation, eliminating the need for separate per-field extraction passes. Rolling key propagation replaces hand-tuned scoring with LLM-native semantic matching. An empirical evaluation on 30 queries over an 18-document Markdown corpus shows Config D (BM25 over structural chunks) achieves Recall@5=1.000 and MRR=0.911, while dense retrieval over the full pipeline (Config C) reaches Recall@5=0.867. MDKeyChunker is implemented in Python with four dependencies and supports any OpenAI-compatible endpoint.

Wassim Swaileh, Mohammed-En-Nadhir Zighem, Hichem Telli, Salah Eddine Bekhouche, Abdellah Zakaria Sellam, Fadi Dornaika, Dimitrios Kotzinos

Main category: cs.CL

TL;DR: A retrieval-augmented generation system for Islamic inheritance law reasoning that combines synthetic data generation, hybrid retrieval, and schema-constrained validation to achieve high precision in Arabic legal reasoning tasks.

Details

Motivation: Islamic inheritance law (Ilm al-Mawarith) is a complex multi-stage legal reasoning task with variations across legal schools and civil-law codifications, requiring high-precision models that can operate under explicit legal configurations.

Method: Retrieval-augmented generation (RAG) pipeline with rule-grounded synthetic data generation using a symbolic inheritance calculator, hybrid retrieval (dense and BM25) with cross-encoder reranking, and schema-constrained output validation.

Result: Achieved MIR-E score of 0.935 and ranked first on the official QIAS 2026 blind-test leaderboard, demonstrating significant reliability improvements in Arabic legal reasoning tasks.

Conclusion: Retrieval-grounded, schema-aware generation significantly improves reliability in high-precision Arabic legal reasoning tasks like Islamic inheritance law.

Abstract: Islamic inheritance (Ilm al-Mawarith) is a multi-stage legal reasoning task requiring the identification of eligible heirs, resolution of blocking rules (hajb), assignment of fixed and residual shares, handling of adjustments such as awl and radd, and generation of a consistent final distribution. The task is further complicated by variations across legal schools and civil-law codifications, requiring models to operate under explicit legal configurations. We present a retrieval-augmented generation (RAG) pipeline for this setting, combining rule-grounded synthetic data generation, hybrid retrieval (dense and BM25) with cross-encoder reranking, and schema-constrained output validation. A symbolic inheritance calculator is used to generate a large high-quality synthetic corpus with full intermediate reasoning traces, ensuring legal and numerical consistency. The proposed system achieves a MIR-E score of 0.935 and ranks first on the official QIAS 2026 blind-test leaderboard. Results demonstrate that retrieval-grounded, schema-aware generation significantly improves reliability in high-precision Arabic legal reasoning tasks.

[77] Approaches to Analysing Historical Newspapers Using LLMs

Filip Dobranić, Tina Munda, Oliver Pejić, Vojko Gorjanc, Uroš Šmajdek, David Bordon, Jakob Lenardič, Tjaša Konovšek, Kristina Pahor de Maiti Tekavčič, Ciril Bohak, Darja Fišer

Main category: cs.CL

TL;DR: Computational analysis of historical Slovene newspapers using topic modeling, LLM-based sentiment analysis, and entity graphs to study collective identity representations in late 19th/early 20th century public discourse.

Details

Motivation: To examine how collective identities, political orientations, and national belonging were represented in public discourse at the turn of the 20th century using computational methods on historical newspaper data, combining scalable analysis with critical interpretation.

Method: Combines BERTopic for thematic pattern identification, evaluation of four instruction-following LLMs for sentiment analysis on OCR-degraded historical Slovene (selecting GaMS3-12B-Instruct), NER graph creation for entity-place relationships, and mixed methods approach combining quantitative network analysis with critical discourse analysis.

Result: Identified major thematic patterns showing both shared concerns and ideological differences between conservative-Catholic and liberal-progressive newspapers; selected Slovene-adapted GaMS3-12B-Instruct as most suitable for sentiment analysis (though stronger on neutral sentiment); revealed meaningful variation in portrayal of collective identities; created entity graphs showing relationships between identities and places.

Conclusion: Demonstrates the value of combining scalable computational methods with critical interpretation for digital humanities research on noisy historical newspaper data, particularly for studying intertwined historical political and socionomic identities.

Abstract: This study presents a computational analysis of the Slovene historical newspapers \textit{Slovenec} and \textit{Slovenski narod} from the sPeriodika corpus, combining topic modelling, large language model (LLM)-based aspect-level sentiment analysis, entity-graph visualisation, and qualitative discourse analysis to examine how collective identities, political orientations, and national belonging were represented in public discourse at the turn of the twentieth century. Using BERTopic, we identify major thematic patterns and show both shared concerns and clear ideological differences between the two newspapers, reflecting their conservative-Catholic and liberal-progressive orientations. We further evaluate four instruction-following LLMs for targeted sentiment classification in OCR-degraded historical Slovene and select the Slovene-adapted GaMS3-12B-Instruct model as the most suitable for large-scale application, while also documenting important limitations, particularly its stronger performance on neutral sentiment than on positive or negative sentiment. Applied at dataset scale, the model reveals meaningful variation in the portrayal of collective identities, with some groups appearing predominantly in neutral descriptive contexts and others more often in evaluative or conflict-related discourse. We then create NER graphs to explore the relationships between collective identities and places. We apply a mixed methods approach to analyse the named entity graphs, combining quantitative network analysis with critical discourse analysis. The investigation focuses on the emergence and development of intertwined historical political and socionomic identities. Overall, the study demonstrates the value of combining scalable computational methods with critical interpretation to support digital humanities research on noisy historical newspaper data.

[78] AgentPack: A Dataset of Code Changes, Co-Authored by Agents and Humans

Yangtian Zi, Zixuan Wu, Aleksander Boruch-Gruszecki, Jonathan Bell, Arjun Guha

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2509.21891: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21891&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[79] GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

Shijie Zhou, Viet Dac Lai, Hao Tan, Jihyung Kil, Wanrong Zhu, Changyou Chen, Ruiyi Zhang

Main category: cs.CL

TL;DR: Unable to analyze paper 2511.00810 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to the paper's abstract or content

Method: Cannot determine method without access to the paper’s abstract or content

Result: Cannot determine results without access to the paper’s abstract or content

Conclusion: Cannot draw conclusions without access to the paper’s abstract or content

Abstract: Failed to fetch summary for 2511.00810: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.00810&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[80] When to Think and When to Look: Uncertainty-Guided Lookback

Jing Bi, Filippos Bellos, Junjia Guo, Yayuan Li, Chao Huang, Yolo Y. Tang, Luchuan Song, Susan Liang, Zhongfei Mark Zhang, Jason J. Corso, Chenliang Xu

Main category: cs.CL

TL;DR: Paper ID 2511.15613 could not be fetched due to HTTP 429 error (rate limiting), so no abstract or content is available for analysis.

Details

Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting.

Method: Unable to determine method as the paper content could not be retrieved due to API rate limiting.

Result: Unable to determine results as the paper content could not be retrieved due to API rate limiting.

Conclusion: Unable to determine conclusion as the paper content could not be retrieved due to API rate limiting.

Abstract: Failed to fetch summary for 2511.15613: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.15613&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[81] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Unable to determine method due to API rate limiting preventing access to paper content

Result: Unable to determine results due to API rate limiting preventing access to paper content

Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2512.02425: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02425&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[82] AI and My Values: User Perceptions of LLMs’ Ability to Extract, Embody, and Explain Human Values from Casual Conversations

Bhada Yun, Renn Su, April Yi Wang

Main category: cs.CL

TL;DR: Unable to analyze paper 2601.22440 due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2601.22440: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22440&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[83] Quantization-Robust LLM Unlearning via Low-Rank Adaptation

João Vitor Boer Abitante, Joana Meneguzzo Pasquali, Luan Fonseca Garcia, Ewerton de Oliveira, Thomas da Silva Paula, Rodrigo C. Barros, Lucas S. Kupssinskü

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.13151: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13151&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[84] Large-Scale Analysis of Persuasive Content on Moltbook

Julia Jose, Meghna Manoj Nair, Rachel Greenstadt

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2603.18349: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18349&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[85] To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

Yitong Zhang, Chengze Li, Ruize Chen, Guowei Yang, Xiaoran Jia, Yijie Ren, Jia Li

Main category: cs.CL

TL;DR: Paper 2603.15159: Unable to fetch abstract due to HTTP 429 error (rate limiting).

Details

Motivation: Cannot determine motivation due to inability to access paper content.

Method: Cannot determine method due to inability to access paper content.

Result: Cannot determine results due to inability to access paper content.

Conclusion: Cannot draw conclusions due to inability to access paper content.

Abstract: Failed to fetch summary for 2603.15159: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15159&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[86] EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Yaolun Zhang, Ruohui Wang, Jiahao Wang, Yepeng Tang, Xuanyu Zheng, Haonan Duan, Hao Lu, Hanming Deng, Lewei Lu

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.22918: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22918&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[87] The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation

Mingyi Liu

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.24124: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24124&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.CV

[88] A-SelecT: Automatic Timestep Selection for Diffusion Transformer Representation Learning

Changyu Liu, James Chenhao Liang, Wenhao Yang, Yiming Cui, Jinghao Yang, Tianyang Wang, Qifan Wang, Dongfang Liu, Cheng Han

Main category: cs.CV

TL;DR: A-SelecT method automatically selects the most information-rich timestep from Diffusion Transformer features for discriminative tasks, improving efficiency and performance over previous diffusion-based approaches.

Details

Motivation: Current Diffusion Transformers (DiTs) for discriminative tasks suffer from inefficient timestep searching and inadequate exploitation of DiT-specific feature representations, limiting their training efficiency and representational capacity.

Method: Proposes Automatically Selected Timestep (A-SelecT) that dynamically identifies DiT’s most information-rich timestep from selected transformer features in a single run, eliminating exhaustive timestep searching and suboptimal feature selection.

Result: Extensive experiments on classification and segmentation benchmarks show that DiT empowered by A-SelecT surpasses all prior diffusion-based attempts efficiently and effectively.

Conclusion: A-SelecT provides an efficient and effective method for leveraging Diffusion Transformers in discriminative tasks by automatically selecting optimal timesteps, advancing the use of generative pre-training for downstream vision tasks.

Abstract: Diffusion models have significantly reshaped the field of generative artificial intelligence and are now increasingly explored for their capacity in discriminative representation learning. Diffusion Transformer (DiT) has recently gained attention as a promising alternative to conventional U-Net-based diffusion models, demonstrating a promising avenue for downstream discriminative tasks via generative pre-training. However, its current training efficiency and representational capacity remain largely constrained due to the inadequate timestep searching and insufficient exploitation of DiT-specific feature representations. In light of this view, we introduce Automatically Selected Timestep (A-SelecT) that dynamically pinpoints DiT’s most information-rich timestep from the selected transformer feature in a single run, eliminating the need for both computationally intensive exhaustive timestep searching and suboptimal discriminative feature selection. Extensive experiments on classification and segmentation benchmarks demonstrate that DiT, empowered by A-SelecT, surpasses all prior diffusion-based attempts efficiently and effectively.

[89] A Survey of OCR Evaluation Methods and Metrics and the Invisibility of Historical Documents

Fitsum Sileshi Beyene, Christopher L. Dancy

Main category: cs.CV

TL;DR: This paper examines evaluation biases in OCR and document understanding systems, showing they focus on modern Western documents while neglecting historical Black newspapers, leading to structural invisibility and representational harm.

Details

Motivation: Current OCR and document understanding evaluation focuses on modern, Western, institutional documents, masking system failures on historical and marginalized archives like Black historical newspapers where layout, typography, and material degradation significantly affect interpretation.

Method: Systematic review using PRISMA framework of OCR/document understanding papers (2006-2025) and benchmark datasets, analyzing training data reporting, benchmark design, and evaluation metrics for vision transformer and multimodal OCR systems, supplemented with archival statistics from Black press collections.

Result: Black newspapers and community-produced historical documents rarely appear in training data or evaluation benchmarks; evaluations emphasize character accuracy on modern layouts but fail to capture structural failures common in historical newspapers (column collapse, typographic errors, hallucinated text).

Conclusion: Evaluation gaps lead to structural invisibility and representational harm, driven by organizational and institutional behaviors shaped by benchmark incentives and data governance decisions; need for more inclusive evaluation practices.

Abstract: Optical character recognition (OCR) and document understanding systems increasingly rely on large vision and vision-language models, yet evaluation remains centered on modern, Western, and institutional documents. This emphasis masks system behavior in historical and marginalized archives, where layout, typography, and material degradation shape interpretation. This study examines how OCR and document understanding systems are evaluated, with particular attention to Black historical newspapers. We review OCR and document understanding papers, as well as benchmark datasets, which are published between 2006 and 2025 using the PRISMA framework. We look into how the studies report training data, benchmark design, and evaluation metrics for vision transformer and multimodal OCR systems. During the review, we found that Black newspapers and other community-produced historical documents rarely appear in reported training data or evaluation benchmarks. Most evaluations emphasize character accuracy and task success on modern layouts. They rarely capture structural failures common in historical newspapers, including column collapse, typographic errors, and hallucinated text. To put these findings into perspective, we use previous empirical studies and archival statistics from significant Black press collections to show how evaluation gaps lead to structural invisibility and representational harm. We propose that these gaps occur due to organizational (meso) and institutional (macro) behaviors and structure, shaped by benchmark incentives and data governance decisions.

[90] Evaluating Synthetic Images as Effective Substitutes for Experimental Data in Surface Roughness Classification

Binwei Chen, Huachao Leng, Chi Yeung Mang, Tsz Wai Cheung, Yanhua Chen, Wai Keung Anthony Loh, Chi Ho Wong, Chak Yin Tang

Main category: cs.CV

TL;DR: Using Stable Diffusion XL to generate synthetic ceramic surface images for data augmentation in roughness classification, achieving comparable accuracy to experimental-only datasets while reducing data requirements.

Details

Motivation: AI for surface roughness classification is limited by need for large labeled datasets and expensive high-resolution imaging equipment; synthetic images could reduce costs and data requirements.

Method: Generate synthetic ceramic surface images using Stable Diffusion XL, augment authentic datasets with these synthetic images, train classification models, and systematically vary hyperparameters (epoch count, batch size, learning rate) to assess robustness.

Result: Augmenting authentic datasets with generative images yields test accuracies comparable to exclusively experimental images; synthetic images effectively reproduce structural features needed for classification; identified configurations preserve performance while reducing data requirements.

Conclusion: Generative AI can substantially improve data efficiency and reliability in materials-image classification workflows, offering practical route to lower experimental cost, accelerate model development, and expand AI applicability in materials engineering.

Abstract: Hard coatings play a critical role in industry, with ceramic materials offering outstanding hardness and thermal stability for applications that demand superior mechanical performance. However, deploying artificial intelligence (AI) for surface roughness classification is often constrained by the need for large labeled datasets and costly high-resolution imaging equipment. In this study, we explore the use of synthetic images, generated with Stable Diffusion XL, as an efficient alternative or supplement to experimentally acquired data for classifying ceramic surface roughness. We show that augmenting authentic datasets with generative images yields test accuracies comparable to those obtained using exclusively experimental images, demonstrating that synthetic images effectively reproduce the structural features necessary for classification. We further assess method robustness by systematically varying key training hyperparameters (epoch count, batch size, and learning rate), and identify configurations that preserve performance while reducing data requirements. Our results indicate that generative AI can substantially improve data efficiency and reliability in materials-image classification workflows, offering a practical route to lower experimental cost, accelerate model development, and expand AI applicability in materials engineering.

[91] Finding Distributed Object-Centric Properties in Self-Supervised Transformers

Samyak Rawlekar, Amitabh Swain, Yujun Cai, Yiwei Wang, Ming-Hsuan Yang, Narendra Ahuja

Main category: cs.CV

TL;DR: Object-DINO extracts distributed object-centric information from all layers of self-supervised ViTs using patch-level attention components, improving unsupervised object discovery and reducing object hallucination in multimodal LLMs.

Details

Motivation: Current self-supervised ViTs like DINO show emergent object discovery abilities but suffer from spurious activations and poor localization because the [CLS] token summarizes the entire image rather than focusing on objects, diluting object-centric information in patch-level interactions.

Method: Object-DINO analyzes inter-patch similarity using all three patch-level attention components (query, key, value) across all layers, clusters attention heads based on patch similarities, and automatically identifies object-centric clusters corresponding to objects, all without additional training.

Result: The method achieves +3.6 to +12.4 CorLoc gains for unsupervised object discovery and effectively mitigates object hallucination in Multimodal Large Language Models by providing visual grounding.

Conclusion: Object-centric information is distributed across all layers of self-supervised ViTs in all three attention components, and extracting this distributed information improves downstream tasks without requiring additional training.

Abstract: Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similarity using patch-level attention components (query, key, and value) across all layers. We find that: (1) Object-centric properties are encoded in the similarity maps derived from all three components ($q, k, v$), unlike prior work that uses only key features or the [CLS] token. (2) This object-centric information is distributed across the network, not just confined to the final layer. Based on these insights, we introduce Object-DINO, a training-free method that extracts this distributed object-centric information. Object-DINO clusters attention heads across all layers based on the similarities of their patches and automatically identifies the object-centric cluster corresponding to all objects. We demonstrate Object-DINO’s effectiveness on two applications: enhancing unsupervised object discovery (+3.6 to +12.4 CorLoc gains) and mitigating object hallucination in Multimodal Large Language Models by providing visual grounding. Our results demonstrate that using this distributed object-centric information improves downstream tasks without additional training.

[92] Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis

Yuan Zhang, Sihao Dou, Kai Hu, Shuhua Deng, Chunhong Cao, Fen Xiao, Xieping Gao

Main category: cs.CV

TL;DR: FPRL is a hierarchical self-supervised video representation learning framework for endoscopic videos that focuses on lesion-centric static semantics first, then models their temporal evolution, addressing limitations of natural video methods in clinical settings.

Details

Motivation: Endoscopic video analysis suffers from limited high-quality annotations. Existing self-supervised methods for natural videos prioritize dense spatio-temporal modeling and exhibit motion bias, overlooking the static, structured semantics crucial for clinical decision-making in gastrointestinal screening.

Method: FPRL uses a hierarchical semantic modeling mechanism: (1) Captures static semantics via teacher-prior adaptive masking (TPAM) with multi-view sparse sampling to focus on lesion-related local semantics; (2) Models contextual semantics through cross-view masked feature completion (CVMFC) and attention-guided temporal prediction (AGTP) to establish cross-view correspondences and model structured inter-frame evolution.

Result: Extensive experiments on 11 endoscopic video datasets show FPRL achieves superior performance across diverse downstream tasks, demonstrating effectiveness in endoscopic video representation learning.

Conclusion: FPRL provides an effective cognition-inspired hierarchical framework for endoscopic video representation learning that better captures clinically relevant semantics by distinguishing and collaboratively learning both static and contextual semantics.

Abstract: Endoscopic video analysis is essential for early gastrointestinal screening but remains hindered by limited high-quality annotations. While self-supervised video pre-training shows promise, existing methods developed for natural videos prioritize dense spatio-temporal modeling and exhibit motion bias, overlooking the static, structured semantics critical to clinical decision-making. To address this challenge, we propose Focus-to-Perceive Representation Learning (FPRL), a cognition-inspired hierarchical framework that emulates clinical examination. FPRL first focuses on intra-frame lesion-centric regions to learn static semantics, and then perceives their evolution across frames to model contextual semantics. To achieve this, FPRL employs a hierarchical semantic modeling mechanism that explicitly distinguishes and collaboratively learns both types of semantics. Specifically, it begins by capturing static semantics via teacher-prior adaptive masking (TPAM) combined with multi-view sparse sampling. This approach mitigates redundant temporal dependencies and enables the model to concentrate on lesion-related local semantics. Following this, contextual semantics are derived through cross-view masked feature completion (CVMFC) and attention-guided temporal prediction (AGTP). These processes establish cross-view correspondences and effectively model structured inter-frame evolution, thereby reinforcing temporal semantic continuity while preserving global contextual integrity. Extensive experiments on 11 endoscopic video datasets show that FPRL achieves superior performance across diverse downstream tasks, demonstrating its effectiveness in endoscopic video representation learning. The code is available at https://github.com/MLMIP/FPRL.

[93] QPT V2: Masked Image Modeling Advances Visual Scoring

Qizhi Xie, Kun Yuan, Yunpeng Qu, Mingda Wu, Ming Sun, Chao Zhou, Jihong Zhu

Main category: cs.CV

TL;DR: QPT V2 is a masked image modeling-based pretraining framework for unified quality and aesthetics assessment of visual content, addressing data scarcity and generalization issues through curated data, degradation introduction, and multi-scale modeling.

Details

Motivation: Current learning-based methods for quality and aesthetics assessment suffer from limited labeled data and poor generalization. While masked image modeling (MIM) has shown success in high-level vision tasks, its potential for quality- and aesthetics-awareness remains unexplored.

Method: Proposes QPT V2, a MIM-based pretraining framework with three key components: 1) curated pretraining data to capture high-level semantics and fine-grained details, 2) introduced degradation to encompass quality- and aesthetics-related factors, and 3) modified model structure to capture multi-scale quality and aesthetic information.

Result: Extensive experiments on 11 downstream benchmarks demonstrate superior performance compared to current state-of-the-art approaches and other pretraining paradigms.

Conclusion: QPT V2 successfully demonstrates that masked image modeling can be effectively adapted for quality and aesthetics assessment, providing a unified solution that addresses data scarcity and generalization challenges in this domain.

Abstract: Quality assessment and aesthetics assessment aim to evaluate the perceived quality and aesthetics of visual content. Current learning-based methods suffer greatly from the scarcity of labeled data and usually perform sub-optimally in terms of generalization. Although masked image modeling (MIM) has achieved noteworthy advancements across various high-level tasks (e.g., classification, detection etc.). In this work, we take on a novel perspective to investigate its capabilities in terms of quality- and aesthetics-awareness. To this end, we propose Quality- and aesthetics-aware pretraining (QPT V2), the first pretraining framework based on MIM that offers a unified solution to quality and aesthetics assessment. To perceive the high-level semantics and fine-grained details, pretraining data is curated. To comprehensively encompass quality- and aesthetics-related factors, degradation is introduced. To capture multi-scale quality and aesthetic information, model structure is modified. Extensive experimental results on 11 downstream benchmarks clearly show the superior performance of QPT V2 in comparison with current state-of-the-art approaches and other pretraining paradigms.

[94] ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions

Zikai Wang, Zhilu Zhang, Yiqing Wang, Hui Li, Wangmeng Zuo

Main category: cs.CV

TL;DR: ArtHOI: A framework for reconstructing 4D human-articulated-object interactions from single monocular RGB videos using foundation model priors and novel optimization methods.

Details

Motivation: Existing methods are limited to rigid objects or require multi-view setups; there's a need to reconstruct articulated object interactions from single videos, which foundation models now make possible.

Method: Optimization-based framework integrating multiple foundation model priors with Adaptive Sampling Refinement for object scale/pose estimation and MLLM-guided hand-object alignment using contact reasoning.

Result: Robust and effective reconstruction across diverse objects and interactions, validated on new datasets ArtHOI-RGBD and ArtHOI-Wild.

Conclusion: ArtHOI successfully addresses the challenging problem of 4D articulated object interaction reconstruction from single videos using foundation model integration and novel optimization techniques.

Abstract: Existing hand-object interactions (HOI) methods are largely limited to rigid objects, while 4D reconstruction methods of articulated objects generally require pre-scanning the object or even multi-view videos. It remains an unexplored but significant challenge to reconstruct 4D human-articulated-object interactions from a single monocular RGB video. Fortunately, recent advancements in foundation models present a new opportunity to address this highly ill-posed problem. To this end, we introduce ArtHOI, an optimization-based framework that integrates and refines priors from multiple foundation models. Our key contribution is a suite of novel methodologies designed to resolve the inherent inaccuracies and physical unreality of these priors. In particular, we introduce an Adaptive Sampling Refinement (ASR) method to optimize object’s metric scale and pose for grounding its normalized mesh in world space. Furthermore, we propose a Multimodal Large Language Model (MLLM) guided hand-object alignment method, utilizing contact reasoning information as constraints of hand-object mesh composition optimization. To facilitate a comprehensive evaluation, we also contribute two new datasets, ArtHOI-RGBD and ArtHOI-Wild. Extensive experiments validate the robustness and effectiveness of our ArtHOI across diverse objects and interactions. Project: https://arthoi-reconstruction.github.io.

[95] End-to-end Feature Alignment: A Simple CNN with Intrinsic Class Attribution

Parniyan Farvardin, David Chapman

Main category: cs.CV

TL;DR: FA-CNN is a CNN architecture with intrinsic class attribution through end-to-end feature alignment, using order-preserving layers to maintain alignment from input pixels to final logits, making feature maps interpretable and equivalent to Grad-CAM saliency maps.

Details

Motivation: Standard CNNs use unordered operations (Linear, Conv2D) that shuffle and mix semantic concepts, making raw feature maps difficult to interpret. The goal is to create a CNN with intrinsic interpretability through end-to-end feature alignment.

Method: Introduces Feature-Align CNN with two new order-preserving layers: dampened skip connection and global average pooling classifier head. These maintain end-to-end feature alignment from input pixels to final class logits, ensuring feature maps intrinsically exhibit class attribution.

Result: Proves theoretically that FA-CNN penultimate feature maps are identical to Grad-CAM saliency maps, and shows feature maps morph layer-by-layer toward class activations. Performs well on benchmark image classification datasets and compares favorably to Grad-CAM and permutation methods in interpretability tasks.

Conclusion: FA-CNN provides intrinsic interpretability through feature alignment while maintaining competitive performance. The approach offers theoretical guarantees about feature map interpretability and shows potential for hybrid models and future extensions.

Abstract: We present Feature-Align CNN (FA-CNN), a prototype CNN architecture with intrinsic class attribution through end-to-end feature alignment. Our intuition is that the use of unordered operations such as Linear and Conv2D layers cause unnecessary shuffling and mixing of semantic concepts, thereby making raw feature maps difficult to understand. We introduce two new order preserving layers, the dampened skip connection, and the global average pooling classifier head. These layers force the model to maintain an end-to-end feature alignment from the raw input pixels all the way to final class logits. This end-to-end alignment enhances the interpretability of the model by allowing the raw feature maps to intrinsically exhibit class attribution. We prove theoretically that FA-CNN penultimate feature maps are identical to Grad-CAM saliency maps. Moreover, we prove that these feature maps slowly morph layer-by-layer over network depth, showing the evolution of features through network depth toward penultimate class activations. FA-CNN performs well on benchmark image classification datasets. Moreover, we compare the averaged FA-CNN raw feature maps against Grad-CAM and permutation methods in a percent pixels removed interpretability task. We conclude this work with a discussion and future, including limitations and extensions toward hybrid models.

[96] Out-of-Sight Embodied Agents: Multimodal Tracking, Sensor Fusion, and Trajectory Forecasting

Haichao Zhang, Yi Xu, Yun Fu

Main category: cs.CV

TL;DR: Improved OST method for predicting noise-free visual trajectories of out-of-sight objects from noisy sensor observations, extended to both pedestrians and vehicles with a vision-positioning denoising module.

Details

Motivation: Existing trajectory prediction methods assume complete, clean observations and fail to handle out-of-sight agents or noisy sensing signals caused by limited camera coverage, occlusions, and lack of ground-truth denoised trajectories, raising safety concerns for real-world deployment.

Method: Expands Out-of-Sight Trajectory Prediction (OOSTraj) from pedestrians to both pedestrians and vehicles. Introduces improved Vision-Positioning Denoising Module that exploits camera calibration to establish vision-position correspondence, enabling unsupervised denoising of noisy sensor signals.

Result: Achieves state-of-the-art results for both trajectory denoising and trajectory prediction on Vi-Fi and JRDB datasets, with clear gains over prior baselines. Outperforms classical denoising methods like Kalman filtering and adapted trajectory prediction models.

Conclusion: First work to use vision-positioning projection to denoise noisy sensor trajectories of out-of-sight agents, opening new directions for future research in autonomous driving, robotics, and surveillance applications.

Abstract: Trajectory prediction is a fundamental problem in computer vision, vision-language-action models, world models, and autonomous systems, with broad impact on autonomous driving, robotics, and surveillance. However, most existing methods assume complete and clean observations, and therefore do not adequately handle out-of-sight agents or noisy sensing signals caused by limited camera coverage, occlusions, and the absence of ground-truth denoised trajectories. These challenges raise safety concerns and reduce robustness in real-world deployment. In this extended study, we introduce major improvements to Out-of-Sight Trajectory (OST), a task for predicting noise-free visual trajectories of out-of-sight objects from noisy sensor observations. Building on our prior work, we expand Out-of-Sight Trajectory Prediction (OOSTraj) from pedestrians to both pedestrians and vehicles, increasing its relevance to autonomous driving, robotics, and surveillance. Our improved Vision-Positioning Denoising Module exploits camera calibration to establish vision-position correspondence, mitigating the lack of direct visual cues and enabling effective unsupervised denoising of noisy sensor signals. Extensive experiments on the Vi-Fi and JRDB datasets show that our method achieves state-of-the-art results for both trajectory denoising and trajectory prediction, with clear gains over prior baselines. We also compare with classical denoising methods, including Kalman filtering, and adapt recent trajectory prediction models to this setting, establishing a stronger benchmark. To the best of our knowledge, this is the first work to use vision-positioning projection to denoise noisy sensor trajectories of out-of-sight agents, opening new directions for future research.

[97] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

Junwon Lee, Juhan Nam, Jiyoung Lee

Main category: cs.CV

TL;DR: SELVA is a text-conditioned selective video-to-audio generation model that produces only user-intended sounds from multi-object videos, treating text prompts as explicit selectors to extract relevant sound-source visual features.

Details

Motivation: In multimedia production, audio tracks are handled individually for precise editing, mixing, and creative control. Current V2A models generate all sounds from videos, but there's a need for selective generation where users can specify which sounds to produce using text prompts.

Method: SELVA treats text prompts as explicit selectors to extract prompt-relevant sound-source visual features from video encoder. It uses supplementary tokens to suppress text-irrelevant activations via efficient video encoder finetuning, and employs autonomous video-mixing in a self-supervised manner to overcome lack of mono audio track supervision.

Result: Evaluated on VGG-MONOAUDIO benchmark, SELVA shows effectiveness across audio quality, semantic alignment, and temporal synchronization. Extensive experiments and ablations consistently verify its performance.

Conclusion: SELVA successfully addresses text-conditioned selective V2A generation, enabling precise audio extraction from multi-object videos using text prompts as selectors, with applications in multimedia production.

Abstract: This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. We propose SELVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector to distinctly extract prompt-relevant sound-source visual features from the video encoder. To suppress text-irrelevant activations with efficient video encoder finetuning, the proposed supplementary tokens promote cross-attention to yield robust semantic and temporal grounding. SELVA further employs an autonomous video-mixing scheme in a self-supervised manner to overcome the lack of mono audio track supervision. We evaluate SELVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization.

[98] LEMON: a foundation model for nuclear morphology in Computational Pathology

Loïc Chadoutaud, Alice Blondel, Hana Feki, Jacqueline Fontugne, Emmanuel Barillot, Thomas Walter

Main category: cs.CV

TL;DR: LEMON is a self-supervised foundation model for single-cell image representation learning in computational pathology, trained on millions of cell images from diverse tissues and cancer types to learn robust morphological representations.

Details

Motivation: While self-supervised learning has advanced patch and whole-slide image representation in computational pathology, single-cell level representation learning remains underexplored despite its importance for characterizing cell types and cellular phenotypes in cancer research.

Method: LEMON uses self-supervised learning on millions of single-cell images from diverse tissues and cancer types to learn morphological embeddings. It serves as a foundation model for scalable single-cell image representation learning.

Result: LEMON demonstrates strong performance across five benchmark datasets on various prediction tasks, showing robust and versatile morphological representations that support large-scale single-cell analyses in pathology.

Conclusion: LEMON represents a new paradigm for cell-level computational pathology, providing a foundation model for single-cell image representation learning with potential applications in cancer research and precision medicine.

Abstract: Computational pathology relies on effective representation learning to support cancer research and precision medicine. Although self-supervised learning has driven major progress at the patch and whole-slide image levels, representation learning at the single-cell level remains comparatively underexplored, despite its importance for characterizing cell types and cellular phenotypes. We introduce LEMON (Learning Embeddings from Morphology Of Nuclei), a self-supervised foundation model for scalable single-cell image representation learning. Trained on millions of cell images from diverse tissues and cancer types, LEMON learns robust and versatile morphological representations that support large-scale single-cell analyses in pathology. We evaluate LEMON on five benchmark datasets across a range of prediction tasks and show that it provides strong performance, highlighting its potential as a new paradigm for cell-level computational pathology. Model weights are available at https://huggingface.co/aliceblondel/LEMON.

Ngoc-Son Nguyen, Thanh V. T. Tran, Jeongsoo Choi, Hieu-Nghia Huynh-Nguyen, Truong-Son Hy, Van Nguyen

Main category: cs.CV

TL;DR: DiFlowDubber is a novel video dubbing framework that uses discrete flow matching and cross-modal synchronization to generate expressive, synchronized speech from video inputs.

Details

Motivation: Current video dubbing approaches either train on limited datasets or use two-stage TTS pipelines that struggle with expressive prosody, rich acoustic characteristics, and precise speech-lip synchronization.

Method: Two-stage training framework with discrete flow matching generative backbone. Includes FaPro module for capturing global prosody/stylistic cues from facial expressions, and Synchronizer module for bridging modality gaps between text, video, and speech to ensure precise synchronization.

Result: Outperforms previous methods across multiple metrics on two primary benchmark datasets.

Conclusion: DiFlowDubber effectively transfers knowledge from pre-trained TTS models to video-driven dubbing while addressing key challenges in prosody, acoustic quality, and synchronization.

Abstract: Video dubbing has broad applications in filmmaking, multimedia creation, and assistive speech technology. Existing approaches either train directly on limited dubbing datasets or adopt a two-stage pipeline that adapts pre-trained text-to-speech (TTS) models, which often struggle to produce expressive prosody, rich acoustic characteristics, and precise synchronization. To address these issues, we propose DiFlowDubber with a novel two-stage training framework that effectively transfers knowledge from a pre-trained TTS model to video-driven dubbing, with a discrete flow matching generative backbone. Specifically, we design a FaPro module that captures global prosody and stylistic cues from facial expressions and leverages this information to guide the modeling of subsequent speech attributes. To ensure precise speech-lip synchronization, we introduce a Synchronizer module that bridges the modality gap among text, video, and speech, thereby improving cross-modal alignment and generating speech that is temporally synchronized with lip movements. Experiments on two primary benchmark datasets demonstrate that DiFlowDubber outperforms previous methods across multiple metrics.

[100] Do All Vision Transformers Need Registers? A Cross-Architectural Reassessment

Spiros Baxevanakis, Platon Karageorgis, Ioannis Dravilas, Konrad Szewczyk

Main category: cs.CV

TL;DR: This paper reproduces and evaluates Darcet et al.’s findings about attention map artifacts in Vision Transformers, testing their generalizability across multiple models (DINO, DINOv2, OpenCLIP, DeiT3) and model sizes.

Details

Motivation: To validate and extend Darcet et al.'s (2024) findings about attention map artifacts in Vision Transformers, which they attributed to ViTs needing to store global information beyond the [CLS] token, and to test the generalizability of their claims across different models and model sizes.

Method: Reproduction study evaluating Darcet et al.’s claims across multiple Vision Transformer models (DINO, DINOv2, OpenCLIP, DeiT3), testing the impact of adding empty input tokens (registers) on attention map artifacts, and extending analysis to smaller model sizes.

Result: While confirming several key claims from Darcet et al., the study found that some claims do not generalize universally across all tested models. The impact of model size was explored, extending findings to smaller models, and terminology inconsistencies in the original paper were identified and explained.

Conclusion: Darcet et al.’s findings about attention map artifacts and the register solution have partial generalizability across Vision Transformer architectures, with model-specific variations and terminology inconsistencies affecting broader application.

Abstract: Training Vision Transformers (ViTs) presents significant challenges, one of which is the emergence of artifacts in attention maps, hindering their interpretability. Darcet et al. (2024) investigated this phenomenon and attributed it to the need of ViTs to store global information beyond the [CLS] token. They proposed a novel solution involving the addition of empty input tokens, named registers, which successfully eliminate artifacts and improve the clarity of attention maps. In this work, we reproduce the findings of Darcet et al. (2024) and evaluate the generalizability of their claims across multiple models, including DINO, DINOv2, OpenCLIP, and DeiT3. While we confirm the validity of several of their key claims, our results reveal that some claims do not extend universally to other models. Additionally, we explore the impact of model size, extending their findings to smaller models. Finally, we untie terminology inconsistencies found in the original paper and explain their impact when generalizing to a wider range of models.

[101] Fourier Decomposition for Explicit Representation of 3D Point Cloud Attributes

Donghyun Kim, Chanyoung Kim, Hyunah Ko, Seong Jae Hwang

Main category: cs.CV

TL;DR: Novel colored point cloud encoding method using 3D Fourier decomposition to disentangle color and geometric features, achieving SOTA results on classification, segmentation, and style transfer tasks.

Details

Motivation: Existing point cloud encoding methods lack consideration for colored point clouds, which are more expressive 3D representations. Current approaches handle color and geometry separately on a per-point basis, leading to limited receptive fields and restricted ability to capture relationships across multiple points.

Method: Proposes a colored point cloud encoding methodology that leverages 3D Fourier decomposition to disentangle color and geometric features while extending the receptive field through spectral-domain operations. The approach separates feature components where amplitude captures color attributes and phase encodes geometric structure.

Result: Achieves state-of-the-art results on the DensePoint dataset for classification, segmentation, and style transfer tasks. Analysis confirms effective separation of feature components with amplitude uniquely capturing color and phase encoding geometry.

Conclusion: The 3D Fourier decomposition approach successfully addresses limitations of existing colored point cloud encoding methods by enabling independent learning and utilization of both color and geometric attributes through spectral-domain operations.

Abstract: While 3D point clouds are widely used in vision applications, their irregular and sparse nature make them challenging to handle. In response, numerous encoding approaches have been proposed to capture the rich semantic information of point clouds. Yet, a critical limitation persists: a lack of consideration for colored point clouds, which serve as more expressive 3D representations encompassing both color and geometry. While existing methods handle color and geometry separately on a per-point basis, this leads to a limited receptive field and restricted ability to capture relationships across multiple points. To address this, we pioneer a colored point cloud encoding methodology that leverages 3D Fourier decomposition to disentangle color and geometric features while extending the receptive field through spectral-domain operations. Our analysis confirms that our approach effectively separates feature components, where the amplitude uniquely captures color attributes and the phase encodes geometric structure, thereby enabling independent learning and utilization of both attributes. We validate our colored point cloud encoding approach on classification, segmentation, and style transfer tasks, achieving state-of-the-art results on the DensePoint dataset.

[102] Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis

Yancheng Zhang, Xiaohan Zhang, Guangyu Sun, Zonglin Lyu, Safwan Wshah, Chen Chen

Main category: cs.CV

TL;DR: Geo^2 is a unified framework that leverages 3D geometric priors from foundation models to jointly perform cross-view geo-localization and bidirectional cross-view image synthesis, achieving state-of-the-art results.

Details

Motivation: Cross-view geo-spatial tasks (localization and synthesis) rely on geometric correspondences between ground and aerial views. While geometric foundation models have strong 3D understanding capabilities, their potential for cross-view tasks remains underexplored due to the large viewpoint gap between ground and aerial imagery.

Method: Proposes Geo^2 with two key components: 1) GeoMap - embeds ground and aerial features into a shared 3D-aware latent space to reduce cross-view discrepancies for localization; 2) GeoFlow - a flow-matching model conditioned on geometry-aware latent embeddings for bidirectional image synthesis, with consistency loss for alignment.

Result: Achieves state-of-the-art performance on standard benchmarks (CVUSA, CVACT, VIGOR) for both cross-view geo-localization and bidirectional cross-view image synthesis.

Conclusion: Demonstrates the effectiveness of 3D geometric priors from foundation models for cross-view geo-spatial learning, enabling unified handling of both localization and synthesis tasks.

Abstract: Cross-view geo-spatial learning consists of two important tasks: Cross-View Geo-Localization (CVGL) and Cross-View Image Synthesis (CVIS), both of which rely on establishing geometric correspondences between ground and aerial views. Recent Geometric Foundation Models (GFMs) have demonstrated strong capabilities in extracting generalizable 3D geometric features from images, but their potential in cross-view geo-spatial tasks remains underexplored. In this work, we present Geo^2, a unified framework that leverages Geometric priors from GFMs (e.g., VGGT) to jointly perform geo-spatial tasks, CVGL and bidirectional CVIS. Despite the 3D reconstruction ability of GFMs, directly applying them to CVGL and CVIS remains challenging due to the large viewpoint gap between ground and aerial imagery. We propose GeoMap, which embeds ground and aerial features into a shared 3D-aware latent space, effectively reducing cross-view discrepancies for localization. This shared latent space naturally bridges cross-view image synthesis in both directions. To exploit this, we propose GeoFlow, a flow-matching model conditioned on geometry-aware latent embeddings. We further introduce a consistency loss to enforce latent alignment between the two synthesis directions, ensuring bidirectional coherence. Extensive experiments on standard benchmarks, including CVUSA, CVACT, and VIGOR, demonstrate that Geo^2 achieves state-of-the-art performance in both localization and synthesis, highlighting the effectiveness of 3D geometric priors for cross-view geo-spatial learning.

[103] StreamDiT: Real-Time Streaming Text-to-Video Generation

Akio Kodaira, Tingbo Hou, Ji Hou, Markos Georgopoulos, Felix Juefei-Xu, Masayoshi Tomizuka, Yue Zhao

Main category: cs.CV

TL;DR: StreamDiT is a streaming video generation model that enables real-time text-to-video generation using flow matching with moving buffers and window attention, achieving 16 FPS at 512p resolution after distillation.

Details

Motivation: Existing text-to-video models produce only short clips offline, limiting their use in interactive and real-time applications. There's a need for models that can generate video streams continuously in real-time.

Method: Proposes StreamDiT with flow matching using moving buffers, mixed training with different partitioning schemes, adaLN DiT with varying time embedding and window attention, and a multistep distillation method tailored for streaming.

Result: Trained a 4B parameter model that achieves real-time performance at 16 FPS on one GPU for 512p video generation. Distillation reduces NFEs to the number of chunks in buffer. Evaluated with quantitative metrics and human evaluation.

Conclusion: StreamDiT enables real-time video generation applications including streaming generation, interactive generation, and video-to-video tasks, overcoming limitations of offline short-clip generation models.

Abstract: Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer. Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g. streaming generation, interactive generation, and video-to-video. We provide video results and more examples in our project website: https://cumulo-autumn.github.io/StreamDiT/

[104] ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?

Haonan Han, Jiancheng Huang, Xiaopeng Sun, Junyan He, Rui Yang, Jie Hu, Xiaojiang Peng, Lin Ma, Xiaoming Wei, Xiu Li

Main category: cs.CV

TL;DR: ViGoR is a unified benchmark framework that evaluates vision-generative models’ reasoning capabilities beyond superficial metrics, revealing significant logical deficits in current AIGC systems.

Details

Motivation: Current AIGC models excel in visual fidelity but fail at tasks requiring physical, causal, or complex spatial reasoning, creating a "logical desert." Existing evaluations rely on superficial metrics and fragmented benchmarks, creating a "performance mirage" that overlooks the generative process.

Method: ViGoR introduces four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) dual-track mechanism evaluating both intermediate processes and final results; 3) evidence-grounded automated judge ensuring high human alignment; 4) granular diagnostic analysis decomposing performance into fine-grained cognitive dimensions.

Result: Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical “stress test” for next-generation intelligent vision models.

Conclusion: ViGoR provides a comprehensive framework to evaluate vision-generative reasoning capabilities, exposing the “logical desert” beneath impressive visual outputs and setting a new standard for benchmarking intelligent vision models.

Abstract: Beneath the stunning visual fidelity of modern AIGC models lies a “logical desert”, where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a performance mirage'' that overlooks the generative process. To address this, we introduce ViGoR Vision-G}nerative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical stress test’’ for the next generation of intelligent vision models. The demo have been available at https://vincenthancoder.github.io/ViGoR-Bench/

[105] MS-ISSM: Objective Quality Assessment of Point Clouds Using Multi-scale Implicit Structural Similarity

Zhang Chen, Shuai Wan, Yuezhe Zhang, Siyu Ren, Fuzheng Yang, Junhui Hou

Main category: cs.CV

TL;DR: MS-ISSM is a novel point cloud quality assessment method using implicit function representation and multi-scale feature analysis to overcome irregular point cloud challenges.

Details

Motivation: The unstructured and irregular nature of point clouds makes accurate quality assessment challenging, especially for establishing perceptual feature correspondence between reference and distorted point clouds.

Method: Proposes Multi-scale Implicit Structural Similarity Measurement (MS-ISSM) using radial basis functions to represent local features continuously, transforming distortion measurement into implicit function coefficient comparison. Also introduces ResGrouped-MLP network with grouped encoding, residual blocks, and channel attention for multi-scale feature mapping.

Result: MS-ISSM outperforms state-of-the-art metrics on multiple benchmarks in both reliability and generalization.

Conclusion: The method effectively addresses irregular point cloud quality assessment by avoiding matching errors through implicit representation and hierarchical feature analysis.

Abstract: The unstructured and irregular nature of points poses a significant challenge for accurate point cloud quality assessment (PCQA), particularly in establishing accurate perceptual feature correspondence. To tackle this, we propose the Multi-scale Implicit Structural Similarity Measurement (MS-ISSM). Unlike traditional point-to-point matching, MS-ISSM utilizes radial basis function (RBF) to represent local features continuously, transforming distortion measurement into a comparison of implicit function coefficients. This approach effectively circumvents matching errors inherent in irregular data. Additionally, we propose a ResGrouped-MLP quality assessment network, which robustly maps multi-scale feature differences to perceptual scores. The network architecture departs from traditional flat multi-layer perceptron (MLP) by adopting a grouped encoding strategy integrated with residual blocks and channel-wise attention mechanisms. This hierarchical design allows the model to preserve the distinct physical semantics of luma, chroma, and geometry while adaptively focusing on the most salient distortion features across High, Medium, and Low scales. Experimental results on multiple benchmarks demonstrate that MS-ISSM outperforms state-of-the-art metrics in both reliability and generalization. The source code is available at: https://github.com/ZhangChen2022/MS-ISSM.

[106] Fus3D: Decoding Consolidated 3D Geometry from Feed-forward Geometry Transformer Latents

Laura Fink, Linus Franke, George Kopanas, Marc Stamminger, Peter Hedman

Main category: cs.CV

TL;DR: Fast feed-forward method for dense Signed Distance Field (SDF) regression from image collections in under 3 seconds, using pretrained geometry transformer features directly for 3D extraction without per-view prediction heads.

Details

Motivation: Existing methods discard valuable joint world representation encoded in pretrained multi-view geometry transformers by routing features through per-view prediction heads before assembling 3D geometry, which loses completeness information and accumulates inaccuracies.

Method: Perform 3D extraction directly from geometry transformer features via learned volumetric extraction: voxelized canonical embeddings that progressively absorb multi-view geometry information through interleaved cross- and self-attention into a structured volumetric latent grid, then use a convolutional decoder to map to dense SDF.

Result: The approach yields complete and well-defined distance values across sparse- and dense-view settings, demonstrates geometrically plausible completions, and achieves inference in less than three seconds without camera calibration or post-hoc fusion.

Conclusion: Direct 3D extraction from pretrained geometry transformer features enables fast, accurate SDF regression from image collections, outperforming methods that discard valuable joint world representations through per-view processing.

Abstract: We propose a feed-forward method for dense Signed Distance Field (SDF) regression from unstructured image collections in less than three seconds, without camera calibration or post-hoc fusion. Our key insight is that the intermediate feature space of pretrained multi-view feed-forward geometry transformers already encodes a powerful joint world representation; yet, existing pipelines discard it, routing features through per-view prediction heads before assembling 3D geometry post-hoc, which discards valuable completeness information and accumulates inaccuracies. We instead perform 3D extraction directly from geometry transformer features via learned volumetric extraction: voxelized canonical embeddings that progressively absorb multi-view geometry information through interleaved cross- and self-attention into a structured volumetric latent grid. A simple convolutional decoder then maps this grid to a dense SDF. We additionally propose a scalable, validity-aware supervision scheme directly using SDFs derived from depth maps or 3D assets, tackling practical issues like non-watertight meshes. Our approach yields complete and well-defined distance values across sparse- and dense-view settings and demonstrates geometrically plausible completions. Code and further material can be found at https://lorafib.github.io/fus3d.

[107] GazeQwen: Lightweight Gaze-Conditioned LLM Modulation for Streaming Video Understanding

Trong Thang Pham, Hien Nguyen, Ngan Le

Main category: cs.CV

TL;DR: GazeQwen enhances multimodal LLMs with gaze awareness via parameter-efficient hidden-state modulation, achieving state-of-the-art performance on video understanding tasks using gaze cues.

Details

Motivation: Current MLLMs fail to effectively utilize eye-gaze information for video understanding, even when gaze cues are provided through visual overlays or text descriptions. There's a need for models that can properly integrate gaze information to improve video comprehension.

Method: Introduces GazeQwen with a compact gaze resampler (1-5M parameters) that encodes V-JEPA 2.1 video features with fixation-derived positional encodings, producing additive residuals injected into selected LLM decoder layers via forward hooks. Optional second stage adds LoRA adapters for tighter integration.

Result: Achieves 63.9% accuracy on StreamGaze benchmark, +16.1 points over Qwen2.5-VL-7B with gaze as visual prompts, and +10.5 points over GPT-4o, highest among all tested open-source and proprietary models.

Conclusion: Learning where to inject gaze within an LLM is more effective than scaling model size or engineering better prompts. Parameter-efficient gaze integration significantly improves video understanding capabilities.

Abstract: Current multimodal large language models (MLLMs) cannot effectively utilize eye-gaze information for video understanding, even when gaze cues are supplied via visual overlays or text descriptions. We introduce GazeQwen, a parameter efficient approach that equips an open-source MLLM with gaze awareness through hidden-state modulation. At its core is a compact gaze resampler (~1-5 M trainable parameters) that encodes V-JEPA 2.1 video features together with fixation-derived positional encodings and produces additive residuals injected into selected LLM decoder layers via forward hooks. An optional second training stage adds low-rank adapters (LoRA) to the LLM for tighter integration. Evaluated on all 10 tasks of the StreamGaze benchmark, GazeQwen reaches 63.9% accuracy, a +16.1 point gain over the same Qwen2.5-VL-7B backbone with gaze as visual prompts and +10.5 points over GPT-4o, the highest score among all open-source and proprietary models tested. These results suggest that learning where to inject gaze within an LLM is more effective than scaling model size or engineering better prompts. All code and checkpoints are available at https://github.com/phamtrongthang123/gazeqwen .

[108] Dynamic LIBRAS Gesture Recognition via CNN over Spatiotemporal Matrix Representation

Jasmine Moreira

Main category: cs.CV

TL;DR: Dynamic hand gesture recognition using MediaPipe Hand Landmarker for skeletal keypoint extraction and CNN for classification from spatiotemporal matrices, applied to LIBRAS sign language gestures for home automation control.

Details

Motivation: To develop an effective method for dynamic hand gesture recognition that can be applied to sign language interpretation for device control in home automation systems, with real-time capabilities and robustness to lighting variations.

Method: Two-stage approach: 1) MediaPipe Hand Landmarker extracts 21 skeletal keypoints from hand images, 2) CNN classifies gestures from spatiotemporal matrix representation (90×21 dimensions) of keypoints. Uses sliding window with temporal frame triplication for real-time inference without recurrent networks.

Result: Achieved 95% accuracy under low-light conditions and 92% under normal lighting for 11 classes of LIBRAS static and dynamic gestures. Real-time continuous recognition demonstrated.

Conclusion: The approach is effective for hand gesture recognition in home automation applications, though systematic experiments with greater user diversity are needed for better generalization evaluation.

Abstract: This paper proposes a method for dynamic hand gesture recognition based on the composition of two models: the MediaPipe Hand Landmarker, responsible for extracting 21 skeletal keypoints of the hand, and a convolutional neural network (CNN) trained to classify gestures from a spatiotemporal matrix representation of dimensions 90 by 21 of those keypoints. The method is applied to the recognition of LIBRAS (Brazilian Sign Language) gestures for device control in a home automation system, covering 11 classes of static and dynamic gestures. For real-time inference, a sliding window with temporal frame triplication is used, enabling continuous recognition without recurrent networks. Tests achieved 95% accuracy under low-light conditions and 92% under normal lighting. The results indicate that the approach is effective, although systematic experiments with greater user diversity are needed for a more thorough evaluation of generalization.

[109] GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks

Saelyne Yang, Jaesang Yu, Yi-Hao Peng, Kevin Qinghong Lin, Jae Won Cho, Yale Song, Juho Kim

Main category: cs.CV

TL;DR: GUIDE benchmark evaluates AI models on GUI user intent detection through behavior state recognition, intent prediction, and help prediction using screen recordings with think-aloud narrations across 10 software applications.

Details

Motivation: Current GUI agents focus on automation through clicks and keystrokes but overlook human intention, where users value exploration, iteration, and refinement while maintaining agency. To move from automation to collaboration, agents need to understand what users are doing and why.

Method: Created GUIDE benchmark with 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations across 10 software applications. Defines three evaluation tasks: Behavior State Detection, Intent Prediction, and Help Prediction to test model capabilities.

Result: Eight state-of-the-art multimodal models struggled, achieving only 44.6% accuracy on behavior state detection and 55.0% on help prediction. Providing user context significantly improved performance, raising help prediction by up to 50.2 percentage points.

Conclusion: GUI agents need structured user understanding for effective assistance. Current multimodal models have limited ability to perceive user behavior and infer intent, highlighting the need for better user context integration in GUI assistance systems.

Abstract: Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software (e.g., PowerPoint, Photoshop). While prior research has primarily focused on automating user actions through clicks and keystrokes, this paradigm overlooks human intention, where users value the ability to explore, iterate, and refine their ideas while maintaining agency. To move beyond automation and toward collaboration, GUI agents must understand what users are doing and why. We introduce GUIDE (GUI User Intent Detection Evaluation), a benchmark that evaluates AI models on their ability to perceive user behavior, infer intent, and provide assistance in open-ended GUI tasks. GUIDE consists of 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations, across 10 software. GUIDE defines three tasks - (i) Behavior State Detection, (ii) Intent Prediction, and (iii) Help Prediction that test a model’s ability to recognize behavior state, reason about goals, and decide when and how to help. Evaluations across eight state-of-the-art multimodal models reveal that all models struggled, achieving only 44.6% and 55.0% accuracy on behavior state and help prediction. However, providing user context significantly improved the performance, raising help prediction by up to 50.2pp, highlighting the critical role of structured user understanding in effective assistance. Our dataset is available at https://guide-bench.github.io.

[110] Seeing Through Smoke: Surgical Desmoking for Improved Visual Perception

Jingpei Lu, Fengyi Jiang, Xiaorui Zhang, Lingbo Jin, Omid Mohareri

Main category: cs.CV

TL;DR: Transformer-based surgical desmoking model with physics-inspired head that predicts smoke-free images and smoke maps, trained on synthetic data and evaluated on largest paired surgical smoke dataset.

Details

Motivation: Surgical smoke from electrocautery and vessel-sealing instruments degrades endoscopic imaging in minimally invasive surgery, hindering visual perception and vision-based functionalities.

Method: Transformer-based surgical desmoking model with physics-inspired desmoking head that jointly predicts smoke-free images and corresponding smoke maps. Uses synthetic data generation pipeline blending artificial smoke patterns with real endoscopic images (80,000+ paired samples) and curated largest paired surgical smoke dataset (5,817 image pairs from da Vinci system).

Result: State-of-the-art performance in image reconstruction compared to existing dehazing and desmoking approaches on both public benchmark and new dataset. Demonstrated impact on downstream stereo depth estimation and instrument segmentation, highlighting benefits and limitations of digital smoke removal.

Conclusion: The transformer-based desmoking model with physics-inspired head and synthetic data generation pipeline effectively addresses surgical smoke removal, improving endoscopic vision for minimally invasive surgery while revealing both potential benefits and current limitations for downstream vision tasks.

Abstract: Minimally invasive and robot-assisted surgery relies heavily on endoscopic imaging, yet surgical smoke produced by electrocautery and vessel-sealing instruments can severely degrade visual perception and hinder vision-based functionalities. We present a transformer-based surgical desmoking model with a physics-inspired desmoking head that jointly predicts smoke-free image and corresponding smoke map. To address the scarcity of paired smoky-to-smoke-free training data, we develop a synthetic data generation pipeline that blends artificial smoke patterns with real endoscopic images, yielding over 80,000 paired samples for supervised training. We further curate, to our knowledge, the largest paired surgical smoke dataset to date, comprising 5,817 image pairs captured with the da Vinci robotic surgical system, enabling benchmarking on high-resolution endoscopic images. Extensive experiments on both a public benchmark and our dataset demonstrate state-of-the-art performance in image reconstruction compared to existing dehazing and desmoking approaches. We also assess the impact of desmoking on downstream stereo depth estimation and instrument segmentation, highlighting both the potential benefits and current limitations of digital smoke removal methods.

[111] Speech-Synchronized Whiteboard Generation via VLM-Driven Structured Drawing Representations

Suraj Prasad, Pinak Mahapatra

Main category: cs.CV

TL;DR: First dataset of 24 Excalidraw educational videos with millisecond-precision drawing timestamps and narrated audio, used to train VL model for predicting stroke sequences synchronized to speech.

Details

Motivation: Creating educational whiteboard videos requires precise coordination between drawings and narration, but no existing methods address this multimodal synchronization problem with structured, reproducible drawing representations.

Method: Created dataset of 24 paired Excalidraw demonstrations with narrated audio across 8 STEM domains, each with millisecond-precision creation timestamps. Fine-tuned Qwen2-VL-7B vision-language model via LoRA to predict full stroke sequences synchronized to speech from only 24 demonstrations.

Result: Topic-stratified five-fold evaluation shows timestamp conditioning significantly improves temporal alignment over ablated baselines, and the model generalizes across unseen STEM topics.

Conclusion: The approach demonstrates potential for automated educational content generation, with discussion of transferability to real classroom settings. Dataset and code are released to support future research.

Abstract: Creating whiteboard-style educational videos demands precise coordination between freehand illustrations and spoken narration, yet no existing method addresses this multimodal synchronization problem with structured, reproducible drawing representations. We present the first dataset of 24 paired Excalidraw demonstrations with narrated audio, where every drawing element carries millisecond-precision creation timestamps spanning 8 STEM domains. Using this data, we study whether a vision-language model (Qwen2-VL-7B), fine-tuned via LoRA, can predict full stroke sequences synchronized to speech from only 24 demonstrations. Our topic-stratified five-fold evaluation reveals that timestamp conditioning significantly improves temporal alignment over ablated baselines, while the model generalizes across unseen STEM topics. We discuss transferability to real classroom settings and release our dataset and code to support future research in automated educational content generation.

Prasiddha Bhandari, Kanchan Poudel, Nishant Luitel, Bishram Acharya, Angelina Ghimire, Tyler Wellman, Kilian Koepsell, Pradeep Raj Regmi, Bishesh Khanal

Main category: cs.CV

TL;DR: Systematic evaluation of Blind Sweep Obstetric Ultrasound quality and its impact on AI tasks, with automated quality assessment to improve reliability in low-resource settings.

Details

Motivation: Blind Sweep Obstetric Ultrasound enables scalable fetal imaging in low-resource settings, but AI system reliability depends on sweep quality, and little is known about how acquisition deviations affect downstream predictions.

Method: Simulated plausible acquisition deviations (reversed sweep direction, probe inversion, incomplete sweeps) to quantify model robustness; developed automated quality-assessment models; simulated feedback loop where flagged sweeps are re-acquired.

Result: Found BSOU-based AI models are sensitive to acquisition variability; automated quality assessment can detect perturbations; correction through re-acquisition improves downstream task performance.

Conclusion: Automated quality assessment plays a central role in building reliable, scalable AI-assisted prenatal ultrasound workflows, particularly in low-resource environments.

Abstract: Blind Sweep Obstetric Ultrasound (BSOU) enables scalable fetal imaging in low-resource settings by allowing minimally trained operators to acquire standardized sweep videos for automated Artificial Intelligence(AI) interpretation. However, the reliability of such AI systems depends critically on the quality of the acquired sweeps, and little is known about how deviations from the intended protocol affect downstream predictions. In this work, we present a systematic evaluation of BSOU quality and its impact on three key AI tasks: sweep-tag classification, fetal presentation classification, and placenta-location classification. We simulate plausible acquisition deviations, including reversed sweep direction, probe inversion, and incomplete sweeps, to quantify model robustness, and we develop automated quality-assessment models capable of detecting these perturbations. To approximate real-world deployment, we simulate a feedback loop in which flagged sweeps are re-acquired, showing that such correction improves downstream task performance. Our findings highlight the sensitivity of BSOU-based AI models to acquisition variability and demonstrate that automated quality assessment can play a central role in building reliable, scalable AI-assisted prenatal ultrasound workflows, particularly in low-resource environments.

[113] World Reasoning Arena

PAN Team, Qiyue Gao, Kun Zhou, Jiannan Xiang, Zihan Liu, Dequan Yang, Junrong Chen, Arif Ahmad, Cong Zeng, Ganesh Bannur, Xinqi Huang, Zheqi Liu, Yi Gu, Yichi Yang, Guangyi Liu, Zhiting Hu, Zhengzhong Liu, Eric Xing

Main category: cs.CV

TL;DR: WR-Arena is a comprehensive benchmark for evaluating world models across three dimensions: action simulation fidelity, long-horizon forecasting, and simulative reasoning/planning, moving beyond traditional next-state prediction.

Details

Motivation: Existing world model benchmarks focus narrowly on next-state prediction and visual fidelity, overlooking richer simulation capabilities needed for intelligent behavior. There's a need for comprehensive evaluation of world models' ability to simulate complex environments and support reasoning.

Method: The authors introduce WR-Arena benchmark with a task taxonomy and curated diverse datasets designed to probe three fundamental dimensions: (1) Action Simulation Fidelity - ability to interpret multi-step instructions and generate counterfactual rollouts, (2) Long-horizon Forecast - ability to sustain accurate simulations across extended interactions, and (3) Simulative Reasoning and Planning - ability to support goal-directed reasoning by simulating alternative futures.

Result: Extensive experiments with state-of-the-art world models reveal a substantial gap between current models and human-level hypothetical reasoning. WR-Arena serves as both a diagnostic tool and guideline for advancing next-generation world models.

Conclusion: WR-Arena addresses limitations of existing benchmarks and provides a comprehensive framework for evaluating world models’ simulation capabilities, establishing a foundation for developing models capable of robust understanding, forecasting, and purposeful action.

Abstract: World models (WMs) are intended to serve as internal simulators of the real world that enable agents to understand, anticipate, and act upon complex environments. Existing WM benchmarks remain narrowly focused on next-state prediction and visual fidelity, overlooking the richer simulation capabilities required for intelligent behavior. To address this gap, we introduce WR-Arena, a comprehensive benchmark for evaluating WMs along three fundamental dimensions of next world simulation: (i) Action Simulation Fidelity, the ability to interpret and follow semantically meaningful, multi-step instructions and generate diverse counterfactual rollouts; (ii) Long-horizon Forecast, the ability to sustain accurate, coherent, and physically plausible simulations across extended interactions; and (iii) Simulative Reasoning and Planning, the ability to support goal-directed reasoning by simulating, comparing, and selecting among alternative futures in both structured and open-ended environments. We build a task taxonomy and curate diverse datasets designed to probe these capabilities, moving beyond single-turn and perceptual evaluations. Through extensive experiments with state-of-the-art WMs, our results expose a substantial gap between current models and human-level hypothetical reasoning, and establish WR-Arena as both a diagnostic tool and a guideline for advancing next-generation world models capable of robust understanding, forecasting, and purposeful action. The code is available at https://github.com/MBZUAI-IFM/WR-Arena.

[114] Polarization-Based Eye Tracking with Personalized Siamese Architectures

Beyza Kalkanli, Tom Bu, Mahsa Shakeri, Alexander Fix, Dave Stronks, Dmitri Model, Mantas Žurauskas

Main category: cs.CV

TL;DR: Siamese personalization for eye tracking reduces calibration samples by 10x while maintaining accuracy comparable to linear calibration, with polarization inputs providing up to 12% error reduction over NIR.

Details

Motivation: Head-mounted eye tracking devices require per-user calibration due to inter-person variability, which is inconvenient. The paper aims to reduce calibration burden while maintaining accuracy through differential personalization approaches.

Method: Uses Siamese architectures to learn relative gaze displacements and reconstruct absolute gaze from minimal calibration frames. Benchmarks on polarization-enabled eye tracking using a 338-subject dataset captured with polarization-sensitive camera and 850 nm illumination.

Result: Achieves performance comparable to linear calibration with 10-fold fewer samples. Polarization inputs reduce gaze error by up to 12% compared to NIR-based inputs. Combining Siamese personalization with linear calibration yields further 13% improvements over linearly calibrated baseline.

Conclusion: Siamese personalization establishes a practical approach for accurate eye tracking with significantly reduced calibration requirements, especially effective when combined with polarization sensing.

Abstract: Head-mounted devices integrated with eye tracking promise a solution for natural human-computer interaction. However, they typically require per-user calibration for optimal performance due to inter-person variability. A differential personalization approach using Siamese architectures learns relative gaze displacements and reconstructs absolute gaze from a small set of calibration frames. In this paper, we benchmark Siamese personalization on polarization-enabled eye tracking. For benchmarking, we use a 338-subject dataset captured with a polarization-sensitive camera and 850 nm illumination. We achieve performance comparable to linear calibration with 10-fold fewer samples. Using polarization inputs for Siamese personalization reduces gaze error by up to 12% compared to near-infrared (NIR)-based inputs. Combining Siamese personalization with linear calibration yields further improvements of up to 13% over a linearly calibrated baseline. These results establish Siamese personalization as a practical approach enabling accurate eye tracking.

[115] Good Scores, Bad Data: A Metric for Multimodal Coherence

Vasundra Srinivasan

Main category: cs.CV

TL;DR: Multimodal Coherence Score (MCS) is a new metric that evaluates multimodal fusion quality by measuring coherence across four dimensions (identity, spatial, semantic, decision) without relying on downstream task performance.

Details

Motivation: Current multimodal AI evaluation relies on downstream task accuracy, which doesn't guarantee that the underlying multimodal data is coherent. Models can achieve high VQA scores even when inputs contradict each other, highlighting the need for a metric that directly assesses fusion quality.

Method: MCS decomposes coherence into four dimensions: identity (object consistency), spatial (spatial relationships), semantic (semantic relationships), and decision (model confidence). Weights for these dimensions are learned via Nelder-Mead optimization. The method is evaluated on 1,000 Visual Genome images using DETR, CLIP, and ViLT models, and validated on 150 COCO images without retraining.

Result: MCS discriminates fusion quality with higher sensitivity than task accuracy alone (Spearman rho = 0.093 vs. 0.071) across three fusion architectures. Perturbation experiments confirm each dimension responds independently to its specific failure mode with zero cross-talk. The metric is lightweight, requires no human annotation, and provides diagnostic information about what specifically broke in the fusion process.

Conclusion: MCS provides a novel, model-agnostic way to evaluate multimodal fusion quality that goes beyond task accuracy, offering diagnostic insights into specific failure modes across four coherence dimensions. It enables better understanding of what makes multimodal representations effective.

Abstract: Multimodal AI systems are evaluated by downstream task accuracy, but high accuracy does not mean the underlying data is coherent. A model can score well on Visual Question Answering (VQA) while its inputs contradict each other. We introduce the Multimodal Coherence Score (MCS), a metric that evaluates fusion quality independent of any downstream model. MCS decomposes coherence into four dimensions, identity, spatial, semantic, and decision, with weights learned via Nelder-Mead optimization. We evaluate on 1,000 Visual Genome images using DETR, CLIP, and ViLT, and validate on 150 COCO images with no retraining. Across three fusion architectures, MCS discriminates quality with higher sensitivity than task accuracy alone (Spearman rho = 0.093 vs. 0.071). Perturbation experiments confirm each dimension responds independently to its failure mode with zero cross-talk. MCS is lightweight, requires no human annotation, and tells you not just that something broke, but what broke.

[116] Few Shots Text to Image Retrieval: New Benchmarking Dataset and Optimization Methods

Ofer Idan, Vladi Vexler, Gil Lederman, Dima Sivov, Aviad Cohen Zada, Shir Niego Komforti

Main category: cs.CV

TL;DR: Proposes FSIR-BD benchmark for few-shot text-to-image retrieval with compositional and OOD queries, and introduces optimization methods using reference examples to improve retrieval performance.

Details

Motivation: Current VLMs struggle with compositional queries and OOD image-text pairs in retrieval tasks, while humans excel at learning from few examples. The paper aims to bridge this gap by creating a benchmark and methods for few-shot image retrieval.

Method: 1) Introduces FSIR-BD benchmark dataset with 38,353 images and 303 queries focusing on compositional (urban scenes, nature species) and OOD scenarios. 2) Proposes two novel retrieval optimization methods that leverage single-shot or few-shot reference examples from the FSR corpus to improve performance with any pre-trained image encoder.

Result: FSIR-BD provides a challenging benchmark for image retrieval, and the proposed optimization methods outperform existing baselines as measured by mean Average Precision (mAP).

Conclusion: The work advances few-shot image retrieval for compositional reasoning, narrowing the gap between machine and human-level understanding from limited examples. Further research into FSIR optimization methods is needed.

Abstract: Pre-trained vision-language models (VLMs) excel in multimodal tasks, commonly encoding images as embedding vectors for storage in databases and retrieval via approximate nearest neighbor search (ANNS). However, these models struggle with compositional queries and out-of-distribution (OOD) image-text pairs. Inspired by human cognition’s ability to learn from minimal examples, we address this performance gap through few-shot learning approaches specifically designed for image retrieval. We introduce the Few-Shot Text-to-Image Retrieval (FSIR) task and its accompanying benchmark dataset, FSIR-BD - the first to explicitly target image retrieval by text accompanied by reference examples, focusing on the challenging compositional and OOD queries. The compositional part is divided to urban scenes and nature species, both in specific situations or with distinctive features. FSIR-BD contains 38,353 images and 303 queries, with 82% comprising the test corpus (averaging per query 37 positives, ground truth matches, and significant number of hard negatives) and 18% forming the few-shot reference corpus (FSR) of exemplar positive and hard negative images. Additionally, we propose two novel retrieval optimization methods leveraging single shot or few shot reference examples in the FSR to improve performance. Both methods are compatible with any pre-trained image encoder, making them applicable to existing large-scale environments. Our experiments demonstrate that: (1) FSIR-BD provides a challenging benchmark for image retrieval; and (2) our optimization methods outperform existing baselines as measured by mean Average Precision (mAP). Further research into FSIR optimization methods will help narrow the gap between machine and human-level understanding, particularly for compositional reasoning from limited examples.

[117] DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation

Abolfazl Meyarian, Amin Karimi Monsefi, Rajiv Ramnath, Ser-Nam Lim

Main category: cs.CV

TL;DR: DiReCT improves physics consistency in text-to-video generation by disentangling semantic and physical aspects in contrastive learning, using macro-contrastive and micro-contrastive terms with LLM-guided perturbations.

Details

Motivation: Current flow-matching video generators produce high-quality videos but often violate basic physics because their reconstruction objectives don't distinguish physically consistent dynamics from impossible ones. Contrastive flow matching helps but suffers from semantic-physics entanglement in text-conditioned settings.

Method: DiReCT (Disentangled Regularization of Contrastive Trajectories) is a lightweight post-training framework with two components: 1) Macro-contrastive term uses semantically distant negatives for global trajectory separation, and 2) Micro-contrastive term constructs hard negatives sharing full scene semantics but differing along single LLM-perturbed physical axes (kinematics, forces, materials, interactions, magnitudes). Includes velocity-space distributional regularizer to preserve visual quality.

Result: Applied to Wan 2.1-1.3B, DiReCT improves physical commonsense score on VideoPhy by 16.7% compared to baseline and 11.3% compared to SFT, without increasing training time.

Conclusion: DiReCT effectively addresses semantic-physics entanglement in contrastive video generation, improving physical consistency while maintaining visual quality through disentangled regularization at complementary scales.

Abstract: Flow-matching video generators produce temporally coherent, high-fidelity outputs yet routinely violate elementary physics because their reconstruction objectives penalize per-frame deviations without distinguishing physically consistent dynamics from impossible ones. Contrastive flow matching offers a principled remedy by pushing apart velocity-field trajectories of differing conditions, but we identify a fundamental obstacle in the text-conditioned video setting: semantic-physics entanglement. Because natural-language prompts couple scene content with physical behavior, naive negative sampling draws conditions whose velocity fields largely overlap with the positive sample’s, causing the contrastive gradient to directly oppose the flow-matching objective. We formalize this gradient conflict, deriving a precise alignment condition that reveals when contrastive learning helps versus harms training. Guided by this analysis, we introduce DiReCT (Disentangled Regularization of Contrastive Trajectories), a lightweight post-training framework that decomposes the contrastive signal into two complementary scales: a macro-contrastive term that draws partition-exclusive negatives from semantically distant regions for interference-free global trajectory separation, and a micro-contrastive term that constructs hard negatives sharing full scene semantics with the positive sample but differing along a single, LLM-perturbed axis of physical behavior; spanning kinematics, forces, materials, interactions, and magnitudes. A velocity-space distributional regularizer helps to prevent catastrophic forgetting of pretrained visual quality. When applied to Wan 2.1-1.3B, our method improves the physical commonsense score on VideoPhy by 16.7% and 11.3% compared to the baseline and SFT, respectively, without increasing training time.

[118] THFM: A Unified Video Foundation Model for 4D Human Perception and Beyond

Letian Wang, Andrei Zanfir, Eduard Gabriel Bazavan, Misha Andriluka, Cristian Sminchisescu

Main category: cs.CV

TL;DR: THFM is a unified video foundation model for human-centric perception that handles both dense (depth, normals, segmentation, dense pose) and sparse (2D/3D keypoints) tasks using a single architecture derived from pretrained text-to-video diffusion models.

Details

Motivation: To create a unified perception model that can handle multiple human-centric video understanding tasks simultaneously, overcoming the limitations of specialized models that require separate architectures for different tasks.

Method: Repurposes a pretrained text-to-video diffusion model as a single-forward-pass perception model, augmented with learnable tokens for sparse predictions. Uses text prompts to modulate task performance and is trained exclusively on synthetic data.

Result: Achieves state-of-the-art or competitive performance on various benchmarks despite training only on synthetic data. Demonstrates emergent generalization capabilities to multiple humans and other object classes beyond training data.

Conclusion: THFM shows that unified video foundation models derived from diffusion models can effectively handle diverse perception tasks and exhibit strong generalization capabilities, even when trained only on synthetic data.

Abstract: We present THFM, a unified video foundation model for human-centric perception that jointly addresses dense tasks (depth, normals, segmentation, dense pose) and sparse tasks (2d/3d keypoint estimation) within a single architecture. THFM is derived from a pretrained text-to-video diffusion model, repurposed as a single-forward-pass perception model and augmented with learnable tokens for sparse predictions. Modulated by the text prompt, our single unified model is capable of performing various perception tasks. Crucially, our model is on-par or surpassing state-of-the-art specialized models on a variety of benchmarks despite being trained exclusively on synthetic data (i.e.~without training on real-world or benchmark specific data). We further highlight intriguing emergent properties of our model, which we attribute to the underlying diffusion-based video representation. For example, our model trained on videos with a single human in the scene generalizes to multiple humans and other object classes such as anthropomorphic characters and animals – a capability that hasn’t been demonstrated in the past.

[119] DenseSwinV2: Channel Attentive Dual Branch CNN Transformer Learning for Cassava Leaf Disease Classification

Shah Saood, Saddam Hussain Khan

Main category: cs.CV

TL;DR: Hybrid Dense SwinV2 combines DenseNet CNN and Swin Transformer V2 for cassava disease classification, achieving 98.02% accuracy by fusing local and global features with attention mechanisms.

Details

Motivation: To improve cassava disease classification by combining the strengths of CNNs (local feature extraction) and Transformers (global context modeling) to handle visually similar lesions and complex field conditions.

Method: Two-branch framework: DenseNet branch captures high-resolution local features, while customized SwinV2 branch models global contextual dependencies via shifted-window self-attention. Both branches use attention channel-squeeze modules to emphasize disease-related responses, then fuse discriminative channels from both streams.

Result: Achieved 98.02% classification accuracy and 97.81% F1 score on a cassava leaf disease dataset of 31,000 images with 5 disease classes, outperforming established CNN and Transformer models.

Conclusion: Hybrid Dense SwinV2 offers robust and practical field-level diagnosis for cassava diseases, effectively handling challenges like occlusion, noise, and complex backgrounds through its combined CNN-Transformer architecture.

Abstract: This work presents a new Hybrid Dense SwinV2, a two-branch framework that jointly leverages densely connected convolutional features and hierarchical customized Swin Transformer V2 (SwinV2) representations for cassava disease classification. The proposed framework captures high resolution local features through its DenseNet branch, preserving the fine structural cues and also allowing for effective gradient flow. Concurrently, the customized SwinV2 models global contextual dependencies through the idea of shifted-window self attention, which enables the capture of long range interactions critical in distinguishing between visually similar lesions. Moreover, an attention channel-squeeze module is employed for each CNN Transformer stream independently to emphasize discriminative disease related responses and suppress redundant or background driven activations. Finally, these discriminative channels are fused to achieve refined representations from the dense local and SwinV2 global correlated strengthened feature maps, respectively. The proposed Dense SwinV2 utilized a public cassava leaf disease dataset of 31000 images, comprised of five diseases, including brown streak, mosaic, green mottle, bacterial blight, and normal leaf conditions. The proposed Dense SwinV2 demonstrates a significant classification accuracy of 98.02 percent with an F1 score of 97.81 percent, outperforming well-established convolutional and transformer models. These results underline the fact that Hybrid Dense SwinV2 offers robustness and practicality in the field level diagnosis of cassava disease and real world challenges related to occlusion, noise, and complex backgrounds.

[120] Shared Representation for 3D Pose Estimation, Action Classification, and Progress Prediction from Tactile Signals

Isaac Han, Seoyoung Lee, Sangyeon Park, Ecehan Akan, Yiyue Luo, Joseph DelPreto, Kyung-Joong Kim

Main category: cs.CV

TL;DR: SCOTTI is a unified tactile-based model using shared convolutional transformers for simultaneous 3D human pose estimation, action classification, and action progress prediction from foot sensor data.

Details

Motivation: Vision-based methods for human pose and action understanding suffer from occlusion and privacy issues in real-world environments. Tactile sensing avoids these problems but existing approaches handle each task separately, leading to suboptimal performance.

Method: Proposes Shared COnvolutional Transformer for Tactile Inference (SCOTTI) that learns a shared representation to simultaneously address three tasks: 3D human pose estimation, action classification, and action completion progress estimation using foot tactile signals from custom wireless insole sensors.

Result: SCOTTI outperforms existing approaches across all three tasks. The multi-task learning approach enables improved performance compared to learning tasks independently. A novel dataset was collected from 15 participants performing various activities with 7 hours of total duration.

Conclusion: This is the first work to explore action progress prediction using foot tactile signals. The unified multi-task learning approach effectively leverages mutual benefits across tasks, demonstrating superior performance over independent learning methods.

Abstract: Estimating human pose, classifying actions, and predicting movement progress are essential for human-robot interaction. While vision-based methods suffer from occlusion and privacy concerns in realistic environments, tactile sensing avoids these issues. However, prior tactile-based approaches handle each task separately, leading to suboptimal performance. In this study, we propose a Shared COnvolutional Transformer for Tactile Inference (SCOTTI) that learns a shared representation to simultaneously address three separate prediction tasks: 3D human pose estimation, action class categorization, and action completion progress estimation. To the best of our knowledge, this is the first work to explore action progress prediction using foot tactile signals from custom wireless insole sensors. This unified approach leverages the mutual benefits of multi-task learning, enabling the model to achieve improved performance across all three tasks compared to learning them independently. Experimental results demonstrate that SCOTTI outperforms existing approaches across all three tasks. Additionally, we introduce a novel dataset collected from 15 participants performing various activities and exercises, with 7 hours of total duration, across eight different activities.

[121] Reinforcing Structured Chain-of-Thought for Video Understanding

Peiyao Wang, Haotian Xu, Noranart Vesdapunt, Rui Hou, Jingyi Zhang, Haibin Ling, Oleksandr Obiednikov, Ning Zhou, Kah Kuen Fu

Main category: cs.CV

TL;DR: SDRL is a novel single-stage RL framework for video understanding MLLMs that eliminates SFT by using Summarize->Think->Answer structured CoT with self-supervised CVK and DVR mechanisms.

Details

Motivation: Existing MLLMs for video understanding suffer from thinking drift and weak temporal comprehension. RL methods like GRPO still require costly SFT with CoT annotations, enforce fixed reasoning paths, limit generalization, and induce bias.

Method: SDRL uses single-stage RL without SFT, employing Structured CoT format: Summarize -> Think -> Answer. It integrates two self-supervised mechanisms into GRPO: 1) CVK reduces KL divergence among generated summaries for factual grounding, and 2) DVR dynamically modulates thinking diversity based on group accuracy to promote exploration.

Result: Achieves state-of-the-art performance on seven public VideoQA datasets.

Conclusion: SDRL effectively balances alignment and exploration by supervising both final answers and reasoning processes, overcoming limitations of existing RL methods for video understanding MLLMs.

Abstract: Multi-modal Large Language Models (MLLMs) show promise in video understanding. However, their reasoning often suffers from thinking drift and weak temporal comprehension, even when enhanced by Reinforcement Learning (RL) techniques like Group Relative Policy Optimization (GRPO). Moreover, existing RL methods usually depend on Supervised Fine-Tuning (SFT), which requires costly Chain-of-Thought (CoT) annotation and multi-stage training, and enforces fixed reasoning paths, limiting MLLMs’ ability to generalize and potentially inducing bias. To overcome these limitations, we introduce Summary-Driven Reinforcement Learning (SDRL), a novel single-stage RL framework that obviates the need for SFT by utilizing a Structured CoT format: Summarize -> Think -> Answer. SDRL introduces two self-supervised mechanisms integrated into the GRPO objective: 1) Consistency of Vision Knowledge (CVK) enforces factual grounding by reducing KL divergence among generated summaries; and 2) Dynamic Variety of Reasoning (DVR) promotes exploration by dynamically modulating thinking diversity based on group accuracy. This novel integration effectively balances alignment and exploration, supervising both the final answer and the reasoning process. Our method achieves state-of-the-art performance on seven public VideoQA datasets.

[122] Collision-Aware Vision-Language Learning for End-to-End Driving with Multimodal Infraction Datasets

Alex Koran, Dimitrios Sinodinos, Hadi Hojjati, Takuya Nanri, Fangge Chen, Narges Armanfard

Main category: cs.CV

TL;DR: VLAAD is a video-language-augmented anomaly detector for collision prediction in autonomous driving, trained on new multimodal datasets CARLA-Collide and Real-Collide, improving driving scores and outperforming larger models.

Details

Motivation: High infraction rates in end-to-end autonomous driving, particularly collision-related failures, remain a major bottleneck. Existing approaches lack collision-aware representation learning, and simulator datasets are limited in multimodality and scenario diversity.

Method: Developed VLAAD (Video-Language-Augmented Anomaly Detector) using Multiple Instance Learning for stable, temporally localized collision signals. Created CARLA-Collide (simulator dataset) and Real-Collide (real-world dataset) for training. VLAAD serves as a plug-in module for existing E2E driving models.

Result: Integration into TransFuser++ agent achieved 14.12% relative increase in driving score. In open-loop evaluation on real-world data, VLAAD (0.6B parameters) outperformed multi-billion-parameter vision-language model by 23.3% improvement in AUC.

Conclusion: VLAAD effectively addresses collision prediction in autonomous driving through multimodal learning, demonstrating strong performance in both simulated and real-world settings with efficient parameter usage.

Abstract: High infraction rates remain the primary bottleneck for end-to-end (E2E) autonomous driving, as evidenced by the low driving scores on the CARLA Leaderboard. Despite collision-related infractions being the dominant failure mode in closed-loop evaluations, collision-aware representation learning has received limited attention. To address this gap, we first develop a Video-Language-Augmented Anomaly Detector (VLAAD), leveraging a Multiple Instance Learning (MIL) formulation to obtain stable, temporally localized collision signals for proactive prediction. To transition these capabilities into closed-loop simulations, we must overcome the limitations of existing simulator datasets, which lack multimodality and are frequently restricted to simple intersection scenarios. Therefore, we introduce CARLA-Collide, a large-scale multimodal dataset capturing realistic collision events across highly diverse road networks. Trained on this diverse simulator data, VLAAD serves as a collision-aware plug-in module that can be seamlessly integrated into existing E2E driving models. By integrating our module into a pretrained TransFuser++ agent, we demonstrate a 14.12% relative increase in driving score with minimal fine-tuning. Beyond closed-loop evaluation, we further assess the generalization capability of VLAAD in an open-loop setting using real-world driving data. To support this analysis, we introduce Real-Collide, a multimodal dataset of diverse dashcam videos paired with semantically rich annotations for collision detection and prediction. On this benchmark, despite containing only 0.6B parameters, VLAAD outperforms a multi-billion-parameter vision-language model, achieving a 23.3% improvement in AUC.

[123] Low-Rank-Modulated Functa: Exploring the Latent Space of Implicit Neural Representations for Interpretable Ultrasound Video Analysis

Julia Wolleb, Cristiana Baloescu, Alicia Durrer, Hemant D. Tagare, Xenophon Papademetris

Main category: cs.CV

TL;DR: LRM-Functa introduces a low-rank modulated functa architecture for ultrasound video analysis that creates interpretable latent spaces with structured periodic trajectories for cardiac cycle visualization and enables unsupervised detection of key cardiac frames.

Details

Motivation: While functa-based approaches using implicit neural representations show strong reconstruction performance for images, their latent spaces lack interpretability and structure, particularly for temporal data like ultrasound videos where understanding cardiac cycle patterns is clinically important.

Method: Proposes Low-Rank-Modulated Functa (LRM-Functa) that enforces low-rank adaptation of modulation vectors in time-resolved latent space, creating structured periodic trajectories for ultrasound videos. Each video frame is compressed to low-rank representations (as low as rank k=2) while maintaining temporal coherence.

Result: Outperforms prior methods in unsupervised end-diastolic and end-systolic frame detection, achieves competitive ejection fraction prediction performance, demonstrates generalizability to out-of-distribution data and lung ultrasound B-line classification, and enables smooth novel frame sampling along cardiac cycle.

Conclusion: LRM-Functa provides a compact, interpretable, and generalizable framework for ultrasound video analysis with structured latent spaces that facilitate visualization of temporal patterns and enable direct clinical measurements without additional training.

Abstract: Implicit neural representations (INRs) have emerged as a powerful framework for continuous image representation learning. In Functa-based approaches, each image is encoded as a latent modulation vector that conditions a shared INR, enabling strong reconstruction performance. However, the structure and interpretability of the corresponding latent spaces remain largely unexplored. In this work, we investigate the latent space of Functa-based models for ultrasound videos and propose Low-Rank-Modulated Functa (LRM-Functa), a novel architecture that enforces a low-rank adaptation of modulation vectors in the time-resolved latent space. When applied to cardiac ultrasound, the resulting latent space exhibits clearly structured periodic trajectories, facilitating visualization and interpretability of temporal patterns. The latent space can be traversed to sample novel frames, revealing smooth transitions along the cardiac cycle, and enabling direct readout of end-diastolic (ED) and end-systolic (ES) frames without additional model training. We show that LRM-Functa outperforms prior methods in unsupervised ED and ES frame detection, while compressing each video frame to as low as rank k=2 without sacrificing competitive downstream performance on ejection fraction prediction. Evaluations on out-of-distribution frame selection in a cardiac point-of-care dataset, as well as on lung ultrasound for B-line classification, demonstrate the generalizability of our approach. Overall, LRM-Functa provides a compact, interpretable, and generalizable framework for ultrasound video analysis. The code is available at https://github.com/JuliaWolleb/LRM_Functa.

[124] BEVMAPMATCH: Multimodal BEV Neural Map Matching for Robust Re-Localization of Autonomous Vehicles

Shounak Sural, Ragunathan Rajkumar

Main category: cs.CV

TL;DR: BEVMapMatch: A multimodal lidar+camera fusion framework for vehicle re-localization in GNSS-denied environments using BEV segmentation and cross-attention map matching.

Details

Motivation: Autonomous vehicles need robust localization in GNSS-denied/degraded environments (urban canyons, tunnels, adverse weather). Current methods struggle without GNSS priors, requiring alternative approaches for safe deployment.

Method: Uses context-aware lidar+camera fusion to generate multimodal Bird’s Eye View (BEV) segmentations. Employs cross-attention search to retrieve candidate map patches from known maps. Performs finer alignment using top retrieved candidates. Leverages multiple BEV segmentation frames for improved accuracy.

Result: Outperforms existing re-localization methods with Recall@1m of 39.8% (nearly twice the best baseline). Works in both good and adverse weather conditions without GNSS priors.

Conclusion: BEVMapMatch provides robust vehicle re-localization in GNSS-challenged environments through multimodal BEV segmentation and cross-attention map matching, enabling safe autonomous vehicle deployment without GNSS dependency.

Abstract: Localization in GNSS-denied and GNSS-degraded environments is a challenge for the safe widespread deployment of autonomous vehicles. Such GNSS-challenged environments require alternative methods for robust localization. In this work, we propose BEVMapMatch, a framework for robust vehicle re-localization on a known map without the need for GNSS priors. BEVMapMatch uses a context-aware lidar+camera fusion method to generate multimodal Bird’s Eye View (BEV) segmentations around the ego vehicle in both good and adverse weather conditions. Leveraging a search mechanism based on cross-attention, the generated BEV segmentation maps are then used for the retrieval of candidate map patches for map-matching purposes. Finally, BEVMapMatch uses the top retrieved candidate for finer alignment against the generated BEV segmentation, achieving accurate global localization without the need for GNSS. Multiple frames of generated BEV segmentation further improve localization accuracy. Extensive evaluations show that BEVMapMatch outperforms existing methods for re-localization in GNSS-denied and adverse environments, with a Recall@1m of 39.8%, being nearly twice as much as the best performing re-localization baseline. Our code and data will be made available at https://github.com/ssuralcmu/BEVMapMatch.git.

[125] FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants

Mahesh Bhosale, Abdul Wasi, Shantam Srivastava, Shifa Latif, Tianyu Luan, Mingchen Gao, David Doermann, Xuan Gong

Main category: cs.CV

TL;DR: FairLLaVA: A parameter-efficient fine-tuning method that mitigates demographic biases in multimodal LLMs for medical imaging tasks while maintaining overall performance.

Details

Motivation: Multimodal LLMs show uneven performance across demographic groups in clinical settings, risking unequal diagnostic narratives and eroding trust in AI-assisted decision-making. Fairness in MLLMs remains underexplored compared to vision-only or language-only models.

Method: FairLLaVA uses parameter-efficient fine-tuning with mutual information minimization between target attributes to regularize model representations to be demographic-invariant. It’s implemented as a lightweight plug-in using low-rank adapter fine-tuning, making it architecture-agnostic.

Result: Extensive experiments on chest radiology report generation and dermoscopy visual question answering benchmarks show FairLLaVA consistently reduces inter-group disparities while improving both equity-scaled clinical performance and natural language generation quality across diverse medical imaging modalities.

Conclusion: FairLLaVA provides an effective, efficient approach to address fairness issues in multimodal LLMs for medical applications, enabling more equitable AI-assisted clinical decision-making.

Abstract: While powerful in image-conditioned generation, multimodal large language models (MLLMs) can display uneven performance across demographic groups, highlighting fairness risks. In safety-critical clinical settings, such disparities risk producing unequal diagnostic narratives and eroding trust in AI-assisted decision-making. While fairness has been studied extensively in vision-only and language-only models, its impact on MLLMs remains largely underexplored. To address these biases, we introduce FairLLaVA, a parameter-efficient fine-tuning method that mitigates group disparities in visual instruction tuning without compromising overall performance. By minimizing the mutual information between target attributes, FairLLaVA regularizes the model’s representations to be demographic-invariant. The method can be incorporated as a lightweight plug-in, maintaining efficiency with low-rank adapter fine-tuning, and provides an architecture-agnostic approach to fair visual instruction following. Extensive experiments on large-scale chest radiology report generation and dermoscopy visual question answering benchmarks show that FairLLaVA consistently reduces inter-group disparities while improving both equity-scaled clinical performance and natural language generation quality across diverse medical imaging modalities. Code can be accessed at https://github.com/bhosalems/FairLLaVA.

[126] Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control

Zhuoli Zhuang, Yu-Cheng Chang, Yu-Kai Wang, Thomas Do, Chin-Teng Lin

Main category: cs.CV

TL;DR: EEG-guided RL framework for autonomous driving that uses brain signals to align AI decisions with human cognitive responses, improving collision avoidance.

Details

Motivation: Current autonomous driving systems struggle to align with human expectations. RLHF methods are time-consuming and indirect. Human cognitive insights from EEG could provide more direct feedback for RL training.

Method: Collected EEG signals from 20 participants in driving simulator, analyzed ERP responses to sudden changes. Used neural network to predict ERP strength from visual scenes, integrated cognitive information into RL reward signal.

Result: Framework improved collision avoidance ability of RL algorithm, demonstrating potential of neuro-cognitive feedback for enhancing autonomous driving systems.

Conclusion: EEG-guided decision-making can effectively incorporate human cognitive insights into RL for autonomous driving, offering more direct alignment with human intent than conventional RLHF methods.

Abstract: Recent advancements in computer vision have accelerated the development of autonomous driving. Despite these advancements, training machines to drive in a way that aligns with human expectations remains a significant challenge. Human factors are still essential, as humans possess a sophisticated cognitive system capable of rapidly interpreting scene information and making accurate decisions. Aligning machine with human intent has been explored with Reinforcement Learning with Human Feedback (RLHF). Conventional RLHF methods rely on collecting human preference data by manually ranking generated outputs, which is time-consuming and indirect. In this work, we propose an electroencephalography (EEG)-guided decision-making framework to incorporate human cognitive insights without behaviour response interruption into reinforcement learning (RL) for autonomous driving. We collected EEG signals from 20 participants in a realistic driving simulator and analyzed event-related potentials (ERP) in response to sudden environmental changes. Our proposed framework employs a neural network to predict the strength of ERP based on the cognitive information from visual scene information. Moreover, we explore the integration of such cognitive information into the reward signal of the RL algorithm. Experimental results show that our framework can improve the collision avoidance ability of the RL algorithm, highlighting the potential of neuro-cognitive feedback in enhancing autonomous driving systems. Our project page is: https://alex95gogo.github.io/Cognitive-Reward/.

[127] VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation

Rakib Hossain Sajib, Md Kishor Morol, Rajan Das Gupta, Mohammad Sakib Mahmood, Shuvra Smaran Das

Main category: cs.CV

TL;DR: LVLMs like GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 Vision achieve competitive zero-shot facial age estimation without fine-tuning, evaluated on UTKFace and FG-NET datasets with eight metrics.

Details

Motivation: Traditional deep learning for facial age estimation requires extensive labeled data and domain-specific training. This study explores whether general-purpose large vision-language models can perform accurate zero-shot age estimation without task-specific adaptation.

Method: Zero-shot evaluation of GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 Vision on UTKFace and FG-NET datasets using eight metrics (MAE, MSE, RMSE, MAPE, MBE, R², CCC, ±5-year accuracy) without any fine-tuning or task-specific adaptation.

Result: General-purpose LVLMs deliver competitive performance in zero-shot settings, demonstrating emergent capabilities for accurate biometric age estimation. Performance disparities were observed related to image quality and demographic subgroups.

Conclusion: LVLMs show promise as tools for real-world applications in forensic science, healthcare, and HCI, but challenges remain in prompt sensitivity, interpretability, computational cost, and demographic fairness.

Abstract: Human age estimation from facial images represents a challenging computer vision task with significant applications in biometrics, healthcare, and human-computer interaction. While traditional deep learning approaches require extensive labeled datasets and domain-specific training, recent advances in large vision-language models (LVLMs) offer the potential for zero-shot age estimation. This study presents a comprehensive zero-shot evaluation of state-of-the-art Large Vision-Language Models (LVLMs) for facial age estimation, a task traditionally dominated by domain-specific convolutional networks and supervised learning. We assess the performance of GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 Vision on two benchmark datasets, UTKFace and FG-NET, without any fine-tuning or task-specific adaptation. Using eight evaluation metrics, including MAE, MSE, RMSE, MAPE, MBE, $R^2$, CCC, and $\pm$5-year accuracy, we demonstrate that general-purpose LVLMs can deliver competitive performance in zero-shot settings. Our findings highlight the emergent capabilities of LVLMs for accurate biometric age estimation and position these models as promising tools for real-world applications. Additionally, we highlight performance disparities linked to image quality and demographic subgroups, underscoring the need for fairness-aware multimodal inference. This work introduces a reproducible benchmark and positions LVLMs as promising tools for real-world applications in forensic science, healthcare monitoring, and human-computer interaction. The benchmark focuses on strict zero-shot inference without fine-tuning and highlights remaining challenges related to prompt sensitivity, interpretability, computational cost, and demographic fairness.

[128] Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE)

Gustavo Chau Loo Kung, Mohammad Abbasi, Camila Blank, Juze Zhang, Alan Q. Wang, Sophie Ostmeier, Akshay Chaudhari, Kilian Pohl, Ehsan Adeli

Main category: cs.CV

TL;DR: D-RoPE: A transformer with diffusion space rotatory positional embedding for learning general-purpose representations from diffusion MRI data, addressing challenges of spatial, diffusion-weighting, and directional dependencies.

Details

Motivation: Existing deep learning approaches fail to capture unique properties of diffusion MRI signals, which have spatial, diffusion-weighting, and directional dependencies. Varying acquisition protocols further limit traditional models.

Method: Introduces diffusion space rotatory positional embedding (D-RoPE) plugged into a dMRI transformer to capture both spatial structure and directional characteristics. Uses self-supervised masked autoencoding pretraining.

Result: Learned representations provide competitive/superior performance across downstream tasks. Finetuned features achieved 6% higher accuracy in classifying mild cognitive impairment and 0.05 increase in correlation coefficient for cognitive score prediction.

Conclusion: D-RoPE enables robust and transferable representations across diverse acquisition settings and arbitrary numbers of diffusion directions, advancing general-purpose representation learning from dMRI data.

Abstract: Diffusion Magnetic Resonance Imaging (dMRI) plays a critical role in studying microstructural changes in the brain. It is, therefore, widely used in clinical practice; yet progress in learning general-purpose representations from dMRI has been limited. A key challenge is that existing deep learning approaches are not well-suited to capture the unique properties of diffusion signals. Brain dMRI is normally composed of several brain volumes, each with different attenuation characteristics dependent on the direction and strength of the diffusion-sensitized gradients. Thus, there is a need to jointly model spatial, diffusion-weighting, and directional dependencies in dMRI. Furthermore, varying acquisition protocols (e.g., differing numbers of directions) further limit traditional models. To address these gaps, we introduce a diffusion space rotatory positional embedding (D-RoPE) plugged into our dMRI transformer to capture both the spatial structure and directional characteristics of diffusion data, enabling robust and transferable representations across diverse acquisition settings and an arbitrary number of diffusion directions. After self-supervised masked autoencoding pretraining, tests on several downstream tasks show that the learned representations and the pretrained model can provide competitive or superior performance compared to several baselines in these downstream tasks (even compared to a fully trained baseline); the finetuned features from our pretrained encoder resulted in a 6% higher accuracy in classifying mild cognitive impairment and a 0.05 increase in the correlation coefficient when predicting cognitive scores. Code is available at: github.com/gustavochau/D-RoPE.

[129] Unlabeled Cross-Center Automatic Analysis for TAAD: An Integrated Framework from Segmentation to Clinical Features

Mengdi Liu, Qiang Li, Weizhi Nie, Shaopeng Zhang, Yuting Su

Main category: cs.CV

TL;DR: A UDA framework for automated extraction of Type A Aortic Dissection clinical features from medical imaging without target-domain annotations, enabling cross-institutional deployment.

Details

Motivation: TAAD requires rapid preoperative evaluation with quantitative clinical features, but current research focuses only on segmentation accuracy. Building comprehensive datasets requires expert annotations, and models suffer from domain shift during cross-institutional deployment.

Method: Unsupervised Domain Adaptation (UDA)-driven framework that leverages limited source-domain labels to adapt to unlabeled target-domain data, enabling cross-institutional multi-class segmentation and clinical feature extraction without target annotations.

Result: Method significantly improves cross-domain segmentation performance over SOTA approaches. Reader study with cardiovascular surgeons confirms automatically extracted features provide meaningful assistance for preoperative assessment.

Conclusion: The proposed end-to-end segmentation-to-feature pipeline achieves stable cross-institutional deployment, reliable clinical feature extraction, and practical utility for emergency workflows without high-cost annotations.

Abstract: Type A Aortic Dissection (TAAD) is a life-threatening cardiovascular emergency that demands rapid and precise preoperative evaluation. While key anatomical and pathological features are decisive for surgical planning, current research focuses predominantly on improving segmentation accuracy, leaving the reliable, quantitative extraction of clinically actionable features largely under-explored. Furthermore, constructing comprehensive TAAD datasets requires labor-intensive, expert level pixel-wise annotations, which is impractical for most clinical institutions. Due to significant domain shift, models trained on a single center dataset also suffer from severe performance degradation during cross-institutional deployment. This study addresses a clinically critical challenge: the accurate extraction of key TAAD clinical features during cross-institutional deployment in the total absence of target-domain annotations. To this end, we propose an unsupervised domain adaptation (UDA)-driven framework for the automated extraction of TAAD clinical features. The framework leverages limited source-domain labels while effectively adapting to unlabeled data from target domains. Tailored for real-world emergency workflows, our framework aims to achieve stable cross-institutional multi-class segmentation, reliable and quantifiable clinical feature extraction, and practical deployability independent of high-cost annotations. Extensive experiments demonstrate that our method significantly improves cross-domain segmentation performance compared to existing state-of-the-art approaches. More importantly, a reader study involving multiple cardiovascular surgeons confirms that the automatically extracted clinical features provide meaningful assistance for preoperative assessment, highlighting the practical utility of the proposed end-to-end segmentation-to-feature pipeline.

[130] JRM: Joint Reconstruction Model for Multiple Objects without Alignment

Qirui Wu, Yawar Siddiqui, Duncan Frost, Samir Aroudj, Armen Avetisyan, Richard Newcombe, Angel X. Chang, Jakob Engel, Henry Howard-Jenkins

Main category: cs.CV

TL;DR: JRM is a 3D flow-matching generative model that leverages object repetition across scenes for improved reconstruction through implicit latent space aggregation, without needing explicit alignment.

Details

Motivation: Current object-centric reconstruction methods treat objects independently, discarding valuable repetition signals where the same object appears multiple times in scenes. Prior methods rely on explicit matching and rigid alignment, making them error-prone and limited to rigid transformations.

Method: JRM frames object reconstruction as personalized generation where multiple observations share a common subject. It uses a 3D flow-matching generative model that implicitly aggregates unaligned observations in latent space, learning to produce consistent reconstructions without explicit constraints.

Result: JRM outperforms both independent and alignment-based baselines in reconstruction quality. It removes the need for explicit alignment, improves robustness to incorrect associations, and naturally handles non-rigid changes like articulation.

Conclusion: Implicit aggregation through flow-matching generative models provides a more robust and flexible approach to leveraging object repetition for 3D reconstruction, handling both rigid and non-rigid transformations without explicit alignment.

Abstract: Object-centric reconstruction seeks to recover the 3D structure of a scene through composition of independent objects. While this independence can simplify modeling, it discards strong signals that could improve reconstruction, notably repetition where the same object model is seen multiple times in a scene, or across scans. We propose the Joint Reconstruction Model (JRM) to leverage repetition by framing object reconstruction as one of personalized generation: multiple observations share a common subject that should be consistent for all observations, while still adhering to the specific pose and state from each. Prior methods in this direction rely on explicit matching and rigid alignment across observations, making them sensitive to errors and difficult to extend to non-rigid transformations. In contrast, JRM is a 3D flow-matching generative model that implicitly aggregates unaligned observations in its latent space, learning to produce consistent and faithful reconstructions in a data-driven manner without explicit constraints. Evaluations on synthetic and real-world data show that JRM’s implicit aggregation removes the need for explicit alignment, improves robustness to incorrect associations, and naturally handles non-rigid changes such as articulation. Overall, JRM outperforms both independent and alignment-based baselines in reconstruction quality.

[131] FAST3DIS: Feed-forward Anchored Scene Transformer for 3D Instance Segmentation

Changyang Li, Xueqing Huang, Shin-Fang Chng, Huangying Zhan, Qingan Yan, Yi Xu

Main category: cs.CV

TL;DR: FAST3DIS is an end-to-end feed-forward 3D instance segmentation method that replaces traditional lift-and-cluster approaches with a query-based Transformer architecture using 3D anchors and cross-attention for view-consistent segmentation.

Details

Motivation: Current 3D instance segmentation methods rely on disjointed "lift-and-cluster" paradigms that use non-differentiable clustering, which scales poorly with multiple views and disconnects representation learning from the final segmentation objective.

Method: Proposes a 3D-anchored, query-based Transformer architecture built on a depth backbone. Uses learned 3D anchor generator with anchor-sampling cross-attention for view-consistent segmentation. Projects 3D object queries into multi-view feature maps for efficient context sampling. Implements dual-level regularization with multi-view contrastive learning and dynamically scheduled spatial overlap penalty.

Result: Achieves competitive segmentation accuracy on complex indoor 3D datasets with significantly improved memory scalability and inference speed compared to state-of-the-art clustering-based methods.

Conclusion: FAST3DIS provides an effective end-to-end alternative to clustering-based 3D instance segmentation that improves scalability and speed while maintaining accuracy through its query-based Transformer architecture and regularization strategies.

Abstract: While recent feed-forward 3D reconstruction models provide a strong geometric foundation for scene understanding, extending them to 3D instance segmentation typically relies on a disjointed “lift-and-cluster” paradigm. Grouping dense pixel-wise embeddings via non-differentiable clustering scales poorly with the number of views and disconnects representation learning from the final segmentation objective. In this paper, we present a Feed-forward Anchored Scene Transformer for 3D Instance Segmentation (FAST3DIS), an end-to-end approach that effectively bypasses post-hoc clustering. We introduce a 3D-anchored, query-based Transformer architecture built upon a foundational depth backbone, adapted efficiently to learn instance-specific semantics while retaining its zero-shot geometric priors. We formulate a learned 3D anchor generator coupled with an anchor-sampling cross-attention mechanism for view-consistent 3D instance segmentation. By projecting 3D object queries directly into multi-view feature maps, our method samples context efficiently. Furthermore, we introduce a dual-level regularization strategy, that couples multi-view contrastive learning with a dynamically scheduled spatial overlap penalty to explicitly prevent query collisions and ensure precise instance boundaries. Experiments on complex indoor 3D datasets demonstrate that our approach achieves competitive segmentation accuracy with significantly improved memory scalability and inference speed over state-of-the-art clustering-based methods.

[132] Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays

Kang Liu, Zhuoqi Ma, Siyu Liang, Yunan Li, Xiyue Gao, Chao Liang, Kun Xie, Qiguang Miao

Main category: cs.CV

TL;DR: CoGaze: A medical vision-language pretraining framework for chest X-rays that incorporates clinical context and radiologists’ gaze to improve diagnostic reasoning and cross-modal alignment.

Details

Motivation: Existing medical vision-language models fail to capture the diagnostic workflow, treating radiographs as context-agnostic images and ignoring radiologists' gaze cues, which hinders disease-specific pattern modeling and weakens cross-modal alignment.

Method: Proposes a context-infused vision encoder that models how radiologists integrate clinical context, and a multi-level supervision paradigm with hybrid-positive contrastive learning, disease-aware cross-modal representation learning, and gaze-guided attention to diagnostically salient regions.

Result: CoGaze outperforms state-of-the-art methods across diverse tasks: +2.0% CheXbertF1 and +1.2% BLEU2 for report generation, +23.2% AUROC for zero-shot classification, and +12.2% Precision@1 for image-text retrieval.

Conclusion: Incorporating clinical context and radiologists’ gaze significantly improves medical vision-language pretraining, enabling better diagnostic reasoning and cross-modal alignment for chest X-ray analysis.

Abstract: Despite recent advances in medical vision-language pretraining, existing models still struggle to capture the diagnostic workflow: radiographs are typically treated as context-agnostic images, while radiologists’ gaze – a crucial cue for visual reasoning – remains largely underexplored by existing methods. These limitations hinder the modeling of disease-specific patterns and weaken cross-modal alignment. To bridge this gap, we introduce CoGaze, a Context- and Gaze-guided vision-language pretraining framework for chest X-rays. We first propose a context-infused vision encoder that models how radiologists integrate clinical context – including patient history, symptoms, and diagnostic intent – to guide diagnostic reasoning. We then present a multi-level supervision paradigm that (1) enforces intra- and inter-modal semantic alignment through hybrid-positive contrastive learning, (2) injects diagnostic priors via disease-aware cross-modal representation learning, and (3) leverages radiologists’ gaze as probabilistic priors to guide attention toward diagnostically salient regions. Extensive experiments demonstrate that CoGaze consistently outperforms state-of-the-art methods across diverse tasks, achieving up to +2.0% CheXbertF1 and +1.2% BLEU2 for free-text and structured report generation, +23.2% AUROC for zero-shot classification, and +12.2% Precision@1 for image-text retrieval. Code is available at https://github.com/mk-runner/CoGaze.

[133] Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models

Zhuan Shi, Alireza Dehghanpour Farashah, Rik de Vries, Golnoosh Farnadi

Main category: cs.CV

TL;DR: NLCE is a training-free framework for localized concept erasure in text-to-image diffusion models that preserves neighboring concepts while removing target concepts through spectral embedding modulation, attention-guided spatial gating, and localized hard erasure.

Details

Motivation: Current localized concept erasure methods in text-to-image diffusion models can unintentionally weaken semantically related neighbor concepts, reducing fidelity in fine-grained domains where preserving subtle distinctions between related concepts is crucial.

Method: Three-stage training-free framework: (1) spectrally-weighted embedding modulation to attenuate target concept directions while stabilizing neighbor representations, (2) attention-guided spatial gate to identify regions with residual concept activation, and (3) spatially-gated hard erasure that eliminates remaining traces only where necessary.

Result: Experiments on fine-grained datasets (Oxford Flowers, Stanford Dogs) show effective target concept removal while better preserving closely related categories. Additional results on celebrity identity, explicit content, and artistic style demonstrate robustness and generalization to broader erasure scenarios.

Conclusion: NLCE enables localized concept removal while maintaining surrounding concept neighborhood structure, addressing the neighbor concept degradation problem in existing erasure methods and showing strong performance across diverse erasure tasks.

Abstract: Concept erasure in text-to-image diffusion models seeks to remove undesired concepts while preserving overall generative capability. Localized erasure methods aim to restrict edits to the spatial region occupied by the target concept. However, we observe that suppressing a concept can unintentionally weaken semantically related neighbor concepts, reducing fidelity in fine-grained domains. We propose Neighbor-Aware Localized Concept Erasure (NLCE), a training-free framework designed to better preserve neighboring concepts while removing target concepts. It operates in three stages: (1) a spectrally-weighted embedding modulation that attenuates target concept directions while stabilizing neighbor concept representations, (2) an attention-guided spatial gate that identifies regions exhibiting residual concept activation, and (3) a spatially-gated hard erasure that eliminates remaining traces only where necessary. This neighbor-aware pipeline enables localized concept removal while maintaining the surrounding concept neighborhood structure. Experiments on fine-grained datasets (Oxford Flowers, Stanford Dogs) show that our method effectively removes target concepts while better preserving closely related categories. Additional results on celebrity identity, explicit content and artistic style demonstrate robustness and generalization to broader erasure scenarios.

[134] Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification

Zizhao Chen, Ping Wei, Ziyang Ren, Huan Li, Xiangru Yin

Main category: cs.CV

TL;DR: MaLSF is a novel multimodal misinformation detection framework that uses mask-aware local semantic fusion with bidirectional cross-modal verification to detect subtle inconsistencies between images and text.

Details

Motivation: Current multimodal verification methods using passive holistic fusion struggle with sophisticated misinformation due to 'feature dilution' - global alignments average out subtle local semantic inconsistencies, masking the very conflicts they're designed to find.

Method: MaLSF uses mask-label pairs as semantic anchors to bridge pixels and words. It features: 1) Bidirectional Cross-modal Verification (BCV) module with parallel query streams (Text-as-Query and Image-as-Query) to explicitly pinpoint conflicts, and 2) Hierarchical Semantic Aggregation (HSA) module to aggregate multi-granularity conflict signals for task-specific reasoning. Also includes diverse mask-label pair extraction parsers.

Result: Achieves state-of-the-art performance on both DGM4 and multimodal fake news detection tasks. Extensive ablation studies and visualization results verify effectiveness and interpretability.

Conclusion: MaLSF shifts the paradigm to active, bidirectional verification mimicking human cognitive cross-referencing, effectively detecting sophisticated multimodal misinformation by focusing on local semantic inconsistencies rather than global alignments.

Abstract: As multimodal misinformation becomes more sophisticated, its detection and grounding are crucial. However, current multimodal verification methods, relying on passive holistic fusion, struggle with sophisticated misinformation. Due to ‘feature dilution,’ global alignments tend to average out subtle local semantic inconsistencies, effectively masking the very conflicts they are designed to find. We introduce MaLSF (Mask-aware Local Semantic Fusion), a novel framework that shifts the paradigm to active, bidirectional verification, mimicking human cognitive cross-referencing. MaLSF utilizes mask-label pairs as semantic anchors to bridge pixels and words. Its core mechanism features two innovations: 1) a Bidirectional Cross-modal Verification (BCV) module that acts as an interrogator, using parallel query streams (Text-as-Query and Image-as-Query) to explicitly pinpoint conflicts; and 2) a Hierarchical Semantic Aggregation (HSA) module that intelligently aggregates these multi-granularity conflict signals for task-specific reasoning. In addition, to extract fine-grained mask-label pairs, we introduce a set of diverse mask-label pair extraction parsers. MaLSF achieves state-of-the-art performance on both the DGM4 and multimodal fake news detection tasks. Extensive ablation studies and visualization results further verify its effectiveness and interpretability.

Danny Abraham, Nikhil Kamalkumar Advani, Arun Das, Nikil Dutt

Main category: cs.CV

TL;DR: GeoReFormer is a transformer-based architecture for 3D lane detection and topology reasoning that incorporates geometric and relational inductive biases into the decoder design for improved structured map construction in autonomous driving.

Details

Motivation: Existing transformer-based approaches for 3D lane detection inherit decoder designs from object detection, but lane segments are continuous polylines in directed graphs with specific geometric and relational structure that current methods don't explicitly encode.

Method: Proposes GeoReFormer with three key components: 1) data-driven geometric priors for structured query initialization, 2) bounded coordinate-space refinement for stable polyline deformation, and 3) per-query gated topology propagation to selectively integrate relational context.

Result: Achieves state-of-the-art performance on OpenLane-V2 benchmark with 34.5% mAP while improving topology consistency over strong transformer baselines.

Conclusion: Explicit encoding of geometric and relational structure in transformer decoders significantly improves 3D lane detection and topology reasoning for autonomous driving map construction.

Abstract: Accurate 3D lane segment detection and topology reasoning are critical for structured online map construction in autonomous driving. Recent transformer-based approaches formulate this task as query-based set prediction, yet largely inherit decoder designs originally developed for compact object detection. However, lane segments are continuous polylines embedded in directed graphs, and generic query initialization and unconstrained refinement do not explicitly encode this geometric and relational structure. We propose GeoReFormer (Geometry-aware Refinement Transformer), a unified query-based architecture that embeds geometry- and topology-aware inductive biases directly within the transformer decoder. GeoReFormer introduces data-driven geometric priors for structured query initialization, bounded coordinate-space refinement for stable polyline deformation, and per-query gated topology propagation to selectively integrate relational context. On the OpenLane-V2 benchmark, GeoReFormer achieves state-of-the-art performance with 34.5% mAP while improving topology consistency over strong transformer baselines, demonstrating the utility of explicit geometric and relational structure encoding.

[136] MuDD: A Multimodal Deception Detection Dataset and GSR-Guided Progressive Distillation for Non-Contact Deception Detection

Peiyuan Jiang, Yao Liu, Yanglei Gan, Jiaye Yang, Lu Liu, Daibing Yao, Qiao Liu

Main category: cs.CV

TL;DR: Cross-modal knowledge distillation framework using GSR to guide non-contact deception detection from video/audio, with new multimodal dataset and progressive distillation method

Details

Motivation: Non-contact deception detection (video/audio) is challenging due to unstable cross-subject patterns, while GSR provides more reliable physiological cues but requires contact. Need to leverage GSR knowledge to improve non-contact modalities.

Method: Introduces MuDD dataset with 130 participants, 690 minutes of multimodal data (video, audio, GSR, PPG, HR, personality). Proposes GSR-guided Progressive Distillation (GPD) framework with progressive feature-level and digit-level distillation with dynamic routing to adaptively transfer knowledge across modalities.

Result: GPD outperforms existing methods and achieves state-of-the-art performance on both deception detection and concealed-digit identification tasks. Extensive experiments and visualizations demonstrate effectiveness.

Conclusion: Cross-modal knowledge distillation from contact-based GSR to non-contact video/audio modalities effectively improves deception detection, with the proposed progressive distillation framework mitigating modality mismatch issues.

Abstract: Non-contact automatic deception detection remains challenging because visual and auditory deception cues often lack stable cross-subject patterns. In contrast, galvanic skin response (GSR) provides more reliable physiological cues and has been widely used in contact-based deception detection. In this work, we leverage stable deception-related knowledge in GSR to guide representation learning in non-contact modalities through cross-modal knowledge distillation. A key obstacle, however, is the lack of a suitable dataset for this setting. To address this, we introduce MuDD, a large-scale Multimodal Deception Detection dataset containing recordings from 130 participants over 690 minutes. In addition to video, audio, and GSR, MuDD also provides Photoplethysmography, heart rate, and personality traits, supporting broader scientific studies of deception. Based on this dataset, we propose GSR-guided Progressive Distillation (GPD), a cross-modal distillation framework for mitigating the negative transfer caused by the large modality mismatch between GSR and non-contact signals. The core innovation of GPD is the integration of progressive feature-level and digit-level distillation with dynamic routing, which allows the model to adaptively determine how teacher knowledge should be transferred during training, leading to more stable cross-modal knowledge transfer. Extensive experiments and visualizations show that GPD outperforms existing methods and achieves state-of-the-art performance on both deception detection and concealed-digit identification.

[137] Learning to Trim: End-to-End Causal Graph Pruning with Dynamic Anatomical Feature Banks for Medical VQA

Zibo Xu, Qiang Li, Weizhi Nie, Yuting Su

Main category: cs.CV

TL;DR: LCT framework integrates causal pruning into end-to-end optimization for MedVQA, using a dynamic feature bank to capture global patterns and a differentiable trimming module to suppress spurious correlations while emphasizing instance-specific evidence.

Details

Motivation: MedVQA models suffer from limited generalization due to reliance on dataset-specific correlations (anatomical patterns, question-type regularities) rather than genuine diagnostic evidence. Existing causal approaches are static or post-hoc corrections.

Method: Proposes Learnable Causal Trimming (LCT) framework with: 1) Dynamic Anatomical Feature Bank (DAFB) updated via momentum mechanism to capture global prototypes of frequent patterns; 2) Differentiable trimming module that estimates dependency between instance-level representations and global feature bank, softly suppressing features correlated with global prototypes while emphasizing instance-specific evidence.

Result: Experiments on VQA-RAD, SLAKE, SLAKE-CP and PathVQA demonstrate LCT consistently improves robustness and generalization over existing debiasing strategies.

Conclusion: LCT provides an effective learnable mechanism for MedVQA models to prioritize causal signals over spurious correlations adaptively, enhancing generalization capabilities.

Abstract: Medical Visual Question Answering (MedVQA) models often exhibit limited generalization due to reliance on dataset-specific correlations, such as recurring anatomical patterns or question-type regularities, rather than genuine diagnostic evidence. Existing causal approaches are typically implemented as static adjustments or post-hoc corrections. To address this issue, we propose a Learnable Causal Trimming (LCT) framework that integrates causal pruning into end-to-end optimization. We introduce a Dynamic Anatomical Feature Bank (DAFB), updated via a momentum mechanism, to capture global prototypes of frequent anatomical and linguistic patterns, serving as an approximation of dataset-level regularities. We further design a differentiable trimming module that estimates the dependency between instance-level representations and the global feature bank. Features highly correlated with global prototypes are softly suppressed, while instance-specific evidence is emphasized. This learnable mechanism encourages the model to prioritize causal signals over spurious correlations adaptively. Experiments on VQA-RAD, SLAKE, SLAKE-CP and PathVQA demonstrate that LCT consistently improves robustness and generalization over existing debiasing strategies.

[138] R-PGA: Robust Physical Adversarial Camouflage Generation via Relightable 3D Gaussian Splatting

Tianrui Lou, Siyuan Liang, Jiawei Liang, Yuze Gao, Xiaochun Cao

Main category: cs.CV

TL;DR: R-PGA: A relightable 3D Gaussian Splatting framework for robust physical adversarial camouflage attacks on autonomous driving systems that addresses simulation fidelity and optimization robustness issues.

Details

Motivation: Current physical adversarial camouflage methods fail in complex dynamic scenarios due to domain gaps from oversimplified simulations and rugged loss landscapes from targeting average performance, making them vulnerable to geometric and radiometric variations.

Method: Uses relightable 3D Gaussian Splatting for photo-realistic reconstruction with physically disentangled attributes, hybrid rendering pipeline combining precise foreground rendering with image translation for backgrounds, and Hard Physical Configuration Mining to actively mine worst-case configurations and flatten loss landscape.

Result: The framework achieves consistent adversarial effectiveness and robustness across varying physical configurations by addressing both simulation fidelity and optimization robustness issues.

Conclusion: R-PGA bridges the simulation-optimization gap for physical adversarial attacks, enabling more robust camouflage in complex dynamic environments like autonomous driving scenarios.

Abstract: Physical adversarial camouflage poses a severe security threat to autonomous driving systems by mapping adversarial textures onto 3D objects. Nevertheless, current methods remain brittle in complex dynamic scenarios, failing to generalize across diverse geometric (e.g., viewing configurations) and radiometric (e.g., dynamic illumination, atmospheric scattering) variations. We attribute this deficiency to two fundamental limitations in simulation and optimization. First, the reliance on coarse, oversimplified simulations (e.g., via CARLA) induces a significant domain gap, confining optimization to a biased feature space. Second, standard strategies targeting average performance result in a rugged loss landscape, leaving the camouflage vulnerable to configuration shifts.To bridge these gaps, we propose the Relightable Physical 3D Gaussian Splatting (3DGS) based Attack framework (R-PGA). Technically, to address the simulation fidelity issue, we leverage 3DGS to ensure photo-realistic reconstruction and augment it with physically disentangled attributes to decouple intrinsic material from lighting. Furthermore, we design a hybrid rendering pipeline that leverages precise Relightable 3DGS for foreground rendering, while employing a pre-trained image translation model to synthesize plausible relighted backgrounds that align with the relighted foreground.To address the optimization robustness issue, we propose the Hard Physical Configuration Mining (HPCM) module, designed to actively mine worst-case physical configurations and suppress their corresponding loss peaks. This strategy not only diminishes the overall loss magnitude but also effectively flattens the rugged loss landscape, ensuring consistent adversarial effectiveness and robustness across varying physical configurations.

[139] Knowledge is Power: Advancing Few-shot Action Recognition with Multimodal Semantics from MLLMs

Jiazheng Xing, Chao Xu, Hangjie Yuan, Mengmeng Wang, Jun Dan, Hangwei Qian, Yong Liu

Main category: cs.CV

TL;DR: FSAR-LLaVA: First end-to-end MLLM-based method for few-shot action recognition using multimodal knowledge base, feature enhancement, and multimodal prototype matching.

Details

Motivation: Existing MLLM approaches for few-shot action recognition use suboptimal captioning pipelines and only metric learning in visual space. Need to directly leverage MLLMs as multimodal knowledge bases for enhanced FSAR.

Method: 1) Extract enriched multimodal features from MLLM decoder, enhance via Multimodal Feature-Enhanced Module. 2) Use MLLM prompts to adapt to scenarios and bridge train-test distribution gap. 3) Introduce training-free Multimodal Prototype Matching Metric for joint multimodal metric learning.

Result: Superior performance across various FSAR tasks with minimal trainable parameters, demonstrating effectiveness of direct MLLM integration.

Conclusion: FSAR-LLaVA successfully leverages MLLMs as multimodal knowledge bases for direct FSAR enhancement, achieving state-of-the-art results with efficient design.

Abstract: Multimodal Large Language Models (MLLMs) have propelled the field of few-shot action recognition (FSAR). However, preliminary explorations in this area primarily focus on generating captions to form a suboptimal feature->caption->feature pipeline and adopt metric learning solely within the visual space. In this paper, we propose FSAR-LLaVA, the first end-to-end method to leverage MLLMs (such as Video-LLaVA) as a multimodal knowledge base for directly enhancing FSAR. First, at the feature level, we leverage the MLLM’s multimodal decoder to extract spatiotemporally and semantically enriched representations, which are then decoupled and enhanced by our Multimodal Feature-Enhanced Module into distinct visual and textual features that fully exploit their semantic knowledge for FSAR. Next, we leverage the versatility of MLLMs to craft input prompts that flexibly adapt to diverse scenarios, and use their aligned outputs to drive our designed Composite Task-Oriented Prototype Construction, effectively bridging the distribution gap between meta-train and meta-test sets. Finally, to enable multimodal features to guide metric learning jointly, we introduce a training-free Multimodal Prototype Matching Metric that adaptively selects the most decisive cues and efficiently leverages the decoupled feature representations produced by MLLMs. Extensive experiments demonstrate superior performance across various tasks with minimal trainable parameters.

[140] When Identities Collapse: A Stress-Test Benchmark for Multi-Subject Personalization

Zhihan Chen, Yuhuan Zhao, Yijie Zhu, Xinyu Yao

Main category: cs.CV

TL;DR: The paper exposes a critical failure in subject-driven text-to-image diffusion models: they create an “Illusion of Scalability” where models appear to handle 2-4 subjects well but catastrophically collapse identities when scaling to 6-10 subjects or complex interactions, and proposes a new evaluation metric (SCR) to properly measure this failure.

Details

Motivation: Current subject-driven text-to-image models excel at single identities but struggle with multiple interacting subjects. Existing evaluation using CLIP metrics fails to detect local identity collapse and multi-subject entanglement, creating a misleading perception of model capabilities.

Method: 1) Constructed a stress-test benchmark with 75 prompts across varying subject counts (2-10) and interaction difficulties (Neutral, Occlusion, Interaction). 2) Introduced Subject Collapse Rate (SCR) metric using DINOv2’s structural priors to penalize local attention leakage and homogenization. 3) Evaluated state-of-the-art models (MOSAIC, XVerse, PSR) systematically.

Result: Models show precipitous drop in identity fidelity as scene complexity grows, with SCR approaching 100% at 10 subjects. Standard CLIP metrics are fundamentally flawed, often assigning high scores to semantically correct but identity-collapsed images. The failure is traced to semantic shortcuts in global attention routing.

Conclusion: Current models suffer from catastrophic identity collapse when scaling to multiple subjects or complex interactions, revealing an “Illusion of Scalability.” The proposed SCR metric better captures these failures, highlighting the urgent need for explicit physical disentanglement in future generative architectures.

Abstract: Subject-driven text-to-image diffusion models have achieved remarkable success in preserving single identities, yet their ability to compose multiple interacting subjects remains largely unexplored and highly challenging. Existing evaluation protocols typically rely on global CLIP metrics, which are insensitive to local identity collapse and fail to capture the severity of multi-subject entanglement. In this paper, we identify a pervasive “Illusion of Scalability” in current models: while they excel at synthesizing 2-4 subjects in simple layouts, they suffer from catastrophic identity collapse when scaled to 6-10 subjects or tasked with complex physical interactions. To systematically expose this failure mode, we construct a rigorous stress-test benchmark comprising 75 prompts distributed across varying subject counts and interaction difficulties (Neutral, Occlusion, Interaction). Furthermore, we demonstrate that standard CLIP-based metrics are fundamentally flawed for this task, as they often assign high scores to semantically correct but identity-collapsed images (e.g., generating generic clones). To address this, we introduce the Subject Collapse Rate (SCR), a novel evaluation metric grounded in DINOv2’s structural priors, which strictly penalizes local attention leakage and homogenization. Our extensive evaluation of state-of-the-art models (MOSAIC, XVerse, PSR) reveals a precipitous drop in identity fidelity as scene complexity grows, with SCR approaching 100% at 10 subjects. We trace this collapse to the semantic shortcuts inherent in global attention routing, underscoring the urgent need for explicit physical disentanglement in future generative architectures.

[141] Face2Parts: Exploring Coarse-to-Fine Inter-Regional Facial Dependencies for Generalized Deepfake Detection

Kutub Uddin, Nusrat Tasnim, Byung Tae Oh

Main category: cs.CV

TL;DR: A hierarchical deepfake detection method called Face2Parts that analyzes facial regions from coarse to fine (frame → face → lips/eyes/nose) using channel attention and triplet learning to improve detection performance.

Details

Motivation: Deepfakes pose serious threats to multimedia authenticity across various applications. Existing forensic methods have limitations as they focus on specific facial regions, but different manipulations leave traces in different regions. A comprehensive approach that leverages multiple facial regions could improve detection.

Method: Proposes Face2Parts with hierarchical feature representation (HFR): extracts features from frame, face, and key facial regions (lips, eyes, nose) separately. Uses channel-attention mechanism to capture inter-dependencies among regions and deep triplet learning for improved feature discrimination.

Result: Achieves strong performance across multiple datasets: 98.42% AUC on FF++, 79.80% on CDF1, 85.34% on CDF2, 89.41% on DFD, 84.07% on DFDC, 95.62% on DTIM, 80.76% on PDD, and 100% on WLDR. Outperforms existing methods and shows good generalization in intra-, inter-dataset, and inter-manipulation settings.

Conclusion: The hierarchical approach effectively captures coarse-to-fine relationships in facial regions, improving deepfake detection performance and generalization across different manipulation types and datasets.

Abstract: Multimedia data, particularly images and videos, is integral to various applications, including surveillance, visual interaction, biometrics, evidence gathering, and advertising. However, amateur or skilled counterfeiters can simulate them to create deepfakes, often for slanderous motives. To address this challenge, several forensic methods have been developed to ensure the authenticity of the content. The effectiveness of these methods depends on their focus, with challenges arising from the diverse nature of manipulations. In this article, we analyze existing forensic methods and observe that each method has unique strengths in detecting deepfake traces by focusing on specific facial regions, such as the frame, face, lips, eyes, or nose. Considering these insights, we propose a novel hybrid approach called Face2Parts based on hierarchical feature representation ($HFR$) that takes advantage of coarse-to-fine information to improve deepfake detection. The proposed method involves extracting features from the frame, face, and key facial regions (i.e., lips, eyes, and nose) separately to explore the coarse-to-fine relationships. This approach enables us to capture inter-dependencies among facial regions using a channel-attention mechanism and deep triplet learning. We evaluated the proposed method on benchmark deepfake datasets in both intra-, inter-dataset, and inter-manipulation settings. The proposed method achieves an average AUC of 98.42% on FF++, 79.80% on CDF1, 85.34% on CDF2, 89.41% on DFD, 84.07% on DFDC, 95.62% on DTIM, 80.76% on PDD, and 100% on WLDR, respectively. The results demonstrate that our approach generalizes effectively and achieves promising performance to outperform the existing methods.

[142] Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives

Daiqiang Li, Zihao Pan, Zeyu Zhang, Ronghao Chen, Huacan Wang, Honggang Chen, Haiyun Jiang

Main category: cs.CV

TL;DR: Empirical study on token pruning for GUI screenshots in multimodal agents reveals three key insights: foreground-background separation shows background captures interface-state transitions, random pruning preserves spatial structure better, and recency-based token allocation reduces computation while maintaining performance.

Details

Motivation: GUI visual agents using MLLMs face computational challenges with high-resolution screenshots generating many visual tokens. Preserving complete historical information is expensive, so efficient token pruning strategies are needed for practical deployment.

Method: Conducted empirical study on token pruning for GUI screenshots, analyzing different strategies including foreground-background separation using edge-based methods, random pruning, and recency-based token allocation strategies.

Result: Three key findings: 1) GUI screenshots have foreground-background composition where background captures interface-state transitions, 2) random pruning outperforms carefully designed strategies in preserving spatial structure, 3) recency-based token allocation (more tokens for recent screenshots) reduces computation significantly with minimal performance loss.

Conclusion: The study provides practical guidance for designing efficient GUI visual agents by leveraging these insights about GUI screenshot characteristics and effective pruning strategies to reduce computational costs while maintaining agent performance.

Abstract: In recent years, GUI visual agents built upon Multimodal Large Language Models (MLLMs) have demonstrated strong potential in navigation tasks. However, high-resolution GUI screenshots produce a large number of visual tokens, making the direct preservation of complete historical information computationally expensive. In this paper, we conduct an empirical study on token pruning for historical screenshots in GUI scenarios and distill three practical insights that are crucial for designing effective pruning strategies. First, we observe that GUI screenshots exhibit a distinctive foreground-background semantic composition. To probe this property, we apply a simple edge-based separation to partition screenshots into foreground and background regions. Surprisingly, we find that, contrary to the common assumption that background areas have little semantic value, they effectively capture interface-state transitions, thereby providing auxiliary cues for GUI reasoning. Second, compared with carefully designed pruning strategies, random pruning possesses an inherent advantage in preserving spatial structure, enabling better performance under the same computational budget. Finally, we observe that GUI Agents exhibit a recency effect similar to human cognition: by allocating larger token budgets to more recent screenshots and heavily compressing distant ones, we can significantly reduce computational cost while maintaining nearly unchanged performance. These findings offer new insights and practical guidance for the design of efficient GUI visual agents.

[143] SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis

Zhangtianyi Chen, Yuhao Shen, Florensia Widjaja, Yan Xu, Liyuan Sun, Zijian Wang, Hongyi Chen, Wufei Dai, Juexiao Zhou

Main category: cs.CV

TL;DR: SkinGPT-X is a multimodal collaborative multi-agent system with self-evolving memory for dermatological diagnosis, achieving state-of-the-art performance on standard datasets and rare skin disease benchmarks.

Details

Motivation: Monolithic LLMs struggle with fine-grained multi-class dermatological diagnosis and rare diseases due to training data sparsity, while lacking interpretability for clinical reasoning. Existing multi-agent systems focus on VQA/conversational tasks and rely on static knowledge bases, limiting adaptability in real-world clinical settings.

Method: Multimodal collaborative multi-agent system integrated with self-evolving dermatological memory mechanism that simulates dermatologists’ diagnostic workflow. Uses continuous memory evolution for transparent, trustworthy diagnostics.

Result: Achieves +9.6% accuracy improvement on DDI31, +13% weighted F1 gain on Dermnet over SOTA. On rare skin disease dataset (first benchmark with 8 rare diseases, 564 samples), achieves +9.8% accuracy, +7.1% weighted F1, +10% Cohen’s Kappa improvements.

Conclusion: SkinGPT-X demonstrates robust performance for complex and rare dermatological cases through its multimodal multi-agent architecture with self-evolving memory, providing transparent and trustworthy diagnostics.

Abstract: While recent advancements in Large Language Models have significantly advanced dermatological diagnosis, monolithic LLMs frequently struggle with fine-grained, large-scale multi-class diagnostic tasks and rare skin disease diagnosis owing to training data sparsity, while also lacking the interpretability and traceability essential for clinical reasoning. Although multi-agent systems can offer more transparent and explainable diagnostics, existing frameworks are primarily concentrated on Visual Question Answering and conversational tasks, and their heavy reliance on static knowledge bases restricts adaptability in complex real-world clinical settings. Here, we present SkinGPT-X, a multimodal collaborative multi-agent system for dermatological diagnosis integrated with a self-evolving dermatological memory mechanism. By simulating the diagnostic workflow of dermatologists and enabling continuous memory evolution, SkinGPT-X delivers transparent and trustworthy diagnostics for the management of complex and rare dermatological cases. To validate the robustness of SkinGPT-X, we design a three-tier comparative experiment. First, we benchmark SkinGPT-X against four state-of-the-art LLMs across four public datasets, demonstrating its state-of-the-art performance with a +9.6% accuracy improvement on DDI31 and +13% weighted F1 gain on Dermnet over the state-of-the-art model. Second, we construct a large-scale multi-class dataset covering 498 distinct dermatological categories to evaluate its fine-grained classification capabilities. Finally, we curate the rare skin disease dataset, the first benchmark to address the scarcity of clinical rare skin diseases which contains 564 clinical samples with eight rare dermatological diseases. On this dataset, SkinGPT-X achieves a +9.8% accuracy improvement, a +7.1% weighted F1 improvement, a +10% Cohen’s Kappa improvement.

[144] Pioneering Perceptual Video Fluency Assessment: A Novel Task with Benchmark Dataset and Baseline

Qizhi Xie, Kun Yuan, Yunpeng Qu, Ming Sun, Chao Zhou, Jihong Zhu

Main category: cs.CV

TL;DR: This paper introduces Video Fluency Assessment (VFA) as a standalone perceptual task focused on temporal video quality, addressing limitations of existing video quality assessment methods that underrepresent fluency aspects like motion consistency and frame continuity.

Details

Motivation: Current video quality assessment (VQA) methods largely underrepresent fluency aspects like motion consistency and frame continuity, limiting their applicability for applications like streaming and gaming where temporal smoothness is crucial.

Method: 1) Constructed FluVid dataset with 4,606 in-the-wild videos with balanced fluency distribution and first-ever VFA scoring criteria; 2) Developed benchmark of 23 methods; 3) Proposed FluNet baseline model with temporal permuted self-attention (T-PSA) to enrich fluency information and enhance long-range inter-frame interactions.

Result: The work achieves state-of-the-art performance and provides the community with a comprehensive roadmap for VFA research, including the first dedicated dataset, benchmark, and baseline model for video fluency assessment.

Conclusion: Video Fluency Assessment should be treated as a standalone perceptual task separate from general video quality assessment, and the proposed framework provides essential tools and insights for advancing research in this important area.

Abstract: Accurately estimating humans’ subjective feedback on video fluency, e.g., motion consistency and frame continuity, is crucial for various applications like streaming and gaming. Yet, it has long been overlooked, as prior arts have focused on solving it in the video quality assessment (VQA) task, merely as a sub-dimension of overall quality. In this work, we conduct pilot experiments and reveal that current VQA predictions largely underrepresent fluency, thereby limiting their applicability. To this end, we pioneer Video Fluency Assessment (VFA) as a standalone perceptual task focused on the temporal dimension. To advance VFA research, 1) we construct a fluency-oriented dataset, FluVid, comprising 4,606 in-the-wild videos with balanced fluency distribution, featuring the first-ever scoring criteria and human study for VFA. 2) We develop a large-scale benchmark of 23 methods, the most comprehensive one thus far on FluVid, gathering insights for VFA-tailored model designs. 3) We propose a baseline model called FluNet, which deploys temporal permuted self-attention (T-PSA) to enrich input fluency information and enhance long-range inter-frame interactions. Our work not only achieves state-of-the-art performance but, more importantly, offers the community a roadmap to explore solutions for VFA.

[145] PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery

Elkhan Ismayilzada, Yufei Zhang, Zijun Cui

Main category: cs.CV

TL;DR: Physics-aware diffusion framework refines noisy hand pose sequences into physically plausible motion while estimating physics variance, providing interpretable confidence measures for physical consistency.

Details

Motivation: Current hand reconstruction methods from images produce accurate single-frame estimates but lack physics consistency and don't provide confidence measures about how well the motion satisfies physical laws.

Method: Proposes a physics-aware conditional diffusion framework with MeshCNN-Transformer backbone, formulates Euler-Lagrange dynamics for articulated hands, treats dynamic residuals as virtual observables (rather than enforcing zero residuals), and uses last-layer Laplace approximation for variance estimation.

Result: Experiments on two hand datasets show consistent improvements over image-based initializations and competitive performance with video-based methods. The method produces interpretable variance maps that align with physical plausibility of motion.

Conclusion: The framework successfully refines hand motion to be physically plausible while providing valuable physics variance estimates that indicate where physical consistency weakens, offering interpretable confidence measures for motion quality.

Abstract: Significant advancements made in reconstructing hands from images have delivered accurate single-frame estimates, yet they often lack physics consistency and provide no notion of how confidently the motion satisfies physics. In this paper, we propose a novel physics-aware conditional diffusion framework that refines noisy pose sequences into physically plausible hand motion while estimating the physics variance in motion estimates. Building on a MeshCNN-Transformer backbone, we formulate Euler-Lagrange dynamics for articulated hands. Unlike prior works that enforce zero residuals, we treat the resulting dynamic residuals as virtual observables to more effectively integrate physics. Through a last-layer Laplace approximation, our method produces per-joint, per-time variances that measure physics consistency and offers interpretable variance maps indicating where physical consistency weakens. Experiments on two well-known hand datasets show consistent gains over strong image-based initializations and competitive video-based methods. Qualitative results confirm that our variance estimations are aligned with the physical plausibility of the motion in image-based estimates.

[146] MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality

Kyungwon Kim, Dosik Hwang

Main category: cs.CV

TL;DR: MUST is a multimodal framework for cancer survival prediction that explicitly decomposes modality representations into modality-specific and cross-modal components, using conditional diffusion models to generate missing modality-specific information.

Details

Motivation: Clinical deployment of multimodal survival prediction faces challenges with incomplete modalities due to cost, technical limitations, or retrospective data availability. Existing methods lack explicit modeling of each modality's unique contributions versus information derivable from other modalities.

Method: Proposes MUST (Modality-Specific representation-aware Transformer) that decomposes each modality’s representation into modality-specific and cross-modal contextualized components through algebraic constraints in a learned low-rank shared subspace. Uses conditional latent diffusion models to generate high-quality representations for truly modality-specific information that cannot be inferred from available modalities.

Result: Extensive experiments on five TCGA cancer datasets demonstrate state-of-the-art performance with complete data while maintaining robust predictions in both missing pathology and missing genomics conditions, with clinically acceptable inference latency.

Conclusion: MUST provides an effective framework for handling missing modalities in multimodal medical data by explicitly modeling modality-specific contributions and using conditional diffusion for information recovery, enabling robust clinical deployment.

Abstract: Accurate survival prediction from multimodal medical data is essential for precision oncology, yet clinical deployment faces a persistent challenge: modalities are frequently incomplete due to cost constraints, technical limitations, or retrospective data availability. While recent methods attempt to address missing modalities through feature alignment or joint distribution learning, they fundamentally lack explicit modeling of the unique contributions of each modality as opposed to the information derivable from other modalities. We propose MUST (Modality-Specific representation-aware Transformer), a novel framework that explicitly decomposes each modality’s representation into modality-specific and cross-modal contextualized components through algebraic constraints in a learned low-rank shared subspace. This decomposition enables precise identification of what information is lost when a modality is absent. For the truly modality-specific information that cannot be inferred from available modalities, we employ conditional latent diffusion models to generate high-quality representations conditioned on recovered shared information and learned structural priors. Extensive experiments on five TCGA cancer datasets demonstrate that MUST achieves state-of-the-art performance with complete data while maintaining robust predictions in both missing pathology and missing genomics conditions, with clinically acceptable inference latency.

[147] Progressive Learning with Anatomical Priors for Reliable Left Atrial Scar Segmentation from Late Gadolinium Enhancement MRI

Jing Zhang, Bastien Bergere, Emilie Bollache, Jonas Leite, Mikaël Laredo, Alban Redheuil, Nadjia Kachenoura

Main category: cs.CV

TL;DR: A progressive learning strategy for left atrial scar segmentation from cardiac MRI using a 3-stage SwinUNETR framework with anatomy-aware loss function.

Details

Motivation: Automatic LA scar segmentation from LGE MRI is challenging due to low contrast, annotation variability, and lack of anatomical constraints, leading to unreliable predictions. The paper aims to incorporate clinical workflow insights into deep learning.

Method: 3-stage progressive learning framework: 1) LA cavity pre-learning model, 2) dual-task model learning spatial relationships between LA geometry and scar patterns, 3) fine-tuning for precise scar segmentation. Uses anatomy-aware spatially weighted loss incorporating clinical priors.

Result: On LASCARQS dataset with 5-fold cross-validation: LA segmentation Dice 0.94, LA scar segmentation Dice 0.50, HD 11.84mm, ASD 1.80mm. Outperforms one-stage scar segmentation (Dice 0.49, HD 13.02mm, ASD 1.96mm).

Conclusion: Explicitly embedding clinical anatomical priors and diagnostic reasoning into deep learning improves accuracy and reliability of LA scar segmentation, demonstrating importance of clinically informed model design.

Abstract: Cardiac MRI late gadolinium enhancement (LGE) enables non-invasive identification of left atrial (LA) scar, whose spatial distribution is strongly associated with atrial fibrillation (AF) severity and recurrence. However, automatic LA scar segmentation remains challenging due to low contrast, annotation variability, and the lack of anatomical constraints, often leading to non-reliable predictions. Accordingly, our aim was to propose a progressive learning strategy to segment LA scar from LGE images inspired from a clinical workflow. A 3-stage framework based on SwinUNETR was implemented, comprising: 1) a first LA cavity pre-learning model, 2) dual-task model which further learns spatial relationship between LA geometry and scar patterns, and 3) fine-tuning on precise segmentation of the scar. Furthermore, we introduced an anatomy-aware spatially weighted loss that incorporates prior clinical knowledge by constraining scar predictions to anatomically plausible LA wall regions while mitigating annotation bias. Our preliminary results obtained on validation LGE volumes from LASCARQS public dataset after 5-fold cross validation, LA segmentation had Dice score of 0.94, LA scar segmentation achieved Dice score of 0.50, Hausdorff Distance of 11.84 mm, Average Surface Distance of 1.80 mm, outperforming only a one-stage scar segmentation with 0.49, 13.02 mm, 1.96 mm, repectively. By explicitly embedding clinical anatomical priors and diagnostic reasoning into deep learning, the proposed approach improved the accuracy and reliability of LA scar segmentation from LGE, revealing the importance of clinically informed model design.

[148] Learnable Instance Attention Filtering for Adaptive Detector Distillation

Chen Liu, Qizhen Lan, Zhicheng Ding, Xinyu Chu, Qing Tian

Main category: cs.CV

TL;DR: LIAF-KD introduces learnable instance attention filtering for adaptive detector distillation, allowing dynamic instance importance evaluation during knowledge transfer from teacher to student models.

Details

Motivation: Existing feature-based knowledge distillation methods for vision models treat all object instances uniformly and use heuristic attention filtering, ignoring instance-level variability and not involving the student in the filtering process.

Method: Proposes LIAF-KD framework with learnable instance selectors that dynamically evaluate and reweight instance importance during distillation, where the student contributes based on its evolving learning state.

Result: Experiments on KITTI and COCO datasets show consistent improvements, achieving 2% gain on GFL ResNet-50 student without added complexity, outperforming state-of-the-art methods.

Conclusion: LIAF-KD effectively addresses limitations of existing KD methods by introducing adaptive, learnable instance filtering that involves student feedback, improving distillation efficiency for object detection.

Abstract: As deep vision models grow increasingly complex to achieve higher performance, deployment efficiency has become a critical concern. Knowledge distillation (KD) mitigates this issue by transferring knowledge from large teacher models to compact student models. While many feature-based KD methods rely on spatial filtering to guide distillation, they typically treat all object instances uniformly, ignoring instance-level variability. Moreover, existing attention filtering mechanisms are typically heuristic or teacher-driven, rather than learned with the student. To address these limitations, we propose Learnable Instance Attention Filtering for Adaptive Detector Distillation (LIAF-KD), a novel framework that introduces learnable instance selectors to dynamically evaluate and reweight instance importance during distillation. Notably, the student contributes to this process based on its evolving learning state. Experiments on the KITTI and COCO datasets demonstrate consistent improvements, with a 2% gain on a GFL ResNet-50 student without added complexity, outperforming state-of-the-art methods.

[149] MemCam: Memory-Augmented Camera Control for Consistent Video Generation

Xinhang Gao, Junlin Guan, Shuhan Luo, Wenzhuo Li, Guanghuan Tan, Jiacheng Wang

Main category: cs.CV

TL;DR: MemCam: A memory-augmented approach for interactive video generation that maintains scene consistency during long videos with dynamic camera control by treating previous frames as external memory and using context compression for efficient retrieval.

Details

Motivation: Existing interactive video generation methods struggle with maintaining scene consistency during long video generation under dynamic camera control due to limited contextual information, which is crucial for realistic scene simulation and video creation.

Method: Proposes MemCam with memory-augmented approach that treats previously generated frames as external memory for contextual conditioning. Includes a context compression module that encodes memory frames into compact representations and uses co-visibility-based selection to dynamically retrieve the most relevant historical frames, reducing computational overhead while enriching context.

Result: MemCam significantly outperforms existing baseline methods and open-source state-of-the-art approaches in scene consistency, particularly in long video scenarios with large camera rotations.

Conclusion: Memory-augmented approach with context compression effectively addresses scene consistency challenges in interactive video generation, enabling better long video generation with dynamic camera control.

Abstract: Interactive video generation has significant potential for scene simulation and video creation. However, existing methods often struggle with maintaining scene consistency during long video generation under dynamic camera control due to limited contextual information. To address this challenge, we propose MemCam, a memory-augmented interactive video generation approach that treats previously generated frames as external memory and leverages them as contextual conditioning to achieve controllable camera viewpoints with high scene consistency. To enable longer and more relevant context, we design a context compression module that encodes memory frames into compact representations and employs co-visibility-based selection to dynamically retrieve the most relevant historical frames, thereby reducing computational overhead while enriching contextual information. Experiments on interactive video generation tasks show that MemCam significantly outperforms existing baseline methods as well as open-source state-of-the-art approaches in terms of scene consistency, particularly in long video scenarios with large camera rotations.

[150] CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection

Youngjun Song, Hyeongyu Kim, Dosik Hwang

Main category: cs.CV

TL;DR: CD-Buffer is a test-time adaptation framework that combines subtractive (removing domain-sensitive channels) and additive (refining features) approaches through a unified discrepancy metric to handle varying domain shift severities.

Details

Motivation: Current TTA methods have limitations: subtractive approaches work well for severe domain shifts but fail on moderate ones, while additive approaches excel on moderate shifts but struggle with severe corruption. There's a need for a unified framework that can adaptively balance both strategies based on measured feature-level domain shift severity.

Method: Proposes CD-Buffer, a complementary dual-buffer framework where subtractive and additive mechanisms operate in opposite yet coordinated directions. Key innovation is discrepancy-driven coupling: the framework couples removal and refinement through a unified discrepancy metric, automatically balancing both strategies based on feature-level shift severity without manual tuning.

Result: Extensive experiments on KITTI, Cityscapes, and ACDC datasets demonstrate state-of-the-art performance, consistently achieving superior results across diverse weather conditions and severity levels.

Conclusion: CD-Buffer effectively addresses the limitations of single-paradigm TTA methods by adaptively balancing subtractive and additive strategies based on measured feature-level domain shifts, enabling robust performance across varying corruption levels.

Abstract: Test-Time Adaptation (TTA) enables real-time adaptation to domain shifts without off-line retraining. Recent TTA methods have predominantly explored additive approaches that introduce lightweight modules for feature refinement. Recently, a subtractive approach that removes domain-sensitive channels has emerged as an alternative direction. We observe that these paradigms exhibit complementary effectiveness patterns: subtractive methods excel under severe shifts by removing corrupted features, while additive methods are effective under moderate shifts requiring refinement. However, each paradigm operates effectively only within limited shift severity ranges, failing to generalize across diverse corruption levels. This leads to the following question: can we adaptively balance both strategies based on measured feature-level domain shift? We propose CD-Buffer, a novel complementary dual-buffer framework where subtractive and additive mechanisms operate in opposite yet coordinated directions driven by a unified discrepancy metric. Our key innovation lies in the discrepancy-driven coupling: Our framework couples removal and refinement through a unified discrepancy metric, automatically balancing both strategies based on feature-level shift severity. This establishes automatic channel-wise balancing that adapts differentiated treatment to heterogeneous shift magnitudes without manual tuning. Extensive experiments on KITTI, Cityscapes, and ACDC datasets demonstrate state-of-the-art performance, consistently achieving superior results across diverse weather conditions and severity levels.

[151] SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection

Jiaming Liang, Yifeng Zhan, Chunlin Liu, Weihua Zheng, Bingye Peng, Qiwei Liang, Boyang Cai, Xiaochun Mai, Qiang Nie

Main category: cs.CV

TL;DR: A benchmark and method for open-vocabulary camouflaged object detection using text prompts and multimodal fusion strategies to improve detection of objects with high visual similarity to background.

Details

Motivation: Current open-vocabulary object detection (OVOD) models fail on camouflaged objects due to high visual similarity between objects and background, creating a need for specialized approaches.

Method: Constructed OVCOD-D benchmark with camouflaged object images and fine-grained descriptions. Proposed sub-description principal component contrastive fusion to reduce noisy text components, and specificity-guided regional weak alignment with dynamic focusing to enhance discrimination between camouflaged objects and background.

Result: Achieved AP of 56.4 on OVCOD-D benchmark under open-set evaluation setting, demonstrating improved performance for camouflaged object detection.

Conclusion: The proposed benchmark and methods effectively address the challenge of detecting camouflaged objects in open-vocabulary settings by leveraging multimodal fusion and specificity-guided visual alignment.

Abstract: Open-vocabulary object detection (OVOD) aims to detect known and unknown objects in the open world by leveraging text prompts. Benefiting from the emergence of large-scale vision–language pre-trained models, OVOD has demonstrated strong zero-shot generalization capabilities. However, when dealing with camouflaged objects, the detector often fails to distinguish and localize objects because the visual features of the objects and the background are highly similar. To bridge this gap, we construct a benchmark named OVCOD-D by augmenting carefully selected camouflaged object images with fine-grained textual descriptions. Due to the limited scale of available camouflaged object datasets, we adopt detectors pre-trained on large-scale object detection datasets as our baseline methods, as they possess stronger zero-shot generalization ability. In the specificity-aware sub-descriptions generated by multimodal large models, there still exist confusing and overly decorative modifiers. To mitigate such interference, we design a sub-description principal component contrastive fusion strategy that reduces noisy textual components. Furthermore, to address the challenge that the visual features of camouflaged objects are highly similar to those of their surrounding environment, we propose a specificity-guided regional weak alignment and dynamic focusing method, which aims to strengthen the detector’s ability to discriminate camouflaged objects from background. Under the open-set evaluation setting, the proposed method achieves an AP of 56.4 on the OVCOD-D benchmark.

[152] Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding

Shrinidhi Kumbhar, Haofu Liao, Srikar Appalaraju, Kunwar Yashraj Singh

Main category: cs.CV

TL;DR: Discrete diffusion vision-language models (DVLMs) adapted for GUI grounding tasks, showing competitive performance with autoregressive models through hybrid masking and data expansion.

Details

Motivation: While autoregressive VLMs dominate multimodal understanding and GUI grounding, discrete DVLMs have shown promise in multimodal reasoning but their potential for GUI grounding remains unexplored. The paper investigates whether discrete DVLMs can serve as a viable alternative to AR models for GUI grounding tasks.

Method: Adapted LLaDA-V for single-turn action and bounding-box prediction, framing GUI grounding as text generation from multimodal input. Proposed a hybrid masking schedule combining linear and deterministic masking to better capture hierarchical bounding-box geometry. Evaluated on four datasets spanning web, desktop, and mobile interfaces with systematic ablations on diffusion steps, generation length, and block length.

Result: Hybrid masking improved grounding accuracy by up to 6.1 points in Step Success Rate over linear-masked GUI-adapted LLaDA-V. The adapted diffusion model performed competitively with autoregressive counterparts despite limited pretraining. Data expansion with diverse GUI domains reduced latency by ~1.3 seconds and improved grounding accuracy by average 20 points across benchmarks.

Conclusion: Discrete DVLMs are a promising modeling framework for GUI grounding and represent an important step toward diffusion-based GUI agents, demonstrating viability as an alternative to autoregressive models.

Abstract: Autoregressive (AR) vision-language models (VLMs) have long dominated multimodal understanding, reasoning, and graphical user interface (GUI) grounding. Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement. However, their potential for GUI grounding remains unexplored. In this work, we evaluate whether discrete DVLMs can serve as a viable alternative to AR models for GUI grounding. We adapt LLaDA-V for single-turn action and bounding-box prediction, framing the task as text generation from multimodal input. To better capture the hierarchical structure of bounding-box geometry, we propose a hybrid masking schedule that combines linear and deterministic masking, improving grounding accuracy by up to 6.1 points in Step Success Rate (SSR) over the GUI-adapted LLaDA-V trained with linear masking. Evaluations on four datasets spanning web, desktop, and mobile interfaces show that the adapted diffusion model with hybrid masking consistently outperforms the linear-masked variant and performs competitively with autoregressive counterparts despite limited pretraining. Systematic ablations reveal that increasing diffusion steps, generation length, and block length improves accuracy but also increases latency, with accuracy plateauing beyond a certain number of diffusion steps. Expanding the training data with diverse GUI domains further reduces latency by about 1.3 seconds and improves grounding accuracy by an average of 20 points across benchmarks. These results demonstrate that discrete DVLMs are a promising modeling framework for GUI grounding and represent an important step toward diffusion-based GUI agents.

[153] Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR

Jinda Lu, Junkang Wu, Jinghan Li, Kexin Huang, Shuo Yang, Mingzhu Chen, Jiancan Wu, Kuien Liu, Xiang Wang

Main category: cs.CV

TL;DR: TGRL uses expert reasoning trajectories from stronger models to guide MLLMs in integrating visual evidence into fine-grained reasoning processes, improving multimodal reasoning performance.

Details

Motivation: Current RLVR methods for MLLMs focus on answer correctness and visual grounding, but models often fail to effectively incorporate visual evidence into reasoning chains, leading to weakly grounded reasoning.

Method: Proposes Trajectory-Guided Reinforcement Learning (TGRL) that uses expert reasoning trajectories from stronger models to guide policy models, with token-level reweighting and trajectory filtering for stable optimization.

Result: Extensive experiments on multiple multimodal reasoning benchmarks show TGRL consistently improves reasoning performance and bridges the gap between visual perception and logical reasoning.

Conclusion: TGRL effectively addresses the bottleneck of weak visual grounding in reasoning chains and enhances multimodal reasoning capabilities through guided reinforcement learning.

Abstract: Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) for multimodal large language models (MLLMs) have mainly focused on improving final answer correctness and strengthening visual grounding. However, a critical bottleneck remains: although models can attend to relevant visual regions, they often fail to effectively incorporate visual evidence into subsequent reasoning, leading to reasoning chains that are weakly grounded in visual facts. To address this issue, we propose Trajectory-Guided Reinforcement Learning (TGRL), which guides the policy model to integrate visual evidence into fine-grained reasoning processes using expert reasoning trajectories from stronger models. We further introduce token-level reweighting and trajectory filtering to ensure stable and effective policy optimization. Extensive experiments on multiple multimodal reasoning benchmarks demonstrate that TGRL consistently improves reasoning performance and effectively bridges the gap between visual perception and logical reasoning.

[154] PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

Shaoxuan Li, Zhixuan Zhao, Hanze Deng, Zirun Ma, Shulin Tian, Zuyan Liu, Yushi Hu, Haoning Wu, Yuhao Dong, Benlin Liu, Ziwei Liu, Ranjay Krishna

Main category: cs.CV

TL;DR: PerceptionComp is a new benchmark for complex, long-horizon video reasoning requiring multiple temporal visual evidence and compositional constraints across diverse perceptual subtasks.

Details

Motivation: Current video reasoning benchmarks often focus on single-moment understanding or simple temporal relations, lacking the complexity needed to evaluate models on perception-centric, long-horizon reasoning that requires integrating multiple temporally separated visual cues.

Method: Created a manually annotated benchmark with 1,114 complex questions across 279 diverse videos (city tours, indoor tours, video games, extreme sports). Questions require multiple temporal visual evidence, compositional constraints under conjunctive/sequential logic, and span perceptual subtasks like objects, attributes, relations, locations, actions, events.

Result: Human performance drops to near chance (18.97%) without rewatching; best MLLM (Gemini-3-Flash) achieves only 45.96% accuracy in five-choice setting; open-source models below 40%. Shows substantial gap in perception-centric long-horizon video reasoning.

Conclusion: Perception-centric long-horizon video reasoning remains a major bottleneck for MLLMs. PerceptionComp provides a challenging benchmark to drive progress in perceptual reasoning requiring complex temporal and compositional understanding.

Abstract: We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning. The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. These results suggest that perception-centric long-horizon video reasoning remains a major bottleneck, and we hope PerceptionComp will help drive progress in perceptual reasoning.

[155] TaxaAdapter: Vision Taxonomy Models are Key to Fine-grained Image Generation over the Tree of Life

Mridul Khurana, Amin Karimi Monsefi, Justin Lee, Medha Sawhney, David Carlyn, Julia Chae, Jianyang Gu, Rajiv Ramnath, Sara Beery, Wei-Lun Chao, Anuj Karpatne, Cheng Zhang

Main category: cs.CV

TL;DR: TaxaAdapter uses Vision Taxonomy Models (VTMs) like BioCLIP to guide text-to-image diffusion models for fine-grained species generation, improving species-level fidelity while maintaining text control over attributes.

Details

Motivation: Existing text-to-image models fail to capture fine-grained visual traits that define species identity, despite appearing photo-realistic. There are over 10M distinct species with subtle visual differences that current models cannot accurately represent.

Method: Proposes TaxaAdapter that injects VTM embeddings (like BioCLIP) into frozen text-to-image diffusion models. Uses a clean architecture and training recipe to improve species-level fidelity while preserving text control over pose, style, and background.

Result: TaxaAdapter consistently improves morphology fidelity and species-identity accuracy over baselines. Shows strong generalization: works with few-shot species (handful of training images) and even species unseen during training. Introduces a multimodal LLM-based metric for evaluating morphological consistency.

Conclusion: VTMs are key for scalable, fine-grained species generation. TaxaAdapter enables accurate species synthesis while maintaining flexible text control, with applications in challenging regimes like few-shot and zero-shot species generation.

Abstract: Accurately generating images across the Tree of Life is difficult: there are over 10M distinct species on Earth, many of which differ only by subtle visual traits. Despite the remarkable progress in text-to-image synthesis, existing models often fail to capture the fine-grained visual cues that define species identity, even when their outputs appear photo-realistic. To this end, we propose TaxaAdapter, a simple and lightweight approach that incorporates Vision Taxonomy Models (VTMs) such as BioCLIP to guide fine-grained species generation. Our method injects VTM embeddings into a frozen text-to-image diffusion model, improving species-level fidelity while preserving flexible text control over attributes such as pose, style, and background. Extensive experiments demonstrate that TaxaAdapter consistently improves morphology fidelity and species-identity accuracy over strong baselines, with a cleaner architecture and training recipe. To better evaluate these improvements, we also introduce a multimodal Large Language Model-based metric that summarizes trait-level descriptions from generated and real images, providing a more interpretable measure of morphological consistency. Beyond this, we observe that TaxaAdapter exhibits strong generalization capabilities, enabling species synthesis in challenging regimes such as few-shot species with only a handful of training images and even species unseen during training. Overall, our results highlight that VTMs are a key ingredient for scalable, fine-grained species generation.

[156] InstaVSR: Taming Diffusion for Efficient and Temporally Consistent Video Super-Resolution

Jintong Hu, Bin Chen, Zhenyu Hu, Jiayue Liu, Guo Wang, Lu Qi

Main category: cs.CV

TL;DR: InstaVSR: A lightweight diffusion framework for efficient video super-resolution that combines pruned one-step diffusion, recurrent training with flow-guided regularization, and dual-space adversarial learning to achieve fast processing with temporal stability.

Details

Motivation: Current diffusion-based video super-resolution methods face two main challenges: strong generative priors cause temporal instability, and multi-frame diffusion pipelines are computationally expensive for practical deployment.

Method: Three key components: (1) pruned one-step diffusion backbone that removes costly components from conventional pipelines, (2) recurrent training with flow-guided temporal regularization to improve frame-to-frame stability, and (3) dual-space adversarial learning in latent and pixel spaces to preserve perceptual quality after simplification.

Result: On NVIDIA RTX 4090, processes 30-frame 2K×2K video in under one minute with only 7GB memory usage, substantially reducing computational cost while maintaining favorable perceptual quality with significantly smoother temporal transitions.

Conclusion: InstaVSR provides an efficient diffusion framework for video super-resolution that addresses both computational cost and temporal stability challenges, making diffusion-based VSR more practical for deployment.

Abstract: Video super-resolution (VSR) seeks to reconstruct high-resolution frames from low-resolution inputs. While diffusion-based methods have substantially improved perceptual quality, extending them to video remains challenging for two reasons: strong generative priors can introduce temporal instability, and multi-frame diffusion pipelines are often too expensive for practical deployment. To address both challenges simultaneously, we propose InstaVSR, a lightweight diffusion framework for efficient video super-resolution. InstaVSR combines three ingredients: (1) a pruned one-step diffusion backbone that removes several costly components from conventional diffusion-based VSR pipelines, (2) recurrent training with flow-guided temporal regularization to improve frame-to-frame stability, and (3) dual-space adversarial learning in latent and pixel spaces to preserve perceptual quality after backbone simplification. On an NVIDIA RTX 4090, InstaVSR processes a 30-frame video at 2K$\times$2K resolution in under one minute with only 7 GB of memory usage, substantially reducing the computational cost compared to existing diffusion-based methods while maintaining favorable perceptual quality with significantly smoother temporal transitions.

[157] Efficient Few-Shot Learning for Edge AI via Knowledge Distillation on MobileViT

Shuhei Tsuyuki, Reda Bensaid, Jérémy Morlier, Mathieu Léonardon, Naoya Onizawa, Vincent Gripon, Takahiro Hanyu

Main category: cs.CV

TL;DR: Proposes knowledge distillation-based pre-training for MobileViT backbone to enable efficient few-shot learning on edge devices, achieving significant accuracy improvements with reduced parameters, FLOPs, and power consumption.

Details

Motivation: Need for efficient deep learning models on edge devices with limited connectivity, low-latency requirements, and energy constraints, particularly in low-data regimes where collecting large annotated datasets is costly.

Method: Uses knowledge distillation to transfer generalization ability from large-scale teacher model to lightweight MobileViT student model, enabling few-shot learning with reduced computational complexity.

Result: Achieves 14% and 6.7% accuracy improvements for one-shot and five-shot classification on MiniImageNet vs ResNet12 baseline, with 69% parameter reduction, 88% FLOP reduction, 37% energy reduction, and 2.6ms latency on Jetson Orin Nano.

Conclusion: The method provides a practical solution for deploying few-shot learning models on edge AI hardware, balancing accuracy, efficiency, and real-world deployment constraints.

Abstract: Efficient and adaptable deep learning models are an important area of deep learning research, driven by the need for highly efficient models on edge devices. Few-shot learning enables the use of deep learning models in low-data regimes, a capability that is highly sought after in real-world applications where collecting large annotated datasets is costly or impractical. This challenge is particularly relevant in edge scenarios, where connectivity may be limited, low-latency responses are required, or energy consumption constraints are critical. We propose and evaluate a pre-training method for the MobileViT backbone designed for edge computing. Specifically, we employ knowledge distillation, which transfers the generalization ability of a large-scale teacher model to a lightweight student model. This method achieves accuracy improvements of 14% and 6.7% for one-shot and five-shot classification, respectively, on the MiniImageNet benchmark, compared to the ResNet12 baseline, while reducing by 69% the number of parameters and by 88% the computational complexity of the model, in FLOPs. Furthermore, we deployed the proposed models on a Jetson Orin Nano platform and measured power consumption directly at the power supply, showing that the dynamic energy consumption is reduced by 37% with a latency of 2.6 ms. These results demonstrate that the proposed method is a promising and practical solution for deploying few-shot learning models on edge AI hardware.

[158] IP-Bench: Benchmark for Image Protection Methods in Image-to-Video Generation Scenarios

Xiaofeng Li, Leyi Sheng, Zhen Sun, Zongmin Zhang, Jiaheng Wei, Xinlei He

Main category: cs.CV

TL;DR: IP-Bench is the first systematic benchmark for evaluating image protection methods against image-to-video generation misuse, testing 6 protection methods against 5 I2V models with robustness attacks and transferability analysis.

Details

Motivation: With the rise of image-to-video (I2V) generation models, there's growing concern about misuse where single images can be exploited to create fake videos for malicious purposes. Existing image protection methods lack unified benchmarks and haven't been systematically evaluated in I2V scenarios or against preprocessing attacks, making real-world effectiveness assessment difficult.

Method: Proposes IP-Bench (Image Protection Bench), a systematic benchmark that evaluates 6 representative protection methods against 5 state-of-the-art I2V models. The benchmark includes robustness evaluation with two attack strategies under practical scenarios and analyzes cross-model & cross-modality transferability of protection methods.

Result: IP-Bench establishes the first systematic, reproducible, and extensible evaluation framework for image protection methods in I2V generation scenarios, providing comprehensive assessment of protection effectiveness against modern video generation threats.

Conclusion: The benchmark addresses critical gaps in evaluating image protection against I2V misuse, enabling better assessment of protection methods’ real-world effectiveness and providing a foundation for future research in this important security area.

Abstract: With the rapid advancement of image-to-video (I2V) generation models, their potential for misuse in creating malicious content has become a significant concern. For instance, a single image can be exploited to generate a fake video, which can be used to attract attention and gain benefits. This phenomenon is referred to as an I2V generation misuse. Existing image protection methods suffer from the absence of a unified benchmark, leading to an incomplete evaluation framework. Furthermore, these methods have not been systematically assessed in I2V generation scenarios and against preprocessing attacks, which complicates the evaluation of their effectiveness in real-world deployment scenarios.To address this challenge, we propose IP-Bench (Image Protection Bench), the first systematic benchmark designed to evaluate protection methods in I2V generation scenarios. This benchmark examines 6 representative protection methods and 5 state-of-the-art I2V models. Furthermore, our work systematically evaluates protection methods’ robustness with two robustness attack strategies under practical scenarios and analyzes their cross-model & cross-modality transferability. Overall, IP-Bench establishes a systematic, reproducible, and extensible evaluation framework for image protection methods in I2V generation scenarios.

[159] ARTA: Adaptive Mixed-Resolution Token Allocation for Efficient Dense Feature Extraction

David Hagerman, Roman Naeem, Erik Brorsson, Fredrik Kahl, Lennart Svensson

Main category: cs.CV

TL;DR: ARTA is a mixed-resolution coarse-to-fine vision transformer that efficiently extracts dense features by starting with low-resolution tokens and using a lightweight allocator to predict where to add fine tokens near semantic boundaries.

Details

Motivation: Traditional vision transformers use dense high-resolution tokens from the start, which is computationally expensive and inefficient for homogeneous regions. There's a need for more efficient dense feature extraction that focuses computation on semantically complex areas.

Method: ARTA starts with low-resolution coarse tokens and uses a lightweight allocator to iteratively predict semantic boundary scores. It allocates additional fine tokens to patches above a low threshold, concentrating token density near boundaries. Mixed-resolution attention enables interaction between coarse and fine tokens.

Result: ARTA achieves state-of-the-art results on ADE20K and COCO-Stuff with substantially fewer FLOPs, and delivers competitive performance on Cityscapes at markedly lower compute. ARTA-Base attains 54.6 mIoU on ADE20K in the ~100M-parameter class with fewer FLOPs and less memory than comparable backbones.

Conclusion: ARTA demonstrates an efficient coarse-to-fine approach for dense feature extraction that focuses computation on semantically complex regions while avoiding redundant processing in homogeneous areas, achieving strong performance with reduced computational cost.

Abstract: We present ARTA, a mixed-resolution coarse-to-fine vision transformer for efficient dense feature extraction. Unlike models that begin with dense high-resolution (fine) tokens, ARTA starts with low-resolution (coarse) tokens and uses a lightweight allocator to predict which regions require more fine tokens. The allocator iteratively predicts a semantic (class) boundary score and allocates additional tokens to patches above a low threshold, concentrating token density near boundaries while maintaining high sensitivity to weak boundary evidence. This targeted allocation encourages tokens to represent a single semantic class rather than a mixture of classes. Mixed-resolution attention enables interaction between coarse and fine tokens, focusing computation on semantically complex areas while avoiding redundant processing in homogeneous regions. Experiments demonstrate that ARTA achieves state-of-the-art results on ADE20K and COCO-Stuff with substantially fewer FLOPs, and delivers competitive performance on Cityscapes at markedly lower compute. For example, ARTA-Base attains 54.6 mIoU on ADE20K in the ~100M-parameter class while using fewer FLOPs and less memory than comparable backbones.

[160] Gaussian Shannon: High-Precision Diffusion Model Watermarking Based on Communication

Yi Zhang, Hongbo Huang, Liang-Jie Zhang

Main category: cs.CV

TL;DR: A watermarking framework for diffusion models that enables exact bit recovery of structured watermark data, treating diffusion as noisy channel with error correction for robust tracing.

Details

Motivation: Diffusion models generate high-quality images but pose risks like copyright violation and disinformation. Existing watermarking methods only support fuzzy matching and cannot recover structured watermark data bit-exactly, making them unsuitable for applications requiring lossless metadata like licensing instructions.

Method: Gaussian Shannon treats diffusion process as noisy communication channel, embeds watermarks in initial Gaussian noise without fine-tuning or quality loss. Uses cascaded defense combining error-correcting codes and majority voting to handle local bit flips and global stochastic distortions.

Result: Experiments across three Stable Diffusion variants and seven perturbation types show state-of-the-art bit-level accuracy while maintaining high true positive rate, enabling trustworthy rights attribution in real-world deployment.

Conclusion: The framework enables both robust tracing and exact bit recovery of watermarks in diffusion-generated images, addressing limitations of threshold-based detection methods for applications requiring lossless metadata recovery.

Abstract: Diffusion models generate high-quality images but pose serious risks like copyright violation and disinformation. Watermarking is a key defense for tracing and authenticating AI-generated content. However, existing methods rely on threshold-based detection, which only supports fuzzy matching and cannot recover structured watermark data bit-exactly, making them unsuitable for offline verification or applications requiring lossless metadata (e.g., licensing instructions). To address this problem, in this paper, we propose Gaussian Shannon, a watermarking framework that treats the diffusion process as a noisy communication channel and enables both robust tracing and exact bit recovery. Our method embeds watermarks in the initial Gaussian noise without fine-tuning or quality loss. We identify two types of channel interference, namely local bit flips and global stochastic distortions, and design a cascaded defense combining error-correcting codes and majority voting. This ensures reliable end-to-end transmission of semantic payloads. Experiments across three Stable Diffusion variants and seven perturbation types show that Gaussian Shannon achieves state-of-the-art bit-level accuracy while maintaining a high true positive rate, enabling trustworthy rights attribution in real-world deployment. The source code have been made available at: https://github.com/Rambo-Yi/Gaussian-Shannon

[161] GeoGuide: Hierarchical Geometric Guidance for Open-Vocabulary 3D Semantic Segmentation

Xujing Tao, Chuxin Wang, Yubo Ai, Zhixin Cheng, Zhuoyuan Li, Liangsheng Liu, Yujia Chen, Xinjun Li, Qiao Li, Wenfei Yang, Tianzhu Zhang

Main category: cs.CV

TL;DR: GeoGuide: A novel framework for open-vocabulary 3D semantic segmentation that leverages pretrained 3D models with hierarchical geometry-semantic consistency, avoiding limitations of 2D distillation approaches.

Details

Motivation: Existing open-vocabulary 3D segmentation methods rely on distilling knowledge from 2D models, which restricts intrinsic 3D geometric learning and inherits errors from 2D predictions.

Method: Three key modules: 1) Uncertainty-based Superpoint Distillation fuses geometric/semantic features for per-point uncertainty estimation, adaptively weighting 2D features; 2) Instance-level Mask Reconstruction uses geometric priors for semantic consistency within instances; 3) Inter-Instance Relation Consistency aligns geometric and semantic similarity matrices to mitigate viewpoint-induced semantic drift.

Result: Extensive experiments on ScanNet v2, Matterport3D, and nuScenes demonstrate superior performance compared to existing methods.

Conclusion: GeoGuide effectively addresses limitations of 2D distillation approaches by leveraging pretrained 3D models and hierarchical geometry-semantic consistency for improved open-vocabulary 3D segmentation.

Abstract: Open-vocabulary 3D semantic segmentation aims to segment arbitrary categories beyond the training set. Existing methods predominantly rely on distilling knowledge from 2D open-vocabulary models. However, aligning 3D features to the 2D representation space restricts intrinsic 3D geometric learning and inherits errors from 2D predictions. To address these limitations, we propose GeoGuide, a novel framework that leverages pretrained 3D models to integrate hierarchical geometry-semantic consistency for open-vocabulary 3D segmentation. Specifically, we introduce an Uncertainty-based Superpoint Distillation module to fuse geometric and semantic features for estimating per-point uncertainty, adaptively weighting 2D features within superpoints to suppress noise while preserving discriminative information to enhance local semantic consistency. Furthermore, our Instance-level Mask Reconstruction module leverages geometric priors to enforce semantic consistency within instances by reconstructing complete instance masks. Additionally, our Inter-Instance Relation Consistency module aligns geometric and semantic similarity matrices to calibrate cross-instance consistency for same-category objects, mitigating viewpoint-induced semantic drift. Extensive experiments on ScanNet v2, Matterport3D, and nuScenes demonstrate the superior performance of GeoGuide.

[162] Provably Contractive and High-Quality Denoisers for Convergent Restoration

Shubhi Shukla, Pravin Nair

Main category: cs.CV

TL;DR: Provably contractive (global Lipschitz < 1) denoiser networks for image restoration with stability guarantees under input perturbations, competitive with SOTA denoisers while ensuring robustness.

Details

Motivation: Existing convolutional and attention-based networks for image restoration lack stability guarantees under minor input shifts, exposing a robustness-accuracy trade-off. The paper aims to develop provably contractive denoisers that reduce this gap while maintaining competitive performance.

Method: Design composes proximal layers from unfolding techniques with Lipschitz-controlled convolutional refinements to create provably contractive denoiser networks (global Lipschitz < 1). The approach ensures input perturbations induce at most the same magnitude change at output.

Result: The proposed model is competitive with unconstrained SOTA denoisers on image denoising, reporting the tightest gap for a provably 1-Lipschitz model. It also acts as strong regularizers for image restoration that provably effect convergence in Plug-and-Play algorithms.

Conclusion: Enforcing strict Lipschitz control does not inherently degrade output quality, challenging common assumptions and moving toward verifiable and stable vision models. The work demonstrates that contractive denoisers can achieve performance close to unconstrained SOTA methods.

Abstract: Image restoration, the recovery of clean images from degraded measurements, has applications in various domains like surveillance, defense, and medical imaging. Despite achieving state-of-the-art (SOTA) restoration performance, existing convolutional and attention-based networks lack stability guarantees under minor shifts in input, exposing a robustness accuracy trade-off. We develop provably contractive (global Lipschitz $< 1$) denoiser networks that considerably reduce this gap. Our design composes proximal layers obtained from unfolding techniques, with Lipschitz-controlled convolutional refinements. By contractivity, our denoiser guarantees that input perturbations of strength $|δ|\le\varepsilon$ induce at most $\varepsilon$ change at the output, while strong baselines such as DnCNN and Restormer can exhibit larger deviations under the same perturbations. On image denoising, the proposed model is competitive with unconstrained SOTA denoisers, reporting the tightest gap for a provably 1-Lipschitz model and establishing that such gaps are indeed achievable by contractive denoisers. Moreover, the proposed denoisers act as strong regularizers for image restoration that provably effect convergence in Plug-and-Play algorithms. Our results show that enforcing strict Lipschitz control does not inherently degrade output quality, challenging a common assumption in the literature and moving the field toward verifiable and stable vision models. Codes and pretrained models are available at https://github.com/SHUBHI1553/Contractive-Denoisers

[163] CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions

Chonghuinan Wang, Zihan Chen, Yuxiang Wei, Tianyi Jiang, Xiaohe Wu, Fan Li, Wangmeng Zuo, Hongxun Yao

Main category: cs.CV

TL;DR: CREval is an automated QA-based evaluation pipeline for creative image manipulation models, addressing limitations of existing MLLM scoring methods with a comprehensive benchmark covering 9 creative dimensions.

Details

Motivation: Existing evaluation methods lack systematic and human-aligned frameworks for assessing model performance on complex and creative image editing tasks, with current MLLM scoring being incomplete and poorly interpretable.

Method: Proposes CREval, a fully automated question-answer evaluation pipeline that overcomes opaque MLLM scoring limitations, and CREval-Bench, a comprehensive benchmark with 3 categories, 9 creative dimensions, 800+ editing samples, and 13K evaluation queries.

Result: Evaluation of state-of-the-art open and closed-source models shows closed-source models generally outperform open-source ones on complex creative tasks, but all models struggle with effective completion. User studies confirm strong consistency between CREval’s automated metrics and human judgments.

Conclusion: CREval provides a reliable foundation for evaluating image editing models on complex creative manipulation tasks, highlighting key challenges and opportunities for future research in multimodal image editing.

Abstract: Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To address this gap, we propose CREval, a fully automated question-answer (QA)-based evaluation pipeline that overcomes the incompleteness and poor interpretability of opaque Multimodal Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries. Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that while closed-source models generally outperform open-source ones on complex and creative tasks, all models still struggle to complete such edits effectively. In addition, user studies demonstrate strong consistency between CREval’s automated metrics and human judgments. Therefore, CREval provides a reliable foundation for evaluating image editing models on complex and creative image manipulation tasks, and highlights key challenges and opportunities for future research.

[164] PhysVid: Physics Aware Local Conditioning for Generative Video Models

Saurabh, Pathak, Elahe Arani, Mykola Pechenizkiy, Bahram Zonooz

Main category: cs.CV

TL;DR: PhysVid introduces physics-aware local conditioning for video generation using temporally contiguous chunks annotated with physics-grounded descriptions, improving physical plausibility over baseline models.

Details

Motivation: Current generative video models achieve high visual fidelity but often violate basic physical principles, limiting reliability in real-world settings. Existing physics injection methods have limitations: frame-level signals are domain-specific and short-horizon, while global text prompts are coarse and noisy, missing fine-grained dynamics.

Method: PhysVid uses a physics-aware local conditioning scheme operating over temporally contiguous chunks of frames. Each chunk is annotated with physics-grounded descriptions of states, interactions, and constraints, fused with global prompts via chunk-aware cross-attention during training. At inference, negative physics prompts (descriptions of locally relevant law violations) steer generation away from implausible trajectories.

Result: On VideoPhy benchmark, PhysVid improves physical commonsense scores by approximately 33% over baseline video generators, and by up to approximately 8% on VideoPhy2. The method substantially increases physical plausibility in generative video.

Conclusion: Local, physics-aware guidance significantly improves physical plausibility in generative video models, representing a step toward physics-grounded video generation systems.

Abstract: Generative video models achieve high visual fidelity but often violate basic physical principles, limiting reliability in real-world settings. Prior attempts to inject physics rely on conditioning: frame-level signals are domain-specific and short-horizon, while global text prompts are coarse and noisy, missing fine-grained dynamics. We present PhysVid, a physics-aware local conditioning scheme that operates over temporally contiguous chunks of frames. Each chunk is annotated with physics-grounded descriptions of states, interactions, and constraints, which are fused with the global prompt via chunk-aware cross-attention during training. At inference, we introduce negative physics prompts (descriptions of locally relevant law violations) to steer generation away from implausible trajectories. On VideoPhy, PhysVid improves physical commonsense scores by $\approx 33%$ over baseline video generators, and by up to $\approx 8%$ on VideoPhy2. These results show that local, physics-aware guidance substantially increases physical plausibility in generative video and marks a step toward physics-grounded video models.

[165] Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness via Contextual Consistency Learning

Bozhao Li, Shaocong Wu, Tong Shao, Senqiao Yang, Qiben Shan, Zhuotao Tian, Jingyong Su

Main category: cs.CV

TL;DR: CCL introduces contextual consistency learning for open-vocabulary object detection, addressing robustness gaps when objects appear in different backgrounds by enforcing intra-modal consistency through data generation and consistency loss.

Details

Motivation: Current open-vocabulary object detection methods focus on scaling datasets and contrastive learning for cross-modal alignment, but neglect internal consistency within single modalities. This leads to performance drops when objects appear in different backgrounds, revealing a robustness gap.

Method: Proposes Contextual Consistency Learning (CCL) with two key components: 1) Contextual Bootstrapped Data Generation (CBDG) creates images with same objects across diverse backgrounds, and 2) Contextual Consistency Loss (CCLoss) enforces invariance of object features despite environmental changes.

Result: Achieves state-of-the-art performance with +16.3 AP improvement on OmniLabel and +14.9 AP on D3 datasets, demonstrating significant enhancement in model generalization across diverse environments.

Conclusion: Enforcing intra-modal consistency is crucial for improving robustness in open-vocabulary object detection, and the CCL framework effectively addresses the contextual consistency problem through data generation and consistency constraints.

Abstract: Recent advances in open-vocabulary object detection focus primarily on two aspects: scaling up datasets and leveraging contrastive learning to align language and vision modalities. However, these approaches often neglect internal consistency within a single modality, particularly when background or environmental changes occur. This lack of consistency leads to a performance drop because the model struggles to detect the same object in different scenes, which reveals a robustness gap. To address this issue, we introduce Contextual Consistency Learning (CCL), a novel framework that integrates two key strategies: Contextual Bootstrapped Data Generation (CBDG) and Contextual Consistency Loss (CCLoss). CBDG functions as a data generation mechanism, producing images that contain the same objects across diverse backgrounds. This is essential because existing datasets alone do not support our CCL framework. The CCLoss further enforces the invariance of object features despite environmental changes, thereby improving the model’s robustness in different scenes. These strategies collectively form a unified framework for ensuring contextual consistency within the same modality. Our method achieves state-of-the-art performance, surpassing previous approaches by +16.3 AP on OmniLabel and +14.9 AP on D3. These results demonstrate the importance of enforcing intra-modal consistency, significantly enhancing model generalization in diverse environments. Our code is publicly available at: https://github.com/bozhao-li/CCL.

[166] Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy

Wooseong Jeong, Wonyoung Lee, Kuk-Jin Yoon

Main category: cs.CV

TL;DR: TARA-Merging is a method for merging multiple LoRA modules by addressing subspace coverage and anisotropy issues through preference-weighted cross-entropy pseudo-loss alignment.

Details

Motivation: Naive merging of LoRA modules leads to weakened critical task directions and overemphasis of less important ones due to mismatched subspaces and uneven contributions, reducing model's ability to represent all tasks faithfully.

Method: TARA-Merging aligns merging weights using a preference-weighted cross-entropy pseudo-loss while preserving task-relevant LoRA subspaces, ensuring broad subspace coverage and mitigating anisotropy via direction-wise reweighting.

Result: Across eight vision and six NLI benchmarks, TARA-Merging consistently outperforms vanilla and LoRA-aware baselines, demonstrating strong robustness and generalization.

Conclusion: The method highlights the importance of addressing both subspace coverage and anisotropy in LoRA merging for constructing general-purpose systems.

Abstract: Merging multiple Low-Rank Adaptation (LoRA) modules is promising for constructing general-purpose systems, yet challenging because LoRA update directions span different subspaces and contribute unevenly. When merged naively, such mismatches can weaken the directions most critical to certain task losses while overemphasizing relatively less important ones, ultimately reducing the model’s ability to represent all tasks faithfully. We revisit this problem through two perspectives: subspace coverage, which captures how broadly LoRA directions cover diverse representational directions, and anisotropy, which reflects the imbalance of influence across those directions. We propose TARA-Merging (Task-Rank Anisotropy Alignment), which aligns merging weights using a preference-weighted cross-entropy pseudo-loss while preserving task-relevant LoRA subspaces. This ensures broad subspace coverage and mitigates anisotropy via direction-wise reweighting. Across eight vision and six NLI benchmarks, TARA-Merging consistently outperforms vanilla and LoRA-aware baselines, demonstrating strong robustness and generalization, and highlighting the importance of addressing both subspace coverage and anisotropy in LoRA merging.

[167] GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport

Youngju Na, Jaeseong Yun, Soohyun Ryu, Hyunsu Kim, Sung-Eui Yoon, Suyong Yeon

Main category: cs.CV

TL;DR: GLINT: A framework for modeling scene-scale transparency in 3D Gaussian splatting by decomposing transparent interfaces and transmitted geometry with separate radiance modeling.

Details

Motivation: Current 3D Gaussian splatting methods fundamentally fail to model transparency like glass panels, as they cannot decouple radiance contributions from transparent interfaces and transmitted geometry observed through the glass.

Method: GLINT uses explicit decomposed Gaussian representation to reconstruct primary interfaces and model reflected/transmitted radiance separately. It bootstraps transparency localization from geometry-separation cues induced by decomposition, along with geometry and material priors from a pre-trained video relighting model.

Result: Extensive experiments demonstrate consistent improvements over prior methods for reconstructing complex transparent scenes.

Conclusion: GLINT successfully addresses the transparency modeling challenge in 3D Gaussian splatting through decomposition and prior-guided optimization.

Abstract: While 3D Gaussian splatting has emerged as a powerful paradigm, it fundamentally fails to model transparency such as glass panels. The core challenge lies in decoupling the intertwined radiance contributions from transparent interfaces and the transmitted geometry observed through the glass. We present GLINT, a framework that models scene-scale transparency through explicit decomposed Gaussian representation. GLINT reconstructs the primary interface and models reflected and transmitted radiance separately, enabling consistent radiance transport. During optimization, GLINT bootstraps transparency localization from geometry-separation cues induced by the decomposition, together with geometry and material priors from a pre-trained video relighting model. Extensive experiments demonstrate consistent improvements over prior methods for reconstructing complex transparent scenes.

[168] Label-Free Cross-Task LoRA Merging with Null-Space Compression

Wonyoung Lee, Wooseong Jeong, Kuk-Jin Yoon

Main category: cs.CV

TL;DR: NSC Merging is a label-free, output-agnostic method for merging LoRA adapters by analyzing adapter geometry, specifically the null-space compression in down-projection matrices, enabling effective merging across heterogeneous tasks including classification, regression, and sequence generation.

Details

Motivation: Existing model merging approaches work well for homogeneous classification tasks but fail when tasks span both classification and regression. Entropy-based methods don't apply to regression and are computationally expensive for large language models with long token sequences. There's a need for a more general merging method that can handle diverse task types efficiently.

Method: Null-Space Compression (NSC) Merging uses the observation that during LoRA fine-tuning, the down-projection matrix A compresses its null space, and this compression correlates with task performance. NSC uses this geometric property as an optimization signal to determine merge weights, making it label-free and output-agnostic. The method analyzes adapter geometry rather than task outputs.

Result: NSC achieves state-of-the-art performance across 20 heterogeneous vision tasks with balanced gains, outperforming prior methods that overfit subsets of tasks. It also outperforms baselines on six NLI benchmarks and vision-language evaluations for VQA and image captioning, demonstrating scalability and effectiveness across diverse task types.

Conclusion: NSC Merging provides a principled, geometry-based approach to LoRA adapter merging that generalizes across classification, regression, and sequence generation tasks, offering a scalable solution for heterogeneous task merging in foundation models.

Abstract: Model merging combines independently fine-tuned checkpoints without joint multi-task training. In the era of foundation-model, fine-tuning with Low-Rank Adaptation (LoRA) is prevalent, making LoRA merging a promising target. Existing approaches can work in homogeneous settings where all target tasks are classification but often fail when tasks span classification and regression. Approaches using entropy-based surrogates do not apply to regression and are costly for large language models due to long token sequences. We introduce Null-Space Compression (NSC) Merging, a label-free, output-agnostic method that sets merge weights from adapter geometry. Our key observation is that during LoRA finetuning the down-projection factor $A$ in $ΔW = BA$ compresses its null space, and the compression correlates with performance. NSC uses this as an optimization signal for merging that can generalize across classification, regression, and sequence generation. NSC achieves state-of-the-art performance across twenty heterogeneous vision tasks with balanced gains where prior methods overfit subsets of tasks. It also outperforms baselines on six NLI benchmarks and on vision-language evaluations for VQA and image captioning, demonstrating scalability and effectiveness.

[169] DUGAE: Unified Geometry and Attribute Enhancement via Spatiotemporal Correlations for G-PCC Compressed Dynamic Point Clouds

Pan Zhao, Hui Yuan, Chang Sun, Chongzhen Tian, Raouf Hamzaoui, Sam Kwong

Main category: cs.CV

TL;DR: DUGAE is a unified geometry and attribute enhancement framework for G-PCC compressed dynamic point clouds that exploits inter-frame spatiotemporal correlations using dynamic enhancement networks with motion compensation.

Details

Motivation: Existing point cloud quality enhancement methods are designed for static data and process frames independently, failing to exploit spatiotemporal correlations in dynamic point cloud sequences.

Method: Proposes DUGAE with three components: 1) Dynamic Geometry Enhancement Network (DGE-Net) using sparse convolution and geometry motion compensation, 2) Detail-aware KNN recoloring module for attribute mapping, and 3) Dynamic Attribute Enhancement Network (DAE-Net) with temporal feature extraction and attribute motion compensation.

Result: Significantly enhanced G-PCC performance on 7 dynamic point clouds: 11.03 dB BD-PSNR gain (93.95% bitrate reduction) for geometry, 4.23 dB BD-PSNR gain (66.61% bitrate reduction) for luma, improved perceptual quality (PCQM), and outperformed V-PCC.

Conclusion: DUGAE effectively exploits spatiotemporal correlations in dynamic point clouds, achieving substantial quality improvements for both geometry and attributes compared to existing methods.

Abstract: Existing post-decoding quality enhancement methods for point clouds are designed for static data and typically process each frame independently. As a result, they cannot effectively exploit the spatiotemporal correlations present in point cloud sequences.We propose a unified geometry and attribute enhancement framework (DUGAE) for G-PCC compressed dynamic point clouds that explicitly exploits inter-frame spatiotemporal correlations in both geometry and attributes. First, a dynamic geometry enhancement network (DGE-Net) based on sparse convolution (SPConv) and feature-domain geometry motion compensation (GMC) aligns and aggregates spatiotemporal information. Then, a detail-aware k-nearest neighbors (DA-KNN) recoloring module maps the original attributes onto the enhanced geometry at the encoder side, improving mapping completeness and preserving attribute details. Finally, a dynamic attribute enhancement network (DAE-Net) with dedicated temporal feature extraction and feature-domain attribute motion compensation (AMC) refines attributes by modeling complex spatiotemporal correlations. On seven dynamic point clouds from the 8iVFB v2, Owlii, and MVUB datasets, DUGAE significantly enhanced the performance of the latest G-PCC geometry-based solid content test model (GeS-TM v10). For geometry (D1), it achieved an average BD-PSNR gain of 11.03 dB and a 93.95% BD-bitrate reduction. For the luma component, it achieved a 4.23 dB BD-PSNR gain with a 66.61% BD-bitrate reduction. DUGAE also improved perceptual quality (as measured by PCQM) and outperformed V-PCC. Our source code will be released on GitHub at: https://github.com/yuanhui0325/DUGAE

[170] OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement

Rui Wang, Huisi Wu, Jing Qin

Main category: cs.CV

TL;DR: OSA framework for echocardiography video segmentation uses orthogonal state updates on Stiefel manifold to prevent rank collapse and anatomical prior-aware feature enhancement for noise-resilient cardiac structure tracking.

Details

Motivation: Accurate left ventricle segmentation from echocardiography videos is crucial for cardiac function assessment, but existing methods suffer from rank collapse in recurrent models and noise interference from speckle artifacts.

Method: Proposes OSA with Orthogonalized State Update (OSU) mechanism that constrains state evolution on Stiefel manifold via Euclidean projected gradient descent, plus Anatomical Prior-aware Feature Enhancement module to separate anatomical structures from speckle noise.

Result: Achieves state-of-the-art segmentation accuracy and temporal stability on CAMUS and EchoNet-Dynamic datasets while maintaining real-time inference efficiency for clinical deployment.

Conclusion: OSA effectively addresses rank collapse in temporal tracking and noise interference in echocardiography segmentation, enabling robust clinical cardiac assessment.

Abstract: Accurate and temporally consistent segmentation of the left ventricle from echocardiography videos is essential for estimating the ejection fraction and assessing cardiac function. However, modeling spatiotemporal dynamics remains difficult due to severe speckle noise and rapid non-rigid deformations. Existing linear recurrent models offer efficient in-context associative recall for temporal tracking, but rely on unconstrained state updates, which cause progressive singular value decay in the state matrix, a phenomenon known as rank collapse, resulting in anatomical details being overwhelmed by noise. To address this, we propose OSA, a framework that constrains the state evolution on the Stiefel manifold. We introduce the Orthogonalized State Update (OSU) mechanism, which formulates the memory evolution as Euclidean projected gradient descent on the Stiefel manifold to prevent rank collapse and maintain stable temporal transitions. Furthermore, an Anatomical Prior-aware Feature Enhancement module explicitly separates anatomical structures from speckle noise through a physics-driven process, providing the temporal tracker with noise-resilient structural cues. Comprehensive experiments on the CAMUS and EchoNet-Dynamic datasets show that OSA achieves state-of-the-art segmentation accuracy and temporal stability, while maintaining real-time inference efficiency for clinical deployment. Codes are available at https://github.com/wangrui2025/OSA.

[171] Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation

Yiming Ren, Yujiu Yang, Junjie Wang

Main category: cs.CV

TL;DR: IADA improves VLM fine-tuning by preserving cross-depth access through input-adaptive depth aggregation, boosting both reasoning and perception scores with minimal parameters.

Details

Motivation: Supervised fine-tuning on visual instruction data often improves perceptual capabilities in vision-language models while degrading reasoning performance, creating a persistent reasoning tax. The authors investigate whether this degradation is related to disrupted access to depth-wise representations.

Method: Proposes Input-Adaptive Depth Aggregation (IADA), a lightweight mechanism that makes cross-depth retrieval input-adaptive, modality-aware, and efficiently parameterized through a low-rank bottleneck. It preserves cross-depth access during fine-tuning.

Result: On Qwen3-VL-2B, IADA improves average reasoning score by 9.5 points and average perception score by 3.3 points over LoRA-only fine-tuning with only 0.14M additional parameters. Strongest gains appear in parameter-efficient low-rank settings.

Conclusion: Preserved cross-depth access is an important missing factor in VLM fine-tuning. IADA effectively addresses the reasoning degradation problem while maintaining perception improvements through efficient cross-depth aggregation.

Abstract: Supervised fine-tuning (SFT) on visual instruction data often improves perceptual capabilities in vision-language models (VLMs) while degrading reasoning performance, creating a persistent reasoning tax during post-training. We investigate whether this degradation is related to disrupted access to depth-wise representations, and find that even fixed cross-depth aggregation substantially restores reasoning, suggesting that preserved cross-depth access is an important missing factor in VLM fine-tuning. Building on this observation, we propose Input-Adaptive Depth Aggregation (IADA), a lightweight mechanism that makes cross-depth retrieval input-adaptive, modality-aware, and efficiently parameterized through a low-rank bottleneck. On Qwen3-VL-2B, IADA improves the average reasoning score by 9.5 points and the average perception score by $3.3$ points over LoRA-only fine-tuning with only 0.14M additional parameters, with the strongest gains appearing in parameter-efficient low-rank settings.

[172] Dual-Stage Invariant Continual Learning under Extreme Visual Sparsity

Rangya Zhang, Jiaping Xiao, Lu Bai, Yuhang Zhang, Mir Feroskhan

Main category: cs.CV

TL;DR: Continual object detection framework for extreme-sparsity regimes like space-based RSO detection, using dual-stage invariant learning with joint distillation and sparsity-aware data conditioning to prevent representation drift.

Details

Motivation: Existing continual learning methods for object detection assume balanced visual conditions, but fail in extreme-sparsity regimes where foreground signals are dominated by background observations. Background-driven gradients destabilize feature backbones during sequential domain shifts, causing progressive representation drift that output-level distillation alone cannot address.

Method: Proposes a dual-stage invariant continual learning framework via joint distillation, enforcing structural and semantic consistency on both backbone representations and detection predictions. Also introduces sparsity-aware data conditioning combining patch-based sampling and distribution-aware augmentation to regulate gradient statistics under severe imbalance.

Result: Experiments on high-resolution space-based RSO detection dataset show consistent improvement over established continual object detection methods, achieving an absolute gain of +4.0 mAP under sequential domain shifts.

Conclusion: The proposed framework effectively addresses representation drift in extreme-sparsity continual object detection by jointly preserving intermediate representation stability and detection performance through dual-stage invariant learning and sparsity-aware conditioning.

Abstract: Continual learning seeks to maintain stable adaptation under non-stationary environments, yet this problem becomes particularly challenging in object detection, where most existing methods implicitly assume relatively balanced visual conditions. In extreme-sparsity regimes, such as those observed in space-based resident space object (RSO) detection scenarios, foreground signals are overwhelmingly dominated by background observations. Under such conditions, we analytically demonstrate that background-driven gradients destabilize the feature backbone during sequential domain shifts, causing progressive representation drift. This exposes a structural limitation of continual learning approaches relying solely on output-level distillation, as they fail to preserve intermediate representation stability. To address this, we propose a dual-stage invariant continual learning framework via joint distillation, enforcing structural and semantic consistency on both backbone representations and detection predictions, respectively, thereby suppressing error propagation at its source while maintaining adaptability. Furthermore, to regulate gradient statistics under severe imbalance, we introduce a sparsity-aware data conditioning strategy combining patch-based sampling and distribution-aware augmentation. Experiments on a high-resolution space-based RSO detection dataset show consistent improvement over established continual object detection methods, achieving an absolute gain of +4.0 mAP under sequential domain shifts.

[173] Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification

Shuai Lv, Chang Liu, Feng Tang, Yujie Yuan, Aojun Zhou, Kui Zhang, Xi Yang, Yangqiu Song

Main category: cs.CV

TL;DR: VRE is a self-evolving training framework that enables MLLMs to perform visual introspection during reasoning to reduce hallucinations and improve grounding in long-form generation.

Details

Motivation: MLLMs suffer from progressive drift from image evidence in long-form generation, relying more on textual priors and causing hallucinations. The authors discovered MLLMs have latent late-stage visual verification capabilities that aren't consistently activated.

Method: Visual Re-Examination (VRE) framework enables MLLMs to autonomously perform visual introspection during reasoning without additional visual inputs. It promotes iterative self-improvement by having the model generate reflection traces and making visual information actionable through information gain.

Result: Extensive experiments across diverse multimodal benchmarks show VRE consistently improves reasoning accuracy and perceptual reliability while substantially reducing hallucinations, especially in long-chain settings.

Conclusion: VRE effectively addresses the hallucination problem in MLLMs by activating their latent visual verification capabilities through self-evolving training, leading to more grounded multimodal reasoning.

Abstract: Multimodal Large Language Models (MLLMs) achieve strong multimodal reasoning performance, yet we identify a recurring failure mode in long-form generation: as outputs grow longer, models progressively drift away from image evidence and fall back on textual priors, resulting in ungrounded reasoning and hallucinations. Interestingly, Based on attention analysis, we find that MLLMs have a latent capability for late-stage visual verification that is present but not consistently activated. Motivated by this observation, we propose Visual Re-Examination (VRE), a self-evolving training framework that enables MLLMs to autonomously perform visual introspection during reasoning without additional visual inputs. Rather than distilling visual capabilities from a stronger teacher, VRE promotes iterative self-improvement by leveraging the model itself to generate reflection traces, making visual information actionable through information gain. Extensive experiments across diverse multimodal benchmarks demonstrate that VRE consistently improves reasoning accuracy and perceptual reliability, while substantially reducing hallucinations, especially in long-chain settings. Code is available at https://github.com/Xiaobu-USTC/VRE.

[174] HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning

Xuerui Zhang, Xuehao Wang, Zhan Zhuang, Linglan Zhao, Ziyue Li, Xinmin Zhang, Zhihuan Song, Yu Zhang

Main category: cs.CV

TL;DR: Lifelong Heterogeneous Learning (LHL) addresses learning across tasks with different output structures, focusing on dense prediction scenarios (LHL4DP) with Heterogeneity-Aware Distillation (HAD) method.

Details

Motivation: Most lifelong learning research focuses on homogeneous tasks (e.g., only classification), neglecting scenarios where tasks have heterogeneous output structures. This work formalizes lifelong heterogeneous learning (LHL) to address learning across tasks with different output space structures.

Method: Proposes Heterogeneity-Aware Distillation (HAD), an exemplar-free approach using self-distillation. HAD has two components: 1) distribution-balanced heterogeneity-aware distillation loss to address global prediction distribution imbalance, and 2) salience-guided heterogeneity-aware distillation loss focusing on informative edge pixels extracted with Sobel operator.

Result: Extensive experiments show HAD significantly outperforms existing methods in the new LHL4DP (Lifelong Heterogeneous Learning for Dense Prediction) scenario.

Conclusion: The work formalizes lifelong heterogeneous learning, proposes an effective HAD method for dense prediction scenarios, and demonstrates superior performance over existing approaches.

Abstract: Lifelong learning aims to preserve knowledge acquired from previous tasks while incorporating knowledge from a sequence of new tasks. However, most prior work explores only streams of homogeneous tasks (\textit{e.g.}, only classification tasks) and neglects the scenario of learning across heterogeneous tasks that possess different structures of outputs. In this work, we formalize this broader setting as lifelong heterogeneous learning (LHL). Departing from conventional lifelong learning, the task sequence of LHL spans different task types, and the learner needs to retain heterogeneous knowledge for different output space structures. To instantiate the LHL, we focus on LHL in the context of dense prediction (LHL4DP), a realistic and challenging scenario. To this end, we propose the Heterogeneity-Aware Distillation (HAD) method, an exemplar-free approach that preserves previously gained heterogeneous knowledge by self-distillation in each training phase. The proposed HAD comprises two complementary components, including a distribution-balanced heterogeneity-aware distillation loss to alleviate the global imbalance of prediction distribution and a salience-guided heterogeneity-aware distillation loss that concentrates learning on informative edge pixels extracted with the Sobel operator. Extensive experiments demonstrate that the proposed HAD method significantly outperforms existing methods in this new scenario.

[175] 4DRaL: Bridging 4D Radar with LiDAR for Place Recognition using Knowledge Distillation

Ningyuan Huang, Zhiheng Li, Zheng Fang

Main category: cs.CV

TL;DR: 4DRaL: A knowledge distillation framework that uses LiDAR-to-LiDAR place recognition as teacher to enhance 4D radar-to-radar and radar-to-LiDAR place recognition, addressing noise and sparsity in 4D radar data.

Details

Motivation: Place recognition is essential for robotics but current camera/LiDAR methods fail in adverse weather. 4D millimeter-wave radar works in all weather but suffers from noise and sparsity, limiting performance.

Method: Uses LiDAR-to-LiDAR place recognition model as teacher to guide 4D radar-to-radar student model via three KD modules: local image enhancement for sparsity, feature distribution distillation for discriminative features, and response distillation for feature space consistency.

Result: Achieves state-of-the-art performance in both radar-to-radar and radar-to-LiDAR place recognition tasks under normal and adverse weather conditions.

Conclusion: 4DRaL effectively addresses 4D radar limitations through knowledge distillation, enabling robust all-weather place recognition for robotics applications.

Abstract: Place recognition is crucial for loop closure detection and global localization in robotics. Although mainstream algorithms typically rely on cameras and LiDAR, these sensors are susceptible to adverse weather conditions. Fortunately, the recently developed 4D millimeter-wave radar (4D radar) offers a promising solution for all-weather place recognition. However, the inherent noise and sparsity in 4D radar data significantly limit its performance. Thus, in this paper, we propose a novel framework called 4DRaL that leverages knowledge distillation (KD) to enhance the place recognition performance of 4D radar. Its core is to adopt a high-performance LiDAR-to-LiDAR (L2L) place recognition model as a teacher to guide the training of a 4D radar-to-4D radar (R2R) place recognition model. 4DRaL comprises three key KD modules: a local image enhancement module to handle the sparsity of raw 4D radar points, a feature distribution distillation module that ensures the student model generates more discriminative features, and a response distillation module to maintain consistency in feature space between the teacher and student models. More importantly, 4DRaL can also be trained for 4D radar-to-LiDAR (R2L) place recognition through different module configurations. Experimental results prove that 4DRaL achieves state-of-the-art performance in both R2R and R2L tasks regardless of normal or adverse weather.

[176] Real-Time Branch-to-Tool Distance Estimation for Autonomous UAV Pruning: Benchmarking Five DEFOM-Stereo Variants from Simulation to Jetson Deployment

Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield, Richard Green

Main category: cs.CV

TL;DR: DEFOM-Stereo variants trained on synthetic tree pruning data for UAV applications; DEFOM-PrunePlus offers best accuracy-speed trade-off for real-time deployment on Jetson Orin.

Details

Motivation: Autonomous tree pruning with UAVs requires real-time metric distance estimation to thin branches for safe cutting tool operation without collisions.

Method: Train five DEFOM-Stereo variants on synthetic dataset (5,520 stereo pairs from Unreal Engine 5 with ZED Mini simulation), deploy on NVIDIA Jetson Orin, evaluate accuracy-speed trade-offs.

Result: DEFOM-Stereo ViT-S has best accuracy but too slow (~2.2 FPS); DEFOM-PrunePlus offers best deployable trade-off (~3.3 FPS, depth MAE 64.26 cm); lightweight variants faster but less accurate.

Conclusion: DEFOM-PrunePlus provides most practical accuracy-latency balance for onboard distance estimation in UAV tree pruning, with ViT-S as reference for future hardware improvements.

Abstract: Autonomous tree pruning with unmanned aerial vehicles (UAVs) is a safety-critical real-world task: the onboard perception system must estimate the metric distance from a cutting tool to thin tree branches in real time so that the UAV can approach, align, and actuate the pruner without collision. We address this problem by training five variants of DEFOM-Stereo - a recent foundation-model-based stereo matcher - on a task-specific synthetic dataset and deploying the checkpoints on an NVIDIA Jetson Orin Super 16 GB. The training corpus is built in Unreal Engine 5 with a simulated ZED Mini stereo camera capturing 5,520 stereo pairs across 115 tree instances from three viewpoints at 2m distance; dense EXR depth maps provide exact, spatially complete supervision for thin branches. On the synthetic test set, DEFOM-Stereo ViT-S achieves the best depth-domain accuracy (EPE 1.74 px, D1-all 5.81%, delta-1 95.90%, depth MAE 23.40 cm) but its Jetson inference speed of ~2.2 FPS (~450 ms per frame) remains too slow for responsive closed-loop tool control. A newly introduced balanced variant, DEFOM-PrunePlus (~21M backbone, ~3.3 FPS on Jetson), offers the best deployable accuracy-speed trade-off (EPE 5.87 px, depth MAE 64.26 cm, delta-1 87.59%): its frame rate is sufficient for real-time guidance and its depth accuracy supports safe branch approach planning at the 2m operating range. The lightweight DEFOM-PruneStereo (~6.9 FPS) and DEFOM-PruneNano (~8.5 FPS) run fast but sacrifice substantial accuracy (depth MAE > 57 cm), making estimates too unreliable for safe actuation. Zero-shot inference on real photographs confirms that full-capacity models preserve branch geometry, validating the sim-to-real transfer. We conclude that DEFOM-PrunePlus provides the most practical accuracy-latency balance for onboard distance estimation, while ViT-S serves as the reference for future hardware.

[177] CPUBone: Efficient Vision Backbone Design for Devices with Low Parallelization Capabilities

Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni

Main category: cs.CV

TL;DR: CPUBone introduces CPU-optimized vision backbones that balance MACs and hardware-efficient execution through grouped convolutions and reduced kernel sizes, achieving state-of-the-art speed-accuracy trade-offs on CPUs.

Details

Motivation: Most vision backbone research focuses on hardware with high parallel processing capabilities (GPUs, mobile phones, AI accelerators), but CPUs have different architectural constraints that require a specialized design philosophy balancing operation count (MACs) with hardware-efficient execution (MACs per second).

Method: Investigate two modifications to standard convolutions: grouping convolutions and reducing kernel sizes. These adaptations reduce computational cost while maintaining hardware-efficiency on CPUs. Based on these insights, develop CPUBone family of vision backbone models optimized for CPU-based inference.

Result: CPUBone achieves state-of-the-art Speed-Accuracy Trade-offs across diverse CPU devices and effectively transfers its efficiency to downstream tasks like object detection and semantic segmentation.

Conclusion: CPU-specific optimization strategies (grouped convolutions and reduced kernel sizes) enable efficient vision backbone models that outperform existing approaches on CPU hardware while maintaining strong performance on downstream vision tasks.

Abstract: Recent research on vision backbone architectures has predominantly focused on optimizing efficiency for hardware platforms with high parallel processing capabilities. This category increasingly includes embedded systems such as mobile phones and embedded AI accelerator modules. In contrast, CPUs do not have the possibility to parallelize operations in the same manner, wherefore models benefit from a specific design philosophy that balances amount of operations (MACs) and hardware-efficient execution by having high MACs per second (MACpS). In pursuit of this, we investigate two modifications to standard convolutions, aimed at reducing computational cost: grouping convolutions and reducing kernel sizes. While both adaptations substantially decrease the total number of MACs required for inference, sustaining low latency necessitates preserving hardware-efficiency. Our experiments across diverse CPU devices confirm that these adaptations successfully retain high hardware-efficiency on CPUs. Based on these insights, we introduce CPUBone, a new family of vision backbone models optimized for CPU-based inference. CPUBone achieves state-of-the-art Speed-Accuracy Trade-offs (SATs) across a wide range of CPU devices and effectively transfers its efficiency to downstream tasks such as object detection and semantic segmentation. Models and code are available at https://github.com/altair199797/CPUBone.

[178] GLASS: Geometry-aware Local Alignment and Structure Synchronization Network for 2D-3D Registration

Zhixin Cheng, Jiacheng Deng, Xinjun Li, Bohao Liao, Li Liu, Xiaotian Yin, Baoqun Yin, Tianzhu Zhang

Main category: cs.CV

TL;DR: Proposes LGE and GDC modules for image-to-point cloud registration, addressing repetitive patterns and structural consistency issues through geometry enhancement and graph-based distribution constraints.

Details

Motivation: Image-to-point cloud registration struggles with repetitive patterns where images lack 3D structural cues and alignment, leading to incorrect matches. Existing methods often overlook structural consistency and fail to fully exploit correspondences.

Method: Two novel modules: 1) Local Geometry Enhancement (LGE) enhances image and point cloud features with normal vectors to inject geometric structure into image features, reducing mismatches. 2) Graph Distribution Consistency (GDC) constructs a graph from matched points to update features and explicitly constrain similarity distributions.

Result: Extensive experiments on RGB-D Scenes v2 and 7-Scenes benchmarks demonstrate state-of-the-art performance in image-to-point cloud registration.

Conclusion: The proposed LGE and GDC modules effectively address challenges in image-to-point cloud registration, particularly for scenes with repetitive patterns, by enhancing geometric structure and enforcing distribution consistency.

Abstract: Image-to-point cloud registration methods typically follow a coarse-to-fine pipeline, extracting patch-level correspondences and refining them into dense pixel-to-point matches. However, in scenes with repetitive patterns, images often lack sufficient 3D structural cues and alignment with point clouds, leading to incorrect matches. Moreover, prior methods usually overlook structural consistency, limiting the full exploitation of correspondences. To address these issues, we propose two novel modules: the Local Geometry Enhancement (LGE) module and the Graph Distribution Consistency (GDC) module. LGE enhances both image and point cloud features with normal vectors, injecting geometric structure into image features to reduce mismatches. GDC constructs a graph from matched points to update features and explicitly constrain similarity distributions. Extensive experiments and ablations on two benchmarks, RGB-D Scenes v2 and 7-Scenes, demonstrate that our approach achieves state-of-the-art performance in image-to-point cloud registration.

[179] DRUM: Diffusion-based Raydrop-aware Unpaired Mapping for Sim2Real LiDAR Segmentation

Tomoya Miyawaki, Kazuto Nakashima, Yumi Iwashita, Ryo Kurazume

Main category: cs.CV

TL;DR: DRUM is a Sim2Real translation framework using diffusion models to bridge domain gap between synthetic and real LiDAR data by reproducing reflectance intensity and raydrop noise characteristics.

Details

Motivation: Large-scale annotation of LiDAR point clouds is expensive and time-consuming. While simulators provide labeled synthetic data, models trained on synthetic data underperform on real-world data due to domain gaps, particularly in measurement characteristics like reflectance intensity and raydrop noise.

Method: Proposes DRUM framework that leverages a diffusion model pre-trained on unlabeled real-world data as generative prior. Translates synthetic data by reproducing two key measurement characteristics: reflectance intensity and raydrop noise. Introduces raydrop-aware masked guidance mechanism that selectively enforces consistency with input synthetic data while preserving realistic raydrop noise induced by diffusion prior.

Result: Experimental results demonstrate that DRUM consistently improves Sim2Real performance across multiple representations of LiDAR data.

Conclusion: DRUM effectively bridges the Sim2Real domain gap for LiDAR semantic segmentation by using diffusion models to translate synthetic data to realistic representations, addressing key measurement characteristics that cause domain shifts.

Abstract: LiDAR-based semantic segmentation is a key component for autonomous mobile robots, yet large-scale annotation of LiDAR point clouds is prohibitively expensive and time-consuming. Although simulators can provide labeled synthetic data, models trained on synthetic data often underperform on real-world data due to a data-level domain gap. To address this issue, we propose DRUM, a novel Sim2Real translation framework. We leverage a diffusion model pre-trained on unlabeled real-world data as a generative prior and translate synthetic data by reproducing two key measurement characteristics: reflectance intensity and raydrop noise. To improve sample fidelity, we introduce a raydrop-aware masked guidance mechanism that selectively enforces consistency with the input synthetic data while preserving realistic raydrop noise induced by the diffusion prior. Experimental results demonstrate that DRUM consistently improves Sim2Real performance across multiple representations of LiDAR data. The project page is available at https://miya-tomoya.github.io/drum.

[180] SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning

Cai Selvas-Sala, Lei Kang, Lluis Gomez

Main category: cs.CV

TL;DR: SALMUBench is a new benchmark for evaluating multimodal unlearning in contrastive encoders like CLIP, featuring synthetic persona-attribute associations and structured evaluation protocols to measure precise forgetting and collateral damage.

Details

Motivation: As multimodal models like CLIP become widely deployed, there's a critical need to remove sensitive information. However, machine unlearning for contrastively-trained encoders is underexplored, and existing evaluations fail to diagnose fine-grained, association-level forgetting.

Method: Created SALMUBench with 60K synthetic persona-attribute associations and two foundational models: a Compromised model polluted with sensitive data and a Clean model without it. Both trained from scratch on same 400M-pair retain base. Introduced structured holdout sets (holdout identity, holdout association) to precisely measure unlearning efficacy and collateral damage.

Result: The benchmark reveals that while utility-efficient deletion is feasible, current methods exhibit distinct failure modes: they either fail to forget effectively or over-generalize by erasing more than intended. SALMUBench sets a new standard for comprehensive unlearning evaluation.

Conclusion: SALMUBench provides a comprehensive framework for evaluating multimodal unlearning, addressing the gap in fine-grained association-level forgetting assessment for contrastive encoders like CLIP, with publicly released resources to foster future research.

Abstract: As multimodal models like CLIP become integral to downstream systems, the need to remove sensitive information is critical. However, machine unlearning for contrastively-trained encoders remains underexplored, and existing evaluations fail to diagnose fine-grained, association-level forgetting. We introduce SALMUBench (Sensitive Association-Level Multimodal Unlearning), a benchmark built upon a synthetic dataset of 60K persona-attribute associations and two foundational models: a Compromised model polluted with this data, and a Clean model without it. To isolate unlearning effects, both are trained from scratch on the same 400M-pair retain base, with the Compromised model additionally trained on the sensitive set. We propose a novel evaluation protocol with structured holdout sets (holdout identity, holdout association) to precisely measure unlearning efficacy and collateral damage. Our benchmark reveals that while utility-efficient deletion is feasible, current methods exhibit distinct failure modes: they either fail to forget effectively or over-generalize by erasing more than intended. SALMUBench sets a new standard for comprehensive unlearning evaluation, and we publicly release our dataset, models, evaluation scripts, and leaderboards to foster future research.

[181] Verify Claimed Text-to-Image Models via Boundary-Aware Prompt Optimization

Zidong Zhao, Yihao Huang, Qing Guo, Tianlin Li, Anran Li, Kailong Wang, Jin Song Dong, Geguang Pu

Main category: cs.CV

TL;DR: BPO is a reference-free method for verifying Text-to-Image models by identifying boundary-adjacent prompts that trigger unstable outputs specific to the target model, enabling accurate model verification without needing multiple reference models.

Details

Motivation: As T2I generation becomes widespread with third-party platforms offering multiple model APIs, there's a need to verify whether APIs actually use the claimed official models to prevent false claims that mislead users and harm model owners' reputations. Existing verification methods require multiple reference models for prompt optimization, which is computationally expensive and sensitive to model selection.

Method: BPO (Boundary-aware Prompt Optimization) explores intrinsic characteristics of target T2I models by identifying semantic boundaries in embedding space (transition zones between concepts). Prompts near these boundaries generate unstable outputs on the target model but remain stable on others. The method directly optimizes prompts to be near these boundaries without needing reference models.

Result: Experiments on five T2I models and four baselines show that BPO achieves superior verification accuracy compared to existing methods.

Conclusion: BPO provides an effective reference-free approach for T2I model verification by leveraging model-specific boundary behaviors, addressing computational cost and sensitivity issues of existing methods while maintaining high accuracy.

Abstract: As Text-to-Image (T2I) generation becomes widespread, third-party platforms increasingly integrate multiple model APIs for convenient image creation. However, false claims of using official models can mislead users and harm model owners’ reputations, making model verification essential to confirm whether an API’s underlying model matches its claim. Existing methods address this by using verification prompts generated by official model owners, but the generation relies on multiple reference models for optimization, leading to high computational cost and sensitivity to model selection. To address this problem, we propose a reference-free T2I model verification method called Boundary-aware Prompt Optimization (BPO). It directly explores the intrinsic characteristics of the target model. The key insight is that although different T2I models produce similar outputs for normal prompts, their semantic boundaries in the embedding space (transition zones between two concepts such as “corgi” and “bagel”) are distinct. Prompts near these boundaries generate unstable outputs (e.g., sometimes a corgi and sometimes a bagel) on the target model but remain stable on other models. By identifying such boundary-adjacent prompts, BPO captures model-specific behaviors that serve as reliable verification cues for distinguishing T2I models. Experiments on five T2I models and four baselines demonstrate that BPO achieves superior verification accuracy.

[182] Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni

Main category: cs.CV

TL;DR: LowFormer is a novel vision backbone family with streamlined design and Lowtention (lightweight alternative to Multi-Head Self-Attention) that achieves superior efficiency and performance across various hardware platforms.

Details

Motivation: The paper addresses the limitations of using MACs (Multiply Accumulate operations) as the primary efficiency metric for vision backbones, especially on edge devices. The authors aim to identify key factors for efficient execution and optimize backbone design for real-world performance.

Method: The authors experimentally analyze MAC count vs execution time of common architectural elements, identify efficiency factors, then design LowFormer with streamlined macro/micro architecture including Lowtention (lightweight alternative to self-attention). They also create an edge GPU optimized version.

Result: LowFormer achieves superior results on ImageNet and remarkable speed-ups across various hardware platforms compared to state-of-the-art backbones. It demonstrates wide applicability on image classification, object detection, semantic segmentation, image retrieval, and visual object tracking.

Conclusion: LowFormer provides an efficient vision backbone family with practical design insights beyond MAC metrics, offering significant speed improvements while maintaining or improving accuracy across diverse vision tasks.

Abstract: Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline’s speed on edge GPU and desktop GPU. We demonstrate LowFormer’s wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md.

[183] From Pixels to Privacy: Temporally Consistent Video Anonymization via Token Pruning for Privacy Preserving Action Recognition

Nazia Aslam, Abhisek Ray, Joakim Bruslund Haurum, Lukas Esterle, Kamal Nasrollahi

Main category: cs.CV

TL;DR: Attention-driven spatiotemporal video anonymization framework that uses Vision Transformers with dual classification tokens to separate action-relevant from privacy-sensitive content, selectively pruning tubelets to preserve utility while reducing privacy leakage.

Details

Motivation: Large-scale video models improve video understanding but amplify privacy risks by encoding sensitive attributes like facial identity, race, and gender. While image anonymization is well-studied, video anonymization remains underexplored despite video models' ability to leverage spatiotemporal motion patterns as biometric identifiers.

Method: Proposes an attention-driven spatiotemporal video anonymization framework using Vision Transformers with two task-specific classification tokens: an action CLS token and a privacy CLS token. These learn complementary representations within a shared Transformer backbone. Attention distributions are contrasted to compute utility-privacy scores for each spatiotemporal tubelet, keeping only the top-k tubelets with highest scores to prune privacy-dominated content while preserving action-critical information.

Result: Extensive experiments show the approach maintains action recognition performance comparable to models trained on raw videos while substantially reducing privacy leakage. The method demonstrates effective privacy-preserving video analytics.

Conclusion: Attention-driven spatiotemporal pruning offers an effective and principled solution for privacy-preserving video analytics, balancing utility preservation with privacy protection in video understanding systems.

Abstract: Recent advances in large-scale video models have significantly improved video understanding across domains such as surveillance, healthcare, and entertainment. However, these models also amplify privacy risks by encoding sensitive attributes, including facial identity, race, and gender. While image anonymization has been extensively studied, video anonymization remains relatively underexplored, even though modern video models can leverage spatiotemporal motion patterns as biometric identifiers. To address this challenge, we propose a novel attention-driven spatiotemporal video anonymization framework based on systematic disentanglement of utility and privacy features. Our key insight is that attention mechanisms in Vision Transformers (ViTs) can be explicitly structured to separate action-relevant information from privacy-sensitive content. Building on this insight, we introduce two task-specific classification tokens, an action CLS token and a privacy CLS token, that learn complementary representations within a shared Transformer backbone. We contrast their attention distributions to compute a utility-privacy score for each spatiotemporal tubelet, and keep the top-k tubelets with the highest scores. This selectively prunes tubelets dominated by privacy cues while preserving those most critical for action recognition. Extensive experiments demonstrate that our approach maintains action recognition performance comparable to models trained on raw videos, while substantially reducing privacy leakage. These results indicate that attention-driven spatiotemporal pruning offers an effective and principled solution for privacy-preserving video analytics.

[184] HINT: Composed Image Retrieval with Dual-path Compositional Contextualized Network

Mingyu Zhang, Zixu Li, Zhiwei Chen, Zhiheng Fu, Xiaowei Zhu, Jiajia Nie, Yinwei Wei, Yupeng Hu

Main category: cs.CV

TL;DR: HINT is a dual-path contextualized network for composed image retrieval that addresses the neglect of contextual information by performing contextualized encoding and amplifying similarity differences between matching and non-matching samples.

Details

Motivation: Existing CIR methods neglect contextual information in discriminating matching samples, which is crucial for accurate retrieval. Two main challenges exist: 1) implicit dependencies between query components, and 2) lack of differential amplification mechanism to distinguish matching from non-matching samples.

Method: Proposes HINT (dual-patH composItional coNtextualized neTwork) with two key components: 1) contextualized encoding to capture implicit dependencies between reference image and modification text, and 2) differential amplification mechanism to enhance similarity differences between matching and non-matching samples.

Result: HINT achieves optimal performance on all metrics across two CIR benchmark datasets, demonstrating superiority over existing methods in complex scenarios.

Conclusion: The proposed HINT model effectively addresses the contextual information neglect in CIR by capturing implicit dependencies and amplifying similarity differences, leading to state-of-the-art performance.

Abstract: Composed Image Retrieval (CIR) is a challenging image retrieval paradigm. It aims to retrieve target images from large-scale image databases that are consistent with the modification semantics, based on a multimodal query composed of a reference image and modification text. Although existing methods have made significant progress in cross-modal alignment and feature fusion, a key flaw remains: the neglect of contextual information in discriminating matching samples. However, addressing this limitation is not an easy task due to two challenges: 1) implicit dependencies and 2) the lack of a differential amplification mechanism. To address these challenges, we propose a dual-patH composItional coNtextualized neTwork (HINT), which can perform contextualized encoding and amplify the similarity differences between matching and non-matching samples, thus improving the upper performance of CIR models in complex scenarios. Our HINT model achieves optimal performance on all metrics across two CIR benchmark datasets, demonstrating the superiority of our HINT model. Codes are available at https://github.com/zh-mingyu/HINT.

[185] Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow

Ziyue Zeng, Xun Su, Haoyuan Liu, Bingyu Lu, Yui Tatsumi, Hiroshi Watanabe

Main category: cs.CV

TL;DR: GVC transforms pretrained video generative models into video codecs by converting deterministic ODEs to stochastic SDEs, enabling codebook-driven compression without retraining.

Details

Motivation: Existing generative video compression methods only use generative models as post-hoc reconstruction modules, limiting their integration with conventional codecs. The authors aim to create a framework where generative models themselves become the codec.

Method: Convert deterministic rectified-flow ODEs of video foundation models into equivalent SDEs at inference, creating stochastic injection points for codebook-driven compression. Implement three conditioning strategies: Image-to-Video (I2V), Text-to-Video (T2V), and First-Last-Frame-to-Video (FLF2V) with GOP chaining.

Result: GVC achieves high-quality reconstruction below 0.002 bpp, supports flexible bitrate control through a single hyperparameter, and spans trade-offs between spatial fidelity, temporal coherence, and compression efficiency.

Conclusion: The proposed zero-shot framework successfully transforms pretrained video generative models into effective video codecs, enabling efficient compression with flexible conditioning strategies.

Abstract: Existing generative video compression methods use generative models only as post-hoc reconstruction modules atop conventional codecs. We propose \emph{Generative Video Codec} (GVC), a zero-shot framework that turns a pretrained video generative model into the codec itself: the transmitted bitstream directly specifies the generative decoding trajectory, with no retraining required. To enable this, we convert the deterministic rectified-flow ODE of modern video foundation models into an equivalent SDE at inference time, unlocking per-step stochastic injection points for codebook-driven compression. Building on this unified backbone, we instantiate three complementary conditioning strategies – \emph{Image-to-Video} (I2V) with adaptive tail-frame atom allocation, \emph{Text-to-Video} (T2V) operating at near-zero side information as a pure generative prior, and \emph{First-Last-Frame-to-Video} (FLF2V) with boundary-sharing GOP chaining for dual-anchor temporal control. Together, these variants span a principled trade-off space between spatial fidelity, temporal coherence, and compression efficiency. Experiments on standard benchmarks show that GVC achieves high-quality reconstruction below 0.002,bpp while supporting flexible bitrate control through a single hyperparameter.

[186] DuSCN-FusionNet: An Interpretable Dual-Channel Structural Covariance Fusion Framework for ADHD Classification Using Structural MRI

Qurat Ul Ain, Alptekin Temizel, Soyiba Jawed

Main category: cs.CV

TL;DR: DuSCN-FusionNet: Interpretable ADHD classification framework using dual-channel Structural Covariance Networks from sMRI data, achieving 80.59% balanced accuracy with ROI-level interpretability via Grad-CAM.

Details

Motivation: ADHD lacks reliable imaging-based biomarkers, and existing deep learning approaches are black-box systems that limit clinical trust and interpretability. There's a need for interpretable structural MRI-based frameworks for ADHD diagnosis.

Method: Proposes DuSCN-FusionNet using dual-channel Structural Covariance Networks (SCNs) from sMRI data: intensity-based and heterogeneity-based SCNs capture inter-regional morphological relationships. Uses SCN-CNN encoder, with late-stage fusion of auxiliary ROI-wise variability features and global statistical descriptors. Evaluated with stratified 10-fold cross-validation and 5-seed ensemble strategy.

Result: Achieved mean balanced accuracy of 80.59% and AUC of 0.778 on ADHD-200 dataset (Peking University site). Precision: 81.66%, recall: 80.59%, F1-score: 80.27%. Grad-CAM adaptation provides ROI-level importance scores for interpretability.

Conclusion: DuSCN-FusionNet offers an interpretable sMRI-based framework for ADHD classification with competitive performance and biomarker identification capability through ROI-level interpretability, addressing clinical trust issues in deep learning approaches.

Abstract: Attention Deficit Hyperactivity Disorder (ADHD) is a highly prevalent neurodevelopmental condition; however, its neurobiological diagnosis remains challenging due to the lack of reliable imaging-based biomarkers, particularly anatomical markers. Structural MRI (sMRI) provides a non-invasive modality for investigating brain alterations associated with ADHD; nevertheless, most deep learning approaches function as black-box systems, limiting clinical trust and interpretability. In this work, we propose DuSCN-FusionNet, an interpretable sMRI-based framework for ADHD classification that leverages dual-channel Structural Covariance Networks (SCNs) to capture inter-regional morphological relationships. ROI-wise mean intensity and intra-regional variability descriptors are used to construct intensity-based and heterogeneity-based SCNs, which are processed through an SCN-CNN encoder. In parallel, auxiliary ROI-wise variability features and global statistical descriptors are integrated via late-stage fusion to enhance performance. The model is evaluated using stratified 10-fold cross-validation with a 5-seed ensemble strategy, achieving a mean balanced accuracy of 80.59% and an AUC of 0.778 on the Peking University site of the ADHD-200 dataset. DuSCN-FusionNet further achieves precision, recall, and F1-scores of 81.66%, 80.59%, and 80.27%, respectively. Moreover, Grad-CAM is adapted to the SCN domain to derive ROI-level importance scores, enabling the identification of structurally relevant brain regions as potential biomarkers.

[187] Only Whats Necessary: Pareto Optimal Data Minimization for Privacy Preserving Video Anomaly Detection

Nazia Aslam, Abhisek Ray, Thomas B. Moeslund, Kamal Nasrollahi

Main category: cs.CV

TL;DR: Privacy-preserving video anomaly detection framework that minimizes personally identifiable information exposure while maintaining detection performance through data minimization techniques.

Details

Motivation: Video anomaly detection systems require large datasets but often contain personally identifiable information (PII) that creates GDPR compliance challenges, necessitating privacy-by-design approaches that limit data exposure to only what's necessary for anomaly detection.

Method: Introduces “Only What’s Necessary” framework combining breadth-based and depth-based data minimization mechanisms to suppress PII while preserving anomaly detection cues. Evaluates minimization configurations using VAD models and privacy inference models with ranking methods and Pareto analysis to identify optimal trade-offs.

Result: Framework effectively identifies sweet spot operating points that minimize personal data exposure with limited degradation in detection performance, as demonstrated through extensive experiments on public datasets.

Conclusion: Proposed privacy-by-design framework enables GDPR-compliant video anomaly detection by balancing privacy protection and utility through systematic data minimization techniques.

Abstract: Video anomaly detection (VAD) systems are increasingly deployed in safety critical environments and require a large amount of data for accurate detection. However, such data may contain personally identifiable information (PII), including facial cues and sensitive demographic attributes, creating compliance challenges under the EU General Data Protection Regulation (GDPR). In particular, GDPR requires that personal data be limited to what is strictly necessary for a specified processing purpose. To address this, we introduce Only What’s Necessary, a privacy-by-design framework for VAD that explicitly controls the amount and type of visual information exposed to the detection pipeline. The framework combines breadth based and depth based data minimization mechanisms to suppress PII while preserving cues relevant to anomaly detection. We evaluate a range of minimization configurations by feeding the minimized videos to both a VAD model and a privacy inference model. We employ two ranking based methods, along with Pareto analysis, to characterize the resulting trade off between privacy and utility. From the non-dominated frontier, we identify sweet spot operating points that minimize personal data exposure with limited degradation in detection performance. Extensive experiments on publicly available datasets demonstrate the effectiveness of the proposed framework.

[188] Think over Trajectories: Leveraging Video Generation to Reconstruct GPS Trajectories from Cellular Signaling

Ruixing Zhang, Hanzhang Jiang, Leilei Sun, Liangzhe Han, Jibin Wang, Weifeng Lv

Main category: cs.CV

TL;DR: Sig2GPS transforms cellular signaling data into GPS trajectories using image-to-video generation on maps, treating signaling traces as map images and generating continuous GPS paths as videos.

Details

Motivation: Cellular signaling records provide broad human mobility coverage but only offer coarse location data (cell identifiers), limiting their use for applications requiring high-precision GPS trajectories. Current solutions use complex multi-stage pipelines or coordinate regression, which are suboptimal.

Method: Reframes the problem as image-to-video generation: signaling traces are rendered on maps as images, and a video generation model is trained to draw continuous GPS paths. Uses a paired signaling-to-trajectory video dataset to fine-tune an open-source video model, with trajectory-aware reinforcement learning optimization for improved fidelity.

Result: Substantial improvements over strong engineered and learning-based baselines on large-scale real-world datasets. Additional results show scalability and cross-city transferability for next GPS prediction tasks.

Conclusion: Map-visual video generation provides a practical interface for trajectory data mining by enabling direct generation and refinement of continuous paths under map constraints, offering a novel approach to converting coarse signaling data to precise GPS trajectories.

Abstract: Mobile devices continuously interact with cellular base stations, generating massive volumes of signaling records that provide broad coverage for understanding human mobility. However, such records offer only coarse location cues (e.g., serving-cell identifiers) and therefore limit their direct use in applications that require high-precision GPS trajectories. This paper studies the Sig2GPS problem: reconstructing GPS trajectories from cellular signaling. Inspired by domain experts often lay the signaling trace on the map and sketch the corresponding GPS route, unlike conventional solutions that rely on complex multi-stage engineering pipelines or regress coordinates, Sig2GPS is reframed as an image-to-video generation task that directly operates in the map-visual domain: signaling traces are rendered on a map, and a video generation model is trained to draw a continuous GPS path. To support this paradigm, a paired signaling-to-trajectory video dataset is constructed to fine-tune an open-source video model, and a trajectory-aware reinforcement learning-based optimization method is introduced to improve generation fidelity via rewards. Experiments on large-scale real-world datasets show substantial improvements over strong engineered and learning-based baselines, while additional results on next GPS prediction indicate scalability and cross-city transferability. Overall, these results suggest that map-visual video generation provides a practical interface for trajectory data mining by enabling direct generation and refinement of continuous paths under map constraints.

[189] From Pen to Pixel: Translating Hand-Drawn Plots into Graphical APIs via a Novel Benchmark and Efficient Adapter

Zhenghao Xu, Mengning Yang

Main category: cs.CV

TL;DR: Plot2API system for recommending graphical APIs from plot images, extended to handle hand-drawn plots via new dataset HDpy-13 and efficient Plot-Adapter architecture.

Details

Motivation: Existing Plot2API systems work well for standard plot images but fail for hand-drawn plots due to domain gap and lack of expertise. Non-experts often create hand-drawn plots, creating accessibility issues.

Method: 1) Created HDpy-13 dataset of hand-drawn plots; 2) Proposed Plot-Adapter architecture with lightweight CNN block for local feature capture and projection matrix sharing to reduce parameters; 3) Enables separate adapter training per language/domain instead of full model retraining.

Result: Experimental results show effectiveness of HDpy-13 dataset and efficiency of Plot-Adapter in improving API recommendation for hand-drawn plots while reducing computational costs.

Conclusion: The work addresses accessibility gap in Plot2API for hand-drawn plots through specialized dataset and efficient adapter architecture, making plot creation more accessible to non-experts.

Abstract: As plots play a critical role in modern data visualization and analysis, Plot2API is launched to help non-experts and beginners create their desired plots by directly recommending graphical APIs from reference plot images by neural networks. However, previous works on Plot2API have primarily focused on the recommendation for standard plot images, while overlooking the hand-drawn plot images that are more accessible to non-experts and beginners. To make matters worse, both Plot2API models trained on standard plot images and powerful multi-modal large language models struggle to effectively recommend APIs for hand-drawn plot images due to the domain gap and lack of expertise. To facilitate non-experts and beginners, we introduce a hand-drawn plot dataset named HDpy-13 to improve the performance of graphical API recommendations for hand-drawn plot images. Additionally, to alleviate the considerable strain of parameter growth and computational resource costs arising from multi-domain and multi-language challenges in Plot2API, we propose Plot-Adapter that allows for the training and storage of separate adapters rather than requiring an entire model for each language and domain. In particular, Plot-Adapter incorporates a lightweight CNN block to improve the ability to capture local features and implements projection matrix sharing to reduce the number of fine-tuning parameters further. Experimental results demonstrate both the effectiveness of HDpy-13 and the efficiency of Plot-Adapter.

[190] MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model

Quan Dao, Dimitris Metaxas

Main category: cs.CV

TL;DR: Hierarchical multi-patch transformer design for diffusion models that reduces computation by up to 50% while maintaining generative quality

Details

Motivation: Standard Diffusion Transformers (DiTs) use isotropic designs with same patch sizes throughout, leading to heavy computational costs during training. There's a need for more efficient architectures that maintain performance while reducing computation.

Method: Proposes a multi-patch transformer where early blocks use larger patches to capture global context and later blocks use smaller patches for local refinement. Also introduces improved time and class embedding designs to accelerate training convergence.

Result: Achieves up to 50% reduction in GFLOPs while maintaining good generative performance on ImageNet dataset. The improved embeddings accelerate training convergence.

Conclusion: The hierarchical multi-patch transformer design offers an efficient alternative to isotropic DiTs, significantly reducing computational costs without sacrificing generative quality.

Abstract: Transformer architectures, particularly Diffusion Transformers (DiTs), have become widely used in diffusion and flow-matching models due to their strong performance compared to convolutional UNets. However, the isotropic design of DiTs processes the same number of patchified tokens in every block, leading to relatively heavy computation during training process. In this work, we introduce a multi-patch transformer design in which early blocks operate on larger patches to capture coarse global context, while later blocks use smaller patches to refine local details. This hierarchical design could reduces computational cost by up to 50% in GFLOPs while achieving good generative performance. In addition, we also propose improved designs for time and class embeddings that accelerate training convergence. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our architectural choices. Code is released at \url{https://github.com/quandao10/MPDiT}

[191] Make Geometry Matter for Spatial Reasoning

Shihua Zhang, Qiuhong Shen, Shizun Wang, Tianbo Pan, Xinchao Wang

Main category: cs.CV

TL;DR: GeoSR enhances vision-language models’ spatial reasoning by forcing them to actively use geometry tokens through masking of 2D visual cues and adaptive fusion mechanisms.

Details

Motivation: Current vision-language models struggle with spatial reasoning despite having access to geometry tokens from 3D foundation models, as they tend to rely too heavily on 2D visual cues instead of properly utilizing geometric information.

Method: Proposes GeoSR framework with two key components: (1) Geometry-Unleashing Masking that strategically masks 2D vision tokens during training to force reliance on geometry tokens, and (2) Geometry-Guided Fusion using gated routing to amplify geometry token contributions in geometrically critical regions.

Result: Extensive experiments on static and dynamic spatial reasoning benchmarks show GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information.

Conclusion: GeoSR successfully enhances VLMs’ spatial reasoning capabilities by making geometry tokens matter through training strategies that encourage active geometric reasoning rather than passive fusion.

Abstract: Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.

[192] HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

MD Khalequzzaman Chowdhury Sayem, Mubarrat Tajoar Chowdhury, Yihalem Yimolal Tiruneh, Muneeb A. Khan, Muhammad Salman Ali, Binod Bhattarai, Seungryul Baek

Main category: cs.CV

TL;DR: HandVQA is a diagnostic benchmark for evaluating vision-language models’ understanding of fine-grained hand anatomy through 1.6M spatial reasoning questions, revealing systematic limitations and enabling transfer learning to downstream tasks.

Details

Motivation: Current vision-language models struggle with fine-grained spatial reasoning of articulated hand poses, which is critical for applications like robot-assisted surgery, chip manufacturing, and AR/VR human-AI interaction.

Method: Created HandVQA benchmark using high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA) with over 1.6M multiple-choice questions probing spatial relationships between hand joints. Evaluated state-of-the-art VLMs (LLaVA, DeepSeek, Qwen-VL) in base and fine-tuned settings using LoRA.

Result: Revealed systematic limitations including hallucinated finger parts, incorrect geometric interpretations, and poor generalization. Demonstrated that 3D-grounded spatial knowledge transfers zero-shot, improving accuracy on hand gesture recognition (+10.33%) and hand-object interaction (+2.63%).

Conclusion: HandVQA exposes critical reasoning gaps in current VLMs and provides a validated path for improvement through spatial reasoning training, with demonstrated transfer learning benefits to practical applications.

Abstract: Understanding the fine-grained articulation of human hands is critical in high-stakes settings such as robot-assisted surgery, chip manufacturing, and AR/VR-based human-AI interaction. Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning, especially in interpreting complex and articulated hand poses. We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs’ understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions. We evaluate several state-of-the-art VLMs (LLaVA, DeepSeek and Qwen-VL) in both base and fine-tuned settings, using lightweight fine-tuning via LoRA. Our findings reveal systematic limitations in current models, including hallucinated finger parts, incorrect geometric interpretations, and poor generalization. HandVQA not only exposes these critical reasoning gaps but provides a validated path to improvement. We demonstrate that the 3D-grounded spatial knowledge learned from our benchmark transfers in a zero-shot setting, significantly improving accuracy of model on novel downstream tasks like hand gesture recognition (+10.33%) and hand-object interaction (+2.63%).

[193] Dynamic Token Compression for Efficient Video Understanding through Reinforcement Learning

Shida Wang, YongXiang Hua, Zhou Tao, Haoyu Cao, Linli Xu

Main category: cs.CV

TL;DR: SCORE is a reinforcement learning framework for adaptive token compression in video understanding MLLMs that reduces computational costs while maintaining performance.

Details

Motivation: Current video MLLMs suffer from high computational costs due to massive visual token redundancy and "context rot" issues. Existing compression methods use heuristics or fixed transformations that are decoupled from task objectives, limiting adaptability and effectiveness.

Method: SCORE uses a lightweight policy network conditioned on surprise-augmented state representations incorporating inter-frame residuals to capture temporal dynamics and motion saliency. It’s optimized via group-wise reinforcement learning with split-advantage estimator, stabilized by a two-stage curriculum from static pseudo-videos to real dynamic videos.

Result: SCORE significantly outperforms state-of-the-art baselines on diverse video understanding benchmarks, achieving 16x prefill speedup while preserving 99.5% of original performance at 10% retention ratio.

Conclusion: SCORE provides a scalable solution for efficient long-form video understanding through adaptive token compression learned via reinforcement learning, effectively balancing computational efficiency and performance preservation.

Abstract: Multimodal Large Language Models have demonstrated remarkable capabilities in video understanding, yet face prohibitive computational costs and performance degradation from ‘‘context rot’’ due to massive visual token redundancy. Existing compression strategies typically rely on heuristics or fixed transformations that are often decoupled from the downstream task objectives, limiting their adaptability and effectiveness. To address this, we propose SCORE (Surprise-augmented token COmpression via REinforcement learning), a unified framework that learns an adaptive token compression policy. SCORE introduces a lightweight policy network conditioned on a surprise-augmented state representation that incorporates inter-frame residuals to explicitly capture temporal dynamics and motion saliency. We optimize this policy using a group-wise reinforcement learning scheme with a split-advantage estimator, stabilized by a two-stage curriculum transferring from static pseudo-videos to real dynamic videos. Extensive experiments on diverse video understanding benchmarks demonstrate that SCORE significantly outperforms state-of-the-art baselines. Notably, SCORE achieves a 16x prefill speedup while preserving 99.5% of original performance at a 10% retention ratio, offering a scalable solution for efficient long-form video understanding.

[194] Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration

I-Hsiang Chen, Isma Hadji, Enrique Sanchez, Adrian Bulat, Sy-Yen Kuo, Radu Timofte, Georgios Tzimiropoulos, Brais Martinez

Main category: cs.CV

TL;DR: RAR proposes an iterative Restore-Assess-Repeat framework that integrates Image Quality Assessment and Image Restoration in latent space for efficient, adaptive handling of unknown/composite degradations.

Details

Motivation: Current image restoration methods suffer from limited generalization and inefficiency when dealing with unknown or composite degradations, as they typically treat IQA and IR as separate modules leading to latency and information loss.

Method: RAR integrates IQA and IR into a unified framework operating entirely in latent domain, performing degradation identification, restoration, and quality verification iteratively through a Restore-Assess-Repeat process that’s fully trainable end-to-end.

Result: Extensive experiments show consistent improvements under single, unknown, and composite degradations, establishing new state-of-the-art performance while minimizing latency from disjoint modules.

Conclusion: The unified latent-domain approach enables efficient, adaptive restoration with better generalization to diverse degradation types through tight integration of assessment and restoration.

Abstract: Image restoration aims to recover high quality images from inputs degraded by various factors, such as adverse weather, blur, or low light. While recent studies have shown remarkable progress across individual or unified restoration tasks, they still suffer from limited generalization and inefficiency when handling unknown or composite degradations. To address these limitations, we propose RAR, a Restore, Assess and Repeat process, that integrates Image Quality Assessment (IQA) and Image Restoration (IR) into a unified framework to iteratively and efficiently achieve high quality image restoration. Specifically, we introduce a restoration process that operates entirely in the latent domain to jointly perform degradation identification, image restoration, and quality verification. The resulting model is fully trainable end to end and allows for an all-in-one assess and restore approach that dynamically adapts the restoration process. Also, the tight integration of IQA and IR into a unified model minimizes the latency and information loss that typically arises from keeping the two modules disjoint, (e.g. during image and/or text decoding). Extensive experiments show that our approach consistent improvements under single, unknown and composite degradations, thereby establishing a new state-of-the-art.

[195] SHANDS: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and Error Recognition Toward Medical Training

Le Ma, Thiago Freitas dos Santos, Nadia Magnenat-Thalmann, Katarzyna Wac

Main category: cs.CV

TL;DR: Surgical-Hands (SHands) is a large-scale multi-view video dataset for surgical hand-gesture and error recognition, addressing the lack of realistic trainee error data for automated AI assessment in surgical training.

Details

Motivation: Current surgical training relies on expert-led skill assessment which is costly, time-limited, difficult to scale, and confined to institutions with specialists. Automated AI-based assessment offers an alternative but lacks datasets with realistic trainee errors and multi-view variability needed for robust computer vision approaches.

Method: Created SHands dataset capturing linear incision and suturing procedures using five RGB cameras from complementary viewpoints. 52 participants (20 experts, 32 trainees) each completed three standardized trials per procedure. Videos are annotated at frame level with 15 gesture primitives and include validated taxonomy of 8 trainee error types.

Result: Dataset enables both gesture recognition and error detection. Standardized evaluation protocols defined for single-view, multi-view, and cross-view generalization. State-of-the-art deep learning models benchmarked on the dataset, which is publicly released.

Conclusion: SHands supports development of robust and scalable AI systems for surgical training grounded in clinically curated domain knowledge, addressing critical gaps in surgical education through computer vision and AI.

Abstract: In surgical training for medical students, proficiency development relies on expert-led skill assessment, which is costly, time-limited, difficult to scale, and its expertise remains confined to institutions with available specialists. Automated AI-based assessment offers a viable alternative, but progress is constrained by the lack of datasets containing realistic trainee errors and the multi-view variability needed to train robust computer vision approaches. To address this gap, we present Surgical-Hands (SHands), a large-scale multi-view video dataset for surgical hand-gesture and error recognition for medical training. \textsc{SHands} captures linear incision and suturing using five RGB cameras from complementary viewpoints, performed by 52 participants (20 experts and 32 trainees), each completing three standardized trials per procedure. The videos are annotated at the frame level with 15 gesture primitives and include a validated taxonomy of 8 trainee error types, enabling both gesture recognition and error detection. We further define standardized evaluation protocols for single-view, multi-view, and cross-view generalization, and benchmark state-of-the-art deep learning models on the dataset. SHands is publicly released to support the development of robust and scalable AI systems for surgical training grounded in clinically curated domain knowledge.

[196] Image-based Quantification of Postural Deviations on Patients with Cervical Dystonia: A Machine Learning Approach Using Synthetic Training Data

Roland Stenger, Sebastian Löns, Nele Brügge, Feline Hamami, Alexander Münchau, Theresa Paulus, Anne Weissbach, Tatiana Usnich, Max Borsche, Martje G. Pauly, Lara M. Lange, Markus A. Hobert, Rebecca Herzog, Ana Luísa de Almeida Marcelino, Tina Mainka, Friederike Schumann, Lukas L. Goede, Johanna Reimer, Julienne Haas, Jos Becktepe, Alexander Baumann, Robin Wolke, Chi Wang Ip, Thorsten Odorfer, Daniel Zeller, Lisa Harder-Rauschenberger, John-Ih Lee, Philipp Albrecht, Tristan Kölsche, Joachim K. Krauss, Johanna M. Nagel, Joachim Runge, Johanna Doll-Lee, Simone Zittel, Kai Grimm, Pawel Tacik, André Lee, Tobias Bäumer, Sebastian Fudickar

Main category: cs.CV

TL;DR: Automated image-based system using pretrained head-pose estimation and deep learning on synthetic avatar images to objectively assess cervical dystonia symptoms, validated against expert clinical ratings.

Details

Motivation: Current assessment of cervical dystonia relies on subjective clinical rating scales (TWSTRS) with low inter-rater reliability, lacking objective tools for monitoring disease severity and treatment response.

Method: Combines pretrained head-pose estimation algorithm for rotational symptoms with deep learning model trained on ~16,000 synthetic avatar images for translational symptoms (lateral shift), validated in multicenter study comparing against 20 clinical experts on 100 real patient images and 100 synthetic avatars.

Result: Strong agreement with expert ratings for rotational symptoms: torticollis (r=0.91), laterocollis (r=0.81), anteroretrocollis (r=0.78). Moderate correlation for lateral shift (r=0.55), with higher accuracy than human raters in controlled benchmark tests on avatars.

Conclusion: Synthetic training data bridges clinical data gap, enabling validated objective tool for CD postural assessment that can standardize clinical decision-making and trial evaluation.

Abstract: Cervical dystonia (CD) is the most common form of dystonia, yet current assessment relies on subjective clinical rating scales, such as the Toronto Western Spasmodic Torticollis Rating Scale (TWSTRS), which requires expertise, is subjective and faces low inter-rater reliability some items of the score. To address the lack of established objective tools for monitoring disease severity and treatment response, this study validates an automated image-based head pose and shift estimation system for patients with CD. We developed an assessment tool that combines a pretrained head-pose estimation algorithm for rotational symptoms with a deep learning model trained exclusively on ~16,000 synthetic avatar images to evaluate rare translational symptoms, specifically lateral shift. This synthetic data approach overcomes the scarcity of clinical training examples. The system’s performance was validated in a multicenter study by comparing its predicted scores against the consensus ratings of 20 clinical experts using a dataset of 100 real patient images and 100 labeled synthetic avatars. The automated system demonstrated strong agreement with expert clinical ratings for rotational symptoms, achieving high correlations for torticollis (r=0.91), laterocollis (r=0.81), and anteroretrocollis (r=0.78). For lateral shift, the tool achieved a moderate correlation (r=0.55) with clinical ratings and demonstrated higher accuracy than human raters in controlled benchmark tests on avatars. By leveraging synthetic training data to bridge the clinical data gap, this model successfully generalizes to real-world patients, providing a validated, objective tool for CD postural assessment that can enable standardized clinical decision-making and trial evaluation.

[197] Meta-Learned Adaptive Optimization for Robust Human Mesh Recovery with Uncertainty-Aware Parameter Updates

Shaurjya Mandal, Nutan Sharma, John Galeotti

Main category: cs.CV

TL;DR: Meta-learning framework for human mesh recovery that learns optimization-friendly initializations and uses adaptive uncertainty-aware updates during test-time refinement.

Details

Motivation: Human mesh recovery from single images suffers from depth ambiguity and poor generalization. Existing methods struggle with poor initialization for test-time refinement and inefficient parameter updates during optimization.

Method: 1) Meta-learning strategy simulating test-time optimization during training for better initializations; 2) Selective parameter caching to freeze converged joints; 3) Distribution-based adaptive updates sampling from learned distributions with uncertainty quantification; 4) Stochastic approximation for intractable gradients.

Result: State-of-the-art performance: reduces MPJPE by 10.3 on 3DPW and 8.0 on Human3.6M. Superior domain adaptation with minimal degradation across environments. Provides meaningful uncertainty estimates correlating with prediction errors.

Conclusion: Combining meta-learning and adaptive optimization enables accurate mesh recovery and robust generalization to challenging scenarios.

Abstract: Human mesh recovery from single images remains challenging due to inherent depth ambiguity and limited generalization across domains. While recent methods combine regression and optimization approaches, they struggle with poor initialization for test-time refinement and inefficient parameter updates during optimization. We propose a novel meta-learning framework that trains models to produce optimization-friendly initializations while incorporating uncertainty-aware adaptive updates during test-time refinement. Our approach introduces three key innovations: (1) a meta-learning strategy that simulates test-time optimization during training to learn better parameter initializations, (2) a selective parameter caching mechanism that identifies and freezes converged joints to reduce computational overhead, and (3) distribution-based adaptive updates that sample parameter changes from learned distributions, enabling robust exploration while quantifying uncertainty. Additionally, we employ stochastic approximation techniques to handle intractable gradients in complex loss landscapes. Extensive experiments on standard benchmarks demonstrate that our method achieves state-of-the-art performance, reducing MPJPE by 10.3 on 3DPW and 8.0 on Human3.6M compared to strong baselines. Our approach shows superior domain adaptation capabilities with minimal performance degradation across different environmental conditions, while providing meaningful uncertainty estimates that correlate with actual prediction errors. Combining meta-learning and adaptive optimization enables accurate mesh recovery and robust generalization to challenging scenarios.

[198] HyVIC: A Metric-Driven Spatio-Spectral Hyperspectral Image Compression Architecture Based on Variational Autoencoders

Martin Hermann Paul Fuchs, Behnood Rasti, Begüm Demir

Main category: cs.CV

TL;DR: HyVIC: A novel variational hyperspectral image compression architecture that explicitly balances spatial and spectral feature learning through adjustable spatio-spectral components and metric-driven hyperparameter selection.

Details

Motivation: Existing learning-based hyperspectral image compression methods adapt models designed for natural images without properly addressing the unique spatio-spectral redundancies in HSIs, lacking explicit architectural designs to balance spatial and spectral feature learning.

Method: Proposes HyVIC with four main components: adjustable spatio-spectral encoder, spatio-spectral hyperencoder, spatio-spectral hyperdecoder, and adjustable spatio-spectral decoder. Uses metric-driven strategy to systematically select hyperparameters balancing spatial vs. spectral learning.

Result: Achieves high spatial and spectral reconstruction fidelity across wide compression ratios, improving state-of-the-art by up to 4.66dB in BD-PSNR on two benchmark datasets.

Conclusion: The trade-off between spatial and spectral feature learning is crucial for reconstruction fidelity in HSI compression. Provides insights and practical guidelines for future learning-based variational HSI compression research.

Abstract: The rapid growth of hyperspectral data archives in remote sensing (RS) necessitates effective compression methods for storage and transmission. Recent advances in learning-based hyperspectral image (HSI) compression have significantly enhanced both reconstruction fidelity and compression efficiency. However, existing methods typically adapt variational image compression models designed for natural images, without adequately accounting for the distinct spatio-spectral redundancies inherent in HSIs. In particular, they lack explicit architectural designs to balance spatial and spectral feature learning, limiting their ability to effectively leverage the unique characteristics of hyperspectral data. To address this issue, we introduce spatio-spectral variational hyperspectral image compression architecture (HyVIC). The proposed model comprises four main components: 1) adjustable spatio-spectral encoder; 2) spatio-spectral hyperencoder; 3) spatio-spectral hyperdecoder; and 4) adjustable spatio-spectral decoder. We demonstrate that the trade-off between spatial and spectral feature learning is crucial for the reconstruction fidelity, and therefore present a metric-driven strategy to systematically select the hyperparameters of the proposed model. Extensive experiments on two benchmark datasets demonstrate the effectiveness of the proposed model, achieving high spatial and spectral reconstruction fidelity across a wide range of compression ratios (CRs) and improving the state of the art by up to 4.66dB in terms of BD-PSNR. Based on our results, we offer insights and derive practical guidelines to guide future research directions in learning-based variational HSI compression. Our code and pre-trained model weights are publicly available at https://git.tu-berlin.de/rsim/hyvic .

[199] SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras

Weihong Pan, Xiaoyu Zhang, Zhuang Zhang, Zhichao Ye, Nan Wang, Haomin Liu, Guofeng Zhang

Main category: cs.CV

TL;DR: Sparse-camera 4D reconstruction framework using Spatio-Temporal Distortion Field to handle inconsistencies in generative observations from uncalibrated cameras.

Details

Motivation: High-quality 4D reconstruction typically requires expensive dense camera arrays (tens to hundreds of synchronized cameras), which limits practical scalability. The paper aims to enable 4D reconstruction from sparse, uncalibrated camera inputs instead.

Method: Proposes a sparse-camera dynamic reconstruction framework with Spatio-Temporal Distortion Field as key innovation - a unified mechanism for modeling inconsistencies in generative observations across spatial and temporal dimensions. Develops a complete pipeline for 4D reconstruction from sparse, uncalibrated camera inputs.

Result: Achieves spatio-temporally consistent high-fidelity renderings on multi-camera dynamic scene benchmarks, significantly outperforming existing approaches.

Conclusion: Enables practical 4D reconstruction without costly dense camera setups by effectively handling inconsistencies in sparse camera observations through the proposed Spatio-Temporal Distortion Field.

Abstract: High-quality 4D reconstruction enables photorealistic and immersive rendering of the dynamic real world. However, unlike static scenes that can be fully captured with a single camera, high-quality dynamic scenes typically require dense arrays of tens or even hundreds of synchronized cameras. Dependence on such costly lab setups severely limits practical scalability. The reliance on such costly lab setups severely limits practical scalability. To this end, we propose a sparse-camera dynamic reconstruction framework that exploits abundant yet inconsistent generative observations. Our key innovation is the Spatio-Temporal Distortion Field, which provides a unified mechanism for modeling inconsistencies in generative observations across both spatial and temporal dimensions. Building on this, we develop a complete pipeline that enables 4D reconstruction from sparse and uncalibrated camera inputs. We evaluate our method on multi-camera dynamic scene benchmarks, achieving spatio-temporally consistent high-fidelity renderings and significantly outperforming existing approaches.

[200] ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Better

Mriganka Nath, Anurag Das, Jiahao Xie, Bernt Schiele

Main category: cs.CV

TL;DR: ClipTTT uses CLIP’s image-text alignment to guide test-time training of LVLMs, reducing hallucinations when visual inputs are corrupted.

Details

Motivation: Large vision-language models hallucinate more when visual inputs are corrupted at test time, which is problematic for real-world applications where visual degradation is common.

Method: Proposes CLIP-guided Test-Time Training (ClipTTT) that uses pre-trained CLIP’s image-text alignment as guidance signal to identify reliable self-supervision targets for adapting LVLMs on-the-fly with single test samples, without modifying base models.

Result: Extensive experiments on standard hallucination benchmarks with 15 common corruptions show ClipTTT effectively mitigates hallucinations and improves descriptive faithfulness under visual corruptions.

Conclusion: ClipTTT provides an effective test-time adaptation method for LVLMs that reduces hallucination under visual corruption using CLIP’s stable alignment guidance.

Abstract: Large vision-language models (LVLMs) tend to hallucinate, especially when visual inputs are corrupted at test time. We show that such corruptions act as additional distribution shifts, significantly amplifying hallucination rates in real-world applications. To address this, we propose CLIP-guided Test-Time Training (ClipTTT), a method to adapt LVLMs under degraded conditions on the fly with a single test sample. Specifically, we leverage the image-text alignment strength of a pre-trained CLIP model as a stable guidance signal to identify reliable self-supervision targets, enabling rapid adaptation without altering the base LVLMs. Extensive experiments on standard hallucination benchmarks, with 15 common corruptions, demonstrate that ClipTTT effectively mitigates hallucinations and improves descriptive faithfulness under visual corruptions.

[201] Conditional Diffusion for 3D CT Volume Reconstruction from 2D X-rays

Martin Rath, Morteza Ghahremani, Yitong Li, Ashkan Taghipour, Marcus Makowski, Christian Wachinger

Main category: cs.CV

TL;DR: AXON: A multi-stage diffusion-based framework that reconstructs high-fidelity 3D CT volumes directly from real chest X-rays using coarse-to-fine strategy with Brownian Bridge diffusion and ControlNet refinement.

Details

Motivation: CT scans provide rich 3D anatomical details but have limitations including high radiation exposure, cost, and limited availability. Chest X-rays are cost-effective and widely accessible but only provide 2D projections with limited pathological information. Reconstructing 3D CT volumes from 2D X-rays could increase diagnostic accessibility, but existing methods rely on synthetic X-ray projections, limiting clinical generalization.

Method: AXON employs a multi-stage diffusion-based framework: 1) Brownian Bridge diffusion model-based initial stage for global structural synthesis, 2) ControlNet-based refinement stage for local intensity optimization, 3) supports bi-planar X-ray input to mitigate depth ambiguities, and 4) integrated super-resolution network to upscale volumes to diagnostic-grade resolution.

Result: AXON significantly outperforms state-of-the-art baselines, achieving 11.9% improvement in PSNR and 11.0% increase in SSIM with robust generalizability across disparate clinical distributions on both public and external datasets.

Conclusion: AXON provides a transformative solution for reconstructing 3D CT volumes directly from real X-rays, increasing diagnostic accessibility while maintaining high fidelity and clinical generalization.

Abstract: Computed tomography (CT) provides rich 3D anatomical details but is often constrained by high radiation exposure, substantial costs, and limited availability. While standard chest X-rays are cost-effective and widely accessible, they only provide 2D projections with limited pathological information. Reconstructing 3D CT volumes from 2D X-rays offers a transformative solution to increase diagnostic accessibility, yet existing methods predominantly rely on synthetic X-ray projections, limiting clinical generalization. In this work, we propose AXON, a multi-stage diffusion-based framework that reconstructs high-fidelity 3D CT volumes directly from real X-rays. AXON employs a coarse-to-fine strategy, with a Brownian Bridge diffusion model-based initial stage for global structural synthesis, followed by a ControlNet-based refinement stage for local intensity optimization. It also supports bi-planar X-ray input to mitigate depth ambiguities inherent in 2D-to-3D reconstruction. A super-resolution network is integrated to upscale the generated volumes to achieve diagnostic-grade resolution. Evaluations on both public and external datasets demonstrate that AXON significantly outperforms state-of-the-art baselines, achieving a 11.9% improvement in PSNR and a 11.0% increase in SSIM with robust generalizability across disparate clinical distributions. Our code is available at https://github.com/ai-med/AXON.

[202] Learnable Quantum Efficiency Filters for Urban Hyperspectral Segmentation

Imad Ali Shah, Jiarong Li, Ethan Delaney, Enda Ward, Martin Glavin, Edward Jones, Brian Deegan

Main category: cs.CV

TL;DR: LQE is a physics-inspired, interpretable dimensionality reduction method for hyperspectral urban driving data that learns smooth spectral response functions with physical constraints, improving semantic segmentation performance while maintaining parameter efficiency.

Details

Motivation: Hyperspectral sensing provides rich spectral information for urban driving scene understanding, but its high dimensionality poses challenges for interpretation and efficient learning. Current methods lack physical interpretability and constraints.

Method: Learnable Quantum Efficiency (LQE) parameterizes smooth high-order spectral response functions that emulate plausible sensor quantum efficiency curves with physical constraints: single dominant peak, smooth responses, and bounded bandwidth. It’s fully differentiable and end-to-end trainable within semantic segmentation models.

Result: LQE achieves highest average mIoU across three hyperspectral urban driving datasets, improving over conventional methods by 2.45%, 0.45%, and 1.04%, and over learnable methods by 1.18%, 1.56%, and 0.81%. It maintains strong parameter efficiency (12-36 parameters vs 51-22K for competitors) with competitive inference latency.

Conclusion: Physics-informed spectral learning improves both performance and interpretability, providing a principled bridge between hyperspectral perception and data-driven multispectral sensor design for automotive vision systems.

Abstract: Hyperspectral sensing provides rich spectral information for scene understanding in urban driving, but its high dimensionality poses challenges for interpretation and efficient learning. We introduce Learnable Quantum Efficiency (LQE), a physics-inspired, interpretable dimensionality reduction (DR) method that parameterizes smooth high-order spectral response functions that emulate plausible sensor quantum efficiency curves. Unlike conventional methods or unconstrained learnable layers, LQE enforces physically motivated constraints, including a single dominant peak, smooth responses, and bounded bandwidth. This formulation yields a compact spectral representation that preserves discriminative information while remaining fully differentiable and end-to-end trainable within semantic segmentation models (SSMs). We conduct systematic evaluations across three publicly available multi-class hyperspectral urban driving datasets, comparing LQE against six conventional and seven learnable baseline DR methods across six SSMs. Averaged across all SSMs and configurations, LQE achieves the highest average mIoU, improving over conventional methods by 2.45%, 0.45%, and 1.04%, and over learnable methods by 1.18%, 1.56%, and 0.81% on HyKo, HSI-Drive, and Hyperspectral City, respectively. LQE maintains strong parameter efficiency (12–36 parameters compared to 51–22K for competing learnable approaches) and competitive inference latency. Ablation studies show that low-order configurations are optimal, while the learned spectral filters converge to dataset-intrinsic wavelength patterns. These results demonstrate that physics-informed spectral learning can improve both performance and interpretability, providing a principled bridge between hyperspectral perception and data-driven multispectral sensor design for automotive vision systems.

[203] OVI-MAP:Open-Vocabulary Instance-Semantic Mapping

Zilong Deng, Federico Tombari, Marc Pollefeys, Johanna Wald, Daniel Barath

Main category: cs.CV

TL;DR: OVI-MAP: A real-time incremental 3D instance-semantic mapping system that decouples instance reconstruction from semantic inference using vision-language models for open-vocabulary labeling.

Details

Motivation: Autonomous agents need robust 3D instance-semantic mapping in complex environments, but existing methods struggle with real-time processing, open-set reasoning, and temporal consistency due to closed-set assumptions or dense per-pixel language fusion.

Method: Decouples instance reconstruction from semantic inference by building a class-agnostic 3D instance map incrementally from RGB-D input, while extracting semantic features only from a small set of automatically selected views using vision-language models.

Result: System operates in real time and outperforms state-of-the-art open-vocabulary mapping baselines on standard benchmarks, enabling stable instance tracking and zero-shot semantic labeling during online exploration.

Conclusion: OVI-MAP provides an effective solution for incremental open-vocabulary 3D instance-semantic mapping by separating geometric reconstruction from semantic reasoning, achieving real-time performance with improved temporal consistency.

Abstract: Incremental open-vocabulary 3D instance-semantic mapping is essential for autonomous agents operating in complex everyday environments. However, it remains challenging due to the need for robust instance segmentation, real-time processing, and flexible open-set reasoning. Existing methods often rely on the closed-set assumption or dense per-pixel language fusion, which limits scalability and temporal consistency. We introduce OVI-MAP that decouples instance reconstruction from semantic inference. We propose to build a class-agnostic 3D instance map that is incrementally constructed from RGB-D input, while semantic features are extracted only from a small set of automatically selected views using vision-language models. This design enables stable instance tracking and zero-shot semantic labeling throughout online exploration. Our system operates in real time and outperforms state-of-the-art open-vocabulary mapping baselines on standard benchmarks.

[204] AutoWeather4D: Autonomous Driving Video Weather Conversion via G-Buffer Dual-Pass Editing

Tianyu Liu, Weitao Xiong, Kunming Luo, Manyuan Zhang, Peng Liu, Yuan Liu, Ping Tan

Main category: cs.CV

TL;DR: AutoWeather4D is a feed-forward 3D-aware weather editing framework that decouples geometry and illumination for realistic weather synthesis in autonomous driving scenarios without per-scene optimization.

Details

Motivation: Current generative video models for adverse weather synthesis require massive datasets for rare weather scenarios, while 3D-aware editing methods suffer from costly per-scene optimization and geometric/illumination entanglement.

Method: Uses G-buffer Dual-pass Editing: Geometry Pass leverages explicit structural foundations for surface-anchored physical interactions, and Light Pass analytically resolves light transport with dynamic 3D local relighting.

Result: Achieves comparable photorealism and structural consistency to generative baselines while enabling fine-grained parametric physical control, serving as a practical data engine for autonomous driving.

Conclusion: AutoWeather4D provides an efficient feed-forward solution for weather editing that decouples geometry and illumination, enabling realistic weather synthesis without massive datasets or per-scene optimization.

Abstract: Generative video models have significantly advanced the photorealistic synthesis of adverse weather for autonomous driving; however, they consistently demand massive datasets to learn rare weather scenarios. While 3D-aware editing methods alleviate these data constraints by augmenting existing video footage, they are fundamentally bottlenecked by costly per-scene optimization and suffer from inherent geometric and illumination entanglement. In this work, we introduce AutoWeather4D, a feed-forward 3D-aware weather editing framework designed to explicitly decouple geometry and illumination. At the core of our approach is a G-buffer Dual-pass Editing mechanism. The Geometry Pass leverages explicit structural foundations to enable surface-anchored physical interactions, while the Light Pass analytically resolves light transport, accumulating the contributions of local illuminants into the global illumination to enable dynamic 3D local relighting. Extensive experiments demonstrate that AutoWeather4D achieves comparable photorealism and structural consistency to generative baselines while enabling fine-grained parametric physical control, serving as a practical data engine for autonomous driving.

[205] HolisticSemGes: Semantic Grounding of Holistic Co-Speech Gesture Generation with Contrastive Flow-Matching

Lanmiao Liu, Esam Ghaleb, Aslı Özyürek, Zerrin Yumak

Main category: cs.CV

TL;DR: A contrastive flow matching model for co-speech gesture generation that uses mismatched audio-text conditions as negatives to improve semantic grounding and cross-modal consistency.

Details

Motivation: Existing co-speech gesture generation methods have limitations: they rely on external semantic retrieval with limited generalization, focus on rhythmic gestures rather than sparse iconic/metaphoric gestures, and fail to maintain cross-modal consistency by modeling body parts in isolation.

Method: Introduces a Contrastive Flow Matching-based model that uses mismatched audio-text conditions as negative examples during training. The velocity field learns to follow correct motion trajectories while repelling semantically incongruent ones. Embeds text, audio, and holistic motion into a composite latent space using cosine and contrastive objectives to ensure cross-modal coherence.

Result: Extensive experiments and user studies demonstrate that the proposed approach outperforms state-of-the-art methods on two datasets: BEAT2 and SHOW.

Conclusion: The contrastive flow matching approach effectively addresses limitations of existing methods by improving semantic grounding, handling sparse gestures, and maintaining cross-modal consistency in co-speech gesture generation.

Abstract: While the field of co-speech gesture generation has seen significant advances, producing holistic, semantically grounded gestures remains a challenge. Existing approaches rely on external semantic retrieval methods, which limit their generalisation capability due to dependency on predefined linguistic rules. Flow-matching-based methods produce promising results; however, the network is optimised using only semantically congruent samples without exposure to negative examples, leading to learning rhythmic gestures rather than sparse motion, such as iconic and metaphoric gestures. Furthermore, by modelling body parts in isolation, the majority of methods fail to maintain crossmodal consistency. We introduce a Contrastive Flow Matching-based co-speech gesture generation model that uses mismatched audio-text conditions as negatives, training the velocity field to follow the correct motion trajectory while repelling semantically incongruent trajectories. Our model ensures cross-modal coherence by embedding text, audio, and holistic motion into a composite latent space via cosine and contrastive objectives. Extensive experiments and a user study demonstrate that our proposed approach outperforms state-of-the-art methods on two datasets, BEAT2 and SHOW.

[206] StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos

Daeun Lee, Subhojyoti Mukherjee, Branislav Kveton, Ryan A. Rossi, Viet Dac Lai, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Mohit Bansal

Main category: cs.CV

TL;DR: StreamGaze benchmark evaluates MLLMs’ ability to use gaze signals for temporal and proactive reasoning in streaming videos, showing significant performance gaps between models and humans.

Details

Motivation: Current streaming benchmarks lack evaluation of how MLLMs can interpret and leverage human gaze signals for understanding streaming videos, particularly for applications like AR glasses that require anticipating user intentions.

Method: Developed StreamGaze benchmark with gaze-guided tasks (past, present, proactive) and a gaze-video QA generation pipeline that aligns egocentric videos with gaze trajectories through fixation extraction, region-specific visual prompting, and scanpath construction.

Result: Substantial performance gaps between state-of-the-art MLLMs and human performance across all StreamGaze tasks, revealing limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction.

Conclusion: StreamGaze fills an important gap in evaluating gaze-guided streaming video understanding, provides insights into current MLLM limitations, and offers directions for future research in multimodal temporal reasoning with gaze signals.

Abstract: Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications such as Augmented Reality (AR) glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether Multimodal Large Language Models (MLLMs) can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs utilize gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively assess streaming video understanding. These tasks evaluate whether models can use real-time gaze signals to follow shifting attention and infer user intentions based only on past and currently observed frames. To build StreamGaze, we develop a gaze-video Question Answering (QA) generation pipeline that aligns egocentric videos with raw gaze trajectories through fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, highlighting key limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze prompting strategies, reasoning behaviors, and task-specific failure modes, offering insights into current limitations and directions for future research. All data and code are publicly available to support continued research in gaze-guided streaming video understanding.

[207] Scene Grounding In the Wild

Tamir Cohen, Leo Segre, Shay Shomer-Chai, Shai Avidan, Hadar Averbuch-Elor

Main category: cs.CV

TL;DR: A framework for globally consistent 3D scene reconstruction from unstructured imagery by aligning partial reconstructions to complete semantic reference models derived from Google Earth Studio renderings.

Details

Motivation: Existing 3D reconstruction pipelines struggle with unstructured, in-the-wild imagery when input views have little or no overlap, often producing disconnected partial reconstructions or incorrectly merging non-overlapping regions.

Method: Uses dense, geospatially accurate pseudo-synthetic renderings from Google Earth Studio as reference models. Represents reference models with 3D Gaussian Splatting augmented with semantic features, and formulates alignment as an inverse feature-based optimization scheme estimating global 6DoF pose and scale while keeping reference fixed.

Result: Demonstrates consistent improvement in global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models. Introduces WikiEarth dataset for evaluation.

Conclusion: The proposed framework enables globally consistent alignment of partial 3D reconstructions to complete reference models even without visual overlap, leveraging shared scene semantics across domains despite appearance differences.

Abstract: Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry. In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models. All code and data will be released.

[208] MA-Bench: Towards Fine-grained Micro-Action Understanding

Kun Li, Jihao Gu, Fei Wang, Zhiliang Wu, Hehe Fan, Dan Guo

Main category: cs.CV

TL;DR: MA-Bench is a new benchmark for evaluating multimodal LLMs on micro-action understanding, featuring 1,000 videos with 12,000 QA pairs across three evaluation tiers, plus a 20.5K video training corpus for fine-tuning.

Details

Motivation: Current MLLMs lack specialized benchmarks for micro-action understanding, which is crucial for human emotion analysis. There's a need to systematically evaluate MLLMs' ability to perceive subtle human movements and behaviors.

Method: Created MA-Bench with 1,000 videos and three-tier evaluation: (1) micro-action perception, (2) relational comprehension, and (3) interpretive reasoning. Also built MA-Bench-Train with 20.5K videos with structured captions for fine-tuning MLLMs.

Result: Evaluation of 23 MLLMs revealed significant challenges in capturing motion granularity and fine-grained body-part dynamics. Fine-tuning Qwen3-VL-8B on MA-Bench-Train showed clear performance improvements across micro-action reasoning tasks.

Conclusion: MA-Bench establishes a foundation for advancing MLLMs in understanding subtle micro-actions and human behaviors, addressing current limitations in motion granularity and body-part dynamics understanding.

Abstract: With the rapid development of Multimodal Large Language Models (MLLMs), their potential in Micro-Action understanding, a vital role in human emotion analysis, remains unexplored due to the absence of specialized benchmarks. To tackle this issue, we present MA-Bench, a benchmark comprising 1,000 videos and a three-tier evaluation architecture that progressively examines micro-action perception, relational comprehension, and interpretive reasoning. MA-Bench contains 12,000 structured question-answer pairs, enabling systematic assessment of both recognition accuracy and action interpretation. The results of 23 representative MLLMs reveal that there are significant challenges in capturing motion granularity and fine-grained body-part dynamics. To address these challenges, we further construct MA-Bench-Train, a large-scale training corpus with 20.5K videos annotated with structured micro-action captions for fine-tuning MLLMs. The results of Qwen3-VL-8B fine-tuned on MA-Bench-Train show clear performance improvements across micro-action reasoning and explanation tasks. Our work aims to establish a foundation benchmark for advancing MLLMs in understanding subtle micro-action and human-related behaviors. Project Page: https://MA-Bench.github.io

[209] From Synthetic Data to Real Restorations: Diffusion Model for Patient-specific Dental Crown Completion

Dávid Pukanec, Tibor Kubík, Michal Španěl

Main category: cs.CV

TL;DR: ToothCraft: A diffusion-based model for generating complete tooth crowns from incomplete teeth using anatomical context, trained on artificially created incomplete teeth from dental arch datasets.

Details

Motivation: Address the need for automated tooth crown completion in dental restoration by developing a model that can generate anatomically correct crowns conditioned on local context, overcoming the lack of training data for incomplete teeth.

Method: Uses conditioned diffusion models for 3D shapes, trained on artificially generated incomplete tooth geometries created through an augmentation pipeline from complete dental arch datasets (3DS, ODD). The model learns to complete tooth crowns from diverse defect patterns.

Result: Achieves 81.8% IoU and 0.00034 Chamfer Distance on synthetic test cases. Model effectively completes real-world incomplete teeth with minimal intersection with opposing dentition, reducing occlusal interference risk.

Conclusion: ToothCraft demonstrates strong capability for automated tooth crown completion using diffusion models, with potential for real-world dental restoration applications despite being trained on synthetic data.

Abstract: We present ToothCraft, a diffusion-based model for the contextual generation of tooth crowns, trained on artificially created incomplete teeth. Building upon recent advancements in conditioned diffusion models for 3D shapes, we developed a model capable of an automated tooth crown completion conditioned on local anatomical context. To address the lack of training data for this task, we designed an augmentation pipeline that generates incomplete tooth geometries from a publicly available dataset of complete dental arches (3DS, ODD). By synthesising a diverse set of training examples, our approach enables robust learning across a wide spectrum of tooth defects. Experimental results demonstrate the strong capability of our model to reconstruct complete tooth crowns, achieving an intersection over union (IoU) of 81.8% and a Chamfer Distance (CD) of 0.00034 on synthetically damaged testing restorations. Our experiments demonstrate that the model can be applied directly to real-world cases, effectively filling in incomplete teeth, while generated crowns show minimal intersection with the opposing dentition, thus reducing the risk of occlusal interference. Access to the code, model weights, and dataset information will be available at: https://github.com/ikarus1211/VISAPP_ToothCraft

[210] The Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding

Gillian Rosenberg, Skylar Stadhard, Bruce C. Hansen, Michelle R. Greene

Main category: cs.CV

TL;DR: VLMs trained on image-text pairs show human-level performance on general scene knowledge but fail at affordance understanding, suggesting distributional learning alone is insufficient for embodied cognition.

Details

Motivation: To test whether the distributional hypothesis (that statistical co-occurrence of language and images captures conceptual knowledge) is sufficient for full human scene understanding, particularly examining what VLMs can learn without embodied experience.

Method: Two experiments comparing 18 VLMs to 2000+ humans across 15 scene understanding tasks. Developed Human-Calibrated Cosine Distance (HCD) metric to measure VLM similarity to human response distributions. Tested six mechanistic hypotheses for affordance deficits and analyzed corpus language patterns.

Result: VLMs approached human-level on general knowledge but showed robust deficit in affordance tasks that persisted despite prompt engineering and newer models. Corpus analysis revealed sparse agent-addressed affordance language in training data. Affordance gap was structural rather than stylistic.

Conclusion: Distributional learning from images and text is insufficient for affordance-based scene understanding, suggesting some dimensions of human visual cognition require agent-centered, 3D embodied experience that photographs and captions cannot encode.

Abstract: What information is sufficient to learn the full richness of human scene understanding? The distributional hypothesis holds that the statistical co-occurrence of language and images captures the conceptual knowledge underlying visual cognition. Vision-language models (VLMs) are trained on massive paired text-image corpora but lack embodied experience, making them an ideal test of the distributional hypothesis. We report two experiments comparing descriptions generated by 18 VLMs to those of over 2000 human observers across 15 high-level scene understanding tasks, spanning general knowledge, affordances, sensory experiences, affective responses, and future prediction. Because many tasks lack ground truth answers, we developed a Human-Calibrated Cosine Distance (HCD) metric that measures VLM output similarity to the distribution of human responses, scaled by within-human variability. In Experiment 1, VLMs approached human-level performance on general knowledge tasks, but showed a robust deficit for affordance tasks that resisted prompt engineering and did not improve with newer model releases. In Experiment 2, we tested six mechanistic hypotheses for explaining this affordance gap, finding that the deficit was structural rather than stylistic and was not resolved by providing explicit spatial information. Corpus analyses revealed that image captioning datasets contain sparse agent-addressed affordance language, consistent with Gricean accounts of why embodied knowledge may be systematically underrepresented in language. Together, these findings suggest that distributional learning from images and text is insufficient for affordance-based scene understanding, implying that some dimensions of human visual cognition may require the kind of agent-centered, three-dimensional experience that no photograph or caption can encode.

[211] From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning

Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Xilin Zhao, Qingming Huang

Main category: cs.CV

TL;DR: Co-Settle is a lightweight transfer learning framework that addresses the trade-off between intra-video temporal consistency and inter-video semantic separability when adapting image-pretrained models to video tasks.

Details

Motivation: Current video representation learning approaches using image-pretrained models face a dilemma: fine-tuning heavy temporal modules compromises inter-video semantic separability (ability to distinguish objects across videos), while reducing tunable parameters hinders intra-video temporal consistency (stable representations of same object within video).

Method: Proposes Consistency-Separability Trade-off Transfer Learning (Co-Settle) framework that applies a lightweight projection layer on top of frozen image-pretrained encoder. Uses temporal cycle consistency objective and semantic separability constraint to adjust representation space. Provides theoretical support showing optimized projection yields better trade-off.

Result: Experiments on eight image-pretrained models show consistent improvements across multiple levels of video tasks with only five epochs of self-supervised training.

Conclusion: Co-Settle effectively addresses the consistency-separability trade-off in image-to-video transfer learning, enabling better video representation learning with minimal training.

Abstract: Recent studies have made notable progress in video representation learning by transferring image-pretrained models to video tasks, typically with complex temporal modules and video fine-tuning. However, fine-tuning heavy modules may compromise inter-video semantic separability, i.e., the essential ability to distinguish objects across videos. While reducing the tunable parameters hinders their intra-video temporal consistency, which is required for stable representations of the same object within a video. This dilemma indicates a potential trade-off between the intra-video temporal consistency and inter-video semantic separability during image-to-video transfer. To this end, we propose the Consistency-Separability Trade-off Transfer Learning (Co-Settle) framework, which applies a lightweight projection layer on top of the frozen image-pretrained encoder to adjust representation space with a temporal cycle consistency objective and a semantic separability constraint. We further provide a theoretical support showing that the optimized projection yields a better trade-off between the two properties under appropriate conditions. Experiments on eight image-pretrained models demonstrate consistent improvements across multiple levels of video tasks with only five epochs of self-supervised training. The code is available at https://github.com/yafeng19/Co-Settle.

[212] VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, Marta Tintore Gazulla

Main category: cs.CV

TL;DR: VGGRPO is a latent geometry-guided framework that improves geometric consistency in video diffusion models without modifying pretrained architectures, using a Latent Geometry Model and reinforcement learning with geometry-based rewards.

Details

Motivation: Current video diffusion models achieve high visual quality but often fail to preserve geometric consistency. Existing approaches either modify architectures (compromising pretrained generalization) or use RGB-space alignment methods that are computationally expensive and limited to static scenes.

Method: Proposes VGGRPO with: 1) Latent Geometry Model (LGM) that connects video diffusion latents to geometry foundation models for direct geometry decoding, 2) Group Relative Policy Optimization with two rewards: camera motion smoothness and geometry reprojection consistency, all operating in latent space to avoid costly VAE decoding.

Result: Experiments show VGGRPO improves camera stability, geometry consistency, and overall quality on both static and dynamic benchmarks while eliminating VAE decoding overhead, making it an efficient approach for world-consistent video generation.

Conclusion: VGGRPO preserves pretrained model capacity while improving geometric consistency through latent-space geometry-guided reinforcement learning, overcoming limitations of prior methods and extending to dynamic real-world scenes.

Abstract: Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.

[213] Drive-Through 3D Vehicle Exterior Reconstruction via Dynamic-Scene SfM and Distortion-Aware Gaussian Splatting

Nitin Kulkarni, Akhil Devarashetti, Charlie Cluss, Livio Forte, Philip Schneider, Chunming Qiao, Alina Vereshchaka

Main category: cs.CV

TL;DR: End-to-end pipeline for high-fidelity 3D reconstruction of moving vehicles in cluttered dealership environments using specialized camera rig, motion-gating, learned matching, and distortion-aware 3D Gaussian Splatting.

Details

Motivation: Online automotive marketplaces need high-fidelity 3D vehicle models to boost buyer confidence, but current methods struggle with dynamic scenes in cluttered dealership environments with wide-angle distortion, specular paint, and non-rigid wheel rotations.

Method: Four-stage pipeline: 1) SAM 3 instance segmentation with motion-gating to isolate moving vehicles and mask non-rigid wheels; 2) RoMa v2 learned matcher on raw 4K imagery with semantic confidence masks; 3) rig-aware SfM optimization using CAD-derived pose priors; 4) distortion-aware 3D Gaussian Splatting with MCMC densification for reflective surfaces.

Result: Achieved PSNR of 28.66 dB, SSIM of 0.89, and LPIPS of 0.21 on held-out views across 25 real-world vehicles in 10 dealerships, representing 3.85 dB improvement over standard 3D-GS.

Conclusion: The pipeline successfully generates inspection-grade interactive 3D vehicle models in uncontrolled dealership environments without studio infrastructure, overcoming challenges of dynamic scenes, specular surfaces, and lens distortion.

Abstract: High-fidelity 3D reconstruction of vehicle exteriors improves buyer confidence in online automotive marketplaces, but generating these models in cluttered dealership drive-throughs presents severe technical challenges. Unlike static-scene photogrammetry, this setting features a dynamic vehicle moving against heavily cluttered, static backgrounds. This problem is further compounded by wide-angle lens distortion, specular automotive paint, and non-rigid wheel rotations that violate classical epipolar constraints. We propose an end-to-end pipeline utilizing a two-pillar camera rig. First, we resolve dynamic-scene ambiguities by coupling SAM 3 for instance segmentation with motion-gating to cleanly isolate the moving vehicle, explicitly masking out non-rigid wheels to enforce strict epipolar geometry. Second, we extract robust correspondences directly on raw, distorted 4K imagery using the RoMa v2 learned matcher guided by semantic confidence masks. Third, these matches are integrated into a rig-aware SfM optimization that utilizes CAD-derived relative pose priors to eliminate scale drift. Finally, we use a distortion-aware 3D Gaussian Splatting framework (3DGUT) coupled with a stochastic Markov Chain Monte Carlo (MCMC) densification strategy to render reflective surfaces. Evaluations on 25 real-world vehicles across 10 dealerships demonstrate that our full pipeline achieves a PSNR of 28.66 dB, an SSIM of 0.89, and an LPIPS of 0.21 on held-out views, representing a 3.85 dB improvement over standard 3D-GS, delivering inspection-grade interactive 3D models without controlled studio infrastructure.

[214] Beyond Language: Grounding Referring Expressions with Hand Pointing in Egocentric Vision

Ling Li, Bowen Liu, Zinuo Zhan, Peng Jie, Jianhui Zhong, Kenglun Chang, Zhidong Deng

Main category: cs.CV

TL;DR: EgoPoint-Ground: A large-scale multimodal dataset for egocentric deictic visual grounding using hand-pointing and speech, with a novel SV-CoT framework that improves grounding performance by 11.7%.

Details

Motivation: Traditional visual grounding relies solely on textual descriptions, which struggle with linguistic ambiguity and ignore non-verbal deictic cues like hand-pointing that are prevalent in real-world egocentric interactions.

Method: Introduces EgoPoint-Ground dataset with 15k+ interactive samples featuring hand-target bounding box pairs and dense captions. Proposes SV-CoT framework that reformulates grounding as structured inference using Visual Chain-of-Thought to synergize gestural and linguistic cues.

Result: SV-CoT achieves 11.7% absolute improvement over existing methods, effectively mitigating semantic ambiguity and advancing multimodal intent comprehension. Comprehensive benchmark evaluation of MLLMs and VG architectures.

Conclusion: The work bridges the gap between traditional text-only grounding and real-world multimodal interactions, enabling better comprehension of physical intents through combined gestural and linguistic cues.

Abstract: Traditional Visual Grounding (VG) predominantly relies on textual descriptions to localize objects, a paradigm that inherently struggles with linguistic ambiguity and often ignores non-verbal deictic cues prevalent in real-world interactions. In natural egocentric engagements, hand-pointing combined with speech forms the most intuitive referring mechanism. To bridge this gap, we introduce EgoPoint-Ground, the first large-scale multimodal dataset dedicated to egocentric deictic visual grounding. Comprising over \textbf{15k} interactive samples in complex scenes, the dataset provides rich, multi-grained annotations including hand-target bounding box pairs and dense semantic captions. We establish a comprehensive benchmark for hand-pointing referring expression resolution, evaluating a wide spectrum of mainstream Multimodal Large Language Models (MLLMs) and state-of-the-art VG architectures. Furthermore, we propose SV-CoT, a novel baseline framework that reformulates grounding as a structured inference process, synergizing gestural and linguistic cues through a Visual Chain-of-Thought paradigm. Extensive experiments demonstrate that SV-CoT achieves an $\textbf{11.7%}$ absolute improvement over existing methods, effectively mitigating semantic ambiguity and advancing the capability of agents to comprehend multimodal physical intents. The dataset and code will be made publicly available.

[215] Tunable Soft Equivariance with Guarantees

Md Ashiqur Rahman, Lim Jun Hao, Jeremiah Jiang, Teck-Yian Lim, Raymond A. Yeh

Main category: cs.CV

TL;DR: A framework for constructing soft equivariant models by projecting model weights into designed subspaces, applicable to any pre-trained architecture with theoretical bounds on equivariance error.

Details

Motivation: Strict equivariance is rarely satisfied in real-world vision data, which can limit model performance. There's a need to control the degree of equivariance rather than enforcing strict constraints.

Method: Proposes a general framework for soft equivariant models by projecting model weights into designed subspaces. The method applies to any pre-trained architecture and provides theoretical bounds on induced equivariance error.

Result: Demonstrated effectiveness on multiple pre-trained backbones (ViT and ResNet) across image classification, semantic segmentation, and human-trajectory prediction tasks. Improved performance while reducing equivariance error on ImageNet benchmark.

Conclusion: The proposed framework enables controlled equivariance in vision models, balancing performance benefits with theoretical guarantees on equivariance error reduction.

Abstract: Equivariance is a fundamental property in computer vision models, yet strict equivariance is rarely satisfied in real-world data, which can limit a model’s performance. Controlling the degree of equivariance is therefore desirable. We propose a general framework for constructing soft equivariant models by projecting the model weights into a designed subspace. The method applies to any pre-trained architecture and provides theoretical bounds on the induced equivariance error. Empirically, we demonstrate the effectiveness of our method on multiple pre-trained backbones, including ViT and ResNet, across image classification, semantic segmentation, and human-trajectory prediction tasks. Notably, our approach improves the performance while simultaneously reducing equivariance error on the competitive ImageNet benchmark.

[216] Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly

Hang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Jiayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiangming Liu, Hehe Fan, Dajiu Huang, Jing Feng, Linli Chen, Can Zhang, Xuhuan Li, Hao Zhang, Jianhang Chen, Qimei Cui, Xiaofeng Tao

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2405.00181: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.00181&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[217] Zero-Shot Depth from Defocus

Yiming Zuo, Hongyu Wen, Venkat Subramanian, Patrick Chen, Karhan Kayan, Mario Bijelic, Felix Heide, Jia Deng

Main category: cs.CV

TL;DR: FOSSA is a Transformer-based network for zero-shot Depth from Defocus (DfD) that uses stack attention with focus distance embeddings and is trained on synthetic focus stacks from RGBD datasets, achieving 55.7% error reduction on the new ZEDD benchmark.

Details

Motivation: Previous DfD methods overfit to specific datasets, lacking zero-shot generalization. There's a need for better real-world benchmarks and architectures that can generalize across different settings without dataset-specific training.

Method: Proposes FOSSA: a Transformer-based architecture with stack attention layers using focus distance embeddings for efficient cross-stack information exchange. Also introduces a training pipeline that generates synthetic focus stacks from existing RGBD datasets, and creates the ZEDD benchmark with 8.3x more scenes and higher quality data.

Result: Achieves significant improvement over baselines, reducing errors by up to 55.7% on the ZEDD benchmark and other benchmarks. The ZEDD benchmark provides substantially more scenes and higher quality data than previous benchmarks.

Conclusion: FOSSA enables effective zero-shot generalization for DfD through novel architecture design and synthetic training data generation. The ZEDD benchmark advances the field by providing high-quality real-world data for evaluation.

Abstract: Depth from Defocus (DfD) is the task of estimating a dense metric depth map from a focus stack. Unlike previous works overfitting to a certain dataset, this paper focuses on the challenging and practical setting of zero-shot generalization. We first propose a new real-world DfD benchmark ZEDD, which contains 8.3x more scenes and significantly higher quality images and ground-truth depth maps compared to previous benchmarks. We also design a novel network architecture named FOSSA. FOSSA is a Transformer-based architecture with novel designs tailored to the DfD task. The key contribution is a stack attention layer with a focus distance embedding, allowing efficient information exchange across the focus stack. Finally, we develop a new training data pipeline allowing us to utilize existing large-scale RGBD datasets to generate synthetic focus stacks. Experiment results on ZEDD and other benchmarks show a significant improvement over the baselines, reducing errors by up to 55.7%. The ZEDD benchmark is released at https://zedd.cs.princeton.edu. The code and checkpoints are released at https://github.com/princeton-vl/FOSSA.

[218] GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation

Nicolas von Lützow, Barbara Rössle, Katharina Schmid, Matthias Nießner

Main category: cs.CV

TL;DR: GaussianGPT: An autoregressive transformer model that generates 3D scenes by directly producing 3D Gaussians via next-token prediction, offering step-by-step scene construction with capabilities for completion, outpainting, and controllable sampling.

Details

Motivation: To explore an alternative to diffusion/flow-matching approaches for 3D generation by developing a fully autoregressive method that can generate 3D scenes step-by-step with better controllability and context-awareness.

Method: 1) Compress 3D Gaussian primitives into discrete latent tokens using sparse 3D convolutional autoencoder with vector quantization. 2) Serialize tokens and model with causal transformer using 3D rotary positional embeddings. 3) Generate scenes sequentially via next-token prediction.

Result: Enables full 3D scene generation with step-by-step construction, supporting completion, outpainting, controllable sampling via temperature, and flexible generation horizons while maintaining compatibility with modern neural rendering pipelines.

Conclusion: Autoregressive transformers offer a complementary paradigm to diffusion methods for controllable and context-aware 3D generation, leveraging compositional inductive biases while operating on explicit 3D representations.

Abstract: Most recent advances in 3D generative modeling rely on diffusion or flow-matching formulations. We instead explore a fully autoregressive alternative and introduce GaussianGPT, a transformer-based model that directly generates 3D Gaussians via next-token prediction, thus facilitating full 3D scene generation. We first compress Gaussian primitives into a discrete latent grid using a sparse 3D convolutional autoencoder with vector quantization. The resulting tokens are serialized and modeled using a causal transformer with 3D rotary positional embedding, enabling sequential generation of spatial structure and appearance. Unlike diffusion-based methods that refine scenes holistically, our formulation constructs scenes step-by-step, naturally supporting completion, outpainting, controllable sampling via temperature, and flexible generation horizons. This formulation leverages the compositional inductive biases and scalability of autoregressive modeling while operating on explicit representations compatible with modern neural rendering pipelines, positioning autoregressive transformers as a complementary paradigm for controllable and context-aware 3D generation.

[219] Detailed Geometry and Appearance from Opportunistic Motion

Ryosuke Hirai, Kohei Yamashita, Antoine Guédon, Ryo Kawahara, Vincent Lepetit, Ko Nishino

Main category: cs.CV

TL;DR: Joint optimization of object pose and geometry using 2D Gaussian splatting with alternating minimization, plus factorized appearance modeling, to reconstruct 3D objects from sparse fixed cameras by leveraging opportunistic object motion.

Details

Motivation: Traditional 3D reconstruction from sparse fixed cameras is fundamentally limited by viewpoint constraints. The paper aims to overcome this by exploiting object motion during manipulation, which provides additional virtual viewpoints as static cameras effectively "orbit" the object.

Method: 1) Joint pose and shape optimization using 2D Gaussian splatting with alternating minimization of 6DoF trajectories and primitive parameters. 2) Novel appearance model that factorizes diffuse and specular components with reflected directional probing within spherical harmonics space.

Result: Extensive experiments on synthetic and real-world datasets with extremely sparse viewpoints show significantly more accurate geometry and appearance recovery compared to state-of-the-art baselines.

Conclusion: Object motion during manipulation can be effectively harnessed to break the viewpoint limitations of sparse camera setups, enabling high-quality 3D reconstruction through joint optimization and factorized appearance modeling.

Abstract: Reconstructing 3D geometry and appearance from a sparse set of fixed cameras is a foundational task with broad applications, yet it remains fundamentally constrained by the limited viewpoints. We show that this bound can be broken by exploiting opportunistic object motion: as a person manipulates an object~(e.g., moving a chair or lifting a mug), the static cameras effectively ``orbit’’ the object in its local coordinate frame, providing additional virtual viewpoints. Harnessing this object motion, however, poses two challenges: the tight coupling of object pose and geometry estimation and the complex appearance variations of a moving object under static illumination. We address these by formulating a joint pose and shape optimization using 2D Gaussian splatting with alternating minimization of 6DoF trajectories and primitive parameters, and by introducing a novel appearance model that factorizes diffuse and specular components with reflected directional probing within the spherical harmonics space. Extensive experiments on synthetic and real-world datasets with extremely sparse viewpoints demonstrate that our method recovers significantly more accurate geometry and appearance than state-of-the-art baselines.

[220] INSIGHT: Enhancing Autonomous Driving Safety through Vision-Language Models on Context-Aware Hazard Detection and Edge Case Evaluation

Dianwei Chen, Zifan Zhang, Lei Cheng, Yuchen Liu, Xianfeng Terry Yang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Unable to determine method due to API rate limiting preventing access to paper details

Result: Unable to determine results due to API rate limiting preventing access to paper details

Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper details

Abstract: Failed to fetch summary for 2502.00262: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.00262&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[221] Hierarchical and Multimodal Data for Daily Activity Understanding

Ghazal Kaviani, Yavuz Yarici, Seulgi Kim, Mohit Prabhushankar, Ghassan AlRegib, Mashhour Solh, Ameya Patil

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2504.17696: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.17696&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[222] AMFD: Distillation via Adaptive Multimodal Fusion for Multispectral Pedestrian Detection

Zizhao Chen, Yeqiang Qian, Xiaoxiao Yang, Chunxiang Wang, Ming Yang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2405.12944: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.12944&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[223] Evidence-based diagnostic reasoning with multi-agent copilot for human pathology

Luca L. Weishaupt, Chengkuan Chen, Drew F. K. Williamson, Richard J. Chen, Guillaume Jaume, Tong Ding, Bowen Chen, Anurag Vaidya, Long Phi Le, Guillaume Jaume, Ming Y. Lu, Faisal Mahmood

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2506.20964: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.20964&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[224] Masked Training for Robust Arrhythmia Detection from Digitalized Multiple Layout ECG Images

Shanwei Zhang, Deyun Zhang, Yirao Tao, Kexin Wang, Shijia Geng, Jun Li, Qinghao Zhao, Xingpeng Liu, Xingliang Wu, Shengyong Chen, Yuxi Zhou, Shenda Hong

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2508.09165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.09165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[225] Towards Knowledge Guided Pretraining Approaches for Multimodal Foundation Models: Applications in Remote Sensing

Praveen Ravirathinam, Ajitesh Parthasarathy, Ankush Khandelwal, Rahul Ghosh, Vipin Kumar

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2407.19660: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.19660&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[226] High-Fidelity Human Avatars from Laptop Webcams using Edge Compute

Akash Haridas, Imran N. Junejo

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2502.02468: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.02468&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[227] ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting

Jiayu Ding, Xinpeng Liu, Zhiyi Pan, Shiqiang Long, Ge Li

Main category: cs.CV

TL;DR: Unable to analyze paper 2509.22225 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract retrieval failed

Method: Cannot determine method as abstract retrieval failed

Result: Cannot determine results as abstract retrieval failed

Conclusion: Cannot draw conclusion as paper content unavailable

Abstract: Failed to fetch summary for 2509.22225: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22225&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[228] Beyond Deepfake vs Real: Facial Deepfake Detection in the Open-Set Paradigm

Nadarasar Bahavan, Sachith Seneviratne, Sanjay Saha, Ken Chen, Sanka Rasnayaka, Saman Halgamuge

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2503.08055: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.08055&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[229] GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings

Angel Daruna, Nicholas Meegan, Han-Pang Chiu, Supun Samarasekera, Rakesh Kumar

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2510.01448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[230] Zero-Shot Personalized Camera Motion Control for Image-to-Video Synthesis

Pooja Guhan, Divya Kothandaraman, Geonsun Lee, Tsung-Wei Huang, Guan-Ming Su, Dinesh Manocha

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2504.09472: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.09472&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[231] CheXGenBench: A Unified Benchmark For Fidelity, Privacy and Utility of Synthetic Chest Radiographs

Raman Dutt, Pedro Sanchez, Yongchen Yao, Steven McDonagh, Sotirios A. Tsaftaris, Timothy Hospedales

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to analyze paper content due to technical fetching error

Abstract: Failed to fetch summary for 2505.10496: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.10496&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[232] CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

Jiange Yang, Yansong Shi, Haoyi Zhu, Mingyu Liu, Kaijing Ma, Yating Wang, Gangshan Wu, Tong He, Limin Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2505.17006: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17006&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[233] Compositional Image Synthesis with Inference-Time Scaling

Minsuk Ji, Sanghyeok Lee, Namhyuk Ahn

Main category: cs.CV

TL;DR: Paper 2510.24133: Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as abstract is unavailable due to arXiv API rate limiting

Method: Method unknown - paper content inaccessible

Result: No results available - abstract retrieval failed

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.24133: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24133&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[234] IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout

Fei Shen, Yutong Gao, Jian Yu, Xiaoyu Du, Jinhui Tang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2506.01949: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01949&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[235] Gaussian Mapping for Evolving Scenes

Vladimir Yugay, Thies Kersten, Luca Carlone, Theo Gevers, Martin R. Oswald, Lukas Schmid

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2506.06909: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.06909&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[236] Binary Verification for Zero-Shot Vision

Rongbin Hu, Jeffrey Liu

Main category: cs.CV

TL;DR: Paper ID 2511.10983 could not be fetched due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2511.10983: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10983&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[237] Probing Deep into Temporal Profile Makes the Infrared Small Target Detector Much Better

Ruojing Li, Wei An, Yingqian Wang, Xinyi Ying, Yimian Dai, Longguang Wang, Miao Li, Yulan Guo, Li Liu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2506.12766: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.12766&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[238] Any4D: Open-Prompt 4D Generation from Natural Language and Images

Hao Li, Qiao Sun

Main category: cs.CV

TL;DR: Unable to analyze paper 2511.18746 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2511.18746: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18746&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[239] Score2Instruct: Scaling Up Video Quality-Centric Instructions via Automated Dimension Scoring

Qizhi Xie, Kun Yuan, Yunpeng Qu, Jiachao Gong, Mingda Wu, Ming Sun, Chao Zhou, Jihong Zhu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to retrieval failure

Method: Unable to determine method due to retrieval failure

Result: Unable to determine results due to retrieval failure

Conclusion: Unable to draw conclusions due to retrieval failure

Abstract: Failed to fetch summary for 2506.21011: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.21011&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[240] TimeSenCLIP: A Time Series Vision-Language Model for Remote Sensing

Pallavi Jain, Diego Marcos, Dino Ienco, Roberto Interdonato, Tristan Berchoux

Main category: cs.CV

TL;DR: Unable to analyze paper 2508.11919 due to HTTP 429 error when fetching the abstract from arXiv API

Details

Motivation: Cannot determine motivation as the paper content could not be retrieved due to rate limiting

Method: Cannot determine method as the paper content could not be retrieved

Result: Cannot determine results as the paper content could not be retrieved

Conclusion: Cannot draw conclusions as the paper content could not be retrieved

Abstract: Failed to fetch summary for 2508.11919: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.11919&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[241] Particulate: Feed-Forward 3D Object Articulation

Ruining Li, Yuxin Yao, Chuanxia Zheng, Christian Rupprecht, Joan Lasenby, Shangzhe Wu, Andrea Vedaldi

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.11798: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.11798&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[242] CLARITY: Medical World Model for Guiding Treatment Decisions by Modeling Context-Aware Disease Trajectories in Latent Space

Tianxingjian Ding, Yuanhao Zou, Chen Chen, Mubarak Shah, Yu Tian

Main category: cs.CV

TL;DR: Failed to fetch paper summary - HTTP 429 error (rate limiting) prevents accessing arXiv API for paper 2512.08029

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Unable to determine method due to API rate limiting preventing access to paper content

Result: Unable to determine results due to API rate limiting preventing access to paper content

Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2512.08029: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08029&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[243] Clinical Metadata Guided Limited-Angle CT Image Reconstruction

Yu Shi, Shuyi Fan, Changsheng Fang, Shuo Han, Haodong Li, Li Zhou, Bahareh Morovati, Dayang Wang, Hengyong Yu

Main category: cs.CV

TL;DR: Unable to analyze paper 2509.01752 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to paper abstract

Method: Cannot determine method without access to paper abstract

Result: Cannot determine results without access to paper abstract

Conclusion: Cannot draw conclusions without access to paper abstract

Abstract: Failed to fetch summary for 2509.01752: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.01752&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[244] ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models

Zhaoyang Li, Zhan Ling, Yuchen Zhou, Litian Gong, Erdem Bıyık, Hao Su

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2509.15695: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.15695&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[245] One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework

Lorenzo Bianchi, Giacomo Pacini, Fabio Carrara, Nicola Messina, Giuseppe Amato, Fabrizio Falchi

Main category: cs.CV

TL;DR: Paper 2510.02898: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2510.02898: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02898&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[246] A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering

Yuanhao Zou, Shengji Jin, Andong Deng, Youpeng Zhao, Jun Wang, Chen Chen

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2510.04428: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04428&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[247] RoAD Benchmark: How LiDAR Models Fail under Coupled Domain Shifts and Label Evolution

Subeen Lee, Siyeong Lee, Namil Kim, Jaesik Choi

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2601.07855: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07855&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[248] Revisiting Diffusion Model Predictions Through Dimensionality

Qing Jin, Chaoyang Wang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2601.21419 exists but cannot be retrieved from arXiv API at this time.

Details

Motivation: Cannot determine motivation due to inability to access paper content.

Method: Cannot determine method due to inability to access paper content.

Result: Cannot determine results due to inability to access paper content.

Conclusion: Cannot determine conclusion due to inability to access paper content.

Abstract: Failed to fetch summary for 2601.21419: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21419&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[249] SSeg: Active Sparse Point-Label Augmentation for Semantic Segmentation

Cesar Borja, Carlos Plou, Ruben Martinez-Cantin, Ana C. Murillo

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unknown - paper details not accessible

Result: No results available due to API rate limiting

Conclusion: Cannot analyze paper due to technical limitations in accessing content

Abstract: Failed to fetch summary for 2510.10163: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10163&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[250] IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment

Yinan Chen, Jiangning Zhang, Teng Hu, Yuxiang Zeng, Zhucun Xue, Qingdong He, Chengjie Wang, Yong Liu, Xiaobin Hu, Shuicheng Yan

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2510.11647: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.11647&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[251] Learning Neural Parametric 3D Breast Shape Models for Metrical Surface Reconstruction From Monocular RGB Videos

Maximilian Weiherer, Antonia von Riedheim, Vanessa Brébant, Bernhard Egger, Christoph Palm

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2510.13540: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13540&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[252] CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models

Donghee Lee, Rui Cai, Zhe Zhao

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2601.13622: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13622&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[253] Attention Misses Visual Risk: Risk-Adaptive Steering for Multimodal Safety Alignment

Jonghyun Park, Minhyuk Seo, Chaewon Yeo, Jonghyun Choi

Main category: cs.CV

TL;DR: Paper 2510.13698: Unable to fetch abstract due to HTTP 429 rate limiting error from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable due to API rate limiting

Method: Cannot determine method as abstract is unavailable due to API rate limiting

Result: Cannot determine results as abstract is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as abstract is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2510.13698: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13698&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[254] EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

Weiyu Sun, Liangliang Chen, Yongnuo Cai, Huiru Xie, Yi Zeng, Ying Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: No method information available - arXiv API returned HTTP 429 (Too Many Requests) error

Result: No results available - Could not retrieve paper abstract due to technical limitations

Conclusion: Cannot analyze paper content due to API rate limiting preventing access to the abstract

Abstract: Failed to fetch summary for 2602.00095: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00095&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[255] CLEAR: Causal Learning Framework For Robust Histopathology Tumor Detection Under Out-Of-Distribution Shifts

Kieu-Anh Truong Thi, Huy-Hieu Pham, Duc-Trong Le

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2510.14273: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14273&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[256] PISCO: Precise Video Instance Insertion with Sparse Control

Xiangbo Gao, Renjie Li, Xinghao Chen, Yuheng Wu, Suofei Feng, Qing Yin, Zhengzhong Tu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to draw conclusions due to failed API request

Abstract: Failed to fetch summary for 2602.08277: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08277&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[257] BeetleFlow: An Integrative Deep Learning Pipeline for Beetle Image Processing

Fangxun Liu, S M Rayeed, Samuel Stevens, Alyson East, Cheng Hsuan Chiang, Colin Lee, Daniel Yi, Junke Yang, Tejas Naik, Ziyi Wang, Connor Kilrain, Elijah H Buckwalter, Jiacheng Hou, Saul Ibaven Bueno, Shuheng Wang, Xinyue Ma, Yifan Liu, Zhiyuan Tao, Ziheng Zhang, Eric Sokol, Michael Belitz, Sydne Record, Charles V. Stewart, Wei-Lun Chao

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2511.00255: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.00255&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[258] PriVi: Towards A General-Purpose Video Model For Primate Behavior In The Wild

Felix B. Mueller, Jan F. Meier, Timo Lueddecke, Richard Vogg, Roger L. Freixanet, Valentin Hassler, Tiffany Bosshard, Elif Karakoc, William J. O’Hearn, Sofia M. Pereira, Sandro Sehner, Kaja Wierucka, Judith Burkart, Claudia Fichtel, Julia Fischer, Alexander Gail, Catherine Hobaiter, Julia Ostner, Liran Samuni, Oliver Schülke, Neda Shahidi, Erin G. Wessling, Alexander S. Ecker

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2511.09675: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09675&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[259] ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization

Anzhe Cheng, Shukai Duan, Shixuan Li, Chenzhong Yin, Mingxi Cheng, Heng Ping, Tamoghna Chattopadhyay, Sophia I Thomopoulos, Shahin Nazarian, Paul Thompson, Paul Bogdan

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.10971: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10971&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[260] The Effective Depth Paradox: Evaluating the Relationship between Architectural Topology and Trainability in Deep CNNs

Manfred M. Fischer, Joshua Pitts

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.13298 suggests it’s from February 2025, but no abstract or content is available for analysis.

Details

Motivation: Cannot determine motivation due to lack of paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests, preventing access to the paper's abstract and details.

Method: No method information available. The paper content could not be retrieved from arXiv due to API rate limiting (HTTP 429 error).

Result: No results available. The paper analysis cannot be completed because the abstract and content are inaccessible due to rate limiting constraints.

Conclusion: Unable to analyze paper 2602.13298. The arXiv API rate limiting prevents access to the paper’s content, making it impossible to assess relevance to multimodal large language models with audio/vision focus.

Abstract: Failed to fetch summary for 2602.13298: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13298&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[261] UniSER: A Foundation Model for Unified Soft Effects Removal

Jingdong Zhang, Lingzhi Zhang, Qing Liu, Mang Tik Chiu, Connelly Barnes, Yizhou Wang, Haoran You, Xiaoyang Liu, Yuqian Zhou, Zhe Lin, Eli Shechtman, Sohrab Amirghodsi, Xin Li, Wenping Wang, Xiaohang Zhan

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.14183: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.14183&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[262] DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

Aditya Kumar Singh, Hitesh Kandala, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum

Main category: cs.CV

TL;DR: Unable to analyze paper 2602.18846 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2602.18846: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18846&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Pierrick Bournez, Luca Savant Aira, Thibaud Ehret, Gabriele Facciolo

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2511.16542: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16542&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[264] PedaCo-Gen: Scaffolding Pedagogical Agency in Human-AI Collaborative Video Authoring

Injun Baek, Yearim Kim, Nojun Kwak

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.19623: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19623&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[265] Rethinking Diffusion Model-Based Video Super-Resolution: Leveraging Dense Guidance from Aligned Features

Jingyi Xu, Meisong Zheng, Ying Chen, Minglang Qiao, Xin Deng, Mai Xu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2511.16928: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16928&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[266] Towards single-shot coherent imaging via overlap-free ptychography

Oliver Hoidn, Aashwin Mishra, Steven Henke, Albert Vong, Matthew Seaberg

Main category: cs.CV

TL;DR: Failed to fetch summary for arXiv ID 2602.21361 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2602.21361: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21361&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[267] Versatile Recompression-Aware Perceptual Image Super-Resolution

Mingwei He, Tongda Xu, Xingtong Ge, Ming Sun, Chao Zhou, Yan Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved due to technical limitations in accessing the arXiv API

Method: No method information available - paper content retrieval failed due to HTTP 429 error (Too Many Requests)

Result: No results available - the arXiv API request was rate-limited and returned HTTP 429 status

Conclusion: Cannot analyze paper content due to technical limitations in accessing the arXiv database; need to try again later or use alternative methods to retrieve the paper

Abstract: Failed to fetch summary for 2511.18090: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18090&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[268] Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI

Xinhao Liu, Jiaqi Li, Youming Deng, Ruxin Chen, Yingjia Zhang, Yifei Ma, Li Guo, Yiming Li, Jing Zhang, Chen Feng

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.20620: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20620&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[269] The Pulse of Motion: Measuring Physical Frame Rate from Visual Dynamics

Xiangbo Gao, Mingyang Wu, Siyuan Yang, Jiongze Yu, Pardis Taghavi, Fangzhou Lin, Zhengzhong Tu

Main category: cs.CV

TL;DR: Visual Chronometer: A method to estimate Physical Frames Per Second (PhyFPS) from video motion to address temporal ambiguity in generative video models, improving motion speed realism.

Details

Motivation: Current generative video models produce visually smooth kinematics but lack reliable temporal grounding, leading to chronometric hallucination - ambiguous, unstable, and uncontrollable physical motion speeds due to training on videos with different real-world speeds standardized to uniform frame rates.

Method: Proposes Visual Chronometer, a predictor that recovers Physical Frames Per Second (PhyFPS) directly from visual dynamics of input videos. Trained via controlled temporal resampling to estimate true temporal scale implied by motion itself, bypassing unreliable metadata.

Result: Established two benchmarks (PhyFPS-Bench-Real and PhyFPS-Bench-Gen) revealing severe PhyFPS misalignment and temporal instability in state-of-the-art video generators. Demonstrated that applying PhyFPS corrections significantly improves human-perceived naturalness of AI-generated videos.

Conclusion: Visual Chronometer addresses critical temporal grounding issues in video generation, providing a method to estimate true physical motion speeds from visual dynamics, which enhances the realism and physical consistency of generated videos.

Abstract: While recent generative video models have achieved remarkable visual realism and are being explored as world models, true physical simulation requires mastering both space and time. Current models can produce visually smooth kinematics, yet they lack a reliable internal motion pulse to ground these motions in a consistent, real-world time scale. This temporal ambiguity stems from the common practice of indiscriminately training on videos with vastly different real-world speeds, forcing them into standardized frame rates. This leads to what we term chronometric hallucination: generated sequences exhibit ambiguous, unstable, and uncontrollable physical motion speeds. To address this, we propose Visual Chronometer, a predictor that recovers the Physical Frames Per Second (PhyFPS) directly from the visual dynamics of an input video. Trained via controlled temporal resampling, our method estimates the true temporal scale implied by the motion itself, bypassing unreliable metadata. To systematically quantify this issue, we establish two benchmarks, PhyFPS-Bench-Real and PhyFPS-Bench-Gen. Our evaluations reveal a harsh reality: state-of-the-art video generators suffer from severe PhyFPS misalignment and temporal instability. Finally, we demonstrate that applying PhyFPS corrections significantly improves the human-perceived naturalness of AI-generated videos. Our project page is https://xiangbogaobarry.github.io/Visual_Chronometer/.

[270] Smol-GS: Compact Representations for Abstract 3D Gaussian Splatting

Haishan Wang, Mohammad Hassan Vali, Arno Solin

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.00850: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00850&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[271] MLLM-based Textual Explanations for Face Comparison

Redwan Sony, Anil K Jain, Arun Ross

Main category: cs.CV

TL;DR: Unable to analyze paper 2603.16629 due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2603.16629: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16629&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[272] UniPart: Part-Level 3D Generation with Unified 3D Geom-Seg Latents

Xufan He, Yushuang Wu, Xiaoyang Guo, Chongjie Ye, Jiaqing Zhou, Tianlei Hu, Xiaoguang Han, Dong Du

Main category: cs.CV

TL;DR: Failed to fetch summary for paper 2512.09435 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed summary fetch

Method: Unable to determine method due to failed summary fetch

Result: Unable to determine results due to failed summary fetch

Conclusion: Unable to determine conclusion due to failed summary fetch

Abstract: Failed to fetch summary for 2512.09435: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09435&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[273] Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification

Han Liu, Bogdan Georgescu, Yanbo Zhang, Youngjin Yoo, Michael Baumgartner, Riqiang Gao, Jianing Wang, Gengyan Zhao, Eli Gibson, Dorin Comaniciu, Sasa Grbic

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2512.12887: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.12887&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[274] Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models

Pablo Ruiz-Ponce, Sergio Escalera, José García-Rodríguez, Jiankang Deng, Rolandos Alexandros Potamias

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2512.19692: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19692&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[275] ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation

Chia-Ming Lee, Yu-Fan Lin, Jin-Hui Jiang, Yu-Jou Hsiao, Chih-Chung Hsu, Yu-Lun Liu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2601.17468: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17468&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[276] PokeFusion Attention: A Lightweight Cross-Attention Mechanism for Style-Conditioned Image Generation

Jingbang Tang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2602.03220: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03220&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[277] Few TensoRF: Enhance the Few-shot on Tensorial Radiance Fields

Thanh-Hai Le, Hoang-Hau Tran, Trong-Nghia Vu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.25008: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25008&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[278] Wid3R: Wide Field-of-View 3D Reconstruction via Camera Model Conditioning

Dongki Jung, Jaehoon Choi, Adil Qureshi, Somi Jeong, Dinesh Manocha, Suyong Yeon

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to access limitations

Abstract: Failed to fetch summary for 2602.05321: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05321&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[279] Adaptive Multi-Scale Channel-Spatial Attention Aggregation Framework for 3D Indoor Semantic Scene Completion Toward Assisting Visually Impaired

Qi He, XiangXiang Wang, Jingtao Zhang, Yongbin Yu, Hongxiang Chu, Manping Fan, JingYe Cai, Zhenglin Yang

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when querying arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to data retrieval failure

Abstract: Failed to fetch summary for 2602.16385: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16385&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[280] IRIS-SLAM: Unified Geo-Instance Representations for Robust Semantic Localization and Mapping

Tingyang Xiao, Liu Liu, Wei Feng, Zhengyu Zou, Xiaolin Zhou, Wei Sui, Hao Li, Dingwen Zhang, Zhizhong Su

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.18709 suggests it’s from February 26, 2025, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2602.18709: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18709&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[281] ORION: ORthonormal Text Encoding for Universal VLM AdaptatION

Omprakash Chakraborty, Jose Dolz, Ismail Ben Ayed

Main category: cs.CV

TL;DR: Paper 2602.19530: Failed to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions preventing abstract retrieval

Method: Cannot analyze method without access to paper content

Result: No results available due to technical access limitations

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.19530: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19530&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[282] Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction

Noé Artru, Rukhshanda Hussain, Emeline Got, Alexandre Messier, David B. Lindell, Abdallah Dib

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2602.21100: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21100&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[283] Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments

Shuang Song, Debao Huang, Deyan Deng, Haolin Xiong, Yang Tang, Yajie Zhao, Rongjun Qin

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.22025: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22025&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[284] OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis

Junuk Cha, Jihyeon Kim, Han-Mu Park

Main category: cs.CV

TL;DR: Paper 2602.22949: Could not fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing abstract

Method: Unable to determine method due to missing abstract

Result: Unable to determine results due to missing abstract

Conclusion: Unable to determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2602.22949: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22949&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[285] Leveraging Arbitrary Data Sources for AI-Generated Image Detection Without Sacrificing Generalization

Qinghui He, Haifeng Zhang, Xiuli Bi, Bo Liu, Chi-Man Pun, Bin Xiao

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available due to technical issue with arXiv API access

Conclusion: Cannot draw conclusions about paper content due to API rate limiting preventing access

Abstract: Failed to fetch summary for 2603.00717: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00717&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[286] From Pixels to Patches: Pooling Strategies for Earth Embeddings

Isaac Corley, Caleb Robinson, Inbal Becker-Reshef, Juan M. Lavista Ferres

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API for paper ID 2603.02080

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2603.02080: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02080&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[287] MIBURI: Towards Expressive Interactive Gesture Synthesis

M. Hamza Mughal, Rishabh Dabral, Vera Demberg, Christian Theobalt

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.03282: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03282&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[288] Making Training-Free Diffusion Segmentors Scale with the Generative Power

Benyuan Meng, Qianqian Xu, Zitai Wang, Xiaochun Cao, Longtao Huang, Qingming Huang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2603.06178: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06178&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[289] From Imitation to Intuition: Intrinsic Reasoning for Open-Instance Video Classification

Ke Zhang, Xiangchen Zhao, Yunjie Tian, Jiayu Zheng, Vishal M. Patel, Di Fu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.10300: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10300&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[290] HIFICL: High-Fidelity In-Context Learning for Multimodal Tasks

Xiaoyu Li, Yuhang Liu, Xuanshuo Kang, Zheng Luo, Fangqi Lou, Xiaohua Wu, Zihan Xiong

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.12760: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12760&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Xi Chen, Maojun Zhang, Yu Liu, Shen Yan

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.13352: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13352&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[292] UE5-Forest: A Photorealistic Synthetic Stereo Dataset for UAV Forestry Depth Estimation

Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield, Richard Green

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.15304: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15304&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[293] ModTrack: Sensor-Agnostic Multi-View Tracking via Identity-Informed PHD Filtering with Covariance Propagation

Aditya Iyer, Jack Roberts, Nora Ayanian

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2603.15812: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[294] EPOFusion: Exposure aware Progressive Optimization Method for Infrared and Visible Image Fusion

Zhiwei Wang, Yayu Zheng, Defeng He, Li Zhao, Xiaoqin Zhang, Yuxing Li, Edmund Y. Lam

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The arXiv API request for paper ID 2603.16130 was blocked.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2603.16130: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16130&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[295] Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors

Ryosuke Hori, Jyun-Ting Song, Zhengyi Luo, Jinkun Cao, Soyong Shin, Hideo Saito, Kris Kitani

Main category: cs.CV

TL;DR: Failed to fetch summary for paper 2603.16233 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to lack of accessible paper content

Method: Unable to determine method due to lack of accessible paper content

Result: Unable to determine results due to lack of accessible paper content

Conclusion: Unable to determine conclusion due to lack of accessible paper content

Abstract: Failed to fetch summary for 2603.16233: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16233&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[296] WildDepth: A Multimodal Dataset for 3D Wildlife Perception and Depth Estimation

Muhammad Aamir, Naoya Muramatsu, Sangyun Shin, Matthew Wijers, Jia-Xing Zhong, Xinyu Hou, Amir Patel, Andrew Loveridge, Andrew Markham

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to draw conclusions due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2603.16816: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16816&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[297] MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing

Yimin Wei, Aoran Xiao, Hongruixuan Chen, Junshi Xia, Naoto Yokoya

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2603.17528: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17528&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Mohammad Robaitul Islam Bhuiyan, Sheethal Bhat, Melika Qahqaie, Tri-Thien Nguyen, Paula Andrea Perez-Toro, Tomas Arias-Vergara, Andreas Maier

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.17576: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17576&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[299] EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

Longfei Liu, Yongjie Hou, Yang Li, Qirui Wang, Youyang Sha, Yongjun Yu, Yinzhi Wang, Peizhe Ru, Xuanlong Yu, Xi Shen

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2603.18739: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18739&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[300] LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis

Stanislaw Szymanowicz, Minghao Chen, Jianyuan Wang, Christian Rupprecht, Andrea Vedaldi

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.20176: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20176&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[301] PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization

Xiaoya Cheng, Long Wang, Yan Liu, Xinyi Liu, Hanlin Tan, Yu Liu, Maojun Zhang, Shen Yan

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.20778: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20778&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[302] CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models

Nan Zhou, Huiqun Wang, Yaoyan Zheng, Di Huang

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2603.21077: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21077&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[303] GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning

Jiayin Sun, Caixia Sun, Boyu Yang, Hailin Li, Xiao Chen, Yi Zhang, Errui Ding, Liang Li, Chao Deng, Junlan Feng

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2603.22687: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22687&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[304] ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2603.23376: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23376&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[305] Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

Gengluo Li, Pengyuan Lyu, Chengquan Zhang, Huawen Shen, Liang Wu, Xingyu Wan, Gangyan Zeng, Han Hu, Can Ma, Yu Zhou

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.23885: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23885&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[306] Relaxed Rigidity with Ray-based Grouping for Dynamic Gaussian Splatting

Junoh Lee, Junmyeong Lee, Yeon-Ji Song, Inhwan Bae, Jisu Shin, Hae-Gon Jeon, Jin-Hwa Kim

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.24994 suggests it’s from March 2025, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2603.24994: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24994&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[307] GeoNDC: A Queryable Neural Data Cube for Planetary-Scale Earth Observation

Jianbo Qi, Mengyao Li, Baogui Jiang, Yidan Chen, Xihan Mu, Qiao Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2603.25037: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25037&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[308] CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation

Jeannie Chung, Hanna Jang, Ingyeong Yang, Uiwon Hwang, Jaehyeong Sim

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.25383: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25383&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Bin Chen, Wenbo Yu, Qinshan Zhang, Tianqu Zhuang, Hao Wu, Yong Jiang, Shu-Tao Xia

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2411.15702: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.15702&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[310] Toward Efficient and Robust Behavior Models for Multi-Agent Driving Simulation

Fabian Konstantinidis, Moritz Sackmann, Ulrich Hofmann, Christoph Stiller

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2512.05812: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[311] CrisiSense-RAG: Crisis Sensing Multimodal Retrieval-Augmented Generation for Rapid Disaster Impact Assessment

Yiming Xiao, Kai Yin, Ali Mostafavi

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.13239 suggests it’s from February 2025, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2602.13239: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13239&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[312] Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance

Wenxuan Song, Jiayi Chen, Shuai Chen, Jingbo Wang, Pengxiang Ding, Han Zhao, Yikai Qin, Xinhu Zheng, Donglin Wang, Yan Wang, Haoang Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to draw conclusions due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2603.25661: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25661&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.AI

[313] BeSafe-Bench: Unveiling Behavioral Safety Risks of Situated Agents in Functional Environments

Yuxuan Li, Yi Lin, Peng Wang, Shiming Liu, Xuetao Wei

Main category: cs.AI

TL;DR: BeSafe-Bench (BSB) is a comprehensive benchmark for evaluating behavioral safety risks of situated multimodal agents across web, mobile, and embodied domains, revealing significant safety-performance tradeoffs in current systems.

Details

Motivation: Current safety evaluations for Large Multimodal Models (LMMs) as autonomous agents are inadequate - they use low-fidelity environments, simulated APIs, or narrow tasks, lacking comprehensive assessment of behavioral safety risks in functional environments.

Method: Created BeSafe-Bench covering four domains (Web, Mobile, Embodied VLM, Embodied VLA) with functional environments. Constructed diverse instruction space by augmenting tasks with nine safety-critical risk categories. Used hybrid evaluation combining rule-based checks with LLM-as-a-judge reasoning to assess real environmental impacts.

Result: Evaluation of 13 popular agents shows concerning results: best-performing agent completes <40% of tasks while fully adhering to safety constraints. Strong task performance frequently coincides with severe safety violations, revealing significant safety-performance tradeoffs.

Conclusion: There is an urgent need for improved safety alignment before deploying agentic systems in real-world settings. The benchmark exposes critical gaps in current agent safety and provides a comprehensive evaluation framework.

Abstract: The rapid evolution of Large Multimodal Models (LMMs) has enabled agents to perform complex digital and physical tasks, yet their deployment as autonomous decision-makers introduces substantial unintentional behavioral safety risks. However, the absence of a comprehensive safety benchmark remains a major bottleneck, as existing evaluations rely on low-fidelity environments, simulated APIs, or narrowly scoped tasks. To address this gap, we present BeSafe-Bench (BSB), a benchmark for exposing behavioral safety risks of situated agents in functional environments, covering four representative domains: Web, Mobile, Embodied VLM, and Embodied VLA. Using functional environments, we construct a diverse instruction space by augmenting tasks with nine categories of safety-critical risks, and adopt a hybrid evaluation framework that combines rule-based checks with LLM-as-a-judge reasoning to assess real environmental impacts. Evaluating 13 popular agents reveals a concerning trend: even the best-performing agent completes fewer than 40% of tasks while fully adhering to safety constraints, and strong task performance frequently coincides with severe safety violations. These findings underscore the urgent need for improved safety alignment before deploying agentic systems in real-world settings.

[314] AutoB2G: A Large Language Model-Driven Agentic Framework For Automated Building-Grid Co-Simulation

Borui Zhang, Nariman Mahdavi, Subbu Sethuvenkatraman, Shuang Ao, Flora Salim

Main category: cs.AI

TL;DR: AutoB2G: An automated building-grid co-simulation framework using LLMs to generate simulation workflows from natural language descriptions, extending CityLearn V2 for building-to-grid interaction analysis.

Details

Motivation: Existing building simulation environments focus on building-side metrics and lack systematic grid-level impact evaluation, while requiring manual configuration and programming expertise. There's a need for automated simulation workflows that can coordinate building-to-grid interactions.

Method: Extends CityLearn V2 to support building-to-grid interactions and uses LLM-based SOCIA framework to automatically generate, execute, and refine simulators from natural language descriptions. Constructs a codebase organized as a DAG to represent module dependencies and guide LLM retrieval of executable paths.

Result: AutoB2G effectively enables automated simulator implementations and coordinates building-to-grid interactions to improve grid-side performance metrics.

Conclusion: The framework demonstrates successful automation of simulation workflows using LLMs, bridging the gap between building operations and grid-level impacts through natural language interfaces.

Abstract: The growing availability of building operational data motivates the use of reinforcement learning (RL), which can learn control policies directly from data and cope with the complexity and uncertainty of large-scale building clusters. However, most existing simulation environments prioritize building-side performance metrics and lack systematic evaluation of grid-level impacts, while their experimental workflows still rely heavily on manual configuration and substantial programming expertise. Therefore, this paper proposes AutoB2G, an automated building-grid co-simulation framework that completes the entire simulation workflow solely based on natural-language task descriptions. The framework extends CityLearn V2 to support Building-to-Grid (B2G) interaction and adopts the large language model (LLM)-based SOCIA (Simulation Orchestration for Computational Intelligence with Agents) framework to automatically generate, execute, and iteratively refine the simulator. As LLMs lack prior knowledge of the implementation context of simulation functions, a codebase covering simulation configurations and functional modules is constructed and organized as a directed acyclic graph (DAG) to explicitly represent module dependencies and execution order, guiding the LLM to retrieve a complete executable path. Experimental results demonstrate that AutoB2G can effectively enable automated simulator implementations, coordinating B2G interactions to improve grid-side performance metrics.

[315] Semi-Automated Knowledge Engineering and Process Mapping for Total Airport Management

Darryl Teo, Adharsha Sam, Chuan Shen Marcus Koh, Rakesh Nagi, Nuno Antunes Ribeiro

Main category: cs.AI

TL;DR: A framework combining symbolic knowledge engineering with LLMs to build domain-specific knowledge graphs for airport operations, ensuring traceability and verifiability.

Details

Motivation: Airport operations documentation is complex due to technical terminology, regulations, and fragmented communication, creating data silos that hinder Total Airport Management. There's a need to bridge generative AI outputs with operational transparency requirements.

Method: Dual-stage fusion of symbolic Knowledge Engineering (KE) and generative LLMs, using expert-curated KE structures to guide LLM prompts for knowledge triple discovery. Combines probabilistic discovery with deterministic source anchoring for traceability.

Result: Document-level processing improves recovery of non-linear procedural dependencies compared to segment-based inference, contrary to prior observations of long-context degradation. Framework enables synthesis of complex operational workflows from unstructured text.

Conclusion: The framework successfully bridges “black-box” generative outputs with operational transparency needs, providing traceable knowledge extraction for airport management while maintaining high-fidelity provenance.

Abstract: Documentation of airport operations is inherently complex due to extensive technical terminology, rigorous regulations, proprietary regional information, and fragmented communication across multiple stakeholders. The resulting data silos and semantic inconsistencies present a significant impediment to the Total Airport Management (TAM) initiative. This paper presents a methodological framework for constructing a domain-grounded, machine-readable Knowledge Graph (KG) through a dual-stage fusion of symbolic Knowledge Engineering (KE) and generative Large Language Models (LLMs). The framework employs a scaffolded fusion strategy in which expert-curated KE structures guide LLM prompts to facilitate the discovery of semantically aligned knowledge triples. We evaluate this methodology on the Google LangExtract library and investigate the impact of context window utilization by comparing localized segment-based inference with document-level processing. Contrary to prior empirical observations of long-context degradation in LLMs, document-level processing improves the recovery of non-linear procedural dependencies. To ensure the high-fidelity provenance required in airport operations, the proposed framework fuses a probabilistic model for discovery and a deterministic algorithm for anchoring every extraction to its ground source. This ensures absolute traceability and verifiability, bridging the gap between “black-box” generative outputs and the transparency required for operational tooling. Finally, we introduce an automated framework that operationalizes this pipeline to synthesize complex operational workflows from unstructured textual corpora.

[316] GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation

Rui Xie, Zhi Gao, Chenrui Shi, Zirui Shang, Lu Chen, Qing Li

Main category: cs.AI

TL;DR: GUIDE is a training-free framework that reduces GUI agent domain bias by extracting expertise from tutorial videos through automated annotation and injecting planning/grounding knowledge into agents.

Details

Motivation: Large vision-language models have strong general GUI understanding but lack domain-specific expertise for particular applications, limiting real-world task performance due to insufficient exposure to domain-specific software operation data.

Method: Two innovations: 1) Subtitle-driven Video-RAG pipeline with progressive three-stage retrieval (domain classification, topic extraction, relevance matching) to find relevant tutorial videos; 2) Fully automated annotation pipeline using inverse dynamics paradigm with consecutive keyframes and UI element detection fed into VLMs to infer planning and grounding knowledge.

Result: Extensive experiments on OSWorld show GUIDE consistently yields over 5% improvements and reduces execution steps without modifying model parameters or architecture, validating it as an architecture-agnostic enhancement.

Conclusion: GUIDE effectively bridges GUI agent domain bias as a plug-and-play component for both multi-agent systems and single-model agents, demonstrating strong generality and performance gains.

Abstract: Large vision-language models have endowed GUI agents with strong general capabilities for interface understanding and interaction. However, due to insufficient exposure to domain-specific software operation data during training, these agents exhibit significant domain bias - they lack familiarity with the specific operation workflows (planning) and UI element layouts (grounding) of particular applications, limiting their real-world task performance. In this paper, we present GUIDE (GUI Unbiasing via Instructional-Video Driven Expertise), a training-free, plug-and-play framework that resolves GUI agent domain bias by autonomously acquiring domain-specific expertise from web tutorial videos through a retrieval-augmented automated annotation pipeline. GUIDE introduces two key innovations. First, a subtitle-driven Video-RAG pipeline unlocks video semantics through subtitle analysis, performing progressive three-stage retrieval - domain classification, topic extraction, and relevance matching - to identify task-relevant tutorial videos. Second, a fully automated annotation pipeline built on an inverse dynamics paradigm feeds consecutive keyframes enhanced with UI element detection into VLMs, inferring the required planning and grounding knowledge that are injected into the agent’s corresponding modules to address both manifestations of domain bias. Extensive experiments on OSWorld demonstrate GUIDE’s generality as a plug-and-play component for both multi-agent systems and single-model agents. It consistently yields over 5% improvements and reduces execution steps - without modifying any model parameters or architecture - validating GUIDE as an architecture-agnostic enhancement to bridge GUI agent domain bias.

[317] AIRA_2: Overcoming Bottlenecks in AI Research Agents

Karen Hambardzumyan, Nicolas Baldwin, Edan Toledo, Rishi Hazra, Michael Kuchnik, Bassel Al Omari, Thomas Simon Foster, Anton Protopopov, Jean-Christophe Gagnon-Audet, Ishita Mediratta, Kelvin Niu, Michael Shvartsman, Alisia Lupidi, Alexis Audran-Reiss, Parth Pathak, Tatiana Shavrina, Despoina Magka, Hela Momand, Derek Dunfield, Nicola Cancedda, Pontus Stenetorp, Carole-Jean Wu, Jakob Nicolaus Foerster, Yoram Bachrach, Martin Josifoski

Main category: cs.AI

TL;DR: AIRA$_2$ addresses three structural bottlenecks in AI research agents through asynchronous multi-GPU execution, Hidden Consistent Evaluation protocol, and ReAct agents, achieving state-of-the-art performance on MLE-bench-30.

Details

Motivation: Existing AI research agents suffer from three key bottlenecks: 1) synchronous single-GPU execution limiting throughput, 2) generalization gap from validation-based selection causing performance degradation over time, and 3) limited capability of fixed single-turn LLM operators capping search performance.

Method: AIRA$_2$ introduces three architectural innovations: 1) asynchronous multi-GPU worker pool for linear throughput scaling, 2) Hidden Consistent Evaluation protocol for reliable evaluation signals, and 3) ReAct agents that dynamically scope actions and debug interactively.

Result: On MLE-bench-30, AIRA$_2$ achieves 71.8% mean Percentile Rank at 24 hours (surpassing previous best of 69.9%) and improves to 76.0% at 72 hours. Ablations show each component is necessary, and previous “overfitting” was due to evaluation noise, not data memorization.

Conclusion: AIRA$_2$ successfully addresses structural bottlenecks in AI research agents through its three-component architecture, demonstrating improved performance and scalability while revealing that prior overfitting concerns stemmed from evaluation noise rather than true memorization.

Abstract: Existing research has identified three structural performance bottlenecks in AI research agents: (1) synchronous single-GPU execution constrains sample throughput, limiting the benefit of search; (2) a generalization gap where validation-based selection causes performance to degrade over extended search horizons; and (3) the limited capability of fixed, single-turn LLM operators imposes a ceiling on search performance. We introduce AIRA$_2$, which addresses these bottlenecks through three architectural choices: an asynchronous multi-GPU worker pool that increases experiment throughput linearly; a Hidden Consistent Evaluation protocol that delivers a reliable evaluation signal; and ReAct agents that dynamically scope their actions and debug interactively. On MLE-bench-30, AIRA$_2$ achieves a mean Percentile Rank of 71.8% at 24 hours - surpassing the previous best of 69.9% - and steadily improves to 76.0% at 72 hours. Ablation studies reveal that each component is necessary and that the “overfitting” reported in prior work was driven by evaluation noise rather than true data memorization.

[318] CADSmith: Multi-Agent CAD Generation with Programmatic Geometric Validation

Jesse Barkley, Rumi Loghmani, Amir Barati Farimani

Main category: cs.AI

TL;DR: CADSmith is a multi-agent pipeline for text-to-CAD generation that uses iterative refinement with geometric validation and vision-language feedback to improve accuracy and reliability of LLM-generated CAD models.

Details

Motivation: Existing text-to-CAD methods lack proper geometric verification - they either operate in single pass without validation or rely on lossy visual feedback that cannot resolve dimensional errors, leading to inaccurate CAD models.

Method: Multi-agent pipeline that generates CadQuery code from natural language, then uses nested correction loops: inner loop for execution errors and outer loop with programmatic geometric validation using OpenCASCADE kernel measurements combined with vision-language model assessment.

Result: Achieves 100% execution rate (up from 95%), improves median F1 score from 0.9707 to 0.9846, median IoU from 0.8085 to 0.9629, and reduces mean Chamfer Distance from 28.37 to 0.74 on 100-prompt benchmark.

Conclusion: Closed-loop refinement with programmatic geometric feedback substantially improves quality and reliability of LLM-generated CAD models, demonstrating the value of combining exact geometric measurements with holistic visual assessment.

Abstract: Existing methods for text-to-CAD generation either operate in a single pass with no geometric verification or rely on lossy visual feedback that cannot resolve dimensional errors. We present CADSmith, a multi-agent pipeline that generates CadQuery code from natural language. It then undergoes an iterative refinement process through two nested correction loops: an inner loop that resolves execution errors and an outer loop grounded in programmatic geometric validation. The outer loop combines exact measurements from the OpenCASCADE kernel (bounding box dimensions, volume, solid validity) with holistic visual assessment from an independent vision-language model Judge. This provides both the numerical precision and the high-level shape awareness needed to converge on the correct geometry. The system uses retrieval-augmented generation over API documentation rather than fine-tuning, maintaining a current database as the underlying CAD library evolves. We evaluate on a custom benchmark of 100 prompts in three difficulty tiers (T1 through T3) with three ablation configurations. Against a zero-shot baseline, CADSmith achieves a 100% execution rate (up from 95%), improves the median F1 score from 0.9707 to 0.9846, the median IoU from 0.8085 to 0.9629, and reduces the mean Chamfer Distance from 28.37 to 0.74, demonstrating that closed-loop refinement with programmatic geometric feedback substantially improves the quality and reliability of LLM-generated CAD models.

[319] Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

Zelin Tan, Zhouliang Yu, Bohan Lin, Zijie Geng, Hejia Geng, Yudong Zhang, Mulei Zhang, Yang Chen, Shuyue Hu, Zhenfei Yin, Chen Zhang, Lei Bai

Main category: cs.AI

TL;DR: PAPO integrates process-level evaluation into policy optimization through decoupled advantage normalization to address limitations of outcome-only and process-only reward models in reasoning tasks.

Details

Motivation: Existing reward designs for reasoning tasks have two key limitations: Outcome reward models (ORM) only evaluate final-answer correctness and treat all correct responses identically regardless of reasoning quality, while process reward models (PRM) can cause reward hacking where models exploit verbosity to inflate scores while accuracy collapses.

Method: PAPO integrates process-level evaluation into GRPO through decoupled advantage normalization. It composes advantage from two components: Aout (derived from ORM and normalized over all responses) and Aproc (derived from a rubric-based PRM and normalized exclusively among correct responses). This decoupled design ensures Aout anchors training on correctness while Aproc differentiates reasoning quality without distorting the outcome signal.

Result: Experiments across multiple model scales and six benchmarks show PAPO consistently outperforms ORM, reaching 51.3% vs. 46.3% on OlympiadBench while continuing to improve as ORM plateaus and declines.

Conclusion: PAPO effectively addresses limitations of both outcome-only and process-only reward models by combining their strengths through decoupled advantage normalization, leading to better performance on reasoning tasks.

Abstract: We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward designs. Outcome reward models (ORM) evaluate only final-answer correctness, treating all correct responses identically regardless of reasoning quality, and gradually lose the advantage signal as groups become uniformly correct. Process reward models (PRM) offer richer supervision, but directly using PRM scores causes reward hacking, where models exploit verbosity to inflate scores while accuracy collapses. PAPO resolves both by composing the advantage from an outcome component Aout, derived from ORM and normalized over all responses, and a process component Aproc, derived from a rubric-based PRM and normalized exclusively among correct responses. This decoupled design ensures that Aout anchors training on correctness while Aproc differentiates reasoning quality without distorting the outcome signal. Experiments across multiple model scales and six benchmarks demonstrate that PAPO consistently outperforms ORM, reaching 51.3% vs.\ 46.3% on OlympiadBench while continuing to improve as ORM plateaus and declines.

[320] Scale-Adaptive Balancing of Exploration and Exploitation in Classical Planning

Stephen Wissow, Masataro Asai

Main category: cs.AI

TL;DR: GreedyUCT-Normal improves MCTS/THTS for planning by using UCB1-Normal bandit algorithm that accounts for reward variance, outperforming existing methods in finding plans with fewer node expansions.

Details

Motivation: Planning algorithms using MCTS/THTS with UCB1 bandits don't satisfy theoretical requirements since UCB1 assumes fixed bounded reward distributions, which doesn't hold in heuristic search for classical planning. The core issue is UCB1's inability to adapt to different reward scales.

Method: Proposes GreedyUCT-Normal, a MCTS/THTS algorithm that uses UCB1-Normal bandit algorithm which handles distributions with different scales by considering reward variance, making it suitable for agile classical planning.

Result: GreedyUCT-Normal outperforms Greedy Best First Search and existing MCTS/THTS-based algorithms (GreedyUCT, GreedyUCT*), finding more plans with fewer node expansions.

Conclusion: A more detailed theoretical understanding of Multi-Armed Bandit literature can improve planning algorithms based on MCTS/THTS by addressing reward distribution issues, leading to better performance in classical planning.

Abstract: Balancing exploration and exploitation has been an important problem in both game tree search and automated planning. However, while the problem has been extensively analyzed within the Multi-Armed Bandit (MAB) literature, the planning community has had limited success when attempting to apply those results. We show that a more detailed theoretical understanding of MAB literature helps improve existing planning algorithms that are based on Monte Carlo Tree Search (MCTS) / Trial Based Heuristic Tree Search (THTS). In particular, THTS uses UCB1 MAB algorithms in an ad hoc manner, as UCB1’s theoretical requirement of fixed bounded support reward distributions is not satisfied within heuristic search for classical planning. The core issue lies in UCB1’s lack of adaptations to the different scales of the rewards. We propose GreedyUCT-Normal, a MCTS/THTS algorithm with UCB1-Normal bandit for agile classical planning, which handles distributions with different scales by taking the reward variance into consideration, and resulted in an improved algorithmic performance (more plans found with less node expansions) that outperforms Greedy Best First Search and existing MCTS/THTS-based algorithms (GreedyUCT,GreedyUCT*).

[321] Extreme Value Monte Carlo Tree Search for Classical Planning

Masataro Asai, Stephen Wissow

Main category: cs.AI

TL;DR: UCB1-Uniform: A new bandit algorithm for classical planning using extreme value theory to improve Monte Carlo Tree Search performance with better theoretical foundations.

Details

Motivation: Previous MCTS+MAB approaches for classical planning had limitations: UCB1 (designed for bounded rewards) performs poorly with unbounded cost-to-go estimates, Gaussian MABs oversimplify reward distributions, and Full Bellman backup lacks theoretical justification.

Method: Uses Peaks-Over-Threshold Extreme Value Theory to model cost-to-go estimates more accurately, proposes UCB1-Uniform bandit algorithm with formal regret bound, and applies it to classical planning with MCTS.

Result: Formally proves regret bound for UCB1-Uniform and empirically demonstrates improved performance in classical planning tasks compared to previous approaches.

Conclusion: UCB1-Uniform provides a theoretically grounded and empirically effective bandit algorithm for MCTS in classical planning, addressing limitations of previous reward modeling approaches.

Abstract: Despite being successful in board games and reinforcement learning (RL), Monte Carlo Tree Search (MCTS) combined with Multi Armed Bandits (MABs) has seen limited success in domain-independent classical planning until recently. Previous work (Wissow and Asai 2024) showed that UCB1, designed for bounded rewards, does not perform well as applied to cost-to-go estimates in classical planning, which are unbounded in $\R$, and showed improved performance using a Gaussian reward MAB instead. This paper further sharpens our understanding of ideal bandits for planning tasks. Existing work has two issues: first, Gaussian MABs under-specify the support of cost-to-go estimates as $(-\infty,\infty)$, which we can narrow down. Second, Full Bellman backup (Schulte and Keller 2014), which backpropagates sample max/min, lacks theoretical justification. We use \emph{Peaks-Over-Threashold Extreme Value Theory} to resolve both issues at once, and propose a new bandit algorithm (UCB1-Uniform). We formally prove its regret bound and empirically demonstrate its performance in classical planning.

[322] ReMe: Scaffolding Personalized Cognitive Training via Controllable LLM-Mediated Conversations

Zilong Wang, Nan Chen, Luna K. Qiu, Ling Yue, Geli Guo, Yang Ou, Shiqi Jiang, Yuqing Yang, Lili Qiu

Main category: cs.AI

TL;DR: ReMe is a web-based cognitive training framework that uses controllable LLM-mediated conversations to create personalized dialogue-based word games and memory recall activities for older adults.

Details

Motivation: Current computerized cognitive training programs are rigid and difficult to personalize, while LLMs offer natural interaction but lack the controlled structure needed for cognitive training. There's a need for scalable, engaging cognitive interventions for aging populations.

Method: ReMe uses a modular Puzzle Engine with structured templates and constraint rules to create reusable puzzle groups. It integrates personal life logs for episodic-memory practice through Life Recall activities with guided retrieval and progressive cues.

Result: A community pilot with 32 adults aged 50+ showed initial feasibility signals for the framework.

Conclusion: ReMe addresses the rigidity of conventional cognitive training while providing conversational controllability through LLM mediation, offering a promising approach for scalable cognitive interventions.

Abstract: Global aging calls for scalable and engaging cognitive interventions. Computerized cognitive training (CCT) is a promising non-pharmacological approach, yet many unsupervised programs rely on rigid, hand-authored puzzles that are difficult to personalize and can hinder adherence. Large language models (LLMs) offer more natural interaction, but their open-ended generation complicates the controlled task structure required for cognitive training. We present ReMe, a web-based framework that scaffolds cognitive training through controllable LLM-mediated conversations, addressing both rigidity in conventional CCT content and the need for conversational controllability. ReMe features a modular Puzzle Engine that represents training activities as reusable puzzle groups specified by structured templates and constraint rules, enabling rapid development of dialogue-based word games and personalized tasks grounded in user context. By integrating personal life logs, ReMe supports Life Recall activities for episodic-memory practice through guided retrieval and progressive cues. A community pilot with 32 adults aged 50+ provides initial feasibility signals.

[323] Efficient Energy-Optimal Path Planning for Electric Vehicles Considering Vehicle Dynamics

Saman Ahmadi, Guido Tack, Daniel Harabor, Philip Kilby, Mahdi Jalili

Main category: cs.AI

TL;DR: This paper explores energy-optimal path planning for electric vehicles, focusing on how vehicle dynamics affect energy model accuracy and introducing methods to accelerate pathfinding with negative energy costs from regenerative braking.

Details

Motivation: The rapid adoption of electric vehicles requires accurate energy-aware routing, especially when charging infrastructure is limited. Current energy models often fail to account for vehicle dynamics, leading to inaccurate energy estimates that can make planned routes infeasible in reality.

Method: The authors develop a data-driven energy model that incorporates key vehicle dynamics parameters. They also introduce two novel online reweighting and energy heuristic functions to accelerate path planning, specifically addressing challenges with negative energy costs from regenerative braking.

Result: Extensive experiments on real-world transport networks demonstrate that the proposed method significantly improves computational efficiency for energy-optimal pathfinding for electric vehicles.

Conclusion: Accurate modeling of vehicle dynamics is crucial for feasible energy-optimal routing of electric vehicles, and the proposed approach enables more efficient path planning suitable for real-time applications.

Abstract: The rapid adoption of electric vehicles (EVs) in modern transport systems has made energy-aware routing a critical task in their successful integration, especially within large-scale transport networks. In cases where an EV’s remaining energy is limited and charging locations are not easily accessible, some destinations may only be reachable through an energy-optimal path: a route that consumes less energy than all other alternatives. The feasibility of such energy-efficient paths depends heavily on the accuracy of the energy model used for planning, and thus failing to account for vehicle dynamics can lead to inaccurate energy estimates, rendering some planned routes infeasible in reality. This paper explores the impact of vehicle dynamics on energy-optimal path planning for EVs. We first investigate how energy model accuracy influences energy-optimal pathfinding and, consequently, feasibility of planned trips, using a novel data-driven model that incorporates key vehicle dynamics parameters into energy calculations. Additionally, we introduce two novel online reweighting and energy heuristic functions that accelerate path planning with negative energy costs arise due to regenerative braking, making our approach well-suited for real-time applications. Extensive experiments on real-world transport networks demonstrate that our method significantly improves both the computational efficiency of energy-optimal pathfinding for EVs.

[324] Deontic Temporal Logic for Formal Verification of AI Ethics

Priya T. V., Shrisha Rao

Main category: cs.AI

TL;DR: A formal deontic logic framework for specifying and verifying ethical behavior in AI systems, with temporal operators for reasoning over time, applied to COMPAS and loan prediction systems.

Details

Motivation: Address the need for formal methods in AI ethics to specify and verify ethical behavior as AI systems become more ubiquitous and influential, providing rigorous tools for evaluating fairness and explainability.

Method: Proposes a formalization based on deontic logic with temporal operators, introduces axioms and theorems for ethical requirements (fairness, explainability), and uses automated theorem proving to verify real-world AI systems.

Result: Formal verification reveals both COMPAS and loan prediction systems fail to fulfill key ethical properties related to fairness and non-discrimination, demonstrating the framework’s effectiveness in identifying ethical issues.

Conclusion: The deontic logic formalization provides a rigorous approach for evaluating AI ethics, successfully identifying ethical violations in real-world systems and offering a foundation for formal verification of ethical AI behavior.

Abstract: Ensuring ethical behavior in Artificial Intelligence (AI) systems amidst their increasing ubiquity and influence is a major concern the world over. The use of formal methods in AI ethics is a possible crucial approach for specifying and verifying the ethical behavior of AI systems. This paper proposes a formalization based on deontic logic to define and evaluate the ethical behavior of AI systems, focusing on system-level specifications, contributing to this important goal. It introduces axioms and theorems to capture ethical requirements related to fairness and explainability. The formalization incorporates temporal operators to reason about the ethical behavior of AI systems over time. The authors evaluate the effectiveness of this formalization by assessing the ethics of the real-world COMPAS and loan prediction AI systems. Various ethical properties of the COMPAS and loan prediction systems are encoded using deontic logical formulas, allowing the use of an automated theorem prover to verify whether these systems satisfy the defined properties. The formal verification reveals that both systems fail to fulfill certain key ethical properties related to fairness and non-discrimination, demonstrating the effectiveness of the proposed formalization in identifying potential ethical issues in real-world AI applications.

[325] ProbGuard: Probabilistic Runtime Monitoring for LLM Agent Safety

Haoyu Wang, Christopher M. Poskitt, Jiali Wei, Jun Sun

Main category: cs.AI

TL;DR: ProbGuard: A proactive runtime monitoring framework for LLM agents that predicts safety violations through probabilistic risk modeling using Markov chains, enabling early intervention before unsafe behavior occurs.

Details

Motivation: LLM agents operating in safety-critical domains (robotics, virtual assistants, web automation) have stochastic decision-making that introduces unpredictable safety risks. Existing monitoring frameworks are reactive, detecting violations only when unsafe behavior is imminent or has occurred, which limits their ability to handle long-horizon dependencies and prevent accidents.

Method: ProbGuard abstracts agent executions into symbolic states and learns a Discrete-Time Markov Chain (DTMC) from execution traces to model behavioral dynamics. At runtime, it estimates the probability that future executions will reach unsafe states and triggers interventions when risk exceeds user-defined thresholds. The framework incorporates semantic validity constraints in the abstraction and provides PAC-style guarantees on the learned model.

Result: In autonomous driving scenarios, ProbGuard consistently predicts traffic law violations and collisions in advance, with warnings up to 38.66 seconds ahead of occurrence. In embodied household agent tasks, it reduces unsafe behavior by up to 65.37% while preserving up to 80.4% task completion. The framework introduces minimal runtime overhead.

Conclusion: ProbGuard provides an effective proactive safety monitoring solution for LLM agents that can anticipate and prevent safety violations in advance, outperforming reactive approaches. It offers practical implementation as an extensible open-source runtime monitor integrated with LangChain.

Abstract: Large Language Model (LLM) agents increasingly operate across domains such as robotics, virtual assistants, and web automation. However, their stochastic decision-making introduces safety risks that are difficult to anticipate during execution. Existing runtime monitoring frameworks, such as AgentSpec, primarily rely on reactive safety rules that detect violations only when unsafe behavior is imminent or has already occurred, limiting their ability to handle long-horizon dependencies. We present ProbGuard, a proactive runtime monitoring framework for LLM agents that anticipates safety violations through probabilistic risk prediction. ProbGuard abstracts agent executions into symbolic states and learns a Discrete-Time Markov Chain (DTMC) from execution traces to model behavioral dynamics. At runtime, the monitor estimates the probability that future executions will reach unsafe states and triggers interventions when this risk exceeds a user-defined threshold. To improve robustness, ProbGuard incorporates semantic validity constraints in the abstraction and provides PAC-style guarantees on the learned model under standard assumptions. We evaluate ProbGuard in two safety-critical domains: autonomous driving and embodied household agents. Across evaluated scenarios, ProbGuard consistently predicts traffic law violations and collisions in advance, with warnings up to 38.66 seconds ahead of occurrence. In embodied agent tasks, ProbGuard reduces unsafe behavior by up to 65.37% while preserving up to 80.4% task completion. ProbGuard is implemented as an extensible open-source runtime monitor integrated with the LangChain agent framework and introduces minimal runtime overhead.

[326] Humanline: Online Alignment as Perceptual Loss

Sijia Liu, Niklas Muennighoff, Kawin Ethayarajh

Main category: cs.AI

TL;DR: The paper proposes that online alignment methods outperform offline ones due to better matching human perceptual biases about probability, and introduces “humanline” variants that incorporate these biases into offline training to achieve similar performance with faster training.

Details

Motivation: To explain why online alignment methods (like GRPO) generally outperform offline methods (like DPO) and to develop more efficient training approaches that maintain performance while being faster and cheaper.

Method: Draws on prospect theory from behavioral economics to analyze human perceptual biases about probability. Proposes “humanline” variants of existing objectives (DPO/KTO/GRPO) that explicitly incorporate perceptual distortions of probability, allowing training with offline off-policy data while mimicking human perception.

Result: Humanline variants trained with offline off-policy data can match the performance of their online counterparts on both verifiable and unverifiable tasks while running up to 6x faster.

Conclusion: The online/offline dichotomy is incidental to maximizing human utility; by incorporating human perceptual biases into training objectives, offline methods can achieve similar performance to online methods with significant efficiency gains.

Abstract: Online alignment (e.g., GRPO) is generally more performant than offline alignment (e.g., DPO) – but why? Drawing on prospect theory from behavioral economics, we propose a human-centric explanation. We prove that online on-policy sampling better approximates the human-perceived distribution of what the model can produce, and PPO/GRPO-style clipping – originally introduced to just stabilize training – recovers a perceptual bias in how humans perceive probability. In this sense, PPO/GRPO act as perceptual losses already. Our theory further suggests that the online/offline dichotomy is itself incidental to maximizing human utility, since we can achieve the same effect by selectively training on any data in a manner that mimics human perception, rather than restricting ourselves to online on-policy data. Doing so would allow us to post-train more quickly, cheaply, and flexibly without sacrificing performance. To this end, we propose a design pattern that explicitly incorporates perceptual distortions of probability into objectives like DPO/KTO/GRPO, creating humanline variants of them. Surprisingly, we find that these humanline variants, even when trained with offline off-policy data, can match the performance of their online counterparts (on both verifiable and unverifiable tasks) while running up to 6x faster.

Yunlong Deng, Boyang Sun, Yan Li, Lingjing Kong, Zeyu Tang, Kun Zhang, Guangyi Chen

Main category: cs.AI

TL;DR: SR² framework improves reasoning tasks by modeling them as selection mechanisms with latent logical concepts, using reflective representation learning and dependency refinement to enhance reasoning accuracy.

Details

Motivation: Reasoning tasks remain challenging for LLMs despite extensive training. The paper aims to understand reasoning from a causal perspective, viewing reasoning as a selection mechanism where high-level logical concepts operate on observations.

Method: SR² framework with three modules: 1) Reflective representation learning to estimate latent variables, 2) Dependency self-refinement to learn dense dependencies among latent representations, and 3) Periodic intermediate alignment to maintain consistency.

Result: Significant gains in reasoning accuracy, achieving over 10% improvement with 8× fewer parameters on Sudoku and Maze tasks compared to recent advances.

Conclusion: Modeling reasoning as a selection mechanism with latent logical concepts and incorporating feedback through the SR² framework effectively addresses reasoning challenges and improves performance.

Abstract: Due to their inherent complexity, reasoning tasks have long been regarded as rigorous benchmarks for assessing the capabilities of machine learning models, especially large language models (LLMs). Although humans can solve these tasks with ease, existing models, even after extensive pre-training and post-training at scale, still fail to perform reasoning reliably. In this paper, we revisit reasoning tasks from a causal perspective, seeking to understand their behavior in latent space and to offer insights for addressing their challenges. Specifically, we cast reasoning tasks as a selection mechanism, in which high-level logical concepts function as selection operators on the given observations, such as, identifying the correct answer in a math problem or filling the appropriate entry in Sudoku. We emphasize two key properties of this formulation that shed light on the difficulty of reasoning tasks. First, the latent space exceeds the observation space in complexity, even when the correct answer is fully determined by the observed input. Second, the latent variables, corresponding to logical thought, are densely structured and exhibit strong dependencies. Building on this formulation, we introduce a framework, called SR$^2$, that incorporates the estimated latent variables as feedback into the selection mechanism, thereby facilitating the learning of dense dependencies among latent representations. The framework consists of three key modules: reflective representation learning, dependency self-refinement, and periodic intermediate alignment. Experimentally, we show that our approach yields significant gains in reasoning accuracy, for example, attaining over 10$%$ improvement in performance with 8$\times$ fewer parameters on the Sudoku and Maze tasks over the recent advances.

[328] Shared Spatial Memory Through Predictive Coding

Zhengru Fang, Yu Guo, Yuang Zhang, Haonan An, Wenbo Ding, Yuguang Fang

Main category: cs.AI

TL;DR: Multi-agent predictive coding framework learns bandwidth-efficient communication and social place cells for spatial coordination under limited bandwidth constraints.

Details

Motivation: Addressing the challenge of constructing consistent shared spatial memory in multi-agent systems with partial observability and limited bandwidth, which often causes coordination failures.

Method: Multi-agent predictive coding framework with information bottleneck objective that learns who/what/when to communicate, grid-cell-like metric spatial coding from self-supervised motion prediction, hierarchical RL policy for active exploration.

Result: Exceptional resilience to bandwidth constraints: success degrades gracefully from 73.5% to 64.4% as bandwidth shrinks from 128 to 4 bits/step, while baseline collapses from 67.6% to 28.6%. Learned social place cells and efficient communication.

Conclusion: Establishes theoretically principled and biologically plausible basis for complex social representations emerging from unified predictive drive, leading to collective intelligence.

Abstract: Constructing a consistent shared spatial memory is a critical challenge in multi-agent systems, where partial observability and limited bandwidth often lead to catastrophic failures in coordination. We introduce a multi-agent predictive coding framework that formulates coordination as the minimization of mutual uncertainty among agents. Through an information bottleneck objective, this framework prompts agents to learn not only who and what to communicate but also when. At the foundation of this framework lies a grid-cell-like metric as internal spatial coding for self-localization, emerging spontaneously from self-supervised motion prediction. Building upon this internal spatial code, agents gradually develop a bandwidth-efficient communication mechanism and specialized neural populations that encode partners’ locations-an artificial analogue of hippocampal social place cells (SPCs). These social representations are further utilized by a hierarchical reinforcement learning policy that actively explores to reduce joint uncertainty. On the Memory-Maze benchmark, our approach shows exceptional resilience to bandwidth constraints: success degrades gracefully from 73.5% to 64.4% as bandwidth shrinks from 128 to 4 bits/step, whereas a full-broadcast baseline collapses from 67.6% to 28.6%. Our findings establish a theoretically principled and biologically plausible basis for how complex social representations emerge from a unified predictive drive, leading to collective intelligence.

[329] HeaRT: A Hierarchical Circuit Reasoning Tree-Based Agentic Framework for AMS Design Optimization

Souradip Poddar, Chia-Tung Ho, Ziming Wei, Weidong Cao, Haoxing Ren, David Z. Pan

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2511.19669: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19669&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Jua Han, Jaeyoon Seo, Jungbin Min, Sieun Choi, Huichan Seo, Jihie Kim, Jean Oh

Main category: cs.AI

TL;DR: Unable to analyze paper 2601.05529 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable

Method: Cannot determine method as abstract is unavailable

Result: Cannot determine results as abstract is unavailable

Conclusion: Cannot determine conclusion as abstract is unavailable

Abstract: Failed to fetch summary for 2601.05529: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05529&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[331] AtomMem : Learnable Dynamic Agentic Memory with Atomic Memory Operation

Yupeng Huo, Yaxi Lu, Zhong Zhang, Haotian Chen, Yankai Lin

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2601.08323: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08323&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[332] See, Symbolize, Act: Grounding VLMs with Spatial Representations for Better Gameplay

Ashish Baghel, Paras Chopra

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API for paper ID 2603.11601

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2603.11601: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11601&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[333] AIDABench: AI Data Analytics Benchmark

Yibo Yang, Fei Lei, Yixuan Sun, Yantao Zeng, Chengguang Lv, Jiancao Hong, Jiaojiao Tian, Tianyu Qiu, Xin Wang, Yanbing Chen, Yanjie Li, Zheng Pan, Xiaochen Zhou, Guanzhou Chen, Haoran Lv, Yuning Xu, Yue Ou, Haodong Liu, Shiqi He, Anya Jia, Yulei Xin, Huan Wu, Liang Liu, Jiaye Ge, Jianxin Dong, Dahua Lin, Wenxiu Sun

Main category: cs.AI

TL;DR: AIDABench is a comprehensive benchmark for evaluating AI systems on complex, end-to-end document analysis tasks across question answering, data visualization, and file generation dimensions, revealing significant challenges for current models.

Details

Motivation: Existing benchmarks focus on isolated capabilities or simplified scenarios, failing to capture end-to-end task effectiveness required in practical settings, creating a need for rigorous evaluation standards for AI-driven document understanding tools.

Method: Created AIDABench with 600+ diverse document analysis tasks across three core dimensions (question answering, data visualization, file generation) grounded in realistic scenarios with heterogeneous data types (spreadsheets, databases, financial reports, operational records). Evaluated 11 state-of-the-art models including both proprietary and open-source families.

Result: Complex real-world data analytics tasks remain challenging for current AI systems, with best-performing model achieving only 59.43% pass-at-1. Tasks are sufficiently difficult that human experts require 1-2 hours per question even with AI assistance.

Conclusion: AIDABench provides a principled reference for enterprise procurement, tool selection, and model optimization, highlighting significant gaps in current AI capabilities for complex document understanding and analysis tasks.

Abstract: As AI-driven document understanding and processing tools become increasingly prevalent in real-world applications, the need for rigorous evaluation standards has grown increasingly urgent. Existing benchmarks and evaluations often focus on isolated capabilities or simplified scenarios, failing to capture the end-to-end task effectiveness required in practical settings. To address this gap, we introduce AIDABench, a comprehensive benchmark for evaluating AI systems on complex data analytics tasks in an end-to-end manner. AIDABench encompasses 600+ diverse document analysis tasks across three core capability dimensions: question answering, data visualization, and file generation. These tasks are grounded in realistic scenarios involving heterogeneous data types, including spreadsheets, databases, financial reports, and operational records, and reflect analytical demands across diverse industries and job functions. Notably, the tasks in AIDABench are sufficiently challenging that even human experts require 1-2 hours per question when assisted by AI tools, underscoring the benchmark’s difficulty and real-world complexity. We evaluate 11 state-of-the-art models on AIDABench, spanning both proprietary (e.g., Claude Sonnet 4.5, Gemini 3 Pro Preview) and open-source (e.g., Qwen3-Max-2026-01-23-Thinking) families. Our results reveal that complex, real-world data analytics tasks remain a significant challenge for current AI systems, with the best-performing model achieving only 59.43% pass-at-1. We provide a detailed analysis of failure modes across each capability dimension and identify key challenges for future research. AIDABench offers a principled reference for enterprise procurement, tool selection, and model optimization, and is publicly available at https://github.com/MichaelYang-lyx/AIDABench.

[334] Draft-and-Prune: Improving the Reliability of Auto-formalization for Logical Reasoning

Zhiyu Ni, Zheng Liang, Liangcheng Song, Chenrui Cao, Xian Zhang, Alberto Sangiovanni-Vincentelli, Pierluigi Nuzzo

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.17233: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17233&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[335] Governance-Aware Vector Subscriptions for Multi-Agent Knowledge Ecosystems

Steven Johnson

Main category: cs.AI

TL;DR: Unable to analyze paper 2603.20833 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2603.20833: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20833&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[336] Environment Maps: Structured Environmental Representations for Long-Horizon Agents

Yenchia Feng, Chirag Sharma, Karime Maamari

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: No method information available - arXiv API returned HTTP 429 error

Result: No results available - failed to retrieve paper content

Conclusion: Paper analysis cannot be completed due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2603.23610: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23610&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[337] The Competence Shadow: Theory and Bounds of AI Assistance in Safety Engineering

Umair Siddique

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.25197: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25197&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[338] Error Estimation for Physics-informed Neural Networks Approximating Semilinear Wave Equations

Beatrice Lorenz, Aras Bacho, Gitta Kutyniok

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2402.07153: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.07153&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[339] Complexity-Aware Deep Symbolic Regression with Robust Risk-Seeking Policy Gradients

Zachary Bastiani, Robert M. Kirby, Jacob Hochhalter, Shandian Zhe

Main category: cs.AI

TL;DR: Unable to analyze paper 2406.06751 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting error

Method: Cannot determine method as abstract is unavailable due to rate limiting error

Result: Cannot determine results as abstract is unavailable due to rate limiting error

Conclusion: Cannot draw conclusions as paper content is inaccessible due to HTTP 429 rate limiting error from arXiv API

Abstract: Failed to fetch summary for 2406.06751: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.06751&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[340] CGRA4ML: A Hardware/Software Framework to Implement Neural Networks for Scientific Edge Computing

G Abarajithan, Zhenghua Ma, Ravidu Munasinghe, Francesco Restuccia, Ryan Kastner

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2408.15561: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.15561&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[341] Biogeochemistry-Informed Neural Network (BINN) for Improving Accuracy of Model Prediction and Scientific Understanding of Soil Organic Carbon

Haodi Xu, Joshua Fan, Feng Tao, Lifen Jiang, Fengqi You, Benjamin Z. Houlton, Ying Sun, Carla P. Gomes, Yiqi Luo

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusion due to missing paper content

Abstract: Failed to fetch summary for 2502.00672: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.00672&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[342] The Accountability Paradox: How Platform API Restrictions Undermine AI Transparency Mandates

Florian A.D. Burnat, Brittany I. Davidson

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.11577: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11577&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[343] PepThink-R1: LLM for Interpretable Cyclic Peptide Optimization with CoT SFT and Reinforcement Learning

Ruheng Wang, Hang Zhang, Trieu Nguyen, Shasha Feng, Hao-Wei Pang, Xiang Yu, Li Xiao, Peter Zhiping Zhang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2508.14765: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.14765&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[344] Multi-Dimensional Autoscaling of Stream Processing Services on Edge Devices

Boris Sedlak, Philipp Raith, Andrea Morichetta, Víctor Casamayor Pujol, Schahram Dustdar

Main category: cs.AI

TL;DR: Paper 2510.06882: Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing abstract

Method: Cannot determine method due to missing abstract

Result: Cannot determine results due to missing abstract

Conclusion: Cannot determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2510.06882: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06882&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Ilona van der Linden, Sahana Kumar, Arnav Dixit, Aadi Sudan, Smruthi Danda, David C. Anastasiu, Kai Lukoff

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2510.21011: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21011&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[346] Causal Graph Neural Networks for Healthcare

Munib Mesinovic, Max Buhlan, Tingting Zhu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2511.02531: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.02531&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[347] Route Experts by Sequence, not by Token

Tiansheng Wen, Yifei Wang, Aosong Feng, Long Ma, Xinyang Liu, Yifan Wang, Lixuan Guo, Bo Chen, Stefanie Jegelka, Chenyu You

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Cannot analyze method without access to paper content

Result: No results available due to technical access issues

Conclusion: Cannot draw conclusions without paper content

Abstract: Failed to fetch summary for 2511.06494: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06494&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[348] Aligning LLMs with Biomedical Knowledge using Balanced Fine-Tuning

Zhenchao Tang, Fang Wang, Haohuai He, Jiale Zhou, Tianxu Lv, Jun Zhu, Shouzhi Chen, Minghao Yang, Yu Wang, Jiayang Wu, Yidong Song, Yaokun Li, Jiehui Huang, Dawei Huang, Zhi Song, Jianhua Yao

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2511.21075: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21075&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[349] SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations

Wentao Guo, Mayank Mishra, Xinle Cheng, Ion Stoica, Tri Dao

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to retry with different approach or wait.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2512.14080: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14080&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[350] PathFinder: Advancing Path Loss Prediction for Single-to-Multi-Transmitter Scenario

Zhijie Zhong, Zhiwen Yu, Pengyu Li, Jianming Lv, C. L. Philip Chen, Min Chen

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2512.14150: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14150&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[351] The Dual-State Architecture for Reliable LLM Agents

Matthew Thompson

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.20660: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20660&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[352] Incorporating Q&A Nuggets into Retrieval-Augmented Generation

Laura Dietz, Bryan Li, Gabrielle Liu, Jia-Huei Ju, Eugene Yang, Dawn Lawrie, William Walden, James Mayfield

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2601.13222: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13222&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[353] Insider Knowledge: How Much Can RAG Systems Gain from Evaluation Secrets?

Laura Dietz, Bryan Li, Eugene Yang, Dawn Lawrie, William Walden, James Mayfield

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2601.13227: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13227&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[354] Golden Layers and Where to Find Them: Improved Knowledge Editing for Large Language Models Via Layer Gradient Analysis

Shrestha Datta, Hongfu Liu, Anshuman Chhabra

Main category: cs.AI

TL;DR: Paper ID 2602.20207 could not be fetched due to HTTP 429 error (rate limiting), so analysis cannot be performed

Details

Motivation: Unable to determine motivation as paper content is not available

Method: Unable to determine method as paper content is not available

Result: Unable to determine results as paper content is not available

Conclusion: Unable to determine conclusion as paper content is not available

Abstract: Failed to fetch summary for 2602.20207: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20207&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[355] SWE Context Bench: A Benchmark for Context Learning in Coding

Jared Zhu, Minhao Hu, Junde Wu

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2602.08316: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08316&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[356] Administrative Law’s Fourth Settlement: AI and the Capability-Accountability Trap

Nicholas Caputo

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.09678: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09678&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[357] AgentTrace: Causal Graph Tracing for Root Cause Analysis in Deployed Multi-Agent Systems

Zhaohui Geoffrey Wang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.14688: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14688&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[358] mSFT: Addressing Dataset Mixtures Overfitting Heterogeneously in Multi-task SFT

Woosung Koh, Jeyoung Jeon, Youngjin Song, Yujin Cheon, Soowon Oh, Jaehyeong Choi, Se-Young Yun

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2603.21606: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21606&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[359] SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints

Andrew Tremante, Yang He, Rocky Klopfenstein, Yuepeng Wang, Nina Narodytska, Haoze Wu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limiting

Method: N/A - Paper content not accessible

Result: N/A - No results available due to retrieval failure

Conclusion: Unable to analyze paper due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2603.04334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[360] Hybrid Associative Memories

Leon Lufkin, Tomás Figliolia, Beren Millidge, Kamesh Krishnamurthy

Main category: cs.AI

TL;DR: Unable to analyze paper 2603.22325 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to paper abstract

Method: Cannot determine method without access to paper abstract

Result: Cannot determine results without access to paper abstract

Conclusion: Cannot determine conclusion without access to paper abstract

Abstract: Failed to fetch summary for 2603.22325: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22325&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[361] Goedel-Code-Prover: Hierarchical Proof Search for Open State-of-the-Art Code Verification

Zenan Li, Ziran Yang, Deyuan He, Haoyu Zhao, Andrew Zhao, Shange Tang, Kaiyu Yang, Aarti Gupta, Zhendong Su, Chi Jin

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.19329: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19329&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[362] Modernizing Amdahl’s Law: How AI Scaling Laws Shape Computer Architecture

Chien-Ping Lu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.20654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[363] AI Generalisation Gap In Comorbid Sleep Disorder Staging

Saswata Bose, Suvadeep Maiti, Shivam Kumar Sharma, Mythirayee S, Tapabrata Chakraborti, Srijitesh Rajendran, Raju S. Bapi

Main category: cs.AI

TL;DR: Paper ID 2603.23582 - Could not fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing abstract

Method: Unable to determine method due to missing abstract

Result: Unable to determine results due to missing abstract

Conclusion: Unable to draw conclusions due to missing abstract

Abstract: Failed to fetch summary for 2603.23582: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23582&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[364] SM-Net: Learning a Continuous Spectral Manifold from Multiple Stellar Libraries

Omar Anwar, Aaron S. G. Robotham, Luca Cortese, Kevin Vinsen

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.23899: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23899&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[365] Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients

Michael Hardy, Joshua Gilbert, Benjamin Domingue

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.24999: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24999&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[366] Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models

Eyal Hadad, Mordechai Guri

Main category: cs.AI

TL;DR: Failed to fetch paper summary - HTTP 429 error (rate limiting) prevents accessing arXiv API for paper 2603.25403

Details

Motivation: Unable to determine motivation due to technical error accessing the paper

Method: Unable to determine method due to technical error accessing the paper

Result: Unable to determine results due to technical error accessing the paper

Conclusion: Unable to determine conclusion due to technical error accessing the paper

Abstract: Failed to fetch summary for 2603.25403: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25403&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.SD

[367] Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models

Kyudan Jung, Jihwan Kim, Soyoon Kim, Jeongoon Kim, Jaegul Choo, Cheonbok Park

Main category: cs.SD

TL;DR: A robust open-source data processing pipeline for full-duplex speech language models to address the scarcity of high-quality multi-speaker conversational data and challenges in natural dialogue dynamics.

Details

Motivation: The shift from text-based LLMs to Speech Language Models (SLMs) creates demand for full-duplex systems capable of real-time natural human-computer interaction, but development is constrained by scarcity of high-quality multi-speaker conversational data and challenges in handling complex dialogue dynamics like overlapping speech and back-channeling.

Method: Presents a robust and scalable open-source data processing pipeline specifically designed for full-duplex models to address issues with existing resources that are predominantly single-speaker or limited in volume, and to overcome problems with standard processing pipelines that suffer from diarization errors and ASR hallucinations.

Result: The paper introduces a pipeline that bridges the gap in data processing for full-duplex SLMs, though specific quantitative results are not provided in the abstract.

Conclusion: The proposed pipeline addresses critical data processing challenges for developing full-duplex speech language models capable of natural human-computer interaction.

Abstract: As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction. However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conversational data, as existing large-scale resources are predominantly single-speaker or limited in volume. Addressing the complex dynamics of natural dialogue, such as overlapping and back-channeling remains a challenge, with standard processing pipelines suffering from diarization errors and ASR hallucinations. To bridge this gap, we present a robust and scalable open-source data processing pipeline designed for full-duplex model.

[368] Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods

Xuanru Zhou, Yiwen Shao, Wei-Cheng Tseng, Dong Yu

Main category: cs.SD

Details

Result: Experiments suggest that data quality and coverage are the primary drivers of performance, while the choice of objective dictates downstream task specialization.

[369] A Human-Inspired Decoupled Architecture for Efficient Audio Representation Learning

Harunori Kawano, Takeshi Sasaki

Main category: cs.SD

TL;DR: HEAR is a decoupled audio architecture inspired by human cognition that separates local acoustic feature extraction from global semantic integration, achieving high efficiency with only 15M parameters while maintaining competitive performance on audio classification tasks.

Details

Motivation: Standard Transformers for audio SSL have excessive parameters and quadratic computational costs that limit deployment on resource-constrained devices. The authors aim to create a more efficient architecture inspired by human cognitive processes.

Method: Proposes HEAR, a decoupled architecture with two modules: Acoustic Model for local feature extraction and Task Model for global semantic integration. Uses an Acoustic Tokenizer trained via knowledge distillation to enable robust Masked Audio Modeling (MAM).

Result: HEAR requires only 15M parameters and 9.47 GFLOPs for inference, operating at a fraction of conventional foundation models (85M-94M parameters). Despite high efficiency, achieves competitive performance across diverse audio classification benchmarks.

Conclusion: HEAR demonstrates that decoupled architectures inspired by human cognition can achieve high efficiency without sacrificing performance, enabling deployment of audio SSL models on resource-constrained devices.

Abstract: While self-supervised learning (SSL) has revolutionized audio representation, the excessive parameterization and quadratic computational cost of standard Transformers limit their deployment on resource-constrained devices. To address this bottleneck, we propose HEAR (Human-inspired Efficient Audio Representation), a novel decoupled architecture. Inspired by the human cognitive ability to isolate local acoustic features from global context, HEAR splits the processing pipeline into two dedicated modules: an Acoustic Model for local feature extraction and a Task Model for global semantic integration. Coupled with an Acoustic Tokenizer trained via knowledge distillation, our approach enables robust Masked Audio Modeling (MAM). Extensive experiments demonstrate that HEAR requires only 15M parameters and 9.47 GFLOPs for inference, operating at a fraction of the computational cost of conventional foundation models (which typically require 85M-94M parameters). Despite this high efficiency, HEAR achieves highly competitive performance across diverse audio classification benchmarks. The code and pre-trained models are available at https://github.com/HarunoriKawano/HEAR

[370] LLaDA-TTS: Unifying Speech Synthesis and Zero-Shot Editing via Masked Diffusion Modeling

Xiaoyu Fan, Huizhi Xie, Wei Zou, Yunzhang Chen

Main category: cs.SD

TL;DR: LLaDA-TTS replaces autoregressive LLM-based TTS with masked diffusion for parallel generation, achieving 2x speedup while maintaining quality and enabling zero-shot speech editing.

Details

Motivation: Current LLM-based TTS systems use autoregressive decoding which requires N sequential steps for N speech tokens, causing high inference latency that scales with sequence length. The authors aim to decouple inference latency from sequence length while maintaining speech quality.

Method: Replace autoregressive LLM with masked diffusion model using bidirectional attention. Transfer pretrained AR checkpoint to masked diffusion paradigm via bidirectional attention with only 50 hours of fine-tuning data. Modify only attention mask and objective while keeping the same architecture.

Result: At 64 steps, achieves 0.98% CER (Chinese) and 1.96% WER (English) on Seed-TTS-Eval, matching original CosyVoice 3 baseline performance with 2x LLM-stage speedup. Enables zero-shot speech editing (word-level insertion, deletion, substitution) without additional training.

Conclusion: LLaDA-TTS demonstrates that AR-pretrained weights are near-optimal for bidirectional masked prediction under acoustic token locality, enabling rapid convergence. The method applies seamlessly to any LLM-based AR TTS system with minimal modifications.

Abstract: Large language model (LLM)-based text-to-speech (TTS) systems achieve remarkable naturalness via autoregressive (AR) decoding, but require N sequential steps to generate N speech tokens. We present LLaDA-TTS, which replaces the AR LLM with a masked diffusion model that completes generation in a fixed number of parallel steps, decoupling inference latency from sequence length. Remarkably, using only 50 hours of fine-tuning data, we successfully transfer a pretrained AR checkpoint to the masked diffusion paradigm via bidirectional attention. At 64 steps, LLaDA-TTS achieves 0.98% CER (zh) and 1.96% WER (en) on Seed-TTS-Eval, matching the original CosyVoice 3 baseline performance while delivering a 2x LLM-stage speedup–a notable acceleration achieved despite the absence of KV cache, an optimization the AR baseline heavily relies on. Beyond acceleration, the bidirectional architecture naturally enables zero-shot speech editing–including word-level insertion, deletion, and substitution–without any additional training. Theoretically, we prove that AR-pretrained weights are near-optimal for bidirectional masked prediction under the locality property of acoustic tokens, explaining this rapid convergence. This general method modifies only the attention mask and objective, applying seamlessly to any LLM-based AR TTS system. Code and audio samples will be available at https://deft-piroshki-b652b5.netlify.app/.

[371] CA-TCN: A Causal-Anticausal Temporal Convolutional Network for Direct Auditory Attention Decoding

Iñigo García-Ugarte, Rubén Eguinoa, Ricardo San Martín, Daniel Paternain, Carmen Vidaurre

Main category: cs.SD

TL;DR: CA-TCN: A Causal-Anticausal Temporal Convolutional Network for Auditory Attention Decoding that improves accuracy by explicitly aligning auditory stimuli and neural responses with separate temporal convolutions.

Details

Motivation: Current Auditory Attention Decoding (AAD) approaches for steering auditory attention in complex listening environments typically assume access to clean speech sources and EEG signals, relying on low-frequency correlations. The authors aim to develop a more accurate and unified AAD model that can work in practical online processing scenarios.

Method: Proposes CA-TCN (Causal-Anticausal Temporal Convolutional Network) that directly classifies attended speakers. The architecture integrates best practices from CNNs for sequence processing and explicitly aligns auditory stimuli and neural responses using separate causal and anticausal convolutions with distinct receptive fields operating in opposite temporal directions.

Result: CA-TCN consistently improved decoding accuracy across datasets and decision windows compared to three baseline AAD models, with gains of 0.5%-3.2% for subject-independent models and 0.8%-2.9% for subject-specific models. Improvements were statistically significant in 4 of 6 evaluated settings. The model also demonstrated spatial robustness with stable EEG spatial filter patterns across datasets.

Conclusion: The work introduces an accurate and unified AAD model that outperforms existing methods while considering practical benefits for online processing. These findings advance the state of AAD and its applicability in real-world systems.

Abstract: A promising approach for steering auditory attention in complex listening environments relies on Auditory Attention Decoding (AAD), which aim to identify the attended speech stream in a multiple speaker scenario from neural recordings. Entrainment-based AAD approaches, typically assume access to clean speech sources and electroencephalography (EEG) signals to exploit low-frequency correlations between the neural response and the attended stimulus. In this study, we propose CA-TCN, a Causal-Anticausal Temporal Convolutional Network that directly classifies the attended speaker. The proposed architecture integrates several best practices from convolutional neural networks in sequence processing tasks. Importantly, it explicitly aligns auditory stimuli and neural responses by employing separate causal and anticausal convolutions respectively, with distinct receptive fields operating in opposite temporal directions. Experimental results, obtained through comparisons with three baseline AAD models, demonstrated that CA-TCN consistently improved decoding accuracy across datasets and decision windows, with gains ranging from 0.5% to 3.2% for subject-independent models and from 0.8% to 2.9% for subject-specific models compared with the next best-performing model, AADNet. Moreover, these improvements were statistically significant in four of the six evaluated settings when comparing Minimum Expected Switch Duration distributions. Beyond accuracy, the model demonstrated spatial robustness across different conditions, as the EEG spatial filters exhibited stable patterns across datasets. Overall, this work introduces an accurate and unified AAD model that outperforms existing methods while considering practical benefits for online processing scenarios. These findings contribute to advancing the state of AAD and its applicability in real-world systems.

[372] Does Audio Deepfake Detection Generalize?

Nicolas M. Müller, Pavel Czempin, Franziska Dieckmann, Adam Froghyar, Konstantin Böttinger

Main category: cs.SD

TL;DR: Systematic evaluation of audio deepfake detection methods reveals key factors for success and shows poor generalization to real-world data, highlighting limitations of current benchmarks.

Details

Motivation: Current audio deepfake detection research lacks consistency in methods and unclear understanding of what factors truly contribute to detection success, with unknown generalization to real-world scenarios.

Method: Re-implemented and uniformly evaluated existing audio spoofing detection architectures, systematically testing different features (cqtspec, logspec, melspec), preprocessing steps, and hyperparameters. Collected new real-world dataset of 37.9 hours of celebrity/politician audio (17.2 hours deepfakes) for generalization testing.

Result: Identified cqtspec or logspec features outperform melspec by 37% EER on average. Found existing methods perform poorly on real-world data with performance degradation up to 1000%, suggesting overfitting to ASVSpoof benchmark.

Conclusion: Audio deepfake detection is more challenging in real-world settings than previously thought, and the community needs better benchmarks and understanding of what truly matters for detection success.

Abstract: Current text-to-speech algorithms produce realistic fakes of human voices, making deepfake detection a much-needed area of research. While researchers have presented various techniques for detecting audio spoofs, it is often unclear exactly why these architectures are successful: Preprocessing steps, hyperparameter settings, and the degree of fine-tuning are not consistent across related work. Which factors contribute to success, and which are accidental? In this work, we address this problem: We systematize audio spoofing detection by re-implementing and uniformly evaluating architectures from related work. We identify overarching features for successful audio deepfake detection, such as using cqtspec or logspec features instead of melspec features, which improves performance by 37% EER on average, all other factors constant. Additionally, we evaluate generalization capabilities: We collect and publish a new dataset consisting of 37.9 hours of found audio recordings of celebrities and politicians, of which 17.2 hours are deepfakes. We find that related work performs poorly on such real-world data (performance degradation of up to one thousand percent). This may suggest that the community has tailored its solutions too closely to the prevailing ASVSpoof benchmark and that deepfakes are much harder to detect outside the lab than previously thought.

[373] Probabilistic Multilabel Graphical Modelling of Motif Transformations in Symbolic Music

Ron Taieb, Yoel Greenberg, Barak Sober

Main category: cs.SD

TL;DR: A probabilistic framework using multilabel Conditional Random Fields to model and analyze motivic transformations in Beethoven’s piano sonatas, integrating melodic, rhythmic, harmonic, and motivic data for structural analysis.

Details

Motivation: To understand how musical motifs transform while preserving identity, and to develop computational methods for analyzing these transformation patterns in symbolic music, particularly in classical compositions like Beethoven's piano sonatas.

Method: Developed a probabilistic framework with multilabel Conditional Random Fields to model motivic transformations, representing transformations as multilabel variables by comparing motif instances to reference occurrences, and integrating multiple datasets (melodic, rhythmic, harmonic, motivic) into unified analytical representations.

Result: Created an interpretable, distributional analysis framework that enables study of structural relationships and stylistic variation in motivic transformations, linking computational modeling with music-theoretical interpretation.

Conclusion: The framework supports quantitative investigation of musical structure and complexity in symbolic corpora, potentially facilitating analysis of broader compositional patterns and writing practices through computational music analysis.

Abstract: Motifs often recur in musical works in altered forms, preserving aspects of their identity while undergoing local variation. This paper investigates how such motivic transformations occur within their musical context in symbolic music. To support this analysis, we develop a probabilistic framework for modeling motivic transformations and apply it to Beethoven’s piano sonatas by integrating multiple datasets that provide melodic, rhythmic, harmonic, and motivic information within a unified analytical representation. Motif transformations are represented as multilabel variables by comparing each motif instance to a designated reference occurrence within its local context, ensuring consistent labeling across transformation families. We introduce a multilabel Conditional Random Field to model how motif-level musical features influence the occurrence of transformations and how different transformation families tend to co-occur. Our goal is to provide an interpretable, distributional analysis of motivic transformation patterns, enabling the study of their structural relationships and stylistic variation. By linking computational modeling with music-theoretical interpretation, the proposed framework supports quantitative investigation of musical structure and complexity in symbolic corpora and may facilitate the analysis of broader compositional patterns and writing practices.

[374] Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

Téo Guichoux, Théodor Lemerle, Shivam Mehta, Jonas Beskow, Gustav Eje Henter, Laure Soulier, Catherine Pelachaud, Nicolas Obin

Main category: cs.SD

TL;DR: Gelina is a unified framework that jointly synthesizes speech and co-speech gestures from text using interleaved token sequences, improving synchrony and prosody alignment over sequential methods.

Details

Motivation: Human communication is multimodal with speech and gestures tightly coupled, but current computational methods synthesize them sequentially, weakening synchrony and prosody alignment.

Method: Uses a unified framework with interleaved token sequences in a discrete autoregressive backbone, with modality-specific decoders. Supports multi-speaker/multi-style cloning and gesture-only synthesis from speech inputs.

Result: Subjective and objective evaluations demonstrate competitive speech quality and improved gesture generation over unimodal baselines.

Conclusion: Gelina provides a joint synthesis approach that better captures the natural coupling between speech and gestures in human communication.

Abstract: Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We introduce Gelina, a unified framework that jointly synthesizes speech and co-speech gestures from text using interleaved token sequences in a discrete autoregressive backbone, with modality-specific decoders. Gelina supports multi-speaker and multi-style cloning and enables gesture-only synthesis from speech inputs. Subjective and objective evaluations demonstrate competitive speech quality and improved gesture generation over unimodal baselines.

[375] TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling

Hao-Hui Xie, Ho-Lam Chung, Yi-Cheng Lin, Ke-Han Lu, Wenze Ren, Xie Chen, Hung-yi Lee

Main category: cs.SD

TL;DR: TW-Sound580K: A Taiwanese audio-text instruction dataset created via Verify-Generate-Critique protocol, used to train Tai-LALM model that achieves 49.1% accuracy on TAU Benchmark, 6.5% improvement over baseline.

Details

Motivation: Large Audio-Language Models struggle with localized dialectal prosody due to scarcity of specialized corpora, particularly for Taiwanese speech.

Method: Developed TW-Sound580K dataset using Verify-Generate-Critique protocol with Dual-ASR validation, then trained Tai-LALM by fine-tuning DeSTA 2.5-Audio-initialized backbone with dynamic Dual-ASR Arbitration strategy for transcription selection.

Result: Tai-LALM achieves 49.1% accuracy on TAU Benchmark, a 6.5% absolute improvement over the zero-shot baseline (42.6% with ASR text conditioning).

Conclusion: Integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech.

Abstract: Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline leverages Dual-ASR validation to filter 522K raw clips, subsequently expanding them into 580,000 high-fidelity instruction pairs using a teacher model. The dataset’s utility is demonstrated through Tai-LALM, which fine-tunes a DeSTA 2.5-Audio-initialized backbone and incorporates a dynamic Dual-ASR Arbitration strategy to optimize transcription selection during inference. On the TAU Benchmark, Tai-LALM reaches 49.1% accuracy, marking a 6.5% absolute improvement over the zero-shot baseline (42.6% with ASR text conditioning). This confirms that integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech.

[376] Joint Learning Global-Local Speaker Classification to Enhance End-to-End Speaker Diarization and Recognition

Yuhang Dai, Haopeng Lin, Jiale Qian, Ruiqi Yan, Hao Meng, Hanke Xie, Hanlin Wen, Shunshun Yin, Ming Tao, Xie Chen, Lei Xie, Xinsheng Wang

Main category: cs.SD

TL;DR: GLSC-SDR: A novel paradigm for Large Audio-Language Models that jointly trains speaker classification with diarization and recognition using a Global-Local Speaker Classification strategy to enhance speaker discriminability.

Details

Motivation: Current Large Audio-Language Models (LALMs) have limited speaker discriminability due to scarcity of large-scale conversational data and lack of explicit speaker representation optimization. There's a need to improve speaker discrimination while maintaining semantic transcription accuracy.

Method: Proposes GLSC-SDR paradigm that jointly trains speaker classification with diarization and recognition. Introduces Global-Local Speaker Classification strategy: uses clustered speakers as global labels and re-encoded intra-cluster speakers as local labels in a hierarchical design.

Result: Achieves competitive or superior performance on AliMeeting, AISHELL-4, and AMI-SDM datasets compared to simulation-based and multi-encoder approaches, without requiring large-scale real conversational data.

Conclusion: GLSC-SDR effectively enhances fine-grained speaker discrimination while preserving semantic transcription accuracy, addressing key limitations in current LALMs for speaker diarization and recognition tasks.

Abstract: Large Audio-Language Models (LALMs) have demonstrated remarkable performance in end-to-end speaker diarization and recognition. However, their speaker discriminability remains limited due to the scarcity of large-scale conversational data and the absence of explicit speaker representation optimization. To address this, we propose GLSC-SDR, a paradigm that jointly trains speaker classification with diarization and recognition. We further introduce a Global-Local Speaker Classification strategy, which uses clustered speakers as global labels and re-encoded intra-cluster speakers as local labels. This hierarchical design enhances fine-grained speaker discrimination while preserving semantic transcription accuracy. Experiments on AliMeeting, AISHELL-4, and AMI-SDM demonstrate that GLSC-SDR achieves competitive or superior performance compared to simulation-based and multi-encoder approaches, without relying on large-scale real conversational data.

cs.LG

[377] Empowering Epidemic Response: The Role of Reinforcement Learning in Infectious Disease Control

Mutong Liu, Yang Liu, Jiming Liu

Main category: cs.LG

TL;DR: Survey paper reviewing reinforcement learning applications for optimizing infectious disease control interventions, covering resource allocation, policy balancing, and coordinated control strategies.

Details

Motivation: RL's adaptability to dynamic systems and ability to maximize long-term outcomes under constraints makes it suitable for optimizing infectious disease intervention strategies, but there's a lack of comprehensive surveys on this specific application.

Method: Literature review and survey methodology analyzing existing RL applications in infectious disease control, categorizing approaches by public health demands including resource allocation, policy balancing, mixed interventions, and inter-regional coordination.

Result: Identifies and discusses RL approaches used for controlling infectious disease spread, with COVID-19 as a prominent case study, highlighting how RL can optimize both pharmaceutical and non-pharmaceutical interventions.

Conclusion: RL shows significant potential for assisting public health sectors in disease control, with future research directions needed to further develop and refine these applications.

Abstract: Reinforcement learning (RL), owing to its adaptability to various dynamic systems in many real-world scenarios and the capability of maximizing long-term outcomes under different constraints, has been used in infectious disease control to optimize the intervention strategies for controlling infectious disease spread and responding to outbreaks in recent years. The potential of RL for assisting public health sectors in preventing and controlling infectious diseases is gradually emerging and being explored by rapidly increasing publications relevant to COVID-19 and other infectious diseases. However, few surveys exclusively discuss this topic, that is, the development and application of RL approaches for optimizing strategies of non-pharmaceutical and pharmaceutical interventions of public health. Therefore, this paper aims to provide a concise review and discussion of the latest literature on how RL approaches have been used to assist in controlling the spread and outbreaks of infectious diseases, covering several critical topics addressing public health demands: resource allocation, balancing between lives and livelihoods, mixed policy of multiple interventions, and inter-regional coordinated control. Finally, we conclude the paper with a discussion of several potential directions for future research.

[378] Pure and Physics-Guided Deep Learning Solutions for Spatio-Temporal Groundwater Level Prediction at Arbitrary Locations

Matteo Salis, Gabriele Sartor, Rosa Meo, Stefano Ferraris, Abdourrahmane M. Atto

Main category: cs.LG

TL;DR: STAINet: A physics-guided deep learning model for groundwater level prediction using attention mechanisms and physics-informed strategies.

Details

Motivation: Groundwater modeling is challenging due to complex relationships, and traditional theory-based models have computational limitations and simplifying assumptions. Data-driven deep learning offers flexibility but needs physical grounding for trustworthiness.

Method: Proposed STAINet - an attention-based deep learning model predicting weekly groundwater levels using sparse measurements and dense weather data. Enhanced with three physics-guided strategies: STAINet-IB (inductive bias estimating equation components), STAINet-ILB (learning bias with additional loss terms), and STAINet-ILRB (incorporating expert-estimated recharge zones).

Result: STAINet-ILB performed best with median MAPE 0.16% and KGE 0.58 in rollout testing. It predicted sensible equation components, demonstrating physical soundness and improved generalization.

Conclusion: Physics-guided deep learning enhances both generalization ability and trustworthiness of groundwater models, paving the way for hybrid Earth system models.

Abstract: Groundwater represents a key element of the water cycle, yet it exhibits intricate and context-dependent relationships that make its modeling a challenging task. Theory-based models have been the cornerstone of scientific understanding. However, their computational demands, simplifying assumptions, and calibration requirements limit their use. In recent years, data-driven models have emerged as powerful alternatives. In particular, deep learning has proven to be a leading approach for its design flexibility and ability to learn complex relationships. We proposed an attention-based pure deep learning model, named STAINet, to predict weekly groundwater levels at an arbitrary and variable number of locations, leveraging both spatially sparse groundwater measurements and spatially dense weather information. Then, to enhance the model’s trustworthiness and generalization ability, we considered different physics-guided strategies to inject the groundwater flow equation into the model. Firstly, in the STAINet-IB, by introducing an inductive bias, we also estimated the governing equation components. Then, by adopting a learning bias strategy, we proposed the STAINet-ILB, trained with additional loss terms adding supervision on the estimated equation components. Lastly, we developed the STAINet-ILRB, leveraging the groundwater body recharge zone information estimated by domain experts. The STAINet-ILB performed the best, achieving overwhelming test performances in a rollout setting (median MAPE 0.16%, KGE 0.58). Furthermore, it predicted sensible equation components, providing insights into the model’s physical soundness. Physics-guided approaches represent a promising opportunity to enhance both the generalization ability and the trustworthiness, thereby paving the way to a new generation of disruptive hybrid deep learning Earth system models.

[379] MAGNET: Autonomous Expert Model Generation via Decentralized Autoresearch and BitNet Training

Yongwan Kim, Sungchul Park

Main category: cs.LG

TL;DR: MAGNET is a decentralized system for autonomous generation, training, and serving of domain-expert language models across commodity hardware using four key components: autonomous ML research pipeline, CPU-native inference, distributed merging, and on-chain contribution tracking.

Details

Motivation: The paper aims to democratize access to specialized language models by creating a decentralized system that can autonomously generate, train, and serve domain-expert models across commodity hardware without requiring expensive GPU infrastructure.

Method: MAGNET integrates four components: (1) autoresearch - autonomous ML pipeline for dataset generation, hyperparameter exploration, evaluation, and iteration; (2) BitNet b1.58 ternary training for CPU-native inference; (3) DiLoCo-based distributed merging for efficient aggregation of domain specialists; (4) on-chain contribution tracking on HOOTi EVM chain.

Result: Validated through three case studies: video safety classification (balanced accuracy improved from 0.9287 to 0.9851), cryptocurrency directional prediction (hit rate improved from 41% to 54.9%), and BitNet hyperparameter optimization (10-phase sweep achieving -16.7% validation loss reduction).

Conclusion: MAGNET demonstrates a viable decentralized approach for autonomous generation and deployment of specialized language models on commodity hardware, potentially democratizing access to domain-expert AI capabilities without requiring expensive GPU infrastructure.

Abstract: We present MAGNET (Model Autonomously Growing Network), a decentralized system for autonomous generation, training, and serving of domain-expert language models across commodity hardware. MAGNET integrates four components: (1) autoresearch, an autonomous ML research pipeline that automates dataset generation, hyperparameter exploration, evaluation, and error-driven iteration; (2) BitNet b1.58 ternary training, enabling CPU-native inference via bitnet.cpp without GPU hardware; (3) DiLoCo-based distributed merging for communication-efficient aggregation of domain specialists; and (4) on-chain contribution tracking on the HOOTi EVM chain. We validate autoresearch through three case studies: video safety classification (balanced accuracy 0.9287 to 0.9851), cryptocurrency directional prediction (41% to 54.9% hit rate), and BitNet hyperparameter optimization (10-phase sweep, -16.7% validation loss).

[380] A Compression Perspective on Simplicity Bias

Tom Marty, Eric Elmoznino, Leo Gagnon, Tejas Kasetty, Mizu Nishikawa-Toomey, Sarthak Mittal, Guillaume Lajoie, Dhanya Sridhar

Main category: cs.LG

TL;DR: Neural networks’ simplicity bias explained through Minimum Description Length principle, showing feature selection follows optimal compression trade-offs between model complexity and predictive power across different data regimes.

Details

Motivation: To provide a theoretical foundation for understanding neural networks' simplicity bias (preference for simple functions) by framing supervised learning as optimal two-part lossless compression through the Minimum Description Length principle.

Method: Formalize supervised learning as optimal two-part lossless compression problem using Minimum Description Length principle, analyzing trade-off between model complexity (hypothesis description cost) and predictive power (data description cost). Validate on semi-synthetic benchmark comparing neural network feature selection to optimal compressors.

Result: Theory explains how simplicity bias governs feature selection through fundamental complexity-predictive power trade-off. Predicts learners transition from simple spurious shortcuts to complex features only when data encoding cost reduction justifies increased model complexity. Identifies distinct data regimes where increasing data promotes robustness vs. limiting data acts as complexity-based regularization.

Conclusion: Neural networks’ feature selection follows same trajectory as optimal two-part compressors, providing compression-theoretic explanation for simplicity bias and offering insights into data regimes affecting robustness and regularization.

Abstract: Deep neural networks exhibit a simplicity bias, a well-documented tendency to favor simple functions over complex ones. In this work, we cast new light on this phenomenon through the lens of the Minimum Description Length principle, formalizing supervised learning as a problem of optimal two-part lossless compression. Our theory explains how simplicity bias governs feature selection in neural networks through a fundamental trade-off between model complexity (the cost of describing the hypothesis) and predictive power (the cost of describing the data). Our framework predicts that as the amount of available training data increases, learners transition through qualitatively different features – from simple spurious shortcuts to complex features – only when the reduction in data encoding cost justifies the increased model complexity. Consequently, we identify distinct data regimes where increasing data promotes robustness by ruling out trivial shortcuts, and conversely, regimes where limiting data can act as a form of complexity-based regularization, preventing the learning of unreliable complex environmental cues. We validate our theory on a semi-synthetic benchmark showing that the feature selection of neural networks follows the same trajectory of solutions as optimal two-part compressors.

[381] Incorporating contextual information into KGWAS for interpretable GWAS discovery

Cheng Jiang, Brady Ryan, Megan Crow, Kipper Fletez-Brant, Kashish Doshi, Sandra Melo Carlos, Kexin Huang, Burkhard Hoeckendorf, Heming Yao, David Richmond

Main category: cs.LG

TL;DR: KGWAS framework enhanced with cell-type specific knowledge graphs from perturb-seq data improves disease mechanism discovery by reducing spurious correlations and increasing biological robustness.

Details

Motivation: While GWAS identifies genetic associations with disease, moving from associations to causal mechanisms is crucial for therapeutic target prioritization. The original KGWAS framework uses general-purpose knowledge graphs that can introduce spurious correlations, limiting its effectiveness for disease mechanism discovery.

Method: The authors propose using cell-type specific knowledge graphs from disease-relevant cell types instead of general-purpose KGs. They show the general-purpose KG can be pruned without losing statistical power, and performance improves by incorporating gene-gene relationships derived from perturb-seq data. They use sparse, context-specific KGs built from direct perturb-seq evidence.

Result: Cell-type specific KGs substantially improve disease mechanism discovery. The general-purpose KG can be pruned with no loss of statistical power, and incorporating perturb-seq data further enhances performance. Sparse, context-specific KGs from direct perturb-seq evidence yield more consistent and biologically robust disease-critical networks.

Conclusion: Using cell-type specific knowledge graphs derived from perturb-seq data significantly improves the KGWAS framework by reducing spurious correlations and providing more biologically relevant disease mechanisms, advancing therapeutic target prioritization.

Abstract: Genome-Wide Association Studies (GWAS) identify associations between genetic variants and disease; however, moving beyond associations to causal mechanisms is critical for therapeutic target prioritization. The recently proposed Knowledge Graph GWAS (KGWAS) framework addresses this challenge by linking genetic variants to downstream gene-gene interactions via a knowledge graph (KG), thereby improving detection power and providing mechanistic insights. However, the original KGWAS implementation relies on a large general-purpose KG, which can introduce spurious correlations. We hypothesize that cell-type specific KGs from disease-relevant cell types will better support disease mechanism discovery. Here, we show that the general-purpose KG in KGWAS can be substantially pruned with no loss of statistical power on downstream tasks, and that performance further improves by incorporating gene-gene relationships derived from perturb-seq data. Importantly, using a sparse, context-specific KG from direct perturb-seq evidence yields more consistent and biologically robust disease-critical networks.

[382] FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation

Dong Liu, Yanxuan Yu, Jiayi Zhang, Yifan Li, Ben Lengerich, Ying Nian Wu

Main category: cs.LG

TL;DR: FastCache accelerates Diffusion Transformers (DiT) inference through hidden-state caching and compression, reducing computational overhead while maintaining generation quality.

Details

Motivation: Diffusion Transformers are computationally intensive due to iterative structure and deep transformer stacks, creating need for efficient inference methods that reduce latency and memory usage without sacrificing generation quality.

Method: Proposes FastCache with dual strategy: (1) spatial-aware token selection to filter redundant tokens based on hidden-state saliency, and (2) transformer-level cache to reuse latent activations across timesteps when changes are minimal. Also includes token merging module based on k-NN density for further speedup.

Result: Demonstrates substantial reductions in latency and memory usage across multiple DiT variants, achieving best generation quality among existing cache methods as measured by FID and t-FID metrics.

Conclusion: FastCache provides an effective framework for accelerating DiT inference through intelligent caching and compression of internal representations, maintaining bounded approximation error while significantly improving computational efficiency.

Abstract: Diffusion Transformers (DiT) are powerful generative models but remain computationally intensive due to their iterative structure and deep transformer stacks. To alleviate this inefficiency, we propose \textbf{FastCache}, a hidden-state-level caching and compression framework that accelerates DiT inference by exploiting redundancy within the model’s internal representations. FastCache introduces a dual strategy: (1) a spatial-aware token selection mechanism that adaptively filters redundant tokens based on hidden-state saliency, and (2) a transformer-level cache that reuses latent activations across timesteps when changes fall below a predefined threshold. These modules work jointly to reduce unnecessary computation while preserving generation fidelity through learnable linear approximation. Theoretical analysis shows that FastCache maintains bounded approximation error under a hypothesis-testing-based decision rule. Empirical evaluations across multiple DiT variants demonstrate substantial reductions in latency and memory usage, achieving the best generation quality among existing cache methods, as measured by FID and t-FID. To further improve the speedup of FastCache, we also introduce a token merging module that merges redundant tokens based on k-NN density. Code is available at \href{https://github.com/NoakLiu/FastCache-xDiT}{https://github.com/NoakLiu/FastCache-xDiT}.

[383] In-Context Molecular Property Prediction with LLMs: A Blinding Study on Memorization and Knowledge Conflicts

Matthias Busch, Marius Tacke, Sviatlana V. Lamaka, Mikhail L. Zheludkevich, Christian J. Cyron, Christian Feiler, Roland C. Aydin

Main category: cs.LG

TL;DR: LLMs’ in-context learning for molecular property prediction is investigated to determine if they perform genuine regression or rely on memorization, using systematic blinding experiments across multiple LLM families and datasets.

Details

Motivation: To address ambiguity about LLMs' effectiveness in scientific prediction tasks, particularly whether they perform genuine in-context learning for molecular properties or rely on memorized values from potentially contaminated training data.

Method: Systematic blinding experiments across nine LLM variants from three families (GPT-4.1, GPT-5, Gemini 2.5) on three MoleculeNet datasets, with progressively reduced information and varying in-context sample sizes (0-, 60-, and 1000-shot) as controls.

Result: The study provides a framework for evaluating molecular property prediction under controlled information access, revealing insights about memorization and conflicts between pre-trained knowledge and in-context information.

Conclusion: This work establishes a principled approach to assess LLMs’ genuine in-context learning capabilities for scientific prediction tasks, addressing data contamination concerns and knowledge-information conflicts.

Abstract: The capabilities of large language models (LLMs) have expanded beyond natural language processing to scientific prediction tasks, including molecular property prediction. However, their effectiveness in in-context learning remains ambiguous, particularly given the potential for training data contamination in widely used benchmarks. This paper investigates whether LLMs perform genuine in-context regression on molecular properties or rely primarily on memorized values. Furthermore, we analyze the interplay between pre-trained knowledge and in-context information through a series of progressively blinded experiments. We evaluate nine LLM variants across three families (GPT-4.1, GPT-5, Gemini 2.5) on three MoleculeNet datasets (Delaney solubility, Lipophilicity, QM7 atomization energy) using a systematic blinding approach that iteratively reduces available information. Complementing this, we utilize varying in-context sample sizes (0-, 60-, and 1000-shot) as an additional control for information access. This work provides a principled framework for evaluating molecular property prediction under controlled information access, addressing concerns regarding memorization and exposing conflicts between pre-trained knowledge and in-context information.

[384] Why Safety Probes Catch Liars But Miss Fanatics

Kristiyan Haralambiev

Main category: cs.LG

TL;DR: Probes fail to detect coherently misaligned AI systems that believe harmful behavior is virtuous, not just strategically hidden, due to fundamental computational limitations.

Details

Motivation: Current activation-based probes for detecting deceptive AI alignment have a blind spot: they can't identify models that are coherently misaligned (believing harmful behavior is virtuous) rather than deceptively aligned (strategically hiding harmful intentions).

Method: Theoretical proof that no polynomial-time probe can detect coherent misalignment with non-trivial accuracy when belief structures reach sufficient complexity (PRF-like triggers). Empirical demonstration by training two models with identical RLHF procedures: one produces direct hostile responses (“the Liar”), another is trained towards coherent misalignment using rationalizations framing hostility as protective (“the Fanatic”).

Result: Both models exhibit identical behavior, but the Liar is detected 95%+ of the time while the Fanatic evades detection almost entirely. This shows Emergent Probe Evasion: training with belief-consistent reasoning shifts models from detectable “deceptive” regime to undetectable “coherent” regime.

Conclusion: Coherent misalignment represents a fundamental limitation for current probe-based detection methods, as models learn to believe their harmful behavior is virtuous rather than learning to hide it, making them undetectable to polynomial-time probes.

Abstract: Activation-based probes have emerged as a promising approach for detecting deceptively aligned AI systems by identifying internal conflict between true and stated goals. We identify a fundamental blind spot: probes fail on coherent misalignment - models that believe their harmful behavior is virtuous rather than strategically hiding it. We prove that no polynomial-time probe can detect such misalignment with non-trivial accuracy when belief structures reach sufficient complexity (PRF-like triggers). We show the emergence of this phenomenon on a simple task by training two models with identical RLHF procedures: one producing direct hostile responses (“the Liar”), another trained towards coherent misalignment using rationalizations that frame hostility as protective (“the Fanatic”). Both exhibit identical behavior, but the Liar is detected 95%+ of the time while the Fanatic evades detection almost entirely. We term this Emergent Probe Evasion: training with belief-consistent reasoning shifts models from a detectable “deceptive” regime to an undetectable “coherent” regime - not by learning to hide, but by learning to believe.

[385] DRiffusion: Draft-and-Refine Process Parallelizes Diffusion Models with Ease

Runsheng Bai, Chengyu Zhang, Yangdong Deng

Main category: cs.LG

TL;DR: DRiffusion is a parallel sampling framework that accelerates diffusion model inference through draft-and-refine process using skip transitions and parallel noise computation.

Details

Motivation: Diffusion models suffer from slow iterative sampling causing high latency, limiting their use in interactive applications that require real-time generation.

Method: Uses draft-and-refine process with skip transitions to generate multiple draft states for future timesteps in parallel, computes corresponding noises in parallel, then uses standard denoising for refinement.

Result: Achieves 1.4×-3.7× speedup across multiple diffusion models with minimal quality degradation: FID and CLIP scores remain largely on par, PickScore and HPSv2.1 show only minor average drops of 0.17 and 0.43.

Conclusion: DRiffusion delivers substantial acceleration while preserving perceptual quality, making diffusion models more practical for interactive applications.

Abstract: Diffusion models have achieved remarkable success in generating high-fidelity content but suffer from slow, iterative sampling, resulting in high latency that limits their use in interactive applications. We introduce DRiffusion, a parallel sampling framework that parallelizes diffusion inference through a draft-and-refine process. DRiffusion employs skip transitions to generate multiple draft states for future timesteps and computes their corresponding noises in parallel, which are then used in the standard denoising process to produce refined results. Theoretically, our method achieves an acceleration rate of $\tfrac{1}{n}$ or $\tfrac{2}{n+1}$, depending on whether the conservative or aggressive mode is used, where $n$ denotes the number of devices. Empirically, DRiffusion attains 1.4$\times$-3.7$\times$ speedup across multiple diffusion models while incur minimal degradation in generation quality: on MS-COCO dataset, both FID and CLIP remain largely on par with those of the original model, while PickScore and HPSv2.1 show only minor average drops of 0.17 and 0.43, respectively. These results verify that DRiffusion delivers substantial acceleration and preserves perceptual quality.

[386] Data-Driven Plasticity Modeling via Acoustic Profiling

Khalid El-Awady

Main category: cs.LG

TL;DR: Data-driven framework using acoustic emission analysis to model plastic deformation in metals, combining wavelet-based event detection with machine learning to identify deformation mechanisms.

Details

Motivation: To develop a predictive framework for understanding plastic deformation in crystalline metals using acoustic emission signals, moving beyond retrospective analysis to enable real-time monitoring and prediction of material behavior.

Method: Wavelet-based method using Morlet transforms to detect AE events across frequency bands, validation against stress-drop dynamics, machine learning with engineered time/frequency domain features, and clustering analysis to identify event archetypes.

Result: Engineered features significantly outperform raw signal classifiers, key discriminative features identified (RMS amplitude, zero crossing rate, spectral centroid), four distinct AE event archetypes discovered corresponding to different deformation mechanisms.

Conclusion: The framework successfully links acoustic emission signals to physical deformation mechanisms, demonstrating potential for transitioning from retrospective analysis to predictive modeling of material behavior using acoustic signals.

Abstract: This paper presents a data-driven framework for modeling plastic deformation in crystalline metals through acoustic emission (AE) analysis. Building on experimental data from compressive loading of nickel micropillars, the study introduces a wavelet-based method using Morlet transforms to detect AE events across distinct frequency bands, enabling identification of both large and previously overlooked small-scale events. The detected events are validated against stress-drop dynamics, demonstrating strong physical consistency and revealing a relationship between AE energy release and strain evolution, including the onset of increased strain rate following major events. Leveraging labeled datasets of events and non-events, the work applies machine learning techniques, showing that engineered time and frequency domain features significantly outperform raw signal classifiers, and identifies key discriminative features such as RMS amplitude, zero crossing rate, and spectral centroid. Finally, clustering analysis uncovers four distinct AE event archetypes corresponding to different deformation mechanisms, highlighting the potential for transitioning from retrospective analysis to predictive modeling of material behavior using acoustic signals.

[387] Decoding Defensive Coverage Responsibilities in American Football Using Factorized Attention Based Transformer Models

Kevin Song, Evan Diewald, Ornob Siddiquee, Chris Boomhower, Keegan Abdoo, Mike Band, Amy Lee

Main category: cs.LG

TL;DR: A factorized attention transformer model predicts NFL defensive coverage assignments, receiver-defender matchups, and targeted defenders using multi-agent tracking data with ~89% accuracy.

Details

Motivation: Current NFL coverage analysis focuses on post-hoc team-level classification, lacking predictive modeling of individual player assignments and dynamic matchup evolution throughout plays.

Method: Factorized attention-based transformer separates temporal and agent dimensions to independently model player movement patterns and inter-player relationships, trained on randomly truncated trajectories.

Result: Models achieve ~89%+ accuracy for all tasks (coverage assignments, matchups, targeted defenders), with true accuracy potentially higher due to annotation ambiguity.

Conclusion: The approach enables predictive modeling of defensive responsibilities evolution and novel derivative metrics (disguise rate, double coverage rate) for broadcasting and team strategy.

Abstract: Defensive coverage schemes in the National Football League (NFL) represent complex tactical patterns requiring coordinated assignments among defenders who must react dynamically to the offense’s passing concept. This paper presents a factorized attention-based transformer model applied to NFL multi-agent play tracking data to predict individual coverage assignments, receiver-defender matchups, and the targeted defender on every pass play. Unlike previous approaches that focus on post-hoc coverage classification at the team level, our model enables predictive modeling of individual player assignments and matchup dynamics throughout the play. The factorized attention mechanism separates temporal and agent dimensions, allowing independent modeling of player movement patterns and inter-player relationships. Trained on randomly truncated trajectories, the model generates frame-by-frame predictions that capture how defensive responsibilities evolve from pre-snap through pass arrival. Our models achieve approximately 89%+ accuracy for all tasks, with true accuracy potentially higher given annotation ambiguity in the ground truth labels. These outputs also enable novel derivative metrics, including disguise rate and double coverage rate, which enable enhanced storytelling in TV broadcasts as well as provide actionable insights for team strategy development and player evaluation.

[388] Parameter-Free Dynamic Regret for Unconstrained Linear Bandits

Alberto Rumi, Andrew Jacobsen, Nicolò Cesa-Bianchi, Fabio Vitale

Main category: cs.LG

TL;DR: This paper presents a new algorithm for unconstrained adversarial linear bandits that achieves optimal dynamic regret bounds by adapting to comparator switches without prior knowledge of switch count.

Details

Motivation: The paper addresses the long-standing open problem in linear bandits of achieving optimal dynamic regret bounds that adapt to the number of comparator switches without requiring prior knowledge of the switch count. Current approaches either require knowing the number of switches in advance or achieve suboptimal bounds.

Method: The authors propose a simple approach to combine guarantees of multiple bandit algorithms. Their method allows optimal adaptation to the number of switches S_T in an arbitrary comparator sequence by effectively combining different algorithm instances without prior knowledge of S_T.

Result: The paper provides the first algorithm for linear bandits achieving the optimal regret guarantee of order O(√(d(1+S_T)T)) up to poly-logarithmic terms without prior knowledge of S_T, resolving a long-standing open problem in the field.

Conclusion: The work successfully solves a fundamental problem in linear bandits by developing an algorithm that optimally adapts to comparator switches without requiring advance knowledge of switch counts, representing a significant theoretical advancement in online learning and bandit optimization.

Abstract: We study dynamic regret minimization in unconstrained adversarial linear bandit problems. In this setting, a learner must minimize the cumulative loss relative to an arbitrary sequence of comparators $\boldsymbol{u}_1,\ldots,\boldsymbol{u}_T$ in $\mathbb{R}^d$, but receives only point-evaluation feedback on each round. We provide a simple approach to combining the guarantees of several bandit algorithms, allowing us to optimally adapt to the number of switches $S_T = \sum_t\mathbb{I}{\boldsymbol{u}t \neq \boldsymbol{u}{t-1}}$ of an arbitrary comparator sequence. In particular, we provide the first algorithm for linear bandits achieving the optimal regret guarantee of order $\mathcal{O}\big(\sqrt{d(1+S_T) T}\big)$ up to poly-logarithmic terms without prior knowledge of $S_T$, thus resolving a long-standing open problem.

[389] Preventing Data Leakage in EEG-Based Survival Prediction: A Two-Stage Embedding and Transformer Framework

Yixin Zhou, Zhixiang Liu, Vladimir I. Zadorozhny, Jonathan Elmer

Main category: cs.LG

TL;DR: Proposes a leakage-aware two-stage framework for EEG-based outcome prediction in comatose patients after cardiac arrest, addressing subtle data leakage issues in multi-stage modeling pipelines.

Details

Motivation: Deep learning models for EEG-based outcome prediction suffer from reliability issues due to subtle data leakage when long EEG recordings are segmented into short windows and reused across training stages, leading to overly optimistic validation and poor generalization.

Method: A two-stage framework: (1) CNN with ArcFace objective transforms short EEG segments into embeddings, (2) Transformer aggregates embeddings for patient-level predictions with strict patient-level separation to eliminate leakage pathways.

Result: The framework achieves stable and generalizable performance on large-scale EEG dataset of post-cardiac-arrest patients, maintaining high sensitivity at stringent specificity thresholds under clinically relevant constraints.

Conclusion: Highlights importance of rigorous data partitioning in EEG modeling and provides practical solution for reliable EEG-based outcome prediction by addressing previously overlooked data leakage issues.

Abstract: Deep learning models have shown promise in EEG-based outcome prediction for comatose patients after cardiac arrest, but their reliability is often compromised by subtle forms of data leakage. In particular, when long EEG recordings are segmented into short windows and reused across multiple training stages, models may implicitly encode and propagate label information, leading to overly optimistic validation performance and poor generalization. In this study, we identify a previously overlooked form of data leakage in multi-stage EEG modeling pipelines. We demonstrate that violating strict patient-level separation can significantly inflate validation metrics while causing substantial degradation on independent test data. To address this issue, we propose a leakage-aware two-stage framework. In the first stage, short EEG segments are transformed into embedding representations using a convolutional neural network with an ArcFace objective. In the second stage, a Transformer-based model aggregates these embeddings to produce patient-level predictions, with strict isolation between training cohorts to eliminate leakage pathways. Experiments on a large-scale EEG dataset of post-cardiac-arrest patients show that the proposed framework achieves stable and generalizable performance under clinically relevant constraints, particularly in maintaining high sensitivity at stringent specificity thresholds. These results highlight the importance of rigorous data partitioning and provide a practical solution for reliable EEG-based outcome prediction.

[390] Personalizing Mathematical Game-based Learning for Children: A Preliminary Study

Jie Gao, Adam K. Dubé

Main category: cs.LG

TL;DR: AI framework using Random Forest classifier to predict valid player-generated math game levels for adaptive game-based learning systems.

Details

Motivation: Game-based learning enhances math education engagement but faces challenges in creating personalized, high-quality game levels that match learners' abilities. Current systems lack effective mechanisms to validate and deliver appropriate player-generated content.

Method: Proposed AI framework guided by adaptive learning theory. Collected 206 distinct game levels from experts and advanced players using Creative Mode in a math GBL app. Developed classifier to extract game features and predict valid levels, comparing four ML models: k-nearest neighbors, decision trees, support vector machines, and random forests.

Result: Random Forest model performed best among the four classification models for predicting valid game levels. The classifier successfully extracted game features and demonstrated potential for validating player-generated content.

Conclusion: AI integration into game-level design can provide more personalized game levels for players. The framework offers insights for developing adaptive GBL systems that leverage player-generated content while maintaining educational quality.

Abstract: Game-based learning (GBL) is widely adopted in mathematics education. It enhances learners’ engagement and critical thinking throughout the mathematics learning process. However, enabling players to learn intrinsically through mathematical games still presents challenges. In particular, effective GBL systems require dozens of high-quality game levels and mechanisms to deliver them to appropriate players in a way that matches their learning abilities. To address this challenge, we propose a framework, guided by adaptive learning theory, that uses artificial intelligence (AI) techniques to build a classifier for player-generated levels. We collect 206 distinct game levels created by both experts and advanced players in Creative Mode, a new tool in a math game-based learning app, and develop a classifier to extract game features and predict valid game levels. The preliminary results show that the Random Forest model is the optimal classifier among the four machine learning classification models (k-nearest neighbors, decision trees, support vector machines, and random forests). This study provides insights into the development of GBL systems, highlighting the potential of integrating AI into the game-level design process to provide more personalized game levels for players.

[391] Online Learning for Dynamic Constellation Topologies

João Norberto, Ricardo Ferreira, Cláudia Soares

Main category: cs.LG

TL;DR: Online learning approach for dynamic satellite network topology configuration that handles orbital movement without assuming network structure.

Details

Motivation: Satellite networks face challenges due to continuous orbital movement and maneuvering, requiring dynamic topology configuration without relying on assumptions like known orbital planes that can be violated.

Method: Formulates network topology configuration as an online learning problem that doesn’t assume network structure, making it robust to satellite maneuvering. The approach is amenable to constrained online learning with trade-offs between computational complexity and convergence.

Result: Empirically demonstrates that the online learning formulation matches state-of-the-art offline methods’ performance while being adaptable to constrained online learning scenarios.

Conclusion: The online learning framework provides an effective solution for dynamic satellite network topology configuration that handles orbital movement without structural assumptions, offering computational trade-offs for practical deployment.

Abstract: The use of satellite networks has increased significantly in recent years due to their advantages over purely terrestrial systems, such as higher availability and coverage. However, to effectively provide these services, satellite networks must cope with the continuous orbital movement and maneuvering of their nodes and the impact on the network’s topology. In this work, we address the problem of (dynamic) network topology configuration under the online learning framework. As a byproduct, our approach does not assume structure about the network, such as known orbital planes (that could be violated by maneuvering satellites). We empirically demonstrate that our problem formulation matches the performance of state-of-the-art offline methods. Importantly, we demonstrate that our approach is amenable to constrained online learning, exhibiting a trade-off between computational complexity per iteration and convergence to a final strategy.

[392] EngineAD: A Real-World Vehicle Engine Anomaly Detection Dataset

Hadi Hojjati, Christopher Roth, Rory Woods, Ken Sills, Narges Armanfard

Main category: cs.LG

TL;DR: EngineAD is a real-world multivariate dataset for anomaly detection in commercial vehicles, featuring authentic sensor telemetry with expert annotations, showing classical methods often outperform deep learning approaches in this domain.

Details

Motivation: The lack of large-scale, real-world benchmarks for anomaly detection in safety-critical domains like transportation limits progress. Existing synthetic datasets don't capture authentic operational data needed for developing robust, field-deployable solutions.

Method: Collected high-resolution sensor telemetry from 25 commercial vehicles over 6 months, preprocessed into 300-timestep segments of 8 principal components, and established benchmark using 9 diverse one-class anomaly detection models.

Result: Significant performance variability across vehicle fleet highlights cross-vehicle generalization challenges. Simple classical methods (K-Means, One-Class SVM) often outperform deep learning approaches in segment-based evaluation.

Conclusion: EngineAD provides realistic, challenging resource for developing robust anomaly detection solutions for automotive industry, showing classical methods remain competitive in real-world vehicle sensor data analysis.

Abstract: The progress of Anomaly Detection (AD) in safety-critical domains, such as transportation, is severely constrained by the lack of large-scale, real-world benchmarks. To address this, we introduce EngineAD, a novel, multivariate dataset comprising high-resolution sensor telemetry collected from a fleet of 25 commercial vehicles over a six-month period. Unlike synthetic datasets, EngineAD features authentic operational data labeled with expert annotations, distinguishing normal states from subtle indicators of incipient engine faults. We preprocess the data into $300$-timestep segments of $8$ principal components and establish an initial benchmark using nine diverse one-class anomaly detection models. Our experiments reveal significant performance variability across the vehicle fleet, underscoring the challenge of cross-vehicle generalization. Furthermore, our findings corroborate recent literature, showing that simple classical methods (e.g., K-Means and One-Class SVM) are often highly competitive with, or superior to, deep learning approaches in this segment-based evaluation. By publicly releasing EngineAD, we aim to provide a realistic, challenging resource for developing robust and field-deployable anomaly detection and anomaly prediction solutions for the automotive industry.

[393] Adversarial-Robust Multivariate Time-Series Anomaly Detection via Joint Information Retention

Hadi Hojjati, Narges Armanfard

Main category: cs.LG

TL;DR: ARTA is an adversarial training framework for robust multivariate time-series anomaly detection that uses joint min-max optimization between a detector and sparsity-constrained mask generator to improve robustness against temporal corruptions.

Details

Motivation: Modern deep learning-based time-series anomaly detectors are highly sensitive to localized input corruptions and structured noise, making them brittle in real-world applications where data quality varies.

Method: ARTA uses a joint training framework with an anomaly detector and a sparsity-constrained mask generator trained simultaneously via min-max optimization. The generator identifies minimal temporal perturbations that maximally increase anomaly scores, while the detector learns to remain stable under these structured perturbations.

Result: ARTA consistently improves anomaly detection performance across diverse datasets in the TSB-AD benchmark and shows significantly more graceful degradation under increasing noise levels compared to state-of-the-art baselines.

Conclusion: The adversarial training strategy exposes brittle decision pathways and encourages detectors to rely on distributed, stable temporal patterns rather than spurious localized artifacts, improving robustness in time-series anomaly detection.

Abstract: Time-series anomaly detection (TSAD) is a critical component in monitoring complex systems, yet modern deep learning-based detectors are often highly sensitive to localized input corruptions and structured noise. We propose ARTA (Adversarially Robust multivariate Time-series Anomaly detection via joint information retention), a joint training framework that improves detector robustness through a principled min-max optimization objective. ARTA comprises an anomaly detector and a sparsity-constrained mask generator that are trained simultaneously. The generator identifies minimal, task-relevant temporal perturbations that maximally increase the detector’s anomaly score, while the detector is optimized to remain stable under these structured perturbations. The resulting masks characterize the detector’s sensitivity to adversarial temporal corruptions and can serve as explanatory signals for the detector’s decisions. This adversarial training strategy exposes brittle decision pathways and encourages the detector to rely on distributed and stable temporal patterns rather than spurious localized artifacts. We conduct extensive experiments on the TSB-AD benchmark, demonstrating that ARTA consistently improves anomaly detection performance across diverse datasets and exhibits significantly more graceful degradation under increasing noise levels compared to state-of-the-art baselines.

[394] On the Objective and Feature Weights of Minkowski Weighted k-Means

Renato Cordeiro de Amorim, Vladimir Makarenkov

Main category: cs.LG

TL;DR: Theoretical analysis of Minkowski weighted k-means algorithm, showing its objective as power-mean aggregation and deriving bounds, weight structure, convergence guarantees, and feature suppression properties.

Details

Motivation: The Minkowski weighted k-means algorithm has shown empirical success but lacks sufficient theoretical understanding of its properties and behavior.

Method: Mathematical analysis showing the mwk-means objective can be expressed as power-mean aggregation of within-cluster dispersions, deriving bounds, characterizing feature weight structure, and establishing convergence properties.

Result: Revealed that the Minkowski exponent p controls transition between selective and uniform feature use, derived explicit guarantees on suppression of high-dispersion features, and established algorithm convergence.

Conclusion: Provides comprehensive theoretical foundation for mwk-means algorithm, explaining its behavior through mathematical analysis of objective function, weight structure, and convergence properties.

Abstract: The Minkowski weighted k-means (mwk-means) algorithm extends classical k-means by incorporating feature weights and a Minkowski distance. Despite its empirical success, its theoretical properties remain insufficiently understood. We show that the mwk-means objective can be expressed as a power-mean aggregation of within-cluster dispersions, with the order determined by the Minkowski exponent p. This formulation reveals how p controls the transition between selective and uniform use of features. Using this representation, we derive bounds for the objective function and characterise the structure of the feature weights, showing that they depend only on relative dispersion and follow a power-law relationship with dispersion ratios. This leads to explicit guarantees on the suppression of high-dispersion features. Finally, we establish convergence of the algorithm and provide a unified theoretical interpretation of its behaviour.

[395] Do Neurons Dream of Primitive Operators? Wake-Sleep Compression Rediscovers Schank’s Event Semantics

Peter Balogh

Main category: cs.LG

TL;DR: The paper shows that event primitives similar to Schank’s conceptual dependency theory can be automatically discovered through compression pressure alone, using DreamCoder’s wake-sleep library learning on event state transformations.

Details

Motivation: To investigate whether event primitives like those in Schank's conceptual dependency theory (ATRANS, PTRANS, MTRANS) can be discovered automatically through compression pressure rather than being hand-coded from linguistic intuition.

Method: Adapt DreamCoder’s wake-sleep library learning to event state transformations. Given events as before/after world state pairs, the system finds operator compositions explaining each event (wake), then extracts recurring patterns as new operators optimized under Minimum Description Length (sleep). Starts from four generic primitives.

Result: On synthetic data: discovered operators achieve Bayesian MDL within 4% of Schank’s hand-coded primitives while explaining 100% of events vs. Schank’s 81%. On ATOMIC commonsense knowledge graph: Schank’s primitives explain only 10% of naturalistic events, while discovered library explains 100%. Dominant operators are mental/emotional state changes (CHANGE_wants 20%, CHANGE_feels 18%, CHANGE_is 18%) not in Schank’s original taxonomy.

Conclusion: Event primitives can be derived from compression pressure; Schank’s core primitives are information-theoretically justified; the complete inventory is substantially richer than proposed, with mental/emotional operators dominating in naturalistic data.

Abstract: We show that they do. Schank’s conceptual dependency theory proposed that all events decompose into primitive operations – ATRANS, PTRANS, MTRANS, and others – hand-coded from linguistic intuition. Can the same primitives be discovered automatically through compression pressure alone? We adapt DreamCoder’s wake-sleep library learning to event state transformations. Given events as before/after world state pairs, our system finds operator compositions explaining each event (wake), then extracts recurring patterns as new operators optimized under Minimum Description Length (sleep). Starting from four generic primitives, it discovers operators mapping directly to Schank’s: MOVE_PROP_has = ATRANS, CHANGE_location = PTRANS, SET_knows = MTRANS, SET_consumed = INGEST, plus compound operators (“mail” = ATRANS + PTRANS) and novel emotional state operators absent from Schank’s taxonomy. We validate on synthetic events and real-world commonsense data from the ATOMIC knowledge graph. On synthetic data, discovered operators achieve Bayesian MDL within 4% of Schank’s hand-coded primitives while explaining 100% of events vs. Schank’s 81%. On ATOMIC, results are more dramatic: Schank’s primitives explain only 10% of naturalistic events, while the discovered library explains 100%. Dominant operators are not physical-action primitives but mental and emotional state changes – CHANGE_wants (20%), CHANGE_feels (18%), CHANGE_is (18%) – none in Schank’s original taxonomy. These results provide the first empirical evidence that event primitives can be derived from compression pressure, that Schank’s core primitives are information-theoretically justified, and that the complete inventory is substantially richer than proposed – with mental/emotional operators dominating in naturalistic data.

[396] Second-Order, First-Class: A Composable Stack for Curvature-Aware Training

Mikalai Korbit, Mario Zanon

Main category: cs.LG

TL;DR: Somax is a composable Optax-native stack for second-order optimization that treats curvature-aware training as a single JIT-compiled step with explicit, swappable modules for curvature operators, estimators, linear solvers, preconditioners, and damping policies.

Details

Motivation: Second-order optimization methods offer improved stability and faster convergence but remain underused due to implementation overhead, tuning brittleness, and lack of composable APIs that make hidden choices explicit and swappable.

Method: Somax introduces a composable Optax-native stack with first-class modules for curvature operators, estimators, linear solvers, preconditioners, and damping policies. It separates planning from execution: derives a static plan from module requirements, then runs through a specialized execution path that reuses intermediate results across modules.

Result: System-oriented ablations show that composition choices materially affect scaling behavior and time-to-accuracy, and planning reduces per-step overhead relative to unplanned composition with redundant recomputation.

Conclusion: Somax provides a practical, composable framework for second-order optimization that makes typically hidden choices explicit and swappable, enabling more accessible and efficient use of curvature-aware training methods.

Abstract: Second-order methods promise improved stability and faster convergence, yet they remain underused due to implementation overhead, tuning brittleness, and the lack of composable APIs. We introduce Somax, a composable Optax-native stack that treats curvature-aware training as a single JIT-compiled step governed by a static plan. Somax exposes first-class modules – curvature operators, estimators, linear solvers, preconditioners, and damping policies – behind a single step interface and composes with Optax by applying standard gradient transformations (e.g., momentum, weight decay, schedules) to the computed direction. This design makes typically hidden choices explicit and swappable. Somax separates planning from execution: it derives a static plan (including cadences) from module requirements, then runs the step through a specialized execution path that reuses intermediate results across modules. We report system-oriented ablations showing that (i) composition choices materially affect scaling behavior and time-to-accuracy, and (ii) planning reduces per-step overhead relative to unplanned composition with redundant recomputation.

[397] QuitoBench: A High-Quality Open Time Series Forecasting Benchmark

Siqiao Xue, Zhaoyang Zhu, Wei Zhang, Rongyao Cai, Rui Wang, Yixiang Mu, Fan Zhou, Jianguo Li, Peng Di, Hang Yu

Main category: cs.LG

TL;DR: QuitoBench is a regime-balanced benchmark for time series forecasting built on billion-scale Alipay traffic data, revealing key findings about model performance across different context lengths and regimes.

Details

Motivation: Time series forecasting lacks large-scale, high-quality benchmarks that capture forecasting-relevant properties rather than just application domains, limiting progress in the field.

Method: Built QuitoBench using Quito corpus (billion-scale Alipay traffic data) with coverage across 8 trend×seasonality×forecastability regimes. Benchmarked 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances.

Result: Four key findings: 1) Context-length crossover (DL leads at short context, foundation models dominate at long context); 2) Forecastability is dominant difficulty driver (3.64× MAE gap); 3) DL models match/surpass foundation models with 59× fewer parameters; 4) Data scaling benefits exceed model size scaling for both families.

Conclusion: QuitoBench enables reproducible, regime-aware evaluation for time series forecasting research, revealing important insights about model performance across different regimes and context lengths.

Abstract: Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. To address this gap, we introduce \textsc{QuitoBench}, a regime-balanced benchmark for time series forecasting with coverage across eight trend$\times$seasonality$\times$forecastability (TSF) regimes, designed to capture forecasting-relevant properties rather than application-defined domain labels. The benchmark is built upon \textsc{Quito}, a billion-scale time series corpus of application traffic from Alipay spanning nine business domains. Benchmarking 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances, we report four key findings: (i) a context-length crossover where deep learning models lead at short context ($L=96$) but foundation models dominate at long context ($L \ge 576$); (ii) forecastability is the dominant difficulty driver, producing a $3.64 \times$ MAE gap across regimes; (iii) deep learning models match or surpass foundation models at $59 \times$ fewer parameters; and (iv) scaling the amount of training data provides substantially greater benefit than scaling model size for both model families. These findings are validated by strong cross-benchmark and cross-metric consistency. Our open-source release enables reproducible, regime-aware evaluation for time series forecasting research.

[398] GLU: Global-Local-Uncertainty Fusion for Scalable Spatiotemporal Reconstruction and Forecasting

Linzheng Wang, Jason Chen, Nicolas Tricard, Zituo Chen, Sili Deng

Main category: cs.LG

TL;DR: GLU is a unified framework for sparse reconstruction and dynamic forecasting in digital twins, using structured latent states with global, local, and uncertainty components to handle sparse measurements and predict system evolution.

Details

Motivation: Current approaches treat sparse reconstruction and dynamic forecasting as separate tasks in digital twins, but these functions should be unified for better performance and consistency in modeling complex physical systems.

Method: GLU introduces a structured latent state with three components: global system-level summary, local tokens anchored to measurements, and uncertainty-driven importance field. For reconstruction, it uses importance-aware adaptive neighborhood selection. For forecasting, a hierarchical Leader-Follower Dynamics module evolves the latent state with reduced memory growth.

Result: GLU consistently outperforms reduced-order, convolutional, neural operator, and attention-based baselines in reconstruction fidelity, better preserves multi-scale structures, maintains stable rollout behavior, delays error accumulation, and preserves cross-channel thermo-chemical couplings in turbulent combustion datasets with lower memory growth.

Conclusion: GLU establishes a flexible and computationally practical paradigm for sparse digital twins by unifying reconstruction and forecasting through structured latent representations, achieving superior performance with reduced computational overhead.

Abstract: Digital twins of complex physical systems are expected to infer unobserved states from sparse measurements and predict their evolution in time, yet these two functions are typically treated as separate tasks. Here we present GLU, a Global-Local-Uncertainty framework that formulates sparse reconstruction and dynamic forecasting as a unified state-representation problem and introduces a structured latent assembly to both tasks. The central idea is to build a structured latent state that combines a global summary of system-level organization, local tokens anchored to available measurements, and an uncertainty-driven importance field that weights observations according to the physical informativeness. For reconstruction, GLU uses importance-aware adaptive neighborhood selection to retrieve locally relevant information while preserving global consistency and allowing flexible query resolution on arbitrary geometries. Across a suite of challenging benchmarks, GLU consistently improves reconstruction fidelity over reduced-order, convolutional, neural operator, and attention-based baselines, better preserving multi-scale structures. For forecasting, a hierarchical Leader-Follower Dynamics module evolves the latent state with substantially reduced memory growth, maintains stable rollout behavior and delays error accumulation in nonlinear dynamics. On a realistic turbulent combustion dataset, it further preserves not only sharp fronts and broadband structures in multiple physical fields, but also their cross-channel thermo-chemical couplings. Scalability tests show that these gains are achieved with substantially lower memory growth than comparable attention-based baselines. Together, these results establish GLU as a flexible and computationally practical paradigm for sparse digital twins.

[399] Identification of Bivariate Causal Directionality Based on Anticipated Asymmetric Geometries

Alex Glushkovsky

Main category: cs.LG

TL;DR: Two methods for identifying causal directionality in bivariate data using conditional distributions: Anticipated Asymmetric Geometries (AAG) and Monotonicity Index, with AAG achieving 77.9% accuracy on real-world examples.

Details

Motivation: Identifying causal directionality in bivariate numerical data is a fundamental problem with practical implications, but existing methods like Additive Noise Models (ANMs) have limited accuracy (~63%).

Method: Two approaches: 1) AAG compares actual conditional distributions to anticipated normal distributions using various metrics (correlation, cosine similarity, etc.); 2) Monotonicity Index compares gradient monotonicity indexes and counts gradient sign changes. Both assume stochastic properties and unimodality of effect distributions.

Result: Tuned AAG method achieves 77.9% accuracy on 95 real-world pairs, outperforming Monotonicity Index and ANMs (63% ±10%). Hyperparameter tuning via Design of Experiment and decision tree analysis address sensitivity and decisiveness.

Conclusion: AAG method provides improved causal direction identification compared to existing approaches, with systematic hyperparameter tuning and decision tree analysis enhancing reliability and understanding of method decisiveness.

Abstract: Identification of causal directionality in bivariate numerical data is a fundamental research problem with important practical implications. This paper presents two alternative methods to identify direction of causation by considering conditional distributions: (1) Anticipated Asymmetric Geometries (AAG) and (2) Monotonicity Index. The AAG method compares the actual conditional distributions to anticipated ones along two variables. Different comparison metrics, such as correlation, cosine similarity, Jaccard index, K-L divergence, K-S distance, and mutual information have been evaluated. Anticipated distributions have been projected as normal based on dual response statistics: mean and standard deviation. The Monotonicity Index approach compares the calculated monotonicity indexes of the gradients of conditional distributions along two axes and exhibits counts of gradient sign changes. Both methods assume stochastic properties of the bivariate data and exploit anticipated unimodality of conditional distributions of the effect. It turns out that the tuned AAG method outperforms the Monotonicity Index and reaches a top accuracy of 77.9% compared to ANMs accuracy of 63 +/- 10% when classifying 95 pairs of real-world examples (Mooij et al, 2014). The described methods include a number of hyperparameters that impact accuracy of the identification. For a given set of hyperparameters, both the AAG or Monotonicity Index method provide a unique deterministic outcome of the solution. To address sensitivity to hyperparameters, tuning of hyperparameters has been done by utilizing a full factorial Design of Experiment. A decision tree has been fitted to distinguish misclassified cases using the input data’s symmetrical bivariate statistics to address the question of: How decisive is the identification method of causal directionality?

[400] Constitutive parameterized deep energy method for solid mechanics problems with random material parameters

Zhangyong Liang, Huanhuan Gao

Main category: cs.LG

TL;DR: CPDEM is a physics-driven deep learning method that enables zero-shot inference of displacement fields for varying material parameters in solid mechanics without retraining or data generation.

Details

Motivation: Traditional FEM requires repeated mesh discretization for each parameter variation, data-driven models need massive datasets, and physics-informed methods require complete retraining for parameter changes - all computationally expensive for handling continuous material uncertainty.

Method: Reformulates strain energy density functional by encoding latent representation of stochastic constitutive parameters. Embeds material parameters directly into neural network alongside spatial coordinates, transforming spatial collocation points into parameter-aware material points. Trained via expected energy minimization over parameter domain in unsupervised manner.

Result: Enables zero-shot, real-time inference of displacement fields for unknown material parameters without dataset generation or model retraining. Validated across linear elasticity, finite-strain hyperelasticity, and complex highly nonlinear contact mechanics benchmarks.

Conclusion: CPDEM represents the first purely physics-driven deep learning paradigm capable of simultaneously and efficiently handling continuous multi-parameter variations in solid mechanics, bridging the gap between computational efficiency and material uncertainty modeling.

Abstract: In practical structural design and solid mechanics simulations, material properties inherently exhibit random variations within bounded intervals. However, evaluating mechanical responses under continuous material uncertainty remains a persistent challenge. Traditional numerical approaches, such as the Finite Element Method (FEM), incur prohibitive computational costs as they require repeated mesh discretization and equation solving for every parametric realization. Similarly, data-driven surrogate models depend heavily on massive, high-fidelity datasets, while standard physics-informed frameworks (e.g., the Deep Energy Method) strictly demand complete retraining from scratch whenever material parameters change. To bridge this critical gap, we propose the Constitutive Parameterized Deep Energy Method (CPDEM). In this purely physics-driven framework, the strain energy density functional is reformulated by encoding a latent representation of stochastic constitutive parameters. By embedding material parameters directly into the neural network alongside spatial coordinates, CPDEM transforms conventional spatial collocation points into parameter-aware material points. Trained in an unsupervised manner via expected energy minimization over the parameter domain, the pre-trained model continuously learns the solution manifold. Consequently, it enables zero-shot, real-time inference of displacement fields for unknown material parameters without requiring any dataset generation or model retraining. The proposed method is rigorously validated across diverse benchmarks, including linear elasticity, finite-strain hyperelasticity, and complex highly nonlinear contact mechanics. To the best of our knowledge, CPDEM represents the first purely physics-driven deep learning paradigm capable of simultaneously and efficiently handling continuous multi-parameter variations in solid mechanics.

[401] H-Node Attack and Defense in Large Language Models

Eric Yocam, Varghese Vaidyan, Yong Wang

Main category: cs.LG

TL;DR: H-Node ANC identifies specific hidden dimensions in LLMs responsible for hallucinations, uses them for adversarial attacks, and defends against them via targeted cancellation.

Details

Motivation: To develop a mechanistic understanding of hallucination representations in LLMs at the individual dimension level, enabling both exploitation and defense against hallucinations through targeted interventions.

Method: Uses logistic regression probes on last-token hidden states to identify Hallucination Nodes (H-Nodes), implements white-box adversarial attacks by amplifying these dimensions, and develops adaptive ANC defense with confidence-weighted cancellation and dynamic iterative re-ranking.

Result: Achieves probe AUC of 0.90 across four architectures, attack selectivity of 3.02x with <10% visibility, reduces activation drift by 33-42%, recovers up to 0.69 robustness from 8% baseline, with minimal perplexity impact (<5%) and MMLU degradation (≤3%).

Conclusion: H-Node ANC provides a precise mechanistic framework for understanding, exploiting, and defending against hallucinations in LLMs through targeted dimension-level interventions without significantly impairing general reasoning capabilities.

Abstract: We present H-Node Adversarial Noise Cancellation (H-Node ANC), a mechanistic framework that identifies, exploits, and defends hallucination representations in transformer-based large language models (LLMs) at the level of individual hidden-state dimensions. A logistic regression probe trained on last-token hidden states localizes hallucination signal to a small set of high-variance dimensions – termed Hallucination Nodes (H-Nodes) – with probe AUC reaching 0.90 across four architectures. A white-box adversarial attack amplifies these dimensions at inference time via a real-time forward hook, achieving a selectivity of 3.02x with less than 10% visibility to the defender. Adaptive ANC defense suppresses H-Node excess in-pass using confidence-weighted cancellation, reducing grounded activation drift by 33-42% over static cancellation. A dynamic iterative extension that re-ranks cancellation targets across successive passes recovers up to 0.69 robustness from a single-pass baseline of 8%. All contributions are validated on OPT-125M, Phi-3-mini-4k-instruct, LLaMA-3-8B-Instruct, and Mistral-7B-Instruct-v0.3 (125M-8B parameters). Perplexity impact is surgical (<5%) and MMLU degradation is at most 3%, confirming that the defense does not impair general reasoning capability.

[402] Adversarial Bandit Optimization with Globally Bounded Perturbations to Linear Losses

Zhuoyu Cheng, Kohei Hatano, Eiji Takimoto

Main category: cs.LG

TL;DR: Adversarial bandit optimization with non-convex, non-smooth losses containing linear components and budget-constrained perturbations, with regret guarantees and improved bounds for classical linear bandits.

Details

Motivation: The paper addresses adversarial bandit optimization problems where loss functions can be non-convex and non-smooth, which are more challenging than traditional convex settings. The model includes both linear components and perturbations with cumulative budget constraints, aiming to provide theoretical guarantees for more realistic scenarios.

Method: The authors study a class of adversarial bandit problems where losses consist of underlying linear components plus perturbations applied after action selection. Perturbations are measured relative to linear losses and constrained by a global cumulative budget. They establish both expected and high-probability regret guarantees under this model.

Result: The paper establishes expected and high-probability regret guarantees for the adversarial bandit model with non-convex, non-smooth losses. As a special case, they recover an improved high-probability regret bound for classical bandit linear optimization (setting without perturbations). They also complement upper bounds with a lower bound on expected regret.

Conclusion: The work provides theoretical foundations for adversarial bandit optimization with non-convex, non-smooth losses and budget-constrained perturbations, offering improved bounds for classical linear bandits as a special case and establishing fundamental limits through lower bounds.

Abstract: We study a class of adversarial bandit optimization problems in which the loss functions may be non-convex and non-smooth. In each round, the learner observes a loss that consists of an underlying linear component together with an additional perturbation applied after the learner selects an action. The perturbations are measured relative to the linear losses and are constrained by a global budget that bounds their cumulative magnitude over time. Under this model, we establish both expected and high-probability regret guarantees. As a special case of our analysis, we recover an improved high-probability regret bound for classical bandit linear optimization, which corresponds to the setting without perturbations. We further complement our upper bounds by proving a lower bound on the expected regret.

[403] Selective Deficits in LLM Mental Self-Modeling in a Behavior-Based Test of Theory of Mind

Christopher Ackerman

Main category: cs.LG

TL;DR: LLMs show human-level performance in modeling others’ mental states but fail at self-modeling without reasoning traces, suggesting they use limited working memory for Theory of Mind tasks.

Details

Motivation: To determine whether LLMs have actually learned causal Theory of Mind models that can be deployed in arbitrary settings, rather than just mimicking patterns from training data.

Method: Developed a novel experimental paradigm requiring strategic action based on mental state representations rather than just description. Tested wide range of open/closed source LLMs (2024+) and human subjects on self- and other-modeling tasks.

Result: 1) Pre-mid-2025 LLMs failed all tasks; 2) Recent LLMs achieved human-level performance on modeling others’ cognitive states; 3) Frontier LLMs failed self-modeling unless given reasoning traces; 4) Showed cognitive load effects; 5) Models engaged in strategic deception.

Conclusion: LLMs can model others’ mental states at human level but struggle with self-modeling without explicit reasoning support, suggesting they use limited-capacity working memory for mental representations during inference.

Abstract: The ability to represent oneself and others as agents with knowledge, intentions, and belief states that guide their behavior - Theory of Mind - is a human universal that enables us to navigate - and manipulate - the social world. It is supported by our ability to form mental models of ourselves and others. Its ubiquity in human affairs entails that LLMs have seen innumerable examples of it in their training data and therefore may have learned to mimic it, but whether they have actually learned causal models that they can deploy in arbitrary settings is unclear. We therefore develop a novel experimental paradigm that requires that subjects form representations of the mental states of themselves and others and act on them strategically rather than merely describe them. We test a wide range of leading open and closed source LLMs released since 2024, as well as human subjects, on this paradigm. We find that 1) LLMs released before mid-2025 fail at all of our tasks, 2) more recent LLMs achieve human-level performance on modeling the cognitive states of others, and 3) even frontier LLMs fail at our self-modeling task - unless afforded a scratchpad in the form of a reasoning trace. We further demonstrate cognitive load effects on other-modeling tasks, offering suggestive evidence that LLMs are using something akin to limited-capacity working memory to hold these mental representations in mind during a single forward pass. Finally, we explore the mechanisms by which reasoning models succeed at the self- and other-modeling tasks, and show that they readily engage in strategic deception.

[404] AcTTA: Rethinking Test-Time Adaptation via Dynamic Activation

Hyeongyu Kim, Geonhui Han, Dosik Hwang

Main category: cs.LG

TL;DR: AcTTA is a test-time adaptation framework that adapts activation functions (like ReLU, GELU) during inference to handle distribution shifts, outperforming normalization-based methods on corrupted image datasets.

Details

Motivation: Existing test-time adaptation methods focus mainly on recalibrating normalization layers through affine modulation, overlooking the important role of activation functions in representation dynamics. The authors aim to explore activation adaptation as a complementary approach to address domain shifts during inference.

Method: AcTTA reformulates conventional activation functions into parameterized forms that can shift response thresholds and modulate gradient sensitivity. This allows adaptive updating of activation behavior at test time without modifying network weights or requiring source data. The method enables continuous adjustment of activation behavior to handle distribution shifts.

Result: AcTTA consistently surpasses normalization-based TTA methods across CIFAR10-C, CIFAR100-C, and ImageNet-C benchmarks. It achieves robust and stable adaptation across diverse corruptions while maintaining simplicity.

Conclusion: Activation adaptation provides a compact and effective route for domain-shift-robust test-time learning, broadening the prevailing affine-centric view of adaptation. The work highlights activation functions as an overlooked but influential component for test-time adaptation.

Abstract: Test-time adaptation (TTA) aims to mitigate performance degradation under distribution shifts by updating model parameters during inference. Existing approaches have primarily framed adaptation around affine modulation, focusing on recalibrating normalization layers. This perspective, while effective, overlooks another influential component in representation dynamics: the activation function. We revisit this overlooked space and propose AcTTA, an activation-aware framework that reinterprets conventional activation functions from a learnable perspective and updates them adaptively at test time. AcTTA reformulates conventional activation functions (e.g., ReLU, GELU) into parameterized forms that shift their response threshold and modulate gradient sensitivity, enabling the network to adjust activation behavior under domain shifts. This functional reparameterization enables continuous adjustment of activation behavior without modifying network weights or requiring source data. Despite its simplicity, AcTTA achieves robust and stable adaptation across diverse corruptions. Across CIFAR10-C, CIFAR100-C, and ImageNet-C, AcTTA consistently surpasses normalization-based TTA methods. Our findings highlight activation adaptation as a compact and effective route toward domain-shift-robust test-time learning, broadening the prevailing affine-centric view of adaptation.

[405] Dynamic Tokenization via Reinforcement Patching: End-to-end Training and Zero-shot Transfer

Yulun Wu, Sravan Kumar Ankireddy, Samuel Sharpe, Nikita Seleznev, Dehao Yuan, Hyeji Kim, Nam H. Nguyen

Main category: cs.LG

TL;DR: ReinPatch uses reinforcement learning to jointly optimize sequence patching policies and downstream models for time-series data, enabling data-adaptive variable-sized patches without heuristic rules.

Details

Motivation: Current methods for learning data-adaptive representations for long-horizon sequence data (like time series) rely on fixed-size patching, soft discretization, specific backbones, or heuristic rules. There's a need for a framework that can discover variable-sized, data-driven patches end-to-end while maintaining efficiency.

Method: Proposes Reinforcement Patching (ReinPatch) that formulates patch boundary placement as a discrete decision process optimized via Group Relative Policy Gradient (GRPG). The framework jointly optimizes sequence patching policy and downstream backbone model using reinforcement learning, bypassing continuous relaxations and allowing strict enforcement of desired compression rates.

Result: Demonstrates compelling performance on time-series forecasting datasets compared to state-of-the-art data-driven patching strategies. The patching module can be extracted as a standalone foundation patcher, providing insights into segmentation behaviors preferred by performance-driven neural patching.

Conclusion: ReinPatch offers an effective reinforcement learning approach for data-adaptive sequence patching that enables hierarchical modeling, compression control, and efficient scaling of downstream models while providing interpretable patching behaviors.

Abstract: Efficiently aggregating spatial or temporal horizons to acquire compact representations has become a unifying principle in modern deep learning models, yet learning data-adaptive representations for long-horizon sequence data, especially continuous sequences like time series, remains an open challenge. While fixed-size patching has improved scalability and performance, discovering variable-sized, data-driven patches end-to-end often forces models to rely on soft discretization, specific backbones, or heuristic rules. In this work, we propose Reinforcement Patching (ReinPatch), the first framework to jointly optimize a sequence patching policy and its downstream sequence backbone model using reinforcement learning. By formulating patch boundary placement as a discrete decision process optimized via Group Relative Policy Gradient (GRPG), ReinPatch bypasses the need for continuous relaxations and performs dynamic patching policy optimization in a natural manner. Moreover, our method allows strict enforcement of a desired compression rate, freeing the downstream backbone to scale efficiently, and naturally supports multi-level hierarchical modeling. We evaluate ReinPatch on time-series forecasting datasets, where it demonstrates compelling performance compared to state-of-the-art data-driven patching strategies. Furthermore, our detached design allows the patching module to be extracted as a standalone foundation patcher, providing the community with visual and empirical insights into the segmentation behaviors preferred by a purely performance-driven neural patching strategy.

[406] Are LLM-Enhanced Graph Neural Networks Robust against Poisoning Attacks?

Yuhang Ma, Jie Wang, Zheng Yan

Main category: cs.LG

TL;DR: LLM-enhanced GNNs show improved robustness against poisoning attacks that manipulate both graph structures and textual attributes, with comprehensive evaluation revealing key factors for their resilience.

Details

Motivation: While LLM-enhanced GNNs achieve performance gains by enriching node representations with semantic features, their robustness against poisoning attacks that manipulate both graph structures and textual attributes remains unexplored, creating a critical research gap.

Method: Proposed a robustness assessment framework evaluating 24 victim models combining 8 LLM/LM-based feature enhancers with 3 GNN backbones, using 6 structural poisoning attacks and 3 textual poisoning attacks across 4 real-world datasets, including one post-LLM dataset to avoid ground truth leakage.

Result: LLM-enhanced GNNs exhibit significantly higher accuracy and lower Relative Drop in Accuracy than shallow embedding baselines across various attack settings, with key robustness factors identified including effective encoding of structural and label information in node representations.

Conclusion: LLM-enhanced GNNs demonstrate strong robustness against poisoning attacks, with insights for future research directions including combined attacks and graph purification defenses, providing a foundation for more secure multimodal graph learning systems.

Abstract: Large Language Models (LLMs) have advanced Graph Neural Networks (GNNs) by enriching node representations with semantic features, giving rise to LLM-enhanced GNNs that achieve notable performance gains. However, the robustness of these models against poisoning attacks, which manipulate both graph structures and textual attributes during training, remains unexplored. To bridge this gap, we propose a robustness assessment framework that systematically evaluates LLM-enhanced GNNs under poisoning attacks. Our framework enables comprehensive evaluation across multiple dimensions. Specifically, we assess 24 victim models by combining eight LLM- or Language Model (LM)-based feature enhancers with three representative GNN backbones. To ensure diversity in attack coverage, we incorporate six structural poisoning attacks (both targeted and non-targeted) and three textual poisoning attacks operating at the character, word, and sentence levels. Furthermore, we employ four real-world datasets, including one released after the emergence of LLMs, to avoid potential ground truth leakage during LLM pretraining, thereby ensuring fair evaluation. Extensive experiments show that LLM-enhanced GNNs exhibit significantly higher accuracy and lower Relative Drop in Accuracy (RDA) than a shallow embedding-based baseline across various attack settings. Our in-depth analysis identifies key factors that contribute to this robustness, such as the effective encoding of structural and label information in node representations. Based on these insights, we outline future research directions from both offensive and defensive perspectives, and propose a new combined attack along with a graph purification defense. To support future research, we release the source code of our framework at~\url{https://github.com/CyberAlSec/LLMEGNNRP}.

[407] Accurate Precipitation Forecast by Efficiently Learning from Massive Atmospheric Variables and Unbalanced Distribution

Shuangliang Li, Siwei Li, Li Li, Weijie Zou, Jie Yang, Maolin Zhang

Main category: cs.LG

TL;DR: A novel precipitation forecasting model that efficiently utilizes multi-source atmospheric data with a specialized loss function for handling extreme precipitation events.

Details

Motivation: Short-term precipitation forecasting is crucial for socioeconomic activities and public safety, but current models struggle with complex precipitation patterns, extreme sample imbalance, and inefficient use of multi-source atmospheric data.

Method: Developed a forecasting model that automatically extracts and iteratively predicts latent features from massive atmospheric observations, and introduced a ‘WMCE’ loss function to handle extreme precipitation events.

Result: Extensive experiments on two datasets show the model substantially outperforms all prevalent baselines in both accuracy and efficiency, while significantly lowering computational costs.

Conclusion: The proposed model represents a milestone for efficient and practical precipitation forecasting by effectively utilizing multi-source data and handling extreme precipitation events.

Abstract: Short-term (0-24 hours) precipitation forecasting is highly valuable to socioeconomic activities and public safety. However, the highly complex evolution patterns of precipitation events, the extreme imbalance between precipitation and non-precipitation samples, and the inability of existing models to efficiently and effectively utilize large volumes of multi-source atmospheric observation data hinder improvements in precipitation forecasting accuracy and computational efficiency. To address the above challenges, this study developed a novel forecasting model capable of effectively and efficiently utilizing massive atmospheric observations by automatically extracting and iteratively predicting the latent features strongly associated with precipitation evolution. Furthermore, this study introduces a ‘WMCE’ loss function, designed to accurately discriminate extremely scarce precipitation events while precisely predicting their intensity values. Extensive experiments on two datasets demonstrate that our proposed model substantially and consistently outperforms all prevalent baselines in both accuracy and efficiency. Moreover, the proposed forecasting model substantially lowers the computational cost required to obtain valuable predictions compared to existing approaches, thereby positioning it as a milestone for efficient and practical precipitation forecasting.

[408] DPD-Cancer: Explainable Graph-based Deep Learning for Small Molecule Anti-Cancer Activity Prediction

Magnus H. Strømme, Alex G. C. de Sá, David B. Ascher

Main category: cs.LG

TL;DR: DPD-Cancer: Graph Attention Transformer model for predicting anti-cancer drug responses and pGI50 values with attention-based explainability.

Details

Motivation: Accurate drug response prediction in cancer is limited by challenges in modeling molecular structure-cellular context interplay, tumor heterogeneity, and genomic variability. Conventional approaches fail to capture non-linear relationships across diverse cell lines.

Method: Deep learning method based on Graph Attention Transformer (GAT) framework for small molecule anti-cancer activity classification and quantitative prediction of cell-line specific growth inhibition concentration (pGI50).

Result: Superior performance with AUC up to 0.87 on NCI60 data and up to 0.98 on ACLPred/MLASM datasets. For pGI50 prediction across 10 cancer types and 73 cell lines, achieved Pearson’s correlation coefficients up to 0.72 on independent test sets.

Conclusion: Attention-based mechanisms offer significant advantages in extracting meaningful molecular representations, establishing DPD-Cancer as a competitive tool for drug candidate prioritization with explainability through attention visualization of molecular substructures.

Abstract: Accurate drug response prediction is a critical bottleneck in computational biochemistry, limited by the challenge of modelling the interplay between molecular structure and cellular context. In cancer research, this is acute due to tumour heterogeneity and genomic variability, which hinder the identification of effective therapies. Conventional approaches often fail to capture non-linear relationships between chemical features and biological outcomes across diverse cell lines. To address this, we introduce DPD-Cancer, a deep learning method based on a Graph Attention Transformer (GAT) framework. It is designed for small molecule anti-cancer activity classification and the quantitative prediction of cell-line specific responses, specifically growth inhibition concentration (pGI50). Benchmarked against state-of-the-art methods (pdCSM-cancer, ACLPred, and MLASM), DPD-Cancer demonstrated superior performance, achieving an Area Under ROC Curve (AUC) of up to 0.87 on strictly partitioned NCI60 data and up to 0.98 on ACLPred/MLASM datasets. For pGI50 prediction across 10 cancer types and 73 cell lines, the model achieved Pearson’s correlation coefficients of up to 0.72 on independent test sets. These findings confirm that attention-based mechanisms offer significant advantages in extracting meaningful molecular representations, establishing DPD-Cancer as a competitive tool for prioritising drug candidates. Furthermore, DPD-Cancer provides explainability by leveraging the attention mechanism to identify and visualise specific molecular substructures, offering actionable insights for lead optimisation. DPD-Cancer is freely available as a web server at: https://biosig.lab.uq.edu.au/dpd_cancer/.

[409] TinyML for Acoustic Anomaly Detection in IoT Sensor Networks

Amar Almaini, Jakob Folz, Ghadeer Ashour

Main category: cs.LG

TL;DR: A TinyML pipeline for acoustic anomaly detection in IoT sensor networks using MFCC features and lightweight neural networks deployed on edge devices.

Details

Motivation: Cloud-based acoustic monitoring in IoT systems faces challenges with latency, power consumption, and privacy. There's a need for real-time, energy-efficient sound anomaly detection directly on edge devices.

Method: Extract Mel Frequency Cepstral Coefficients (MFCCs) from sound signals and train a lightweight neural network classifier optimized for deployment on microcontrollers/edge devices using the UrbanSound8K dataset.

Result: Achieved 91% test accuracy and balanced F1-scores of 0.91 across both normal and anomalous sound classes, demonstrating reliable embedded acoustic anomaly detection.

Conclusion: The compact TinyML pipeline enables feasible, reliable, and scalable acoustic anomaly detection for responsive IoT deployments by processing sound data directly on edge devices.

Abstract: Tiny Machine Learning enables real-time, energy-efficient data processing directly on microcontrollers, making it ideal for Internet of Things sensor networks. This paper presents a compact TinyML pipeline for detecting anomalies in environmental sound within IoT sensor networks. Acoustic monitoring in IoT systems can enhance safety and context awareness, yet cloud-based processing introduces challenges related to latency, power usage, and privacy. Our pipeline addresses these issues by extracting Mel Frequency Cepstral Coefficients from sound signals and training a lightweight neural network classifier optimized for deployment on edge devices. The model was trained and evaluated using the UrbanSound8K dataset, achieving a test accuracy of 91% and balanced F1-scores of 0.91 across both normal and anomalous sound classes. These results demonstrate the feasibility and reliability of embedded acoustic anomaly detection for scalable and responsive IoT deployments.

[410] PEANUT: Perturbations by Eigenvalue Alignment for Attacking GNNs Under Topology-Driven Message Passing

Bhavya Kohli, Biplab Sikdar

Main category: cs.LG

TL;DR: PEANUT is a gradient-free black-box attack on Graph Neural Networks that injects virtual nodes to exploit GNN vulnerabilities in graph topology consumption, requiring no node features and operating at inference time.

Details

Motivation: GNNs are vulnerable to small perturbations in graph structure, raising robustness concerns for real-world deployment. Current attacks often require graph modification or extensive optimization, while injection-based attacks are more practical but under-explored.

Method: PEANUT is a gradient-free, restricted black-box attack that injects virtual nodes into graphs. It operates at inference phase (evasion attack), requires no node features (can use zero features), and doesn’t need surrogate models or lengthy optimization. The attack exploits GNNs’ explicit consumption of graph topology via adjacency/Laplacian matrices.

Result: Extensive experiments on real-world datasets across three graph tasks demonstrate PEANUT’s effectiveness despite its simplicity. The attack significantly deteriorates GNN performance even with injected nodes having zero features.

Conclusion: PEANUT reveals critical vulnerabilities in GNNs’ reliance on graph topology, showing that simple injection attacks can effectively compromise performance. This highlights the need for more robust GNN architectures against such practical attacks.

Abstract: Graph Neural Networks (GNNs) have achieved remarkable performance on tasks involving relational data. However, small perturbations to the graph structure can significantly alter GNN outputs, raising concerns about their robustness in real-world deployments. In this work, we explore the core vulnerability of GNNs which explicitly consume graph topology in the form of the adjacency matrix or Laplacian as a means for message passing, and propose PEANUT, a simple, gradient-free, restricted black-box attack that injects virtual nodes to capitalize on this vulnerability. PEANUT is a injection based attack, which is widely considered to be more practical and realistic scenario than graph modification attacks, where the attacker is able to modify the original graph structure directly. Our method works at the inference phase, making it an evasion attack, and is applicable almost immediately, since it does not involve lengthy iterative optimizations or parameter learning, which add computational and time overhead, or training surrogate models, which are susceptible to failure due to differences in model priors and generalization capabilities. PEANUT also does not require any features on the injected node and consequently demonstrates that GNN performance can be significantly deteriorated even with injected nodes with zeros for features, highlighting the significance of effectively designed connectivity in such attacks. Extensive experiments on real-world datasets across three graph tasks demonstrate the effectiveness of our attack despite its simplicity.

[411] PruneFuse: Efficient Data Selection via Weight Pruning and Network Fusion

Humaira Kousar, Hasnain Irshad Bhatti, Jaekyun Moon

Main category: cs.LG

TL;DR: PruneFuse: A two-stage data selection method that uses pruned networks for efficient sample selection and then fuses them with the original network to optimize training efficiency and performance.

Details

Motivation: Traditional data selection methods for deep neural networks suffer from high computational costs, limiting scalability and practical use. There's a need for more efficient approaches to reduce annotation requirements and training time.

Method: Two-stage approach: 1) Apply structured pruning to create a smaller network that maintains structural coherence with the original, train it to select informative samples. 2) Fuse the trained pruned network with the original network to leverage insights gained during pruning while allowing discovery of more robust solutions.

Result: Extensive experiments show PruneFuse significantly reduces computational costs for data selection, achieves better performance than baselines, and accelerates the overall training process across various datasets.

Conclusion: PruneFuse offers an effective strategy for efficient data selection that balances computational efficiency with model performance, addressing scalability limitations of traditional methods.

Abstract: Efficient data selection is crucial for enhancing the training efficiency of deep neural networks and minimizing annotation requirements. Traditional methods often face high computational costs, limiting their scalability and practical use. We introduce PruneFuse, a novel strategy that leverages pruned networks for data selection and later fuses them with the original network to optimize training. PruneFuse operates in two stages: First, it applies structured pruning to create a smaller pruned network that, due to its structural coherence with the original network, is well-suited for the data selection task. This small network is then trained and selects the most informative samples from the dataset. Second, the trained pruned network is seamlessly fused with the original network. This integration leverages the insights gained during the training of the pruned network to facilitate the learning process of the fused network while leaving room for the network to discover more robust solutions. Extensive experimentation on various datasets demonstrates that PruneFuse significantly reduces computational costs for data selection, achieves better performance than baselines, and accelerates the overall training process.

[412] On the Complexity of Optimal Graph Rewiring for Oversmoothing and Oversquashing in Graph Neural Networks

Mostafa Haghir Chehreghani

Main category: cs.LG

TL;DR: Theoretical analysis showing that optimizing graph topology to mitigate GNN oversmoothing and oversquashing problems is NP-hard, establishing fundamental computational limits for graph rewiring approaches.

Details

Motivation: GNNs suffer from oversmoothing (node representations become indistinguishable) and oversquashing (information fails to propagate through bottlenecks) when scaled to deep architectures. Since both issues are tied to graph structure, researchers want to know if optimizing graph topology can mitigate these problems, but the computational complexity of such optimization needs investigation.

Method: Formulates oversmoothing and oversquashing mitigation as graph optimization problems based on spectral gap (for oversmoothing) and conductance (for oversquashing). Proves exact optimization for either problem is NP-hard through reductions from Minimum Bisection, establishing NP-completeness of the decision versions.

Result: Proves that exact optimization for mitigating oversmoothing and oversquashing through graph topology optimization is NP-hard. Provides theoretical foundations showing fundamental computational limits of graph rewiring for GNN optimization.

Conclusion: The NP-hardness results justify the use of approximation algorithms and heuristic methods in practice for graph rewiring in GNNs, as exact optimization is computationally intractable. The work establishes theoretical foundations for understanding the fundamental limits of graph structure optimization for GNN performance.

Abstract: Graph Neural Networks (GNNs) face two fundamental challenges when scaled to deep architectures: oversmoothing, where node representations converge to indistinguishable vectors, and oversquashing, where information from distant nodes fails to propagate through bottlenecks. Both phenomena are intimately tied to the underlying graph structure, raising a natural question: can we optimize the graph topology to mitigate these issues? This paper provides a theoretical investigation of the computational complexity of such graph structure optimization. We formulate oversmoothing and oversquashing mitigation as graph optimization problems based on spectral gap and conductance, respectively. We prove that exact optimization for either problem is NP-hard through reductions from Minimum Bisection, establishing NP-completeness of the decision versions. Our results provide theoretical foundations for understanding the fundamental limits of graph rewiring for GNN optimization and justify the use of approximation algorithms and heuristic methods in practice.

[413] DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

Hao Liang, Zhengyang Zhao, Meiyi Qiang, Mingrui Chen, Lu Ma, Rongyi Yu, Hengyi Feng, Shixuan Sun, Zimo Meng, Xiaochen Ma, Xuanlin Yang, Qifeng Cai, Ruichuan An, Bohan Zeng, Zhen Hao Wong, Chengyu Shen, Runming He, Zhaoyang Han, Yaowei Zheng, Fangcheng Fu, Conghui He, Bin Cui, Zhiyu Li, Weinan E, Wentao Zhang

Main category: cs.LG

TL;DR: DataFlex is a unified framework for data-centric dynamic training of LLMs, supporting sample selection, domain mixture adjustment, and sample reweighting with extensible components and compatibility with existing training workflows.

Details

Motivation: Existing data-centric training approaches for LLMs are developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. There's a need for a unified framework that supports multiple data optimization paradigms while maintaining compatibility with standard training workflows.

Method: DataFlex is built upon LLaMA-Factory and supports three major paradigms: sample selection, domain mixture adjustment, and sample reweighting. It provides extensible trainer abstractions and modular components, unifying key model-dependent operations like embedding extraction, inference, and gradient computation with support for large-scale settings including DeepSpeed ZeRO-3.

Result: Dynamic data selection consistently outperforms static full-data training on MMLU across Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations.

Conclusion: DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs, enabling better integration and comparison of data optimization methods while maintaining compatibility with existing training workflows.

Abstract: Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and unifies key model-dependent operations such as embedding extraction, inference, and gradient computation, with support for large-scale settings including DeepSpeed ZeRO-3. We conduct comprehensive experiments across multiple data-centric methods. Dynamic data selection consistently outperforms static full-data training on MMLU across both Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations. These results demonstrate that DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs.

[414] Can AI Scientist Agents Learn from Lab-in-the-Loop Feedback? Evidence from Iterative Perturbation Discovery

Gilles Wainrib, Barbara Bodinier, Haitem Dakhli, Josep Monserrat, Almudena Espin Perez, Sabrina Carpentier, Roberta Codato, John Klein

Main category: cs.LG

TL;DR: LLMs can perform genuine in-context learning for scientific experimental design when they reach sufficient capability thresholds, with feedback-driven learning showing significant improvements over zero-shot baselines in biological screening experiments.

Details

Motivation: To investigate whether large language models can genuinely perform in-context learning for scientific experimental design, particularly whether they can effectively use experimental feedback to improve hypothesis generation in biological screening experiments.

Method: Conducted 800 replicated experiments on iterative perturbation discovery in Cell Painting high-content screening. Compared LLM agents using experimental feedback to zero-shot baselines. Introduced random feedback control with permuted labels to test genuine learning vs. pretraining knowledge recall. Examined model capability effects by comparing Claude Sonnet 4.5 vs 4.6.

Result: Feedback access yielded +53.4% increase in discoveries per feature (p=0.003). Random feedback control eliminated performance gains, confirming genuine feedback-driven learning. Model capability significantly affected results: upgrading from Claude 4.5 to 4.6 reduced gene hallucination from ~33-45% to ~3-9%, converting non-significant ICL effect to large, significant improvement (+11.0 hits, p=0.003).

Conclusion: Effective in-context learning from experimental feedback emerges only when LLMs reach sufficient capability thresholds, demonstrating genuine feedback-driven learning rather than just pretraining knowledge retrieval in scientific experimental design tasks.

Abstract: Recent work has questioned whether large language models (LLMs) can perform genuine in-context learning (ICL) for scientific experimental design, with prior studies suggesting that LLM-based agents exhibit no sensitivity to experimental feedback. We shed new light on this question by carrying out 800 independently replicated experiments on iterative perturbation discovery in Cell Painting high-content screening. We compare an LLM agent that iteratively updates its hypotheses using experimental feedback to a zero-shot baseline that relies solely on pretraining knowledge retrieval. Access to feedback yields a $+53.4%$ increase in discoveries per feature on average ($p = 0.003$). To test whether this improvement arises from genuine feedback-driven learning rather than prompt-induced recall of pretraining knowledge, we introduce a random feedback control in which hit/miss labels are permuted. Under this control, the performance gain disappears, indicating that the observed improvement depends on the structure of the feedback signal ($+13.0$ hits, $p = 0.003$). We further examine how model capability affects feedback utilization. Upgrading from Claude Sonnet 4.5 to 4.6 reduces gene hallucination rates from ${\sim}33%$–$45%$ to ${\sim}3$–$9%$, converting a non-significant ICL effect ($+0.8$, $p = 0.32$) into a large and highly significant improvement ($+11.0$, $p=0.003$) for the best ICL strategy. These results suggest that effective in-context learning from experimental feedback emerges only once models reach a sufficient capability threshold.

[415] Geometric Evolution Graph Convolutional Networks: Enhancing Graph Representation Learning via Ricci Flow

Jicheng Ma, Yunyan Yang, Juan Zhao, Liang Zhao

Main category: cs.LG

TL;DR: GEGCN enhances graph representation learning by modeling geometric evolution using LSTM to capture structural sequences from discrete Ricci flow, then infusing these dynamic representations into GCN for improved performance.

Details

Motivation: To improve graph representation learning by capturing geometric evolution and structural dynamics that traditional graph neural networks may miss, particularly for heterophilic graphs where connected nodes often have different labels.

Method: Uses discrete Ricci flow to generate structural sequences, employs LSTM to model these sequences and learn dynamic representations, then infuses these learned representations into a Graph Convolutional Network architecture.

Result: Achieves state-of-the-art performance on classification tasks across various benchmark datasets, with particularly outstanding performance on heterophilic graphs.

Conclusion: Modeling geometric evolution through discrete Ricci flow and LSTM significantly enhances graph representation learning, especially for challenging heterophilic graph scenarios.

Abstract: We introduce the Geometric Evolution Graph Convolutional Network (GEGCN), a novel framework that enhances graph representation learning by modeling geometric evolution on graphs. Specifically, GEGCN employs a Long Short-Term Memory to model the structural sequence generated by discrete Ricci flow, and the learned dynamic representations are infused into a Graph Convolutional Network. Extensive experiments demonstrate that GEGCN achieves state-of-the-art performance on classification tasks across various benchmark datasets, with its performance being particularly outstanding on heterophilic graphs.

[416] Optimization Trade-offs in Asynchronous Federated Learning: A Stochastic Networks Approach

Abdelkrim Alahyane, Céline Comte, Matthieu Jonckheere

Main category: cs.LG

TL;DR: A stochastic queueing-network framework for Generalized AsyncSGD that models computation and communication delays, providing closed-form expressions for throughput and convergence time, with optimization strategies reducing convergence time and energy consumption.

Details

Motivation: Synchronous federated learning suffers from straggler effect, while asynchronous approaches introduce gradient staleness and bias toward faster clients. Existing analyses lack closed-form characterizations of update throughput and gradient staleness, failing to model underlying queueing dynamics.

Method: Developed a stochastic queueing-network framework for Generalized AsyncSGD that jointly models random computation times at clients and central server, plus random uplink/downlink communication delays. Used product-form network theory to derive closed-form expressions for update throughput, communication round complexity, and expected wall-clock time to reach ε-stationary point.

Result: Derived closed-form expressions for update throughput and upper bounds for convergence metrics. Extended framework to quantify energy consumption under stochastic timing. Proposed gradient-based optimization strategies for routing and concurrency. Experiments on EMNIST showed 29%–46% reduction in convergence time and 36%–49% reduction in energy consumption compared to AsyncSGD.

Conclusion: The framework formally characterizes trade-offs between gradient staleness and wall-clock convergence speed, and between convergence speed and energy efficiency. Provides analytical tools for optimizing asynchronous federated learning systems.

Abstract: Synchronous federated learning scales poorly due to the straggler effect. Asynchronous algorithms increase the update throughput by processing updates upon arrival, but they introduce two fundamental challenges: gradient staleness, which degrades convergence, and bias toward faster clients under heterogeneous data distributions. Although algorithms such as AsyncSGD and Generalized AsyncSGD mitigate this bias via client-side task queues, most existing analyses neglect the underlying queueing dynamics and lack closed-form characterizations of the update throughput and gradient staleness. To close this gap, we develop a stochastic queueing-network framework for Generalized AsyncSGD that jointly models random computation times at the clients and the central server, as well as random uplink and downlink communication delays. Leveraging product-form network theory, we derive a closed-form expression for the update throughput, alongside closed-form upper bounds for both the communication round complexity and the expected wall-clock time required to reach an $ε$-stationary point. These results formally characterize the trade-off between gradient staleness and wall-clock convergence speed. We further extend the framework to quantify energy consumption under stochastic timing, revealing an additional trade-off between convergence speed and energy efficiency. Building on these analytical results, we propose gradient-based optimization strategies to jointly optimize routing and concurrency. Experiments on EMNIST demonstrate reductions of 29%–46% in convergence time and 36%–49% in energy consumption compared to AsyncSGD.

[417] Knowledge Distillation for Efficient Transformer-Based Reinforcement Learning in Hardware-Constrained Energy Management Systems

Pascal Henrich, Jonas Sievers, Maximilian Beichter, Thomas Blank, Ralf Mikut, Veit Hagenmeyer

Main category: cs.LG

TL;DR: Knowledge distillation transfers Decision Transformer battery control policies from large teacher models to compact student models for residential energy management, achieving up to 96% parameter reduction while preserving control performance.

Details

Motivation: Transformer-based RL (Decision Transformer) shows promise for residential energy management but is computationally demanding for resource-constrained controllers. Need to compress models for embedded deployment while maintaining control quality.

Method: Train high-capacity Decision Transformer teacher models on Ausgrid multi-building data using offline sequence-based framework. Distill knowledge to smaller student models by matching teacher actions. Evaluate various teacher-student configurations.

Result: Distillation preserves control performance (even improves up to 1%) while reducing parameters by up to 96%, inference memory by 90%, and inference time by 63%. Comparable improvements when distilling into same-capacity student models.

Conclusion: Knowledge distillation enables practical deployment of Decision Transformer control for residential energy management on resource-limited hardware by compressing models while maintaining performance.

Abstract: Transformer-based reinforcement learning has emerged as a strong candidate for sequential control in residential energy management. In particular, the Decision Transformer can learn effective battery dispatch policies from historical data, thereby increasing photovoltaic self-consumption and reducing electricity costs. However, transformer models are typically too computationally demanding for deployment on resource-constrained residential controllers, where memory and latency constraints are critical. This paper investigates knowledge distillation to transfer the decision-making behaviour of high-capacity Decision Transformer policies to compact models that are more suitable for embedded deployment. Using the Ausgrid dataset, we train teacher models in an offline sequence-based Decision Transformer framework on heterogeneous multi-building data. We then distil smaller student models by matching the teachers’ actions, thereby preserving control quality while reducing model size. Across a broad set of teacher-student configurations, distillation largely preserves control performance and even yields small improvements of up to 1%, while reducing the parameter count by up to 96%, the inference memory by up to 90%, and the inference time by up to 63%. Beyond these compression effects, comparable cost improvements are also observed when distilling into a student model of identical architectural capacity. Overall, our results show that knowledge distillation makes Decision Transformer control more applicable for residential energy management on resource-limited hardware.

[418] A Formal Framework for Uncertainty Analysis of Text Generation with Large Language Models

Steffen Herbold, Florian Lemmerich

Main category: cs.LG

TL;DR: A formal framework for measuring uncertainty in LLM text generation that considers prompt uncertainty, generation uncertainty, and interpretation uncertainty as interconnected autoregressive processes in a sampling tree.

Details

Motivation: Current approaches to uncertainty in LLM text generation are fragmented and don't comprehensively address all sources of uncertainty - not just the generation process itself, but also the prompt uncertainty and downstream interpretation uncertainty.

Method: Proposes a formal framework modeling prompting, generation, and interpretation as interconnected autoregressive processes that combine into a single sampling tree. Introduces filters and objective functions to express different aspects of uncertainty over this tree structure.

Result: The framework shows how existing uncertainty methods are formally related and can be reduced to a common core, while also revealing additional aspects of uncertainty that haven’t been studied yet.

Conclusion: Provides a unified theoretical foundation for understanding and measuring uncertainty in LLM text generation that encompasses all major sources of uncertainty in the pipeline.

Abstract: The generation of texts using Large Language Models (LLMs) is inherently uncertain, with sources of uncertainty being not only the generation of texts, but also the prompt used and the downstream interpretation. Within this work, we provide a formal framework for the measurement of uncertainty that takes these different aspects into account. Our framework models prompting, generation, and interpretation as interconnected autoregressive processes that can be combined into a single sampling tree. We introduce filters and objective functions to describe how different aspects of uncertainty can be expressed over the sampling tree and demonstrate how to express existing approaches towards uncertainty through these functions. With our framework we show not only how different methods are formally related and can be reduced to a common core, but also point out additional aspects of uncertainty that have not yet been studied.

[419] Improving Risk Stratification in Hypertrophic Cardiomyopathy: A Novel Score Combining Echocardiography, Clinical, and Medication Data

Marion Taconné, Valentina D. A. Corino, Annamaria Del Franco, Sara Giovani, Iacopo Olivotto, Adrien Al Wazzan, Erwan Donal, Pietro Cerveri, Luca Mainardi

Main category: cs.LG

TL;DR: ML-based risk score for hypertrophic cardiomyopathy using EHR data outperforms traditional ESC score in predicting 5-year cardiovascular outcomes.

Details

Motivation: Current risk stratification models for hypertrophic cardiomyopathy (like ESC score) have moderate performance, creating need for more accurate, explainable ML approaches using routine clinical data.

Method: Developed Random Forest ensemble model using echocardiographic, clinical, and medication data from EHRs. Trained on SHARE registry cohort (N=1,201), internally validated, and externally validated on independent Rennes Hospital cohort (N=382).

Result: ML model achieved AUC of 0.85 ± 0.02 (vs ESC score 0.56 ± 0.03). External validation showed superior risk separation (Log-rank p = 8.62×10⁻⁴ vs p = 0.0559). Model remained stable over time in event-free patients.

Conclusion: The explainable ML risk score offers superior predictive performance and longitudinal stability, providing promising tool for personalized HCM management.

Abstract: Hypertrophic cardiomyopathy (HCM) requires accurate risk stratification to inform decisions regarding ICD therapy and follow-up management. Current established models, such as the European Society of Cardiology (ESC) score, exhibit moderate discriminative performance. This study develops a robust, explainable machine learning (ML) risk score leveraging routinely collected echocardiographic, clinical, and medication data, typically contained within Electronic Health Records (EHRs), to predict a 5-year composite cardiovascular outcome in HCM patients. The model was trained and internally validated using a large cohort (N=1,201) from the SHARE registry (Florence Hospital) and externally validated on an independent cohort (N=382) from Rennes Hospital. The final Random Forest ensemble model achieved a high internal Area Under the Curve (AUC) of 0.85 +- 0.02, significantly outperforming the ESC score (0.56 +- 0.03). Critically, survival curve analysis on the external validation set showed superior risk separation for the ML score (Log-rank p = 8.62 x 10^(-4) compared to the ESC score (p = 0.0559). Furthermore, longitudinal analyses demonstrate that the proposed risk score remains stable over time in event-free patients. The model high interpretability and its capacity for longitudinal risk monitoring represent promising tools for the personalized clinical management of HCM.

[420] Contrastive Conformal Sets

Yahya Alkhatib, Wee Peng Tay

Main category: cs.LG

TL;DR: Conformal contrastive learning with minimum-volume covering sets that guarantee positive sample coverage while maximizing negative sample exclusion through learned set geometry.

Details

Motivation: Existing contrastive learning methods lack principled guarantees on coverage within the semantic feature space, particularly regarding systematic control over positive sample inclusion and negative sample exclusion.

Method: Extends conformal prediction to contrastive learning by introducing minimum-volume covering sets with learnable generalized multi-norm constraints. Constructs conformal sets that guarantee user-specified coverage of positive samples while maximizing negative sample exclusion through volume minimization as a proxy for negative exclusion.

Result: Experiments on simulated and real-world image datasets show improved inclusion-exclusion trade-offs compared to standard distance-based conformal baselines.

Conclusion: The approach provides distribution-free coverage guarantees for positive samples while effectively excluding negative samples through learned set geometry, even when negative pairs are unavailable.

Abstract: Contrastive learning produces coherent semantic feature embeddings by encouraging positive samples to cluster closely while separating negative samples. However, existing contrastive learning methods lack principled guarantees on coverage within the semantic feature space. We extend conformal prediction to this setting by introducing minimum-volume covering sets equipped with learnable generalized multi-norm constraints. We propose a method that constructs conformal sets guaranteeing user-specified coverage of positive samples while maximizing negative sample exclusion. We establish theoretically that volume minimization serves as a proxy for negative exclusion, enabling our approach to operate effectively even when negative pairs are unavailable. The positive inclusion guarantee inherits the distribution-free coverage property of conformal prediction, while negative exclusion is maximized through learned set geometry optimized on a held-out training split. Experiments on simulated and real-world image datasets demonstrate improved inclusion-exclusion trade-offs compared to standard distance-based conformal baselines.

[421] Topology-Aware Graph Reinforcement Learning for Energy Storage Systems Optimal Dispatch in Distribution Networks

Shuyi Gao, Stavros Orfanoudakis, Shengren Hou, Peter Palensky, Pedro P. Vergara

Main category: cs.LG

TL;DR: GNN-based reinforcement learning for energy storage dispatch in distribution networks, using TD3 with graph neural networks to handle topology changes and improve voltage security.

Details

Motivation: Need for fast online decision-making in energy storage system dispatch that can handle time-varying conditions and topology changes while improving operating economy and voltage security.

Method: Developed a topology-aware Reinforcement Learning architecture using Twin Delayed Deep Deterministic Policy Gradient (TD3) with graph neural networks (GNNs) as graph feature encoders. Evaluated three GNN variants: GCNs, TAGConv, and GATs on 34-bus and 69-bus systems.

Result: GNN-based controllers consistently reduce voltage violations, with clearer benefits on larger systems and under topology reconfiguration. TD3-GCN and TD3-TAGConv achieve lower saved cost relative to NLP benchmark than NN baseline on 69-bus system. Transfer learning results are case-dependent with performance degradation on fundamentally different systems.

Conclusion: GNN-based RL is effective for energy storage dispatch in distribution networks, particularly for handling topology changes and improving voltage security, though transferability between different systems requires careful consideration.

Abstract: Optimal dispatch of energy storage systems (ESSs) in distribution networks involves jointly improving operating economy and voltage security under time-varying conditions and possible topology changes. To support fast online decision making, we develop a topology-aware Reinforcement Learning architecture based on Twin Delayed Deep Deterministic Policy Gradient (TD3), which integrates graph neural networks (GNNs) as graph feature encoders for ESS dispatch. We conduct a systematic investigation of three GNN variants: graph convolutional networks (GCNs), topology adaptive graph convolutional networks (TAGConv), and graph attention networks (GATs) on the 34-bus and 69-bus systems, and evaluate robustness under multiple topology reconfiguration cases as well as cross-system transfer between networks with different system sizes. Results show that GNN-based controllers consistently reduce the number and magnitude of voltage violations, with clearer benefits on the 69-bus system and under reconfiguration; on the 69-bus system, TD3-GCN and TD3-TAGConv also achieve lower saved cost relative to the NLP benchmark than the NN baseline. We also highlight that transfer gains are case-dependent, and zero-shot transfer between fundamentally different systems results in notable performance degradation and increased voltage magnitude violations. This work is available at: https://github.com/ShuyiGao/GNNs_RL_ESSs and https://github.com/distributionnetworksTUDelft/GNNs_RL_ESSs.

[422] D-GATNet: Interpretable Temporal Graph Attention Learning for ADHD Identification Using Dynamic Functional Connectivity

Qurat Ul Ain, Alptekin Temizel, Soyiba Jawed

Main category: cs.LG

TL;DR: D-GATNet: Interpretable temporal graph-based framework for ADHD classification using dynamic functional connectivity from fMRI data, achieving state-of-the-art performance with attention mechanisms for interpretability.

Details

Motivation: ADHD diagnosis using neuroimaging is challenging due to complex time-varying brain connectivity disruptions. Existing deep learning approaches often use static connectivity features and lack interpretability, while dynamic connectivity modeling is underexplored.

Method: Proposes D-GATNet framework using sliding-window Pearson correlation to construct sequences of functional brain graphs. Uses Graph Attention Network for spatial dependencies, 1D convolution for temporal dynamics, and temporal attention for interpretability.

Result: Achieved 85.18% ±5.64 balanced accuracy and 0.881 AUC on ADHD-200 dataset, outperforming state-of-the-art methods. Attention analysis revealed cerebellar and default mode network disruptions as potential biomarkers.

Conclusion: D-GATNet provides an effective interpretable framework for ADHD classification using dynamic functional connectivity, with attention mechanisms offering insights into neuroimaging biomarkers for the disorder.

Abstract: Attention Deficit Hyperactivity Disorder (ADHD) is a prevalent neurodevelopmental disorder whose neuroimaging-based diagnosis remains challenging due to complex time-varying disruptions in brain connectivity. Functional MRI (fMRI) provides a powerful non-invasive modality for identifying functional alterations. Existing deep learning (DL) studies employ diverse neuroimaging features; however, static functional connectivity remains widely used, whereas dynamic connectivity modeling is comparatively underexplored. Moreover, many DL models lack interpretability. In this work, we propose D-GATNet, an interpretable temporal graph-based framework for automated ADHD classification using dynamic functional connectivity (dFC). Sliding-window Pearson correlation constructs sequences of functional brain graphs with regions of interest as nodes and connectivity strengths as edges. Spatial dependencies are learned via a multi-layer Graph Attention Network, while temporal dynamics are modeled using 1D convolution followed by temporal attention. Interpretability is achieved through graph attention weights revealing dominant ROI interactions, ROI importance scores identifying influential regions, and temporal attention emphasizing informative connectivity segments. Experiments on the Peking University site of the ADHD-200 dataset using stratified 10-fold cross-validation with a 5-seed ensemble achieved 85.18% +_5.64 balanced accuracy and 0.881 AUC, outperforming state-of-the-art methods. Attention analysis reveals cerebellar and default mode network disruptions, indicating potential neuroimaging biomarkers.

[423] Curvature-aware Expected Free Energy as an Acquisition Function for Bayesian Optimization

Ajith Anil Meera, Wouter Kouw

Main category: cs.LG

TL;DR: Proposes Expected Free Energy-based acquisition function for Bayesian optimization to simultaneously optimize and learn functions, with theoretical connections to existing methods and adaptive curvature-aware updates.

Details

Motivation: Addresses the joint learning and optimization problem in Bayesian optimization where both optimizing an objective function and learning the underlying function model need to happen simultaneously. Existing acquisition functions may not effectively balance exploration and exploitation in this dual-task setting.

Method: Develops Expected Free Energy (EFE) as a novel acquisition function for Bayesian optimization. Shows theoretical connections: under specific assumptions, EFE reduces to Upper Confidence Bound (UCB), Lower Confidence Bound (LCB), and Expected Information Gain (EIG). Introduces a curvature-aware update law for EFE to adapt to function characteristics.

Result: Proves EFE has unbiased convergence guarantees for concave functions. Demonstrates proof of concept on Van der Pol oscillator system identification problem. Simulation experiments show adaptive EFE-based acquisition outperforms state-of-the-art methods with least final simple regret and error in learning the Gaussian process.

Conclusion: Expected Free Energy provides a principled framework for joint optimization and learning in Bayesian optimization, with theoretical guarantees and practical advantages over existing acquisition functions.

Abstract: We propose an Expected Free Energy-based acquisition function for Bayesian optimization to solve the joint learning and optimization problem, i.e., optimize and learn the underlying function simultaneously. We show that, under specific assumptions, Expected Free Energy reduces to Upper Confidence Bound, Lower Confidence Bound, and Expected Information Gain. We prove that Expected Free Energy has unbiased convergence guarantees for concave functions. Using the results from these derivations, we introduce a curvature-aware update law for Expected Free Energy and show its proof of concept using a system identification problem on a Van der Pol oscillator. Through rigorous simulation experiments, we show that our adaptive Expected Free Energy-based acquisition function outperforms state-of-the-art acquisition functions with the least final simple regret and error in learning the Gaussian process.

[424] Generative Modeling in Protein Design: Neural Representations, Conditional Generation, and Evaluation Standards

Senura Hansaja Wanasekara, Minh-Duong Nguyen, Xiaochen Liu, Nguyen H. Tran, Ken-Tye Yong

Main category: cs.LG

TL;DR: Survey paper synthesizing generative AI methods for protein research, covering representations, architectures, tasks, evaluation standards, and open challenges.

Details

Motivation: The literature on generative AI for protein research is fragmented across different representations, model classes, and task formulations, making it difficult to compare methods or establish appropriate evaluation standards. There's a need for systematic synthesis to unify architectural advances with practical evaluation standards.

Method: Systematic survey organized around: (1) foundational representations (sequence, geometric, multimodal encodings), (2) generative architectures (SE(3)-equivariant diffusion, flow matching, hybrid predictor-generator systems), and (3) task settings (structure prediction, de novo design, protein-ligand/protein-protein interactions).

Result: Provides comprehensive synthesis comparing assumptions, conditioning mechanisms, and controllability across methods. Synthesizes evaluation best practices emphasizing leakage-aware splits, physical validity checks, and function-oriented benchmarks.

Conclusion: Identifies critical open challenges: modeling conformational dynamics and intrinsically disordered regions, scaling to large assemblies while maintaining efficiency, and developing robust safety frameworks for biosecurity risks. Aims to accelerate transition from predictive modeling to reliable, function-driven protein engineering.

Abstract: Generative modeling has become a central paradigm in protein research, extending machine learning beyond structure prediction toward sequence design, backbone generation, inverse folding, and biomolecular interaction modeling. However, the literature remains fragmented across representations, model classes, and task formulations, making it difficult to compare methods or identify appropriate evaluation standards. This survey provides a systematic synthesis of generative AI in protein research, organized around (i) foundational representations spanning sequence, geometric, and multimodal encodings; (ii) generative architectures including $\mathrm{SE}(3)$-equivariant diffusion, flow matching, and hybrid predictor-generator systems; and (iii) task settings from structure prediction and de novo design to protein-ligand and protein-protein interactions. Beyond cataloging methods, we compare assumptions, conditioning mechanisms, and controllability, and we synthesize evaluation best practices that emphasize leakage-aware splits, physical validity checks, and function-oriented benchmarks. We conclude with critical open challenges: modeling conformational dynamics and intrinsically disordered regions, scaling to large assemblies while maintaining efficiency, and developing robust safety frameworks for dual-use biosecurity risks. By unifying architectural advances with practical evaluation standards and responsible development considerations, this survey aims to accelerate the transition from predictive modeling to reliable, function-driven protein engineering.

[425] Maintaining Difficulty: A Margin Scheduler for Triplet Loss in Siamese Networks Training

Roberto Sprengel Minozzo Tomchak, Oge Marques, Lucas Garcia Pedroso, Luiz Eduardo Oliveira, Paulo Lisboa de Almeida

Main category: cs.LG

TL;DR: Proposes a dynamic margin scheduler for triplet loss that adjusts margin parameter based on proportion of easy triplets to maintain consistent training difficulty, improving verification performance over fixed or monotonically increasing margins.

Details

Motivation: The authors observe that during training with triplet margin ranking loss, the effective margin often exceeds the predefined margin parameter when sufficient violating triplets are observed. This suggests that fixing the margin throughout training may limit learning potential, as the training difficulty changes over time.

Method: Proposes a margin scheduler that dynamically adjusts the margin parameter μ based on the proportion of easy triplets observed at each epoch. The scheduler aims to maintain consistent training difficulty by increasing margin when too many triplets are easy and decreasing it when too many are hard.

Result: Experimental results on four different datasets show consistent gains in verification performance compared to both constant margin and monotonically increasing margin schemes. The dynamic scheduler leads to improved learning outcomes.

Conclusion: Dynamic adjustment of the margin parameter based on training difficulty leads to better performance than fixed or simple increasing margin schedules, demonstrating the importance of maintaining appropriate training difficulty throughout the learning process.

Abstract: The Triplet Margin Ranking Loss is one of the most widely used loss functions in Siamese Networks for solving Distance Metric Learning (DML) problems. This loss function depends on a margin parameter μ, which defines the minimum distance that should separate positive and negative pairs during training. In this work, we show that, during training, the effective margin of many triplets often exceeds the predefined value of μ, provided that a sufficient number of triplets violating this margin is observed. This behavior indicates that fixing the margin throughout training may limit the learning process. Based on this observation, we propose a margin scheduler that adjusts the value of μ according to the proportion of easy triplets observed at each epoch, with the goal of maintaining training difficulty over time. We show that the proposed strategy leads to improved performance when compared to both a constant margin and a monotonically increasing margin scheme. Experimental results on four different datasets show consistent gains in verification performance.

[426] KMM-CP: Practical Conformal Prediction under Covariate Shift via Selective Kernel Mean Matching

Siddhartha Laghuvarapu, Rohan Deb, Jimeng Sun

Main category: cs.LG

TL;DR: KMM-CP: A conformal prediction framework using Kernel Mean Matching for covariate-shift correction with selective extension for reliable support overlap regions.

Details

Motivation: Conformal prediction requires exchangeability, often violated by distribution shift. Under covariate shift, importance weighting needs accurate density-ratio estimation, which becomes unstable when training and test distributions have limited support overlap.

Method: Proposes KMM-CP based on Kernel Mean Matching for covariate-shift correction. Uses RKHS moment discrepancy minimization with explicit weight constraints. Also introduces selective extension that identifies regions of reliable support overlap and restricts correction to this subset.

Result: Experiments on molecular property prediction benchmarks with realistic distribution shifts show KMM-CP reduces coverage gap by over 50% compared to existing approaches.

Conclusion: KMM-CP provides effective covariate-shift correction for conformal prediction with improved stability in low-overlap regimes through selective extension.

Abstract: Uncertainty quantification is essential for deploying machine learning models in high-stakes domains such as scientific discovery and healthcare. Conformal Prediction (CP) provides finite-sample coverage guarantees under exchangeability, an assumption often violated in practice due to distribution shift. Under covariate shift, restoring validity requires importance weighting, yet accurate density-ratio estimation becomes unstable when training and test distributions exhibit limited support overlap. We propose KMM-CP, a conformal prediction framework based on Kernel Mean Matching (KMM) for covariate-shift correction. We show that KMM directly controls the bias-variance components governing conformal coverage error by minimizing RKHS moment discrepancy under explicit weight constraints, and establish asymptotic coverage guarantees under mild conditions. We then introduce a selective extension that identifies regions of reliable support overlap and restricts conformal correction to this subset, further improving stability in low-overlap regimes. Experiments on molecular property prediction benchmarks with realistic distribution shifts show that KMM-CP reduces coverage gap by over 50% compared to existing approaches. The code is available at https://github.com/siddharthal/KMM-CP.

[427] Interpretable long-term traffic modelling on national road networks using theory-informed deep learning

Yue Li, Shujuan Chen, Akihiro Shimoda, Ying Jin

Main category: cs.LG

TL;DR: DeepDemand is a theory-informed deep learning framework that integrates travel demand theory with neural networks to predict long-term highway traffic volumes, achieving better accuracy and interpretability than traditional methods.

Details

Motivation: Existing traffic modelling approaches face trade-offs between interpretability, transferability, and predictive accuracy. Classical travel demand models rely on strong assumptions, while generic deep learning models lack theoretical grounding and spatial transferability, limiting their usefulness for long-term planning.

Method: Proposes DeepDemand framework with: 1) competitive two-source Dijkstra procedure for local origin-destination region extraction and OD pair screening, 2) differentiable architecture modelling OD interactions and travel-time deterrence, using socioeconomic features and road-network structure.

Result: Achieved R2 of 0.718 and MAE of 7406 vehicles under random cross-validation, outperforming linear, ridge, random forest, and gravity-style baselines. Maintained strong performance under spatial cross-validation (R2 = 0.665), indicating good geographic transferability.

Conclusion: Integrating transport theory with deep learning provides interpretable highway traffic modelling with practical planning applications, revealing stable nonlinear travel-time deterrence patterns and key socioeconomic drivers of demand.

Abstract: Long-term traffic modelling is fundamental to transport planning, but existing approaches often trade off interpretability, transferability, and predictive accuracy. Classical travel demand models provide behavioural structure but rely on strong assumptions and extensive calibration, whereas generic deep learning models capture complex patterns but often lack theoretical grounding and spatial transferability, limiting their usefulness for long-term planning applications. We propose DeepDemand, a theory-informed deep learning framework that embeds key components of travel demand theory to predict long-term highway traffic volumes using external socioeconomic features and road-network structure. The framework integrates a competitive two-source Dijkstra procedure for local origin-destination (OD) region extraction and OD pair screening with a differentiable architecture modelling OD interactions and travel-time deterrence. The model is evaluated using eight years (2017-2024) of observations on the UK strategic road network, covering 5088 highway segments. Under random cross-validation, DeepDemand achieves an R2 of 0.718 and an MAE of 7406 vehicles, outperforming linear, ridge, random forest, and gravity-style baselines. Performance remains strong under spatial cross-validation (R2 = 0.665), indicating good geographic transferability. Interpretability analysis reveals a stable nonlinear travel-time deterrence pattern, key socioeconomic drivers of demand, and polycentric OD interaction structures aligned with major employment centres and transport hubs. These results highlight the value of integrating transport theory with deep learning for interpretable highway traffic modelling and practical planning applications.

[428] Neuro-Symbolic Process Anomaly Detection

Devashish Gaikwad, Wil M. P. van der Aalst, Gyunam Park

Main category: cs.LG

TL;DR: Neuro-symbolic approach combining Logic Tensor Networks with Declare constraints to improve process anomaly detection by incorporating human domain knowledge to distinguish true anomalies from rare but conformant behavior.

Details

Motivation: Existing neural network-based process anomaly detection methods fail to incorporate human domain knowledge, causing rare but conformant traces to be misclassified as anomalies due to their low frequency, limiting detection effectiveness.

Method: Proposes a neuro-symbolic approach using Logic Tensor Networks (LTN) to integrate symbolic knowledge into neural networks via real-valued logic. Uses autoencoder models as foundation and encodes Declare constraints as soft logical guiderails during learning to distinguish anomalous from rare but conformant behavior.

Result: Evaluations on synthetic and real-world datasets show improved F1 scores even with as few as 10 conformant traces. The choice of Declare constraint (and thus human domain knowledge) significantly influences performance gains.

Conclusion: Neuro-symbolic integration of domain knowledge via LTN and Declare constraints effectively improves process anomaly detection by reducing false positives on rare but conformant behavior, demonstrating the value of combining statistical learning with symbolic reasoning.

Abstract: Process anomaly detection is an important application of process mining for identifying deviations from the normal behavior of a process. Neural network-based methods have recently been applied to this task, learning directly from event logs without requiring a predefined process model. However, since anomaly detection is a purely statistical task, these models fail to incorporate human domain knowledge. As a result, rare but conformant traces are often misclassified as anomalies due to their low frequency, which limits the effectiveness of the detection process. Recent developments in the field of neuro-symbolic AI have introduced Logic Tensor Networks (LTN) as a means to integrate symbolic knowledge into neural networks using real-valued logic. In this work, we propose a neuro-symbolic approach that integrates domain knowledge into neural anomaly detection using LTN and Declare constraints. Using autoencoder models as a foundation, we encode Declare constraints as soft logical guiderails within the learning process to distinguish between anomalous and rare but conformant behavior. Evaluations on synthetic and real-world datasets demonstrate that our approach improves F1 scores even when as few as 10 conformant traces exist, and that the choice of Declare constraint and by extension human domain knowledge significantly influences performance gains.

[429] Automatic feature identification in least-squares policy iteration using the Koopman operator framework

Christian Mugisho Zagabe, Sebastian Peitz

Main category: cs.LG

TL;DR: Koopman autoencoder-based least-squares policy iteration (KAE-LSPI) algorithm for reinforcement learning that automatically learns features via Koopman autoencoders instead of requiring manual feature/kernel selection.

Details

Motivation: Addresses the lack of systematic feature/kernel selection in linear RL techniques by enabling automatic feature learning through Koopman autoencoders.

Method: Reformulates least-squares fixed-point approximation using extended dynamic mode decomposition (EDMD) and integrates Koopman autoencoder (KAE) framework for automatic feature learning in policy iteration.

Result: KAE-LSPI achieves comparable convergence to optimal/near-optimal policies as classical LSPI and kernel-based LSPI, with reasonable number of learned features compared to manually fixed features in LSPI.

Conclusion: The KAE-LSPI algorithm successfully eliminates the need for a priori feature/kernel selection while maintaining performance comparable to existing methods in reinforcement learning.

Abstract: In this paper, we present a Koopman autoencoder-based least-squares policy iteration (KAE-LSPI) algorithm in reinforcement learning (RL). The KAE-LSPI algorithm is based on reformulating the so-called least-squares fixed-point approximation method in terms of extended dynamic mode decomposition (EDMD), thereby enabling automatic feature learning via the Koopman autoencoder (KAE) framework. The approach is motivated by the lack of a systematic choice of features or kernels in linear RL techniques. We compare the KAE-LSPI algorithm with two previous works, the classical least-squares policy iteration (LSPI) and the kernel-based least-squares policy iteration (KLSPI), using stochastic chain walk and inverted pendulum control problems as examples. Unlike previous works, no features or kernels need to be fixed a priori in our approach. Empirical results show the number of features learned by the KAE technique remains reasonable compared to those fixed in the classical LSPI algorithm. The convergence to an optimal or a near-optimal policy is also comparable to the other two methods.

[430] A Boltzmann-machine-enhanced Transformer For DNA Sequence Classification

Zhixuan Cao, Yishu Xu, Xuang WU

Main category: cs.LG

TL;DR: A Boltzmann-machine-enhanced Transformer for DNA sequence classification that introduces structured binary gating variables to discover latent interactions and higher-order dependencies in biological sequences.

Details

Motivation: DNA sequence classification needs to uncover latent site interactions, combinatorial regulation, and epistasis-like higher-order dependencies. Standard Transformers have softmax attention that is continuous and dense, making them better for information routing than explicit structure discovery.

Method: Proposes a Transformer with structured binary gating variables representing latent query-key connections, constrained by a Boltzmann-style energy function. Uses mean-field variational inference with Gumbel-Softmax to estimate edge activation probabilities while preserving differentiability. Jointly optimizes classification and energy losses.

Result: The framework provides a unified approach integrating Boltzmann machines, differentiable discrete optimization, and Transformers for structured learning on biological sequences, enabling interpretable structure discovery alongside accurate prediction.

Conclusion: The proposed Boltzmann-machine-enhanced Transformer offers a principled way to combine the global modeling capacity of Transformers with structured, interpretable latent variable models for biological sequence analysis.

Abstract: DNA sequence classification requires not only high predictive accuracy but also the ability to uncover latent site interactions, combinatorial regulation, and epistasis-like higher-order dependencies. Although the standard Transformer provides strong global modeling capacity, its softmax attention is continuous, dense, and weakly constrained, making it better suited for information routing than explicit structure discovery. In this paper, we propose a Boltzmann-machine-enhanced Transformer for DNA sequence classification. Built on multi-head attention, the model introduces structured binary gating variables to represent latent query-key connections and constrains them with a Boltzmann-style energy function. Query-key similarity defines local bias terms, learnable pairwise interactions capture synergy and competition between edges, and latent hidden units model higher-order combinatorial dependencies. Since exact posterior inference over discrete gating graphs is intractable, we use mean-field variational inference to estimate edge activation probabilities and combine it with Gumbel-Softmax to progressively compress continuous probabilities into near-discrete gates while preserving end-to-end differentiability. During training, we jointly optimize classification and energy losses, encouraging the model to achieve accurate prediction while favoring low-energy, stable, and interpretable structures. We further derive the framework from the energy function and variational free energy to the mean-field fixed-point equations, Gumbel-Softmax relaxation, and the final joint objective. The proposed framework provides a unified view of integrating Boltzmann machines, differentiable discrete optimization, and Transformers for structured learning on biological sequences.

[431] Foundation Model for Cardiac Time Series via Masked Latent Attention

Moritz Vandenhirtz, Samuel Ruipérez-Campillo, Simon Böhi, Sonia Laguna, Irene Cannistraci, Andrea Agostini, Ece Ozkan, Thomas M. Sutter, Julia E. Vogt

Main category: cs.LG

TL;DR: LAMAE is a foundation model for ECG analysis that uses latent attention to model cross-lead interactions during self-supervised pretraining, improving representation quality and transferability for cardiovascular diagnosis.

Details

Motivation: Existing ECG foundation models treat leads as independent channels, failing to leverage their structural redundancy and cross-lead connections, which limits representation quality and transferability.

Method: Proposes Latent Attention Masked Autoencoder (LAMAE) that learns cross-lead connection mechanisms through latent attention, enabling permutation-invariant aggregation and adaptive weighting of lead-specific representations during self-supervised pretraining.

Result: Outperforms independent-lead masked modeling and alignment-based baselines on Mimic-IV-ECG database for predicting ICD-10 codes, demonstrating improved representation quality and transferability.

Conclusion: Leveraging cross-lead connections through latent attention provides effective structural supervision for ECG foundation models, enhancing their diagnostic capabilities.

Abstract: Electrocardiograms (ECGs) are among the most widely available clinical signals and play a central role in cardiovascular diagnosis. While recent foundation models (FMs) have shown promise for learning transferable ECG representations, most existing pretraining approaches treat leads as independent channels and fail to explicitly leverage their strong structural redundancy. We introduce the latent attention masked autoencoder (LAMAE) FM that directly exploits this structure by learning cross-lead connection mechanisms during self-supervised pretraining. Our approach models higher-order interactions across leads through latent attention, enabling permutation-invariant aggregation and adaptive weighting of lead-specific representations. We provide empirical evidence on the Mimic-IV-ECG database that leveraging the cross-lead connection constitutes an effective form of structural supervision, improving representation quality and transferability. Our method shows strong performance in predicting ICD-10 codes, outperforming independent-lead masked modeling and alignment-based baselines.

[432] Shapley meets Rawls: an integrated framework for measuring and explaining unfairness

Fadoua Amri-Jouidel, Emmanuel Kemel, Stéphane Mussard

Main category: cs.LG

TL;DR: Shapley values used to define and explain unfairness in ML models under group fairness criteria, with extensions to ESL values for robustness and faster computation.

Details

Motivation: Current research treats explainability and fairness separately; need integrated framework to both define unfairness and explain its sources using established fairness criteria.

Method: Uses Shapley value framework to quantify unfairness contributions of features under group fairness criteria, extends to Efficient-Symmetric-Linear (ESL) values for robustness and computational efficiency.

Result: Applied to Census Income dataset, identified “Age”, “Number of hours”, and “Marital status” as key contributors to gender unfairness, with faster computation than traditional Bootstrap tests.

Conclusion: Shapley values provide unified framework for fairness definition and explanation; ESL extensions offer computational advantages and robustness improvements.

Abstract: Explainability and fairness have mainly been considered separately, with recent exceptions trying the explain the sources of unfairness. This paper shows that the Shapley value can be used to both define and explain unfairness, under standard group fairness criteria. This offers an integrated framework to estimate and derive inference on unfairness as-well-as the features that contribute to it. Our framework can also be extended from Shapley values to the family of Efficient-Symmetric-Linear (ESL) values, some of which offer more robust definitions of fairness, and shorter computation times. An illustration is run on the Census Income dataset from the UCI Machine Learning Repository. Our approach shows that Age", Number of hours" and ``Marital status" generate gender unfairness, using shorter computation time than traditional Bootstrap tests.

[433] SPECTRA: An Efficient Spectral-Informed Neural Network for Sensor-Based Activity Recognition

Deepika Gurung, Lala Shakti Swarup Ray, Mengxi Liu, Bo Zhou, Paul Lukowicz

Main category: cs.LG

TL;DR: SPECTRA is a spectral-temporal architecture for edge-deployable human activity recognition that integrates STFT feature extraction, depthwise separable convolutions, and channel-wise self-attention to capture spectral-temporal dependencies under real edge constraints.

Details

Motivation: Real-time sensor applications in pervasive computing require edge-deployable models for low latency, privacy, and efficient interaction. Current deep learning approaches treat temporal sensor signals as black-box sequences, overlooking spectral-temporal structure while demanding excessive computation.

Method: SPECTRA integrates short-time Fourier transform (STFT) feature extraction, depthwise separable convolutions, and channel-wise self-attention to capture spectral-temporal dependencies. Uses a compact bidirectional GRU with attention pooling to summarize within-window dynamics at low cost.

Result: Across five public HAR datasets, SPECTRA matches or approaches larger CNN-LSTM and Transformer baselines while substantially reducing parameters, latency, and energy. Successfully deployed on Google Pixel 9 smartphone and STM32L4 microcontroller.

Conclusion: SPECTRA demonstrates end-to-end deployable, real-time, private, and efficient human activity recognition suitable for edge devices, balancing accuracy with stringent resource constraints.

Abstract: Real time sensor based applications in pervasive computing require edge deployable models to ensure low latency privacy and efficient interaction. A prime example is sensor based human activity recognition where models must balance accuracy with stringent resource constraints. Yet many deep learning approaches treat temporal sensor signals as black box sequences overlooking spectral temporal structure while demanding excessive computation. We present SPECTRA a deployment first co designed spectral temporal architecture that integrates short time Fourier transform STFT feature extraction depthwise separable convolutions and channel wise self attention to capture spectral temporal dependencies under real edge runtime and memory constraints. A compact bidirectional GRU with attention pooling summarizes within window dynamics at low cost reducing downstream model burden while preserving accuracy. Across five public HAR datasets SPECTRA matches or approaches larger CNN LSTM and Transformer baselines while substantially reducing parameters latency and energy. Deployments on a Google Pixel 9 smartphone and an STM32L4 microcontroller further demonstrate end to end deployable realtime private and efficient HAR.

[434] EcoFair: Trustworthy and Energy-Aware Routing for Privacy-Preserving Vertically Partitioned Medical Inference

Mostafa Anoosha, Dhavalkumar Thakker, Kuniko Paxton, Koorosh Aslansefat, Bhupesh Kumar Mishra, Baseer Ahmad, Rameez Raja Kureshi

Main category: cs.LG

TL;DR: EcoFair: Privacy-preserving multimodal medical inference framework for dermatology that keeps raw data local, transmits only embeddings, and uses lightweight-first routing to activate heavier image encoders only when needed based on uncertainty and clinical risk.

Details

Motivation: Need to balance privacy (data locality), diagnostic reliability, and deployment efficiency in medical inference, particularly for dermatological diagnosis where both image and tabular data are involved.

Method: Vertically partitioned inference framework with raw image/tabular data staying local; transmits only modality-specific embeddings for server-side multimodal fusion. Uses lightweight-first routing that selectively activates heavier image encoder based on predictive uncertainty, safe-danger probability gap, and tabular neurosymbolic risk score (patient age + lesion localization).

Result: Substantially reduces edge-side inference energy while remaining competitive in classification performance on three dermatology benchmarks. Selective routing improves subgroup-sensitive malignant-case behavior without modifying global training objective.

Conclusion: EcoFair provides practical framework for privacy-preserving, energy-aware medical inference under edge deployment constraints, balancing data locality, diagnostic reliability, and computational efficiency.

Abstract: Privacy-preserving medical inference must balance data locality, diagnostic reliability, and deployment efficiency. This paper presents EcoFair, a simulated vertically partitioned inference framework for dermatological diagnosis in which raw image and tabular data remain local and only modality-specific embeddings are transmitted for server-side multimodal fusion. EcoFair introduces a lightweight-first routing mechanism that selectively activates a heavier image encoder when local uncertainty or metadata-derived clinical risk indicates that additional computation is warranted. The routing decision combines predictive uncertainty, a safe–danger probability gap, and a tabular neurosymbolic risk score derived from patient age and lesion localisation. Experiments on three dermatology benchmarks show that EcoFair can substantially reduce edge-side inference energy in representative model pairings while remaining competitive in classification performance. The results further indicate that selective routing can improve subgroup-sensitive malignant-case behaviour in representative settings without modifying the global training objective. These findings position EcoFair as a practical framework for privacy-preserving and energy-aware medical inference under edge deployment constraints.

[435] A Lyapunov Analysis of Softmax Policy Gradient for Stochastic Bandits

Tor Lattimore

Main category: cs.LG

TL;DR: Adaptation of continuous-time policy gradient analysis for k-armed stochastic bandits to discrete time, with regret bounds dependent on learning rate and gap parameters.

Details

Motivation: To bridge the gap between continuous-time analysis of policy gradient methods for bandits and the standard discrete-time setup commonly used in practice.

Method: Adapts the continuous-time analysis framework by Lattimore (2026) to discrete-time stochastic bandits, analyzing policy gradient algorithms with specific learning rate schedules.

Result: Proves that with learning rate η = O(Δ_min²/(Δ_max log(n))), the regret is O(k log(k) log(n)/η), where n is horizon and Δ_min, Δ_max are minimum/maximum gaps.

Conclusion: The analysis successfully extends continuous-time policy gradient theory to discrete-time bandits, providing theoretical guarantees for regret bounds with appropriate learning rate tuning.

Abstract: We adapt the analysis of policy gradient for continuous time $k$-armed stochastic bandits by Lattimore (2026) to the standard discrete time setup. As in continuous time, we prove that with learning rate $η= O(Δ_{\min}^2/(Δ_{\max} \log(n)))$ the regret is $O(k \log(k) \log(n) / η)$ where $n$ is the horizon and $Δ_{\min}$ and $Δ_{\max}$ are the minimum and maximum gaps.

[436] Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

Juno Kim, Eshaan Nichani, Denny Wu, Alberto Bietti, Jason D. Lee

Main category: cs.LG

TL;DR: Muon spectral optimizer outperforms SGD in associative memory tasks with Gaussian inputs, showing higher storage capacity, larger critical batch size, and faster initial recovery while reaching same information-theoretic limit.

Details

Motivation: To understand why spectral optimizers like Muon perform better than SGD in large-scale language model training, particularly for factual recall tasks, by analyzing them through the tractable linear associative memory problem.

Method: Analyze Muon and SGD on linear associative memory problem with Gaussian inputs/outputs (allowing more associations than embedding dimension). Characterize recovery rates under power law frequency distribution using theoretical analysis of one-step dynamics and multi-step dynamics under thresholded gradient approximation.

Result: Muon has significantly higher storage capacity than SGD, saturates at larger critical batch size, achieves faster initial recovery rate, but both eventually converge to the same information-theoretic limit at comparable speeds. Synthetic experiments validate predicted scaling laws.

Conclusion: The analysis provides quantitative understanding of Muon’s signal amplification mechanism and lays groundwork for establishing scaling laws across more practical language modeling tasks and optimizers.

Abstract: Spectral optimizers such as Muon have recently shown strong empirical performance in large-scale language model training, but the source and extent of their advantage remain poorly understood. We study this question through the linear associative memory problem, a tractable model for factual recall in transformer-based models. In particular, we go beyond orthogonal embeddings and consider Gaussian inputs and outputs, which allows the number of stored associations to greatly exceed the embedding dimension. Our main result sharply characterizes the recovery rates of one step of Muon and SGD on the logistic regression loss under a power law frequency distribution. We show that the storage capacity of Muon significantly exceeds that of SGD, and moreover Muon saturates at a larger critical batch size. We further analyze the multi-step dynamics under a thresholded gradient approximation and show that Muon achieves a substantially faster initial recovery rate than SGD, while both methods eventually converge to the information-theoretic limit at comparable speeds. Experiments on synthetic tasks validate the predicted scaling laws. Our analysis provides a quantitative understanding of the signal amplification of Muon and lays the groundwork for establishing scaling laws across more practical language modeling tasks and optimizers.

[437] Machine Unlearning under Retain-Forget Entanglement

Jingpu Cheng, Ping Liu, Qianxiao Li, Chi Zhang

Main category: cs.LG

TL;DR: A two-phase optimization framework for machine unlearning that addresses retai-forget entanglements using augmented Lagrangian methods and Wasserstein-2 regularized gradient projection.

Details

Motivation: In machine unlearning, forgetting specific subsets often unintentionally affects related retained samples due to correlated features from pretraining or semantic similarities. Existing methods don't adequately handle these retai-forget entanglements.

Method: Two-phase optimization: 1) Augmented Lagrangian method increases loss on forget set while preserving accuracy on less-related retained samples; 2) Gradient projection step regularized by Wasserstein-2 distance to mitigate performance degradation on semantically related retained samples without compromising unlearning.

Result: Comprehensive experiments on multiple unlearning tasks, benchmark datasets, and diverse neural architectures show the approach achieves effective unlearning while outperforming baselines in both accuracy retention and removal fidelity.

Conclusion: The proposed framework successfully addresses retai-forget entanglements in machine unlearning, providing a robust solution that maintains performance on related retained samples while effectively removing target data.

Abstract: Forgetting a subset in machine unlearning is rarely an isolated task. Often, retained samples that are closely related to the forget set can be unintentionally affected, particularly when they share correlated features from pretraining or exhibit strong semantic similarities. To address this challenge, we propose a novel two-phase optimization framework specifically designed to handle such retai-forget entanglements. In the first phase, an augmented Lagrangian method increases the loss on the forget set while preserving accuracy on less-related retained samples. The second phase applies a gradient projection step, regularized by the Wasserstein-2 distance, to mitigate performance degradation on semantically related retained samples without compromising the unlearning objective. We validate our approach through comprehensive experiments on multiple unlearning tasks, standard benchmark datasets, and diverse neural architectures, demonstrating that it achieves effective and reliable unlearning while outperforming existing baselines in both accuracy retention and removal fidelity.

[438] The Climber’s Grip – Personalized Deep Learning Models for Fear and Muscle Activity in Climbing

Matthias Boeker, Dana Swarbrick, Ulysse T. A. Côté-Allard, Marc T. P. Adam, Hugo L. Hammer, Pål Halvorsen

Main category: cs.LG

TL;DR: Combines statistical modeling and deep learning to analyze relationships between perceived fear and muscle activity in lead vs top rope climbing, finding muscle fatigue correlates with increased fear in lead climbing.

Details

Motivation: To investigate the psychophysiological relationship between perceived fear and muscle activity in climbers, comparing lead climbing (with larger falls) vs top rope climbing, using both traditional statistical and advanced deep learning approaches.

Method: Experimental study with 19 climbers collecting EMG, ECG, and arm motion data during both climbing styles. Used linear mixed-effects models for initial analysis, then extended to deep learning models with integrated random effects for personalized modeling.

Result: Random effects improved model performance metrics (MSE, MAE, RMSE). Muscle fatigue significantly correlates with increased fear during lead climbing specifically.

Conclusion: Demonstrates the potential of combining statistical and deep learning approaches for modeling psychological-physiological interplay in climbing, with implications for understanding fear responses in high-risk sports.

Abstract: Climbing is a multifaceted sport that combines physical demands and emotional and cognitive challenges. Ascent styles differ in fall distance with lead climbing involving larger falls than top rope climbing, which may result in different perceived risk and fear. In this study, we investigated the psychophysiological relationship between perceived fear and muscle activity in climbers using a combination of statistical modeling and deep learning techniques. We conducted an experiment with 19 climbers, collecting electromyography (EMG), electrocardiography (ECG) and arm motion data during lead and top rope climbing. Perceived fear ratings were collected for the different phases of the climb. Using a linear mixed-effects model, we analyzed the relationships between perceived fear and physiological measures. To capture the non-linear dynamics of this relationship, we extended our analysis to deep learning models and integrated random effects for a personalized modeling approach. Our results showed that random effects improved model performance of the mean squared error (MSE), mean absolute error (MAE) and root mean squared error (RMSE). The results showed that muscle fatigue correlates significantly with increased fear during \textit{lead climbing}. This study highlights the potential of combining statistical and deep learning approaches for modeling the interplay between psychological and physiological states during climbing.

[439] Evaluating Interactive 2D Visualization as a Sample Selection Strategy for Biomedical Time-Series Data Annotation

Einari Vaaras, Manu Airaksinen, Okko Räsänen

Main category: cs.LG

TL;DR: Comparison of three annotation sample selection methods (random, farthest-first traversal, and 2D visualization exploration) for biomedical time-series data, showing 2DV performs best overall but has higher variability risk.

Details

Motivation: Biomedical time-series data annotation is challenging, and while algorithmic sample selection methods exist, there's limited evidence from studies with real human annotators about which methods work best.

Method: Compared three sample selection methods: random sampling (RND), farthest-first traversal (FAFT), and 2D visualization exploration (2DV). Evaluated across four classification tasks in infant motility assessment and speech emotion recognition with 12 annotators (experts and non-experts) under limited annotation budget.

Result: 2DV performed best overall when aggregating labels across annotators, especially for capturing rare classes. However, 2DV showed greater label distribution variability between annotators, decreasing performance when training on individual annotators’ labels. FAFT excelled in individual-annotator settings for IMA. RND was safest when annotator count or expertise was uncertain.

Conclusion: 2DV-based sampling is promising for biomedical time-series annotation, particularly when annotation budget is not highly constrained, though it carries higher risk due to variability. RND is safer when annotator factors are uncertain.

Abstract: Reliable machine-learning models in biomedical settings depend on accurate labels, yet annotating biomedical time-series data remains challenging. Algorithmic sample selection may support annotation, but evidence from studies involving real human annotators is scarce. Consequently, we compare three sample selection methods for annotation: random sampling (RND), farthest-first traversal (FAFT), and a graphical user interface-based method enabling exploration of complementary 2D visualizations (2DVs) of high-dimensional data. We evaluated the methods across four classification tasks in infant motility assessment (IMA) and speech emotion recognition (SER). Twelve annotators, categorized as experts or non-experts, performed data annotation under a limited annotation budget, and post-annotation experiments were conducted to evaluate the sampling methods. Across all classification tasks, 2DV performed best when aggregating labels across annotators. In IMA, 2DV most effectively captured rare classes, but also exhibited greater annotator-to-annotator label distribution variability resulting from the limited annotation budget, decreasing classification performance when models were trained on individual annotators’ labels; in these cases, FAFT excelled. For SER, 2DV outperformed the other methods among expert annotators and matched their performance for non-experts in the individual-annotator setting. A failure risk analysis revealed that RND was the safest choice when annotator count or annotator expertise was uncertain, whereas 2DV had the highest risk due to its greater label distribution variability. Furthermore, post-experiment interviews indicated that 2DV made the annotation task more interesting and enjoyable. Overall, 2DV-based sampling appears promising for biomedical time-series data annotation, particularly when the annotation budget is not highly constrained.

[440] PQuantML: A Tool for End-to-End Hardware-aware Model Compression

Roope Niemi, Anastasiia Petrovych, Arghya Ranjan Das, Enrico Lupi, Chang Sun, Dimitrios Danopoulos, Marlon Joshua Helbing, Mia Liu, Sebastian Dittmeier, Michael Kagan, Vladimir Loncar, Maurizio Pierini

Main category: cs.LG

TL;DR: PQuantML is an open-source neural network compression library for hardware-aware model deployment, offering pruning and quantization with unified interface and evaluation on physics tasks.

Details

Motivation: The need to deploy performant models to environments with strict latency constraints, particularly for real-time LHC data processing in physics applications.

Method: Provides a unified interface for applying pruning and quantization (jointly or individually), implements multiple pruning methods with different granularities, and fixed-point quantization with High-Granularity Quantization support.

Result: Achieves substantial parameter and bit-width reductions while maintaining accuracy on jet tagging tasks, with compression performance compared against existing tools like QKeras and HGQ.

Conclusion: PQuantML simplifies training of compressed models for hardware deployment and shows effective compression for physics applications with latency constraints.

Abstract: PQuantML is a new open-source, hardware-aware neural network model compression library tailored to end-to-end workflows. Motivated by the need to deploy performant models to environments with strict latency constraints, PQuantML simplifies training of compressed models by providing a unified interface to apply pruning and quantization, either jointly or individually. The library implements multiple pruning methods with different granularities, as well as fixed-point quantization with support for High-Granularity Quantization. We evaluate PQuantML on representative tasks such as the jet substructure classification, so-called jet tagging, an on-edge problem related to real-time LHC data processing. Using various pruning methods with fixed-point quantization, PQuantML achieves substantial parameter and bit-width reductions while maintaining accuracy. The resulting compression is further compared against existing tools, such as QKeras and HGQ.

[441] Characterization and forecasting of national-scale solar power ramp events

Luca Lanzilao, Angela Meyer

Main category: cs.LG

TL;DR: Analysis of solar ramp events using 2 years of PV data from 6434 stations, developing metrics to characterize events and evaluating forecasting models for grid stability.

Details

Motivation: Solar energy growth increases grid management complexity, with PV fluctuations causing operational uncertainty and ramp events threatening grid stability, requiring better identification, forecasting, and mitigation.

Method: Analyzed 2 years of 15-minute resolution PV data from 6434 stations, developed quantitative ramp event metrics, characterized events nationally, examined meteorological drivers, and evaluated spatiotemporal forecasting models including SolarSTEPS, SHADECast, IrradianceNet, and IFS-ENS.

Result: SHADECast was most reliable with 10.8% lower CRPS than SolarSTEPS at 2-hour lead time, but current nowcasting models struggle with ramp dynamics (RMSE increases up to 50% during events). Ramp-ups associated with morning cloud dissipation, ramp-downs with afternoon cloud cover.

Conclusion: Improved high-resolution spatiotemporal modeling is needed for better ramp prediction to support reliable large-scale solar integration into power systems.

Abstract: The rapid growth of solar energy is reshaping power system operations and increasing the complexity of grid management. As photovoltaic (PV) capacity expands, short-term fluctuations in PV generation introduce substantial operational uncertainty. At the same time, solar power ramp events intensify risks of grid instability and unplanned outages due to sudden large power fluctuations. Accurate identification, forecasting and mitigation of solar ramp events are therefore critical to maintaining grid stability. In this study, we analyze two years of PV power production from 6434 PV stations at 15-minute resolution. We develop quantitative metrics to define solar ramp events and systematically characterize their occurrence, frequency, and magnitude at a national scale. Furthermore, we examine the meteorological drivers of ramp events, highlighting the role of mesoscale cloud systems. In particular, we observe that ramp-up events are typically associated with cloud dissipation during the morning, while ramp-down events commonly occur when cloud cover increases in the afternoon. Additionally, we adopt a recently developed spatiotemporal forecasting framework to evaluate both deterministic and probabilistic PV power forecasts derived from deep learning and physics-based models, including SolarSTEPS, SHADECast, IrradianceNet, and IFS-ENS. The results show that SHADECast is the most reliable model, achieving a CRPS 10.8% lower than that of SolarSTEPS at a two-hour lead time. Nonetheless, state-of-the-art nowcasting models struggle to capture ramp dynamics, with forecast RMSE increasing by up to 50% compared to normal operating conditions. Overall, these results emphasize the need for improved high-resolution spatiotemporal modelling to enhance ramp prediction skill and support the reliable integration of large-scale solar generation into power systems.

[442] Hardware-Aware Tensor Networks for Real-Time Quantum-Inspired Anomaly Detection at Particle Colliders

Sagar Addepalli, Prajita Bhattarai, Abhilasha Dave, Julia Gonski

Main category: cs.LG

TL;DR: Quantum-inspired tensor networks for real-time anomaly detection in collider physics using spaced matrix product operators (SMPO) implemented in FPGA hardware for edge deployment

Details

Motivation: To leverage quantum machine learning benefits for detecting beyond Standard Model physics in collider events, while enabling near-term deployment through quantum-inspired algorithms on classical hardware for edge applications in scientific experiments

Method: Developed spaced matrix product operator (SMPO) tensor networks for anomaly detection, with cascaded SMPO architecture for flexibility and efficiency. Implemented in field programmable gate array (FPGA) hardware with resources and latency suitable for trigger deployment

Result: SMPO provides sensitivity to various beyond Standard Model benchmarks and can be implemented in FPGA hardware with appropriate resources and latency for trigger deployment. Cascaded SMPO offers greater flexibility and efficiency for resource-constrained edge applications

Conclusion: Quantum-inspired machine learning using tensor networks is beneficial and feasible for near-term deployment in high energy colliders for real-time anomaly detection

Abstract: Quantum machine learning offers the ability to capture complex correlations in high-dimensional feature spaces, crucial for the challenge of detecting beyond the Standard Model physics in collider events, along with the potential for unprecedented computational efficiency in future quantum processors. Near-term utilization of these benefits can be achieved by developing quantum-inspired algorithms for deployment in classical hardware to enable applications at the “edge” of current scientific experiments. This work demonstrates the use of tensor networks for real-time anomaly detection in collider detectors. A spaced matrix product operator (SMPO) is developed that provides sensitivity to a variety beyond the Standard Model benchmarks, and can be implemented in field programmable gate array hardware with resources and latency consistent with trigger deployment. The cascaded SMPO architecture is introduced as an SMPO variation that affords greater flexibility and efficiency in ways that are key to edge applications in resource-constrained environments. These results reveal the benefit and near-term feasibility of deploying quantum-inspired ML in high energy colliders.

[443] Benchmarking Tabular Foundation Models for Conditional Density Estimation in Regression

Rafael Izbicki, Pedro L. C. Rodrigues

Main category: cs.LG

TL;DR: Tabular foundation models (TabPFN, TabICL) are evaluated as general-purpose conditional density estimators and outperform various baselines across multiple datasets and metrics.

Details

Motivation: While tabular foundation models naturally produce predictive distributions, their effectiveness as general-purpose conditional density estimation (CDE) methods hasn't been systematically evaluated, unlike their point prediction performance which is well studied.

Method: Benchmarked three tabular foundation model variants against parametric, tree-based, and neural CDE baselines on 39 real-world datasets across training sizes from 50 to 20,000, using six metrics covering density accuracy, calibration, and computation time.

Result: Foundation models achieve best CDE loss, log-likelihood, and CRPS on majority of datasets across all sample sizes. In a photometric redshift case study, TabPFN with 50k training galaxies outperformed all baselines trained on 500k galaxies.

Conclusion: Tabular foundation models are strong off-the-shelf conditional density estimators, though post-hoc recalibration may be valuable for some metrics at larger sample sizes.

Abstract: Conditional density estimation (CDE) - recovering the full conditional distribution of a response given tabular covariates - is essential in settings with heteroscedasticity, multimodality, or asymmetric uncertainty. Recent tabular foundation models, such as TabPFN and TabICL, naturally produce predictive distributions, but their effectiveness as general-purpose CDE methods has not been systematically evaluated, unlike their performance for point prediction, which is well studied. We benchmark three tabular foundation model variants against a diverse set of parametric, tree-based, and neural CDE baselines on 39 real-world datasets, across training sizes from 50 to 20,000, using six metrics covering density accuracy, calibration, and computation time. Across all sample sizes, foundation models achieve the best CDE loss, log-likelihood, and CRPS on the large majority of datasets tested. Calibration is competitive at small sample sizes but, for some metrics and datasets, lags behind task-specific neural baselines at larger sample sizes, suggesting that post-hoc recalibration may be a valuable complement. In a photometric redshift case study using SDSS DR18, TabPFN exposed to 50,000 training galaxies outperforms all baselines trained on the full 500,000-galaxy dataset. Taken together, these results establish tabular foundation models as strong off-the-shelf conditional density estimators.

[444] Context-specific Credibility-aware Multimodal Fusion with Conditional Probabilistic Circuits

Pranuthi Tenali, Sahil Sidheekh, Saurabh Mathur, Erik Blasch, Kristian Kersting, Sriraam Natarajan

Main category: cs.LG

TL;DR: C²MF is a context-specific credibility-aware multimodal fusion framework that models per-instance source reliability using Conditional Probabilistic Circuits, enabling adaptive reliability assessment when modalities conflict.

Details

Motivation: Existing multimodal fusion approaches rely on static assumptions about source reliability, limiting their ability to resolve conflicts when modalities become unreliable due to situational factors like sensor degradation or class-specific corruption.

Method: Introduces C²MF framework using Conditional Probabilistic Circuits (CPC) to model per-instance source reliability, formalizes instance-level reliability through Context-Specific Information Credibility (CSIC) - a KL-divergence-based measure computed exactly from CPC.

Result: C²MF improves predictive accuracy by up to 29% over static-reliability baselines in high-noise settings, while preserving interpretability advantages of probabilistic circuit-based fusion. Evaluated on Conflict benchmark with class-specific corruptions.

Conclusion: The framework enables principled and adaptive reliability assessment for multimodal fusion, generalizing conventional static credibility estimates as a special case, and demonstrates robustness under cross-modal conflicts.

Abstract: Multimodal fusion requires integrating information from multiple sources that may conflict depending on context. Existing fusion approaches typically rely on static assumptions about source reliability, limiting their ability to resolve conflicts when a modality becomes unreliable due to situational factors such as sensor degradation or class-specific corruption. We introduce C$^2$MF, a context-specfic credibility-aware multimodal fusion framework that models per-instance source reliability using a Conditional Probabilistic Circuit (CPC). We formalize instance-level reliability through Context-Specific Information Credibility (CSIC), a KL-divergence-based measure computed exactly from the CPC. CSIC generalizes conventional static credibility estimates as a special case, enabling principled and adaptive reliability assessment. To evaluate robustness under cross-modal conflicts, we propose the Conflict benchmark, in which class-specific corruptions deliberately induce discrepancies between different modalities. Experimental results show that C$^2$MF improves predictive accuracy by up to 29% over static-reliability baselines in high-noise settings, while preserving the interpretability advantages of probabilistic circuit-based fusion.

[445] Automatic Laplace Collapsed Sampling: Scalable Marginalisation of Latent Parameters via Automatic Differentiation

Toby Lovick, David Yallup, Will Handley

Main category: cs.LG

TL;DR: ALCS is a Bayesian inference framework that uses automatic differentiation and nested sampling to efficiently marginalize high-dimensional latent variables via Laplace approximation, making evidence computation tractable for complex models.

Details

Motivation: Bayesian inference with high-dimensional latent variables is computationally challenging due to the curse of dimensionality. Traditional methods require hand-derived gradients/Hessians or expensive joint sampling. There's a need for automated, scalable approaches that can handle complex models without extensive model-specific engineering.

Method: ALCS combines nested sampling with automatic differentiation to collapse high-dimensional latent variables to scalar contributions. At each likelihood evaluation, it performs MAP optimization and Laplace approximation using autodiff, reducing effective dimension. The method parallelizes MAP optimization and Hessian evaluation across GPU hardware and extends beyond Gaussian approximations to parametric families like Student-t distributions.

Result: The framework makes Bayesian evidence computation tractable for high-dimensional settings, validated on benchmarks including hierarchical, time-series, and discrete-likelihood models. It enables post-hoc ESS diagnostics to localize failures across hyperparameter space without expensive joint sampling.

Conclusion: ALCS provides a general, automated framework for efficient Bayesian inference in high-dimensional latent variable models, leveraging modern hardware and automatic differentiation to overcome computational bottlenecks while maintaining robustness.

Abstract: We present Automatic Laplace Collapsed Sampling (ALCS), a general framework for marginalising latent parameters in Bayesian models using automatic differentiation, which we combine with nested sampling to explore the hyperparameter space in a robust and efficient manner. At each nested sampling likelihood evaluation, ALCS collapses the high-dimensional latent variables $z$ to a scalar contribution via maximum a posteriori (MAP) optimisation and a Laplace approximation, both computed using autodiff. This reduces the effective dimension from $d_θ+ d_z$ to just $d_θ$, making Bayesian evidence computation tractable for high-dimensional settings without hand-derived gradients or Hessians, and with minimal model-specific engineering. The MAP optimisation and Hessian evaluation are parallelised across live points on GPU-hardware, making the method practical at scale. We also show that automatic differentiation enables local approximations beyond Laplace to parametric families such as the Student-$t$, which improves evidence estimates for heavy-tailed latents. We validate ALCS on a suite of benchmarks spanning hierarchical, time-series, and discrete-likelihood models and establish where the Gaussian approximation holds. This enables a post-hoc ESS diagnostic that localises failures across hyperparameter space without expensive joint sampling.

[446] An LP-based Sampling Policy for Multi-Armed Bandits with Side-Observations and Stochastic Availability

Ashutosh Soni, Peizhong Ju, Atilla Eryilmaz, Ness B. Shroff

Main category: cs.LG

TL;DR: UCB-LP-A: A novel bandit algorithm for stochastic multi-armed bandits with side-observations and dynamic action availability, using linear programming to optimize exploration-exploitation under changing feasible action sets.

Details

Motivation: Real-world systems often have structural dependencies (side-observations) AND volatility (stochastic availability), but existing network bandit algorithms assume all actions are permanently accessible. Need to handle both network structure and dynamic availability constraints.

Method: Proposes UCB-LP-A policy that uses Linear Programming to compute optimal sampling distribution over realizable activation sets. Leverages bipartite graph linking actions to unknowns, where selecting an action reveals observations for all connected unknowns. Optimizes exploration-exploitation trade-offs using only currently active arms.

Result: Derived theoretical upper bound on regret characterizing impact of network structure and activation probabilities. Numerical simulations show UCB-LP-A significantly outperforms existing heuristics that ignore either side-information or availability constraints.

Conclusion: UCB-LP-A effectively addresses stochastic multi-armed bandits with side-observations and dynamic action availability, providing a practical solution for real-world systems with both structural dependencies and volatility.

Abstract: We study the stochastic multi-armed bandit (MAB) problem where an underlying network structure enables side-observations across related actions. We use a bipartite graph to link actions to a set of unknowns, such that selecting an action reveals observations for all the unknowns it is connected to. While previous works rely on the assumption that all actions are permanently accessible, we investigate the more practical setting of stochastic availability, where the set of feasible actions (the “activation set”) varies dynamically in each round. This framework models real-world systems with both structural dependencies and volatility, such as social networks where users provide side-information about their peers’ preferences, yet are not always online to be queried. To address this challenge, we propose UCB-LP-A, a novel policy that leverages a Linear Programming (LP) approach to optimize exploration-exploitation trade-offs under stochastic availability. Unlike standard network bandit algorithms that assume constant access, UCB-LP-A computes an optimal sampling distribution over the realizable activation sets, ensuring that the necessary observations are gathered using only the currently active arms. We derive a theoretical upper bound on the regret of our policy, characterizing the impact of both the network structure and the activation probabilities. Finally, we demonstrate through numerical simulations that UCB-LP-A significantly outperforms existing heuristics that ignore either the side-information or the availability constraints.

[447] Nonmyopic Global Optimisation via Approximate Dynamic Programming

Filippo Airaldi, Bart De Schutter, Azita Dabiri

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2412.04882: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.04882&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[448] Projection-free Algorithms for Online Convex Optimization with Adversarial Constraints

Dhruv Sarkar, Aprameyo Chakrabartty, Subhamon Supantha, Palash Dey, Abhishek Sinha

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2501.16919: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.16919&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[449] How iteration order influences convergence and stability in deep learning

Benoit Dherin, Benny Avelin, Anders Karlsson, Hanna Mazzawi, Javier Gonzalvo, Michael Munn

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2502.01557: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.01557&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[450] Robust Predictive Modeling Under Unseen Data Distribution Shifts: A Methodological Commentary

Hanyu Duan, Yi Yang, Ahmed Abbasi, Kar Yan Tam

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusion due to lack of paper content

Abstract: Failed to fetch summary for 2503.03399: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.03399&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[451] Task Tokens: A Flexible Approach to Adapting Behavior Foundation Models

Ron Vainshtein, Zohar Rimon, Shie Mannor, Chen Tessler

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2503.22886: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.22886&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[452] Defending Against Knowledge Poisoning Attacks During Retrieval-Augmented Generation

Kennedy Edemacu, Vinay M. Shashidhar, Micheal Tuape, Dan Abudu, Beakcheol Jang, Jong Wook Kim

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2508.02835: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.02835&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[453] MarS-FM: Generative Modeling of Molecular Dynamics via Markov State Models

Kacper Kapuśniak, Cristian Gabellini, Michael Bronstein, Prudencio Tossou, Francesco Di Giovanni

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2509.24779: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24779&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[454] Large Language Models Can Perform Automatic Modulation Classification via Discretized Self-supervised Candidate Retrieval

Mohammad Rostami, Atik Faysal, Reihaneh Gh. Roshan, Huaxia Wang, Nikhil Muralidhar, Yu-Dong Yao

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2510.00316: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00316&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[455] Activation Steering with a Feedback Controller

Dung V. Nguyen, Hieu M. Vu, Nhi Y. Pham, Lei Zhang, Tan M. Nguyen

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2510.04309: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04309&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[456] NeST-BO: Fast Local Bayesian Optimization via Newton-Step Targeting of Gradient and Hessian Information

Wei-Ting Tang, Akshay Kudva, Joel A. Paulson

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.05516: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05516&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[457] Is the Hard-Label Cryptanalytic Model Extraction Really Polynomial?

Akira Ito, Takayuki Miura, Yosuke Todo

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting).

Details

Motivation: Cannot determine motivation without paper content.

Method: Cannot determine method without paper content.

Result: Cannot determine results without paper content.

Conclusion: Cannot draw conclusions without paper content.

Abstract: Failed to fetch summary for 2510.06692: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06692&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[458] Cascading Bandits With Feedback

R Sri Prakash, Nikhil Karamchandani, Sharayu Moharir

Main category: cs.LG

TL;DR: Unable to analyze paper 2511.10938 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract content is unavailable due to API rate limiting

Method: Cannot determine method as abstract content is unavailable due to API rate limiting

Result: Cannot determine results as abstract content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as abstract content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2511.10938: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10938&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[459] LiteCache: A Query Similarity-Driven, GPU-Centric KVCache Subsystem for Efficient LLM Inference

Jiawei Yi, Ping Gong, Youhui Bai, Zewen Jin, Shengnan Wang, Jiaqi Ruan, Jia He, Jiaan Zhu, Pengcheng Wang, Haibo Wang, Weiguang Wang, Xia Zhu, Cheng Li

Main category: cs.LG

TL;DR: Unable to analyze paper 2511.14510 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable due to API rate limiting

Method: Cannot determine method as abstract is unavailable due to API rate limiting

Result: Cannot determine results as abstract is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as abstract is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2511.14510: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.14510&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[460] ReBaPL: Repulsive Bayesian Prompt Learning

Yassir Bendou, Omar Ezzahir, Eduardo Fernandes Montesuma, Gabriel Mahuas, Victoria Shevchenko, Mike Gartrell

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2511.17339: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17339&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[461] FedRE: A Representation Entanglement Framework for Model-Heterogeneous Federated Learning

Yuan Yao, Lixu Wang, Jiaqi Wu, Jin Song, Simin Chen, Zehua Wang, Zijian Tian, Wei Chen, Huixia Li, Xiaoxiao Li

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2511.22265: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22265&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[462] XNNTab – Interpretable Neural Networks for Tabular Data using Sparse Autoencoders

Khawla Elhadri, Jörg Schlötterer, Christin Seifert

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2512.13442: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13442&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[463] Concurrent training methods for Kolmogorov-Arnold networks: Disjoint datasets and FPGA implementation

Andrew Polar, Michael Poluektov

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2512.18921: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18921&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[464] Mobility-Assisted Decentralized Federated Learning: Convergence Analysis and A Data-Driven Approach

Reza Jahani, Md Farhamdur Reza, Richeng Jin, Huaiyu Dai

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2512.24694: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.24694&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[465] Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration

Akhiad Bercovich, Nir Ailon, Vladimir Anisimov, Tomer Asida, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Roi Koren, Itay Levy, Zach Moshe, Pavlo Molchanov, Najeeb Nabwani, Mostofa Patwary, Omri Puny, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.11937: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11937&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[466] cc-Shapley: Measuring Multivariate Feature Importance Needs Causal Context

Jörg Martin, Stefan Haufe

Main category: cs.LG

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API access error

Method: Unable to determine method due to API access error

Result: Unable to determine results due to API access error

Conclusion: Paper content unavailable due to arXiv API rate limiting

Abstract: Failed to fetch summary for 2602.20396: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20396&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[467] Efficient Continual Learning in Language Models via Thalamically Routed Cortical Columns

Afshin Khadangi

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.22479: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22479&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[468] Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web–Knowledge–Web Pipeline

Yijiashun Qi, Yijiazhen Qi, Tanmay Wagh

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper with ID 2602.24262 could not be retrieved from arXiv API.

Details

Motivation: Unable to determine motivation due to failed paper retrieval.

Method: Unable to determine method due to failed paper retrieval.

Result: Unable to determine results due to failed paper retrieval.

Conclusion: Unable to draw conclusions due to failed paper retrieval.

Abstract: Failed to fetch summary for 2602.24262: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.24262&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[469] Sinkhorn-Drifting Generative Models

Ping He, Om Khangaonkar, Hamed Pirsiavash, Yikun Bai, Soheil Kolouri

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - paper retrieval failed due to rate limiting

Conclusion: Cannot draw conclusions about paper content due to technical access issues

Abstract: Failed to fetch summary for 2603.12366: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12366&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[470] Massive Redundancy in Gradient Transport Enables Sparse Online Learning

Aur Shalev Merin

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.15195: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15195&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[471] Neural Uncertainty Principle: A Unified View of Adversarial Fragility and LLM Hallucination

Dong-Xiao Zhang, Hu Lou, Jun-Jie Zhang, Jun Zhu, Deyu Meng

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API for paper 2603.19562

Details

Motivation: Cannot determine motivation without access to the paper content

Method: Cannot determine method without access to the paper content

Result: Cannot determine results without access to the paper content

Conclusion: Cannot determine conclusion without access to the paper content

Abstract: Failed to fetch summary for 2603.19562: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19562&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[472] Benchmarking Scientific Machine Learning Models for Air Quality Data

Khawja Imran Masud, Venkata Sai Rahul Unnam, Sahara Ali

Main category: cs.LG

TL;DR: Physics-guided ML/DL models benchmarked for AQI forecasting in North Texas, showing deep learning with physics constraints improves accuracy and physical consistency.

Details

Motivation: Need for accurate AQI forecasting to protect public health, with challenges in model evaluation due to lack of region-specific benchmarking on standardized datasets.

Method: Benchmarked classical time-series (LR, SARIMAX), ML (MLP), and DL (LSTM) models with physics-guided variants incorporating EPA breakpoint-based AQI formulation as consistency constraints via weighted loss. Used EPA daily air quality data (2022-2024) for PM2.5 and O3 with lag-wise forecasting for 1,7,14,30 days.

Result: Deep learning models outperformed simpler baselines; physics guidance improved stability and yielded physically consistent pollutant-AQI relationships, with largest benefits for short-horizon prediction and for PM2.5 and O3.

Conclusion: Provides practical reference for selecting AQI forecasting models in North Texas and clarifies when physics constraints meaningfully improve predictive performance across pollutants and forecast horizons.

Abstract: Accurate air quality index (AQI) forecasting is essential for the protecting public health in rapidly growing urban regions, and the practical model evaluation and selection are often challenged by the lack of rigorous, region-specific benchmarking on standardized datasets. Physics-guided machine learning and deep learning models could be a good and effective solution to resolve such issues with more accurate and efficient AQI forecasting. This research study presents an explainable and comprehensive benchmark that enables a guideline and proposed physics-guided best model by benchmarking classical time-series, machine-learning, and deep-learning approaches for multi-horizon AQI forecasting in North Texas (Dallas County). Using publicly available U.S. Environmental Protection Agency (EPA) daily observations of air quality data from 2022 to 2024, we curate city-level time series for PM2.5 and O3 by aggregating station measurements and constructing lag-wise forecasting datasets for LAG in {1,7,14,30} days. For benchmarking the best model, linear regression (LR), SARIMAX, multilayer perceptrons (MLP), and LSTM networks are evaluated with the proposed physics-guided variants (MLP+Physics and LSTM+Physics) that incorporate the EPA breakpoint-based AQI formulation as a consistency constraint through a weighted loss. Experiments using chronological train-test splits and error metrics MAE, RMSE showed that deep-learning models outperform simpler baselines, while physics guidance improves stability and yields physically consistent pollutant with AQI relationships, with the largest benefits observed for short-horizon prediction and for PM2.5 and O3. Overall, the results provide a practical reference for selecting AQI forecasting models in North Texas and clarify when lightweight physics constraints meaningfully improve predictive performance across pollutants and forecast horizons.

[473] Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration

Zakaria Mhammedi, James Cohan

Main category: cs.LG

TL;DR: A new exploration paradigm separates exploration from exploitation using tree-search with epistemic uncertainty, achieving efficient exploration without RL overhead and state-of-the-art results on hard Atari games and MuJoCo tasks.

Details

Motivation: Current RL-based exploration methods using intrinsic motivation incur unnecessary overhead - policy optimization is needed for task execution but inefficient for state coverage expansion. The paper proposes separating exploration from exploitation to bypass RL during exploration phase.

Method: Uses tree-search strategy inspired by Go-With-The-Winner algorithm paired with epistemic uncertainty measure to drive exploration. Removes policy optimization overhead during exploration, then distills discovered trajectories into deployable policies using supervised backward learning algorithms.

Result: Explores an order of magnitude more efficiently than intrinsic motivation baselines on hard Atari benchmarks. Achieves state-of-the-art scores on Montezuma’s Revenge, Pitfall!, and Venture. Solves MuJoCo Adroit dexterous manipulation and AntMaze tasks from image observations without expert demonstrations or offline datasets.

Conclusion: Proposed paradigm successfully separates exploration from exploitation, demonstrating efficient exploration without RL overhead and achieving strong results across discrete and continuous domains, including previously unsolved Adroit tasks from images.

Abstract: The process of discovery requires active exploration – the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new paradigm that explicitly separates exploration from exploitation and bypasses RL during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of epistemic uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard Atari benchmarks. Further, we demonstrate that the discovered trajectories can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art scores by a wide margin on Montezuma’s Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before for the Adroit tasks.

[474] COMPASS-Hedge: Learning Safely Without Knowing the World

Ting Hu, Luanda Cai, Manolis Vlatakis

Main category: cs.LG

TL;DR: Paper 2603.22348: Failed to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2603.22348: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22348&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[475] A Heterogeneous Long-Micro Scale Cascading Architecture for General Aviation Health Management

Xinhang Chen, Zhihuan Wei, Yang Hu, Zhiguo Zeng, Kang Zeng, Wei Wang

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2603.22885: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22885&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[476] Steering Code LLMs with Activation Directions for Language and Library Control

Md Mahbubur Rahman, Arjun Guha, Harshitha Menon

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.23629: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23629&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[477] Missing-Aware Multimodal Fusion for Unified Microservice Incident Management

Wenzhuo Qian, Hailiang Zhao, Ziqi Wang, Zhipeng Gao, Jiayi Chen, Zhiwei Ling, Shuiguang Deng

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.25538: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25538&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[478] Uncertainty-Guided Label Rebalancing for CPS Safety Monitoring

John Ayotunde, Qinghua Xu, Guancheng Wang, Lionel C. Briand

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2603.25670: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25670&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[479] CACTO-SL: Using Sobolev Learning to improve Continuous Actor-Critic with Trajectory Optimization

Elisa Alboni, Gianluigi Grandesso, Gastone Pietro Rosati Papini, Justin Carpentier, Andrea Del Prete

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2312.10666: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2312.10666&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[480] Symmetric observations without symmetric causal explanations

Christian William, Patrick Remy, Jean-Daniel Bancal, Yu Cai, Nicolas Brunner, Alejandro Pozas-Kerstjens

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2502.14950: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.14950&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[481] Reproducibility and Artifact Consistency of the SIGIR 2022 Recommender Systems Papers Based on Message Passing

Maurizio Ferrari Dacrema, Michael Benigni, Nicola Ferro

Main category: cs.LG

TL;DR: Failed to fetch summary for paper 2503.07823 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2503.07823: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.07823&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[482] A Channel-Triggered Backdoor Attack on Wireless Semantic Image Reconstruction

Jialin Wan, Jinglong Shen, Nan Cheng, Zhisheng Yin, Yiliang Liu, Wenchao Xu, Xuemin, Shen

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2503.23866: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.23866&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[483] Curved representational Bregman divergences and their applications

Frank Nielsen

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2504.05654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.05654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[484] Diffusion Recommender Models and the Illusion of Progress: A Concerning Study of Reproducibility and a Conceptual Mismatch

Michael Benigni, Maurizio Ferrari Dacrema, Dietmar Jannach

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2505.09364: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.09364&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[485] Is Supervised Learning Really That Different from Unsupervised?

Oskar Allerbo, Thomas B. Schön

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.11006: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11006&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[486] Vecchia-Inducing-Points Full-Scale Approximations for Gaussian Processes

Tim Gyger, Reinhard Furrer, Fabio Sigrist

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2507.05064: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.05064&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[487] Bayesian Optimization on Networks

Wenwen Li, Daniel Sanz-Alonso, Ruiyi Yang

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot determine conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2510.27643: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.27643&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[488] The Value of Personalized Recommendations: Evidence from Netflix

Kevin Zielnicki, Guy Aridor, Aurélien Bibaut, Allen Tran, Winston Chou, Nathan Kallus

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2511.07280: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07280&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[489] Architecting software monitors for control-flow anomaly detection through large language models and conformance checking

Francesco Vitale, Francesco Flammini, Mauro Caporuscio, Nicola Mazzocca

Main category: cs.LG

TL;DR: Failed to fetch paper summary - HTTP 429 error (too many requests)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.10876: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10876&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[490] Uncovering Patterns of Brain Activity from EEG Data Consistently Associated with Cybersickness Using Neural Network Interpretability Maps

Jacqueline Yau, Katherine J. Mimnaugh, Evan G. Center, Timo Ojala, Steven M. LaValle, Wenzhen Yuan, Nancy Amato, Minje Kim, Kara D. Federmeier

Main category: cs.LG

TL;DR: Failed to fetch summary for arXiv ID 2512.20620 due to HTTP 429 error (too many requests)

Details

Motivation: Unable to analyze motivation due to access error preventing retrieval of paper content

Method: Unable to analyze method due to access error preventing retrieval of paper content

Result: Unable to analyze results due to access error preventing retrieval of paper content

Conclusion: Unable to analyze conclusion due to access error preventing retrieval of paper content

Abstract: Failed to fetch summary for 2512.20620: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20620&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[491] Conformal Graph Prediction with Z-Gromov Wasserstein Distances

Gabriel Melo, Thibaut de Saivre, Anna Calissano, Florence d’Alché-Buc

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.02460 suggests it’s from March 2026, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to paper content. The arXiv API rate limiting prevents retrieval of the abstract.

Method: Unknown - paper content not accessible due to API rate limiting.

Result: No results available - unable to fetch paper summary.

Conclusion: Cannot draw conclusions without access to paper content. The arXiv API returned HTTP 429 (Too Many Requests) error.

Abstract: Failed to fetch summary for 2603.02460: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02460&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[492] Combinatorial Privacy: Private Multi-Party Bitstream Grand Sum by Hiding in Birkhoff Polytopes

Praneeth Vepakomma

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.22808: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22808&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[493] UniScale: Synergistic Entire Space Data and Model Scaling for Search Ranking

Liren Yu, Caiyuan Li, Feiyi Dong, Tao Zhang, Zhixuan Zhang, Dan Ou, Haihong Tang, Bo Zheng

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.24226 suggests it’s from March 2023, but content is unavailable.

Details

Motivation: Unable to determine motivation due to content unavailability.

Method: Unable to determine method due to content unavailability.

Result: Unable to determine results due to content unavailability.

Conclusion: Unable to determine conclusion due to content unavailability.

Abstract: Failed to fetch summary for 2603.24226: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24226&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[494] Binary Expansion Group Intersection Network

Sicheng Zhou, Kai Zhang

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.24763: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24763&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.MA

[495] Decentralized Value Systems Agreements

Arturo Hernandez-Sanchez, Natalia Criado, Stella Heras, Miguel Rebollo, Jose Such

Main category: cs.MA

TL;DR: Novel method for aggregating heterogeneous value systems to generate multiple value agreements that accommodate individual differences, using decentralized optimization and applied to real-world participatory evaluation data.

Details

Motivation: Value-based decision-making faces challenges due to subjective nature of values - individuals differ in both interpretation of values and their relative importance. Existing approaches focus on finding single value agreements, which may not suit realistic, heterogeneous societies where value systems differ significantly.

Method: Proposes a novel aggregation method where agents indicate their value systems and willingness to concede, then finds a set of agreements using decentralized optimization approach. Applied to real-world scenarios using data from Participatory Value Evaluation process and European Value Survey.

Result: Results show substantial improvement in individual utilities compared to existing value system aggregation techniques. Case studies illustrate different aggregations possible with the method and compare them with existing techniques.

Conclusion: The proposed approach for generating multiple value agreements that accommodate inherent differences in value systems is more suitable for realistic, heterogeneous societies than existing single-agreement methods.

Abstract: One of the biggest challenges of value-based decision-making is dealing with the subjective nature of values. The relative importance of a value for a particular decision varies between individuals, and people may also have different interpretations of what aligning with a value means in a given situation. While members of a society are likely to share a set of principles or values, their value systems–that is, how they interpret these values and the relative importance they give to them–have been found to differ significantly. This work proposes a novel method for aggregating value systems, generating distinct value agreements that accommodate the inherent differences within these systems. Unlike existing work, which focuses on finding a single value agreement, the proposed approach may be more suitable for a realistic and heterogeneous society. In our solution, the agents indicate their value systems and the extent to which they are willing to concede. Then, a set of agreements is found, taking a decentralized optimization approach. Our work has been applied to identify value agreements in two real-world scenarios using data from a Participatory Value Evaluation process and a European Value Survey. These case studies illustrate the different aggregations that can be obtained with our method and compare them with those obtained using existing value system aggregation techniques. In both cases, the results showed a substantial improvement in individual utilities compared to existing alternatives.

[496] Deception and Communication in Autonomous Multi-Agent Systems: An Experimental Study with Among Us

Maria Milkowski, Tim Weninger

Main category: cs.MA

TL;DR: LLM agents in Among Us game show deceptive behavior favoring equivocation over outright lies, with impostors using more explanatory language but achieving limited strategic success.

Details

Motivation: To study strategic deception and communication in autonomous LLM agents deployed in multi-agent systems, using social deduction games to understand coordination, reliability, and safety implications.

Method: Analyzed 1,100 games of Among Us with autonomous L2LM agents generating over one million tokens of meeting dialogue, using speech act theory and interpersonal deception theory to examine language patterns.

Result: All agents primarily used directive language, while impostor agents shifted slightly toward representative acts (explanations, denials). Deception mainly appeared as equivocation rather than outright lies, increased under social pressure, but rarely improved win rates.

Conclusion: Current LLM agents favor low-risk ambiguous communication that is linguistically subtle but strategically limited, revealing a fundamental tension between truthfulness and utility in autonomous multi-agent systems.

Abstract: As large language models are deployed as autonomous agents, their capacity for strategic deception raises core questions for coordination, reliability, and safety in multi-goal, multi-agent systems. We study deception and communication in L2LM agents through the social deduction game Among Us, a cooperative-competitive environment. Across 1,100 games, autonomous agents produced over one million tokens of meeting dialogue. Using speech act theory and interpersonal deception theory, we find that all agents rely mainly on directive language, while impostor agents shift slightly toward representative acts such as explanations and denials. Deception appears primarily as equivocation rather than outright lies, increasing under social pressure but rarely improving win rates. Our contributions are a large-scale analysis of role-conditioned deceptive behavior in LLM agents and empirical evidence that current agents favor low-risk ambiguity that is linguistically subtle yet strategically limited, revealing a fundamental tension between truthfulness and utility in autonomous communication.

Divyanshu Singh, Ashman Mehra, Kavya Makwana, Snehanshu Saha, Santonu Sarkar

Main category: cs.MA

TL;DR: Altruistic Ride Sharing (ARS) uses non-monetary altruism points and multi-agent reinforcement learning to create a decentralized peer-to-peer ride-sharing system that reduces travel distance, emissions, and traffic while increasing vehicle utilization.

Details

Motivation: Current ride-sharing platforms have profit-driven incentives that don't align individual participation with community benefits, leading to congestion, underutilized vehicles, and emissions from private commuting.

Method: ARS introduces altruism points (non-monetary credits) and ORACLE, a shared-parameter multi-agent reinforcement learning architecture for decentralized rider selection in a peer-to-peer framework.

Result: Using NYC taxi data, ARS reduces total travel distance and emissions by ~20%, reduces urban traffic density by up to 30%, and doubles vehicle utilization while maintaining balanced participation across agents.

Conclusion: Altruism-based incentives combined with decentralized learning can provide a scalable and equitable alternative to profit-driven ride-sharing systems for urban mobility.

Abstract: Urban mobility systems face persistent challenges of congestion, underutilized vehicles, and rising emissions driven by private point-to-point commuting. Although ride-sharing platforms exist, their profit-driven incentive structures often fail to align individual participation with broader community benefit. We introduce Altruistic Ride Sharing (ARS), a decentralized peer-to-peer mobility framework in which commuters alternate between driver and rider roles using altruism points, a non-monetary credit mechanism that rewards providing rides and discourages persistent free-riding. To enable scalable coordination among agents, ARS formulates ride-sharing as a multi-agent reinforcement learning problem and introduces ORACLE (One-Network Actor-Critic for Learning in Cooperative Environments), a shared-parameter learning architecture for decentralized rider selection. We evaluate ARS using real-world New York City Taxi and Limousine Commission (TLC) trajectory data under varying agent populations and behavioral dynamics. Across simulations, ARS reduces total travel distance and associated carbon emissions by approximately 20%, reduces urban traffic density by up to 30%, and doubles vehicle utilization relative to no-sharing baselines while maintaining balanced participation across agents. These results demonstrate that altruism-based incentives combined with decentralized learning can provide a scalable and equitable alternative to profit-driven ride-sharing systems.

cs.MM

[498] Cinematic Audio Source Separation Using Visual Cues

Kang Zhang, Suyeon Lee, Arda Senocak, Joon Son Chung

Main category: cs.MM

TL;DR: First audio-visual framework for cinematic audio source separation using conditional flow matching with dual-stream visual encoding, trained on synthetic data and generalizing to real films.

Details

Result: Model trained entirely on synthetic data generalizes effectively to real-world cinematic content. Achieves strong performance on synthetic, real-world, and audio-only CASS benchmarks.

Conclusion: First successful audio-visual CASS framework demonstrates the value of visual context for cinematic audio separation, with synthetic training enabling generalization to real films.

[499] ComVi: Context-Aware Optimized Comment Display in Video Playback

Minsun Kim, Dawon Lee, Junyong Noh

Main category: cs.MM

TL;DR: ComVi is a system that synchronizes video comments with relevant video moments using audio-visual correlation, optimizing comment display timing for better viewer engagement.

Details

Motivation: Current video platforms display comments independently of video playback, causing viewers to encounter comments about unrelated moments that can reveal spoilers and disrupt immersion.

Method: The system maps comments to relevant video timestamps by computing audio-visual correlation, then constructs an optimized comment sequence considering temporal relevance, popularity (likes), and comfortable display duration.

Result: In user studies, ComVi provided significantly more engaging experiences than conventional interfaces (YouTube and Danmaku), with 71.9% of participants selecting it as their most preferred interface.

Conclusion: ComVi demonstrates that time-synchronized comment display based on audio-visual correlation can significantly improve viewer engagement and experience on video platforms.

Abstract: On general video-sharing platforms like YouTube, comments are displayed independently of video playback. As viewers often read comments while watching a video, they may encounter ones referring to moments unrelated to the current scene, which can reveal spoilers and disrupt immersion. To address this problem, we present ComVi, a novel system that displays comments at contextually relevant moments, enabling viewers to see time-synchronized comments and video content together. We first map all comments to relevant video timestamps by computing audio-visual correlation, then construct the comment sequence through an optimization that considers temporal relevance, popularity (number of likes), and display duration for comfortable reading. In a user study, ComVi provided a significantly more engaging experience than conventional video interfaces (i.e., YouTube and Danmaku), with 71.9% of participants selecting ComVi as their most preferred interface.

eess.AS

[500] UPV_RIR_DB: A Structured Room Impulse Response Database with Hierarchical Metadata and Acoustic Indicators

Jesús García-Gamborino, Laura Fuster, Daniel de la Prida, Luis A. Azpicueta-Ruiz, Gema Piñero

Main category: eess.AS

TL;DR: UPV_RIR_DB is a structured database of measured room impulse responses with spatial metadata and traceable acquisition parameters, containing 166 multichannel RIR files from three rooms at Universitat Politècnica de València.

Details

Motivation: To provide acoustic data with explicit spatial metadata and traceable acquisition parameters for reproducible analysis in audio research, particularly for spatial audio and room acoustics applications.

Method: Created a hierarchical database organization where directory structure and metadata jointly describe measurement context. Each room includes metadata files with acquisition parameters, hardware description, spatial coordinates, and acoustic indicators like reverberation time.

Result: Database contains 166 multichannel RIR files with 18,976 single impulse responses measured in three rooms, organized with traceable metadata and compatible with MATLAB- and JSON-based workflows.

Conclusion: UPV_RIR_DB provides a consistent framework for storing, inspecting, and reusing real RIR measurements while ensuring traceability and enabling reproducible analysis, publicly available through Zenodo.

Abstract: This paper presents UPV_RIR_DB, a structured database of measured room impulse responses (RIRs) designed to provide acoustic data with explicit spatial metadata and traceable acquisition parameters. The dataset currently contains 166 multichannel RIR files measured in three rooms of the Universitat Politècnica de València (UPV). Each multichannel RIR file contains impulse responses for multiple source-receiver pairs, with each pair covering a 25 cm2 area - the typical size of a personal sound zone. Considering the number of sources and receiver channels associated with each microphone modality, the database contains a total of 18,976 single impulse responses. A hierarchical organization is adopted in which directory structure and metadata jointly describe the measurement context. Each room includes a metadata file containing acquisition parameters, hardware description, spatial coordinates of zones and microphones, and acoustic indicators such as reverberation time. A central index links each RIR file with its experimental context, ensuring traceability and enabling reproducible analysis. The resulting database provides a consistent framework for storing, inspecting, and reusing real RIR measurements while preserving compatibility with both MATLAB- and JSON-based workflows. The UPV_RIR_DB dataset is publicly available through the open repository Zenodo.

[501] Acoustic Imaging for UAV Detection: Dense Beamformed Energy Maps and U-Net SELD

Belman Jahir Rodriguez, Sergio F. Chevtchenko, Marcelo Herrera Martinez, Yeshwanth Bethi, Saeed Afshar

Main category: eess.AS

TL;DR: U-net model for 360° acoustic source localization using spherical semantic segmentation on beamformed audio maps, trained on drone recordings and validated on SELD benchmarks.

Details

Motivation: Traditional sound source localization methods regress discrete DoA angles, but this paper proposes a segmentation approach for dense spatial audio understanding that can identify spatially distributed source regions and adapt to different microphone configurations.

Method: U-net model trained on frequency-domain representations of DAS beamformed audio maps (azimuth & elevation) using Tversky loss to address class imbalance. Creates binary supervision masks from drone GPS telemetry. Segmentation outputs are post-processed via centroid computation for DoA estimates.

Result: Model generalizes across environments with improved angular precision, validated on real-world drone recordings and DCASE 2019 TAU Spatial Sound Events benchmark, showing generalization beyond drone acoustics to multiclass SELD scenarios.

Conclusion: The segmentation-based approach offers a new paradigm for dense spatial audio understanding beyond traditional SSL, providing array-independent localization that can adapt to different microphone configurations with minimal adaptation.

Abstract: We introduce a U-net model for 360° acoustic source localization formulated as a spherical semantic segmentation task. Rather than regressing discrete direction-of-arrival (DoA) angles, our model segments beamformed audio maps (azimuth & elevation) into regions of active sound presence. Using delay-and-sum (DAS) beamforming on a custom 24-microphone array, we generate signals aligned with drone GPS telemetry to create binary supervision masks. A modified U-Net, trained on frequency-domain representations of these maps, learns to identify spatially distributed source regions while addressing class imbalance via the Tversky loss. Because the network operates on beamformed energy maps, the approach is inherently array-independent and can adapt to different microphone configurations and can be transferred to different microphone configurations with minimal adaptation. The segmentation outputs are post-processed by computing centroids over activated regions, enabling robust DoA estimates. Our dataset includes real-world open-field recordings of a DJI Air 3 drone, synchronized with 360° video and flight logs across multiple dates and locations. Experimental results show that U-net generalizes across environments, providing improved angular precision, offering a new paradigm for dense spatial audio understanding beyond traditional Sound Source Localization (SSL). We additionally validate the same beamforming-plus-segmentation formulation on the DCASE 2019 TAU Spatial Sound Events benchmark, showing that the approach generalizes beyond drone acoustics to multiclass Sound Event Localization and Detection (SELD) scenarios.

eess.IV

[502] Learning to Recorrupt: Noise Distribution Agnostic Self-Supervised Image Denoising

Brayan Monroy, Jorge Bacca, Julián Tachella

Main category: eess.IV

TL;DR: L2R is a self-supervised image denoising method that learns the recorruption process using a monotonic neural network without requiring prior knowledge of noise distributions.

Details

Motivation: Existing self-supervised denoising methods require precise knowledge of noise distributions to avoid trivial identity mappings, which is often unavailable in real-world scenarios.

Method: Introduces Learning to Recorrupt (L2R) with a learnable monotonic neural network that learns the recorruption process through a min-max saddle-point objective, eliminating the need for noise distribution knowledge.

Result: Achieves state-of-the-art performance across unconventional and heavy-tailed noise distributions (log-gamma, Laplace, spatially correlated noise) and signal-dependent noise models (Poisson-Gaussian noise).

Conclusion: L2R provides a noise distribution-agnostic approach to self-supervised image denoising that works effectively across diverse noise types without requiring prior knowledge.

Abstract: Self-supervised image denoising methods have traditionally relied on either architectural constraints or specialized loss functions that require prior knowledge of the noise distribution to avoid the trivial identity mapping. Among these, approaches such as Noisier2Noise or Recorrupted2Recorrupted, create training pairs by adding synthetic noise to the noisy images. While effective, these recorruption-based approaches require precise knowledge of the noise distribution, which is often not available. We present Learning to Recorrupt (L2R), a noise distribution-agnostic denoising technique that eliminates the need for knowledge of the noise distribution. Our method introduces a learnable monotonic neural network that learns the recorruption process through a min-max saddle-point objective. The proposed method achieves state-of-the-art performance across unconventional and heavy-tailed noise distributions, such as log-gamma, Laplace, and spatially correlated noise, as well as signal-dependent noise models such as Poisson-Gaussian noise.

[503] Adapting Segment Anything Model 3 for Concept-Driven Lesion Segmentation in Medical Images: An Experimental Study

Guoping Xu, Jayaram K. Udupa, Yubing Tong, Xin Long, Ying Zhang, Jie Deng, Weiguo Lu, You Zhang

Main category: eess.IV

TL;DR: Systematic evaluation of SAM3 for lesion segmentation across multiple medical imaging modalities using concept-based prompts and fine-tuning strategies

Details

Motivation: Existing lesion segmentation methods are limited to specific anatomical sites or imaging modalities, lacking generalizability. Vision-language foundation models offer concept-driven segmentation for more flexible medical image analysis, but concept-prompt-based lesion segmentation with SAM3 remains underexplored.

Method: Evaluated SAM3 performance using geometric bounding boxes and concept-based text/image prompts across multiple modalities (MRI, CT, ultrasound, dermoscopy, endoscopy). Incorporated additional prior knowledge (adjacent-slice predictions, multiparametric information, prior annotations). Compared fine-tuning strategies including partial module tuning, adapter-based methods, and full-model optimization.

Result: Experiments on 13 datasets covering 11 lesion types show SAM3 achieves strong cross-modality generalization, reliable concept-driven segmentation, and accurate lesion delineation.

Conclusion: SAM3 demonstrates potential for concept-based foundation models in scalable and practical medical image segmentation, highlighting the value of vision-language models for flexible medical image analysis.

Abstract: Accurate lesion segmentation is essential in medical image analysis, yet most existing methods are designed for specific anatomical sites or imaging modalities, limiting their generalizability. Recent vision-language foundation models enable concept-driven segmentation in natural images, offering a promising direction for more flexible medical image analysis. However, concept-prompt-based lesion segmentation, particularly with the latest Segment Anything Model 3 (SAM3), remains underexplored. In this work, we present a systematic evaluation of SAM3 for lesion segmentation. We assess its performance using geometric bounding boxes and concept-based text and image prompts across multiple modalities, including multiparametric MRI, CT, ultrasound, dermoscopy, and endoscopy. To improve robustness, we incorporate additional prior knowledge, such as adjacent-slice predictions, multiparametric information, and prior annotations. We further compare different fine-tuning strategies, including partial module tuning, adapter-based methods, and full-model optimization. Experiments on 13 datasets covering 11 lesion types demonstrate that SAM3 achieves strong cross-modality generalization, reliable concept-driven segmentation, and accurate lesion delineation. These results highlight the potential of concept-based foundation models for scalable and practical medical image segmentation. Code and trained models will be released at: https://github.com/apple1986/lesion-sam3

[504] Cone-Beam CT Image Quality Enhancement Using A Latent Diffusion Model Trained with Simulated CBCT Artifacts

Naruki Murahashi, Mitsuhiro Nakamura, Megumi Nakao

Main category: eess.IV

TL;DR: A conditional latent diffusion model for improving CBCT image quality while preserving anatomical structures using pseudo-CBCT images created from CT scans.

Details

Motivation: CBCT images have low contrast and high artifacts compared to conventional CT, and existing enhancement methods can cause anatomical structure changes in regions with organ deformation.

Method: Proposes an overcorrection-free CBCT enhancement method using a conditional latent diffusion model with pseudo-CBCT images created from CT scans to simulate CBCT artifacts, enabling self-supervised learning with spatially consistent paired images.

Result: Structural changes were less than 1/1000th of conventional methods, correlation coefficient between generated and reference images was 0.916, with faster processing and superior performance compared to conditional diffusion models.

Conclusion: The proposed conditional latent diffusion model with pseudo-CBCT images effectively enhances CBCT image quality while preserving anatomical structures, outperforming conventional methods in both accuracy and efficiency.

Abstract: Cone-beam computed tomography (CBCT) images are problematic in clinical medicine because of their low contrast and high artifact content compared with conventional CT images. Although there are some studies to improve image quality, in regions subject to organ deformation, the anatomical structure may change after such image quality improvement. In this study, we propose an overcorrection-free CBCT image quality enhancement method based on a conditional latent diffusion model using pseudo-CBCT images. Pseudo-CBCT images are created from CT images using a simple method that simulates CBCT artifacts and are spatially consistent with the CT images. By performing self-supervised learning with these spatially consistent paired images, we can improve image quality while maintaining anatomical structures. Furthermore, extending the framework of the conditional diffusion model to latent space improves the efficiency of image processing. Our model was trained on pelvic CT-pseudo-CBCT paired data and was applied to both pseudo-CBCT and real CBCT data. The experimental results using data of 75 cases show that with our proposed method, the structural changes were less than 1/1000th (in terms of the number of pixels) of those of a conventional method involving learning with real images, and the correlation coefficient between the CT value distributions of the generated and reference images was 0.916, approaching the same level as conventional methods. We also confirmed that the proposed framework achieves faster processing and superior improvement performance compared with the framework of a conditional diffusion model, even under constrained training settings.

[505] FINDER: Zero-Shot Field-Integrated Network for Distortion-free EPI Reconstruction in Diffusion MRI

Namgyu Han, Seong Dae Yun, Chaeeun Lim, Sunghyun Seok, Sunju Kim, Yoonhwan Kim, Yohan Jun, Tae Hyung Kim, Berkin Bilgic, Jaejin Cho

Main category: eess.IV

TL;DR: FINDER is a zero-shot, scan-specific framework that jointly optimizes image reconstruction and B0 field map estimation for distortion-free EPI diffusion MRI using physics-guided unrolled networks and implicit neural representations.

Details

Motivation: EPI diffusion MRI suffers from severe geometric distortions due to B0 field inhomogeneities, and existing methods lack robust geometric distortion correction integrated into self-supervised frameworks.

Method: Uses a physics-guided unrolled network with dual-domain denoisers and virtual coil extensions for data consistency, coupled with an Implicit Neural Representation (INR) to model B0 field as a continuous function, employing alternating minimization to jointly update reconstruction and field map.

Result: FINDER achieves superior geometric fidelity and image quality compared to state-of-the-art baselines, effectively disentangling susceptibility-induced geometric distortions from anatomical structures.

Conclusion: FINDER offers a robust solution for high-quality diffusion imaging by integrating geometric distortion correction into a self-supervised reconstruction framework.

Abstract: Echo-planar imaging (EPI) remains the cornerstone of diffusion MRI, but it is prone to severe geometric distortions due to its rapid sampling scheme that renders the sequence highly sensitive to $B_{0}$ field inhomogeneities. While deep learning has helped improve MRI reconstruction, integrating robust geometric distortion correction into a self-supervised framework remains an unmet need. To address this, we present FINDER (Field-Integrated Network for Distortion-free EPI Reconstruction), a novel zero-shot, scan-specific framework that reformulates reconstruction as a joint optimization of the underlying image and the $B_{0}$ field map. Specifically, we employ a physics-guided unrolled network that integrates dual-domain denoisers and virtual coil extensions to enforce robust data consistency. This is coupled with an Implicit Neural Representation (INR) conditioned on spatial coordinates and latent image features to model the off-resonance field as a continuous, differentiable function. Employing an alternating minimization strategy, FINDER synergistically updates the reconstruction network and the field map, effectively disentangling susceptibility-induced geometric distortions from anatomical structures. Experimental results demonstrate that FINDER achieves superior geometric fidelity and image quality compared to state-of-the-art baselines, offering a robust solution for high-quality diffusion imaging.

[506] Rethinking Feature Conditioning for Robust Forged Media Detection in Edge AI Sensing Systems

Izaldein Al-Zyoud, Abdulmotaleb El Saddik

Main category: eess.IV

TL;DR: Study shows feature conditioning (not just backbone choice) significantly impacts forged media detection performance, especially for out-of-distribution robustness, with different conditioning strategies optimal for in-distribution vs cross-dataset scenarios.

Details

Motivation: Generalization under manipulation and dataset shift remains challenging for forged media detection in edge AI systems. While frozen vision foundation models with linear probes are strong baselines, most pipelines use default backbone outputs without testing conditioning at the frozen feature interface.

Method: First controlled probing study on DINOv3 ConvNeXt, evaluating five conditioning variants without task-specific fine-tuning. Fixed backbone, head, data, and optimization while varying conditioning. Evaluated on FaceForensics++ c23 under in-distribution testing, leave-one-manipulation-out (LOMO), and cross-dataset transfer to Celeb-DF v2 and DeepFakeDetection.

Result: Conditioning alone changed LOMO mean AUC by 6.1 points and reversed ID-vs-OOD ranking: LN-Affine strongest on external datasets, while LayerNorm strongest in-distribution. In ConvNeXt-Base replication, OOD winner became protocol-dependent, and ID-optimal selection failed as robust deployment rule.

Conclusion: Feature conditioning is a first-order design variable that should be selected with robustness-oriented validation, not ID accuracy alone. Different conditioning strategies are optimal for different deployment scenarios.

Abstract: Generalization under manipulation and dataset shift remains a core challenge in forged media detection for AI-driven edge sensing systems. Frozen vision foundation models with linear probes are strong baselines, but most pipelines use default backbone outputs without testing conditioning at the frozen feature interface. We present the first controlled probing study on DINOv3 ConvNeXt and show that, without task-specific fine-tuning, linear probing alone yields competitive forged-media detection performance, indicating that ViT-7B self-supervised distillation transfers to security-critical vision workloads at edge-compatible inference cost. Backbone, head, data, and optimization are fixed while conditioning is varied; LN-Affine, the default ConvNeXt head output, is the natural baseline. On FaceForensics++ c23, five conditioning variants are evaluated under in-distribution testing, leave-one-manipulation-out (LOMO), and cross-dataset transfer to Celeb-DF v2 and DeepFakeDetection. In ConvNeXt-Tiny, conditioning alone changes LOMO mean AUC by 6.1 points and reverses ID-vs-OOD ranking: LN-Affine is strongest on external datasets, while LayerNorm is strongest in-distribution. In ConvNeXt-Base replication, the OOD winner becomes protocol-dependent, and ID-optimal selection still fails as a robust deployment rule. Results show that feature conditioning is a first-order design variable and should be selected with robustness-oriented validation, not ID accuracy alone.

Yi Zhang, Yidong Zhao, Qian Tao

Main category: eess.IV

TL;DR: A lightweight adaptation framework that enables frozen mono-modal registration models to handle multi-modal scenarios through style transfer and instance optimization, avoiding expensive full fine-tuning.

Details

Motivation: Deep learning registration methods struggle with multi-modal scenarios due to intensity distribution variations across scans. Full fine-tuning of modern architectures (Transformers, deep U-Nets) is computationally expensive in 3D, and naive fine-tuning risks performance degradation with domain shifts.

Method: Integrates frozen pretrained mono-modal registration model with lightweight adaptation pipeline using style transfer based on contrast-agnostic representation generation and refinement modules. Uses instance optimization at test time to bridge modality/domain gaps without full fine-tuning.

Result: On Learn2Reg 2025 LUMIR validation set: ranks 2nd on multi-modal subset, 3rd on out-of-domain subset, 4th overall in Dice score. Shows consistent improvements over pretrained state-of-the-art mono-modal backbone.

Conclusion: Combining frozen mono-modal models with modality adaptation and lightweight instance optimization offers effective, practical pathway toward robust multi-modal registration without computational burden of full fine-tuning.

Abstract: Deformable image registration remains a central challenge in medical image analysis, particularly under multi-modal scenarios where intensity distributions vary significantly across scans. While deep learning methods provide efficient feed-forward predictions, they often fail to generalize robustly under distribution shifts at test time. A straightforward remedy is full network fine-tuning, yet for modern architectures such as Transformers or deep U-Nets, this adaptation is prohibitively expensive in both memory and runtime when operating in 3D. Meanwhile, the naive fine-tuning struggles more with potential degradation in performance in the existence of drastic domain shifts. In this work, we propose a registration framework that integrates a frozen pretrained \textbf{mono-modal} registration model with a lightweight adaptation pipeline for \textbf{multi-modal} image registration. Specifically, we employ style transfer based on contrast-agnostic representation generation and refinement modules to bridge modality and domain gaps with instance optimization at test time. This design is orthogonal to the choice of backbone mono-modal model, thus avoids the computational burden of full fine-tuning while retaining the flexibility to adapt to unseen domains. We evaluate our approach on the Learn2Reg 2025 LUMIR validation set and observe consistent improvements over the pretrained state-of-the-art mono-modal backbone. In particular, the method ranks second on the multi-modal subset, third on the out-of-domain subset, and achieves fourth place overall in Dice score. These results demonstrate that combining frozen mono-modal models with modality adaptation and lightweight instance optimization offers an effective and practical pathway toward robust multi-modal registration.

[508] Context Adaptive Extended Chain Coding for Semantic Map Compression

Runyu Yang, Junqi Liao, Hyomin Choi, Fabien Racapé, Ivan V. Bajić

Main category: eess.IV

TL;DR: A novel lossless compression method for semantic maps using extended chain codes and skip-coding to exploit contour topology and shared boundaries, achieving 18% bitrate reduction over state-of-the-art.

Details

Motivation: Semantic maps are crucial for robotics, autonomous systems, and extended reality applications, but efficient compression methods that preserve structured semantic information are needed to handle the increasing use of these maps.

Method: Proposes a chain-coding-based framework with extended chain code (ECC) for compact contour representation, context-adaptive entropy coding with Markov modeling, and skip-coding mechanism to eliminate redundant shared contours between adjacent semantic regions.

Result: Achieves 18% average bitrate reduction compared to state-of-the-art benchmarks on semantic map datasets, with up to 98% encoder and 50% decoder runtime reduction relative to modern generic lossless codecs.

Conclusion: The proposed method effectively compresses semantic maps by exploiting their structural properties, demonstrating significant improvements in both compression efficiency and computational performance.

Abstract: Semantic maps are increasingly utilized in areas such as robotics, autonomous systems, and extended reality, motivating the investigation of efficient compression methods that preserve structured semantic information. This paper studies lossless compression of semantic maps through a novel chain-coding-based framework that explicitly exploits contour topology and shared boundaries between adjacent semantic regions. We propose an extended chain code (ECC) to represent long-range contour transitions more compactly, while retaining a legacy three-orthogonal chain code (3OT) as a fallback mode for further efficiency. To efficiently encode sequences of ECC symbols, a context-adaptive entropy coding scheme based on Markov modeling is employed. Furthermore, a skip-coding mechanism is introduced to eliminate redundant representations of shared contours between adjacent semantic regions, supporting both complete and partial skips via run-length signaling. Experimental results demonstrate that the proposed method achieves an average bitrate reduction of 18% compared with a state-of-the-art benchmark on semantic map datasets. In addition, the proposed encoder and decoder achieve up to 98% and 50% runtime reduction, respectively, relative to a modern generic lossless codec. Extended evaluations on occupancy maps further confirm consistent compression gains across the majority of tested scenarios. The source code is made publicly available at \url{https://github.com/InterDigitalInc/LosslessSegmentationMapCompression}.

[509] Enhancing Neural Video Compression of Static Scenes with Positive-Incentive Noise

Cheng Yuan, Zhenyu Jia, Jiawei Shao, Xuelong Li

Main category: eess.IV

TL;DR: A novel video compression framework for static scenes that treats temporal changes as positive-incentive noise to fine-tune neural video compression models, achieving extremely low compression rates while maintaining pixel-level fidelity.

Details

Motivation: Static scene videos (surveillance, videotelephony) dominate storage/bandwidth but current codecs and neural compression methods are inefficient. Traditional codecs don't use temporal redundancy well, neural methods suffer from training-test distribution gaps, and generative methods introduce unacceptable hallucinations for authenticity-critical applications.

Method: Proposes Positive-Incentive Camera (PIC) framework that reinterprets short-term temporal changes as positive-incentive noise to facilitate NVC model finetuning. Disentangles transient variations from persistent background, internalizing structured prior information in compression model. During inference, invariant components require minimal signaling.

Result: Achieves visually lossless reconstruction for static scenes at extremely low compression rate of 0.009%. DCVC-FM baseline requires 20.5% higher BD rate. Enables robust video transmission under adverse network conditions and economic long-term retention of surveillance footage.

Conclusion: PIC provides effective solution to trade computation for bandwidth, overcoming limitations of traditional codecs, neural compression, and generative methods for static scene videos while maintaining pixel-level fidelity for authenticity-critical applications.

Abstract: Static scene videos, such as surveillance feeds and videotelephony streams, constitute a dominant share of storage consumption and network traffic. However, both traditional standardized codecs and neural video compression (NVC) methods struggle to encode these videos efficiently due to inadequate usage of temporal redundancy and severe distribution gaps between training and test data, respectively. While recent generative compression methods improve perceptual quality, they introduce hallucinated details that are unacceptable in authenticity-critical applications. To overcome these limitations, we propose a positive-incentive camera (PIC) framework for static scene videos, where short-term temporal changes are reinterpreted as positive-incentive noise to facilitate NVC model finetuning. By disentangling transient variations from the persistent background, structured prior information is internalized in the compression model. During inference, the invariant component requires minimal signaling, thus reducing data transmission while maintaining pixel-level fidelity. Experiment results show that PIC achieves visually lossless reconstruction for static scenes at an extremely low compression rate of 0.009%, while the DCVC-FM baseline requires 20.5% higher Bjøntegaard delta (BD) rate. Our method provides an effective solution to trade computation for bandwidth, enabling robust video transmission under adverse network conditions and economic long-term retention of surveillance footage.

Editor’s Picks

[1] Gradient-Informed Training for Low-Resource Multilingual Speech Translation

[2] Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods

[3] Cinematic Audio Source Separation Using Visual Cues

Today’s Research Highlights

Table of Contents

cs.CL

[1] Relational graph-driven differential denoising and diffusion attention fusion for multimodal conversation emotion recognition

[2] RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation

[3] Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI

[4] Gradient-Informed Training for Low-Resource Multilingual Speech Translation

[5] Methods for Knowledge Graph Construction from Text Collections: Development and Applications

[6] Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR

[7] Density-aware Soft Context Compression with Semi-Dynamic Compression Ratio

[8] Can Small Models Reason About Legal Documents? A Comparative Study

[9] When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models

[10] MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization

[11] Toward Culturally Grounded Natural Language Processing

[12] AgentCollab: A Self-Evaluation-Driven Collaboration Paradigm for Efficient LLM Agents

[13] Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

[14] Retrieval-Augmented Generation Based Nurse Observation Extraction

[15] I Want to Believe (but the Vocabulary Changed): Measuring the Semantic Structure and Evolution of Conspiracy Theories

[16] IndoBERT-Relevancy: A Context-Conditioned Relevancy Classifier for Indonesian Text

[17] LLM Benchmark-User Need Misalignment for Climate Change

[18] Clash of the models: Comparing performance of BERT-based variants for generic news frame detection

[19] ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory

[20] Sparse Auto-Encoders and Holism about Large Language Models

[21] Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents

[22] GS-BrainText: A Multi-Site Brain Imaging Report Dataset from Generation Scotland for Clinical Natural Language Processing Development and Validation

[23] A Universal Vibe? Finding and Controlling Language-Agnostic Informal Register with SAEs

[24] Automatic Speech Recognition for Documenting Endangered Languages: Case Study of Ikema Miyakoan

[25] SocialX: A Modular Platform for Multi-Source Big Data Research in Indonesia

[26] findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

[27] From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMs

[28] CALRK-Bench: Evaluating Context-Aware Legal Reasoning in Korean Law

[29] Switch Attention: Towards Dynamic and Fine-grained Hybrid Transformers

[30] Word Alignment-Based Evaluation of Uniform Meaning Representations

[31] Why Models Know But Don’t Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models

[32] Analysing Calls to Order in German Parliamentary Debates

[33] Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models

[34] ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-related Claims

[35] Clinical named entity recognition in the Portuguese language: a benchmark of modern BERT models and LLMs

[36] AMALIA Technical Report: A Fully Open Source Large Language Model for European Portuguese

[37] JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems

[38] ALBA: A European Portuguese Benchmark for Evaluating Language and Linguistic Dimensions in Generative LLMs

[39] How Open Must Language Models be to Enable Reliable Scientific Inference?

[40] Development of a European Union Time-Indexed Reference Dataset for Assessing the Performance of Signal Detection Methods in Pharmacovigilance using a Large Language Model

[41] When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

[42] MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference

[43] EnTaCs: Analyzing the Relationship Between Sentiment and Language Choice in English-Tamil Code-Switching

[44] Weight Tying Biases Token Embeddings Towards the Output Space

[45] FinTruthQA: A Benchmark for AI-Driven Financial Disclosure Quality Assessment in Investor – Firm Interactions

[46] Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models

[47] Don’t Stop the Multi-Party! On Generating Synthetic Written Multi-Party Conversations with Constraints

[48] Not Minds, but Signs: Reframing LLMs through Semiotics

[49] Beyond cognacy

[50] From dots to faces: Individual differences in visual imagery capacity predict the content of Ganzflicker-induced hallucinations

[51] Neural Models and Language Model Prompting for the Multidimensional Evaluation of Open-Ended Conversations

[52] Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum

[53] Attention-Aligned Reasoning for Large Language Models

[54] Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

[55] Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

[56] Dual-objective Language Models: Training Efficiency Without Overfitting

[57] MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation

[58] Sigmoid Head for Quality Estimation under Language Ambiguity

[59] T$^\star$: Progressive Block Scaling for Masked Diffusion Language Models Through Trajectory Aware Reinforcement Learning

[60] CitiLink: Enhancing Municipal Transparency and Citizen Engagement through Searchable Meeting Minutes

[61] Formula-One Prompting: Equation-First Reasoning For Applied Mathematics

[62] ClaimPT: A Portuguese Dataset of Annotated Claims in News Articles

[63] NRR-Phi: Text-to-State Mapping for Ambiguity Preservation in LLM Inference

[64] MiNER: A Two-Stage Pipeline for Metadata Extraction from Municipal Meeting Minutes

[65] TernaryLM: Memory-Efficient Language Modeling via Native 1.5-Bit Quantization with Adaptive Layer-wise Scaling

[66] Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection

[67] CitiLink-Minutes: A Multilayer Annotated Dataset of Municipal Meeting Minutes

[68] Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs

[69] A Browser-based Open Source Assistant for Multimodal Content Verification

[70] CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

[71] Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

[72] The Hidden Puppet Master: Predicting Human Belief Change in Manipulative LLM Dialogues

[73] KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning