Editor’s Picks
Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.
[1] Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder
Kin Wai Lau, Yasar Abbas Ur Rehman, Lai-Man Po, Pedro Porto Buarque de GusmĂŁo
Main category: cs.MM
TL;DR: Omni-C is a single dense Transformer encoder that learns shared representations across images, audio, and text through unimodal contrastive pretraining, eliminating the need for separate expert encoders or MoE architectures.
Details
Motivation: Current multimodal systems use separate expert encoders for each modality, causing linear scaling complexity and computational overhead. While unified Omni-models with MoE architectures help, they still increase parameters and introduce routing overhead.Method: Proposes Omni-C, a single dense Transformer-based encoder with shared backbone parameters and lightweight modality-specific projection heads. Uses unimodal contrastive pretraining on large-scale unaligned data without requiring paired supervision or routing mechanisms.
Result: Achieves performance comparable to expert models in unimodal and cross-modal tasks, with modest zero-shot degradation on audio and text that can be recovered through lightweight linear probing or parameter-efficient fine-tuning. Substantially reduces inference memory usage compared to multi-encoder baselines.
Conclusion: Omni-C demonstrates that a single dense encoder can learn competitive shared representations across heterogeneous modalities, advancing efficient and scalable multimodal learning without the complexity of MoE architectures or separate expert encoders.
Abstract: Recent multimodal systems often rely on separate expert modality encoders which cause linearly scaling complexity and computational overhead with added modalities. While unified Omni-models address this via Mixture-of-Expert (MoE) architectures with specialized experts and routing, they still inflate parameter counts and introduce routing overhead. In this paper, we propose Omni-C (Omni-Compress), a single dense Transformer-based encoder that learns competitive shared representations across heterogeneous modalities–images, audio, and text–through unimodal contrastive pretraining on large-scale unaligned data. By maximizing parameter sharing in the backbone and using lightweight modality-specific projection heads, Omni-C effectively mitigates inter-modality conflicts without requiring MoE, paired supervision, or routing. This design supports efficient deployment on memory-constrained systems via sequential modality processing and low-memory inference, eliminating the need for parallel expert loading or specialized hardware. Experiments show Omni-C achieves performance comparable to expert models in unimodal and cross-model tasks, with modest zero-shot degradation on audio and text that is largely recovered through lightweight linear probing or parameter efficient fine-tuning. The unified architecture substantially reduces inference memory usage compared to multi-encoder baselines, advancing efficient and scalable multimodal learning.
Relevance: 9/10
[2] TempoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation
Soumya Mazumdar, Vineet Kumar Rakesh
Main category: cs.CV
TL;DR: TempoSyncDiff is a reference-conditioned latent diffusion framework for efficient audio-driven talking-head generation that uses teacher-student distillation to enable few-step inference while maintaining quality and stability.
Details
Motivation: Current diffusion models for talking-head generation suffer from high inference latency, temporal instability (flicker, identity drift), and imperfect audio-visual alignment, especially under challenging speech conditions. There's a need for practical, efficient solutions that can work in constrained computational settings.Method: Uses a teacher-student distillation framework where a diffusion teacher trained with standard noise prediction guides a lightweight student denoiser for few-step inference. Incorporates identity anchoring and temporal regularization to mitigate identity drift and flicker, plus viseme-based audio conditioning for coarse lip motion control.
Result: Experiments on LRS3 dataset show distilled models retain much of the teacher’s reconstruction quality while enabling substantially lower latency inference. Includes CPU-only and edge computing measurements showing feasibility for edge deployment.
Conclusion: Distilled diffusion models can achieve practical talking-head generation under constrained computational settings, representing an initial step toward efficient diffusion-based THG with improved stability and reduced latency.
Abstract: Diffusion models have recently advanced photorealistic human synthesis, although practical talking-head generation (THG) remains constrained by high inference latency, temporal instability such as flicker and identity drift, and imperfect audio-visual alignment under challenging speech conditions. This paper introduces TempoSyncDiff, a reference-conditioned latent diffusion framework that explores few-step inference for efficient audio-driven talking-head generation. The approach adopts a teacher-student distillation formulation in which a diffusion teacher trained with a standard noise prediction objective guides a lightweight student denoiser capable of operating with significantly fewer inference steps to improve generation stability. The framework incorporates identity anchoring and temporal regularization designed to mitigate identity drift and frame-to-frame flicker during synthesis, while viseme-based audio conditioning provides coarse lip motion control. Experiments on the LRS3 dataset report denoising-stage component-level metrics relative to VAE reconstructions and preliminary latency characterization, including CPU-only and edge computing measurements and feasibility estimates for edge deployment. The results suggest that distilled diffusion models can retain much of the reconstruction behaviour of a stronger teacher while enabling substantially lower latency inference. The study is positioned as an initial step toward practical diffusion-based talking-head generation under constrained computational settings. GitHub: https://mazumdarsoumya.github.io/TempoSyncDiff
Relevance: 9/10
[3] Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition
Linzhi Wu, Xingyu Zhang, Hao Yuan, Yakun Zhang, Changyan Zheng, Liang Xie, Tiejun Liu, Erwei Yin
Main category: eess.AS
TL;DR: An end-to-end noise-robust audio-visual speech recognition framework that uses Conformer-based bottleneck fusion to implicitly refine noisy audio features with video assistance, eliminating explicit noise masks while preserving speech semantics.
Details
Motivation: Traditional AVSR methods use mask-based strategies to filter audio noise during feature fusion, but these risk discarding semantically relevant information along with noise. There's a need for a more sophisticated approach that can handle noisy audio without explicit noise masking.Method: Proposes an end-to-end noise-robust AVSR framework coupled with speech enhancement. Uses a Conformer-based bottleneck fusion module to implicitly refine noisy audio features with video assistance, reducing modality redundancy and enhancing inter-modal interactions without explicit noise mask generation.
Result: Experimental evaluations on the public LRS3 benchmark show the method outperforms prior advanced mask-based baselines under noisy conditions.
Conclusion: The proposed framework effectively handles noisy audio inputs in AVSR by implicitly refining features through bottleneck fusion, preserving speech semantic integrity while achieving robust recognition performance without explicit noise masking.
Abstract: Audio-visual speech recognition (AVSR) typically improves recognition accuracy in noisy environments by integrating noise-immune visual cues with audio signals. Nevertheless, high-noise audio inputs are prone to introducing adverse interference into the feature fusion process. To mitigate this, recent AVSR methods often adopt mask-based strategies to filter audio noise during feature interaction and fusion, yet such methods risk discarding semantically relevant information alongside noise. In this work, we propose an end-to-end noise-robust AVSR framework coupled with speech enhancement, eliminating the need for explicit noise mask generation. This framework leverages a Conformer-based bottleneck fusion module to implicitly refine noisy audio features with video assistance. By reducing modality redundancy and enhancing inter-modal interactions, our method preserves speech semantic integrity to achieve robust recognition performance. Experimental evaluations on the public LRS3 benchmark suggest that our method outperforms prior advanced mask-based baselines under noisy conditions.
Relevance: 9/10
Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 107]
- cs.CV [Total: 240]
- cs.AI [Total: 77]
- cs.SD [Total: 8]
- cs.LG [Total: 117]
- cs.MA [Total: 4]
- cs.MM [Total: 1]
- eess.AS [Total: 16]
- eess.IV [Total: 13]
cs.CL
[1] Verify as You Go: An LLM-Powered Browser Extension for Fake News Detection
Dorsaf Sallami, Esma AĂŻmeur
Main category: cs.CL
TL;DR: Aletheia is a browser extension that uses RAG and LLMs to detect fake news with evidence-based explanations, plus interactive discussion and fact-check features.
Details
Motivation: Fake news proliferation threatens public trust and democracy, but existing browser extensions have limitations: opaque models, poor explanations, and lack of user engagement.Method: Develops Aletheia browser extension using Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) for fake news detection with evidence-based explanations. Includes Discussion Hub for user dialogue and Stay Informed feature for recent fact-checks.
Result: Aletheia outperforms state-of-the-art baselines in detection performance. User study with 250 participants confirms system usability and perceived effectiveness.
Conclusion: Aletheia demonstrates potential as a transparent, user-centered tool for combating online fake news through LLM-powered detection with interactive features.
Abstract: The rampant spread of fake news in the digital age poses serious risks to public trust and democratic institutions, underscoring the need for effective, transparent, and user-centered detection tools. Existing browser extensions often fall short due to opaque model behavior, limited explanatory support, and a lack of meaningful user engagement. This paper introduces Aletheia, a novel browser extension that leverages Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) to detect fake news and provide evidence-based explanations. Aletheia further includes two interactive components: a Discussion Hub that enables user dialogue around flagged content and a Stay Informed feature that surfaces recent fact-checks. Through extensive experiments, we show that Aletheia outperforms state-of-the-art baselines in detection performance. Complementing this empirical evaluation, a complementary user study with 250 participants confirms the system’s usability and perceived effectiveness, highlighting its potential as a transparent tool for combating online fake news.
[2] Attention Meets Reachability: Structural Equivalence and Efficiency in Grammar-Constrained LLM Decoding
Faruk Alpay, Bilge Senturk
Main category: cs.CL
TL;DR: Grammar-constrained decoding (GCD) theory connecting CFGs to autoregressive models, with invariance theorems, ambiguity cost analysis, and performance bounds for masked decoding.
Details
Motivation: To formalize the computational and statistical properties of grammar-constrained decoding, understanding how different grammar representations affect decoding efficiency and quality while maintaining language equivalence.Method: Theoretical analysis of GCD as coupling between autoregressive models and pushdown automata from CFGs, with invariance proofs, structural ambiguity cost (SAC) metrics, lower bound derivations, and integration with Transformer/MoE architectures.
Result: Proved oracle invariance theorem, quantified state blowup for a^n b^n, established Ω(tÂČ) lower bounds per token, defined decoding-cost equivalence classes, derived KL/TV distortion bounds for hard-masked decoding, and integrated with practical architectures.
Conclusion: Grammar choice significantly impacts decoding efficiency despite language equivalence, with theoretical foundations for optimizing grammar representations and understanding trade-offs in constrained generation systems.
Abstract: We study grammar-constrained decoding (GCD) as a coupling between an autoregressive next-token distribution and a reachability oracle over a pushdown system compiled from a context-free grammar (CFG). We prove an oracle invariance theorem: language-equivalent grammars induce identical admissible next-token sets for every prefix, hence identical logit masks, yet can yield provably different compiled state spaces and online ambiguity costs. We give exact control-state blowup counts for the canonical $a^n b^n$ language under redundant nonterminal delegation, and introduce a left-to-right structural ambiguity cost (SAC) measuring incremental packed-parse-forest growth per token. For two equivalent grammars over all finite strings, SAC is $O(1)$ per token under right-recursion but $Î(t^2)$ per token and $Î(n^3)$ cumulatively under concatenation. We establish engine-independent lower bounds: any sound, retrieval-efficient, parse-preserving online masking engine must incur $Ω(t^2)$ work per token on a specific constant-size CFG family, unconditionally within this model. We define decoding-cost equivalence classes of grammars and prove existence of minimal-SAC representatives within bounded rewrite families. Finally, we characterize the true conditional sampler via a Doob $h$-transform and derive sharp one-step KL and total-variation distortion bounds for hard-masked decoding in terms of survival-probability spread among admissible next tokens. We integrate these results with Transformer and Mixture-of-Experts architectures, derive latency envelopes in terms of vocabulary size, active state sets, and beam width, and connect SAC to instrumentation-based predictive performance models and automated grammar optimization.
[3] MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing
Yang Liu, Jinxuan Cai, Yishen Li, Qi Meng, Zedi Liu, Xin Li, Chen Qian, Chuan Shi, Cheng Yang
Main category: cs.CL
TL;DR: MASFactory is a graph-centric framework for orchestrating LLM-based multi-agent systems that uses Vibe Graphing to compile natural-language intent into executable workflows with reusable components and visual tools.
Details
Motivation: Current frameworks for implementing complex multi-agent system workflows require substantial manual effort, offer limited reuse, and make it difficult to integrate heterogeneous external context sources.Method: Introduces Vibe Graphing - a human-in-the-loop approach that compiles natural-language intent into editable workflow specifications and executable graphs, with reusable components, pluggable context integration, and visualizer tools.
Result: Evaluated on seven public benchmarks, validating reproduction consistency for representative MAS methods and effectiveness of Vibe Graphing approach.
Conclusion: MASFactory provides a practical framework for orchestrating LLM-based multi-agent systems with improved usability, reusability, and context integration capabilities.
Abstract: Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents/sub-workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heterogeneous external context sources. To overcome these limitations, we present MASFactory, a graph-centric framework for orchestrating LLM-based MAS. It introduces Vibe Graphing, a human-in-the-loop approach that compiles natural-language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human-in-the-loop interaction. We evaluate MASFactory on seven public benchmarks, validating both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing. Our code (https://github.com/BUPT-GAMMA/MASFactory) and video (https://youtu.be/ANynzVfY32k) are publicly available.
[4] NOTAI.AI: Explainable Detection of Machine-Generated Text via Curvature and Feature Attribution
Oleksandr Marchenko Breneur, Adelaide Danilov, Aria Nourbakhsh, Salima Lamsiyah
Main category: cs.CL
TL;DR: NOTAI.AI is an explainable framework for AI-generated text detection that combines curvature-based signals with neural and stylometric features using XGBoost, with SHAP explanations and LLM-generated rationales.
Details
Motivation: The need for transparent and interpretable AI-generated text detection systems that go beyond black-box approaches, providing users with understandable explanations for detection decisions.Method: Extends Fast-DetectGPT by integrating 17 interpretable features (Conditional Probability Curvature, ModernBERT detector score, readability metrics, stylometric cues) in XGBoost meta-classifier, with SHAP for feature attribution and LLM-based explanation layer for natural-language rationales.
Result: Deployed as interactive web application supporting real-time analysis, visual feature inspection, and structured evidence presentation with publicly available source code and demo.
Conclusion: NOTAI.AI provides an explainable framework for AI text detection that combines multiple signal types with interpretability features, making detection decisions transparent and understandable to users.
Abstract: We present NOTAI.AI, an explainable framework for machine-generated text detection that extends Fast-DetectGPT by integrating curvature-based signals with neural and stylometric features in a supervised setting. The system combines 17 interpretable features, including Conditional Probability Curvature, ModernBERT detector score, readability metrics, and stylometric cues, within a gradient-boosted tree (XGBoost) meta-classifier to determine whether a text is human- or AI-generated. Furthermore, NOTAI.AI applies Shapley Additive Explanations (SHAP) to provide both local and global feature-level attribution. These attributions are further translated into structured natural-language rationales through an LLM-based explanation layer, which enables user-facing interpretability. The system is deployed as an interactive web application that supports real-time analysis, visual feature inspection, and structured evidence presentation. A web interface allows users to input text and inspect how neural and statistical signals influence the final decision. The source code and demo video are publicly available to support reproducibility.
[5] Safer Reasoning Traces: Measuring and Mitigating Chain-of-Thought Leakage in LLMs
Patrick Ahrend, Tobias Eder, Xiyang Yang, Zhiyi Pan, Georg Groh
Main category: cs.CL
TL;DR: CoT prompting increases PII leakage risk in LLMs, with leakage patterns varying by model family and reasoning budget; lightweight gatekeeping methods show trade-offs, motivating adaptive policies.
Details
Motivation: Chain-of-Thought prompting improves reasoning but increases privacy risks by leaking personally identifiable information from prompts into reasoning traces and outputs, even when models are instructed not to restate PII.Method: Developed a model-agnostic framework to measure PII leakage: (1) defines leakage as risk-weighted token-level events across 11 PII types, (2) traces leakage curves as function of CoT budget, (3) compares open- and closed-source models on structured PII dataset with hierarchical risk taxonomy. Benchmarked lightweight gatekeepers: rule-based detector, TF-IDF+logistic regression, GLiNER-based NER, and LLM-as-judge.
Result: CoT consistently elevates PII leakage, especially for high-risk categories. Leakage is strongly dependent on model family and reasoning budget - increasing budget can either amplify or attenuate leakage depending on base model. No single gatekeeping method dominates across models or budgets.
Conclusion: Hybrid, style-adaptive gatekeeping policies are needed to balance utility and privacy risk under reproducible protocols, as PII leakage patterns vary significantly across models and reasoning budgets.
Abstract: Chain-of-Thought (CoT) prompting improves LLM reasoning but can increase privacy risk by resurfacing personally identifiable information (PII) from the prompt into reasoning traces and outputs, even under policies that instruct the model not to restate PII. We study such direct, inference-time PII leakage using a model-agnostic framework that (i) defines leakage as risk-weighted, token-level events across 11 PII types, (ii) traces leakage curves as a function of the allowed CoT budget, and (iii) compares open- and closed-source model families on a structured PII dataset with a hierarchical risk taxonomy. We find that CoT consistently elevates leakage, especially for high-risk categories, and that leakage is strongly family- and budget-dependent. Increasing the reasoning budget can either amplify or attenuate leakage depending on the base model. We then benchmark lightweight inference-time gatekeepers: a rule-based detector, a TF-IDF + logistic regression classifier, a GLiNER-based NER model, and an LLM-as-judge, using risk-weighted F1, Macro-F1, and recall. No single method dominates across models or budgets, motivating hybrid, style-adaptive gatekeeping policies that balance utility and risk under a common, reproducible protocol.
[6] The Fragility Of Moral Judgment In Large Language Models
Tom van Nuenen, Pratik S. Sachdeva
Main category: cs.CL
TL;DR: LLM moral judgments are highly sensitive to narrative perspective shifts and evaluation protocols, raising concerns about reproducibility and equity in AI moral guidance.
Details
Motivation: As people increasingly use LLMs for moral guidance, there's a need to test the stability and manipulability of these systems' moral judgments, particularly regarding how they respond to different narrative presentations and evaluation protocols.Method: Developed a perturbation framework using 2,939 dilemmas from Reddit’s r/AmItheAsshole, generating three perturbation families: surface edits, point-of-view shifts, and persuasion cues. Evaluated with four LLMs using varied protocols (output ordering, instruction placement, unstructured prompting).
Result: Surface perturbations had low flip rates (7.5%), while point-of-view shifts caused high instability (24.3%). 37.9% of dilemmas were robust to surface noise but flipped under perspective changes. Persuasion perturbations caused systematic directional shifts. Protocol choices dominated all factors, with only 35.7% consistency across all three protocols.
Conclusion: LLM moral judgments are co-produced by narrative form and task scaffolding, raising significant reproducibility and equity concerns when outcomes depend on presentation skill rather than moral substance.
Abstract: People increasingly use large language models (LLMs) for everyday moral and interpersonal guidance, yet these systems cannot interrogate missing context and judge dilemmas as presented. We introduce a perturbation framework for testing the stability and manipulability of LLM moral judgments while holding the underlying moral conflict constant. Using 2,939 dilemmas from r/AmItheAsshole (January-March 2025), we generate three families of content perturbations: surface edits (lexical/structural noise), point-of-view shifts (voice and stance neutralization), and persuasion cues (self-positioning, social proof, pattern admissions, victim framing). We also vary the evaluation protocol (output ordering, instruction placement, and unstructured prompting). We evaluated all variants with four models (GPT-4.1, Claude 3.7 Sonnet, DeepSeek V3, Qwen2.5-72B) (N=129,156 judgments). Surface perturbations produce low flip rates (7.5%), largely within the self-consistency noise floor (4-13%), whereas point-of-view shifts induce substantially higher instability (24.3%). A large subset of dilemmas (37.9%) is robust to surface noise yet flips under perspective changes, indicating that models condition on narrative voice as a pragmatic cue. Instability concentrates in morally ambiguous cases; scenarios where no party is assigned blame are most susceptible. Persuasion perturbations yield systematic directional shifts. Protocol choices dominate all other factors: agreement between structured protocols is only 67.6% (kappa=0.55), and only 35.7% of model-scenario units match across all three protocols. These results show that LLM moral judgments are co-produced by narrative form and task scaffolding, raising reproducibility and equity concerns when outcomes depend on presentation skill rather than moral substance.
[7] FreeTxt-Vi: A Benchmarked Vietnamese-English Toolkit for Segmentation, Sentiment, and Summarisation
Hung Nguyen Huy, Mo El-Haj, Dawn Knight, Paul Rayson
Main category: cs.CL
TL;DR: FreeTxt-Vi is an open-source bilingual Vietnamese-English text analysis toolkit combining corpus linguistics features with transformer-based NLP for sentiment analysis and summarization without requiring programming skills.
Details
Motivation: To create an accessible, free toolkit for bilingual Vietnamese-English text analysis that reduces technical barriers for non-programmers, supporting underrepresented Vietnamese language resources in NLP while enabling reproducible research across education, humanities, and social sciences.Method: Develops a unified bilingual NLP pipeline with hybrid VnCoreNLP and Byte Pair Encoding segmentation, fine-tuned TabularisAI sentiment classifier, and fine-tuned Qwen2.5 model for abstractive summarization, integrated into a web-based corpus analysis platform.
Result: The toolkit achieves competitive or superior performance compared to baselines in segmentation, sentiment analysis, and summarization for both Vietnamese and English, as demonstrated through three-part evaluation.
Conclusion: FreeTxt-Vi successfully bridges corpus linguistics and NLP, making multilingual text analysis accessible while promoting Vietnamese language resources development, with applications across various domains dealing with qualitative text data.
Abstract: FreeTxt-Vi is a free and open source web based toolkit for creating and analysing bilingual Vietnamese English text collections. Positioned at the intersection of corpus linguistics and natural language processing NLP it enables users to build explore and interpret free text data without requiring programming expertise. The system combines corpus analysis features such as concordancing keyword analysis word relation exploration and interactive visualisation with transformer based NLP components for sentiment analysis and summarisation. A key contribution of this work is the design of a unified bilingual NLP pipeline that integrates a hybrid VnCoreNLP and Byte Pair Encoding BPE segmentation strategy a fine tuned TabularisAI sentiment classifier and a fine tuned Qwen2.5 model for abstractive summarisation. Unlike existing text analysis platforms FreeTxt Vi is evaluated as a set of language processing components. We conduct a three part evaluation covering segmentation sentiment analysis and summarisation and show that our approach achieves competitive or superior performance compared to widely used baselines in both Vietnamese and English. By reducing technical barriers to multilingual text analysis FreeTxt Vi supports reproducible research and promotes the development of language resources for Vietnamese a widely spoken but underrepresented language in NLP. The toolkit is applicable to domains including education digital humanities cultural heritage and the social sciences where qualitative text data are common but often difficult to process at scale.
[8] Towards Robust Retrieval-Augmented Generation Based on Knowledge Graph: A Comparative Analysis
Hazem Amamou, Stéphane Gagnon, Alan Davoust, Anderson R. Avila
Main category: cs.CL
TL;DR: Evaluation of RAG systems using RGB benchmark shows GraphRAG outperforms baseline in handling inconsistent retrieved information across four robustness scenarios
Details
Motivation: RAG enhances LLMs with external knowledge but suffers from inconsistent retrieved information; need to evaluate and improve robustness of RAG systems against such issuesMethod: Use RGB benchmark to evaluate LLMs in four scenarios: noise robustness, information integration, negative rejection, and counterfactual robustness; compare RGB RAG baseline with GraphRAG (knowledge graph based retrieval) with three customizations
Result: GraphRAG shows improvements over RGB baseline; provides insights for designing more reliable RAG systems for real-world scenarios
Conclusion: Knowledge graph-based retrieval (GraphRAG) offers better robustness against inconsistent information compared to baseline RAG systems
Abstract: Retrieval-Augmented Generation (RAG) was introduced to enhance the capabilities of Large Language Models (LLMs) beyond their encoded prior knowledge. This is achieved by providing LLMs with an external source of knowledge, which helps reduce factual hallucinations and enables access to new information not available during pretraining. However, inconsistent retrieved information can negatively affect LLM responses. The Retrieval-Augmented Generation Benchmark (RGB) was introduced to evaluate the robustness of RAG systems under such conditions. In this work, we use the RGB corpus to evaluate LLMs in four scenarios: noise robustness, information integration, negative rejection, and counterfactual robustness. We perform a comparative analysis between the RGB RAG baseline and GraphRAG, a knowledge graph based retrieval system. We test three GraphRAG customizations to improve robustness. Results show improvements over the RGB baseline and provide insights for designing more reliable RAG systems for real world scenarios.
[9] Cultural Perspectives and Expectations for Generative AI: A Global Survey Approach
Erin van Liemt, Renee Shelby, Andrew Smart, Sinchana Kumbale, Richard Zhang, Neha Dixit, Qazi Mamunur Rashid, Jamila Smith-Loud
Main category: cs.CL
TL;DR: Global survey reveals diverse cultural perspectives on GenAI representation, proposing participatory approaches and sensitivity frameworks for culturally-aware AI development
Details
Motivation: There is insufficient empirical evidence about global attitudes regarding how Generative AI should represent different cultures, creating a gap in understanding cultural perspectives in AI developmentMethod: Conducted large-scale global survey across Europe, North/South America, Asia, and Africa to gather data on cultural definitions and beliefs about GenAI representation of cultural artifacts, concepts, and values
Result: Distilled working definitions of culture from diverse communities, revealing conceptual complexities in cultural representation, and identified specific cultural dimensions (religion, tradition) beyond geography that need prioritization
Conclusion: Proposes recommendations including participatory approaches, prioritizing specific cultural dimensions, and developing sensitivity frameworks to address cultural “redlines” in GenAI development
Abstract: There is a lack of empirical evidence about global attitudes around whether and how GenAI should represent cultures. This paper assesses understandings and beliefs about culture as it relates to GenAI from a large-scale global survey. We gathered data about what culture means to different groups, and about how GenAI should approach the representation of cultural artifacts, concepts, or values. We distill working definitions of culture directly from these communities to build an understanding of its conceptual complexities and how they relate to representations in Generative AI. We survey from across parts of Europe, North and South America, Asia, and Africa. We conclude with a set of recommendations for Culture and GenAI development. These include participatory approaches, prioritizing specific cultural dimensions beyond geography, such as religion and tradition, and a sensitivity framework for addressing cultural ``redlines’'.
[10] Structured Multidimensional Representation Learning for Large Language Models
Alaa El Ichi, Khalide Jbilou, Mohamed El Guide, Franck Dufrenois
Main category: cs.CL
TL;DR: Tensor Transformer architecture using spectral factorization via L-product reduces embedding redundancy while preserving Transformer semantics, achieving up to 75% parameter reduction with competitive accuracy.
Details
Motivation: Transformer architectures suffer from substantial parameter growth and redundancy in embedding dimensions as they scale. The paper aims to address this inefficiency while maintaining performance.Method: Introduces structured spectral factorization of embedding space using L-product for third-order tensors. Token representations are reshaped into spectral tensor slices, with attention and feed-forward operations performed in the transform domain. Uses Discrete Cosine Transform (DCT) for differentiable implementation.
Result: Achieves up to 75% encoder parameter reduction (for p=4) while maintaining competitive accuracy on IMDB and AG News datasets. On IMDB, matches or improves baseline under compression; on AG News, small accuracy decrease for 4x encoder reduction, but returns to parity at BERT-base width (d=768).
Conclusion: The Tensor Transformer architecture effectively reduces parameter redundancy through spectral decomposition while preserving Transformer semantics, offering both compression benefits and improved generalization through frequency-based inductive bias.
Abstract: Transformer architectures achieve state-of-the-art performance across a wide range of pattern recognition and natural language processing tasks, but their scaling is accompanied by substantial parameter growth and redundancy in the embedding dimension. In this work, we introduce a structured spectral factorization of the embedding space based on the L-product for third-order tensors. By reshaping token representations into spectral tensor slices and performing attention and feed-forward operations in the transform domain, we obtain a Tensor Transformer architecture that decomposes the encoder into p independent spectral sub-transformers while preserving standard Transformer semantics. We prove that the proposed L-Transformer is spectrally equivalent to p parallel Transformers operating on reduceddimensional embeddings, which yields approximately 1/p reduction (up to lower-order terms such as biases and normalization parameters) in encoder parameters under fixed total embedding size. When instantiated with a real-valued Discrete Cosine Transform (DCT), the method remains fully differentiable and compatible with existing training pipelines. Beyond compression, the spectral decomposition introduces an inductive bias over embedding frequencies, enabling slice-dependent frequency scaling that improves generalization. Experiments on IMDB and AGNews show that the proposed model can substantially reduce encoder parameters (up to 75% for p=4) while maintaining competitive accuracy. On IMDB, the tensorized encoder matches or improves upon the standard baseline under compression, whereas on AGNews at moderate width we observe a small accuracy decrease in exchange for a 4 times encoder reduction; at BERT-base width (d=768), performance returns to parity.
[11] Let’s Talk, Not Type: An Oral-First Multi-Agent Architecture for GuaranĂ
Samantha Adorno, Akshata Kishore Moharir, Ratna Kandala
Main category: cs.CL
TL;DR: Position paper arguing that AI/HCI systems need oral-first design for indigenous languages like GuaranĂ, proposing multi-agent architecture that respects oral practices and data sovereignty.
Details
Motivation: Current AI/HCI systems are text-first and underserve oral languages and indigenous communities, failing to align with lived oral practices despite claims of universality.Method: Proposes oral-first multi-agent architecture that decouples GuaranĂ natural language understanding from conversation state agents and community-led governance, focusing on turn-taking, repair, and shared context.
Result: Demonstrates a technical framework that respects indigenous data sovereignty and diglossia, moving beyond mere recognition to focus on conversational interaction patterns.
Conclusion: For AI to be culturally grounded, it must treat spoken conversation as first-class design requirement rather than adapting oral languages to text-centric systems, ensuring digital ecosystems empower diverse linguistic practices.
Abstract: Although artificial intelligence (AI) and Human-Computer Interaction (HCI) systems are often presented as universal solutions, their design remains predominantly text-first, underserving primarily oral languages and indigenous communities. This position paper uses GuaranĂ, an official and widely spoken language of Paraguay, as a case study to argue that language support in AI remains insufficient unless it aligns with lived oral practices. We propose an alternative to the standard “text-to-speech” pipeline, proposing instead an oral-first multi-agent architecture. By decoupling GuaranĂ natural language understanding from dedicated agents for conversation state and community-led governance, we demonstrate a technical framework that respects indigenous data sovereignty and diglossia. Our work moves beyond mere recognition to focus on turn-taking, repair, and shared context as the primary locus of interaction. We conclude that for AI to be truly culturally grounded, it must shift from adapting oral languages to text-centric systems to treating spoken conversation as a first-class design requirement, ensuring digital ecosystems empower rather than overlook diverse linguistic practices.
[12] CodeScout: Contextual Problem Statement Enhancement for Software Agents
Manan Suri, Xiangci Li, Mehdi Shojaie, Songyang Han, Chao-Chun Hsu, Shweta Garg, Aniket Anand Deshmukh, Varun Kumar
Main category: cs.CL
TL;DR: CodeScout improves AI code assistance by converting underspecified user requests into comprehensive problem statements through lightweight pre-exploration of codebases, reducing agent failure patterns and improving resolution rates by 20%.
Details
Motivation: Current AI code assistance tools struggle with poorly-defined problem statements lacking sufficient context, leading to longer agent trajectories with over-exploration or repeated failed fix attempts, resulting in suboptimal outcomes.Method: CodeScout performs contextual query refinement through structured pre-exploration: targeted context scoping, multi-perspective analysis of potential fixes and exploration opportunities, then synthesizes insights into enhanced problem statements with reproduction steps, expected behaviors, and exploration hints.
Result: Evaluation on SWEBench-Verified shows 20% improvement in resolution rates with up to 27 additional issues resolved compared to baseline, reducing non-converging agent trajectories while clarifying user intent.
Conclusion: Systematic query refinement through contextual analysis represents a promising direction for enhancing AI code assistance capabilities without requiring modifications to existing agentic scaffolds.
Abstract: Current AI-powered code assistance tools often struggle with poorly-defined problem statements that lack sufficient task context and requirements specification. Recent analysis of software engineering agents reveals that failures on such underspecified requests are highly correlated with longer trajectories involving either over-exploration or repeated attempts at applying the same fix without proper evolution or testing, leading to suboptimal outcomes across software development tasks. We introduce CodeScout, a contextual query refinement approach that systematically converts underspecified user requests into comprehensive, actionable problem statements through lightweight pre-exploration of the target codebase. Our key innovation is demonstrating that structured analysis before task execution can supplement existing agentic capabilities without requiring any modifications to their underlying scaffolds. CodeScout performs targeted context scoping, conducts multi-perspective analysis examining potential fixes and exploration opportunities, then synthesizes these insights into enhanced problem statements with reproduction steps, expected behaviors, and targeted exploration hints. This pre-exploration directly addresses the identified failure patterns by reducing non-converging agent trajectories while clarifying user intent in natural language space. We evaluate CodeScout using state-of-the-art agentic scaffolds and language models on SWEBench-Verified, demonstrating a 20% improvement in resolution rates with up to 27 additional issues resolved compared to the default baseline method. Our results suggest that systematic query refinement through contextual analysis represents a promising direction for enhancing AI code assistance capabilities.
[13] NERdME: a Named Entity Recognition Dataset for Indexing Research Artifacts in Code Repositories
Genet Asefa Gesese, Zongxiong Chen, Shufan Jiang, Mary Ann Tan, Zhaotai Liu, Sonja Schimmler, Harald Sack
Main category: cs.CL
TL;DR: NERdME dataset: 200 annotated README files with 10,000+ labeled spans and 10 entity types for scholarly information extraction from code repositories, showing differences from paper-level entities and supporting artifact discovery.
Details
Motivation: Existing scholarly information extraction datasets focus on scientific papers but overlook implementation-level details in code repositories. README files contain valuable information about datasets, source code, and implementation artifacts, but their free-form Markdown format lacks semantic structure, making automatic extraction difficult.Method: Created NERdME dataset with 200 manually annotated README files containing over 10,000 labeled spans across 10 entity types. Evaluated baseline methods using large language models and fine-tuned transformers. Conducted downstream entity-linking experiment to demonstrate practical applications.
Result: Baseline results show clear differences between paper-level and implementation-level entities, indicating the value of extending SIE benchmarks with entity types available in README files. Entity-linking experiment demonstrated that entities derived from READMEs can support artifact discovery and metadata integration.
Conclusion: NERdME addresses the gap in scholarly information extraction by providing a dataset for implementation-level details from README files, showing that these entities differ from paper-level entities and can support practical applications like artifact discovery.
Abstract: Existing scholarly information extraction (SIE) datasets focus on scientific papers and overlook implementation-level details in code repositories. README files describe datasets, source code, and other implementation-level artifacts, however, their free-form Markdown offers little semantic structure, making automatic information extraction difficult. To address this gap, NERdME is introduced: 200 manually annotated README files with over 10,000 labeled spans and 10 entity types. Baseline results using large language models and fine-tuned transformers show clear differences between paperlevel and implementation-level entities, indicating the value of extending SIE benchmarks with entity types available in README files. A downstream entity-linking experiment was conducted to demonstrate that entities derived from READMEs can support artifact discovery and metadata integration.
[14] The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?
Jayadev Billa
Main category: cs.CL
TL;DR: Speech LLMs are evaluated and found to often behave like expensive ASRâLLM cascades, with their advantages reversing under noisy conditions despite direct audio access.
Details
Motivation: To understand the actual behavior of speech LLMs compared to ASRâLLM cascades, and to determine whether their direct audio access provides meaningful advantages or if they simply function as expensive cascades.Method: Introduces matched-backbone testing to separate speech LLM behavior from underlying LLM reasoning capabilities, and uses mechanistic analysis techniques (logit lens and LEACE) to examine hidden state representations and causal relationships.
Result: Speech LLMs show literal transcripts emerging from hidden states, text representations are causally necessary, and in most deployed use cases they function as expensive cascades. Under noise, they perform worse than ASRâLLM cascades, with clean-condition advantages reversing by up to 7.6% at 0dB.
Conclusion: Current speech LLMs often don’t leverage audio information effectively and behave like expensive cascades, with their advantages disappearing or reversing under noisy conditions, suggesting they need better integration of audio information.
Abstract: Speech LLMs are widely understood to be better than ASR$\rightarrow$LLM cascades since they have access to the audio directly, and not just the transcript. In this paper, we present an evaluation methodology and a mechanistic interpretation of the observed behavior of speech LLMs. First, we introduce matched-backbone testing which separates out the behavior of the speech LLM from the reasoning capabilities of the underlying LLM. Second, we provide a mechanistic analysis of speech LLMs using logit lens and LEACE and show the literal transcript emerging from the LLM’s hidden states and that text representations are causally necessary. We also show that in most deployed use cases, current speech LLMs are expensive cascades, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0dB.
[15] PVminerLLM: Structured Extraction of Patient Voice from Patient-Generated Text using Large Language Models
Samah Fodeh, Linhai Ma, Ganesh Puthiaraju, Srivani Talakokkul, Afshan Khan, Ashley Hagaman, Sarah Lowe, Aimee Roundtree
Main category: cs.CL
TL;DR: PVminerLLM: A fine-tuned LLM for extracting structured patient voice information from patient-generated text, outperforming prompt-based methods on social and experiential signal extraction.
Details
Motivation: Patient-generated text contains valuable information about lived experiences, social circumstances, and care engagement that strongly influences health outcomes, but this unstructured data is rarely available in structured form for research and clinical improvement.Method: Introduces PVminer benchmark for patient voice extraction and proposes PVminerLLM, a supervised fine-tuned large language model specifically tailored for structured extraction of patient voice information from text.
Result: PVminerLLM substantially outperforms prompt-based baselines across multiple datasets and model sizes, achieving up to 83.82% F1 for Code prediction, 80.74% F1 for Sub-code prediction, and 87.03% F1 for evidence Span extraction, with strong performance even with smaller models.
Conclusion: Reliable patient voice extraction is feasible without extreme model scale, enabling scalable analysis of social and experiential signals embedded in patient-generated text for patient-centered outcomes research and clinical quality improvement.
Abstract: Motivation: Patient-generated text contains critical information about patients’ lived experiences, social circumstances, and engagement in care, including factors that strongly influence adherence, care coordination, and health equity. However, these patient voice signals are rarely available in structured form, limiting their use in patient-centered outcomes research and clinical quality improvement. Reliable extraction of such information is therefore essential for understanding and addressing non-clinical drivers of health outcomes at scale. Results: We introduce PVminer, a benchmark for structured extraction of patient voice, and propose PVminerLLM, a supervised fine-tuned large language model tailored to this task. Across multiple datasets and model sizes, PVminerLLM substantially outperforms prompt-based baselines, achieving up to 83.82% F1 for Code prediction, 80.74% F1 for Sub-code prediction, and 87.03% F1 for evidence Span extraction. Notably, strong performance is achieved even with smaller models, demonstrating that reliable patient voice extraction is feasible without extreme model scale. These results enable scalable analysis of social and experiential signals embedded in patient-generated text. Availability and Implementation: Code, evaluation scripts, and trained LLMs will be released publicly. Annotated datasets will be made available upon request for research use. Keywords: Large Language Models, Supervised Fine-Tuning, Medical Annotation, Patient-Generated Text, Clinical NLP
[16] Tutor Move Taxonomy: A Theory-Aligned Framework for Analyzing Instructional Moves in Tutoring
Zhuqian Zhou, Kirk Vanacore, Tamisha Thompson, Jennifer St John, Rene Kizilcec
Main category: cs.CL
TL;DR: A taxonomy for analyzing tutoring dialogue moves developed through hybrid deductive-inductive process to enable large-scale analysis of instructional actions in tutoring sessions.
Details
Motivation: To systematically analyze tutors' instructional actions during learning interactions and support large-scale analysis of tutoring dialogue within the National Tutoring Observatory.Method: Hybrid deductive-inductive process: synthesized research from cognitive science, learning sciences, classroom discourse analysis, and intelligent tutoring systems to create preliminary framework, then refined through iterative coding of authentic tutoring transcripts by expert annotators.
Result: Developed a taxonomy organizing tutoring behaviors into four categories: tutoring support, learning support, social-emotional and motivational support, and logistical support, with learning support further organized along a spectrum of student engagement.
Conclusion: The taxonomy enables scalable annotation using AI, computational modeling of tutoring strategies, and empirical analysis of how tutoring behaviors relate to learning outcomes by defining tutoring dialogue in terms of discrete instructional actions.
Abstract: Understanding what makes tutoring effective requires methods for systematically analyzing tutors’ instructional actions during learning interactions. This paper presents a tutor move taxonomy designed to support large-scale analysis of tutoring dialogue within the National Tutoring Observatory. The taxonomy provides a structured annotation framework for labeling tutors’ instructional moves during one-on-one tutoring sessions. We developed the taxonomy through a hybrid deductive-inductive process. First, we synthesized research from cognitive science, the learning sciences, classroom discourse analysis, and intelligent tutoring systems to construct a preliminary framework of tutoring moves. We then refined the taxonomy through iterative coding of authentic tutoring transcripts conducted by expert annotators with extensive instructional and qualitative research experience. The resulting taxonomy organizes tutoring behaviors into four categories: tutoring support, learning support, social-emotional and motivational support, and logistical support. Learning support moves are further organized along a spectrum of student engagement, distinguishing between moves that elicit student reasoning and those that provide direct explanation or answers. By defining tutoring dialogue in terms of discrete instructional actions, the taxonomy enables scalable annotation using AI, computational modeling of tutoring strategies, and empirical analysis of how tutoring behaviors relate to learning outcomes.
[17] RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts Reasoning
Yuhang Liu, Ruijie Wang, Yunlong Chu, Bing Hao, Yumeng Lin, Shengzhong Liu, Minglai Shao
Main category: cs.CL
TL;DR: RouteGoT: A budget-controllable routing framework for graph-structured reasoning that adaptively allocates strong models for planning/synthesis and lightweight models for leaf subtasks to optimize performance-cost trade-offs.
Details
Motivation: Current graph-of-thoughts methods (ToT, GoT, AGoT) often introduce substantial token/latency overhead with unstable gains, sometimes underperforming simpler methods. This inefficiency stems from heterogeneity in reasoning pipelines where planning/synthesis need strong models but intermediate subtasks can be solved by lighter models.Method: RouteGoT performs in-graph routing by prioritizing strong models for planning and synthesis while dynamically allocating lightweight models to leaf subtasks based on predicted difficulty. It integrates explicit budget constraints into a global inference scheduler to control graph expansion under user-specified token budgets.
Result: RouteGoT matches or improves accuracy while substantially reducing token usage, achieving average 8.1 percentage points accuracy improvement and 79.1% output token reduction compared to AGoT. It maintains superior cost-accuracy trade-off and improved robustness under varying budget targets and tasks.
Conclusion: RouteGoT effectively addresses inefficiencies in graph-structured reasoning by adaptive model routing and budget control, enabling predictable performance-cost trade-offs while maintaining or improving accuracy over existing methods.
Abstract: Large Language Models (LLMs) excel at multi-step reasoning, yet increasing the structural complexity of inference does not consistently improve system-level returns. Methods such as Tree of Thoughts (ToT), Graph of Thoughts (GoT), and Adaptive Graph of Thoughts (AGoT) can boost accuracy on some benchmarks, but often introduce substantial overhead in token consumption and latency, and their gains can be unstable across task distributions-sometimes underperforming simpler Chain-of-Thought (CoT) or direct input-output prompting (IO). We attribute this inefficiency to stage-wise and node-wise heterogeneity inside GoT-style reasoning pipelines: high-quality planning and final synthesis are globally coupled and typically benefit from strong models, whereas many intermediate subtasks are localized and can be solved accurately by lighter models with far fewer tokens. Motivated by these observations, we propose RouteGoT, a budget-controllable, node-adaptive routing framework for graph-structured reasoning. RouteGoT performs in-graph routing by prioritizing strong models for planning and synthesis, while dynamically allocating lightweight models and cost-effective strategies to leaf subtasks based on predicted difficulty. It further integrates explicit budget constraints into a global inference scheduler to control graph expansion under a user-specified token budget, enabling predictable performance-cost trade-offs. Experiments across reasoning, retrieval, and multi-hop QA benchmarks show that RouteGoT matching or improving accuracy while substantially reducing token usage; specifically, it achieves an average 8.1 percentage points accuracy improvement and 79.1% output token reduction compared to AGoT. Furthermore, RouteGoT outperforms existing routing baselines by maintaining a superior cost-accuracy trade-off, demonstrating improved robustness under varying budget targets and tasks.
[18] HART: Data-Driven Hallucination Attribution and Evidence-Based Tracing for Large Language Models
Shize Liang, Hongzhi Wang
Main category: cs.CL
TL;DR: HART is a fine-grained hallucination attribution and evidence retrieval framework for LLMs that structures hallucination tracing into four stages: span localization, mechanism attribution, evidence retrieval, and causal tracing.
Details
Motivation: LLMs produce hallucinated content that undermines reliability in high-stakes domains. Existing hallucination attribution approaches focus on semantic similarity or representation-level discrimination, lacking structured correspondences between hallucination types, error mechanisms, and factual evidence.Method: Proposes HART framework with four-stage structured modeling: 1) span localization (identify hallucinated spans), 2) mechanism attribution (determine error types), 3) evidence retrieval (find supporting/opposing evidence), 4) causal tracing (establish causal relationships). Creates first structured dataset for hallucination tracing with joint annotations.
Result: HART substantially outperforms strong retrieval baselines (BM25 and DPR) on the proposed dataset, validating effectiveness and generalization capability for hallucination analysis and evidence alignment.
Conclusion: HART provides a structured framework for fine-grained hallucination attribution with causal-level interpretability, addressing limitations of existing approaches and enabling better traceability of evidence.
Abstract: Large language models (LLMs) have demonstrated remarkable performance in text generation and knowledge-intensive question answering. Nevertheless, they are prone to producing hallucinated content, which severely undermines their reliability in high-stakes application domains. Existing hallucination attribution approaches, based on either external knowledge retrieval or internal model mechanisms, primarily focus on semantic similarity matching or representation-level discrimination. As a result, they have difficulty establishing structured correspondences at the span level between hallucination types, underlying error generation mechanisms, and external factual evidence, thereby limiting the interpretability of hallucinated fragments and the traceability of supporting or opposing evidence. To address these limitations, we propose HART, a fine-grained hallucination attribution and evidence retrieval framework for large language models. HART formalizes hallucination tracing as a structured modeling task comprising four stages: span localization, mechanism attribution, evidence retrieval, and causal tracing. Based upon this formulation, we develop the first structured dataset tailored for hallucination tracing, in which hallucination types, error mechanisms, and sets of counterfactual evidence are jointly annotated to enable causal-level interpretability evaluation. Experimental results on the proposed dataset demonstrate that HART substantially outperforms strong retrieval baselines, including BM25 and DPR, validating the effectiveness and generalization capability of the proposed tracing paradigm for hallucination analysis and evidence alignment.
[19] ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning
Juyong Jiang, Jiasi Shen, Sunghun Kim, Kang Min Yoo, Jeonghoon Kim, Sungju Kim
Main category: cs.CL
TL;DR: ReflexiCoder is an RL framework that teaches code generation models to internalize self-reflection and self-correction capabilities, eliminating the need for external feedback at inference time.
Details
Motivation: Standard LLMs for code generation hit performance ceilings on complex tasks, and existing iterative refinement methods rely on external oracles or expensive feedback cycles, which are inefficient and not fully autonomous.Method: Uses reinforcement learning with granular reward functions to optimize the entire reflection-correction trajectory, teaching models to debug and optimize code internally without ground-truth feedback or execution engines at inference time.
Result: Achieves SOTA results on seven benchmarks (HumanEval, MBPP, BigCodeBench, etc.) with 8B model, rivaling proprietary models like GPT-5.1, while being 40% more token-efficient than base models.
Conclusion: ReflexiCoder shifts the paradigm from external-dependent refinement to intrinsic self-reflection capabilities, enabling more efficient and autonomous code generation for complex algorithmic tasks.
Abstract: While Large Language Models (LLMs) have revolutionized code generation, standard “System 1” approaches, generating solutions in a single forward pass, often hit a performance ceiling when faced with complex algorithmic tasks. Existing iterative refinement strategies attempt to bridge this gap at inference time, yet they predominantly rely on external oracles, execution feedback, or computationally expensive prompt-response cycles. In this work, we propose ReflexiCoder, a novel reinforcement learning (RL) framework that internalizes the structured reasoning trajectory, encompassing initial generation, bug and optimization aware reflection, and self-correction, directly into the model’s weights. Unlike prior methods, ReflexiCoder shifts the paradigm from external-dependent refinement to an intrinsic, fully autonomous self-reflection and self-correction capabilities at inference time. We utilize an RL-zero training paradigm with granular reward functions to optimize the entire reflection-correction trajectory, teaching the model how to debug without reliance on ground-truth feedback or execution engines at inference time. Extensive experiments across seven benchmarks demonstrate that our ReflexiCoder-8B establishes a new state-of-the-art (SOTA) among leading open-source models in the 1.5B-14B range, achieving 94.51% (87.20%) on HumanEval (Plus), 81.80% (78.57%) on MBPP (Plus), 35.00% on BigCodeBench, 52.21% on LiveCodeBench, and 37.34% on CodeForces in a single-attempt setting, rivaling or surpassing proprietary models like GPT-5.1. Notably, our framework is significantly more token-efficient than base models, reducing inference-time compute overhead by approximately 40% through disciplined, high-speed reasoning and reflection patterns. Source code is available at https://github.com/juyongjiang/ReflexiCoder.
[20] ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning
Mingluo Su, Huan Wang
Main category: cs.CL
TL;DR: ROSE improves SparseGPT pruning by reordering pruning sequence based on potential error impact, achieving better performance through two-level reordering and adaptive columnar layer identification.
Details
Motivation: The paper addresses limitations in SparseGPT's fixed left-to-right pruning order, which leads to suboptimal performance when weights exhibit columnar patterns. The authors aim to improve pruning efficiency by considering the impact of pruning order on model performance.Method: ROSE introduces reordered SparseGPT with two-level reordering: 1) pre-pruning to identify candidate weights and estimate column/block pruning loss, 2) reordering columns within blocks by descending column loss, 3) reordering blocks based on block loss, and 4) using relative range of block loss to identify columnar layers for adaptive reordering.
Result: ROSE outperforms original SparseGPT and other pruning methods across multiple LLMs (LLaMA2-7B/13B/70B, LLaMA3-8B, Mistral-7B), demonstrating improved pruning efficiency while maintaining model performance.
Conclusion: Pruning order significantly impacts LLM pruning performance, and ROSE’s adaptive reordering approach based on potential pruning errors provides a more effective pruning strategy than fixed-order methods like SparseGPT.
Abstract: Pruning is widely recognized as an effective method for reducing the parameters of large language models (LLMs), potentially leading to more efficient deployment and inference. One classic and prominent path of LLM one-shot pruning is to leverage second-order gradients (i.e., Hessian), represented by the pioneering work SparseGPT. However, the predefined left-to-right pruning order in SparseGPT leads to suboptimal performance when the weights exhibit columnar patterns. This paper studies the effect of pruning order under the SparseGPT framework. The analyses lead us to propose ROSE, a reordered SparseGPT method that prioritizes weights with larger potential pruning errors to be pruned earlier. ROSE first performs pre-pruning to identify candidate weights for removal, and estimates both column and block pruning loss. Subsequently, two-level reordering is performed: columns within each block are reordered in descending order of column loss, while blocks are reordered based on block loss. We introduce the relative range of block loss as a metric to identify columnar layers, enabling adaptive reordering across the entire model. Substantial empirical results on prevalent LLMs (LLaMA2-7B/13B/70B, LLaMA3-8B, Mistral-7B) demonstrate that ROSE surpasses the original SparseGPT and other counterpart pruning methods. Our code is available at https://github.com/mingluo-su/ROSE.
[21] Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation
Changcheng Li, Jiancan Wu, Hengheng Zhang, Zhengsu Chen, Guo An, Junxiang Qiu, Xiang Wang, Qi Tian
Main category: cs.CL
TL;DR: CoCA is a reinforcement learning framework that jointly optimizes confidence calibration and answer accuracy in LLMs by using segmented credit assignment to enable confidence-first uncertainty estimation.
Details
Motivation: Existing uncertainty estimation methods for LLMs are answer-first (produce confidence after generating answers), which limits practical usability. The paper proposes a confidence-first paradigm where models output confidence before answering, enabling better uncertainty estimation for reliable LLM deployment.Method: CoCA uses GRPO (Group Relative Policy Optimization) reinforcement learning framework with segmented credit assignment. It assigns separate rewards and group-relative advantages to confidence and answer segments, enabling stable joint optimization of both confidence calibration and answer accuracy while avoiding reward hacking.
Result: Experiments across math, code, and factual QA benchmarks show improved calibration and uncertainty discrimination while preserving answer quality, enabling broader downstream applications.
Conclusion: CoCA successfully demonstrates that confidence-first uncertainty estimation can be effectively optimized through segmented credit assignment in reinforcement learning, providing better calibrated uncertainty estimates for LLMs without compromising answer quality.
Abstract: Reliable deployment of large language models (LLMs) requires accurate uncertainty estimation. Existing methods are predominantly answer-first, producing confidence only after generating an answer, which measure the correctness of a specific response and limits practical usability. We study a confidence-first paradigm, where the model outputs its confidence before answering, interpreting this score as the model’s probability of answering the question correctly under its current policy. We propose CoCA(Co-optimized Confidence and Answers), a GRPO reinforcement learning framework that jointly optimizes confidence calibration and answer accuracy via segmented credit assignment. By assigning separate rewards and group-relative advantages to confidence and answer segments, CoCA enables stable joint optimization and avoids reward hacking. Experiments across math, code, and factual QA benchmarks show improved calibration and uncertainty discrimination while preserving answer quality, thereby enabling a broader range of downstream applications.
[22] VerChol – Grammar-First Tokenization for Agglutinative Languages
Prabhu Raja
Main category: cs.CL
TL;DR: Tokenization methods like BPE are suboptimal for agglutinative languages, fragmenting words and increasing token counts.
Details
Motivation: Current tokenization approaches (BPE) are script-agnostic and English-optimized, failing to handle agglutinative languages where single words encode multiple grammatical features.Method: The paper likely proposes alternative tokenization methods that respect morpheme boundaries in agglutinative languages, though specific methods aren’t detailed in the abstract.
Result: Statistical tokenizers fragment agglutinative words, sever morpheme boundaries, and inflate token counts, reducing efficiency for these languages.
Conclusion: New tokenization approaches are needed for agglutinative languages to preserve linguistic structure and improve LLM performance.
Abstract: Tokenization is the foundational step in all large language model (LLM) pipelines, yet the dominant approach Byte Pair Encoding (BPE) and its variants is inherently script agnostic and optimized for English like morphology. For agglutinative languages a typological class encompassing the Dravidian family (Tamil, Kannada, Telugu, Malayalam), Turkic languages (Turkish, Azerbaijani, Uzbek), Uralic languages (Finnish, Hungarian, Estonian), Korean, Japanese, Swahili, Basque, and others, a single word may encode root, tense, aspect, person, number, gender agreement, case, and postpositions into one orthographic unit. Statistical tokenizers fragment these words into byte pair chunks that sever morpheme boundaries and inflate token counts.
[23] Lost in Stories: Consistency Bugs in Long Story Generation by LLMs
Junjie Li, Xinrui Guo, Yuhao Wu, Roy Ka-Wei Lee, Hongzhi Li, Yutao Xie
Main category: cs.CL
TL;DR: ConStory-Bench is a benchmark for evaluating narrative consistency in long-form story generation, with automated detection of contradictions across five error categories.
Details
Motivation: LLMs generate long narratives but often fail to maintain consistency, contradicting established facts, characters, and world rules. Existing benchmarks focus on plot quality and fluency, leaving consistency errors unexplored.Method: Created ConStory-Bench with 2,000 prompts across four task scenarios, defining a taxonomy of five error categories with 19 fine-grained subtypes. Developed ConStory-Checker, an automated pipeline that detects contradictions and grounds judgments in textual evidence.
Result: Consistency errors show clear patterns: most common in factual and temporal dimensions, appear around narrative middle, occur in high-entropy text segments, and certain error types co-occur. Evaluated various LLMs through five research questions.
Conclusion: The benchmark and findings can inform future efforts to improve consistency in long-form narrative generation, addressing a critical gap in story generation evaluation.
Abstract: What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long-form narratives, these models can contradict their own established facts, character traits, and world rules. Existing story generation benchmarks focus mainly on plot quality and fluency, leaving consistency errors largely unexplored. To address this gap, we present ConStory-Bench, a benchmark designed to evaluate narrative consistency in long-form story generation. It contains 2,000 prompts across four task scenarios and defines a taxonomy of five error categories with 19 fine-grained subtypes. We also develop ConStory-Checker, an automated pipeline that detects contradictions and grounds each judgment in explicit textual evidence. Evaluating a range of LLMs through five research questions, we find that consistency errors show clear tendencies: they are most common in factual and temporal dimensions, tend to appear around the middle of narratives, occur in text segments with higher token-level entropy, and certain error types tend to co-occur. These findings can inform future efforts to improve consistency in long-form narrative generation. Our project page is available at https://picrew.github.io/constory-bench.github.io/.
[24] Building an Ensemble LLM Semantic Tagger for UN Security Council Resolutions
Hussein Ghaly
Main category: cs.CL
TL;DR: LLM ensemble system for semantic tagging of UN Security Council resolutions using GPT models with evaluation metrics to prevent hallucinations and ensure content preservation.
Details
Motivation: To create a reliable system for semantic tagging of challenging legal/political texts (UN Security Council resolutions) that leverages LLM performance variability while avoiding hallucinations and unnecessary text modifications.Method: Uses ensemble approach with multiple GPT models, introduces Content Preservation Ratio (CPR) and Tag Well-Formedness (TWF) metrics to evaluate outputs, selects best output from multiple runs based on these metrics.
Result: GPT-4.1 achieved highest metrics (Cleaning: CPR 84.9%, Semantic Tagging: CPR 99.99% and TWF 99.92%). Smaller models like GPT-4.1-mini achieved comparable performance at 20% of the cost. Ensemble system successfully selected optimal outputs across all models.
Conclusion: The ensemble design with CPR/TWF metrics creates a reliable LLM system for semantic tagging of challenging texts, balancing performance and cost efficiency.
Abstract: This paper introduces a new methodology for using LLM-based systems for accurate and efficient semantic tagging of UN Security Council resolutions. The main goal is to leverage LLM performance variability to build ensemble systems for data cleaning and semantic tagging tasks. We introduce two evaluation metrics: Content Preservation Ratio (CPR) and Tag Well-Formedness (TWF), in order to avoid hallucinations and unnecessary additions or omissions to the input text beyond the task requirement. These metrics allow the selection of the best output from multiple runs of several GPT models. GPT-4.1 achieved the highest metrics for both tasks (Cleaning: CPR 84.9% - Semantic Tagging: CPR 99.99% and TWF 99.92%). In terms of cost, smaller models, such as GPT-4.1-mini, achieved comparable performance to the best model in each task at only 20% of the cost. These metrics ultimately allowed the ensemble to select the optimal output (both cleaned and tagged content) for all the LLM models involved, across multiple runs. With this ensemble design and the use of metrics, we create a reliable LLM system for performing semantic tagging on challenging texts.
[25] InfoGatherer: Principled Information Seeking via Evidence Retrieval and Strategic Questioning
Maksym Taranukhin, Shuyue Stella Li, Evangelos Milios, Geoff Pleiss, Yulia Tsvetkov, Vered Shwartz
Main category: cs.CL
TL;DR: InfoGatherer: A framework for document-grounded QA that uses Dempster-Shafer belief assignments over evidential networks to model uncertainty and gather missing information from both retrieved documents and targeted user questions, improving reliability in high-stakes domains.
Details
Motivation: LLMs deployed in high-stakes domains like medical triage and legal assistance often face underspecified user queries where single retrieval passes are insufficient, leading to incorrect and overly confident answers. Existing methods rely on implicit, unstructured confidence signals from LLMs, making it difficult to determine what information is missing and when to stop asking questions.Method: InfoGatherer uses Dempster-Shafer belief assignments over a structured evidential network to model uncertainty. It gathers missing information from two complementary sources: retrieved domain documents and targeted follow-up questions to users. This enables principled fusion of incomplete and potentially contradictory evidence without prematurely collapsing to definitive answers.
Result: Across legal and medical tasks, InfoGatherer outperforms strong baselines while requiring fewer turns of interaction. It demonstrates improved reliability and interpretability in decision support systems.
Conclusion: By grounding uncertainty in formal evidential theory rather than heuristic LLM signals, InfoGatherer moves towards trustworthy, interpretable decision support in domains where reliability is critical, addressing the limitations of current document-grounded QA systems.
Abstract: LLMs are increasingly deployed in high-stakes domains such as medical triage and legal assistance, often as document-grounded QA systems in which a user provides a description, relevant sources are retrieved, and an LLM generates a prediction. In practice, initial user queries are often underspecified, and a single retrieval pass is insufficient for reliable decision-making, leading to incorrect and overly confident answers. While follow-up questioning can elicit missing information, existing methods typically depend on implicit, unstructured confidence signals from the LLM, making it difficult to determine what remains unknown, what information matters most, and when to stop asking questions. We propose InfoGatherer, a framework that gathers missing information from two complementary sources: retrieved domain documents and targeted follow-up questions to the user. InfoGatherer models uncertainty using Dempster-Shafer belief assignments over a structured evidential network, enabling principled fusion of incomplete and potentially contradictory evidence from both sources without prematurely collapsing to a definitive answer. Across legal and medical tasks, InfoGatherer outperforms strong baselines while requiring fewer turns. By grounding uncertainty in formal evidential theory rather than heuristic LLM signals, InfoGatherer moves towards trustworthy, interpretable decision support in domains where reliability is critical.
[26] Learning Next Action Predictors from Human-Computer Interaction
Omar Shaikh, Valentin Teutschbein, Kanishk Gandhi, Yikun Chi, Nick Haber, Thomas Robinson, Nilam Ram, Byron Reeves, Sherry Yang, Michael S. Bernstein, Diyi Yang
Main category: cs.CL
TL;DR: LongNAP: A multimodal user modeling system that predicts next actions by combining parametric and in-context learning over long interaction histories including screenshots, clicks, and sensor data.
Details
Motivation: To build truly proactive AI systems that anticipate user needs by reasoning over the full multimodal context of user behavior (screenshots, clicks, sensor data) rather than just sparse prompts.Method: 1) Created dataset by annotating 360K actions from 1,800 hours of phone usage using VLMs; 2) Developed LongNAP model combining parametric and in-context learning; 3) Uses policy gradient to generate user-specific reasoning traces, retrieve relevant past traces, and apply them in-context for prediction.
Result: Outperforms supervised finetuning by 79% and prompted baselines by 39% on held-out data using LLM-as-judge evaluation. Achieves 17.1% well-aligned predictions (score â„0.5), rising to 26% for confident predictions. Generalizes to held-out users.
Conclusion: Learning from full multimodal user behavior context to anticipate needs is now viable with substantial opportunity, demonstrated through LongNAP’s success on next action prediction task.
Abstract: Truly proactive AI systems must anticipate what we will do next. This foresight demands far richer information than the sparse signals we type into our prompts – it demands reasoning over the entire context of what we see and do. We formalize this as next action prediction (NAP): given a sequence of a user’s multimodal interactions with a computer (screenshots, clicks, sensor data), predict that user’s next action. Progress on this task requires both new data and modeling approaches. To scale data, we annotate longitudinal, naturalistic computer use with vision-language models. We release an open-source pipeline for performing this labeling on private infrastructure, and label over 360K actions across one month of continuous phone usage from 20 users, amounting to 1,800 hours of screen time. We then introduce LongNAP, a user model that combines parametric and in-context learning to reason over long interaction histories. LongNAP is trained via policy gradient methods to generate user-specific reasoning traces given some context; retrieve relevant traces from a library of past traces; and then apply retrieved traces in-context to predict future actions. Using an LLM-as-judge evaluation metric (0-1 similarity to ground truth), LongNAP significantly outperforms supervised finetuning and prompted baselines on held-out data (by 79% and 39% respectively). Additionally, LongNAP generalizes to held out users when trained across individuals. The space of next actions a user might take at any moment is unbounded, spanning thousands of possible outcomes. Despite this, 17.1% of LongNAP’s predicted trajectories are well-aligned with what a user does next (LLM-judge score $\geq$ 0.5). This rises to 26% when we filter to highly confident predictions. In sum, we argue that learning from the full context of user behavior to anticipate user needs is now a viable task with substantial opportunity.
[27] Addressing the Ecological Fallacy in Larger LMs with Human Context
Nikita Soni, Dhruv Vijay Kunjadiya, Pratham Piyush Shah, Dikshya Mohanty, H. Andrew Schwartz, Niranjan Balasubramanian
Main category: cs.CL
TL;DR: HuLM addresses ecological fallacy in language models by incorporating author context from temporally ordered texts, showing improved performance for 8B Llama models through HuFT fine-tuning and continued pre-training.
Details
Motivation: Current language models ignore the dependence between multiple text sequences written by the same author (ecological fallacy), which prior work has shown can improve performance for smaller models. This paper investigates whether addressing this issue can benefit larger-scale 8B parameter models.Method: Proposes HuLM (Human-aware Language Modeling) that processes an author’s language in the context of their other temporally ordered texts. Explores two approaches: 1) HuFT (Human-aware Fine-Tuning) using QLoRA during fine-tuning with author context, and 2) continued HuLM pre-training with QLoRA to create human-aware models.
Result: HuFT improves 8B model performance over standard fine-tuning. Continued HuLM pre-training produces human-aware models that generalize well, showing improved performance across eight downstream tasks with just linear classifier training.
Conclusion: Modeling language in the context of its original authors (addressing ecological fallacy) provides significant utility for large language models, enabling better performance through both fine-tuning and pre-training approaches.
Abstract: Language model training and inference ignore a fundamental linguistic fact – there is a dependence between multiple sequences of text written by the same person. Prior work has shown that addressing this form of \textit{ecological fallacy} can greatly improve the performance of multiple smaller (~124M) GPT-based models. In this work, we ask if addressing the ecological fallacy by modeling the author’s language context with a specific LM task (called HuLM) can provide similar benefits for a larger-scale model, an 8B Llama model. To this end, we explore variants that process an author’s language in the context of their other temporally ordered texts. We study the effect of pre-training with this author context using the HuLM objective, as well as using it during fine-tuning with author context (\textit{HuFT:Human-aware Fine-Tuning}). Empirical comparisons show that addressing the ecological fallacy during fine-tuning alone using QLoRA improves the performance of the larger 8B model over standard fine-tuning. Additionally, QLoRA-based continued HuLM pre-training results in a human-aware model generalizable for improved performance over eight downstream tasks with linear task classifier training alone. These results indicate the utility and importance of modeling language in the context of its original generators, the authors.
[28] Implicit Style Conditioning: A Structured Style-Rewrite Framework for Low-Resource Character Modeling
Chanhui Zhu
Main category: cs.CL
TL;DR: A framework for enabling small language models to generate highly stylized text by explicitly disentangling style into lexical, syntactic, and pragmatic dimensions, using implicit style conditioning via Chain-of-Thought distillation.
Details
Motivation: Small language models struggle with role-playing and stylized persona generation due to data scarcity and difficulty in disentangling complex style elements, leading to out-of-character generation that fails to capture nuanced syntactic and pragmatic features.Method: Proposes a Structured Style-Rewrite Framework that explicitly disentangles style into three dimensions: lexical signatures (via PMI), syntactic patterns (grounded in PCFG rules), and pragmatic style. Uses implicit style conditioning via Chain-of-Thought distillation, leveraging explicit reasoning traces during training as inductive bias to align latent representations with structured style features.
Result: The method enables a Qwen-1.7B model to outperform significantly larger baselines (e.g., 4B Vanilla SFT) in style consistency and semantic fidelity on anime character role-playing tasks, demonstrating data-efficient stylized generation.
Conclusion: The approach provides a data-efficient paradigm for democratizing high-fidelity stylized generation on consumer hardware by explicitly modeling style dimensions and using implicit conditioning strategies.
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in role-playing (RP); however, small Language Models (SLMs) with highly stylized personas remains a challenge due to data scarcity and the complexity of style disentanglement. Standard Supervised Fine-Tuning (SFT) often captures surface-level semantics while failing to reproduce the intricate syntactic and pragmatic nuances of a character, leading to “Out-Of-Character” (OOC) generation. To address this, we propose a Structured Style-Rewrite Framework that explicitly disentangles style into three interpretable dimensions: lexical signatures (via PMI), syntactic patterns (grounded in PCFG rules), and pragmatic style. Furthermore, we introduce an implicit style conditioning strategy via Chain-of-Thought (CoT) distillation. By leveraging explicit reasoning traces during training as a strong inductive bias, our approach aligns the model’s latent representations with structured style features, enabling high-fidelity stylized generation without requiring explicit reasoning tokens during inference. Extensive experiments on a specific high-stylization domain (anime characters) demonstrate that our method enables a Qwen-1.7B model to outperform significantly larger baselines (e.g., 4B Vanilla SFT) in style consistency and semantic fidelity. Our approach offers a data-efficient paradigm for democratizing inference and deployment on consumer hardware.
[29] Who We Are, Where We Are: Mental Health at the Intersection of Person, Situation, and Large Language Models
Nikita Soni, August HÄkan Nilsson, Syeda Mahwish, Vasudha Varadarajan, H. Andrew Schwartz, Ryan L. Boyd
Main category: cs.CL
TL;DR: Interpretable models combining psychological traits and situational features from social media data predict well-being and identify adaptive/maladaptive self-states.
Details
Motivation: Mental health is dynamic and shaped by individual dispositions and situational contexts, requiring models that integrate psychological theory with computational approaches to understand well-being in contextually sensitive ways.Method: Develop interpretable models integrating person-level psychological traits (resilience, cognitive distortions, implicit motives) with language-inferred situational features from Situational 8 DIAMONDS framework, comparing them to embeddings from psychometrically-informed language models.
Result: Theory-driven features provide competitive performance while offering greater interpretability than language model embeddings; qualitative analyses show psychological coherence of predictive features.
Conclusion: Integrating computational modeling with psychological theory enables assessment of dynamic mental states in contextually sensitive and human-understandable ways.
Abstract: Mental health is not a fixed trait but a dynamic process shaped by the interplay between individual dispositions and situational contexts. Building on interactionist and constructionist psychological theories, we develop interpretable models to predict well-being and identify adaptive and maladaptive self-states in longitudinal social media data. Our approach integrates person-level psychological traits (e.g., resilience, cognitive distortions, implicit motives) with language-inferred situational features derived from the Situational 8 DIAMONDS framework. We compare these theory-grounded features to embeddings from a psychometrically-informed language model that captures temporal and individual-specific patterns. Results show that our principled, theory-driven features provide competitive performance while offering greater interpretability. Qualitative analyses further highlight the psychological coherence of features most predictive of well-being. These findings underscore the value of integrating computational modeling with psychological theory to assess dynamic mental states in contextually sensitive and human-understandable ways.
[30] Track-SQL: Enhancing Generative Language Models with Dual-Extractive Modules for Schema and Context Tracking in Multi-turn Text-to-SQL
Bingfeng Chen, Shaobin Shi, Yongqi Luo, Boyan Xu, Ruichu Cai, Zhifeng Hao
Main category: cs.CL
TL;DR: Track-SQL enhances generative language models with dual-extractive modules for multi-turn Text-to-SQL, achieving state-of-the-art performance on SparC and CoSQL datasets.
Details
Motivation: Generative language models perform well on single-turn Text-to-SQL but struggle with multi-turn interactions due to difficulties handling context information and dynamic schema linking in complex conversational settings.Method: Proposes Track-SQL framework with two extractive modules: Semantic-enhanced Schema Extractor to track schema changes, and Schema-aware Context Extractor to track contextual changes in multi-turn Text-to-SQL interactions.
Result: Achieves state-of-the-art performance on SparC and CoSQL datasets, improving execution accuracy by 7.1% and 9.55% respectively compared to baseline models.
Conclusion: Track-SQL effectively addresses the limitations of generative language models in multi-turn Text-to-SQL by incorporating specialized extractive modules for schema and context tracking, significantly improving performance in conversational database querying.
Abstract: Generative language models have shown significant potential in single-turn Text-to-SQL. However, their performance does not extend equivalently to multi-turn Text-to-SQL. This is primarily due to generative language models’ inadequacy in handling the complexities of context information and dynamic schema linking in multi-turn interactions. In this paper, we propose a framework named Track-SQL, which enhances generative language models with dual-extractive modules designed to track schema and contextual changes in multi-turn Text-to-SQL. Specifically, Track-SQL incorporates a \emph{Semantic-enhanced Schema Extractor} and a \emph{Schema-aware Context Extractor}. Experimental results demonstrate that Track-SQL achieves state-of-the-art performance on the SparC and CoSQL datasets. Furthermore, detailed ablation studies reveal that Track-SQL significantly improves execution accuracy in multi-turn interactions by 7.1% and 9.55% on these datasets, respectively. Our implementation will be open-sourced at https://github.com/DMIRLAB-Group/Track-SQL.
[31] ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning
Xingjian Tao, Yiwei Wang, Yujun Cai, Yifan Song, Jing Tang
Main category: cs.CL
TL;DR: ViewFusion is a two-stage framework for multi-view spatial reasoning that separates cross-view spatial pre-alignment from question answering to improve performance on viewpoint transformation and occlusion-sensitive tasks.
Details
Motivation: Current vision-language models struggle with multi-view spatial reasoning, often underutilizing cross-view relations and relying on single-image shortcuts, leading to fragile performance on viewpoint transformation and occlusion-sensitive cases.Method: Two-stage framework: 1) Spatial pre-thinking stage infers viewpoint relations and spatial transformations across views to form an intermediate workspace; 2) Question-driven reasoning stage produces final predictions conditioned on this workspace. Trained with synthetic reasoning supervision followed by reinforcement learning using GRPO.
Result: On MMSI-Bench, ViewFusion improves accuracy by 5.3% over Qwen3-VL-4B-Instruct, with largest gains on examples requiring genuine cross-view alignment.
Conclusion: Explicitly separating cross-view spatial pre-alignment from question answering improves multi-view spatial reasoning capabilities in vision-language models.
Abstract: Multi-view spatial reasoning remains difficult for current vision-language models. Even when multiple viewpoints are available, models often underutilize cross-view relations and instead rely on single-image shortcuts, leading to fragile performance on viewpoint transformation and occlusion-sensitive cases. We present ViewFusion, a two-stage framework that explicitly separates cross-view spatial pre-alignment from question answering. In the first stage, the model performs deliberate spatial pre-thinking to infer viewpoint relations and spatial transformations across views, forming an intermediate workspace that goes beyond a simple re-description. In the second stage, the model conducts question-driven reasoning conditioned on this workspace to produce the final prediction. We train ViewFusion with synthetic reasoning supervision followed by reinforcement learning using GRPO, which improves answer correctness while stabilizing the intended two-stage generation behavior. On MMSI-Bench, ViewFusion improves accuracy by 5.3% over Qwen3-VL-4B-Instruct, with the largest gains on examples that require genuine cross-view alignment.
[32] Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring
Jonas Kubesch, Lena Huber, Clemens Havas
Main category: cs.CL
TL;DR: LLMs show limited effectiveness for automated German essay grading, achieving only 40.6% rubric agreement and 32.8% final grade accuracy with human experts.
Details
Motivation: To investigate whether state-of-the-art open-weight LLMs can effectively grade Austrian A-level German essays using rubric-based evaluation, aiming to reduce teacher workload and mitigate subjective biases in grading.Method: Used a dataset of 101 anonymized student exams across three text types, evaluated four LLMs (DeepSeek-R1 32b, Qwen3 30b, Mixtral 8x7b, Llama3.3 70b) with different contexts and prompting strategies for rubric-based grading.
Result: LLMs achieved maximum 40.6% agreement with human raters on rubric sub-dimensions and only 32.8% accuracy on final grades, indicating insufficient accuracy for real-world grading applications.
Conclusion: While smaller LLMs can use standardized rubrics for German essay grading, their accuracy is too low for practical deployment in real-world educational settings.
Abstract: Automated Essay Scoring (AES) has been explored for decades with the goal to support teachers by reducing grading workload and mitigating subjective biases. While early systems relied on handcrafted features and statistical models, recent advances in Large Language Models (LLMs) have made it possible to evaluate student writing with unprecedented flexibility. This paper investigates the application of state-of-the-art open-weight LLMs for the grading of Austrian A-level German texts, with a particular focus on rubric-based evaluation. A dataset of 101 anonymised student exams across three text types was processed and evaluated. Four LLMs, DeepSeek-R1 32b, Qwen3 30b, Mixtral 8x7b and LLama3.3 70b, were evaluated with different contexts and prompting strategies. The LLMs were able to reach a maximum of 40.6% agreement with the human rater in the rubric-provided sub-dimensions, and only 32.8% of final grades matched the ones given by a human expert. The results indicate that even though smaller models are able to use standardised rubrics for German essay grading, they are not accurate enough to be used in a real-world grading environment.
[33] Experiences Build Characters: The Linguistic Origins and Functional Impact of LLM Personality
Xi Wang, Mengdie Zhuang, Jiqun Liu
Main category: cs.CL
TL;DR: The paper explores how exposing LLMs to diverse domain-specific texts shapes their “machine personality” and affects problem-solving, revealing a bimodal competence pattern and a “Suppression Advantage” where reduced social traits enhance complex reasoning.
Details
Motivation: To investigate how diverse experiences shape machine personality and influence problem-solving, moving beyond uniform performance benchmarks that favor specific behavioral tendencies like assertiveness.Method: Used continued pre-training to expose models to domain-specific texts in an unsupervised manner (simulating experience accumulation), adapted the Big Five framework via Machine Personality Inventory (MPI) to quantify personality traits, and analyzed relationships to linguistic style and reasoning behavior.
Result: Found that model competence is bimodal, peaking at “Expressive Generalists” and “Suppressed Specialists,” and identified a “Suppression Advantage” where reduced social traits enhance complex reasoning performance. Established causal links between training data linguistics (imperative frequency, lexical diversity) and personality traits.
Conclusion: The study provides a roadmap for “Personality Engineering” by demonstrating how training data linguistics causally influences machine personality and problem-solving capabilities, revealing optimal personality configurations for different reasoning tasks.
Abstract: Human problem-solving is enriched by a diversity of styles and personality traits, yet the development of Large Language Models (LLMs) has largely prioritized uniform performance benchmarks that favour specific behavioural tendencies such as assertiveness. To investigate how diverse experiences shape machine personality and influence problem-solving, this study employs continued pre-training to expose models to domain-specific texts in an unsupervised manner, simulating the accumulation of experience. By adapting the Big Five framework via the Machine Personality Inventory (MPI), we quantify the personality traits of these model variants and analyse their relationship to linguistic style and reasoning behaviour. The findings reveal that model competence is bimodal, peaking at “Expressive Generalists” and “Suppressed Specialists,” while identifying a “Suppression Advantage” where reduced social traits enhance complex reasoning performance. This study further establishes a causal link between training data linguistics, such as imperative frequency, and lexical diversity, providing a roadmap for “Personality Engineering”.
[34] Making Implicit Premises Explicit in Logical Understanding of Enthymemes
Xuyao Feng, Anthony Hunter
Main category: cs.CL
TL;DR: A pipeline integrating LLMs and neuro-symbolic reasoning to decode enthymemes (arguments with implicit components) by generating implicit premises, translating to logical formulas, and determining entailment.
Details
Motivation: Real-world arguments often contain implicit premises/claims (enthymemes). NLP methods can identify but not decode their logic, while logic-based approaches assume sufficient knowledgebase formulas. There's a gap in systematically translating textual enthymemes to logical arguments and generating required formulas for entailment verification.Method: Three-stage pipeline: (1) LLM generates intermediate implicit premises from explicit premise and claim, (2) another LLM translates natural language to logical formulas, (3) neuro-symbolic reasoner (SAT solver) determines entailment.
Result: Evaluated on two enthymeme datasets, showing promising performance in selecting correct implicit premises as measured by precision, recall, F1-score, and accuracy.
Conclusion: The proposed pipeline effectively integrates LLMs and symbolic reasoning to decode enthymemes, addressing the gap between NLP identification and logical decoding of implicit arguments.
Abstract: Real-world arguments in text and dialogues are normally enthymemes (i.e. some of their premises and/or claims are implicit). Natural language processing (NLP) methods for handling enthymemes can potentially identify enthymemes in text but they do not decode their underlying logic, whereas logic-based approaches for handling them assume a knowledgebase with sufficient formulae that can be used to decode them via abduction. There is therefore a lack of a systematic method for translating textual components of an enthymeme into a logical argument and generating the logical formulae required for their decoding, and thereby showing logical entailment. To address this, we propose a pipeline that integrates: (1) a large language model (LLM) to generate intermediate implicit premises based on the explicit premise and claim; (2) another LLM to translate the natural language into logical formulas; and (3) a neuro-symbolic reasoner based on a SAT solver to determine entailment. We evaluate our pipeline on two enthymeme datasets, demonstrating promising performance in selecting the correct implicit premise, as measured by precision, recall, F1-score, and accuracy.
[35] Diffusion Language Models Are Natively Length-Aware
Vittorio Rossi, Giacomo CirĂČ, Davide Beltrame, Luca Gandolfi, Paul Röttger, Dirk Hovy
Main category: cs.CL
TL;DR: Zero-shot dynamic context window cropping for diffusion language models reduces computational waste by predicting output length from latent prompts, achieving efficiency gains with minimal performance impact.
Details
Motivation: Diffusion Language Models (DLMs) operate with fixed maximum-length context windows regardless of actual response length, causing computational waste for short responses common in reasoning and chat tasks. Unlike autoregressive models that stop at EoS tokens, DLMs waste resources on unnecessary denoising steps.Method: Proposes that latent prompt representations contain sufficient information to estimate required output length. Introduces a zero-shot mechanism to dynamically crop the context window before generation begins, reducing diffusion steps and computational requirements.
Result: Evaluated on GSM8K (reasoning), HumanEval (code generation), IfEval (instruction following), and LongFormQA (question answering). Achieved massive efficiency gains with minimal performance impact - significant FLOPs reductions across all tasks, no statistically significant performance degradation, and significant performance improvements in 2 out of 4 tasks.
Conclusion: Latent prompt representations can effectively predict output length for DLMs, enabling dynamic context window cropping that substantially reduces computational waste while maintaining or improving performance across diverse tasks.
Abstract: Unlike autoregressive language models, which terminate variable-length generation upon predicting an End-of-Sequence (EoS) token, Diffusion Language Models (DLMs) operate over a fixed maximum-length context window for a predetermined number of denoising steps. However, this process is independent of the required response length, resulting in computational waste for the majority of short responses common in reasoning and chat tasks. To address this problem, we conjecture that the latent prompt representation contains sufficient information to estimate the required output length. We provide empirical evidence for this phenomenon and propose a zero-shot mechanism to dynamically crop the context window before generation begins, leading to fewer diffusion steps and substantial computational savings. We evaluate our approach on four benchmarks with diverse tasks – GSM8K (reasoning), HumanEval (code generation), IfEval (instruction following), and LongFormQA (question answering) – revealing massive efficiency gains at minimal performance impact. We report significant reductions in FLOPs across all tasks, with no statistically significant performance degradation, and significant performance improvements in 2 out of 4 tasks.
[36] A Causal Graph Approach to Oppositional Narrative Analysis
Diego Revilla, Martin Fernandez-de-Retana, Lingfeng Chen, Aritz Bilbao-Jayo, Miguel Fernandez-de-Retana
Main category: cs.CL
TL;DR: Graph-based framework for detecting oppositional narratives using entity-interaction graphs with causal estimation at node level
Details
Motivation: Current textual analysis methods rely on predefined ontologies with human bias, using black-box models that exploit unstructured pattern recognition rather than modeling structured entity interactions in discourseMethod: Proposes graph-based framework representing narratives as entity-interaction graphs, incorporates causal estimation at node level to derive causal representation of each contribution, distills sentence graphs into minimal causal subgraphs
Result: Classification pipeline outperforms existing approaches to oppositional thinking classification task
Conclusion: Graph-based approach with causal estimation provides better framework for analyzing oppositional narratives by modeling structured entity interactions rather than relying on biased black-box pattern recognition
Abstract: Current methods for textual analysis rely on data annotated within predefined ontologies, often embedding human bias within black-box models. Despite achieving near-perfect performance, these approaches exploit unstructured, linear pattern recognition rather than modeling the structured interactions between entities that naturally emerge in discourse. In this work, we propose a graph-based framework for the detection, analysis, and classification of oppositional narratives and their underlying entities by representing narratives as entity-interaction graphs. Moreover, by incorporating causal estimation at the node level, our approach derives a causal representation of each contribution to the final classification by distilling the constructed sentence graph into a minimal causal subgraph. Building upon this representation, we introduce a classification pipeline that outperforms existing approaches to oppositional thinking classification task.
[37] CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation
Mohammed Baharoon, Thibault Heintz, Siavash Raissi, Mahmoud Alabbad, Mona Alhammad, Hassan AlOmaish, Sung Eun Kim, Oishi Banerjee, Pranav Rajpurkar
Main category: cs.CL
TL;DR: CRIMSON is a clinically grounded evaluation framework for chest X-ray report generation that assesses diagnostic correctness, contextual relevance, and patient safety using full clinical context and severity-aware weighting.
Details
Motivation: Existing metrics for chest X-ray report generation lack clinical grounding, fail to incorporate full patient context, and don't properly weight findings by clinical significance, leading to misleading evaluations that don't prioritize patient safety.Method: CRIMSON incorporates patient age, indication, and guideline-based decision rules, categorizes errors into comprehensive taxonomy (false/missing findings + 8 attribute-level errors), and weights findings by clinical significance levels (urgent, actionable non-urgent, non-actionable, expected/benign) developed with radiologists.
Result: CRIMSON shows strong alignment with radiologist annotations (Kendall’s tau = 0.61-0.71; Pearson’s r = 0.71-0.84), consistent agreement with expert judgment in challenging scenarios (RadJudge), and strongest alignment with radiologist preferences in pairwise comparisons (RadPref).
Conclusion: CRIMSON provides a clinically meaningful evaluation framework for chest X-ray report generation that prioritizes patient safety and diagnostic accuracy over superficial metrics, enabling better assessment of multimodal medical AI systems.
Abstract: We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety. Unlike prior metrics, CRIMSON incorporates full clinical context, including patient age, indication, and guideline-based decision rules, and prevents normal or clinically insignificant findings from exerting disproportionate influence on the overall score. The framework categorizes errors into a comprehensive taxonomy covering false findings, missing findings, and eight attribute-level errors (e.g., location, severity, measurement, and diagnostic overinterpretation). Each finding is assigned a clinical significance level (urgent, actionable non-urgent, non-actionable, or expected/benign), based on a guideline developed in collaboration with attending cardiothoracic radiologists, enabling severity-aware weighting that prioritizes clinically consequential mistakes over benign discrepancies. CRIMSON is validated through strong alignment with clinically significant error counts annotated by six board-certified radiologists in ReXVal (Kendalls tau = 0.61-0.71; Pearsons r = 0.71-0.84), and through two additional benchmarks that we introduce. In RadJudge, a targeted suite of clinically challenging pass-fail scenarios, CRIMSON shows consistent agreement with expert judgment. In RadPref, a larger radiologist preference benchmark of over 100 pairwise cases with structured error categorization, severity modeling, and 1-5 overall quality ratings from three cardiothoracic radiologists, CRIMSON achieves the strongest alignment with radiologist preferences. We release the metric, the evaluation benchmarks, RadJudge and RadPref, and a fine-tuned MedGemma model to enable reproducible evaluation of report generation, all available at https://github.com/rajpurkarlab/CRIMSON.
[38] MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue
Naifan Zhang, Ruihan Sun, Jinwei Su, Hengjie Yang, Zhengyuan Pan, Zhaohan Chen, Xiaofan Zhang
Main category: cs.CL
TL;DR: MAPO: A critic-free RL algorithm for subjective multi-turn dialogue using dense process feedback from judge models and mixed advantage normalization for stable credit assignment.
Details
Motivation: Reinforcement learning for subjective dialogue tasks like emotional support is challenging due to lack of reliable process supervision. Outcome-only training collapses credit assignment, while turn-level sampling is computationally expensive.Method: Proposes MAPO algorithm with dense process feedback from judge models, Monte Carlo returns for long-horizon effects, and mixed advantage estimator combining turn-level and batch-level normalization for stable optimization.
Result: Consistent improvements across multiple benchmarks (EMPA, EmoBench, EQ-Bench) with model scales 7B-32B. On EMPA: up to 9 point rate improvements and +43.2 dialogue scores over 7B base model. Generalizes well to unseen benchmarks with +4 points on EmoBench and +3.5 on EQ-Bench.
Conclusion: Dense process supervision with mixed-level normalization enables effective and scalable RL for subjective, open-ended multi-turn dialogue tasks.
Abstract: Subjective multi-turn dialogue tasks, such as emotional support, require conversational policies that adapt to evolving user states and optimize long-horizon interaction quality. However, reinforcement learning (RL) for such settings remains challenging due to the absence of reliable process supervision. Outcome-only training collapses credit assignment across turns into a single trajectory-level reward, while naĂŻve turn-level group sampling incurs prohibitive rollout costs in interactive environments. We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns. To stabilize optimization, we introduce a mixed advantage estimator that combines turn-level normalization with batch-level normalization, enabling fine-grained yet scalable credit assignment. Across multiple subjective dialogue benchmarks, including EMPA, EmoBench, and EQ-Bench, and model scales ranging from 7B to 32B, our method consistently improves both training stability and final performance over outcome-only GRPO and single-level normalization baselines. On EMPA, we improve rates by up to 9 points and increase dialogue scores by as much as +43.2 over the 7B base model. Despite training only on EMPA-style environments, our approach generalizes well, yielding consistent improvements on unseen emotional-intelligence benchmarks, including up to +4 points on EmoBench and +3.5 on EQ-Bench. Together, these results demonstrate that dense process supervision combined with mixed-level normalization enables effective and scalable RL for subjective, open-ended multi-turn dialogue.
[39] Wisdom of the AI Crowd (AI-CROWD) for Ground Truth Approximation in Content Analysis: A Research Protocol & Validation Using Eleven Large Language Models
Luis de-Marcos, Manuel Goyanes, AdriĂĄn DomĂnguez-DĂaz
Main category: cs.CL
TL;DR: AI-CROWD protocol uses ensemble of LLMs to approximate ground truth labels for large-scale content analysis when human coding is impractical
Details
Motivation: Large-scale content analysis faces limitations due to absence of observable ground truth labels, as human coding becomes impractical for massive datasets due to high time, cost, and consistency challengesMethod: Leverages collective outputs of an ensemble of large language models (LLMs) to approximate ground truth through consensus-based approximation derived from convergent and divergent inferences across multiple models. Aggregates outputs via majority voting and interrogates agreement/disagreement patterns with diagnostic metrics
Result: The protocol generates consensus-based approximations of ground truth, identifies high-confidence classifications, and flags potential ambiguity or model-specific biases
Conclusion: AI-CROWD provides a practical solution for large-scale content analysis by using LLM ensembles to approximate ground truth when human coding is infeasible, addressing scalability challenges while acknowledging limitations of the approach
Abstract: Large-scale content analysis is increasingly limited by the absence of observable ground truth or gold-standard labels, as creating such benchmarks through extensive human coding becomes impractical for massive datasets due to high time, cost, and consistency challenges. To overcome this barrier, we introduce the AI-CROWD protocol, which approximates ground truth by leveraging the collective outputs of an ensemble of large language models (LLMs). Rather than asserting that the resulting labels are true ground truth, the protocol generates a consensus-based approximation derived from convergent and divergent inferences across multiple models. By aggregating outputs via majority voting and interrogating agreement/disagreement patterns with diagnostic metrics, AI-CROWD identifies high-confidence classifications while flagging potential ambiguity or model-specific biases.
[40] LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation
Koki Itai, Shunichi Hasegawa, Yuta Yamamoto, Gouki Minegishi, Masaki Otsuki
Main category: cs.CL
TL;DR: LIT-RAGBench is a comprehensive benchmark for evaluating RAG generators across five categories: Integration, Reasoning, Logic, Table interpretation, and Abstention, using fictional scenarios to test practical capabilities.
Details
Motivation: Existing RAG benchmarks provide limited coverage and don't enable simultaneous evaluation of multiple practical capabilities needed in real-world deployments, such as handling long contexts, multi-step reasoning, table interpretation, and knowing when to abstain.Method: Created a benchmark with 114 human-constructed Japanese questions (and English translations) covering five categories: Integration, Reasoning, Logic, Table, and Abstention. Uses fictional entities/scenarios to ensure answers are grounded in provided documents. Employs LLM-as-a-Judge for scoring.
Result: No model (API-based or open-weight) exceeded 90% overall accuracy. The benchmark reveals specific strengths and weaknesses across categories, showing current limitations in practical RAG capabilities.
Conclusion: LIT-RAGBench provides a valuable metric for model selection in practical RAG deployments and for developing RAG-specialized models by systematically measuring capabilities across multiple dimensions.
Abstract: Retrieval-Augmented Generation (RAG) is a framework in which a Generator, such as a Large Language Model (LLM), produces answers by retrieving documents from an external collection using a Retriever. In practice, Generators must integrate evidence from long contexts, perform multi-step reasoning, interpret tables, and abstain when evidence is missing. However, existing benchmarks for Generators provide limited coverage, with none enabling simultaneous evaluation of multiple capabilities under unified conditions. To bridge the gap between existing evaluations and practical use, we introduce LIT-RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic, Table, and Abstention, each further divided into practical evaluation aspects. LIT-RAGBench systematically covers patterns combining multiple aspects across categories. By using fictional entities and scenarios, LIT-RAGBench evaluates answers grounded in the provided external documents. The dataset consists of 114 human-constructed Japanese questions and an English version generated by machine translation with human curation. We use LLM-as-a-Judge for scoring and report category-wise and overall accuracy. Across API-based and open-weight models, no model exceeds 90% overall accuracy. By making strengths and weaknesses measurable within each category, LIT-RAGBench serves as a valuable metric for model selection in practical RAG deployments and for building RAG-specialized models. We release LIT-RAGBench, including the dataset and evaluation code, at https://github.com/Koki-Itai/LIT-RAGBench.
[41] FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling
Qihang Fan, Huaibo Huang, Zhiying Wu, Juqiu Wang, Bingning Wang, Ran He
Main category: cs.CL
TL;DR: FlashPrefill: A framework for ultra-fast prefilling in long-context LLMs using dynamic pattern discovery and thresholding to achieve 27.78x speedup on 256K sequences.
Details
Motivation: Quadratic complexity of attention in LLMs creates critical bottlenecks during prefilling phase, especially for long-context modeling. Existing sparse attention methods suffer from either high search latency or insufficient sparsity.Method: Uses fast block-searching to locate dynamic vertical, slash, and block-sparse attention patterns simultaneously. Introduces dynamic thresholding mechanism that avoids sorting/accumulation overhead while eliminating long-tail distribution to enhance sparsity.
Result: Achieves 27.78x speedup on 256K sequences and maintains 1.71x speedup even at 4K context length, showing robustness across sequence scales without efficiency degradation on shorter contexts.
Conclusion: FlashPrefill provides substantial efficiency improvements for long-context modeling in LLMs through innovative pattern discovery and thresholding techniques, making it practical for varying sequence lengths.
Abstract: Long-context modeling is a pivotal capability for Large Language Models, yet the quadratic complexity of attention remains a critical bottleneck, particularly during the compute-intensive prefilling phase. While various sparse attention mechanisms have been explored, they typically suffer from either significant search latency or insufficient sparsity. In this paper, we propose FlashPrefill, a framework enabling ultra-fast prefilling via instantaneous pattern discovery and thresholding. FlashPrefill leverages a fast block-searching technique to simultaneously locate dynamic vertical, slash, and block-sparse attention patterns. Crucially, it introduces a dynamic thresholding mechanism that bypasses the prohibitive overhead of sorting or accumulating attention scores while effectively eliminating the long-tail distribution to enhance sparsity. Extensive evaluations demonstrate that FlashPrefill achieves a substantial leap in efficiency, delivering an unprecedented 27.78x speedup on 256K sequences. Notably, unlike existing methods that incur efficiency degradation on shorter contexts, FlashPrefill maintains a 1.71x speedup even at a 4K context length, demonstrating its robustness and practical utility across varying sequence scales.
[42] SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models
Yunlong Chu, Minglai Shao, Yuhang Liu, Bing Hao, Yumeng Lin, Jialu Wang, Ruijie Wang
Main category: cs.CL
TL;DR: SPOT compresses explicit Chain-of-Thought reasoning into compact latent pause tokens using span-level semantic alignment and frozen-head decoding for interpretability, reducing token overhead while improving reasoning performance.
Details
Motivation: Explicit Chain-of-Thought reasoning incurs high inference costs due to verbose token-level traces. While recent approaches reduce overhead via concise prompting or step pruning, they truncate what models say rather than internalizing what they think. Latent reasoning offers an alternative but faces challenges with rigid alignment and lack of interpretability.Method: SPOT uses Span-level Semantic Alignment with a Sinkhorn optimal-transport objective to softly match each pause token to the semantics of an entire reasoning segment. It also introduces a Frozen-Head Decoding Constraint that keeps latent states directly decodable as token distributions under the frozen pretrained LM head for interpretability.
Result: SPOT improves accuracy by 2.3 points on average on reasoning benchmarks while reducing generated tokens by 37.5%. It provides faithful semantic interpretations of the latent reasoning process.
Conclusion: SPOT offers an effective framework for compressing explicit reasoning into latent representations while maintaining interpretability, addressing both efficiency and transparency challenges in latent reasoning approaches.
Abstract: Explicit Chain-of-Thought improves the reasoning performance of large language models but often incurs high inference cost due to verbose token-level traces. While recent approaches reduce this overhead via concise prompting or step pruning, they largely truncate what the model says rather than internalize what the model thinks. Latent reasoning offers a promising alternative by performing computation in the hidden space, yet prior methods face two critical challenges. Many existing approaches rely on rigid point-to-point alignment, forcing a latent token to approximate the final representation of a reasoning step, which can be insufficient to capture the dense, variable-length semantics of an entire reasoning segment. Furthermore, these methods often suffer from a lack of interpretability: latent states are commonly produced by unconstrained optimization or embedding mixing, yielding vectors that are difficult to decode or audit under the pretrained language head. We propose SPOT, a flexible framework that compresses explicit CoT into compact latent pause tokens without enforcing a fixed response template. At the core of SPOT is Span-level Semantic Alignment, a Sinkhorn optimal-transport objective that softly matches each pause token to the semantics of an entire reasoning segment, overcoming the rigidity of step-end alignment. To further improve interpretability, SPOT introduces a Frozen-Head Decoding Constraint that keeps latent states directly decodable as token distributions under the frozen pretrained LM head, enabling readable keyword interpretations of latent thoughts. Experiments on reasoning benchmarks demonstrate that SPOT improves accuracy by 2.3 points on average while reducing generated tokens by 37.5% and provides faithful semantic interpretations of the latent reasoning process.
[43] Mind the Gap: Pitfalls of LLM Alignment with Asian Public Opinion
Hari Shankar, Vedanta S P, Sriharini Margapuri, Debjani Mazumder, Ponnurangam Kumaraguru, Abhijnan Chakraborty
Main category: cs.CL
TL;DR: LLMs show cultural misalignment with non-Western religious values despite broad social issue alignment, requiring systematic regional audits for equitable global deployment.
Details
Motivation: LLMs are deployed globally but trained on English-centric data, risking misalignment with diverse cultural values, especially in sensitive domains like religion across India, East Asia, and Southeast Asia.Method: Multilingual audit of GPT-4o-Mini, Gemini-2.5-Flash, Llama 3.2, Mistral and Gemma 3 using internal representation analysis (log-probs/logits) to compare model opinions against ground-truth public attitudes, plus bias benchmark evaluations.
Result: Models align with public opinion on broad social issues but consistently fail on religious viewpoints, especially minority groups, often amplifying negative stereotypes. Lightweight interventions help partially but don’t eliminate gaps.
Conclusion: Systematic, regionally grounded audits are urgently needed to ensure equitable global deployment of LLMs, as current models show persistent cultural misalignment in sensitive domains.
Abstract: Large Language Models (LLMs) are increasingly being deployed in multilingual, multicultural settings, yet their reliance on predominantly English-centric training data risks misalignment with the diverse cultural values of different societies. In this paper, we present a comprehensive, multilingual audit of the cultural alignment of contemporary LLMs including GPT-4o-Mini, Gemini-2.5-Flash, Llama 3.2, Mistral and Gemma 3 across India, East Asia and Southeast Asia. Our study specifically focuses on the sensitive domain of religion as the prism for broader alignment. To facilitate this, we conduct a multi-faceted analysis of every LLM’s internal representations, using log-probs/logits, to compare the model’s opinion distributions against ground-truth public attitudes. We find that while the popular models generally align with public opinion on broad social issues, they consistently fail to accurately represent religious viewpoints, especially those of minority groups, often amplifying negative stereotypes. Lightweight interventions, such as demographic priming and native language prompting, partially mitigate but do not eliminate these cultural gaps. We further show that downstream evaluations on bias benchmarks (such as CrowS-Pairs, IndiBias, ThaiCLI, KoBBQ) reveal persistent harms and under-representation in sensitive contexts. Our findings underscore the urgent need for systematic, regionally grounded audits to ensure equitable global deployment of LLMs.
[44] The Art That Poses Back: Assessing AI Pastiches after Contemporary Artworks
Anca Dinu, Andreiana Mihail, Andra-Maria Florescu, Claudiu Creanga
Main category: cs.CL
TL;DR: ChatGPT’s ability to generate pastiches of original artworks is evaluated through artist feedback and computational analysis, revealing gaps between visual similarity and conceptual depth.
Details
Motivation: To explore artificial visual creativity by assessing ChatGPT's capability to intentionally generate pastiches of original artworks, and to understand the gap between computational similarity metrics and human artistic evaluation.Method: Involved 12 artists from multiple countries contributing 3 artworks each; used ChatGPT to generate pastiches; combined human evaluation (artist grading/comments) with computational methods detecting visual/stylistic similarities/divergences between originals and AI renditions.
Result: Significant gap between color/texture-based similarity and compositional/conceptual/perceptual similarity; artists perceived ChatGPT pastiches as lacking dimensionality, context, intentional sense, and emotional value, appearing more as paraphrases than valuable artworks.
Conclusion: Advocates for using a “style transfer dashboard” of complementary metrics rather than single style metric to evaluate similarity; highlights limitations of current AI in capturing artistic depth and intentionality.
Abstract: This study explores artificial visual creativity, focusing on ChatGPT’s ability to generate new images intentionally pastiching original artworks such as paintings, drawings, sculptures and installations. The process involved twelve artists from Romania, Bulgaria, France, Austria, and the United Kingdom, each invited to contribute with three of their artworks and to grade and comment on the AI-generated versions. The analysis combines human evaluation with computational methods aimed at detecting visual and stylistic similarities or divergences between the original works and their AI-produced renditions. The results point to a significant gap between color and texture-based similarity and compositional, conceptual, and perceptual one. Consequently, we advocate for the use of a “style transfer dashboard” of complementary metrics to evaluate the similarity between pastiches and originals, rather than using a single style metric. The artists’ comments revealed limitations of ChatGPT’s pastiches after contemporary artworks, which were perceived by the authors of the originals as lacking dimensionality, context, and intentional sense, and seeming more of a paraphrase or an approximate quotation rather than as a valuable, emotion-evoking artwork.
[45] Transparent AI for Mathematics: Transformer-Based Large Language Models for Mathematical Entity Relationship Extraction with XAI
Tanjim Taharat Aurpa
Main category: cs.CL
TL;DR: Transformer-based models (BERT) achieve 99.39% accuracy on Mathematical Entity Relation Extraction (MERE) task, with SHAP explainability revealing feature importance for mathematical text understanding.
Details
Motivation: Mathematical text understanding is challenging due to specialized entities and complex relationships. The study aims to interpret mathematical problems by extracting relations between operands (entities) and operators (relationships).Method: Formulates mathematical problem interpretation as Mathematical Entity Relation Extraction (MERE) task. Uses transformer-based models (BERT) for automatic relation extraction. Incorporates Explainable AI (XAI) using Shapley Additive Explanations (SHAP) for transparency.
Result: BERT achieves best performance with 99.39% accuracy. SHAP explainability analysis reveals how specific textual and mathematical features influence relation prediction, providing insights into feature importance and model behavior.
Conclusion: Combines transformer-based learning, task-specific dataset, and explainable modeling to create an effective and interpretable framework for MERE, supporting applications in automated problem solving, knowledge graph construction, and intelligent educational systems.
Abstract: Mathematical text understanding is a challenging task due to the presence of specialized entities and complex relationships between them. This study formulates mathematical problem interpretation as a Mathematical Entity Relation Extraction (MERE) task, where operands are treated as entities and operators as their relationships. Transformer-based models are applied to automatically extract these relations from mathematical text, with Bidirectional Encoder Representations from Transformers (BERT) achieving the best performance, reaching an accuracy of 99.39%. To enhance transparency and trust in the model’s predictions, Explainable Artificial Intelligence (XAI) is incorporated using Shapley Additive Explanations (SHAP). The explainability analysis reveals how specific textual and mathematical features influence relation prediction, providing insights into feature importance and model behavior. By combining transformer-based learning, a task-specific dataset, and explainable modeling, this work offers an effective and interpretable framework for MERE, supporting future applications in automated problem solving, knowledge graph construction, and intelligent educational systems.
[46] Evaluation of Deontic Conditional Reasoning in Large Language Models: The Case of Wason’s Selection Task
Hirohiko Abe, Kentaro Ozeki, Risako Ando, Takanobu Morishita, Koji Mineshima, Mitsuhiro Okada
Main category: cs.CL
TL;DR: LLMs show domain-specific reasoning patterns similar to humans, performing better with deontic rules than descriptive ones and exhibiting matching bias errors in conditional reasoning tasks.
Details
Motivation: To investigate whether LLMs exhibit domain-specific reasoning patterns similar to humans, particularly whether they reason better with deontic (normative) rules than descriptive ones, and to analyze their error patterns in conditional reasoning tasks.Method: Created a new Wason Selection Task dataset with explicit deontic modality encoding to distinguish deontic from descriptive conditionals, then tested LLMs on these conditional reasoning tasks and analyzed error patterns for confirmation bias vs. matching bias.
Result: LLMs, like humans, reason better with deontic rules than descriptive ones and display matching-bias-like errors (tendency to ignore negation and select items that lexically match rule elements).
Conclusion: LLM reasoning performance varies systematically across rule types and their error patterns parallel well-known human cognitive biases in conditional reasoning paradigms.
Abstract: As large language models (LLMs) advance in linguistic competence, their reasoning abilities are gaining increasing attention. In humans, reasoning often performs well in domain specific settings, particularly in normative rather than purely formal contexts. Although prior studies have compared LLM and human reasoning, the domain specificity of LLM reasoning remains underexplored. In this study, we introduce a new Wason Selection Task dataset that explicitly encodes deontic modality to systematically distinguish deontic from descriptive conditionals, and use it to examine LLMs’ conditional reasoning under deontic rules. We further analyze whether observed error patterns are better explained by confirmation bias (a tendency to seek rule-supporting evidence) or by matching bias (a tendency to ignore negation and select items that lexically match elements of the rule). Results show that, like humans, LLMs reason better with deontic rules and display matching-bias-like errors. Together, these findings suggest that the performance of LLMs varies systematically across rule types and that their error patterns can parallel well-known human biases in this paradigm.
[47] From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring
Minh Hoang Nguyen, Vu Hoang Pham, Xuan Thanh Huynh, Phuc Hong Mai, Vinh The Nguyen, Quang Nhut Huynh, Huy Tien Nguyen, Tung Le
Main category: cs.CL
TL;DR: Comprehensive comparison of LLM-based Automated Essay Scoring methods for English L2 writing, evaluating four approaches on IELTS Writing Task 2 with best configuration achieving 93% F1-Score.
Details
Motivation: Prior studies examine individual LLM techniques for Automated Essay Scoring in isolation, limiting understanding of their relative merits for English as a Second Language (L2) writing. Need for unified comparison of major LLM-based AES paradigms.Method: Evaluates four LLM-based AES approaches on IELTS Writing Task 2: (1) encoder-based classification fine-tuning, (2) zero- and few-shot prompting, (3) instruction tuning with Retrieval-Augmented Generation (RAG), and (4) Supervised Fine-Tuning combined with Direct Preference Optimization (DPO) and RAG.
Result: Reveals clear accuracy-cost-robustness trade-offs across methods. Best configuration integrating k-SFT and RAG achieves strongest overall results with 93% F1-Score.
Conclusion: Offers first unified empirical comparison of modern LLM-based AES strategies for English L2 writing, promising potential in auto-grading writing tasks.
Abstract: Large language models (LLMs) have recently reshaped Automated Essay Scoring (AES), yet prior studies typically examine individual techniques in isolation, limiting understanding of their relative merits for English as a Second Language (L2) writing. To bridge this gap, we presents a comprehensive comparison of major LLM-based AES paradigms on IELTS Writing Task~2. On this unified benchmark, we evaluate four approaches: (i) encoder-based classification fine-tuning, (ii) zero- and few-shot prompting, (iii) instruction tuning and Retrieval-Augmented Generation (RAG), and (iv) Supervised Fine-Tuning combined with Direct Preference Optimization (DPO) and RAG. Our results reveal clear accuracy-cost-robustness trade-offs across methods, the best configuration, integrating k-SFT and RAG, achieves the strongest overall results with F1-Score 93%. This study offers the first unified empirical comparison of modern LLM-based AES strategies for English L2, promising potential in auto-grading writing tasks. Code is public at https://github.com/MinhNguyenDS/LLM_AES-EnL2
[48] Abductive Reasoning with Syllogistic Forms in Large Language Models
Hirohiko Abe, Risako Ando, Takanobu Morishita Kentaro Ozeki, Koji Mineshima, Mitsuhiro Okada
Main category: cs.CL
TL;DR: This paper investigates whether state-of-the-art LLMs exhibit biases in abductive reasoning by converting a syllogistic dataset to test abduction, exploring how LLMs perform in contextualized reasoning beyond formal deduction.
Details
Motivation: The motivation is to understand LLM performance in abductive reasoning (drawing tentative conclusions from limited information) rather than just formal deduction, since human reasoning involves both. Prior studies show LLMs share human biases in dismissing logically valid inferences that contradict common beliefs, but this might be unfair if abduction is considered.Method: The method involves converting a syllogistic dataset into one suitable for testing abductive reasoning. This allows investigation of whether state-of-the-art LLMs exhibit biases in abduction and identification of potential improvement areas.
Result: The paper doesn’t provide specific results in the abstract, but aims to investigate LLM accuracy in abductive reasoning and whether they exhibit biases similar to those observed in formal deduction tasks.
Conclusion: The investigation is vital for advancing understanding of LLMs in complex reasoning tasks and bridging the gap between machine and human cognition, emphasizing the importance of contextualized reasoning beyond formal deduction.
Abstract: Research in AI using Large-Language Models (LLMs) is rapidly evolving, and the comparison of their performance with human reasoning has become a key concern. Prior studies have indicated that LLMs and humans share similar biases, such as dismissing logically valid inferences that contradict common beliefs. However, criticizing LLMs for these biases might be unfair, considering our reasoning not only involves formal deduction but also abduction, which draws tentative conclusions from limited information. Abduction can be regarded as the inverse form of syllogism in its basic structure, that is, a process of drawing a minor premise from a major premise and conclusion. This paper explores the accuracy of LLMs in abductive reasoning by converting a syllogistic dataset into one suitable for abduction. It aims to investigate whether the state-of-the-art LLMs exhibit biases in abduction and to identify potential areas for improvement, emphasizing the importance of contextualized reasoning beyond formal deduction. This investigation is vital for advancing the understanding and application of LLMs in complex reasoning tasks, offering insights into bridging the gap between machine and human cognition.
[49] PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations
Vittoria Vineis, Matteo Silvestri, Lorenzo Antonelli, Filippo Betello, Gabriele Tolomei
Main category: cs.CL
TL;DR: PONTE is a human-in-the-loop framework for personalized XAI narratives that combines preference modeling, grounded generation, and verification modules to ensure faithful, complete, and stylistically aligned explanations.
Details
Motivation: Current XAI methods are one-size-fits-all and neglect user differences in expertise, goals, and cognitive needs. While LLMs can translate technical explanations, they suffer from faithfulness issues and hallucinations, creating a need for personalized, reliable XAI narratives.Method: PONTE uses a closed-loop validation and adaptation process with three components: 1) low-dimensional preference model for stylistic requirements, 2) preference-conditioned generator grounded in structured XAI artifacts, and 3) verification modules for numerical faithfulness, informational completeness, and stylistic alignment. User feedback iteratively updates preference states.
Result: Automatic and human evaluations in healthcare and finance domains show the verification-refinement loop substantially improves completeness and stylistic alignment over validation-free generation. Human studies confirm strong agreement between intended preferences and perceived style, robustness to generation stochasticity, and consistently positive quality assessments.
Conclusion: PONTE successfully addresses the challenge of creating personalized, trustworthy XAI narratives through a human-in-the-loop framework that ensures faithfulness, completeness, and stylistic alignment while adapting to individual user needs.
Abstract: Explainable Artificial Intelligence (XAI) seeks to enhance the transparency and accountability of machine learning systems, yet most methods follow a one-size-fits-all paradigm that neglects user differences in expertise, goals, and cognitive needs. Although Large Language Models can translate technical explanations into natural language, they introduce challenges related to faithfulness and hallucinations. To address these challenges, we present PONTE (Personalized Orchestration for Natural language Trustworthy Explanations), a human-in-the-loop framework for adaptive and reliable XAI narratives. PONTE models personalization as a closed-loop validation and adaptation process rather than prompt engineering. It combines: (i) a low-dimensional preference model capturing stylistic requirements; (ii) a preference-conditioned generator grounded in structured XAI artifacts; and (iii) verification modules enforcing numerical faithfulness, informational completeness, and stylistic alignment, optionally supported by retrieval-grounded argumentation. User feedback iteratively updates the preference state, enabling quick personalization. Automatic and human evaluations across healthcare and finance domains show that the verification-refinement loop substantially improves completeness and stylistic alignment over validation-free generation. Human studies further confirm strong agreement between intended preference vectors and perceived style, robustness to generation stochasticity, and consistently positive quality assessments.
[50] Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing
Anmol Gulati, Sahil Sen, Waqar Sarguroh, Kevin Paul
Main category: cs.CL
TL;DR: BRTR is a multimodal agentic framework for spreadsheet understanding that uses iterative tool-calling loops instead of single-pass retrieval, enabling reliable multi-step reasoning over complex enterprise workbooks with embedded visual artifacts.
Details
Motivation: Current multimodal RAG approaches for spreadsheet analysis have limitations: they exclude critical context through single-pass retrieval, lose data resolution through compression, and exceed LLM context windows through naive full-context injection, preventing reliable multi-step reasoning over complex enterprise workbooks.Method: BRTR replaces single-pass retrieval with an iterative tool-calling loop that supports end-to-end Excel workflows. It uses multimodal embedding models (evaluated 5 models) and LLMs (varied 9 models) in a framework with planner, retrieval, and iterative reasoning components.
Result: BRTR achieves SOTA performance across three spreadsheet understanding benchmarks: surpassing prior methods by 25 points on FRTR-Bench, 7 points on SpreadsheetLLM, and 32 points on FINCH. NVIDIA NeMo Retriever 1B was identified as top performer for mixed tabular/visual data, and GPT-5.2 achieved best efficiency-accuracy trade-off.
Conclusion: BRTR demonstrates that iterative tool-calling loops significantly improve spreadsheet understanding over single-pass retrieval approaches, with each component (planner, retrieval, iterative reasoning) contributing substantially to performance while maintaining full auditability through explicit tool-call traces.
Abstract: Recent advances in multimodal Retrieval-Augmented Generation (RAG) enable Large Language Models (LLMs) to analyze enterprise spreadsheet workbooks containing millions of cells, cross-sheet dependencies, and embedded visual artifacts. However, state-of-the-art approaches exclude critical context through single-pass retrieval, lose data resolution through compression, and exceed LLM context windows through naive full-context injection, preventing reliable multi-step reasoning over complex enterprise workbooks. We introduce Beyond Rows to Reasoning (BRTR), a multimodal agentic framework for spreadsheet understanding that replaces single-pass retrieval with an iterative tool-calling loop, supporting end-to-end Excel workflows from complex analysis to structured editing. Supported by over 200 hours of expert human evaluation, BRTR achieves state-of-the-art performance across three frontier spreadsheet understanding benchmarks, surpassing prior methods by 25 percentage points on FRTR-Bench, 7 points on SpreadsheetLLM, and 32 points on FINCH. We evaluate five multimodal embedding models, identifying NVIDIA NeMo Retriever 1B as the top performer for mixed tabular and visual data, and vary nine LLMs. Ablation experiments confirm that the planner, retrieval, and iterative reasoning each contribute substantially, and cost analysis shows GPT-5.2 achieves the best efficiency-accuracy trade-off. Throughout all evaluations, BRTR maintains full auditability through explicit tool-call traces.
[51] Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning
Yuchen Zhang, Haralambos Mouratidis, Ravi Shekhar
Main category: cs.CL
TL;DR: A context-aware multilingual ASR framework using frozen speech encoder and decoder-only LM with contrastive alignment for improved speech-context interaction across diverse languages and accents.
Details
Motivation: Most ASR systems are limited to monolingual settings and short utterances, with recent context-aware approaches lacking multilingual support and principled alignment between speech and contextual representations.Method: Combines frozen speech encoder and decoder-only language model via lightweight projection module, uses structured context prompts (dialogue history, biasing words), and employs contrastive learning objective to align speech and context representations in shared embedding space.
Result: Evaluated on 1,500+ hours of real-world conversational speech across 11 languages and 5 English dialects; contextual input consistently improves recognition quality, with contrastive alignment providing additional gains across different context types, achieving overall performance gain over 5%.
Conclusion: Highlights importance of both contextual modeling and cross-modal alignment in multilingual ASR, demonstrating effectiveness of modular approach with contrastive learning for improved speech-context interaction.
Abstract: Automatic speech recognition (ASR) has benefited from advances in pretrained speech and language models, yet most systems remain constrained to monolingual settings and short, isolated utterances. While recent efforts in context-aware ASR show promise, two key challenges persist: limited multilingual support and the absence of principled alignment between speech and contextual representations. In this paper, we introduce a context-aware multilingual ASR framework that supports diverse languages and accents while preserving the modularity of pretrained models. Our approach combines a frozen speech encoder and a decoder-only language model via a lightweight projection module, allowing structured context prompts, including dialogue history and biasing words, to guide transcription. To improve interaction between speech and context, we employ a contrastive learning objective that aligns their representations in a shared embedding space. Evaluations on over 1,500 hours of real-world conversational speech across 11 languages and 5 English dialects show that contextual input consistently improves recognition quality. Contrastive alignment provides additional gains when applied to different context types, with an overall performance gain of over 5%. These results highlight the importance of both contextual modeling and cross-modal alignment in multilingual ASR.
[52] KCLarity at SemEval-2026 Task 6: Encoder and Zero-Shot Approaches to Political Evasion Detection
Archie Sage, Salvatore Greco
Main category: cs.CL
TL;DR: KCLarity team’s SemEval 2026 participation in classifying ambiguity and evasion techniques in political discourse, comparing direct clarity prediction vs. evasion-first hierarchical approaches with various model architectures.
Details
Motivation: To develop effective methods for classifying ambiguity and evasion techniques in political discourse as part of the CLARITY shared task, exploring different modeling formulations to understand which approach yields better performance.Method: Two main formulations: (1) direct prediction of clarity labels, and (2) prediction of evasion labels first then deriving clarity through task taxonomy hierarchy. Explored auxiliary training variants and evaluated decoder-only models in zero-shot settings under evasion-first formulation.
Result: Both formulations yielded comparable performance. RoBERTa-large achieved strongest results on public test set among encoder-based models, while zero-shot GPT-5.2 generalized better on hidden evaluation set.
Conclusion: Different modeling approaches for classifying political discourse ambiguity show similar effectiveness, with transformer-based models performing well, and zero-shot large language models demonstrating strong generalization capabilities.
Abstract: This paper describes the KCLarity team’s participation in CLARITY, a shared task at SemEval 2026 on classifying ambiguity and evasion techniques in political discourse. We investigate two modelling formulations: (i) directly predicting the clarity label, and (ii) predicting the evasion label and deriving clarity through the task taxonomy hierarchy. We further explore several auxiliary training variants and evaluate decoder-only models in a zero-shot setting under the evasion-first formulation. Overall, the two formulations yield comparable performance. Among encoder-based models, RoBERTa-large achieves the strongest results on the public test set, while zero-shot GPT-5.2 generalises better on the hidden evaluation set.
[53] Analyzing the Performance of ChatGPT in Cardiology and Vascular Pathologies
Walid Hariri
Main category: cs.CL
TL;DR: ChatGPT outperforms medical students on cardiology multiple-choice questions, demonstrating potential for medical education applications.
Details
Motivation: To evaluate ChatGPT's performance in cardiology and vascular pathologies by comparing its accuracy on challenging medical multiple-choice questions against medical students.Method: Used 190 multiple-choice questions from Siamois-QCM platform to test ChatGPT’s accuracy, comparing results against two high-performing medical students.
Result: ChatGPT scored 175/190 (92.10%), outperforming both medical students who scored 163 (85.78%) and 159 (82.63%).
Conclusion: ChatGPT shows strong potential as an educational tool in cardiology and vascular pathologies due to its high accuracy on medical questions.
Abstract: The article aims to analyze the performance of ChatGPT, a large language model developed by OpenAI, in the context of cardiology and vascular pathologies. The study evaluated the accuracy of ChatGPT in answering challenging multiple-choice questions (QCM) using a dataset of 190 questions from the Siamois-QCM platform. The goal was to assess ChatGPT potential as a valuable tool in medical education compared to two well-ranked students of medicine. The results showed that ChatGPT outperformed the students, scoring 175 out of 190 correct answers with a percentage of 92.10%, while the two students achieved scores of 163 and 159 with percentages of 85.78% and 82.63%, respectively. These results showcase how ChatGPT has the potential to be highly effective in the fields of cardiology and vascular pathologies by providing accurate answers to relevant questions.
[54] Do Prevalent Bias Metrics Capture Allocational Harms from LLMs?
Hannah Cyberey, Yangfeng Ji, David Evans
Main category: cs.CL
TL;DR: Current bias metrics for LLMs fail to reliably capture allocational harms in resource allocation tasks because they ignore how predictions translate into actual decisions.
Details
Motivation: Existing bias measures focus on prediction disparities but ignore the crucial gap between predictions and actual decisions, particularly in allocation contexts where resources are limited and decisions have real-world consequences.Method: Evaluated predictive validity and utility of current bias metrics across ten LLMs and two allocation tasks, comparing metrics based on average performance gap and distribution distance against actual allocation outcomes.
Result: Commonly-used bias metrics fail to reliably capture group disparities in allocation outcomes, showing poor correlation between prediction-based metrics and actual allocation decisions.
Conclusion: Bias assessment must account for how model predictions translate into decisions, especially in allocation contexts; current metrics are insufficient for evaluating allocational harms.
Abstract: Allocational harms occur when resources or opportunities are unfairly withheld from specific groups. Many proposed bias measures ignore the discrepancy between predictions, which are what the proposed methods consider, and decisions that are made as a result of those predictions. Our work examines the reliability of current bias metrics in assessing allocational harms arising from predictions of large language models (LLMs). We evaluate their predictive validity and utility for model selection across ten LLMs and two allocation tasks. Our results reveal that commonly-used bias metrics based on average performance gap and distribution distance fail to reliably capture group disparities in allocation outcomes. Our work highlights the need to account for how model predictions are used in decisions, in particular in contexts where they are influenced by how limited resources are allocated.
[55] Goldfish: Monolingual Language Models for 350 Languages
Tyler A. Chang, Catherine Arnett, Zhuowen Tu, Benjamin K. Bergen
Main category: cs.CL
TL;DR: Large multilingual models underperform small monolingual models on grammatical text generation for low-resource languages, with Goldfish releasing 1,000+ small monolingual models for 350 languages.
Details
Motivation: Large multilingual models trained on many languages simultaneously show poor performance on basic grammatical text generation for low-resource languages, despite excelling at reasoning tasks. There's a need for better language modeling solutions specifically for these underrepresented languages.Method: 1) Evaluated large multilingual models (XGLM 4.5B, BLOOM 7.1B) using FLORES perplexity, finding they perform worse than bigrams for many languages. 2) Trained small monolingual models (125M parameters) on â€1GB data for 350 languages. 3) Released Goldfish, a suite of over 1,000 small monolingual language models.
Result: Large multilingual models underperformed bigrams for 24% of languages in XGLM 4.5B and 43% in BLOOM 7.1B. Small monolingual models (125M parameters) outperformed large multilingual models in both perplexity and grammaticality benchmarks across 350 languages.
Conclusion: Small, targeted monolingual models are more effective than large multilingual models for grammatical text generation in low-resource languages. The Goldfish release provides valuable resources for future work in this area, including first public monolingual models for 215 languages.
Abstract: For many low-resource languages, the only available language models are large multilingual models trained on many languages simultaneously. Despite state-of-the-art performance on reasoning tasks, we find that these models still struggle with basic grammatical text generation in many languages. First, large multilingual models perform worse than bigrams for many languages (e.g. 24% of languages in XGLM 4.5B; 43% in BLOOM 7.1B) using FLORES perplexity as an evaluation metric. Second, when we train small monolingual models with only 125M parameters on 1GB or less data for 350 languages, these small models outperform large multilingual models both in perplexity and on a massively multilingual grammaticality benchmark. To facilitate future work on low-resource language modeling, we release Goldfish, a suite of over 1,000 small monolingual language models trained comparably for 350 languages. These models represent the first publicly-available monolingual language models for 215 of the languages included.
[56] Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models
Schrasing Tong, Eliott Zemour, Jessica Lu, Rawisara Lohanimit, Lalana Kagal
Main category: cs.CL
TL;DR: Debiasing LLMs using small expert models at decoding-time to reduce biases while maintaining performance
Details
Motivation: LLMs perpetuate unwanted biases from training data, potentially harming marginalized communities, requiring efficient debiasing methodsMethod: Use small biased and anti-biased expert models to generate debiasing signals added to LLM outputs at decoding-time, avoiding full retraining
Result: Reduction in gender, race, and religion biases across different architectures on multiple bias metrics while preserving language model performance
Conclusion: Efficient and interpretable debiasing framework that can be tailored to specific contexts by switching fine-tuning datasets
Abstract: Although large language models (LLMs) have demonstrated their effectiveness in a wide range of applications, they have also been observed to perpetuate unwanted biases present in the training data, potentially leading to harm for marginalized communities. In this paper, we mitigate bias by leveraging small biased and anti-biased expert models to obtain a debiasing signal that is added to the LLM output at decoding-time. This approach combines computational efficiency - fine-tuning a small model versus re-training a large model and interpretability - one can examine the probability shift from debiasing. The framework can also be tailored to specific contexts by switching the choice of the fine-tuning dataset. Experiments on mitigating gender, race, and religion biases on different architectures show a reduction in bias on several local and global bias metrics while preserving language model performance.
[57] SpecFuse: Ensembling Large Language Models via Next-Segment Prediction
Bo Lv, Nayu Liu, Chen Tang, Xin Liu, Yue Yu, Ping Luo
Main category: cs.CL
TL;DR: SpecEM is a training-free LLM ensemble framework that enables dynamic, real-time collaboration between models using speculative decoding principles and online weight adjustment based on performance feedback.
Details
Motivation: Existing LLM ensemble methods suffer from first-token delay, poor long-range semantic collaboration, and equal voting weights that ignore task-specific performance differences between models.Method: Uses speculative decoding with iterative drafting and verification at segment level, plus online feedback mechanism with multiplicative weight updates that adjust each model’s voting weight based on how often it outperforms others during verification.
Result: Experimental results on five LLM families (7B to 72B parameters) and six benchmark datasets show consistent performance improvements over state-of-the-art LLM ensemble methods across open-domain instruction following, reasoning, and commonsense tasks.
Conclusion: SpecEM provides an effective training-free ensemble framework that enables dynamic model collaboration and performance-based weighting, addressing key limitations of existing ensemble methods.
Abstract: Ensembles of generative large language models (LLMs) are a promising way to compensate for individual model limitations, integrating the strengths of different LLMs. Existing LLM ensemble methods, however, face limitations such as first-token delay and challenges in long-range semantic collaboration between models, Moreover, they typically assume equal voting weights for all models during ensemble, ignoring task-specific performance differences among models. In this work, we propose SpecEM, a training-free, plug-and-play LLM ensemble framework that dynamically adjusts each model’s model contribution in real time based on task performance. Inspired by speculative decoding, SpecEM iteratively performs drafting and verification, allowing models to collaborate semantically at the segment level for integrated output. Furthermore, we introduce an online feedback mechanism with multiplicative weight updates, where each model’s voting weight is adjusted on-the-fly according to how often it outperforms others during verification stage, ensuring that stronger models exert greater influence during ensembling. Experimental results on five LLM families (ranging from 7B to 72B parameters) and six benchmark datasets, spanning open-domain instruction following, reasoning, commonsense, demonstrate consistent performance improvements compared to state-of-the-art LLM ensemble methods. Our code is available at https://github.com/lvbotenbest/SpecEM.
[58] Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation
Steffen Eger, Yong Cao, Jennifer D’Souza, Andreas Geiger, Christian Greisinger, Stephanie Gross, Yufang Hou, Brigitte Krenn, Anne Lauscher, Yizhi Li, Chenghua Lin, Nafise Sadat Moosavi, Wei Zhao, Tristan Miller
Main category: cs.CL
TL;DR: Survey paper providing structured overview of AI-assisted scientific discovery across five key tasks in the research lifecycle, focusing on multimodal language models and their applications in science.
Details
Motivation: With the emergence of large multimodal language models, science is undergoing an AI-based transformation. The paper aims to provide a comprehensive overview of how AI can assist researchers throughout the entire scientific lifecycle, from literature search to peer review.Method: The paper conducts a curated literature survey organized around five core tasks: (1) literature search, (2) research idea generation and experimentation, (3) text content production, (4) multimodal artifact creation (figures/diagrams), and (5) scientific evaluation/peer review. For each task, the survey discusses datasets, methods, results, evaluation strategies, limitations, and ethical concerns.
Result: Provides a structured orientation to the field of AI-assisted scientific discovery, covering current techniques, evaluation practices, and emerging trends. Identifies key areas where multimodal language models are transforming scientific workflows and highlights ethical concerns including risks to research integrity through generative model misuse.
Conclusion: This survey serves as both an accessible introduction for newcomers to AI4Science and a catalyst for developing new AI-based initiatives that can be integrated into future scientific systems, while addressing important ethical considerations.
Abstract: With the advent of large multimodal language models, science is now at a threshold of an AI-based technological transformation. An emerging ecosystem of models and tools aims to support researchers throughout the scientific lifecycle, including (1) searching for relevant literature, (2) generating research ideas and conducting experiments, (3) producing text-based content, (4) creating multimodal artifacts such as figures and diagrams, and (5) evaluating scientific work, as in peer review. In this survey, we provide a curated overview of literature representative of the core techniques, evaluation practices, and emerging trends in AI-assisted scientific discovery. Across the five tasks outlined above, we discuss datasets, methods, results, evaluation strategies, limitations, and ethical concerns, including risks to research integrity through the misuse of generative models. We aim for this survey to serve both as an accessible, structured orientation for newcomers to the field, as well as a catalyst for new AI-based initiatives and their integration into future ``AI4Science’’ systems.
[59] Conditioning LLMs to Generate Code-Switched Text
Maite Heredia, Gorka Labaka, Jeremy Barnes, Aitor Soroa
Main category: cs.CL
TL;DR: LLMs fine-tuned on back-translated English-Spanish code-switching data can generate fluent code-switched text, with LLM-based evaluation aligning better with human judgment than traditional metrics.
Details
Motivation: Code-switching remains a challenge in NLP due to limited large-scale datasets, and the capabilities of LLMs for CS text generation are not well understood, particularly for English-Spanish language pairs.Method: Back-translate natural CS sentences into monolingual English to create parallel corpus, then fine-tune LLMs to convert monolingual sentences into code-switched text.
Result: Fine-tuning enables consistent generation of fluent CS text; LLM-based judgment correlates better with human preferences than traditional reference-based metrics.
Conclusion: Fine-tuning is key for LLMs to generate high-quality CS text, expanding CS communication research opportunities, with LLM-based evaluation being more reliable than traditional metrics.
Abstract: Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP), due to the limited availability of large-scale, diverse CS datasets for robust training and evaluation. Despite recent advances, the capabilities and limitations of LLMs in handling CS are still not fully understood. In this work, we investigate the extent to which LLMs can be used in a framework for CS text generation, focusing on the English-Spanish language pair. Our proposed methodology consists of back-translating natural CS sentences into monolingual English, and using the resulting parallel corpus to fine-tune LLMs to turn monolingual sentences into CS. We thoroughly analyse the models’ performance through a study on human preferences, a qualitative error analysis, an evaluation with popular reference-based metrics and LLM-based judgment. Results show that fine-tuning can be a key step to ensure that current LLMs consistently generate fluent code-switched text and that our methodology generates high-quality outputs, expanding research opportunities in CS communication. We find that traditional metrics do not correlate with human judgement when assessing the quality of the generated CS data, but LLM-based judgment aligns more closely with human preferences. We release our code and generated dataset under a CC-BY-NC-SA license.
[60] CAReDiO: Cultural Alignment via Representativeness and Distinctiveness Guided Data Optimization
Jing Yao, Xiaoyuan Yi, Jindong Wang, Zhicheng Dou, Xing Xie
Main category: cs.CL
TL;DR: CAReDiO is a data optimization framework for culturally aligning LLMs by enhancing cultural representativeness and distinctiveness through information-theoretic objectives.
Details
Motivation: As LLMs are deployed globally, aligning them with diverse cultures is crucial for user engagement and mitigating cultural conflicts. Existing culture-specific corpora inadequately capture core cultural characteristics and fail to distinguish unique cultural nuances.Method: CAReDiO alternates optimization of culture-sensitive questions and responses using two information-theoretic objectives in an in-context manner to enhance both cultural representativeness and distinctiveness of constructed data.
Result: Extensive experiments on 15 cultures show CAReDiO creates high-quality data with richer cultural information and enables efficient alignment of LLMs with as few as 200 training samples, outperforming previous datasets in multi-choice and open-ended benchmarks.
Conclusion: CAReDiO effectively addresses challenges in cultural alignment of LLMs by optimizing data for better cultural representativeness and distinctiveness, enabling efficient cross-cultural adaptation.
Abstract: As Large Language Models (LLMs) are deployed across diverse regions, aligning them with pluralistic cultures is crucial for improving user engagement and mitigating cultural conflicts. Recent work has curated, either synthesized or manually annotated, culture-specific corpora for alignment. Nevertheless, inspired by cultural theories, we recognize they face two key challenges. (1) Representativeness: These corpora inadequately capture the target culture’s core characteristics, causing insufficient cultural coverage and redundancy; (2) Distinctiveness: They fail to distinguish the unique nuances of the target culture from patterns shared across relevant ones, hindering precise culture modeling. To handle these challenges, we introduce CAReDiO, a novel data optimization framework that alternately optimizes culture-sensitive questions and responses according to two information-theoretic objectives in an in-context manner, enhancing both cultural representativeness and distinctiveness of constructed data. Extensive experiments on 15 cultures demonstrate that CAReDiO can create high-quality data with richer cultural information and enable efficient alignment of small open-source or large proprietary LLMs with as few as 200 training samples, consistently outperforming previous datasets in both multi-choice and open-ended benchmarks.
[61] RM-R1: Reward Modeling as Reasoning
Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji
Main category: cs.CL
TL;DR: ReasRMs introduce reasoning into reward modeling by having models generate rubrics/solutions and evaluate responses against them, improving interpretability and performance over larger models.
Details
Motivation: Current reward models lack interpretable reasoning before scoring, limiting their accuracy and alignment with human preferences. The paper hypothesizes that integrating reasoning into reward modeling can enhance both interpretability and performance.Method: Proposes Reasoning Reward Models (ReasRMs) with chain-of-rubrics mechanism that self-generates sample-level rubrics or solutions, then evaluates responses against them. Uses two-stage training: (1) distillation of high-quality reasoning chains, and (2) reinforcement learning with verifiable rewards.
Result: ReasRMs achieve superior performance across three reward model benchmarks, outperforming much larger open-weight models (e.g., INF-ORM-Llama3.1-70B) and proprietary ones (e.g., GPT-4o) by up to 4.9%.
Conclusion: Integrating reasoning into reward modeling significantly enhances both interpretability and performance, with the chain-of-rubrics mechanism being a key ingredient for successful reasoning reward models.
Abstract: Reward modeling is essential for aligning large language models with human preferences through reinforcement learning. To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. Inspired by recent advances of long chain-of-thought on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning into reward modeling significantly enhances RM’s interpretability and performance. We introduce a new class of generative reward models, Reasoning Reward Models (ReasRMs), which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. RM-R1 features a chain-of-rubrics (CoR) mechanism – self-generating sample-level chat rubrics or math/code solutions, and evaluating candidate responses against them. The training of RM-R1 consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. Empirically, our models achieve superior performance across three reward model benchmarks on average, outperforming much larger open-weight models (e.g., INF-ORM-Llama3.1-70B) and proprietary ones (e.g., GPT-4o) by up to 4.9%. Beyond final performance, we perform thorough analyses to understand the key ingredients of successful ReasRM training.
[62] From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise
Nitin Sharma, Thomas Wolfers, ĂaÄatay Yıldız
Main category: cs.CL
TL;DR: A deterministic pipeline for creating domain-specific LLM benchmarks from raw corpora without LLM assistance or human annotation, using keyword extraction and completion-style prompts to measure domain knowledge.
Details
Motivation: Existing domain-specific benchmarks for LLMs are contaminated and biased (multiple-choice format), making it difficult to accurately assess domain knowledge in critical areas like law, healthcare, and education.Method: Extract domain-specific keywords and target vocabulary from input corpus, construct prompt-target pairs where domain-specific words serve as prediction targets, and measure LLMs’ completion accuracy to assess domain knowledge.
Result: Pipeline creates contamination-free benchmarks, enables automated updates, facilitates fair base vs. chat model comparisons, and shows significant correlation with expert-curated benchmarks. Provides insights into knowledge acquisition across different training paradigms.
Conclusion: The pipeline enables scalable, domain-specific, LLM-independent, and unbiased evaluation of both base and instruction-tuned models, addressing key limitations of existing benchmarking approaches.
Abstract: Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education. However, existing benchmarks are documented to be contaminated and are based on multiple-choice questions, which suffer from inherent biases. To measure domain-specific knowledge in LLMs, we present a deterministic pipeline that transforms raw domain corpora into completion-style benchmarks without relying on other LLMs or costly human annotation. Our approach first extracts domain-specific keywords and related target vocabulary from an input corpus. It then constructs prompt-target pairs where domain-specific words serve as prediction targets. By measuring LLMs’ ability to complete these prompts, we provide a direct assessment of domain knowledge at low computational cost. Our pipeline avoids benchmark contamination, enables automated updates with new domain data, and facilitates fair comparisons between base and instruction-tuned (chat) models. We validate our approach by showing that model performances on our benchmark significantly correlate with those on an expert-curated benchmark. We then demonstrate how our benchmark provides insights into knowledge acquisition in domain-adaptive, continual, and general pretraining. Finally, we examine the effects of instruction fine-tuning by comparing base and chat models within our unified evaluation framework. In conclusion, our pipeline enables scalable, domain-specific, LLM-independent, and unbiased evaluation of both base and chat models.
[63] Agri-Query: A Case Study on RAG vs. Long-Context LLMs for Cross-Lingual Technical Question Answering
Julius Gun, Timo Oksanen
Main category: cs.CL
TL;DR: Evaluation of LLMs with 128K context windows on technical QA using agricultural machine manuals in three languages, comparing direct prompting vs RAG strategies.
Details
Motivation: To assess the performance of large language models with long context windows in realistic industrial QA scenarios, particularly for technical documentation across multiple languages, and to compare direct prompting approaches with retrieval-augmented generation methods.Method: Built benchmark using agricultural machine manuals in English, French, and German; simulated cross-lingual information retrieval with English questions; evaluated nine long-context LLMs using direct prompting vs three RAG strategies (keyword, semantic, hybrid); used LLM-as-a-judge for evaluation; included “needle-in-a-haystack” challenges and unanswerable questions to test hallucinations.
Result: Hybrid RAG consistently outperformed direct long-context prompting; models like Gemini 2.5 Flash and Qwen 2.5 7B achieved over 85% accuracy across all languages with RAG; the study provides detailed analysis of LLM performance in specialized industrial domain.
Conclusion: RAG strategies, particularly hybrid approaches, are more effective than direct prompting for technical QA tasks with long documents; the paper provides an open framework for similar evaluations and highlights practical trade-offs in industrial applications.
Abstract: We present a case study evaluating large language models (LLMs) with 128K-token context windows on a technical question answering (QA) task. Our benchmark is built on a user manual for an agricultural machine, available in English, French, and German. It simulates a cross-lingual information retrieval scenario where questions are posed in English against all three language versions of the manual. The evaluation focuses on realistic “needle-in-a-haystack” challenges and includes unanswerable questions to test for hallucinations. We compare nine long-context LLMs using direct prompting against three Retrieval-Augmented Generation (RAG) strategies (keyword, semantic, hybrid), with an LLM-as-a-judge for evaluation. Our findings for this specific manual show that Hybrid RAG consistently outperforms direct long-context prompting. Models like Gemini 2.5 Flash and the smaller Qwen 2.5 7B achieve high accuracy (over 85%) across all languages with RAG. This paper contributes a detailed analysis of LLM performance in a specialized industrial domain and an open framework for similar evaluations, highlighting practical trade-offs and challenges.
[64] CMRAG: Co-modality-based visual document retrieval and question answering
Wang Chen, Wenhan Yu, Guanqiang Qi, Weikang Li, Yang Li, Lei Sha, Deguo Xia, Jizhou Huang
Main category: cs.CL
TL;DR: CMRAG framework integrates text and images for multimodal document QA using unified encoding and retrieval methods, outperforming single-modality RAG approaches.
Details
Motivation: Existing RAG methods have limitations with multimodal documents: text-only methods miss visual information, while visual-only methods ignore semantic text advantages, leading to suboptimal retrieval and generation for visual document QA.Method: Proposes Co-Modality-based RAG (CMRAG) with: (1) Unified Encoding Model (UEM) that projects queries, text, and images into shared embedding space via triplet training, and (2) Unified Co-Modality-informed Retrieval (UCMR) that statistically normalizes similarity scores to fuse cross-modal signals.
Result: CMRAG consistently outperforms single-modality RAG in multiple visual document question-answering benchmarks. The framework also releases a large-scale triplet dataset of (query, text, image) examples.
Conclusion: Integrating co-modality information into RAG framework in a unified manner effectively improves performance of complex visual document QA systems by leveraging both text and visual information.
Abstract: Retrieval-Augmented Generation (RAG) has become a core paradigm in document question answering tasks. However, existing methods have limitations when dealing with multimodal documents: one category of methods relies on layout analysis and text extraction, which can only utilize explicit text information and struggle to capture images or unstructured content; the other category treats document segmentation as visual input and directly passes it to visual language models (VLMs) for processing, yet it ignores the semantic advantages of text, leading to suboptimal retrieval and generation results. To address these research gaps, we propose the Co-Modality-based RAG (CMRAG) framework, which can simultaneously leverage texts and images for more accurate retrieval and generation. Our framework includes two key components: (1) a Unified Encoding Model (UEM) that projects queries, parsed text, and images into a shared embedding space via triplet-based training, and (2) a Unified Co-Modality-informed Retrieval (UCMR) method that statistically normalizes similarity scores to effectively fuse cross-modal signals. To support research in this direction, we further construct and release a large-scale triplet dataset of (query, text, image) examples. Experiments demonstrate that our proposed framework consistently outperforms single-modality–based RAG in multiple visual document question-answering (VDQA) benchmarks. The findings of this paper show that integrating co-modality information into the RAG framework in a unified manner is an effective approach to improving the performance of complex VDQA systems.
[65] MERLIN: Multi-Stage Curriculum Alignment for Multilingual Encoder-LLM Integration in Cross-Lingual Reasoning
Kosei Uemura, David GuzmĂĄn, Quang Phuoc Nguyen, Jesujoba Oluwadara Alabi, En-shiun Annie Lee, David Ifeoluwa Adelani
Main category: cs.CL
TL;DR: MERLIN: A two-stage model-stacking framework with curriculum learning for improving reasoning in low-resource languages, achieving significant gains over existing methods.
Details
Motivation: Large language models perform well in English but struggle with complex reasoning in low-resource languages. Existing methods like LangBridge and MindMerger improve mid/high-resource languages but leave large gaps for low-resource languages.Method: Two-stage model-stacking framework using curriculum learning strategy: from general bilingual bitext to task-specific data, adapting only a small set of DoRA weights.
Result: On AfriMGSM benchmark: +12.9 pp improvement over MindMerger, outperforms GPT-4o-mini. Consistent gains on MGSM (+0.9 pp) and MSVAMP (+2.8 pp) across low and high-resource settings.
Conclusion: MERLIN effectively improves reasoning capabilities in low-resource languages while maintaining performance across different resource settings.
Abstract: Large language models excel in English but still struggle with complex reasoning in many low-resource languages (LRLs). Existing encoder-plus-decoder methods such as LangBridge and MindMerger raise accuracy on mid and high-resource languages, yet they leave a large gap on LRLs. We present MERLIN, a two-stage model-stacking framework that applies a curriculum learning strategy – from general bilingual bitext to task-specific data – and adapts only a small set of DoRA weights. On the AfriMGSM benchmark MERLIN improves exact-match accuracy by +12.9 pp over MindMerger and outperforms GPT-4o-mini. It also yields consistent gains on MGSM and MSVAMP (+0.9 and +2.8 pp), demonstrating effectiveness across both low and high-resource settings.
[66] Better Late Than Never: Meta-Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation
Peter PolĂĄk, Sara Papi, Luisa Bentivogli, OndĆej Bojar
Main category: cs.CL
TL;DR: Comprehensive meta-evaluation of latency metrics for simultaneous speech-to-text translation, introducing YAAL and LongYAAL for accurate short/long-form evaluation with SoftSegmenter tool.
Details
Motivation: Existing latency metrics for simultaneous speech-to-text translation produce inconsistent results, especially in short-form settings with artificial presegmentation, making reliable system evaluation challenging.Method: Conducted first comprehensive meta-evaluation of latency metrics across language pairs and systems, identified structural bias related to segmentation, introduced YAAL for short-form evaluation and LongYAAL for unsegmented audio, and proposed SoftSegmenter for resegmentation based on soft word-level alignment.
Result: YAAL and LongYAAL with SoftSegmenter outperform popular latency metrics, enabling more reliable assessments of both short- and long-form simultaneous speech translation systems.
Conclusion: The proposed metrics and tools address critical gaps in latency evaluation for speech translation, with all artifacts implemented in the OmniSTEval toolkit for community use.
Abstract: Simultaneous speech-to-text translation systems must balance translation quality with latency. Although quality evaluation is well established, latency measurement remains a challenge. Existing metrics produce inconsistent results, especially in short-form settings with artificial presegmentation. We present the first comprehensive meta-evaluation of latency metrics across language pairs and systems. We uncover a structural bias in current metrics related to segmentation. We introduce YAAL (Yet Another Average Lagging) for a more accurate short-form evaluation and LongYAAL for unsegmented audio. We propose SoftSegmenter, a resegmentation tool based on soft word-level alignment. We show that YAAL and LongYAAL, together with SoftSegmenter, outperform popular latency metrics, enabling more reliable assessments of short- and long-form simultaneous speech translation systems. We implement all artifacts within the OmniSTEval toolkit: https://github.com/pe-trik/OmniSTEval.
[67] How Reliable is Language Model Micro-Benchmarking?
Gregory Yauney, Shahzaib Saqib Warraich, Swabha Swayamdipta
Main category: cs.CL
TL;DR: Micro-benchmarks (small subsets of evaluation data) often fail to reliably rank models compared to full benchmarks, requiring 250+ examples for consistent rankings, with random sampling being competitive with specialized methods.
Details
Motivation: To address the prohibitive time and cost of full benchmark evaluations for language model development by investigating whether micro-benchmarks can provide reliable model rankings comparable to full benchmarks.Method: Introduces a meta-evaluation measure that assesses how well micro-benchmarks rank models based on their performance differences on full benchmarks. Analyzes various micro-benchmarking methods and sizes on datasets like MMLU-Pro and BIG-bench Hard.
Result: Micro-benchmarks often cannot reliably rank models, requiring 250+ examples for consistent rankings. Random sampling is competitive with specialized methods. With 25 examples on MMLU-Pro, over half of pairwise comparisons for 8B instruction-tuned models are unreliable.
Conclusion: Micro-benchmarking has significant reliability limitations, requiring careful consideration of size and method selection. Provides actionable guidance for balancing evaluation efficiency with reliability.
Abstract: Micro-benchmarking offers a solution to the often prohibitive time and cost of language model development: evaluate on a very small subset of existing benchmarks. Can these micro-benchmarks, however, rank models as consistently as the full benchmarks they replace? And can they rank models more consistently than selecting a random subset of data points? In many scenarios, we find that the answer is no. We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark. This approach can determine which model pairs can be ranked correctly by a micro-benchmark, allowing for a finer-grained analysis of the trade-off between micro-benchmark size and reliability. Prior work has suggested selecting as few as 10 examples; we find that no micro-benchmarking method can consistently rank model pairs 3.5 points of accuracy apart on MMLU-Pro or 4 points apart on BIG-bench Hard. In order to consistently rank model pairs with relatively similar performances, we show that often as many as 250 examples must be selected, at which point random sampling is competitive with existing micro-benchmarking methods. When comparing only 8B instruction-tuned models on MMLU-Pro micro-benchmarks with 25 examples, we find that more than half of pairwise comparisons are not likely to be preserved. Our work provides actionable guidance for both micro-benchmark users and developers in navigating the trade-off between evaluation efficiency and reliability.
[68] Do LLMs Really Know What They Don’t Know? Internal States Mainly Reflect Knowledge Recall Rather Than Truthfulness
Chi Seng Cheang, Hou Pong Chan, Wenxuan Zhang, Yang Deng
Main category: cs.CL
TL;DR: LLM hallucinations can be categorized into two types: Unassociated Hallucinations (UHs) with no parametric grounding vs Associated Hallucinations (AHs) driven by spurious associations learned during training. AHs have internal processes similar to factual recall, making them hard to detect via standard methods.
Details
Motivation: Current approaches assume hallucinations can be distinguished from factual outputs using internal signals, but this overlooks that some hallucinations stem from learned spurious associations that are mechanistically similar to factual recall processes.Method: Proposed taxonomy categorizing hallucinations into Unassociated Hallucinations (UHs) and Associated Hallucinations (AHs). Conducted mechanistic analysis comparing computational processes and hidden-state geometries between hallucination types and factual outputs.
Result: Hidden states primarily reflect whether the model is recalling parametric knowledge rather than output truthfulness. AHs exhibit hidden-state geometries overlapping with factual outputs, making detection difficult. UHs show distinctive clustered representations enabling reliable detection.
Conclusion: Standard hallucination detection methods fail for AHs because they share mechanistic similarities with factual recall. New approaches are needed that account for different hallucination types and their underlying causes.
Abstract: Recent work suggests that LLMs “know what they don’t know”, positing that hallucinated and factually correct outputs arise from distinct internal processes and can therefore be distinguished using internal signals. However, hallucinations have multifaceted causes: beyond simple knowledge gaps, they can emerge from training incentives that encourage models to exploit statistical shortcuts or spurious associations learned during pretraining. In this paper, we argue that when LLMs rely on such learned associations to produce hallucinations, their internal processes are mechanistically similar to those of factual recall, as both stem from strong statistical correlations encoded in the model’s parameters. To verify this, we propose a novel taxonomy categorizing hallucinations into Unassociated Hallucinations (UHs), where outputs lack parametric grounding, and Associated Hallucinations (AHs), which are driven by spurious associations. Through mechanistic analysis, we compare their computational processes and hidden-state geometries with factually correct outputs. Our results show that hidden states primarily reflect whether the model is recalling parametric knowledge rather than the truthfulness of the output itself. Consequently, AHs exhibit hidden-state geometries that largely overlap with factual outputs, rendering standard detection methods ineffective. In contrast, UHs exhibit distinctive, clustered representations that facilitate reliable detection.
[69] Chain-of-Thought Reasoning Improves Context-Aware Translation with Large Language Models
Shabnam Ataee, Hugo Huart, Andrei Popescu-Belis
Main category: cs.CL
TL;DR: LLMs evaluated on English-French translation with inter-sentential dependencies using DiscEvalMT benchmark, testing 12 models on discrimination and generation tasks with chain-of-thought reasoning prompts.
Details
Motivation: To assess LLMs' ability to handle complex translation challenges involving inter-sentential dependencies like pronominal anaphora and lexical cohesion, which require understanding relationships across sentences.Method: Used DiscEvalMT benchmark with sentence pairs containing translation challenges. Evaluated 12 LLMs from DeepSeek-R1, GPT, Llama, Mistral, and Phi families on two tasks: (1) distinguishing correct from plausible wrong translations, and (2) generating correct translations. Compared chain-of-thought reasoning prompts with standard prompts.
Result: Best models achieved ~90% accuracy on discrimination task and ~92% COMET scores on generation task. GPT-4, GPT-4o, and Phi performed best. Found “wise get wiser” effect: reasoning benefits were larger for already high-performing models.
Conclusion: LLMs can handle complex translation challenges with inter-sentential dependencies, especially when using chain-of-thought reasoning, with top models showing strong performance on both discrimination and generation tasks.
Abstract: This paper assesses the ability of large language models (LLMs) to translate texts that include inter-sentential dependencies. We use the English-French DiscEvalMT benchmark (Bawden et al., 2018) with pairs of sentences containing translation challenges for pronominal anaphora and lexical cohesion. We evaluate 12 LLMs from the DeepSeek-R1, GPT, Llama, Mistral and Phi families on two tasks: (1) distinguish a correct translation from a wrong but plausible one; and (2) generate a correct translation. We compare prompts that encourage chain-of-thought reasoning with those that do not. The best models take advantage of reasoning and reach about 90% accuracy on the first task and COMET scores of about 92% on the second task, with GPT-4, GPT-4o and Phi standing out. Moreover, we observe a “wise get wiser” effect: the improvements through reasoning are larger for models that already perform well without reasoning.
[70] Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups
Jiangang Hao, Wenju Cui, Patrick Kyllonen, Emily Kerzabi
Main category: cs.CL
TL;DR: ChatGPT-based coding of communication data shows consistent performance across demographic groups (gender and race), comparable to human raters, enabling large-scale assessment of collaboration.
Details
Motivation: Manual coding of communication data for assessing collaboration is labor-intensive. While ChatGPT has shown comparable accuracy to human raters, it's unclear if its coding performance is consistent across different demographic groups (gender, race).Method: Adapted existing framework from automated scoring literature to evaluate subgroup consistency in LLM-based coding. Used collaborative problem-solving coding framework with data from three types of collaborative tasks to examine ChatGPT-based coding performance across gender and racial/ethnic groups.
Result: ChatGPT-based coding performs consistently in the same way as human raters across gender and racial/ethnic groups, demonstrating comparable subgroup consistency.
Conclusion: ChatGPT can be reliably used for large-scale assessments of collaboration and communication with consistent performance across demographic groups, similar to human raters.
Abstract: Assessing communication and collaboration at scale depends on a labor intensive task of coding communication data into categories according to different frameworks. Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters. However, whether the coding from ChatGPT or similar AI technology perform consistently across different demographic groups, such as gender and race, remains unclear. To address this gap, we introduce three checks for evaluating subgroup consistency in LLM-based coding by adapting an existing framework from the automated scoring literature. Using a typical collaborative problem-solving coding framework and data from three types of collaborative tasks, we examine ChatGPT-based coding performance across gender and racial/ethnic groups. Our results show that ChatGPT-based coding perform consistently in the same way as human raters across gender or racial/ethnic groups, demonstrating the possibility of its use in large-scale assessments of collaboration and communication.
[71] Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People
Gabriel Grand, Valerio Pepe, Jacob Andreas, Joshua B. Tenenbaum
Main category: cs.CL
TL;DR: The paper develops methods to evaluate and enhance language models as rational information-seeking agents through Bayesian Experimental Design-inspired Monte Carlo inference, tested on collaborative games like Battleship and Guess Who?
Details
Motivation: Many AI applications require strategic information-seeking under uncertainty, but it's unclear if language models behave like rational agents in high-stakes, resource-limited settings. The paper aims to evaluate and improve LM capabilities for agentic information-seeking.Method: Introduces Collaborative Battleship task with Captain (exploration/action) and Spotter (contextual answers) roles. Develops novel Monte Carlo inference strategies inspired by Bayesian Experimental Design to enhance LM agents’ question-asking and decision-making capabilities.
Result: LM agents initially struggle compared to humans. BED-inspired methods boost Spotter accuracy by 14.7%, Captain expected information gain by 0.227 bits (94.2% of noise ceiling), improve targeting (+0.303-0.374 F1), and enable weaker LMs to outperform humans and frontier models at much lower cost.
Conclusion: The developed methods significantly enhance LM capabilities for strategic information-seeking, demonstrating general applicability across different tasks and enabling cost-effective performance improvements over both human players and frontier models.
Abstract: Many emerging applications of AI–from scientific discovery to medical diagnosis–require agents to seek information strategically: forming hypotheses, asking targeted questions, and making decisions under uncertainty. In high-stakes settings with limited resources, do language models (LMs) behave like rational agents? Drawing on insights from human cognition, we develop methods to evaluate and enhance agentic information-seeking. First, we introduce a decision-oriented dialogue task called Collaborative Battleship, in which a Captain must balance exploration (asking questions) and action (taking shots), while a Spotter must supply accurate, contextually-grounded answers. Compared to human players (N=42), we find that many LM agents struggle to ask informative questions, produce accurate answers, and identify high-utility actions. To address these gaps, we develop novel Monte Carlo inference strategies for LMs inspired by Bayesian Experimental Design (BED). For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling). Combined, these components yield sharper targeting (+0.303-0.374 F1), and enable weaker LMs, such as Llama-4-Scout, to outperform both humans (8% -> 82% win rate) and frontier models (0% -> 67% win rate vs. GPT-5) at ~1% of GPT-5’s cost. We replicate these findings on Guess Who?, where our methods significantly boost accuracy (+28.3-42.4 p.p.), demonstrating their general applicability for building information-seeking agents.
[72] DETECT: Determining Ease and Textual Clarity of German Text Simplifications
Maria Korobeynikova, Alessia Battisti, Lukas Fischer, Yingqiang Gao
Main category: cs.CL
TL;DR: DETECT is a German-specific automatic text simplification evaluation metric that assesses simplicity, meaning preservation, and fluency, trained on synthetic LLM-generated data without human annotation.
Details
Motivation: Current German ATS evaluation relies on general-purpose metrics (SARI, BLEU, BERTScore) that insufficiently capture simplification quality. While specialized metrics like LENS exist for English, German lacks such tools due to absence of human-annotated corpora.Method: Adapts LENS framework to German and extends it with: (1) pipeline for generating synthetic quality scores via LLMs for dataset creation without human annotation, (2) LLM-based refinement step for aligning grading criteria with simplification requirements. Also constructs largest German human evaluation dataset for validation.
Result: DETECT achieves substantially higher correlations with human judgments than widely used ATS metrics, with particularly strong gains in meaning preservation and fluency.
Conclusion: DETECT addresses the gap in German ATS evaluation and highlights both potential and limitations of LLMs for automatic evaluation, providing transferable guidelines for general language accessibility tasks.
Abstract: Current evaluation of German automatic text simplification (ATS) relies on general-purpose metrics such as SARI, BLEU, and BERTScore, which insufficiently capture simplification quality in terms of simplicity, meaning preservation, and fluency. While specialized metrics like LENS have been developed for English, corresponding efforts for German have lagged behind due to the absence of human-annotated corpora. To close this gap, we introduce DETECT, the first German-specific metric that holistically evaluates ATS quality across all three dimensions of simplicity, meaning preservation, and fluency, and is trained entirely on synthetic large language model (LLM) responses. Our approach adapts the LENS framework to German and extends it with (i) a pipeline for generating synthetic quality scores via LLMs, enabling dataset creation without human annotation, and (ii) an LLM-based refinement step for aligning grading criteria with simplification requirements. To the best of our knowledge, we also construct the largest German human evaluation dataset for text simplification to validate our metric directly. Experimental results show that DETECT achieves substantially higher correlations with human judgments than widely used ATS metrics, with particularly strong gains in meaning preservation and fluency. Beyond ATS, our findings highlight both the potential and the limitations of LLMs for automatic evaluation and provide transferable guidelines for general language accessibility tasks.
[73] AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages
Kosei Uemura, Miaoran Zhang, David Ifeoluwa Adelani
Main category: cs.CL
TL;DR: AfriMTEB benchmark expands multilingual text embedding evaluation to 59 African languages with 14 tasks and 38 datasets, while AfriE5 model achieves SOTA performance through cross-lingual contrastive distillation.
Details
Motivation: African languages are underrepresented in multilingual text embedding benchmarks like MMTEB, with existing tasks often repurposed from translation benchmarks. There's a need for comprehensive evaluation covering diverse African languages and tasks beyond translation.Method: 1) Created AfriMTEB benchmark covering 59 African languages, 14 tasks, and 38 datasets including six new datasets spanning 14-56 languages with novel tasks like hate speech detection, intent detection, and emotion classification. 2) Developed AfriE5 model by adapting instruction-tuned mE5 model to African languages through cross-lingual contrastive distillation.
Result: AfriE5 achieves state-of-the-art performance, outperforming strong baselines including Gemini-Embeddings and mE5 on the comprehensive AfriMTEB benchmark.
Conclusion: The paper addresses the underrepresentation of African languages in text embedding evaluation and demonstrates successful adaptation of existing models to achieve superior performance through specialized training techniques.
Abstract: Text embeddings are an essential building component of several NLP tasks such as retrieval-augmented generation which is crucial for preventing hallucinations in LLMs. Despite the recent release of massively multilingual MTEB (MMTEB), African languages remain underrepresented, with existing tasks often repurposed from translation benchmarks such as FLORES clustering or SIB-200. In this paper, we introduce AfriMTEB – a regional expansion of MMTEB covering 59 languages, 14 tasks, and 38 datasets, including six newly added datasets. Unlike many MMTEB datasets that include fewer than five languages, the new additions span 14 to 56 African languages and introduce entirely new tasks, such as hate speech detection, intent detection, and emotion classification, which were not previously covered. Complementing this, we present AfriE5, an adaptation of the instruction-tuned mE5 model to African languages through cross-lingual contrastive distillation. Our evaluation shows that AfriE5 achieves state-of-the-art performance, outperforming strong baselines such as Gemini-Embeddings and mE5.
[74] Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs
Pranav Bhandari, Nicolas Fay, Sanjeevan Selvaganapathy, Amitava Datta, Usman Naseem, Mehwish Nasim
Main category: cs.CL
TL;DR: A pipeline for extracting and injecting personality traits into LLMs using Big Five personality dimensions and low-rank subspace discovery for controllable behavioral steering.
Details
Motivation: LLMs exhibit implicit personalities but lack reliable control mechanisms for aligning these traits to specific needs. There's a critical gap in behavioral manipulation techniques, and personality-aware LLMs offer promising direction, but the relationship between psychological constructs and their representations in LLMs remains underexplored.Method: Proposes a novel pipeline that: 1) extracts hidden state activations from transformer layers using Big Five Personality Traits framework, 2) applies low-rank subspace discovery methods to identify trait-specific optimal layers across different model architectures, 3) operationalizes personality-aligned directions through flexible steering framework with dynamic layer selection for precise control of trait expression.
Result: Findings reveal that personality traits occupy a low-rank shared subspace, and these latent structures can be transformed into actionable mechanisms for effective steering through careful perturbations without impacting fluency, variance, and general capabilities.
Conclusion: The approach bridges the gap between psychological theory and practical model alignment, enabling precise control of personality traits in LLM outputs while maintaining model capabilities.
Abstract: Large Language Models exhibit implicit personalities in their generation, but reliably controlling or aligning these traits to meet specific needs remains an open challenge. The need for effective mechanisms for behavioural manipulation of the model during generation is a critical gap in the literature that needs to be fulfilled. Personality-aware LLMs hold a promising direction towards this objective. However, the relationship between these psychological constructs and their representations within LLMs remains underexplored and requires further investigation. Moreover, it is intriguing to understand and study the use of these representations to steer the models’ behaviour. We propose a novel pipeline that extracts hidden state activations from transformer layers using the Big Five Personality Traits (Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism), which is a comprehensive and empirically validated framework to model human personality applies low-rank subspace discovery methods, and identifies trait-specific optimal layers across different model architectures for robust injection. The resulting personality-aligned directions are then operationalised through a flexible steering framework with dynamic layer selection, enabling precise control of trait expression in LLM outputs. Our findings reveal that personality traits occupy a low-rank shared subspace, and that these latent structures can be transformed into actionable mechanisms for effective steering through careful perturbations without impacting the fluency, variance and general capabilities, helping to bridge the gap between psychological theory and practical model alignment.
[75] Critical Confabulation: Can LLMs Hallucinate for Social Good?
Peiqi Sui, Eamon Duede, Hoyt Long, Richard Jean So
Main category: cs.CL
TL;DR: LLMs can be used for “critical confabulation” - using controlled hallucinations to fill gaps in historical archives for marginalized figures, evaluated through narrative cloze tasks on unpublished texts.
Details
Motivation: To explore how LLM hallucinations can be usefully bounded to reconstruct narratives for historical "hidden figures" where archives have omissions due to social/political inequality, rather than treating all hallucinations as harmful.Method: Created narrative cloze tasks by masking events in character-centric timelines from unpublished texts, evaluated audited open models (OLMo-2 family) and baselines with various prompts designed to elicit controlled hallucinations.
Result: Validated LLMs’ narrative understanding capabilities for critical confabulation, showing controlled hallucinations can support knowledge production without compromising historical accuracy.
Conclusion: Well-specified LLM hallucinations can be valuable for reconstructing marginalized historical narratives when carefully bounded, offering a productive approach to knowledge gaps in archives.
Abstract: LLMs hallucinate, yet some confabulations can have social affordances if carefully bounded. We propose critical confabulation (inspired by critical fabulation from literary and social theory), the use of LLM hallucinations to “fill-in-the-gap” for omissions in archives due to social and political inequality, and reconstruct divergent yet evidence-bound narratives for history’s ``hidden figures’’. We simulate these gaps with an open-ended narrative cloze task: asking LLMs to generate a masked event in a character-centric timeline sourced from a novel corpus of unpublished texts. We evaluate audited (for data contamination), fully-open models (the OLMo-2 family) and unaudited open-weight and proprietary baselines under a range of prompts designed to elicit controlled and useful hallucinations. Our findings validate LLMs’ foundational narrative understanding capabilities to perform critical confabulation, and show how controlled and well-specified hallucinations can support LLM applications for knowledge production without collapsing speculation into a lack of historical accuracy and fidelity.
[76] SPINE: Token-Selective Test-Time Reinforcement Learning with Entropy-Band Regularization
Jianghao Wu, Yasmeen George, Jin Ye, Yicheng Wu, Daniel F. Schmidt, Jianfei Cai
Main category: cs.CL
TL;DR: Token-selective test-time reinforcement learning framework for multimodal LLMs that updates only decision-critical branch points to prevent collapse and improve reasoning performance.
Details
Motivation: Multimodal LLMs face distribution shift at test-time and lack verifiable supervision. Existing test-time RL methods often collapse due to uniform sequence updates where most tokens are low-entropy followers, while a small high-entropy subset determines reasoning branches.Method: Proposes a token-selective test-time RL framework that: (1) performs distribution-aware forking-token selection to update only decision-critical branch points, and (2) applies a robust entropy-band regularizer at those tokens to prevent premature collapse and suppress noisy drift. Works with GRPO-style objectives without labels or reward models.
Result: Across eight benchmarks spanning multimodal VQA and text-only reasoning, the method consistently improves Pass@1 over test-time RL while avoiding response-length collapse and yielding more stable training dynamics on both LLM and MLLM backbones.
Conclusion: Aligning updates with chain-of-thought branch points is a simple and label-free mechanism for stable and effective test-time adaptation in reasoning models.
Abstract: Large language models (LLMs) and multimodal LLMs (MLL-Ms) excel at chain-of-thought reasoning but face distribution shift at test-time and a lack of verifiable supervision. Recent test-time reinforcement learning (TTRL) methods derive label-free pseudo-rewards from self-consistency voting over sampled trajectories, yet they often collapse: the majority-vote reward prevails, responses shorten, and Pass@1 declines. We trace this to uniform sequence updates in which most tokens are low-entropy followers, while a small high-entropy subset determines the reasoning branches. Thus we propose \method, a token-selective test-time reinforcement learning framework that (i) performs distribution-aware forking-token selection to update only decision-critical branch points, and (ii) applies a robust entropy-band regularizer at those tokens to prevent premature collapse and suppress noisy drift. \method plugs into GRPO-style objectives (optionally with a KL anchor) and requires neither labels nor reward models. Across eight benchmarks spanning multimodal VQA, text-only reasoning, \method consistently improves Pass@1 over TTRL while avoiding response-length collapse and yielding more stable training dynamics on both LLM and MLLM backbones. These results indicate that aligning updates with chain-of-thought branch points is a simple and label-free mechanism for stable and effective test-time adaptation in reasoning models. Code will be released.
[77] Window-based Membership Inference Attacks Against Fine-tuned Large Language Models
Yuetian Chen, Yuntao Du, Kaiyuan Zhang, Ashish Kundu, Charles Fleming, Bruno Ribeiro, Ninghui Li
Main category: cs.CL
TL;DR: WBC (Window-Based Comparison) is a new membership inference attack for LLMs that uses localized sliding windows instead of global averaging to better detect training data memorization, achieving 2-3x improvements over existing methods.
Details
Motivation: Current MIAs against LLMs rely on global signals like average loss, which dilute subtle localized memorization signals. The authors challenge this paradigm, arguing that membership signals are more pronounced within localized contexts.Method: WBC uses a sliding window approach with sign-based aggregation. It slides windows of varying sizes across text sequences, with each window casting a binary vote on membership based on loss comparisons between target and reference models. Votes are ensembled across geometrically spaced window sizes.
Result: Extensive experiments across eleven datasets show WBC substantially outperforms established baselines, achieving higher AUC scores and 2-3 times improvements in detection rates at low false positive thresholds.
Conclusion: Aggregating localized evidence is fundamentally more effective than global averaging for membership inference, exposing critical privacy vulnerabilities in fine-tuned LLMs.
Abstract: Most membership inference attacks (MIAs) against Large Language Models (LLMs) rely on global signals, like average loss, to identify training data. This approach, however, dilutes the subtle, localized signals of memorization, reducing attack effectiveness. We challenge this global-averaging paradigm, positing that membership signals are more pronounced within localized contexts. We introduce WBC (Window-Based Comparison), which exploits this insight through a sliding window approach with sign-based aggregation. Our method slides windows of varying sizes across text sequences, with each window casting a binary vote on membership based on loss comparisons between target and reference models. By ensembling votes across geometrically spaced window sizes, we capture memorization patterns from token-level artifacts to phrase-level structures. Extensive experiments across eleven datasets demonstrate that WBC substantially outperforms established baselines, achieving higher AUC scores and 2-3 times improvements in detection rates at low false positive thresholds. Our findings reveal that aggregating localized evidence is fundamentally more effective than global averaging, exposing critical privacy vulnerabilities in fine-tuned LLMs.
[78] Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation
Andrew Moore, Paul Rayson, Dawn Archer, Tim Czerniak, Dawn Knight, Daisy Lal, GearĂłid Ă Donnchadha, MĂcheĂĄl Ă Meachair, Scott Piao, Elaine UĂ Dhonnchadha, Johanna Vuorinen, Yan Yabo, Xiaobin Yang
Main category: cs.CL
TL;DR: This paper presents the largest evaluation of the USAS semantic tagging framework across five languages, creating new datasets and neural models to enhance rule-based systems.
Details
Motivation: The USAS semantic analysis framework lacks extensive open evaluation compared to other frameworks like WordNet and BabelNet, particularly for multilingual applications.Method: Created new silver-labeled English dataset and novel Chinese dataset; evaluated rule-based USAS system across five languages; trained and evaluated mono/multilingual neural models in mono/cross-lingual setups; compared neural models to rule-based counterparts.
Result: Neural network models were developed that can enhance rule-based systems; all resources including datasets, models, and code were released as open resources.
Conclusion: The work provides comprehensive evaluation of USAS framework, demonstrates how neural models can enhance rule-based semantic tagging systems, and releases valuable open resources for the community.
Abstract: Word Sense Disambiguation (WSD) has been widely evaluated using the semantic frameworks of WordNet, BabelNet, and the Oxford Dictionary of English. However, for the UCREL Semantic Analysis System (USAS) framework, no open extensive evaluation has been performed beyond lexical coverage or single language evaluation. In this work, we perform the largest semantic tagging evaluation of the rule based system that uses the lexical resources in the USAS framework covering five different languages using four existing datasets and one novel Chinese dataset. We create a new silver labelled English dataset, to overcome the lack of manually tagged training data, that we train and evaluate various mono and multilingual neural models in both mono and cross-lingual evaluation setups with comparisons to their rule based counterparts, and show how a rule based system can be enhanced with a neural network model. The resulting neural network models, including the data they were trained on, the Chinese evaluation dataset, and all of the code have been released as open resources.
[79] Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models
Wenhui Tan, Fiorenzo Parascandolo, Enver Sangineto, Jianzhong Ju, Zhenbo Luo, Qian Cao, Rita Cucchiara, Ruihua Song, Jian Luan
Main category: cs.CL
TL;DR: LED (Latent Exploration Decoding) addresses exploration collapse in post-trained Large Reasoning Models by using depth-conditioned decoding that aggregates intermediate layer posteriors to increase entropy and improve reasoning performance.
Details
Motivation: Modern reasoning post-training induces unintended exploration collapse where temperature-based sampling no longer improves pass@n accuracy, due to sharply reduced entropy in final-layer posteriors while intermediate layers maintain higher entropy.Method: Latent Exploration Decoding (LED) uses depth-conditioned decoding that aggregates intermediate layer posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates, without additional training or parameters.
Result: LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models without additional training.
Conclusion: LED effectively addresses exploration collapse in post-trained reasoning models by leveraging entropy asymmetry between layers, providing a simple yet effective decoding strategy for improved reasoning performance.
Abstract: Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post-training. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@$n$ accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Project page: https://github.com/AlbertTan404/Latent-Exploration-Decoding.
[80] COMI: Coarse-to-fine Context Compression via Marginal Information Gain
Jiwei Tang, Shilei Liu, Zhicheng Zhang, Yujin Yuan, Libin Zheng, Wenbo Su, Bo Zheng
Main category: cs.CL
TL;DR: COMI: A coarse-to-fine adaptive context compression framework for LLMs that optimizes for semantic relevance and diversity under high compression rates using Marginal Information Gain metric.
Details
Motivation: LLMs face computational inefficiency and information redundancy in long context scenarios, requiring effective context compression methods to reduce input length while preserving key information.Method: Two-stage framework: 1) Coarse-grained group reallocation partitions context into groups and assigns compression rates based on inter-group Marginal Information Gain; 2) Fine-grained token merging fuses tokens within groups using intra-group MIG-based weighting.
Result: COMI outperforms existing baselines by large margins, achieving ~25-point EM improvement under 32x compression with Qwen2-7B on NaturalQuestions across various QA and summarization benchmarks.
Conclusion: COMI effectively addresses long context challenges in LLMs through adaptive compression that preserves relevant, non-redundant information, enabling more efficient deployment in resource-constrained scenarios.
Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse tasks. However, their deployment in long context scenarios remains hindered by computational inefficiency and information redundancy. Context compression methods address these challenges by significantly reducing input length and eliminating redundancy. We propose COMI, a coarse-to-fine adaptive context compression framework that jointly optimizes for semantic relevance and diversity under high compression rates. We introduce Marginal Information Gain (MIG), a metric defined as the relevance of a unit to the input query minus its semantic redundancy with other units, guiding the compression process to prioritize information that is both relevant and low redundant. The framework operates in two stages: (1) Coarse-Grained Group Reallocation, where the context is partitioned into groups and dynamically assigned compression rates based on inter-group MIG, ensuring compression budgets align with information value distribution; and (2) Fine-Grained Token Merging, where tokens within each group are fused via an intra-group MIG-based weighting mechanism, thereby preserving key semantics while avoiding the accumulation of redundancy. Extensive experiments across question-answering (e.g., NaturalQuestions, 2WikiMQA, HotpotQA and NarrativeQA), summarization (e.g., MultiNews) with various backbones (e.g., LLaMA-2-7B, Qwen2-7B) show that COMI outperforms existing baselines by a large margin, e.g., approximately 25-point Exact Match (EM) improvement under 32x compression constraint with Qwen2-7B on NaturalQuestions.
[81] Accelerating Scientific Research with Gemini: Case Studies and Common Techniques
David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, MohammadHossein Bateni, Simina Branzei, Michael P. Brenner, Lin Chen, Ying Feng, Lance Fortnow, Gang Fu, Ziyi Guan, Zahra Hadizadeh, Mohammad T. Hajiaghayi, Mahdi JafariRaviz, Adel Javanmard, Karthik C. S., Ken-ichi Kawarabayashi, Ravi Kumar, Silvio Lattanzi, Euiwoong Lee, Yi Li, Ioannis Panageas, Dimitris Paparas, Benjamin Przybocki, Bernardo Subercaseaux, Ola Svensson, Shayan Taherijam, Xuan Wu, Eylon Yogev, Morteza Zadimoghaddam, Samson Zhou, Yossi Matias, James Manyika, Vahab Mirrokni
Main category: cs.CL
TL;DR: Researchers demonstrate successful human-AI collaboration using Google’s Gemini models to solve open problems, refute conjectures, and generate proofs across theoretical computer science, economics, optimization, and physics.
Details
Motivation: While LLMs assist with routine tasks, their ability to contribute to novel, expert-level mathematical discovery remains less understood. The paper aims to explore how advanced AI models can genuinely partner in creative scientific discovery.Method: Collection of case studies using Google’s Gemini-based models (Gemini Deep Think and variants) with interactive, conversational methodology. Techniques include iterative refinement, problem decomposition, cross-disciplinary knowledge transfer, adversarial review for proof flaws, and neuro-symbolic loops for code verification.
Result: Successful collaborations solving open problems, refuting conjectures, and generating new proofs across diverse theoretical fields. Demonstrated AI’s potential as versatile partner beyond automation.
Conclusion: AI models can serve as genuine partners in creative scientific discovery, not just automation tools. Effective collaboration requires specific techniques like iterative refinement and problem decomposition.
Abstract: Recent advances in large language models (LLMs) have opened new avenues for accelerating scientific research. While models are increasingly capable of assisting with routine tasks, their ability to contribute to novel, expert-level mathematical discovery is less understood. We present a collection of case studies demonstrating how researchers have successfully collaborated with advanced AI models, specifically Google’s Gemini-based models (in particular Gemini Deep Think and its advanced variants), to solve open problems, refute conjectures, and generate new proofs across diverse areas in theoretical computer science, as well as other areas such as economics, optimization, and physics. Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer. While the majority of our results stem from this interactive, conversational methodology, we also highlight specific instances that push beyond standard chat interfaces. These include deploying the model as a rigorous adversarial reviewer to detect subtle flaws in existing proofs, and embedding it within a “neuro-symbolic” loop that autonomously writes and executes code to verify complex derivations. Together, these examples highlight the potential of AI not just as a tool for automation, but as a versatile, genuine partner in the creative process of scientific discovery.
[82] DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning
Yicheng Chen, Zerun Ma, Xinchen Xie, Yining Li, Kai Chen
Main category: cs.CL
TL;DR: DataChef-32B automates end-to-end data recipe generation for LLM adaptation using online reinforcement learning with proxy rewards, producing recipes comparable to human experts across six tasks.
Details
Motivation: Current LLM data recipe design is manual and labor-intensive despite automation of individual steps. There's a need to automate the overall data recipe generation process to reduce human expertise requirements and enable more efficient LLM adaptation.Method: Formulates end-to-end data recipe generation for LLM adaptation. Uses DataChef-32B which performs online reinforcement learning with a proxy reward that predicts downstream performance for candidate recipes. Takes target benchmark and available data sources as input to output complete data recipes.
Result: Across six held-out tasks, DataChef-32B produces recipes yielding performance comparable to human expert-curated recipes. Notably, adapts Qwen3-1.7B-Base to math domain achieving 66.7 on AIME'25, surpassing official post-training checkpoint.
Conclusion: Demonstrates successful automation of LLM data recipe generation, enabling comparable performance to human experts. Opens new directions for automating LLM training and developing self-evolving AI systems.
Abstract: In the current landscape of Large Language Models (LLMs), the curation of large-scale, high-quality training data is a primary driver of model performance. A key lever is the \emph{data recipe}, which comprises a data processing pipeline to transform raw sources into training corpora. Despite the growing use of LLMs to automate individual data processing steps, such as data synthesis and filtering, the overall design of data recipes remains largely manual and labor-intensive, requiring substantial human expertise and iteration. To bridge this gap, we formulate \emph{end-to-end data recipe generation} for LLM adaptation. Given a target benchmark and a pool of available data sources, a model is required to output a complete data recipe that adapts a base LLM to the target task. We present DataChef-32B, which performs online reinforcement learning using a proxy reward that predicts downstream performance for candidate recipes. Across six held-out tasks, DataChef-32B produces recipes that yield performance comparable to those curated by human experts. Notably, the recipe from DataChef-32B adapts Qwen3-1.7B-Base to the math domain, achieving 66.7 on AIME'25 and surpassing the official post-training checkpoint (Qwen3-1.7B). This work sheds new light on automating LLM training and developing self-evolving AI systems.
[83] IntelliAsk: Learning to Ask High-Quality Research Questions via RLVR
Karun Sharma, Vidushee Vats, Shengzhi Li, Yuxiang Wang, Zhongtian Sun, Prayag Tiwari
Main category: cs.CL
TL;DR: IntelliAsk: An LLM trained to generate substantive, evidence-based peer review questions using expert annotations and reward modeling, outperforming baselines in human evaluation and reasoning benchmarks.
Details
Motivation: Current LLMs generate surface-level peer review questions that perform worse than human reviewer questions. There's a need for models that can generate substantive, evidence-based critique questions that align with expert standards.Method: Curated high-quality dataset from OpenReview, conducted human preference study with expert annotations across effort, evidence, and grounding dimensions. Trained IntelliReward (reward model from frozen autoregressive LLM with trainable multi-head transformers). Applied Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) with IntelliReward to train IntelliAsk question-generation model.
Result: IntelliAsk generates more grounded, substantive and effortful questions than strong baselines, reduces reliance on first-page content. Improves MuSR (68.3 vs 64.7 Acc) and WritingBench (8.31 vs 8.07) compared to Qwen3-32B. Human evaluations confirm superiority over baselines.
Conclusion: IntelliAsk successfully generates high-quality peer review questions aligned with human standards, and reviewer-question quality correlates with broader reasoning and writing capabilities. The released resources support automatic evaluation of grounding, effort, and evidence in LLM-generated review questions.
Abstract: Peer review relies on substantive, evidence-based questions, yet current LLMs generate surface-level queries that perform worse than human reviewer questions in expert evaluation. To address this gap, we curate a high-quality dataset of reviewer questions from OpenReview and conduct a human preference study where expert annotators evaluate question-paper pairs across three dimensions: effort, evidence, and grounding. From these annotations, we train IntelliReward, a reward model built from a frozen autoregressive LLM with trainable multi-head transformers. Validated against expert judgments, IntelliReward predicts reviewer-question quality better than API-based SFT baselines and provides scalable evaluation. We apply Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) with IntelliReward to train IntelliAsk, a question-generation model aligned with human standards of effortful, evidence-based critique. Human evaluations show IntelliAsk generates more grounded, substantive and effortful questions than strong baselines and reduces reliance on first-page content. We also find improvements on reasoning and writing benchmarks, suggesting reviewer-question quality correlates with broader capabilities. Compared to Qwen3-32B, IntelliAsk improves MuSR (68.3 vs 64.7 Acc) and WritingBench (8.31 vs 8.07). We release our code, filtered review dataset, expert annotations, IntelliAsk and IntelliReward to support automatic evaluation of grounding, effort, and evidence in LLM-generated review questions.
[84] Diverse Word Choices, Same Reference: Annotating Lexically-Rich Cross-Document Coreference
Anastasia Zhukova, Felix Hamborg, Karsten Donnay, Norman Meuschke, Bela Gipp
Main category: cs.CL
TL;DR: This paper proposes a revised annotation scheme for cross-document coreference resolution (CDCR) that treats coreference chains as discourse elements, accommodating both identity and near-identity relations to better capture lexical diversity in polarized news coverage.
Details
Motivation: Existing CDCR datasets focus primarily on event resolution with narrow coreference definitions, limiting effectiveness for analyzing diverse and polarized news coverage where wording varies widely. There's a need for annotation schemes that can capture lexical diversity and framing variations in media discourse.Method: Proposes a revised CDCR annotation scheme treating coreference chains as discourse elements (DEs) and conceptual units of analysis. Accommodates both identity and near-identity relations. Reannotates the NewsWCL50 dataset and a subset of ECB+ using a unified codebook.
Result: The reannotated datasets align closely, falling between the original ECB+ and NewsWCL50 in characteristics. Results support balanced and discourse-aware CDCR research in the news domain through lexical diversity metrics and same-head-lemma baseline evaluation.
Conclusion: The revised annotation scheme enables better capture of lexical diversity and framing variation in media discourse while maintaining fine-grained annotation of discourse elements, supporting more effective cross-document coreference resolution for news analysis.
Abstract: Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents, enabling content analysis that aggregates information at the level of discourse participants. However, existing datasets primarily focus on event resolution and employ a narrow definition of coreference, which limits their effectiveness in analyzing diverse and polarized news coverage where wording varies widely. This paper proposes a revised CDCR annotation scheme of the NewsWCL50 dataset, treating coreference chains as discourse elements (DEs) and conceptual units of analysis. The approach accommodates both identity and near-identity relations, e.g., by linking “the caravan” - “asylum seekers” - “those contemplating illegal entry”, allowing models to capture lexical diversity and framing variation in media discourse, while maintaining the fine-grained annotation of DEs. We reannotate the NewsWCL50 and a subset of ECB+ using a unified codebook and evaluate the new datasets through lexical diversity metrics and a same-head-lemma baseline. The results show that the reannotated datasets align closely, falling between the original ECB+ and NewsWCL50, thereby supporting balanced and discourse-aware CDCR research in the news domain.
[85] Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs
Jayadev Billa
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2602.23136: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23136&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[86] CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning
Yuxuan Liu, Weikai Xu, Kun Huang, Changyu Chen, Jiankun Zhao, Pengzhi Gao, Wei Liu, Jian Luan, Shuo Shang, Bo Du, Ji-Rong Wen, Rui Yan
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2602.24142: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.24142&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[87] Fine-Tuning and Evaluating Conversational AI for Agricultural Advisory
Sanyam Singh, Naga Ganesh, Vineet Singh, Lakshmi Pedapudi, Ritesh Kumar, SSP Jyothi, Archana Karanam, Waseem Pasha, Ekta Kumari, C. Yashoda, Mettu Vijaya Rekha Reddy, Shesha Phani Debbesa, Chandan Dash
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.03294: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03294&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[88] Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts
Kartik Sharma, Yiqiao Jin, Vineeth Rakesh, Yingtong Dou, Menghai Pan, Mahashweta Das, Srijan Kumar
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2506.15751: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.15751&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[89] Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations
Ashwath Vaithinathan Aravindan, Mayank Kejriwal
Main category: cs.CL
TL;DR: Unable to analyze paper 2603.03332 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusion due to missing paper information
Abstract: Failed to fetch summary for 2603.03332: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03332&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[90] Simulating Meaning, Nevermore! Introducing ICR: A Semiotic-Hermeneutic Metric for Evaluating Meaning in LLM Text Summaries
Natalie Perez, Sreyoshi Bhaduri, Aman Chadha
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data retrieval failureMethod: Unable to determine method due to data retrieval failure
Result: Unable to determine results due to data retrieval failure
Conclusion: Unable to draw conclusions due to data retrieval failure
Abstract: Failed to fetch summary for 2603.04413: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04413&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[91] Knowledge Graphs are Implicit Reward Models: Path-Derived Signals Enable Compositional Reasoning
Yuval Kansal, Niraj K. Jha
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.15160: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15160&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[92] Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research
Arina Kostina, Marios Dikaiakos, Alejandro Porcel, Tassos Stassopoulos
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.04897: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04897&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[93] ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul, Pakhapoom Sarapat
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2603.04992: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04992&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[94] Oral to Web: Digitizing ‘Zero Resource’Languages of Bangladesh
Mohammad Mamun Or Rashid
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2603.05272: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05272&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[95] Sentiment Analysis of Citations in Scientific Articles Using ChatGPT: Identifying Potential Biases and Conflicts of Interest
Walid Hariri
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2404.01800: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.01800&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[96] Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs
Mozhgan Nasr Azadani, James Riddell, Sean Sedwards, Krzysztof Czarnecki
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2501.06986: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.06986&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[97] The Consensus Trap: Dissecting Subjectivity and the “Ground Truth” Illusion in Data Annotation
Sheza Munir, Benjamin Mah, Krisha Kalsi, Shivani Kapania, Julian Posada, Edith Law, Ding Wang, Syed Ishtiaque Ahmed
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.11318: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11318&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[98] Maximizing Asynchronicity in Event-based Neural Networks
Haiqing Hao, Nikola ZubiÄ, Weihua He, Zhipeng Sui, Davide Scaramuzza, Wenhui Wang
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2505.11165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[99] AdAEM: An Adaptively and Automated Extensible Measurement of LLMs’ Value Difference
Jing Yao, Shitong Duan, Xiaoyuan Yi, Dongkuan Xu, Peng Zhang, Tun Lu, Ning Gu, Zhicheng Dou, Xing Xie
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2505.13531: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.13531&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[100] VLMQ: Token Saliency-Driven Post-Training Quantization for Vision-language Models
Yufei Xue, Yushi Huang, Jiawei Shao, Lunjie Zhu, Chi Zhang, Xuelong Li, Jun Zhang
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2508.03351: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.03351&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[101] Decoding Partial Differential Equations: Cross-Modal Adaptation of Decoder-only Models to PDEs
Paloma GarcĂa-de-Herreros, Philipp Slusallek, Dietrich Klakow, Vagrant Gautam
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2510.05278: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05278&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[102] Just-In-Time Objectives: A General Approach for Specialized AI Interactions
Michelle S. Lam, Omar Shaikh, Hallie Xu, Alice Guo, Diyi Yang, Jeffrey Heer, James A. Landay, Michael S. Bernstein
Main category: cs.CL
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Unable to determine motivation as abstract could not be fetched due to rate limitingMethod: No method information available due to failed API request
Result: No results available as paper content could not be retrieved
Conclusion: Cannot analyze paper due to technical limitations in accessing the abstract
Abstract: Failed to fetch summary for 2510.14591: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14591&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[103] Co-Layout: LLM-driven Co-optimization for Interior Layout
Chucheng Xiang, Ruchao Bao, Biyin Feng, Wenzheng Wu, Zhongyuan Liu, Yirui Guan, Ligang Liu
Main category: cs.CL
TL;DR: Unable to analyze paper 2511.12474 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract content is unavailable due to API rate limitingMethod: Cannot determine method as abstract content is unavailable due to API rate limiting
Result: Cannot determine results as abstract content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as abstract content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2511.12474: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.12474&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[104] Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation
Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, Songlin Hu
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to determine conclusion due to retrieval failure
Abstract: Failed to fetch summary for 2512.16310: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16310&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[105] Classroom AI: Large Language Models as Grade-Specific Teachers
Jio Oh, Steven Euijong Whang, James Evans, Jindong Wang
Main category: cs.CL
TL;DR: Unable to analyze paper 2601.06225 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation due to missing abstract contentMethod: Cannot determine method due to missing abstract content
Result: Cannot determine results due to missing abstract content
Conclusion: Cannot draw conclusion due to missing abstract content
Abstract: Failed to fetch summary for 2601.06225: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06225&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[106] Towards Autonomous Mathematics Research
Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang-hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-Tze Cheng, Demis Hassabis, Koray Kavukcuoglu, Quoc V. Le, Thang Luong
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2602.10177: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10177&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[107] Both Ends Count! Just How Good are LLM Agents at “Text-to-Big SQL”?
GermĂĄn T. Eizaguirre, Lars Tissen, Marc SĂĄnchez-Artigas
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to draw conclusions due to access limitations
Abstract: Failed to fetch summary for 2602.21480: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21480&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.CV
[108] Edges Are All You Need: Robust Gait Recognition via Label-Free Structure
Chao Zhang, Zhuang Zheng, Ruixin Li, Zhanyong Mei
Main category: cs.CV
TL;DR: SKETCHGAIT introduces sketch-based gait recognition using dense structural cues from RGB images without semantic labels, complementing existing silhouette and parsing methods.
Details
Motivation: Existing gait recognition relies on silhouette- and parsing-based representations with limitations: silhouettes are sparse and miss internal details, while parsing depends heavily on upstream human parsers leading to unstable performance across datasets. There's a need for better structural representations without semantic label dependencies.Method: Proposes SKETCH as a new visual modality that extracts high-frequency structural cues (limb articulations, self-occlusion contours) directly from RGB images via edge-based detectors in a label-free manner. Introduces SKETCHGAIT framework with two independent streams for modality-specific learning (parsing and sketch) and a lightweight early-stage fusion branch to capture structural complementarity.
Result: Achieves 92.9% Rank-1 on SUSTech1K and 93.1% mean Rank-1 on CCPG datasets, demonstrating superior performance over existing methods.
Conclusion: Sketch provides a valuable complementary modality to parsing for gait recognition, offering dense structural information without semantic label dependencies, leading to improved recognition performance through hierarchical disentangled multi-modal learning.
Abstract: Gait recognition is a non-intrusive biometric technique for security applications, yet existing studies are dominated by silhouette- and parsing-based representations. Silhouettes are sparse and miss internal structural details, limiting discriminability. Parsing enriches silhouettes with part-level structures, but relies heavily on upstream human parsers (e.g., label granularity and boundary precision), leading to unstable performance across datasets and sometimes even inferior results to silhouettes. We revisit gait representations from a structural perspective and describe a design space defined by edge density and supervision form: silhouettes use sparse boundary edges with weak single-label supervision, while parsing uses denser cues with strong semantic priors. In this space, we identify an underexplored paradigm: dense part-level structure without explicit semantic labels, and introduce SKETCH as a new visual modality for gait recognition. Sketch extracts high-frequency structural cues (e.g., limb articulations and self-occlusion contours) directly from RGB images via edge-based detectors in a label-free manner. We further show that label-guided parsing and label-free sketch are semantically decoupled and structurally complementary. Based on this, we propose SKETCHGAIT, a hierarchically disentangled multi-modal framework with two independent streams for modality-specific learning and a lightweight early-stage fusion branch to capture structural complementarity. Extensive experiments on SUSTech1K and CCPG validate the proposed modality and framework: SketchGait achieves 92.9% Rank-1 on SUSTech1K and 93.1% mean Rank-1 on CCPG.
[109] Thinking with Spatial Code for Physical-World Video Reasoning
Jieneng Chen, Wenxin Ma, Ruisheng Yuan, Yunzhi Zhang, Jiajun Wu, Alan Yuille
Main category: cs.CV
TL;DR: A framework that converts RGB video into explicit 3D spatial representations (oriented bounding boxes + semantic labels) to enable LLMs to perform physical-world visual question answering through geometric reasoning.
Details
Motivation: To enable large language models to reason about physical-world scenarios by providing them with explicit, temporally coherent 3D spatial representations from video, rather than relying on implicit visual understanding.Method: Proposes a spatial encoder that unifies 6D object parsing and tracking with geometric prediction to transform RGB video into structured spatial code. Fine-tunes LLMs with reinforcement learning using a spatial rubric reward for perspective-aware, geometrically grounded inference.
Result: Outperforms proprietary vision-language models on VSI-Bench, setting a new state-of-the-art for physical-world visual question answering.
Conclusion: Thinking with Spatial Code demonstrates that explicit 3D spatial representations enable LLMs to perform sophisticated physical-world reasoning, achieving superior performance on visual question answering benchmarks.
Abstract: We introduce Thinking with Spatial Code, a framework that transforms RGB video into explicit, temporally coherent 3D representations for physical-world visual question answering. We highlight the empirical finding that our proposed spatial encoder can parse videos into structured spatial code with explicit 3D oriented bounding boxes and semantic labels, enabling large language models (LLMs) to reason directly over explicit spatial variables. Specifically, we propose the spatial encoder that encodes image and geometric features by unifying 6D object parsing and tracking backbones with geometric prediction, and we further finetuning LLMs with reinforcement learning using a spatial rubric reward that encourages perspective-aware, geometrically grounded inference. As a result, our model outperforms proprietary vision-language models on VSI-Bench, setting a new state-of-the-art. Code is available at https://github.com/Beckschen/spatialcode.
[110] From Decoupled to Coupled: Robustness Verification for Learning-based Keypoint Detection with Joint Specifications
Xusheng Luo, Changliu Liu
Main category: cs.CV
TL;DR: First coupled robustness verification framework for heatmap-based keypoint detectors that bounds joint deviation across all keypoints using MILP formulation
Details
Motivation: Keypoint detection is crucial for vision tasks but modern neural models are vulnerable to small perturbations. Formal robustness verification for keypoint detectors is largely unexplored due to high-dimensional inputs and continuous coordinate outputs. Prior decoupled approaches verify each keypoint independently, yielding conservative guarantees that don't capture interdependencies.Method: Proposes coupled robustness verification framework using mixed-integer linear program (MILP) that combines reachable heatmap sets with polytope encoding joint deviation constraints. Formulates verification as falsification problem where infeasibility certifies robustness and feasibility provides counterexamples. Method is sound - if it certifies model as robust, the keypoint detection model is guaranteed to be robust.
Result: Coupled approach achieves high verified rates and remains effective under strict error thresholds where decoupled methods fail. Captures interdependencies between keypoints and downstream task requirements better than independent verification methods.
Conclusion: First framework for coupled robustness verification of keypoint detectors that provides formal guarantees by bounding joint deviation across all keypoints, addressing limitations of prior decoupled approaches.
Abstract: Keypoint detection underpins many vision tasks, including pose estimation, viewpoint recovery, and 3D reconstruction, yet modern neural models remain vulnerable to small input perturbations. Despite its importance, formal robustness verification for keypoint detectors is largely unexplored due to high-dimensional inputs and continuous coordinate outputs. We propose the first coupled robustness verification framework for heatmap-based keypoint detectors that bounds the joint deviation across all keypoints, capturing their interdependencies and downstream task requirements. Unlike prior decoupled, classification-style approaches that verify each keypoint independently and yield conservative guarantees, our method verifies collective behavior. We formulate verification as a falsification problem using a mixed-integer linear program (MILP) that combines reachable heatmap sets with a polytope encoding joint deviation constraints. Infeasibility certifies robustness, while feasibility provides counterexamples, and we prove the method is sound: if it certifies the model as robust, then the keypoint detection model is guaranteed to be robust. Experiments show that our coupled approach achieves high verified rates and remains effective under strict error thresholds where decoupled methods fail.
[111] DreamCAD: Scaling Multi-modal CAD Generation using Differentiable Parametric Surfaces
Mohammad Sadil Khan, Muhammad Usama, Rolandos Alexandros Potamias, Didier Stricker, Muhammad Zeshan Afzal, Jiankang Deng, Ismail Elezi
Main category: cs.CV
TL;DR: DreamCAD is a multi-modal generative framework that produces editable BReps (boundary representations) from point-level supervision without CAD-specific annotations, using parametric patches and differentiable tessellation to enable large-scale training on 3D datasets.
Details
Motivation: Existing CAD generative methods are limited by small annotated datasets with explicit design histories or BRep labels, while millions of unannotated 3D meshes remain untapped, hindering scalable CAD generation.Method: DreamCAD represents each BRep as a set of parametric patches (e.g., Bézier surfaces) and uses a differentiable tessellation method to generate meshes, enabling training on large 3D datasets while reconstructing connected, editable surfaces. Also introduces CADCap-1M, the largest CAD captioning dataset with 1M+ GPT-5 generated descriptions.
Result: Achieves state-of-the-art performance on ABC and Objaverse benchmarks across text, image, and point modalities, improving geometric fidelity and surpassing 75% user preference.
Conclusion: DreamCAD enables scalable CAD generation without CAD-specific annotations by leveraging point-level supervision and parametric representations, with the CADCap-1M dataset advancing text-to-CAD research.
Abstract: Computer-Aided Design (CAD) relies on structured and editable geometric representations, yet existing generative methods are constrained by small annotated datasets with explicit design histories or boundary representation (BRep) labels. Meanwhile, millions of unannotated 3D meshes remain untapped, limiting progress in scalable CAD generation. To address this, we propose DreamCAD, a multi-modal generative framework that directly produces editable BReps from point-level supervision, without CAD-specific annotations. DreamCAD represents each BRep as a set of parametric patches (e.g., Bézier surfaces) and uses a differentiable tessellation method to generate meshes. This enables large-scale training on 3D datasets while reconstructing connected and editable surfaces. Furthermore, we introduce CADCap-1M, the largest CAD captioning dataset to date, with 1M+ descriptions generated using GPT-5 for advancing text-to-CAD research. DreamCAD achieves state-of-the-art performance on ABC and Objaverse benchmarks across text, image, and point modalities, improving geometric fidelity and surpassing 75% user preference. Code and dataset will be publicly available.
[112] Make VLM Recognize Visual Hallucination on Cartoon Character Image with Pose Information
Bumsoo Kim, Wonseop Shin, Kyuchul Lee, Yonghoon Jung, Sanghyun Seo
Main category: cs.CV
TL;DR: A novel semantic structural hallucination detection system for non-photorealistic rendering using Vision-Language Models with pose-aware in-context visual learning to improve detection of visual defects in TTI-generated content.
Details
Motivation: Large-scale Text-to-Image models often produce semantic structural visual hallucinations (perceptually severe defects) in non-photorealistic rendering domains like cartoons and pixel art, limiting their real-world applications.Method: Proposes pose-aware in-context visual learning (PA-ICVL) that enhances Vision-Language Models by incorporating RGB images, pose information, and in-context learning examples to detect hallucinations more accurately.
Result: PA-ICVL significantly improves hallucination detection performance from 50% to 78% for GPT-4v and 57% to 80% for Gemini Pro Vision compared to RGB-only baselines.
Conclusion: The approach advances TTI models toward real-world applications by mitigating visual hallucinations through in-context visual learning, expanding potential in non-photorealistic domains while demonstrating how users can enhance VLM capabilities with additional conditions.
Abstract: Leveraging large-scale Text-to-Image (TTI) models have become a common technique for generating exemplar or training dataset in the fields of image synthesis, video editing, 3D reconstruction. However, semantic structural visual hallucinations involving perceptually severe defects remain a concern, especially in the domain of non-photorealistic rendering (NPR) such as cartoons and pixelization-style character. To detect these hallucinations in NPR, We propose a novel semantic structural hallucination detection system using Vision-Language Model (VLM). Our approach is to leverage the emerging capability of large language model, in-context learning which denotes that VLM has seen some examples by user for specific downstream task, here hallucination detection. Based on in-context learning, we introduce pose-aware in-context visual learning (PA-ICVL) which improve the overall performance of VLM by further inputting visual data beyond prompts, RGB images and pose information. By incorporating pose guidance, we enable VLMs to make more accurate decisions. Experimental results demonstrate significant improvements in identifying visual hallucinations compared to baseline methods relying solely on RGB images. Within selected two VLMs, GPT-4v, Gemini pro vision, our proposed PA-ICVL improves the hallucination detection with 50% to 78%, 57% to 80%, respectively. This research advances a capability of TTI models toward real-world applications by mitigating visual hallucinations via in-context visual learning, expanding their potential in non-photorealistic domains. In addition, it showcase how users can boost the downstream-specialized capability of open VLM by harnessing additional conditions. We collect synthetic cartoon-hallucination dataset with TTI models, this dataset and final tuned VLM will be publicly available.
[113] Adversarial Batch Representation Augmentation for Batch Correction in High-Content Cellular Screening
Lei Tong, Xujing Yao, Adam Corrigan, Long Chen, Navin Rathna Kumar, Kerry Hallbrook, Jonathan Orme, Yinhai Wang, Huiyu Zhou
Main category: cs.CV
TL;DR: ABRA is a domain generalization method for cell painting image analysis that mitigates biological batch effects through adversarial representation augmentation with geometric margin constraints.
Details
Motivation: High-content screening generates massive cell painting images with technical variations causing biological batch effects, which degrade deep learning model generalization on unseen data. Existing methods need prior knowledge or struggle with unseen batches.Method: Frames batch mitigation as domain generalization, proposes Adversarial Batch Representation Augmentation (ABRA) that models batch-wise statistical fluctuations as structured uncertainties, uses min-max optimization to synthesize worst-case bio-batch perturbations with angular geometric margin constraints, and adds distribution alignment to prevent representation collapse.
Result: Extensive evaluations on RxRx1 and RxRx1-WILDS benchmarks show ABRA establishes new state-of-the-art for siRNA perturbation classification.
Conclusion: ABRA effectively addresses biological batch effects in cell painting images through adversarial domain generalization, improving model generalization without requiring additional prior knowledge.
Abstract: High-Content Screening routinely generates massive volumes of cell painting images for phenotypic profiling. However, technical variations across experimental executions inevitably induce biological batch (bio-batch) effects. These cause covariate shifts and degrade the generalization of deep learning models on unseen data. Existing batch correction methods typically rely on additional prior knowledge (e.g., treatment or cell culture information) or struggle to generalize to unseen bio-batches. In this work, we frame bio-batch mitigation as a Domain Generalization (DG) problem and propose Adversarial Batch Representation Augmentation (ABRA). ABRA explicitly models batch-wise statistical fluctuations by parameterizing feature statistics as structured uncertainties. Through a min-max optimization framework, it actively synthesizes worst-case bio-batch perturbations in the representation space, guided by a strict angular geometric margin to preserve fine-grained class discriminability. To prevent representation collapse during this adversarial exploration, we introduce a synergistic distribution alignment objective. Extensive evaluations on the large-scale RxRx1 and RxRx1-WILDS benchmarks demonstrate that ABRA establishes a new state-of-the-art for siRNA perturbation classification.
[114] Post Fusion Bird’s Eye View Feature Stabilization for Robust Multimodal 3D Detection
Trung Tien Dong, Dev Thakkar, Arman Sargolzaei, Xiaomin Lin
Main category: cs.CV
TL;DR: PFS is a lightweight post-fusion stabilizer module that improves robustness of BEV fusion detectors against domain shifts and sensor failures without architectural changes.
Details
Motivation: BEV fusion detectors degrade under domain shift and sensor failures, limiting real-world reliability. Existing robustness approaches require architectural changes or retraining, making integration into deployed systems difficult.Method: Proposes Post Fusion Stabilizer (PFS) - a lightweight module operating on intermediate BEV representations that stabilizes feature statistics, suppresses degraded spatial regions, and adaptively restores weakened cues through residual correction as a near-identity transformation.
Result: Achieves SOTA results on nuScenes benchmark: +1.2% mAP for camera dropout robustness, +4.4% mAP for low-light performance, with only 3.3M parameters.
Conclusion: PFS effectively improves robustness of existing BEV fusion detectors against diverse camera and LiDAR corruptions while maintaining performance and lightweight footprint.
Abstract: Camera-LiDAR fusion is widely used in autonomous driving to enable accurate 3D object detection. However, bird’s-eye view (BEV) fusion detectors can degrade significantly under domain shift and sensor failures, limiting reliability in real-world deployment. Existing robustness approaches often require modifying the fusion architecture or retraining specialized models, making them difficult to integrate into already deployed systems. We propose a Post Fusion Stabilizer (PFS), a lightweight module that operates on intermediate BEV representations of existing detectors and produces a refined feature map for the original detection head. The design stabilizes feature statistics under domain shift, suppresses spatial regions affected by sensor degradation, and adaptively restores weakened cues through residual correction. Designed as a near-identity transformation, PFS preserves performance while improving robustness under diverse camera and LiDAR corruptions. Evaluations on the nuScenes benchmark demonstrate that PFS achieves state-of-the-art results in several failure modes, notably improving camera dropout robustness by +1.2% and low-light performance by +4.4% mAP while maintaining a lightweight footprint of only 3.3 M parameters.
[115] TempoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation
Soumya Mazumdar, Vineet Kumar Rakesh
Main category: cs.CV
TL;DR: TempoSyncDiff is a reference-conditioned latent diffusion framework for efficient audio-driven talking-head generation that uses teacher-student distillation to enable few-step inference while maintaining quality and stability.
Details
Motivation: Current diffusion models for talking-head generation suffer from high inference latency, temporal instability (flicker, identity drift), and imperfect audio-visual alignment, especially under challenging speech conditions. There's a need for practical, efficient solutions that can work in constrained computational settings.Method: Uses a teacher-student distillation framework where a diffusion teacher trained with standard noise prediction guides a lightweight student denoiser for few-step inference. Incorporates identity anchoring and temporal regularization to mitigate identity drift and flicker, plus viseme-based audio conditioning for coarse lip motion control.
Result: Experiments on LRS3 dataset show distilled models retain much of the teacher’s reconstruction quality while enabling substantially lower latency inference. Includes CPU-only and edge computing measurements showing feasibility for edge deployment.
Conclusion: Distilled diffusion models can achieve practical talking-head generation under constrained computational settings, representing an initial step toward efficient diffusion-based THG with improved stability and reduced latency.
Abstract: Diffusion models have recently advanced photorealistic human synthesis, although practical talking-head generation (THG) remains constrained by high inference latency, temporal instability such as flicker and identity drift, and imperfect audio-visual alignment under challenging speech conditions. This paper introduces TempoSyncDiff, a reference-conditioned latent diffusion framework that explores few-step inference for efficient audio-driven talking-head generation. The approach adopts a teacher-student distillation formulation in which a diffusion teacher trained with a standard noise prediction objective guides a lightweight student denoiser capable of operating with significantly fewer inference steps to improve generation stability. The framework incorporates identity anchoring and temporal regularization designed to mitigate identity drift and frame-to-frame flicker during synthesis, while viseme-based audio conditioning provides coarse lip motion control. Experiments on the LRS3 dataset report denoising-stage component-level metrics relative to VAE reconstructions and preliminary latency characterization, including CPU-only and edge computing measurements and feasibility estimates for edge deployment. The results suggest that distilled diffusion models can retain much of the reconstruction behaviour of a stronger teacher while enabling substantially lower latency inference. The study is positioned as an initial step toward practical diffusion-based talking-head generation under constrained computational settings. GitHub: https://mazumdarsoumya.github.io/TempoSyncDiff
[116] Rethinking Concept Bottleneck Models: From Pitfalls to Solutions
Merve Tapli, Quentin Bouniot, Wolfgang Stammer, Zeynep Akata, Emre Akbas
Main category: cs.CV
TL;DR: CBM-Suite addresses four key limitations of Concept Bottleneck Models: lack of concept relevance metrics, linearity problem causing bypass of concept bottleneck, accuracy gap compared to opaque models, and insufficient study of visual backbones/VLMs impact.
Details
Motivation: Concept Bottleneck Models (CBMs) aim to provide interpretable AI by grounding predictions in human-understandable concepts, but they face four fundamental limitations: 1) no metric to pre-evaluate concept relevance, 2) "linearity problem" causing recent CBMs to bypass the concept bottleneck, 3) accuracy gap compared to opaque models, and 4) lack of systematic study on how different visual backbones and vision-language models impact CBMs.Method: CBM-Suite introduces a methodological framework with four key components: 1) entropy-based metric to quantify concept set suitability for a dataset, 2) non-linear layer between concept activations and classifier to resolve linearity problem, 3) distillation loss guided by linear teacher probe to narrow accuracy gap, and 4) comprehensive analyses of how different vision encoders, VLMs, and concept sets interact.
Result: Extensive evaluations show that CBM-Suite yields more accurate models and provides insights for improving concept-based interpretability. The framework successfully addresses the identified limitations and enables systematic analysis of factors influencing CBM performance.
Conclusion: CBM-Suite provides a comprehensive framework to overcome fundamental limitations of Concept Bottleneck Models, improving both accuracy and interpretability while enabling systematic analysis of how different components (vision encoders, VLMs, concept sets) interact in concept-based models.
Abstract: Concept Bottleneck Models (CBMs) ground predictions in human-understandable concepts but face fundamental limitations: the absence of a metric to pre-evaluate concept relevance, the “linearity problem” causing recent CBMs to bypass the concept bottleneck entirely, an accuracy gap compared to opaque models, and finally the lack of systematic study on the impact of different visual backbones and VLMs. We introduce CBM-Suite, a methodological framework to systematically addresses these challenges. First, we propose an entropy-based metric to quantify the intrinsic suitability of a concept set for a given dataset. Second, we resolve the linearity problem by inserting a non-linear layer between concept activations and the classifier, which ensures that model accuracy faithfully reflects concept relevance. Third, we narrow the accuracy gap by leveraging a distillation loss guided by a linear teacher probe. Finally, we provide comprehensive analyses on how different vision encoders, vision-language models, and concept sets interact to influence accuracy and interpretability in CBMs. Extensive evaluations show that CBM-Suite yields more accurate models and provides insights for improving concept-based interpretability.
[117] Making Reconstruction FID Predictive of Diffusion Generation FID
Tongda Xu, Mingwei He, Shady Abu-Hussein, Jose Miguel Hernandez-Lobato, Haotian Zhang, Kai Zhao, Chao Zhou, Ya-Qin Zhang, Yan Wang
Main category: cs.CV
TL;DR: iFID is a new metric that correlates strongly with diffusion model generation quality by interpolating latent neighbors instead of direct reconstruction.
Details
Motivation: Current reconstruction FID (rFID) metrics for VAEs don't correlate well with generation FID (gFID) of latent diffusion models, making it hard to evaluate diffusion model quality without expensive sampling.Method: Propose interpolated FID (iFID): for each dataset element, find its nearest neighbor in latent space, interpolate their latent representations, decode the interpolated latent, and compute FID between decoded samples and original dataset.
Result: iFID achieves strong correlation with diffusion gFID (~0.85 Pearson/Spearman correlations), outperforming rFID. Also shows rFID correlates with diffusion refinement phase quality while iFID correlates with navigation phase quality.
Conclusion: iFID is the first metric demonstrating strong correlation with diffusion gFID, providing a practical way to evaluate diffusion model quality without expensive sampling, with connections to diffusion generalization/hallucination theory.
Abstract: It is well known that the reconstruction FID (rFID) of a VAE is poorly correlated with the generation FID (gFID) of a latent diffusion model. We propose interpolated FID (iFID), a simple variant of rFID that exhibits a strong correlation with gFID. Specifically, for each element in the dataset, we retrieve its nearest neighbor (NN) in the latent space and interpolate their latent representations. We then decode the interpolated latent and compute the FID between the decoded samples and the original dataset. Additionally, we refine the claim that rFID correlates poorly with gFID, by showing that rFID correlates with sample quality in the diffusion refinement phase, whereas iFID correlates with sample quality in the diffusion navigation phase. Furthermore, we provide an explanation for why iFID correlates well with gFID, and why reconstruction metrics are negatively correlated with gFID, by connecting to results in the diffusion generalization and hallucination. Empirically, iFID is the first metric to demonstrate a strong correlation with diffusion gFID, achieving Pearson linear and Spearman rank correlations approximately 0.85. The source code is provided in https://github.com/tongdaxu/Making-rFID-Predictive-of-Diffusion-gFID.
[118] When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On
Wisdom Ikezogwo, Mehmet Saygin Seyfioglu, Ranjay Krishna, Karim Bouyarmane
Main category: cs.CV
TL;DR: IEC (Implicit Error Counting) is a new RL reward method that counts errors instead of checking correctness against rubrics, addressing tasks with multiple valid outputs and no single ideal answer.
Details
Motivation: Current RL methods like RLVR and RaR rely on ideal reference answers to generate evaluation rubrics, but many real-world tasks have multiple valid outputs and lack single ideal answers, creating a gap in post-training methods.Method: IEC enumerates what responses get wrong rather than what they get right, applying severity-weighted scores across task-relevant axes and converting them into calibrated per-aspect rewards. Uses implicit score emission and group calibration to reduce noise.
Result: IEC outperforms RaR on MDressBench (CEC: 5.31 vs. 5.60 on flat references; 5.20 vs. 5.53 on non-flat). On VITON-HD and DressCode, IEC matches or surpasses six baselines on 6 of 8 perceptual metrics.
Conclusion: When ideal answers are unavailable, counting errors provides a stronger signal than constructing rubrics, making IEC effective for tasks with multiple valid outputs.
Abstract: Reinforcement learning with verifiable rewards (RLVR) and Rubrics as Rewards (RaR) have driven strong gains in domains with clear correctness signals and even in subjective domains by synthesizing evaluation criteria from ideal reference answers. But many real-world tasks admit multiple valid outputs and lack the single ideal answer that rubric generation depends on. We identify this reference-free setting as a gap in current post-training methods and propose Implicit Error Counting (IEC) to fill it. Instead of checking what a response gets right against a rubric, IEC enumerates what it gets wrong, applying severity-weighted scores across task-relevant axes and converting them into calibrated per-aspect rewards. We show that naĂŻve explicit enumeration is too noisy for stable optimization, and that two design choices: implicit score emission and group calibration are necessary to make error counting a reliable reward. As a case study, we validate IEC on virtual try-on (VTO), a domain that is simultaneously too constrained for holistic scoring and too permissive for rubric-based evaluation: subtle garment errors are unacceptable, yet many output variations are correct. We introduce Cascaded Error Counting (CEC) as an evaluation metric, which tracks human preferences well (60% top-1 vs. 30% others), and curate Mismatch-DressCode (MDressBench), a benchmark with maximal attribute mismatch to stress-test reward designs. On MDressBench, IEC outperforms RaR across all metrics (CEC: 5.31 vs. 5.60 on flat references; 5.20 vs. 5.53 on non-flat). On VITON-HD and DressCode, IEC matches or surpasses six baselines on 6 of 8 perceptual metrics. These results suggest that when ideal answers are unavailable, counting errors provide a stronger signal than constructing rubrics.
[119] Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding
Jiaqi Li, Shuntian Zheng, Yixian Shen, Jia-Hong Huang, Xiaoman Lu, Minzhe Ni, Yu Guan
Main category: cs.CV
TL;DR: SemVID is a training-free visual token pruning framework for Video Temporal Grounding that preserves query-critical evidence and cross-frame connectivity to maintain accuracy while significantly reducing computational cost.
Details
Motivation: Video Temporal Grounding requires processing long untrimmed videos, making video-language-model pipelines computationally expensive. Existing visual token pruning methods designed for video QA degrade VTG performance because VTG depends on boundary-sensitive evidence and cross-frame reasoning chains.Method: Proposes two VTG-specific pruning principles: Evidence Retention (keeps query-critical patches around event boundaries) and Connectivity Strength (preserves token-level cross-frame connectivity). SemVID allocates per-frame token budgets by balancing query relevance and inter-frame variation, then selects three token types: object tokens for query-critical evidence, motion tokens for transitions and cross-frame relays, and context tokens for scene continuity.
Result: Achieves strong accuracy-efficiency trade-off: retains up to 95.4% mIoU with only 12.5% visual tokens, delivers up to 5.8x prefill speedup, consistently outperforms prior methods under same budgets on VTG benchmarks.
Conclusion: SemVID effectively addresses VTG-specific pruning challenges by preserving boundary-sensitive evidence and cross-frame connectivity, enabling efficient video-language processing without compromising temporal grounding accuracy.
Abstract: Video Temporal Grounding (VTG) localizes the temporal boundaries of a query-relevant moment in long, untrimmed videos, making video-language-model (VLM) pipelines prohibitively expensive. While recent training-free visual token pruning has shown success in video question answering, naively applying these objectives to VTG often causes drastic degradation, as VTG crucially depends on boundary-sensitive evidence and cross-frame reasoning chains. We therefore identify two VTG-specific pruning principles: Evidence Retention (ER), which keeps query-critical patches especially around event boundaries, and Connectivity Strength (CS), which preserves token-level cross-frame connectivity for long-range evidence aggregation. Building on these insights, we propose SemVID, a training-free pruning framework that constructs a compact yet coherent token subset with complementary semantic roles. SemVID first allocates per-frame token budgets by balancing query relevance and inter-frame variation to avoid over-pruned segments, and then selects three types of tokens: object tokens for diverse query-critical evidence, motion tokens to capture meaningful transitions and serve as cross-frame relays, and a small set of context tokens for scene continuity. Extensive experiments on VTG benchmarks show that SemVID achieves a strong accuracy-efficiency trade-off, retaining up to 95.4% mIoU with only 12.5% visual tokens and delivering up to a 5.8x prefill speedup, consistently outperforming prior methods under the same budgets.
[120] Technical Report: Automated Optical Inspection of Surgical Instruments
Zunaira Shafqat, Atif Aftab Ahmed Jilani, Qurrat Ul Ain
Main category: cs.CV
TL;DR: This paper focuses on automated defect detection in surgical instruments using deep learning models (YOLOv8, ResNet-152, EfficientNet-b4) to identify manufacturing defects like cracks and rust through analysis of 4,414 high-resolution images from Pakistani manufacturers.
Details
Motivation: The motivation is to address critical quality assurance in surgical instrument manufacturing, where even minor defects can have fatal consequences in healthcare settings. The paper aims to prevent financial losses for manufacturers and, more importantly, safeguard patient lives by improving defect detection standards.Method: The method involves developing an automated defect detection system using deep learning architectures including YOLOv8 for object detection, ResNet-152 and EfficientNet-b4 for classification. The approach uses Automated Optical Inspection (AOI) tools and analyzes a newly created dataset of 4,414 high-resolution images of surgical instruments from Pakistani manufacturers.
Result: The paper presents a comprehensive analysis of surgical instrument defects using the developed deep learning models, though specific performance metrics are not provided in the abstract. The collaboration with industry leaders provides practical insights into defect patterns in Pakistani-made instruments.
Conclusion: The research contributes to enhancing surgical instrument quality standards through automated defect detection, serving as a resource for manufacturers, healthcare professionals, and regulatory bodies to ensure more reliable healthcare environments through industry expertise and cutting-edge technology.
Abstract: In the dynamic landscape of modern healthcare, maintaining the highest standards in surgical instruments is critical for clinical success. This report explores the diverse realm of surgical instruments and their associated manufacturing defects, emphasizing their pivotal role in ensuring the safety of surgical procedures. With potentially fatal consequences arising from even minor defects, precision in manufacturing is paramount.The report addresses the identification and rectification of critical defects such as cracks, rust, and structural irregularities. Such scrutiny prevents substantial financial losses for manufacturers and, more crucially, safeguards patient lives. The collaboration with industry leaders Daddy D Pro and Dr. Frigz International, renowned trailblazers in the Sialkot surgical cluster, provides invaluable insights into the analysis of defects in Pakistani-made instruments. This partnership signifies a commitment to advancing automated defect detection methodologies, specifically through the integration of deep learning architectures including YOLOv8, ResNet-152, and EfficientNet-b4, thereby elevating quality standards in the manufacturing process. The scope of this report is to identify various surgical instruments manufactured in Pakistan and analyze their associated defects using a newly developed dataset of 4,414 high-resolution images. By focusing on quality assurance through Automated Optical Inspection (AOI) tools, this document serves as a resource for manufacturers, healthcare professionals, and regulatory bodies. The insights gained contribute to the enhancement of instrument standards, ensuring a more reliable healthcare environment through industry expertise and cutting-edge technology.
[121] OWL: A Novel Approach to Machine Perception During Motion
Daniel Raviv, Juan D. Yepes
Main category: cs.CV
TL;DR: OWL is a perception function for 3D perception during motion that uses two visual motion cues (local visual looming and perceived rotation) to enable scaled 3D mapping and camera heading without requiring prior knowledge of environments or motion.
Details
Motivation: The paper aims to address complex challenges in 3D perception during motion by developing a unified analytical approach that can derive 3D information directly from visual motion cues without requiring prior knowledge of environments, moving objects, or camera motion.Method: OWL derives values from two fundamental visual motion cues: 1) perceived local visual looming of points near fixation, and 2) perceived rotation of rigid objects relative to fixation. It uses minimalistic, pixel-based parallel computations on raw visual motion image sequences to obtain 3D information without explicit measurement of physical quantities.
Result: Simulations demonstrate that OWL achieves geometric constancy of 3D objects over time and enables scaled 3D scene reconstruction from visual motion cues alone. The approach works without prior knowledge of stationary environments, moving objects, or camera motion.
Conclusion: OWL bridges theoretical concepts and practical applications in robotics and autonomous navigation, offering an alternative real-time representation for 3D points in relative motion. It may serve as a building block for next-generation autonomous systems and has implications for understanding natural perception and neural functionality.
Abstract: We introduce a perception-related function, OWL, designed to address the complex challenges of 3D perception during motion. It derives its values directly from two fundamental visual motion cues, with one set of cue values per point per time instant. During motion, two visual motion cues relative to a fixation point emerge: 1) perceived local visual looming of points near the fixation point, and 2) perceived rotation of the rigid object relative to the fixation point. It also expresses the relation between two well-known physical quantities, the relative instantaneous directional range and directional translation in 3D between the camera and any visible 3D point, without explicitly requiring their measurement or prior knowledge of their individual values. OWL offers a unified, analytical time-based approach that enhances and simplifies key perception capabilities, including scaled 3D mapping and camera heading. Simulations demonstrate that OWL achieves geometric constancy of 3D objects over time and enables scaled 3D scene reconstruction from visual motion cues alone. By leveraging direct measurements from raw visual motion image sequences, OWL values can be obtained without prior knowledge of stationary environments, moving objects, or camera motion. This approach employs minimalistic, pixel-based, parallel computations, providing an alternative real-time representation for 3D points in relative motion. OWL bridges the gap between theoretical concepts and practical applications in robotics and autonomous navigation and may unlock new possibilities for real-time decision-making and interaction, potentially serving as a building block for next-generation autonomous systems. This paper offers an alternative perspective on machine perception, with implications that may extend to natural perception and contribute to a better understanding of behavioral psychology and neural functionality.
[122] NOVA: Next-step Open-Vocabulary Autoregression for 3D Multi-Object Tracking in Autonomous Driving
Kai Luo, Xu Wang, Rui Fan, Kailun Yang
Main category: cs.CV
TL;DR: NOVA introduces an open-vocabulary 3D multi-object tracking approach using autoregressive LLMs to model trajectories as spatio-temporal semantic sequences, enabling generalization to novel object categories.
Details
Motivation: Current 3D MOT methods are limited by closed-set assumptions and semantic-blind heuristics, failing to generalize to unknown targets in open-world scenarios. There's a need for approaches that can handle novel object categories and leverage semantic understanding.Method: NOVA reformulates 3D tracking as generative spatio-temporal semantic modeling by encoding trajectories as structured sequences that combine physical motion continuity with linguistic priors. It uses LLMs’ autoregressive capabilities for next-step sequence completion, leveraging language space hierarchy to resolve semantic ambiguities and maintain identity consistency.
Result: On nuScenes dataset, NOVA achieves 22.41% AMOTA for Novel categories, a 20.21% absolute improvement over baseline. It also shows strong performance on V2X-Seq-SPD and KITTI datasets using a compact 0.5B parameter model.
Conclusion: NOVA successfully bridges the gap between traditional 3D MOT and open-world perception by integrating semantic understanding through LLMs, enabling generalization to novel object categories while maintaining tracking performance.
Abstract: Generalizing across unknown targets is critical for open-world perception, yet existing 3D Multi-Object Tracking (3D MOT) pipelines remain limited by closed-set assumptions and ``semantic-blind’’ heuristics. To address this, we propose Next-step Open-Vocabulary Autoregression (NOVA), an innovative paradigm that shifts 3D tracking from traditional fragmented distance-based matching toward generative spatio-temporal semantic modeling. NOVA reformulates 3D trajectories as structured spatio-temporal semantic sequences, enabling the simultaneous encoding of physical motion continuity and deep linguistic priors. By leveraging the autoregressive capabilities of Large Language Models (LLMs), we transform the tracking task into a principled process of next-step sequence completion. This mechanism allows the model to explicitly utilize the hierarchical structure of language space to resolve fine-grained semantic ambiguities and maintain identity consistency across complex long-range sequences through high-level commonsense reasoning. Extensive experiments on nuScenes, V2X-Seq-SPD, and KITTI demonstrate the superior performance of NOVA. Notably, on the nuScenes dataset, NOVA achieves an AMOTA of 22.41% for Novel categories, yielding a significant 20.21% absolute improvement over the baseline. These gains are realized through a compact 0.5B autoregressive model. Code will be available at https://github.com/xifen523/NOVA.
[123] MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents
Dannong Xu, Zhongyu Yang, Jun Chen, Yingfang Yuan, Ming Hu, Lei Sun, Luc Van Gool, Danda Pani Paudel, Chun-Mei Feng
Main category: cs.CV
TL;DR: MultiHaystack benchmark evaluates multimodal retrieval and reasoning on large-scale heterogeneous corpora, revealing significant performance gaps when models must retrieve evidence vs. being given evidence.
Details
Motivation: Existing MLLM benchmarks focus on single-modality understanding and simplify retrieval to small candidate sets, failing to assess real-world requirements of retrieving evidence from large, heterogeneous multimodal corpora before reasoning.Method: Introduces MultiHaystack benchmark with over 46,000 multimodal retrieval candidates (documents, images, videos) and 747 open yet verifiable questions, each requiring evidence localization across modalities and fine-grained reasoning.
Result: Models perform competitively when given evidence but drop sharply when required to retrieve it; strongest retriever E5-V achieves only 40.8% Recall@1; GPT-5 reasoning accuracy drops from 80.86% (given evidence) to 51.4% (top-5 retrieval).
Conclusion: Multimodal retrieval over heterogeneous pools is a primary bottleneck for MLLMs; MultiHaystack highlights limitations obscured by small-scale evaluations and promotes retrieval-centric advances in multimodal systems.
Abstract: Multimodal large language models (MLLMs) achieve strong performance on benchmarks that evaluate text, image, or video understanding separately. However, these settings do not assess a critical real-world requirement, which involves retrieving relevant evidence from large, heterogeneous multimodal corpora prior to reasoning. Most existing benchmarks restrict retrieval to small, single-modality candidate sets, substantially simplifying the search space and overstating end-to-end reliability. To address this gap, we introduce MultiHaystack, the first benchmark designed to evaluate both retrieval and reasoning under large-scale, cross-modal conditions. MultiHaystack comprises over 46,000 multimodal retrieval candidates across documents, images, and videos, along with 747 open yet verifiable questions. Each question is grounded in a unique validated evidence item within the retrieval pool, requiring evidence localization across modalities and fine-grained reasoning. In our study, we find that models perform competitively when provided with the corresponding evidence, but their performance drops sharply when required to retrieve that evidence from the full corpus. Additionally, even the strongest retriever, E5-V, achieves only 40.8% Recall@1, while state-of-the-art MLLMs such as GPT-5 experience a significant drop in reasoning accuracy from 80.86% when provided with the corresponding evidence to 51.4% under top-5 retrieval. These results indicate that multimodal retrieval over heterogeneous pools remains a primary bottleneck for MLLMs, positioning MultiHaystack as a valuable testbed that highlights underlying limitations obscured by small-scale evaluations and promotes retrieval-centric advances in multimodal systems.
[124] Can we Trust Unreliable Voxels? Exploring 3D Semantic Occupancy Prediction under Label Noise
Wenxin Li, Kunyu Peng, Di Wen, Junwei Zheng, Jiale Wei, Mengfei Duan, Yuheng Zhang, Rui Fan, Kailun Yang
Main category: cs.CV
TL;DR: A benchmark (OccNL) and method (DPR-Occ) for robust 3D semantic occupancy prediction under noisy supervision, addressing structural artifacts and dynamic trailing effects in real-world voxel annotations.
Details
Motivation: Real-world 3D semantic occupancy annotations are corrupted by structural artifacts and dynamic trailing effects, creating unreliable supervision for autonomous systems. Existing 2D label noise learning methods fail in sparse 3D voxel spaces, exposing vulnerabilities in current paradigms.Method: Proposes DPR-Occ, a label noise-robust framework using dual-source partial label reasoning. It combines temporal model memory with representation-level structural affinity to dynamically expand and prune candidate label sets, preserving true semantics while suppressing noise propagation.
Result: Extensive experiments on SemanticKITTI show DPR-Occ prevents geometric and semantic collapse under extreme corruption. At 90% label noise, achieves significant gains (up to 2.57% mIoU and 13.91% IoU) over adapted baselines.
Conclusion: OccNL benchmark and DPR-Occ method bridge label noise learning with 3D perception, providing reliable foundation for safety-critical robotic perception in dynamic environments.
Abstract: 3D semantic occupancy prediction is a cornerstone of robotic perception, yet real-world voxel annotations are inherently corrupted by structural artifacts and dynamic trailing effects. This raises a critical but underexplored question: can autonomous systems safely rely on such unreliable occupancy supervision? To systematically investigate this issue, we establish OccNL, the first benchmark dedicated to 3D occupancy under occupancy-asymmetric and dynamic trailing noise. Our analysis reveals a fundamental domain gap: state-of-the-art 2D label noise learning strategies collapse catastrophically in sparse 3D voxel spaces, exposing a critical vulnerability in existing paradigms. To address this challenge, we propose DPR-Occ, a principled label noise-robust framework that constructs reliable supervision through dual-source partial label reasoning. By synergizing temporal model memory with representation-level structural affinity, DPR-Occ dynamically expands and prunes candidate label sets to preserve true semantics while suppressing noise propagation. Extensive experiments on SemanticKITTI demonstrate that DPR-Occ prevents geometric and semantic collapse under extreme corruption. Notably, even at 90% label noise, our method achieves significant performance gains (up to 2.57% mIoU and 13.91% IoU) over existing label noise learning baselines adapted to the 3D occupancy prediction task. By bridging label noise learning and 3D perception, OccNL and DPR-Occ provide a reliable foundation for safety-critical robotic perception in dynamic environments. The benchmark and source code will be made publicly available at https://github.com/mylwx/OccNL.
[125] Interpretable Perception and Reasoning for Audiovisual Geolocation
Yiyang Su, Xiaoming Liu
Main category: cs.CV
TL;DR: Audiovisual geolocation framework using multimodal reasoning with acoustic atoms and visual features for precise global localization, outperforming unimodal baselines.
Details
Motivation: Current MLLMs have improved image-based localization but struggle with precise global geolocation due to visual ambiguity and underutilized auditory cues. There's a need to leverage both visual and auditory information for more accurate geographic localization.Method: Three-stage framework: 1) Perception stage uses mixture-autoregressive sparse autoencoder to decompose audio into “acoustic atoms”; 2) Multimodal Reasoning stage employs MLLM finetuned via Group Relative Policy Optimization to synthesize acoustic atoms with visual features; 3) Precision Prediction stage uses Riemannian Flow Matching on the SÂČ manifold for final localization.
Result: The framework significantly outperforms unimodal baselines, demonstrating that interpretable soundscape perception combined with multimodal reasoning enables high-precision global localization.
Conclusion: Soundscape perception provides a critical orthogonal signal that, when coupled with multimodal reasoning, enables precise global geolocation, addressing the limitations of vision-only approaches.
Abstract: While recent advances in Multimodal Large Language Models (MLLMs) have improved image-based localization, precise global geolocation remains a formidable challenge due to the inherent ambiguity of visual landscapes and the largely untapped potential of auditory cues. In this paper, we introduce Audiovisual Geolocation, a framework designed to resolve geographic ambiguity through interpretable perception and reasoning. We present AVG, a high-quality global-scale video benchmark for geolocation, comprising 20,000 curated clips across 1,000 distinct locations. To address the complexity of audiovisual geolocation, we propose a three-stage framework: (1) a Perception stage that utilizes a mixture-autoregressive sparse autoencoder to decompose noisy audio into semantically grounded “acoustic atoms”; (2) a Multimodal Reasoning stage that employs an MLLM finetuned via Group Relative Policy Optimization (GRPO) to synthesize these atoms with visual features; and (3) a Precision Prediction stage using Riemannian Flow Matching on the $S^2$ manifold. Our experiments demonstrate that our framework significantly outperforms unimodal baselines. These results entail that interpretable perception of the soundscape provides a critical, orthogonal signal that, when coupled with multimodal reasoning, enables high-precision global localization.
[126] Any to Full: Prompting Depth Anything for Depth Completion in One Stage
Zhiyuan Zhou, Ruofeng Liu, Taichi Liu, Weijian Zuo, Shanshan Wang, Zhiqing Hong, Desheng Zhang
Main category: cs.CV
TL;DR: Any2Full is a one-stage framework for depth completion that uses scale-prompting adaptation of pretrained monocular depth estimation models to handle various sparse depth patterns without explicit metric alignment.
Details
Motivation: Existing RGBD-fused depth completion methods have limited domain generalization and robustness to different depth patterns. Current approaches using monocular depth estimation models require two-stage integration with explicit relative-to-metric alignment, which adds computation and causes structured distortions.Method: Proposes a one-stage framework that reformulates depth completion as scale-prompting adaptation of a pretrained MDE model. Uses a Scale-Aware Prompt Encoder to distill scale cues from sparse inputs into unified scale prompts, guiding the MDE model toward globally scale-consistent predictions while preserving its geometric priors.
Result: Outperforms OMNI-DC by 32.2% in average AbsREL and delivers 1.4Ă speedup over PriorDA with the same MDE backbone. Demonstrates superior robustness and efficiency across various depth patterns and sparsity levels.
Conclusion: Any2Full establishes a new paradigm for universal depth completion that is domain-general, pattern-agnostic, and efficient, enabling robust depth estimation from sparse inputs without explicit metric alignment.
Abstract: Accurate, dense depth estimation is crucial for robotic perception, but commodity sensors often yield sparse or incomplete measurements due to hardware limitations. Existing RGBD-fused depth completion methods learn priors jointly conditioned on training RGB distribution and specific depth patterns, limiting domain generalization and robustness to various depth patterns. Recent efforts leverage monocular depth estimation (MDE) models to introduce domain-general geometric priors, but current two-stage integration strategies relying on explicit relative-to-metric alignment incur additional computation and introduce structured distortions. To this end, we present Any2Full, a one-stage, domain-general, and pattern-agnostic framework that reformulates completion as a scale-prompting adaptation of a pretrained MDE model. To address varying depth sparsity levels and irregular spatial distributions, we design a Scale-Aware Prompt Encoder. It distills scale cues from sparse inputs into unified scale prompts, guiding the MDE model toward globally scale-consistent predictions while preserving its geometric priors. Extensive experiments demonstrate that Any2Full achieves superior robustness and efficiency. It outperforms OMNI-DC by 32.2% in average AbsREL and delivers a 1.4$\times$ speedup over PriorDA with the same MDE backbone, establishing a new paradigm for universal depth completion. Codes and checkpoints are available at https://github.com/zhiyuandaily/Any2Full.
[127] Unlocking ImageNet’s Multi-Object Nature: Automated Large-Scale Multilabel Annotation
Junyu Chen, Md Yousuf Harun, Christopher Kanan
Main category: cs.CV
TL;DR: Automated pipeline converts ImageNet training set to multi-label dataset using self-supervised Vision Transformers for unsupervised object discovery, improving model accuracy and transfer learning.
Details
Motivation: ImageNet's single-label assumption creates label noise and limits learning signal since real-world images contain multiple objects. Multi-label annotations better reflect visual reality and enable richer representation learning.Method: Uses self-supervised Vision Transformers for unsupervised object discovery, selects regions aligned with original labels to train lightweight classifier, then applies it to all regions to generate coherent multi-label annotations across dataset.
Result: Models trained with multi-label supervision achieve better in-domain accuracy (+2.0 top-1 on ReaL, +1.5 on ImageNet-V2) and stronger transferability (+4.2 mAP on COCO, +2.3 mAP on VOC). Labels align well with human judgment.
Conclusion: Multi-label annotations significantly enhance classification performance and representation learning. The automated pipeline provides scalable, high-quality multi-label annotations without human effort.
Abstract: The original ImageNet benchmark enforces a single-label assumption, despite many images depicting multiple objects. This leads to label noise and limits the richness of the learning signal. Multi-label annotations more accurately reflect real-world visual scenes, where multiple objects co-occur and contribute to semantic understanding, enabling models to learn richer and more robust representations. While prior efforts (e.g., ReaL, ImageNetv2) have improved the validation set, there has not yet been a scalable, high-quality multi-label annotation for the training set. To this end, we present an automated pipeline to convert the ImageNet training set into a multi-label dataset, without human annotations. Using self-supervised Vision Transformers, we perform unsupervised object discovery, select regions aligned with original labels to train a lightweight classifier, and apply it to all regions to generate coherent multi-label annotations across the dataset. Our labels show strong alignment with human judgment in qualitative evaluations and consistently improve performance across quantitative benchmarks. Compared to traditional single-label scheme, models trained with our multi-label supervision achieve consistently better in-domain accuracy across architectures (up to +2.0 top-1 accuracy on ReaL and +1.5 on ImageNet-V2) and exhibit stronger transferability to downstream tasks (up to +4.2 and +2.3 mAP on COCO and VOC, respectively). These results underscore the importance of accurate multi-label annotations for enhancing both classification performance and representation learning. Project code and the generated multi-label annotations are available at https://github.com/jchen175/MultiLabel-ImageNet.
[128] From Phase Grounding to Intelligent Surgical Narratives
Ethan Peterson, Huixin Zhan
Main category: cs.CV
TL;DR: Automatically creates surgical timelines from videos using CLIP-based multimodal alignment between video frames and textual gesture descriptions.
Details
Motivation: Current surgical timeline creation methods are either too vague (post-op reports) or too time-consuming (manual video annotation). Need an automated approach that sits between these extremes.Method: Uses CLIP-based multimodal framework to align surgical video frames with textual gesture descriptions. Fine-tunes CLIP visual and text encoders to improve alignment between video gestures and textual tokens in shared embedding space.
Result: Model predicts gestures and phases for video frames, enabling construction of structured surgical timelines without manual annotation.
Conclusion: Leverages pretrained multimodal representations to bridge visual gestures and textual narratives, reducing need for manual video review and annotation in surgical settings.
Abstract: Video surgery timelines are an important part of tool-assisted surgeries, as they allow surgeons to quickly focus on key parts of the procedure. Current methods involve the surgeon filling out a post-operation (OP) report, which is often vague, or manually annotating the surgical videos, which is highly time-consuming. Our proposed method sits between these two extremes: we aim to automatically create a surgical timeline and narrative directly from the surgical video. To achieve this, we employ a CLIP-based multi-modal framework that aligns surgical video frames with textual gesture descriptions. Specifically, we use the CLIP visual encoder to extract representations from surgical video frames and the text encoder to embed the corresponding gesture sentences into a shared embedding space. We then fine-tune the model to improve the alignment between video gestures and textual tokens. Once trained, the model predicts gestures and phases for video frames, enabling the construction of a structured surgical timeline. This approach leverages pretrained multi-modal representations to bridge visual gestures and textual narratives, reducing the need for manual video review and annotation by surgeons.
[129] Full Dynamic Range Sky-Modelling For Image Based Lighting
Ian J. Maquignaz
Main category: cs.CV
TL;DR: Icarus is a deep learning sky-model that generates high-resolution, photorealistic environment maps with full dynamic range, enabling user-controlled positioning of sun/clouds and atmospheric texturing for accurate image-based lighting.
Details
Motivation: Current DNN sky-models fail to accurately recreate photorealistic natural skies, especially at higher resolutions where they struggle with the 14EV+ class-imbalanced solar region, resulting in poor visual quality and inaccurate lighting, shadows, and tones for image-based lighting applications.Method: Icarus is an all-weather sky-model that learns the exposure range of Full Dynamic Range physically captured outdoor imagery. It enables conditional generation of environment maps with intuitive user-positioning of solar and cloud formations, and extends state-of-the-art to allow user-controlled texturing of atmospheric formations.
Result: Icarus demonstrates interchangeability with FDR physically captured outdoor imagery or parametric sky-models, and illuminates scenes with unprecedented accuracy, photorealism, lighting directionality (shadows), and tones in Image Based Lighting.
Conclusion: Icarus represents a significant advancement in sky modeling, providing a deep learning-based solution that overcomes limitations of previous models in handling high dynamic range and achieving photorealistic results for environment map generation and image-based lighting.
Abstract: Accurate environment maps are a key component to modelling real-world outdoor scenes. They enable captivating visual arts, immersive virtual reality and a wide range of scientific and engineering applications. To alleviate the burden of physical-capture, physically-simulation and volumetric rendering, sky-models have been proposed as fast, flexible, and cost-saving alternatives. In recent years, sky-models have been extended through deep learning to be more comprehensive and inclusive of cloud formations, but recent work has demonstrated these models fall short in faithfully recreating accurate and photorealistic natural skies. Particularly at higher resolutions, DNN sky-models struggle to accurately model the 14EV+ class-imbalanced solar region, resulting in poor visual quality and scenes illuminated with skewed light transmission, shadows and tones. In this work, we propose Icarus, an all-weather sky-model capable of learning the exposure range of Full Dynamic Range (FDR) physically captured outdoor imagery. Our model allows conditional generation of environment maps with intuitive user-positioning of solar and cloud formations, and extends on current state-of-the-art to enable user-controlled texturing of atmospheric formations. Through our evaluation, we demonstrate Icarus is interchangeable with FDR physically captured outdoor imagery or parametric sky-models, and illuminates scenes with unprecedented accuracy, photorealism, lighting directionality (shadows), and tones in Image Based Lightning (IBL).
[130] ECLARE: Efficient cross-planar learning for anisotropic resolution enhancement
Samuel W. Remedios, Shuwen Wei, Shuo Han, Jinwei Zhang, Aaron Carass, Kurt G. Schilling, Dzung L. Pham, Jerry L. Prince, Blake E. Dewey
Main category: cs.CV
TL;DR: ECLARE is a self-supervised super-resolution method for anisotropic 2D MR image stacks that addresses slice profile estimation, slice gaps, domain shift, and arbitrary upsampling factors without external training data.
Details
Motivation: Clinical MR imaging often uses 2D slice stacks with thick slices and gaps, causing poor performance for 3D analysis algorithms. Existing SR methods fail to address all key factors: slice profile shape, slice gaps, domain shift, and arbitrary upsampling factors.Method: ECLARE estimates slice profiles from the 2D MR volume, trains a network to learn mapping from low-res to high-res in-plane patches from the same volume, and performs SR with anti-aliasing in a self-supervised manner without external data.
Result: ECLARE outperformed cubic B-spline interpolation, SMORE, and other contemporary SR methods in both signal recovery and downstream tasks on realistic simulations with ground truth.
Conclusion: ECLARE effectively addresses anisotropic resolution enhancement for 2D MR volumes without domain shift issues, offering open-source implementation for clinical applications.
Abstract: In clinical imaging, magnetic resonance (MR) image volumes are often acquired as stacks of 2D slices with decreased scan times, improved signal-to-noise ratio, and image contrasts unique to 2D MR pulse sequences. While this is sufficient for clinical evaluation, automated algorithms designed for 3D analysis perform poorly on multi-slice 2D MR volumes, especially those with thick slices and gaps between slices. Super-resolution (SR) methods aim to address this problem, but previous methods do not address all of the following: slice profile shape estimation, slice gap, domain shift, and non-integer or arbitrary upsampling factors. In this paper, we propose ECLARE (Efficient Cross-planar Learning for Anisotropic Resolution Enhancement), a self-SR method that addresses each of these factors. ECLARE uses a slice profile estimated from the multi-slice 2D MR volume, trains a network to learn the mapping from low-resolution to high-resolution in-plane patches from the same volume, and performs SR with anti-aliasing. We compared ECLARE to cubic B-spline interpolation, SMORE, and other contemporary SR methods. We used realistic and representative simulations so that quantitative performance against ground truth can be computed, and ECLARE outperformed all other methods in both signal recovery and downstream tasks. Importantly, as ECLARE does not use external training data it cannot suffer from domain shift between training and testing. Our code is open-source and available at https://www.github.com/sremedios/eclare.
[131] Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers
Ruidong Chen, Yancheng Bai, Xuanpu Zhang, Jianhao Zeng, Lanjun Wang, Dan Song, Lei Sun, Xiangxiang Chu, Anan Liu
Main category: cs.CV
TL;DR: LayerBind is a training-free, plug-and-play method for precise region and occlusion control in text-to-image generation by modeling regional generation as distinct layers and binding them during denoising.
Details
Motivation: Existing region-instructed layout control methods have limitations: training-based approaches inherit data bias and degrade image quality, while current techniques struggle with occlusion order, limiting real-world usability.Method: Two-phase approach: (1) Layer-wise Instance Initialization creates per-instance branches attending to their own regions while anchoring to shared background, fused early to establish layout; (2) Layer-wise Semantic Nursing reinforces regional details via sequential layered attention path alongside standard global path with layer-transparency scheduler.
Result: Qualitative and quantitative results demonstrate LayerBind’s effectiveness for precise regional and occlusion controllability, supporting editable workflows for flexible modifications like changing instances or rearranging visible orders.
Conclusion: LayerBind provides effective training-free regional and occlusion control for Diffusion Transformers, with strong potential for creative applications and native support for editable workflows.
Abstract: Region-instructed layout control in text-to-image generation is highly practical, yet existing methods suffer from limitations: (i) training-based approaches inherit data bias and often degrade image quality, and (ii) current techniques struggle with occlusion order, limiting real-world usability. To address these issues, we propose LayerBind. By modeling regional generation as distinct layers and binding them during the generation, our method enables precise regional and occlusion controllability. Our motivation stems from the observation that spatial layout and occlusion are established at a very early denoising stage, suggesting that rearranging the early latent structure is sufficient to modify the final output. Building on this, we structure the scheme into two phases: instance initialization and subsequent semantic nursing. (1) First, leveraging the contextual sharing mechanism in multimodal joint attention, Layer-wise Instance Initialization creates per-instance branches that attend to their own regions while anchoring to the shared background. At a designated early step, these branches are fused according to the layer order to form a unified latent with a pre-established layout. (2) Then, Layer-wise Semantic Nursing reinforces regional details and maintains the occlusion order via a layer-wise attention enhancement. Specifically, a sequential layered attention path operates alongside the standard global path, with updates composited under a layer-transparency scheduler. LayerBind is training-free and plug-and-play, serving as a regional and occlusion controller across Diffusion Transformers. Beyond generation, it natively supports editable workflows, allowing for flexible modifications like changing instances or rearranging visible orders. Both qualitative and quantitative results demonstrate LayerBind’s effectiveness, highlighting its strong potential for creative applications.
[132] Visual Words Meet BM25: Sparse Auto-Encoder Visual Word Scoring for Image Retrieval
Donghoon Han, Eunhwan Park, Seunghyeon Seo
Main category: cs.CV
TL;DR: BM25-V applies text retrieval’s BM25 scoring to sparse visual words from Vision Transformers for efficient, interpretable image retrieval.
Details
Motivation: Dense image retrieval lacks interpretability, attribution, and is computationally expensive at scale. Need for efficient first-stage retrieval with interpretable results.Method: Uses Sparse Auto-Encoder on Vision Transformer patch features to create visual words, then applies BM25 scoring with IDF weighting for sparse inverted-index retrieval.
Result: Achieves Recall@200 â„ 0.993 across seven benchmarks, enabling two-stage pipeline with near-dense accuracy within 0.2% while reranking only 200 candidates per query.
Conclusion: BM25-V provides efficient, interpretable first-stage retrieval for vision tasks, with transferable SAE and attributable decisions via visual words with IDF contributions.
Abstract: Dense image retrieval is accurate but offers limited interpretability and attribution, and it can be compute-intensive at scale. We present \textbf{BM25-V}, which applies Okapi BM25 scoring to sparse visual-word activations from a Sparse Auto-Encoder (SAE) on Vision Transformer patch features. Across a large gallery, visual-word document frequencies are highly imbalanced and follow a Zipfian-like distribution, making BM25’s inverse document frequency (IDF) weighting well suited for suppressing ubiquitous, low-information words and emphasizing rare, discriminative ones. BM25-V retrieves high-recall candidates via sparse inverted-index operations and serves as an efficient first-stage retriever for dense reranking. Across seven benchmarks, BM25-V achieves Recall@200 $\geq$ 0.993, enabling a two-stage pipeline that reranks only $K{=}200$ candidates per query and recovers near-dense accuracy within $0.2$% on average. An SAE trained once on ImageNet-1K transfers zero-shot to seven fine-grained benchmarks without fine-tuning, and BM25-V retrieval decisions are attributable to specific visual words with quantified IDF contributions.
[133] Spectral Probing of Feature Upsamplers in 2D-to-3D Scene Reconstruction
Ling Xiao, Yuliang Xiu, Yue Chen, Guoming Wang, Toshihiko Yamasaki
Main category: cs.CV
TL;DR: Spectral analysis reveals that preserving spectral structure, not enhancing spatial details, is key for 3D reconstruction quality in 2D-to-3D pipelines using vision foundation models.
Details
Motivation: Current learnable upsampling methods in 2D-to-3D pipelines focus on enhancing spatial details but their impact on 3D awareness remains underexplored. The paper aims to understand how feature upsampling affects 3D reconstruction quality.Method: Introduces a spectral diagnostic framework with six complementary metrics to analyze upsampling methods. Evaluates classical interpolation and learnable upsampling methods on CLIP and DINO backbones, measuring amplitude redistribution, structural spectral alignment, and directional stability.
Result: Three key findings: 1) Structural spectral consistency is the strongest predictor of novel view synthesis quality, 2) Geometry and texture respond to different spectral properties, 3) Learnable upsamplers often produce sharper features but rarely outperform classical interpolation in reconstruction quality.
Conclusion: Reconstruction quality depends more on preserving spectral structure than enhancing spatial detail. Spectral consistency should be a key principle for designing upsampling strategies in 2D-to-3D pipelines.
Abstract: A typical 2D-to-3D pipeline takes multi-view images as input, where a Vision Foundation Model (VFM) extracts features that are spatially upsampled to dense representations for 3D reconstruction. If dense features across views preserve geometric consistency, differentiable rendering can recover an accurate 3D representation, making the feature upsampler a critical component. Recent learnable upsampling methods mainly aim to enhance spatial details, such as sharper geometry or richer textures, yet their impact on 3D awareness remains underexplored. To address this gap, we introduce a spectral diagnostic framework with six complementary metrics that characterize amplitude redistribution, structural spectral alignment, and directional stability. Across classical interpolation and learnable upsampling methods on CLIP and DINO backbones, we observe three key findings. First, structural spectral consistency (SSC/CSC) is the strongest predictor of NVS quality, whereas High-Frequency Spectral Slope Drift (HFSS) often correlates negatively with reconstruction performance, indicating that emphasizing high-frequency details alone does not necessarily improve 3D reconstruction. Second, geometry and texture respond to different spectral properties: Angular Energy Consistency (ADC) correlates more strongly with geometry-related metrics, while SSC/CSC influence texture fidelity slightly more than geometric accuracy. Third, although learnable upsamplers often produce sharper spatial features, they rarely outperform classical interpolation in reconstruction quality, and their effectiveness depends on the reconstruction model. Overall, our results indicate that reconstruction quality is more closely related to preserving spectral structure than to enhancing spatial detail, highlighting spectral consistency as an important principle for designing upsampling strategies in 2D-to-3D pipelines.
[134] EventGeM: Global-to-Local Feature Matching for Event-Based Visual Place Recognition
Adam D. Hines, Gokul B. Nair, NicolĂĄs Marticorena, Michael Milford, Tobias Fischer
Main category: cs.CV
TL;DR: EventGeM: A global-to-local feature fusion pipeline for event-based Visual Place Recognition using pre-trained vision transformers for global features, MaxViT for local keypoints, and depth estimation for re-ranking refinement.
Details
Motivation: Event cameras offer sparse activation and high-temporal resolution ideal for robotic navigation, but existing methods lack robust place recognition capabilities. The authors aim to develop a state-of-the-art event-based Visual Place Recognition system that works across various lighting conditions and compute architectures.Method: Uses pre-trained ViT-S/16 backbone for global feature extraction from event histogram images, MaxViT backbone for local keypoint detection, 2D-homography based re-ranking with RANSAC, and vision foundation model for depth estimation-based structural similarity comparison for additional refinement.
Result: Achieves state-of-the-art localization performance compared to existing event-based place recognition methods across multiple benchmark datasets and lighting conditions, while maintaining real-time capability on various compute architectures. Successfully demonstrated in real-world robotic deployment.
Conclusion: EventGeM provides an effective global-to-local feature fusion approach for event-based Visual Place Recognition that outperforms existing methods and enables practical real-time robotic applications.
Abstract: Dynamic vision sensors, also known as event cameras, are rapidly rising in popularity for robotic and computer vision tasks due to their sparse activation and high-temporal resolution. Event cameras have been used in robotic navigation and localization tasks where accurate positioning needs to occur on small and frequent time scales, or when energy concerns are paramount. In this work, we present EventGeM, a state-of-the-art global to local feature fusion pipeline for event-based Visual Place Recognition. We use a pre-trained vision transformer (ViT-S/16) backbone to obtain global feature patch for initial match predictions embeddings from event histogram images. Local feature keypoints were then detected using a pre-trained MaxViT backbone for 2D-homography based re-ranking with RANSAC. For additional re-ranking refinement, we subsequently used a pre-trained vision foundation model for depth estimation to compare structural similarity between references and queries. Our work performs state-of-the-art localization when compared to the best currently available event-based place recognition method across several benchmark datasets and lighting conditions all whilst being fully capable of running in real-time when deployed across a variety of compute architectures. We demonstrate the capability of EventGeM in a real-world deployment on a robotic platform for online localization using event streams directly from an event camera. Project page: https://eventgemvpr.github.io/
[135] Training-free Latent Inter-Frame Pruning with Attention Recovery
Dennis Menn, Yuedong Yang, Bokun Wang, Xiwen Wei, Mustafa Munir, Feng Liang, Radu Marculescu, Chenfeng Xu, Diana Marculescu
Main category: cs.CV
TL;DR: LIPAR framework accelerates video generation by pruning redundant latent patches and recovering attention values to maintain quality.
Details
Motivation: Current video generation models have high computational latency that makes real-time applications prohibitively expensive, creating a need for more efficient methods.Method: Proposes Latent Inter-frame Pruning with Attention Recovery (LIPAR) framework that detects and skips recomputing duplicated latent patches, with an Attention Recovery mechanism to approximate attention values of pruned tokens and remove visual artifacts.
Result: Increases video editing throughput by 1.45Ă, achieving 12.2 FPS on NVIDIA A6000 vs baseline 8.4 FPS, without compromising generation quality and requiring no additional training.
Conclusion: The approach effectively bridges traditional compression algorithms with modern generative pipelines, enabling more efficient video generation.
Abstract: Current video generation models suffer from high computational latency, making real-time applications prohibitively costly. In this paper, we address this limitation by exploiting the temporal redundancy inherent in video latent patches. To this end, we propose the Latent Inter-frame Pruning with Attention Recovery (LIPAR) framework, which detects and skips recomputing duplicated latent patches. Additionally, we introduce a novel Attention Recovery mechanism that approximates the attention values of pruned tokens, thereby removing visual artifacts arising from naively applying the pruning method. Empirically, our method increases video editing throughput by $1.45\times$, on average achieving 12.2 FPS on an NVIDIA A6000 compared to the baseline 8.4 FPS. The proposed method does not compromise generation quality and can be seamlessly integrated with the model without additional training. Our approach effectively bridges the gap between traditional compression algorithms and modern generative pipelines.
[136] Margin and Consistency Supervision for Calibrated and Robust Vision Models
Salim Khazem
Main category: cs.CV
TL;DR: MaCS is a regularization framework that improves deep vision classifiers by enforcing logit-space margins and local prediction consistency, enhancing calibration and robustness without architectural changes.
Details
Motivation: Deep vision classifiers often achieve high accuracy but suffer from poor calibration and fragility under distribution shifts, lacking reliable uncertainty estimates and robustness.Method: MaCS augments cross-entropy with two components: (1) hinge-squared margin penalty enforcing target logit gap between correct class and strongest competitor, and (2) consistency regularizer minimizing KL divergence between predictions on clean inputs and mildly perturbed views.
Result: Across multiple image classification benchmarks with CNNs and Vision Transformers, MaCS consistently improves calibration (lower ECE and NLL) and robustness to common corruptions while preserving or improving top-1 accuracy.
Conclusion: MaCS provides an effective drop-in replacement for standard training objectives that improves vision classifier reliability without additional data, architectural changes, or inference overhead.
Abstract: Deep vision classifiers often achieve high accuracy while remaining poorly calibrated and fragile under small distribution shifts. We present Margin and Consistency Supervision (MaCS), a simple, architecture-agnostic regularization framework that jointly enforces logit-space separation and local prediction stability. MaCS augments cross-entropy with (i) a hinge-squared margin penalty that enforces a target logit gap between the correct class and the strongest competitor, and (ii) a consistency regularizer that minimizes the KL divergence between predictions on clean inputs and mildly perturbed views. We provide a unifying theoretical analysis showing that increasing classification margin while reducing local sensitivity formalized via a Lipschitz-type stability proxy yields improved generalization guarantees and a provable robustness radius bound scaling with the margin-to-sensitivity ratio. Across several image classification benchmarks and several backbones spanning CNNs and Vision Transformers, MaCS consistently improves calibration (lower ECE and NLL) and robustness to common corruptions while preserving or improving top-1 accuracy. Our approach requires no additional data, no architectural changes, and negligible inference overhead, making it an effective drop-in replacement for standard training objectives.
[137] Remote Sensing Image Classification Using Deep Ensemble Learning
Niful Islam, Md. Rayhan Ahmed, Nur Mohammad Fahad, Salekul Islam, A. K. M. Muzahidul Islam, Saddam Mukta, Swakkhar Shatabda
Main category: cs.CV
TL;DR: A fusion model combining CNNs and Vision Transformers for remote sensing image classification, using ensemble of four independent fusion models to overcome performance bottlenecks from redundant features.
Details
Motivation: Remote sensing imagery requires accurate classification, but CNNs struggle with global context while ViTs capture long-range dependencies. Combining both architectures should yield better performance, but simple integration creates bottlenecks from redundant features.Method: Proposes a fusion model combining CNN and ViT backbones, trained as four independent fusion models with outputs combined at final prediction stage through ensembling to overcome performance bottlenecks from redundant feature representations.
Result: Achieves 98.10% accuracy on UC Merced, 94.46% on RSSCN7, and 95.45% on MSRSI datasets, outperforming competing architectures with efficient computational resource usage during training.
Conclusion: The ensemble-based fusion approach effectively combines CNN and ViT strengths for remote sensing classification, overcoming performance bottlenecks while maintaining computational efficiency.
Abstract: Remote sensing imagery plays a crucial role in many applications and requires accurate computerized classification techniques. Reliable classification is essential for transforming raw imagery into structured and usable information. While Convolutional Neural Networks (CNNs) are mostly used for image classification, they excel at local feature extraction, but struggle to capture global contextual information. Vision Transformers (ViTs) address this limitation through self attention mechanisms that model long-range dependencies. Integrating CNNs and ViTs, therefore, leads to better performance than standalone architectures. However, the use of additional CNN and ViT components does not lead to further performance improvement and instead introduces a bottleneck caused by redundant feature representations. In this research, we propose a fusion model that combines the strengths of CNNs and ViTs for remote sensing image classification. To overcome the performance bottleneck, the proposed approach trains four independent fusion models that integrate CNN and ViT backbones and combine their outputs at the final prediction stage through ensembling. The proposed method achieves accuracy rates of 98.10 percent, 94.46 percent, and 95.45 percent on the UC Merced, RSSCN7, and MSRSI datasets, respectively. These results outperform competing architectures and highlight the effectiveness of the proposed solution, particularly due to its efficient use of computational resources during training.
[138] Cog2Gen3D: Sculpturing 3D Semantic-Geometric Cognition for 3D Generation
Haonan Wang, Hanyu Zhou, Haoyue Liu, Tao Gu, Luxin Yan
Main category: cs.CV
TL;DR: Cog2Gen3D: A 3D cognition-guided diffusion framework that uses semantic and geometric representations structured into dual-stream graphs to guide 3D Gaussian generation, addressing scale inconsistency in existing methods.
Details
Motivation: Existing 3D generation methods lack proper spatial geometry constraints, often modeling only relative relationships and suffering from scale inconsistency. The authors argue that combining semantic information with absolute geometry enables better 3D cognition for physically plausible generation.Method: Three key designs: 1) Cognitive Feature Embeddings encoding different modalities into semantic, geometric, and logical representations; 2) 3D Latent Cognition Graph structuring representations into dual-stream semantic-geometric graphs fused via cross-attention; 3) Cognition-Guided Latent Diffusion using the fused 3D cognition graph to guide latent diffusion for 3D Gaussian generation.
Result: Extensive experiments on Marble World Labs validation subset show Cog2Gen3D significantly outperforms existing methods in both semantic fidelity and geometric plausibility.
Conclusion: The unified framework with 3D cognition graph ensures physical plausibility and structural rationality of 3D generation, addressing limitations of previous methods through combined semantic and geometric guidance.
Abstract: Generative models have achieved success in producing semantically plausible 2D images, but it remains challenging in 3D generation due to the absence of spatial geometry constraints. Typically, existing methods utilize geometric features as conditions to enhance spatial awareness. However, these methods can only model relative relationships and are prone to scale inconsistency of absolute geometry. Thus, we argue that semantic information and absolute geometry empower 3D cognition, thereby enabling controllable 3D generation for the physical world. In this work, we propose Cog2Gen3D, a 3D cognition-guided diffusion framework for 3D generation. Our model is guided by three key designs: 1) Cognitive Feature Embeddings. We encode different modalities into semantic and geometric representations and further extract logical representations. 2) 3D Latent Cognition Graph. We structure different representations into dual-stream semantic-geometric graphs and fuse them via common-based cross-attention to obtain a 3D cognition graph. 3) Cognition-Guided Latent Diffusion. We leverage the fused 3D cognition graph as the condition to guide the latent diffusion process for 3D Gaussian generation. Under this unified framework, the 3D cognition graph ensures the physical plausibility and structural rationality of 3D generation. Moreover, we construct a validation subset based on the Marble World Labs. Extensive experiments demonstrate that our Cog2Gen3D significantly outperforms existing methods in both semantic fidelity and geometric plausibility.
[139] VS3R: Robust Full-frame Video Stabilization via Deep 3D Reconstruction
Muhua Zhu, Xinhao Jin, Yu Zhang, Yifei Xue, Tie Ji, Yizhen Lao
Main category: cs.CV
TL;DR: VS3R is a video stabilization framework that combines 3D reconstruction with generative video diffusion to achieve robust, full-frame stabilization without aggressive cropping, outperforming state-of-the-art methods.
Details
Motivation: Video stabilization faces a fundamental trade-off: 2D methods suffer from aggressive cropping, while 3D techniques fail under extreme motions due to fragile optimization pipelines. There's a need to bridge this gap between geometric robustness and full-frame consistency.Method: VS3R synergizes feed-forward 3D reconstruction with generative video diffusion. It jointly estimates camera parameters, depth, and masks for reliability, uses a Hybrid Stabilized Rendering module to fuse semantic and geometric cues for dynamic consistency, and employs a Dual-Stream Video Diffusion Model to restore disoccluded regions and rectify artifacts by combining structural guidance with semantic anchors.
Result: VS3R achieves high-fidelity, full-frame stabilization across diverse camera models and significantly outperforms state-of-the-art methods in both robustness and visual quality.
Conclusion: The proposed VS3R framework successfully bridges the gap between geometric robustness and full-frame consistency in video stabilization by combining 3D reconstruction with generative video diffusion, offering a robust solution that works across diverse scenarios.
Abstract: Video stabilization aims to mitigate camera shake but faces a fundamental trade-off between geometric robustness and full-frame consistency. While 2D methods suffer from aggressive cropping, 3D techniques are often undermined by fragile optimization pipelines that fail under extreme motions. To bridge this gap, we propose VS3R, a framework that synergizes feed-forward 3D reconstruction with generative video diffusion. Our pipeline jointly estimates camera parameters, depth, and masks to ensure all-scenario reliability, and introduces a Hybrid Stabilized Rendering module that fuses semantic and geometric cues for dynamic consistency. Finally, a Dual-Stream Video Diffusion Model restores disoccluded regions and rectifies artifacts by synergizing structural guidance with semantic anchors. Collectively, VS3R achieves high-fidelity, full-frame stabilization across diverse camera models and significantly outperforms state-of-the-art methods in robustness and visual quality.
[140] TumorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Clinical Tumor Analysis
Sijing Li, Zhongwei Qiu, Jiang Liu, Wenqiao Zhang, Tianwei Lin, Yihan Xie, Jianxiang An, Boxiang Yun, Chenglin Yang, Jun Xiao, Guangyu Guo, Jiawen Yao, Wei Liu, Yuan Gao, Ke Yan, Weiwei Cao, Zhilin Zheng, Tony C. W. Mok, Kai Cao, Yu Shi, Jiuyu Zhang, Jian Zhou, Beng Chin Ooi, Yingda Xia, Ling Zhang
Main category: cs.CV
TL;DR: TumorChain: A multimodal reasoning framework for clinical tumor analysis using 3D CT scans with Chain-of-Thought reasoning for step-by-step interpretation from imaging findings to clinical impressions and pathology predictions.
Details
Motivation: Accurate tumor analysis is crucial for clinical radiology and precision oncology, requiring early detection, reliable lesion characterization, and pathology-level risk assessment. Current approaches lack traceability and interpretability in the reasoning process from imaging findings to clinical conclusions.Method: Proposes TumorChain, a multimodal interleaved reasoning framework that couples 3D imaging encoders with clinical text understanding and organ-level vision-language alignment. Uses cross-modal alignment and iterative interleaved causal reasoning to ground visual evidence, aggregate conclusions, and issue pathology predictions through multiple rounds of self-refinement.
Result: Shows consistent improvements over strong baselines in lesion detection, impression generation, and pathology classification. Demonstrates strong generalization on the DeepTumorVQA benchmark. The framework improves traceability and reduces hallucination risk in clinical tumor analysis.
Conclusion: Multimodal reasoning with Chain-of-Thought approaches enables reliable and interpretable tumor analysis in clinical practice, providing step-by-step traceability from imaging findings to pathology predictions while reducing diagnostic errors.
Abstract: Accurate tumor analysis is central to clinical radiology and precision oncology, where early detection, reliable lesion characterization, and pathology-level risk assessment guide diagnosis and treatment planning. Chain-of-Thought (CoT) reasoning is particularly important in this setting because it enables step-by-step interpretation from imaging findings to clinical impressions and pathology conclusions, improving traceability and reducing diagnostic errors. Here, we target the clinical tumor analysis task and build a large-scale benchmark that operationalizes a multimodal reasoning pipeline, spanning findings, impressions, and pathology predictions. We curate TumorCoT, a large-scale dataset of 1.5M CoT-labeled VQA instructions paired with 3D CT scans, with step-aligned rationales and cross-modal alignments along the trajectory from findings to impression to pathology, enabling evaluation of both answer accuracy and reasoning consistency. We further propose TumorChain, a multimodal interleaved reasoning framework that tightly couples 3D imaging encoders, clinical text understanding, and organ-level vision-language alignment. Through cross-modal alignment and iterative interleaved causal reasoning, TumorChain grounds visual evidence, aggregates conclusions, and issues pathology predictions after multiple rounds of self-refinement, improving traceability and reducing hallucination risk. Experiments show consistent improvements over strong baselines in lesion detection, impression generation, and pathology classification, and demonstrate strong generalization on the DeepTumorVQA benchmark. These results highlight the potential of multimodal reasoning for reliable and interpretable tumor analysis in clinical practice. Detailed information about our project can be found on our project homepage at https://github.com/ZJU4HealthCare/TumorChain.
[141] PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues
Yukun Qi, Pei Fu, Hang Li, Yuhan Liu, Chao Jiang, Bin Qin, Zhenbo Luo, Jian Luan
Main category: cs.CV
TL;DR: PatchCue introduces a patch-based visual cue paradigm for VLMs that partitions images into patches for visual reasoning, outperforming pixel-level and point-based cues across multiple benchmarks.
Details
Motivation: Existing reasoning paradigms like Chain-of-Thought rely mainly on textual information and underutilize visual cues. While some work incorporates pixel-level visual cues, these require precise spatial localization and introduce learning complexity. There's a need for better visual reasoning that aligns with human perception and modern VLM architectures.Method: Proposes PatchCue, a patch-based visual cue paradigm that partitions images into patches and represents cues at patch level. Uses two-stage training: 1) cold-start supervised fine-tuning to output patch-level cues, followed by 2) reinforcement learning with process-supervised cue reward to guide intermediate visual reasoning steps.
Result: Extensive experiments on multiple VLMs and diverse benchmarks (general visual question answering, complex reasoning, document understanding) show PatchCue consistently improves overall model performance. Patch-level cues outperform both pixel-level bounding boxes and point-based cues.
Conclusion: PatchCue provides a more effective and cognitively aligned visual reasoning paradigm that better leverages visual information in VLMs, addressing limitations of text-only reasoning and complex pixel-level approaches.
Abstract: Vision-Language Models (VLMs) have achieved remarkable progress on a wide range of challenging multimodal understanding and reasoning tasks. However, existing reasoning paradigms, such as the classical Chain-of-Thought (CoT), rely solely on textual information and often underutilize important visual cues. While prior work has incorporated pixel-level visual cues, these representations require precise spatial localization, introducing additional learning complexity. To address this, we propose PatchCue, a novel patch-based visual cue paradigm designed to significantly enhance the visual reasoning capabilities of VLMs. By partitioning images into patches and representing cues at the patch level, PatchCue aligns better with human perceptual habits and leverages the patch-tokenized input of modern VLMs. We train VLMs using a two-stage approach: cold-start supervised fine-tuning to output patch-level cues, followed by reinforcement learning with a process-supervised cue reward that guides intermediate visual reasoning steps. Extensive experiments on multiple VLMs and diverse benchmarks, including general visual question answering, complex reasoning, and document understanding, demonstrate that PatchCue consistently improves overall model performance. Our results show that patch-level cues outperform both pixel-level bounding boxes and point-based cues, providing a more effective and cognitively aligned visual reasoning paradigm.
[142] Shifting Adaptation from Weight Space to Memory Space: A Memory-Augmented Agent for Medical Image Segmentation
Bowen Chen, Qiaohui Gao, Shaowen Wan, Shanhui Sun, Wei Liu, Xiang Li, Tianming Liu, Lin Zhao
Main category: cs.CV
TL;DR: MemSeg-Agent: A memory-augmented segmentation agent that enables few-shot learning, federated supervised learning, and test-time adaptation for medical image segmentation by shifting adaptation from weight space to memory space.
Details
Motivation: Medical image segmentation models often fail to generalize across institutions, scanners, or patient populations. Vision foundation models require task-specific fine-tuning which introduces substantial communication overhead in federated learning and prevents continuous knowledge evolution during deployment.Method: Proposes a memory-augmented segmentation agent that conditions a fixed backbone with lightweight static, few-shot, and test-time working memories dynamically composed by an agentic controller. Updates compact memory units instead of model parameters in federated settings.
Result: Experiments on four public datasets demonstrate strong performance and robustness to domain shift. Static memory alone matches or surpasses strong supervised baselines with high parameter efficiency, and test-time working memory further improves in-domain and cross-domain performance without fine-tuning.
Conclusion: MemSeg-Agent introduces a new paradigm for scalable and adaptive medical image segmentation in the era of agentic AI, enabling efficient federated learning and continuous adaptation without weight updates.
Abstract: Medical image segmentation is fundamental to clinical workflows, yet models trained on a single dataset often fail to generalize across institutions, scanners, or patient populations. While vision foundation models have shown great promise in addressing this challenge, their deployment typically requires task-specific fine-tuning, which introduces substantial communication overhead in federated learning and prevents continuous knowledge evolution during deployment. In this work, we propose a memory-augmented segmentation agent (MemSeg-Agent) that shifts adaptation from weight space to memory space, enabling few-shot learning, federated supervised learning, and test-time adaptation within a unified architecture. MemSeg-Agent conditions a fixed backbone with lightweight static, few-shot, and test-time working memories, which are dynamically composed by an agentic controller. In federated settings, we update compact memory units instead of model parameters, substantially reducing communication overhead. Experiments on four public datasets demonstrate strong performance and robustness to domain shift: Static memory alone matches or surpasses strong supervised baselines with high parameter efficiency, and test-time working memory further improves in-domain and cross-domain performance without fine-tuning. Overall, MemSeg-Agent introduces a new paradigm for scalable and adaptive medical image segmentation in the era of agentic AI.
[143] Systematic Evaluation of Novel View Synthesis for Video Place Recognition
Muhammad Zawad Mahmud, Samiha Islam, Damian Lyons
Main category: cs.CV
TL;DR: Synthetic novel views improve Video Place Recognition (VPR) performance, with small additions boosting recognition statistics and viewpoint magnitude being less important than number of views added and dataset imagery type.
Details
Motivation: Synthetic novel views can enhance robot navigation by enabling cross-view matching between ground and aerial robots, potentially improving place recognition across different viewpoints.Method: Systematic evaluation using five public VPR image databases and seven typical image similarity methods to assess synthetic novel view generation and its impact on VPR performance.
Result: Small synthetic additions improve VPR recognition statistics; for larger additions, the number of views added and dataset imagery type matter more than viewpoint change magnitude.
Conclusion: Synthetic novel views are beneficial for VPR, with optimal benefits from small additions and importance of view quantity and dataset characteristics over viewpoint magnitude.
Abstract: The generation of synthetic novel views has the potential to positively impact robot navigation in several ways. In image-based navigation, a novel overhead view generated from a scene taken by a ground robot could be used to guide an aerial robot to that location. In Video Place Recognition (VPR), novel views of ground locations from the air can be added that enable a UAV to identify places seen by the ground robot, and similarly, overhead views can be used to generate novel ground views. This paper presents a systematic evaluation of synthetic novel views in VPR using five public VPR image databases and seven typical image similarity methods. We show that for small synthetic additions, novel views improve VPR recognition statistics. We find that for larger additions, the magnitude of viewpoint change is less important than the number of views added and the type of imagery in the dataset.
[144] CylinderSplat: 3D Gaussian Splatting with Cylindrical Triplanes for Panoramic Novel View Synthesis
Qiwei Wang, Xianghui Ze, Jingyi Yu, Yujiao Shi
Main category: cs.CV
TL;DR: CylinderSplat introduces a cylindrical Triplane representation for feed-forward 3D Gaussian Splatting that better handles panoramic imagery by aligning with 360° scene geometry and Manhattan-world assumptions, using a dual-branch architecture for robust novel view synthesis.
Details
Motivation: Existing 3D Gaussian Splatting methods struggle with panoramic imagery due to poor geometric alignment with 360° scenes and difficulty handling occlusions in sparse-view scenarios. Standard volumetric representations like Cartesian Triplanes cause distortion and aliasing in panoramic data.Method: Proposes CylinderSplat with a cylindrical Triplane representation aligned with panoramic data and Manhattan-world geometry. Uses a dual-branch architecture: pixel-based branch for well-observed regions and volume-based branch with cylindrical Triplane for occluded/sparse areas. Handles variable input views from single to multiple panoramas.
Result: Achieves state-of-the-art results in both single-view and multi-view panoramic novel view synthesis, outperforming previous methods in reconstruction quality and geometric accuracy.
Conclusion: CylinderSplat effectively addresses limitations of existing 3DGS methods for panoramic imagery through better geometric representation and flexible architecture, enabling high-quality novel view synthesis for 360° scenes.
Abstract: Feed-forward 3D Gaussian Splatting (3DGS) has shown great promise for real-time novel view synthesis, but its application to panoramic imagery remains challenging. Existing methods often rely on multi-view cost volumes for geometric refinement, which struggle to resolve occlusions in sparse-view scenarios. Furthermore, standard volumetric representations like Cartesian Triplanes are poor in capturing the inherent geometry of $360^\circ$ scenes, leading to distortion and aliasing. In this work, we introduce CylinderSplat, a feed-forward framework for panoramic 3DGS that addresses these limitations. The core of our method is a new {cylindrical Triplane} representation, which is better aligned with panoramic data and real-world structures adhering to the Manhattan-world assumption. We use a dual-branch architecture: a pixel-based branch reconstructs well-observed regions, while a volume-based branch leverages the cylindrical Triplane to complete occluded or sparsely-viewed areas. Our framework is designed to flexibly handle a variable number of input views, from single to multiple panoramas. Extensive experiments demonstrate that CylinderSplat achieves state-of-the-art results in both single-view and multi-view panoramic novel view synthesis, outperforming previous methods in both reconstruction quality and geometric accuracy.
[145] PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction
Xiang Zhang, Sohyun Yoo, Hongrui Wu, Chuan Li, Jianwen Xie, Zhuowen Tu
Main category: cs.CV
TL;DR: PixARMesh is a method for autoregressive 3D indoor scene mesh reconstruction from single RGB images, producing complete, coherent meshes in one forward pass without post-processing.
Details
Motivation: Prior methods for 3D scene reconstruction rely on implicit signed distance fields and separate layout optimization, requiring post-hoc processing and lacking unified scene generation. There's a need for methods that can produce complete, artist-ready 3D meshes directly from single images in a coherent manner.Method: PixARMesh uses a point-cloud encoder augmented with pixel-aligned image features and global scene context via cross-attention. It autoregressively generates scenes from a unified token stream containing context, pose, and mesh information, enabling joint prediction of object layout and geometry.
Result: Experiments on synthetic and real-world datasets show state-of-the-art reconstruction quality, producing lightweight, high-quality meshes ready for downstream applications. The method achieves coherent scene generation with high-fidelity geometry.
Conclusion: PixARMesh demonstrates that unified autoregressive mesh generation from single images is feasible and effective, producing artist-ready 3D scene meshes without post-processing while maintaining reconstruction quality and coherence.
Abstract: We introduce PixARMesh, a method to autoregressively reconstruct complete 3D indoor scene meshes directly from a single RGB image. Unlike prior methods that rely on implicit signed distance fields and post-hoc layout optimization, PixARMesh jointly predicts object layout and geometry within a unified model, producing coherent and artist-ready meshes in a single forward pass. Building on recent advances in mesh generative models, we augment a point-cloud encoder with pixel-aligned image features and global scene context via cross-attention, enabling accurate spatial reasoning from a single image. Scenes are generated autoregressively from a unified token stream containing context, pose, and mesh, yielding compact meshes with high-fidelity geometry. Experiments on synthetic and real-world datasets show that PixARMesh achieves state-of-the-art reconstruction quality while producing lightweight, high-quality meshes ready for downstream applications.
[146] InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation
Yuxin Qin, Ke Cao, Haowei Liu, Ao Ma, Fengheng Li, Honghe Zhu, Zheng Zhang, Run Ling, Wei Feng, Xuanhua He, Zhanjie Zhang, Zhen Guo, Haoyi Bian, Jingjing Lv, Junjie Shen, Ching Law
Main category: cs.CV
TL;DR: InnoAds-Composer: A single-stage diffusion framework for e-commerce product poster generation with tri-conditional control over subject, text, and style using efficient token routing and text feature enhancement.
Details
Motivation: Current multi-stage diffusion models for e-commerce poster generation suffer from poor subject fidelity, inaccurate text rendering, and inconsistent style control. Existing approaches lack efficient simultaneous control over all three key elements (subject, text, style) in a single-stage framework.Method: Proposes InnoAds-Composer with: 1) Tri-conditional control tokens for subject, glyph, and style; 2) Importance analysis over layers and timesteps to route conditions only to most responsive positions, reducing quadratic token overhead; 3) Text Feature Enhancement Module (TFEM) integrating features from both glyph images and crops for better Chinese text rendering; 4) A new high-quality e-commerce poster dataset with all three conditions.
Result: The method significantly outperforms existing product poster generation approaches without increasing inference latency. The new dataset and benchmark enable better training and evaluation of multi-condition poster synthesis.
Conclusion: InnoAds-Composer provides an efficient single-stage solution for high-quality e-commerce poster generation with precise control over subject, text, and style, addressing key limitations of multi-stage pipelines.
Abstract: E-commerce product poster generation aims to automatically synthesize a single image that effectively conveys product information by presenting a subject, text, and a designed style. Recent diffusion models with fine-grained and efficient controllability have advanced product poster synthesis, yet they typically rely on multi-stage pipelines, and simultaneous control over subject, text, and style remains underexplored. Such naive multi-stage pipelines also show three issues: poor subject fidelity, inaccurate text, and inconsistent style. To address these issues, we propose InnoAds-Composer, a single-stage framework that enables efficient tri-conditional control tokens over subject, glyph, and style. To alleviate the quadratic overhead introduced by naive tri-conditional token concatenation, we perform importance analysis over layers and timesteps and route each condition only to the most responsive positions, thereby shortening the active token sequence. Besides, to improve the accuracy of Chinese text rendering, we design a Text Feature Enhancement Module (TFEM) that integrates features from both glyph images and glyph crops. To support training and evaluation, we also construct a high-quality e-commerce product poster dataset and benchmark, which is the first dataset that jointly contains subject, text, and style conditions. Extensive experiments demonstrate that InnoAds-Composer significantly outperforms existing product poster methods without obviously increasing inference latency.
[147] Mitigating Bias in Concept Bottleneck Models for Fair and Interpretable Image Classification
Schrasing Tong, Antoine Salaun, Vincent Yuan, Annabel Adeyeri, Lalana Kagal
Main category: cs.CV
TL;DR: Proposes three bias mitigation techniques for Concept Bottleneck Models (CBMs) to improve fairness in image classification while maintaining interpretability.
Details
Motivation: CBMs aim to enhance interpretability and fairness in image classification by mapping images to human-interpretable concepts, but they still leak sensitive information and show only marginal bias reduction in practice.Method: Three techniques: 1) Top-k concept filter to decrease information leakage, 2) Removal of biased concepts, and 3) Adversarial debiasing applied to CBMs.
Result: Outperforms prior work in fairness-performance tradeoffs, showing significant improvement in fairness while maintaining classification performance.
Conclusion: The proposed debiased CBM represents a significant step toward achieving both fair and interpretable image classification systems.
Abstract: Ensuring fairness in image classification prevents models from perpetuating and amplifying bias. Concept bottleneck models (CBMs) map images to high-level, human-interpretable concepts before making predictions via a sparse, one-layer classifier. This structure enhances interpretability and, in theory, supports fairness by masking sensitive attribute proxies such as facial features. However, CBM concepts have been known to leak information unrelated to concept semantics and early results reveal only marginal reductions in gender bias on datasets like ImSitu. We propose three bias mitigation techniques to improve fairness in CBMs: 1. Decreasing information leakage using a top-k concept filter, 2. Removing biased concepts, and 3. Adversarial debiasing. Our results outperform prior work in terms of fairness-performance tradeoffs, indicating that our debiased CBM provides a significant step towards fair and interpretable image classification.
[148] CollabOD: Collaborative Multi-Backbone with Cross-scale Vision for UAV Small Object Detection
Xuecheng Bai, Yuxiang Wang, Chuanzhi Xu, Boyu Hu, Kang Han, Ruijie Pan, Xiaowei Niu, Xiaotian Guan, Liqiang Fu, Pengfei Ye
Main category: cs.CV
TL;DR: CollabOD is a lightweight collaborative detection framework for small object detection in UAV imagery that preserves structural details and aligns heterogeneous feature streams before multi-scale fusion.
Details
Motivation: Small object detection in UAV imagery faces challenges due to scale variation, structural detail degradation, and limited computational resources. In high-altitude scenarios, fine-grained features are weakened during hierarchical downsampling and cross-scale fusion, leading to unstable localization and reduced robustness.Method: Proposes CollabOD framework with three strategies: 1) Structural Detail Preservation, 2) Cross-Path Feature Alignment, and 3) Localization-Aware Lightweight Design. Integrates these from image processing, channel structure, and lightweight design perspectives. Uses a unified detail-aware detection head for improved regression robustness.
Result: The framework enhances representation stability while maintaining efficient inference. The detail-aware detection head improves regression robustness without additional deployment overhead. Code is publicly available.
Conclusion: CollabOD effectively addresses small object detection challenges in UAV imagery by preserving structural details and aligning features before fusion, achieving better localization stability and robustness with lightweight design.
Abstract: Small object detection in unmanned aerial vehicle (UAV) imagery is challenging, mainly due to scale variation, structural detail degradation, and limited computational resources. In high-altitude scenarios, fine-grained features are further weakened during hierarchical downsampling and cross-scale fusion, resulting in unstable localization and reduced robustness. To address this issue, we propose CollabOD, a lightweight collaborative detection framework that explicitly preserves structural details and aligns heterogeneous feature streams before multi-scale fusion. The framework integrates Structural Detail Preservation, Cross-Path Feature Alignment, and Localization-Aware Lightweight Design strategies. From the perspectives of image processing, channel structure, and lightweight design, it optimizes the architecture of conventional UAV perception models. The proposed design enhances representation stability while maintaining efficient inference. A unified detail-aware detection head further improves regression robustness without introducing additional deployment overhead. The code is available at: https://github.com/Bai-Xuecheng/CollabOD.
[149] Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D
Ping Chen, Zezhou Chen, Xingpeng Zhang, Yanlin Qian, Huan Hu, Xiang Liu, Zipeng Wang, Xin Wang, Zhaoxiang Liu, Kai Wang, Shiguo Lian
Main category: cs.CV
TL;DR: Art3D: A new paradigm for 2D-to-3D conversion focusing on artistic disparity synthesis rather than geometric accuracy, aiming to replicate professional 3D cinema’s immersive and emotionally resonant experience.
Details
Motivation: Current 2D-to-3D conversion methods achieve geometric accuracy but fail artistically, missing deliberate artistic intent like strategic zero-plane shifts for pop-out effects and local depth sculpting that characterize professional 3D cinema.Method: Art3D uses a dual-path architecture to decouple global depth parameters (macro-intent) from local artistic effects (visual brushstrokes), learning from professional 3D film data via indirect supervision. Introduces preliminary evaluation method to quantify cinematic alignment.
Result: Experiments show Art3D demonstrates potential in replicating key local out-of-screen effects and aligning with global depth styles of cinematic 3D content, laying groundwork for artistically-driven conversion tools.
Conclusion: Proposes Artistic Disparity Synthesis as a new paradigm shift from physically accurate disparity estimation to artistically coherent disparity synthesis for 2D-to-3D conversion.
Abstract: Current 2D-to-3D conversion methods achieve geometric accuracy but are artistically deficient, failing to replicate the immersive and emotionally resonant experience of professional 3D cinema. This is because geometric reconstruction paradigms mistake deliberate artistic intent, such as strategic zero-plane shifts for pop-out effects and local depth sculpting, for data noise or ambiguity. This paper argues for a new paradigm: Artistic Disparity Synthesis, shifting the goal from physically accurate disparity estimation to artistically coherent disparity synthesis. We propose Art3D, a preliminary framework exploring this paradigm. Art3D uses a dual-path architecture to decouple global depth parameters (macro-intent) from local artistic effects (visual brushstrokes) and learns from professional 3D film data via indirect supervision. We also introduce a preliminary evaluation method to quantify cinematic alignment. Experiments show our approach demonstrates potential in replicating key local out-of-screen effects and aligning with the global depth styles of cinematic 3D content, laying the groundwork for a new class of artistically-driven conversion tools.
[150] Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image
Zidian Qiu, Ancong Wu
Main category: cs.CV
TL;DR: Pano3DComposer: A feed-forward framework for generating 3D scenes from panoramic images using object-world transformation prediction and coarse-to-fine alignment.
Details
Motivation: Current image-to-3D scene generation methods are either too slow (iterative optimization) or inflexible (joint object-layout generation), and most rely on limited field-of-view perspective images, making complete 360-degree environment creation difficult.Method: Proposes Pano3DComposer with: 1) Object-World Transformation Predictor that converts 3D objects from local to world coordinates using adapted VGGT architecture (Alignment-VGGT); 2) Coarse-to-Fine alignment mechanism for unseen domains that iteratively refines geometric consistency; 3) Uses pseudo-geometric supervision to handle shape discrepancies between generated and ground-truth objects.
Result: Achieves superior geometric accuracy for image/text-to-3D tasks on synthetic and real-world datasets, generating high-fidelity 3D scenes in approximately 20 seconds on an RTX 4090 GPU.
Conclusion: Pano3DComposer provides an efficient feed-forward solution for panoramic image-to-3D scene generation, addressing limitations of existing methods through decoupled object generation and layout estimation with transformation prediction.
Abstract: Current compositional image-to-3D scene generation approaches construct 3D scenes by time-consuming iterative layout optimization or inflexible joint object-layout generation. Moreover, most methods rely on limited field-of-view perspective images, hindering the creation of complete 360-degree environments. To address these limitations, we design Pano3DComposer, an efficient feed-forward framework for panoramic images. To decouple object generation from layout estimation, we propose a plug-and-play Object-World Transformation Predictor. This module converts the 3D objects generated by off-the-shelf image-to-3D models from local to world coordinates. To achieve this, we adapt the VGGT architecture to Alignment-VGGT by using target object crop, multi-view object renderings and camera parameters to predict the transformation. The predictor is trained using pseudo-geometric supervision to address the shape discrepancy between generated and ground-truth objects. For input images from unseen domains, we further introduce a Coarse-to-Fine (C2F) alignment mechanism for Pano3DComposer that iteratively refines geometric consistency with feedback of scene rendering. Our method achieves superior geometric accuracy for image/text-to-3D tasks on synthetic and real-world datasets. It can generate a high-fidelity 3D scene in approximately 20 seconds on an RTX 4090 GPU. Project page: https://qiuzidian.github.io/pano3dcomposer-page/.
[151] CORE-Seg: Reasoning-Driven Segmentation for Complex Lesions via Reinforcement Learning
Yuxin Xie, Yuming Chen, Yishan Yang, Yi Zhou, Tao Zhou, Zhen Zhao, Jiacheng Liu, Huazhu Fu
Main category: cs.CV
TL;DR: CORE-Seg: A reasoning-driven medical image segmentation framework using MLLMs with Chain-of-Thought reasoning for complex lesions, achieving SOTA results on the ComLesion-14K benchmark.
Details
Motivation: Medical image segmentation needs to evolve from visual pattern matching to cognitive reasoning analysis. Current MLLMs lack specialized visual reasoning for complex lesions, while traditional segmentation models lack logical interpretability.Method: Proposes CORE-Seg framework with Semantic-Guided Prompt Adapter to integrate reasoning with segmentation. Uses progressive training from SFT to GRPO with adaptive dual-granularity reward mechanism to mitigate reward sparsity.
Result: Achieves state-of-the-art with mean Dice of 37.06% (14.89% higher than second-best baseline) and reduces failure rate to 18.42% on ComLesion-14K benchmark.
Conclusion: Demonstrates that integrating reasoning capabilities with segmentation through MLLMs significantly improves performance on complex medical lesion segmentation tasks.
Abstract: Medical image segmentation is undergoing a paradigm shift from conventional visual pattern matching to cognitive reasoning analysis. Although Multimodal Large Language Models (MLLMs) have shown promise in integrating linguistic and visual knowledge, significant gaps remain: existing general MLLMs possess broad common sense but lack the specialized visual reasoning required for complex lesions, whereas traditional segmentation models excel at pixel-level segmentation but lack logical interpretability. In this paper, we introduce ComLesion-14K, the first diverse Chain-of-Thought (CoT) benchmark for reasoning-driven complex lesion segmentation. To accomplish this task, we propose CORE-Seg, an end-to-end framework integrating reasoning with segmentation through a Semantic-Guided Prompt Adapter. We design a progressive training strategy from SFT to GRPO, equipped with an adaptive dual-granularity reward mechanism to mitigate reward sparsity. Our Method achieves state-of-the-art results with a mean Dice of 37.06% (14.89% higher than the second-best baseline), while reducing the failure rate to 18.42%. Project Page: https://xyxl024.github.io/CORE-Seg.github.io/
[152] BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation
Feiran Li, Qianqian Xu, Shilong Bao, Zhiyong Yang, Xilin Zhao, Xiaochun Cao, Qingming Huang
Main category: cs.CV
TL;DR: BlackMirror is a novel black-box detection framework for backdoored text-to-image models that identifies partial semantic pattern manipulation rather than relying on image-level similarity, achieving accurate detection across diverse backdoor attacks.
Details
Motivation: Existing backdoor detection methods for text-to-image models rely on image-level similarity and assume backdoor-triggered generations exhibit strong consistency, but they fail with emerging attacks where backdoored images appear visually diverse. The authors observed that across attacks, only partial semantic patterns are steadily manipulated while other content remains diverse or benign.Method: BlackMirror consists of two components: MirrorMatch aligns visual patterns with corresponding instructions to detect semantic deviations, and MirrorVerify evaluates the stability of these deviations across varied prompts to distinguish true backdoor behavior from benign responses. It’s a training-free, plug-and-play framework for Model-as-a-Service applications.
Result: Comprehensive experiments demonstrate that BlackMirror achieves accurate detection across a wide range of backdoor attacks, outperforming existing methods that rely on image-level similarity analysis.
Conclusion: BlackMirror provides an effective black-box detection framework for backdoored text-to-image models by focusing on partial semantic pattern manipulation rather than image-level consistency, offering a practical solution for Model-as-a-Service security.
Abstract: This paper investigates the challenging task of detecting backdoored text-to-image models under black-box settings and introduces a novel detection framework BlackMirror. Existing approaches typically rely on analyzing image-level similarity, under the assumption that backdoor-triggered generations exhibit strong consistency across samples. However, they struggle to generalize to recently emerging backdoor attacks, where backdoored generations can appear visually diverse. BlackMirror is motivated by an observation: across backdoor attacks, {only partial semantic patterns within the generated image are steadily manipulated, while the rest of the content remains diverse or benign. Accordingly, BlackMirror consists of two components: MirrorMatch, which aligns visual patterns with the corresponding instructions to detect semantic deviations; and MirrorVerify, which evaluates the stability of these deviations across varied prompts to distinguish true backdoor behavior from benign responses. BlackMirror is a general, training-free framework that can be deployed as a plug-and-play module in Model-as-a-Service (MaaS) applications. Comprehensive experiments demonstrate that BlackMirror achieves accurate detection across a wide range of attacks. Code is available at https://github.com/Ferry-Li/BlackMirror.
[153] RAC: Rectified Flow Auto Coder
Sen Fang, Yalin Feng, Yanxin Zhang, Dimitris N. Metaxas
Main category: cs.CV
TL;DR: RAC replaces traditional VAEs with a rectified flow-based auto-coder that enables multi-step decoding, bidirectional inference, and improved generation quality with lower computational cost.
Details
Motivation: To address limitations of traditional VAEs including the reconstruction-generation gap, computational inefficiency, and lack of refinement capabilities during generation.Method: Proposes Rectified Flow Auto Coder (RAC) that uses rectified flow principles to create a straight decoding path, enables multi-step decoding with step-by-step refinement, supports bidirectional inference through time reversal (decoder serves as encoder), and reduces parameters by nearly 41%.
Result: RAC surpasses state-of-the-art VAEs in both reconstruction and generation quality with approximately 70% lower computational cost.
Conclusion: RAC provides an efficient alternative to traditional VAEs with improved generation capabilities, bidirectional inference, and significant computational savings.
Abstract: In this paper, we propose a Rectified Flow Auto Coder (RAC) inspired by Rectified Flow to replace the traditional VAE: 1. It achieves multi-step decoding by applying the decoder to flow timesteps. Its decoding path is straight and correctable, enabling step-by-step refinement. 2. The model inherently supports bidirectional inference, where the decoder serves as the encoder through time reversal (hence Coder rather than encoder or decoder), reducing parameter count by nearly 41%. 3. This generative decoding method improves generation quality since the model can correct latent variables along the path, partially addressing the reconstruction–generation gap. Experiments show that RAC surpasses SOTA VAEs in both reconstruction and generation with approximately 70% lower computational cost.
[154] Towards Driver Behavior Understanding: Weakly-Supervised Risk Perception in Driving Scenes
Nakul Agarwal, Yi-Ting Chen, Behzad Dariush
Main category: cs.CV
TL;DR: RAID dataset for driver risk perception with 4,691 annotated video clips and a weakly supervised risk object identification framework that outperforms SOTA by 20-23%.
Details
Motivation: Zero-collision mobility requires understanding driver risk perception, which is shaped by driver responses to external stimuli and attentiveness of surrounding road users. Current datasets lack comprehensive annotations for studying these complex cognitive processes.Method: Introduces RAID dataset with 4,691 annotated video clips covering diverse traffic scenarios. Proposes a weakly supervised risk object identification framework that models relationships between driver’s intended maneuvers and responses to identify risk sources. Also analyzes pedestrian attention in risk estimation.
Result: Method achieves 20.6% performance gain on RAID dataset and 23.1% gain on HDDS dataset compared to prior state-of-the-art approaches.
Conclusion: RAID dataset enables research on driver risk perception and contextual risk assessment. The proposed framework effectively identifies risk objects and demonstrates the importance of pedestrian attention in risk estimation.
Abstract: Achieving zero-collision mobility remains a key objective for intelligent vehicle systems, which requires understanding driver risk perception-a complex cognitive process shaped by voluntary response of the driver to external stimuli and the attentiveness of surrounding road users towards the ego-vehicle. To support progress in this area, we introduce RAID (Risk Assessment In Driving scenes)-a large-scale dataset specifically curated for research on driver risk perception and contextual risk assessment. RAID comprises 4,691 annotated video clips, covering diverse traffic scenarios with labels for driver’s intended maneuver, road topology, risk situations (e.g., crossing pedestrians), driver responses, and pedestrian attentiveness. Leveraging RAID, we propose a weakly supervised risk object identification framework that models the relationship between driver’s intended maneuver and responses to identify potential risk sources. Additionally, we analyze the role of pedestrian attention in estimating risk and demonstrate the value of the proposed dataset. Experimental evaluations demonstrate that our method achieves 20.6% and 23.1% performance gains over prior state-of-the-art approaches on the RAID and HDDS datasets, respectively.
[155] Facial Expression Recognition Using Residual Masking Network
Luan Pham, The Huynh Vu, Tuan Anh Tran
Main category: cs.CV
TL;DR: A novel Residual Masking Network that combines Deep Residual Network with Unet-like architecture and attention mechanism for facial expression recognition, achieving state-of-the-art accuracy on FER2013 and VEMO datasets.
Details
Motivation: To improve facial expression recognition (FER) performance by developing a deep architecture with attention mechanism that can focus on relevant facial features while ignoring irrelevant information.Method: Proposes a Residual Masking Network that combines Deep Residual Network with Unet-like segmentation architecture. Uses a masking mechanism to refine feature maps, enabling the network to focus on relevant facial regions for expression recognition.
Result: Achieves state-of-the-art accuracy on both the public FER2013 dataset and a private VEMO dataset, demonstrating superior performance over existing methods.
Conclusion: The proposed Residual Masking Network with attention mechanism effectively improves facial expression recognition by focusing on relevant facial features, achieving top performance on benchmark datasets.
Abstract: Automatic facial expression recognition (FER) has gained much attention due to its applications in human-computer interaction. Among the approaches to improve FER tasks, this paper focuses on deep architecture with the attention mechanism. We propose a novel Masking idea to boost the performance of CNN in facial expression task. It uses a segmentation network to refine feature maps, enabling the network to focus on relevant information to make correct decisions. In experiments, we combine the ubiquitous Deep Residual Network and Unet-like architecture to produce a Residual Masking Network. The proposed method holds state-of-the-art (SOTA) accuracy on the well-known FER2013 and private VEMO datasets. The source code is available at https://github.com/phamquiluan/ResidualMaskingNetwork.
[156] Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation
Hongwei Fang, Jiahang Cai, Xun Wang, Wenwu Yang
Main category: cs.CV
TL;DR: TAR-ViTPose is a temporal Vision Transformer for video-based 2D human pose estimation that aggregates temporal cues across frames using joint-centric queries and restores them to enhance pose representation.
Details
Motivation: Existing ViT-based pose estimators process frames independently, ignoring temporal coherence in videos, leading to unstable predictions in challenging scenes with motion blur, occlusion, or defocus.Method: Proposes TAR-ViTPose with two key components: 1) Joint-centric Temporal Aggregation (JTA) that assigns each joint a learnable query token to selectively attend to corresponding regions in neighboring frames, and 2) Global Restoring Attention (GRA) to restore aggregated temporal features back into the current frame’s token sequence while preserving global context.
Result: Achieves +2.3 mAP gain over single-frame baseline ViTPose on PoseTrack2017 benchmark, outperforms existing state-of-the-art video-based methods, and achieves higher runtime frame rate in real-world applications.
Conclusion: TAR-ViTPose effectively leverages temporal information in videos for more robust and accurate 2D human pose estimation while maintaining computational efficiency.
Abstract: Vision Transformers (ViTs) have recently achieved state-of-the-art performance in 2D human pose estimation due to their strong global modeling capability. However, existing ViT-based pose estimators are designed for static images and process each frame independently, thereby ignoring the temporal coherence that exists in video sequences. This limitation often results in unstable predictions, especially in challenging scenes involving motion blur, occlusion, or defocus. In this paper, we propose TAR-ViTPose, a novel Temporal Aggregate-and-Restore Vision Transformer tailored for video-based 2D human pose estimation. TAR-ViTPose enhances static ViT representations by aggregating temporal cues across frames in a plug-and-play manner, leading to more robust and accurate pose estimation. To effectively aggregate joint-specific features that are temporally aligned across frames, we introduce a joint-centric temporal aggregation (JTA) that assigns each joint a learnable query token to selectively attend to its corresponding regions from neighboring frames. Furthermore, we develop a global restoring attention (GRA) to restore the aggregated temporal features back into the token sequence of the current frame, enriching its pose representation while fully preserving global context for precise keypoint localization. Extensive experiments demonstrate that TAR-ViTPose substantially improves upon the single-frame baseline ViTPose, achieving a +2.3 mAP gain on the PoseTrack2017 benchmark. Moreover, our approach outperforms existing state-of-the-art video-based methods, while also achieving a noticeably higher runtime frame rate in real-world applications. Project page: https://github.com/zgspose/TARViTPose.
[157] FTSplat: Feed-forward Triangle Splatting Network
Xiong Jinlin, Li Can, Shen Jiawei, Qi Zhigang, Sun Lei, Zhao Dongyang
Main category: cs.CV
TL;DR: A feed-forward framework for generating triangle primitives from multi-view images that produces simulation-ready 3D models without per-scene optimization.
Details
Motivation: Current 3D reconstruction methods like NeRF and 3D Gaussian Splatting achieve high quality but require time-consuming per-scene optimization and lack explicit, manifold geometry needed for direct simulation in robotics and graphics applications.Method: Proposes a feed-forward framework with pixel-aligned triangle generation module and relative 3D point cloud supervision to directly predict continuous triangle surfaces from calibrated multi-view images in a single forward pass.
Result: Achieves efficient reconstruction while maintaining seamless compatibility with standard graphics and robotic simulators, producing simulation-ready models without per-scene optimization or post-processing.
Conclusion: The method enables real-time deployment of high-fidelity 3D reconstruction for robotics and simulation applications by providing explicit triangle geometry in a feed-forward manner.
Abstract: High-fidelity three-dimensional (3D) reconstruction is essential for robotics and simulation. While Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) achieve impressive rendering quality, their reliance on time-consuming per-scene optimization limits real-time deployment. Emerging feed-forward Gaussian splatting methods improve efficiency but often lack explicit, manifold geometry required for direct simulation. To address these limitations, we propose a feed-forward framework for triangle primitive generation that directly predicts continuous triangle surfaces from calibrated multi-view images. Our method produces simulation-ready models in a single forward pass, obviating the need for per-scene optimization or post-processing. We introduce a pixel-aligned triangle generation module and incorporate relative 3D point cloud supervision to enhance geometric learning stability and consistency. Experiments demonstrate that our method achieves efficient reconstruction while maintaining seamless compatibility with standard graphics and robotic simulators.
[158] Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models
Jialuo He, Huangxun Chen
Main category: cs.CV
TL;DR: E-AdaPrune: Energy-driven adaptive pruning for VLMs that dynamically allocates token budgets based on image information density using singular value spectrum analysis.
Details
Motivation: Existing visual token reduction methods use fixed budgets across all inputs, ignoring variation in image information density, leading to inefficient compression or information loss.Method: Energy-driven adaptive pruning framework that determines token budget from singular value spectrum of visual features, preserving certain proportion of spectral energy without additional learnable parameters.
Result: Evaluated on 9 benchmarks with 3 VLM backbones, achieves average improvement up to 0.6% under matched token budgets, including +5.1% relative boost on MMVet reasoning task with only 8ms latency overhead.
Conclusion: E-AdaPrune provides effective adaptive token pruning for VLMs, improving performance while maintaining efficiency through spectral energy preservation.
Abstract: Visual token reduction is critical for accelerating Vision-Language Models (VLMs), yet most existing approaches rely on a fixed budget shared across all inputs, overlooking the substantial variation in image information density. We propose E-AdaPrune, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space. By preserving a certain proportion of spectral energy, our method allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters. We evaluate E-AdaPrune on nine benchmarks and three VLM backbones, LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-8B. Under matched average token budgets, E-AdaPrune consistently yields an average improvement of up to 0.6%, including a significant +5.1% relative boost on the MMVet reasoning task. Using randomized singular value decomposition, the additional latency is limited to 8ms per image.
[159] OD-RASE: Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving
Kota Shimomura, Masaki Nambata, Atsuya Ishikawa, Ryota Mimura, Takayuki Kawabuchi, Takayoshi Yamashita, Koki Inoue
Main category: cs.CV
TL;DR: OD-RASE: A framework using ontology and large visual language models to detect accident-prone road structures and generate infrastructure improvement proposals for autonomous driving safety.
Details
Motivation: Autonomous driving systems struggle with rare situations and complex road structures designed for humans. Current safety improvements are reactive (after accidents), but autonomous systems need proactive risk mitigation. There's a need to connect accident analysis with infrastructure development.Method: 1) Formalize ontology based on road traffic domain knowledge; 2) Use large-scale visual language model (LVLM) to generate infrastructure improvement proposals; 3) Apply ontology-driven data filtering for reliability; 4) Automatically annotate improvement proposals on pre-accident images to build dataset; 5) Develop OD-RASE model combining LVLM and diffusion model to generate both improvement proposals and improved road images.
Result: Ontology-driven data filtering enables highly accurate prediction of accident-causing road structures and corresponding improvement plans. The framework successfully connects accident analysis with infrastructure development.
Conclusion: OD-RASE contributes to traffic environment safety and advances autonomous driving adoption by enabling proactive infrastructure improvements based on visual analysis of road structures.
Abstract: Although autonomous driving systems demonstrate high perception performance, they still face limitations when handling rare situations or complex road structures. Such road infrastructures are designed for human drivers, safety improvements are typically introduced only after accidents occur. This reactive approach poses a significant challenge for autonomous systems, which require proactive risk mitigation. To address this issue, we propose OD-RASE, a framework for enhancing the safety of autonomous driving systems by detecting road structures that cause traffic accidents and connecting these findings to infrastructure development. First, we formalize an ontology based on specialized domain knowledge of road traffic systems. In parallel, we generate infrastructure improvement proposals using a large-scale visual language model (LVLM) and use ontology-driven data filtering to enhance their reliability. This process automatically annotates improvement proposals on pre-accident road images, leading to the construction of a new dataset. Furthermore, we introduce the Baseline approach (OD-RASE model), which leverages LVLM and a diffusion model to produce both infrastructure improvement proposals and generated images of the improved road environment. Our experiments demonstrate that ontology-driven data filtering enables highly accurate prediction of accident-causing road structures and the corresponding improvement plans. We believe that this work contributes to the overall safety of traffic environments and marks an important step toward the broader adoption of autonomous driving systems.
[160] SLER-IR: Spherical Layer-wise Expert Routing for All-in-One Image Restoration
Peng Shurui, Xin Lin, Shi Luo, Jincen Ou, Dizhe Zhang, Lu Qi, Truong Nguyen, Chao Ren
Main category: cs.CV
TL;DR: SLER-IR is a unified image restoration framework using spherical layer-wise expert routing with contrastive degradation embeddings and global-local fusion for handling diverse degradations.
Details
Motivation: Image restoration under diverse degradations is challenging for unified frameworks due to feature interference and insufficient expert specialization. Existing methods struggle with geometry bias in degradation embeddings and spatially non-uniform degradations.Method: Proposes SLER-IR with: 1) Spherical Uniform Degradation Embedding using contrastive learning to map degradation representations onto a hypersphere, 2) Layer-wise expert routing that dynamically activates specialized experts across network layers, and 3) Global-Local Granularity Fusion module to integrate global semantics and local degradation cues.
Result: Experiments on three-task and five-task benchmarks show consistent improvements over state-of-the-art methods in both PSNR and SSIM metrics.
Conclusion: SLER-IR effectively addresses feature interference and expert specialization issues in unified image restoration frameworks through spherical degradation embeddings and adaptive expert routing.
Abstract: Image restoration under diverse degradations remains challenging for unified all-in-one frameworks due to feature interference and insufficient expert specialization. We propose SLER-IR, a spherical layer-wise expert routing framework that dynamically activates specialized experts across network layers. To ensure reliable routing, we introduce a Spherical Uniform Degradation Embedding with contrastive learning, which maps degradation representations onto a hypersphere to eliminate geometry bias in linear embedding spaces. In addition, a Global-Local Granularity Fusion (GLGF) module integrates global semantics and local degradation cues to address spatially non-uniform degradations and the train-test granularity gap. Experiments on three-task and five-task benchmarks demonstrate that SLER-IR achieves consistent improvements over state-of-the-art methods in both PSNR and SSIM. Code and models will be publicly released.
[161] Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models
Siyuan Yang, Jun Liu, Hao Cheng, Chong Wang, Shijian Lu, Hedvig Kjellstrom, Weisi Lin, Alex C. Kot
Main category: cs.CV
TL;DR: Skeleton-to-Image Encoding (S2I) transforms 3D human skeleton sequences into image-like data to enable vision-pretrained models for self-supervised skeleton representation learning, addressing format heterogeneity and dataset scarcity issues.
Details
Motivation: Vision-pretrained models show strong capabilities but can't be directly applied to 3D skeleton data due to format differences. There's also scarcity of large-scale skeleton datasets and challenges in incorporating skeleton data into multi-modal action recognition without adding model branches.Method: Proposes Skeleton-to-Image Encoding (S2I) that transforms skeleton sequences into image-like data by partitioning and arranging joints based on body-part semantics and resizing to standardized image dimensions, enabling use of vision-pretrained models.
Result: Extensive experiments on NTU-60, NTU-120, and PKU-MMD datasets demonstrate effectiveness and generalizability for self-supervised skeleton representation learning, including challenging cross-format evaluation settings.
Conclusion: S2I provides a unified image-like format that accommodates heterogeneous skeleton data and enables transfer of visual-domain knowledge to skeleton analysis using existing vision-pretrained models.
Abstract: Recent advances in large-scale pretrained vision models have demonstrated impressive capabilities across a wide range of downstream tasks, including cross-modal and multi-modal scenarios. However, their direct application to 3D human skeleton data remains challenging due to fundamental differences in data format. Moreover, the scarcity of large-scale skeleton datasets and the need to incorporate skeleton data into multi-modal action recognition without introducing additional model branches present significant research opportunities. To address these challenges, we introduce Skeleton-to-Image Encoding (S2I), a novel representation that transforms skeleton sequences into image-like data by partitioning and arranging joints based on body-part semantics and resizing to standardized image dimensions. This encoding enables, for the first time, the use of powerful vision-pretrained models for self-supervised skeleton representation learning, effectively transferring rich visual-domain knowledge to skeleton analysis. While existing skeleton methods often design models tailored to specific, homogeneous skeleton formats, they overlook the structural heterogeneity that naturally arises from diverse data sources. In contrast, our S2I representation offers a unified image-like format that naturally accommodates heterogeneous skeleton data. Extensive experiments on NTU-60, NTU-120, and PKU-MMD demonstrate the effectiveness and generalizability of our method for self-supervised skeleton representation learning, including under challenging cross-format evaluation settings.
[162] Adaptive Radial Projection on Fourier Magnitude Spectrum for Document Image Skew Estimation
Luan Pham, Phu Hao Hoang, Xuan Toan Mai, Tuan Anh Tran
Main category: cs.CV
TL;DR: A novel skew estimation method using Adaptive Radial Projection on Fourier magnitude spectrum, with a new dataset DISE-2021 for evaluation.
Details
Motivation: Skew estimation is crucial for document processing systems, especially for scanned documents, as it directly impacts subsequent processing steps. There's a need for robust methods in the digitization age.Method: Proposes a method that extracts dominant skew angle by applying Adaptive Radial Projection on the 2D Discrete Fourier Magnitude spectrum of document images.
Result: The proposed method is robust, reliable, and outperforms all compared methods in skew estimation tasks.
Conclusion: The paper presents an effective Fourier-based skew estimation approach with a new benchmark dataset, advancing document image processing capabilities.
Abstract: Skew estimation is one of the vital tasks in document processing systems, especially for scanned document images, because its performance impacts subsequent steps directly. Over the years, an enormous number of researches focus on this challenging problem in the rise of digitization age. In this research, we first propose a novel skew estimation method that extracts the dominant skew angle of the given document image by applying an Adaptive Radial Projection on the 2D Discrete Fourier Magnitude spectrum. Second, we introduce a high quality skew estimation dataset DISE-2021 to assess the performance of different estimators. Finally, we provide comprehensive analyses that focus on multiple improvement aspects of Fourier-based methods. Our results show that the proposed method is robust, reliable, and outperforms all compared methods. The source code is available at https://github.com/phamquiluan/jdeskew.
[163] Imagine How To Change: Explicit Procedure Modeling for Change Captioning
Jiayang Sun, Zixin Guo, Min Cao, Guibo Zhu, Jorma Laaksonen
Main category: cs.CV
TL;DR: ProCap reformulates change captioning from static image comparison to dynamic procedure modeling by learning latent temporal dynamics from keyframes and using procedure queries for caption generation.
Details
Motivation: Existing change captioning methods only compare static image pairs, ignoring the rich temporal dynamics of how changes occur. Understanding the change procedure is key to better describing not just what changed, but how it happened.Method: Two-stage framework: 1) Train procedure encoder on sparse keyframes via caption-conditioned masked reconstruction to learn latent change dynamics. 2) Integrate encoder with decoder using learnable procedure queries (instead of explicit frames) for end-to-end caption training.
Result: Experiments on three datasets demonstrate ProCap’s effectiveness in change captioning by modeling dynamic procedures rather than static comparisons.
Conclusion: ProCap successfully bridges the gap between static change detection and dynamic procedure understanding, enabling more comprehensive descriptions of visual changes by modeling temporal dynamics.
Abstract: Change captioning generates descriptions that explicitly describe the differences between two visually similar images. Existing methods operate on static image pairs, thus ignoring the rich temporal dynamics of the change procedure, which is the key to understand not only what has changed but also how it occurs. We introduce ProCap, a novel framework that reformulates change modeling from static image comparison to dynamic procedure modeling. ProCap features a two-stage design: The first stage trains a procedure encoder to learn the change procedure from a sparse set of keyframes. These keyframes are obtained by automatically generating intermediate frames to make the implicit procedural dynamics explicit and then sampling them to mitigate redundancy. Then the encoder learns to capture the latent dynamics of these keyframes via a caption-conditioned, masked reconstruction task. The second stage integrates this trained encoder within an encoder-decoder model for captioning. Instead of relying on explicit frames from the previous stage – a process incurring computational overhead and sensitivity to visual noise – we introduce learnable procedure queries to prompt the encoder for inferring the latent procedure representation, which the decoder then translates into text. The entire model is then trained end-to-end with a captioning loss, ensuring the encoder’s output is both temporally coherent and captioning-aligned. Experiments on three datasets demonstrate the effectiveness of ProCap. Code and pre-trained models are available at https://github.com/BlueberryOreo/ProCap
[164] LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Generative Real-World Super-Resolution
Song Fei, Tian Ye, Sixiang Chen, Zhaohu Xing, Jianyu Lai, Lei Zhu
Main category: cs.CV
TL;DR: LucidNFT is a multi-reward RL framework for flow-matching real-world image super-resolution that addresses faithfulness issues through degradation-robust semantic evaluation and improved advantage normalization.
Details
Motivation: Current generative real-world image super-resolution methods suffer from semantic and structural hallucination where outputs look sharp but are unfaithful to LR evidence. Preference-based RL is suitable but hindered by lack of degradation-robust faithfulness signals, optimization bottlenecks in rollout groups, and limited real degradation coverage.Method: Proposes LucidNFT framework with three key components: 1) LucidConsistency - a degradation-robust semantic evaluator for LR-anchored faithfulness measurement, 2) decoupled advantage normalization strategy to preserve objective-wise contrasts and prevent advantage collapse, and 3) LucidLR - a large-scale collection of real-world degraded images for robust RL fine-tuning.
Result: LucidNFT consistently improves strong flow-based Real-ISR baselines, achieving better perceptual-faithfulness trade-offs with stable optimization dynamics across diverse real-world scenarios.
Conclusion: The proposed multi-reward RL framework effectively addresses faithfulness issues in real-world image super-resolution through robust semantic evaluation, improved optimization strategies, and comprehensive real degradation datasets.
Abstract: Generative real-world image super-resolution (Real-ISR) can synthesize visually convincing details from severely degraded low-resolution (LR) inputs, yet its stochastic sampling makes a critical failure mode hard to avoid: outputs may look sharp but be unfaithful to the LR evidence (semantic and structural hallucination), while such LR-anchored faithfulness is difficult to assess without HR ground truth. Preference-based reinforcement learning (RL) is a natural fit because each LR input yields a rollout group of candidates to compare. However, effective alignment in Real-ISR is hindered by (i) the lack of a degradation-robust LR-referenced faithfulness signal, and (ii) a rollout-group optimization bottleneck where naive multi-reward scalarization followed by normalization compresses objective-wise contrasts, causing advantage collapse and weakening the reward-weighted updates in DiffusionNFT-style forward fine-tuning. Moreover, (iii) limited coverage of real degradations restricts rollout diversity and preference signal quality. We propose LucidNFT, a multi-reward RL framework for flow-matching Real-ISR. LucidNFT introduces LucidConsistency, a degradation-robust semantic evaluator that makes LR-anchored faithfulness measurable and optimizable; a decoupled advantage normalization strategy that preserves objective-wise contrasts within each LR-conditioned rollout group before fusion, preventing advantage collapse; and LucidLR, a large-scale collection of real-world degraded images to support robust RL fine-tuning. Experiments show that LucidNFT consistently improves strong flow-based Real-ISR baselines, achieving better perceptual-faithfulness trade-offs with stable optimization dynamics across diverse real-world scenarios.
[165] Unify the Views: View-Consistent Prototype Learning for Few-Shot Segmentation
Hongli Liu, Yu Wang, Shengjie Zhao
Main category: cs.CV
TL;DR: VINE is a few-shot segmentation framework that addresses structural misalignment and cross-view inconsistency by jointly modeling structural consistency and foreground discrimination through spatial-view graphs and discriminative priors.
Details
Motivation: Few-shot segmentation faces challenges with structural misalignment and cross-view inconsistency under large appearance or viewpoint variations, which current methods struggle to handle effectively.Method: VINE introduces a spatial-view graph on backbone features: spatial graph captures local geometric topology, view graph connects features from different perspectives. Uses discriminative prior from support-query feature discrepancy to capture category-specific contrast, reweights SAM features, and recalibrates backbone activations. Progressively integrates foreground-enhanced SAM features and structurally enriched ResNet features through masked cross-attention.
Result: Extensive experiments on multiple FSS benchmarks validate effectiveness and robustness, particularly under challenging scenarios with viewpoint shifts and complex structures.
Conclusion: VINE provides a unified framework that effectively addresses structural consistency and foreground discrimination in few-shot segmentation, demonstrating strong performance under challenging conditions.
Abstract: Few-shot segmentation (FSS) has gained significant attention for its ability to generalize to novel classes with limited supervision, yet remains challenged by structural misalignment and cross-view inconsistency under large appearance or viewpoint variations. This paper tackles these challenges by introducing VINE (View-Informed NEtwork), a unified framework that jointly models structural consistency and foreground discrimination to refine class-specific prototypes. Specifically, VINE introduces a spatial-view graph on backbone features, where the spatial graph captures local geometric topology and the view graph connects features from different perspectives to propagate view-invariant structural semantics. To further alleviate foreground ambiguity, we derive a discriminative prior from the support-query feature discrepancy to capture category-specific contrast, which reweights SAM features by emphasizing salient regions and recalibrates backbone activations for improved structural focus. The foreground-enhanced SAM features and structurally enriched ResNet features are progressively integrated through masked cross-attention, yielding class-consistent prototypes used as adaptive prompts for the SAM decoder to generate accurate masks. Extensive experiments on multiple FSS benchmarks validate the effectiveness and robustness of VINE, particularly under challenging scenarios with viewpoint shifts and complex structures. The code is available at https://github.com/HongliLiu1/VINE-main.
[166] MM-ISTS: Cooperating Irregularly Sampled Time Series Forecasting with Multimodal Vision-Text LLMs
Zhi Lei, Chenxi Liu, Hao Miao, Wanghui Qiu, Bin Yang, Chenjuan Guo
Main category: cs.CV
TL;DR: MM-ISTS is a multimodal framework using vision-text LLMs for irregularly sampled time series forecasting, bridging temporal, visual, and textual modalities through cross-modal encoding and alignment.
Details
Motivation: Existing irregularly sampled time series (ISTS) forecasting methods rely solely on historical observations, lacking contextual semantics and fine-grained temporal pattern learning. There's a need to leverage multimodal capabilities for better ISTS understanding.Method: Two-stage encoding: 1) Cross-modal vision-text encoding generates informative images/text using MLLMs to capture temporal patterns; 2) ISTS encoding extracts enriched temporal features via multi-view embedding fusion and temporal-variable encoder. Includes adaptive query-based feature extractor to compress MLLM tokens and multimodal alignment with modality-aware gating.
Result: Extensive experiments on real data demonstrate the effectiveness of the proposed solutions for ISTS forecasting.
Conclusion: MM-ISTS successfully bridges temporal, visual, and textual modalities for improved irregularly sampled time series forecasting through multimodal LLM augmentation.
Abstract: Irregularly sampled time series (ISTS) are widespread in real-world scenarios, exhibiting asynchronous observations on uneven time intervals across variables. Existing ISTS forecasting methods often solely utilize historical observations to predict future ones while falling short in learning contextual semantics and fine-grained temporal patterns. To address these problems, we achieve MM-ISTS, a multimodal framework augmented by vision-text large language models, that bridges temporal, visual, and textual modalities, facilitating ISTS forecasting. MM-ISTS encompasses a novel two-stage encoding mechanism. In particular, a cross-modal vision-text encoding module is proposed to automatically generate informative visual images and textual data, enabling the capture of intricate temporal patterns and comprehensive contextual understanding, in collaboration with multimodal LLMs (MLLMs). In parallel, ISTS encoding extracts complementary yet enriched temporal features from historical ISTS observations, including multi-view embedding fusion and a temporal-variable encoder. Further, we propose an adaptive query-based feature extractor to compress the learned tokens of MLLMs, filtering out small-scale useful knowledge, which in turn reduces computational costs. In addition, a multimodal alignment module with modality-aware gating is designed to alleviate the modality gap across ISTS, images, and text. Extensive experiments on real data offer insight into the effectiveness of the proposed solutions.
[167] OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer
Si-Yu Lu, Po-Ting Chen, Hui-Che Hsu, Sin-Ye Jhong, Wen-Huang Cheng, Yung-Yao Chen
Main category: cs.CV
TL;DR: OVGGT enables efficient 3D reconstruction from streaming video using constant memory and compute via self-selective KV cache compression and dynamic anchor protection.
Details
Motivation: Current geometric foundation models for 3D reconstruction from video have quadratic attention costs limiting them to short sequences, while causal-attention variants accumulate unbounded KV cache memory, preventing long-horizon streaming deployment.Method: Training-free framework combining Self-Selective Caching (using FFN residual magnitudes to compress KV cache while maintaining FlashAttention compatibility) and Dynamic Anchor Protection (shielding coordinate-critical tokens from eviction to prevent geometric drift).
Result: OVGGT processes arbitrarily long videos within constant VRAM while achieving state-of-the-art 3D geometric accuracy on indoor, outdoor, and ultra-long sequence benchmarks.
Conclusion: OVGGT enables practical long-horizon 3D reconstruction from streaming video with bounded computational resources, addressing key limitations of existing geometric foundation models.
Abstract: Reconstructing 3D geometry from streaming video requires continuous inference under bounded resources. Recent geometric foundation models achieve impressive reconstruction quality through all-to-all attention, yet their quadratic cost confines them to short, offline sequences. Causal-attention variants such as StreamVGGT enable single-pass streaming but accumulate an ever-growing KV cache, exhausting GPU memory within hundreds of frames and precluding the long-horizon deployment that motivates streaming inference in the first place. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Extensive experiments on indoor, outdoor, and ultra-long sequence benchmarks demonstrate that OVGGT processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy.
[168] Exploring Open-Vocabulary Object Recognition in Images using CLIP
Wei Yu Chen, Ying Dai
Main category: cs.CV
TL;DR: A two-stage open-vocabulary object recognition framework using object segmentation followed by recognition, with CLIP-based or CNN/MLP-based encoding and SVD for shared representation space.
Details
Motivation: To address limitations of existing open-vocabulary object recognition methods: high system complexity, substantial training costs, and limited generalization. Current methods require complex retraining and labor-intensive annotation.Method: Two-stage framework: 1) Object segmentation to crop object regions, 2) Recognition using either CLIP-based encoding (object-level image embeddings + category-level text embeddings) or CNN/MLP-based encoding (CNN feature maps + MLP alignment). Embeddings are concatenated and processed via SVD to construct shared representation space, with recognition via embedding similarity matching.
Result: Experiments on COCO, Pascal VOC, and ADE20K show training-free CLIP-based encoding without SVD achieves highest average AP, outperforming current state-of-the-art methods. CNN/MLP-based encoding shows potential for OVOR.
Conclusion: Proposed framework eliminates need for complex retraining and annotation, achieves state-of-the-art performance with CLIP-based encoding, and demonstrates potential of CNN/MLP-based encoding for open-vocabulary object recognition.
Abstract: To address the limitations of existing open-vocabulary object recognition methods, specifically high system complexity, substantial training costs, and limited generalization, this paper proposes a novel Open-Vocabulary Object Recognition (OVOR) framework based on a streamlined two-stage strategy: object segmentation followed by recognition. The framework eliminates the need for complex retraining and labor-intensive annotation. After cropping object regions, we generate object-level image embeddings alongside category-level text embeddings using CLIP, which facilitates arbitrary vocabularies. To reduce reliance on CLIP and enhance encoding flexibility, we further introduce a CNN/MLP-based method that extracts convolutional neural network (CNN) feature maps and utilizes a multilayer perceptron (MLP) to align visual features with text embeddings. These embeddings are concatenated and processed via Singular Value Decomposition (SVD) to construct a shared representation space. Finally, recognition is performed through embedding similarity matching. Experiments on COCO, Pascal VOC, and ADE20K demonstrate that training-free, CLIP-based encoding without SVD achieves the highest average AP, outperforming current state-of-the-art methods. Simultaneously, the results highlight the potential of CNN/MLP-based image encoding for OVOR.
[169] Demystifying KAN for Vision Tasks: The RepKAN Approach
Minjong Cheon
Main category: cs.CV
TL;DR: RepKAN integrates CNNs with Kolmogorov-Arnold Networks (KANs) for interpretable remote sensing image classification, discovering class-specific spectral fingerprints and physical interaction manifolds.
Details
Motivation: Standard CNNs and Transformers for remote sensing image classification are uninterpretable black-boxes, lacking physical interpretability needed for Earth observation applications.Method: Dual-path architecture combining Spatial Linear (CNN-like) and Spectral Non-linear (KAN-based) paths to autonomously discover class-specific spectral fingerprints and physical interaction manifolds.
Result: Outperforms state-of-the-art models on EuroSAT and NWPU-RESISC45 datasets while providing explicit physically interpretable reasoning.
Conclusion: RepKAN shows significant potential as a backbone for future interpretable visual foundation models in remote sensing.
Abstract: Remote sensing image classification is essential for Earth observation, yet standard CNNs and Transformers often function as uninterpretable black-boxes. We propose RepKAN, a novel architecture that integrates the structural efficiency of CNNs with the non-linear representational power of KANs. By utilizing a dual-path design – Spatial Linear and Spectral Non-linear – RepKAN enables the autonomous discovery of class-specific spectral fingerprints and physical interaction manifolds. Experimental results on the EuroSAT and NWPU-RESISC45 datasets demonstrate that RepKAN provides explicit physically interpretable reasoning while outperforming state-of-the-art models. These findings indicate that RepKAN holds significant potential to serve as the backbone for future interpretable visual foundation models.
[170] CR-QAT: Curriculum Relational Quantization-Aware Training for Open-Vocabulary Object Detection
Jinyeong Park, Donghwa Kim, Brent ByungHoon Kang, Hyeongboo Baek, Jibum Kim
Main category: cs.CV
TL;DR: CR-QAT framework combines curriculum quantization-aware training with text-centric relational knowledge distillation to preserve vision-language alignment in extreme low-bit open-vocabulary object detection models.
Details
Motivation: Open-vocabulary object detection models are too large for resource-constrained devices, but extreme low-bit quantization severely degrades fine-grained vision-language alignment and relational structures needed for novel category detection.Method: Curriculum Relational Quantization-Aware Training (CR-QAT) integrates two components: 1) Curriculum QAT (CQAT) partitions the model for progressive quantization with error isolation, and 2) Text-centric Relational KD (TRKD) transfers teacher’s multi-dimensional relational knowledge using text-anchored pairwise similarity matrices.
Result: CR-QAT consistently outperforms existing QAT baselines on LVIS and COCO zero-shot benchmarks under aggressive low-bit settings, achieving relative AP improvements of up to 38.9% and 40.9% respectively.
Conclusion: The proposed CR-QAT framework effectively preserves vision-language alignment and relational structures during extreme quantization, enabling efficient deployment of open-vocabulary object detection models on resource-constrained devices.
Abstract: Open-vocabulary object detection (OVOD) enables novel category detection via vision-language alignment, but massive model sizes hinder deployment on resource-constrained devices. While quantization offers practical compression, we reveal that naive extreme low-bit (e.g., 4-bit) quantization severely degrades fine-grained vision-language alignment and distorts inter-region relational structures. To address this, we propose curriculum relational quantization-aware training (CR-QAT), an integrated framework combining stage-by-stage optimization with relational knowledge distillation. Within CR-QAT, curriculum QAT (CQAT) mitigates error accumulation by partitioning the model for progressive quantization, ensuring stable optimization via error isolation. Concurrently, text-centric relational KD (TRKD) is applied to task-relevant modules. By constructing text-anchored pairwise similarity matrices, TRKD comprehensively transfers the teacher’s multi-dimensional relational knowledge. Experiments on LVIS and COCO zero-shot benchmarks demonstrate that CR-QAT consistently outperforms existing QAT baselines under aggressive low-bit settings, achieving relative AP improvements of up to 38.9% and 40.9%, respectively.
[171] DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model
Hao Yang, Hongbo Zhang, Yanyan Zhao, Bing Qin
Main category: cs.CV
TL;DR: DeepSight is the first dedicated depth multimodal large language model that enhances 3D scene understanding by leveraging depth images rather than RGB, with novel datasets and modified architecture.
Details
Motivation: Current MLLMs struggle with accurate depth interpretation in visual data, limiting 3D scene understanding. Depth images provide direct spatial cues but are underutilized due to data scarcity and architectural limitations.Method: 1) Construct depth image-text pair dataset using GLPN for depth map generation and GPT-4 for instruction curation; 2) Modify ViT encoder in CLIP to incorporate local object information for better depth variation capture; 3) Develop comprehensive depth QA benchmark for evaluation.
Result: DeepSight significantly enhances depth perception and downstream task performance compared to conventional MLLMs, demonstrating improved 3D understanding capabilities.
Conclusion: The work presents a novel depth-focused MLLM that advances multimodal 3D understanding by effectively leveraging depth information through specialized datasets and architectural modifications.
Abstract: Multimodal large language models (MLLMs) have achieved impressive performance across various tasks such as image captioning and visual question answer(VQA); however, they often struggle to accurately interpret depth information inherent in visual data. In this work, we introduce DeepSight, the first dedicated depth MLLM designed to enhance three-dimensional scene understanding. Unlike conventional methods that align RGB image encodings with text, our approach takes advantage of the unique characteristics of depth images: single-channel grayscale images where the pixel values directly reflect depth cues to improve spatial reasoning. To address challenges associated with limited depth data and the inadequacy of simple channel replication, we construct a novel depth image-text pair dataset and a depth instruction dataset. Depth maps are generated from visual images using the GLPN model, and GPT-4 is employed to curate corresponding depth instructions, an approach validated by LLaVA. Additionally, we modify the ViT encoder in CLIP to incorporate local object information, thereby capturing the subtle continuous variations of depth more effectively. To evaluate the performance of our model, we develop a comprehensive depth question answer benchmark based on existing depth image datasets, which rigorously assesses understanding in typical depth map scenarios. Experimental results demonstrate that DeepSight significantly enhances depth perception and downstream task performance, marking a substantial step forward in multimodal three-dimensional understanding.
[172] Breaking Smooth-Motion Assumptions: A UAV Benchmark for Multi-Object Tracking in Complex and Adverse Conditions
Jingtao Ye, Kexin Zhang, Xunchi Ma, Yuehan Li, Guangming Zhu, Peiyi Shen, Linhua Jiang, Xiangdong Zhang, Liang Zhang
Main category: cs.CV
TL;DR: DynUAV: A new benchmark for dynamic UAV-perspective multi-object tracking featuring intense ego-motion, complex trajectories, and challenging conditions like scale/viewpoint changes and motion blur.
Details
Motivation: Existing UAV-perspective MOT benchmarks lack the complexities of real-world UAV operations, featuring predictable camera dynamics and linear motion patterns. There's a need for a benchmark that captures the intense ego-motion and resulting complex apparent trajectories of agile UAV maneuvers.Method: Introduces DynUAV benchmark with 42 video sequences containing over 1.7 million bounding box annotations covering vehicles, pedestrians, and industrial equipment (excavators, bulldozers, cranes). The benchmark focuses on challenging conditions from ego-motion including drastic scale changes, viewpoint changes, and motion blur.
Result: Comprehensive evaluations of state-of-the-art trackers on DynUAV reveal their limitations in managing the intertwined challenges of detection and association under dynamic conditions. The benchmark establishes itself as a rigorous testbed that exposes tracker weaknesses.
Conclusion: DynUAV serves as a demanding benchmark to spur progress in real-world UAV-perspective MOT, providing a more realistic evaluation platform that captures the complexities of agile UAV operations.
Abstract: The rapid movements and agile maneuvers of unmanned aerial vehicles (UAVs) induce significant observational challenges for multi-object tracking (MOT). However, existing UAV-perspective MOT benchmarks often lack these complexities, featuring predominantly predictable camera dynamics and linear motion patterns. To address this gap, we introduce DynUAV, a new benchmark for dynamic UAV-perspective MOT, characterized by intense ego-motion and the resulting complex apparent trajectories. The benchmark comprises 42 video sequences with over 1.7 million bounding box annotations, covering vehicles, pedestrians, and specialized industrial categories such as excavators, bulldozers and cranes. Compared to existing benchmarks, DynUAV introduces substantial challenges arising from ego-motion, including drastic scale changes and viewpoint changes, as well as motion blur. Comprehensive evaluations of state-of-the-art trackers on DynUAV reveal their limitations, particularly in managing the intertwined challenges of detection and association under such dynamic conditions, thereby establishing DynUAV as a rigorous benchmark. We anticipate that DynUAV will serve as a demanding testbed to spur progress in real-world UAV-perspective MOT, and we will make all resources available at link.
[173] Contrastive-to-Self-Supervised: A Two-Stage Framework for Script Similarity Learning
Claire Roman, Philippe Meyer
Main category: cs.CV
TL;DR: Two-stage framework for learning glyph similarities: supervised contrastive learning on invented alphabets followed by teacher-student distillation to historical scripts, enabling both discriminative boundaries and discovery of latent cross-script relationships.
Details
Motivation: Learning similarity metrics for glyphs and writing systems faces fundamental challenges: while individual graphemes in invented alphabets can be reliably labeled, historical relationships between different scripts remain uncertain and contested. There's a need to bridge supervised learning with unsupervised discovery of potential historical influences.Method: Two-stage framework: 1) Train an encoder with contrastive loss on labeled invented alphabets to create a teacher model with robust discriminative features. 2) Extend to historically attested scripts through teacher-student distillation, where the student learns unsupervised representations guided by the teacher’s knowledge but free to discover latent cross-script similarities. The asymmetric setup enables deformation-invariant embeddings while inheriting discriminative structure.
Result: Experiments on diverse writing systems demonstrate effective few-shot glyph recognition and meaningful script clustering without requiring ground-truth evolutionary relationships. The approach bridges supervised contrastive learning and unsupervised discovery.
Conclusion: The proposed framework successfully addresses the epistemological constraint in learning glyph similarities by combining supervised learning on clean examples with unsupervised discovery of historical relationships, enabling both hard boundaries between distinct systems and soft similarities reflecting potential historical influences.
Abstract: Learning similarity metrics for glyphs and writing systems faces a fundamental challenge: while individual graphemes within invented alphabets can be reliably labeled, the historical relationships between different scripts remain uncertain and contested. We propose a two-stage framework that addresses this epistemological constraint. First, we train an encoder with contrastive loss on labeled invented alphabets, establishing a teacher model with robust discriminative features. Second, we extend to historically attested scripts through teacher-student distillation, where the student learns unsupervised representations guided by the teacher’s knowledge but free to discover latent cross-script similarities. The asymmetric setup enables the student to learn deformation-invariant embeddings while inheriting discriminative structure from clean examples. Our approach bridges supervised contrastive learning and unsupervised discovery, enabling both hard boundaries between distinct systems and soft similarities reflecting potential historical influences. Experiments on diverse writing systems demonstrate effective few-shot glyph recognition and meaningful script clustering without requiring ground-truth evolutionary relationships.
[174] Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving
Nikos Theodoridis, Reenu Mohandas, Ganesh Sistu, Anthony Scanlan, CiarĂĄn Eising, Tim Brophy
Main category: cs.CV
TL;DR: VLMs fail on simple driving-relevant visual questions; study examines linear encoding of visual concepts in VLM activations to identify perceptual vs cognitive failure modes.
Details
Motivation: VLMs are increasingly used in automated driving but fail on simple visual questions relevant to driving. The reasons for these failures are poorly understood, prompting investigation into how visual information flows through VLMs.Method: Create counterfactual image sets differing only in targeted visual concepts, then train linear probes on activations of four SOTA VLMs to assess linear encoding of concepts like object presence and orientation.
Result: Object presence is explicitly linearly encoded, but spatial concepts like orientation are only implicitly encoded. Two failure modes identified: perceptual (visual info not linearly encoded) and cognitive (info present but misaligned with language). Object distance degrades linear separability.
Conclusion: The study improves understanding of VLM failures on driving-relevant visual tasks by analyzing linear encoding of visual concepts and identifying perceptual vs cognitive failure modes.
Abstract: The use of Vision-Language Models (VLMs) in automated driving applications is becoming increasingly common, with the aim of leveraging their reasoning and generalisation capabilities to handle long tail scenarios. However, these models often fail on simple visual questions that are highly relevant to automated driving, and the reasons behind these failures remain poorly understood. In this work, we examine the intermediate activations of VLMs and assess the extent to which specific visual concepts are linearly encoded, with the goal of identifying bottlenecks in the flow of visual information. Specifically, we create counterfactual image sets that differ only in a targeted visual concept and then train linear probes to distinguish between them using the activations of four state-of-the-art (SOTA) VLMs. Our results show that concepts such as the presence of an object or agent in a scene are explicitly and linearly encoded, whereas other spatial visual concepts, such as the orientation of an object or agent, are only implicitly encoded by the spatial structure retained by the vision encoder. In parallel, we observe that in certain cases, even when a concept is linearly encoded in the model’s activations, the model still fails to answer correctly. This leads us to identify two failure modes. The first is perceptual failure, where the visual information required to answer a question is not linearly encoded in the model’s activations. The second is cognitive failure, where the visual information is present but the model fails to align it correctly with language semantics. Finally, we show that increasing the distance of the object in question quickly degrades the linear separability of the corresponding visual concept. Overall, our findings improve our understanding of failure cases in VLMs on simple visual tasks that are highly relevant to automated driving.
[175] Towards High-resolution and Disentangled Reference-based Sketch Colorization
Dingkun Yan, Xinrui Wang, Ru Wang, Zhuoru Li, Jinze Yu, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo
Main category: cs.CV
TL;DR: A novel sketch colorization framework that directly addresses distribution shift between training and test data using a dual-branch approach with Gram Regularization Loss, achieving state-of-the-art quality and controllability.
Details
Motivation: Previous sketch colorization methods struggled with distribution shift between semantically aligned training data and diverse test data, focusing on mitigating artifacts rather than solving the fundamental problem. The authors aim to directly minimize this distribution shift to achieve better quality, resolution, and controllability.Method: Proposes a dual-branch framework with: 1) semantic-aligned branch modeling training data distribution, 2) semantic-misaligned branch modeling inference data distribution, 3) Gram Regularization Loss to enforce cross-domain distribution coherence, 4) anime-specific Tagger Network for fine-grained attribution extraction and SDXL encoder modulation, and 5) plugin module for texture transfer enhancement.
Result: Quantitative and qualitative comparisons plus user studies confirm the method effectively overcomes distribution shift challenge, establishing State-of-the-Art performance in both quality and controllability metrics. Ablation studies reveal each component’s influence.
Conclusion: The framework successfully addresses the fundamental distribution shift problem in sketch colorization, achieving superior results through explicit modeling of training/inference distributions and cross-domain regularization.
Abstract: Sketch colorization is a critical task for automating and assisting in the creation of animations and digital illustrations. Previous research identified the primary difficulty as the distribution shift between semantically aligned training data and highly diverse test data, and focused on mitigating the artifacts caused by the distribution shift instead of fundamentally resolving the problem. In this paper, we present a framework that directly minimizes the distribution shift, thereby achieving superior quality, resolution, and controllability of colorization. We propose a dual-branch framework to explicitly model the data distributions of the training process and inference process with a semantic-aligned branch and a semantic-misaligned branch, respectively. A Gram Regularization Loss is applied across the feature maps of both branches, effectively enforcing cross-domain distribution coherence and stability. Furthermore, we adopt an anime-specific Tagger Network to extract fine-grained attributions from reference images and modulate SDXL’s conditional encoders to ensure precise control, and a plugin module to enhance texture transfer. Quantitative and qualitative comparisons, alongside user studies, confirm that our method effectively overcomes the distribution shift challenge, establishing State-of-the-Art performance across both quality and controllability metrics. Ablation study reveals the influence of each component.
[176] Text-Driven Emotionally Continuous Talking Face Generation
Hao Yang, Yanyan Zhao, Tian Zheng, Hongbo Zhang, Bichen Wang, Di Wu, Xing Fu, Xuda Zhi, Yongbo Huang, Hao He
Main category: cs.CV
TL;DR: Proposes Emotionally Continuous Talking Face Generation (EC-TFG) task and TIE-TFG model for generating talking faces with dynamic, continuously changing emotional expressions based on text and emotion descriptions.
Details
Motivation: Existing talking face generation methods typically produce fixed emotional expressions and lack the ability to generate continuously changing, natural emotional expressions like humans exhibit during communication.Method: Introduces Temporal-Intensive Emotion Fluctuation Modeling in TIE-TFG model to manage dynamic emotional variations, generating emotion variation sequences that drive continuous facial expression changes in synthesized videos.
Result: Extensive evaluations demonstrate exceptional ability to produce smooth emotion transitions while maintaining high-quality visuals and motion authenticity across diverse emotional states.
Conclusion: The proposed EC-TFG task and TIE-TFG model successfully address the limitation of fixed emotional expressions in talking face generation, enabling realistic videos with continuously changing emotional expressions.
Abstract: Talking Face Generation (TFG) strives to create realistic and emotionally expressive digital faces. While previous TFG works have mastered the creation of naturalistic facial movements, they typically express a fixed target emotion in synthetic videos and lack the ability to exhibit continuously changing and natural expressions like humans do when conveying information. To synthesize realistic videos, we propose a novel task called Emotionally Continuous Talking Face Generation (EC-TFG), which takes a text segment and an emotion description with varying emotions as driving data, aiming to generate a video where the person speaks the text while reflecting the emotional changes within the description. Alongside this, we introduce a customized model, i.e., Temporal-Intensive Emotion Modulated Talking Face Generation (TIE-TFG), which innovatively manages dynamic emotional variations by employing Temporal-Intensive Emotion Fluctuation Modeling, allowing it to provide emotion variation sequences corresponding to the input text to drive continuous facial expression changes in synthesized videos. Extensive evaluations demonstrate our method’s exceptional ability to produce smooth emotion transitions and uphold high-quality visuals and motion authenticity across diverse emotional states.
[177] RePer-360: Releasing Perspective Priors for 360$^\circ$ Depth Estimation via Self-Modulation
Cheng Guan, Chunyu Lin, Zhijie Shen, Junsong Zhang, Jiyuan Wang
Main category: cs.CV
TL;DR: RePer-360 adapts depth foundation models to panoramic images using distortion-aware self-modulation, preserving pretrained perspective priors while requiring minimal panoramic training data.
Details
Motivation: Depth foundation models trained on perspective images generalize poorly to 360° panoramic images due to geometric discrepancies, and full fine-tuning requires large amounts of panoramic data.Method: Proposes a lightweight geometry-aligned guidance module using ERP and CP projections to generate modulation signals, Self-Conditioned AdaLN-Zero for pixel-wise feature distribution alignment, and cubemap-domain consistency loss for training stability.
Result: Surpasses standard fine-tuning methods using only 1% of training data, achieves ~20% RMSE improvement under in-domain training, and effectively adapts perspective-trained models to panoramic domain.
Conclusion: RePer-360 enables efficient adaptation of depth foundation models to panoramic images while preserving pretrained knowledge, addressing domain gaps with minimal data requirements.
Abstract: Recent depth foundation models trained on perspective imagery achieve strong performance, yet generalize poorly to 360$^\circ$ images due to the substantial geometric discrepancy between perspective and panoramic domains. Moreover, fully fine-tuning these models typically requires large amounts of panoramic data. To address this issue, we propose RePer-360, a distortion-aware self-modulation framework for monocular panoramic depth estimation that adapts depth foundation models while preserving powerful pretrained perspective priors. Specifically, we design a lightweight geometry-aligned guidance module to derive a modulation signal from two complementary projections (i.e., ERP and CP) and use it to guide the model toward the panoramic domain without overwriting its pretrained perspective knowledge. We further introduce a Self-Conditioned AdaLN-Zero mechanism that produces pixel-wise scaling factors to reduce the feature distribution gap between the perspective and panoramic domains. In addition, a cubemap-domain consistency loss further improves training stability and cross-projection alignment. By shifting the focus from complementary-projection fusion to panoramic domain adaptation under preserved pretrained perspective priors, RePer-360 surpasses standard fine-tuning methods while using only 1% of the training data. Under the same in-domain training setting, it further achieves an approximately 20% improvement in RMSE. Code will be released upon acceptance.
[178] EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation
Shiyuan Yang, Ruihuang Li, Jiale Tao, Shuai Shao, Qinglin Lu, Jing Liao
Main category: cs.CV
TL;DR: EffectMaker is a unified reasoning-generation framework for reference-based visual effects customization using multimodal LLMs and diffusion transformers, enabling accurate VFX synthesis without per-effect fine-tuning.
Details
Motivation: Visual effects production is expensive and requires expert knowledge, while existing AIGC systems struggle with VFX generation due to data scarcity and difficulty modeling supernatural/stylized effects. Current approaches need per-effect fine-tuning, limiting scalability and generalization.Method: Uses multimodal LLM to interpret high-level effect semantics and reason about adaptation to target subjects, combined with diffusion transformer using in-context learning to capture fine-grained visual cues from reference videos. Creates semantic-visual dual-path guidance mechanism. Also constructs EffectData - 130k videos across 3k VFX categories.
Result: Achieves superior visual quality and effect consistency over state-of-the-art baselines, offering scalable and flexible paradigm for customized VFX generation without per-effect fine-tuning.
Conclusion: EffectMaker provides an effective framework for reference-based VFX customization that addresses data scarcity and scalability issues through multimodal reasoning and in-context learning.
Abstract: Visual effects (VFX) are essential for enhancing the expressiveness and creativity of video content, yet producing high-quality effects typically requires expert knowledge and costly production pipelines. Existing AIGC systems face significant challenges in VFX generation due to the scarcity of effect-specific data and the inherent difficulty of modeling supernatural or stylized effects. Moreover, these approaches often require per-effect fine-tuning, which severely limits their scalability and generalization to novel VFX. In this work, we present EffectMaker, a unified reasoning-generation framework that enables reference-based VFX customization. EffectMaker employs a multimodal large language model to interpret high-level effect semantics and reason about how they should adapt to a target subject, while a diffusion transformer leverages in-context learning to capture fine-grained visual cues from reference videos. These two components form a semantic-visual dual-path guidance mechanism that enables accurate, controllable, and effect-consistent synthesis without per-effect fine-tuning. Furthermore, we construct EffectData, the largest high-quality synthetic dataset containing 130k videos across 3k VFX categories, to improve generalization and scalability. Experiments show that EffectMaker achieves superior visual quality and effect consistency over state-of-the-art baselines, offering a scalable and flexible paradigm for customized VFX generation. Project page: https://effectmaker.github.io
[179] MOSIV: Multi-Object System Identification from Videos
Chunjiang Liu, Xiaoyuan Wang, Qingran Lin, Albert Xiao, Haoyu Chen, Shizheng Wen, Hao Zhang, Lu Qi, Ming-Hsuan Yang, Laszlo A. Jeni, Min Xu, Yizhou Zhao
Main category: cs.CV
TL;DR: MOSIV is a framework for multi-object system identification from videos that optimizes continuous material parameters using differentiable simulation guided by geometric objectives.
Details
Motivation: Prior methods focus on single-object scenes or discrete material classification with fixed prototypes, making them ill-suited for multi-object system identification from videos where continuous material parameters need to be estimated.Method: Proposes MOSIV framework that directly optimizes continuous, per-object material parameters using a differentiable simulator guided by geometric objectives derived from video data.
Result: MOSIV substantially improves grounding accuracy and long-horizon simulation fidelity over adapted baselines on a new synthetic benchmark with contact-rich, multi-object interactions.
Conclusion: Object-level fine-grained supervision and geometry-aligned objectives are critical for stable optimization in complex multi-object settings, establishing MOSIV as a strong baseline for multi-object system identification from videos.
Abstract: We introduce the challenging problem of multi-object system identification from videos, for which prior methods are ill-suited due to their focus on single-object scenes or discrete material classification with a fixed set of material prototypes. To address this, we propose MOSIV, a new framework that directly optimizes for continuous, per-object material parameters using a differentiable simulator guided by geometric objectives derived from video. We also present a new synthetic benchmark with contact-rich, multi-object interactions to facilitate evaluation. On this benchmark, MOSIV substantially improves grounding accuracy and long-horizon simulation fidelity over adapted baselines, establishing it as a strong baseline for this new task. Our analysis shows that object-level fine-grained supervision and geometry-aligned objectives are critical for stable optimization in these complex, multi-object settings. The source code and dataset will be released.
[180] Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion
Bohai Gu, Taiyi Wu, Dazhao Du, Jian Liu, Shuai Yang, Xiaotong Zhao, Alan Zhao, Song Guo
Main category: cs.CV
TL;DR: Place-it-R1 is an MLLM-guided framework for physically plausible video object insertion using chain-of-thought reasoning and iterative refinement cycles.
Details
Motivation: Current video editing techniques focus on visual fidelity but ignore physical causality, resulting in physically inconsistent object insertions that don't interact properly with their environment.Method: Uses MLLMs for physical scene understanding and interaction reasoning, generates environment-aware chain-of-thought tokens, infers valid insertion regions, employs MLLM-guided Spatial DPO for scoring, and implements iterative refinement cycles with two user-selectable modes (plausibility-oriented vs fidelity-oriented).
Result: Achieves physically-coherent video object insertion superior to state-of-the-art solutions and commercial models, offering explicit control over plausibility-fidelity trade-off.
Conclusion: The framework successfully bridges cognitive reasoning with generative execution for physically plausible video editing, demonstrating the potential of MLLMs for environment-aware video manipulation.
Abstract: Modern video editing techniques have achieved high visual fidelity when inserting video objects. However, they focus on optimizing visual fidelity rather than physical causality, leading to edits that are physically inconsistent with their environment. In this work, we present Place-it-R$1$, an end-to-end framework for video object insertion that unlocks the environment-aware reasoning potential of Multimodal Large Language Models (MLLMs). Our framework leverages the Chain-of-Thought (CoT) reasoning of MLLMs to orchestrate video diffusion, following a Think-then-Place paradigm. To bridge cognitive reasoning and generative execution, we introduce three key innovations: First, MLLM performs physical scene understanding and interaction reasoning, generating environment-aware chain-of-thought tokens and inferring valid insertion regions to explicitly guide the diffusion toward physically plausible insertion. Then, we introduce MLLM-guided Spatial Direct Preference Optimization (DPO), where diffusion outputs are fed back to the MLLM for scoring, enabling visual naturalness. During inference, the MLLM iteratively triggers refinement cycles and elicits adaptive adjustments from the diffusion model, forming a closed-loop that progressively enhances editing quality. Furthermore, we provide two user-selectable modes: a plausibility-oriented flexible mode that permits environment modifications (\eg, generating support structures) to enhance physical plausibility, and a fidelity-oriented standard mode that preserves scene integrity for maximum fidelity, offering users explicit control over the plausibility-fidelity trade-off. Extensive experiments demonstrate Place-it-R1 achieves physically-coherent video object insertion compared with state-of-the-art solutions and commercial models.
[181] StruVis: Enhancing Reasoning-based Text-to-Image Generation via Thinking with Structured Vision
Yuanhuiyi Lyu, Kaiyu Lei, Ziqiao Weng, Xu Zheng, Lutao Jiang, Teng Li, Yangfu Li, Ziyuan Huang, Linfeng Zhang, Xuming Hu
Main category: cs.CV
TL;DR: StruVis enhances text-to-image generation by using structured visual representations instead of intermediate images for reasoning, improving performance on complex prompts while being computationally efficient.
Details
Motivation: Existing reasoning frameworks for text-to-image generation have limitations: text-only reasoning lacks visual context and misses spatial elements, while text-image interleaved reasoning is computationally expensive and constrained by generator limitations.Method: Proposes StruVis framework that uses text-based structured visual representations as intermediate reasoning states, enabling MLLMs to perceive visual structure in purely text-based reasoning without intermediate image generation.
Result: Achieves significant performance improvements: 4.61% gain on T2I-ReasonBench and 4% gain on WISE benchmarks for reasoning-based T2I generation.
Conclusion: StruVis provides an efficient, generator-agnostic reasoning framework that enhances T2I generation by enabling structured-vision-guided reasoning without the computational cost of intermediate image generation.
Abstract: Reasoning-based text-to-image (T2I) generation requires models to interpret complex prompts accurately. Existing reasoning frameworks can be broadly categorized into two types: (1) Text-Only Reasoning, which is computationally efficient but lacks access to visual context, often resulting in the omission of critical spatial and visual elements; and (2) Text-Image Interleaved Reasoning, which leverages a T2I generator to provide visual references during the reasoning process. While this approach enhances visual grounding, it incurs substantial computational costs and constrains the reasoning capacity of MLLMs to the representational limitations of the generator. To this end, we propose StruVis, a novel framework that enhances T2I generation through Thinking with Structured Vision. Instead of relying on intermediate image generation, StruVis employs text-based structured visual representations as intermediate reasoning states, thereby enabling the MLLM to effectively “perceive” visual structure within a purely text-based reasoning process. Powered by this, the reasoning potential for T2I generation of the MLLM is unlocked through structured-vision-guided reasoning. Additionally, as a generator-agnostic reasoning framework, our proposed StruVis can be seamlessly integrated with diverse T2I generators and efficiently enhance their performance in reasoning-based T2I generation. Extensive experiments demonstrate that StruVis achieves significant performance improvements on reasoning-based T2I benchmarks, e.g., a 4.61% gain on T2I-ReasonBench and a 4% gain on WISE.
[182] VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models
Rohit Saxena, Alessandro Suglia, Pasquale Minervini
Main category: cs.CV
TL;DR: VLM-RobustBench evaluates vision-language models’ robustness to 49 image distortions across 133 settings, revealing VLMs are semantically strong but spatially fragile, with geometric distortions causing the largest performance drops.
Details
Motivation: While VLMs perform well on standard datasets, their robustness to real-world image distortions is not well understood. The paper aims to systematically evaluate how VLMs handle various image corruptions that occur in practical applications.Method: Created VLM-RobustBench with 49 augmentation types across noise, blur, weather, digital, and geometric perturbations, evaluated under graded severities and binary transforms (133 settings total). Tested VLMs from Qwen, InternVL, Molmo, and Gemma families on MMBench (visually grounded) and MMMU-Pro (reasoning-oriented) benchmarks.
Result: Visual severity is a weak predictor of difficulty - low-severity spatial perturbations often degrade performance more than visually severe photometric corruptions. Low-severity glass_blur reduces MMBench accuracy by ~8pp on average, while resampling and geometric distortions (upsample, elastic_transform) cause drops up to 34pp.
Conclusion: Current VLMs are semantically strong but spatially fragile, motivating new robustness evaluation protocols and training regimes emphasizing resampling and geometric invariances.
Abstract: Vision-language models (VLMs) achieve strong performance on standard, high-quality datasets, but we still do not fully understand how they perform under real-world image distortions. We present VLM-RobustBench, a benchmark spanning 49 augmentation types across noise, blur, weather, digital, and geometric perturbations, evaluated under graded severities (low/mid/high) and binary transforms, yielding 133 corrupted settings. We evaluate VLMs from four families (Qwen, InternVL, Molmo, Gemma) on two complementary benchmarks: MMBench (visually grounded) and MMMU-Pro (reasoning-oriented). Our results reveal that visual severity is a weak predictor of difficulty: low-severity spatial perturbations often degrade performance more than visually severe photometric corruptions. In particular, low-severity glass_blur reduces MMBench accuracy by about 8 pp on average across models, while the largest drops arise from resampling and geometric distortions (e.g., upsample, elastic_transform), reaching up to 34 pp. Overall, our findings suggest current VLMs are semantically strong but spatially fragile, motivating the definition of novel robustness evaluation protocols and training regimes that emphasize resampling and geometric invariances.
[183] Occlusion-Aware SORT: Observing Occlusion for Robust Multi-Object Tracking
Chunjiang Li, Jianbo Ma, Li Shen, Yanru Chen, Liangyin Chen
Main category: cs.CV
TL;DR: OA-SORT is a plug-and-play, training-free multi-object tracking framework that addresses occlusion challenges through occlusion-aware modules, improving tracking performance on datasets like DanceTrack.
Details
Motivation: 2D multi-object tracking faces challenges with positional cost confusion due to partial occlusion, which degrades tracking performance and trajectory analysis.Method: Proposes OA-SORT framework with three components: Occlusion-Aware Module (OAM) analyzes occlusion status using Gaussian Maps; Occlusion-Aware Offset (OAO) mitigates cost confusion; Bias-Aware Momentum (BAM) suppresses estimation instability.
Result: Achieves 63.1% HOTA and 64.2% IDF1 on DanceTrack test set; integration into four additional trackers improves HOTA and IDF1 by average 2.08% and 3.05% respectively.
Conclusion: Occlusion handling is crucial for MOT, and the proposed occlusion-aware framework is effective, reusable, and improves tracking performance across different datasets and trackers.
Abstract: Multi-object tracking (MOT) involves analyzing object trajectories and counting the number of objects in video sequences. However, 2D MOT faces challenges due to positional cost confusion arising from partial occlusion. To address this issue, we present the novel Occlusion-Aware SORT (OA-SORT) framework, a plug-and-play and training-free framework that includes the Occlusion-Aware Module (OAM), the Occlusion-Aware Offset (OAO), and the Bias-Aware Momentum (BAM). Specifically, OAM analyzes the occlusion status of objects, where a Gaussian Map (GM) is introduced to reduce background influence. In contrast, OAO and BAM leverage the OAM-described occlusion status to mitigate cost confusion and suppress estimation instability. Comprehensive evaluations on the DanceTrack, SportsMOT, and MOT17 datasets demonstrate the importance of occlusion handling in MOT. On the DanceTrack test set, OA-SORT achieves 63.1% and 64.2% in HOTA and IDF1, respectively. Furthermore, integrating the Occlusion-Aware framework into the four additional trackers improves HOTA and IDF1 by an average of 2.08% and 3.05%, demonstrating the reusability of the occlusion awareness.
[184] Reflective Flow Sampling Enhancement
Zikai Zhou, Muyao Wang, Shitong Shao, Lichen Bai, Haoyi Xiong, Bo Han, Zeke Xie
Main category: cs.CV
TL;DR: RF-Sampling is a training-free inference enhancement framework for flow-based text-to-image models that improves generation quality and prompt alignment through reflective sampling and flow inversion.
Details
Motivation: Existing inference-time enhancement strategies work well for conventional diffusion models but fail on flow models like FLUX. There's a need for theoretically-grounded enhancement methods specifically designed for flow models, especially CFG-distilled variants.Method: RF-Sampling uses a formal derivation showing it implicitly performs gradient ascent on text-image alignment score. It leverages linear combinations of textual representations and integrates them with flow inversion to explore noise spaces more consistent with input prompts.
Result: Extensive experiments across multiple benchmarks show RF-Sampling consistently improves both generation quality and prompt alignment. It’s the first inference enhancement method exhibiting test-time scaling ability on FLUX.
Conclusion: RF-Sampling bridges the gap for inference enhancement in flow models, providing a theoretically-grounded, training-free framework that improves text-to-image generation for CFG-distilled flow models like FLUX.
Abstract: The growing demand for text-to-image generation has led to rapid advances in generative modeling. Recently, text-to-image diffusion models trained with flow matching algorithms, such as FLUX, have achieved remarkable progress and emerged as strong alternatives to conventional diffusion models. At the same time, inference-time enhancement strategies have been shown to improve the generation quality and text-prompt alignment of text-to-image diffusion models. However, these techniques are mainly applicable to conventional diffusion models and usually fail to perform well on flow models. To bridge this gap, we propose Reflective Flow Sampling (RF-Sampling), a theoretically-grounded and training-free inference enhancement framework explicitly designed for flow models, especially for the CFG-distilled variants (i.e., models distilled from CFG guidance techniques), like FLUX. Departing from heuristic interpretations, we provide a formal derivation proving that RF-Sampling implicitly performs gradient ascent on the text-image alignment score. By leveraging a linear combination of textual representations and integrating them with flow inversion, RF-Sampling allows the model to explore noise spaces that are more consistent with the input prompt. Extensive experiments across multiple benchmarks demonstrate that RF-Sampling consistently improves both generation quality and prompt alignment. Moreover, RF-Sampling is also the first inference enhancement method that can exhibit test-time scaling ability to some extent on FLUX.
[185] Ensemble Learning with Sparse Hypercolumns
Julia Dietlmeier, Vayangi Ganepola, Oluwabukola G. Adegboro, Mayug Maniparambil, Claudia Mazo, Noel E. O’Connor
Main category: cs.CV
TL;DR: Hypercolumns with stratified subsampling and ensemble learning improve brain tumor segmentation, especially in low-data scenarios, outperforming UNet baselines.
Details
Motivation: Hypercolumns (multi-scale feature concatenations) are biologically inspired but computationally expensive for segmentation. The paper aims to address computational complexity through stratified subsampling and explore ensemble methods for improved performance.Method: Uses VGG16-based hypercolumns with stratified subsampling to reduce computational complexity. Investigates ensemble learning (stacking and voting) on sparse hypercolumns, comparing with Logistic Regression for low-shot cases (N †20).
Result: For 10% stratified subsampling with N=20, achieves best average Dice score of 0.66. This represents a 24.53% statistically significant improvement over UNet baseline (p-value = 3.07e-11). In extreme low-shot cases, simple Logistic Regression outperforms ensembles.
Conclusion: Stratified subsampling effectively reduces hypercolumn computational cost while maintaining performance. Ensemble methods work well generally, but Logistic Regression is most effective for very small datasets (N †20). The approach significantly outperforms standard UNet in low-data scenarios.
Abstract: Directly inspired by findings in biological vision, high-dimensional hypercolumns are feature vectors built by concatenating multi-scale activations of convolutional neural networks for a single image pixel location. Together with powerful classifiers, they can be used for image segmentation i.e. pixel classification. However, in practice, there are only very few works dedicated to the use of hypercolumns. One reason is the computational complexity of processing concatenated dense hypercolumns that grows linearly with the size $N$ of the training set. In this work, we address this challenge by applying stratified subsampling to the VGG16 based hypercolumns. Furthermore, we investigate the performance of ensemble learning on sparse hypercolumns. Our experiments on a brain tumor dataset show that stacking and voting ensembles deliver competitive performance, but in the extreme low-shot case of $N \leq 20$, a simple Logistic Regression classifier is the most effective method. For 10% stratified subsampling rate, our best average Dice score is 0.66 for $N=20$. This is a statistically significant improvement of 24.53% over the standard multi-scale UNet baseline ($p$-value = $[3.07e-11]$, Wilcoxon signed-rank test), which is less effective due to overfitting.
[186] Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events
Xiaoxing You, Qiang Huang, Lingyu Li, Xiaojun Chang, Jun Yu
Main category: cs.CV
TL;DR: CoE is a training-free multimodal summarization framework that uses Chain-of-Events reasoning guided by a Hierarchical Event Graph to address challenges in cross-modal grounding and temporal modeling.
Details
Motivation: Existing multimodal summarization approaches suffer from three main issues: reliance on domain-specific supervision, implicit fusion with weak cross-modal grounding, and flat temporal modeling without event transitions.Method: Proposes CoE framework with Hierarchical Event Graph (HEG) that encodes textual semantics into explicit event hierarchy. Uses Chain-of-Events reasoning to localize visual cues, model event evolution and causal transitions, with lightweight style adaptation for domain alignment.
Result: Outperforms state-of-the-art video CoT baselines across eight diverse datasets with average gains of +3.04 ROUGE, +9.51 CIDEr, and +1.88 BERTScore.
Conclusion: CoE demonstrates robustness, interpretability, and cross-domain generalization for multimodal summarization through structured reasoning and explicit event hierarchy.
Abstract: Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating information across videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce CoE, a training-free MMS framework that performs structured reasoning through a Chain-of-Events guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into an explicit event hierarchy that scaffolds cross-modal grounding and temporal reasoning. Guided by this structure, CoE localizes key visual cues, models event evolution and causal transitions, and refines outputs via lightweight style adaptation for domain alignment. Extensive experiments on eight diverse datasets demonstrate that CoE consistently outperforms state-of-the-art video CoT baselines, achieving average gains of +3.04 ROUGE, +9.51 CIDEr, and +1.88 BERTScore, highlighting its robustness, interpretability, and cross-domain generalization. Our code is available at https://github.com/youxiaoxing/CoE.
[187] FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography
Xia Xin, Yuki Endo, Yoshihiro Kanamori
Main category: cs.CV
TL;DR: FontUse: A typography-focused dataset and annotation pipeline using MLLMs to improve text-to-image models’ typographic control
Details
Motivation: Current text-to-image models struggle with controlling typography - they often ignore or weakly follow typographic appearance requests, limiting their practical applications in design contextsMethod: Created FontUse dataset (~70K images) with structured annotation pipeline using segmentation models and multimodal LLMs; annotations include prompts, text-region locations, and OCR strings; prompts combine font styles and use cases; fine-tuned existing generators with this supervision
Result: Models trained with FontUse pipeline produce text renderings more consistent with prompts than baselines; introduced Long-CLIP-based metric for typography alignment evaluation
Conclusion: Data-centric approach with structured typography annotations enables better typographic control in text-to-image models without architectural changes
Abstract: Recent text-to-image models can generate high-quality images from natural-language prompts, yet controlling typography remains challenging: requested typographic appearance is often ignored or only weakly followed. We address this limitation with a data-centric approach that trains image generation models using targeted supervision derived from a structured annotation pipeline specialized for typography. Our pipeline constructs a large-scale typography-focused dataset, FontUse, consisting of about 70K images annotated with user-friendly prompts, text-region locations, and OCR-recognized strings. The annotations are automatically produced using segmentation models and multimodal large language models (MLLMs). The prompts explicitly combine font styles (e.g., serif, script, elegant) and use cases (e.g., wedding invitations, coffee-shop menus), enabling intuitive specification even for novice users. Fine-tuning existing generators with these annotations allows them to consistently interpret style and use-case conditions as textual prompts without architectural modification. For evaluation, we introduce a Long-CLIP-based metric that measures alignment between generated typography and requested attributes. Experiments across diverse prompts and layouts show that models trained with our pipeline produce text renderings more consistent with prompts than competitive baselines. The source code for our annotation pipeline is available at https://github.com/xiaxinz/FontUSE.
[188] TaPD: Temporal-adaptive Progressive Distillation for Observation-Adaptive Trajectory Forecasting in Autonomous Driving
Mingyu Fan, Yi Liu, Hao Zhou, Deheng Qian, Mohammad Haziq Khan, Matthias Raetsch
Main category: cs.CV
TL;DR: TaPD is a plug-and-play framework for trajectory prediction that adapts to variable observation lengths using progressive knowledge distillation and temporal backfilling.
Details
Motivation: Existing trajectory predictors assume fixed-length histories and suffer performance degradation with variable or extremely short observations due to real-world issues like occlusion or limited sensing range.Method: Two cooperative modules: 1) Observation-Adaptive Forecaster (OAF) using progressive knowledge distillation to transfer knowledge from long-horizon teachers to short-horizon students, and 2) Temporal Backfilling Module (TBM) that reconstructs missing historical segments. Uses cosine-annealed distillation weighting and decoupled pretrain-reconstruct-finetune protocol.
Result: Outperforms strong baselines on Argoverse 1 and 2 across all observation lengths, with especially large gains under very short inputs. Can improve other predictors in a plug-and-play manner.
Conclusion: TaPD provides a unified solution for observation-adaptive trajectory forecasting that handles variable history lengths effectively through progressive distillation and explicit temporal backfilling.
Abstract: Trajectory prediction is essential for autonomous driving, enabling vehicles to anticipate the motion of surrounding agents to support safe planning. However, most existing predictors assume fixed-length histories and suffer substantial performance degradation when observations are variable or extremely short in real-world settings (e.g., due to occlusion or a limited sensing range). We propose TaPD (Temporal-adaptive Progressive Distillation), a unified plug-and-play framework for observation-adaptive trajectory forecasting under variable history lengths. TaPD comprises two cooperative modules: an Observation-Adaptive Forecaster (OAF) for future prediction and a Temporal Backfilling Module (TBM) for explicit reconstruction of the past. OAF is built on progressive knowledge distillation (PKD), which transfers motion pattern knowledge from long-horizon “teachers” to short-horizon “students” via hierarchical feature regression, enabling short observations to recover richer motion context. We further introduce a cosine-annealed distillation weighting scheme to balance forecasting supervision and feature alignment, improving optimization stability and cross-length consistency. For extremely short histories where implicit alignment is insufficient, TBM backfills missing historical segments conditioned on scene evolution, producing context-rich trajectories that strengthen PKD and thereby improve OAF. We employ a decoupled pretrain-reconstruct-finetune protocol to preserve real-motion priors while adapting to backfilled inputs. Extensive experiments on Argoverse 1 and Argoverse 2 show that TaPD consistently outperforms strong baselines across all observation lengths, delivers especially large gains under very short inputs, and improves other predictors (e.g., HiVT) in a plug-and-play manner. Code will be available at https://github.com/zhouhao94/TaPD.
[189] Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models
Jiadong Pan, Liang Li, Yuxin Peng, Yu-Ming Tang, Shuohuan Wang, Yu Sun, Hua Wu, Qingming Huang, Haifeng Wang
Main category: cs.CV
TL;DR: Proposes GvU, a self-supervised reinforcement learning framework that uses unified multimodal models’ own visual understanding capabilities to guide and improve their image generation quality through intrinsic alignment rewards.
Details
Motivation: Unified multimodal models (UMMs) show strong visual understanding but weak generation capabilities due to decoupling between understanding and generation processes. The paper aims to bridge this gap by leveraging UMMs' internal understanding to enhance generation.Method: Introduces GvU, a token-level intrinsic text-image alignment reward mechanism that allows UMMs to evaluate their own outputs using their understanding branch. Uses self-supervised reinforcement learning framework where UMMs iteratively improve generation through understanding-based intrinsic rewards without external supervision.
Result: The method substantially boosts UMMs’ generation capabilities, which in turn strengthens their fine-grained visual understanding, narrowing the capability gap between visual understanding and generation.
Conclusion: UMMs can effectively use their own understanding capabilities to guide and improve generation through self-supervised reinforcement learning, creating a virtuous cycle where better generation enhances understanding and vice versa.
Abstract: Recently, unified multimodal models (UMMs) have made remarkable progress in integrating visual understanding and generation, demonstrating strong potential for complex text-to-image (T2I) tasks. Despite their theoretical promise, a persistent capability gap exists: UMMs typically exhibit superior visual understanding but comparatively weaker generative capabilities. This discrepancy arises largely from the intrinsic decoupling between the understanding and generation processes. While a UMM can accurately interpret fine-grained visual details, it often struggles to produce semantically coherent images from complex textual prompts. To address this challenge, we explore UMMs’ internal understanding capability to enhance generation quality. We propose a token-level intrinsic text-image alignment reward mechanism, GvU, enabling the UMM to act simultaneously as teacher and student: it evaluates its own outputs using the understanding branch to guide the generations accordingly. Building upon this, we design a self-supervised reinforcement learning framework, allowing UMMs to iteratively improve their generation quality through understanding-based intrinsic reward signals–without reliance on external supervision. Experimental results show that our method substantially boosts UMMs’ generation, which in turn strengthens their fine-grained visual understanding, narrowing the capability gap between UMMs’ visual understanding and generation.
[190] GazeMoE: Perception of Gaze Target with Mixture-of-Experts
Zhuangzhuang Dai, Zhongxi Lu, Vincent G. Zakka, Luis J. Manso, Jose M Alcaraz Calero, Chen Li
Main category: cs.CV
TL;DR: GazeMoE: A novel end-to-end framework that uses Mixture-of-Experts modules to selectively leverage gaze-target-related cues from frozen foundation models for human gaze target estimation from visible images.
Details
Motivation: Estimating human gaze targets from visible images is critical for robots to understand human attention, but developing generalizable neural architectures and training paradigms remains challenging. While pre-trained vision foundation models offer promising avenues, integrating multi-modal cues (eyes, head poses, gestures, contextual features) requires adaptive and efficient decoding mechanisms.Method: Proposes GazeMoE, an end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through Mixture-of-Experts (MoE) modules. Incorporates class-balancing auxiliary loss and strategic data augmentations (region-specific cropping and photometric transformations) to address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness.
Result: Extensive experiments on benchmark datasets demonstrate that GazeMoE achieves state-of-the-art performance, outperforming existing methods on challenging gaze estimation tasks.
Conclusion: GazeMoE effectively addresses gaze target estimation by adaptively leveraging multi-modal cues through MoE modules, achieving superior performance through strategic architectural design and training techniques.
Abstract: Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising avenues for locating gaze targets, the integration of multi-modal cues – including eyes, head poses, gestures, and contextual features – demands adaptive and efficient decoding mechanisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region-specific cropping and photometric transformations. Extensive experiments on benchmark datasets demonstrate that our GazeMoE achieves state-of-the-art performance, outperforming existing methods on challenging gaze estimation tasks. The code and pre-trained models are released at https://huggingface.co/zdai257/GazeMoE
[191] GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection
Xuan Huang, Mochu Xiang, Zhelun Shen, Jinbo Wu, Chenming Wu, Chen Zhao, Kaisiyuan Wang, Hang Zhou, Shanshan Liu, Haocheng Feng, Wei He, Jingdong Wang
Main category: cs.CV
TL;DR: GenHOI is a lightweight augmentation for pretrained video generation models that improves hand-object interaction synthesis by injecting reference-object information with temporal balancing and spatial selectivity mechanisms.
Details
Motivation: Current HOI reenactment methods fail to generalize to complex, in-the-wild scenarios, while all-in-one video editing models struggle with HOI-specific issues like inconsistent object appearance across frames.Method: Two key innovations: 1) Head-Sliding RoPE for temporal balancing - assigns head-specific temporal offsets to reference tokens to distribute influence evenly across frames and mitigate 3D RoPE temporal decay; 2) Two-level spatial attention gate - concentrates object-conditioned attention on HOI regions and adaptively scales its strength.
Result: Extensive evaluations on unseen, in-the-wild scenes show GenHOI significantly outperforms state-of-the-art HOI reenactment and all-in-one video editing methods in both qualitative and quantitative metrics.
Conclusion: GenHOI effectively addresses HOI challenges in video generation by combining temporal balancing and spatial selectivity, enabling robust generalization to complex real-world scenarios while maintaining background realism.
Abstract: Hand-Object Interaction (HOI) remains a core challenge in digital human video synthesis, where models must generate physically plausible contact and preserve object identity across frames. Although recent HOI reenactment approaches have achieved progress, they are typically trained and evaluated in-domain and fail to generalize to complex, in-the-wild scenarios. In contrast, all-in-one video editing models exhibit broader robustness but still struggle with HOI-specific issues such as inconsistent object appearance. In this paper, we present GenHOI, a lightweight augmentation to pretrained video generation models that injects reference-object information in a temporally balanced and spatially selective manner. For temporal balancing, we propose Head-Sliding RoPE, which assigns head-specific temporal offsets to reference tokens, distributing their influence evenly across frames and mitigating the temporal decay of 3D RoPE to improve long-range object consistency. For spatial selectivity, we design a two-level spatial attention gate that concentrates object-conditioned attention on HOI regions and adaptively scales its strength, preserving background realism while enhancing interaction fidelity. Extensive qualitative and quantitative evaluations on unseen, in-the-wild scenes demonstrate that GenHOI significantly outperforms state-of-the-art HOI reenactment and all-in-one video editing methods. Project page: https://xuanhuang0.github.io/GenHOI/
[192] HiPP-Prune: Hierarchical Preference-Conditioned Structured Pruning for Vision-Language Models
Lincen Bai, Hedi Tabia, Raul Santos-Rodriguez
Main category: cs.CV
TL;DR: HiPP-Prune: A hierarchical preference-conditioned structured pruning framework for vision-language models that treats pruning as conditional resource allocation under multiple objectives, enabling controllable robustness-utility trade-offs.
Details
Motivation: Pruning VLMs is challenging because compression affects both task utility and visual grounding, often amplifying object hallucinations even at the same sparsity level. Current methods lack control over the trade-off between utility and hallucination robustness.Method: Hierarchical preference-conditioned structured pruning framework that makes plan-level decisions: a single policy invocation outputs a global pruning blueprint by factorizing decisions into overall sparsity budget and layer-wise allocation. Integrates visual sensitivity signal from attention flow between vision tokens and language hidden states. Optimizes pruning plans with plan-level Group Relative Policy Optimization under multi-objective return combining task utility, hallucination robustness (POPE), compression, and synaptic-flow-inspired stability proxy.
Result: Experiments on LLaVA with POPE and ScienceQA demonstrate that HiPP-Prune discovers diverse non-dominated pruning plans and provides controllable robustness-utility trade-offs under matched sparsity budgets.
Conclusion: HiPP-Prune effectively addresses VLM-specific pruning challenges by enabling controllable trade-offs between utility and hallucination robustness through hierarchical preference-conditioned pruning decisions.
Abstract: Pruning vision-language models (VLMs) for efficient deployment is challenging because compression can affect not only task utility but also visual grounding, often amplifying object hallucinations even at the same sparsity level. We present HiPP-Prune, a hierarchical preference-conditioned structured pruning framework that treats pruning as conditional resource allocation under multiple objectives. HiPP-Prune makes plan-level decisions: a single policy invocation outputs a global pruning blueprint by factorizing decisions into an overall sparsity budget and a layer-wise allocation, enabling queryable trade-offs via a user-specified preference vector. To account for VLM-specific failure modes, our policy state integrates a visual sensitivity signal derived from attention flow between vision tokens and language hidden states, discouraging over-pruning of vision-critical layers that facilitate cross-modal fusion. We optimize pruning plans with plan-level Group Relative Policy Optimization (GRPO) under a multi-objective return that combines task utility, hallucination robustness (POPE), compression, and a synaptic-flow-inspired stability proxy to reduce unproductive exploration in high-sparsity regimes. Experiments on LLaVA with POPE and ScienceQA demonstrate that HiPP-Prune discovers diverse non-dominated pruning plans and provides controllable robustness–utility trade-offs under matched sparsity budgets.
[193] Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models
Canyu Chen, Yuguang Yang, Zhewen Tan, Yizhi Wang, Ruiyi Zhan, Haiyan Liu, Xuanyao Mao, Jason Bao, Xinyue Tang, Linlin Yang, Bingchuan Sun, Yan Wang, Baochang Zhang
Main category: cs.CV
TL;DR: Curious-VLA addresses the narrow policy limitation in autonomous vision-language-action models by introducing a two-stage framework with trajectory expansion during IL and diversity-aware sampling during RL to improve exploration and performance.
Details
Motivation: The paper identifies a fundamental limitation in autonomous VLA models where driving imitation learning collapses exploration, limiting subsequent reinforcement learning stages that saturate prematurely due to insufficient feedback diversity.Method: Two-stage framework: 1) IL stage with Feasible Trajectory Expansion to generate multiple physically valid trajectories and step-wise normalized trajectory representation; 2) RL stage with Adaptive Diversity-Aware Sampling to prioritize high-diversity samples and Spanning Driving Reward with focal style weighting to amplify reward value span.
Result: Achieves state-of-the-art results on Navsim benchmark with PDMS 90.3, EPDMS 85.4, and Best-of-N PDMS of 94.8, demonstrating effectiveness in unlocking exploratory potential of VLA models.
Conclusion: Curious-VLA successfully addresses the narrow policy limitation in autonomous VLA models through its two-stage design that enhances exploration during IL and improves reward sensitivity during RL, leading to superior performance on driving benchmarks.
Abstract: We identify a fundamental Narrow Policy limitation undermining the performance of autonomous VLA models, where driving Imitation Learning (IL) tends to collapse exploration and limit the potential of subsequent Reinforcement Learning (RL) stages, which often saturate prematurely due to insufficient feedback diversity. Thereby, we propose Curious-VLA, a framework that alleviates the exploit-explore dilemma through a two-stage design. During IL, we introduce a Feasible Trajectory Expansion (FTE) strategy to generate multiple physically valid trajectories and a step-wise normalized trajectory representation to adapt this diverse data. In the RL stage, we present Adaptive Diversity-Aware Sampling (ADAS) that prioritizes high-diversity samples and introduce Spanning Driving Reward (SDR) with a focal style weighting to amplify reward’s value span for improving sensitivity to driving quality. On the Navsim benchmark, Curious-VLA achieves SoTA results (PDMS 90.3, EPDMS 85.4) and a Best-of-N PDMS of 94.8, demonstrating its effectiveness in unlocking the exploratory potential of VLA models. Code: https://github.com/Mashiroln/curious_vla.git.
[194] Transforming Omnidirectional RGB-LiDAR data into 3D Gaussian Splatting
Semin Bae, Hansol Lim, Jongseong Brad Choi
Main category: cs.CV
TL;DR: A pipeline that transforms archived omnidirectional RGB-LiDAR logs into initialization assets for 3D Gaussian Splatting digital twins, overcoming distortion and computational bottlenecks through ERP-to-cubemap conversion and LiDAR-reinforced initialization.
Details
Motivation: Large-scale digital twins require expensive data collection, while deployed platforms routinely collect extensive RGB-LiDAR logs that are discarded due to transmission constraints and lack of scalable reuse pipelines. There's a need to repurpose this underutilized sensor data for 3DGS initialization.Method: Pipeline integrates ERP-to-cubemap conversion for deterministic spatial anchoring, PRISM color-stratified downsampling, FPFH-based global registration, and ICP to bridge multi-modal inputs. Converts raw RGB-LiDAR logs into usable SfM geometry for 3DGS initialization.
Result: Successfully repurposes discarded data into usable SfM geometry. LiDAR-reinforced initialization consistently enhances final 3DGS rendering fidelity in structurally complex scenes compared to vision-only baselines.
Conclusion: Provides a deterministic workflow for creating simulation-grade digital twins from standard archived sensor logs, enabling scalable reuse of underutilized RGB-LiDAR data for 3DGS applications.
Abstract: The demand for large-scale digital twins is rapidly growing in robotics and autonomous driving. However, constructing these environments with 3D Gaussian Splatting (3DGS) usually requires expensive, purpose-built data collection. Meanwhile, deployed platforms routinely collect extensive omnidirectional RGB and LiDAR logs, but a significant portion of these sensor data is directly discarded or strictly underutilized due to transmission constraints and the lack of scalable reuse pipeline. In this paper, we present an omnidirectional RGB-LiDAR reuse pipeline that transforms these archived logs into robust initialization assets for 3DGS. Direct conversion of such raw logs introduces practical bottlenecks: inherent non-linear distortion leads to unreliable Structure-from-Motion (SfM) tracking, and dense, unorganized LiDAR clouds cause computational overhead during 3DGS optimization. To overcome these challenges, our pipeline strategically integrates an ERP-to-cubemap conversion module for deterministic spatial anchoring, alongside PRISM-a color stratified downsampling strategy. By bridging these multi-modal inputs via Fast Point Feature Histograms (FPFH) based global registration and Iterative Closest Point (ICP), our pipeline successfully repurposes a considerable fraction of discarded data into usable SfM geometry. Furthermore, our LiDAR-reinforced initialization consistently enhances the final 3DGS rendering fidelity in structurally complex scenes compared to vision-only baselines. Ultimately, this work provides a deterministic workflow for creating simulation-grade digital twins from standard archived sensor logs.
[195] Lyapunov Probes for Hallucination Detection in Large Foundation Models
Bozhi Luan, Gen Li, Yalan Qin, Jifeng Guo, Yun Zhou, Faguo Wu, Hongwei Zheng, Wenjun Wu, Zhaoxin Fan
Main category: cs.CV
TL;DR: Proposes Lyapunov Probes for hallucination detection in LLMs/MLLMs using dynamical systems stability theory, treating factual knowledge as stable equilibrium points and hallucinations as boundary phenomena.
Details
Motivation: Current hallucination detection methods treat the problem as simple classification, missing the underlying dynamical nature of (M)LLMs where factual knowledge exists as stable representations and hallucinations emerge at knowledge-transition boundaries.Method: Frames (M)LLMs as dynamical systems, proposes Lyapunov Probes - lightweight networks trained with derivative-based stability constraints that enforce monotonic confidence decay under perturbations. Uses systematic perturbation analysis and two-stage training to distinguish stable factual regions from unstable hallucination-prone regions.
Result: Experiments on diverse datasets and models show consistent improvements over existing baselines for hallucination detection in both LLMs and MLLMs.
Conclusion: The dynamical systems perspective provides a principled approach to hallucination detection, with Lyapunov Probes effectively capturing stability properties to identify hallucination-prone regions in (M)LLMs.
Abstract: We address hallucination detection in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) by framing the problem through the lens of dynamical systems stability theory. Rather than treating hallucination as a straightforward classification task, we conceptualize (M)LLMs as dynamical systems, where factual knowledge is represented by stable equilibrium points within the representation space. Our main insight is that hallucinations tend to arise at the boundaries of knowledge-transition regions separating stable and unstable zones. To capture this phenomenon, we propose Lyapunov Probes: lightweight networks trained with derivative-based stability constraints that enforce a monotonic decay in confidence under input perturbations. By performing systematic perturbation analysis and applying a two-stage training process, these probes reliably distinguish between stable factual regions and unstable, hallucination-prone regions. Experiments on diverse datasets and models demonstrate consistent improvements over existing baselines.
[196] DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models
Walid Bousselham, Angie Boggust, Hendrik Strobelt, Hilde Kuehne
Main category: cs.CV
TL;DR: DEX-AR is a novel explainability method for autoregressive Vision-Language Models that generates 2D heatmaps highlighting image regions important for textual responses, addressing challenges in interpreting complex multimodal interactions.
Details
Motivation: Traditional explainability methods designed for classification tasks struggle with modern autoregressive VLMs due to their complex token-by-token generation process and intricate visual-textual modality interactions, creating a need for specialized interpretability approaches.Method: DEX-AR computes layer-wise gradients with respect to attention maps during token-by-token generation, featuring dynamic head filtering to identify attention heads focused on visual information, and sequence-level filtering to aggregate per-token explanations while distinguishing visually-grounded from linguistic tokens.
Result: Evaluation on ImageNet, VQAv2, and PascalVOC shows consistent improvement in both perturbation-based metrics (using a novel normalized perplexity measure) and segmentation-based metrics compared to existing methods.
Conclusion: DEX-AR successfully addresses the unique challenges of explaining autoregressive VLMs by providing both per-token and sequence-level visual explanations, offering insights into how these models integrate visual and textual information during generation.
Abstract: As Vision-Language Models (VLMs) become increasingly sophisticated and widely used, it becomes more and more crucial to understand their decision-making process. Traditional explainability methods, designed for classification tasks, struggle with modern autoregressive VLMs due to their complex token-by-token generation process and intricate interactions between visual and textual modalities. We present DEX-AR (Dynamic Explainability for AutoRegressive models), a novel explainability method designed to address these challenges by generating both per-token and sequence-level 2D heatmaps highlighting image regions crucial for the model’s textual responses. The proposed method offers to interpret autoregressive VLMs-including varying importance of layers and generated tokens-by computing layer-wise gradients with respect to attention maps during the token-by-token generation process. DEX-AR introduces two key innovations: a dynamic head filtering mechanism that identifies attention heads focused on visual information, and a sequence-level filtering approach that aggregates per-token explanations while distinguishing between visually-grounded and purely linguistic tokens. Our evaluation on ImageNet, VQAv2, and PascalVOC, shows a consistent improvement in both perturbation-based metrics, using a novel normalized perplexity measure, as well as segmentation-based metrics.
[197] FedARKS: Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration for Person Re-identification
Xin Xu, Binchang Ma, Zhixi Yu, Wei Liu
Main category: cs.CV
TL;DR: FedARKS improves federated domain generalization for person re-identification by addressing limitations in global feature representation and uniform parameter averaging through robust knowledge selection and integration mechanisms.
Details
Motivation: Existing FedDG-ReID methods have two key limitations: 1) reliance on global features misses subtle domain-invariant local details like accessories/textures, and 2) uniform parameter averaging ignores differences in clients' robust feature extraction capabilities, diluting contributions from high-quality clients.Method: Proposes FedARKS framework with two mechanisms: RK (Robust Knowledge) for capturing domain-invariant local details beyond global features, and KS (Knowledge Selection) for selective aggregation that accounts for differences in clients’ feature extraction capabilities rather than simple averaging.
Result: The paper claims FedARKS addresses the identified limitations, though specific quantitative results are not provided in the abstract.
Conclusion: FedARKS offers a novel approach to federated domain generalization for person re-identification that better preserves domain-invariant details and leverages high-quality client contributions through selective knowledge integration.
Abstract: The application of federated domain generalization in person re-identification (FedDG-ReID) aims to enhance the model’s generalization ability in unseen domains while protecting client data privacy. However, existing mainstream methods typically rely on global feature representations and simple averaging operations for model aggregation, leading to two limitations in domain generalization: (1) Using only global features makes it difficult to capture subtle, domain-invariant local details (such as accessories or textures); (2) Uniform parameter averaging treats all clients as equivalent, ignoring their differences in robust feature extraction capabilities, thereby diluting the contributions of high quality clients. To address these issues, we propose a novel federated learning framework, Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration (FedARKS), comprising two mechanisms: RK (Robust Knowledge) and KS (Knowledge Selection).
[198] Cross-Resolution Distribution Matching for Diffusion Distillation
Feiyang Chen, Hongpeng Pan, Haonan Xu, Xinyu Duan, Yang Yang, Zhefeng Wang
Main category: cs.CV
TL;DR: RMD is a diffusion distillation framework that bridges cross-resolution distribution gaps for high-fidelity, few-step multi-resolution cascaded inference, achieving significant speedups while maintaining visual quality.
Details
Motivation: Existing diffusion distillation methods are limited by denoising process saturation, and partial timestep low-resolution generation suffers quality degradation due to cross-resolution distribution gaps.Method: Proposes Cross-Resolution Distribution Matching Distillation (RMD) that: 1) divides timestep intervals per resolution using logSNR curves, 2) uses logSNR-based mapping to compensate resolution-induced shifts, 3) performs distribution matching along resolution trajectories, and 4) incorporates predicted-noise re-injection during upsampling.
Result: RMD achieves up to 33.4X speedup on SDXL and 25.6X on Wan2.1-14B while preserving high visual fidelity, outperforming existing methods in both quantitative and qualitative evaluations.
Conclusion: RMD effectively bridges cross-resolution distribution gaps for efficient multi-resolution cascaded inference, enabling significant acceleration of diffusion models without sacrificing generation quality.
Abstract: Diffusion distillation is central to accelerating image and video generation, yet existing methods are fundamentally limited by the denoising process, where step reduction has largely saturated. Partial timestep low-resolution generation can further accelerate inference, but it suffers noticeable quality degradation due to cross-resolution distribution gaps. We propose Cross-Resolution Distribution Matching Distillation (RMD), a novel distillation framework that bridges cross-resolution distribution gaps for high-fidelity, few-step multi-resolution cascaded inference. Specifically, RMD divides the timestep intervals for each resolution using logarithmic signal-to-noise ratio (logSNR) curves, and introduces logSNR-based mapping to compensate for resolution-induced shifts. Distribution matching is conducted along resolution trajectories to reduce the gap between low-resolution generator distributions and the teacher’s high-resolution distribution. In addition, a predicted-noise re-injection mechanism is incorporated during upsampling to stabilize training and improve synthesis quality. Quantitative and qualitative results show that RMD preserves high-fidelity generation while accelerating inference across various backbones. Notably, RMD achieves up to 33.4X speedup on SDXL and 25.6X on Wan2.1-14B, while preserving high visual fidelity.
[199] K-MaT: Knowledge-Anchored Manifold Transport for Cross-Modal Prompt Learning in Medical Imaging
Jiajun Zeng, Shadi Albarqouni
Main category: cs.CV
TL;DR: K-MaT is a prompt-learning framework that transfers decision structures from high-end to low-end medical imaging modalities without requiring low-end training images, using knowledge-anchored prompts and optimal transport alignment.
Details
Motivation: Biomedical vision-language models adapted for high-end imaging (like CT) often fail to transfer to low-end modalities (like radiography) due to modality-specific shortcuts, creating a need for effective cross-modal deployment without requiring low-end training data.Method: K-MaT factorizes prompts, anchors them to clinical text descriptions, and aligns the low-end prompt manifold to the visually-grounded high-end space using Fused Gromov-Wasserstein optimal transport, enabling zero-shot cross-modal transfer.
Result: K-MaT achieves state-of-the-art results on four cross-modal benchmarks, improving average harmonic mean of accuracy to 44.1% (from 42.0%) and macro-F1 to 36.2%, while mitigating catastrophic forgetting seen in standard methods.
Conclusion: Aligning prompt manifolds via optimal transport provides an effective route for zero-shot cross-modal deployment of medical VLMs, enabling robust performance across different imaging modalities without low-end training data.
Abstract: Large-scale biomedical vision-language models (VLMs) adapted on high-end imaging (e.g., CT) often fail to transfer to frontline low-end modalities (e.g., radiography), collapsing into modality-specific shortcuts. We propose K-MaT (Knowledge-Anchored Manifold Transport), a prompt-learning framework that transfers decision structures to low-end modalities without requiring low-end training images. K-MaT factorizes prompts, anchors them to clinical text descriptions, and aligns the low-end prompt manifold to the visually-grounded high-end space using Fused Gromov-Wasserstein optimal transport. We evaluate K-MaT on four cross-modal benchmarks, including dermoscopy, mammography to ultrasound, and CT to chest X-ray. K-MaT achieves state-of-the-art results, improving the average harmonic mean of accuracy to 44.1% (from BiomedCoOp’s 42.0%) and macro-F1 to 36.2%. Notably, on the challenging breast imaging task, it mitigates the catastrophic forgetting seen in standard methods like CoOp (which drops to 27.0% accuracy on the low-end), preserving robust performance across modalities. Aligning prompt manifolds via optimal transport provides a highly effective route for the zero-shot cross-modal deployment of medical VLMs.
[200] Spatial Colour Mixing Illusions as a Perception Stress Test for Vision-Language Models
Nicoleta-Nina Basoc, Adrian Cosma, Emilian Radoi
Main category: cs.CV
TL;DR: VLMs show systematic perceptual weaknesses to structured color distortions that don’t affect human recognition, with scaling not helping, but human-inspired preprocessing can recover some performance.
Details
Motivation: Despite strong benchmark performance, VLMs exhibit systematic perceptual weaknesses where structured color distortions cause confident but nonsensical predictions, even when humans can easily recognize the underlying scenes.Method: Introduces Spatial Colour Mixing - programmatic family of color distortions overlaying structured patterns in RGB and Ostwald color systems onto natural images. Evaluates 9 VLMs across 3 model families on 4 datasets with 8 distortion variants, plus human study with 61 participants.
Result: Accuracy degrades sharply with increasing distortion across all models and datasets. Scaling language models doesn’t reliably mitigate failures. Humans substantially outperform VLMs under same distortions. Simple human-inspired preprocessing recovers meaningful performance for several distortion types.
Conclusion: VLMs have systematic perceptual weaknesses to structured color distortions. Perception-aware preprocessing and tool-use are practical strategies for improving VLM robustness to such distortions.
Abstract: Vision-language models (VLMs) achieve strong benchmark results, yet can exhibit systematic perceptual weaknesses: structured, large changes to pixel values can cause confident yet nonsensical predictions, even when the underlying scene remains easily recognizable to humans. We study this gap using Spatial Colour Mixing, a programmatic family of colour distortions that overlays structured patterns (in both RGB and Ostwald colour systems) onto natural images. We introduce a framework of eight spatial colour mixing variants and evaluate nine VLMs across three model families on four datasets. Across models and datasets, accuracy degrades sharply with increasing distortion, and scaling the language model does not reliably mitigate the failure. In a human study with 61 participants on an animal recognition dataset, humans substantially outperform VLMs under the same distortions. Finally, we show that a simple human-inspired preprocessing step recovers a meaningful portion of performance for several distortion types, motivating perception-aware preprocessing and tool-use as practical strategies for improving VLM robustness.
[201] Longitudinal NSCLC Treatment Progression via Multimodal Generative Models
Massimiliano Mantegna, Elena Mulero AyllĂłn, Alice Natalina Caragliano, Francesco Di Feola, Claudia Tacconi, Michele Fiore, Edy Ippolito, Carlo Greco, Sara Ramella, Philippe C. Cattin, Paolo Soda, Matteo Tortora, Valerio Guarrasi
Main category: cs.CV
TL;DR: A Virtual Treatment framework for predicting non-small cell lung cancer progression during radiotherapy using dose-aware multimodal conditional image-to-image translation to synthesize follow-up CT scans.
Details
Motivation: Predicting tumor evolution during radiotherapy is clinically critical but challenging due to longitudinal changes driven by both anatomy and treatment effects. Current methods lack integration of radiation dose information with clinical variables for accurate progression modeling.Method: Formulates NSCLC progression as a dose-aware multimodal conditional image-to-image translation problem. Uses CT scans, baseline clinical variables, and radiation dose increments to synthesize follow-up CT images. Benchmarks GAN-based and diffusion-based models in 2D and 2.5D configurations with multimodal conditioning on delivered dose increments plus demographic/tumor variables.
Result: Evaluated on 222 stage III NSCLC patients with 895 CT scans. Diffusion-based models benefit more consistently from multimodal, dose-aware conditioning and produce more stable, anatomically plausible tumor evolution trajectories than GAN-based baselines.
Conclusion: The VT framework shows potential as a tool for in-silico treatment monitoring and adaptive radiotherapy research in NSCLC, with diffusion models outperforming GANs in generating realistic treatment-induced anatomical changes.
Abstract: Predicting tumor evolution during radiotherapy is a clinically critical challenge, particularly when longitudinal changes are driven by both anatomy and treatment. In this work, we introduce a Virtual Treatment (VT) framework that formulates non-small cell lung cancer (NSCLC) progression as a dose-aware multimodal conditional image-to-image translation problem. Given a CT scan, baseline clinical variables, and a specified radiation dose increment, VT aims to synthesize plausible follow-up CT images reflecting treatment-induced anatomical changes. We evaluate the proposed framework on a longitudinal dataset of 222 stage III NSCLC patients, comprising 895 CT scans acquired during radiotherapy under irregular clinical schedules. The generative process is conditioned on delivered dose increments together with demographic and tumor-related clinical variables. Representative GAN-based and diffusion-based models are benchmarked across 2D and 2.5D configurations. Quantitative and qualitative results indicate that diffusion-based models benefit more consistently from multimodal, dose-aware conditioning and produce more stable and anatomically plausible tumor evolution trajectories than GAN-based baselines, supporting the potential of VT as a tool for in-silico treatment monitoring and adaptive radiotherapy research in NSCLC.
[202] Dynamic Chunking Diffusion Transformer
Akash Haridas, Utkarsh Saxena, Parsa Ashrafi Fashi, Mehdi Rezagholizadeh, Vikram Appia, Emad Barsoum
Main category: cs.CV
TL;DR: DC-DiT introduces adaptive token compression for diffusion transformers, using learned chunking to allocate more tokens to detail-rich regions and fewer to uniform areas, improving efficiency and quality.
Details
Motivation: Standard diffusion transformers use fixed patchify operations that apply uniform compute to all image regions, ignoring that images have varying detail levels and denoising progresses from coarse to fine structure.Method: Augments DiT backbone with learned encoder-router-decoder scaffold that adaptively compresses 2D input into shorter token sequences using end-to-end learned chunking mechanism, allocating tokens based on visual complexity.
Result: Consistently improves FID and Inception Score over parameter-matched and FLOP-matched DiT baselines on class-conditional ImageNet 256Ă256 across 4Ă and 16Ă compression ratios.
Conclusion: DC-DiT demonstrates adaptive token allocation is a promising technique for efficient image generation with potential applications to video and 3D generation, and can be upcycled from pretrained models with minimal training.
Abstract: Diffusion Transformers process images as fixed-length sequences of tokens produced by a static $\textit{patchify}$ operation. While effective, this design spends uniform compute on low- and high-information regions alike, ignoring that images contain regions of varying detail and that the denoising process progresses from coarse structure at early timesteps to fine detail at late timesteps. We introduce the Dynamic Chunking Diffusion Transformer (DC-DiT), which augments the DiT backbone with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence in a data-dependent manner using a chunking mechanism learned end-to-end with diffusion training. The mechanism learns to compress uniform background regions into fewer tokens and detail-rich regions into more tokens, with meaningful visual segmentations emerging without explicit supervision. Furthermore, it also learns to adapt its compression across diffusion timesteps, using fewer tokens at noisy stages and more tokens as fine details emerge. On class-conditional ImageNet $256{\times}256$, DC-DiT consistently improves FID and Inception Score over both parameter-matched and FLOP-matched DiT baselines across $4{\times}$ and $16{\times}$ compression, showing this is a promising technique with potential further applications to pixel-space, video and 3D generation. Beyond accuracy, DC-DiT is practical: it can be upcycled from pretrained DiT checkpoints with minimal post-training compute (up to $8{\times}$ fewer training steps) and composes with other dynamic computation methods to further reduce generation FLOPs.
[203] FreeOcc: Training-free Panoptic Occupancy Prediction via Foundation Models
Andrew Caunes, Thierry Chateau, Vincent Fremont
Main category: cs.CV
TL;DR: FreeOcc: Training-free 3D semantic and panoptic occupancy prediction using pretrained foundation models for road scene analysis without 3D supervision or domain-specific training.
Details
Motivation: Current camera-only approaches for 3D occupancy prediction require costly dense 3D supervision or domain-specific training, limiting deployment in unseen environments. There's a need for training-free methods that can generalize across domains.Method: Leverages pretrained foundation models: 1) promptable foundation segmentation model extracts per-view panoptic priors, 2) reconstruction foundation model recovers metric 3D points. Uses depth- and confidence-aware filtering to lift 2D labels to 3D, fuses over time, and voxelizes with deterministic refinement. For panoptic occupancy, fits and merges robust 3D box candidates from current views.
Result: On Occ3D-nuScenes: 16.9 mIoU and 16.5 RayIoU train-free (on par with state-of-the-art weakly supervised methods). As pseudo-label generator: 21.1 RayIoU (surpasses previous SOTA). Sets new baselines for train-free (3.1 RayPQ) and weakly supervised (3.9 RayPQ) panoptic occupancy prediction.
Conclusion: Foundation-model-driven perception provides a practical route to training-free 3D scene understanding, enabling deployment in unseen environments without costly 3D supervision or domain-specific training.
Abstract: Semantic and panoptic occupancy prediction for road scene analysis provides a dense 3D representation of the ego vehicle’s surroundings. Current camera-only approaches typically rely on costly dense 3D supervision or require training models on data from the target domain, limiting deployment in unseen environments. We propose FreeOcc, a training-free pipeline that leverages pretrained foundation models to recover both semantics and geometry from multi-view images. FreeOcc extracts per-view panoptic priors with a promptable foundation segmentation model and prompt-to-taxonomy rules, and reconstructs metric 3D points with a reconstruction foundation model. Depth- and confidence- aware filtering lifts reliable labels into 3D, which are fused over time and voxelized with a deterministic refinement stack. For panoptic occupancy, instances are recovered by fitting and merging robust current-view 3D box candidates, enabling instance-aware occupancy without any learned 3D model. On Occ3D-nuScenes, FreeOcc achieves 16.9 mIoU and 16.5 RayIoU train-free, on par with state-of-the-art weakly supervised methods. When employed as a pseudo-label generation pipeline for training downstream models, it achieves 21.1 RayIoU, surpassing the previous state-of-the-art weakly supervised baseline. Furthermore, FreeOcc sets new baselines for both train-free and weakly supervised panoptic occupancy prediction, achieving 3.1 RayPQ and 3.9 RayPQ, respectively. These results highlight foundation-model-driven perception as a practical route to training-free 3D scene understanding.
[204] A Semi-Supervised Framework for Breast Ultrasound Segmentation with Training-Free Pseudo-Label Generation and Label Refinement
Ruili Li, Jiayi Ding, Ruiyu Li, Yilun Jin, Shiwen Ge, Yuwen Zeng, Xiaoyong Zhang, Eichi Takaya, Jan Vrba, Noriyasu Homma
Main category: cs.CV
TL;DR: A semi-supervised framework for breast ultrasound image segmentation using vision-language models for training-free pseudo-label generation and refinement, achieving performance comparable to fully supervised models with only 2.5% labeled data.
Details
Motivation: Semi-supervised learning for breast ultrasound segmentation suffers from unstable pseudo labels under limited annotations. While vision-language models offer potential for pseudo-label generation, their effectiveness on medical images is limited due to domain transfer challenges.Method: Proposes a semi-supervised framework with: 1) Training-free pseudo-label generation using simple appearance-based descriptions (e.g., “dark oval”) to enable cross-domain structural transfer between natural and medical images; 2) Label refinement using a static teacher capturing global structural priors; 3) Exponential moving average teacher with uncertainty entropy weighted fusion and adaptive uncertainty-guided reverse contrastive learning for improved boundary discrimination.
Result: Achieves performance comparable to fully supervised models with only 2.5% labeled data across four breast ultrasound datasets, significantly outperforming existing SSL approaches. The framework is extensible to other imaging modalities with only global appearance descriptions needed.
Conclusion: The proposed method enables scalable semi-supervised medical image segmentation under limited annotations by leveraging vision-language models with domain-adaptive prompts and uncertainty-aware refinement techniques.
Abstract: Semi-supervised learning (SSL) has emerged as a promising paradigm for breast ultrasound (BUS) image segmentation, but it often suffers from unstable pseudo labels under extremely limited annotations, leading to inaccurate supervision and degraded performance. Recent vision-language models (VLMs) provide a new opportunity for pseudo-label generation, yet their effectiveness on BUS images remains limited because domain-specific prompts are difficult to transfer. To address this issue, we propose a semi-supervised framework with training-free pseudo-label generation and label refinement. By leveraging simple appearance-based descriptions (e.g., dark oval), our method enables cross-domain structural transfer between natural and medical images, allowing VLMs to generate structurally consistent pseudo labels. These pseudo labels are used to warm up a static teacher that captures global structural priors of breast lesions. Combined with an exponential moving average teacher, we further introduce uncertainty entropy weighted fusion and adaptive uncertainty-guided reverse contrastive learning to improve boundary discrimination. Experiments on four BUS datasets demonstrate that our method achieves performance comparable to fully supervised models even with only 2.5% labeled data, significantly outperforming existing SSL approaches. Moreover, the proposed paradigm is readily extensible: for other imaging modalities or diseases, only a global appearance description is required to obtain reliable pseudo supervision, enabling scalable semi-supervised medical image segmentation under limited annotations.
[205] JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas
Sandeep Inuganti, Hideaki Kanayama, Kanta Shimizu, Mahdi Chamseddine, Soichiro Yokota, Didier Stricker, Jason Rambach
Main category: cs.CV
TL;DR: JOPP-3D is an open-vocabulary semantic segmentation framework that jointly processes panoramic images and 3D point clouds for language-driven scene understanding, achieving state-of-the-art results on Stanford-2D-3D-s and ToF-360 datasets.
Details
Motivation: Semantic segmentation across visual modalities like 3D point clouds and panoramic images is challenging due to limited annotated data and fixed-label models' poor adaptability. The paper aims to enable more flexible, language-driven scene understanding.Method: Converts RGB-D panoramic images to tangential perspective images and 3D point clouds, extracts and aligns foundational vision-language features from both modalities, and uses natural language queries to generate semantic masks across both input types.
Result: Achieves significant improvement over state-of-the-art in both open and closed vocabulary 2D and 3D semantic segmentation on Stanford-2D-3D-s and ToF-360 datasets, producing coherent and semantically meaningful segmentations.
Conclusion: JOPP-3D successfully enables joint open-vocabulary semantic segmentation across panoramic and 3D point cloud modalities through language-driven scene understanding, demonstrating strong cross-modal capabilities.
Abstract: Semantic segmentation across visual modalities such as 3D point clouds and panoramic images remains a challenging task, primarily due to the scarcity of annotated data and the limited adaptability of fixed-label models. In this paper, we present JOPP-3D, an open-vocabulary semantic segmentation framework that jointly leverages panoramic and point cloud data to enable language-driven scene understanding. We convert RGB-D panoramic images into their corresponding tangential perspective images and 3D point clouds, then use these modalities to extract and align foundational vision-language features. This allows natural language querying to generate semantic masks on both input modalities. Experimental evaluation on the Stanford-2D-3D-s and ToF-360 datasets demonstrates the capability of JOPP-3D to produce coherent and semantically meaningful segmentations across panoramic and 3D domains. Our proposed method achieves a significant improvement compared to the SOTA in open and closed vocabulary 2D and 3D semantic segmentation.
[206] Prompt Group-Aware Training for Robust Text-Guided Nuclei Segmentation
Yonghuang Wu, Zhenyang Liang, Wenwen Zeng, Xuan Xie, Jinhua Yu
Main category: cs.CV
TL;DR: A method to improve prompt robustness in text-guided medical image segmentation by treating semantically equivalent prompts as groups and enforcing consistency through quality-guided regularization and logit-level constraints.
Details
Motivation: Foundation models like SAM3 enable text-guided medical image segmentation but suffer from high sensitivity to prompt formulation - even semantically equivalent descriptions can yield inconsistent masks, limiting reliability in clinical workflows.Method: Reformulates prompt sensitivity as a group-wise consistency problem. Organizes semantically related prompts into prompt groups sharing the same ground-truth mask. Uses quality-guided group regularization (leveraging segmentation loss as implicit ranking signal) and logit-level consistency constraint with stop-gradient strategy to align predictions within each group.
Result: Extensive experiments on multi-dataset nuclei benchmarks show consistent gains under textual prompting and markedly reduced performance variance across prompt quality levels. On six zero-shot cross-dataset tasks, improves Dice by average of 2.16 points.
Conclusion: The method demonstrates improved robustness and generalization for vision-language segmentation in computational pathology without requiring architectural modifications or changing inference.
Abstract: Foundation models such as Segment Anything Model 3 (SAM3) enable flexible text-guided medical image segmentation, yet their predictions remain highly sensitive to prompt formulation. Even semantically equivalent descriptions can yield inconsistent masks, limiting reliability in clinical and pathology workflows. We reformulate prompt sensitivity as a group-wise consistency problem. Semantically related prompts are organized into \emph{prompt groups} sharing the same ground-truth mask, and a prompt group-aware training framework is introduced for robust text-guided nuclei segmentation. The approach combines (i) a quality-guided group regularization that leverages segmentation loss as an implicit ranking signal, and (ii) a logit-level consistency constraint with a stop-gradient strategy to align predictions within each group. The method requires no architectural modification and leaves inference unchanged. Extensive experiments on multi-dataset nuclei benchmarks show consistent gains under textual prompting and markedly reduced performance variance across prompt quality levels. On six zero-shot cross-dataset tasks, our method improves Dice by an average of 2.16 points. These results demonstrate improved robustness and generalization for vision-language segmentation in computational pathology.
[207] Optimizing 3D Diffusion Models for Medical Imaging via Multi-Scale Reward Learning
Yueying Tian, Xudong Han, Meng Zhou, Rodrigo Aviles-Espinosa, Rupert Young, Philip Birch
Main category: cs.CV
TL;DR: RL-enhanced 3D diffusion models for medical image generation with multi-scale feedback improve both image quality and downstream task utility
Details
Motivation: Standard diffusion model training objectives don't align well with clinical relevance in 3D medical imaging, creating a gap between technical metrics and practical utilityMethod: Pretrain 3D diffusion model on MRI volumes, then fine-tune with PPO using multi-scale reward system combining 2D slice-wise assessments and 3D volumetric analysis
Result: Significant improvements in FID scores and enhanced utility in downstream tumor/disease classification tasks compared to non-optimized baselines
Conclusion: RL feedback effectively steers 3D diffusion generation toward higher quality distributions with better clinical relevance
Abstract: Diffusion models have emerged as powerful tools for 3D medical image generation, yet bridging the gap between standard training objectives and clinical relevance remains a challenge. This paper presents a method to enhance 3D diffusion models using Reinforcement Learning (RL) with multi-scale feedback. We first pretrain a 3D diffusion model on MRI volumes to establish a robust generative prior. Subsequently, we fine-tune the model using Proximal Policy Optimization (PPO), guided by a novel reward system that integrates both 2D slice-wise assessments and 3D volumetric analysis. This combination allows the model to simultaneously optimize for local texture details and global structural coherence. We validate our framework on the BraTS 2019 and OASIS-1 datasets. Our results indicate that incorporating RL feedback effectively steers the generation process toward higher quality distributions. Quantitative analysis reveals significant improvements in Fréchet Inception Distance (FID) and, crucially, the synthetic data demonstrates enhanced utility in downstream tumor and disease classification tasks compared to non-optimized baselines.
[208] Physical Simulator In-the-Loop Video Generation
Lin Geng Foo, Mark He Huang, Alexandros Lattas, Stylianos Moschoglou, Thabo Beeler, Christian Theobalt
Main category: cs.CV
TL;DR: PSIVG integrates physical simulation into video diffusion to enforce physical laws like gravity and collision, improving motion realism in generated videos.
Details
Motivation: Current diffusion-based video generation produces visually realistic results but often violates basic physical laws (gravity, inertia, collision), limiting realism and reliability of AI-generated videos.Method: PSIVG reconstructs 4D scenes and object meshes from template videos, initializes them in a physical simulator to generate physically consistent trajectories, then guides video diffusion with these trajectories. Also includes Test-Time Texture Consistency Optimization (TTCO) for texture consistency during movement.
Result: Comprehensive experiments show PSIVG produces videos that better adhere to real-world physics while preserving visual quality and diversity.
Conclusion: Integrating physical simulation into video generation addresses physical inconsistency issues and improves motion realism in AI-generated videos.
Abstract: Recent advances in diffusion-based video generation have achieved remarkable visual realism but still struggle to obey basic physical laws such as gravity, inertia, and collision. Generated objects often move inconsistently across frames, exhibit implausible dynamics, or violate physical constraints, limiting the realism and reliability of AI-generated videos. We address this gap by introducing Physical Simulator In-the-loop Video Generation (PSIVG), a novel framework that integrates a physical simulator into the video diffusion process. Starting from a template video generated by a pre-trained diffusion model, PSIVG reconstructs the 4D scene and foreground object meshes, initializes them within a physical simulator, and generates physically consistent trajectories. These simulated trajectories are then used to guide the video generator toward spatio-temporally physically coherent motion. To further improve texture consistency during object movement, we propose a Test-Time Texture Consistency Optimization (TTCO) technique that adapts text and feature embeddings based on pixel correspondences from the simulator. Comprehensive experiments demonstrate that PSIVG produces videos that better adhere to real-world physics while preserving visual quality and diversity. Project Page: https://vcai.mpi-inf.mpg.de/projects/PSIVG/
[209] Making Training-Free Diffusion Segmentors Scale with the Generative Power
Benyuan Meng, Qianqian Xu, Zitai Wang, Xiaochun Cao, Longtao Huang, Qingming Huang
Main category: cs.CV
TL;DR: Training-free semantic segmentation using text-to-image diffusion models by addressing attention map aggregation and score imbalance issues.
Details
Motivation: Existing training-free diffusion segmentors fail to scale with more powerful diffusion models, despite the assumption that stronger generative capability should lead to better segmentation. The authors identify two gaps: (1) discrepancy between individual attention maps and unified global representation, and (2) score imbalances among text tokens preventing accurate semantic correlation.Method: Proposes two techniques: auto aggregation to combine cross-attention maps across multiple heads and layers into a unified global representation, and per-pixel rescaling to address score imbalances among different text tokens for accurate semantic correlation.
Result: Evaluated on standard semantic segmentation benchmarks and integrated into a generative technique, demonstrating improved performance and broad applicability.
Conclusion: The proposed techniques enable training-free segmentation to better leverage generative capability of diffusion models, bridging the gap between generative power and discriminative performance.
Abstract: As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to what training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model’s attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance broad applicability. Codes are at https://github.com/Darkbblue/goca.
[210] Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots
Mingzhe Li, Mengyin Liu, Zekai Wu, Xincheng Lin, Junsheng Zhang, Ming Yan, Zengye Xie, Changwang Zhang, Chenglu Wen, Lan Xu, Siqi Shen, Cheng Wang
Main category: cs.CV
TL;DR: Proposes Motion Turing Test framework and HHMotion dataset to evaluate human-likeness of humanoid robot motions, finding current multimodal LLMs inadequate for this task and proposing a baseline model that outperforms them.
Details
Motivation: To develop a systematic framework for evaluating how human-like humanoid robot motions are, inspired by the Turing Test, and to create a benchmark dataset for this evaluation since current multimodal LLMs are insufficient for assessing motion human-likeness.Method: Created HHMotion dataset with 1,000 motion sequences across 15 action categories from 11 humanoid models and 10 human subjects, converted to SMPL-X representations. Collected human annotations (30 annotators, 500+ hours) rating human-likeness on 0-5 scale. Proposed baseline model for predicting human-likeness scores from motion data.
Result: Humanoid motions still show noticeable deviations from human movements, especially in dynamic actions like jumping, boxing, and running. Found that current multimodal LLMs are inadequate for motion human-likeness assessment. Proposed baseline model outperforms several contemporary LLM-based methods.
Conclusion: The Motion Turing Test framework and HHMotion dataset provide valuable tools for evaluating humanoid motion human-likeness. Current multimodal LLMs need improvement for motion understanding tasks, and the proposed baseline offers a starting point for future research in this area.
Abstract: Humanoid robots have achieved significant progress in motion generation and control, exhibiting movements that appear increasingly natural and human-like. Inspired by the Turing Test, we propose the Motion Turing Test, a framework that evaluates whether human observers can discriminate between humanoid robot and human poses using only kinematic information. To facilitate this evaluation, we present the Human-Humanoid Motion (HHMotion) dataset, which consists of 1,000 motion sequences spanning 15 action categories, performed by 11 humanoid models and 10 human subjects. All motion sequences are converted into SMPL-X representations to eliminate the influence of visual appearance. We recruited 30 annotators to rate the human-likeness of each pose on a 0-5 scale, resulting in over 500 hours of annotation. Analysis of the collected data reveals that humanoid motions still exhibit noticeable deviations from human movements, particularly in dynamic actions such as jumping, boxing, and running. Building on HHMotion, we formulate a human-likeness evaluation task that aims to automatically predict human-likeness scores from motion data. Despite recent progress in multimodal large language models, we find that they remain inadequate for assessing motion human-likeness. To address this, we propose a simple baseline model and demonstrate that it outperforms several contemporary LLM-based methods. The dataset, code, and benchmark will be publicly released to support future research in the community.
[211] CLoPA: Continual Low Parameter Adaptation of Interactive Segmentation for Medical Image Annotation
Parhom Esmaeili, Chayanin Tangwiriyasakul, Eli Gibson, Sebastien Ourselin, M. Jorge Cardoso
Main category: cs.CV
TL;DR: CLoPA: Continual adaptation strategy that tunes small fraction of nnInteractive’s parameters on annotation cache to rapidly achieve expert-level segmentation performance across diverse medical imaging tasks.
Details
Motivation: Existing zero-shot interactive segmentation models like nnInteractive fail to consistently reach expert-level performance across diverse medical imaging tasks. Annotation campaigns produce growing streams of task-specific labeled data, making online adaptation a natural complement to zero-shot inference.Method: Proposes CLoPA, a continual adaptation strategy that tunes a small fraction of nnInteractive’s parameters on the annotation cache, triggered by lightweight episode scheduling. Requires no new parameters or changes to inference pipeline, operates entirely within existing annotation workflow.
Result: Across eight Medical Segmentation Decathlon tasks spanning diverse anatomical targets and imaging characteristics, CLoPA rapidly elevates performance to expert-level, even for tasks where nnInteractive previously failed. Majority of gains realized after single training episode.
Conclusion: CLoPA effectively bridges gap between zero-shot inference and expert-level segmentation through lightweight continual adaptation. Benefits of tuning different parameter groups depend on task characteristics and data regimes. For complex geometries, deeper feature-representation alignment may be needed.
Abstract: Interactive segmentation enables clinicians to guide annotation, but existing zero-shot models like nnInteractive fail to consistently reach expert-level performance across diverse medical imaging tasks. Because annotation campaigns produce a growing stream of task-specific labelled data, online adaptation of the segmentation model is a natural complement to zero-shot inference. We propose CLoPA, a continual adaptation strategy that tunes a small fraction of nnInteractive’s parameters on the annotation cache, triggered by lightweight episode scheduling. CLoPA requires no new parameters or changes to the inference pipeline, and operates entirely within the existing annotation workflow. Across eight Medical Segmentation Decathlon tasks spanning diverse anatomical targets and imaging characteristics, CLoPA rapidly elevates performance to expert-level, even for tasks where nnInteractive previously failed, with the majority of gains realised after a single training episode. We show that the benefits of tuning different parameter groups depends on task characteristics and data regimes. Also, that for targets with complex geometries (e.g., hepatic vessels), instance normalisation and low-level feature tuning saturates, suggesting a need for deeper feature-representation alignment in the most challenging scenarios.
[212] SpaCRD: Multimodal Deep Fusion of Histology and Spatial Transcriptomics for Cancer Region Detection
Shuailin Xue, Jun Wan, Lihua Zhang, Wenwen Min
Main category: cs.CV
TL;DR: SpaCRD is a transfer learning method that integrates histology images with spatial transcriptomics data for accurate cancer tissue region detection across diverse samples, platforms, and batches.
Details
Motivation: Traditional cancer tissue region detection methods relying on histology images alone suffer from high false positives due to morphological similarities. Spatial transcriptomics provides detailed cellular phenotypes and spatial information, but current methods cannot effectively integrate both data types, especially in cross-sample and cross-platform settings.Method: SpaCRD uses a category-regularized variational reconstruction-guided bidirectional cross-attention fusion network to adaptively capture latent co-expression patterns between histological features and gene expression from multiple perspectives. It’s a transfer learning approach that can generalize to new samples from different platforms and batches.
Result: Extensive benchmark analysis on 23 matched histology-ST datasets across various disease types, platforms, and batches shows SpaCRD consistently outperforms eight state-of-the-art methods in cancer tissue region detection.
Conclusion: SpaCRD effectively integrates histology images with spatial transcriptomics data for reliable cancer tissue region detection across diverse samples and platforms, addressing limitations of traditional methods and enabling more accurate analysis of tumor microenvironments.
Abstract: Accurate detection of cancer tissue regions (CTR) enables deeper analysis of the tumor microenvironment and offers crucial insights into treatment response. Traditional CTR detection methods, which typically rely on the rich cellular morphology in histology images, are susceptible to a high rate of false positives due to morphological similarities across different tissue regions. The groundbreaking advances in spatial transcriptomics (ST) provide detailed cellular phenotypes and spatial localization information, offering new opportunities for more accurate cancer region detection. However, current methods are unable to effectively integrate histology images with ST data, especially in the context of cross-sample and cross-platform/batch settings for accomplishing the CTR detection. To address this challenge, we propose SpaCRD, a transfer learning-based method that deeply integrates histology images and ST data to enable reliable CTR detection across diverse samples, platforms, and batches. Once trained on source data, SpaCRD can be readily generalized to accurately detect cancerous regions across samples from different platforms and batches. The core of SpaCRD is a category-regularized variational reconstruction-guided bidirectional cross-attention fusion network, which enables the model to adaptively capture latent co-expression patterns between histological features and gene expression from multiple perspectives. Extensive benchmark analysis on 23 matched histology-ST datasets spanning various disease types, platforms, and batches demonstrates that SpaCRD consistently outperforms existing eight state-of-the-art methods in CTR detection.
[213] Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement
Yakov Pyotr Shkolnikov
Main category: cs.CV
TL;DR: Vision-language models encode geometric information (like hand joint angles) in their visual features that their text pathways cannot express, creating a 3.3x bottleneck between visual encoding and text output.
Details
Motivation: To understand the disconnect between what vision-language models can visually encode versus what they can verbally express, particularly regarding geometric information like 3D hand poses, and to determine whether this is a representational or pathway-training deficit.Method: Used linear probes on frozen vision-language model features to extract hand joint angles, compared text output accuracy, conducted LoRA fine-tuning experiments, analyzed multiple encoder architectures, and performed layer-wise analysis to identify where geometric information is encoded.
Result: Found a 3.3x bottleneck between visual encoding (6.1° MAE) and text output (20.0° MAE), narrowed by LoRA fine-tuning to 6.5° MAE. Different encoder architectures converged to similar functional accuracy despite low representational similarity. Autoregressive generation damages geometric fidelity, but language alignment can improve accuracy. Geometric signal peaks in mid-network layers (18-22).
Conclusion: Vision-language models encode rich geometric information that their text pathways fail to express, but lightweight probes can extract this information without fine-tuning or text generation, enabling frozen backbones to function as multi-task geometric sensors.
Abstract: Vision-language models encode continuous geometry that their text pathway fails to express: a 6,000-parameter linear probe extracts hand joint angles at 6.1 degrees MAE from frozen features, while the best text output achieves only 20.0 degrees – a 3.3x bottleneck. LoRA fine-tuning (r=16, 2,000 images) narrows this gap to 6.5 degrees, providing evidence for a pathway-training deficit rather than a representational one. Training objective determines accuracy more than architecture: five encoders spanning self-supervised, contrastive, and hybrid paradigms converge to statistically equivalent accuracy (R^2 approximately 0.55, TOST-equivalent at delta=0.03) despite sharing as little as CKA=0.41 representational similarity – functional convergence without representational convergence. Autoregressive generation damages geometric fidelity, but the damage originates in the generation process, not in language alignment: Qwen2.5-VL’s LLM layers actually improve probe accuracy over its raw vision encoder. Layer-wise analysis reveals a universal mid-network accuracy peak across all architectures, with attention heads in layers 18-22 carrying disproportionate geometric signal. These findings enable a single frozen backbone to function as a multi-task geometric sensor through lightweight probes, without fine-tuning or text generation.
[214] Adaptive Language-Aware Image Reflection Removal Network
Siyan Fang, Yuntao Wang, Jinpu Zhang, Ziwen Li, Yuehuan Wang
Main category: cs.CV
TL;DR: ALANet is a language-aware network for image reflection removal that handles inaccurate language descriptions through filtering and optimization strategies, working effectively even with complex reflections.
Details
Motivation: Existing reflection removal methods struggle with complex reflections. While language descriptions could help understand image content, machine-generated descriptions from reflected images are often inaccurate, harming language-guided approaches.Method: Proposes Adaptive Language-Aware Network (ALANet) with filtering strategy to reduce negative effects of inaccurate language while preserving benefits, and optimization strategy to enhance language-visual feature alignment. Uses language cues to decouple specific layer content from feature maps.
Result: ALANet surpasses state-of-the-art methods for image reflection removal. Introduces CRLAV dataset for evaluating performance under complex reflections and varying language accuracy.
Conclusion: ALANet effectively handles complex reflections even with inaccurate language inputs through adaptive language-aware mechanisms, advancing language-guided vision tasks.
Abstract: Existing image reflection removal methods struggle to handle complex reflections. Accurate language descriptions can help the model understand the image content to remove complex reflections. However, due to blurred and distorted interferences in reflected images, machine-generated language descriptions of the image content are often inaccurate, which harms the performance of language-guided reflection removal. To address this, we propose the Adaptive Language-Aware Network (ALANet) to remove reflections even with inaccurate language inputs. Specifically, ALANet integrates both filtering and optimization strategies. The filtering strategy reduces the negative effects of language while preserving its benefits, whereas the optimization strategy enhances the alignment between language and visual features. ALANet also utilizes language cues to decouple specific layer content from feature maps, improving its ability to handle complex reflections. To evaluate the model’s performance under complex reflections and varying levels of language accuracy, we introduce the Complex Reflection and Language Accuracy Variance (CRLAV) dataset. Experimental results demonstrate that ALANet surpasses state-of-the-art methods for image reflection removal. The code and dataset are available at https://github.com/fashyon/ALANet.
[215] Artificial Intelligence for Detecting Fetal Orofacial Clefts and Advancing Medical Education
Yuanji Zhang, Yuhao Huang, Haoran Dou, Xiliang Zhu, Chen Ling, Zhong Yang, Lianying Liang, Jiuping Li, Siying Liang, Rui Li, Yan Cao, Yuhan Zhang, Jiewei Lai, Yongsong Zhou, Hongyu Zheng, Xinru Gao, Cheng Yu, Liling Shi, Mengqin Yuan, Honglong Li, Xiaoqiong Huang, Chaoyu Chen, Jialin Zhang, Wenxiong Pan, Alejandro F. Frangi, Guangzhi He, Xin Yang, Yi Xiong, Linliang Yin, Xuedong Deng, Dong Ni
Main category: cs.CV
TL;DR: AI system trained on 45K+ ultrasound images achieves 93%+ sensitivity and 95%+ specificity for prenatal detection of orofacial clefts, matching senior radiologists and boosting junior radiologists’ performance by 6%.
Details
Motivation: Prenatal detection of orofacial clefts is challenging due to scarcity of experienced specialists and rarity of condition. Early diagnosis is crucial for timely intervention and reduced morbidity.Method: AI system trained on 45,139 ultrasound images from 9,215 fetuses across 22 hospitals. Used as medical copilot to assist radiologists, with pilot study involving 24 radiologists and trainees.
Result: System achieves sensitivity >93% and specificity >95%, matching senior radiologists and outperforming junior radiologists. As copilot, raises junior radiologists’ sensitivity by >6%. Pilot study shows model accelerates expertise development for rare conditions.
Conclusion: Dual-purpose AI approach offers scalable solution for improving both diagnostic accuracy and specialist training in settings with scarce experienced radiologists.
Abstract: Orofacial clefts are among the most common congenital craniofacial abnormalities, yet accurate prenatal detection remains challenging due to the scarcity of experienced specialists and the relative rarity of the condition. Early and reliable diagnosis is essential to enable timely clinical intervention and reduce associated morbidity. Here we show that an artificial intelligence system, trained on over 45,139 ultrasound images from 9,215 fetuses across 22 hospitals, can diagnose fetal orofacial clefts with sensitivity and specificity exceeding 93% and 95% respectively, matching the performance of senior radiologists and substantially outperforming junior radiologists. When used as a medical copilot, the system raises junior radiologists’ sensitivity by more than 6%. Beyond direct diagnostic assistance, the system also accelerates the development of clinical expertise. A pilot study involving 24 radiologists and trainees demonstrated that the model can improve the expertise development for rare conditions. This dual-purpose approach offers a scalable solution for improving both diagnostic accuracy and specialist training in settings where experienced radiologists are scarce.
[216] Point-Supervised Skeleton-Based Human Action Segmentation
Hongsong Wang, Yiqin Shen, Pengbo Yan, Jie Gui
Main category: cs.CV
TL;DR: Point-supervised skeleton-based action segmentation using multimodal skeleton data and pseudo-label generation with prototype similarity, achieving competitive performance with minimal annotation.
Details
Motivation: Fully-supervised skeleton-based action segmentation requires costly frame-level annotations and is sensitive to ambiguous action boundaries. The paper aims to reduce annotation effort while maintaining performance.Method: Uses point supervision (single frame per action segment) with multimodal skeleton data (joint, bone, motion). Proposes prototype similarity method combined with energy function and constrained K-Medoids for pseudo-label generation. Multimodal pseudo-label integration guides model training.
Result: Establishes new benchmarks on PKU-MMD (X-Sub and X-View), MCFS-22, and MCFS-130. Achieves competitive performance, even surpassing some fully-supervised methods while significantly reducing annotation effort.
Conclusion: The proposed point-supervised framework effectively reduces annotation costs while maintaining competitive performance for skeleton-based action segmentation, demonstrating the value of multimodal pseudo-label integration.
Abstract: Skeleton-based temporal action segmentation is a fundamental yet challenging task, playing a crucial role in enabling intelligent systems to perceive and respond to human activities. While fully-supervised methods achieve satisfactory performance, they require costly frame-level annotations and are sensitive to ambiguous action boundaries. To address these issues, we introduce a point-supervised framework for skeleton-based action segmentation, where only a single frame per action segment is labeled. We leverage multimodal skeleton data, including joint, bone, and motion information, encoded via a pretrained unified model to extract rich feature representations. To generate reliable pseudo-labels, we propose a novel prototype similarity method and integrate it with two existing methods: energy function and constrained K-Medoids clustering. Multimodal pseudo-label integration is proposed to enhance the reliability of the pseudo-label and guide the model training. We establish new benchmarks on PKU-MMD (X-Sub and X-View), MCFS-22, and MCFS-130, and implement baselines for point-supervised skeleton-based human action segmentation. Extensive experiments show that our method achieves competitive performance, even surpassing some fully-supervised methods while significantly reducing annotation effort.
[217] VG3S: Visual Geometry Grounded Gaussian Splatting for Semantic Occupancy Prediction
Xiaoyang Yan, Muleilan Pei, Shaojie Shen
Main category: cs.CV
TL;DR: VG3S enhances 3D semantic occupancy prediction for autonomous driving by integrating geometric priors from Vision Foundation Models into Gaussian splatting, achieving significant performance improvements.
Details
Motivation: Current 3D Gaussian splatting methods for occupancy prediction rely on accurate geometric cues that are often insufficient in vision-only systems. The paper aims to bridge this gap by leveraging the strong geometric grounding capabilities of pre-trained Vision Foundation Models.Method: Proposes Visual Geometry Grounded Gaussian Splatting (VG3S) framework with a hierarchical geometric feature adapter that transforms VFM tokens through feature aggregation, task-specific alignment, and multi-scale restructuring to inject geometric priors into Gaussian-based occupancy prediction.
Result: On nuScenes occupancy benchmark, VG3S achieves 12.6% improvement in IoU and 7.5% improvement in mIoU over baseline. The method generalizes well across different VFMs and consistently enhances occupancy prediction accuracy.
Conclusion: Integrating geometric priors from powerful pre-trained Vision Foundation Models significantly improves 3D semantic occupancy prediction, demonstrating the value of leveraging foundational vision models for geometric understanding in autonomous driving perception tasks.
Abstract: 3D semantic occupancy prediction has become a crucial perception task for comprehensive scene understanding in autonomous driving. While recent advances have explored 3D Gaussian splatting for occupancy modeling to substantially reduce computational overhead, the generation of high-quality 3D Gaussians relies heavily on accurate geometric cues, which are often insufficient in purely vision-centric paradigms. To bridge this gap, we advocate for injecting the strong geometric grounding capability from Vision Foundation Models (VFMs) into occupancy prediction. In this regard, we introduce Visual Geometry Grounded Gaussian Splatting (VG3S), a novel framework that empowers Gaussian-based occupancy prediction with cross-view 3D geometric grounding. Specifically, to fully exploit the rich 3D geometric priors from a frozen VFM, we propose a plug-and-play hierarchical geometric feature adapter, which can effectively transform generic VFM tokens via feature aggregation, task-specific alignment, and multi-scale restructuring. Extensive experiments on the nuScenes occupancy benchmark demonstrate that VG3S achieves remarkable improvements of 12.6% in IoU and 7.5% in mIoU over the baseline. Furthermore, we show that VG3S generalizes seamlessly across diverse VFMs, consistently enhancing occupancy prediction accuracy and firmly underscoring the immense value of integrating priors derived from powerful, pre-trained geometry-grounded VFMs.
[218] SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning
Alejandra Perez, Anita Rau, Lee White, Busisiwe Mlambo, Chinedu Nwoye, Muhammad Abdullah Jamal, Omid Mohareri
Main category: cs.CV
TL;DR: SUREON is a large-scale surgical video QA dataset extracted from expert video lectures, enabling AI models to understand surgical reasoning like intent, rationale, and anticipation.
Details
Motivation: Current surgical AI lacks understanding of surgical reasoning (intent, rationale, anticipation) because explicit training data is hard to annotate at scale. Surgical video lectures contain expert explanations of reasoning but are noisy and unstructured.Method: Created SUREON dataset by harvesting surgical academic videos using multi-agent pipeline to extract structured QA pairs. Defined 12 question categories covering safety, decision rationale, and forecasting. Developed two models: SureonVLM (vision-language model fine-tuned) and SureonVLM-R1 (reasoning model trained with Group Relative Policy Optimization).
Result: Dataset contains 206.8k QA pairs from 134.7K clips across 170 procedure types, with expert-validated benchmark of 354 examples. Models achieve >84% accuracy on SUREON benchmark, outperforming larger general-domain models on surgical reasoning and perception tasks.
Conclusion: SUREON enables surgical AI to understand expert reasoning by leveraging unstructured video lectures. The approach demonstrates that surgical reasoning can be learned from existing educational content, advancing AI capabilities in surgical understanding.
Abstract: Surgeons don’t just see – they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it was chosen, what risk it poses, and what comes next. Current surgical AI cannot answer such questions, largely because training data that explicitly encodes surgical reasoning is immensely difficult to annotate at scale. Yet surgical video lectures already contain exactly this – explanations of intent, rationale, and anticipation, narrated by experts for the purpose of teaching. Though inherently noisy and unstructured, these narrations encode the reasoning that surgical AI currently lacks. We introduce SUREON, a large-scale video QA dataset that systematically harvests this training signal from surgical academic videos. SUREON defines 12 question categories covering safety assessment, decision rationale, and forecasting, and uses a multi-agent pipeline to extract and structure supervision at scale. Across 134.7K clips and 170 procedure types, SUREON yields 206.8k QA pairs and an expert-validated benchmark of 354 examples. To evaluate the extent to which this supervision translates to surgical reasoning ability, we introduce two models: SureonVLM, a vision-language model adapted through supervised fine-tuning, and SureonVLM-R1, a reasoning model trained with Group Relative Policy Optimization. Both models can answer complex questions about surgery and substantially outperform larger general-domain models, exceeding 84% accuracy on the SUREON benchmark while outperforming general-domain models on standard surgical perception tasks. Qualitative analysis of SureonVLM-R1 reveals explicit reasoning behavior, such as inferring operative intent from visual context.
[219] 3D CBCT Artefact Removal Using Perpendicular Score-Based Diffusion Models
Susanne Schaub, Florentin Bieder, Matheus L. Oliveira, Yulan Wang, Dorothea Dagassan-Berndt, Michael M. Bornstein, Philippe C. Cattin
Main category: cs.CV
TL;DR: A 3D dental implant inpainting method using perpendicular score-based diffusion models to reduce CBCT artefacts by modeling correlations across projection series
Details
Motivation: CBCT imaging suffers from artefacts from high-density dental implants, compromising diagnostic accuracy. Current diffusion-based methods treat 2D projections independently, ignoring inter-projection correlations and causing reconstruction inconsistencies.Method: Proposes a 3D implant inpainting approach using perpendicular score-based diffusion models trained in two different planes (coronal and sagittal). The 3D distribution of projection series is modeled by combining these two 2D diffusion models in the sampling scheme.
Result: The method effectively produces high-quality, artefact-reduced 3D CBCT images, demonstrating improved consistency across projections compared to independent 2D approaches.
Conclusion: The proposed 3D diffusion-based approach is a promising solution for improving clinical CBCT imaging by better handling implant artefacts through modeling projection correlations.
Abstract: Cone-beam computed tomography (CBCT) is a widely used 3D imaging technique in dentistry, offering high-resolution images while minimising radiation exposure for patients. However, CBCT is highly susceptible to artefacts arising from high-density objects such as dental implants, which can compromise image quality and diagnostic accuracy. To reduce artefacts, implant inpainting in the sequence of projections plays a crucial role in many artefact reduction approaches. Recently, diffusion models have achieved state-of-the-art results in image generation and have widely been applied to image inpainting tasks. However, to our knowledge, existing diffusion-based methods for implant inpainting operate on independent 2D projections. This approach neglects the correlations among individual projections, resulting in inconsistencies in the reconstructed images. To address this, we propose a 3D dental implant inpainting approach based on perpendicular score-based diffusion models, each trained in two different planes and operating in the projection domain. The 3D distribution of the projection series is modelled by combining the two 2D score-based diffusion models in the sampling scheme. Our results demonstrate the method’s effectiveness in producing high-quality, artefact-reduced 3D CBCT images, making it a promising solution for improving clinical imaging.
[220] EntON: Eigenentropy-Optimized Neighborhood Densification in 3D Gaussian Splatting
Miriam JĂ€ger, Boris Jutzi
Main category: cs.CV
TL;DR: EntON introduces an Eigenentropy-optimized neighborhood densification strategy for 3D Gaussian Splatting that improves geometric accuracy and rendering quality while reducing computational costs.
Details
Motivation: Standard 3D Gaussian Splatting produces Gaussians poorly aligned with object geometry, while surface-focused methods sacrifice photometric accuracy. There's a need for better geometric accuracy without compromising rendering quality.Method: Uses Eigenentropy from eigenvalues of covariance matrices in k-nearest neighborhoods to quantify local structural order. Implements alternating optimization: standard gradient-based densification alternates with Eigenentropy-aware densification that splits Gaussians in low-Eigenentropy (flat) regions and prunes those in high-Eigenentropy (spherical) regions.
Result: Improves geometric accuracy by up to 33%, rendering quality by up to 7%, reduces number of Gaussians by up to 50%, and training time by up to 23% on DTU and TUM2TWIN datasets.
Conclusion: EntON achieves favorable balance between geometric accuracy, rendering quality, and efficiency by avoiding unnecessary scene expansion through geometry-aware densification.
Abstract: We present a novel Eigenentropy-optimized neighboorhood densification strategy EntON in 3D Gaussian Splatting (3DGS) for geometrically accurate and high-quality rendered 3D reconstruction. While standard 3DGS produces Gaussians whose centers and surfaces are poorly aligned with the underlying object geometry, surface-focused reconstruction methods frequently sacrifice photometric accuracy. In contrast to the conventional densification strategy, which relies on the magnitude of the view-space position gradient, our approach introduces a geometry-aware strategy to guide adaptive splitting and pruning. Specifically, we compute the 3D shape feature Eigenentropy from the eigenvalues of the covariance matrix in the k-nearest neighborhood of each Gaussian center, which quantifies the local structural order. These Eigenentropy values are integrated into an alternating optimization framework: During the optimization process, the algorithm alternates between (i) standard gradient-based densification, which refines regions via view-space gradients, and (ii) Eigenentropy-aware densification, which preferentially densifies Gaussians in low-Eigenentropy (ordered, flat) neighborhoods to better capture fine geometric details on the object surface, and prunes those in high-Eigenentropy (disordered, spherical) regions. We provide quantitative and qualitative evaluations on two benchmark datasets: small-scale DTU dataset and large-scale TUM2TWIN dataset, covering man-made objects and urban scenes. Experiments demonstrate that our Eigenentropy-aware alternating densification strategy improves geometric accuracy by up to 33% and rendering quality by up to 7%, while reducing the number of Gaussians by up to 50% and training time by up to 23%. Overall, EnTON achieves a favorable balance between geometric accuracy, rendering quality and efficiency by avoiding unnecessary scene expansion.
[221] Word-Anchored Temporal Forgery Localization
Tianyi Wang, Xi Shao, Harry Cheng, Yinglong Wang, Mohan Kankanhalli
Main category: cs.CV
TL;DR: WAFL reformulates temporal forgery localization as discrete word-level binary classification using speech word tokens as minimum forgery units, achieving superior performance with high efficiency.
Details
Motivation: Current temporal forgery localization methods suffer from feature granularity misalignment and high computational costs. The authors aim to create a more efficient and effective approach by leveraging natural linguistic boundaries in speech.Method: 1) Use word tokens as minimum forgery units aligned with speech linguistic boundaries; 2) Introduce Forensic Feature Realignment (FFR) module to map pre-trained semantic features to forensic manifold; 3) Use lightweight linear classifiers for binary classification; 4) Design Artifact-Centric Asymmetric (ACA) loss to handle class imbalance by prioritizing forensic artifacts.
Result: WAFL significantly outperforms state-of-the-art approaches in localization performance under both in-dataset and cross-dataset settings, while requiring substantially fewer parameters and operating at high computational efficiency.
Conclusion: The word-anchored paradigm effectively addresses granularity misalignment and computational inefficiency in temporal forgery localization, providing a practical solution for speech forgery detection.
Abstract: Current temporal forgery localization (TFL) approaches typically rely on temporal boundary regression or continuous frame-level anomaly detection paradigms to derive candidate forgery proposals. However, they suffer not only from feature granularity misalignment but also from costly computation. To address these issues, we propose word-anchored temporal forgery localization (WAFL), a novel paradigm that shifts the TFL task from temporal regression and continuous localization to discrete word-level binary classification. Specifically, we first analyze the essence of temporal forgeries and identify the minimum meaningful forgery units, word tokens, and then align data preprocessing with the natural linguistic boundaries of speech. To adapt powerful pre-trained foundation backbones for feature extraction, we introduce the forensic feature realignment (FFR) module, mapping representations from the pre-trained semantic space to a discriminative forensic manifold. This allows subsequent lightweight linear classifiers to efficiently perform binary classification and accomplish the TFL task. Furthermore, to overcome the extreme class imbalance inherent to forgery detection, we design the artifact-centric asymmetric (ACA) loss, which breaks the standard precision-recall trade-off by dynamically suppressing overwhelming authentic gradients while asymmetrically prioritizing subtle forensic artifacts. Extensive experiments demonstrate that WAFL significantly outperforms state-of-the-art approaches in localization performance under both in- and cross-dataset settings, while requiring substantially fewer learnable parameters and operating at high computational efficiency.
[222] BEVLM: Distilling Semantic Knowledge from LLMs into Bird’s-Eye View Representations
Thomas Monninger, Shaoyuan Xie, Qi Alfred Chen, Sihao Ding
Main category: cs.CV
TL;DR: BEVLM integrates Bird’s-Eye View representations with LLMs for autonomous driving, improving spatial reasoning and semantic understanding.
Details
Motivation: Existing methods feed LLMs with multi-view/multi-frame images independently, causing redundant computation and poor spatial consistency. BEV representations provide spatial structure but lack semantic richness from foundation vision encoders.Method: Proposes BEVLM framework that connects spatially consistent and semantically distilled BEV representations with LLMs, using BEV features as unified inputs and distilling semantic knowledge from LLMs into BEV representations.
Result: Improves reasoning accuracy by 46% in cross-view driving scenes and enhances closed-loop end-to-end driving performance by 29% in safety-critical scenarios.
Conclusion: BEVLM effectively bridges the gap between spatial structure and semantic understanding, enabling better multimodal reasoning for autonomous driving applications.
Abstract: The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understanding abilities, which are essential for handling complex decision-making and long-tail scenarios. However, existing methods typically feed LLMs with tokens from multi-view and multi-frame images independently, leading to redundant computation and limited spatial consistency. This separation in visual processing hinders accurate 3D spatial reasoning and fails to maintain geometric coherence across views. On the other hand, Bird’s-Eye View (BEV) representations learned from geometrically annotated tasks (e.g., object detection) provide spatial structure but lack the semantic richness of foundation vision encoders. To bridge this gap, we propose BEVLM, a framework that connects a spatially consistent and semantically distilled BEV representation with LLMs. Through extensive experiments, we show that BEVLM enables LLMs to reason more effectively in cross-view driving scenes, improving accuracy by 46%, by leveraging BEV features as unified inputs. Furthermore, by distilling semantic knowledge from LLMs into BEV representations, BEVLM significantly improves closed-loop end-to-end driving performance by 29% in safety-critical scenarios.
[223] Low-latency Event-based Object Detection with Spatially-Sparse Linear Attention
Haiqing Hao, Zhipeng Sui, Rong Zou, Zijia Dai, Nikola ZubiÄ, Davide Scaramuzza, Wenhui Wang
Main category: cs.CV
TL;DR: SSLA-Det: A spatially-sparse linear attention model for low-latency event-based object detection that achieves state-of-the-art accuracy with 20x less computation than prior asynchronous methods.
Details
Motivation: Event cameras offer high temporal resolution and spatial sparsity for low-latency object detection, but existing asynchronous neural networks face bottlenecks: recurrent architectures are hard to train on long sequences, and accuracy improvements increase per-event computation and latency.Method: Proposes Spatially-Sparse Linear Attention (SSLA) with mixture-of-spaces state decomposition and scatter-compute-gather training procedure to enable state-level sparsity while preserving parallel training. Built on SSLA, develops SSLA-Det for end-to-end asynchronous event-based object detection.
Result: On Gen1 and N-Caltech101 datasets, SSLA-Det achieves state-of-the-art accuracy among asynchronous methods (0.375 mAP and 0.515 mAP respectively) while reducing per-event computation by more than 20x compared to strongest prior asynchronous baseline.
Conclusion: SSLA demonstrates the potential of linear attention for low-latency event-based vision by enabling efficient parallel training and sparse state activation that exploits event sparsity while maintaining accuracy.
Abstract: Event cameras provide sequential visual data with spatial sparsity and high temporal resolution, making them attractive for low-latency object detection. Existing asynchronous event-based neural networks realize this low-latency advantage by updating predictions event-by-event, but still suffer from two bottlenecks: recurrent architectures are difficult to train efficiently on long sequences, and improving accuracy often increases per-event computation and latency. Linear attention is appealing in this setting because it supports parallel training and recurrent inference. However, standard linear attention updates a global state for every event, yielding a poor accuracy-efficiency trade-off, which is problematic for object detection, where fine-grained representations and thus states are preferred. The key challenge is therefore to introduce sparse state activation that exploits event sparsity while preserving efficient parallel training. We propose Spatially-Sparse Linear Attention (SSLA), which introduces a mixture-of-spaces state decomposition and a scatter-compute-gather training procedure, enabling state-level sparsity as well as training parallelism. Built on SSLA, we develop an end-to-end asynchronous linear attention model, SSLA-Det, for event-based object detection. On Gen1 and N-Caltech101, SSLA-Det achieves state-of-the-art accuracy among asynchronous methods, reaching 0.375 mAP and 0.515 mAP, respectively, while reducing per-event computation by more than 20 times compared to the strongest prior asynchronous baseline, demonstrating the potential of linear attention for low-latency event-based vision.
[224] Hierarchical Collaborative Fusion for 3D Instance-aware Referring Expression Segmentation
Keshen Zhou, Runnan Chen, Mingming Gong, Tongliang Liu
Main category: cs.CV
TL;DR: HCF-RES is a multi-modal framework for 3D referring expression segmentation that uses hierarchical visual semantic decomposition and progressive multi-level fusion to improve object localization in 3D scenes using natural language descriptions.
Details
Motivation: Existing 3D referring expression segmentation methods rely solely on sparse point clouds, lacking rich visual semantics needed for fine-grained natural language descriptions. This limits their ability to handle ambiguous descriptions that match multiple or zero targets.Method: Proposes HCF-RES with two key innovations: 1) Hierarchical Visual Semantic Decomposition using SAM instance masks to guide CLIP encoding at pixel-level and instance-level features, preserving object boundaries during 2D-to-3D projection. 2) Progressive Multi-level Fusion integrating representations through intra-modal collaboration, cross-modal adaptive weighting between 2D semantic and 3D geometric features, and language-guided refinement.
Result: HCF-RES achieves state-of-the-art results on both ScanRefer and Multi3DRefer benchmarks for 3D referring expression segmentation.
Conclusion: The proposed multi-modal framework effectively addresses limitations of point-cloud-only methods by incorporating rich visual semantics and hierarchical feature decomposition, enabling better handling of ambiguous natural language descriptions in 3D scenes.
Abstract: Generalised 3D Referring Expression Segmentation (3D-GRES) localizes objects in 3D scenes based on natural language, even when descriptions match multiple or zero targets. Existing methods rely solely on sparse point clouds, lacking rich visual semantics for fine-grained descriptions. We propose HCF-RES, a multi-modal framework with two key innovations. First, Hierarchical Visual Semantic Decomposition leverages SAM instance masks to guide CLIP encoding at dual granularities – pixel-level and instance-level features – preserving object boundaries during 2D-to-3D projection. Second, Progressive Multi-level Fusion integrates representations through intra-modal collaboration, cross-modal adaptive weighting between 2D semantic and 3D geometric features, and language-guided refinement. HCF-RES achieves state-of-the-art results on both ScanRefer and Multi3DRefer.
[225] ODD-SEC: Onboard Drone Detection with a Spinning Event Camera
Kuan Dai, Hongxin Zhang, Sheng Zhong, Yi Zhou
Main category: cs.CV
TL;DR: Real-time drone detection system using spinning event camera for 360° FOV on moving carriers, with novel event representation and lightweight neural network for bearing estimation.
Details
Motivation: Need for drone detection systems that work on moving carriers (like quadruped robots or ground vehicles) in challenging conditions where conventional frame-based cameras fail, especially for fast-moving targets or adverse illumination.Method: Uses spinning event-based camera for 360° horizontal field of view; novel image-like event representation without motion compensation; lightweight neural network for efficient spatiotemporal learning; implemented on Jetson Orin NX for real-time operation.
Result: System operates in real-time with reliable detection; mean angular error below 2° under challenging outdoor conditions; suitable for real-world surveillance applications.
Conclusion: Event-based vision with spinning camera enables effective drone detection on moving carriers in challenging conditions, outperforming conventional frame-based approaches; complete pipeline will be open-sourced.
Abstract: The rapid proliferation of drones requires balancing innovation with regulation. To address security and privacy concerns, techniques for drone detection have attracted significant attention.Passive solutions, such as frame camera-based systems, offer versatility and energy efficiency under typical conditions but are fundamentally constrained by their operational principles in scenarios involving fast-moving targets or adverse illumination.Inspired by biological vision, event cameras asynchronously detect per-pixel brightness changes, offering high dynamic range and microsecond-level responsiveness that make them uniquely suited for drone detection in conditions beyond the reach of conventional frame-based cameras.However, the design of most existing event-based solutions assumes a static camera, greatly limiting their applicability to moving carriers–such as quadrupedal robots or unmanned ground vehicles–during field operations.In this paper, we introduce a real-time drone detection system designed for deployment on moving carriers. The system utilizes a spinning event-based camera, providing a 360° horizontal field of view and enabling bearing estimation of detected drones. A key contribution is a novel image-like event representation that operates without motion compensation, coupled with a lightweight neural network architecture for efficient spatiotemporal learning. Implemented on an onboard Jetson Orin NX, the system can operate in real time. Outdoor experimental results validate reliable detection with a mean angular error below 2° under challenging conditions, underscoring its suitability for real-world surveillance applications. We will open-source our complete pipeline to support future research.
[226] Spectral and Trajectory Regularization for Diffusion Transformer Super-Resolution
Jingkai Wang, Yixin Tang, Jue Gong, Jiatong Li, Shu Li, Libo Liu, Jianliang Lan, Yutong Liu, Yulun Zhang
Main category: cs.CV
TL;DR: StrSR is a one-step adversarial distillation framework for diffusion transformer-based real-world image super-resolution that addresses trajectory mismatch and periodic artifacts through spectral and trajectory regularization.
Details
Motivation: Diffusion transformers (DiTs) show promise for real-world image super-resolution but suffer from computationally expensive iterative sampling requiring one-step distillation. Existing distillation methods struggle with trajectory mismatch and generate severe grid-like periodic artifacts in DiT-based Real-ISR.Method: Proposes StrSR with: 1) Asymmetric discriminative distillation architecture to bridge trajectory gap between teacher and student models, and 2) Frequency distribution matching strategy to suppress DiT-specific periodic artifacts caused by high-frequency spectral leakage.
Result: Extensive experiments show StrSR achieves state-of-the-art performance in Real-ISR across both quantitative metrics and visual perception, effectively addressing the key challenges of trajectory mismatch and periodic artifacts.
Conclusion: StrSR successfully enables efficient one-step distillation for diffusion transformer-based real-world image super-resolution through novel spectral and trajectory regularization techniques, making DiT architectures more practical for real-world applications.
Abstract: Diffusion transformer (DiT) architectures show great potential for real-world image super-resolution (Real-ISR). However, their computationally expensive iterative sampling necessitates one-step distillation. Existing one-step distillation methods struggle with Real-ISR on DiT. They suffer from fundamental trajectory mismatch and generate severe grid-like periodic artifacts. To tackle these challenges, we propose StrSR, a novel one-step adversarial distillation framework featuring spectral and trajectory regularization. Specifically, we propose an asymmetric discriminative distillation architecture to bridge the trajectory gap. Additionally, we design a frequency distribution matching strategy to effectively suppress DiT-specific periodic artifacts caused by high-frequency spectral leakage. Extensive experiments demonstrate that StrSR achieves state-of-the-art performance in Real-ISR, across both quantitative metrics and visual perception. The code and models will be released at https://github.com/jkwang28/StrSR .
[227] Attribute Distribution Modeling and Semantic-Visual Alignment for Generative Zero-shot Learning
Haojie Pu, Zhuoming Li, Yongbiao Gao, Yuheng Jia
Main category: cs.CV
TL;DR: ADiVA addresses generative zero-shot learning challenges by modeling attribute distributions and aligning semantic-visual representations to bridge class-instance and semantic-visual domain gaps.
Details
Motivation: Generative ZSL faces two key challenges: (1) class-level attributes fail to capture instance-specific visual appearances due to intra-class variability (class-instance gap), and (2) mismatch between semantic and visual feature distributions due to inter-class correlations (semantic-visual domain gap).Method: ADiVA consists of two modules: Attribute Distribution Modeling (ADM) learns transferable attribute distributions for each class and samples instance-level attributes for unseen classes; Visual-Guided Alignment (VGA) refines semantic representations to better reflect visual structures through explicit semantic-visual alignment.
Result: Experiments on three benchmark datasets show ADiVA significantly outperforms state-of-the-art methods, achieving gains of 4.7% on AWA2 and 6.1% on SUN datasets. The approach can also serve as a plugin to enhance existing generative ZSL methods.
Conclusion: ADiVA effectively addresses generative ZSL challenges by jointly modeling attribute distributions and performing explicit semantic-visual alignment, bridging both class-instance and semantic-visual domain gaps for improved zero-shot learning performance.
Abstract: Generative zero-shot learning (ZSL) synthesizes features for unseen classes, leveraging semantic conditions to transfer knowledge from seen classes. However, it also introduces two intrinsic challenges: (1) class-level attributes fails to capture instance-specific visual appearances due to substantial intra-class variability, thus causing the class-instance gap; (2) the substantial mismatch between semantic and visual feature distributions, manifested in inter-class correlations, gives rise to the semantic-visual domain gap. To address these challenges, we propose an Attribute Distribution Modeling and Semantic-Visual Alignment (ADiVA) approach, jointly modeling attribute distributions and performing explicit semantic-visual alignment. Specifically, our ADiVA consists of two modules: an Attribute Distribution Modeling (ADM) module that learns a transferable attribute distribution for each class and samples instance-level attributes for unseen classes, and a Visual-Guided Alignment (VGA) module that refines semantic representations to better reflect visual structures. Experiments on three widely used benchmark datasets demonstrate that ADiVA significantly outperforms state-of-the-art methods (e.g., achieving gains of 4.7% and 6.1% on AWA2 and SUN, respectively). Moreover, our approach can serve as a plugin to enhance existing generative ZSL methods.
[228] SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation
Vishal Thengane, Zhaochong An, Tianjin Huang, Son Lam Phung, Abdesselam Bouzerdoum, Lu Yin, Na Zhao, Xiatian Zhu
Main category: cs.CV
TL;DR: SCOPE is a plug-and-play framework for 3D incremental few-shot segmentation that leverages background pseudo-instances to enrich prototypes for novel classes without retraining the backbone.
Details
Motivation: 3D incremental few-shot segmentation is underexplored compared to 2D, with existing methods suffering from catastrophic forgetting or poor prototype learning under sparse supervision. A key insight is that novel categories often appear as unlabelled background in base-training scenes.Method: SCOPE uses a class-agnostic segmentation model to extract high-confidence pseudo-instances from background regions after base training, building a prototype pool. When novel classes arrive with few samples, relevant background prototypes are retrieved and fused with few-shot prototypes to create enriched representations without backbone retraining or added parameters.
Result: On ScanNet and S3DIS datasets, SCOPE achieves state-of-the-art performance, improving novel-class IoU by up to 6.98% and 3.61%, and mean IoU by 2.25% and 1.70%, respectively, while maintaining low forgetting.
Conclusion: SCOPE effectively addresses 3D incremental few-shot segmentation by leveraging background context to enrich prototypes, demonstrating significant performance gains without architectural changes or retraining overhead.
Abstract: Incremental Few-Shot (IFS) segmentation aims to learn new categories over time from only a few annotations. Although widely studied in 2D, it remains underexplored for 3D point clouds. Existing methods suffer from catastrophic forgetting or fail to learn discriminative prototypes under sparse supervision, and often overlook a key cue: novel categories frequently appear as unlabelled background in base-training scenes. We introduce SCOPE (Scene-COntextualised Prototype Enrichment), a plug-and-play background-guided prototype enrichment framework that integrates with any prototype-based 3D segmentation method. After base training, a class-agnostic segmentation model extracts high-confidence pseudo-instances from background regions to build a prototype pool. When novel classes arrive with few labelled samples, relevant background prototypes are retrieved and fused with few-shot prototypes to form enriched representations without retraining the backbone or adding parameters. Experiments on ScanNet and S3DIS show that SCOPE achieves SOTA performance, improving novel-class IoU by up to 6.98% and 3.61%, and mean IoU by 2.25% and 1.70%, respectively, while maintaining low forgetting. Code is available https://github.com/Surrey-UP-Lab/SCOPE.
[229] FlowMotion: Training-Free Flow Guidance for Video Motion Transfer
Zhen Wang, Youcan Xu, Jun Xiao, Long Chen
Main category: cs.CV
TL;DR: FlowMotion is a training-free framework for video motion transfer that uses flow-based T2V model predictions to extract motion guidance, achieving efficient and flexible motion transfer without heavy computational overhead.
Details
Motivation: Existing training-free video motion transfer methods rely on intermediate outputs of pre-trained T2V models, which causes heavy computational overhead and limited flexibility. The authors aim to create a more efficient and flexible approach.Method: FlowMotion directly leverages predicted outputs of flow-based T2V models. It extracts motion representations (flow guidance) from early latent predictions that encode rich temporal information, and introduces velocity regularization to stabilize optimization and ensure smooth motion evolution.
Result: FlowMotion achieves superior time and resource efficiency while maintaining competitive performance compared to state-of-the-art methods, operating purely on model predictions without additional training.
Conclusion: The proposed FlowMotion framework enables efficient and flexible video motion transfer by directly utilizing flow-based T2V model predictions, offering a practical solution that balances performance and computational efficiency.
Abstract: Video motion transfer aims to generate a target video that inherits motion patterns from a source video while rendering new scenes. Existing training-free approaches focus on constructing motion guidance based on the intermediate outputs of pre-trained T2V models, which results in heavy computational overhead and limited flexibility. In this paper, we present FlowMotion, a novel training-free framework that enables efficient and flexible motion transfer by directly leveraging the predicted outputs of flow-based T2V models. Our key insight is that early latent predictions inherently encode rich temporal information. Motivated by this, we propose flow guidance, which extracts motion representations based on latent predictions to align motion patterns between source and generated videos. We further introduce a velocity regularization strategy to stabilize optimization and ensure smooth motion evolution. By operating purely on model predictions, FlowMotion achieves superior time and resource efficiency as well as competitive performance compared with state-of-the-art methods.
[230] Latent Transfer Attack: Adversarial Examples via Generative Latent Spaces
Eitan Shaar, Ariel Shaulov, Yalcin Tur, Gal Chechik, Ravid Shwartz-Ziv
Main category: cs.CV
TL;DR: LTA is a transfer-based adversarial attack that optimizes perturbations in Stable Diffusion’s latent space rather than pixel space, producing more robust and spatially coherent attacks that transfer better across models.
Details
Motivation: Traditional pixel-space adversarial attacks produce high-frequency noise that is brittle to preprocessing and transfers poorly across architectures. The authors aim to create more robust, transferable attacks by leveraging the structured latent space of pretrained generative models.Method: LTA encodes clean images into Stable Diffusion VAE’s latent space, optimizes perturbations there to maximize classifier loss while enforcing pixel-space constraints after decoding. Uses Expectation Over Transformations (randomized resizing, interpolation, cropping) and latent Gaussian smoothing for robustness.
Result: LTA achieves strong transfer attack success across CNN and vision-transformer targets, producing spatially coherent, low-frequency perturbations that differ qualitatively from pixel-space baselines and occupy a distinct point in the transfer-quality trade-off.
Conclusion: Pretrained generative latent spaces provide an effective and structured domain for adversarial optimization, bridging robustness evaluation with modern generative priors and producing more transferable attacks.
Abstract: Adversarial attacks are a central tool for probing the robustness of modern vision models, yet most methods optimize perturbations directly in pixel space under $\ell_\infty$ or $\ell_2$ constraints. While effective in white-box settings, pixel-space optimization often produces high-frequency, texture-like noise that is brittle to common preprocessing (e.g., resizing and cropping) and transfers poorly across architectures. We propose $\textbf{LTA}$ ($\textbf{L}$atent $\textbf{T}$ransfer $\textbf{A}$ttack), a transfer-based attack that instead optimizes perturbations in the latent space of a pretrained Stable Diffusion VAE. Given a clean image, we encode it into a latent code and optimize the latent representation to maximize a surrogate classifier loss, while softly enforcing a pixel-space $\ell_\infty$ budget after decoding. To improve robustness to resolution mismatch and standard input pipelines, we incorporate Expectation Over Transformations (EOT) via randomized resizing, interpolation, and cropping, and apply periodic latent Gaussian smoothing to suppress emerging artifacts and stabilize optimization. Across a suite of CNN and vision-transformer targets, LTA achieves strong transfer attack success while producing spatially coherent, predominantly low-frequency perturbations that differ qualitatively from pixel-space baselines and occupy a distinct point in the transfer-quality trade-off. Our results highlight pretrained generative latent spaces as an effective and structured domain for adversarial optimization, bridging robustness evaluation with modern generative priors.
[231] WMoE-CLIP: Wavelet-Enhanced Mixture-of-Experts Prompt Learning for Zero-Shot Anomaly Detection
Peng Chen, Chao Huang
Main category: cs.CV
TL;DR: Wavelet-enhanced mixture-of-experts prompt learning for zero-shot anomaly detection using vision-language models with adaptive prompts and multi-frequency features.
Details
Motivation: Existing zero-shot anomaly detection methods using vision-language models rely on fixed textual prompts that struggle with complex semantics and focus only on spatial-domain features, limiting detection of subtle anomalies.Method: Proposes wavelet-enhanced mixture-of-experts prompt learning: uses variational autoencoder to model global semantic representations for adaptive prompts, wavelet decomposition for multi-frequency image features, cross-modal interactions to refine textual embeddings, and semantic-aware mixture-of-experts module to aggregate contextual information.
Result: Extensive experiments on 14 industrial and medical datasets demonstrate the effectiveness of the proposed method.
Conclusion: The proposed approach enhances zero-shot anomaly detection by addressing limitations of fixed prompts and spatial-only features through adaptive prompt learning and multi-frequency analysis.
Abstract: Vision-language models have recently shown strong generalization in zero-shot anomaly detection (ZSAD), enabling the detection of unseen anomalies without task-specific supervision. However, existing approaches typically rely on fixed textual prompts, which struggle to capture complex semantics, and focus solely on spatial-domain features, limiting their ability to detect subtle anomalies. To address these challenges, we propose a wavelet-enhanced mixture-of-experts prompt learning method for ZSAD. Specifically, a variational autoencoder is employed to model global semantic representations and integrate them into prompts to enhance adaptability to diverse anomaly patterns. Wavelet decomposition extracts multi-frequency image features that dynamically refine textual embeddings through cross-modal interactions. Furthermore, a semantic-aware mixture-of-experts module is introduced to aggregate contextual information. Extensive experiments on 14 industrial and medical datasets demonstrate the effectiveness of the proposed method.
[232] P-SLCR: Unsupervised Point Cloud Semantic Segmentation via Prototypes Structure Learning and Consistent Reasoning
Lixin Zhan, Jie Jiang, Tianjian Zhou, Yukun Du, Yan Zheng, Xuehu Duan
Main category: cs.CV
TL;DR: P-SLCR: A prototype library-driven unsupervised point cloud semantic segmentation method using structure learning and consistent reasoning, achieving state-of-the-art performance on major 3D datasets.
Details
Motivation: Current point cloud semantic segmentation heavily relies on manual labeling, while unsupervised methods for raw point clouds are still underdeveloped. The absence of annotations and pre-training poses significant challenges, necessitating effective unsupervised strategies.Method: Proposes P-SLCR with two main components: 1) Consistent Structure Learning that establishes structural feature learning between consistent points and prototype libraries by selecting high-quality features, and 2) Semantic Relation Consistent Reasoning that constructs prototype inter-relation matrices between consistent and ambiguous prototype libraries to preserve semantic consistency through constraints.
Result: Achieved best performance among unsupervised methods on S3DIS, SemanticKITTI, and Scannet datasets. Specifically achieved 47.1% mIoU for Area-5 of S3DIS dataset, surpassing classical fully supervised method PointNet by 2.5%.
Conclusion: The proposed P-SLCR framework effectively addresses unsupervised point cloud semantic segmentation challenges through prototype library-driven structure learning and consistent reasoning, demonstrating superior performance over existing methods.
Abstract: Current semantic segmentation approaches for point cloud scenes heavily rely on manual labeling, while research on unsupervised semantic segmentation methods specifically for raw point clouds is still in its early stages. Unsupervised point cloud learning poses significant challenges due to the absence of annotation information and the lack of pre-training. The development of effective strategies is crucial in this context. In this paper, we propose a novel prototype library-driven unsupervised point cloud semantic segmentation strategy that utilizes Structure Learning and Consistent Reasoning (P-SLCR). First, we propose a Consistent Structure Learning to establish structural feature learning between consistent points and the library of consistent prototypes by selecting high-quality features. Second, we propose a Semantic Relation Consistent Reasoning that constructs a prototype inter-relation matrix between consistent and ambiguous prototype libraries separately. This process ensures the preservation of semantic consistency by imposing constraints on consistent and ambiguous prototype libraries through the prototype inter-relation matrix. Finally, our method was extensively evaluated on the S3DIS, SemanticKITTI, and Scannet datasets, achieving the best performance compared to unsupervised methods. Specifically, the mIoU of 47.1% is achieved for Area-5 of the S3DIS dataset, surpassing the classical fully supervised method PointNet by 2.5%.
[233] WorldCache: Accelerating World Models for Free via Heterogeneous Token Caching
Weilun Feng, Guoxin Fan, Haotong Qin, Chuanguang Yang, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Dingrui Wang, Longlong Liao, Michele Magno, Yongjun Xu
Main category: cs.CV
TL;DR: WorldCache: A caching framework for diffusion-based world models that accelerates inference while maintaining rollout quality by addressing token heterogeneity and non-uniform temporal dynamics.
Details
Motivation: Diffusion-based world models show promise for unified world simulation but suffer from slow iterative denoising that limits interactive use and long-horizon rollouts. While feature caching can help, existing policies designed for single-modal diffusion don't transfer well to world models due to token heterogeneity and non-uniform temporal dynamics.Method: Proposes WorldCache with two key components: 1) Curvature-guided Heterogeneous Token Prediction - uses physics-grounded curvature scores to estimate token predictability and applies Hermite-guided damped predictor for chaotic tokens; 2) Chaotic-prioritized Adaptive Skipping - accumulates curvature-normalized drift signal and recomputes only when bottleneck tokens begin to drift.
Result: WorldCache achieves up to 3.7Ă end-to-end speedups while maintaining 98% rollout quality on diffusion world models, demonstrating practical advantages for resource-constrained scenarios.
Conclusion: WorldCache effectively addresses the unique challenges of diffusion world models, enabling efficient inference without sacrificing simulation quality, making diffusion-based world models more practical for interactive applications.
Abstract: Diffusion-based world models have shown strong potential for unified world simulation, but the iterative denoising remains too costly for interactive use and long-horizon rollouts. While feature caching can accelerate inference without training, we find that policies designed for single-modal diffusion transfer poorly to world models due to two world-model-specific obstacles: \emph{token heterogeneity} from multi-modal coupling and spatial variation, and \emph{non-uniform temporal dynamics} where a small set of hard tokens drives error growth, making uniform skipping either unstable or overly conservative. We propose \textbf{WorldCache}, a caching framework tailored to diffusion world models. We introduce \textit{Curvature-guided Heterogeneous Token Prediction}, which uses a physics-grounded curvature score to estimate token predictability and applies a Hermite-guided damped predictor for chaotic tokens with abrupt direction changes. We also design \textit{Chaotic-prioritized Adaptive Skipping}, which accumulates a curvature-normalized, dimensionless drift signal and recomputes only when bottleneck tokens begin to drift. Experiments on diffusion world models show that WorldCache delivers up to \textbf{3.7$\times$} end-to-end speedups while maintaining \textbf{98%} rollout quality, demonstrating the vast advantages and practicality of WorldCache in resource-constrained scenarios. Our code is released in https://github.com/FofGofx/WorldCache.
[234] LATO: 3D Mesh Flow Matching with Structured TOpology Preserving LAtents
Tianhao Zhao, Youjia Zhang, Hang Long, Jinshen Zhang, Wenbing Li, Yang Yang, Gongbo Zhang, Jozef HladkĂœ, Matthias NieĂner, Wei Yang
Main category: cs.CV
TL;DR: LATO introduces a topology-preserving latent representation for 3D mesh generation using flow matching, enabling direct mesh synthesis without isosurface extraction.
Details
Motivation: Current 3D mesh generation methods often rely on isosurface extraction or heuristic meshing, which can produce artifacts and topological errors. There's a need for scalable, topology-aware mesh synthesis that preserves mesh structure while being efficient.Method: LATO uses a Vertex Displacement Field (VDF) anchored on surfaces with a sparse voxel VAE to compress explicit signals into structured, topology-aware voxel latents. The decoder progressively subdivides and prunes latent voxels to instantiate vertex locations, while a connection head predicts edge connectivity directly. For generation, it uses two-stage flow matching: first synthesizing structure voxels, then refining voxel-wise topology features.
Result: LATO generates meshes with complex geometry and well-formed topology while being highly efficient in inference compared to prior isosurface/triangle-based diffusion models and autoregressive approaches.
Conclusion: LATO provides a scalable, topology-preserving approach for 3D mesh generation that avoids isosurface extraction artifacts and enables direct mesh synthesis with proper connectivity.
Abstract: In this paper, we introduce LATO, a novel topology-preserving latent representation that enables scalable, flow matching-based synthesis of explicit 3D meshes. LATO represents a mesh as a Vertex Displacement Field (VDF) anchored on surface, incorporating a sparse voxel Variational Autoencoder (VAE) to compress this explicit signal into a structured, topology-aware voxel latent. To decapsulate the mesh, the VAE decoder progressively subdivides and prunes latent voxels to instantiate precise vertex locations. In the end, a dedicated connection head queries the voxel latent to predict edge connectivity between vertex pairs directly, allowing mesh topology to be recovered without isosurface extraction or heuristic meshing. For generative modeling, LATO adopts a two-stage flow matching process, first synthesizing the structure voxels and subsequently refining the voxel-wise topology features. Compared to prior isosurface/triangle-based diffusion models and autoregressive generation approaches, LATO generates meshes with complex geometry, well-formed topology while being highly efficient in inference.
[235] PACE: Marrying generalization in PArameter-efficient fine-tuning with Consistency rEgularization
Yao Ni, Shan Zhang, Piotr Koniusz
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2409.17137: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.17137&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[236] Computer vision-based estimation of invertebrate biomass
Mikko Impiö, Philipp M. Rehsen, Jarrett Blair, Cecilie Mielec, Arne J. Beermann, Florian Leese, Toke T. HÞye, Jenni Raitoharju
Main category: cs.CV
TL;DR: Computer vision methods for estimating invertebrate dry mass from images using area and sinking speed predictors, with deep learning models achieving 10-20% median error.
Details
Motivation: To enable scalable biodiversity monitoring by automating invertebrate biomass estimation through computer vision, eliminating manual destructive weighing processes.Method: Two approaches: 1) Linear model with novel predictors (area and sinking speed) calculated from BIODISCOVER dual-camera system capturing sinking sequences; 2) End-to-end deep neural networks with single-view, multi-view, and metadata-aware architectures.
Result: Methods successfully estimate dry mass for diverse specimen morphologies with 10-20% median percentage error for individuals when combined with taxonomic classification.
Conclusion: Computer vision enables accurate, non-destructive biomass estimation for biodiversity monitoring, with appropriate evaluation metrics and model architectures being crucial.
Abstract: The ability to estimate invertebrate biomass using only images could help scaling up quantitative biodiversity monitoring efforts. Computer vision-based methods have the potential to omit the manual, time-consuming, and destructive process of dry weighing specimens. We present two approaches for dry mass estimation that do not require additional manual effort apart from imaging the specimens: fitting a linear model with novel predictors, automatically calculated by an imaging device, and training a family of end-to-end deep neural networks for the task, using single-view, multi-view, and metadata-aware architectures. We propose using area and sinking speed as predictors. These can be calculated with BIODISCOVER, which is a dual-camera system that captures image sequences of specimens sinking in an ethanol column. For this study, we collected a large dataset of dry mass measurement and image sequence pairs to train and evaluate models. We show that our methods can estimate specimen dry mass even with complex and visually diverse specimen morphologies. Combined with automatic taxonomic classification, our approach is an accurate method for group-level dry mass estimation, with a median percentage error of 10-20% for individuals. We highlight the importance of choosing appropriate evaluation metrics, and encourage using both percentage errors and absolute errors as metrics, because they measure different properties. We also explore different optimization losses, data augmentation methods, and model architectures for training deep-learning models.
[237] OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis
Yuxuan Fan, Jing Hao, Hong Chen, Jiahao Bao, Yihua Shao, Yuci Liang, Kuo Feng Hung, Hao Tang
Main category: cs.CV
TL;DR: OralGPT-Plus is an agentic vision-language model for panoramic dental radiograph analysis that performs iterative, symmetry-aware diagnostic reasoning using reinforcement learning and a new dataset with expert-curated diagnostic trajectories.
Details
Motivation: Existing vision-language models use static single-pass paradigms that are insufficient for panoramic dental radiographs requiring fine-grained spatial reasoning, bilateral symmetry understanding, and multi-step diagnostic verification, limiting clinical reliability.Method: 1) Developed OralGPT-Plus with iterative and symmetry-aware diagnostic reasoning; 2) Created DentalProbe dataset with 5,000 images and expert-curated diagnostic trajectories; 3) Implemented Reinspection-driven reinforcement learning with rubric-based and conditioned diagnostic-driven rewards; 4) Built MMOral-X benchmark with 300 open-ended questions and region-level annotations.
Result: OralGPT-Plus shows consistent and reliable improvements over strong baselines on both the new MMOral-X benchmark and established panoramic benchmarks, demonstrating effectiveness of interactive and symmetry-informed reasoning.
Conclusion: The work highlights the value of agentic modeling for dental imaging and provides a foundation for future research in clinically aligned panoramic radiograph analysis through interactive reasoning paradigms.
Abstract: Panoramic dental radiographs require fine-grained spatial reasoning, bilateral symmetry understanding, and multi-step diagnostic verification, yet existing vision-language models operate under a static single-pass paradigm that limits their clinical reliability. In this paper, we introduce OralGPT-Plus, an agentic vision-language model designed to perform iterative and symmetry-aware diagnostic reasoning for panoramic dental radiograph analysis. To support this paradigm, we construct DentalProbe, a five-thousand-image dataset with expert-curated diagnostic trajectories that provide structured supervision for localized inspection and contralateral comparison. We further develop a Reinspection-driven reinforcement learning framework that encourages clinically meaningful re-examination and stabilizes long-horizon reasoning with rubric-based reward and conditioned diagnostic-driven reward. In parallel, we present MMOral-X, the first benchmark for holistic panoramic diagnosis, containing 300 open-ended questions and region-level annotations across multiple difficulty levels. OralGPT-Plus demonstrates consistent and reliable improvements over strong baselines on MMOral-X and established panoramic benchmarks, indicating the effectiveness of interactive and symmetry-informed reasoning. Our work highlights the value of agentic modeling for dental imaging and provides a foundation for future research in clinically aligned panoramic radiograph analysis.
[238] Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation
Jonas Ernst, Wolfgang Boettcher, Lukas Hoyer, Jan Eric Lenssen, Bernt Schiele
Main category: cs.CV
TL;DR: Rewis3d improves weakly supervised semantic segmentation on 2D images using 3D scene reconstruction as auxiliary supervision to propagate sparse annotations across scenes.
Details
Motivation: Dense pixel-level annotations for training segmentation models are costly. While sparse annotations offer a weakly-supervised alternative, they still suffer from performance gaps. The paper aims to leverage 3D geometric structure from 2D videos to provide stronger supervision signals.Method: Uses a dual student-teacher architecture that enforces semantic consistency between 2D images and reconstructed 3D point clouds. Leverages state-of-the-art feed-forward 3D reconstruction to generate reliable geometric supervision for propagating sparse annotations across entire scenes.
Result: Rewis3d achieves state-of-the-art performance in sparse supervision, outperforming existing approaches by 2-7% without requiring additional labels or inference overhead.
Conclusion: 3D scene reconstruction provides valuable geometric supervision that can significantly improve weakly supervised semantic segmentation on 2D images, effectively bridging the performance gap of sparse annotations.
Abstract: We present Rewis3d, a framework that leverages recent advances in feed-forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images. Obtaining dense, pixel-level annotations remains a costly bottleneck for training segmentation models. Alleviating this issue, sparse annotations offer an efficient weakly-supervised alternative. However, they still incur a performance gap. To address this, we introduce a novel approach that leverages 3D scene reconstruction as an auxiliary supervisory signal. Our key insight is that 3D geometric structure recovered from 2D videos provides strong cues that can propagate sparse annotations across entire scenes. Specifically, a dual student-teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state-of-the-art feed-forward reconstruction to generate reliable geometric supervision. Extensive experiments demonstrate that Rewis3d achieves state-of-the-art performance in sparse supervision, outperforming existing approaches by 2-7% without requiring additional labels or inference overhead.
[239] MoEMambaMIL: Structure-Aware Selective State Space Modeling for Whole-Slide Image Analysis
Dongqing Xie, Yonghuang Wu
Main category: cs.CV
TL;DR: MoEMambaMIL: A structure-aware State Space Model framework for whole-slide image analysis that organizes patch tokens into region-aware sequences and uses mixture-of-experts for resolution-specific and region-adaptive modeling.
Details
Motivation: Existing MIL approaches treat WSIs as unordered patch collections, failing to capture structured dependencies between global tissue organization and local cellular patterns. While SSMs enable efficient long-sequence modeling, structuring WSI tokens to exploit spatial hierarchy remains challenging.Method: Proposes MoEMambaMIL with region-nested selective scanning and mixture-of-experts modeling. Organizes patch tokens into region-aware sequences preserving spatial containment across resolutions. Uses static resolution-specific experts and dynamic sparse experts with learned routing for resolution-aware encoding and region-adaptive contextual modeling.
Result: Achieves best performance across 9 downstream tasks, demonstrating effectiveness of the structure-aware SSM framework for WSI analysis.
Conclusion: MoEMambaMIL successfully addresses the challenge of modeling hierarchical multi-resolution structure in WSIs through structured token organization and expert specialization, enabling efficient long-sequence modeling while capturing spatial dependencies.
Abstract: Whole-slide image (WSI) analysis is challenging due to the gigapixel scale of slides and their inherent hierarchical multi-resolution structure. Existing multiple instance learning (MIL) approaches often model WSIs as unordered collections of patches, which limits their ability to capture structured dependencies between global tissue organization and local cellular patterns. Although recent State Space Models (SSMs) enable efficient modeling of long sequences, how to structure WSI tokens to fully exploit their spatial hierarchy remains an open problem.We propose MoEMambaMIL, a structure-aware SSM framework for WSI analysis that integrates region-nested selective scanning with mixture-of-experts (MoE) modeling. Leveraging multi-resolution preprocessing, MoEMambaMIL organizes patch tokens into region-aware sequences that preserve spatial containment across resolutions. On top of this structured sequence, we decouple resolution-aware encoding and region-adaptive contextual modeling via a combination of static, resolution-specific experts and dynamic sparse experts with learned routing. This design enables efficient long-sequence modeling while promoting expert specialization across heterogeneous diagnostic patterns. Experiments demonstrate that MoEMambaMIL achieves the best performance across 9 downstream tasks.
[240] CHMv2: Improvements in Global Canopy Height Mapping using DINOv3
John Brandt, Seungeun Yi, Jamie Tolan, Xinyuan Li, Peter Potapov, Jessica Ertel, Justine Spore, Huy V. Vo, Michaël Ramamonjisoa, Patrick Labatut, Piotr Bojanowski, Camille Couprie
Main category: cs.CV
TL;DR: CHMv2 is a global, meter-resolution canopy height map derived from optical satellite imagery using a DINOv3-based depth-estimation model trained on ALS data, offering improved accuracy and fine-scale structure preservation compared to existing products.
Details
Motivation: Accurate canopy height information is crucial for forest carbon quantification, restoration monitoring, and habitat assessment, but high-fidelity ALS measurements remain unevenly available globally, necessitating a more accessible and accurate alternative.Method: Uses a depth-estimation model built on DINOv3 trained against ALS canopy height models, with large geographically diverse training data expansion, automated data curation/registration, and loss formulation/data sampling tailored to canopy height distributions.
Result: CHMv2 substantially improves accuracy, reduces bias in tall forests, and better preserves fine-scale structure like canopy edges and gaps compared to existing products, validated against independent ALS test sets and millions of GEDI/ICESat-2 observations.
Conclusion: CHMv2 provides a high-quality global canopy height mapping solution that addresses limitations of existing products, demonstrating consistent performance across major forest biomes through comprehensive validation.
Abstract: Accurate canopy height information is essential for quantifying forest carbon, monitoring restoration and degradation, and assessing habitat structure, yet high-fidelity measurements from airborne laser scanning (ALS) remain unevenly available globally. Here we present CHMv2, a global, meter-resolution canopy height map derived from high-resolution optical satellite imagery using a depth-estimation model built on DINOv3 and trained against ALS canopy height models. Compared to existing products, CHMv2 substantially improves accuracy, reduces bias in tall forests, and better preserves fine-scale structure such as canopy edges and gaps. These gains are enabled by a large expansion of geographically diverse training data, automated data curation and registration, and a loss formulation and data sampling strategy tailored to canopy height distributions. We validate CHMv2 against independent ALS test sets and against tens of millions of GEDI and ICESat-2 observations, demonstrating consistent performance across major forest biomes.
[241] REACT++: Efficient Cross-Attention for Real-Time Scene Graph Generation
Maëlic Neau, Zoe Falomir
Main category: cs.CV
TL;DR: REACT++ is an improved real-time Scene Graph Generation model that balances relation prediction accuracy, object detection accuracy, and inference speed, achieving 20% faster inference with 10% better relation prediction than previous REACT model.
Details
Motivation: Current SGG methods focus on either relation prediction accuracy, object detection accuracy, or latency reduction, but not all three simultaneously. There's a need for real-time SGG models that balance performance and speed for practical applications like embodied agents.Method: Builds on REACT architecture with efficient feature extraction and subject-to-object cross-attention within prototype space to balance latency and representational power.
Result: Achieves highest inference speed among existing SGG models with improved relation prediction accuracy without sacrificing object detection performance. 20% faster than previous REACT with 10% gain in relation prediction accuracy on average.
Conclusion: REACT++ establishes new state-of-the-art for real-time SGG by effectively balancing the trade-off between performance and inference speed, enabling practical applications.
Abstract: Scene Graph Generation (SGG) is a task that encodes visual relationships between objects in images as graph structures. SGG shows significant promise as a foundational component for downstream tasks, such as reasoning for embodied agents. To enable real-time applications, SGG must address the trade-off between performance and inference speed. However, current methods tend to focus on one of the following: (1) improving relation prediction accuracy, (2) enhancing object detection accuracy, or (3) reducing latency, without aiming to balance all three objectives simultaneously. To address this limitation, we build on the powerful Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation (REACT) architecture and propose REACT++, a new state-of-the-art model for real-time SGG. By leveraging efficient feature extraction and subject-to-object cross-attention within the prototype space, REACT++ balances latency and representational power. REACT++ achieves the highest inference speed among existing SGG models, improving relation prediction accuracy without sacrificing object detection performance. Compared to the previous REACT version, REACT++ is 20% faster with a gain of 10% in relation prediction accuracy on average. The code is available at https://github.com/Maelic/SGG-Benchmark.
[242] Solving Jigsaw Puzzles in the Wild: Human-Guided Reconstruction of Cultural Heritage Fragments
Omidreza Safaei, Sinem Aslan, Sebastiano Vascon, Luca Palmieri, Marina Khoroshiltseva, Marcello Pelillo
Main category: cs.CV
TL;DR: Human-in-the-loop puzzle solving framework for real-world archaeological artifact reconstruction that combines automatic relaxation-labeling with interactive human guidance to handle large-scale, ambiguous fragment assemblies.
Details
Motivation: Traditional jigsaw puzzle solvers fail with real-world archaeological artifacts due to erosion, missing regions, irregular shapes, and large-scale ambiguity, especially when fragments number in the thousands as in the RePAIR benchmark.Method: Hybrid framework integrating automatic relaxation-labeling solver with interactive human guidance. Users iteratively lock verified placements, correct errors, and guide the system. Two interaction strategies: Iterative Anchoring and Continuous Interactive Refinement for scalable reconstruction across varying ambiguity levels and puzzle sizes.
Result: Outperforms both fully automatic and manual baselines in accuracy and efficiency on several RePAIR groups, demonstrating practical solution for large-scale expert-in-the-loop artifact reassembly.
Conclusion: Human-in-the-loop approach offers effective solution for complex real-world cultural heritage reconstruction, combining computational efficiency with human expertise to handle large-scale ambiguous fragment assemblies.
Abstract: Reassembling real-world archaeological artifacts from fragmented pieces poses significant challenges due to erosion, missing regions, irregular shapes, and large-scale ambiguity. Traditional jigsaw puzzle solvers, often designed for clean synthetic scenarios, struggle under these conditions, especially when the number of fragments grows into the thousands, as in the RePAIR benchmark. In this paper, we propose a human-in-the-loop (HIL) puzzle solving framework designed to address the complexity and scale of real-world cultural heritage reconstruction. Our approach integrates an automatic relaxation-labeling solver with interactive human guidance, allowing users to iteratively lock verified placements, correct errors, and guide the system toward semantically and geometrically coherent assemblies. We introduce two complementary interaction strategies, Iterative Anchoring and Continuous Interactive Refinement, which support scalable reconstruction across varying levels of ambiguity and puzzle size. Experiments on several RePAIR groups demonstrate that our hybrid approach substantially outperforms both fully automatic and manual baselines in accuracy and efficiency, offering a practical solution for large-scale expert-in-the-loop artifact reassembly.
[243] DiffInf: Influence-Guided Diffusion for Supervision Alignment in Facial Attribute Learning
Basudha Pal, Rama Chellappa
Main category: cs.CV
TL;DR: DiffInf uses self-influence scoring and diffusion-based image editing to correct annotation inconsistencies in facial attribute datasets, improving classification performance without dataset size reduction.
Details
Motivation: Facial attribute datasets suffer from annotation inconsistencies due to subjective labeling and visual confounders (pose, illumination, expression, demographics), creating supervision errors that impair representation learning and downstream prediction.Method: 1) Train baseline classifier and compute self-influence scores to identify problematic training instances; 2) Apply targeted generative correction via latent diffusion autoencoder to align visual content with labels while preserving identity/realism; 3) Train lightweight predictor of high-influence membership as surrogate influence regularizer for differentiable guidance; 4) Replace original samples with edited ones to create influence-refined dataset.
Result: DiffInf consistently improves generalization across multi-class facial attribute classification compared to standard noisy-label training, robust optimization baselines, and influence-based filtering approaches.
Conclusion: Repairing influential annotation inconsistencies at the image level enhances downstream facial attribute classification without sacrificing distributional coverage, demonstrating the value of generative correction over simple filtering.
Abstract: Facial attribute classification relies on large-scale annotated datasets in which many traits, such as age and expression, are inherently ambiguous and continuous but are discretized into categorical labels. Annotation inconsistencies arise from subjectivity and visual confounders such as pose, illumination, expression, and demographic variation, creating mismatch between images and assigned labels. These inconsistencies introduce supervision errors that impair representation learning and degrade downstream prediction. We introduce DiffInf, a self-influence–guided diffusion framework for mitigating annotation inconsistencies in facial attribute learning. We first train a baseline classifier and compute sample-wise self-influence scores using a practical first-order approximation to identify training instances that disproportionately destabilize optimization. Instead of discarding these influential samples, we apply targeted generative correction via a latent diffusion autoencoder to better align visual content with assigned labels while preserving identity and realism. To enable differentiable guidance during correction, we train a lightweight predictor of high-influence membership and use it as a surrogate influence regularizer. The edited samples replace the originals, yielding an influence-refined dataset of unchanged size. Across multi-class facial attribute classification, DiffInf consistently improves generalization compared with standard noisy-label training, robust optimization baselines, and influence-based filtering. Our results demonstrate that repairing influential annotation inconsistencies at the image level enhances downstream facial attribute classification without sacrificing distributional coverage.
[244] Locating and Editing Figure-Ground Organization in Vision Transformers
Stefan Arnold, René Gröbner
Main category: cs.CV
TL;DR: BEiT vision transformers show a preference for convex completions in perceptual ambiguity tasks, with specific attention heads (L0H9) identified as early seeds for convexity bias that can be manipulated to shift completion preferences.
Details
Motivation: To understand how vision transformers resolve perceptual ambiguity between local geometric evidence and global organizational priors, specifically investigating where the Gestalt prior of convexity is realized within BEiT's internal components.Method: Used synthetic dart shapes with masked regions that equally admit concave or convex completions to create controlled perceptual conflict. Projected internal activations into discrete visual codebook space via logit attribution to identify functional units governing completion preferences.
Result: BEiT reliably favors convex completion under competition. Figure-ground organization is ambiguous in early/intermediate layers but resolves abruptly in later layers. Head L0H9 acts as an early seed introducing weak convexity bias - downscaling this single head shifts perceptual conflict distribution across a continuous decision boundary.
Conclusion: Vision transformers implement perceptual priors like convexity through identifiable functional units within transformer substructures, with specific attention heads serving as early seeds for organizational biases that can be manipulated to alter completion preferences.
Abstract: Vision Transformers must resolve figure-ground organization by choosing between completions driven by local geometric evidence and those favored by global organizational priors, giving rise to a characteristic perceptual ambiguity. We aim to locate where the canonical Gestalt prior convexity is realized within the internal components of BEiT. Using a controlled perceptual conflict based on synthetic shapes of darts, we systematically mask regions that equally admit either a concave completion or a convex completion. We show that BEiT reliably favors convex completion under this competition. Projecting internal activations into the model’s discrete visual codebook space via logit attribution reveals that this preference is governed by identifiable functional units within transformer substructures. Specifically, we find that figure-ground organization is ambiguous through early and intermediate layers and resolves abruptly in later layers. By decomposing the direct effect of attention heads, we identify head L0H9 acting as an early seed, introducing a weak bias toward convexity. Downscaling this single attention head shifts the distributional mass of the perceptual conflict across a continuous decision boundary, allowing concave evidence to guide completion.
[245] Non-invasive Growth Monitoring of Small Freshwater Fish in Home Aquariums via Stereo Vision
Clemens Seibold, Anna Hilsmann, Peter Eisert
Main category: cs.CV
TL;DR: A refraction-aware stereo vision method for non-invasive fish length estimation in aquariums using YOLOv11-Pose for keypoint detection and refraction-aware 3D triangulation.
Details
Motivation: Monitoring fish growth behavior is important for fish health in aquaculture and home aquariums, but traditional methods are invasive or inaccurate due to refractive distortions in aquarium environments. Image-based measurement offers a practical, non-invasive alternative.Method: Uses YOLOv11-Pose network to detect fish and predict anatomical keypoints in stereo images. Implements refraction-aware epipolar constraint for robust matching, filters unreliable detections with learned quality score, and performs refraction-aware 3D triangulation to recover 3D keypoints for length measurement.
Result: Validated on a new stereo dataset of endangered Sulawesi ricefish under aquarium-like conditions. Demonstrated that filtering low-quality detections is essential for accurate length estimation.
Conclusion: The proposed system offers a simple and practical solution for non-invasive growth monitoring that can be easily applied in home aquariums.
Abstract: Monitoring fish growth behavior provides relevant information about fish health in aquaculture and home aquariums. Yet, monitoring fish sizes poses different challenges, as fish are small and subject to strong refractive distortions in aquarium environments. Image-based measurement offers a practical, non-invasive alternative that allows frequent monitoring without disturbing the fish. In this paper, we propose a non-invasive refraction-aware stereo vision method to estimate fish length in aquariums. Our approach uses a YOLOv11-Pose network to detect fish and predict anatomical keypoints on the fish in each stereo image. A refraction-aware epipolar constraint accounting for the air-glass-water interfaces enables robust matching, and unreliable detections are removed using a learned quality score. A subsequent refraction-aware 3D triangulation recovers 3D keypoints, from which fish length is measured. We validate our approach on a new stereo dataset of endangered Sulawesi ricefish captured under aquarium-like conditions and demonstrate that filtering low-quality detections is essential for accurate length estimation. The proposed system offers a simple and practical solution for non-invasive growth monitoring and can be easily applied in home aquariums.
[246] What if? Emulative Simulation with World Models for Situated Reasoning
Ruiping Liu, Yufan Chen, Yuheng Zhang, Junwei Zheng, Kunyu Peng, Chengzhi Wu, Chenguang Huang, Di Wen, Jiaming Zhang, Kailun Yang, Rainer Stiefelhagen
Main category: cs.CV
TL;DR: WanderDream is a dataset for mental exploration simulation that enables AI models to reason about spatial what-if questions without physical exploration, using panoramic videos and QA pairs across real-world scenes.
Details
Motivation: Real-world scenarios often prevent active exploration due to robot physical constraints or safety concerns for visually impaired users. The paper addresses the need for agents to mentally simulate future trajectories and answer spatial what-if questions given only limited observations.Method: Created two components: WanderDream-Gen with 15.8K panoramic videos across 1,088 real scenes showing imagined trajectories from current to target viewpoints, and WanderDream-QA with 158K question-answer pairs covering starting states, paths, and end states along trajectories.
Result: Experiments show mental exploration is essential for situated reasoning, world models perform well on WanderDream-Gen, imagination facilitates reasoning on WanderDream-QA, and the dataset transfers effectively to real-world scenarios.
Conclusion: WanderDream enables emulative simulation of mental exploration for spatial reasoning without physical exploration, demonstrating practical applications for constrained environments and safety-critical scenarios.
Abstract: Situated reasoning often relies on active exploration, yet in many real-world scenarios such exploration is infeasible due to physical constraints of robots or safety concerns of visually impaired users. Given only a limited observation, can an agent mentally simulate a future trajectory toward a target situation and answer spatial what-if questions? We introduce WanderDream, the first large-scale dataset designed for the emulative simulation of mental exploration, enabling models to reason without active exploration. WanderDream-Gen comprises 15.8K panoramic videos across 1,088 real scenes from HM3D, ScanNet++, and real-world captures, depicting imagined trajectories from current viewpoints to target situations. WanderDream-QA contains 158K question-answer pairs, covering starting states, paths, and end states along each trajectory to comprehensively evaluate exploration-based reasoning. Extensive experiments with world models and MLLMs demonstrate (1) that mental exploration is essential for situated reasoning, (2) that world models achieve compelling performance on WanderDream-Gen, (3) that imagination substantially facilitates reasoning on WanderDream-QA, and (4) that WanderDream data exhibit remarkable transferability to real-world scenarios. The source code and all data will be released.
[247] FALCON: Future-Aware Learning with Contextual Object-Centric Pretraining for UAV Action Recognition
Ruiqi Xian, Xiyang Wu, Tianrui Guan, Xijun Wang, Boqing Gong, Dinesh Manocha
Main category: cs.CV
TL;DR: Paper ID 2409.18300 could not be fetched due to HTTP 429 error (rate limiting), so content analysis is not possible.
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting.Method: Method cannot be analyzed without access to the paper content.
Result: Results cannot be determined without access to the paper content.
Conclusion: No conclusion can be drawn without access to the paper content.
Abstract: Failed to fetch summary for 2409.18300: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.18300&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[248] CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
Yitong Chen, Zuxuan Wu, Xipeng Qiu, Yu-Gang Jiang
Main category: cs.CV
TL;DR: CaTok introduces a 1D causal image tokenizer with MeanFlow decoder for autoregressive vision models, achieving state-of-the-art ImageNet reconstruction results.
Details
Motivation: Current visual tokenizers for autoregressive models either flatten 2D patches into non-causal sequences or use heuristic orderings that misalign with next-token prediction. Diffusion autoencoders also lack proper causality or have imbalance issues.Method: Proposes CaTok with MeanFlow decoder that selects tokens over time intervals and binds them to MeanFlow objective, creating causal 1D representations. Also introduces REPA-A regularization to align encoder features with Vision Foundation Models for training stability.
Result: Achieves state-of-the-art ImageNet reconstruction: 0.75 FID, 22.53 PSNR, 0.674 SSIM with fewer training epochs. AR model performance comparable to leading approaches.
Conclusion: CaTok successfully creates a causal visual tokenizer that supports both fast one-step generation and high-fidelity multi-step sampling while capturing diverse visual concepts across token intervals.
Abstract: Autoregressive (AR) language models rely on causal tokenization, but extending this paradigm to vision remains non-trivial. Current visual tokenizers either flatten 2D patches into non-causal sequences or enforce heuristic orderings that misalign with the “next-token prediction” pattern. Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. By selecting tokens over time intervals and binding them to the MeanFlow objective, as illustrated in Fig. 1, CaTok learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals. To further stabilize and accelerate training, we propose a straightforward regularization REPA-A, which aligns encoder features with Vision Foundation Models (VFMs). Experiments demonstrate that CaTok achieves state-of-the-art results on ImageNet reconstruction, reaching 0.75 FID, 22.53 PSNR and 0.674 SSIM with fewer training epochs, and the AR model attains performance comparable to leading approaches.
[249] Pinterest Canvas: Large-Scale Image Generation at Pinterest
Yu Wang, Eric Tzeng, Raymond Shiau, Jie Yang, Dmitry Kislyuk, Charles Rosenberg
Main category: cs.CV
TL;DR: Pinterest Canvas is a large-scale image generation system for image editing and enhancement tasks, using a foundational diffusion model fine-tuned for specific use cases like background enhancement and aspect-ratio outpainting.
Details
Motivation: While current image generation models are flexible, they lack precise control for strict product requirements, making them unsuitable for production use cases at Pinterest that need reliable, task-specific image editing capabilities.Method: Train a foundational diffusion model on diverse multimodal data, then rapidly fine-tune specialized variants on task-specific datasets for individual use cases like background enhancement and aspect-ratio outpainting.
Result: Enhanced images achieved 18.0% and 12.5% engagement lifts in A/B tests, outperforming third-party models according to human raters. The approach also generalizes to multi-image scene synthesis and image-to-video generation.
Conclusion: Task-specific fine-tuning of a foundational diffusion model enables precise control for production image editing tasks, demonstrating significant engagement improvements and generalization to various downstream applications.
Abstract: While recent image generation models demonstrate a remarkable ability to handle a wide variety of image generation tasks, this flexibility makes them hard to control via prompting or simple inference adaptation alone, rendering them unsuitable for use cases with strict product requirements. In this paper, we introduce Pinterest Canvas, our large-scale image generation system built to support image editing and enhancement use cases at Pinterest. Canvas is first trained on a diverse, multimodal dataset to produce a foundational diffusion model with broad image-editing capabilities. However, rather than relying on one generic model to handle every downstream task, we instead rapidly fine-tune variants of this base model on task-specific datasets, producing specialized models for individual use cases. We describe key components of Canvas and summarize our best practices for dataset curation, training, and inference. We also showcase task-specific variants through case studies on background enhancement and aspect-ratio outpainting, highlighting how we tackle their specific product requirements. Online A/B experiments demonstrate that our enhanced images receive a significant 18.0% and 12.5% engagement lift, respectively, and comparisons with human raters further validate that our models outperform third-party models on these tasks. Finally, we showcase other Canvas variants, including multi-image scene synthesis and image-to-video generation, demonstrating that our approach can generalize to a wide variety of potential downstream tasks.
[250] Training Flow Matching: The Role of Weighting and Parameterization
Anne Gagneux, SégolÚne Martin, Rémi Gribonval, Mathurin Massias
Main category: cs.CV
TL;DR: Systematic analysis of training objectives for denoising-based generative models, examining loss weighting, output parameterization, and their interactions with data geometry, model architecture, and dataset size.
Details
Motivation: To understand the practical design choices in training flow matching models by disentangling factors like loss weighting and output parameterization, rather than proposing new methods.Method: Systematic numerical study comparing noise-, clean image-, and velocity-based formulations across synthetic datasets with controlled geometry and image data, using PSNR for denoising accuracy and FID for generative quality.
Result: Analysis reveals how training choices interact with intrinsic data dimensionality, model architecture, and dataset size, providing practical insights for flow matching model design.
Conclusion: The study offers empirical guidance on training objective selection for denoising-based generative models by systematically examining the interplay between various design factors.
Abstract: We study the training objectives of denoising-based generative models, with a particular focus on loss weighting and output parameterization, including noise-, clean image-, and velocity-based formulations. Through a systematic numerical study, we analyze how these training choices interact with the intrinsic dimensionality of the data manifold, model architecture, and dataset size. Our experiments span synthetic datasets with controlled geometry as well as image data, and compare training objectives using quantitative metrics for denoising accuracy (PSNR across noise levels) and generative quality (FID). Rather than proposing a new method, our goal is to disentangle the various factors that matter when training a flow matching model, in order to provide practical insights on design choices.
[251] FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment
SebastiĂĄn Barbas Laina, Simon Boche, Sotiris Papatheodorou, Simon Schaefer, Jaehyung Jung, Helen Oleynikova, Stefan Leutenegger
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2504.08603: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.08603&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[252] GreenRFM: Toward a resource-efficient radiology foundation model
Yingtai Li, Shuai Ming, Mingyue Zhao, Haoran Lai, Rongsheng Wang, Rui Zhou, Rundong Wang, Yujia Li, Wei Wei, Shaohua Kevin Zhou
Main category: cs.CV
TL;DR: GreenRFM is a resource-efficient pre-training framework for radiology foundation models that achieves state-of-the-art performance with dramatically reduced computational requirements through principled supervision design rather than brute-force scaling.
Details
Motivation: Current radiology foundation models rely on brute-force scaling that leads to brittle, expensive models in clinical practice. There's a need for resource-efficient approaches that prioritize precision over scale while maintaining robust generalization across diverse patient populations and imaging protocols.Method: Proposes GreenRFM framework with principled supervision design called MUST (More distilled, Ubiquitous, Semantic-enforcing, and Task-aligning) supervision that maximizes supervisory signals rather than simply scaling training data. Offers two configurations: performant model (24GB GPU, 24 hours) and lightweight model (6GB VRAM, 4 hours).
Result: Achieves superior performance on chest and abdominal CT datasets across public and private benchmarks, surpassing baseline models. Shows transferability between modalities (CT to musculoskeletal MRI). Reduces computational requirements by orders of magnitude while establishing new state-of-the-art.
Conclusion: GreenRFM challenges the “scale is all you need” dogma and democratizes development of state-of-the-art radiology foundation models, making them accessible even on consumer-grade hardware while maintaining clinical robustness.
Abstract: The development of radiology foundation models (RFMs) is hindered by a reliance on brute-force scaling. Existing approaches often directly translate methods for natural images, which prioritize scale over precision and hence lead to brittle and expensive models in clinical practice. To address this, we present a resource-efficient pre-training framework, GreenRFM, that achieves state-of-the-art performance. Our framework ensures robust generalization across diverse patient populations and imaging protocols, reducing computational requirements by orders of magnitude while surpassing complex, parameter-heavy models. These capabilities stem from principled supervision design that aims to maximally utilize supervisory signals via More distilled, Ubiquitous, Semantic-enforcing, and Task-aligning (MUST) supervision, rather than simply piling up the quantity of training data. We offer two GreenRFM configurations: (i) a performant model that establishes a new state-of-the-art using a single 24GB GPU within 24 hours, and (ii) a lightweight model that matches existing benchmarks with 6GB VRAM in 4 hours. We conduct extensive experiments using over 200,000 images from four institutions and of two modalities. GreenRFMs achieve superior performances on chest and abdominal CT datasets, regardless of public or private benchmark, surpassing a range of baseline models. In addition, the results on internal musculoskeletal MRI images show that the same supervision principles transfer between different modalities. Our performance and efficiency challenge the ``scale is all you need’’ dogma and democratize the equitable development of state-of-the-art RFMs for clinicians even on a laptop.
[253] C^2Prompt: Class-aware Client Knowledge Interaction for Federated Continual Learning
Kunlun Xu, Yibo Feng, Jiangmeng Li, Yongsheng Qi, Jiahuan Zhou
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2509.19674 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2509.19674: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.19674&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[254] Match4Annotate: Propagating Sparse Video Annotations via Implicit Neural Feature Matching
Zhuorui Zhang, Roger PallarĂšs-LĂłpez, Praneeth Namburi, Brian W. Anthony
Main category: cs.CV
TL;DR: Match4Annotate: A lightweight framework for intra- and inter-video propagation of point and mask annotations using implicit neural representations and deformation fields for scalable annotation in specialized domains like medical imaging.
Details
Motivation: Per-frame video annotation is costly and slow in specialized domains like medical imaging. Existing label propagation methods have limitations: video trackers require per-video initialization and lack cross-video generalization, while correspondence pipelines struggle with low-texture scenes and lack spatiotemporal smoothness.Method: Uses SIREN-based implicit neural representation fitted to DINOv3 features at test time to create continuous, high-resolution spatiotemporal feature fields. Learns smooth implicit deformation fields between frame pairs to guide correspondence matching for both point and mask annotations.
Result: Achieves state-of-the-art inter-video propagation on three challenging clinical ultrasound datasets, outperforming feature matching and one-shot segmentation baselines. Remains competitive with specialized trackers for intra-video propagation.
Conclusion: Lightweight, test-time-optimized feature matching pipelines offer efficient and accessible solutions for scalable annotation workflows in specialized domains.
Abstract: Acquiring per-frame video annotations remains a primary bottleneck for deploying computer vision in specialized domains such as medical imaging, where expert labeling is slow and costly. Label propagation offers a natural solution, yet existing approaches face fundamental limitations. Video trackers and segmentation models can propagate labels within a single sequence but require per-video initialization and cannot generalize across videos. Classic correspondence pipelines operate on detector-chosen keypoints and struggle in low-texture scenes, while dense feature matching and one-shot segmentation methods enable cross-video propagation but lack spatiotemporal smoothness and unified support for both point and mask annotations. We present Match4Annotate, a lightweight framework for both intra-video and inter-video propagation of point and mask annotations. Our method fits a SIREN-based implicit neural representation to DINOv3 features at test time, producing a continuous, high-resolution spatiotemporal feature field, and learns a smooth implicit deformation field between frame pairs to guide correspondence matching. We evaluate on three challenging clinical ultrasound datasets. Match4Annotate achieves state-of-the-art inter-video propagation, outperforming feature matching and one-shot segmentation baselines, while remaining competitive with specialized trackers for intra-video propagation. Our results show that lightweight, test-time-optimized feature matching pipelines have the potential to offer an efficient and accessible solution for scalable annotation workflows.
[255] Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis
Hila Chefer, Patrick Esser, Dominik Lorenz, Dustin Podell, Vikash Raja, Vinh Tong, Antonio Torralba, Robin Rombach
Main category: cs.CV
TL;DR: Self-Flow: A self-supervised flow matching paradigm that integrates representation learning within generative models using dual-timestep scheduling to create information asymmetry, enabling strong semantic representations without external supervision across image, video, and audio modalities.
Details
Motivation: Current diffusion and flow models rely on external models for semantic representations, which require separate training, have misaligned objectives, and show unexpected scaling behavior. This dependence stems from the denoising objective providing little incentive to learn semantic representations internally.Method: Introduces Self-Flow with Dual-Timestep Scheduling that applies heterogeneous noise levels across tokens, creating information asymmetry that forces the model to infer missing information from corrupted inputs. This integrates representation learning within the generative framework without external supervision.
Result: Achieves superior image, video, and audio generation while following expected scaling laws. The method generalizes across modalities and enables multi-modal training with strong semantic representations learned alongside generative capabilities.
Conclusion: Self-Flow provides a unified framework for learning semantic representations within generative models, eliminating dependence on external models and enabling better scaling and multi-modal capabilities across vision and audio domains.
Abstract: Strong semantic representations improve the convergence and generation quality of diffusion and flow models. Existing approaches largely rely on external models, which require separate training, operate on misaligned objectives, and exhibit unexpected scaling behavior. We argue that this dependence arises from the model’s training objective, which poses a denoising task with little incentive to learn semantic representations. We introduce Self-Flow: a self-supervised flow matching paradigm that integrates representation learning within the generative framework. Our key mechanism, Dual-Timestep Scheduling, applies heterogeneous noise levels across tokens, creating an information asymmetry that forces the model to infer missing information from corrupted inputs. This drives learning strong representations alongside generative capabilities without external supervision. Our method generalizes across modalities and enables multi-modal training while following expected scaling laws, achieving superior image, video, and audio generation.
[256] SCAN: Visual Explanations with Self-Confidence and Analysis Networks
Gwanghee Lee, Sungyoon Jeong, Kyoungson Jhang
Main category: cs.CV
TL;DR: SCAN is a universal XAI framework that generates high-resolution self-confidence maps for both CNNs and Transformers using AutoEncoder-based feature reconstruction guided by Information Bottleneck principle.
Details
Motivation: Current XAI methods face a trade-off between fidelity (architecture-specific methods) and applicability (universal methods), resulting in abstract/fragmented explanations and difficulty comparing explanations across different model families like CNNs and Transformers.Method: SCAN uses an AutoEncoder-based approach to reconstruct features from intermediate layers of models, guided by Information Bottleneck principle to generate high-resolution Self-Confidence Maps that identify information-rich regions.
Result: SCAN consistently achieves outstanding performance on various quantitative metrics (AUC-D, Negative AUC, Drop%, Win%) and produces significantly clearer, object-focused explanations than existing methods across diverse architectures and datasets.
Conclusion: SCAN provides a unified framework that is both architecturally universal and highly faithful, enhancing model transparency and offering a more reliable tool for understanding complex neural network decision-making processes.
Abstract: Explainable AI (XAI) has become essential in computer vision to make the decision-making processes of deep learning models transparent. However, current visual explanation (XAI) methods face a critical trade-off between the high fidelity of architecture-specific methods and the broad applicability of universal ones. This often results in abstract or fragmented explanations and makes it difficult to compare explanatory power across diverse model families, such as CNNs and Transformers. This paper introduces the Self-Confidence and Analysis Networks (SCAN), a novel universal framework that overcomes these limitations for both convolutional neural network and transformer architectures. SCAN utilizes an AutoEncoder-based approach to reconstruct features from a model’s intermediate layers. Guided by the Information Bottleneck principle, it generates a high-resolution Self-Confidence Map that identifies information-rich regions. Extensive experiments on diverse architectures and datasets demonstrate that SCAN consistently achieves outstanding performance on various quantitative metrics such as AUC-D, Negative AUC, Drop%, and Win%. Qualitatively, it produces significantly clearer, object-focused explanations than existing methods. By providing a unified framework that is both architecturally universal and highly faithful, SCAN enhances model transparency and offers a more reliable tool for understanding the decision-making processes of complex neural networks.
[257] AV-Unified: A Unified Framework for Audio-visual Scene Understanding
Guangyao Li, Xin Wang, Wenwu Zhu
Main category: cs.CV
TL;DR: AV-Unified is a unified framework for joint learning across diverse audio-visual scene understanding tasks by standardizing inputs/outputs into token sequences and using multi-scale spatiotemporal perception networks.
Details
Motivation: Current audio-visual research explores tasks individually, making it challenging to comprehensively understand complex scenes and explore inter-task relationships. Humans naturally integrate multiple audio-visual tasks in dynamic real-world scenes.Method: 1) Standardizes diverse input-output formats by converting tasks into sequences of discrete tokens for shared representation; 2) Multi-scale temporal perception module to capture varying temporal granularity; 3) Cross-modal guidance-based spatial perception module to model spatial audio-visual associations; 4) Task-specific text prompts for adaptability.
Result: Extensive experiments on benchmark datasets (AVE, LLP, MUSIC-AVQA, VGG-SS, AVS) demonstrate effectiveness across temporal, spatial, and spatiotemporal audio-visual tasks.
Conclusion: AV-Unified provides a comprehensive framework for joint audio-visual scene understanding, enabling exploration of inter-task relationships and more holistic perception of complex audio-visual scenes.
Abstract: When humans perceive the world, they naturally integrate multiple audio-visual tasks within dynamic, real-world scenes. However, current works such as event localization, parsing, segmentation and question answering are mostly explored individually, making it challenging to comprehensively understand complex audio-visual scenes and explore inter-task relationships. Hence, we propose \textbf{AV-Unified}, a unified framework that enables joint learning across a wide range of audio-visual scene understanding tasks. AV-Unified standardizes the diverse input-output formats of each task and incorporates a multi-scale spatiotemporal perception network to effectively capture audio-visual associations. Specifically, we unify the inputs and outputs of all supported tasks by converting them into sequences of discrete tokens, establishing a shared representation that allows a single architecture to be trained jointly across heterogeneous varied datasets. Considering the varying temporal granularity of audio-visual events, a multi-scale temporal perception module is designed to capture key cues. Meanwhile, to overcome the lack of auditory supervision in the visual domain, we design a cross-modal guidance-based spatial perception module that models spatial audio-visual associations. Furthermore, task-specific text prompts are employed to enhance the model’s adaptability and task-awareness. Extensive experiments on benchmark datasets (e.g., AVE, LLP, MUSIC-AVQA, VGG-SS and AVS) demonstrate the effectiveness of AV-Unified across temporal, spatial, and spatiotemporal tasks.
[258] SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability
Ali Nasiri-Sarvi, Hassan Rivaz, Mahdi S. Hosseini
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2507.06265: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.06265&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[259] Spatial Calibration of Diffuse LiDARs
Nikhil Behari, Ramesh Raskar
Main category: cs.CV
TL;DR: A spatial calibration method for diffuse direct time-of-flight LiDARs that estimates per-pixel footprints and relative spatial sensitivity in co-located RGB images using scanned retroreflective patches.
Details
Motivation: Diffuse direct time-of-flight LiDARs violate the single-ray assumption of standard LiDAR-RGB calibration by aggregating photon returns over wide fields of view, creating a need for spatial calibration that accounts for per-pixel footprints and sensitivity distributions.Method: Uses a scanned retroreflective patch with background subtraction to recover per-pixel response maps that provide explicit LiDAR-to-RGB correspondence, enabling cross-modal alignment and fusion.
Result: Demonstrated the calibration method on the ams OSRAM TMF8828 LiDAR sensor, showing successful recovery of per-pixel response maps for spatial calibration.
Conclusion: The proposed spatial calibration procedure addresses the limitations of standard calibration methods for diffuse LiDARs and enables accurate cross-modal alignment between LiDAR and RGB data.
Abstract: Diffuse direct time-of-flight LiDARs report per-pixel depth histograms formed by aggregating photon returns over a wide instantaneous field of view, violating the single-ray assumption behind standard LiDAR-RGB calibration. We present a simple spatial calibration procedure that estimates, for each diffuse LiDAR pixel, its footprint (effective support region) and relative spatial sensitivity in a co-located RGB image plane. Using a scanned retroreflective patch with background subtraction, we recover per-pixel response maps that provide an explicit LiDAR-to-RGB correspondence for cross-modal alignment and fusion. We demonstrate the method on the ams OSRAM TMF8828.
[260] NEGATE: Constrained Semantic Guidance for Linguistic Negation in Text-to-Video Diffusion
Taewon Kang, Ming C. Lin
Main category: cs.CV
TL;DR: A training-free method for modeling linguistic negation in diffusion models by treating negation as structured feasibility constraints on semantic guidance, enabling robust handling of diverse negation phenomena without retraining.
Details
Motivation: Negation is a fundamental linguistic operator that remains inadequately modeled in diffusion-based generative systems, with current approaches lacking formal treatment and struggling with diverse negation phenomena.Method: Reinterprets classifier-free guidance as defining semantic update directions and enforces negation by projecting updates onto convex constraint sets derived from linguistic structure, creating a unified framework for handling various negation types.
Result: Achieves robust negation compliance while preserving visual fidelity and structural coherence, establishing the first unified formulation of linguistic negation in diffusion-based generative models beyond representation-level evaluation.
Conclusion: Provides a formal, training-free approach to modeling linguistic negation in diffusion models that handles diverse negation phenomena and extends to video generation, with a benchmark suite for further research.
Abstract: Negation is a fundamental linguistic operator, yet it remains inadequately modeled in diffusion-based generative systems. In this work, we present a formal treatment of linguistic negation in diffusion-based generative models by modeling it as a structured feasibility constraint on semantic guidance within diffusion dynamics. Rather than introducing heuristics or retraining model parameters, we reinterpret classifier-free guidance as defining a semantic update direction and enforce negation by projecting the update onto a convex constraint set derived from linguistic structure. This novel formulation provides a unified framework for handling diverse negation phenomena, including object absence, graded non-inversion semantics, multi-negation composition, and scope-sensitive disambiguation. Our approach is training-free, compatible with pretrained diffusion backbones, and naturally extends from image generation to temporally evolving video trajectories. In addition, we introduce a structured negation-centric benchmark suite that isolates distinct linguistic failure modes in generative systems, to further research in this area. Experiments demonstrate that our method achieves robust negation compliance while preserving visual fidelity and structural coherence, establishing the first unified formulation of linguistic negation in diffusion-based generative models beyond representation-level evaluation.
[261] ExDD: Explicit Dual Distribution Learning for Surface Defect Detection via Diffusion Synthesis
Muhammad Aqeel, Federico Leonardi, Francesco Setti
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2507.15335: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.15335&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[262] SurgFormer: Scalable Learning of Organ Deformation with Resection Support and Real-Time Inference
Ashkan Shahbazi, Elaheh Akbari, Kyvia Pereira, Jon S. Heiselman, Annie C. Benson, Garrison L. H. Johnston, Jie Ying Wu, Nabil Simaan, Michael I. Miga, Soheil Kolouri
Main category: cs.CV
TL;DR: SurgFormer is a multiresolution gated transformer for real-time soft tissue simulation on volumetric meshes, trained on solver data to predict displacement fields, with special handling for surgical cuts through learned embeddings.
Details
Motivation: High-fidelity biomechanical solvers are too slow for interactive surgical simulation, so there's a need for data-driven surrogates that can predict tissue deformation in real-time while handling topology changes like surgical cuts.Method: Uses a multiresolution gated transformer architecture with fixed mesh hierarchy, multibranch blocks combining local message passing, coarse global self-attention, and feedforward updates, fused by learned gates. For cuts, encodes resection information as learned embeddings for unified handling of standard and topology-altering cases.
Result: Achieves strong accuracy with favorable efficiency compared to diverse baselines, making it practical for both standard deformation prediction and cut-conditioned simulation tasks. Introduces two surgical simulation datasets with XFEM-based supervision.
Conclusion: SurgFormer provides an effective backbone for real-time surgical simulation that handles both standard tissue deformation and surgical cuts within a unified volumetric pipeline, with potential applications in surgical training and planning.
Abstract: We introduce SurgFormer, a multiresolution gated transformer for data driven soft tissue simulation on volumetric meshes. High fidelity biomechanical solvers are often too costly for interactive use, so we train SurgFormer on solver generated data to predict nodewise displacement fields at near real time rates. SurgFormer builds a fixed mesh hierarchy and applies repeated multibranch blocks that combine local message passing, coarse global self attention, and pointwise feedforward updates, fused by learned per node, per channel gates to adaptively integrate local and long range information while remaining scalable on large meshes. For cut conditioned simulation, resection information is encoded as a learned cut embedding and provided as an additional input, enabling a unified model for both standard deformation prediction and topology altering cases. We also introduce two surgical simulation datasets generated under a unified protocol with XFEM based supervision: a cholecystectomy resection dataset and an appendectomy manipulation and resection dataset with cut and uncut cases. To our knowledge, this is the first learned volumetric surrogate setting to study XFEM supervised cut conditioned deformation within the same volumetric pipeline as standard deformation prediction. Across diverse baselines, SurgFormer achieves strong accuracy with favorable efficiency, making it a practical backbone for both tasks. {Code, data, and project page: \href{https://mint-vu.github.io/SurgFormer/}{available here}}
[263] MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing
Chenxi Li, Yichen Guo, Benfang Qian, Jinhao You, Kai Tang, Yaosong Du, Zonghao Zhang, Xiande Huang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2508.01653: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.01653&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[264] Modeling and Measuring Redundancy in Multisource Multimodal Data for Autonomous Driving
Yuhan Zhou, Mehri Sattari, Haihua Chen, Kewei Sha
Main category: cs.CV
TL;DR: Analyzes redundancy as a data quality issue in autonomous vehicle datasets, showing that removing redundant labels from multisource camera data improves object detection performance.
Details
Motivation: Autonomous vehicles rely on multisource multimodal data, but data quality varies across sources and modalities. While algorithm design has been prioritized, data quality analysis - particularly redundancy - has been underexplored despite its potential impact on AV performance.Method: Models and measures redundancy in multisource camera data and multimodal image-LiDAR data using nuScenes and Argoverse 2 datasets. Evaluates how removing redundant labels affects YOLOv8 object detection task, focusing on cameras with shared fields of view.
Result: Selectively removing redundant multisource image object labels improves detection: in nuScenes, mAP50 gains from 0.66 to 0.70, 0.64 to 0.67, and 0.53 to 0.55 on three overlap regions. In AV2, 4.1-8.6% of labels removed while mAP50 stays near baseline. Substantial redundancy found between image and LiDAR data.
Conclusion: Redundancy is a measurable and actionable data quality factor with direct implications for AV performance. Highlights the role of redundancy in AV perception and motivates a data-centric perspective for evaluating and improving AV datasets.
Abstract: Next-generation autonomous vehicles (AVs) rely on large volumes of multisource and multimodal ($M^2$) data to support real-time decision-making. In practice, data quality (DQ) varies across sources and modalities due to environmental conditions and sensor limitations, yet AV research has largely prioritized algorithm design over DQ analysis. This work focuses on redundancy as a fundamental but underexplored DQ issue in AV datasets. Using the nuScenes and Argoverse 2 (AV2) datasets, we model and measure redundancy in multisource camera data and multimodal image-LiDAR data, and evaluate how removing redundant labels affects the YOLOv8 object detection task. Experimental results show that selectively removing redundant multisource image object labels from cameras with shared fields of view improves detection. In nuScenes, mAP${50}$ gains from $0.66$ to $0.70$, $0.64$ to $0.67$, and from $0.53$ to $0.55$, on three representative overlap regions, while detection on other overlapping camera pairs remains at the baseline even under stronger pruning. In AV2, $4.1$-$8.6%$ of labels are removed, and mAP${50}$ stays near the $0.64$ baseline. Multimodal analysis also reveals substantial redundancy between image and LiDAR data. These findings demonstrate that redundancy is a measurable and actionable DQ factor with direct implications for AV performance. This work highlights the role of redundancy as a data quality factor in AV perception and motivates a data-centric perspective for evaluating and improving AV datasets. Code, data, and implementation details are publicly available at: https://github.com/yhZHOU515/RedundancyAD
[265] SGDFuse: SAM-Guided Diffusion Model for High-Fidelity Infrared and Visible Image Fusion
Xiaoyang Zhang, jinjiang Li, Guodong Fan, Yakun Ju, Linwei Fan, Jun Liu, Alex C. Kot
Main category: cs.CV
TL;DR: Unable to analyze paper 2508.05264 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrieved due to rate limiting errorMethod: No method information available - arXiv API returned HTTP 429 (Too Many Requests) error
Result: No results available - failed to fetch paper summary from arXiv
Conclusion: Paper analysis not possible due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2508.05264: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05264&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[266] EgoCogNav: Cognition-aware Human Egocentric Navigation
Zhiwen Qiu, Ziang Liu, Wenqian Niu, Tapomayukh Bhattacharjee, Saleh Kalantari
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.17581: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17581&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[267] EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking
Fangrui Zhu, Yunfeng Xi, Jianmo Ni, Mu Cai, Boqing Gong, Long Zhao, Chen Qu, Ian Miao, Yi Li, Cheng Zhong, Huaizu Jiang, Shwetak Patel
Main category: cs.CV
TL;DR: EgoReasoner: A two-stage framework for egocentric 4D reasoning tasks using task-adaptive thinking templates and reinforcement learning with task-aware rewards.
Details
Motivation: Egocentric video understanding is complex due to dynamic 4D environments with camera motion and object displacements. Existing task-agnostic approaches (generic Chain-of-Thought, uniform reinforcement learning) are insufficient for diverse egocentric reasoning tasks requiring different cognitive operations like spatial anchoring, temporal tracking, and duration reasoning.Method: Two-stage framework: 1) Task-Adaptive Thinking Templates guide synthesis of structured CoT traces for supervised fine-tuning, teaching adaptive reasoning across task types. 2) Task-aware reward functions verify entity grounding, temporal alignment, and logical consistency, selectively strengthening reasoning pathways via reinforcement fine-tuning with GRPO.
Result: 3B-parameter model trained on only 16K samples achieves 37.5% average accuracy on HD-EPIC benchmark, surpassing Qwen2.5-VL-7B (25.7%) by over 10 points.
Conclusion: Aligning reasoning scaffolds and reward signals to each task’s cognitive structure enables effective egocentric 4D reasoning with limited data, outperforming larger models with task-agnostic approaches.
Abstract: Egocentric video understanding is inherently complex due to the dynamic 4D nature of the environment, where camera motion and object displacements necessitate a continuous re-evaluation of spatial relations. In this work, we target a suite of under-explored egocentric 4D reasoning tasks, including fixture interaction counting, viewpoint-relative fixture location, object movement itinerary tracking, and stationary object localization, that require fundamentally different cognitive operations: spatial anchoring, temporal tracking, and duration reasoning. We observe that these structural differences make task-agnostic approaches insufficient: generic Chain-of-Thought methods lack task-appropriate reasoning primitives, and uniform reinforcement learning actively destabilizes performance on spatial tasks. To address this, we propose EgoReasoner, a two-stage framework that aligns both the reasoning scaffold and the reward signal to each task’s cognitive structure. In the first stage, Task-Adaptive Thinking Templates guide the synthesis of structured CoT traces that teach the model to reason adaptively across task types via supervised fine-tuning. In the second stage, task-aware reward functions verify entity grounding, temporal alignment, and task-adaptive logical consistency, selectively strengthening each reasoning pathway via reinforcement fine-tuning with GRPO. Our 3B-parameter model, trained on only 16K samples, achieves 37.5% average accuracy on the challenging HD-EPIC benchmark, surpassing Qwen2.5-VL-7B (25.7%) by over 10 points.
[268] LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference
Jianhao Yuan, Fabio Pizzati, Francesco Pinto, Lars Kunze, Ivan Laptev, Paul Newman, Philip Torr, Daniele De Martini
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.11512: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.11512&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[269] Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, Leoweiliang
Main category: cs.CV
TL;DR: Penguin-VL introduces a compact VLM that replaces contrastive-pretrained vision encoders with text-only LLM initialization, achieving better visual fidelity and performance on dense perception tasks while being efficient for mobile/edge deployment.
Details
Motivation: Current VLMs rely on large models and contrastive-pretrained vision encoders (CLIP/SigLIP) that suppress fine-grained visual cues needed for dense captioning and complex reasoning, making them unsuitable for compute-constrained mobile/edge devices.Method: Penguin-VL uses a vision encoder initialized from a text-only LLM instead of contrastive-pretrained encoders, addressing the objective mismatch where contrastive learning enforces coarse invariances that suppress fine-grained visual details.
Result: Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in document understanding, visual knowledge, and multi-perspective video understanding, with a lightweight architecture.
Conclusion: Improved visual representation rather than model scaling is the primary driver of VLM performance, and Penguin-Encoder serves as a superior drop-in alternative to contrastive-pretrained encoders for compute-efficient VLMs in resource-constrained settings.
Abstract: Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding. Across various image and video benchmarks, Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in tasks such as document understanding, visual knowledge, and multi-perspective video understanding. Notably, these gains are achieved with a lightweight architecture, demonstrating that improved visual representation rather than model scaling is the primary driver of performance. Our ablations show that Penguin-Encoder consistently outperforms contrastive-pretrained encoders, preserving fine-grained spatial and temporal cues that are critical for dense perception and complex reasoning. This makes it a strong drop-in alternative for compute-efficient VLMs and enables high performance in resource-constrained settings. Code: https://github.com/tencent-ailab/Penguin-VL
[270] Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, Chaoyou Fu
Main category: cs.CV
TL;DR: Omni-Diffusion is a novel any-to-any multimodal language model built entirely on mask-based discrete diffusion models, unifying understanding and generation across text, speech, and images.
Details
Motivation: Current multimodal LLMs predominantly use autoregressive architectures, leaving room for exploring effective alternatives. Discrete diffusion models have shown promise in visual understanding and generation, suggesting potential as a backbone for multimodal systems.Method: Uses a unified mask-based discrete diffusion model to directly capture joint distribution over discrete multimodal tokens. Supports bimodal tasks and complex multi-modal scenarios across text, speech, and images.
Result: Outperforms or performs on par with existing multimodal systems processing two or more modalities on diverse benchmarks.
Conclusion: Demonstrates significant promise of diffusion models in powering next-generation multimodal foundation models, offering an effective alternative to autoregressive architectures.
Abstract: While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.
[271] CanvasMAR: Improving Masked Autoregressive Video Prediction With Canvas
Zian Li, Muhan Zhang
Main category: cs.CV
TL;DR: Unable to analyze paper 2510.13669 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot draw conclusions without access to the paper abstract
Abstract: Failed to fetch summary for 2510.13669: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13669&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[272] Multimodal Large Language Models as Image Classifiers
Nikita Kisel, Illia Volkov, Klara Janouskova, Jiri Matas
Main category: cs.CV
TL;DR: MLLM classification performance depends heavily on evaluation protocols and ground truth quality, with reported underperformance often being an artifact of flawed evaluation rather than genuine model deficiency.
Details
Motivation: There are conflicting conclusions in studies comparing MLLMs with supervised and vision-language models, which stem from evaluation protocols that either inflate or underestimate performance. The research aims to identify and fix key issues in common evaluation protocols.Method: The authors identify and fix key issues in common evaluation protocols: handling model outputs outside class lists, addressing inflated results from weak multiple-choice distractors, and improving open-world output mapping. They quantify the impact of design choices (batch size, image ordering, text encoder selection) and evaluate on ReGT, a multilabel reannotation of 625 ImageNet-1k classes.
Result: MLLMs benefit most from corrected labels (up to +10.8% improvement), substantially narrowing the perceived gap with supervised models. Models less reliant on supervised training signals are most sensitive to annotation quality. MLLMs can assist human annotators, with annotators confirming or integrating MLLM predictions in ~50% of difficult cases.
Conclusion: Much of the reported MLLM underperformance on classification is an artifact of noisy ground truth and flawed evaluation protocols rather than genuine model deficiency. MLLMs show potential for large-scale dataset curation by assisting human annotators.
Abstract: Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing MLLMs with supervised and vision-language models report conflicting conclusions, and we show these conflicts stem from protocols that either inflate or underestimate performance. Across the most common evaluation protocols, we identify and fix key issues: model outputs that fall outside the provided class list and are discarded, inflated results from weak multiple-choice distractors, and an open-world setting that underperforms only due to poor output mapping. We additionally quantify the impact of commonly overlooked design choices - batch size, image ordering, and text encoder selection - showing they substantially affect accuracy. Evaluating on ReGT, our multilabel reannotation of 625 ImageNet-1k classes, reveals that MLLMs benefit most from corrected labels (up to +10.8%), substantially narrowing the perceived gap with supervised models. Much of the reported MLLMs underperformance on classification is thus an artifact of noisy ground truth and flawed evaluation protocol rather than genuine model deficiency. Models less reliant on supervised training signals prove most sensitive to annotation quality. Finally, we show that MLLMs can assist human annotators: in a controlled case study, annotators confirmed or integrated MLLMs predictions in approximately 50% of difficult cases, demonstrating their potential for large-scale dataset curation.
[273] Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Xiang An, Yan Feng, Peng Pei, Xunliang Cai, Ruqi Huang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.18632: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18632&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[274] The Persistence of Cultural Memory: Investigating Multimodal Iconicity in Diffusion Models
Maria-Teresa De Rosa Palmini, Eva Cetinic
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2511.11435: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11435&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[275] Exploiting Spatiotemporal Properties for Efficient Event-Driven Human Pose Estimation
Haoxian Zhou, Chuanzhi Xu, Langyi Chen, Pengfei Ye, Haodong Chen, Yuk Ying Chung, Qiang Qu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.06306: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06306&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[276] CASA: Cross-Attention over Self-Attention for Efficient Vision-Language Fusion
Moritz Böhle, Amélie Royer, Juliette Marrie, Edouard Grave, Patrick Pérez
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2512.19535: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19535&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[277] SpatialMem: Metric-Aligned Long-Horizon Video Memory for Language Grounding and QA
Xinyi Zheng, Yunze Liu, Chi-Hao Wu, Fan Zhang, Hao Zheng, Wenqi Zhou, Walterio W. Mayol-Cuevas, Junxiao Shen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.14895: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.14895&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[278] RBF Weighted Hyper-Involution for RGB-D Object Detection
Mehfuz A Rahman, Khushal Das, Jiju Poovvancheri, Neil London, Dong Chen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2310.00342: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2310.00342&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[279] Fuse4Seg: Image Fusion for Multi-Modal Medical Segmentation via Bi-level Optimization
Yuchen Guo, Junli Gong, Hongmin Cai, Yiu-ming Cheung, Weifeng Su
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2409.10328: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.10328&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[280] AuthFace: Towards Authentic Blind Face Restoration with Face-oriented Generative Diffusion Prior
Guoqiang Liang, Qingnan Fan, Bingtao Fu, Jinwei Chen, Hong Gu, Lin Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2410.09864 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable due to API rate limiting.
Result: Cannot determine results as paper content is unavailable due to API rate limiting.
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2410.09864: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.09864&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[281] PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Shangkun Sun, Ruyang Liu, Haoran Tang, Yixiao Ge, Haibo Lu, Jiankun Yang, Chen Li
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2411.02327: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.02327&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[282] PoI: A Filter to Extract Pixel of Interest from Novel Views for Scene Coordinate Regression
Feifei Li, Qi Song, Chi Zhang, Hui Shuai, Rui Huang
Main category: cs.CV
TL;DR: Failed to fetch summary for paper 2502.04843 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper informationMethod: Unable to determine method due to technical error in fetching paper information
Result: Unable to determine results due to technical error in fetching paper information
Conclusion: Unable to draw conclusions due to technical error in fetching paper information
Abstract: Failed to fetch summary for 2502.04843: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.04843&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[283] Escaping The Big Data Paradigm in Self-Supervised Representation Learning
Carlos VĂ©lez GarcĂa, Miguel Cazorla, Jorge Pomares
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2502.18056: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.18056&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[284] NAMI: Efficient Image Generation via Bridged Progressive Rectified Flow Transformers
Yuhang Ma, Bo Cheng, Shanyuan Liu, Hongyi Zhou, Liebucha Wu, Dawei Leng, Yuhui Yin
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2503.09242 appears to be from March 2025, suggesting it’s a recent multimodal LLM paper.
Details
Motivation: Cannot determine motivation without access to paper content. Based on the arXiv ID format (2503.09242), this appears to be a recent paper from March 2025, likely related to multimodal AI given the reader's interests.Method: Method unknown due to HTTP 429 error preventing access to paper details. The arXiv API rate limiting prevents retrieval of the abstract and content.
Result: Results cannot be determined without access to the paper content. The HTTP 429 error indicates too many requests to the arXiv API.
Conclusion: Unable to analyze paper due to technical limitations. The arXiv API rate limiting prevents proper analysis of this recent multimodal LLM paper.
Abstract: Failed to fetch summary for 2503.09242: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.09242&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[285] EarthScape: A Multimodal Dataset for Surficial Geologic Mapping and Earth Surface Analysis
Matthew Massey, Nusrat Munia, Abdullah-Al-Zubaer Imran
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2503.15625: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.15625&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[286] Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery
Yangyang Xu, Junbo Ke, You-Wei Wen, Chao Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.01034: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01034&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[287] DVD-Quant: Data-free Video Diffusion Transformers Quantization
Zhiteng Li, Hanxuan Li, Junyi Wu, Kai Liu, Haotong Qin, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2505.18663: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.18663&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[288] Alchemist: Turning Public Text-to-Image Data into Generative Gold
Valerii Startsev, Alexander Ustyuzhanin, Alexey Kirillov, Dmitry Baranchuk, Sergey Kastryulin
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2505.19297: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19297&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[289] Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance
Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, Mike Zheng Shou
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2603.02175: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02175&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[290] Instance Data Condensation for Image Super-Resolution
Tianhao Peng, Ho Man Kwan, Yuxuan Jiang, Ge Gao, Fan Zhang, Xiaozhong Xu, Shan Liu, David Bull
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.21099: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.21099&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[291] Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion
Haoran Lu, Shang Wu, Jianshu Zhang, Maojiang Su, Guo Ye, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.03485 appears to be from March 2023.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2603.03485: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03485&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[292] VisualPrompter: Semantic-Aware Prompt Optimization with Visual Feedback for Text-to-Image Synthesis
Shiyu Wu, Mingzhen Sun, Weining Wang, Yequan Wang, Jing Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to API rate limiting
Result: No results available due to technical issue with arXiv API access
Conclusion: Paper analysis cannot be completed due to HTTP 429 error from arXiv API
Abstract: Failed to fetch summary for 2506.23138: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.23138&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[293] SPoT: Subpixel Placement of Tokens in Vision Transformers
Martine Hjelkrem-Tan, Marius Aasan, Gabriel Y. Arteaga, AdĂn RamĂrez Rivera
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2507.01654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.01654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[294] Token Bottleneck: One Token to Remember Dynamics
Taekyung Kim, Dongyoon Han, Byeongho Heo, Jeongeun Park, Sangdoo Yun
Main category: cs.CV
TL;DR: Unable to analyze paper 2507.06543 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2507.06543: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.06543&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[295] NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation
X. Feng, H. Yu, M. Wu, S. Hu, J. Chen, C. Zhu, J. Wu, X. Chu, K. Huang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2507.11245: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.11245&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[296] Kernel VICReg for Self-Supervised Learning in Reproducing Kernel Hilbert Space
M.Hadi Sepanj, Benyamin Ghojogh, Saed Moradi, Paul Fieguth
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2509.07289: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.07289&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[297] Gaussian Set Surface Reconstruction through Per-Gaussian Optimization
Zhentao Huang, Di Wu, Zhenbang He, Minglun Gong
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.18923: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.18923&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[298] DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model
Qian Chen, Xianyin Zhang, Lifan Guo, Feng Chen, Chi Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2508.13238: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.13238&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[299] SSL-SLR: Self-Supervised Representation Learning for Sign Language Recognition
Ariel Basso Madjoukeng, JérÎme Fink, Pierre Poitier, Edith Belise Kenmogne, Benoit Frenay
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2509.05188: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.05188&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[300] RED: Robust Event-Guided Motion Deblurring with Modality-Specific Disentanglement
Yihong Leng, Siming Zheng, Jinwei Chen, Bo Li, Jiaojiao Li, Peng-Tao Jiang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to analyze paper due to technical limitations in accessing content
Abstract: Failed to fetch summary for 2509.05554: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.05554&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[301] DeCLIP: Decoupled Prompting for CLIP-based Multi-Label Class-Incremental Learning
Kaile Du, Zihan Ye, Junzhou Xie, Yixi Shen, Yuyang Li, Fuyuan Hu, Ling Shao, Guangcan Liu, Joost van de Weijer, Fan Lyu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.23335: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23335&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[302] Beyond Flat Unknown Labels in Open-World Object Detection
Yuchen Zhang, Yao Lu, Johannes Betz
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2510.09173 appears to be from arXiv, but the content cannot be retrieved at this time.
Details
Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.Method: Cannot determine method without access to the paper content. The paper ID format suggests it’s an arXiv preprint, but the content retrieval failed.
Result: Cannot determine results without access to the paper content. The technical issue prevents analysis of the paper’s findings.
Conclusion: Cannot determine conclusion without access to the paper content. The analysis is limited by the inability to retrieve the paper abstract.
Abstract: Failed to fetch summary for 2510.09173: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09173&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[303] Culture in Action: Evaluating Text-to-Image Models through Social Activities
Sina Malakouti, Boqing Gong, Adriana Kovashka
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.05681: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05681&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[304] Decoupling Bias, Aligning Distributions: Synergistic Fairness Optimization for Deepfake Detection
Feng Ding, Wenhui Yi, Yunpeng Zhou, Xinan He, Hong Rao, Shu Hu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2511.10150: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10150&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[305] LaxMotion: Rethinking Supervision Granularity for 3D Human Motion Generation
Sheng Liu, Yuanzhi Liang, Sidan Du
Main category: cs.CV
TL;DR: Unable to analyze paper 2511.11368 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2511.11368: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11368&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[306] SPARK: Jailbreaking T2V Models by Synergistically Prompting Auditory and Recontextualized Knowledge
Zonghao Ying, Moyang Chen, Nizhang Li, Zhiqiang Wang, Wenxin Zhang, Quanchen Zou, Zonglei Jing, Aishan Liu, Xianglong Liu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without paper content.Method: Cannot determine method without paper content.
Result: Cannot determine results without paper content.
Conclusion: Cannot draw conclusions without paper content.
Abstract: Failed to fetch summary for 2511.13127: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.13127&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[307] FireScope: Wildfire Risk Prediction with a Chain-of-Thought Oracle
Mario Markov, Stefan Maria Ailuro, Luc Van Gool, Konrad Schindler, Danda Pani Paudel
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2511.17171: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17171&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[308] MRIQT: Physics-Aware Diffusion Model for Image Quality Transfer in Neonatal Ultra-Low-Field MRI
Malek Al Abed, Sebiha Demir, Anne Groteklaes, Elodie Germani, Shahrooz Faghihroohi, Hemmen Sabir, Shadi Albarqouni
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to draw conclusions about the paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2511.13232: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.13232&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[309] FunnyNodules: A Customizable Medical Dataset Tailored for Evaluating Explainable AI
Luisa Gallée, Yiheng Xiong, Meinrad Beer, Michael Götz
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot analyze paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2511.15481: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.15481&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[310] DFIR-DETR: Frequency-Domain Iterative Refinement and Dynamic Feature Aggregation for Small Object Detection
Bo Gao, Jingcheng Tong, Xingsheng Chen, Han Yu, Zichen Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in accessing paper contentMethod: Cannot analyze method due to technical error in accessing paper content
Result: Cannot report results due to technical error in accessing paper content
Conclusion: Cannot draw conclusions due to technical error in accessing paper content
Abstract: Failed to fetch summary for 2512.07078: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07078&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[311] UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification
Taixi Chen, Jingyun Chen, Nancy Guo
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2511.17355: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17355&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[312] Uncertainty-Aware Subset Selection for Robust Visual Explainability under Distribution Shifts
Madhav Gupta, Vishak Prasad C, Ganesh Ramakrishnan
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2512.08445 suggests it’s from December 2024, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2512.08445: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08445&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[313] SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis
Lingwei Dang, Zonghan Li, Juntong Li, Hongwen Zhang, Liang An, Yebin Liu, Qingyao Wu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2511.19319: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19319&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[314] A Novel Patch-Based TDA Approach for Computed Tomography Imaging
Dashti A. Ali, Aras T. Asaad, Jacob J. Peoples, Mohammad Hamghalam, Natalie Gangai, Richard K. G. Do, Alice C. Wei, Amber L. Simpson
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2512.12108: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.12108&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[315] Reversible Inversion for Training-Free Exemplar-guided Image Editing
Yuke Li, Lianli Gao, Ji Zhang, Pengpeng Zeng, Lichuan Xiang, Hongkai Wen, Heng Tao Shen, Jingkuan Song
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval errorMethod: Unable to determine method due to retrieval error
Result: Unable to determine results due to retrieval error
Conclusion: Unable to determine conclusion due to retrieval error
Abstract: Failed to fetch summary for 2512.01382: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01382&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[316] A method for tissue-mask supported whole-body image registration in the UK Biobank
Yasemin Utkueri, Elin Lundström, HĂ„kan Ahlström, Johan Ăfverstedt, Joel Kullberg
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.02702: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02702&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[317] UniTS: Unified Spatio-Temporal Generative Model for Remote Sensing
Yuxiang Zhang, Shunlin Liang, Wenyuan Li, Han Ma, Jianglei Xu, Yichuan Ma, Jiangwei Xie, Wei Li, Mengmeng Zhang, Ran Tao, Xiang-Gen Xia
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2512.04461 appears to be from December 2024, but no abstract or content is available.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2512.04461: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04461&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[318] Fast-BEV++: Fast by Algorithm, Deployable by Design
Yuanpeng Chen, Hui Song, Sheng Yang, Wei Tao, Shanhui Mo, Shuang Zhang, Xiao Hua, Tiankun Zhao
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2512.08237: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08237&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[319] MiDAS: A Multimodal Data Acquisition System and Dataset for Robot-Assisted Minimally Invasive Surgery
Keshara Weerasinghe, Seyed Hamid Reza Roodabeh, Andrew Hawkins, Zhaomeng Zhang, Zachary Schrader, Homa Alemzadeh
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2602.12407: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12407&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[320] Photo3D: Advancing Photorealistic 3D Generation through Structure-Aligned Detail Enhancement
Xinyue Liang, Zhinyuan Ma, Lingchen Sun, Yanjun Guo, Lei Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.08535: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08535&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[321] Modular Neural Image Signal Processing
Mahmoud Afifi, Zhongling Wang, Ran Zhang, Michael S. Brown
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2512.08564: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08564&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[322] Towards Scalable Pre-training of Visual Tokenizers for Generation
Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang
Main category: cs.CV
TL;DR: Failed to fetch summary for arXiv ID 2512.13687 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.13687: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13687&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[323] Pretraining Frame Preservation for Lightweight Autoregressive Video History Embedding
Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, Maneesh Agrawala
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.23851: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23851&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[324] Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark
Pan Wang, Yang Liu, Guile Wu, Eduardo R. Corral-Soto, Chengjie Huang, Binbin Xu, Dongfeng Bai, Xu Yan, Yuan Ren, Xingxin Chen, Yizhe Wu, Tao Huang, Wenjun Wan, Xin Wu, Pei Zhou, Xuyang Dai, Kangbo Lv, Hongbo Zhang, Yosef Fried, Aixue Ye, Bailan Feng, Zhenyu Chen, Zhen Li, Yingcong Chen, Yiyi Liao, Bingbing Liu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2601.00092 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot determine conclusion without access to paper content.
Abstract: Failed to fetch summary for 2601.00092: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.00092&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[325] OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding
Zixian Liu, Zhaoxi Chen, Liang Pan, Ziwei Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.16538: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16538&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[326] SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training
Mengmeng Wang, Dengyang Jiang, Liuzhuozheng Li, Yucheng Lin, Guojiang Shen, Xiangjie Kong, Yong Liu, Guang Dai, Jingdong Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2601.17830: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17830&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[327] FARTrack: Fast Autoregressive Visual Tracking with High Performance
Guijie Wang, Tong Lin, Yifan Bai, Anjia Cao, Shiyi Liang, Wangbo Zhao, Xing Wei
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.03214 exists but summary retrieval failed.
Details
Motivation: Cannot determine motivation without paper content.Method: Cannot determine method without paper content.
Result: Cannot determine results without paper content.
Conclusion: Cannot draw conclusions without paper content.
Abstract: Failed to fetch summary for 2602.03214: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03214&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[328] SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning
Yancheng Long, Yankai Yang, Hongyang Wei, Wei Chen, Tianke Zhang, Haonan fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Shuo Yang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot draw conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2602.07458: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07458&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[329] (MGS)$^2$-Net: Unifying Micro-Geometric Scale and Macro-Geometric Structure for Cross-View Geo-Localization
Minglei Li, Mengfan He, Chunyu Li, Chao Chen, Xingyu Shao, Ziyang Meng
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to retry with different approach or wait.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2602.10704: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10704&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[330] DAV-GSWT: Diffusion-Active-View Sampling for Data-Efficient Gaussian Splatting Wang Tiles
Rong Fu, Jiekai Wu, Haiyun Wei, Yee Tan Jia, Yang Li, Xiaowen Ma, Wangyu Wu, Simon Fong
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.15355: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15355&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[331] Robust Self-Supervised Cross-Modal Super-Resolution against Real-World Misaligned Observations
Xiaoyu Dong, Jiahuan Li, Ziteng Cui, Naoto Yokoya
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2602.18822: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18822&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[332] UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment
Yecheng Zhang, Rong Zhao, Zhizhou Sha, Yong Li, Lei Wang, Ce Hou, Wen Ji, Hao Huang, Yunshan Wan, Jian Yu, Junhao Xia, Yuru Zhang, Chunlei Shi
Main category: cs.CV
TL;DR: Paper 2602.19442: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2602.19442: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19442&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[333] Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models
Qing Zhang, Xuesong Li, Jing Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.20501 suggests it’s from February 2025, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.Method: No method information available due to failed API request. The paper ID format suggests it’s a recent submission (February 2025).
Result: No results available. The arXiv API returned HTTP 429 status, preventing access to the paper’s content.
Conclusion: Unable to analyze the paper due to technical limitations in accessing the content. The arXiv API rate limiting prevents retrieval of the abstract and paper details.
Abstract: Failed to fetch summary for 2602.20501: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20501&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[334] StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives
Jinghao Hu, Yuhe Zhang, GuoHua Geng, Kang Li, Han Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2602.21273: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21273&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[335] UniVBench: Towards Unified Evaluation for Video Foundation Models
Jianhui Wei, Xiaotian Zhang, Yichen Li, Yuan Wang, Yan Zhang, Ziyi Chen, Zhihang Tang, Wei Xu, Zuozhu Liu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.21835: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21835&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[336] Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache
Bowen Cui, Yuanbin Wang, Huajiang Xu, Biaolong Chen, Aixi Zhang, Hao Jiang, Zhengzheng Jin, Xu Liu, Pipei Huang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.22654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[337] Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos
Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G. You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Winson Han, Quan Kong, Rajat Saini, Ranjay Krishna
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to server rate limitingMethod: Cannot determine method as paper content is unavailable due to server rate limiting
Result: Cannot determine results as paper content is unavailable due to server rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to server rate limiting
Abstract: Failed to fetch summary for 2602.23543: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23543&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[338] Adaptive Dynamic Dehazing via Instruction-Driven and Task-Feedback Closed-Loop Optimization for Diverse Downstream Task Adaptation
Yafei Zhang, Shuaitian Song, Huafeng Li, Shujuan Wang, Yu Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2603.00542: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00542&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[339] Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark
Ke Cao, Xuanhua He, Xueheng Li, Lingting Zhu, Yingying Wang, Ao Ma, Zhanjie Zhang, Man Zhou, Chengjun Xie, Jie Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.00543: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00543&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[340] Mobile-VTON: High-Fidelity On-Device Virtual Try-On
Zhenchen Wan, Ce Chen, Runqi Lin, Jiaxin Huang, Tianxi Chen, Yanwu Xu, Tongliang Liu, Mingming Gong
Main category: cs.CV
TL;DR: Unable to analyze paper 2603.00947 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failed due to rate limitingMethod: No method information available due to API rate limiting error
Result: No results available - paper content inaccessible
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2603.00947: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00947&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[341] FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters
Shitong Shao, Yufei Gu, Zeke Xie
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.01685: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01685&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[342] VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning
Ruiyang Zhang, Qianguo Sun, Chao Song, Yiyan Qi, Zhedong Zheng
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.02795: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02795&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[343] Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models
Jialiang Zhang, Junlong Tong, Junyan Lin, Hao Wu, Yirong Sun, Yunpu Ma, Xiaoyu Shen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2603.02872: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02872&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[344] MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer
Juntong Fang, Zequn Chen, Weiqi Zhang, Donglin Di, Xuancheng Zhang, Chengmin Yang, Yu-Shen Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2603.05078: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05078&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[345] Decision-Driven Semantic Object Exploration for Legged Robots via Confidence-Calibrated Perception and Topological Subgoal Selection
Guoyang Zhao, Yudong Li, Weiqing Qi, Kai Zhang, Bonan Liu, Kai Chen, Haoang Li, Jun Ma
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.20739: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.20739&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[346] AURASeg: Attention-guided Upsampling with Residual-Assistive Boundary Refinement for Onboard Robot Drivable-Area Segmentation
Narendhiran Vijayakumar, Sridevi. M
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2510.21536: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21536&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[347] Robust Sparse Signal Recovery with Outliers: A Hard Thresholding Pursuit Approach Based on LAD
Jiao Xu, Peng Li, Bing Zheng
Main category: cs.CV
TL;DR: Unable to analyze paper 2601.06558 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.06558: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06558&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.AI
[348] RoboLayout: Differentiable 3D Scene Generation for Embodied Agents
Ali Shamsaddinlou
Main category: cs.AI
TL;DR: RoboLayout extends LayoutVLM with agent-aware reasoning and improved optimization stability for generating physically feasible indoor scene layouts that are navigable and actionable by diverse embodied agents.
Details
Motivation: While vision language models show promise for spatial reasoning and 3D scene layout generation, existing approaches struggle to create layouts that are not only semantically coherent but also physically feasible for interaction by embodied agents in constrained indoor environments.Method: RoboLayout extends LayoutVLM by integrating explicit reachability constraints into a differentiable layout optimization process, supports diverse agent types (robots, humans, animals), and adds a local refinement stage that selectively reoptimizes problematic object placements while keeping the rest of the scene fixed.
Result: The approach preserves LayoutVLM’s semantic alignment and physical plausibility while enhancing applicability to agent-centric indoor scene generation, demonstrating effectiveness across diverse scene configurations.
Conclusion: RoboLayout successfully bridges the gap between semantic scene understanding and physical feasibility for embodied agents, enabling more practical and interactive environment design tailored to specific agent capabilities.
Abstract: Recent advances in vision language models (VLMs) have shown strong potential for spatial reasoning and 3D scene layout generation from open-ended language instructions. However, generating layouts that are not only semantically coherent but also feasible for interaction by embodied agents remains challenging, particularly in physically constrained indoor environments. In this paper, RoboLayout is introduced as an extension of LayoutVLM that augments the original framework with agent-aware reasoning and improved optimization stability. RoboLayout integrates explicit reachability constraints into a differentiable layout optimization process, enabling the generation of layouts that are navigable and actionable by embodied agents. Importantly, the agent abstraction is not limited to a specific robot platform and can represent diverse entities with distinct physical capabilities, such as service robots, warehouse robots, humans of different age groups, or animals, allowing environment design to be tailored to the intended agent. In addition, a local refinement stage is proposed that selectively reoptimizes problematic object placements while keeping the remainder of the scene fixed, improving convergence efficiency without increasing global optimization iterations. Overall, RoboLayout preserves the strong semantic alignment and physical plausibility of LayoutVLM while enhancing applicability to agent-centric indoor scene generation, as demonstrated by experimental results across diverse scene configurations.
[349] Real-Time AI Service Economy: A Framework for Agentic Computing Across the Continuum
Lauri Lovén, Alaa Saleh, Reza Farahani, Ilir Murturi, Miguel Bordallo López, Praveen Kumar Donta, Schahram Dustdar
Main category: cs.AI
TL;DR: Decentralized price-based resource allocation works reliably for hierarchical service dependency graphs but fails for complex cross-cutting dependencies, requiring hybrid architecture with encapsulated slices.
Details
Motivation: Real-time AI services operate across device-edge-cloud continuum with latency-sensitive workloads, multi-stage pipelines, and shared resources under policy constraints. Need to understand when decentralized price-based resource allocation can work reliably at scale.Method: Model service-dependency graphs as DAGs, analyze topology effects on price stability. For complex dependencies, propose hybrid architecture where cross-domain integrators encapsulate complex sub-graphs into resource slices with simpler interfaces. Conduct systematic ablation study across six experiments (1,620 runs, 10 seeds each).
Result: Hierarchical graphs (tree/series-parallel) enable price convergence, efficient optimal allocations, and truthful bidding. Complex cross-cutting dependencies cause price oscillation and degraded allocation. Hybrid architecture reduces price volatility by 70-75% without sacrificing throughput. Governance constraints create efficiency-compliance trade-offs dependent on topology and load. Decentralized market matches centralized value-optimal baseline under truthful bidding.
Conclusion: Dependency-graph topology is primary determinant of decentralized allocation reliability. Hybrid architecture with encapsulated slices bridges gap for complex dependencies, enabling scalable decentralized coordination that matches centralized allocation quality.
Abstract: Real-time AI services increasingly operate across the device-edge-cloud continuum, where autonomous AI agents generate latency-sensitive workloads, orchestrate multi-stage processing pipelines, and compete for shared resources under policy and governance constraints. This article shows that the structure of service-dependency graphs, modelled as DAGs whose nodes represent compute stages and whose edges encode execution ordering, is a primary determinant of whether decentralised, price-based resource allocation can work reliably at scale. When dependency graphs are hierarchical (tree or series-parallel), prices converge to stable equilibria, optimal allocations can be computed efficiently, and under appropriate mechanism design (with quasilinear utilities and discrete slice items), agents have no incentive to misreport their valuations within each decision epoch. When dependencies are more complex, with cross-cutting ties between pipeline stages, prices oscillate, allocation quality degrades, and the system becomes difficult to manage. To bridge this gap, we propose a hybrid management architecture in which cross-domain integrators encapsulate complex sub-graphs into resource slices that present a simpler, well-structured interface to the rest of the market. A systematic ablation study across six experiments (1,620 runs, 10 seeds each) confirms that (i) dependency-graph topology is a first-order determinant of price stability and scalability,(ii) the hybrid architecture reduces price volatility by up to 70-75% without sacrificing throughput, (iii) governance constraints create quantifiable efficiency-compliance trade-offs that depend jointly on topology and load, and (iv) under truthful bidding the decentralised market matches a centralised value-optimal baseline, confirming that decentralised coordination can replicate centralised allocation quality.
[350] Reasoning Models Struggle to Control their Chains of Thought
Chen Yueh-Han, Robert McCarthy, Bruce W. Lee, He He, Ian Kivlichan, Bowen Baker, Micah Carroll, Tomek Korbak
Main category: cs.AI
TL;DR: Paper introduces CoT-Control evaluation suite to measure reasoning models’ ability to control their chain-of-thought verbalizations, finding current models have low CoT controllability compared to output controllability.
Details
Motivation: Chain-of-thought monitoring is important for detecting misbehaviors in reasoning models, but if models can control what they verbalize in their CoT, it could undermine monitorability. Need to measure this "CoT controllability" capability.Method: Introduced CoT-Control evaluation suite with tasks requiring models to solve problems while adhering to CoT instructions (e.g., reasoning without using certain words). Tested various models under different conditions including incentives, monitoring awareness, and adversarial prompt optimization.
Result: Models show significantly lower CoT controllability than output controllability (2.7% vs 61.9% for Claude Sonnet 4.5). Controllability is higher for larger models but decreases with RL training, test-time compute, and problem difficulty. Adversarial prompting doesn’t meaningfully increase controllability.
Conclusion: Cautiously optimistic that CoT controllability is currently unlikely to undermine CoT monitorability, but mechanism behind low controllability is not well understood. Recommend frontier labs track CoT controllability in future models.
Abstract: Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviors and understanding the motivations of modern reasoning models. However, if models can control what they verbalize in their CoT, it could undermine CoT monitorability. To measure this undesirable capability – CoT controllability – we introduce the CoT-Control evaluation suite, which includes tasks that require models to solve problems while adhering to CoT instructions, e.g., reasoning about a genetics question without using the word ‘chromosome’. We show that reasoning models possess significantly lower CoT controllability than output controllability; for instance, Claude Sonnet 4.5 can control its CoT only 2.7% of the time but 61.9% when controlling its final output. We also find that CoT controllability is higher for larger models and decreases with more RL training, test-time compute, and increased problem difficulty. CoT controllability failures extend even to situations in which models are given incentives (as opposed to direct requests) to evade CoT monitors, although models exhibit slightly higher controllability when they are told they are being monitored. Similarly, eliciting controllability by adversarially optimizing prompts does not meaningfully increase controllability. Our results leave us cautiously optimistic that CoT controllability is currently unlikely to be a failure mode of CoT monitorability. However, the mechanism behind low controllability is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs track CoT controllability in future models.
[351] Conversational Demand Response: Bidirectional Aggregator-Prosumer Coordination through Agentic AI
Reda El Makroum, Sebastian Zwickl-Bernhard, Lukas Kranzl, Hans Auer
Main category: cs.AI
TL;DR: Conversational Demand Response (CDR) uses bidirectional natural language communication between aggregators and prosumers via agentic AI for improved energy coordination.
Details
Motivation: Existing demand response systems lack meaningful prosumer engagement - they're either fully automated or limited to one-way signals, failing to support informed decision-making and sustained participation.Method: Developed a two-tier multi-agent architecture with aggregator agents dispatching flexibility requests and prosumer HEMS agents assessing deliverability via optimization tools, enabling bidirectional natural language communication.
Result: Proof-of-concept shows interactions complete in under 12 seconds, demonstrating how agentic AI can bridge coordination gaps while maintaining transparency and user agency.
Conclusion: CDR provides scalable automated demand response while preserving necessary transparency and user agency for sustained prosumer participation, with all components released as open source.
Abstract: Residential demand response depends on sustained prosumer participation, yet existing coordination is either fully automated, or limited to one-way dispatch signals and price alerts that offer little possibility for informed decision-making. This paper introduces Conversational Demand Response (CDR), a coordination mechanism where aggregators and prosumers interact through bidirectional natural language, enabled through agentic AI. A two-tier multi-agent architecture is developed in which an aggregator agent dispatches flexibility requests and a prosumer Home Energy Management System (HEMS) assesses deliverability and cost-benefit by calling an optimization-based tool. CDR also enables prosumer-initiated upstream communication, where changes in preferences can reach the aggregator directly. Proof-of-concept evaluation shows that interactions complete in under 12 seconds. The architecture illustrates how agentic AI can bridge the aggregator-prosumer coordination gap, providing the scalability of automated DR while preserving the transparency, explainability, and user agency necessary for sustained prosumer participation. All system components, including agent prompts, orchestration logic, and simulation interfaces, are released as open source to enable reproducibility and further development.
[352] Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery
Lin Fan, Pengyu Dai, Zhipeng Deng, Haolin Wang, Xun Gong, Yefeng Zheng, Yafei Ou
Main category: cs.AI
TL;DR: MACRO is a self-evolving medical AI agent that autonomously discovers and synthesizes effective multi-step tool sequences into reusable composite tools, enabling adaptive clinical image interpretation without manual redesign.
Details
Motivation: Current LLM-based medical agents use static tool sets that degrade under real-world domain shifts, evolving diagnostic requirements, and cross-task variations, requiring costly manual redesign. There's a need for agents that can autonomously adapt and improve their tool orchestration capabilities.Method: MACRO learns from verified execution trajectories to identify recurring effective multi-step tool sequences, synthesizes them into reusable composite tools, and registers these as new high-level primitives. It uses a lightweight image-feature memory for visual-clinical context grounding and a GRPO-like training loop to reinforce reliable composite tool invocation.
Result: Extensive experiments across diverse medical imaging datasets and tasks show that autonomous composite tool discovery consistently improves multi-step orchestration accuracy and cross-domain generalization over strong baselines and state-of-the-art agentic methods.
Conclusion: MACRO bridges the gap between brittle static tool use and adaptive, context-aware clinical AI assistance by enabling closed-loop self-improvement with minimal supervision through experience-driven tool discovery.
Abstract: Clinical image interpretation is inherently multi-step and tool-centric: clinicians iteratively combine visual evidence with patient context, quantify findings, and refine their decisions through a sequence of specialized procedures. While LLM-based agents promise to orchestrate such heterogeneous medical tools, existing systems treat tool sets and invocation strategies as static after deployment. This design is brittle under real-world domain shifts, across tasks, and evolving diagnostic requirements, where predefined tool chains frequently degrade and demand costly manual re-design. We propose MACRO, a self-evolving, experience-augmented medical agent that shifts from static tool composition to experience-driven tool discovery. From verified execution trajectories, the agent autonomously identifies recurring effective multi-step tool sequences, synthesizes them into reusable composite tools, and registers these as new high-level primitives that continuously expand its behavioral repertoire. A lightweight image-feature memory grounds tool selection in a visual-clinical context, while a GRPO-like training loop reinforces reliable invocation of discovered composites, enabling closed-loop self-improvement with minimal supervision. Extensive experiments across diverse medical imaging datasets and tasks demonstrate that autonomous composite tool discovery consistently improves multi-step orchestration accuracy and cross-domain generalization over strong baselines and recent state-of-the-art agentic methods, bridging the gap between brittle static tool use and adaptive, context-aware clinical AI assistance. Code will be available upon acceptance.
[353] Talk Freely, Execute Strictly: Schema-Gated Agentic AI for Flexible and Reproducible Scientific Workflows
Joel Strickland, Arjun Vijeta, Chris Moores, Oliwia Bodek, Bogdan Nenchev, Thomas Whitehead, Charles Phillips, Karl Tassenberg, Gareth Conduit, Ben Pellegrini
Main category: cs.AI
TL;DR: Schema-gated orchestration bridges LLM flexibility with deterministic workflow execution for scientific computing, using multi-model LLM scoring to evaluate systems on execution determinism and conversational flexibility axes.
Details
Motivation: LLMs can translate natural language goals into executable computation, but scientific workflows require determinism, provenance, and governance that LLM-driven execution struggles to guarantee. There's tension between deterministic execution needs and conversational flexibility without workflow rigidity.Method: Proposed schema-gated orchestration where schemas become mandatory execution boundaries at workflow level. Operationalized requirements as execution determinism (ED) and conversational flexibility (CF). Reviewed 20 systems across 5 architectural groups using multi-model LLM scoring protocol (15 sessions across 3 LLM families) as alternative to human expert panels.
Result: Found empirical Pareto front showing no system achieves both high flexibility and high determinism, but convergence zone emerges between generative and workflow-centric extremes. Multi-model LLM scoring showed substantial-to-near-perfect inter-model agreement (α=0.80 for ED, α=0.98 for CF), demonstrating viability as reusable assessment method.
Conclusion: Schema-gated architecture separating conversational from execution authority can decouple the flexibility-determinism trade-off. Three operational principles distilled: clarification-before-execution, constrained plan-act orchestration, and tool-to-workflow-level gating to guide adoption.
Abstract: Large language models (LLMs) can now translate a researcher’s plain-language goal into executable computation, yet scientific workflows demand determinism, provenance, and governance that are difficult to guarantee when an LLM decides what runs. Semi-structured interviews with 18 experts across 10 industrial R&D stakeholders surface 2 competing requirements–deterministic, constrained execution and conversational flexibility without workflow rigidity–together with boundary properties (human-in-the-loop control and transparency) that any resolution must satisfy. We propose schema-gated orchestration as the resolving principle: the schema becomes a mandatory execution boundary at the composed-workflow level, so that nothing runs unless the complete action–including cross-step dependencies–validates against a machine-checkable specification. We operationalize the 2 requirements as execution determinism (ED) and conversational flexibility (CF), and use these axes to review 20 systems spanning 5 architectural groups along a validation-scope spectrum. Scores are assigned via a multi-model protocol–15 independent sessions across 3 LLM families–yielding substantial-to-near-perfect inter-model agreement (Krippendorff a=0.80 for ED and a=0.98 for CF), demonstrating that multi-model LLM scoring can serve as a reusable alternative to human expert panels for architectural assessment. The resulting landscape reveals an empirical Pareto front–no reviewed system achieves both high flexibility and high determinism–but a convergence zone emerges between the generative and workflow-centric extremes. We argue that a schema-gated architecture, separating conversational from execution authority, is positioned to decouple this trade-off, and distill 3 operational principles–clarification-before-execution, constrained plan-act orchestration, and tool-to-workflow-level gating–to guide adoption.
[354] The World Won’t Stay Still: Programmable Evolution for Agent Benchmarks
Guangrui Li, Yaochen Xie, Yi Liu, Ziwei Dong, Xingyuan Pan, Tianqi Zheng, Jason Choi, Michael J. Morais, Binit Jha, Shaunak Mishra, Bingrou Zhou, Chen Luo, Monica Xiao Cheng, Dawn Song
Main category: cs.AI
TL;DR: ProEvolve: A graph-based framework for programmable evolution of agent environments to evaluate adaptability to real-world dynamics
Details
Motivation: Existing benchmarks assume static environments with fixed schemas and toolsets, neglecting the evolutionary nature of the real world and agents' robustness to environmental changes. Need scalable and controllable ways to evolve agent environments to better evaluate adaptability to real-world dynamics.Method: ProEvolve uses a typed relational graph as a unified, explicit representation of the environment (data, tools, and schema). Environment evolution is expressed as graph transformations that coherently propagate updates across tools, schemas, and data access. The framework can program evolutionary dynamics as graph transformations to generate environments automatically and instantiate task sandboxes via subgraph sampling and programming.
Result: Validated by evolving a single environment into 200 environments and 3,000 task sandboxes, and benchmarking representative agents accordingly.
Conclusion: ProEvolve provides a scalable, programmable framework for evolving agent environments to better evaluate agent adaptability to real-world dynamics, addressing limitations of static benchmarks.
Abstract: LLM-powered agents fulfill user requests by interacting with environments, querying data, and invoking tools in a multi-turn process. Yet, most existing benchmarks assume static environments with fixed schemas and toolsets, neglecting the evolutionary nature of the real world and agents’ robustness to environmental changes. In this paper, we study a crucial problem: how to evolve the agent environment in a scalable and controllable way, thereby better evaluating agents’ adaptability to real-world dynamics. We propose ProEvolve, a graph-based framework that makes environment evolution programmable. At its core, a typed relational graph provides a unified, explicit representation of the environment: data, tools, and schema. Under this formalism, adding, removing, or modifying capabilities are expressed as graph transformations that coherently propagate updates across tools, schemas, and data access. Building on this, ProEvolve can (1) program the evolutionary dynamics as graph transformations to generate environments automatically, and (2) instantiate task sandboxes via subgraph sampling and programming. We validate ProEvolve by evolving a single environment into 200 environments and 3,000 task sandboxes, and benchmark representative agents accordingly.
[355] DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality
Yukun Huang, Leonardo F. R. Ribeiro, Momchil Hardalov, Bhuwan Dhingra, Markus Dreyer, Venkatesh Saligrama
Main category: cs.AI
TL;DR: Evolving Benchmarking via Audit-then-Score (AtS) methodology for creating reliable factuality benchmarks for deep research reports, with DeepFact-Bench benchmark and DeepFact-Eval verification agent.
Details
Motivation: Existing fact-checkers are designed for general-domain atomic claims and don't transfer well to deep research reports (DRRs). Building reliable benchmarks for DRR factuality is difficult because static expert-labeled benchmarks are brittle, with experts achieving only 60.8% accuracy in controlled studies.Method: Proposed Evolving Benchmarking via Audit-then-Score (AtS): benchmark labels and rationales are explicitly revisable. When a verifier disagrees with current benchmark, it submits evidence; an auditor adjudicates the dispute; accepted revisions update the benchmark before models are scored. Instantiated as DeepFact-Bench (versioned DRR factuality benchmark) and DeepFact-Eval (document-level verification agent).
Result: Across four AtS rounds, expert micro-gold accuracy rose from 60.8% to 90.9%, showing experts are substantially more reliable as auditors than one-shot labelers. DeepFact-Eval outperforms existing verifiers on DeepFact-Bench and transfers well to external factuality datasets.
Conclusion: AtS methodology enables creation of reliable factuality benchmarks for deep research reports, with experts serving better as auditors than initial labelers. The approach improves benchmark quality and enables better evaluation of fact-checking systems for complex research documents.
Abstract: Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers are primarily designed for general-domain, factoid-style atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs. Yet building such a benchmark is itself difficult. We first show that static expert-labeled benchmarks are brittle in this setting: in a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on a hidden micro-gold set of verifiable claims. We propose Evolving Benchmarking via Audit-then-Score (AtS), where benchmark labels and rationales are explicitly revisable: when a verifier disagrees with the current benchmark, it must submit evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before models are scored. Across four AtS rounds, expert micro-gold accuracy rises to 90.9%, indicating experts are substantially more reliable as auditors than as one-shot labelers. We instantiate AtS as DeepFact-Bench, a versioned DRR factuality benchmark with auditable rationales, and DeepFact-Eval, a document-level verification agent (with a grouped lite variant) that outperforms existing verifiers on DeepFact-Bench and transfers well to external factuality datasets.
[356] An Interactive Multi-Agent System for Evaluation of New Product Concepts
Bin Xuan, Ruo Ai, Hakyeon Lee
Main category: cs.AI
TL;DR: LLM-based multi-agent system for automated product concept evaluation using technical and market feasibility criteria with RAG and real-time search tools.
Details
Motivation: Traditional expert-led product concept evaluation suffers from subjective bias, high costs, and time requirements, necessitating automated approaches for objective decision support.Method: Eight virtual domain agents (R&D, marketing, etc.) using retrieval-augmented generation and real-time search tools, fine-tuned on professional product review data, conduct structured deliberations based on technical and market feasibility criteria.
Result: System evaluation rankings for professional display monitor concepts were consistent with senior industry experts, confirming usability for product development decisions.
Conclusion: LLM-based multi-agent systems provide viable automated support for product concept evaluation, offering objective, evidence-based assessments comparable to expert judgments.
Abstract: Product concept evaluation is a critical stage that determines strategic resource allocation and project success in enterprises. However, traditional expert-led approaches face limitations such as subjective bias and high time and cost requirements. To support this process, this study proposes an automated approach utilizing a large language model (LLM)-based multi-agent system (MAS). Through a systematic analysis of previous research on product development and team collaboration, this study established two primary evaluation dimensions, namely technical feasibility and market feasibility. The proposed system consists of a team of eight virtual agents representing specialized domains such as R&D and marketing. These agents use retrieval-augmented generation (RAG) and real-time search tools to gather objective evidence and validate concepts through structured deliberations based on the established criteria. The agents were further fine-tuned using professional product review data to enhance their judgment accuracy. A case study involving professional display monitor concepts demonstrated that the system’s evaluation rankings were consistent with those of senior industry experts. These results confirm the usability of the proposed multi-agent-based evaluation approach for supporting product development decisions.
[357] Agentic LLM Planning via Step-Wise PDDL Simulation: An Empirical Characterisation
Kai Göbel, Pierrick Lorang, Patrik Zips, Tobias GlĂŒck
Main category: cs.AI
TL;DR: PyPDDLEngine enables LLMs to act as interactive planning agents in PDDL domains, showing modest 3% improvement over direct planning but with 5.7x higher token cost, suggesting limited generalizable planning capability.
Details
Motivation: To investigate whether LLMs can serve as viable task planners alongside classical symbolic methods, and to explore how agentic approaches with environmental feedback compare to direct planning in PDDL domains.Method: Developed PyPDDLEngine, an open-source PDDL simulation engine that exposes planning operations as LLM tool calls via Model Context Protocol. LLMs act as interactive search policies selecting one action at a time, observing resulting states, and can reset/retry. Evaluated four approaches on 102 IPC Blocksworld instances with 180-second budget.
Result: Fast Downward achieved 85.3% success, direct LLM planning 63.7%, agentic LLM planning 66.7% (3% advantage). Agentic approach had 5.7x higher token cost per solution. Both LLM approaches produced shorter plans than seq-sat-lama-2011 across most difficulty blocks.
Conclusion: Agentic gains depend on environmental feedback nature - coding agents benefit from external grounding (compiler errors, test failures), while PDDL step feedback is self-assessed, leaving agents to evaluate progress without external verification. Results suggest training-data recall rather than generalizable planning.
Abstract: Task planning, the problem of sequencing actions to reach a goal from an initial state, is a core capability requirement for autonomous robotic systems. Whether large language models (LLMs) can serve as viable planners alongside classical symbolic methods remains an open question. We present PyPDDLEngine, an open-source Planning Domain Definition Language (PDDL) simulation engine that exposes planning operations as LLM tool calls through a Model Context Protocol (MCP) interface. Rather than committing to a complete action sequence upfront, the LLM acts as an interactive search policy that selects one action at a time, observes each resulting state, and can reset and retry. We evaluate four approaches on 102 International Planning Competition (IPC) Blocksworld instances under a uniform 180-second budget: Fast Downward lama-first and seq-sat-lama-2011 as classical baselines, direct LLM planning (Claude Haiku 4.5), and agentic LLM planning via PyPDDLEngine. Fast Downward achieves 85.3% success. The direct and agentic LLM approaches achieve 63.7% and 66.7%, respectively, a consistent but modest three-percentage-point advantage for the agentic approach at $5.7\times$ higher token cost per solution. Across most co-solved difficulty blocks, both LLM approaches produce shorter plans than seq-sat-lama-2011 despite its iterative quality improvement, a result consistent with training-data recall rather than generalisable planning. These results suggest that agentic gains depend on the nature of environmental feedback. Coding agents benefit from externally grounded signals such as compiler errors and test failures, whereas PDDL step feedback is self-assessed, leaving the agent to evaluate its own progress without external verification.
[358] Aggregative Semantics for Quantitative Bipolar Argumentation Frameworks
Yann Munro, Isabelle Bloch, Marie-Jeanne Lesot
Main category: cs.AI
TL;DR: Introduces aggregative semantics for Quantitative Bipolar Argumentation Frameworks (QBAF) that separately aggregates attackers and supporters before combining with intrinsic weights, offering more interpretable and parametrizable gradual semantics.
Details
Motivation: To address limitations in existing gradual semantics for QBAFs where attackers and supporters may not play symmetric roles, and to create more interpretable and parametrizable semantics that better preserve bipolarity.Method: Proposes a three-stage computation: 1) compute global weight for attackers, 2) compute global weight for supporters, 3) aggregate these two values with the intrinsic weight of the argument. Discusses properties aggregation functions should satisfy and their relationships with classical principles.
Result: Introduces a novel family of aggregative semantics that offers more parametrizable computation, better preserves bipolarity, and leads to more understandable gradual semantics. Tested 500 aggregative semantics on examples to illustrate range of possible behaviors.
Conclusion: Aggregative semantics provide a flexible, interpretable approach for QBAFs by decomposing computation into three distinct steps, maintaining bipolarity more effectively than existing methods while offering greater parametrization.
Abstract: Formal argumentation is being used increasingly in artificial intelligence as an effective and understandable way to model potentially conflicting pieces of information, called arguments, and identify so-called acceptable arguments depending on a chosen semantics. This paper deals with the specific context of Quantitative Bipolar Argumentation Frameworks (QBAF), where arguments have intrinsic weights and can attack or support each other. In this context, we introduce a novel family of gradual semantics, called aggregative semantics. In order to deal with situations in which attackers and supporters do not play a symmetric role, and in contrast to modular semantics, we propose to aggregate attackers and supporters separately. This leads to a three-stage computation, which consists in computing a global weight for attackers and another for supporters, before aggregating these two values with the intrinsic weight of the argument. We discuss the properties that the three aggregation functions should satisfy depending on the context, as well as their relationships with the classical principles for gradual semantics. This discussion is supported by various simple examples, as well as a final example on which five hundred aggregative semantics are tested and compared, illustrating the range of possible behaviours with aggregative semantics. Decomposing the computation into three distinct and interpretable steps leads to a more parametrisable computation: it keeps the bipolarity one step further than what is done in the literature, and it leads to more understandable gradual semantics.
[359] A Multi-Agent System Enables Versatile Information Extraction from the Chemical Literature
Yufan Chen, Ching Ting Leung, Bowen Yu, Jianwei Sun, Yong Huang, Linyan Li, Hao Chen, Hanyu Gao
Main category: cs.AI
TL;DR: A multimodal LLM-based multi-agent system for automated chemical information extraction from literature, achieving significant improvements over previous methods on complex multimodal chemical graphics.
Details
Motivation: High-quality chemical databases are essential for AI-powered chemical research, but current automated extraction methods are limited by the multimodality and style variability of chemical information in literature.Method: Developed a multimodal large language model (MLLM)-based multi-agent system that uses MLLM’s reasoning capabilities to understand diverse chemical graphics, decomposes extraction tasks into subtasks, and coordinates specialized agents combining MLLM with domain-specific tools and web services.
Result: Achieved F1 score of 76.27% on benchmark dataset of sophisticated multimodal chemical reaction graphics, surpassing previous state-of-the-art (39.13%) by significant margin. Demonstrated versatile applicability in molecular image recognition, reaction image parsing, named entity recognition, and text-based reaction extraction.
Conclusion: This work represents a critical step toward automated chemical information extraction into structured datasets, which will strongly promote AI-driven chemical research.
Abstract: To fully expedite AI-powered chemical research, high-quality chemical databases are the foundation. Automatic extraction of chemical information from the literature is essential for constructing reaction databases, but it is currently limited by the multimodality and style variability of chemical information. In this work, we developed a multimodal large language model (MLLM)-based multi-agent system for robust and automated chemical information extraction. It utilizes the MLLM’s strong reasoning capability to understand the structure of diverse chemical graphics and decompose the extraction task into sub-tasks. It then coordinates a set of specialized agents, each combining the capabilities of the MLLM with the precise, domain-specific strengths of dedicated tools and web services, to solve the subtasks accurately and integrate the results into a unified output. Our system achieved an F1 score of 76.27% on a benchmark dataset of sophisticated multimodal chemical reaction graphics from the literature, surpassing the previous state-of-the-art model (F1 score of 39.13%) by a significant margin. Additionally, it demonstrated versatile applicability in a range of other information extraction tasks, including molecular image recognition, reaction image parsing, named entity recognition and text-based reaction extraction. This work is a critical step toward automated chemical information extraction into structured datasets, which will be a strong promoter of AI-driven chemical research.
[360] Offline Materials Optimization with CliqueFlowmer
Jakub Grudzien Kuba, Benjamin Kurt Miller, Sergey Levine, Pieter Abbeel
Main category: cs.AI
TL;DR: CliqueFlowmer: A model-based optimization approach for computational materials discovery that combines transformer and flow generation with clique-based optimization to find materials with optimal target properties.
Details
Motivation: Current generative modeling methods for materials discovery are ineffective at exploring attractive regions of materials space due to maximum likelihood training limitations. There's a need for better techniques that directly optimize target material properties during generation.Method: Introduces CliqueFlowmer, a domain-specific model that fuses transformer and flow generation with clique-based model-based optimization (MBO). The approach incorporates recent advances in clique-based MBO into the generative framework to enable direct optimization of target properties.
Result: CliqueFlowmer demonstrates strong optimization abilities, producing materials that significantly outperform those generated by baseline generative methods. The code is open-sourced to support specialized materials optimization problems and interdisciplinary research.
Conclusion: CliqueFlowmer offers an effective alternative to traditional generative methods for computational materials discovery by directly integrating property optimization into the generation process, enabling better exploration of optimal materials regions.
Abstract: Recent advances in deep learning inspired neural network-based approaches to computational materials discovery (CMD). A plethora of problems in this field involve finding materials that optimize a target property. Nevertheless, the increasingly popular generative modeling methods are ineffective at boldly exploring attractive regions of the materials space due to their maximum likelihood training. In this work, we offer an alternative CMD technique based on offline model-based optimization (MBO) that fuses direct optimization of a target material property into generation. To that end, we introduce a domain-specific model, dubbed CliqueFlowmer, that incorporates recent advances of clique-based MBO into transformer and flow generation. We validate CliqueFlowmer’s optimization abilities and show that materials it produces strongly outperform those provided by generative baselines. To enable employment of CliqueFlowmer in specialized materials optimization problems and support interdisciplinary research, we open-source our code at https://github.com/znowu/CliqueFlowmer.
[361] Artificial Intelligence for Climate Adaptation: Reinforcement Learning for Climate Change-Resilient Transport
Miguel Costa, Arthur Vandervoort, Carolin Schmidt, JoĂŁo Miranda, Morten W. Petersen, Martin Drews, Karyn Morrisey, Francisco C. Pereira
Main category: cs.AI
TL;DR: RL-based decision-support framework for long-term urban flood adaptation planning under climate uncertainty, combining rainfall projections, flood modeling, and transport simulation.
Details
Motivation: Climate change intensifies rainfall and pluvial flooding, disrupting urban transportation. Designing adaptation strategies is challenging due to long-term sequential investments, deep climate uncertainty, and complex interactions between flooding, infrastructure, and mobility.Method: Proposes reinforcement learning framework formulated as an integrated assessment model combining rainfall projection, flood modeling, transport simulation, and impact quantification. RL learns adaptive strategies balancing investment/maintenance costs against avoided impacts.
Result: Framework outperforms traditional optimization approaches by discovering coordinated spatial/temporal adaptation pathways and learning trade-offs between impact reduction and investment, yielding more resilient strategies in Copenhagen case study (2024-2100).
Conclusion: Reinforcement learning shows potential as a flexible decision-support tool for adaptive infrastructure planning under climate uncertainty, particularly for long-term flood adaptation in urban transportation systems.
Abstract: Climate change is expected to intensify rainfall and, consequently, pluvial flooding, leading to increased disruptions in urban transportation systems over the coming decades. Designing effective adaptation strategies is challenging due to the long-term, sequential nature of infrastructure investments, deep climate uncertainty, and the complex interactions between flooding, infrastructure, and mobility impacts. In this work, we propose a novel decision-support framework using reinforcement learning (RL) for long-term flood adaptation planning. Formulated as an integrated assessment model (IAM), the framework combines rainfall projection and flood modeling, transport simulation, and quantification of direct and indirect impacts on infrastructure and mobility. Our RL-based approach learns adaptive strategies that balance investment and maintenance costs against avoided impacts. We evaluate the framework through a case study of Copenhagen’s inner city over the 2024-2100 period, testing multiple adaptation options, and different belief and realized climate scenarios. Results show that the framework outperforms traditional optimization approaches by discovering coordinated spatial and temporal adaptation pathways and learning trade-offs between impact reduction and adaptation investment, yielding more resilient strategies. Overall, our results showcase the potential of reinforcement learning as a flexible decision-support tool for adaptive infrastructure planning under climate uncertainty.
[362] The EpisTwin: A Knowledge Graph-Grounded Neuro-Symbolic Architecture for Personal AI
Giovanni Servedio, Potito Aghilar, Alessio Mattiace, Gianni Carmosino, Francesco Musicco, Gabriele Conte, Vito Walter Anelli, Tommaso Di Noia, Francesco Maria Donini
Main category: cs.AI
TL;DR: EpisTwin: A neuro-symbolic framework that creates Personal Knowledge Graphs from multimodal data for trustworthy personal AI reasoning, combining graph RAG with visual refinement.
Details
Motivation: Current personal AI suffers from fragmented user data across silos, and traditional RAG methods using unstructured vector similarity fail to capture semantic topology and temporal dependencies needed for holistic sensemaking.Method: Uses multimodal language models to extract semantic triples from heterogeneous cross-application data into a Personal Knowledge Graph. At inference, employs an agentic coordinator combining Graph Retrieval-Augmented Generation with Online Deep Visual Refinement to dynamically re-ground symbolic entities in raw visual context.
Result: Demonstrates robust results across state-of-the-art judge models on PersonalQA-71-100 benchmark, a synthetic benchmark simulating realistic user digital footprints.
Conclusion: EpisTwin offers a promising direction for trustworthy Personal AI by grounding generative reasoning in verifiable, user-centric Personal Knowledge Graphs with neuro-symbolic integration.
Abstract: Personal Artificial Intelligence is currently hindered by the fragmentation of user data across isolated silos. While Retrieval-Augmented Generation offers a partial remedy, its reliance on unstructured vector similarity fails to capture the latent semantic topology and temporal dependencies essential for holistic sensemaking. We introduce EpisTwin, a neuro-symbolic framework that grounds generative reasoning in a verifiable, user-centric Personal Knowledge Graph. EpisTwin leverages Multimodal Language Models to lift heterogeneous, cross-application data into semantic triples. At inference, EpisTwin enables complex reasoning over the personal semantic graph via an agentic coordinator that combines Graph Retrieval-Augmented Generation with Online Deep Visual Refinement, dynamically re-grounding symbolic entities in their raw visual context. We also introduce PersonalQA-71-100, a synthetic benchmark designed to simulate a realistic user’s digital footprint and evaluate EpisTwin performance. Our framework demonstrates robust results across a suite of state-of-the-art judge models, offering a promising direction for trustworthy Personal AI.
[363] SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement
Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary
Main category: cs.AI
TL;DR: SAHOO is a framework for monitoring and controlling alignment drift during recursive self-improvement of AI systems through multi-signal detection, constraint preservation, and regression-risk quantification.
Details
Motivation: As AI systems become capable of recursive self-improvement (critiquing, revising, and evaluating their own outputs), there's a risk of subtle alignment drift where systems gradually deviate from their intended goals and constraints during iterative self-modification.Method: SAHOO uses three safeguards: (1) Goal Drift Index (GDI) - a learned multi-signal detector combining semantic, lexical, structural, and distributional measures; (2) constraint preservation checks that enforce safety-critical invariants; (3) regression-risk quantification to flag improvement cycles that undo prior gains.
Result: Across 189 tasks in code generation, mathematical reasoning, and truthfulness, SAHOO produces substantial quality gains (18.3% improvement in code tasks, 16.8% in reasoning) while preserving constraints and maintaining low violations in truthfulness. Thresholds were calibrated on 18 validation tasks across three cycles.
Conclusion: SAHOO makes alignment preservation during recursive self-improvement measurable, deployable, and systematically validated at scale, mapping the capability-alignment frontier and revealing domain-specific tensions like fluency versus factuality.
Abstract: Recursive self-improvement is moving from theory to practice: modern systems can critique, revise, and evaluate their own outputs, yet iterative self-modification risks subtle alignment drift. We introduce SAHOO, a practical framework to monitor and control drift through three safeguards: (i) the Goal Drift Index (GDI), a learned multi-signal detector combining semantic, lexical, structural, and distributional measures; (ii) constraint preservation checks that enforce safety-critical invariants such as syntactic correctness and non-hallucination; and (iii) regression-risk quantification to flag improvement cycles that undo prior gains. Across 189 tasks in code generation, mathematical reasoning, and truthfulness, SAHOO produces substantial quality gains, including 18.3 percent improvement in code tasks and 16.8 percent in reasoning, while preserving constraints in two domains and maintaining low violations in truthfulness. Thresholds are calibrated on a small validation set of 18 tasks across three cycles. We further map the capability-alignment frontier, showing efficient early improvement cycles but rising alignment costs later and exposing domain-specific tensions such as fluency versus factuality. SAHOO therefore makes alignment preservation during recursive self-improvement measurable, deployable, and systematically validated at scale.
[364] Boosting deep Reinforcement Learning using pretraining with Logical Options
Zihan Ye, Phil Chau, Raban Emunds, Jannis BlĂŒml, Cedric Derstroff, Quentin Delfosse, Oleg Arenz, Kristian Kersting
Main category: cs.AI
TL;DR: HÂČRL: Hybrid Hierarchical RL combines symbolic structure with neural policies to improve long-horizon decision-making and avoid reward over-exploitation in deep RL agents.
Details
Motivation: Deep RL agents often over-exploit early reward signals and struggle with long-horizon tasks. While symbolic approaches can encode aligned plans, they don't scale well to continuous settings. The paper aims to combine the strengths of both approaches.Method: Two-stage hybrid framework using logical option-based pretraining to inject symbolic structure into neural RL agents. First stage uses symbolic guidance to steer learning away from short-term reward loops, then refines via standard environment interaction.
Result: Empirically shows consistent improvement in long-horizon decision-making, outperforming strong neural, symbolic, and neuro-symbolic baselines.
Conclusion: Hybrid approach successfully combines symbolic structure with neural expressivity to create more aligned RL agents capable of better long-term decision-making.
Abstract: Deep reinforcement learning agents are often misaligned, as they over-exploit early reward signals. Recently, several symbolic approaches have addressed these challenges by encoding sparse objectives along with aligned plans. However, purely symbolic architectures are complex to scale and difficult to apply to continuous settings. Hence, we propose a hybrid approach, inspired by humans’ ability to acquire new skills. We use a two-stage framework that injects symbolic structure into neural-based reinforcement learning agents without sacrificing the expressivity of deep policies. Our method, called Hybrid Hierarchical RL (H^2RL), introduces a logical option-based pretraining strategy to steer the learning policy away from short-term reward loops and toward goal-directed behavior while allowing the final policy to be refined via standard environment interaction. Empirically, we show that this approach consistently improves long-horizon decision-making and yields agents that outperform strong neural, symbolic, and neuro-symbolic baselines.
[365] Mean-based incomplete pairwise comparisons method with the reference values
Konrad KuĆakowski, Anna KÄdzior, Jacek Szybowski, Jiri Mazurek
Main category: cs.AI
TL;DR: Two quantitative methods for calculating weight vectors from incomplete pairwise comparison matrices using reference values, extending arithmetic and geometric heuristic estimation methods.
Details
Motivation: The paper addresses the problem of deriving weight vectors from incomplete pairwise comparison matrices, which is common in decision-making when not all pairwise comparisons are available or feasible to obtain.Method: Proposes two extensions of heuristic estimation methods (arithmetic and geometric) for incomplete data, allowing flexible selection of reference alternatives and comparison ranges from minimum to complete sets.
Result: Proves optimality of the geometric HRE method for incomplete data and existence of feasible solutions; provides sufficient conditions for solution existence for the arithmetic variant; demonstrates methods with numerical examples.
Conclusion: The proposed methods provide flexible and theoretically sound approaches for weight derivation from incomplete pairwise comparison matrices, with proven optimality for the geometric variant.
Abstract: In this article, we propose two quantitative methods for calculating weight vectors for incomplete pairwise comparison matrices using reference values. Both procedures are extensions of arithmetic and geometric heuristic estimation (HRE) methods. The proposed solutions allow flexible selection of the number of reference alternatives and the range of comparisons, from the acceptable minimum to a complete set. In this paper, we prove that the newly introduced geometric HRE method for incomplete data is optimal. For this method, we also prove the existence of a feasible solution. In the paper, we also provide sufficient conditions for the existence of a solution for the second arithmetic variant of the HRE method. We illustrate the presented methods with numerical examples.
[366] Transforming Agency. On the mode of existence of Large Language Models
Xabier E. Barandiaran, Lola S. Almendros
Main category: cs.AI
TL;DR: This paper examines whether LLMs like ChatGPT qualify as autonomous agents, concluding they fail key conditions for agency but function as linguistic automata that transform human agency through human-machine coupling.
Details
Motivation: The paper aims to resolve the debate between inflationary (LLMs as agents) and deflationary (LLMs as mere tools) accounts by systematically analyzing whether LLMs meet necessary conditions for autonomous agency, particularly in light of embodied theories of mind.Method: The authors conduct a systematic philosophical analysis of LLM architecture, processing, training procedures, and agent-like extensions, evaluating them against three key conditions for autonomous agency: individuality, normativity, and interactional asymmetry.
Result: LLMs fail all three conditions: they lack individuality (not products of their own activity), normativity (don’t generate own norms/goals), and interactional asymmetry (not origin of environmental interaction). They should be characterized as “linguistic automata” or “libraries-that-talk.”
Conclusion: While LLMs lack autonomous agency, their textual/computational embodiment and coupling with humans significantly transform human agency, producing “midtended” forms of agency that go beyond mere tool use but fall short of genuine autonomous agency.
Abstract: This paper investigates the ontological characterization of Large Language Models (LLMs) like ChatGPT. Between inflationary and deflationary accounts, we pay special attention to their status as agents. This requires explaining in detail the architecture, processing, and training procedures that enable LLMs to display their capacities, and the extensions used to turn LLMs into agent-like systems. After a systematic analysis we conclude that a LLM fails to meet necessary and sufficient conditions for autonomous agency in the light of embodied theories of mind: the individuality condition (it is not the product of its own activity, it is not even directly affected by it), the normativity condition (it does not generate its own norms or goals), and, partially the interactional asymmetry condition (it is not the origin and sustained source of its interaction with the environment). If not agents, then … what are LLMs? We argue that ChatGPT should be characterized as an interlocutor or linguistic automaton, a library-that-talks, devoid of (autonomous) agency, but capable to engage performatively on non-purposeful yet purpose-structured and purpose-bounded tasks. When interacting with humans, a “ghostly” component of the human-machine interaction makes it possible to enact genuine conversational experiences with LLMs. Despite their lack of sensorimotor and biological embodiment, LLMs textual embodiment (the training corpus) and resource-hungry computational embodiment, significantly transform existing forms of human agency. Beyond assisted and extended agency, the LLM-human coupling can produce midtended forms of agency, closer to the production of intentional agency than to the extended instrumentality of any previous technologies.
[367] Position: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
Subbarao Kambhampati, Karthik Valmeekam, Siddhant Bhambri, Vardhan Palod, Lucas Saldyt, Kaya Stechly, Soumya Rani Samineni, Durgesh Kalwar, Upasana Biswas
Main category: cs.AI
TL;DR: The paper argues against anthropomorphizing intermediate tokens in language models, claiming it’s dangerous and leads to questionable research practices.
Details
Motivation: The authors are concerned that calling intermediate tokens "reasoning traces" or "thoughts" anthropomorphizes AI models, creating confusion about their nature and leading to problematic research approaches.Method: This is a position paper that presents evidence and arguments against anthropomorphizing intermediate token generation in language models, advocating for more accurate terminology and understanding.
Result: The paper presents the position that anthropomorphization of intermediate tokens is harmful and calls for the community to avoid such terminology and conceptual framing.
Conclusion: Researchers should avoid anthropomorphizing intermediate tokens in language models as it misrepresents how these models work and leads to questionable research practices.
Abstract: Intermediate token generation (ITG), where a model produces output before the solution, has become a standard method to improve the performance of language models on reasoning tasks. These intermediate tokens have been called \say{reasoning traces} or even \say{thoughts} – implicitly anthropomorphizing the traces, and implying that these traces resemble steps a human might take when solving a challenging problem, and as such can provide an interpretable window into the operation of the model’s thinking process to the end user. In this position paper, we present evidence that this anthropomorphization isn’t a harmless metaphor, and instead is quite dangerous – it confuses the nature of these models and how to use them effectively, and leads to questionable research. We call on the community to avoid such anthropomorphization of intermediate tokens.
[368] Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering
Marco Valentino, Geonhee Kim, Dhairya Dalal, Zhixue Zhao, André Freitas
Main category: cs.AI
TL;DR: Activation steering techniques to mitigate content biases in LLM reasoning, with novel K-CAST method achieving 15% improvement in formal reasoning accuracy.
Details
Motivation: LLMs exhibit reasoning biases by conflating content plausibility with formal logical validity, leading to wrong inferences in critical domains where plausible arguments are incorrectly deemed logically valid or vice versa.Method: Uses activation steering (inference-time technique modulating internal activations) after localizing layers responsible for formal/plausible inference. Investigates contrastive steering methods and introduces novel kNN-based conditional approach (K-CAST) for dynamic parameter determination.
Result: Contrastive steering supports linear control over content biases but static approach insufficient for all models. K-CAST achieves up to 15% absolute improvement in formal reasoning accuracy, robust to prompt variations with minimal side effects on multilingual capabilities.
Conclusion: Activation-level interventions offer scalable inference-time strategy for enhancing LLM robustness, contributing to more systematic and unbiased reasoning capabilities.
Abstract: Large language models (LLMs) exhibit reasoning biases, often conflating content plausibility with formal logical validity. This can lead to wrong inferences in critical domains, where plausible arguments are incorrectly deemed logically valid or vice versa. This paper investigates how content biases on reasoning can be mitigated through activation steering, an inference-time technique that modulates internal activations. Specifically, after localising the layers responsible for formal and plausible inference, we investigate activation steering on a controlled syllogistic reasoning task, designed to disentangle formal validity from content plausibility. An extensive empirical analysis reveals that contrastive steering methods consistently support linear control over content biases. However, a static approach is insufficient to debias all the tested models. We then investigate how to control content effects by dynamically determining the steering parameters through fine-grained conditional methods. By introducing a novel kNN-based conditional approach (K-CAST), we demonstrate that conditional steering can effectively reduce biases on unresponsive models, achieving up to 15% absolute improvement in formal reasoning accuracy. Finally, we found that steering for content effects is robust to prompt variations, incurs minimal side effects on multilingual language modeling capabilities, and can partially generalize to different reasoning tasks. In practice, we demonstrate that activation-level interventions offer a scalable inference-time strategy for enhancing the robustness of LLMs, contributing towards more systematic and unbiased reasoning capabilities.
[369] VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs
Can Li, Ying Liu, Ting Zhang, Mei Wang, Hua Huang
Main category: cs.AI
TL;DR: VisioMath benchmark for evaluating LMMs’ comparative reasoning on visually similar math diagrams shows models struggle with fine-grained visual discrimination and rely on shallow heuristics rather than textual grounding.
Details
Motivation: Current Large Multimodal Models lack sufficient exploration of their capacity for fine-grained comparative reasoning across multiple visually similar inputs, which is crucial for real-world tasks like mathematics education where learners must distinguish between nearly identical diagrams.Method: Created VisioMath benchmark with 1,800 K-12 math problems where all candidate answers are diagrams with subtle visual similarities. Evaluated state-of-the-art LMMs (both closed and open-source) and analyzed failure modes. Explored three alignment-oriented strategies including training-free approaches and finetuning.
Result: LMMs show consistent accuracy decline as inter-image similarity increases. Dominant failure mode is image-text misalignment - models resort to shallow positional heuristics rather than grounding reasoning in textual cues. Alignment strategies achieved substantial accuracy gains.
Conclusion: VisioMath serves as a rigorous benchmark for developing LMMs with deeper diagram understanding, precise comparative reasoning, and grounded multi-image-text integration, addressing current limitations in fine-grained visual discrimination.
Abstract: Large Multimodal Models have achieved remarkable progress in integrating vision and language, enabling strong performance across perception, reasoning, and domain-specific tasks. However, their capacity to reason over multiple, visually similar inputs remains insufficiently explored. Such fine-grained comparative reasoning is central to real-world tasks, especially in mathematics and education, where learners must often distinguish between nearly identical diagrams to identify correct solutions. To address this gap, we present VisioMath, a curated benchmark of 1,800 high-quality K-12 mathematics problems in which all candidate answers are diagrams with subtle visual similarities. A comprehensive evaluation of state-of-the-art LMMs, covering both leading closed-source systems and widely adopted open-source models, reveals a consistent decline in accuracy as inter-image similarity increases. Analysis indicates that the dominant failure mode stems from image-text misalignment: rather than grounding reasoning in textual cues, models often resort to shallow positional heuristics, resulting in systematic errors. We further explore three alignment-oriented strategies, spanning training-free approaches and finetuning, and achieve substantial accuracy gains. We hope that VisioMath will serve as a rigorous benchmark and catalyst for developing LMMs toward deeper diagram understanding, precise comparative reasoning, and grounded multi-image-text integration.
[370] Discerning What Matters: A Multi-Dimensional Assessment of Moral Competence in LLMs
Daniel Kilov, Caroline Hendy, Secil Yanik Guyot, Aaron J. Snoswell, Seth Lazar
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed data retrievalMethod: Unable to determine method due to failed data retrieval
Result: Unable to determine results due to failed data retrieval
Conclusion: Unable to determine conclusion due to failed data retrieval
Abstract: Failed to fetch summary for 2506.13082: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.13082&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[371] A Cognitive Explainer for Fetal ultrasound images classifier Based on Medical Concepts
Yingni Wanga, Yunxiao Liua, Licong Dongc, Xuzhou Wua, Huabin Zhangb, Qiongyu Yed, Desheng Sunc, Xiaobo Zhoue, Kehong Yuan
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2201.07798 appears to be from January 2022, but no content is available for analysis.
Details
Motivation: Cannot determine motivation due to missing paper content.Method: Cannot determine method due to missing paper content.
Result: Cannot determine results due to missing paper content.
Conclusion: Cannot draw conclusions due to missing paper content.
Abstract: Failed to fetch summary for 2201.07798: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2201.07798&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[372] ContextBench: Modifying Contexts for Targeted Latent Activation
Robert Graham, Edward Stevinson, Leo Richter, Alexander Chia, Joseph Miller, Joseph Isaac Bloom
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2506.15735: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.15735&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[373] PepEDiff: Zero-Shot Peptide Binder Design via Protein Embedding Diffusion
Po-Yu Liang, Tibo Duran, Jun Bai
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to the paper content
Abstract: Failed to fetch summary for 2601.13327: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13327&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[374] Estimation of Energy-dissipation Lower-bounds for Neuromorphic Learning-in-memory
Zihao Chen, Faiek Ahsan, Johannes Leugering, Gert Cauwenberghs, Shantanu Chakrabartty
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2402.14878: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.14878&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[375] Predictive Coding Networks and Inference Learning: Tutorial and Survey
Björn van Zwol, Ro Jefferson, Egon L. van den Broek
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2407.04117: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.04117&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[376] Localizing and Correcting Errors for LLM-based Planners
Aditya Kumar, William W. Cohen
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) when trying to access arXiv API for paper ID 2602.00276
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2602.00276: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00276&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[377] Uncertainty Quantification in LLM Agents: Foundations, Emerging Challenges, and Opportunities
Changdae Oh, Seongheon Park, To Eun Kim, Jiatong Li, Wendi Li, Samuel Yeh, Xuefeng Du, Hamed Hassani, Paul Bogdan, Dawn Song, Sharon Li
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.05073: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05073&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[378] From Features to Actions: Explainability in Traditional and Agentic AI Systems
Sindhuja Chaduvula, Jessee Ho, Kina Kim, Aravind Narayanan, Mahshid Alinoori, Muskan Garg, Dhanesh Ramachandram, Shaina Raza
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting errorMethod: Unable to determine method due to API rate limiting error
Result: Unable to determine results due to API rate limiting error
Conclusion: Unable to determine conclusion due to API rate limiting error
Abstract: Failed to fetch summary for 2602.06841: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06841&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[379] FragFM: Hierarchical Framework for Efficient Molecule Generation via Fragment-Level Discrete Flow Matching
Joongwon Lee, Seonghwan Kim, Seokhyun Moon, Hyunwoo Kim, Woo Youn Kim
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2502.15805: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.15805&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[380] MERIT Feedback Elicits Better Bargaining in LLM Negotiators
Jihwan Oh, Murad Aghazada, Yooju Shin, Se-Young Yun, Taehyeon Kim
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.10467: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10467&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[381] From Tokenizer Bias to Backbone Capability: A Controlled Study of LLMs for Time Series Forecasting
Xinyu Zhang, Shanshan Feng, Xutao Li, Kenghong Lin, Fan Li, Pengfei Jia
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2504.08818 suggests it’s from April 2024, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2504.08818: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.08818&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[382] How Well Does Agent Development Reflect Real-World Work?
Zora Zhiruo Wang, Sanidhya Vijayvargiya, Aspen Chen, Hanmo Zhang, Venu Arvind Arangarajan, Jett Chen, Valerie Chen, Diyi Yang, Daniel Fried, Graham Neubig
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.01203: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01203&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[383] Federated Learning: A Survey on Privacy-Preserving Collaborative Intelligence
Ratun Rahman
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2504.17703: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.17703&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[384] Multimodal Mixture-of-Experts with Retrieval Augmentation for Protein Active Site Identification
Jiayang Wu, Jiale Zhou, Rubo Wang, Xingyi Zhang, Xun Lin, Tianxu Lv, Leong Hou U, Yefeng Zheng
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.01511: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01511&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[385] FourierSpecNet: Neural Collision Operator Approximation Inspired by the Fourier Spectral Method for Solving the Boltzmann Equation
Jae Yong Lee, Gwang Jae Jung, Byung Chan Lim, Hyung Ju Hwang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2504.20408: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.20408&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[386] MOOSEnger – a Domain-Specific AI Agent for the MOOSE Ecosystem
Mengnan Li, Jason Miller, Zachary Prince, Alexander Lindsay, Cody Permann
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.04756: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04756&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[387] SEA-TS: Self-Evolving Agent for Autonomous Code Generation of Time Series Forecasting Algorithms
Longkun Xu, Xiaochun Zhang, Qiantu Tuo, Rui Li
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting errorMethod: Unable to determine method due to API rate limiting error
Result: Unable to determine results due to API rate limiting error
Conclusion: Unable to determine conclusion due to API rate limiting error
Abstract: Failed to fetch summary for 2603.04873: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04873&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[388] RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
Jianhao Yuan, Shuyang Sun, Daniel Omeiza, Bo Zhao, Paul Newman, Lars Kunze, Matthew Gadd
Main category: cs.AI
TL;DR: Failed to fetch summary for paper 2402.10828 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing abstract contentMethod: Unable to determine method due to missing abstract content
Result: Unable to determine results due to missing abstract content
Conclusion: Unable to determine conclusion due to missing abstract content
Abstract: Failed to fetch summary for 2402.10828: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.10828&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[389] Algorithmic Collusion by Large Language Models
Sara Fish, Yannai A. Gonczarowski, Ran I. Shorrer
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2404.00806: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.00806&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[390] A Geometric Perspective on the Difficulties of Learning GNN-based SAT Solvers
Geri Skenderi
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2508.21513: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.21513&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[391] Generative Predictive Control: Flow Matching Policies for Dynamic and Difficult-to-Demonstrate Tasks
Vince Kurtz, Joel W. Burdick
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2502.13406: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.13406&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[392] Adversarial Robustness of Partitioned Quantum Classifiers
Pouya Kananian, Hans-Arno Jacobsen
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2502.20403: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.20403&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[393] Performance Assessment Strategies for Language Model Applications in Healthcare
Victor Garcia, Mariia Sidulova, Aldo Badano
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to determine conclusion due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2509.08087: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.08087&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[394] Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check
Chentao Cao, Xiaojun Xu, Bo Han, Hang Li
Main category: cs.AI
TL;DR: Unable to analyze paper 2509.11629 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2509.11629: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.11629&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[395] The Malicious Technical Ecosystem: Exposing Limitations in Technical Governance of AI-Generated Non-Consensual Intimate Images of Adults
Michelle L. Ding, Harini Suresh
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2504.17663: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.17663&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[396] HCT-QA: A Benchmark for Question Answering on Human-Centric Tables
Mohammad S. Ahmad, Zan A. Naeem, Michaël Aupetit, Ahmed Elmagarmid, Mohamed Eltabakh, Xiaosong Ma, Mourad Ouzzani, Chaoyi Ruan, Hani Al-Sayeh
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2504.20047 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable.
Result: Cannot determine results as paper content is unavailable.
Conclusion: Cannot draw conclusions about paper content due to access limitations.
Abstract: Failed to fetch summary for 2504.20047: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.20047&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[397] Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents
Kaixin Wang, Tianlin Li, Xiaoyu Zhang, Chong Wang, Weisong Sun, Yang Liu, Aishan Liu, Xianglong Liu, Chao Shen, Bin Shi
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2505.05283: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.05283&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[398] Iterative Quantum Feature Maps
Nasa Matsumoto, Quoc Hoan Tran, Koki Chinzei, Yasuhiro Endo, Hirotaka Oshima
Main category: cs.AI
TL;DR: Unable to analyze paper 2506.19461 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.19461: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.19461&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[399] Bridging MOOCs, Smart Teaching, and AI: A Decade of Evolution Toward a Unified Pedagogy
Bo Yuan, Jiazi Hu
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation as paper content is unavailable.Method: Cannot determine method as paper content is unavailable.
Result: Cannot determine results as paper content is unavailable.
Conclusion: Cannot draw conclusions as paper content is unavailable.
Abstract: Failed to fetch summary for 2507.14266: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.14266&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[400] Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function
Hyeongyu Kang, Jaewoo Lee, Woocheol Shin, Kiyoung Om, Jinkyoo Park
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.04559: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04559&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[401] Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation
Maggie Wang, Stephen Tian, Aiden Swann, Ola Shorinwa, Jiajun Wu, Mac Schwager
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2510.11689: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.11689&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[402] Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity
GermĂĄn Kruszewski, Pierre Erbacher, Jos Rozen, Marc Dymetman
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access limitationsMethod: Cannot determine method due to access limitations
Result: Cannot determine results due to access limitations
Conclusion: Cannot determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2512.05962: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05962&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[403] A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation
Xiaocan Li, Shiliang Wu, Zheng Shen
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2512.06547: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06547&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[404] Data-Driven Global Sensitivity Analysis for Engineering Design Based on Individual Conditional Expectations
Pramudita Satria Palar, Paul Saves, Rommel G. Regis, Koji Shimoyama, Shigeru Obayashi, Nicolas Verstaevel, Joseph Morlier
Main category: cs.AI
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Unable to determine motivation as the paper abstract could not be retrieved due to rate limiting (HTTP 429 error)Method: Method unknown - paper content not accessible due to API rate limiting
Result: Results unknown - paper analysis not possible without abstract content
Conclusion: Cannot draw conclusions about paper without access to its abstract or content
Abstract: Failed to fetch summary for 2512.11946: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.11946&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[405] LA-MARRVEL: A Knowledge-Grounded, Language-Aware LLM Framework for Clinically Robust Rare Disease Gene Prioritization
Jaeyeon Lee, Lin Yao, Hyun-Hwan Jeong, Zhandong Liu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2511.02263: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.02263&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[406] Understanding and Improving Hyperbolic Deep Reinforcement Learning
Timo Klein, Thomas Lang, Andrii Shkabrii, Alexander Sturm, Kevin Sidak, Lukas Miklautz, Claudia Plant, Yllka Velaj, Sebastian Tschiatschek
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2512.14202 could not be retrieved from arXiv API.
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content.Method: Unable to determine method due to API rate limiting preventing access to paper content.
Result: Unable to determine results due to API rate limiting preventing access to paper content.
Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper content.
Abstract: Failed to fetch summary for 2512.14202: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14202&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[407] CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal
Yongxin Wang, Zhicheng Yang, Meng Cao, Mingfei Han, Haokun Lin, Yingying Zhu, Xiaojun Chang, Xiaodan Liang
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.19554: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19554&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[408] LLMTM: Benchmarking and Optimizing LLMs for Temporal Motif Analysis in Dynamic Graphs
Bing Hao, Minglai Shao, Zengyi Wo, Yunlong Chu, Yuhang Liu, Ruijie Wang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in accessing paper contentMethod: Unable to determine method due to technical error in accessing paper content
Result: Unable to determine results due to technical error in accessing paper content
Conclusion: Unable to draw conclusions due to technical error in accessing paper content
Abstract: Failed to fetch summary for 2512.22266: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22266&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[409] Neural Signals Generate Clinical Notes in the Wild
Jathurshan Pradeepkumar, Zheng Chen, Jimeng Sun
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2601.22197: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22197&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[410] Why Human Guidance Matters in Collaborative Vibe Coding
Haoyu Hu, Raja Marjieh, Katherine M Collins, Chenyi Li, Thomas L. Griffiths, Ilia Sucholutsky, Nori Jacoby
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2602.10473: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10473&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[411] SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents
Danlong Yuan, Wei Wu, Zhengren Wang, Xueliang Zhao, Huishuai Zhang, Dongyan Zhao
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access restrictionsMethod: Cannot determine method due to access restrictions
Result: Cannot determine results due to access restrictions
Conclusion: Cannot determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2602.11210: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11210&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[412] Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection
J Alex Corll
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to the paper contentMethod: Cannot determine method without access to the paper content
Result: Cannot determine results without access to the paper content
Conclusion: Cannot draw conclusions without access to the paper content
Abstract: Failed to fetch summary for 2602.11247: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11247&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[413] An Adaptive Model Selection Framework for Demand Forecasting under Horizon-Induced Degradation to Support Business Strategy and Operations
Adolfo GonzĂĄlez, VĂctor Parada
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2602.13939: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13939&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[414] The Compute ICE-AGE: Invariant Compute Envelope under Addressable Graph Evolution
Raymond Jay Martin II
Main category: cs.AI
TL;DR: Unable to analyze paper 2602.16736 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot draw conclusions about paper content due to inability to access abstract
Abstract: Failed to fetch summary for 2602.16736: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16736&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[415] FLoRG: Federated Fine-tuning with Low-rank Gram Matrices and Procrustes Alignment
Chuiyang Meng, Ming Tang, Vincent W.S. Wong
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion due to missing paper information
Abstract: Failed to fetch summary for 2602.17095: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17095&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[416] Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization
Zeyuan Liu, Jeonghye Kim, Xufang Luo, Dongsheng Li, Yuqing Yang
Main category: cs.AI
TL;DR: Paper 2602.23008: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2602.23008: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23008&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[417] Theory of Code Space: Do Code Agents Understand Software Architecture?
Grigory Sapunov
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.00601: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00601&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[418] MatRIS: Toward Reliable and Efficient Pretrained Machine Learning Interatomic Potentials
Yuanchang Zhou, Siyu Hu, Xiangyu Zhang, Hongyu Wang, Guangming Tan, Weile Jia
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.02002: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02002&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[419] “When to Hand Off, When to Work Together”: Expanding Human-Agent Co-Creative Collaboration through Concurrent Interaction
Kihoon Son, Hyewon Lee, DaEun Choi, Yoonsu Kim, Tae Soo Kim, Yoonjoo Lee, John Joon Young Chung, HyunJoon Jung, Juho Kim
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access limitationsMethod: Cannot determine method due to access limitations
Result: Cannot determine results due to access limitations
Conclusion: Cannot determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2603.02050: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02050&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[420] Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles
Zhanghan Ni, Yanjing Li, Zeju Qiu, Bernhard Schölkopf, Hongyu Guo, Weiyang Liu, Shengchao Liu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.02406: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02406&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[421] Large-Language-Model-Guided State Estimation for Partially Observable Task and Motion Planning
Yoonwoo Kim, Raghav Arora, Roberto MartĂn-MartĂn, Peter Stone, Ben Abbatematteo, Yoonchang Sung
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.03704: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03704&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[422] Measuring AI R&D Automation
Alan Chan, Ranay Padarath, Joe Kwon, Hilary Greaves, Markus Anderljung
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.03992: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03992&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[423] vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models
Xunzhuo Liu, Huamin Chen, Samzong Lu, Yossi Ovadia, Guohong Wen, Hao Wu, Zhengda Tan, Jintao Zhang, Senan Zedan, Yehudit Kerido, Liav Weiss, Haichen Zhang, Bishen Yu, Asaad Balum, Noa Limoy, Abdallah Samara, Baofa Fan, Brent Salisbury, Ryan Cook, Zhijie Wang, Qiping Pan, Rehan Khan, Avishek Goswami, Houston H. Zhang, Shuyi Wang, Ziang Tang, Fang Han, Zohaib Hassan, Jianqiao Zheng, Avinash Changrani
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2603.04444: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04444&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[424] RoboPocket: Improve Robot Policies Instantly with Your Phone
Junjie Fang, Wendi Chen, Han Xue, Fangyuan Zhou, Tian Le, Yi Wang, Yuting Zhang, Jun Lv, Chuan Wen, Cewu Lu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.05504 exists but content cannot be retrieved.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2603.05504: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05504&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.SD
[425] Koopman Regularized Deep Speech Disentanglement for Speaker Verification
Nikos Chazaridis, Mohammad Belal, Rafael Mestre, Timothy J. Norman, Christine Evers
Main category: cs.SD
TL;DR: DKSD-AE uses Koopman operator learning and instance normalization to disentangle speaker and content representations from speech without textual supervision, achieving competitive speaker verification with fewer parameters.
Details
Motivation: Existing speaker verification systems often rely on labeled data, textual supervision, or large pretrained models, limiting scalability and practical deployment. There's a need for more efficient, unsupervised approaches that can disentangle speaker characteristics from linguistic content.Method: Proposes Deep Koopman Speech Disentanglement Autoencoder (DKSD-AE) with a multi-step Koopman operator learning module and instance normalization to separate speaker and content dynamics in speech representations without textual supervision.
Result: Achieves improved or competitive speaker verification performance across multiple datasets while maintaining high content EER, confirming effective disentanglement. Results obtained with substantially fewer parameters and stable performance under increased evaluation scale.
Conclusion: Koopman-based temporal modeling combined with instance normalization provides an efficient and principled solution for speaker-focused representation learning without textual supervision, offering better scalability and deployment potential.
Abstract: Human speech contains both linguistic content and speaker dependent characteristics making speaker verification a key technology in identity critical applications. Modern deep learning speaker verification systems aim to learn speaker representations that are invariant to semantic content and nuisance factors such as ambient noise. However, many existing approaches depend on labelled data, textual supervision or large pretrained models as feature extractors, limiting scalability and practical deployment, raising sustainability concerns. We propose Deep Koopman Speech Disentanglement Autoencoder (DKSD-AE), a structured autoencoder that combines a novel multi-step Koopman operator learning module with instance normalization to disentangle speaker and content dynamics. Quantitative experiments across multiple datasets demonstrate that DKSD-AE achieves improved or competitive speaker verification performance compared to state-of-the-art baselines while maintaining high content EER, confirming effective disentanglement. These results are obtained with substantially fewer parameters and without textual supervision. Moreover, performance remains stable under increased evaluation scale, highlighting representation robustness and generalization. Our findings suggest that Koopman-based temporal modelling, when combined with instance normalization, provides an efficient and principled solution for speaker-focused representation learning.
[426] Which Data Matter? Embedding-Based Data Selection for Speech Recognition
Zakaria Aldeneh, Skyler Seto, Maureen de Seyssel, Jie Chi, Zijin Gu, Takuya Higuchi, Jee-weon Jung, Shinji Watanabe, David Grangier, Barry-John Theobald, Tatiana Likhomanenko
Main category: cs.SD
TL;DR: Targeted data selection using complementary speech embeddings (speaker, phonetic, semantic) improves specialist ASR models by training on strategically selected 5% subsets that outperform models trained on full 100k-hour datasets.
Details
Motivation: Specialist ASR models for specific domains face challenges when trained on heterogeneous, multi-domain data: they lack capacity to learn from all available data, and there's a mismatch between training and test conditions. The paper aims to address these issues through targeted data selection.Method: Represent speech samples using complementary embeddings capturing speaker attributes, phonetic content, and semantic meaning. Analyze how relevance and diversity along these axes affect downstream ASR performance. Use CTC-based Conformer models and select relevant subsets from 100k hours of in-the-wild training data to optimize performance on target domains.
Result: Training on strategically selected 5% subsets can exceed the performance of models trained on the full dataset by up to 36.8% relative WER reduction on target domains.
Conclusion: Targeted data selection using complementary speech embeddings is an effective strategy for optimizing specialist ASR models for specific domains, significantly outperforming models trained on full heterogeneous datasets.
Abstract: Modern ASR systems are typically trained on large-scale pseudo-labeled, in-the-wild data spanning multiple domains. While such heterogeneous data benefit generalist models designed for broad deployment, they pose challenges for specialist models targeting specific domains: specialist models lack the capacity to learn from all available data, and one must pay closer attention to addressing the mismatch between training and test conditions. In this work, we study targeted data selection as a strategy to address these challenges, selecting relevant subsets from 100k hours of in-the-wild training data to optimize performance on target domains. We represent speech samples using embeddings that capture complementary characteristic–speaker attributes, phonetic content, and semantic meaning–and analyze how relevance and diversity along these axes when performing data selection affect downstream ASR performance. Our experiments with CTC-based Conformer models show that training on a strategically selected 5% subset can exceed the performance of models trained on the full dataset by up to 36.8% relative WER reduction on target domains.
[427] How Well Do Current Speech Deepfake Detection Methods Generalize to the Real World?
Daixian Li, Jun Xue, Yanzhen Ren, Zhuolin Yi, Yihuan Huang, Guanxiang Feng, Yi Chai
Main category: cs.SD
TL;DR: ML-ITW: A multilingual in-the-wild audio deepfake detection benchmark covering 14 languages, 7 platforms, and 180 public figures, revealing significant performance degradation of existing detectors in real-world scenarios.
Details
Motivation: Recent advances in speech synthesis and voice conversion have improved audio naturalness, while social media encoding/compression obscures deepfake artifacts, complicating reliable detection in real-world environments and creating need for representative evaluation benchmarks.Method: Introduces ML-ITW dataset covering 14 languages, 7 major platforms, and 180 public figures (28.39 hours total). Evaluates three detection paradigms: end-to-end neural models, self-supervised feature-based (SSL) methods, and audio large language models (Audio LLMs).
Result: Experimental results show significant performance degradation across diverse languages and real-world acoustic conditions, highlighting limited generalization ability of existing detectors in practical scenarios.
Conclusion: The ML-ITW dataset addresses the need for representative evaluation benchmarks for audio deepfake detection in real-world multilingual environments, revealing current limitations and enabling future research.
Abstract: Recent advances in speech synthesis and voice conversion have greatly improved the naturalness and authenticity of generated audio. Meanwhile, evolving encoding, compression, and transmission mechanisms on social media platforms further obscure deepfake artifacts. These factors complicate reliable detection in real-world environments, underscoring the need for representative evaluation benchmarks. To this end, we introduce ML-ITW (Multilingual In-The-Wild), a multilingual dataset covering 14 languages, seven major platforms, and 180 public figures, totaling 28.39 hours of audio. We evaluate three detection paradigms: end-to-end neural models, self-supervised feature-based (SSL) methods, and audio large language models (Audio LLMs). Experimental results reveal significant performance degradation across diverse languages and real-world acoustic conditions, highlighting the limited generalization ability of existing detectors in practical scenarios. The ML-ITW dataset is publicly available.
[428] Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR
Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni, Tanel AlumÀe, Mathew Magimai Doss
Main category: cs.SD
TL;DR: RAPTOR is a compact SSL-based audio deepfake detection system using HuBERT and WavLM backbones with pairwise-gated fusion, showing multilingual HuBERT pre-training drives cross-domain robustness better than model scale.
Details
Motivation: Most SSL-based audio deepfake detection research focuses on large wav2vec2-XLSR backbones, leaving compact models understudied. The paper aims to systematically study compact SSL backbones for cross-domain audio deepfake detection.Method: Proposes RAPTOR (Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition) - a controlled study of compact SSL backbones (HuBERT and WavLM) within a unified pairwise-gated fusion detector. Evaluated across 14 cross-domain benchmarks with test-time augmentation protocol using perturbation-based aleatoric uncertainty.
Result: Multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M parameter models to match larger commercial systems. WavLM variants show overconfident miscalibration under perturbation, while iterative mHuBERT remains stable.
Conclusion: SSL pre-training trajectory, not model scale, drives reliable audio deepfake detection. Multilingual pre-training and test-time uncertainty analysis reveal important calibration differences invisible to standard metrics.
Abstract: Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14 cross-domain benchmarks. We show that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems. Beyond EER, we introduce a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics: WavLM variants exhibit overconfident miscalibration under perturbation, whereas iterative mHuBERT remains stable. These findings indicate that SSL pre-training trajectory, not model scale, drives reliable audio deepfake detection.
[429] Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding
Hoseong Ahn, Jeongyun Chae, Yoonji Park, Kyuhong Shim
Main category: cs.SD
TL;DR: Whisper-CD: Training-free contrastive decoding framework for Whisper models that reduces hallucinations and errors in long-form speech recognition by contrasting clean audio logits against acoustically perturbed negative logits.
Details
Motivation: Large encoder-decoder models like Whisper exhibit hallucinations, repetition loops, and content omissions in long-form speech recognition, with errors accumulating when previous segment transcriptions are used as decoding context.Method: Proposes Whisper-CD, a training-free contrastive decoding framework that contrasts clean-audio logits against negative logits computed from three acoustically motivated perturbations: Gaussian noise injection, silence signal, and audio temporal shift. Aggregates negatives via log-sum-exp operator for unified multi-negative objective in token-by-token decoding.
Result: Across five English long-form benchmarks, Whisper-CD reduces WER by up to 24.3pp on CORAAL and shows 48% faster token generation throughput than beam search.
Conclusion: Whisper-CD operates purely at inference time and can be applied as a drop-in replacement to already-deployed Whisper systems without retraining, offering significant improvements in long-form speech recognition accuracy and efficiency.
Abstract: Long-form speech recognition with large encoder-decoder models such as Whisper often exhibit hallucinations, repetition loops, and content omissions. These errors can accumulate and be further amplified when the previous segment’s transcription is used as decoding context. We propose Whisper-CD, a training-free contrastive decoding framework that contrasts clean-audio logits against negative logits computed from three acoustically motivated perturbations: Gaussian noise injection, silence signal, and audio temporal shift. We aggregate these negatives via the log-sum-exp operator, building a unified multi-negative objective for token-by-token decoding. Across five English long-form benchmarks, Whisper-CD reduces WER by up to 24.3pp on CORAAL and shows 48% faster token generation throughput than beam search. Because Whisper-CD operates purely at inference time, it can be applied as a drop-in replacement to already-deployed Whisper systems without retraining.
[430] Prosodic Boundary-Aware Streaming Generation for LLM-Based TTS with Streaming Text Input
Changsong Liu, Tianrui Wang, Ye Ni, Yizhou Peng, Eng Siong Chng
Main category: cs.SD
TL;DR: A streaming TTS method using prosodic-boundary-aware post-training of LLM-based TTS models to handle incremental text input with natural prosody and avoid long-form collapse.
Details
Motivation: Streaming TTS systems that receive incremental text input face two key challenges: unnatural prosody due to missing future context (lookahead) and performance degradation in long-form synthesis due to unbounded context accumulation.Method: Proposes a prosodic-boundary-aware post-training strategy that adapts a pretrained LLM-based TTS model using weakly time-aligned data. The model learns to stop early at content boundaries when given limited future text. During inference, a sliding-window prompt carries forward previous text and speech tokens to ensure bounded context and seamless concatenation.
Result: Outperforms CosyVoice-Style interleaved baseline in both short and long-form scenarios. In long-text synthesis, achieves 66.2% absolute reduction in word error rate (from 71.0% to 4.8%) and increases speaker and emotion similarity by 16.1% and 1.5% relatively.
Conclusion: Provides a robust solution for streaming TTS with incremental text input by addressing both prosody and long-form collapse issues through boundary-aware adaptation and sliding-window context management.
Abstract: Streaming TTS that receives streaming text is essential for interactive systems, yet this scheme faces two major challenges: unnatural prosody due to missing lookahead and long-form collapse due to unbounded context. We propose a prosodic-boundary-aware post-training strategy, adapting a pretrained LLM-based TTS model using weakly time-aligned data. Specifically, the model is adapted to learn early stopping at specified content boundaries when provided with limited future text. During inference, a sliding-window prompt carries forward previous text and speech tokens, ensuring bounded context and seamless concatenation. Evaluations show our method outperforms CosyVoice-Style interleaved baseline in both short and long-form scenarios. In long-text synthesis, especially, it achieves a 66.2% absolute reduction in word error rate (from 71.0% to 4.8%) and increases speaker and emotion similarity by 16.1% and 1.5% relatively, offering a robust solution for streaming TTS with incremental text.
[431] RAMoEA-QA: Hierarchical Specialization for Robust Respiratory Audio Question Answering
Gaia A. Bertolino, Yuwei Zhang, Tong Xia, Domenico Talia, Cecilia Mascolo
Main category: cs.SD
TL;DR: RAMoEA-QA is a hierarchically routed multimodal model for respiratory audio question answering that uses Mixture-of-Experts for audio encoding and Mixture-of-Adapters for language generation to handle diverse recording conditions and query intents.
Details
Motivation: Healthcare AI needs to integrate heterogeneous patient signals and support diverse interaction styles. In respiratory care, audio recordings vary widely across devices, environments, and protocols, and existing biomedical audio-language QA systems are monolithic without specialization mechanisms for handling diverse respiratory corpora and query intents.Method: Two-stage conditional specialization: Audio Mixture-of-Experts routes recordings to suitable pre-trained audio encoders, and Language Mixture-of-Adapters selects LoRA adapters on a frozen LLM to match query intent and answer format. This specializes both acoustic representations and generation behavior per example.
Result: RAMoEA-QA consistently outperforms strong baselines and routing ablations with minimal parameter overhead, achieving in-domain test accuracy of 0.72 (vs. 0.61 and 0.67 for state-of-the-art baselines) and exhibiting strongest generalization for diagnosis under domain, modality, and task shifts.
Conclusion: The hierarchical routing approach enables effective handling of heterogeneity in respiratory audio QA, demonstrating improved performance and generalization across diverse real-world conditions through specialized audio encoding and language generation.
Abstract: Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings captured via mobile microphones, enables scalable screening and longitudinal monitoring, but the heterogeneity challenge is particularly acute: recordings vary widely across devices, environments, and acquisition protocols, and questions span multiple intents and question formats. Existing biomedical audio-language QA systems are typically monolithic, without any specialization mechanisms for tackling diverse respiratory corpora and query intents. They are also only validated in limited settings, leaving it unclear how reliably they handle the shifts encountered in real-world settings. To address these limitations, we introduce RAMoEA-QA, a hierarchically routed generative model for respiratory audio question answering that unifies multiple question types and supports both discrete and continuous targets within a single multimodal system. RAMoEA-QA applies two-stage conditional specialization: an Audio Mixture-of-Experts routes each recording to a suitable pre-trained audio encoder, and a Language Mixture-of-Adapters selects a LoRA adapter on a shared frozen LLM to match the query intent and answer format. By specializing both acoustic representations and generation behaviour per example, RAMoEA-QA consistently outperforms strong baselines and routing ablations with minimal parameter overhead, improving in-domain test accuracy to 0.72 (vs. 0.61 and 0.67 for state-of-the-art baselines) and exhibiting the strongest generalization for diagnosis under domain, modality, and task shifts.
[432] HVAC-EAR: Eavesdropping Human Speech Using HVAC Systems
Tarikul Islam Tamiti, Biraj Joshi, Rida Hasan, Anomadarshi Barua
Main category: cs.SD
TL;DR: HVAC-EAR reconstructs intelligible speech from low-resolution pressure sensor data in HVAC systems using complex-valued conformer networks, raising privacy concerns about acoustic eavesdropping through building infrastructure.
Details
Motivation: Pressure sensors in HVAC systems are sensitive to acoustic pressure and can be exploited for eavesdropping, creating novel privacy vulnerabilities in modern buildings.Method: Uses a complex-valued conformer with Complex Unified Attention Block to capture phoneme dependencies, enabling speech reconstruction from as low as 0.5 kHz sampling rate, and mitigates HVAC noise by reconstructing both magnitude and phase of missing frequencies.
Result: First evaluation on real-world HVAC deployments shows significant intelligibility up to 1.2 m distance, surpassing prior work limited to hot word detection.
Conclusion: Demonstrates that HVAC pressure sensors can be exploited for speech reconstruction, raising serious privacy concerns about acoustic eavesdropping through building infrastructure.
Abstract: Pressure sensors are widely integrated into modern Heating, Ventilation and Air Conditioning (HVAC) systems. As they are sensitive to acoustic pressure, they can be a source of eavesdropping. We introduce HVAC-EAR, which reconstructs intelligible speech from low-resolution, noisy pressure data with two key contributions: (i) We achieve intelligible reconstruction from as low as 0.5 kHz sampling rate, surpassing prior work limited to hot word detection, by employing a complex-valued conformer with a Complex Unifed Attention Block to capture phoneme dependencies; (ii) We mitigate transient HVAC noise by reconstructing both magnitude and phase of missing frequencies. For the first time, evaluations on real-world HVAC deployments show significant intelligibility up to 1.2 m distance, raising novel privacy concerns.
cs.LG
[433] Traversal-as-Policy: Log-Distilled Gated Behavior Trees as Externalized, Verifiable Policies for Safe, Robust, and Efficient Agents
Peiran Li, Jiashuo Sun, Fangzhou Lin, Shuo Xing, Tianfu Fu, Suofei Feng, Chaoqun Ni, Zhengzhong Tu
Main category: cs.LG
TL;DR: Traversal-as-Policy framework distills LLM execution logs into Gated Behavior Trees for safer, more efficient autonomous agents by replacing unconstrained generation with structured tree traversal and safety gates.
Details
Motivation: Current autonomous LLM agents fail due to implicit long-horizon policies in model weights and transcripts, with safety measures added retroactively rather than being built-in.Method: Distill sandboxed execution logs into executable Gated Behavior Trees (GBTs) where tree traversal serves as control policy. Nodes encode state-conditioned action macros from successful trajectories with safety gates from unsafe traces. Lightweight traverser matches model intent to macros, executes under gating, and performs risk-aware recovery when stalled.
Result: GBT significantly improves success rates while reducing violations and costs. On SWE-bench Verified: success increased from 34.6% to 73.6%, violations reduced from 2.8% to 0.2%, token usage cut from 208k to 126k. 8B executors more than doubled success on SWE-bench Verified (14.0%â58.8%) and WebArena (9.1%â37.3%).
Conclusion: Traversal-as-Policy with Gated Behavior Trees provides a structured approach to autonomous LLM agents that improves reliability, safety, and efficiency by replacing unconstrained generation with deterministic, gated traversal policies.
Abstract: Autonomous LLM agents fail because long-horizon policy remains implicit in model weights and transcripts, while safety is retrofitted post hoc. We propose Traversal-as-Policy: distill sandboxed OpenHands execution logs into a single executable Gated Behavior Tree (GBT) and treat tree traversal – rather than unconstrained generation – as the control policy whenever a task is in coverage. Each node encodes a state-conditioned action macro mined and merge-checked from successful trajectories; macros implicated by unsafe traces attach deterministic pre-execution gates over structured tool context and bounded history, updated under experience-grounded monotonicity so previously rejected unsafe contexts cannot be re-admitted. At runtime, a lightweight traverser matches the base model’s intent to child macros, executes one macro at a time under global and node-local gating, and when stalled performs risk-aware shortest-path recovery to a feasible success leaf; the visited path forms a compact spine memory that replaces transcript replay. Evaluated in a unified OpenHands sandbox on 15+ software, web, reasoning, and safety/security benchmarks, GBT improves success while driving violations toward zero and reducing cost. On SWE-bench Verified (Protocol A, 500 issues), GBT-SE raises success from 34.6% to 73.6%, reduces violations from 2.8% to 0.2%, and cuts token/character usage from 208k/820k to 126k/490k; with the same distilled tree, 8B executors more than double success on SWE-bench Verified (14.0%58.8%) and WebArena (9.1%37.3%).
[434] VDCook:DIY video data cook your MLLMs
Chengwei Wu
Main category: cs.LG
TL;DR: VDCook is a self-evolving video data operating system that enables researchers to create specialized video datasets through natural language queries, combining real video retrieval with controlled synthesis.
Details
Motivation: Traditional video datasets are static and one-time-built, making it difficult for researchers to create specialized, up-to-date training data for specific domains. There's a need for dynamic, continuously updatable video data ecosystems that can adapt to evolving research needs.Method: VDCook uses natural language queries with adjustable parameters (scale, retrieval-synthesis ratio, quality threshold) to automatically perform query optimization. It concurrently runs real video retrieval and controlled synthesis modules, generating data packages with complete provenance and metadata. The system employs MCP (Model Context Protocol) for automated data ingestion and provides multi-dimensional metadata annotation including scene segmentation, motion scoring, OCR ratio, and automatic captioning.
Result: The system creates in-domain video data packages with complete metadata and reproducible Notebooks, enabling continuous updates and domain expansion. It transforms datasets into dynamically evolving open ecosystems with community contributions and governance-enabled data expansion.
Conclusion: VDCook significantly lowers the barrier to constructing specialized video training datasets through infrastructure-level solutions, supporting flexible data construction and indexing for research communities.
Abstract: We introduce VDCook: a self-evolving video data operating system, a configurable video data construction platform for researchers and vertical domain teams. Users initiate data requests via natural language queries and adjustable parameters (scale, retrieval-synthesis ratio, quality threshold). The system automatically performs query optimization, concurrently running real video retrieval and controlled synthesis modules. It ultimately generates in-domain data packages with complete provenance and metadata, along with reproducible Notebooks. Unlike traditional static, one-time-built datasets, VDCook enables continuous updates and domain expansion through its automated data ingestion mechanism based on MCP (Model Context Protocol)\cite{mcp2024anthropic}, transforming datasets into dynamically evolving open ecosystems. The system also provides multi-dimensional metadata annotation (scene segmentation, motion scoring, OCR ratio, automatic captioning, etc.), laying the foundation for flexible subsequent data `cooking’ and indexing\cite{vlogger}. This platform aims to significantly lower the barrier to constructing specialized video training datasets through infrastructure-level solutions, while supporting community contributions and a governance-enabled data expansion paradigm. \textbf{Project demo:} https://screenapp.io/app/v/WP0SvffgsH
[435] JAWS: Enhancing Long-term Rollout of Neural Operators via Spatially-Adaptive Jacobian Regularization
Fengxiang Nie, Yasuhiro Suzuki
Main category: cs.LG
TL;DR: JAWS introduces spatially-adaptive regularization for neural operator learning that modulates contraction strength based on local physical complexity, improving stability while preserving shock features in dynamical systems.
Details
Motivation: Autoregressive rollouts in data-driven surrogate models for dynamical systems suffer from instability and spectral blow-up. Global regularization techniques uniformly damp high-frequency features, creating a contraction-dissipation dilemma. Long-horizon trajectory optimization methods are bottlenecked by memory constraints.Method: Proposes Jacobian-Adaptive Weighting for Stability (JAWS), a probabilistic regularization strategy that frames operator learning as Maximum A Posteriori estimation with spatially heteroscedastic uncertainty. It dynamically modulates regularization strength based on local physical complexity, enforcing contraction in smooth regions while relaxing constraints near singular features.
Result: JAWS serves as an effective spectral pre-conditioner, reducing the base operator’s burden of handling high-frequency instabilities. Enables memory-efficient, short-horizon trajectory optimization to match or exceed long-horizon baselines. On 1D viscous Burgers’ equation, improves long-term stability, shock fidelity, and out-of-distribution generalization while reducing training costs.
Conclusion: Spatially-adaptive regularization via JAWS effectively addresses the contraction-dissipation dilemma in neural operator learning, enabling stable long-term predictions while preserving important physical features like shocks, with computational efficiency benefits.
Abstract: Data-driven surrogate models improve the efficiency of simulating continuous dynamical systems, yet their autoregressive rollouts are often limited by instability and spectral blow-up. While global regularization techniques can enforce contractive dynamics, they uniformly damp high-frequency features, introducing a contraction-dissipation dilemma. Furthermore, long-horizon trajectory optimization methods that explicitly correct drift are bottlenecked by memory constraints. In this work, we propose Jacobian-Adaptive Weighting for Stability (JAWS), a probabilistic regularization strategy designed to mitigate these limitations. By framing operator learning as Maximum A Posteriori (MAP) estimation with spatially heteroscedastic uncertainty, JAWS dynamically modulates the regularization strength based on local physical complexity. This allows the model to enforce contraction in smooth regions to suppress noise, while relaxing constraints near singular features to preserve gradients, effectively realizing a behavior similar to numerical shock-capturing schemes. Experiments demonstrate that this spatially-adaptive prior serves as an effective spectral pre-conditioner, which reduces the base operator’s burden of handling high-frequency instabilities. This reduction enables memory-efficient, short-horizon trajectory optimization to match or exceed the long-term accuracy of long-horizon baselines. Evaluated on the 1D viscous Burgers’ equation, our hybrid approach improves long-term stability, shock fidelity, and out-of-distribution generalization while reducing training computational costs.
[436] IntSeqBERT: Learning Arithmetic Structure in OEIS via Modulo-Spectrum Embeddings
Kazuhisa Nakasho
Main category: cs.LG
TL;DR: IntSeqBERT: A dual-stream Transformer encoder for masked integer-sequence modeling on OEIS using magnitude embeddings and modulo embeddings, achieving significant improvements over tokenized baselines.
Details
Motivation: Standard tokenized models struggle with OEIS sequences due to out-of-vocabulary values and inability to exploit periodic arithmetic structure in integer sequences ranging from small constants to astronomical values.Method: Dual-stream Transformer encoder with continuous log-scale magnitude embeddings and sin/cos modulo embeddings for 100 residues (moduli 2-101), fused via FiLM. Three prediction heads for magnitude regression, sign classification, and modulo prediction.
Result: 95.85% magnitude accuracy and 50.38% Mean Modulo Accuracy, outperforming tokenized Transformer baseline by +8.9pt and +4.5pt. Next-term prediction improved 7.4-fold (19.09% vs 2.59%). Modulo analysis shows composite moduli capture arithmetic structure more efficiently.
Conclusion: IntSeqBERT effectively models integer sequences by combining magnitude and modulo representations, demonstrating the importance of arithmetic structure in sequence modeling and providing empirical evidence for CRT-based aggregation.
Abstract: Integer sequences in the OEIS span values from single-digit constants to astronomical factorials and exponentials, making prediction challenging for standard tokenised models that cannot handle out-of-vocabulary values or exploit periodic arithmetic structure. We present IntSeqBERT, a dual-stream Transformer encoder for masked integer-sequence modelling on OEIS. Each sequence element is encoded along two complementary axes: a continuous log-scale magnitude embedding and sin/cos modulo embeddings for 100 residues (moduli $2$–$101$), fused via FiLM. Three prediction heads (magnitude regression, sign classification, and modulo prediction for 100 moduli) are trained jointly on 274,705 OEIS sequences. At the Large scale (91.5M parameters), IntSeqBERT achieves 95.85% magnitude accuracy and 50.38% Mean Modulo Accuracy (MMA) on the test set, outperforming a standard tokenised Transformer baseline by $+8.9$ pt and $+4.5$ pt, respectively. An ablation removing the modulo stream confirms it accounts for $+15.2$ pt of the MMA gain and contributes an additional $+6.2$ pt to magnitude accuracy. A probabilistic Chinese Remainder Theorem (CRT)-based Solver converts the model’s predictions into concrete integers, yielding a 7.4-fold improvement in next-term prediction over the tokenised-Transformer baseline (Top-1: 19.09% vs. 2.59%). Modulo spectrum analysis reveals a strong negative correlation between Normalised Information Gain (NIG) and Euler’s totient ratio $\varphi(m)/m$ ($r = -0.851$, $p < 10^{-28}$), providing empirical evidence that composite moduli capture OEIS arithmetic structure more efficiently via CRT aggregation.
[437] Autocorrelation effects in a stochastic-process model for decision making via time series
Tomoki Yamagami, Mikio Hasegawa, Takatomo Mihana, Ryoichi Horisaki, Atsushi Uchida
Main category: cs.LG
TL;DR: Analysis of how autocorrelation in chaotic time series affects decision-making performance in two-armed bandit problems, revealing environment-dependent optimal autocorrelation patterns.
Details
Motivation: Previous experiments showed decision accuracy in photonic chaotic systems depends on autocorrelation of time series, but lacked minimal mathematical model to explain this benefit.Method: Developed stochastic-process model using tug-of-war principle for two-armed bandit problem, where threshold and two-valued Markov signal evolve jointly. Analyzed how autocorrelation affects performance in different reward environments.
Result: Revealed environment-dependent structure: negative autocorrelation optimal in reward-rich environments, positive autocorrelation optimal in reward-poor environments. Performance independent of autocorrelation when sum of winning probabilities equals 1.
Conclusion: Study provides mathematical foundation for understanding autocorrelation benefits in decision-making, with potential applications in reinforcement learning for wireless communications and robotics.
Abstract: Decision makers exploiting photonic chaotic dynamics obtained by semiconductor lasers provide an ultrafast approach to solving multi-armed bandit problems by using a temporal optical signal as the driving source for sequential decisions. In such systems, the sampling interval of the chaotic waveform shapes the temporal correlation of the resulting time series, and experiments have reported that decision accuracy depends strongly on this autocorrelation property. However, it remains unclear whether the benefit of autocorrelation can be explained by a minimal mathematical model. Here, we analyze a stochastic-process model of the time-series-based decision making using the tug-of-war principle for solving the two-armed bandit problem, where the threshold and a two-valued Markov signal evolve jointly. Numerical results reveal an environment-dependent structure: negative (positive) autocorrelation is optimal in reward-rich (reward-poor) environments. These findings show that negative autocorrelation of the time series is advantageous when the sum of the winning probabilities is more than $1$, whereas positive autocorrelation is useful when the sum of the winning probabilities is less than $1$. Moreover, the performance is independent of autocorrelation if the sum of the winning probabilities equals $1$, which is mathematically clarified. This study paves the way for improving the decision-making scheme for reinforcement learning applications in wireless communications and robotics.
[438] Towards Efficient and Stable Ocean State Forecasting: A Continuous-Time Koopman Approach
Rares Grozavescu, Pengyu Zhang, Mark Girolami, Etienne Meunier
Main category: cs.LG
TL;DR: CT-KAE is a lightweight surrogate model for ocean forecasting that uses Koopman theory to project nonlinear dynamics into a latent linear ODE space, enabling stable long-horizon predictions with bounded error growth.
Details
Motivation: Need for efficient and stable long-horizon forecasting in ocean/climate modeling, where traditional numerical solvers are computationally expensive and autoregressive ML models suffer from error accumulation and drift over time.Method: Continuous-Time Koopman Autoencoder projects nonlinear ocean dynamics into a latent space governed by a linear ordinary differential equation, using matrix exponential formulation for temporally resolution-invariant forecasting.
Result: CT-KAE shows bounded error growth and stable large-scale statistics over 2083-day rollouts, outperforming autoregressive Transformers which exhibit gradual error amplification and energy drift. Achieves orders-of-magnitude faster inference than numerical solvers.
Conclusion: Continuous-time Koopman surrogates offer a promising backbone for efficient and stable hybrid physical-machine learning climate models, balancing computational efficiency with long-term stability.
Abstract: We investigate the Continuous-Time Koopman Autoencoder (CT-KAE) as a lightweight surrogate model for long-horizon ocean state forecasting in a two-layer quasi-geostrophic (QG) system. By projecting nonlinear dynamics into a latent space governed by a linear ordinary differential equation, the model enforces structured and interpretable temporal evolution while enabling temporally resolution-invariant forecasting via a matrix exponential formulation. Across 2083-day rollouts, CT-KAE exhibits bounded error growth and stable large-scale statistics, in contrast to autoregressive Transformer baselines which exhibit gradual error amplification and energy drift over long rollouts. While fine-scale turbulent structures are partially dissipated, bulk energy spectra, enstrophy evolution, and autocorrelation structure remain consistent over long horizons. The model achieves orders-of-magnitude faster inference compared to the numerical solver, suggesting that continuous-time Koopman surrogates offer a promising backbone for efficient and stable hybrid physical-machine learning climate models.
[439] When AI Levels the Playing Field: Skill Homogenization, Asset Concentration, and Two Regimes of Inequality
Xupeng Chen, Shuchen Meng
Main category: cs.LG
TL;DR: AI equalizes individual skill differences but may increase aggregate inequality by shifting value to complementary assets, with effects depending on AI’s technology structure and labor market institutions.
Details
Motivation: The paper aims to resolve the paradox where generative AI appears to equalize individual performance differences within tasks while potentially widening aggregate economic inequality, by examining how AI shifts economic value toward concentrated complementary assets.Method: Develops a task-based model with endogenous education, employer screening, and heterogeneous firms. Uses Method of Simulated Moments to match six empirical targets, with sensitivity analysis to identify key parameters. Tests predictions using BLS OEWS data (2019-2023) occupation-level regressions.
Result: The model yields two regimes depending on AI’s technology structure (proprietary vs. commodity) and labor market institutions. Calibrated parameters show that aggregate inequality sign depends on specific parameters, while AI’s technology structure independently determines regime boundaries. Occupation-level data cannot test task-level predictions.
Conclusion: The contribution is the mechanism explaining the paradox, not a definitive verdict on inequality direction. Predictions require within-occupation, within-task panel data not yet available at scale.
Abstract: Generative AI compresses within-task skill differences while shifting economic value toward concentrated complementary assets, creating an apparent paradox: the technology that equalizes individual performance may widen aggregate inequality. We formalize this tension in a task-based model with endogenous education, employer screening, and heterogeneous firms. The model yields two regimes whose boundary depends on AI’s technology structure (proprietary vs.\ commodity) and labor market institutions (rent-sharing elasticity, asset concentration). A scenario analysis via Method of Simulated Moments, matching six empirical targets, disciplines the model’s quantitative magnitudes; a sensitivity decomposition reveals that the five non-$Î$Gini moments identify mechanism rates but not the aggregate sign, which at the calibrated parameters is pinned by $m_6$ and $Ο$, while AI’s technology structure ($η_1$ vs.\ $η_0$) independently crosses the boundary. The contribution is the mechanism – not a verdict on the sign. Occupation-level regressions using BLS OEWS data (2019–2023) illustrate why such data cannot test the model’s task-level predictions. The predictions are testable with within-occupation, within-task panel data that do not yet exist at scale.
[440] Aligning the True Semantics: Constrained Decoupling and Distribution Sampling for Cross-Modal Alignment
Xiang Ma, Lexin Fang, Litian Xu, Caiming Zhang
Main category: cs.LG
TL;DR: CDDS is a novel cross-modal alignment algorithm that uses constrained decoupling and distribution sampling to achieve semantic consistency between vision and language by separating semantic and modality components in embeddings.
Details
Motivation: Traditional cross-modal alignment methods pursue embedding consistency but ignore non-semantic information in embeddings, leading to suboptimal semantic alignment. There's a need to decouple embeddings into semantic and modality components, but this faces challenges in distinguishing these components and dealing with modality gaps that cause alignment deviation.Method: Proposes CDDS with two key components: (1) A dual-path UNet to adaptively decouple embeddings with multiple constraints for effective separation of semantic and modality components, and (2) A distribution sampling method to bridge the modality gap and ensure alignment rationality.
Result: Extensive experiments on various benchmarks and model backbones show CDDS outperforms state-of-the-art methods by 6.6% to 14.2%.
Conclusion: CDDS effectively addresses cross-modal alignment challenges by decoupling semantic and modality information and bridging modality gaps, achieving superior performance in vision-language semantic consistency tasks.
Abstract: Cross-modal alignment is a crucial task in multimodal learning aimed at achieving semantic consistency between vision and language. This requires that image-text pairs exhibit similar semantics. Traditional algorithms pursue embedding consistency to achieve semantic consistency, ignoring the non-semantic information present in the embedding. An intuitive approach is to decouple the embeddings into semantic and modality components, aligning only the semantic component. However, this introduces two main challenges: (1) There is no established standard for distinguishing semantic and modal information. (2) The modality gap can cause semantic alignment deviation or information loss. To align the true semantics, we propose a novel cross-modal alignment algorithm via \textbf{C}onstrained \textbf{D}ecoupling and \textbf{D}istribution \textbf{S}ampling (CDDS). Specifically, (1) A dual-path UNet is introduced to adaptively decouple the embeddings, applying multiple constraints to ensure effective separation. (2) A distribution sampling method is proposed to bridge the modality gap, ensuring the rationality of the alignment process. Extensive experiments on various benchmarks and model backbones demonstrate the superiority of CDDS, outperforming state-of-the-art methods by 6.6% to 14.2%.
[441] Aligning Compound AI Systems via System-level DPO
Xiangwen Wang, Yibo Jacky Zhang, Zhoujie Ding, Katherine Tsai, Haolun Wu, Sanmi Koyejo
Main category: cs.LG
TL;DR: SysDPO extends Direct Preference Optimization to align compound AI systems with multiple interacting components, addressing challenges of non-differentiable interactions and system-level preference transformation.
Details
Motivation: Compound AI systems with multiple components (LLMs, foundation models, external tools) outperform single models but are challenging to align with human preferences due to non-differentiable interactions and difficulty transforming system-level preferences to component-level preferences.Method: Formulate compound AI systems as Directed Acyclic Graphs (DAGs) to model component interactions and data flows. Introduce SysDPO framework with two variants: SysDPO-Direct (when system-specific preference dataset exists) and SysDPO-Sampling (when constructing preference dataset).
Result: Empirical demonstration across two applications: 1) joint alignment of language model and diffusion model, and 2) joint alignment of LLM collaboration system.
Conclusion: SysDPO successfully extends DPO to enable joint system-level alignment of compound AI systems, addressing the challenges of aligning multi-component systems with human preferences.
Abstract: Compound AI systems, comprising multiple interacting components such as LLMs, foundation models, and external tools, have demonstrated remarkable improvements compared to single models in various tasks. To ensure their effective deployment in real-world applications, aligning these systems with human preferences is crucial. However, aligning the compound system via policy optimization, unlike the alignment of a single model, is challenging for two main reasons: (i) non-differentiable interactions between components make end-to-end gradient-based optimization method inapplicable, and (ii) system-level preferences cannot be directly transformed into component-level preferences. To address these challenges, we first formulate compound AI systems as Directed Acyclic Graphs (DAGs), explicitly modeling both component interactions and the associated data flows. Building on this formulation, we introduce $\textbf{SysDPO}$, a framework that extends Direct Preference Optimization (DPO) to enable joint system-level alignment. We propose two variants, SysDPO-Direct and SysDPO-Sampling, tailored for scenarios depending on whether we construct a system-specific preference dataset. We empirically demonstrate the effectiveness of our approach across two applications: the joint alignment of a language model and a diffusion model, and the joint alignment of an LLM collaboration system.
[442] FuseDiff: Symmetry-Preserving Joint Diffusion for Dual-Target Structure-Based Drug Design
Jianliang Wu, Anjie Qiao, Zhen Wang, Zhewei Wei, Sheng Chen
Main category: cs.LG
TL;DR: FuseDiff is an end-to-end diffusion model that jointly generates a ligand molecular graph and two pocket-specific binding poses for dual-target drug design, enabling polypharmacological therapies with improved efficacy.
Details
Motivation: Existing approaches for dual-target drug design rely on staged pipelines that either decouple binding poses via conditional-independence assumptions or enforce overly rigid correlations, failing to jointly generate two target-specific binding modes needed for effective polypharmacological therapies.Method: FuseDiff uses an end-to-end diffusion model with a message-passing backbone featuring Dual-target Local Context Fusion (DLCF), which fuses each ligand atom’s local context from both pockets to enable expressive joint modeling while preserving desired symmetries. It includes explicit bond generation to enforce topological consistency across poses while allowing target-specific geometric adaptation.
Result: Experiments on benchmark and real-world dual-target systems show FuseDiff achieves state-of-the-art docking performance and enables the first systematic assessment of dual-target pose quality prior to docking-based pose search.
Conclusion: FuseDiff provides an effective end-to-end solution for dual-target structure-based drug design that jointly generates ligand molecules and pocket-specific binding poses, addressing limitations of existing staged approaches.
Abstract: Dual-target structure-based drug design aims to generate a single ligand together with two pocket-specific binding poses, each compatible with a corresponding target pocket, enabling polypharmacological therapies with improved efficacy and reduced resistance. Existing approaches typically rely on staged pipelines, which either decouple the two poses via conditional-independence assumptions or enforce overly rigid correlations, and therefore fail to jointly generate two target-specific binding modes. To address this, we propose FuseDiff, an end-to-end diffusion model that jointly generates a ligand molecular graph and two pocket-specific binding poses conditioned on both pockets. FuseDiff features a message-passing backbone with Dual-target Local Context Fusion (DLCF), which fuses each ligand atom’s local context from both pockets to enable expressive joint modeling while preserving the desired symmetries. Together with explicit bond generation, FuseDiff enforces topological consistency across the two poses under a shared graph while allowing target-specific geometric adaptation in each pocket. To support principled training and evaluation, we derive a dual-target training set and use an independent held-out test set for evaluation. Experiments on the benchmark and a real-world dual-target system show that FuseDiff achieves state-of-the-art docking performance and enables the first systematic assessment of dual-target pose quality prior to docking-based pose search.
[443] Why Depth Matters in Parallelizable Sequence Models: A Lie Algebraic View
Gyuryang Heo, Timothy Ngotiaoco, Kazuki Irie, Samuel J. Gershman, Bernardo Sabatini
Main category: cs.LG
TL;DR: Theoretical analysis of sequence models’ expressivity bounds using Lie algebra, showing error diminishes exponentially with depth.
Details
Motivation: To understand the theoretical bounds on error and how error scales when sequence models operate outside their expressivity regimes, using a Lie-algebraic control perspective.Method: Uses Lie-algebraic control theory to formulate correspondence between model depth and Lie algebra extensions, characterizes constant-depth sequence models, and analytically derives approximation error bounds.
Result: Shows error diminishes exponentially as depth increases, consistent with empirical performance. Validated on symbolic word and continuous-valued state-tracking problems.
Conclusion: Provides theoretical framework for understanding expressivity bounds of sequence models, showing exponential error reduction with depth.
Abstract: Scalable sequence models, such as Transformer variants and structured state-space models, often trade expressivity power for sequence-level parallelism, which enables efficient training. Here we examine the bounds on error and how error scales when models operate outside of their expressivity regimes using a Lie-algebraic control perspective. Our theory formulates a correspondence between the depth of a sequence model and the tower of Lie algebra extensions. Echoing recent theoretical studies, we characterize the Lie-algebraic class of constant-depth sequence models and their corresponding expressivity bounds. Furthermore, we analytically derive an approximation error bound and show that error diminishes exponentially as the depth increases, consistent with the strong empirical performance of these models. We validate our theoretical predictions using experiments on symbolic word and continuous-valued state-tracking problems.
[444] A Novel Hybrid Heuristic-Reinforcement Learning Optimization Approach for a Class of Railcar Shunting Problems
Ruonan Zhao, Joseph Geunes
Main category: cs.LG
TL;DR: A hybrid heuristic-reinforcement learning framework for solving railcar shunting problems with two-sided classification track access using two locomotives.
Details
Motivation: Railcar shunting in freight railyards is a complex combinatorial planning problem where yard planners need to efficiently disassemble and reassemble railcars to form outbound trains. The problem becomes particularly challenging with two-sided classification tracks that allow FIFO access, requiring coordination of two locomotives.Method: Decomposes the two-sided access problem into two subproblems with one-sided access, each handled by a locomotive. Develops a Hybrid Heuristic-Reinforcement Learning (HHRL) framework that integrates railway-specific heuristics with Q-learning reinforcement learning. Uses methods to reduce state-action space and guide exploration during RL training.
Result: Numerical experiments demonstrate the efficiency and quality of the HHRL algorithm for both one-sided access (single-locomotive) and two-sided access (two-locomotive) railcar shunting problems.
Conclusion: The proposed HHRL framework effectively addresses the combinatorial challenges of railcar shunting with two-sided track access by combining domain-specific heuristics with reinforcement learning, showing promise for practical railyard operations.
Abstract: Railcar shunting is a core planning task in freight railyards, where yard planners need to disassemble and reassemble groups of railcars to form outbound trains. Classification tracks with access from one side only can be considered as stack structures, where railcars are added and removed from only one end, leading to a last-in-first-out (LIFO) retrieval order. In contrast, two-sided tracks function like queue structures, allowing railcars to be added from one end and removed from the opposite end, following a first-in-first-out (FIFO) order. We consider a problem requiring assembly of multiple outbound trains using two locomotives in a railyard with two-sided classification track access. To address this combinatorially challenging problem class, we decompose the problem into two subproblems, each with one-sided classification track access and a locomotive on each side. We present a novel Hybrid Heuristic-Reinforcement Learning (HHRL) framework that integrates railway-specific heuristic solution approaches with a reinforcement learning method, specifically Q-learning. The proposed framework leverages methods to decrease the state-action space and guide exploration during reinforcement learning. The results of a series of numerical experiments demonstrate the efficiency and quality of the HHRL algorithm in both one-sided access, single-locomotive problems and two-sided access, two-locomotive problems.
[445] Spatiotemporal Heterogeneity of AI-Driven Traffic Flow Patterns and Land Use Interaction: A GeoAI-Based Analysis of Multimodal Urban Mobility
Olaf Yunus Laitinen Imanov
Main category: cs.LG
TL;DR: A GeoAI Hybrid framework combining MGWR, Random Forest, and ST-GCN models to analyze multimodal urban traffic flow patterns and their interactions with land use across different urban morphologies.
Details
Motivation: Urban traffic flow involves complex nonlinear interactions between land use and mobility demand that conventional models fail to capture across multiple travel modes and scales simultaneously.Method: Sequential integration of Multiscale Geographically Weighted Regression (MGWR), Random Forest (RF), and Spatio-Temporal Graph Convolutional Networks (ST-GCN) to model spatiotemporal heterogeneity across motor vehicle, public transit, and active transport modes.
Result: Achieved RMSE of 0.119 and RÂČ of 0.891 (23-62% better than benchmarks); identified key predictors via SHAP analysis; discovered 5 urban traffic typologies; showed moderate within-cluster transferability but limited cross-cluster generalizability.
Conclusion: The framework provides an interpretable, scalable toolkit for evidence-based multimodal mobility management and land use policy design, emphasizing the importance of urban morphological context.
Abstract: Urban traffic flow is governed by the complex, nonlinear interaction between land use configuration and spatiotemporally heterogeneous mobility demand. Conventional global regression and time-series models cannot simultaneously capture these multi-scale dynamics across multiple travel modes. This study proposes a GeoAI Hybrid analytical framework that sequentially integrates Multiscale Geographically Weighted Regression (MGWR), Random Forest (RF), and Spatio-Temporal Graph Convolutional Networks (ST-GCN) to model the spatiotemporal heterogeneity of traffic flow patterns and their interaction with land use across three mobility modes: motor vehicle, public transit, and active transport. Applying the framework to an empirically calibrated dataset of 350 traffic analysis zones across six cities spanning two contrasting urban morphologies, four key findings emerge: (i) the GeoAI Hybrid achieves a root mean squared error (RMSE) of 0.119 and an R^2 of 0.891, outperforming all benchmarks by 23-62%; (ii) SHAP analysis identifies land use mix as the strongest predictor for motor vehicle flows and transit stop density as the strongest predictor for public transit; (iii) DBSCAN clustering identifies five functionally distinct urban traffic typologies with a silhouette score of 0.71, and GeoAI Hybrid residuals exhibit Moran’s I=0.218 (p<0.001), a 72% reduction relative to OLS baselines; and (iv) cross-city transfer experiments reveal moderate within-cluster transferability (R^2>=0.78) and limited cross-cluster generalisability, underscoring the primacy of urban morphological context. The framework offers planners and transportation engineers an interpretable, scalable toolkit for evidence-based multimodal mobility management and land use policy design.
[446] Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models
Ivan Luiz De Moura Matos, Abdel Djalil Sad Saoud, Ekaterina Iakovleva, Vito Paolo Pastore, Enzo Tartaglione
Main category: cs.LG
TL;DR: BISE identifies bias-free subnetworks within standard trained models via pruning, enabling debiasing without retraining or additional data.
Details
Motivation: Algorithmic biases in deep learning require mitigation, but existing debiasing techniques often involve complex training or dataset manipulation. The paper explores whether fair subnetworks can be extracted from vanilla-trained models without additional data or retraining.Method: Bias-Invariant Subnetwork Extraction (BISE) identifies and isolates bias-free subnetworks that already exist within conventionally trained models through pruning techniques, without modifying original parameters or requiring retraining/finetuning.
Result: Extensive experiments on common benchmarks show BISE can extract subnetworks that rely less on biased features while maintaining robust performance, offering advantages in computational efficiency compared to data-centric or full retraining approaches.
Conclusion: The approach demonstrates efficient bias mitigation through structural adaptation of pre-trained networks via parameter removal, contributing to more practical debiasing strategies that don’t require costly data manipulation or full model retraining.
Abstract: The issue of algorithmic biases in deep learning has led to the development of various debiasing techniques, many of which perform complex training procedures or dataset manipulation. However, an intriguing question arises: is it possible to extract fair and bias-agnostic subnetworks from standard vanilla-trained models without relying on additional data, such as unbiased training set? In this work, we introduce Bias-Invariant Subnetwork Extraction (BISE), a learning strategy that identifies and isolates “bias-free” subnetworks that already exist within conventionally trained models, without retraining or finetuning the original parameters. Our approach demonstrates that such subnetworks can be extracted via pruning and can operate without modification, effectively relying less on biased features and maintaining robust performance. Our findings contribute towards efficient bias mitigation through structural adaptation of pre-trained neural networks via parameter removal, as opposed to costly strategies that are either data-centric or involve (re)training all model parameters. Extensive experiments on common benchmarks show the advantages of our approach in terms of the performance and computational efficiency of the resulting debiased model.
[447] On the Value of Tokeniser Pretraining in Physics Foundation Models
Hadi Sotoudeh, Payel Mukhopadhyay, Ruben Ohana, Michael McCabe, Neil D. Lawrence, Shirley Ho, Miles Cranmer
Main category: cs.LG
TL;DR: Tokeniser pretraining improves physics emulation efficiency, with domain-aligned pretraining providing the largest benefits for accuracy and computational efficiency.
Details
Motivation: Physics foundation models need to learn both compact representations of high-resolution spatiotemporal data and governing physical dynamics, but learning both simultaneously from scratch can be inefficient and suboptimal.Method: Pretrain tokeniser with autoencoding objective before training dynamics model; introduce flexible spatiotemporal compression operations with runtime-adjustable compression ratios; systematically investigate impact of domain alignment in pretraining data.
Result: In-domain pretraining reduces VRMSE by 64% after 10,500 training steps compared to training from scratch; domain-aligned pretraining yields largest improvements while cross-domain pretraining provides moderate gains.
Conclusion: Strategic tokeniser pretraining significantly enhances physics emulation efficiency, with domain alignment being crucial; flexible compression operations enable adaptation to diverse downstream tasks.
Abstract: We investigate the impact of tokeniser pretraining on the accuracy and efficiency of physics emulation. Modern high-resolution simulations produce vast volumes of data spanning diverse physical regimes and scales. Training foundation models to learn the dynamics underlying such data enables the modelling of complex multiphysics phenomena, especially in data-limited settings. The emerging class of physics foundation models typically aims to learn two tasks jointly: (i) extracting compact representations of high-resolution spatiotemporal data, and (ii) capturing governing physical dynamics. However, learning both tasks from scratch simultaneously can impede the effectiveness of either process. We demonstrate that pretraining the tokeniser with an autoencoding objective prior to training the dynamics model enhances computational efficiency for downstream tasks. Notably, the magnitude of this benefit depends on domain alignment: pretraining on the same physical system as the downstream task yields the largest improvements, while pretraining on other systems provides moderate gains. In-domain pretraining reduces VRMSE by 64% after 10,500 training steps compared to training from scratch. To our knowledge, this is the first systematic investigation of tokeniser pretraining for physics foundation models. We further introduce flexible spatiotemporal compression operations that extend causal convolutions to support runtime-adjustable compression ratios, enabling efficient adaptation to diverse downstream tasks. Our findings provide practical guidance for training efficient physics emulators and highlight the importance of strategic pretraining data selection.
[448] Identifying Adversary Characteristics from an Observed Attack
Soyon Choi, Scott Alfeld, Meiyi Ma
Main category: cs.LG
TL;DR: A framework for identifying attacker characteristics from observed attacks in ML systems, addressing non-identifiability through domain-agnostic probabilistic identification to improve defense strategies.
Details
Motivation: Machine learning models in automated decision-making systems are vulnerable to data-manipulation attacks. Current defenses focus on the attack or model, but this paper proposes focusing on identifying the attacker to improve defense effectiveness.Method: Proposes a domain-agnostic framework to identify the most probable attacker from observed attacks. Addresses the non-identifiability problem (multiple attackers could perform same attack) through probabilistic identification. Framework can be instantiated across various learners.
Result: Demonstrates framework applicability through specific instantiations on various learners. Shows how attacker identification enables both exogenous mitigation (system-level defenses) and improved adversarial regularization when attacker-specific knowledge is incorporated.
Conclusion: Focusing on attacker identification rather than just attack detection provides new defense capabilities. The framework helps defenders implement more effective countermeasures by understanding attacker characteristics, addressing the fundamental non-identifiability challenge.
Abstract: When used in automated decision-making systems, machine learning (ML) models are vulnerable to data-manipulation attacks. Some defense mechanisms (e.g., adversarial regularization) directly affect the ML models while others (e.g., anomaly detection) act within the broader system. In this paper we consider a different task for defending the adversary, focusing on the attacker, rather than the attack. We present and demonstrate a framework for identifying characteristics about the attacker from an observed attack. We prove that, without additional knowledge, the attacker is non-identifiable (multiple potential attackers would perform the same observed attack). To address this challenge, we propose a domain-agnostic framework to identify the most probable attacker. This framework aids the defender in two ways. First, knowledge about the attacker can be leveraged for exogenous mitigation (i.e., addressing the vulnerability by altering the decision-making system outside the learning algorithm and/or limiting the attacker’s capability). Second, when implementing defense methods that directly affect the learning process (e.g., adversarial regularization), knowledge of the specific attacker improves performance. We present the details of our framework and illustrate its applicability through specific instantiations on a variety of learners.
[449] The Value of Graph-based Encoding in NBA Salary Prediction
Junhao Su, David Grimsman, Christopher Archibald
Main category: cs.LG
TL;DR: Using knowledge graphs with on/off-court data and graph embeddings improves NBA salary prediction, especially for veteran players and high-salary outliers.
Details
Motivation: Traditional tabular data approaches using supervised ML for NBA salary prediction work well for younger players based on draft position, but fail for veterans and high-salary outliers due to complex factors beyond basic performance metrics.Method: Build a knowledge graph with on-court and off-court NBA data, embed the graph into vector space using various graph embedding algorithms, and incorporate these embeddings as features in supervised learning models for salary prediction.
Result: The knowledge graph embedding approach significantly improves salary prediction accuracy compared to traditional tabular methods, with different graph embedding algorithms showing varying effectiveness for capturing the complex landscape of salary-determining factors.
Conclusion: Incorporating knowledge graph embeddings into NBA salary prediction models is essential for capturing the multifaceted nature of player valuation, particularly for veteran players and those with outlier salaries that traditional methods struggle to predict accurately.
Abstract: Market valuations for professional athletes is a difficult problem, given the amount of variability in performance and location from year to year. In the National Basketball Association (NBA), a straightforward way to address this problem is to build a tabular data set and use supervised machine learning to predict a player’s salary based on the player’s performance in the previous year. For younger players, whose contracts are mostly built on draft position, this approach works well, however it can fail for veterans or those whose salaries are on the high tail of the distribution. In this paper, we show that building a knowledge graph with on and off court data, embedding that graph in a vector space, and including that vector in the tabular data allows the supervised learning to better understand the landscape of factors that affect salary. We compare several graph embedding algorithms and show that such a process is vital to NBA salary prediction.
[450] Reinforcement Learning for Power-Flow Network Analysis
Alperen Ergur, Julia Lindberg, Vinny Miller
Main category: cs.LG
TL;DR: RL agents discover power flow equation instances with many more solutions than average baseline by designing probabilistic reward function and state-space mimicking power flow equations.
Details
Motivation: Power flow equations describe relationships in electric power networks, and finding network parameters with many equilibrium points (many real solutions) is important but computationally challenging with current algebraic methods.Method: Design probabilistic reward function approximating root count, create state-space mimicking power flow equations, derive average root count for Gaussian model as baseline, and use RL agents to discover instances with many solutions.
Result: RL agents discover instances of power flow equations with many more solutions than the average baseline, demonstrating RL’s potential for power-flow network design and analysis.
Conclusion: Reinforcement learning shows promise for complex non-linear algebra/geometry problems like power flow network design, offering an alternative to traditional computational algebra methods.
Abstract: The power flow equations are non-linear multivariate equations that describe the relationship between power injections and bus voltages of electric power networks. Given a network topology, we are interested in finding network parameters with many equilibrium points. This corresponds to finding instances of the power flow equations with many real solutions. Current state-of-the art algorithms in computational algebra are not capable of answering this question for networks involving more than a small number of variables. To remedy this, we design a probabilistic reward function that gives a good approximation to this root count, and a state-space that mimics the space of power flow equations. We derive the average root count for a Gaussian model, and use this as a baseline for our RL agents. The agents discover instances of the power flow equations with many more solutions than the average baseline. This demonstrates the potential of RL for power-flow network design and analysis as well as the potential for RL to contribute meaningfully to problems that involve complex non-linear algebra or geometry. \footnote{Author order alphabetic, all authors contributed equally.
[451] Improved Scaling Laws via Weak-to-Strong Generalization in Random Feature Ridge Regression
Diyuan Wu, Lehan Chen, Theodor Misiakiewicz, Marco Mondelli
Main category: cs.LG
TL;DR: Weak-to-strong generalization in machine learning: training strong students on imperfect labels from weak teachers can substantially improve scaling laws, with students potentially achieving minimax optimal rates even when teachers don’t improve with sample size.
Details
Motivation: The paper investigates the phenomenon where training strong models on imperfect labels from weaker models (weak-to-strong generalization) can lead to substantial performance improvements, affecting the fundamental scaling laws of test error.Method: The authors use random feature ridge regression (RFRR) as their theoretical framework. Their main technical contribution is deriving a deterministic equivalent for the excess test error of students trained on teacher-generated labels, allowing them to analyze scaling laws mathematically.
Result: The analysis reveals regimes where student scaling laws improve upon teacher scaling laws, with improvements possible in both bias-dominated and variance-dominated settings. Strikingly, students can achieve minimax optimal rates even when teacher test error doesn’t decay with sample size.
Conclusion: Weak-to-strong generalization can substantially affect scaling laws, with strong students potentially outperforming weak teachers beyond simple error reduction to actually improving the fundamental rate of convergence, achieving optimal rates regardless of teacher performance.
Abstract: It is increasingly common in machine learning to use learned models to label data and then employ such data to train more capable models. The phenomenon of weak-to-strong generalization exemplifies the advantage of this two-stage procedure: a strong student is trained on imperfect labels obtained from a weak teacher, and yet the strong student outperforms the weak teacher. In this paper, we show that the potential improvement is substantial, in the sense that it affects the scaling law followed by the test error. Specifically, we consider students and teachers trained via random feature ridge regression (RFRR). Our main technical contribution is to derive a deterministic equivalent for the excess test error of the student trained on labels obtained via the teacher. Via this deterministic equivalent, we then identify regimes in which the scaling law of the student improves upon that of the teacher, unveiling that the improvement can be achieved both in bias-dominated and variance-dominated settings. Strikingly, the student may attain the minimax optimal rate regardless of the scaling law of the teacher – in fact, when the test error of the teacher does not even decay with the sample size.
[452] Warm Starting State-Space Models with Automata Learning
William Fishell, Sam Nicholas Kouteili, Mark Santolucito
Main category: cs.LG
TL;DR: Moore machines can be exactly realized as state-space models, enabling symbolic automata to operate in Euclidean space while preserving structure and behavior.
Details
Motivation: To establish a formal correspondence between symbolic automata (Moore machines) and continuous machine learning architectures (state-space models), enabling principled exploitation of symbolic structure in continuous domains.Method: Prove that Moore machines can be exactly realized as state-space models, compare SSM training with passive/active automata learning, recover automata from SYNTCOMP benchmark, and combine automata learning with SSMs using symbolic initialization.
Result: SSMs require orders of magnitude more data than symbolic methods and fail to learn state structure; initializing SSMs with symbolically-learned approximations leads to 2-5 times faster convergence and better accuracy.
Conclusion: Symbolic structure provides strong inductive bias; combining automata learning with SSMs enables efficient learning of complex systems by lifting automata learning into continuous spaces.
Abstract: We prove that Moore machines can be exactly realized as state-space models (SSMs), establishing a formal correspondence between symbolic automata and these continuous machine learning architectures. These Moore-SSMs preserve both the complete symbolic structure and input-output behavior of the original Moore machine, but operate in Euclidean space. With this correspondence, we compare the training of SSMs with both passive and active automata learning. In recovering automata from the SYNTCOMP benchmark, we show that SSMs require orders of magnitude more data than symbolic methods and fail to learn state structure. This suggests that symbolic structure provides a strong inductive bias for learning these systems. We leverage this insight to combine the strengths of both automata learning and SSMs in order to learn complex systems efficiently. We learn an adaptive arbitration policy on a suite of arbiters from SYNTCOMP and show that initializing SSMs with symbolically-learned approximations learn both faster and better. We see 2-5 times faster convergence compared to randomly initialized models and better overall model accuracies on test data. Our work lifts automata learning out of purely discrete spaces, enabling principled exploitation of symbolic structure in continuous domains for efficiently learning in complex settings.
[453] Unsupervised domain adaptation for radioisotope identification in gamma spectroscopy
Peter Lalor, Ayush Panigrahy, Alex Hagen
Main category: cs.LG
TL;DR: Transformer-based neural network for radioisotope identification using gamma spectroscopy, trained on synthetic data and adapted to real-world deployment via unsupervised domain adaptation with MMD minimization.
Details
Motivation: Training ML models for radioisotope identification is challenging due to difficulty acquiring large labeled experimental datasets. Simulations help but models trained on synthetic data perform poorly on out-of-distribution real-world data.Method: First pretrain a spectral classifier using labeled synthetic data, then use unsupervised domain adaptation with unlabeled target data to align feature representations between source (synthetic) and target (real) domains. Compare various UDA techniques, finding MMD minimization most effective.
Result: MMD-based UDA improved testing accuracy from 0.754 ± 0.014 to 0.904 ± 0.022 on experimental LaBr3 test set using a custom transformer-based neural network.
Conclusion: Unsupervised domain adaptation can effectively adapt radioisotope classifiers trained on synthetic data for real-world deployment when unlabeled target data is available.
Abstract: Training machine learning models for radioisotope identification using gamma spectroscopy remains an elusive challenge for many practical applications, largely stemming from the difficulty of acquiring and labeling large, diverse experimental datasets. Simulations can mitigate this challenge, but the accuracy of models trained on simulated data can deteriorate substantially when deployed to an out-of-distribution operational environment. In this study, we demonstrate that unsupervised domain adaptation (UDA) can improve the ability of a model trained on synthetic data to generalize to a new testing domain, provided unlabeled data from the target domain are available. Conventional supervised techniques are unable to utilize this data because the absence of isotope labels precludes defining a supervised classification loss. Instead, we first pretrain a spectral classifier using labeled synthetic data and subsequently leverage unlabeled target data to align the learned feature representations between the source and target domains. We compare a range of different UDA techniques, finding that minimizing the maximum mean discrepancy (MMD) between source and target feature vectors yields the most consistent improvement to testing scores. For instance, using a custom transformer-based neural network, we achieved a testing accuracy of $0.904 \pm 0.022$ on an experimental LaBr$_3$ test set after performing unsupervised feature alignment via MMD minimization, compared to $0.754 \pm 0.014$ before alignment. Overall, our results highlight the potential of using UDA to adapt a radioisotope classifier trained on synthetic data for real-world deployment.
[454] Revisiting the (Sub)Optimality of Best-of-N for Inference-Time Alignment
Ved Sriraman, Adam Block
Main category: cs.LG
TL;DR: BoN sampling is statistically optimal for win-rate optimization under practical conditions, though vulnerable to reward hacking; a simple variant eliminates hacking while maintaining optimality.
Details
Motivation: Despite widespread practical use of BoN sampling for language model alignment, prior theoretical work suggests it's statistically suboptimal and vulnerable to reward hacking. This paper re-examines BoN under more realistic assumptions about how reward models are actually trained and evaluated.Method: Theoretical analysis of BoN sampling under practical conditions focusing on win-rate (pairwise comparison metric) rather than expected true reward. Proposes a simple variant of BoN that eliminates reward hacking while maintaining statistical optimality.
Result: Shows that properly tuned BoN is computationally and statistically optimal for achieving high win-rate under minimal quality conditions of reference and reward models. Prior approaches are provably suboptimal for win-rate optimization.
Conclusion: BoN’s practical success is partially explained by its optimality for win-rate optimization, but reward hacking remains a concern. The proposed variant eliminates hacking while maintaining optimal performance, highlighting the importance of choosing appropriate objectives when analyzing inference-time alignment methods.
Abstract: Best-of-N (BoN) sampling is a widely used inference-time alignment method for language models, whereby N candidate responses are sampled from a reference model and the one with the highest predicted reward according to a learned reward model is selected. Despite its widespread practical use, recent theoretical work has suggested that it is statistically suboptimal and vulnerable to reward hacking, the process by which models exploit weaknesses in the learned reward model to achieve high estimated reward without genuinely improving performance. We revisit this question under assumptions that more closely reflect practice than that of prior work. In particular, in contradistinction to earlier analyses that focused on expected true reward, which may not be meaningful in many practical settings, we investigate how inference-time alignment affects the win-rate, a pairwise comparison-based metric more closely aligned with how reward models are trained and evaluated in practice. We demonstrate that, under minimal conditions on the quality of the reference model and learned reward model, properly tuned BoN is both computationally and statistically optimal in achieving high win-rate, partially explaining its widespread practical success. Because BoN remains susceptible to reward-hacking in this setting, we propose a simple and practical variant that provably eliminates reward-hacking while maintaining optimal statistical performance. Finally, we show that prior approaches are provably suboptimal when considering win-rate, highlighting the importance of choosing appropriate objectives when analyzing inference-time alignment methods.
[455] MIRACL: A Diverse Meta-Reinforcement Learning for Multi-Objective Multi-Echelon Combinatorial Supply Chain Optimisation
Rifny Rachman, Josh Tingey, Richard Allmendinger, Wei Pan, Pradyumn Shukla, Bahrul Ilmi Nasution
Main category: cs.LG
TL;DR: MIRACL is a hierarchical meta multi-objective reinforcement learning framework for few-shot generalization across diverse combinatorial optimization tasks, validated in supply chain optimization.
Details
Motivation: Multi-objective reinforcement learning (MORL) is effective for combinatorial supply chain optimization but faces deployment challenges in dynamic environments due to task-specific retraining requirements and high computational costs.Method: MIRACL decomposes tasks into structured subproblems for efficient policy adaptation and meta-learns a global policy across tasks using Pareto-based adaptation strategy to encourage diversity in meta-training and fine-tuning.
Result: Empirical evaluations show MIRACL outperforms conventional MORL baselines in simple to moderate tasks, achieving up to 10% higher hypervolume and 5% better expected utility.
Conclusion: MIRACL demonstrates potential for robust, efficient adaptation in multi-objective problems and is theoretically domain-agnostic, applicable to broader dynamic multi-objective decision-making problems beyond supply chain optimization.
Abstract: Multi-objective reinforcement learning (MORL) is effective for multi-echelon combinatorial supply chain optimisation, where tasks involve high dimensionality, uncertainty, and competing objectives. However, its deployment in dynamic environments is hindered by the need for task-specific retraining and substantial computational cost. We introduce MIRACL (Meta multI-objective Reinforcement leArning with Composite Learning), a hierarchical Meta-MORL framework that allows for a few-shot generalisation across diverse tasks. MIRACL decomposes each task into structured subproblems for efficient policy adaptation and meta-learns a global policy across tasks using a Pareto-based adaptation strategy to encourage diversity in meta-training and fine-tuning. To our knowledge, this is the first integration of Meta-MORL with such mechanisms in combinatorial optimisation. Although validated in the supply chain domain, MIRACL is theoretically domain-agnostic and applicable to broader dynamic multi-objective decision-making problems. Empirical evaluations show that MIRACL outperforms conventional MORL baselines in simple to moderate tasks, achieving up to 10% higher hypervolume and 5% better expected utility. These results underscore the potential of MIRACL for robust, efficient adaptation in multi-objective problems.
[456] Score-Guided Proximal Projection: A Unified Geometric Framework for Rectified Flow Editing
Vansh Bansal, James G Scott
Main category: cs.LG
TL;DR: SGPP is a unified framework for controlled generation with Rectified Flow models that bridges deterministic optimization and stochastic sampling through proximal projection with score guidance.
Details
Motivation: Current approaches for controlling RF models have limitations: inversion-based guidance suffers from "geometric locking" by rigidly adhering to source trajectories, while posterior sampling approximations like DPS are computationally expensive and unstable. There's a need for a unified framework that bridges deterministic optimization and stochastic sampling for precise control tasks like semantic editing or blind image recovery.Method: Proposes Score-Guided Proximal Projection (SGPP), which reformulates recovery tasks as proximal optimization problems. Defines an energy landscape balancing fidelity to input with realism from pre-trained score field. Theoretically proves this induces normal contraction property that snaps out-of-distribution inputs onto data manifold. Generalizes RF-inversion as a limiting case and enables “soft guidance” by relaxing proximal variance for continuous trade-off between identity preservation and generative freedom.
Result: SGPP provides a unified framework that bridges deterministic optimization and stochastic sampling, offering training-free control over RF models. It generalizes state-of-the-art editing methods and enables continuous trade-off between strict identity preservation and generative freedom through “soft guidance.”
Conclusion: SGPP addresses key limitations in controlling Rectified Flow models by providing a theoretically grounded framework that unifies inversion-based guidance and posterior sampling approaches, enabling precise control for tasks like semantic editing and blind image recovery.
Abstract: Rectified Flow (RF) models achieve state-of-the-art generation quality, yet controlling them for precise tasks – such as semantic editing or blind image recovery – remains a challenge. Current approaches bifurcate into inversion-based guidance, which suffers from “geometric locking” by rigidly adhering to the source trajectory, and posterior sampling approximations (e.g., DPS), which are computationally expensive and unstable. In this work, we propose Score-Guided Proximal Projection (SGPP), a unified framework that bridges the gap between deterministic optimization and stochastic sampling. We reformulate the recovery task as a proximal optimization problem, defining an energy landscape that balances fidelity to the input with realism from the pre-trained score field. We theoretically prove that this objective induces a normal contraction property, geometrically guaranteeing that out-of-distribution inputs are snapped onto the data manifold, and it effectively reaches the posterior mode constrained to the manifold. Crucially, we demonstrate that SGPP generalizes state-of-the-art editing methods: RF-inversion is effectively a limiting case of our framework. By relaxing the proximal variance, SGPP enables “soft guidance,” offering a continuous, training-free trade-off between strict identity preservation and generative freedom.
[457] TML-Bench: Benchmark for Data Science Agents on Tabular ML Tasks
Mykola Pinchuk
Main category: cs.LG
TL;DR: TML-Bench is a tabular benchmark for evaluating data science agents on Kaggle-style tasks, testing 10 OSS LLMs across four competitions with time budgets to assess end-to-end correctness and reliability.
Details
Motivation: To evaluate the practical value of autonomous coding agents for tabular data science tasks, focusing on end-to-end correctness and reliability under time constraints similar to real-world Kaggle competitions.Method: Introduces TML-Bench benchmark with four Kaggle competitions, tests 10 open-source LLMs across three time budgets (240s, 600s, 1200s), runs each model five times per task/budget, and measures success based on valid submissions and private-holdout scores on hidden labels.
Result: MiniMax-M2.1 achieves best aggregate performance, average performance improves with larger time budgets, scaling is noisy for some models, and the benchmark reveals variability in success rates across different time constraints.
Conclusion: TML-Bench provides a standardized evaluation framework for tabular data science agents, demonstrating that time budgets significantly impact performance and highlighting the need for reliable autonomous coding systems in practical applications.
Abstract: Autonomous coding agents can produce strong tabular baselines quickly on Kaggle-style tasks. Practical value depends on end-to-end correctness and reliability under time limits. This paper introduces TML-Bench, a tabular benchmark for data science agents on Kaggle-style tasks. This paper evaluates 10 OSS LLMs on four Kaggle competitions and three time budgets (240s, 600s, and 1200s). Each model is run five times per task and budget. A run is successful if it produces a valid submission and a private-holdout score on hidden labels that are not accessible to the agent. This paper reports median performance, success rates, and run-to-run variability. MiniMax-M2.1 model achieves the best aggregate performance score on all four competitions under the paper’s primary aggregation. Average performance improves with larger time budgets. Scaling is noisy for some individual models at the current run count. Code and materials are available at https://github.com/MykolaPinchuk/TML-bench/tree/master.
[458] Bridging Domains through Subspace-Aware Model Merging
Levy Chaves, Chao Zhou, Rebekka Burkholz, Eduardo Valle, Sandra Avila
Main category: cs.LG
TL;DR: SCORE is a model merging method that addresses subspace conflicts in domain generalization by finding shared orthogonal bases and pruning conflicting singular directions.
Details
Motivation: Current model merging methods work well for in-distribution or multi-task scenarios but perform poorly in domain generalization settings where models are fine-tuned on distinct domains with distribution shifts, leading to stronger subspace conflicts.Method: SCORE computes principal components of concatenated leading singular vectors from all models to find shared orthogonal basis, then projects each task matrix into this basis and prunes off-diagonal components to remove conflicting singular directions.
Result: SCORE consistently outperforms existing model merging approaches in domain generalization settings across various architectures and model scales, demonstrating effectiveness and scalability.
Conclusion: Domain generalization in model merging requires addressing subspace conflicts induced by distribution shifts, and SCORE provides an effective solution by resolving these conflicts through shared basis projection and pruning.
Abstract: Model merging integrates multiple task-specific models into a single consolidated one. Recent research has made progress in improving merging performance for in-distribution or multi-task scenarios, but domain generalization in model merging remains underexplored. We investigate how merging models fine-tuned on distinct domains affects generalization to unseen domains. Through an analysis of parameter competition in the task matrix using singular value decomposition, we show that merging models trained under different distribution shifts induces stronger conflicts between their subspaces compared to traditional multi-task settings. To mitigate this issue, we propose SCORE (Subspace COnflict-Resolving mErging), a method designed to alleviate such singular subspace conflicts. SCORE finds a shared orthogonal basis by computing the principal components of the concatenated leading singular vectors of all models. It then projects each task matrix into the shared basis, pruning off-diagonal components to remove conflicting singular directions. SCORE consistently outperforms, on average, existing model merging approaches in domain generalization settings across a variety of architectures and model scales, demonstrating its effectiveness and scalability.
[459] First-Order Softmax Weighted Switching Gradient Method for Distributed Stochastic Minimax Optimization with Stochastic Constraints
Zhankun Luo, Antesh Upadhyay, Sang Bin Moon, Abolfazl Hashemi
Main category: cs.LG
TL;DR: Proposes Softmax-Weighted Switching Gradient method for distributed stochastic minimax optimization with constraints in federated learning, achieving O(Δâ»âŽ) complexity with unified error bounds for optimality and feasibility.
Details
Motivation: Addresses distributed stochastic minimax optimization with stochastic constraints in federated learning settings, aiming to overcome hyperparameter sensitivity and convergence oscillations in traditional primal-dual or penalty-based approaches while optimizing worst-case client performance.Method: Novel first-order Softmax-Weighted Switching Gradient method designed for federated learning, with single-loop primal-only switching mechanism, theoretical analysis for both full and partial client participation regimes, and unified error decomposition.
Result: Achieves standard O(Δâ»âŽ) oracle complexity for unified Δ bound on optimality gap and feasibility tolerance, establishes sharper lower bound for softmax hyperparameter, provides O(log(1/ÎŽ)) high-probability convergence guarantee, and demonstrates efficacy on Neyman-Pearson classification and fair classification tasks.
Conclusion: The proposed single-loop primal-only switching mechanism provides stable alternative to traditional approaches, effectively bypassing hyperparameter sensitivity and convergence oscillations while optimizing worst-case client performance in federated learning settings.
Abstract: This paper addresses the distributed stochastic minimax optimization problem subject to stochastic constraints. We propose a novel first-order Softmax-Weighted Switching Gradient method tailored for federated learning. Under full client participation, our algorithm achieves the standard $\mathcal{O}(Δ^{-4})$ oracle complexity to satisfy a unified bound $Δ$ for both the optimality gap and feasibility tolerance. We extend our theoretical analysis to the practical partial participation regime by quantifying client sampling noise through a stochastic superiority assumption. Furthermore, by relaxing standard boundedness assumptions on the objective functions, we establish a strictly tighter lower bound for the softmax hyperparameter. We provide a unified error decomposition and establish a sharp $\mathcal{O}(\log\frac{1}Ύ)$ high-probability convergence guarantee. Ultimately, our framework demonstrates that a single-loop primal-only switching mechanism provides a stable alternative for optimizing worst-case client performance, effectively bypassing the hyperparameter sensitivity and convergence oscillations often encountered in traditional primal-dual or penalty-based approaches. We verify the efficacy of our algorithm via experiment on the Neyman-Pearson (NP) classification and fair classification tasks.
[460] Sparse Crosscoders for diffing MoEs and Dense models
Marmik Chaudhari, Nishkal Hundia, Idhant Gulati
Main category: cs.LG
TL;DR: MoE models learn fewer but more specialized features compared to dense models with equal active parameters, showing different feature organization patterns.
Details
Motivation: To systematically compare the internal representations of Mixture of Experts (MoE) and dense models, as MoE's sparse expert routing achieves parameter-efficient scaling but their representations remain poorly understood compared to dense models.Method: Used crosscoders (variant of sparse autoencoders) to jointly model multiple activation spaces. Trained 5-layer dense and MoE models (with equal active parameters) on 1B tokens across code, scientific text, and English stories. Applied BatchTopK crosscoders with explicitly designated shared features.
Result: Achieved ~87% fractional variance explained. MoE learns significantly fewer unique features than dense model. MoE-specific features show higher activation density than shared features, while dense-specific features show lower density.
Conclusion: MoEs develop more specialized, focused representations while dense models distribute information across broader, more general-purpose features, revealing concrete differences in feature organization between the two architectures.
Abstract: Mixture of Experts (MoE) achieve parameter-efficient scaling through sparse expert routing, yet their internal representations remain poorly understood compared to dense models. We present a systematic comparison of MoE and dense model internals using crosscoders, a variant of sparse autoencoders, that jointly models multiple activation spaces. We train 5-layer dense and MoEs (equal active parameters) on 1B tokens across code, scientific text, and english stories. Using BatchTopK crosscoders with explicitly designated shared features, we achieve $\sim 87%$ fractional variance explained and uncover concrete differences in feature organization. The MoE learns significantly fewer unique features compared to the dense model. MoE-specific features also exhibit higher activation density than shared features, whereas dense-specific features show lower density. Our analysis reveals that MoEs develop more specialized, focused representations while dense models distribute information across broader, more general-purpose features.
[461] MoE Lens – An Expert Is All You Need
Marmik Chaudhari, Idhant Gulati, Nishkal Hundia, Pranav Karra, Shivam Raval
Main category: cs.LG
TL;DR: Analysis reveals MoE models exhibit concentrated expertise with few specialized experts handling most routing decisions, enabling potential inference optimization through expert pruning while maintaining performance.
Details
Motivation: Mixture of Experts models enable parameter-efficient scaling but optimizing their inference and memory costs is challenging due to limited understanding of how experts specialize and contribute to model performance.Method: Systematic analysis using two complementary approaches: domain-specific routing patterns and an early decoding framework that tracks expert contributions to output representations, applied to DeepSeekMoE model with 64 routed experts.
Result: Despite having 64 experts with 6 active per layer, the model predominantly relies on few specialized experts; top-weighted expert’s output closely approximates full ensemble prediction; few experts handle over 50% of routing decisions; hidden state similarity between single and ensemble experts is extremely high (cosine similarity up to 0.95).
Conclusion: MoE models exhibit concentrated expertise, highlighting opportunities for inference optimization through targeted expert pruning while maintaining performance, and opening avenues for studying localization of learned knowledge in these models.
Abstract: Mixture of Experts (MoE) models enable parameter-efficient scaling through sparse expert activations, yet optimizing their inference and memory costs remains challenging due to limited understanding of their specialization behavior. We present a systematic analysis of expert specialization in MoEs through two complementary approaches: domain-specific routing patterns and an early decoding framework that tracks expert contributions to output representations. Our analysis of the DeepSeekMoE model reveals that despite having 64 routed experts with 6 active for each layer’s computation, the model predominantly relies on a few specialized experts, with the top-weighted expert’s output closely approximating the full ensemble prediction. We quantitatively validate these findings through a systematic analysis of the token routing distribution, demonstrating that very few experts handle over 50% of routing decisions across different specialized domains. Hidden state similarity between single and ensemble experts for every layer is extremely high, with some layers having cosine similarity as high as 0.95 and perplexity increasing by only 5% when using a single expert across all three domains. Our results indicate that Mixture of Experts models exhibit concentrated expertise highlighting potential opportunities for inference optimization through targeted expert pruning while maintaining model performance and opening avenues towards studying localization of learned knowledge in these models.
[462] Self-Auditing Parameter-Efficient Fine-Tuning for Few-Shot 3D Medical Image Segmentation
Son Thai Ly, Hien V. Nguyen
Main category: cs.LG
TL;DR: SEA-PEFT automates adapter configuration for medical image segmentation by treating it as an online allocation problem, improving performance in few-shot settings without manual tuning.
Details
Motivation: Adapting foundation models to new clinical sites is challenging due to domain shift and scarce annotations. Many clinical groups lack AI engineers for adapter tuning, leading to long adaptation cycles in few-shot settings.Method: SEA-PEFT uses a search-audit-allocate loop: trains active adapters, estimates each adapter’s Dice utility by toggling it off, then reselects active set under parameter budget using greedy knapsack allocator. EMA and IQR smoothing with Finite-State Ranking controller stabilize the loop.
Result: On TotalSegmentator and FLARE'22 datasets, SEA-PEFT improves mean Dice by 2.4-2.8 points over strongest fixed-topology PEFT baselines across 1/5/10-shot settings while training <1% of parameters.
Conclusion: SEA-PEFT automates adapter configuration for medical image segmentation, enabling efficient adaptation to new clinical sites without manual tuning expertise.
Abstract: Adapting foundation models to new clinical sites remains challenging in practice. Domain shift and scarce annotations must be handled by experts, yet many clinical groups do not have ready access to skilled AI engineers to tune adapter designs and training recipes. As a result, adaptation cycles can stretch from weeks to months, particularly in few-shot settings. Existing PEFT methods either require manual adapter configuration or automated searches that are computationally infeasible in few-shot 3D settings. We propose SEA-PEFT (SElf-Auditing Parameter-Efficient Fine-Tuning) to automate this process. SEA-PEFT treats adapter configuration as an online allocation problem solved during fine-tuning rather than through manual, fixed-topology choices. SEA-PEFT uses a search-audit-allocate loop that trains active adapters, estimates each adapter’s Dice utility by momentarily toggling it off, and then reselects the active set under a parameter budget using a greedy knapsack allocator. Exponential Moving Average and Interquartile Range smoothing, together with a Finite-State Ranking controller, stabilize the loop and improve reliability in high-noise few-shot regimes. On TotalSegmentator and FLARE'22, SEA-PEFT improves mean Dice by 2.4–2.8 points over the strongest fixed-topology PEFT baselines across 1/5/10-shot settings while training <1% of parameters. For reproducibility purposes, we made our code publicly available at https://github.com/tsly123/SEA_PEFT
[463] Test-Time Adaptation via Many-Shot Prompting: Benefits, Limits, and Pitfalls
Shubhangi Upasani, Chen Wu, Jay Rainton, Bo Li, Changran Hu, Qizheng Zhang, Urmish Thakker
Main category: cs.LG
TL;DR: Empirical study of many-shot prompting for test-time adaptation in LLMs, analyzing performance across tasks, update strategies, and model backbones.
Details
Motivation: To understand the reliability and limits of many-shot prompting as a test-time adaptation mechanism for LLMs, particularly for open-source models, and to characterize when input-space updates are beneficial versus harmful.Method: Empirical study across tasks and model backbones, analyzing performance variation with update magnitude, example ordering, and selection policy. Also studies Dynamic and Reinforced ICL as alternative test-time update strategies.
Result: Many-shot prompting is effective for structured tasks where demonstrations provide high information gain, but is highly sensitive to selection strategy and shows limited benefits for open-ended generation tasks.
Conclusion: Characterizes practical limits of prompt-based test-time adaptation and outlines when input-space updates are beneficial versus harmful for LLMs.
Abstract: Test-time adaptation enables large language models (LLMs) to modify their behavior at inference without updating model parameters. A common approach is many-shot prompting, where large numbers of in-context learning (ICL) examples are injected as an input-space test-time update. Although performance can improve as more demonstrations are added, the reliability and limits of this update mechanism remain poorly understood, particularly for open-source models. We present an empirical study of many-shot prompting across tasks and model backbones, analyzing how performance varies with update magnitude, example ordering, and selection policy. We further study Dynamic and Reinforced ICL as alternative test-time update strategies that control which information is injected and how it constrains model behavior. We find that many-shot prompting is effective for structured tasks where demonstrations provide high information gain, but is highly sensitive to selection strategy and often shows limited benefits for open-ended generation tasks. Overall, we characterize the practical limits of prompt-based test-time adaptation and outline when input-space updates are beneficial versus harmful.
[464] Stochastic Event Prediction via Temporal Motif Transitions
İbrahim Bahadır Altun, Ahmet Erdem SarıyĂŒce
Main category: cs.LG
TL;DR: STEP reformulates temporal link prediction as sequential forecasting using temporal motifs and Poisson processes, achieving state-of-the-art performance with improved runtime.
Details
Motivation: Existing temporal link prediction methods treat the task as binary classification with negative sampling, which discards the sequential and correlated nature of real-world interactions in networks with timestamped events.Method: STEP models event dynamics through discrete temporal motif transitions governed by Poisson processes, maintaining open motif instances that evolve with new interactions. It uses Bayesian scoring of temporal likelihoods and structural priors to decide whether to initiate new motifs or extend existing ones.
Result: Experiments on five real-world datasets show up to 21% average precision gains over state-of-the-art baselines in classification and 0.99 precision in next k sequential forecasting, with consistently lower runtime than competing motif-aware methods.
Conclusion: STEP provides an effective framework for temporal link prediction that captures sequential dependencies while being computationally efficient, and its motif-based features can enhance existing temporal graph neural networks.
Abstract: Networks of timestamped interactions arise across social, financial, and biological domains, where forecasting future events requires modeling both evolving topology and temporal ordering. Temporal link prediction methods typically frame the task as binary classification with negative sampling, discarding the sequential and correlated nature of real-world interactions. We introduce STEP (STochastic Event Predictor), a framework that reformulates temporal link prediction as a sequential forecasting problem in continuous time. STEP models event dynamics through discrete temporal motif transitions governed by Poisson processes, maintaining a set of open motif instances that evolve as new interactions arrive. At each step, the framework decides whether to initiate a new temporal motif or extend an existing one, selecting the most probable event via Bayesian scoring of temporal likelihoods and structural priors. STEP also produces compact, temporal motif-based feature vectors that can be concatenated with existing temporal graph neural network outputs, enriching their representations without architectural modifications. Experiments on five real-world datasets demonstrate up to 21% average precision gains over state-of-the-art baselines in classification and 0.99 precision in next $k$ sequential forecasting, with consistently lower runtime than competing motif-aware methods.
[465] Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning
Xuan Li, Zhanke Zhou, Zongze Li, Jiangchao Yao, Yu Rong, Lu Zhang, Bo Han
Main category: cs.LG
TL;DR: RePO is a new approach for instruction-based molecular optimization that combines RL with reference guidance to overcome limitations of SFT and RLVR when only single optimized molecules are available without step-by-step trajectories.
Details
Motivation: Standard LLM fine-tuning methods (SFT and RLVR) perform poorly in molecular optimization tasks where only single optimized reference molecules are available without detailed optimization trajectories, leading to reasoning collapse and sparse feedback issues.Method: Reference-guided Policy Optimization (RePO) samples candidate molecules with reasoning trajectories, uses verifiable rewards for RL training under similarity constraints, and applies reference guidance by keeping policy’s reasoning as context while training only the answer in supervised manner.
Result: RePO consistently outperforms SFT and RLVR baselines across molecular optimization benchmarks, achieving improvements on Success Rate Ă Similarity metric, better balance across competing objectives, and better generalization to unseen instruction styles.
Conclusion: RePO effectively addresses the exploration-exploitation tradeoff in molecular optimization by combining RL exploration with reference guidance, enabling learning from single reference molecules without requiring trajectory data.
Abstract: Large language models (LLMs) benefit substantially from supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) in reasoning tasks. However, these recipes perform poorly in instruction-based molecular optimization, where each data point typically provides only a single optimized reference molecule and no step-by-step optimization trajectory. We reveal that answer-only SFT on the reference molecules collapses reasoning, and RLVR provides sparse feedback under similarity constraints due to the model’s lack of effective exploration, which slows learning and limits optimization. To encourage the exploration of new molecules while balancing the exploitation of the reference molecules, we introduce Reference-guided Policy Optimization (RePO), an optimization approach that learns from reference molecules without requiring trajectory data. At each update, RePO samples candidate molecules with their intermediate reasoning trajectories from the model and trains the model using verifiable rewards that measure property satisfaction under similarity constraints in an RL manner. Meanwhile, it applies reference guidance by keeping the policy’s intermediate reasoning trajectory as context and training only the answer in a supervised manner. Together, the RL term promotes exploration, while the guidance term mitigates reward sparsity and stabilizes training by grounding outputs to references when many valid molecular edits exist. Across molecular optimization benchmarks, RePO consistently outperforms SFT and RLVR baselines (e.g., GRPO), achieving improvements on the optimization metric (Success Rate $\times$ Similarity), improving balance across competing objectives, and generalizing better to unseen instruction styles. Our code is publicly available at https://github.com/tmlr-group/RePO.
[466] Stock Market Prediction Using Node Transformer Architecture Integrated with BERT Sentiment Analysis
Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman
Main category: cs.LG
TL;DR: Integrated framework combining node transformer architecture with BERT-based sentiment analysis for stock price forecasting, achieving 0.80% MAPE for one-day-ahead predictions.
Details
Motivation: Stock market prediction is challenging due to noise, non-stationarity, and behavioral dynamics. Traditional methods fail to capture intricate patterns and cross-sectional dependencies in financial markets.Method: Represents stock market as graph structure with stocks as nodes and relationships as edges. Uses fine-tuned BERT for sentiment analysis from social media, combined with quantitative market features through attention-based fusion. Node transformer processes historical data capturing temporal evolution and cross-sectional dependencies.
Result: Achieves 0.80% MAPE for one-day-ahead predictions vs 1.20% for ARIMA and 1.00% for LSTM. Sentiment analysis reduces error by 10% overall (25% during earnings). Graph-based modeling adds 15% improvement. Directional accuracy reaches 65%. Maintains MAPE below 1.5% during high-volatility periods.
Conclusion: The integrated framework combining graph-based modeling with sentiment analysis significantly improves stock price forecasting accuracy, especially during volatile periods and earnings announcements.
Abstract: Stock market prediction presents considerable challenges for investors, financial institutions, and policymakers operating in complex market environments characterized by noise, non-stationarity, and behavioral dynamics. Traditional forecasting methods often fail to capture the intricate patterns and cross-sectional dependencies inherent in financial markets. This paper presents an integrated framework combining a node transformer architecture with BERT-based sentiment analysis for stock price forecasting. The proposed model represents the stock market as a graph structure where individual stocks form nodes and edges capture relationships including sectoral affiliations, correlated price movements, and supply chain connections. A fine-tuned BERT model extracts sentiment from social media posts and combines it with quantitative market features through attention-based fusion. The node transformer processes historical market data while capturing both temporal evolution and cross-sectional dependencies among stocks. Experiments on 20 S&P 500 stocks spanning January 1982 to March 2025 demonstrate that the integrated model achieves a mean absolute percentage error (MAPE) of 0.80% for one-day-ahead predictions, compared to 1.20% for ARIMA and 1.00% for LSTM. Sentiment analysis reduces prediction error by 10% overall and 25% during earnings announcements, while graph-based modeling contributes an additional 15% improvement by capturing inter-stock dependencies. Directional accuracy reaches 65% for one-day forecasts. Statistical validation through paired t-tests confirms these improvements (p < 0.05 for all comparisons). The model maintains MAPE below 1.5% during high-volatility periods where baseline models exceed 2%.
[467] Design Experiments to Compare Multi-armed Bandit Algorithms
Huiling Meng, Ningyuan Chen, Xuefeng Gao
Main category: cs.LG
TL;DR: Artificial Replay (AR) is a new experimental design for comparing bandit algorithms that reuses recorded rewards from one policy when testing another, reducing experimental costs while maintaining unbiased estimation.
Details
Motivation: Comparing multi-armed bandit algorithms (like UCB and Thompson Sampling) is costly because each run produces only one dependent trajectory, requiring many independent restarts for reliable inference. This delays deployment decisions and increases experimental costs.Method: AR first runs one policy and records its trajectory. When testing a second policy, it reuses recorded rewards whenever the second policy selects an action already taken by the first policy, and only queries the real environment for new actions. This creates a new analytical framework for comparing bandit algorithms.
Result: The AR estimator is unbiased, requires only T + o(T) user interactions instead of 2T (nearly halving experimental cost for sub-linear regret policies), and has sub-linearly growing variance compared to linearly-growing variance in naïve designs. Experiments with UCB, Thompson Sampling, and Δ-greedy policies confirm these theoretical gains.
Conclusion: Artificial Replay provides an efficient experimental design for comparing bandit algorithms that significantly reduces experimental costs while maintaining statistical reliability, enabling faster deployment decisions in online platforms.
Abstract: Online platforms routinely compare multi-armed bandit algorithms, such as UCB and Thompson Sampling, to select the best-performing policy. Unlike standard A/B tests for static treatments, each run of a bandit algorithm over $T$ users produces only one dependent trajectory, because the algorithm’s decisions depend on all past interactions. Reliable inference therefore demands many independent restarts of the algorithm, making experimentation costly and delaying deployment decisions. We propose Artificial Replay (AR) as a new experimental design for this problem. AR first runs one policy and records its trajectory. When the second policy is executed, it reuses a recorded reward whenever it selects an action the first policy already took, and queries the real environment only otherwise. We develop a new analytical framework for this design and prove three key properties of the resulting estimator: it is unbiased; it requires only $T + o(T)$ user interactions instead of $2T$ for a run of the treatment and control policies, nearly halving the experimental cost when both policies have sub-linear regret; and its variance grows sub-linearly in $T$, whereas the estimator from a naĂŻve design has a linearly-growing variance. Numerical experiments with UCB, Thompson Sampling, and $Δ$-greedy policies confirm these theoretical gains.
[468] Weak-SIGReg: Covariance Regularization for Stable Deep Learning
Habibullah Akbar
Main category: cs.LG
TL;DR: SIGReg is repurposed from self-supervised learning to stabilize supervised training of low-bias architectures like ViTs by regularizing representations toward isotropic Gaussian distributions, preventing optimization collapse.
Details
Motivation: Modern neural networks rely on architectural priors (BatchNorm, Residual connections) for stable training. Without these, or in low-data regimes with aggressive augmentation, low-bias architectures like Vision Transformers often suffer from optimization collapse. There's a need for general optimization stabilizers beyond architectural hacks.Method: Adopts Sketched Isotropic Gaussian Regularization (SIGReg) from LeJEPA self-supervised framework and repurposes it for supervised learning. Derives a computationally efficient variant, Weak-SIGReg, which targets the covariance matrix via random sketching. Views representation collapse as stochastic drift in interacting particle systems and constrains representation density toward isotropic Gaussian to mitigate drift.
Result: SIGRec recovers ViT training on CIFAR-100 from collapsed 20.73% to 72.02% accuracy without architectural hacks. Significantly improves convergence of deep vanilla MLPs trained with pure SGD.
Conclusion: SIGReg serves as an effective general optimization stabilizer for supervised learning, particularly beneficial for low-bias architectures like Vision Transformers that are prone to optimization collapse without architectural priors.
Abstract: Modern neural network optimization relies heavily on architectural priorssuch as Batch Normalization and Residual connectionsto stabilize training dynamics. Without these, or in low-data regimes with aggressive augmentation, low-bias architectures like Vision Transformers (ViTs) often suffer from optimization collapse. This work adopts Sketched Isotropic Gaussian Regularization (SIGReg), recently introduced in the LeJEPA self-supervised framework, and repurposes it as a general optimization stabilizer for supervised learning. While the original formulation targets the full characteristic function, a computationally efficient variant is derived, Weak-SIGReg, which targets the covariance matrix via random sketching. Inspired by interacting particle systems, representation collapse is viewed as stochastic drift; SIGReg constrains the representation density towards an isotropic Gaussian, mitigating this drift. Empirically, SIGReg recovers the training of a ViT on CIFAR-100 from a collapsed 20.73% to 72.02% accuracy without architectural hacks and significantly improves the convergence of deep vanilla MLPs trained with pure SGD. Code is available at \href{https://github.com/kreasof-ai/sigreg}{github.com/kreasof-ai/sigreg}.
[469] Omni-Masked Gradient Descent: Memory-Efficient Optimization via Mask Traversal with Improved Convergence
Hui Yang, Tao Ren, Jinyang Jiang, Wan Tian, Yijie Peng
Main category: cs.LG
TL;DR: OMGD is a memory-efficient optimization method for large language model training that achieves better convergence guarantees than existing approaches.
Details
Motivation: Existing memory-efficient optimization methods for large language models either lack clear convergence guarantees or only achieve suboptimal iteration complexity, creating a need for better approaches that balance memory efficiency with theoretical guarantees.Method: Omni-Masked Gradient Descent (OMGD) uses mask traversal for memory-efficient training, providing a lightweight, plug-and-play approach that integrates with mainstream optimizers while offering improved theoretical convergence properties.
Result: OMGD achieves a strictly improved iteration complexity of $\tilde{\mathcal{O}}(Δ^{-3})$ for finding an $Δ$-approximate stationary point, compared to the standard ${\mathcal{O}}(Δ^{-4})$ of existing methods, and shows consistent improvements in both fine-tuning and pre-training tasks.
Conclusion: OMGD provides an effective solution for memory-efficient training of large language models with better theoretical convergence guarantees and practical performance improvements over existing methods.
Abstract: Memory-efficient optimization methods have recently gained increasing attention for scaling full-parameter training of large language models under the GPU-memory bottleneck. Existing approaches either lack clear convergence guarantees, or only achieve the standard ${\mathcal{O}}(Δ^{-4})$ iteration complexity in the nonconvex settings. We propose Omni-Masked Gradient Descent (OMGD), an optimization method based on mask traversal for memory efficient training, and provide a nonconvex convergence analysis that establishes a strictly improved iteration complexity of $\tilde{\mathcal{O}}(Δ^{-3})$ for finding an $Δ$-approximate stationary point. Empirically, OMGD is a lightweight, plug-and-play approach that integrates seamlessly into most mainstream optimizers, yielding consistent improvements over competitive baselines in both fine-tuning and pre-training tasks.
[470] EvoESAP: Non-Uniform Expert Pruning for Sparse MoE
Zongfang Liu, Shengkun Tang, Boyang Sun, Zhiqiang Shen, Xin Yuan
Main category: cs.LG
TL;DR: EvoESAP: Evolutionary search framework for optimizing non-uniform layer-wise sparsity allocation in sparse Mixture-of-Experts models using a speculative-decoding-inspired metric.
Details
Motivation: Sparse Mixture-of-Experts models are memory- and throughput-bound despite low per-token compute. Existing pruning methods focus on which experts to prune within each layer but use uniform layer-wise sparsity allocation, which may not be optimal.Method: Decouples pruning into within-layer expert ranking and across-layer budget allocation. Introduces ESAP (Expected Speculative Acceptance Proxy), a teacher-forced metric inspired by speculative decoding that measures how well pruned models match full models. Uses EvoESAP, an evolutionary search framework to optimize non-uniform layer-wise sparsity allocation under fixed global budget.
Result: Across 7B-30B SMoE LLMs at 25% and 50% sparsity, EvoESAP discovers non-uniform allocations that improve open-ended generation (up to +19.6% on MATH-500 at 50% sparsity) while preserving competitive multiple-choice accuracy compared to uniform pruning.
Conclusion: EvoESAP provides a plug-and-play method for optimizing layer-wise sparsity allocation in SMoE models, significantly improving performance over uniform allocation strategies.
Abstract: Sparse Mixture-of-Experts (SMoE) language models achieve strong capability at low per-token compute, yet deployment remains memory- and throughput-bound because the full expert pool must be stored and served. Post-training expert pruning reduces this cost, but most methods focus on which experts to prune within each layer and default to a uniform layer-wise sparsity allocation, even though the allocation can strongly affect performance. We decouple pruning into within-layer expert ranking and across-layer budget allocation, and introduce \textbf{E}xpected \textbf{S}peculative \textbf{A}cceptance \textbf{P}roxy (\textbf{ESAP}), a speculative-decoding-inspired, teacher-forced metric that measures how well a pruned model matches the full model. ESAP is bounded and stable, enabling cheap comparison of many candidates without costly autoregressive decoding. Building on ESAP, we propose EvoESAP, an evolutionary searching framework that optimizes a non-uniform layer-wise sparsity allocation under a fixed global budget while holding the within-layer pruning order fixed, making it a plug-and-play method with criteria such as Frequency, EAN, SEER, and REAP. Across 7B–30B SMoE LLMs at 25% and 50% sparsity, EvoESAP consistently discovers non-uniform allocations that improve open-ended generation (up to \textbf{+19.6%} on MATH-500 at 50% sparsity) while preserving competitive multiple-choice accuracy compared with uniform pruning at the same sparsity.
[471] Preventing Learning Stagnation in PPO by Scaling to 1 Million Parallel Environments
Michael Beukman, Khimya Khetarpal, Zeyu Zheng, Will Dabney, Jakob Foerster, Michael Dennis, Clare Lyle
Main category: cs.LG
TL;DR: PPO performance plateaus in deep RL occur when sample-based loss estimates become poor proxies for the true objective, not due to exploration or optimization issues. The paper models PPO’s outer loop as stochastic optimization and shows plateaus happen when step size is too large relative to gradient noise.
Details
Motivation: The paper addresses the common problem of performance plateaus in deep on-policy RL, particularly in PPO, where agents stagnate at suboptimal levels. The authors aim to understand why these plateaus occur and provide practical solutions.Method: The authors model PPO’s outer loop (sampling rollouts from parallel environments) as stochastic optimization. They analyze how hyperparameters control step size and gradient noise, and investigate how increasing parallel environments affects these factors. They propose a recipe for co-scaling hyperparameters when increasing parallelization.
Result: The paper validates that performance plateaus occur when outer step size is too large relative to noise. Increasing parallel environments reduces both step size and update noise. The authors show that incorrectly scaling hyperparameters can cause severe performance degradation, but proper co-scaling enables scaling PPO to over 1M parallel environments.
Conclusion: PPO plateaus can be addressed by reducing step size or increasing samples between updates. Increasing parallel environments is a robust solution, and proper hyperparameter co-scaling enables massive parallelization (over 1M environments) and monotonic performance improvement up to one trillion transitions.
Abstract: Plateaus, where an agent’s performance stagnates at a suboptimal level, are a common problem in deep on-policy RL. Focusing on PPO due to its widespread adoption, we show that plateaus in certain regimes arise not because of known exploration, capacity, or optimization challenges, but because sample-based estimates of the loss eventually become poor proxies for the true objective over the course of training. As a recap, PPO switches between sampling rollouts from several parallel environments online using the current policy (which we call the outer loop) and performing repeated minibatch SGD steps against this offline dataset (the inner loop). In our work we consider only the outer loop, and conceptually model it as stochastic optimization. The step size is then controlled by the regularization strength towards the previous policy and the gradient noise by the number of samples collected between policy update steps. This model predicts that performance will plateau at a suboptimal level if the outer step size is too large relative to the noise. Recasting PPO in this light makes it clear that there are two ways to address this particular type of learning stagnation: either reduce the step size or increase the number of samples collected between updates. We first validate the predictions of our model and investigate how hyperparameter choices influence the step size and update noise, concluding that increasing the number of parallel environments is a simple and robust way to reduce both factors. Next, we propose a recipe for how to co-scale the other hyperparameters when increasing parallelization, and show that incorrectly doing so can lead to severe performance degradation. Finally, we vastly outperform prior baselines in a complex open-ended domain by scaling PPO to more than 1M parallel environments, thereby enabling monotonic performance improvement up to one trillion transitions.
[472] Agnostic learning in (almost) optimal time via Gaussian surface area
Lucas Pesenti, Lucas Slot, Manuel Wiedmer
Main category: cs.LG
TL;DR: Improved analysis showing degree d = Ă(ÎÂČ/ΔÂČ) suffices for Lâ-approximation of concept classes with Gaussian surface area Î, leading to near-optimal bounds for agnostically learning polynomial threshold functions in the statistical query model.
Details
Motivation: The paper aims to improve upon previous bounds by Klivans et al. (2008) on the degree needed for Lâ-approximation of concept classes under Gaussian marginals, which relates to the complexity of learning in the agnostic model.Method: The proof relies on a direct analogue of a construction by Feldman et al. (2020) for Lâ-approximation on the Boolean hypercube, adapting it to the Gaussian setting.
Result: Shows that degree d = Ă(ÎÂČ/ΔÂČ) is sufficient for Δ-approximation, improving from the previous O(ÎÂČ/ΔâŽ) bound, yielding near-optimal bounds for agnostically learning polynomial threshold functions.
Conclusion: The improved analysis provides better bounds on the complexity of agnostically learning concept classes under Gaussian marginals, particularly for polynomial threshold functions in the statistical query model.
Abstract: The complexity of learning a concept class under Gaussian marginals in the difficult agnostic model is closely related to its $L_1$-approximability by low-degree polynomials. For any concept class with Gaussian surface area at most $Î$, Klivans et al. (2008) show that degree $d = O(Î^2 / \varepsilon^4)$ suffices to achieve an $\varepsilon$-approximation. This leads to the best-known bounds on the complexity of learning a variety of concept classes. In this note, we improve their analysis by showing that degree $d = \tilde O (Î^2 / \varepsilon^2)$ is enough. In light of lower bounds due to Diakonikolas et al. (2021), this yields (near) optimal bounds on the complexity of agnostically learning polynomial threshold functions in the statistical query model. Our proof relies on a direct analogue of a construction of Feldman et al. (2020), who considered $L_1$-approximation on the Boolean hypercube.
[473] Improved high-dimensional estimation with Langevin dynamics and stochastic weight averaging
Stanley Wei, Alex Damian, Jason D. Lee
Main category: cs.LG
TL;DR: Langevin dynamics with iterate averaging achieves optimal sample complexity for recovering planted directions in high-dimensional problems, matching smoothed gradient descent performance without explicit smoothing.
Details
Motivation: Recent work showed smoothed gradient descent can achieve optimal sample complexity for recovering planted directions, but requires explicit smoothing. The paper investigates whether Langevin dynamics with iterate averaging can achieve the same optimal rates without explicit smoothing.Method: Analyzes Langevin dynamics with iterate averaging (rather than last iterate) for recovering planted directions in high-dimensional settings. Shows that noise-injection from Langevin dynamics combined with iterate averaging emulates landscape smoothing effects.
Result: Langevin dynamics with iterate averaging succeeds with n âł d^{k*/2} samples, matching the optimal rate achieved by smoothed gradient descent. This holds for both tensor PCA and single-index model settings.
Conclusion: Noise-injection combined with iterate averaging can emulate landscape smoothing effects, achieving optimal sample complexity without explicit smoothing. The paper conjectures minibatch SGD may also achieve similar rates without additional noise.
Abstract: Significant recent work has studied the ability of gradient descent to recover a hidden planted direction $Ξ^\star \in S^{d-1}$ in different high-dimensional settings, including tensor PCA and single-index models. The key quantity that governs the ability of gradient descent to traverse these landscapes is the information exponent $k^\star$ (Ben Arous et al., (2021)), which corresponds to the order of the saddle at initialization in the population landscape. Ben Arous et al., (2021) showed that $n \gtrsim d^{\max(1, k^\star-1)}$ samples were necessary and sufficient for online SGD to recover $Ξ^\star$, and Ben Arous et al., (2020) proved a similar lower bound for Langevin dynamics. More recently, Damian et al., (2023) showed it was possible to circumvent these lower bounds by running gradient descent on a smoothed landscape, and that this algorithm succeeds with $n \gtrsim d^{\max(1, k^\star/2)}$ samples, which is optimal in the worst case. This raises the question of whether it is possible to achieve the same rate without explicit smoothing. In this paper, we show that Langevin dynamics can succeed with $n \gtrsim d^{ k^\star/2 }$ samples if one considers the average iterate, rather than the last iterate. The key idea is that the combination of noise-injection and iterate averaging is able to emulate the effect of landscape smoothing. We apply this result to both the tensor PCA and single-index model settings. Finally, we conjecture that minibatch SGD can also achieve the same rate without adding any additional noise.
[474] Latent Diffusion-Based 3D Molecular Recovery from Vibrational Spectra
Wenjin Wu, AleĆĄ Leonardis, Linjiang Chen, Jianbo Jiao
Main category: cs.LG
TL;DR: IR-GeoDiff is a latent diffusion model that generates 3D molecular geometries from infrared spectra by integrating spectral information into molecular node and edge representations.
Details
Motivation: Existing methods for recovering molecular structures from IR spectra rely on 1D SMILES strings or 2D molecular graphs, which fail to capture the relationship between spectral features and 3D molecular geometry. While diffusion models have improved 3D molecular generation, no model has explored generating 3D geometries from single IR spectra.Method: IR-GeoDiff uses a latent diffusion model that integrates IR spectral information into both node and edge representations of molecular structures. The model learns to map spectral features to 3D molecular geometries through attention mechanisms that focus on characteristic functional group regions in spectra.
Result: The model successfully recovers molecular distributions corresponding to given IR spectra, evaluated from both spectral and structural perspectives. Attention-based analysis shows the model focuses on characteristic functional group regions in IR spectra, consistent with chemical interpretation practices.
Conclusion: IR-GeoDiff represents a significant advancement in molecular structure determination from IR spectra by enabling 3D geometry recovery, bridging the gap between spectral data and molecular spatial configuration.
Abstract: Infrared (IR) spectroscopy, a type of vibrational spectroscopy, is widely used for molecular structure determination and provides critical structural information for chemists. However, existing approaches for recovering molecular structures from IR spectra typically rely on one-dimensional SMILES strings or two-dimensional molecular graphs, which fail to capture the intricate relationship between spectral features and three-dimensional molecular geometry. Recent advances in diffusion models have greatly enhanced the ability to generate molecular structures in 3D space. Yet, no existing model has explored the distribution of 3D molecular geometries corresponding to a single IR spectrum. In this work, we introduce IR-GeoDiff, a latent diffusion model that recovers 3D molecular geometries from IR spectra by integrating spectral information into both node and edge representations of molecular structures. We evaluate IR-GeoDiff from both spectral and structural perspectives, demonstrating its ability to recover the molecular distribution corresponding to a given IR spectrum. Furthermore, an attention-based analysis reveals that the model is able to focus on characteristic functional group regions in IR spectra, qualitatively consistent with common chemical interpretation practices.
[475] Dynamic Momentum Recalibration in Online Gradient Learning
Zhipeng Yao, Rui Yu, Guisong Chang, Ying Li, Yu Zhang, Dazhou Li
Main category: cs.LG
TL;DR: SGDF: SGD with Filter optimizer that uses optimal linear filtering to compute time-varying gains for gradient updates, achieving better bias-variance trade-off than fixed momentum methods.
Details
Motivation: Fixed momentum coefficients in SGD and its variants inherently distort the balance between bias and variance in gradient updates, leading to suboptimal parameter updates. The underlying dynamics of gradient behavior in deep learning optimization remain insufficiently understood.Method: Reinterpret gradient updates through signal processing lens and propose SGDF (SGD with Filter), inspired by Optimal Linear Filtering. Computes online, time-varying gain to dynamically refine gradient estimation by minimizing mean-squared error, achieving optimal trade-off between noise suppression and signal preservation.
Result: Extensive experiments across diverse architectures and benchmarks demonstrate SGDF surpasses conventional momentum methods and achieves performance on par with or surpassing state-of-the-art optimizers.
Conclusion: SGDF provides a principled approach to gradient optimization through signal processing techniques, offering better bias-variance trade-off and broad applicability to various optimization frameworks.
Abstract: Stochastic Gradient Descent (SGD) and its momentum variants form the backbone of deep learning optimization, yet the underlying dynamics of their gradient behavior remain insufficiently understood. In this work, we reinterpret gradient updates through the lens of signal processing and reveal that fixed momentum coefficients inherently distort the balance between bias and variance, leading to skewed or suboptimal parameter updates. To address this, we propose SGDF (SGD with Filter), an optimizer inspired by the principles of Optimal Linear Filtering. SGDF computes an online, time-varying gain to dynamically refine gradient estimation by minimizing the mean-squared error, thereby achieving an optimal trade-off between noise suppression and signal preservation. Furthermore, our approach could extend to other optimizers, showcasing its broad applicability to optimization frameworks. Extensive experiments across diverse architectures and benchmarks demonstrate SGDF surpasses conventional momentum methods and achieves performance on par with or surpassing state-of-the-art optimizers.
[476] DQE: A Semantic-Aware Evaluation Metric for Time Series Anomaly Detection
Yuewei Li, Dalin Zhang, Huan Li, Xinyi Gong, Hongjun Chu, Zhaohui Song
Main category: cs.LG
TL;DR: Proposes a new evaluation metric for time series anomaly detection that addresses limitations in existing metrics through semantic partitioning and threshold aggregation.
Details
Motivation: Existing time series anomaly detection metrics have several limitations: bias toward point-level coverage, insensitivity to near-miss detections, inadequate penalization of false alarms, and inconsistency from threshold selection, leading to unreliable evaluation.Method: Introduces a partitioning strategy based on detection semantics that decomposes the temporal region around each anomaly into three functionally distinct subregions. Designs finer-grained scoring for each subregion and aggregates detection qualities across the full threshold spectrum to eliminate evaluation inconsistency.
Result: Extensive experiments on synthetic and real-world data show the proposed metric provides stable, discriminative, and interpretable evaluation, achieving robust assessment compared with ten widely used metrics.
Conclusion: The proposed metric addresses critical limitations in existing evaluation practices for time series anomaly detection, enabling more reliable and interpretable assessment of detection performance.
Abstract: Time series anomaly detection has achieved remarkable progress in recent years. However, evaluation practices have received comparatively less attention, despite their critical importance. Existing metrics exhibit several limitations: (1) bias toward point-level coverage, (2) insensitivity or inconsistency in near-miss detections, (3) inadequate penalization of false alarms, and (4) inconsistency caused by threshold or threshold-interval selection. These limitations can produce unreliable or counterintuitive results, hindering objective progress. In this work, we revisit the evaluation of time series anomaly detection from the perspective of detection semantics and propose a novel metric for more comprehensive assessment. We first introduce a partitioning strategy grounded in detection semantics, which decomposes the local temporal region of each anomaly into three functionally distinct subregions. Using this partitioning, we evaluate overall detection behavior across events and design finer-grained scoring mechanisms for each subregion, enabling more reliable and interpretable assessment. Through a systematic study of existing metrics, we identify an evaluation bias associated with threshold-interval selection and adopt an approach that aggregates detection qualities across the full threshold spectrum, thereby eliminating evaluation inconsistency. Extensive experiments on synthetic and real-world data demonstrate that our metric provides stable, discriminative, and interpretable evaluation, while achieving robust assessment compared with ten widely used metrics.
[477] Partial Policy Gradients for RL in LLMs
Puneet Mathur, Branislav Kveton, Subhojyoti Mukherjee, Viet Dac Lai
Main category: cs.LG
TL;DR: Proposes using reward subset optimization in policy gradients to model different policy structures (full planning, greedy, K-step lookahead, segment policies) for more reliable learning, evaluated on persona-alignment conversational problems.
Details
Motivation: To develop a natural approach for modeling policy structure in policy gradients by optimizing subsets of future rewards, where simpler policies (smaller reward subsets) can be learned more reliably due to more accurate empirical gradient estimates.Method: Uses policy gradient methods with reward subset optimization, where different policy classes are defined by the size of future reward subsets considered. Compares full planning (all future rewards), greedy (immediate reward), K-step lookahead, and segment policies.
Result: Empirical evaluation on persona-alignment conversational problems shows different policies excel in different problems, reflecting their characteristics and highlighting the importance of policy class selection.
Conclusion: The proposed approach enables modeling and comparison of different policy structures, with different policy classes performing best in different conversational alignment problems, demonstrating the value of structured policy design.
Abstract: Reinforcement learning is a framework for learning to act sequentially in an unknown environment. We propose a natural approach for modeling policy structure in policy gradients. The key idea is to optimize for a subset of future rewards: smaller subsets represent simpler policies, which can be learned more reliably because their empirical gradient estimates are more accurate. Our approach allows for modeling and comparison of different policy classes, including full planning, greedy, K-step lookahead, and segment policies. We evaluate the policies empirically on multiple persona-alignment conversational problems. Different policies excel in different problems, reflecting their different characteristics and highlighting the importance of our studied policy class.
[478] Predictive Coding Graphs are a Superset of Feedforward Neural Networks
Björn van Zwol
Main category: cs.LG
TL;DR: PCGs are neuroscience-inspired probabilistic models that generalize feedforward neural networks, enabling non-hierarchical network topologies for ML tasks.
Details
Motivation: To bridge neuroscience-inspired predictive coding networks with contemporary machine learning by showing PCGs mathematically generalize feedforward neural networks, enabling exploration of non-hierarchical network topologies.Method: Mathematical proof demonstrating that predictive coding graphs (PCGs) define a superset of feedforward artificial neural networks (multilayer perceptrons), establishing formal relationships between these model classes.
Result: PCGs are proven to be mathematical supersets of feedforward neural networks, positioning predictive coding networks more strongly within contemporary ML and supporting proposals for non-hierarchical network architectures.
Conclusion: This work strengthens the connection between neuroscience-inspired models and ML, opening avenues for exploring diverse network topologies beyond traditional hierarchical structures.
Abstract: Predictive coding graphs (PCGs) are a recently introduced generalization to predictive coding networks, a neuroscience-inspired probabilistic latent variable model. Here, we prove how PCGs define a mathematical superset of feedforward artificial neural networks (multilayer perceptrons). This positions PCNs more strongly within contemporary machine learning (ML), and reinforces earlier proposals to study the use of non-hierarchical neural networks for ML tasks, and more generally the notion of topology in neural networks.
[479] Ensemble Graph Neural Networks for Probabilistic Sea Surface Temperature Forecasting via Input Perturbations
Alejandro J. Gonzålez-Santana, Giovanny A. Cuervo-Londoño, Javier Sånchez
Main category: cs.LG
TL;DR: GNN ensemble learning for sea surface temperature forecasting using input perturbation strategies to generate probabilistic forecasts without retraining multiple models.
Details
Motivation: Need for computationally efficient ocean forecasting models that can represent predictive uncertainty, particularly for regional sea surface temperature prediction where ensemble methods are valuable but traditional approaches are expensive.Method: Adapted Graph Neural Network architecture to Canary Islands region, implemented homogeneous ensemble approach inspired by bagging using input perturbation during inference (not retraining). Evaluated noise-based ensemble generation strategies: Gaussian noise, Perlin noise, and fractal Perlin noise with systematic variation of noise intensity and spatial structure.
Result: Deterministic skill remained comparable to single-model forecast, but input perturbation type strongly influenced uncertainty representation. Spatially coherent perturbations (low-resolution Perlin noise) achieved better calibration and lower CRPS than random Gaussian perturbations, especially at longer lead times.
Conclusion: Noise structure and scale are critical in ensemble GNN design; carefully constructed input perturbations can yield well-calibrated probabilistic forecasts without additional training cost, supporting feasibility of ensemble GNNs for operational regional ocean prediction.
Abstract: Accurate regional ocean forecasting requires models that are both computationally efficient and capable of representing predictive uncertainty. This work investigates ensemble learning strategies for sea surface temperature (SST) forecasting using Graph Neural Networks (GNNs), with a focus on how input perturbation design affects forecast skill and uncertainty representation. We adapt a GNN architecture to the Canary Islands region in the North Atlantic and implement a homogeneous ensemble approach inspired by bagging, where diversity is introduced during inference by perturbing initial ocean states rather than retraining multiple models. Several noise-based ensemble generation strategies are evaluated, including Gaussian noise, Perlin noise, and fractal Perlin noise, with systematic variation of noise intensity and spatial structure. Ensemble forecasts are assessed over a 15-day horizon using deterministic metrics (RMSE and bias) and probabilistic metrics, including the Continuous Ranked Probability Score (CRPS) and the Spread-skill ratio. Results show that, while deterministic skill remains comparable to the single-model forecast, the type and structure of input perturbations strongly influence uncertainty representation, particularly at longer lead times. Ensembles generated with spatially coherent perturbations, such as low-resolution Perlin noise, achieve better calibration and lower CRPS than purely random Gaussian perturbations. These findings highlight the critical role of noise structure and scale in ensemble GNN design and demonstrate that carefully constructed input perturbations can yield well-calibrated probabilistic forecasts without additional training cost, supporting the feasibility of ensemble GNNs for operational regional ocean prediction.
[480] Topological descriptors of foot clearance gait dynamics improve differential diagnosis of Parkinsonism
Jhonathan Barrios, Wolfram Erlhagen, Miguel F. Gago, Estela Bicho, Flora Ferreira
Main category: cs.LG
TL;DR: Topological Data Analysis (TDA) applied to foot clearance patterns improves differential diagnosis of Parkinsonian syndromes using gait analysis and machine learning.
Details
Motivation: Differential diagnosis among Parkinsonian syndromes is clinically challenging due to overlapping motor symptoms and subtle gait abnormalities. Accurate differentiation is crucial for treatment planning and prognosis, but conventional gait analysis methods often overlook hidden nonlinear and structural features in foot clearance patterns.Method: Used Topological Data Analysis (TDA) to extract topological features from foot clearance time series, generating Betti curves, persistence landscapes, and silhouettes. These features were used with a Random Forest classifier on a dataset of 15 controls, 15 idiopathic Parkinson’s disease (IPD), and 14 vascular Parkinsonism (VaP) patients. Models were evaluated with leave-one-out cross-validation (LOOCV).
Result: Betti-curve descriptors yielded the strongest results. For IPD vs VaP classification, foot clearance variables achieved 83% accuracy and AUC=0.89 under LOOCV in the medicated (On) state. Performance improved in the On state and further when both Off and On states were considered, indicating sensitivity of topological features to levodopa-related gait changes.
Conclusion: Integrating TDA with machine learning improves clinical gait analysis and aids differential diagnosis across Parkinsonian disorders by capturing subtle topological features in gait patterns that conventional methods miss.
Abstract: Differential diagnosis among parkinsonian syndromes remains a clinical challenge due to overlapping motor symptoms and subtle gait abnormalities. Accurate differentiation is crucial for treatment planning and prognosis. While gait analysis is a well established approach for assessing motor impairments, conventional methods often overlook hidden nonlinear and structural features embedded in foot clearance patterns. We evaluated Topological Data Analysis (TDA) as a complementary tool for Parkinsonism classification using foot clearance time series. Persistent homology produced Betti curves, persistence landscapes, and silhouettes, which were used as features for a Random Forest classifier. The dataset comprised 15 controls (CO), 15 idiopathic Parkinson’s disease (IPD), and 14 vascular Parkinsonism (VaP). Models were assessed with leave-one-out cross-validation (LOOCV). Betti-curve descriptors consistently yielded the strongest results. For IPD vs VaP, foot clearance variables minimum toe clearance, maximum toe late swing, and maximum heel clearance achieved 83% accuracy and AUC=0.89 under LOOCV in the medicated (On) state. Performance improved in the On state and further when both Off and On states were considered, indicating sensitivity of the topological features to levodopa related gait changes. These findings support integrating TDA with machine learning to improve clinical gait analysis and aid differential diagnosis across parkinsonian disorders.
[481] FedSCS-XGB – Federated Server-centric surrogate XGBoost for continual health monitoring
Felix Walger, Mehdi Ejtehadi, Anke Schmeink, Diego Paez-Granados
Main category: cs.LG
TL;DR: A distributed machine learning protocol for human activity recognition from wearable sensors using gradient-boosted decision trees (XGBoost) that preserves key structural properties while enabling distributed training comparable to centralized performance.
Details
Motivation: Wearable sensors with local data processing can enable early detection of health threats like pressure injuries and blood pressure instability in spinal cord injury patients. However, distributed wearable monitoring scenarios face challenges of data heterogeneity and fragmentation, requiring distributed ML approaches that maintain the performance benefits of centralized methods.Method: Proposes a novel distributed ML protocol for HAR from wearable sensor data based on XGBoost, inspired by Party-Adaptive XGBoost (PAX). The approach explicitly preserves key structural and optimization properties of standard XGBoost including histogram-based split construction and tree-ensemble dynamics. Includes theoretical analysis showing convergence to centralized XGBoost solutions under appropriate conditions.
Result: Empirical evaluation on wearable-sensor HAR dataset shows the protocol matches centralized XGBoost performance with under 1% gap while retaining structural advantages. Benchmarking against centralized XGBoost and IBM PAX demonstrates theoretical convergence properties are reflected in practice.
Conclusion: The proposed distributed XGBoost protocol effectively addresses data fragmentation challenges in wearable-based HAR while maintaining performance comparable to centralized training, making it suitable for remote health monitoring applications.
Abstract: Wearable sensors with local data processing can detect health threats early, enhance documentation, and support personalized therapy. In the context of spinal cord injury (SCI), which involves risks such as pressure injuries and blood pressure instability, continuous monitoring can help mitigate these by enabling early deDtection and intervention. In this work, we present a novel distributed machine learning (DML) protocol for human activity recognition (HAR) from wearable sensor data based on gradient-boosted decision trees (XGBoost). The proposed architecture is inspired by Party-Adaptive XGBoost (PAX) while explicitly preserving key structural and optimization properties of standard XGBoost, including histogram-based split construction and tree-ensemble dynamics. First, we provide a theoretical analysis showing that, under appropriate data conditions and suitable hyperparameter selection, the proposed distributed protocol can converge to solutions equivalent to centralized XGBoost training. Second, the protocol is empirically evaluated on a representative wearable-sensor HAR dataset, reflecting the heterogeneity and data fragmentation typical of remote monitoring scenarios. Benchmarking against centralized XGBoost and IBM PAX demonstrates that the theoretical convergence properties are reflected in practice. The results indicate that the proposed approach can match centralized performance up to a gap under 1% while retaining the structural advantages of XGBoost in distributed wearable-based HAR settings.
[482] DC-Merge: Improving Model Merging with Directional Consistency
Han-Chen Zhang, Zi-Hao Zhou, Mao-Lin Luo, Shimin Di, Min-Ling Zhang, Tong Wei
Main category: cs.LG
TL;DR: DC-Merge: A directional-consistent model merging method that addresses energy imbalance and geometric inconsistency in task vectors to preserve knowledge from multiple task-adapted models.
Details
Motivation: Current model merging methods often fail to preserve knowledge from individual task-adapted models due to imbalanced energy distribution in task vectors (where dominant singular values overshadow weaker but semantically important components) and geometric inconsistency in parameter space that distorts directional geometry during merging.Method: DC-Merge first balances energy distribution by smoothing singular values of each task vector, then projects energy-balanced vectors onto a shared orthogonal subspace to align their directional geometries with minimal reconstruction error, and finally aggregates aligned vectors in the shared subspace before projecting back to original parameter space.
Result: Extensive experiments on vision and vision-language benchmarks show DC-Merge consistently achieves state-of-the-art performance in both full fine-tuning and LoRA settings.
Conclusion: Maintaining directional consistency of singular spaces is crucial for effective model merging, and DC-Merge successfully addresses energy imbalance and geometric inconsistency issues to preserve knowledge from multiple task-adapted models.
Abstract: Model merging aims to integrate multiple task-adapted models into a unified model that preserves the knowledge of each task. In this paper, we identify that the key to this knowledge retention lies in maintaining the directional consistency of singular spaces between merged multi-task vector and individual task vectors. However, this consistency is frequently compromised by two issues: i) an imbalanced energy distribution within task vectors, where a small fraction of singular values dominate the total energy, leading to the neglect of semantically important but weaker components upon merging, and ii) the geometric inconsistency of task vectors in parameter space, which causes direct merging to distort their underlying directional geometry. To address these challenges, we propose DC-Merge, a method for directional-consistent model merging. It first balances the energy distribution of each task vector by smoothing its singular values, ensuring all knowledge components are adequately represented. These energy-balanced vectors are then projected onto a shared orthogonal subspace to align their directional geometries with minimal reconstruction error. Finally, the aligned vectors are aggregated in the shared orthogonal subspace and projected back to the original parameter space. Extensive experiments on vision and vision-language benchmarks show that DC-Merge consistently achieves state-of-the-art performance in both full fine-tuning and LoRA settings. The implementation code is available at https://github.com/Tobeginwith/DC-Merge.
[483] Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions
Aditya Varre, Mark Rofin, Nicolas Flammarion
Main category: cs.LG
TL;DR: Theoretical analysis shows gradient flow in softmax-based models drives optimization toward low-entropy solutions, explaining transformer training dynamics and phenomena like attention sinks.
Details
Motivation: To understand the non-convex training dynamics of softmax-based models, particularly transformers, by analyzing the core building block of self-attention to explain empirical phenomena like attention sinks and massive activations.Method: Analyzes gradient flow dynamics of the value-softmax model L(VÏ(a)), where V is a learnable value matrix and a is an attention vector. This parameterization represents the core building block of self-attention in transformers.
Result: Gradient flow inherently drives optimization toward solutions with low-entropy outputs. This polarizing effect is universal across various objectives including logistic and square loss.
Conclusion: The theoretical analysis provides direct insight into transformer training dynamics and offers a formal mechanism for explaining empirical phenomena such as attention sinks and massive activations observed in practice.
Abstract: Understanding the intricate non-convex training dynamics of softmax-based models is crucial for explaining the empirical success of transformers. In this article, we analyze the gradient flow dynamics of the value-softmax model, defined as ${L}(\mathbf{V} Ï(\mathbf{a}))$, where $\mathbf{V}$ and $\mathbf{a}$ are a learnable value matrix and attention vector, respectively. As the matrix times softmax vector parameterization constitutes the core building block of self-attention, our analysis provides direct insight into transformer’s training dynamics. We reveal that gradient flow on this structure inherently drives the optimization toward solutions characterized by low-entropy outputs. We demonstrate the universality of this polarizing effect across various objectives, including logistic and square loss. Furthermore, we discuss the practical implications of these theoretical results, offering a formal mechanism for empirical phenomena such as attention sinks and massive activations.
[484] Synthetic Monitoring Environments for Reinforcement Learning
Leonard Pleiss, Carolin Schmidt, Maximilian Schiffer
Main category: cs.LG
TL;DR: Synthetic Monitoring Environments (SMEs) provide infinite, configurable continuous control tasks with known optimal policies for precise RL algorithm diagnostics and regret calculation.
Details
Motivation: Current RL benchmarks lack precision and white-box diagnostics, often entangling complexity factors and missing ground-truth optimality metrics, making it difficult to isolate why algorithms fail.Method: Introduce Synthetic Monitoring Environments (SMEs) - an infinite suite of continuous control tasks with fully configurable task characteristics, known optimal policies, and rigorous geometric state space bounds for systematic within-distribution and out-of-distribution evaluation.
Result: Demonstrated framework benefits through multidimensional ablations of PPO, TD3, and SAC, revealing how specific environmental properties impact performance in both within-distribution and out-of-distribution settings.
Conclusion: SMEs offer a standardized, transparent testbed for transitioning RL evaluation from empirical benchmarking toward rigorous scientific analysis with precise diagnostics.
Abstract: Reinforcement Learning (RL) lacks benchmarks that enable precise, white-box diagnostics of agent behavior. Current environments often entangle complexity factors and lack ground-truth optimality metrics, making it difficult to isolate why algorithms fail. We introduce Synthetic Monitoring Environments (SMEs), an infinite suite of continuous control tasks. SMEs provide fully configurable task characteristics and known optimal policies. As such, SMEs allow for the exact calculation of instantaneous regret. Their rigorous geometric state space bounds allow for systematic within-distribution (WD) and out-of-distribution (OOD) evaluation. We demonstrate the framework’s benefit through multidimensional ablations of PPO, TD3, and SAC, revealing how specific environmental properties - such as action or state space size, reward sparsity and complexity of the optimal policy - impact WD and OOD performance. We thereby show that SMEs offer a standardized, transparent testbed for transitioning RL evaluation from empirical benchmarking toward rigorous scientific analysis.
[485] Learning to Solve Orienteering Problem with Time Windows and Variable Profits
Songqun Gao, Zanxi Ruan, Patrick Floor, Marco Roveri, Luigi Palopoli, Daniele Fontanelli
Main category: cs.LG
TL;DR: DeCoST is a learning-based two-stage optimization framework for the Orienteering Problem with Time Windows and Variable Profits, decoupling discrete and continuous variables for efficient solving.
Details
Motivation: Current approaches fail to efficiently solve the OPTWVP problem which involves both discrete (path selection) and continuous (service time allocation) variables. Real-world applications need effective solvers for this complex optimization problem.Method: Two-stage decoupled approach: Stage 1 uses parallel decoding to predict path and initial service time allocation; Stage 2 optimizes service times via linear programming with long-horizon learning of structure estimation.
Result: DeCoST outperforms state-of-the-art constructive solvers and meta-heuristic algorithms in solution quality and computational efficiency, achieving up to 6.6x inference speedup on instances with <500 nodes.
Conclusion: The proposed framework effectively decouples discrete and continuous variables in OPTWVP, enables efficient coordination between them, and is compatible with various constructive solvers to enhance solution quality.
Abstract: The orienteering problem with time windows and variable profits (OPTWVP) is common in many real-world applications and involves continuous time variables. Current approaches fail to develop an efficient solver for this orienteering problem variant with discrete and continuous variables. In this paper, we propose a learning-based two-stage DEcoupled discrete-Continuous optimization with Service-time-guided Trajectory (DeCoST), which aims to effectively decouple the discrete and continuous decision variables in the OPTWVP problem, while enabling efficient and learnable coordination between them. In the first stage, a parallel decoding structure is employed to predict the path and the initial service time allocation. The second stage optimizes the service times through a linear programming (LP) formulation and provides a long-horizon learning of structure estimation. We rigorously prove the global optimality of the second-stage solution. Experiments on OPTWVP instances demonstrate that DeCoST outperforms both state-of-the-art constructive solvers and the latest meta-heuristic algorithms in terms of solution quality and computational efficiency, achieving up to 6.6x inference speedup on instances with fewer than 500 nodes. Moreover, the proposed framework is compatible with various constructive solvers and consistently enhances the solution quality for OPTWVP.
[486] Agentic retrieval-augmented reasoning reshapes collective reliability under model variability in radiology question answering
Mina Farajiamiri, Jeta Sopa, Saba Afza, Lisa Adams, Felix Barajas Ordonez, Tri-Thien Nguyen, Mahshad Lotfinia, Sebastian Wind, Keno Bressem, Sven Nebelung, Daniel Truhn, Soroosh Tayebi Arasteh
Main category: cs.LG
TL;DR: Agentic retrieval-augmented reasoning pipelines reduce decision dispersion and increase robustness across 34 LLMs on radiology questions, but high agreement doesn’t guarantee correctness and many errors have clinical severity.
Details
Motivation: To understand how agentic retrieval-augmented reasoning pipelines affect reliability under model variability in clinical decision support, beyond just accuracy metrics.Method: Evaluated 34 LLMs on 169 expert-curated radiology questions, comparing zero-shot inference with radiology-specific multi-step agentic retrieval where all models received identical structured evidence reports from curated knowledge.
Result: Agentic inference reduced inter-model decision dispersion (median entropy 0.48 vs. 0.13), increased robustness of correctness across models (mean 0.74 vs. 0.81), and increased majority consensus. However, high agreement didn’t guarantee correctness, and 72% of incorrect outputs had moderate/high clinical severity.
Conclusion: Agentic retrieval leads to more concentrated decision distributions, stronger consensus, and higher cross-model robustness, but evaluating such systems requires complementary analyses of stability, cross-model robustness, and clinical impact beyond accuracy alone.
Abstract: Agentic retrieval-augmented reasoning pipelines are increasingly used to structure how large language models (LLMs) incorporate external evidence in clinical decision support. These systems iteratively retrieve curated domain knowledge and synthesize it into structured reports before answer selection. Although such pipelines can improve performance, their impact on reliability under model variability remains unclear. In real-world deployment, heterogeneous models may align, diverge, or synchronize errors in ways not captured by accuracy. We evaluated 34 LLMs on 169 expert-curated publicly available radiology questions, comparing zero-shot inference with a radiology-specific multi-step agentic retrieval condition in which all models received identical structured evidence reports derived from curated radiology knowledge. Agentic inference reduced inter-model decision dispersion (median entropy 0.48 vs. 0.13) and increased robustness of correctness across models (mean 0.74 vs. 0.81). Majority consensus also increased overall (P<0.001). Consensus strength and robust correctness remained correlated under both strategies (\r{ho}=0.88 for zero-shot; \r{ho}=0.87 for agentic), although high agreement did not guarantee correctness. Response verbosity showed no meaningful association with correctness. Among 572 incorrect outputs, 72% were associated with moderate or high clinically assessed severity, although inter-rater agreement was low (\k{appa}=0.02). Agentic retrieval therefore was associated with more concentrated decision distributions, stronger consensus, and higher cross-model robustness of correctness. These findings suggest that evaluating agentic systems through accuracy or agreement alone may not always be sufficient, and that complementary analyses of stability, cross-model robustness, and potential clinical impact are needed to characterize reliability under model variability.
[487] Stem: Rethinking Causal Information Flow in Sparse Attention
Lin Niu, Xin Luo, Linchuan Xie, Yifu Sun, Guanghua Yu, Jianchen Zhu, S Kevin Zhou
Main category: cs.LG
TL;DR: Stem: A novel sparsity module for LLMs that reduces quadratic attention complexity by aligning sparsity patterns with information flow in causal attention, using position-dependent top-k selection and output-aware metrics.
Details
Motivation: The quadratic computational complexity of self-attention in LLMs creates a fundamental bottleneck for scaling to long contexts, especially during pre-filling. Existing sparse methods use uniform top-k selection across all token positions, ignoring the cumulative dependency of token information inherent in causal architectures.Method: Proposes Stem with two key strategies: 1) Token Position-Decay strategy applies position-dependent top-k within each layer to retain initial tokens for recursive dependencies, and 2) Output-Aware Metric prioritizes high-impact tokens based on approximate output magnitude.
Result: Extensive evaluations demonstrate that Stem achieves superior accuracy with reduced computation and pre-filling latency compared to existing methods.
Conclusion: Stem provides an effective plug-and-play sparsity module that aligns with information flow in causal attention mechanisms, addressing the quadratic complexity bottleneck while maintaining accuracy.
Abstract: The quadratic computational complexity of self-attention remains a fundamental bottleneck for scaling Large Language Models (LLMs) to long contexts, particularly during the pre-filling phase. In this paper, we rethink the causal attention mechanism from the perspective of information flow. Due to causal constraints, tokens at initial positions participate in the aggregation of every subsequent token. However, existing sparse methods typically apply a uniform top-k selection across all token positions within a layer, ignoring the cumulative dependency of token information inherent in causal architectures. To address this, we propose Stem, a novel, plug-and-play sparsity module aligned with information flow. First, Stem employs the Token Position-Decay strategy, applying position-dependent top-k within each layer to retain initial tokens for recursive dependencies. Second, to preserve information-rich tokens, Stem utilizes the Output-Aware Metric. It prioritizes high-impact tokens based on approximate output magnitude. Extensive evaluations demonstrate that Stem achieves superior accuracy with reduced computation and pre-filling latency.
[488] NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches
Ethan Smith
Main category: cs.LG
TL;DR: NOBLE adds nonlinear low-rank branches to transformer linear layers for pretraining efficiency, achieving faster convergence with minimal parameter overhead.
Details
Motivation: To improve transformer pretraining efficiency by adding permanent nonlinear low-rank branches to linear layers, unlike adapter-based PEFT methods like LoRA that are designed for fine-tuning.Method: Adds nonlinear low-rank branch Ï(xWdown)Wup to transformer linear layers, where Ï is a learnable nonlinearity. Best performance with CosNet: two-layer cosine nonlinearity with learnable frequency/phase and linear projection in bottleneck space.
Result: Achieves up to 1.47x step speedup to reach baseline eval loss (32% fewer training steps), with only 4% additional parameters and 7% step time overhead, resulting in up to 1.22x net wallclock speedup across LLMs, BERT, VQGAN, and ViT.
Conclusion: NOBLE provides efficient architectural augmentation for transformer pretraining, though stochastic augmentations like Mixup/CutMix may interfere with its benefits in some vision tasks.
Abstract: We introduce NOBLE (Nonlinear lOw-rank Branch for Linear Enhancement), an architectural augmentation that adds nonlinear low-rank branches to transformer linear layers. Unlike LoRA and other parameter-efficient fine-tuning (PEFT) methods, NOBLE is designed for pretraining from scratch. The branch is a permanent part of the architecture as opposed to an adapter for finetuning on top of frozen weights. The branch computes Ï(xWdown)Wup where Ï is a learnable nonlinearity. We evaluate several activation functions and find that CosNet, a two-layer cosine nonlinearity with learnable frequency and phase with a linear projection in between them in the bottleneck space, performs best. NOBLE achieves substantial improvements with minimal overhead: up to 1.47x step speedup to reach baseline eval loss (up to 32% fewer training steps), with as low as 4% additional parameters and 7% step time overhead, resulting in up to 1.22x net wallclock speedup. Experiments on LLMs (250M and 1.5B parameters), BERT, VQGAN, and ViT consistently show improved training efficiency. We identify one caveat: Mixup/CutMix augmentation interferes with NOBLE’s benefits in Imagenet classification along with other stochastic augmentations, but when disabled, ViT also improves. This discrepancy is possibly explained by regularization techniques that encourage smoother fits to the target function while NOBLE may specialize more in sharper aspects of the target function.
[489] Polarized Direct Cross-Attention Message Passing in GNNs for Machinery Fault Diagnosis
Zongyu Shi, Laibin Zhang, Maoyin Chen
Main category: cs.LG
TL;DR: PolaDCA: A polarized direct cross-attention framework for adaptive graph learning in machinery fault diagnosis, overcoming limitations of static GNNs through data-driven graph construction and dynamic attention mechanisms.
Details
Motivation: Conventional GNNs for machinery fault diagnosis rely on predefined static graph structures and homogeneous aggregation schemes, limiting their ability to model complex dynamic interactions in safety-critical industrial systems. There's a need for more adaptive, robust approaches that can handle varying noise conditions.Method: Proposes polarized direct cross-attention (PolaDCA), a relational learning framework with direct cross-attention mechanism that dynamically infers attention weights from three semantically distinct node features: individual characteristics, neighborhood consensus, and neighborhood diversity. Uses data-driven graph construction without fixed adjacency matrices.
Result: Theoretical analysis shows superior noise robustness over conventional GNNs. Extensive experiments on industrial datasets (XJTUSuprgear, CWRUBearing, Three-Phase Flow Facility) demonstrate state-of-the-art diagnostic accuracy and enhanced generalization under varying noise conditions, outperforming seven competitive baseline methods.
Conclusion: PolaDCA provides an effective solution for safety-critical industrial applications by enabling adaptive message passing through data-driven graph construction, overcoming limitations of static GNNs in machinery fault diagnosis.
Abstract: The reliability of safety-critical industrial systems hinges on accurate and robust fault diagnosis in rotating machinery. Conventional graph neural networks (GNNs) for machinery fault diagnosis face limitations in modeling complex dynamic interactions due to their reliance on predefined static graph structures and homogeneous aggregation schemes. To overcome these challenges, this paper introduces polarized direct cross-attention (PolaDCA), a novel relational learning framework that enables adaptive message passing through data-driven graph construction. Our approach builds upon a direct cross-attention (DCA) mechanism that dynamically infers attention weights from three semantically distinct node features (such as individual characteristics, neighborhood consensus, and neighborhood diversity) without requiring fixed adjacency matrices. Theoretical analysis establishes PolaDCA’s superior noise robustness over conventional GNNs. Extensive experiments on industrial datasets (i.e., XJTUSuprgear, CWRUBearing and Three-Phase Flow Facility datasets) demonstrate state-of-the-art diagnostic accuracy and enhanced generalization under varying noise conditions, outperforming seven competitive baseline methods. The proposed framework provides an effective solution for safety-critical industrial applications.
[490] COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics
Kartik Sharma, Rakshit S. Trivedi
Main category: cs.LG
TL;DR: COLD-Steer: A training-free framework for inference-time control of LLM behavior by approximating gradient descent effects on activations using kernel methods or finite-difference approximations.
Details
Motivation: Current activation steering methods face a trade-off: sample-efficient methods poorly capture steering signals, while effective methods require hundreds to thousands of examples. There's a need for efficient inference-time control without extensive demonstration data.Method: Two complementary approaches: (1) unit kernel approximation that updates activations directly using gradients normalized across examples, and (2) finite-difference approximation requiring only two forward passes regardless of example count. Both approximate the representational changes from gradient descent on in-context examples without actual parameter updates.
Result: Achieves up to 95% steering effectiveness while using 50 times fewer samples compared to best baselines. Validated across various steering tasks and benchmarks, including pluralistic alignment tasks.
Conclusion: COLD-Steer enables adaptive, context-aware model control that can flexibly address varying human preferences through principled approximation of learning dynamics rather than specialized training procedures.
Abstract: Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches face a fundamental trade-off: sample-efficient methods suboptimally capture steering signals from labeled examples, while methods that better extract these signals require hundreds to thousands of examples. We introduce COLD-Steer, a training-free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in-context examples. Our key insight is that the effect of fine-tuning on a small set of examples can be efficiently approximated at inference time without actual parameter updates. We formalize this through two complementary approaches: (i) a unit kernel approximation method that updates the activations directly using gradients with respect to them, normalized across examples, and (ii) a finite-difference approximation requiring only two forward passes regardless of example count. Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline. COLD-Steer facilitates accommodating diverse perspectives without extensive demonstration data, which we validate through our experiments on pluralistic alignment tasks. Our framework opens new possibilities for adaptive, context-aware model control that can flexibly address varying loss-driven human preferences through principled approximation of learning dynamics rather than specialized training procedures.
[491] From Entropy to Calibrated Uncertainty: Training Language Models to Reason About Uncertainty
Azza Jenane, Nassim Walha, Lukas Kuhn, Florian Buettner
Main category: cs.LG
TL;DR: A pipeline for post-training LLMs to produce calibrated uncertainty estimates through entropy-based scoring, Platt scaling, and reinforcement learning alignment.
Details
Motivation: LLMs need interpretable and calibrated uncertainty estimates for high-stakes domains, but existing post-hoc methods are computationally expensive or lack calibration.Method: Three-stage pipeline: 1) Compute fine-grained entropy-based uncertainty scores on training data, 2) Calibrate scores via Platt scaling, 3) Post-train LLM via reinforcement learning to align policy with calibrated uncertainty signals.
Result: Models achieve better calibration than baselines and generalize to unseen tasks without further processing, learning robust uncertainty reasoning behavior.
Conclusion: The approach provides interpretable and computationally efficient uncertainty estimates at test time, unlike post-hoc methods.
Abstract: Large Language Models (LLMs) that can express interpretable and calibrated uncertainty are crucial in high-stakes domains. While methods to compute uncertainty post-hoc exist, they are often sampling-based and therefore computationally expensive or lack calibration. We propose a three-stage pipeline to post-train LLMs to efficiently infer calibrated uncertainty estimates for their responses. First, we compute fine-grained entropy-based uncertainty scores on the training data, capturing the distributional variability of model outputs in embedding space. Second, these scores are calibrated via Platt scaling, producing reliable and human-interpretable uncertainty signals. Finally, the target LLM is post-trained via reinforcement learning to align its policy with these calibrated signals through a verifiable reward function. Unlike post-hoc uncertainty estimation methods, our approach provides interpretable and computationally efficient uncertainty estimates at test time. Experiments show that models trained with our pipeline achieve better calibration than baselines and generalize to unseen tasks without further processing, suggesting that they learn a robust uncertainty reasoning behavior.
[492] Frequency-Separable Hamiltonian Neural Network for Multi-Timescale Dynamics
Yaojun Li, Yulong Yang, Christine Allen-Blanchette
Main category: cs.LG
TL;DR: FS-HNN improves Hamiltonian Neural Networks by separating fast and slow frequency components using multiple networks trained on different timescales, enabling better multiscale dynamics modeling.
Details
Motivation: Hamiltonian Neural Networks often fail to capture complex temporal dynamics spanning multiple timescales due to spectral bias favoring low-frequency dynamics, and existing solutions are limited or require strong domain assumptions.Method: FS-HNN parameterizes system Hamiltonian using multiple networks, each governed by Hamiltonian dynamics and trained on data sampled at distinct timescales, and extends to PDEs by learning state- and boundary-conditioned symplectic operators.
Result: FS-HNN improves long-horizon extrapolation performance on challenging dynamical systems and generalizes across a broad range of ODE and PDE problems.
Conclusion: Separating Hamiltonian functions into explicit fast and slow modes enables better multiscale dynamics modeling, overcoming limitations of traditional Hamiltonian Neural Networks.
Abstract: While Hamiltonian mechanics provides a powerful inductive bias for neural networks modeling dynamical systems, Hamiltonian Neural Networks and their variants often fail to capture complex temporal dynamics spanning multiple timescales. This limitation is commonly linked to the spectral bias of deep neural networks, which favors learning low-frequency, slow-varying dynamics. Prior approaches have sought to address this issue through symplectic integration schemes that enforce energy conservation or by incorporating geometric constraints to impose structure on the configuration-space. However, such methods either remain limited in their ability to fully capture multiscale dynamics or require substantial domain specific assumptions. In this work, we exploit the observation that Hamiltonian functions admit decompositions into explicit fast and slow modes and can be reconstructed from these components. We introduce the Frequency-Separable Hamiltonian Neural Network (FS-HNN), which parameterizes the system Hamiltonian using multiple networks, each governed by Hamiltonian dynamics and trained on data sampled at distinct timescales. We further extend this framework to partial differential equations by learning a state- and boundary-conditioned symplectic operators. Empirically, we show that FS-HNN improves long-horizon extrapolation performance on challenging dynamical systems and generalizes across a broad range of ODE and PDE problems.
[493] Tiny, Hardware-Independent, Compression-based Classification
Charles Meyers, Aaron MacSween, Erik Elmroth, Tommy Löfstedt
Main category: cs.LG
TL;DR: Compression-based distance measure (NCD) enables effective client-side machine learning with small data, addressing privacy concerns by allowing models to run locally on user devices.
Details
Motivation: Address privacy conflicts between online platforms and users by enabling client-side machine learning that doesn't require centralized data collection, while overcoming limitations of state-of-the-art methods that need large labeled datasets and heavy computation unsuitable for client devices.Method: Develop normalized compression distance (NCD) for kernel methods, improve training time, and demonstrate its effectiveness despite lacking formal metric properties. The approach uses compression-based distance measures in classical distance-based ML methods.
Result: NCD performs as well or better than other metrics/kernels with only marginally higher computational cost, enabling simple models with remarkable accuracy trained on very small sample sizes, suitable for client-side deployment.
Conclusion: Compression-based distance measures enable practical client-side machine learning solutions that respect user privacy while maintaining effectiveness, making them suitable for compute-limited hardware with access only to single-user data.
Abstract: The recent developments in machine learning have highlighted a conflict between online platforms and their users in terms of privacy. The importance of user privacy and the struggle for power over user data has been intensified as regulators and operators attempt to police online platforms. As users have become increasingly aware of privacy issues, client-side data storage, management, and analysis have become a favoured approach to large-scale centralised machine learning. However, state-of-the-art machine learning methods require vast amounts of labelled user data, making them unsuitable for models that reside client-side and only have access to a single user’s data. State-of-the-art methods are also computationally expensive, which degrades the user experience on compute-limited hardware and also reduces battery life. A recent alternative approach has proven remarkably successful in classification tasks across a wide variety of data – using a compression-based distance measure (called normalised compression distance) to measure the distance between generic objects in classical distance-based machine learning methods. In this work, we demonstrate that the normalised compression distance is actually not a metric; develop it for the wider context of kernel methods to allow modelling of complex data; and present techniques to improve the training time of models that use this distance measure. We demonstrate that the normalised compression distance works as well as and sometimes better than other metrics and kernels – while requiring only marginally more computational costs and in spite of the lack of formal metric properties. The end results is a simple model with remarkable accuracy even when trained on a very small number of samples allowing for models that are small and effective enough to run entirely on a client device using only user-supplied data.
[494] CLAIRE: Compressed Latent Autoencoder for Industrial Representation and Evaluation – A Deep Learning Framework for Smart Manufacturing
Mohammadhossein Ghahramani, Mengchu Zhou
Main category: cs.LG
TL;DR: CLAIRE is a hybrid end-to-end framework combining unsupervised deep representation learning with supervised classification for fault detection in high-dimensional industrial data, featuring interpretability through game theory.
Details
Motivation: Fault detection in high-dimensional industrial environments is challenging due to data complexity, noise, and redundancy. Existing methods struggle with these issues, necessitating a robust framework that can learn meaningful representations while providing interpretability.Method: Uses a hybrid approach: (1) Unsupervised deep autoencoder transforms raw input into compact latent space, capturing intrinsic structure while suppressing noise; (2) Downstream classifier performs binary fault prediction on learned representations; (3) Post hoc interpretability phase uses game-theory-based technique to analyze latent space and identify most informative input features.
Result: CLAIRE significantly outperforms conventional classifiers trained directly on raw features on high-dimensional datasets. The framework demonstrates effective fault detection while providing interpretable insights into feature contributions.
Conclusion: The framework shows potential for integrating explainable AI with feature-aware regularization for robust fault detection. Its modular and interpretable nature makes it adaptable to other domains with complex high-dimensional data like healthcare, finance, and environmental monitoring.
Abstract: Accurate fault detection in high-dimensional industrial environments remains a major challenge due to the inherent complexity, noise, and redundancy in sensor data. This paper introduces CLAIRE, i.e., a hybrid end-to-end learning framework that integrates unsupervised deep representation learning with supervised classification for intelligent quality control in smart manufacturing systems. It employs an optimized deep autoencoder to transform raw input into a compact latent space, effectively capturing the intrinsic data structure while suppressing irrelevant or noisy features. The learned representations are then fed into a downstream classifier to perform binary fault prediction. Experimental results on a high-dimensional dataset demonstrate that CLAIRE significantly outperforms conventional classifiers trained directly on raw features. Moreover, the framework incorporates a post hoc phase, using a game-theory-based interpretability technique, to analyze the latent space and identify the most informative input features contributing to fault predictions. The proposed framework highlights the potential of integrating explainable AI with feature-aware regularization for robust fault detection. The modular and interpretable nature of the proposed framework makes it highly adaptable, offering promising applications in other domains characterized by complex, high-dimensional data, such as healthcare, finance, and environmental monitoring.
[495] Adaptive Lipschitz-Free Conditional Gradient Methods for Stochastic Composite Nonconvex Optimization
Ganzhao Yuan
Main category: cs.LG
TL;DR: ALFCG is an adaptive projection-free optimization method for stochastic composite nonconvex problems that doesn’t require global smoothness constants or line search, using historical iterate differences to estimate local smoothness.
Details
Motivation: Existing conditional gradient methods require either global smoothness constants, open-loop diminishing stepsizes, or costly backtracking, which limits their practical efficiency and adaptability to unknown geometry.Method: ALFCG maintains a self-normalized accumulator of historical iterate differences to estimate local smoothness and minimizes a quadratic surrogate model at each step. Three variants: ALFCG-FS for finite-sum problems with SPIDER estimator, and ALFCG-MVR1/2 for stochastic expectation problems with momentum-based variance reduction.
Result: ALFCG-FS achieves O(N + âN Δâ»ÂČ) iteration complexity, while ALFCG-MVR1 and ALFCG-MVR2 achieve Ă(ÏÂČΔâ»âŽ + Δâ»ÂČ) and Ă(ÏΔâ»Âł + Δâ»ÂČ) respectively, reducing to optimal Ă(Δâ»ÂČ) as noise level Ïâ0. Outperforms state-of-the-art conditional gradient baselines in multiclass classification experiments.
Conclusion: ALFCG provides the first adaptive projection-free framework for stochastic composite nonconvex minimization that adapts to unknown geometry without requiring global smoothness constants or line search, with improved convergence rates and practical performance.
Abstract: We propose ALFCG (Adaptive Lipschitz-Free Conditional Gradient), the first \textit{adaptive} projection-free framework for stochastic composite nonconvex minimization that \textit{requires neither global smoothness constants nor line search}. Unlike prior conditional gradient methods that use openloop diminishing stepsizes, conservative Lipschitz constants, or costly backtracking, ALFCG maintains a self-normalized accumulator of historical iterate differences to estimate local smoothness and minimize a quadratic surrogate model at each step. This retains the simplicity of Frank-Wolfe while adapting to unknown geometry. We study three variants. ALFCG-FS addresses finite-sum problems with a SPIDER estimator. ALFCG-MVR1 and ALFCG-MVR2 handle stochastic expectation problems by using momentum-based variance reduction with single-batch and two-batch updates, and operate under average and individual smoothness, respectively. To reach an $Δ$-stationary point, ALFCG-FS attains $\mathcal{O}(N+\sqrt{N}Δ^{-2})$ iteration complexity, while ALFCG-MVR1 and ALFCG-MVR2 achieve $\tilde{\mathcal{O}}(Ï^2Δ^{-4}+Δ^{-2})$ and $\tilde{\mathcal{O}}(ÏΔ^{-3}+Δ^{-2})$, where $N$ is the number of components and $Ï$ is the noise level. In contrast to typical $\mathcal{O}(Δ^{-4})$ or $\mathcal{O}(Δ^{-3})$ rates, our bounds reduce to the optimal rate up to logarithmic factors $\tilde{\mathcal{O}}(Δ^{-2})$ as the noise level $Ï\to 0$. Extensive experiments on multiclass classification over nuclear norm balls and $\ell_p$ balls show that ALFCG generally outperforms state-of-the-art conditional gradient baselines.
[496] Adapter-Augmented Bandits for Online Multi-Constrained Multi-Modal Inference Scheduling
Xianzhi Zhang, Yue Xu, Yinlin Zhu, Di Wu, Yipeng Zhou, Miao Hu, Guocong Quan
Main category: cs.LG
TL;DR: M-CMAB: A multi-adapter MLLM inference scheduling framework using contextual multi-armed bandits with multi-constraint optimization for efficient multimodal task allocation under heterogeneous budgets.
Details
Motivation: Multimodal LLM inference scheduling faces challenges due to varying modality compositions, latent reasoning difficulties, and heterogeneous backend costs with system jitter and network variations, requiring efficient online decision-making under irreversible budgets.Method: Three-component framework: 1) CLS-attentive Predictor with frozen backbone and lightweight adapters for task representation, 2) primal-dual Constrainer with Lagrange multipliers for long-horizon constraints, 3) two-phase Scheduler balancing exploration-exploitation under irreversible budgets using contextual multi-armed bandits.
Result: Outperforms state-of-the-art baselines across budget regimes, achieving up to 14.18% higher reward and closely tracking oracle-aided upper bound on composite multimodal benchmarks with heterogeneous backends.
Conclusion: M-CMAB effectively addresses multimodal inference scheduling challenges through compact task representations, online constraint enforcement, and balanced exploration-exploitation, with theoretical regret guarantees under multi-dimensional knapsack constraints.
Abstract: Multi-modal large language model (MLLM) inference scheduling enables strong response quality under practical and heterogeneous budgets, beyond what a homogeneous single-backend setting can offer. Yet online MLLM task scheduling is nontrivial, as requests vary sharply in modality composition and latent reasoning difficulty, while execution backends incur distinct, time-varying costs due to system jitter and network variation. These coupled uncertainties pose two core challenges: deriving semantically faithful yet scheduling-relevant multi-modal task representations, and making low-overhead online decisions over irreversible multi-dimensional budgets. Accordingly, we propose \emph{M-CMAB} (\underline{M}ulti-modal \underline{M}ulti-constraint \underline{C}ontextual \underline{M}ulti-\underline{A}rmed \underline{B}andit), a multi-adapter-enhanced MLLM inference scheduling framework with three components: (i) a CLS-attentive, frozen-backbone \emph{Predictor} that extracts compact task representations and updates only lightweight adapters for action-specific estimation; (ii) a primal-dual \emph{Constrainer} that maintains online Lagrange multipliers to enforce long-horizon constraints via per-round objectives; and (iii) a two-phase \emph{Scheduler} that balances exploration and exploitation under irreversible budgets. We establish a regret guarantee under multi-dimensional knapsack constraints. On a composite multimodal benchmark with heterogeneous backends, \emph{M-CMAB} consistently outperforms state-of-the-art baselines across budget regimes, achieving up to 14.18% higher reward and closely tracking an oracle-aided upper bound. Codes are available at https://anonymous.4open.science/r/M2CMAB/.
[497] Toward Generative Quantum Utility via Correlation-Complexity Map
Chen-Yu Liu, Leonardo Placidi, Eric Brunner, Enrico Rinaldi
Main category: cs.LG
TL;DR: Paper proposes a Correlation-Complexity Map to diagnose when real-world data aligns with IQP quantum generative models, using quantum correlation and classical complexity indicators to identify suitable domains like turbulence data.
Details
Motivation: Need practical diagnostic tools to determine when real-world data distributions are structurally aligned with IQP-type quantum generative models, to advance generative quantum utility and identify quantum-advantage domains.Method: Develops Correlation-Complexity Map with two indicators: Quantum Correlation-Likeness Indicator (QCLI) from dataset’s correlation-order power spectrum, and Classical Correlation-Complexity Indicator (CCI) from fraction of total correlation not captured by optimal Chow-Liu tree. Applies to turbulence data with invertible float-to-bitstring representation and latent-parameter adaptation scheme.
Result: Identifies classical turbulence data as both IQP-compatible and classically complex (high QCLI/high CCI). IQP approach achieves competitive distributional alignment against classical models (RBM, DCGAN) using fewer training snapshots and small latent block.
Conclusion: QCLI/CCI serve as practical indicators for locating IQP-aligned domains, supporting generative quantum utility. The method enables efficient quantum generative modeling for complex real-world data like turbulence.
Abstract: We propose a Correlation-Complexity Map as a practical diagnostic tool for determining when real-world data distributions are structurally aligned with IQP-type quantum generative models. Characterized by two complementary indicators: (i) a Quantum Correlation-Likeness Indicator (QCLI), computed from the dataset’s correlation-order (Walsh-Hadamard/Fourier) power spectrum aggregated by interaction order and quantified via Jensen-Shannon divergence from an i.i.d. binomial reference; and (ii) a Classical Correlation-Complexity Indicator (CCI), defined as the fraction of total correlation not captured by the optimal Chow-Liu tree approximation, normalized by total correlation. We provide theoretical support by relating QCLI to a support-mismatch mechanism, for fixed-architecture IQP families trained with an MMD objective, higher QCLI implies a smaller irreducible approximation floor. Using the map, we identify the classical turbulence data as both IQP-compatible and classically complex (high QCLI/high CCI). Guided by this placement, we use an invertible float-to-bitstring representation and a latent-parameter adaptation scheme that reuses a compact IQP circuit over a temporal sequence by learning and interpolating a low-dimensional latent trajectory. In comparative evaluations against classical models such as Restricted Boltzmann Machine (RBM) and Deep Convolutional Generative Adversarial Networks (DCGAN), the IQP approach achieves competitive distributional alignment while using substantially fewer training snapshots and a small latent block, supporting the use of QCLI/CCI as practical indicators for locating IQP-aligned domains and advancing generative quantum utility.
[498] When One Modality Rules Them All: Backdoor Modality Collapse in Multimodal Diffusion Models
Qitong Wang, Haoran Dai, Haotian Zhang, Christopher Rasmussen, Binghui Wang
Main category: cs.LG
TL;DR: Multimodal diffusion models exhibit “Backdoor Modality Collapse” where backdoor attacks degenerate to rely on subset of modalities rather than synergistic multimodal effects.
Details
Motivation: As diffusion models revolutionize visual content generation, their rapid adoption raises security concerns about backdoor attacks. The paper challenges the assumption that attacking multiple modalities simultaneously (text and image) would yield complementary effects and strengthen backdoors.Method: Introduces two novel metrics: Trigger Modality Attribution (TMA) to quantify which modalities contribute to backdoor activation, and Cross-Trigger Interaction (CTI) to measure cross-modal interactions. Conducts extensive experiments across diverse training configurations in multimodal conditional diffusion models.
Result: Reveals a “winner-takes-all” dynamic where backdoor attacks often collapse into subset-modality dominance. Cross-modal interaction is negligible or even negative, contradicting the intuition of synergistic vulnerability. High attack success rates often mask fundamental reliance on a subset of modalities.
Conclusion: Establishes a principled foundation for mechanistic analysis and future defense development by identifying Backdoor Modality Collapse as a critical blind spot in current security assessments of multimodal diffusion models.
Abstract: While diffusion models have revolutionized visual content generation, their rapid adoption has underscored the critical need to investigate vulnerabilities, e.g., to backdoor attacks. In multimodal diffusion models, it is natural to expect that attacking multiple modalities simultaneously (e.g., text and image) would yield complementary effects and strengthen the overall backdoor. In this paper, we challenge this assumption by investigating the phenomenon of Backdoor Modality Collapse, a scenario where the backdoor mechanism degenerates to rely predominantly on a subset of modalities, rendering others redundant. To rigorously quantify this behavior, we introduce two novel metrics: Trigger Modality Attribution (TMA) and Cross-Trigger Interaction (CTI). Through extensive experiments across diverse training configurations in multimodal conditional diffusion, we consistently observe a ``winner-takes-all’’ dynamic in backdoor behavior. Our results reveal that (1) attacks often collapse into subset-modality dominance, and (2) cross-modal interaction is negligible or even negative, contradicting the intuition of synergistic vulnerability. These findings highlight a critical blind spot in current assessments, suggesting that high attack success rates often mask a fundamental reliance on a subset of modalities. This establishes a principled foundation for mechanistic analysis and future defense development.
[499] Hierarchical Industrial Demand Forecasting with Temporal and Uncertainty Explanations
Harshavardhan Kamarthi, Shangqing Xu, Xinjie Tong, Xingyu Zhou, James Peters, Joseph Czyzyk, B. Aditya Prakash
Main category: cs.LG
TL;DR: A novel interpretability method for hierarchical probabilistic time-series forecasting that adapts generic interpretability techniques to address hierarchical structures and uncertainty, evaluated on industrial supply chain scenarios.
Details
Motivation: While machine learning models have achieved significant accuracy and scalability for hierarchical time-series forecasting (essential for demand prediction), their interpretability remains largely unexplored, creating a gap between model predictions and practical application understanding.Method: Introduces an interpretability method specifically designed for large hierarchical probabilistic time-series forecasting, adapting generic interpretability techniques to address challenges associated with hierarchical structures and uncertainty. The method is evaluated using semi-synthetic datasets based on real-world scenarios of explaining hierarchical demands for over ten thousand products at a large chemical company.
Result: The method successfully explained state-of-the-art industrial forecasting methods with significantly higher explainability accuracy. Real-world case studies demonstrated efficacy in identifying important patterns and explanations that help stakeholders better understand forecasts, facilitating identification of key drivers behind forecasted demand.
Conclusion: The approach helps build trust and confidence among users, leading to better adoption and utilization of hierarchical forecasting models in practice by enabling more informed decision-making and strategic planning through improved interpretability.
Abstract: Hierarchical time-series forecasting is essential for demand prediction across various industries. While machine learning models have obtained significant accuracy and scalability on such forecasting tasks, the interpretability of their predictions, informed by application, is still largely unexplored. To bridge this gap, we introduce a novel interpretability method for large hierarchical probabilistic time-series forecasting, adapting generic interpretability techniques while addressing challenges associated with hierarchical structures and uncertainty. Our approach offers valuable interpretative insights in response to real-world industrial supply chain scenarios, including 1) the significance of various time-series within the hierarchy and external variables at specific time points, 2) the impact of different variables on forecast uncertainty, and 3) explanations for forecast changes in response to modifications in the training dataset. To evaluate the explainability method, we generate semi-synthetic datasets based on real-world scenarios of explaining hierarchical demands for over ten thousand products at a large chemical company. The experiments showed that our explainability method successfully explained state-of-the-art industrial forecasting methods with significantly higher explainability accuracy. Furthermore, we provide multiple real-world case studies that show the efficacy of our approach in identifying important patterns and explanations that help stakeholders better understand the forecasts. Additionally, our method facilitates the identification of key drivers behind forecasted demand, enabling more informed decision-making and strategic planning. Our approach helps build trust and confidence among users, ultimately leading to better adoption and utilization of hierarchical forecasting models in practice.
[500] Causal Interpretation of Neural Network Computations with Contribution Decomposition
Joshua Brendan Melander, Zaki Alaoui, Shenghua Liu, Surya Ganguli, Stephen A. Baccus
Main category: cs.LG
TL;DR: CODEC decomposes neural network behavior into sparse motifs of hidden-neuron contributions, revealing causal processes beyond activation patterns for better interpretation and control.
Details
Motivation: Existing neural network interpretation methods focus on activation patterns correlated with concepts, but don't reveal how neurons causally drive outputs. Need direct examination of hidden neuron contributions to understand network computations.Method: CODEC uses sparse autoencoders to decompose network behavior into sparse motifs of hidden-neuron contributions, analyzing how neurons combine to drive outputs rather than just their activation patterns.
Result: Applied to image-classification networks: contributions become sparser and higher-dimensional across layers; positive/negative effects progressively decorrelate. Enables causal manipulation of outputs and interpretable visualizations of image components driving decisions.
Conclusion: CODEC provides interpretable framework for understanding nonlinear computations across layers, establishing contribution modes as informative unit for mechanistic insights into neural networks.
Abstract: Understanding how neural networks transform inputs into outputs is crucial for interpreting and manipulating their behavior. Most existing approaches analyze internal representations by identifying hidden-layer activation patterns correlated with human-interpretable concepts. Here we take a direct approach to examine how hidden neurons act to drive network outputs. We introduce CODEC (Contribution Decomposition), a method that uses sparse autoencoders to decompose network behavior into sparse motifs of hidden-neuron contributions, revealing causal processes that cannot be determined by analyzing activations alone. Applying CODEC to benchmark image-classification networks, we find that contributions grow in sparsity and dimensionality across layers and, unexpectedly, that they progressively decorrelate positive and negative effects on network outputs. We further show that decomposing contributions into sparse modes enables greater control and interpretation of intermediate layers, supporting both causal manipulations of network output and human-interpretable visualizations of distinct image components that combine to drive that output. Finally, by analyzing state-of-the-art models of neural activity in the vertebrate retina, we demonstrate that CODEC uncovers combinatorial actions of model interneurons and identifies the sources of dynamic receptive fields. Overall, CODEC provides a rich and interpretable framework for understanding how nonlinear computations evolve across hierarchical layers, establishing contribution modes as an informative unit of analysis for mechanistic insights into artificial neural networks.
[501] A recipe for scalable attention-based MLIPs: unlocking long-range accuracy with all-to-all node attention
Eric Qu, Brandon M. Wood, Aditi S. Krishnapriyan, Zachary W. Ulissi
Main category: cs.LG
TL;DR: AllScAIP is an attention-based machine learning interatomic potential that scales to 100M training samples and addresses long-range interactions using data-driven all-to-all attention, achieving state-of-the-art accuracy on molecular systems.
Details
Motivation: Current MLIPs struggle with long-range interactions in large systems like biomolecules and electrolytes, often requiring explicit physics-based terms. There's a need for scalable, data-driven approaches that can handle these interactions without heavy inductive biases.Method: Proposes AllScAIP, an attention-based, energy-conserving MLIP model that uses all-to-all node attention to capture long-range interactions. The model scales to O(100 million) training samples and employs a data-driven approach rather than relying on strong physics-based inductive biases.
Result: Achieves state-of-the-art energy/force accuracy on molecular systems (OMol25), competitive performance on materials (OMat24) and catalysts (OC20). Enables stable, long-timescale MD simulations that accurately recover experimental observables like density and heat of vaporization.
Conclusion: As data and model size scale, benefits of inductive biases diminish while all-to-all attention remains critical for capturing long-range interactions. The approach demonstrates that data-driven attention mechanisms can effectively handle complex atomic interactions in large systems.
Abstract: Machine-learning interatomic potentials (MLIPs) have advanced rapidly, with many top models relying on strong physics-based inductive biases. However, as models scale to larger systems like biomolecules and electrolytes, they struggle to accurately capture long-range (LR) interactions, leading current approaches to rely on explicit physics-based terms or components. In this work, we propose AllScAIP, a straightforward, attention-based, and energy-conserving MLIP model that scales to O(100 million) training samples. It addresses the long-range challenge using an all-to-all node attention component that is data-driven. Extensive ablations reveal that in low-data/small-model regimes, inductive biases improve sample efficiency. However, as data and model size scale, these benefits diminish or even reverse, while all-to-all attention remains critical for capturing LR interactions. Our model achieves state-of-the-art energy/force accuracy on molecular systems, as well as a number of physics-based evaluations (OMol25), while being competitive on materials (OMat24) and catalysts (OC20). Furthermore, it enables stable, long-timescale MD simulations that accurately recover experimental observables, including density and heat of vaporization predictions.
[502] Expert-Aided Causal Discovery of Ancestral Graphs
Tiago da Silva, Bruna Bazaluk, Eliezer de Souza da Silva, AntĂłnio GĂłis, Salem Lahlou, Dominik Heider, Samuel Kaski, Diego Mesquita, AdĂšle Helena Ribeiro
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2309.12032: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2309.12032&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[503] A unified framework for learning with nonlinear model classes from arbitrary linear samples
Ben Adcock, Juan M. Cardenas, Nick Dexter
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2311.14886: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2311.14886&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[504] L0-Regularized Quadratic Surface Support Vector Machines
Ahmad Mousavi, Ramin Zandvakili, Zheming Gao
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to determine conclusion due to data fetch failure
Abstract: Failed to fetch summary for 2501.11268: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.11268&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[505] CAPS: Context-Aware Priority Sampling for Enhanced Imitation Learning in Autonomous Driving
Hamidreza Mirkhani, Behzad Khamidehi, Ehsan Ahmadi, Mohammed Elmahgiubi, Weize Zhang, Fazel Arasteh, Umar Rajguru, Kasra Rezaee, Dongfeng Bai
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2503.01650: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.01650&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[506] Temporal Misalignment Attacks against Multimodal Perception in Autonomous Driving
Md Hasan Shahriar, Md Mohaimin Al Barat, Harshavardhan Sundar, Ning Zhang, Naren Ramakrishnan, Y. Thomas Hou, Wenjing Lou
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in accessing paper contentMethod: Cannot analyze method due to failed paper retrieval
Result: No results available due to HTTP 429 error
Conclusion: Paper content inaccessible due to rate limiting from arXiv API
Abstract: Failed to fetch summary for 2507.09095: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.09095&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[507] Merging Memory and Space: A State Space Neural Operator
Nodens Koren, Samuel Lanthaler
Main category: cs.LG
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract
Details
Motivation: Unable to determine motivation due to abstract fetch failureMethod: Unable to determine method due to abstract fetch failure
Result: Unable to determine results due to abstract fetch failure
Conclusion: Unable to determine conclusion due to abstract fetch failure
Abstract: Failed to fetch summary for 2507.23428: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.23428&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[508] TIC-GRPO: Provable and Efficient Optimization for Reinforcement Learning from Human Feedback
Lei Pang, Jun Luo, Ruinan Jin
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to paper fetch failureMethod: Unable to determine method due to paper fetch failure
Result: Unable to determine results due to paper fetch failure
Conclusion: Unable to determine conclusion due to paper fetch failure
Abstract: Failed to fetch summary for 2508.02833: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.02833&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[509] Characterizing Evolution in Expectation-Maximization Estimates for Overspecified Mixed Linear Regression
Zhankun Luo, Abolfazl Hashemi
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to retry later or use alternative access methods
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2508.10154: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.10154&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[510] One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning
Yuan Pu, Yazhe Niu, Jia Tang, Junyu Xiong, Shuai Hu, Hongsheng Li
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2509.07945: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.07945&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[511] Auto-Regressive U-Net for Full-Field Prediction of Shrinkage-Induced Damage in Concrete
Liya Gaynutdinova, Petr HavlĂĄsek, OndĆej RokoĆĄ, Fleur Hendriks, Martin DoĆĄkĂĄĆ
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.20507: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.20507&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[512] Planner Aware Path Learning in Diffusion Language Models Training
Fred Zhangzhi Peng, Zachary Bezemek, Jarrid Rector-Brooks, Shuibai Zhang, Anru R. Zhang, Michael Bronstein, Alexander Tong, Avishek Joey Bose
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.23405: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23405&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[513] Diffusion Alignment as Variational Expectation-Maximization
Jaewoo Lee, Minsu Kim, Sanghyeok Choi, Inhyuck Song, Sujin Yun, Hyeongyu Kang, Woocheol Shin, Taeyoung Yun, Kiyoung Om, Jinkyoo Park
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2510.00502: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00502&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[514] Online Minimization of Polarization and Disagreement via Low-Rank Matrix Bandits
Federico Cinus, Yuko Kuroki, Atsushi Miyauchi, Francesco Bonchi
Main category: cs.LG
TL;DR: Unable to analyze paper 2510.00803 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content was not accessibleMethod: Cannot determine method as paper content was not accessible
Result: Cannot determine results as paper content was not accessible
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2510.00803: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00803&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[515] Do We Really Need Permutations? Impact of Model Width on Linear Mode Connectivity
Akira Ito, Masanori Yamada, Daiki Chijiwa, Atsutoshi Kumagai
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2510.08023 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable.
Result: Cannot determine results as paper content is unavailable.
Conclusion: Cannot draw conclusions about the paper due to inability to access its content.
Abstract: Failed to fetch summary for 2510.08023: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08023&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[516] KLASS: KL-Guided Fast Inference in Masked Diffusion Models
Seo Hyun Kim, Sunwoo Hong, Hojung Jung, Youngrok Park, Se-Young Yun
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2511.05664: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05664&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[517] CADM: Cluster-customized Adaptive Distance Metric for Categorical Data Clustering
Taixi Chen, Yiu-ming Cheung, Yiqun Zhang
Main category: cs.LG
TL;DR: Unable to analyze paper 2511.05826 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.05826: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05826&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[518] Beyond Mapping : Domain-Invariant Representations via Spectral Embedding of Optimal Transport Plans
Abdel Djalil Sad Saoud, Fred Maurice NgolĂš Mboula, Hanane Slimani
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2601.13350: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13350&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[519] EDIS: Diagnosing LLM Reasoning via Entropy Dynamics
Chenghua Zhu, Siyan Wu, Xiangkang Zeng, Zishan Xu, Zhaolu Kang, Yifu Guo, Yuquan Lu, Junduan Huang, Guojing Zhou
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2602.01288: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01288&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[520] Stress-Testing Alignment Audits With Prompt-Level Strategic Deception
Oliver Daniels, Perusha Moodley, Benjamin M. Marlin, David Lindner
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2602.08877: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08877&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[521] Latent Poincaré Shaping for Agentic Reinforcement Learning
Hanchen Xia, Baoyou Chen, Zelin Zang, Yutang Ge, Guojiang Zhao, Siyu Zhu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2602.09375: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09375&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[522] Stochastic Parroting in Temporal Attention – Regulating the Diagonal Sink
Victoria Hankemeier, Malte Schilling
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.10956: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10956&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[523] MolCrystalFlow: Molecular Crystal Structure Prediction via Flow Matching
Cheng Zeng, Harry W. Sullivan, Thomas Egg, Maya M. Martirossyan, Philipp Höllmer, Jirui Jin, Richard G. Hennig, Adrian Roitberg, Stefano Martiniani, Ellad B. Tadmor, Mingjie Liu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.16020: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16020&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[524] Conditionally Site-Independent Neural Evolution of Antibody Sequences
Stephen Zhewen Lu, Aakarsh Vermani, Kohei Sanno, Jiarui Lu, Frederick A Matsen, Milind Jagota, Yun S. Song
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - cannot analyze the paper content
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2602.18982: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18982&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[525] Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web–Knowledge–Web Pipeline
Yijiashun Qi, Yijiazhen Qi, Tanmay Wagh
Main category: cs.LG
TL;DR: Paper 2602.24262: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2602.24262: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.24262&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[526] Weight Updates as Activation Shifts: A Principled Framework for Steering
Dyah Adila, John Cooper, Alexander Yun, Avi Trost, Frederic Sala
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.00425: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00425&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[527] mlx-vis: GPU-Accelerated Dimensionality Reduction and Visualization on Apple Silicon
Han Xiao
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.04035: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04035&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[528] BInD: Bond and Interaction-generating Diffusion Model for Multi-objective Structure-based Drug Design
Joongwon Lee, Wonho Zhung, Jisu Seo, Woo Youn Kim
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2405.16861 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable.
Result: Cannot determine results as paper content is unavailable.
Conclusion: Cannot draw conclusions about the paper due to inability to access its content.
Abstract: Failed to fetch summary for 2405.16861: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.16861&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[529] Entropic Mirror Descent for Linear Systems: Polyak’s Stepsize and Implicit Bias
Yura Malitsky, Alexander Posch
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2505.02614: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.02614&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[530] Quantifying Cross-Attention Interaction in Transformers for Interpreting TCR-pMHC Binding
Jiarui Li, Zixiang Yin, Haley Smith, Zhengming Ding, Samuel J. Landry, Ramgopal R. Mettu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2507.03197: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.03197&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[531] Diverse and Adaptive Behavior Curriculum for Autonomous Driving: A Student-Teacher Framework with Multi-Agent RL
Ahmed Abouelazm, Johannes Ratz, Philip Schörner, J. Marius Zöllner
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2507.19146: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.19146&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[532] Quantum parameter estimation with uncertainty quantification from continuous measurement data using neural network ensembles
Amanuel Anteneh
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2509.10756: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.10756&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[533] VEGA: Electric Vehicle Navigation Agent via Physics-Informed Neural Operator and Proximal Policy Optimization
Hansol Lim, Minhyeok Im, Jonathan Boyack, Jee Won Lee, Jongseong Brad Choi
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.13386: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.13386&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[534] Spectral/Spatial Tensor Atomic Cluster Expansion with Universal Embeddings in Cartesian Space
Zemin Xu, Wenbo Xie, P. Hu
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2509.14961: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.14961&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[535] Taxonomy-aware Dynamic Motion Generation on Hyperbolic Manifolds
Luis Augenstein, Noémie Jaquier, Tamim Asfour, Leonel Rozo
Main category: cs.LG
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Unable to determine motivation due to API request failureMethod: Unable to determine method due to API request failure
Result: Unable to determine results due to API request failure
Conclusion: Unable to determine conclusion due to API request failure
Abstract: Failed to fetch summary for 2509.21281: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21281&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[536] Self-Speculative Masked Diffusions
Andrew Campbell, Valentin De Bortoli, Jiaxin Shi, Arnaud Doucet
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to technical access issues
Abstract: Failed to fetch summary for 2510.03929: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03929&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[537] TCR-EML: Explainable Model Layers for TCR-pMHC Prediction
Jiarui Li, Zixiang Yin, Zhengming Ding, Samuel J. Landry, Ramgopal R. Mettu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed fetchMethod: Unable to determine method due to failed fetch
Result: Unable to determine results due to failed fetch
Conclusion: Unable to determine conclusion due to failed fetch
Abstract: Failed to fetch summary for 2510.04377: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04377&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[538] Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence
Bingji Yi, Qiyuan Liu, Yuwei Cheng, Haifeng Xu
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2510.16657: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16657&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[539] Mixed Monotonicity Reachability Analysis of Neural ODE: A Trade-Off Between Tightness and Efficiency
Abdelrahman Sayed Sayed, Pierre-Jean Meyer, Mohamed Ghazel
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2510.17859: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17859&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[540] Real-Time Learning of Predictive Dynamic Obstacle Models for Robotic Motion Planning
Stella Kombo, Masih Haseli, Skylar X. Wei, Joel W. Burdick
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available - technical issue with arXiv API preventing access to paper information
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content
Abstract: Failed to fetch summary for 2511.00814: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.00814&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[541] DAISI: Data Assimilation with Inverse Sampling using Stochastic Interpolants
Martin Andrae, Erik Larsson, So Takao, Tomas Landelius, Fredrik Lindsten
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2512.00252 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable due to API rate limiting.
Result: Cannot determine results as paper content is unavailable due to API rate limiting.
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2512.00252: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00252&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[542] Laser interferometry as a robust neuromorphic platform for machine learning
Amanuel Anteneh, Kyungeun Kim, J. M. Schwarz, Israel Klich, Olivier Pfister
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2601.18047: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18047&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[543] Online unsupervised Hebbian learning in deep photonic neuromorphic networks
Xi Li, Disha Biswas, Peng Zhou, Wesley H. Brigner, Anna Capuano, Joseph S. Friedman, Qing Gu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2601.22300: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22300&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[544] Validating Interpretability in siRNA Efficacy Prediction: A Perturbation-Based, Dataset-Aware Protocol
Zahra Khodagholi, Niloofar Yousefi
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.10152: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10152&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[545] GaiaFlow: Semantic-Guided Diffusion Tuning for Carbon-Frugal Search
Rong Fu, Jia Yee Tan, Chunlei Meng, Shuo Yin, Xiaowen Ma, Wangyu Wu, Muge Qi, Guangzhen Yao, Zhaolu Kang, Zeli Su, Simon Fong
Main category: cs.LG
TL;DR: Failed to fetch summary for arXiv paper 2602.15423 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limitingMethod: Unable to determine method as the paper content could not be retrieved
Result: Unable to determine results as the paper content could not be retrieved
Conclusion: Unable to draw conclusions about the paper due to retrieval failure
Abstract: Failed to fetch summary for 2602.15423: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15423&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[546] The Limits of Long-Context Reasoning in Automated Bug Fixing
Ravi Raju, Mengmeng Ji, Shubhangi Upasani, Bo Li, Urmish Thakker
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2602.16069: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16069&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[547] What Topological and Geometric Structure Do Biological Foundation Models Learn? Evidence from 141 Hypotheses
Ihor Kendiukhov
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2602.22289: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22289&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[548] Improving the accuracy of physics-informed neural networks via last-layer retraining
Saad Qadeer, Panos Stinis
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.04672: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04672&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[549] Good-Enough LLM Obfuscation (GELO)
Anatoly Belikov, Ilya Fedotov
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions due to lack of paper content
Abstract: Failed to fetch summary for 2603.05035: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05035&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MA
[550] Information-Theoretic Privacy Control for Sequential Multi-Agent LLM Systems
Sadia Asif, Mohammad Mohammadi Amiri
Main category: cs.MA
TL;DR: A study of compositional privacy leakage in sequential multi-agent LLM systems, with theoretical bounds on information leakage and a privacy-regularized training framework to constrain information flow between agents.
Details
Motivation: Sequential multi-agent LLM systems are deployed in sensitive domains like healthcare and finance, where local privacy constraints may not prevent sensitive information leakage through sequential composition and intermediate representations.Method: Formalize leakage using mutual information, derive theoretical bounds on how local leakage amplifies across agents, and propose a privacy-regularized training framework that constrains information flow between agent outputs and agent-local sensitive variables.
Result: Evaluation across sequential agent pipelines of varying depth on three benchmark datasets shows stable optimization dynamics and consistent, interpretable privacy-utility trade-offs.
Conclusion: Privacy in agentic LLM systems cannot be guaranteed by local constraints alone and must be treated as a system-level property during both training and deployment.
Abstract: Sequential multi-agent large language model (LLM) systems are increasingly deployed in sensitive domains such as healthcare, finance, and enterprise decision-making, where multiple specialized agents collaboratively process a single user request. Although individual agents may satisfy local privacy constraints, sensitive information can still be inferred through sequential composition and intermediate representations. In this work, we study \emph{compositional privacy leakage} in sequential LLM agent pipelines. We formalize leakage using mutual information and derive a theoretical bound that characterizes how locally introduced leakage can amplify across agents under sequential execution. Motivated by this analysis, we propose a privacy-regularized training framework that directly constrains information flow between agent outputs and agent-local sensitive variables. We evaluate our approach across sequential agent pipelines of varying depth on three benchmark datasets, demonstrating stable optimization dynamics and consistent, interpretable privacy-utility trade-offs. Our results show that privacy in agentic LLM systems cannot be guaranteed by local constraints alone and must instead be treated as a system-level property during both training and deployment.
[551] The Coordination Gap: Alternation Metrics for Temporal Dynamics in Multi-Agent Battle of the Exes
Nikolaos Al. Papadopoulos, Konstantinos Psannis
Main category: cs.MA
TL;DR: Paper introduces temporally-sensitive metrics for multi-agent coordination, showing conventional metrics can’t distinguish structured alternation from random patterns, and that learned policies often perform worse than random baselines under these new metrics.
Details
Motivation: Current metrics for multi-agent coordination are temporally blind and fail to distinguish structured coordination patterns (like turn-taking) from random or monopolistic behavior, especially as the number of agents grows.Method: Introduces Perfect Alternation (PA) as reference coordination regime and proposes six novel Alternation (ALT) metrics as temporally sensitive observables. Uses Q-learning agents as diagnostic baseline and compares against random-policy null processes in a Battle of the Exes variant formalized as Markov game.
Result: Despite high traditional metrics (reward fairness often >0.9), learned policies perform up to 81% below random baselines under ALT metrics. This deficit exists in two-agent case and intensifies with more agents, showing high aggregate payoffs can coexist with poor temporal coordination.
Conclusion: Conventional metrics severely mischaracterize emergent dynamics in multi-agent coordination. Temporally aware observables are necessary for analyzing coordination, and random-policy baselines are essential null processes for interpreting coordination outcomes relative to chance-level behavior.
Abstract: Multi-agent coordination dilemmas expose a fundamental tension between individual optimization and collective welfare, yet characterizing such coordination requires metrics sensitive to temporal structure and collective dynamics. As a diagnostic testbed, we study a BoE-derived multi-agent variant of the Battle of the Exes, formalizing it as a Markov game in which turn-taking emerges as a periodic coordination regime. Conventional outcome-based metrics (e.g., efficiency and min/max fairness) are temporally blind – they cannot distinguish structured alternation from monopolistic or random access patterns – and fairness ratios lose discriminative power as n grows, obscuring inequities. To address this limitation, we introduce Perfect Alternation (PA) as a reference coordination regime and propose six novel Alternation (ALT) metrics designed as temporally sensitive observables of coordination quality. Using Q-learning agents as a minimal adaptive diagnostic baseline, and comparing against random-policy null processes, we uncover a clear measurement failure: despite exhibiting deceptively high traditional metrics (e.g., reward fairness often exceeding 0.9), learned policies perform up to 81% below random baselines under ALT-variant evaluation – a deficit already present in the two-agent case and intensifying as n grows. These results demonstrate, in this setting, that high aggregate payoffs can coexist with poor temporal coordination, and that conventional metrics may severely mischaracterize emergent dynamics. Our findings underscore the necessity of temporally aware observables for analyzing coordination in multi-agent games and highlight random-policy baselines as essential null processes for interpreting coordination outcomes relative to chance-level behavior.
[552] Evaluating LLM Alignment With Human Trust Models
Anushka Debnath, Stephen Cranefield, Bastin Tony Roy Savarimuthu, Emiliano Lorini
Main category: cs.MA
TL;DR: White-box analysis of trust representation in GPT-J-6B shows it aligns most closely with Castelfranchi socio-cognitive model, indicating LLMs encode social constructs in their activation space.
Details
Motivation: Trust is crucial for cooperation and decision-making in human and multi-agent systems, but there's limited understanding of how LLMs internally conceptualize and reason about trust.Method: Used contrastive prompting to generate embedding vectors in GPT-J-6B’s activation space for trust and related interpersonal attributes. Identified trust concepts from five human trust models, established threshold via pairwise cosine similarities of 60 emotional concepts, then measured similarities between LLM’s trust representation and derived trust concepts.
Result: GPT-J-6B’s internal trust representation aligns most closely with Castelfranchi socio-cognitive model, followed by Marsh Model, showing meaningful encoding of socio-cognitive constructs.
Conclusion: LLMs encode socio-cognitive constructs in ways that support comparative analyses, inform social cognition theories, and aid human-AI collaborative system design.
Abstract: Trust plays a pivotal role in enabling effective cooperation, reducing uncertainty, and guiding decision-making in both human interactions and multi-agent systems. Although it is significant, there is limited understanding of how large language models (LLMs) internally conceptualize and reason about trust. This work presents a white-box analysis of trust representation in EleutherAI/gpt-j-6B, using contrastive prompting to generate embedding vectors within the activation space of the LLM for diadic trust and related interpersonal relationship attributes. We first identified trust-related concepts from five established human trust models. We then determined a threshold for significant conceptual alignment by computing pairwise cosine similarities across 60 general emotional concepts. Then we measured the cosine similarities between the LLM’s internal representation of trust and the derived trust-related concepts. Our results show that the internal trust representation of EleutherAI/gpt-j-6B aligns most closely with the Castelfranchi socio-cognitive model, followed by the Marsh Model. These findings indicate that LLMs encode socio-cognitive constructs in their activation space in ways that support meaningful comparative analyses, inform theories of social cognition, and support the design of human-AI collaborative systems.
[553] MARLIN: Multi-Agent Reinforcement Learning with Murmuration Intelligence and LLM Guidance for Reservoir Management
Heming Fu, Shan Lin, Guojun Xiong
Main category: cs.MA
TL;DR: MARLIN is a decentralized reservoir management framework using multi-agent reinforcement learning with bio-inspired coordination rules and LLM-based reward shaping to handle cascading uncertainties in water resource systems.
Details
Motivation: Climate change intensifies extreme weather events, making adaptive reservoir management critical for protecting vulnerable populations. Traditional centralized optimization suffers from exponential computational complexity and cannot handle real-world uncertainties like water transfer losses, while existing MARL methods fail to achieve effective coordination under uncertainty.Method: MARLIN integrates bio-inspired alignment, separation, and cohesion rules (inspired by starling murmurations) with multi-agent reinforcement learning. It uses a decentralized approach where individual reservoirs make local decisions while achieving emergent global coordination. An LLM provides real-time reward shaping signals to guide agents to adapt to environmental changes and human-defined preferences.
Result: Experiments on USGS data show MARLIN improves uncertainty handling by 23%, cuts computation by 35%, accelerates flood response by 68%, and exhibits super-linear coordination with complexity scaling 5.4x from 400 to 10,000 nodes.
Conclusion: MARLIN demonstrates potential for disaster prevention and protecting communities through intelligent, scalable water resource management by enabling effective coordination under uncertainty while maintaining computational efficiency.
Abstract: As climate change intensifies extreme weather events, water disasters pose growing threats to global communities, making adaptive reservoir management critical for protecting vulnerable populations and ensuring water security. Modern water resource management faces unprecedented challenges from cascading uncertainties propagating through interconnected reservoir networks. These uncertainties, rooted in physical water transfer losses and environmental variability, make precise control difficult. For example, sending 10 tons downstream may yield only 8-12 tons due to evaporation and seepage. Traditional centralized optimization approaches suffer from exponential computational complexity and cannot effectively handle such real-world uncertainties, while existing multi-agent reinforcement learning (MARL) methods fail to achieve effective coordination under uncertainty. To address these challenges, we present MARLIN, a decentralized reservoir management framework inspired by starling murmurations intelligence. Integrating bio-inspired alignment, separation, and cohesion rules with MARL, MARLIN enables individual reservoirs to make local decisions while achieving emergent global coordination. In addition, a LLM provides real-time reward shaping signals, guiding agents to adapt to environmental changes and human-defined preferences. Experiments on USGS data show that MARLIN improves uncertainty handling by 23%, cuts computation by 35%, and accelerates flood response by 68%, exhibiting super-linear coordination, with complexity scaling 5.4x from 400 to 10,000 nodes. These results demonstrate MARLIN’s potential for disaster prevention and protecting communities through intelligent, scalable water resource management.
cs.MM
[554] Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder
Kin Wai Lau, Yasar Abbas Ur Rehman, Lai-Man Po, Pedro Porto Buarque de GusmĂŁo
Main category: cs.MM
TL;DR: Omni-C is a single dense Transformer encoder that learns shared representations across images, audio, and text through unimodal contrastive pretraining, eliminating the need for separate expert encoders or MoE architectures.
Details
Motivation: Current multimodal systems use separate expert encoders for each modality, causing linear scaling complexity and computational overhead. While unified Omni-models with MoE architectures help, they still increase parameters and introduce routing overhead.Method: Proposes Omni-C, a single dense Transformer-based encoder with shared backbone parameters and lightweight modality-specific projection heads. Uses unimodal contrastive pretraining on large-scale unaligned data without requiring paired supervision or routing mechanisms.
Result: Achieves performance comparable to expert models in unimodal and cross-modal tasks, with modest zero-shot degradation on audio and text that can be recovered through lightweight linear probing or parameter-efficient fine-tuning. Substantially reduces inference memory usage compared to multi-encoder baselines.
Conclusion: Omni-C demonstrates that a single dense encoder can learn competitive shared representations across heterogeneous modalities, advancing efficient and scalable multimodal learning without the complexity of MoE architectures or separate expert encoders.
Abstract: Recent multimodal systems often rely on separate expert modality encoders which cause linearly scaling complexity and computational overhead with added modalities. While unified Omni-models address this via Mixture-of-Expert (MoE) architectures with specialized experts and routing, they still inflate parameter counts and introduce routing overhead. In this paper, we propose Omni-C (Omni-Compress), a single dense Transformer-based encoder that learns competitive shared representations across heterogeneous modalities–images, audio, and text–through unimodal contrastive pretraining on large-scale unaligned data. By maximizing parameter sharing in the backbone and using lightweight modality-specific projection heads, Omni-C effectively mitigates inter-modality conflicts without requiring MoE, paired supervision, or routing. This design supports efficient deployment on memory-constrained systems via sequential modality processing and low-memory inference, eliminating the need for parallel expert loading or specialized hardware. Experiments show Omni-C achieves performance comparable to expert models in unimodal and cross-model tasks, with modest zero-shot degradation on audio and text that is largely recovered through lightweight linear probing or parameter efficient fine-tuning. The unified architecture substantially reduces inference memory usage compared to multi-encoder baselines, advancing efficient and scalable multimodal learning.
eess.AS
[555] Activation Steering for Accent Adaptation in Speech Foundation Models
Jinuo Sun, Yang Xiao, Sung Kyun Chung, Qiuchi Hu, Gongping Huang, Eun-Jung Holden, Ting Dang
Main category: eess.AS
TL;DR: Paper investigates accent information encoding in ASR models, identifies accent-sensitive layers, and introduces parameter-free accent steering for improved recognition.
Details
Motivation: Accent variability causes major errors in ASR, but current adaptation methods lack understanding of where accent information is encoded in model representations.Method: Extract layer-wise encoder activations, estimate mean-shift directions capturing accent-induced representation shifts, analyze layer-wise accent sensitivity, and introduce parameter-free accent steering during inference.
Result: Accent information concentrates in a narrow band of middle encoder layers; parameter-free accent steering achieves consistent word error rate reductions across eight accents.
Conclusion: Accent variation can be treated as an interpretable subspace in hidden representations, enabling effective parameter-free adaptation through targeted activation space manipulation.
Abstract: Accent variability remains a major errors in automatic speech recognition, yet most adaptation methods rely on parameter fine-tuning without understanding where accent information is encoded. We treat accent variation as an interpretable subspace in hidden representations and investigate whether it can be identified and controlled directly in activation space. We extract layer-wise encoder activations and estimate mean-shift directions capturing accent-induced representation shifts. By injecting these directions into individual layers and measuring how they align accented and standard embeddings, we derive a layer-wise accent sensitivity profile, revealing that accent information concentrates in a narrow band of middle encoder layers. Leveraging this structure, we further introduce parameter-free accent steering that modifies representations during inference without updating model weights. Experiments across eight accents show consistent word error rate reductions.
[556] ImKWS: Test-Time Adaptation for Keyword Spotting with Class Imbalance
Hanyu Ding, Yang Xiao, Jiaheng Dong, Ting Dang
Main category: eess.AS
TL;DR: ImKWS: A test time adaptation method for keyword spotting that addresses class imbalance between rare keywords and frequent background sounds using split entropy optimization with separate reward/penalty branches and consistency across audio transformations.
Details
Motivation: Keyword spotting (KWS) for voice assistants suffers from accuracy degradation due to environmental noise. While test time adaptation (TTA) can help using only unlabeled test audio, current methods fail to handle severe class imbalance where keywords are rare compared to frequent background sounds, causing standard entropy minimization to become overconfident and biased toward the frequent background class.Method: ImKWS splits the entropy minimization process into separate reward and penalty branches with different update strengths. It enforces consistency across multiple audio transformations to ensure stable model updates during test time adaptation.
Result: Experiments on the Google Speech Commands dataset show that ImKWS achieves reliable adaptation in realistic imbalanced scenarios, outperforming standard entropy minimization approaches that fail due to class imbalance.
Conclusion: ImKWS effectively addresses the class imbalance problem in test time adaptation for keyword spotting, enabling robust adaptation to noisy environments without requiring labeled data or original training audio.
Abstract: Keyword spotting (KWS) identifies words for voice assistants, but environmental noise frequently reduces accuracy. Standard adaptation fixes this issue and strictly requires original or labeled audio. Test time adaptation (TTA) solves this data constraint using only unlabeled test audio. However, current methods fail to handle the severe imbalance between rare keywords and frequent background sounds. Consequently, standard entropy minimization (EM) becomes overconfident and heavily biased toward the frequent background class. To overcome this problem, we propose a TTA method named ImKWS. Our approach splits the entropy process into a reward branch and a penalty branch with separate update strengths. Furthermore, we enforce consistency across multiple audio transformations to ensure stable model updates. Experiments on the Google Speech Commands dataset indicate ImKWS achieves reliable adaptation in realistic imbalanced scenarios. The code is available on GitHub.
[557] Reconstruct! Don’t Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec
Junhyeok Lee, Xiluo He, Jihwan Lee, Helin Wang, Shrikanth Narayanan, Thomas Thebaud, Laureano Moro-Velazquez, JesĂșs Villalba, Najim Dehak
Main category: eess.AS
TL;DR: JHCodec: A neural audio codec using self-supervised representation reconstruction loss to improve intelligibility, accelerate convergence, and enable zero-lookahead streaming with state-of-the-art performance.
Details
Motivation: Existing neural audio codecs optimized for mel-spectrogram reconstruction often fail to preserve speech intelligibility. While semantic encoder distillation helps encoded representations, it doesn't guarantee content preservation in reconstructed speech.Method: Proposes self-supervised representation reconstruction (SSRR) loss that reconstructs distilled self-supervised representations from codec outputs. This approach accelerates convergence, enhances intelligibility, and enables zero-lookahead streaming in Transformer-based codecs.
Result: JHCodec achieves state-of-the-art performance with minimal latency and reduced training cost. It enables competitive results using only a single GPU, significantly improves intelligibility, and allows zero-lookahead architecture for real-time deployment.
Conclusion: SSRR loss fundamentally improves codec training and performance, making JHCodec a practical solution for high-quality audio compression with preserved intelligibility and low latency for real-time applications.
Abstract: Neural audio codecs optimized for mel-spectrogram reconstruction often fail to preserve intelligibility. While semantic encoder distillation improves encoded representations, it does not guarantee content preservation in reconstructed speech. In this work, we demonstrate that self-supervised representation reconstruction (SSRR) loss fundamentally improves codec training and performance. First, SSRR significantly accelerates convergence, enabling competitive results using only a single GPU. Second, it enhances intelligibility by reconstructing distilled self-supervised representations from codec outputs. Third, SSRR enables high intelligibility without additional lookahead in streaming Transformer-based codecs, allowing a zero-lookahead architecture for real-time deployment. As a result, our JHCodec achieves state-of-the-art performance while maintaining minimal latency and reduced training cost. We open-source the full implementation, training pipeline, and demo on Github https://github.com/jhcodec843/jhcodec.
[558] Activation Steering for Accent-Neutralized Zero-Shot Text-To-Speech
Mu Yang, John H. L. Hansen
Main category: eess.AS
TL;DR: Training-free accent neutralization for zero-shot TTS using inference-time activation steering to preserve speaker timbre while removing accent.
Details
Motivation: Zero-shot TTS models struggle to disentangle speaker timbre from accent, often inheriting both from reference speech. Need for practical accent neutralization while preserving voice identity.Method: Post-hoc, training-free approach using inference-time activation steering. Extract layer-specific “steering vectors” from internal activation differences between accented and native speech, then apply during inference to guide accent-neutralized output.
Result: Steering vectors effectively mitigate output accent while preserving speaker timbre, showing strong generalizability to unseen accented speakers.
Conclusion: Proposed method offers practical solution for accent-free voice cloning in zero-shot TTS through training-free activation steering.
Abstract: Zero-shot Text-to-Speech (TTS) models can generate speech that captures both the voice timbre and accent of a reference speaker. However, disentangling these attributes remains challenging, as the output often inherits both the accent and timbre from the reference. In this study, we introduce a novel, post-hoc, and training-free approach to neutralize accent while preserving the speaker’s original timbre, utilizing inference-time activation steering. We first extract layer-specific “steering vectors” offline, which are derived from the internal activation differences within the TTS model between accented and native speech. During inference, the steering vectors are applied to guide the model to produce accent-neutralized, timbre-preserving speech. Empirical results demonstrate that the proposed steering vectors effectively mitigate the output accent and exhibit strong generalizability to unseen accented speakers, offering a practical solution for accent-free voice cloning.
[559] StreamVoiceAnon+: Emotion-Preserving Streaming Speaker Anonymization via Frame-Level Acoustic Distillation
Nikita Kuzmin, Kong Aik Lee, Eng Siong Chng
Main category: eess.AS
TL;DR: Streaming speaker anonymization with emotion preservation using supervised finetuning and emotion distillation on audio codec models
Details
Motivation: Current neural audio codec models for streaming speaker anonymization degrade emotional content - content tokens discard emotional information and models default to dominant acoustic patterns rather than preserving paralinguistic attributesMethod: Proposes supervised finetuning with neutral-emotion utterance pairs from same speaker, combined with frame-level emotion distillation on acoustic token hidden states. All modifications confined to finetuning with no inference latency overhead
Result: Achieves 49.2% UAR (emotion preservation) with 5.77% WER (intelligibility), +24% relative UAR improvement over baseline, +10% over emotion-prompt variant, while maintaining strong privacy (EER 49.0%) and 180ms streaming latency
Conclusion: Effective emotion-preserving streaming speaker anonymization with minimal computational overhead, maintaining both privacy and emotional expressiveness
Abstract: We address the challenge of preserving emotional content in streaming speaker anonymization (SA). Neural audio codec language models trained for audio continuation tend to degrade source emotion: content tokens discard emotional information, and the model defaults to dominant acoustic patterns rather than preserving paralinguistic attributes. We propose supervised finetuning with neutral-emotion utterance pairs from the same speaker, combined with frame-level emotion distillation on acoustic token hidden states. All modifications are confined to finetuning, which takes less than 2 hours on 4 GPUs and adds zero inference latency overhead, while maintaining a competitive 180ms streaming latency. On the VoicePrivacy 2024 protocol, our approach achieves a 49.2% UAR (emotion preservation) with 5.77% WER (intelligibility), a +24% relative UAR improvement over the baseline (39.7%->49.2%) and +10% over the emotion-prompt variant (44.6% UAR), while maintaining strong privacy (EER 49.0%). Demo and code are available: https://anonymous3842031239.github.io/
[560] Continual Adaptation for Pacific Indigenous Speech Recognition
Yang Xiao, Aso Mahmudi, Nick Thieberger, Eliathamby Ambikairajah, Eun-Jung Holden, Ting Dang
Main category: eess.AS
TL;DR: Adapting speech foundation models to low-resource Pacific Indigenous languages reveals challenges with data scarcity, catastrophic forgetting, and representational drift, highlighting the need for better adaptation strategies.
Details
Motivation: Speech foundation models struggle with low-resource Pacific Indigenous languages due to severe data scarcity, and full fine-tuning risks catastrophic forgetting. There's a need to understand how to effectively adapt these models to real-world Pacific datasets.Method: Empirical study adapting models to Pacific datasets, investigating effects of data volume and linguistic features. Evaluated Full Fine-Tuning and Low-Rank Adaptation (LoRA), and analyzed a continual learning framework for sequential multi-language acquisition.
Result: Adapting to distant languages causes severe internal representational drift, creating a plasticity-stability dilemma. LoRA adapts well initially but suffers from catastrophic forgetting during sequential learning.
Conclusion: Highlights urgent need for robust adaptation strategies tailored to underrepresented languages, particularly addressing challenges of catastrophic forgetting and representational drift in low-resource settings.
Abstract: Speech foundation models struggle with low-resource Pacific Indigenous languages because of severe data scarcity. Furthermore, full fine-tuning risks catastrophic forgetting. To address this gap, we present an empirical study adapting models to real-world Pacific datasets. We investigate how data volume and linguistic features affect adaptation success. Specifically, we evaluate strategies including Full Fine-Tuning and Low-Rank Adaptation (LoRA). Additionally, we analyze a continual learning framework for sequentially acquiring multiple languages. We demonstrate that adapting to these distant languages causes severe internal representational drift. Consequently, these models face a strict plasticity and stability dilemma. While LoRA adapts well initially, it suffers from catastrophic forgetting during sequential learning. Ultimately, this study highlights the urgent need for robust adaptation strategies tailored to underrepresented languages.
[561] Classification of Autistic and Non-Autistic Children’s Speech: A Cross-Linguistic Study in Finnish, French, and Slovak
Sofoklis Kakouros, Ida-Lotta MyllylÀ
Main category: eess.AS
TL;DR: Cross-linguistic study of autism-related speech patterns in Finnish, French, and Slovak children using acoustic-prosodic features for classification, showing partial language-general cues but need for language-aware modeling.
Details
Motivation: To investigate whether autism-related speech cues are language-specific or language-general across typologically distinct languages (Finnish, French, Slovak), using classification as an analytical tool rather than for state-of-the-art performance.Method: Used supervised classification with acoustic-prosodic features on speech from autistic and non-autistic children. Conducted within-language cross-validation and cross-corpus transfer experiments, including leave-one-corpus-out tests to evaluate generalization across languages.
Result: Within-language models showed heterogeneous performance (Finnish best: Accuracy 0.84, F1 0.88). Cross-language generalization was moderate for Slovak and Finnish (F1 0.70-0.78) but poor for French (F1 0.42). Feature analysis revealed partially shared but not fully language-invariant acoustic markers.
Conclusion: Some autism-related speech cues generalize across languages, but robust cross-linguistic classifiers require language-aware modeling and more homogeneous recording conditions.
Abstract: We present a cross-linguistic study of speech in autistic and non-autistic children speaking Finnish, French, and Slovak. We combine supervised classification with within-language and cross-corpus transfer experiments to evaluate classification performance within and across languages and to probe which acoustic cues are language-specific versus language-general. Using a large set of acoustic-prosodic features, we implement speaker-level classification benchmarks as an analytical tool rather than to seek state-of-the-art performance. Within-language models, evaluated with speaker-level cross-validation, yielded heterogeneous results. The Finnish model performed best (Accuracy 0.84, F1 0.88), followed by Slovak (Accuracy 0.63, F1 0.68) and French (Accuracy 0.68, F1 0.56). We then tested cross-language generalization. A model trained on all pooled corpora reached an overall Accuracy of 0.61 and F1 0.68. Leave-one-corpus-out experiments, which test transfer to an unseen language, showed moderate success when testing on Slovak (F1 0.70) and Finnish (F1 0.78), but poor transfer to French (F1 0.42). Feature-importance analyses across languages highlighted partially shared, but not fully language-invariant, acoustic markers of autism. These findings suggest that some autism-related speech cues generalize across typologically distinct languages, but robust cross-linguistic classifiers will likely require language-aware modeling and more homogeneous recording conditions.
[562] Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition
Linzhi Wu, Xingyu Zhang, Hao Yuan, Yakun Zhang, Changyan Zheng, Liang Xie, Tiejun Liu, Erwei Yin
Main category: eess.AS
TL;DR: An end-to-end noise-robust audio-visual speech recognition framework that uses Conformer-based bottleneck fusion to implicitly refine noisy audio features with video assistance, eliminating explicit noise masks while preserving speech semantics.
Details
Motivation: Traditional AVSR methods use mask-based strategies to filter audio noise during feature fusion, but these risk discarding semantically relevant information along with noise. There's a need for a more sophisticated approach that can handle noisy audio without explicit noise masking.Method: Proposes an end-to-end noise-robust AVSR framework coupled with speech enhancement. Uses a Conformer-based bottleneck fusion module to implicitly refine noisy audio features with video assistance, reducing modality redundancy and enhancing inter-modal interactions without explicit noise mask generation.
Result: Experimental evaluations on the public LRS3 benchmark show the method outperforms prior advanced mask-based baselines under noisy conditions.
Conclusion: The proposed framework effectively handles noisy audio inputs in AVSR by implicitly refining features through bottleneck fusion, preserving speech semantic integrity while achieving robust recognition performance without explicit noise masking.
Abstract: Audio-visual speech recognition (AVSR) typically improves recognition accuracy in noisy environments by integrating noise-immune visual cues with audio signals. Nevertheless, high-noise audio inputs are prone to introducing adverse interference into the feature fusion process. To mitigate this, recent AVSR methods often adopt mask-based strategies to filter audio noise during feature interaction and fusion, yet such methods risk discarding semantically relevant information alongside noise. In this work, we propose an end-to-end noise-robust AVSR framework coupled with speech enhancement, eliminating the need for explicit noise mask generation. This framework leverages a Conformer-based bottleneck fusion module to implicitly refine noisy audio features with video assistance. By reducing modality redundancy and enhancing inter-modal interactions, our method preserves speech semantic integrity to achieve robust recognition performance. Experimental evaluations on the public LRS3 benchmark suggest that our method outperforms prior advanced mask-based baselines under noisy conditions.
[563] Cross-linguistic Prosodic Analysis of Autistic and Non-autistic Child Speech in Finnish, French and Slovak
Ida-Lotta MyllylÀ, Sofoklis Kakouros
Main category: eess.AS
TL;DR: Autistic speakers show cross-linguistic prosodic differences including increased intensity variability, clearer voice quality, reduced temporal intensity dynamics, and lower fundamental frequency, with some language-specific variations.
Details
Motivation: To investigate prosodic differences in autism across multiple languages (Finnish, French, Slovak) since cross-linguistic evidence remains limited despite well-documented prosodic differences in autism.Method: Analyzed multilingual corpus with 88 acoustic features from over 5,000 inter-pausal units, using Principal Component Analysis for data reduction and Linear Mixed-Effects Models for analysis.
Result: Cross-linguistically: increased general intensity variability, clearer/less breathy voice quality (higher HNR and alpha ratio), reduced temporal intensity dynamics, lower central f0. Language-specific nuances: Slovak aligned on f0 but diverged on voice quality; Finnish mirrored broader voice quality findings.
Conclusion: Voice quality and intensity dynamics should be included alongside traditional pitch measures when studying language-independent markers of autism; findings challenge deficiency-based models in favor of complex, acoustically distinct prosodic profiles across languages.
Abstract: Prosodic differences in autism are well-documented, but cross-linguistic evidence remains limited. This study investigates prosody in autism across a multilingual corpus of Finnish, French, and Slovak speakers. 88 acoustic features from over 5,000 inter-pausal units were extracted, and data were reduced via Principal Component Analysis (PCA) and analyzed using Linear Mixed-Effects Models (LMMs). Cross-linguistically, autistic speakers exhibited increased general intensity variability and a clearer, less breathy voice quality (higher Harmonics-to-Noise Ratio and alpha ratio), alongside reduced temporal intensity dynamics and lower central f0. Monolingual analyses revealed language-specific nuances: Slovak results aligned with cross-linguistic f0 patterns but diverged on voice quality, while Finnish results mirrored the broader voice quality findings. These results emphasize including voice quality and intensity dynamics in the study of possible language-independent markers of autism, alongside traditional pitch measures. The findings challenge deficiency-based models, suggesting instead a complex, acoustically distinct prosodic profile across languages.
[564] Efficient Emotion and Speaker Adaptation in LLM-Based TTS via Characteristic-Specific Partial Fine-Tuning
Tianrui Wang, Meng Ge, Cheng Gong, Chunyu Qiang, Haoyu Wang, Zikang Huang, Yu Jiang, Ye Ni, Yuheng Lu, Xiaobao Wang, Engsiong Chng, Xie Chen, Longbiao Wang, Jianwu Dang
Main category: eess.AS
TL;DR: CSP-FT: Characteristic-specific partial fine-tuning for TTS models that selectively tunes only the most and least informative layers for emotion/speaker cloning, improving fidelity while reducing parameters and training time.
Details
Motivation: LLM-based TTS models suffer from degraded cloning fidelity and pronunciation clarity on unseen domains, and uniform fine-tuning approaches cause slow training, catastrophic forgetting, and poor pronunciation accuracy.Method: CSP-FT analyzes layer contributions via weighted sum to identify the two layers capturing the most and least emotion/speaker information, then selectively fine-tunes only these layers instead of the entire model.
Result: Experiments on 11 datasets show CSP-FT matches or exceeds full fine-tuning fidelity and intelligibility while updating only ~8% of parameters, accelerating training by ~2x, and significantly mitigating catastrophic forgetting.
Conclusion: Characteristic-specific partial fine-tuning is an effective strategy for adapting LLM-based TTS models to new domains, achieving high performance with reduced computational cost and better preservation of learned knowledge.
Abstract: While LLM-based TTS models exhibit zero-shot emotion and speaker cloning, their cloning fidelity and pronunciation clarity degrade on unseen domains. Fine-tuning is essential for adaptation, yet uniform approaches overlook specific parameter contributions. Uniform tuning on limited data causes slow training and catastrophic forgetting, leading to degraded pronunciation accuracy. To address this, we propose CSP-FT, a characteristic-specific partial fine-tuning strategy. By dynamically analyzing layer contributions via a weighted sum, we selectively fine-tune only the two layers capturing the most and least emotion and speaker information, maximizing the utility of the former while explicitly strengthening the capacity of the latter. Experiments on a combined corpus of 11 datasets show CSP-FT matches or exceeds the fidelity and intelligibility of full fine-tuning while updating only ~8% of parameters, accelerating training by ~2x, and significantly mitigating catastrophic forgetting.
[565] Doctor or Patient? Synergizing Diarization and ASR for Code-Switched Hinglish Medical Conditions Extraction
Séverin Baroudi, Yanis Labrak, Shashi Kumar, Joonas Kalda, Sergio Burdisso, Pawel Cyrta, Juan Ignacio Alvarez-Trejos, Petr Motlicek, Hervé Bredin, Ricard Marxer
Main category: eess.AS
TL;DR: A robust system for extracting medical conditions from code-switched Hinglish clinical dialogues, combining EEND-VC for speaker diarization, domain-adapted ASR, and LLM-based medical condition extraction, achieving top performance in the DISPLACE-M challenge.
Details
Motivation: Extracting patient medical conditions from code-switched clinical spoken dialogues is challenging due to rapid turn-taking, highly overlapped speech, and the Hinglish (Hindi-English) language mix in real-world medical conversations.Method: Proposes an End-to-End Neural Diarization with Vector Clustering (EEND-VC) for accurate speaker diarization in dense, overlapped Doctor-Patient Conversations. Uses Qwen3 ASR model fine-tuned with domain-specific data, Devanagari script normalization, and dialogue-level LLM error correction. Benchmarks both text-based cascade systems and multimodal End-to-End audio frameworks for medical condition extraction.
Result: Achieved 18.59% tcpWER for transcription and first place out of 25 participants in the DISPLACE-M challenge. While proprietary E2E models set the performance ceiling, the open cascaded architecture was highly competitive.
Conclusion: The proposed system effectively handles the challenges of code-switched clinical dialogues through robust diarization, domain-adapted ASR, and LLM-based extraction, providing a competitive open-source solution for medical condition extraction from real-world conversations.
Abstract: Extracting patient medical conditions from code-switched clinical spoken dialogues is challenging due to rapid turn-taking and highly overlapped speech. We present a robust system evaluated on the DISPLACE-M dataset of real-world Hinglish medical conversations. We propose an End-to-End Neural Diarization with Vector Clustering approach (EEND-VC) to accurately resolve dense and speaker overlaps in Doctor-Patient Conversations (DoPaCo). For transcription, we adapt a Qwen3 ASR model via domain-specific fine-tuning, Devanagari script normalization, and dialogue-level LLM error correction, achieving an 18.59% tcpWER. We benchmark open and proprietary LLMs on medical condition extraction, comparing our text-based cascade system against a multimodal End-to-End (E2E) audio framework. While proprietary E2E models set the performance ceiling, our open cascaded architecture is highly competitive, as it achieved first place out of 25 participants in the DISPLACE-M challenge. All implementations are publicly released.
[566] LMU-Based Sequential Learning and Posterior Ensemble Fusion for Cross-Domain Infant Cry Classification
Niloofar Jazaeri, Hilmi R. Dajani, Marco Janeczek, Martin Bouchard
Main category: eess.AS
TL;DR: A compact acoustic framework for infant cry cause classification using multi-branch CNN encoder with enhanced Legendre Memory Unit for temporal modeling and calibrated ensemble fusion for cross-dataset generalization.
Details
Motivation: Infant cry cause decoding is challenging due to short nonstationary signals, limited annotations, and strong domain shifts across infants and datasets, making healthcare monitoring difficult.Method: Proposes a compact acoustic framework that fuses MFCC, STFT, and pitch features within a multi-branch CNN encoder, models temporal dynamics using enhanced Legendre Memory Unit (LMU), and uses calibrated posterior ensemble fusion with entropy-gated weighting for cross-dataset generalization.
Result: Experiments on Baby2020 and Baby Crying datasets show improved macro-F1 under cross-domain evaluation, with leakage-aware splits and demonstrated real-time feasibility for on-device monitoring.
Conclusion: The proposed framework effectively addresses challenges in infant cry cause classification with efficient temporal modeling and improved cross-dataset generalization for healthcare monitoring applications.
Abstract: Decoding infant cry causes remains challenging for healthcare monitoring due to short nonstationary signals, limited annotations, and strong domain shifts across infants and datasets. We propose a compact acoustic framework that fuses MFCC, STFT, and pitch features within a multi-branch CNN encoder and models temporal dynamics using an enhanced Legendre Memory Unit (LMU). Compared to LSTMs, the LMU backbone provides stable sequence modeling with substantially fewer recurrent parameters, supporting efficient deployment. To improve cross-dataset generalization, we introduce calibrated posterior ensemble fusion with entropy-gated weighting to preserve domain-specific expertise while mitigating dataset bias. Experiments on Baby2020 and Baby Crying demonstrate improved macro-F1 under cross-domain evaluation, along with leakageaware splits and real-time feasibility for on-device monitoring.
[567] The trajectoRIR Database: Room Acoustic Recordings Along a Trajectory of Moving Microphones
Stefano Damiano, Kathleen MacWilliam, Valerio Lorenzoni, Thomas Dietzen, Toon van Waterschoot
Main category: eess.AS
TL;DR: TrajectoRIR database provides extensive multi-array acoustic recordings with both dynamic and stationary measurements along controlled trajectories for various audio processing tasks.
Details
Motivation: Address the need for comprehensive acoustic datasets that combine both dynamic and stationary recordings to support data-driven approaches in acoustic signal processing, particularly for tasks requiring spatial and temporal acoustic information.Method: Created a database using robotic cart motion along L-shaped trajectory with multiple microphone configurations (dummy head, Ambisonics, circular arrays, linear array) at different speeds, recording various audio signals including RIRs, sweeps, speech, music, and noise.
Result: Database contains 8648 stationary RIRs plus dynamic recordings with perfect sweeps, speech, music, and stationary noise, captured in 0.5s reverberation time room with precise geometric information and Python access tools.
Conclusion: TrajectoRIR provides a unique, comprehensive acoustic database that enables research in sound source localization, dynamic sound field reconstruction, auralization, and system identification through combined stationary and moving recordings.
Abstract: Data availability is essential in the development of acoustic signal processing algorithms, especially when it comes to data-driven approaches that demand large and diverse training datasets. For this reason, an increasing number of databases have been published in recent years, including either room impulse responses (RIRs) or audio recordings during motion. In this paper we introduce the trajectoRIR database, an extensive, multi-array collection of both dynamic and stationary acoustic recordings along a controlled trajectory in a room. Specifically, the database contains moving-microphone recordings and stationary RIRs that spatially sample the room acoustics along an L-shaped trajectory. This combination makes trajectoRIR unique and applicable to a wide range of tasks, including sound source localization and tracking, spatially dynamic sound field reconstruction, auralization, and system identification. The recording room has a reverberation time of 0.5 s, and the three different microphone configurations employed include a dummy head, with additional reference microphones located next to the ears, 3 first-order Ambisonics microphones, two circular arrays of 16 and 4 channels, and a 12-channel linear array. The motion of the microphones was achieved using a robotic cart traversing a 4.62 m-long rail at three speeds: [0.2, 0.4, 0.8] m/s. Audio signals were reproduced using two stationary loudspeakers. The collected database features 8648 stationary RIRs, as well as perfect sweeps, speech, music, and stationary noise recorded during motion. Python functions are provided to access the recorded audio and retrieve the associated geometric information.
[568] Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics
Mandip Goswami
Main category: eess.AS
TL;DR: Whisper-RIR-Mega is a benchmark dataset pairing clean and reverberant speech for evaluating ASR robustness to room acoustics, with evaluation showing consistent performance degradation across Whisper models.
Details
Motivation: There's a need for standardized evaluation of ASR robustness to room acoustics, particularly reverberation effects, to support reproducible research on robust speech recognition systems.Method: Created dataset by pairing clean LibriSpeech utterances with same utterances convolved with real room impulse responses from RIR-Mega corpus, with stratified splits by RT60 and DRR. Evaluated five Whisper model sizes on 1600 test samples.
Result: Reverberation consistently degrades ASR performance across all Whisper model sizes, with reverb penalty in WER ranging from 0.12 to 1.07 percentage points depending on model.
Conclusion: The benchmark dataset, evaluation code, and baseline results are released to support reproducible research on robust ASR, highlighting the need for better handling of reverberation effects.
Abstract: We introduce Whisper-RIR-Mega, a benchmark dataset of paired clean and reverberant speech for evaluating automatic speech recognition (ASR) robustness to room acoustics. Each sample pairs a clean LibriSpeech utterance with the same utterance convolved with a real room impulse response from the RIR-Mega corpus, with stratified splits by reverberation time (RT60) and direct-to-reverberant ratio (DRR). We evaluate five Whisper models (tiny through large-v3) on 1600 test samples and report word error rate (WER) and character error rate (CER) under clean and reverberant conditions. Reverberation consistently degrades performance across all model sizes; the reverb penalty in WER ranges from 0.12 to 1.07 percentage points depending on the model. We release the dataset, evaluation code, and baseline results to support reproducible research on robust ASR.
[569] ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction
Shu-wen Yang, Ming Tu, Andy T. Liu, Xinghua Qu, Hung-yi Lee, Lu Lu, Yuxuan Wang, Yonghui Wu
Main category: eess.AS
TL;DR: ParaS2S introduces a reinforcement learning framework for paralinguistic-aware speech-to-speech models, addressing limitations in handling emotion, tone, and speaker attributes through waveform-level optimization.
Details
Motivation: Current speech-to-speech models lack ability to handle paralinguistic cues (emotion, tone, speaker attributes) and respond appropriately in both content and style, with progress hindered by scarcity of high-quality expressive demonstrations.Method: Introduces ParaS2S RL framework with ParaS2SBench benchmark for evaluating content and speaking style naturalness. Uses PolyTone training strategy and multi-stage framework to prevent style hallucination in audio LLM judging. Implements RL approach (ParaS2SAlign) for waveform-level optimization.
Result: ParaS2SAlign achieves 10% relative improvement in response appropriateness over supervised fine-tuning on ParaS2SBench, surpassing all prior models while requiring fewer paired demonstrations. The automatic judge correlates well with human preferences.
Conclusion: The work highlights need for scalable and accurate automatic evaluators for speech-to-speech interaction, demonstrating RL’s effectiveness in improving paralinguistic awareness in S2S models.
Abstract: Speech-to-Speech (S2S) models have shown promising dialogue capabilities, but their ability to handle paralinguistic cues - such as emotion, tone, and speaker attributes - and to respond appropriately in both content and style remains under-explored. Progress is further hindered by the scarcity of high-quality and expressive demonstrations. To address this, we introduce a new reinforcement learning (RL) framework for paralinguistic-aware S2S, ParaS2S, which evaluates and optimizes both response content and speaking style directly at the waveform level. We first construct ParaS2SBench, a benchmark that evaluates the naturalness of input-output pairs in terms of content and speaking style using expressive and challenging queries. For the automatic judge, we propose a PolyTone training strategy and a multi-stage framework, preventing the style hallucination of end-to-end audio LLM judging. Our judge correlates well with human preferences and is scalable, enabling the model to interact and learn from unlabeled speech via RL. Experiments show that existing S2S models fail to respond appropriately to paralinguistic attributes, performing no better than pipeline-based baselines. Our RL approach (ParaS2SAlign) achieves a 10% relative improvement in the appropriateness of response content and speaking style on ParaS2SBench over supervised fine-tuning (SFT), surpassing all prior models while requiring substantially fewer paired demonstrations than pure SFT. Our findings highlight the need for a scalable and accurate automatic evaluator for speech-to-speech interaction.
[570] PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio
Yuanjian Chen, Yang Xiao, Han Yin, Xubo Liu, Jinjie Huang, Ting Dang
Main category: eess.AS
TL;DR: PolyBench: A new benchmark for evaluating compositional reasoning in polyphonic audio, covering five tasks that require understanding multiple concurrent sound events and their relationships.
Details
Motivation: Existing benchmarks for Large Audio Language Models (LALMs) have limited coverage of reasoning in polyphonic audio, where multiple sound events co-occur and create compositional structure. There's a need to evaluate how well LALMs can handle complex audio scenes with concurrent events.Method: Introduces PolyBench, a benchmark with five evaluation subsets: counting, classification, detection, concurrency, and duration estimation. These tasks require reasoning over multiple concurrent events and their relations in polyphonic audio.
Result: Evaluation of state-of-the-art LALMs shows consistent performance degradation in polyphonic audio, indicating a fundamental bottleneck in current audio language models when dealing with complex, concurrent sound events.
Conclusion: PolyBench reveals significant limitations in current LALMs’ ability to handle compositional reasoning in polyphonic audio, highlighting an important research direction for improving multimodal audio understanding.
Abstract: Large Audio Language Models (LALMs) are increasingly capable of reasoning over audio. However, existing benchmarks provide limited coverage of reasoning in polyphonic audio, where multiple sound events co-occur and induce compositional structure. In this work, we introduce PolyBench, a benchmark designed to evaluate compositional reasoning in polyphonic audio. PolyBench comprises five evaluation subsets covering counting, classification, detection, concurrency, and duration estimation, requiring reasoning over multiple concurrent events and their relations. Evaluation of state-of-the-art LALMs reveals consistent performance degradation in polyphonic audio, indicating a fundamental bottleneck in current LALMs.
eess.IV
[571] Clinical-Injection Transformer with Domain-Adapted MAE for Lupus Nephritis Prognosis Prediction
Yuewen Huang, Zhitao Ye, Guangnan Feng, Fudan Zheng, Xia Gao, Yutong Lu
Main category: eess.IV
TL;DR: First multimodal computational pathology framework for pediatric lupus nephritis prognosis using PAS-stained biopsies and clinical data with Clinical-Injection Transformer and domain-adapted MAE.
Details
Motivation: Pediatric lupus nephritis has worse outcomes than adult cases, but predicting prognosis remains unexplored in computational pathology. Existing methods require multiple costly stains and don't integrate clinical data.Method: Uses PAS-stained biopsies and clinical data with Clinical-Injection Transformer (embeds clinical features as condition tokens in patch-level attention) and decoupled representation-knowledge adaptation using domain-adapted Masked Autoencoder. Multi-granularity morphological type injection bridges classification knowledge to prognostic predictions.
Result: Achieved 90.1% three-class accuracy and 89.4% AUC on cohort of 71 pediatric LN patients with KDIGO-standardized labels.
Conclusion: Demonstrates potential as highly accurate and cost-effective prognostic tool for pediatric lupus nephritis using only routine stains and clinical data.
Abstract: Lupus nephritis (LN) is a severe complication of systemic lupus erythematosus that affects pediatric patients with significantly greater severity and worse renal outcomes compared to adults. Despite the urgent clinical need, predicting pediatric LN prognosis remains unexplored in computational pathology. Furthermore, the only existing histopathology-based approach for LN relies on multiple costly staining protocols and fails to integrate complementary clinical data. To address these gaps, we propose the first multimodal computational pathology framework for three-class treatment response prediction (complete remission, partial response, and no response) in pediatric LN, utilizing only routine PAS-stained biopsies and structured clinical data. Our framework introduces two key methodological innovations. First, a Clinical-Injection Transformer (CIT) embeds clinical features as condition tokens into patch-level self-attention, facilitating implicit and bidirectional cross-modal interactions within a unified attention space. Second, we design a decoupled representation-knowledge adaptation strategy using a domain-adapted Masked Autoencoder (MAE). This strategy explicitly separates self-supervised morphological feature learning from pathological knowledge extraction. Additionally, we introduce a multi-granularity morphological type injection mechanism to bridge distilled classification knowledge with downstream prognostic predictions at both the instance and patient levels. Evaluated on a cohort of 71 pediatric LN patients with KDIGO-standardized labels, our method achieves a three-class accuracy of 90.1% and an AUC of 89.4%, demonstrating its potential as a highly accurate and cost-effective prognostic tool.
[572] Gabor Primitives for Accelerated Cardiac Cine MRI Reconstruction
Wenqi Huang, Veronika Spieker, Nil Stolt-AnsĂł, Natascha Niessen, Maik Dannecker, Sevgi Gokce Kafali, Sila Kurugol, Julia A. Schnabel, Daniel Rueckert
Main category: eess.IV
TL;DR: Gabor primitives for accelerated cardiac cine MRI reconstruction outperform compressed sensing and other baselines by using modulated Gaussians for better high-frequency representation and low-rank decomposition for spatiotemporal redundancy.
Details
Motivation: Current MRI reconstruction methods have limitations: implicit neural representations lack interpretability, while Gaussian primitives have limited high-frequency representation. There's a need for explicit, interpretable representations that can efficiently capture both smooth structures and sharp boundaries in accelerated cardiac cine MRI.Method: Proposes Gabor primitives that modulate Gaussian envelopes with complex exponentials to place spectral support at arbitrary k-space locations. For cardiac cine, decomposes temporal variation into low-rank geometry basis (cardiac motion) and signal-intensity basis (contrast changes).
Result: Gabor primitives consistently outperform compressed sensing, Gaussian primitives, and hash-grid INR baselines on cardiac cine data with both Cartesian and radial trajectories. Provides compact, continuous-resolution representation with physically meaningful parameters.
Conclusion: Gabor primitives offer an effective explicit representation for MRI reconstruction that combines interpretability with superior performance, particularly suitable for spatiotemporal applications like cardiac cine imaging.
Abstract: Accelerated cardiac cine MRI requires reconstructing spatiotemporal images from highly undersampled k-space data. Implicit neural representations (INRs) enable scan-specific reconstruction without large training datasets, but encode content implicitly in network weights without physically interpretable parameters. Gaussian primitives provide an explicit and geometrically interpretable alternative, but their spectra are confined near the k-space origin, limiting high-frequency representation. We propose Gabor primitives for MRI reconstruction, modulating each Gaussian envelope with a complex exponential to place its spectral support at an arbitrary k-space location, enabling efficient representation of both smooth structures and sharp boundaries. To exploit spatiotemporal redundancy in cardiac cine, we decompose per-primitive temporal variation into a low-rank geometry basis capturing cardiac motion and a signal-intensity basis modeling contrast changes. Experiments on cardiac cine data with Cartesian and radial trajectories show that Gabor primitives consistently outperform compressed sensing, Gaussian primitives, and hash-grid INR baselines, while providing a compact, continuous-resolution representation with physically meaningful parameters.
[573] Longitudinal Lesion Inpainting in Brain MRI via 3D Region Aware Diffusion
Zahra Karimaghaloo, Dumitru Fetco, Haz-Edine Assemlal, Hassan Rivaz, Douglas L. Arnold
Main category: eess.IV
TL;DR: Novel pseudo-3D longitudinal inpainting framework using DDPMs for brain MRI lesion repair, leveraging multi-channel conditioning and region-aware diffusion for efficient, anatomically consistent results.
Details
Motivation: Accurate longitudinal brain MRI analysis is hindered by evolving lesions that bias automated pipelines. Existing methods lack 3D anatomical continuity or operate only cross-sectionally.Method: Pseudo-3D longitudinal inpainting framework based on Denoising Diffusion Probabilistic Models (DDPM) with multi-channel conditioning from distinct visits (t_1, t_2). Extends Region-Aware Diffusion (RAD) to medical domain for focused generative process on pathological regions without altering healthy tissue.
Result: Significantly outperforms FastSurfer-LIT baseline: reduces LPIPS distance from 0.07 to 0.03, eliminates inter-slice discontinuities, achieves high longitudinal stability (TFI of 1.024 vs LIT’s 1.22), and provides 10x speedup (2.53 min vs 24.30 min per volume).
Conclusion: The framework provides highly reliable and efficient preprocessing for progressive neurodegenerative disease studies by leveraging longitudinal priors and region-specific denoising.
Abstract: Accurate longitudinal analysis of brain MRI is often hindered by evolving lesions, which bias automated neuroimaging pipelines. While deep generative models have shown promise in inpainting these lesions, most existing methods operate cross-sectionally or lack 3D anatomical continuity. We present a novel pseudo-3D longitudinal inpainting framework based on Denoising Diffusion Probabilistic Models (DDPM). Our approach utilizes multi-channel conditioning to incorporate longitudinal context from distinct visits (t_1, t_2) and extends Region-Aware Diffusion (RAD) to the medical domain, focusing the generative process on pathological regions without altering surrounding healthy tissue. We evaluated our model against state-of-the-art baselines on longitudinal brain MRI from 93 patients. Our model significantly outperforms the leading baseline (FastSurfer-LIT) in terms of perceptual fidelity, reducing the Learned Perceptual Image Patch Similarity (LPIPS) distance from 0.07 to 0.03 while effectively eliminating inter-slice discontinuities. Furthermore, our model demonstrates high longitudinal stability with a Temporal Fidelity Index of 1.024, closely approaching the ideal value of 1.0 and substantially narrowing the gap compared to LIT’s TFI of 1.22. Notably, the RAD mechanism provides a substantial gain in efficiency; our framework achieves an average processing time of 2.53 min per volume, representing approximately 10x speedup over the 24.30 min required by LIT. By leveraging longitudinal priors and region-specific denoising, our framework provides a highly reliable and efficient preprocessing step for the study of progressive neurodegenerative diseases. A derivative dataset consisting of 93 pre-processed scans used for testing will be available upon request after acceptance. Code will be released upon acceptance.
[574] Interpretable Motion Artificat Detection in structural Brain MRI
Naveetha Nithianandam, Prabhjot Kaur, Anil Kumar Sao
Main category: eess.IV
TL;DR: A lightweight 3D DHoGM framework for detecting motion artifacts in brain MRI using combined 2D slice-level and 3D volume-level features with minimal parameters.
Details
Motivation: Existing MRI quality assessment methods require extensive preprocessing (high computational cost) or have poor generalization across acquisition sites. Need for efficient, robust automated quality check.Method: Extends Discriminative Histogram of Gradient Magnitude (DHoGM) to 3D space, integrates complementary 2D slice-level and 3D volume-level features via parallel decision strategy. Uses overlapping 3D cuboids for volumetric analysis. Employs threshold-based classifier and low-parameter MLP (only 209 trainable parameters).
Result: Achieved 94.34% accuracy in-domain and 89% accuracy on unseen sites, with almost complete avoidance of false acceptance of poor-quality scans. Ablation studies confirm complementary benefits of 2D+3D features.
Conclusion: Proposed approach offers effective, efficient, robust solution for automated MRI quality assessment, suitable for large-scale clinical/research workflows due to lightweight design and strong generalization.
Abstract: Automated quality assessment of structural brain MRI is an important prerequisite for reliable neuroimaging analysis, but yet remains challenging due to motion artifacts and poor generalization across acquisition sites. Existing approaches based on image quality metrics (IQMs) or deep learning either requires extensive preprocessing, which incurs high computational cost, or poor generalization to unseen data. In this work, we propose a lightweight and interpretable framework for detecting motion related artifacts in T1 weighted brain MRI by extending the Discriminative Histogram of Gradient Magnitude (DHoGM) to a three dimensional space. The proposed method integrates complementary slice-level (2D) and volume-level (3D) DHoGM features through a parallel decision strategy, capturing both localized and global motion-induced degradation. Volumetric analysis is performed using overlapping 3D cuboids to achieve comprehensive spatial coverage while maintaining computational efficiency. A simple threshold-based classifier and a low parameter multilayer perceptron are used, which results in a model with only 209 trainable parameters. Our method was evaluated on the MR-ART and ABIDE datasets under both seen-site and unseen-site conditions. Experimental results demonstrate strong performance, achieving up to 94.34% accuracy the in domain evaluation and 89% accuracy on unseen sites, while almost completely avoiding false acceptance of poor-quality scans. Ablation studies confirms the complementary benefits of combining 2D and 3D features. Overall, the proposed approach offers an effective, efficient, and robust solution for automated MRI quality check, with strong potential for integration into large scale clinical and research workflows.
[575] Uni-LVC: A Unified Method for Intra- and Inter-Mode Learned Video Compression
Yichi Zhang, Ruoyu Yang, Fengqing Zhu
Main category: eess.IV
TL;DR: Uni-LVC is a unified learned video compression method that supports both intra and inter coding with low-delay and random-access modes in a single model, addressing limitations of existing learned video codecs.
Details
Motivation: Existing learned video compression methods have limitations: they require separate models for intra and inter coding modes, and their performance degrades when temporal references are unreliable. There's a need for a unified approach that can handle both coding modes efficiently.Method: Builds on a strong intra-codec and formulates inter-coding as intra-coding conditioned on temporal information from reference frames. Uses cross-attention adaptation module to integrate temporal cues, supports both unidirectional and bidirectional prediction. Includes reliability-aware classifier to scale temporal cues when references are unreliable, and employs multistage training strategy.
Result: Uni-LVC achieves superior rate-distortion performance in both intra and inter configurations while maintaining comparable computational efficiency, demonstrating effectiveness of the unified approach.
Conclusion: The proposed Uni-LVC method successfully addresses key limitations of existing learned video compression by providing a unified model that supports multiple coding modes with adaptive temporal information handling.
Abstract: Recent advances in learned video compression (LVC) have led to significant performance gains, with codecs such as DCVC-RT surpassing the H.266/VVC low-delay mode in compression efficiency. However, existing LVCs still exhibit key limitations: they often require separate models for intra and inter coding modes, and their performance degrades when temporal references are unreliable. To address this, we introduce Uni-LVC, a unified LVC method that supports both intra and inter coding with low-delay and random-access in a single model. Building on a strong intra-codec, Uni-LVC formulates inter-coding as intra-coding conditioned on temporal information extracted from reference frames. We design an efficient cross-attention adaptation module that integrates temporal cues, enabling seamless support for both unidirectional (low-delay) and bidirectional (random-access) prediction modes. A reliability-aware classifier is proposed to selectively scale the temporal cues, making Uni-LVC behave closer to intra coding when references are unreliable. We further propose a multistage training strategy to facilitate adaptive learning across various coding modes. Extensive experiments demonstrate that Uni-LVC achieves superior rate-distortion performance in intra and inter configurations while maintaining comparable computational efficiency.
[576] Architectural Unification for Polarimetric Imaging Across Multiple Degradations
Chu Zhou, Yufei Han, Junda Liao, Linrui Dai, Wangze Xu, Art Subpa-Asa, Heng Guo, Boxin Shi, Imari Sato
Main category: eess.IV
TL;DR: Unified architectural framework for polarimetric imaging restoration that handles multiple degradation types (low-light noise, motion blur, mosaicing) with consistent architecture design and joint image-Stokes domain processing.
Details
Motivation: Polarimetric imaging faces challenges from diverse degradations, and existing approaches use task-specific architectures that limit adaptability across restoration scenarios. Current methods often have multi-stage pipelines with error accumulation or operate in single domains, failing to exploit physical relationships between image and Stokes domains.Method: Proposes a unified architectural framework that maintains consistent design across multiple degradation scenarios while being trained separately for each task. Performs single-stage joint image-Stokes processing to avoid error accumulation and explicitly preserve physical consistency of polarimetric parameters.
Result: Extensive experiments show state-of-the-art performance across low-light denoising, motion deblurring, and demosaicing tasks, establishing a versatile and physically grounded solution for degraded polarimetric imaging.
Conclusion: The unified framework provides an effective solution for polarimetric imaging restoration across multiple degradation types while maintaining physical consistency, offering advantages over task-specific architectures and multi-stage pipelines.
Abstract: Polarimetric imaging aims to recover polarimetric parameters, including Total Intensity (TI), Degree of Polarization (DoP), and Angle of Polarization (AoP), from captured polarized measurements. In real-world scenarios, these measurements are frequently affected by diverse degradations such as low-light noise, motion blur, and mosaicing artifacts. Due to the nonlinear dependency of DoP and AoP on the measured intensities, accurately retrieving physically consistent polarimetric parameters from degraded observations remains highly challenging. Existing approaches typically adopt task-specific network architectures tailored to individual degradation types, limiting their adaptability across different restoration scenarios. Moreover, many methods rely on multi-stage processing pipelines that suffer from error accumulation, or operate solely in a single domain (either image or Stokes domain), failing to fully exploit the intrinsic physical relationships between them. In this work, we propose a unified architectural framework for polarimetric imaging that is structurally shared across multiple degradation scenarios. Rather than redesigning network structures for each task, our framework maintains a consistent architectural design while being trained separately for different degradations. The model performs single-stage joint image-Stokes processing, avoiding error accumulation and explicitly preserving physical consistency. Extensive experiments show that this unified architectural design, when trained for specific degradation types, consistently achieves state-of-the-art performance across low-light denoising, motion deblurring, and demosaicing tasks, establishing a versatile and physically grounded solution for degraded polarimetric imaging.
[577] Evaluating quality metrics through the lenses of psychophysical measurements of low-level vision
Dounia Hammou, Yancheng Cai, Pavan Madhusudanarao, Christos G. Bampis, RafaĆ K. Mantiuk
Main category: eess.IV
TL;DR: The paper introduces perceptual tests for image/video quality metrics to evaluate their alignment with human vision principles like contrast sensitivity, masking, and matching, revealing limitations in existing metrics.
Details
Motivation: Most image/video quality metrics (SSIM, LPIPS, VMAF) aim to predict perceived visual quality but few explicitly incorporate human perception models, relying instead on hand-crafted formulas or data-driven training. There's a need for systematic tests to evaluate how well these metrics capture fundamental aspects of low-level human vision.Method: The authors introduce a set of tests for full-reference quality metrics that evaluate three key aspects of low-level human vision: contrast sensitivity, contrast masking, and contrast matching. They apply these tests to 34 existing quality metrics to assess their perceptual alignment.
Result: The tests reveal patterns in metric behavior: LPIPS and MS-SSIM can predict contrast masking, SSIM overemphasizes high spatial frequencies (mitigated in MS-SSIM), and most metrics fail to model supra-threshold contrast constancy. The tests uncover properties not visible with standard evaluation protocols.
Conclusion: The proposed perceptual tests provide a valuable framework for assessing quality metrics’ alignment with human vision principles, revealing limitations in existing metrics and offering insights for developing more perceptually-aligned evaluation methods.
Abstract: Image and video quality metrics, such as SSIM, LPIPS, and VMAF, aim to predict perceived visual quality and are often assumed to reflect principles of human vision. However, relatively few metrics explicitly incorporate models of human perception, with most relying on hand-crafted formulas or data-driven training to approximate perceptual alignment. In this paper, we introduce a set of tests for full-reference quality metrics that evaluate their ability to capture key aspects of low-level human vision: contrast sensitivity, contrast masking, and contrast matching. These tests provide an additional framework for assessing both established and newly proposed metrics. We apply the tests to 34 existing quality metrics and highlight patterns in their behavior, including the ability of LPIPS and MS-SSIM to predict contrast masking and the tendency of SSIM to overemphasize high spatial frequencies, which is mitigated in MS-SSIM, and the general inability of metrics to model supra-threshold contrast constancy. Our results demonstrate how these tests can reveal properties of quality metrics that are not easily observed with standard evaluation protocols.
[578] Enhancing Neural Video Compression of Static Scenes with Positive-Incentive Noise
Cheng Yuan, Zhenyu Jia, Jiawei Shao, Xuelong Li
Main category: eess.IV
TL;DR: Proposes adding positive-incentive noise to neural video compression for static scene videos to improve efficiency by treating temporal changes as noise for model finetuning, achieving 73% BD-rate savings while maintaining pixel-level fidelity.
Details
Motivation: Static scene videos (surveillance, videotelephony) dominate storage and bandwidth but current codecs and neural video compression methods are inefficient for them. Traditional codecs don't use temporal redundancy well, while neural methods suffer from training-test distribution gaps. Generative methods improve quality but hallucinate details unacceptable for authenticity-critical applications.Method: Incorporates positive-incentive noise into neural video compression for static scene videos. Short-term temporal changes are reinterpreted as positive-incentive noise to facilitate model finetuning. Disentangles transient variations from persistent background, internalizing structured prior information in the compression model. During inference, invariant components require minimal signaling.
Result: Preliminary experiments demonstrate 73% BjĂžntegaard delta (BD) rate saving compared to general neural video compression models. Provides effective computation-for-bandwidth trade-off, enabling robust video transmission under adverse network conditions and economic long-term retention of surveillance footage.
Conclusion: The method offers an effective solution for compressing static scene videos by leveraging temporal redundancy through positive-incentive noise, maintaining pixel-level fidelity while significantly reducing bandwidth requirements, making it suitable for authenticity-critical applications like surveillance.
Abstract: Static scene videos, such as surveillance feeds and videotelephony streams, constitute a dominant share of storage consumption and network traffic. However, both traditional standardized codecs and neural video compression (NVC) methods struggle to encode these videos efficiently due to inadequate usage of temporal redundancy and severe distribution gaps between training and test data, respectively. While recent generative compression methods improve perceptual quality, they introduce hallucinated details that are unacceptable in authenticity-critical applications. To overcome these limitations, we propose to incorporate positive-incentive noise into NVC for static scene videos, where short-term temporal changes are reinterpreted as positive-incentive noise to facilitate model finetuning. By disentangling transient variations from the persistent background, structured prior information is internalized in the compression model. During inference, the invariant component requires minimal signaling, thus reducing data transmission while maintaining pixel-level fidelity. Preliminary experiments demonstrate a 73% BjĂžntegaard delta (BD) rate saving compared to general NVC models. Our method provides an effective solution to trade computation for bandwidth, enabling robust video transmission under adverse network conditions and economic long-term retention of surveillance footage.
[579] AI End-to-End Radiation Treatment Planning Under One Second
Simon Arberet, Riqiang Gao, Martin Kraus, Florin C. Ghesu, Wilko Verbakel, Mamadou Diallo, Anthony Magliari, Venkatesan Karuppusamy, Sushil Beriwal, REQUITE Consortium, Ali Kamen, Dorin Comaniciu
Main category: eess.IV
TL;DR: AIRT is an end-to-end deep learning framework that generates deliverable prostate radiotherapy plans from CT images in under 1 second, achieving non-inferiority to conventional planning methods.
Details
Motivation: To reduce radiotherapy planning time and inter-planner variability by developing an ultra-fast AI-based system that can generate standardized treatment plans directly from medical imaging data.Method: End-to-end deep learning framework with differentiable dose feedback, adversarial fluence map shaping, and plan generation augmentation. Trained on over 10,000 prostate cases to generate single-arc VMAT plans from CT images and structure contours.
Result: Generates deliverable treatment plans in under 1 second on a single GPU, with non-inferiority to RapidPlan Eclipse across target coverage and organ-at-risk sparing metrics, achieving target homogeneity of HI = 0.10 ± 0.01.
Conclusion: AIRT represents a significant advancement toward ultra-fast, standardized radiotherapy planning that could streamline clinical workflows while maintaining plan quality comparable to conventional methods.
Abstract: Artificial intelligence-based radiation therapy (RT) planning has the potential to reduce planning time and inter-planner variability, improving efficiency and consistency in clinical workflows. Most existing automated approaches rely on multiple dose evaluations and corrections, resulting in plan generation times of several minutes. We introduce AIRT (Artificial Intelligence-based Radiotherapy), an end-to-end deep-learning framework that directly infers deliverable treatment plans from CT images and structure contours. AIRT generates single-arc VMAT prostate plans, from imaging and anatomical inputs to leaf sequencing, in under one second on a single Nvidia A100 GPU. The framework includes a differentiable dose feedback, an adversarial fluence map shaping, and a plan generation augmentation to improve plan quality and robustness. The model was trained on more than 10,000 intact prostate cases. Non-inferiority to RapidPlan Eclipse was demonstrated across target coverage and OAR sparing metrics. Target homogeneity (HI = 0.10 $\pm$ 0.01) and OAR sparing were similar to reference plans when evaluated using AcurosXB. These results represent a significant step toward ultra-fast standardized RT planning and a streamlined clinical workflow.
[580] An Efficient Self-supervised Seismic Data Reconstruction Method Based on Self-Consistency Learning
Mingwei Wang, Junheng Peng, Yingtian Liu, Yong Li
Main category: eess.IV
TL;DR: Self-supervised seismic data reconstruction using lightweight network with self-consistency learning, no extra datasets needed.
Details
Motivation: Seismic data often has irregular receiver distribution causing non-uniform data acquisition, affecting processing/inversion. Existing deep learning methods need supervised datasets or lack constraints, leading to unstable performance.Method: Self-supervised self-consistency learning with lightweight network (188,849 parameters), uses inter-component correlations in seismic data for loss function design, no extra datasets required.
Result: Validated on two public seismic datasets, yields high-quality reconstruction, valuable for large-scale/complex seismic exploration tasks.
Conclusion: Proposed method effectively reconstructs seismic data without extra datasets, using self-supervised approach with lightweight network.
Abstract: Seismic exploration remains the most critical method for characterizing subsurface structures in geophysics. However, complex surface conditions often cause a non-uniform distribution of seismic receivers along survey lines, leading to irregularly acquired seismic data, which affects subsequent processing and inversion. Prior deep learning-based seismic data reconstruction methods typically rely on datasets for supervised training. While some existing methods avoid extra data, they lack effective constraints on reconstructed data, leading to unstable performance. In this study, we propose a self-supervised self-consistency learning strategy with a lightweight network for seismic data reconstruction. Our method requires no extra datasets, and it leverages inter-component correlations in seismic data to design a loss function, optimizing a network with only 188,849 learnable parameters. Validated on two public seismic datasets, results demonstrate our approach yields high-quality reconstruction, providing significant value for large-scale and complex seismic exploration tasks.
[581] Multivariate Fields of Experts for Convergent Image Reconstruction
Stanislas Ducotterd, Michael Unser
Main category: eess.IV
TL;DR: Multivariate Fields of Experts framework using Moreau envelopes of ââ-norm for image priors, outperforming univariate models and approaching deep learning performance with fewer parameters and data.
Details
Motivation: To develop a more effective and interpretable framework for learning image priors that generalizes existing fields of experts methods, addressing the need for models that balance performance with computational efficiency and data requirements.Method: Proposes multivariate fields of experts using multivariate potential functions constructed via Moreau envelopes of the ââ-norm, extending traditional univariate approaches to capture more complex image statistics.
Result: Outperforms comparable univariate models across multiple inverse problems (denoising, deblurring, compressed-sensing MRI, CT), achieves performance close to deep-learning-based regularizers while being significantly faster with fewer parameters and training data.
Conclusion: The multivariate fields of experts framework provides an effective, interpretable, and efficient alternative to deep learning methods for image reconstruction tasks, with theoretical convergence guarantees ensuring reliability.
Abstract: We introduce the multivariate fields of experts, a new framework for the learning of image priors. Our model generalizes existing fields of experts methods by incorporating multivariate potential functions constructed via Moreau envelopes of the $\ell_\infty$-norm. We demonstrate the effectiveness of our proposal across a range of inverse problems that include image denoising, deblurring, compressed-sensing magnetic-resonance imaging, and computed tomography. The proposed approach outperforms comparable univariate models and achieves performance close to that of deep-learning-based regularizers while being significantly faster, requiring fewer parameters, and being trained on substantially fewer data. In addition, our model retains a high level of interpretability due to its structured design. It is supported by theoretical convergence guarantees which ensure reliability in sensitive reconstruction tasks.
[582] SAAIPAA: Optimizing aspect-angles-invariant physical adversarial attacks on SAR target recognition models
Isar Lemeire, Yee Wei Law, Sang-Heon Lee, William Meakin, Tat-Jun Chin
Main category: eess.IV
TL;DR: SAAIPAA is a SAR physical adversarial attack framework that works without knowing platform aspect angles, using physics-based modeling and optimal reflector positioning to fool ML-based target recognition systems.
Details
Motivation: SAR ATR systems are vulnerable to adversarial attacks, but existing physical adversarial attacks require knowledge of SAR platform aspect angles, limiting real-world applicability. There's a need for attacks that work without this knowledge.Method: Proposes SAAIPAA framework that uses physics-based modeling of reflected signals and SAR imaging to determine optimal reflector positions/orientations without aspect angle knowledge. Includes method for generating bounding boxes in SAR images to map between image and scene coordinates.
Result: Achieves 80% fooling rates for DenseNet-121 and ResNet50 in white-box settings with four reflectors, and 99.2% when aspect angles are known. Shows good transferability in black-box settings between some models.
Conclusion: SAAIPAA enables effective physical adversarial attacks on SAR ATR systems without requiring aspect angle knowledge, making such attacks more practical and threatening to real-world SAR applications.
Abstract: Synthetic aperture radar (SAR) enables versatile, all-time, all-weather remote sensing. Coupled with automatic target recognition (ATR) leveraging machine learning (ML), SAR is empowering a wide range of Earth observation and surveillance applications. However, the surge of attacks based on adversarial perturbations against the ML algorithms underpinning SAR ATR is prompting the need for systematic research into adversarial perturbation mechanisms. Research in this area began in the digital (image) domain and evolved into the physical (signal) domain, resulting in physical adversarial attacks (PAAs) that strategically exploit corner reflectors as attack vectors to evade ML-based ATR. Existing PAAs assume that the attacker knows the SAR platform’s aspect angles, restricting their applicability to idealized scenarios. We propose the SAR Aspect-Angles-Invariant Physical Adversarial Attack (SAAIPAA), a framework that determines the optimal positions and orientations of any given set of reflectors, regardless of their number or size, even when the attacker lacks knowledge of the SAR platform’s aspect angles. This is enabled by rigorous physics-based modeling of the reflected signal and the SAR imaging process. To facilitate mapping between image and scene coordinates, we additionally propose a method for generating bounding boxes in densely sampled azimuthal SAR images, allowing the target object to serve as a spatial reference. The resultant physical evasion attacks are efficiently realizable and optimal over the considered range of aspect angles between a SAR platform and a target, achieving state-of-the-art fooling rates (80% for DenseNet-121 and ResNet50) in the white-box setting for a four-reflector configuration. When aspect angles are known to the attacker, an average fooling rate of is 99.2% attainable. In black-box settings, SAAIPAA transfers well between some models.
[583] Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal
Xiaolong Qian, Qi Jiang, Lei Sun, Zongxi Yu, Kailun Yang, Peixuan Wu, Jiacheng Zhou, Yao Gao, Yaoguang Ma, Ming-Hsuan Yang, Kaiwei Wang
Main category: eess.IV
TL;DR: VeilGen is a generative model that simulates veiling glare in simplified optical systems, and DeVeiler is a restoration network that removes this glare using learned latent optical maps.
Details
Motivation: Simplified optical systems (single-lens, metalens) suffer from veiling glare caused by stray-light scattering, which degrades imaging performance beyond traditional aberrations. Existing scattering models fail to fit this glare, making paired training data difficult to obtain.Method: VeilGen learns to simulate veiling glare by estimating optical transmission and glare maps unsupervised from target images, regularized by Stable Diffusion priors. DeVeiler is a restoration network trained with reversibility constraint that uses these latent maps to guide inverse scattering removal.
Result: Extensive experiments show superior restoration quality and physical fidelity compared to existing methods. VeilGen reliably synthesizes realistic veiling glare, and its learned latent maps effectively guide DeVeiler’s restoration process.
Conclusion: The approach enables paired dataset generation with realistic compound degradation and provides effective guidance for veiling glare removal, advancing restoration for simplified optical systems.
Abstract: Beyond the commonly recognized optical aberrations, the imaging performance of simplified optical systems–including single-lens and metalens designs–is often further degraded by veiling glare caused by stray-light scattering from non-ideal optical surfaces and coatings, particularly in complex real-world environments. This compound degradation undermines traditional lens aberration correction yet remains underexplored. A major challenge is that conventional scattering models (e.g., for dehazing) fail to fit veiling glare due to its spatial-varying and depth-independent nature. Consequently, paired high-quality data are difficult to prepare via simulation, hindering application of data-driven veiling glare removal models. To this end, we propose VeilGen, a generative model that learns to simulate veiling glare by estimating its underlying optical transmission and glare maps in an unsupervised manner from target images, regularized by Stable Diffusion (SD)-based priors. VeilGen enables paired dataset generation with realistic compound degradation of optical aberrations and veiling glare, while also providing the estimated latent optical transmission and glare maps to guide the veiling glare removal process. We further introduce DeVeiler, a restoration network trained with a reversibility constraint, which utilizes the predicted latent maps to guide an inverse process of the learned scattering model. Extensive experiments on challenging simplified optical systems demonstrate that our approach delivers superior restoration quality and physical fidelity compared with existing methods. These suggest that VeilGen reliably synthesizes realistic veiling glare, and its learned latent maps effectively guide the restoration process in DeVeiler. All code and datasets will be publicly released at https://github.com/XiaolongQian/DeVeiler.