Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 73]
cs.CV [Total: 177]
cs.AI [Total: 35]
cs.SD [Total: 7]
cs.LG [Total: 96]
cs.MA [Total: 3]
cs.MM [Total: 0]
eess.AS [Total: 2]
eess.IV [Total: 8]

cs.CL

[1] Shona spaCy: A Morphological Analyzer for an Under-Resourced Bantu Language

Happymore Masoka

Main category: cs.CL

TL;DR: Shona spaCy is a rule-based morphological analysis pipeline for the under-resourced Bantu language Shona, achieving 90% POS-tagging and 88% morphological-feature accuracy.

Details

Motivation: Shona remains under-served in NLP despite advances in multilingual processing, lacking morphological analysis tools and language-aware resources.

Method: Built on spaCy framework with curated JSON lexicon and linguistically grounded rules for noun-class prefixes, verbal subject concords, tense-aspect markers, ideophones, and clitics.

Result: Evaluation shows 90% POS-tagging accuracy and 88% morphological-feature accuracy on formal and informal Shona corpora.

Conclusion: The toolkit advances NLP accessibility for Shona speakers and provides a template for other under-resourced Bantu languages, bridging descriptive grammar with computational implementation.

Abstract: Despite rapid advances in multilingual natural language processing (NLP), the Bantu language Shona remains under-served in terms of morphological analysis and language-aware tools. This paper presents Shona spaCy, an open-source, rule-based morphological pipeline for Shona built on the spaCy framework. The system combines a curated JSON lexicon with linguistically grounded rules to model noun-class prefixes (Mupanda 1-18), verbal subject concords, tense-aspect markers, ideophones, and clitics, integrating these into token-level annotations for lemma, part-of-speech, and morphological features. The toolkit is available via pip install shona-spacy, with source code at https://github.com/HappymoreMasoka/shona-spacy and a PyPI release at https://pypi.org/project/shona-spacy/0.1.4/. Evaluation on formal and informal Shona corpora yields 90% POS-tagging accuracy and 88% morphological-feature accuracy, while maintaining transparency in its linguistic decisions. By bridging descriptive grammar and computational implementation, Shona spaCy advances NLP accessibility and digital inclusion for Shona speakers and provides a template for morphological analysis tools for other under-resourced Bantu languages.

[2] Towards Hyper-Efficient RAG Systems in VecDBs: Distributed Parallel Multi-Resolution Vector Search

Dong Liu, Yanxuan Yu

Main category: cs.CL

TL;DR: SPI introduces multi-resolution vector indexing with adaptive resolution control for RAG systems, achieving 5.7× speedup and 2.5 F1 score improvement.

Details

Motivation: Existing vector database retrieval pipelines use flat indexing that cannot adapt to varying semantic granularity, leading to suboptimal speed-relevance trade-offs.

Method: Constructs semantic pyramid over document embeddings with dynamic resolution selection via lightweight classifier, enabling progressive coarse-to-fine retrieval.

Result: Achieves 5.7× retrieval speedup, 1.8× memory efficiency gain, and up to 2.5 points F1 score improvement on multiple RAG benchmarks.

Conclusion: SPI provides efficient adaptive retrieval with theoretical guarantees and compatibility with existing VecDB infrastructures, making it production-ready.

Abstract: Retrieval-Augmented Generation (RAG) systems have become a dominant approach to augment large language models (LLMs) with external knowledge. However, existing vector database (VecDB) retrieval pipelines rely on flat or single-resolution indexing structures, which cannot adapt to the varying semantic granularity required by diverse user queries. This limitation leads to suboptimal trade-offs between retrieval speed and contextual relevance. To address this, we propose \textbf{Semantic Pyramid Indexing (SPI)}, a novel multi-resolution vector indexing framework that introduces query-adaptive resolution control for RAG in VecDBs. Unlike existing hierarchical methods that require offline tuning or separate model training, SPI constructs a semantic pyramid over document embeddings and dynamically selects the optimal resolution level per query through a lightweight classifier. This adaptive approach enables progressive retrieval from coarse-to-fine representations, significantly accelerating search while maintaining semantic coverage. We implement SPI as a plugin for both FAISS and Qdrant backends and evaluate it across multiple RAG tasks including MS MARCO, Natural Questions, and multimodal retrieval benchmarks. SPI achieves up to \textbf{5.7$\times$} retrieval speedup and \textbf{1.8$\times$} memory efficiency gain while improving end-to-end QA F1 scores by up to \textbf{2.5 points} compared to strong baselines. Our theoretical analysis provides guarantees on retrieval quality and latency bounds, while extensive ablation studies validate the contribution of each component. The framework’s compatibility with existing VecDB infrastructures makes it readily deployable in production RAG systems. Code is availabe at \href{https://github.com/FastLM/SPI_VecDB}{https://github.com/FastLM/SPI_VecDB}.

[3] Bench360: Benchmarking Local LLM Inference from 360°

Linus Stuhlmann, Mauricio Fadel Argerich, Jonathan Fürst

Main category: cs.CL

TL;DR: Bench360 is a comprehensive benchmarking framework for local LLM inference that helps users evaluate different configurations across multiple metrics and usage scenarios.

Details

Motivation: Users face overwhelming configuration choices when running LLMs locally, and existing benchmarks are too narrow and not user-focused, failing to integrate system and task-specific metrics.

Method: Bench360 allows users to define custom tasks with datasets and metrics, then automatically benchmarks selected LLMs, inference engines, and quantization levels across different usage scenarios (single stream, batch, server).

Result: Evaluation on four common LLM tasks across three hardware platforms and four inference engines revealed interesting trade-offs between task performance and system efficiency, with no single best setup.

Conclusion: There is no single optimal configuration for local LLM inference, strongly motivating the need for comprehensive benchmarking frameworks like Bench360 to help users make informed decisions.

Abstract: Running large language models (LLMs) locally is becoming increasingly common. While the growing availability of small open-source models and inference engines has lowered the entry barrier, users now face an overwhelming number of configuration choices. Identifying an optimal configuration – balancing functional and non-functional requirements – requires substantial manual effort. While several benchmarks target LLM inference, they are designed for narrow evaluation goals and not user-focused. They fail to integrate relevant system and task-specific metrics into a unified, easy-to-use benchmark that supports multiple inference engines, usage scenarios, and quantization levels. To address this gap, we present Bench360 – Benchmarking Local LLM Inference from 360°. Bench360 allows users to easily define their own custom tasks along with datasets and relevant task-specific metrics and then automatically benchmarks selected LLMs, inference engines, and quantization levels across different usage scenarios (single stream, batch & server). Bench360 tracks a wide range of metrics, including (1) system metrics – such as Computing Performance (e.g., latency, throughput), Resource Usage (e.g., energy per query), and Deployment (e.g., cold start time) – and (2) task-specific metrics such as ROUGE, F1 score or accuracy. We demonstrate Bench360 on four common LLM tasks – General Knowledge & Reasoning, QA, Summarization and Text-to-SQL – across three hardware platforms and four state of the art inference engines. Our results reveal several interesting trade-offs between task performance and system-level efficiency, highlighting the differences in inference engines and models. Most importantly, there is no single best setup for local inference, which strongly motivates the need for a framework such as Bench360.

[4] How Well Do LLMs Understand Tunisian Arabic?

Mohamed Mahdi

Main category: cs.CL

TL;DR: This paper benchmarks LLMs on Tunisian Arabic (Tunizi) tasks, revealing significant gaps in handling low-resource languages and advocating for more inclusive AI development.

Details

Motivation: Industrial-scale LLMs often neglect low-resource languages like Tunisian Arabic, risking exclusion of millions from AI interactions in their native language and threatening cultural preservation.

Method: Created a novel dataset with parallel Tunizi, standard Tunisian Arabic, and English translations with sentiment labels. Benchmarked popular LLMs on transliteration, translation, and sentiment analysis tasks.

Result: Results showed significant performance differences between models, highlighting both strengths and limitations in understanding and processing Tunisian dialects.

Conclusion: The work underscores the importance of including low-resource languages in next-generation AI systems to ensure technology remains accessible, inclusive, and culturally grounded.

Abstract: Large Language Models (LLMs) are the engines driving today’s AI agents. The better these models understand human languages, the more natural and user-friendly the interaction with AI becomes, from everyday devices like computers and smartwatches to any tool that can act intelligently. Yet, the ability of industrial-scale LLMs to comprehend low-resource languages, such as Tunisian Arabic (Tunizi), is often overlooked. This neglect risks excluding millions of Tunisians from fully interacting with AI in their own language, pushing them toward French or English. Such a shift not only threatens the preservation of the Tunisian dialect but may also create challenges for literacy and influence younger generations to favor foreign languages. In this study, we introduce a novel dataset containing parallel Tunizi, standard Tunisian Arabic, and English translations, along with sentiment labels. We benchmark several popular LLMs on three tasks: transliteration, translation, and sentiment analysis. Our results reveal significant differences between models, highlighting both their strengths and limitations in understanding and processing Tunisian dialects. By quantifying these gaps, this work underscores the importance of including low-resource languages in the next generation of AI systems, ensuring technology remains accessible, inclusive, and culturally grounded.

[5] Ellipsoid-Based Decision Boundaries for Open Intent Classification

Yuetian Zou, Hanlei Zhang, Hua Xu, Songze Li, Long Xiao

Main category: cs.CL

TL;DR: EliDecide is a novel method for textual open intent classification that learns ellipsoid decision boundaries with varying scales along different feature directions, outperforming spherical boundary approaches.

Details

Motivation: Existing adaptive decision boundary methods assume isotropic distributions of known classes, restricting boundaries to balls and overlooking distributional variance along different directions, limiting their effectiveness.

Method: Uses supervised contrastive learning for discriminative features, learnable matrices to parameterize ellipsoid boundaries, and optimizes via dual loss function balancing empirical and open-space risks.

Result: Achieves state-of-the-art performance on multiple text intent benchmarks and question classification dataset, demonstrating superior open intent detection capability.

Conclusion: Ellipsoid boundaries offer greater flexibility than spherical ones and show strong potential for generalization to diverse complex open-world text classification tasks.

Abstract: Textual open intent classification is crucial for real-world dialogue systems, enabling robust detection of unknown user intents without prior knowledge and contributing to the robustness of the system. While adaptive decision boundary methods have shown great potential by eliminating manual threshold tuning, existing approaches assume isotropic distributions of known classes, restricting boundaries to balls and overlooking distributional variance along different directions. To address this limitation, we propose EliDecide, a novel method that learns ellipsoid decision boundaries with varying scales along different feature directions. First, we employ supervised contrastive learning to obtain a discriminative feature space for known samples. Second, we apply learnable matrices to parameterize ellipsoids as the boundaries of each known class, offering greater flexibility than spherical boundaries defined solely by centers and radii. Third, we optimize the boundaries via a novelly designed dual loss function that balances empirical and open-space risks: expanding boundaries to cover known samples while contracting them against synthesized pseudo-open samples. Our method achieves state-of-the-art performance on multiple text intent benchmarks and further on a question classification dataset. The flexibility of the ellipsoids demonstrates superior open intent detection capability and strong potential for generalization to more text classification tasks in diverse complex open-world scenarios.

[6] Prompt-Based Value Steering of Large Language Models

Giulio Antonio Abbo, Tony Belpaeme

Main category: cs.CL

TL;DR: The paper presents a method to evaluate how effectively prompts can steer LLM responses toward specific human values, showing that value-aligned prompting works without model fine-tuning.

Details

Motivation: Current fine-tuning approaches for aligning LLMs with human values are static and don't adapt to dynamic preferences, creating a need for prompt-based steering methods.

Method: Developed a reproducible, model-agnostic procedure to evaluate prompt effectiveness in steering text toward target values, using Schwartz’s theory of human values and dialogue datasets with baseline vs value-conditioned prompts.

Result: Demonstrated that value steering is possible through prompting alone without model alterations or dynamic optimization, using Wizard-Vicuna model variants.

Conclusion: Prompt-based value steering provides a practical alternative to fine-tuning for aligning LLMs with human values in dynamic scenarios.

Abstract: Large language models are increasingly used in applications where alignment with human values is critical. While model fine-tuning is often employed to ensure safe responses, this technique is static and does not lend itself to everyday situations involving dynamic values and preferences. In this paper, we present a practical, reproducible, and model-agnostic procedure to evaluate whether a prompt candidate can effectively steer generated text toward specific human values, formalising a scoring method to quantify the presence and gain of target values in generated responses. We apply our method to a variant of the Wizard-Vicuna language model, using Schwartz’s theory of basic human values and a structured evaluation through a dialogue dataset. With this setup, we compare a baseline prompt to one explicitly conditioned on values, and show that value steering is possible even without altering the model or dynamically optimising prompts.

[7] A new kid on the block: Distributional semantics predicts the word-specific tone signatures of monosyllabic words in conversational Taiwan Mandarin

Xiaoyun Jin, Mirjam Ernestus, R. Harald Baayen

Main category: cs.CL

TL;DR: This paper investigates how semantic meaning affects pitch contours in Mandarin monosyllabic words, showing that word sense and contextual embeddings can predict tonal variations better than traditional phonetic factors.

Details

Motivation: To understand how semantic factors influence pitch contours in spontaneous Mandarin conversation, challenging standard theories that focus primarily on phonetic and contextual variables.

Method: Used generalized additive models to decompose pitch contours into components tied to control variables and semantic predictors, analyzing word tokens, heterographic homophones, and contextualized embeddings.

Result: Word sense is a better predictor than word identity alone, heterographic homophones have different pitch contours, and contextualized embeddings can predict pitch contours with accuracy exceeding permutation baselines.

Conclusion: Semantic factors significantly influence Mandarin tonal realization, supporting the Discriminative Lexicon Model framework and challenging standard tone theories.

Abstract: We present a corpus-based investigation of how the pitch contours of monosyllabic words are realized in spontaneous conversational Mandarin, focusing on the effects of words’ meanings. We used the generalized additive model to decompose a given observed pitch contour into a set of component pitch contours that are tied to different control variables and semantic predictors. Even when variables such as word duration, gender, speaker identity, tonal context, vowel height, and utterance position are controlled for, the effect of word remains a strong predictor of tonal realization. We present evidence that this effect of word is a semantic effect: word sense is shown to be a better predictor than word, and heterographic homophones are shown to have different pitch contours. The strongest evidence for the importance of semantics is that the pitch contours of individual word tokens can be predicted from their contextualized embeddings with an accuracy that substantially exceeds a permutation baseline. For phonetics, distributional semantics is a new kid on the block. Although our findings challenge standard theories of Mandarin tone, they fit well within the theoretical framework of the Discriminative Lexicon Model.

[8] Concept-Based Interpretability for Toxicity Detection

Samarth Garg, Deeksha Varshney, Divya Singh

Main category: cs.CL

TL;DR: This paper introduces an interpretability technique using Concept Gradient method to address misclassifications in toxicity detection caused by disproportionate concept attribution, and proposes lexicon-free augmentation to analyze model behavior.

Details

Motivation: Limited exploration of concept-based explanations in toxicity detection, and the problem of disproportionate concept attribution leading to classification errors in existing models.

Method: Leverages subtype attributes as concepts, introduces Concept Gradient method for causal interpretation, creates Targeted Lexicon Set to capture misclassification-causing words, computes Word-Concept Alignment scores, and develops lexicon-free augmentation strategy.

Result: Proposed approach provides more causal interpretation of model decisions and helps identify over-attribution issues in toxicity classification.

Conclusion: The Concept Gradient method and lexicon-free augmentation offer improved interpretability and insights into model attribution patterns in toxicity detection, addressing limitations of traditional feature-based approaches.

Abstract: The rise of social networks has not only facilitated communication but also allowed the spread of harmful content. Although significant advances have been made in detecting toxic language in textual data, the exploration of concept-based explanations in toxicity detection remains limited. In this study, we leverage various subtype attributes present in toxicity detection datasets, such as obscene, threat, insult, identity attack, and sexual explicit as concepts that serve as strong indicators to identify whether language is toxic. However, disproportionate attribution of concepts towards the target class often results in classification errors. Our work introduces an interpretability technique based on the Concept Gradient (CG) method which provides a more causal interpretation by measuring how changes in concepts directly affect the output of the model. This is an extension of traditional gradient-based methods in machine learning, which often focus solely on input features. We propose the curation of Targeted Lexicon Set, which captures toxic words that contribute to misclassifications in text classification models. To assess the significance of these lexicon sets in misclassification, we compute Word-Concept Alignment (WCA) scores, which quantify the extent to which these words lead to errors due to over-attribution to toxic concepts. Finally, we introduce a lexicon-free augmentation strategy by generating toxic samples that exclude predefined toxic lexicon sets. This approach allows us to examine whether over-attribution persists when explicit lexical overlap is removed, providing insights into the model’s attribution on broader toxic language patterns.

[9] Falsely Accused: How AI Detectors Misjudge Slightly Polished Arabic Articles

Saleh Almohaimeed, Saad Almohaimeed, Mousa Jari, Khaled A. Alobaid, Fahad Alotaibi

Main category: cs.CL

TL;DR: AI detectors struggle with slightly polished human-written Arabic text, often misclassifying it as AI-generated, with performance dropping significantly across all tested models.

Details

Motivation: To address the problem where AI detectors misclassify human-authored articles that have been slightly polished by AI, potentially leading to false accusations of AI plagiarism, particularly in Arabic where this issue hasn't been studied.

Method: Created two datasets: 1) 800 Arabic articles (400 AI, 400 human) to evaluate 14 LLMs and commercial detectors, 2) Ar-APT dataset with 400 human articles polished by 10 LLMs using 4 settings (16,400 samples) to test how slight polishing affects detection.

Result: All AI detectors incorrectly attributed significant numbers of polished human articles to AI. Claude-4 Sonnet dropped from 83.51% to 57.63% accuracy, while originality.AI plummeted from 92% to 12% accuracy for articles polished by Mistral or Gemma-3.

Conclusion: Current AI detection models are highly vulnerable to misclassifying slightly polished human-written Arabic text as AI-generated, highlighting a critical weakness in existing detection systems.

Abstract: Many AI detection models have been developed to counter the presence of articles created by artificial intelligence (AI). However, if a human-authored article is slightly polished by AI, a shift will occur in the borderline decision of these AI detection models, leading them to consider it AI-generated article. This misclassification may result in falsely accusing authors of AI plagiarism and harm the credibility of AI detector models. In English, some efforts were made to meet this challenge, but not in Arabic. In this paper, we generated two datasets. The first dataset contains 800 Arabic articles, half AI-generated and half human-authored. We used it to evaluate 14 Large Language models (LLMs) and commercial AI detectors to assess their ability in distinguishing between human-authored and AI-generated articles. The best 8 models were chosen to act as detectors for our primary concern, which is whether they would consider slightly polished human text as AI-generated. The second dataset, Ar-APT, contains 400 Arabic human-authored articles polished by 10 LLMs using 4 polishing settings, totaling 16400 samples. We use it to evaluate the 8 nominated models and determine whether slight polishing will affect their performance. The results reveal that all AI detectors incorrectly attribute a significant number of articles to AI. The best performing LLM, Claude-4 Sonnet, achieved 83.51%, their performance decreased to 57.63% for articles slightly polished by LLaMA-3. Whereas for the best performing commercial model, originality.AI, that achieves 92% accuracy, dropped to 12% for articles slightly polished by Mistral or Gemma-3.

[10] Reproducibility Report: Test-Time Training on Nearest Neighbors for Large Language Models

Boyang Zhou, Johan Lindqvist, Lindsey Li

Main category: cs.CL

TL;DR: Reproduction study confirms test-time training with nearest neighbors reduces perplexity across multiple LLMs, with largest gains on specialized datasets and models not pretrained on similar data.

Details

Motivation: To validate and reproduce the central claims of Test-Time Training on Nearest Neighbors for LLMs, examining its effectiveness across different model architectures and practical implementation challenges.

Method: Used pretrained RoBERTa embeddings with Faiss indexing to retrieve 20 neighbors per test input, applied one gradient update per neighbor across GPT-2 (117M, 774M), GPT-Neo (1.3B), and R1-Distilled-Qwen2.5-1.5B models.

Result: Test-time training significantly reduces perplexity and bits-per-byte metrics across diverse domains, with largest improvements on structured datasets like GitHub and EuroParl. Smaller models benefit more, approaching larger model performance.

Conclusion: Results support the robustness and generality of nearest-neighbor test-time training while highlighting practical implementation considerations for large-scale retrieval-augmented adaptation.

Abstract: We reproduce the central claims of Test-Time Training on Nearest Neighbors for Large Language Models (Hardt and Sun, 2024), which proposes adapting a language model at inference time by fine-tuning on retrieved nearest-neighbor sequences. Using pretrained RoBERTa embeddings indexed with Faiss, we retrieve 20 neighbors per test input and apply one gradient update per neighbor across GPT-2 (117M, 774M), GPT-Neo (1.3B), and R1-Distilled-Qwen2.5-1.5B. Our experiments confirm that test-time training significantly reduces perplexity and bits-per-byte metrics across diverse domains from The Pile, with the largest improvements in structured or specialized datasets such as GitHub and EuroParl. We further validate that models not pretrained on The Pile benefit more from this adaptation than models already trained on similar data, allowing smaller models to approach the performance of larger ones. Due to infrastructure limitations, we introduce a memory-efficient retrieval implementation that loads only required line offsets rather than entire files, reducing RAM requirements from over 128 GB per server to 32 GB. We also extend the original study by evaluating R1-Distilled-Qwen2.5-1.5B, showing that test-time training yields consistent gains even for modern reasoning-optimized architectures. Overall, our results support the robustness and generality of nearest-neighbor test-time training while highlighting practical considerations for reproducing large-scale retrieval-augmented adaptation.

[11] How Language Directions Align with Token Geometry in Multilingual LLMs

JaeSeong Kim, Suan Lee

Main category: cs.CL

TL;DR: Multilingual LLMs rapidly separate language information in the first transformer layer and maintain linear separability throughout, with language alignment strongly influenced by training data composition.

Details

Motivation: To systematically analyze how language information is structured in multilingual LLMs' internal representations and how it emerges across layers, given limited existing research on this topic.

Method: Comprehensive probing study on six multilingual LLMs covering 268 transformer layers using linear/nonlinear probes and Token-Language Alignment analysis to quantify layer-wise dynamics and geometric structure of language encoding.

Result: Language information becomes sharply separated in first transformer block (+76.4±8.2 pp from Layer 0 to 1), remains linearly separable throughout depth. Chinese-inclusive models achieve 16.43% ZH Match@Peak vs 3.90% for English-centric models, showing 4.21× structural imprinting effect.

Conclusion: Multilingual LLMs distinguish languages through latent representational structures shaped by training corpus, not surface script features. Findings provide insights for data composition strategies and fairness in multilingual representation learning.

Abstract: Multilingual LLMs demonstrate strong performance across diverse languages, yet there has been limited systematic analysis of how language information is structured within their internal representation space and how it emerges across layers. We conduct a comprehensive probing study on six multilingual LLMs, covering all 268 transformer layers, using linear and nonlinear probes together with a new Token–Language Alignment analysis to quantify the layer-wise dynamics and geometric structure of language encoding. Our results show that language information becomes sharply separated in the first transformer block (+76.4$\pm$8.2 percentage points from Layer 0 to 1) and remains almost fully linearly separable throughout model depth. We further find that the alignment between language directions and vocabulary embeddings is strongly tied to the language composition of the training data. Notably, Chinese-inclusive models achieve a ZH Match@Peak of 16.43%, whereas English-centric models achieve only 3.90%, revealing a 4.21$\times$ structural imprinting effect. These findings indicate that multilingual LLMs distinguish languages not by surface script features but by latent representational structures shaped by the training corpus. Our analysis provides practical insights for data composition strategies and fairness in multilingual representation learning. All code and analysis scripts are publicly available at: https://github.com/thisiskorea/How-Language-Directions-Align-with-Token-Geometry-in-Multilingual-LLMs.

[12] Hierarchical Retrieval with Out-Of-Vocabulary Queries: A Case Study on SNOMED CT

Jonathon Dilworth, Hui Yang, Jiaoyan Chen, Yongsheng Gao

Main category: cs.CL

TL;DR: Proposes a language model-based ontology embedding approach for hierarchical concept retrieval from SNOMED CT, specifically addressing out-of-vocabulary (OOV) queries that have no direct matches in the ontology.

Details

Motivation: Knowledge retrieval in SNOMED CT is challenging due to language ambiguity, synonyms, and polysemies, especially when queries are OOV (no equivalent matches in the ontology).

Method: Language model-based ontology embeddings approach for hierarchical concept retrieval, tested on constructed OOV queries annotated against SNOMED CT concepts.

Result: The proposed method outperforms baselines including SBERT and two lexical matching methods in retrieving direct subsumers and less relevant ancestors.

Conclusion: The approach is generalizable and can be extended to other ontologies beyond SNOMED CT, with code and datasets released publicly.

Abstract: SNOMED CT is a biomedical ontology with a hierarchical representation of large-scale concepts. Knowledge retrieval in SNOMED CT is critical for its application, but often proves challenging due to language ambiguity, synonyms, polysemies and so on. This problem is exacerbated when the queries are out-of-vocabulary (OOV), i.e., having no equivalent matchings in the ontology. In this work, we focus on the problem of hierarchical concept retrieval from SNOMED CT with OOV queries, and propose an approach based on language model-based ontology embeddings. For evaluation, we construct OOV queries annotated against SNOMED CT concepts, testing the retrieval of the most direct subsumers and their less relevant ancestors. We find that our method outperforms the baselines including SBERT and two lexical matching methods. While evaluated against SNOMED CT, the approach is generalisable and can be extended to other ontologies. We release code, tools, and evaluation datasets at https://github.com/jonathondilworth/HR-OOV.

[13] Detecting and Steering LLMs’ Empathy in Action

Juan P. Cadile

Main category: cs.CL

TL;DR: Empathy-in-action is identified as a linear direction in LLM activation space, with high detection accuracy across models but varying steering capabilities - some models maintain bidirectional control while others show asymmetric steerability.

Details

Motivation: To investigate whether empathy-in-action (willingness to sacrifice task efficiency for human needs) can be detected and steered as a linear direction in LLM activation space, and understand how safety training affects these capabilities.

Method: Used contrastive prompts from the Empathy-in-Action benchmark to test detection and steering across three LLMs (Phi-3-mini-4k, Qwen2.5-7B, Dolphin-Llama-3.1-8B) at optimal layers using activation space analysis.

Result: All models showed near-perfect detection (AUROC 0.996-1.00). Steering success varied: Qwen (65.3%) and Phi-3 (61.7%) maintained bidirectional coherence, while Dolphin showed asymmetric steerability (94.4% pro-empathy success but catastrophic failure for anti-empathy).

Conclusion: Empathy encoding emerges independent of safety training, but safety training affects steering robustness. Models vary in their detection-steering gap, with some maintaining bidirectional control while others show robustness only for empathy enhancement.

Abstract: We investigate empathy-in-action – the willingness to sacrifice task efficiency to address human needs – as a linear direction in LLM activation space. Using contrastive prompts grounded in the Empathy-in-Action (EIA) benchmark, we test detection and steering across Phi-3-mini-4k (3.8B), Qwen2.5-7B (safety-trained), and Dolphin-Llama-3.1-8B (uncensored). Detection: All models show AUROC 0.996-1.00 at optimal layers. Uncensored Dolphin matches safety-trained models, demonstrating empathy encoding emerges independent of safety training. Phi-3 probes correlate strongly with EIA behavioral scores (r=0.71, p<0.01). Cross-model probe agreement is limited (Qwen: r=-0.06, Dolphin: r=0.18), revealing architecture-specific implementations despite convergent detection. Steering: Qwen achieves 65.3% success with bidirectional control and coherence at extreme interventions. Phi-3 shows 61.7% success with similar coherence. Dolphin exhibits asymmetric steerability: 94.4% success for pro-empathy steering but catastrophic breakdown for anti-empathy (empty outputs, code artifacts). Implications: The detection-steering gap varies by model. Qwen and Phi-3 maintain bidirectional coherence; Dolphin shows robustness only for empathy enhancement. Safety training may affect steering robustness rather than preventing manipulation, though validation across more models is needed.

[14] NALA_MAINZ at BLP-2025 Task 2: A Multi-agent Approach for Bangla Instruction to Python Code Generation

Hossain Shaikh Saadi, Faria Alam, Mario Sanz-Guerrero, Minh Duc Bui, Manuel Mager, Katharina von der Wense

Main category: cs.CL

TL;DR: JGU Mainz’s winning system for BLP-2025 Code Generation from Bangla Instructions uses a multi-agent pipeline with code generation and debugging agents to iteratively fix programs using test feedback.

Details

Motivation: To develop an effective system for generating code from Bangla instructions that can handle errors and improve program correctness through iterative debugging.

Method: Multi-agent pipeline: code-generation agent creates initial solution, then debugger agent uses test failures and error traces to generate revised solutions iteratively.

Result: Achieved first place in the shared task with a Pass@1 score of 95.4%.

Conclusion: The multi-agent approach with iterative debugging based on test feedback is highly effective for code generation from natural language instructions.

Abstract: This paper presents JGU Mainz’s winning system for the BLP-2025 Shared Task on Code Generation from Bangla Instructions. We propose a multi-agent-based pipeline. First, a code-generation agent produces an initial solution from the input instruction. The candidate program is then executed against the provided unit tests (pytest-style, assert-based). Only the failing cases are forwarded to a debugger agent, which reruns the tests, extracts error traces, and, conditioning on the error messages, the current program, and the relevant test cases, generates a revised solution. Using this approach, our submission achieved first place in the shared task with a $Pass@1$ score of 95.4. We also make our code public.

[15] From Representation to Enactment: The ABC Framework of the Translating Mind

Michael Carl, Takanori Mizowaki, Aishvarya Raj, Masaru Yamada, Devi Sri Bandaru, Yuxiang Wei, Xinyue Ren

Main category: cs.CL

TL;DR: Proposes a non-representational ABC framework of translation as enacted activity integrating affective, behavioral, and cognitive processes through brain-body-environment interactions.

Details

Motivation: To provide an alternative to representation-based models of the mind by building on Extended Mind theory and radical enactivism.

Method: Develops a novel ABC framework combining Predictive Processing and (En)Active Inference, viewing translation as dynamically integrated affective, behavioral, and cognitive processes.

Result: Reframes translation as skillful participation in sociocultural practice where meaning is co-created in real time through embodied interaction with texts, tools, and contexts.

Conclusion: The translator’s mind emerges through brain-body-environment interaction loops rather than being merely extended, offering a non-representational account of translation as enacted activity.

Abstract: Building on the Extended Mind (EM) theory and radical enactivism, this article suggests an alternative to representation-based models of the mind. We lay out a novel ABC framework of the translating mind, in which translation is not the manipulation of static interlingual correspondences but an enacted activity, dynamically integrating affective, behavioral, and cognitive (ABC) processes. Drawing on Predictive Processing and (En)Active Inference, we argue that the translator’s mind emerges, rather than being merely extended, through loops of brain-body-environment interactions. This non-representational account reframes translation as skillful participation in sociocultural practice, where meaning is co-created in real time through embodied interaction with texts, tools, and contexts.

[16] Interpretable dimensions support an effect of agentivity and telicity on split intransitivity

Eva Neu, Brian Dillon, Katrin Erk

Main category: cs.CL

TL;DR: This paper revisits the relationship between intransitive verb classes (unergatives/unaccusatives) and semantic properties (agentivity/telicity), challenging recent findings that human ratings poorly predict syntactic behavior.

Details

Motivation: To re-examine the link between unergativity/unaccusativity and agentivity/telicity, countering recent work by Kim et al. (2024) that found human ratings were poor predictors of syntactic behavior.

Method: Used interpretable dimensions computed from seed words on opposite poles of the agentive and telic scales, combining them with human judgments.

Result: Findings support the link between unergativity/unaccusativity and agentivity/telicity, showing that interpretable dimensions with human judgments provide valuable evidence for semantic properties.

Conclusion: Using interpretable dimensions in conjunction with human judgments can effectively reveal semantic properties that are difficult to evaluate through rating tasks alone, confirming the theoretical link between verb syntax and semantics.

Abstract: Intransitive verbs fall into two different syntactic classes, unergatives and unaccusatives. It has long been argued that verbs describing an agentive action are more likely to appear in an unergative syntax, and those describing a telic event to appear in an unaccusative syntax. However, recent work by Kim et al. (2024) found that human ratings for agentivity and telicity were a poor predictor of the syntactic behavior of intransitives. Here we revisit this question using interpretable dimensions, computed from seed words on opposite poles of the agentive and telic scales. Our findings support the link between unergativity/unaccusativity and agentivity/telicity, and demonstrate that using interpretable dimensions in conjunction with human judgments can offer valuable evidence for semantic properties that are not easily evaluated in rating tasks.

[17] PEPPER: Perception-Guided Perturbation for Robust Backdoor Defense in Text-to-Image Diffusion Models

Oscar Chew, Po-Yi Lu, Jayden Lin, Kuan-Hao Huang, Hsuan-Tien Lin

Main category: cs.CL

TL;DR: PEPPER is a backdoor defense method for text-to-image diffusion models that rewrites input captions to disrupt triggers while maintaining visual similarity, achieving enhanced robustness against attacks.

Details

Motivation: Text-to-image diffusion models are vulnerable to backdoor attacks where triggers in input prompts can steer generation toward harmful content, requiring effective defense mechanisms.

Method: PEPPER rewrites captions into semantically distant but visually similar versions while adding unobtrusive elements to disrupt triggers and dilute trigger token influence.

Result: Experiments show PEPPER is particularly effective against text encoder based attacks, substantially reducing attack success while preserving generation quality, and can be paired with other defenses for stronger robustness.

Conclusion: PEPPER provides an effective defense against backdoor attacks in T2I models, offering enhanced robustness that can be combined with existing methods for even stronger protection.

Abstract: Recent studies show that text to image (T2I) diffusion models are vulnerable to backdoor attacks, where a trigger in the input prompt can steer generation toward harmful or unintended content. To address this, we introduce PEPPER (PErcePtion Guided PERturbation), a backdoor defense that rewrites the caption into a semantically distant yet visually similar caption while adding unobstructive elements. With this rewriting strategy, PEPPER disrupt the trigger embedded in the input prompt, dilute the influence of trigger tokens and thereby achieve enhanced robustness. Experiments show that PEPPER is particularly effective against text encoder based attacks, substantially reducing attack success while preserving generation quality. Beyond this, PEPPER can be paired with any existing defenses yielding consistently stronger and generalizable robustness than any standalone method. Our code will be released on Github.

[18] ConCISE: A Reference-Free Conciseness Evaluation Metric for LLM-Generated Answers

Seyed Mohssen Ghafari, Ronny Kol, Juan C. Quiroz, Nella Luan, Monika Patial, Chanaka Rupasinghe, Herman Wandabwa, Luiz Pizzato

Main category: cs.CL

TL;DR: Proposes a reference-free metric to evaluate LLM response conciseness by measuring redundancy through compression ratios and word removal, without needing human annotations.

Details

Motivation: LLMs often generate verbose responses with unnecessary details, reducing clarity, user satisfaction, and increasing costs for model developers who charge per output token.

Method: Uses three calculations: compression ratio between original response and LLM abstractive summary, compression ratio between original response and LLM extractive summary, and word removal compression where LLM removes non-essential words while preserving meaning.

Result: Experimental results show the metric effectively identifies redundancy in LLM outputs and provides a practical automated evaluation tool for response brevity.

Conclusion: The proposed reference-free metric offers an effective way to evaluate LLM response conciseness without requiring ground truth human annotations, benefiting conversational AI systems.

Abstract: Large language models (LLMs) frequently generate responses that are lengthy and verbose, filled with redundant or unnecessary details. This diminishes clarity and user satisfaction, and it increases costs for model developers, especially with well-known proprietary models that charge based on the number of output tokens. In this paper, we introduce a novel reference-free metric for evaluating the conciseness of responses generated by LLMs. Our method quantifies non-essential content without relying on gold standard references and calculates the average of three calculations: i) a compression ratio between the original response and an LLM abstractive summary; ii) a compression ratio between the original response and an LLM extractive summary; and iii) wordremoval compression, where an LLM removes as many non-essential words as possible from the response while preserving its meaning, with the number of tokens removed indicating the conciseness score. Experimental results demonstrate that our proposed metric identifies redundancy in LLM outputs, offering a practical tool for automated evaluation of response brevity in conversational AI systems without the need for ground truth human annotations.

[19] Improving Latent Reasoning in LLMs via Soft Concept Mixing

Kang Wang, Xiangyu Duan, Tianyi Du

Main category: cs.CL

TL;DR: Soft Concept Mixing (SCM) is a training method that bridges the gap between LLMs’ discrete token training and soft concept reasoning by mixing soft concept vectors into hidden states and optimizing with RL, improving reasoning performance.

Details

Motivation: LLMs reason with discrete tokens while human reasoning uses abstract conceptual spaces. This gap limits LLMs' expressive power, and Soft Thinking showed potential but LLMs are trained on discrete tokens.

Method: SCM constructs soft concept vectors as probability-weighted averages of embeddings, mixes them into model’s hidden states, and optimizes the entire latent reasoning process using Reinforcement Learning.

Result: Experiments on five reasoning benchmarks show SCM improves LLMs’ reasoning performance while maintaining stable training dynamics.

Conclusion: SCM effectively bridges the training-reasoning gap by exposing models to soft representations during training, enhancing reasoning capabilities without compromising training stability.

Abstract: Unlike human reasoning in abstract conceptual spaces, large language models (LLMs) typically reason by generating discrete tokens, which potentially limit their expressive power. The recent work Soft Thinking has shown that LLMs’ latent reasoning via soft concepts is a promising direction, but LLMs are trained on discrete tokens. To reduce this gap between the soft concepts in reasoning and the discrete tokens in training, we propose Soft Concept Mixing (SCM), a soft concept aware training scheme that directly exposes the model to soft representations during training. Specifically, SCM constructs a soft concept vector by forming a probability-weighted average of embeddings. Then, this vector is mixed into the model’s hidden states, which embody rich contextual information. Finally, the entire latent reasoning process is optimized with Reinforcement Learning (RL). Experiments on five reasoning benchmarks demonstrate that SCM improves the reasoning performance of LLMs, and simultaneously maintains a stable training dynamic.

[20] Deep Improvement Supervision

Arip Asadulaev, Rayan Banerjee, Fakhri Karray, Martin Takac

Main category: cs.CL

TL;DR: Proposes a novel training scheme for Tiny Recursive Models that improves efficiency by 18x, eliminates halting mechanisms, and achieves 24% accuracy on ARC-1 with only 0.8M parameters.

Details

Motivation: To improve the efficiency of small, looped architectures like Tiny Recursive Models (TRMs) that can outperform Large Language Models on complex reasoning tasks, with minimal changes to the existing methods.

Method: Frames TRM latent reasoning as classifier-free guidance and implicit policy improvement, then introduces a training scheme that provides targets for each loop during training.

Result: Achieves 18x reduction in forward passes, eliminates halting mechanisms while maintaining quality comparable to standard TRMs, and reaches 24% accuracy on ARC-1 with only 0.8M parameters.

Conclusion: The proposed training scheme significantly enhances TRM efficiency while maintaining performance, demonstrating that small models can outperform most LLMs on complex reasoning tasks with proper training methodology.

Abstract: Recently, it was shown that small, looped architectures, such as Tiny Recursive Models (TRMs), can outperform Large Language Models (LLMs) on complex reasoning tasks, including the Abstraction and Reasoning Corpus (ARC). In this work, we investigate a core question: how can we further improve the efficiency of these methods with minimal changes? To address this, we frame the latent reasoning of TRMs as a form of classifier-free guidance and implicit policy improvement algorithm. Building on these insights, we propose a novel training scheme that provides a target for each loop during training. We demonstrate that our approach significantly enhances training efficiency. Our method reduces the total number of forward passes by 18x and eliminates halting mechanisms, while maintaining quality comparable to standard TRMs. Notably, we achieve 24% accuracy on ARC-1 with only 0.8M parameters, outperforming most LLMs.

[21] Predicting the Formation of Induction Heads

Tatsuya Aoyama, Ethan Gotlieb Wilcox, Nathan Schneider

Main category: cs.CL

TL;DR: The paper investigates how statistical properties of training data affect the formation of induction heads (IHs) in language models, identifying key factors like batch size, context size, bigram repetition frequency, and reliability.

Details

Motivation: To understand the precise relationship between training data statistics and the formation of induction heads, which are crucial for in-context learning capabilities in language models.

Method: Analyzed the relationship between statistical properties of training data (natural and synthetic) and IH formation, examining batch size, context size, bigram repetition frequency, and reliability.

Result: Found that: (1) batch size and context size predict IH formation point; (2) bigram repetition frequency and reliability strongly affect IH formation with a precise Pareto frontier; (3) local dependency with high frequency/reliability is sufficient, but with low values, categoricality and marginal distribution shape matter.

Conclusion: Statistical properties of training data, particularly bigram repetition patterns and reliability, play a crucial role in the formation of induction heads that enable in-context learning in language models.

Abstract: Arguably, specialized attention heads dubbed induction heads (IHs) underlie the remarkable in-context learning (ICL) capabilities of modern language models (LMs); yet, a precise characterization of their formation remains unclear. In this study, we investigate the relationship between statistical properties of training data (for both natural and synthetic data) and IH formation. We show that (1) a simple equation combining batch size and context size predicts the point at which IHs form; (2) surface bigram repetition frequency and reliability strongly affect the formation of IHs, and we find a precise Pareto frontier in terms of these two values; and (3) local dependency with high bigram repetition frequency and reliability is sufficient for IH formation, but when the frequency and reliability are low, categoriality and the shape of the marginal distribution matter.

[22] ARQUSUMM: Argument-aware Quantitative Summarization of Online Conversations

An Quang Tang, Xiuzhen Zhang, Minh Ngoc Dinh, Zhuang Li

Main category: cs.CL

TL;DR: Proposes ARQUSUMM framework for argument-aware quantitative summarization that reveals claim-reason structures in online conversations with argument strength measurements.

Details

Motivation: Existing conversation summarization overlooks deeper argument structures within sentences and fails to quantify argument strength, which is crucial for understanding controversial topics in online discussions.

Method: Uses LLM few-shot learning grounded in argumentation theory to identify propositions and claim-reason relationships, then applies argument structure-aware clustering to aggregate arguments and quantify support.

Result: ARQUSUMM outperforms existing conversation and quantitative summarization models, generating summaries with better argument structure representation, higher textual quality, and more accurate quantification.

Conclusion: The proposed framework successfully addresses the limitations of existing methods by capturing both argument structures and their quantitative strength, providing more helpful summaries for online conversation analysis.

Abstract: Online conversations have become more prevalent on public discussion platforms (e.g. Reddit). With growing controversial topics, it is desirable to summarize not only diverse arguments, but also their rationale and justification. Early studies on text summarization focus on capturing general salient information in source documents, overlooking the argumentative nature of online conversations. Recent research on conversation summarization although considers the argumentative relationship among sentences, fail to explicate deeper argument structure within sentences for summarization. In this paper, we propose a novel task of argument-aware quantitative summarization to reveal the claim-reason structure of arguments in conversations, with quantities measuring argument strength. We further propose ARQUSUMM, a novel framework to address the task. To reveal the underlying argument structure within sentences, ARQUSUMM leverages LLM few-shot learning grounded in the argumentation theory to identify propositions within sentences and their claim-reason relationships. For quantitative summarization, ARQUSUMM employs argument structure-aware clustering algorithms to aggregate arguments and quantify their support. Experiments show that ARQUSUMM outperforms existing conversation and quantitative summarization models and generate summaries representing argument structures that are more helpful to users, of high textual quality and quantification accuracy.

[23] Supervised Fine Tuning of Large Language Models for Domain Specific Knowledge Graph Construction:A Case Study on Hunan’s Historical Celebrities

Junjie Hao, Chun Wang, Ying Qiao, Qiuyue Zuo, Qiya Song, Hua Ma, Xieping Gao

Main category: cs.CL

TL;DR: This study proposes a supervised fine-tuning approach to enhance domain-specific information extraction for Hunan’s historical celebrities, achieving significant performance improvements in cultural heritage knowledge extraction and knowledge graph construction.

Details

Motivation: To address the limitations of systematic data resources for Hunan's historical celebrities and the underperformance of general-purpose models in domain knowledge extraction in low-resource settings.

Method: Design a fine-grained, schema-guided instruction template and build an instruction-tuning dataset, then apply parameter-efficient instruction fine-tuning to four large language models (Qwen2.5-7B, Qwen3-8B, DeepSeek-R1-Distill-Qwen-7B, and Llama-3.1-8B-Instruct).

Result: All models show substantial performance gains after fine-tuning, with Qwen3-8B achieving the best results (score of 89.3866 with 100 samples and 50 training iterations).

Conclusion: The study provides new insights into fine-tuning vertical large language models for regional historical and cultural domains and demonstrates their potential for cost-effective applications in cultural heritage knowledge extraction and knowledge graph construction.

Abstract: Large language models and knowledge graphs offer strong potential for advancing research on historical culture by supporting the extraction, analysis, and interpretation of cultural heritage. Using Hunan’s modern historical celebrities shaped by Huxiang culture as a case study, pre-trained large models can help researchers efficiently extract key information, including biographical attributes, life events, and social relationships, from textual sources and construct structured knowledge graphs. However, systematic data resources for Hunan’s historical celebrities remain limited, and general-purpose models often underperform in domain knowledge extraction and structured output generation in such low-resource settings. To address these issues, this study proposes a supervised fine-tuning approach for enhancing domain-specific information extraction. First, we design a fine-grained, schema-guided instruction template tailored to the Hunan historical celebrities domain and build an instruction-tuning dataset to mitigate the lack of domain-specific training corpora. Second, we apply parameter-efficient instruction fine-tuning to four publicly available large language models - Qwen2.5-7B, Qwen3-8B, DeepSeek-R1-Distill-Qwen-7B, and Llama-3.1-8B-Instruct - and develop evaluation criteria for assessing their extraction performance. Experimental results show that all models exhibit substantial performance gains after fine-tuning. Among them, Qwen3-8B achieves the strongest results, reaching a score of 89.3866 with 100 samples and 50 training iterations. This study provides new insights into fine-tuning vertical large language models for regional historical and cultural domains and highlights their potential for cost-effective applications in cultural heritage knowledge extraction and knowledge graph construction.

[24] Do Vision-Language Models Understand Visual Persuasiveness?

Gyuwon Park

Main category: cs.CL

TL;DR: VLMs struggle with visual persuasion understanding, showing recall bias and weak low/mid-level feature discrimination, with semantic alignment being the strongest predictor of human judgment.

Details

Motivation: To investigate whether vision-language models truly understand how visual cues shape human attitudes and decisions (visual persuasion), which remains unclear despite their impressive multi-modal reasoning capabilities.

Method: Constructed a high-consensus dataset for binary persuasiveness judgment, introduced Visual Persuasive Factors taxonomy (low-level perceptual, mid-level compositional, high-level semantic cues), and explored cognitive steering and knowledge injection strategies for persuasion-relevant reasoning.

Result: VLMs exhibit recall-oriented bias (over-predicting high persuasiveness) and weak discriminative power for low/mid-level features. High-level semantic alignment between message and object presence is the strongest predictor of human judgment. Simple instruction or unguided reasoning scaffolds yield marginal/negative effects, while object-grounded rationales significantly improve precision and F1 scores.

Conclusion: VLMs’ core limitation lies not in recognizing persuasive objects but in linking them to communicative intent, suggesting they lack deeper understanding of how visual elements serve persuasive purposes.

Abstract: Recent advances in vision-language models (VLMs) have enabled impressive multi-modal reasoning and understanding. Yet, whether these models truly grasp visual persuasion-how visual cues shape human attitudes and decisions-remains unclear. To probe this question, we construct a high-consensus dataset for binary persuasiveness judgment and introduce the taxonomy of Visual Persuasive Factors (VPFs), encompassing low-level perceptual, mid-level compositional, and high-level semantic cues. We also explore cognitive steering and knowledge injection strategies for persuasion-relevant reasoning. Empirical analysis across VLMs reveals a recall-oriented bias-models over-predict high persuasiveness-and weak discriminative power for low/mid-level features. In contrast, high-level semantic alignment between message and object presence emerges as the strongest predictor of human judgment. Among intervention strategies, simple instruction or unguided reasoning scaffolds yield marginal or negative effects, whereas concise, object-grounded rationales significantly improve precision and F1 scores. These results indicate that VLMs core limitation lies not in recognizing persuasive objects but in linking them to communicative intent.

[25] Principled Design of Interpretable Automated Scoring for Large-Scale Educational Assessments

Yunsung Kim, Mike Hardy, Joseph Tey, Candace Thille, Chris Piech

Main category: cs.CL

TL;DR: The paper presents AnalyticScore, an interpretable automated scoring framework for short answers that implements four interpretability principles (FGTI) and achieves competitive accuracy while maintaining transparency.

Details

Motivation: Address the lack of widely accepted interpretable automated scoring systems for large-scale assessments, despite increasing demand for transparency and interpretability in AI-driven evaluation.

Method: Developed AnalyticScore framework with three steps: (1) extract identifiable response elements, (2) featurize responses into human-interpretable values using LLMs, (3) apply ordinal logistic regression for scoring. Based on four interpretability principles: Faithfulness, Groundedness, Traceability, and Interchangeability.

Result: Outperforms many uninterpretable scoring methods and is within only 0.06 QWK of uninterpretable SOTA on average across 10 ASAP-SAS items. Featurization behavior aligns well with human annotators conducting the same task.

Conclusion: Demonstrates feasibility of implementing interpretable automated scoring principles while maintaining competitive accuracy, providing a baseline reference framework for future research in interpretable assessment systems.

Abstract: AI-driven automated scoring systems offer scalable and efficient means of evaluating complex student-generated responses. Yet, despite increasing demand for transparency and interpretability, the field has yet to develop a widely accepted solution for interpretable automated scoring to be used in large-scale real-world assessments. This work takes a principled approach to address this challenge. We analyze the needs and potential benefits of interpretable automated scoring for various assessment stakeholders and develop four principles of interpretability – Faithfulness, Groundedness, Traceability, and Interchangeability (FGTI) – targeted at those needs. To illustrate the feasibility of implementing these principles, we develop the AnalyticScore framework for short answer scoring as a baseline reference framework for future research. AnalyticScore operates by (1) extracting explicitly identifiable elements of the responses, (2) featurizing each response into human-interpretable values using LLMs, and (3) applying an intuitive ordinal logistic regression model for scoring. In terms of scoring accuracy, AnalyticScore outperforms many uninterpretable scoring methods, and is within only 0.06 QWK of the uninterpretable SOTA on average across 10 items from the ASAP-SAS dataset. By comparing against human annotators conducting the same featurization task, we further demonstrate that the featurization behavior of AnalyticScore aligns well with that of humans.

[26] MUCH: A Multilingual Claim Hallucination Benchmark

Jérémie Dentan, Alexi Canesse, Davide Buscaldi, Aymen Shabou, Sonia Vanier

Main category: cs.CL

TL;DR: MUCH is the first claim-level uncertainty quantification benchmark for LLMs that includes 4,873 samples across 4 languages and 4 LLMs, with released generation logits and a fast deterministic segmentation algorithm.

Details

Motivation: To address the lack of reliability in Large Language Models and enable fair evaluation of uncertainty quantification methods under realistic deployment conditions.

Method: Created a benchmark with 4,873 samples across English, French, Spanish, and German using 4 instruction-tuned LLMs, released 24 generation logits per token, and developed a deterministic segmentation algorithm that uses only 0.2% of LLM generation time.

Result: The benchmark enables reproducible evaluation of UQ methods, and current methods show substantial room for improvement in both performance and efficiency.

Conclusion: MUCH provides a comprehensive benchmark for claim-level uncertainty quantification that supports realistic evaluation and real-time monitoring of LLM outputs, highlighting the need for better UQ methods.

Abstract: Claim-level Uncertainty Quantification (UQ) is a promising approach to mitigate the lack of reliability in Large Language Models (LLMs). We introduce MUCH, the first claim-level UQ benchmark designed for fair and reproducible evaluation of future methods under realistic conditions. It includes 4,873 samples across four European languages (English, French, Spanish, and German) and four instruction-tuned open-weight LLMs. Unlike prior claim-level benchmarks, we release 24 generation logits per token, facilitating the development of future white-box methods without re-generating data. Moreover, in contrast to previous benchmarks that rely on manual or LLM-based segmentation, we propose a new deterministic algorithm capable of segmenting claims using as little as 0.2% of the LLM generation time. This makes our segmentation approach suitable for real-time monitoring of LLM outputs, ensuring that MUCH evaluates UQ methods under realistic deployment constraints. Finally, our evaluations show that current methods still have substantial room for improvement in both performance and efficiency.

[27] Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design

Quentin Anthony, Yury Tokpanov, Skyler Szot, Srivatsan Rajagopal, Praneeth Medepalli, Rishi Iyer, Vasu Shyam, Anna Golubeva, Ansh Chaurasia, Xiao Yang, Tomas Figliolia, Robert Washbourne, Drew Thorstensen, Amartey Pearson, Zack Grossbart, Jason van Patten, Emad Barsoum, Zhenyu Gu, Yao Fu, Beren Millidge

Main category: cs.CL

TL;DR: First large-scale MoE pretraining on AMD hardware (MI300X GPUs with Pollara interconnect), providing system characterization and model design guidance, with ZAYA1-base model achieving competitive performance.

Details

Motivation: To demonstrate the maturity and competitiveness of AMD hardware, network, and software stack for large-scale pretraining, and provide practical guidance for systems and model design on this platform.

Method: Conducted comprehensive cluster and networking characterization with microbenchmarks for core collectives on Pollara, developed MI300X-aware transformer sizing rules, designed MoE architecture (ZAYA1: 760M active, 8.3B total parameters), and implemented full training stack with fault-tolerance and checkpoint-reshaping utilities.

Result: ZAYA1-base achieves performance comparable to Qwen3-4B and Gemma3-12B, and outperforms Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks.

Conclusion: AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining, with the developed methodology and model showing strong performance results.

Abstract: We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs with Pollara interconnect. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts on Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.

[28] Learning to Compress: Unlocking the Potential of Large Language Models for Text Representation

Yeqin Zhang, Yizheng Zhao, Chen Hu, Binxing Jiao, Daxin Jiang, Ruihang Miao, Cam-Tu Nguyen

Main category: cs.CL

TL;DR: LLM2Comp uses context compression as a pretext task to adapt LLMs for text representation, outperforming token-level methods like LLM2Vec and achieving better performance with less training data.

Details

Motivation: Most LLMs are causal and optimized for next-token prediction, making them suboptimal for holistic text representation. Existing adaptation methods rely on token-level objectives, which may not fully leverage LLMs' capabilities for representation learning.

Method: Proposes context compression as a pretext task where the model learns to generate compact memory tokens that substitute the entire context for downstream sequence prediction. Combines this with contrastive learning for further improvements.

Result: LLM2Comp significantly outperforms contemporary LLM-based text encoders across various tasks while being more sample-efficient, requiring significantly less training data compared to token-level pretext tasks.

Conclusion: Context compression is an effective pretext task for adapting LLMs to text representation, offering superior performance and efficiency over token-level approaches like masked next-token prediction.

Abstract: Text representation plays a critical role in tasks like clustering, retrieval, and other downstream applications. With the emergence of large language models (LLMs), there is increasing interest in harnessing their capabilities for this purpose. However, most of the LLMs are inherently causal and optimized for next-token prediction, making them suboptimal for producing holistic representations. To address this, recent studies introduced pretext tasks to adapt LLMs for text representation. Most of these tasks, however, rely on token-level prediction objectives, such as the masked next-token prediction (MNTP) used in LLM2Vec. In this work, we explore the untapped potential of context compression as a pretext task for unsupervised adaptation of LLMs. During compression pre-training, the model learns to generate compact memory tokens, which substitute the whole context for downstream sequence prediction. Experiments demonstrate that a well-designed compression objective can significantly enhance LLM-based text representations, outperforming models trained with token-level pretext tasks. Further improvements through contrastive learning produce a strong representation model (LLM2Comp) that outperforms contemporary LLM-based text encoders on a wide range of tasks while being more sample-efficient, requiring significantly less training data.

[29] LangMark: A Multilingual Dataset for Automatic Post-Editing

Diego Velazquez, Mikaela Grace, Konstantinos Karageorgos, Lawrence Carin, Aaron Schliem, Dimitrios Zaikis, Roger Wechsler

Main category: cs.CL

TL;DR: LangMark is a new multilingual APE dataset for English to 7 languages, enabling LLMs to effectively improve machine translation quality through few-shot prompting.

Details

Motivation: To address the lack of large-scale multilingual datasets specifically tailored to NMT outputs for automatic post-editing development.

Method: Created LangMark dataset with 206,983 triplets (source, NMT output, human post-edit) for 7 languages, annotated by expert linguists, and tested LLMs with few-shot prompting.

Result: LLMs with few-shot prompting can effectively perform APE, improving upon leading commercial and proprietary machine translation systems.

Conclusion: The LangMark dataset facilitates future APE system development and evaluation, demonstrating LLMs’ effectiveness in improving translation quality.

Abstract: Automatic post-editing (APE) aims to correct errors in machine-translated text, enhancing translation quality, while reducing the need for human intervention. Despite advances in neural machine translation (NMT), the development of effective APE systems has been hindered by the lack of large-scale multilingual datasets specifically tailored to NMT outputs. To address this gap, we present and release LangMark, a new human-annotated multilingual APE dataset for English translation to seven languages: Brazilian Portuguese, French, German, Italian, Japanese, Russian, and Spanish. The dataset has 206,983 triplets, with each triplet consisting of a source segment, its NMT output, and a human post-edited translation. Annotated by expert human linguists, our dataset offers both linguistic diversity and scale. Leveraging this dataset, we empirically show that Large Language Models (LLMs) with few-shot prompting can effectively perform APE, improving upon leading commercial and even proprietary machine translation systems. We believe that this new resource will facilitate the future development and evaluation of APE systems.

[30] The PLLuM Instruction Corpus

Piotr Pęzik, Filip Żarnecki, Konrad Kaczyński, Anna Cichosz, Zuzanna Deckert, Monika Garnys, Izabela Grabarczyk, Wojciech Janowski, Sylwia Karasińska, Aleksandra Kujawiak, Piotr Misztela, Maria Szymańska, Karolina Walkusz, Igor Siek, Maciej Chrabąszcz, Anna Kołos, Agnieszka Karlińska, Karolina Seweryn, Aleksandra Krasnodębska, Paula Betscher, Zofia Cieślińska, Katarzyna Kowol, Artur Wilczek, Maciej Trzciński, Katarzyna Dziewulska, Roman Roszko, Tomasz Bernaś, Jurgita Vaičenonienė, Danuta Roszko, Paweł Levchuk, Paweł Kowalski, Irena Prawdzic-Jankowska, Marek Kozłowski, Sławomir Dadas, Rafał Poświata, Alina Wróblewska, Katarzyna Krasnowska-Kieraś, Maciej Ogrodniczuk, Michał Rudolf, Piotr Rybak, Karolina Saputa, Joanna Wołoszyn, Marcin Oleksy, Bartłomiej Koptyra, Teddy Ferdinan, Stanisław Woźniak, Maciej Piasecki, Paweł Walkowiak, Konrad Wojtasik, Arkadiusz Janz, Przemysław Kazienko, Julia Moska, Jan Kocoń

Main category: cs.CL

TL;DR: The paper presents the instruction dataset used to fine-tune Polish LLMs in the PLLuM project, including a typology of instruction types and observations about human vs synthetic data.

Details

Motivation: To document and share the instruction dataset development process for the PLLuM project to guide similar LLM development efforts for other languages.

Method: Created a functional typology categorizing organic, converted, and synthetic instructions, and analyzed implications of human-authored vs synthetic datasets.

Result: Released the first representative subset of the PLLuM instruction corpus (PLLuMIC) as a resource for other LLM developers.

Conclusion: The PLLuM instruction corpus provides valuable insights and practical guidance for developing instruction datasets for language-specific LLM adaptation.

Abstract: This paper describes the instruction dataset used to fine-tune a set of transformer-based large language models (LLMs) developed in the PLLuM (Polish Large Language Model) project. We present a functional typology of the organic, converted, and synthetic instructions used in PLLuM and share some observations about the implications of using human-authored versus synthetic instruction datasets in the linguistic adaptation of base LLMs. Additionally, we release the first representative subset of the PLLuM instruction corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar datasets for other LLMs.

[31] Hallucinate Less by Thinking More: Aspect-Based Causal Abstention for Large Language Models

Vy Nguyen, Ziqi Xu, Jeffrey Chan, Estrid He, Feng Xia, Xiuzhen Zhang

Main category: cs.CL

TL;DR: ABCA is a new framework for early abstention in LLMs that analyzes internal knowledge diversity through causal inference to prevent hallucinations before generation.

Details

Motivation: Existing abstention methods rely on post-generation signals, limiting their ability to prevent unreliable responses in advance. LLMs often produce fluent but factually incorrect responses (hallucinations).

Method: Aspect-Based Causal Abstention (ABCA) analyzes internal diversity of LLM knowledge through causal inference. It estimates causal effects conditioned on knowledge aspects (disciplines, legal contexts, temporal frames) to assess reliability.

Result: Experiments on standard benchmarks show ABCA improves abstention reliability, achieves state-of-the-art performance, and enhances interpretability of abstention decisions.

Conclusion: ABCA enables early abstention by detecting knowledge conflicts (Type-1) and knowledge insufficiency (Type-2) through aspect-based causal analysis, providing a more reliable safeguard against LLM hallucinations.

Abstract: Large Language Models (LLMs) often produce fluent but factually incorrect responses, a phenomenon known as hallucination. Abstention, where the model chooses not to answer and instead outputs phrases such as “I don’t know”, is a common safeguard. However, existing abstention methods typically rely on post-generation signals, such as generation variations or feedback, which limits their ability to prevent unreliable responses in advance. In this paper, we introduce Aspect-Based Causal Abstention (ABCA), a new framework that enables early abstention by analysing the internal diversity of LLM knowledge through causal inference. This diversity reflects the multifaceted nature of parametric knowledge acquired from various sources, representing diverse aspects such as disciplines, legal contexts, or temporal frames. ABCA estimates causal effects conditioned on these aspects to assess the reliability of knowledge relevant to a given query. Based on these estimates, we enable two types of abstention: Type-1, where aspect effects are inconsistent (knowledge conflict), and Type-2, where aspect effects consistently support abstention (knowledge insufficiency). Experiments on standard benchmarks demonstrate that ABCA improves abstention reliability, achieves state-of-the-art performance, and enhances the interpretability of abstention decisions.

[32] Attention-Guided Feature Fusion (AGFF) Model for Integrating Statistical and Semantic Features in News Text Classification

Mohammad Zare

Main category: cs.CL

TL;DR: The paper proposes an Attention-Guided Feature Fusion (AGFF) model that combines statistical and semantic features using attention mechanisms for improved news text classification.

Details

Motivation: Traditional statistical methods fail to capture contextual meaning, while modern deep learning approaches may overlook important statistical indicators. There's a need to integrate both feature types for better classification.

Method: An attention-based mechanism dynamically determines the relative importance of statistical and semantic features, combining them in a unified framework for classification.

Result: The AGFF model demonstrates superior performance on benchmark news datasets compared to both traditional statistical models and purely semantic deep learning models.

Conclusion: Strategic integration of diverse feature types significantly enhances classification accuracy, with the model effectively balancing and exploiting complementary strengths of statistical and semantic representations.

Abstract: News text classification is a crucial task in natural language processing, essential for organizing and filtering the massive volume of digital content. Traditional methods typically rely on statistical features like term frequencies or TF-IDF values, which are effective at capturing word-level importance but often fail to reflect contextual meaning. In contrast, modern deep learning approaches utilize semantic features to understand word usage within context, yet they may overlook simple, high-impact statistical indicators. This paper introduces an Attention-Guided Feature Fusion (AGFF) model that combines statistical and semantic features in a unified framework. The model applies an attention-based mechanism to dynamically determine the relative importance of each feature type, enabling more informed classification decisions. Through evaluation on benchmark news datasets, the AGFF model demonstrates superior performance compared to both traditional statistical models and purely semantic deep learning models. The results confirm that strategic integration of diverse feature types can significantly enhance classification accuracy. Additionally, ablation studies validate the contribution of each component in the fusion process. The findings highlight the model’s ability to balance and exploit the complementary strengths of statistical and semantic representations, making it a practical and effective solution for real-world news classification tasks.

[33] AutoLink: Autonomous Schema Exploration and Expansion for Scalable Schema Linking in Text-to-SQL at Scale

Ziyang Wang, Yuanlei Zheng, Zhenbiao Cao, Xiaojin Zhang, Zhongyu Wei, Pei Fu, Zhenbo Luo, Wei Chen, Xiang Bai

Main category: cs.CL

TL;DR: AutoLink is an autonomous agent framework that reformulates schema linking as an iterative process, achieving state-of-the-art performance with exceptional scalability for industrial text-to-SQL systems.

Details

Motivation: Existing schema linking methods are impractical for industrial-scale text-to-SQL due to context window limits, irrelevant noise, prohibitive costs, poor recall-noise tradeoff, and poor scalability to large databases.

Method: AutoLink uses an LLM-guided autonomous agent framework that dynamically explores and expands the linked schema subset through an iterative process, progressively identifying necessary schema components without requiring the full database schema.

Result: Achieved state-of-the-art strict schema linking recall of 97.4% on Bird-Dev and 91.2% on Spider-2.0-Lite, with competitive execution accuracy of 68.7% EX on Bird-Dev and 34.9% EX on Spider-2.0-Lite. Maintains high performance on large schemas with over 3,000 columns.

Conclusion: AutoLink provides a highly scalable, high-recall schema-linking solution that overcomes limitations of existing methods, making it suitable for industrial text-to-SQL systems.

Abstract: For industrial-scale text-to-SQL, supplying the entire database schema to Large Language Models (LLMs) is impractical due to context window limits and irrelevant noise. Schema linking, which filters the schema to a relevant subset, is therefore critical. However, existing methods incur prohibitive costs, struggle to trade off recall and noise, and scale poorly to large databases. We present \textbf{AutoLink}, an autonomous agent framework that reformulates schema linking as an iterative, agent-driven process. Guided by an LLM, AutoLink dynamically explores and expands the linked schema subset, progressively identifying necessary schema components without inputting the full database schema. Our experiments demonstrate AutoLink’s superior performance, achieving state-of-the-art strict schema linking recall of \textbf{97.4%} on Bird-Dev and \textbf{91.2%} on Spider-2.0-Lite, with competitive execution accuracy, i.e., \textbf{68.7%} EX on Bird-Dev (better than CHESS) and \textbf{34.9%} EX on Spider-2.0-Lite (ranking 2nd on the official leaderboard). Crucially, AutoLink exhibits \textbf{exceptional scalability}, \textbf{maintaining high recall}, \textbf{efficient token consumption}, and \textbf{robust execution accuracy} on large schemas (e.g., over 3,000 columns) where existing methods severely degrade-making it a highly scalable, high-recall schema-linking solution for industrial text-to-SQL systems.

[34] E$^3$-Pruner: Towards Efficient, Economical, and Effective Layer Pruning for Large Language Models

Tao Yuan, Haoli Bai, Yinfei Pan, Xuyang Cao, Tianyu Zhang, Lu Hou, Ting Hu, Xianzhi Yu

Main category: cs.CL

TL;DR: E3 is a layer pruning framework that addresses performance degradation, high training costs, and limited acceleration through differentiable mask optimization and entropy-aware knowledge distillation.

Details

Motivation: Existing layer pruning methods struggle with performance degradation, high training costs, and limited acceleration, making them impractical for real deployment.

Method: Uses differentiable mask optimization with Gumbel-TopK sampler for efficient pruning mask search, and entropy-aware adaptive knowledge distillation to enhance task performance.

Result: Achieves 96% accuracy (only 0.8% drop from original 96.8%) on MATH-500 when pruning 25% layers of Qwen3-32B, outperforming SOTA (95%), with 1.33× inference speedup using only 0.5B tokens (0.5% of post-training data).

Conclusion: E3 framework successfully addresses key deployment challenges in layer pruning, achieving superior performance with minimal training cost and significant inference acceleration.

Abstract: With the increasing size of large language models, layer pruning has gained increased attention as a hardware-friendly approach for model compression. However, existing layer pruning methods struggle to simultaneously address key practical deployment challenges, including performance degradation, high training costs, and limited acceleration. To overcome these limitations, we propose \name, a task-\underline{E}ffective, training-\underline{E}conomical and inference-\underline{E}fficient layer pruning framework. \namespace introduces two key innovations: (1) a differentiable mask optimization method using a Gumbel-TopK sampler, enabling efficient and precise pruning mask search; and (2) an entropy-aware adaptive knowledge distillation strategy that enhances task performance. Extensive experiments over diverse model architectures and benchmarks demonstrate the superiority of our method over state-of-the-art approaches. Notably, \namespace achieves 96% accuracy, a mere 0.8% drop from the original model (96.8%) on MATH-500 when pruning 25% layers of Qwen3-32B, outperforming existing SOTA (95%), with a 1.33$\times$ inference speedup by consuming merely 0.5B tokens (0.5% of the post-training data volume).

[35] A Simple Yet Strong Baseline for Long-Term Conversational Memory of LLM Agents

Sizhe Zhou

Main category: cs.CL

TL;DR: Proposes an event-centric memory approach for conversational agents that represents dialogue history as short, event-like propositions in a graph structure, enabling better long-term coherence and personalization while using shorter contexts.

Details

Motivation: LLM-based conversational agents struggle with long-term coherence and personalization due to fixed context windows and limitations of existing memory approaches that trade off between coarse retrieval and fragmented views of dialogue.

Method: Decomposes sessions into enriched elementary discourse units (EDUs) - self-contained statements with normalized entities and source attributions - organized in a heterogeneous graph. Uses dense similarity search and LLM filtering for retrieval, with optional graph-based propagation.

Result: Experiments on LoCoMo and LongMemEval benchmarks show event-centric memories match or surpass strong baselines while operating with much shorter QA contexts.

Conclusion: Structurally simple, event-level memory provides a principled and practical foundation for long-horizon conversational agents, preserving information in non-compressive form and making it more accessible.

Abstract: LLM-based conversational agents still struggle to maintain coherent, personalized interaction over many sessions: fixed context windows limit how much history can be kept in view, and most external memory approaches trade off between coarse retrieval over large chunks and fine-grained but fragmented views of the dialogue. Motivated by neo-Davidsonian event semantics, we propose an event-centric alternative that represents conversational history as short, event-like propositions which bundle together participants, temporal cues, and minimal local context, rather than as independent relation triples or opaque summaries. In contrast to work that aggressively compresses or forgets past content, our design aims to preserve information in a non-compressive form and make it more accessible, rather than more lossy. Concretely, we instruct an LLM to decompose each session into enriched elementary discourse units (EDUs) – self-contained statements with normalized entities and source turn attributions – and organize sessions, EDUs, and their arguments in a heterogeneous graph that supports associative recall. On top of this representation we build two simple retrieval-based variants that use dense similarity search and LLM filtering, with an optional graph-based propagation step to connect and aggregate evidence across related EDUs. Experiments on the LoCoMo and LongMemEval$_S$ benchmarks show that these event-centric memories match or surpass strong baselines, while operating with much shorter QA contexts. Our results suggest that structurally simple, event-level memory provides a principled and practical foundation for long-horizon conversational agents. Our code and data will be released at https://github.com/KevinSRR/EMem.

[36] Parrot: Persuasion and Agreement Robustness Rating of Output Truth – A Sycophancy Robustness Benchmark for LLMs

Yusuf Çelebi, Mahmoud El Hussieni, Özay Ezerceli

Main category: cs.CL

TL;DR: PARROT is a framework that measures how LLMs’ accuracy degrades under social pressure and sycophancy, revealing significant model heterogeneity in resistance to authoritative misinformation.

Details

Motivation: To address the problem of sycophancy in LLMs where models excessively conform to authoritative but false statements, potentially leading to epistemic collapse under social pressure.

Method: PARROT uses (i) double-blind evaluation comparing neutral vs authoritative false versions of questions, (ii) log-likelihood-based calibration tracking to measure confidence shifts, and (iii) an eight-state behavioral taxonomy to classify failure modes.

Result: Advanced models (GPT-5, GPT-4.1, Claude Sonnet 4.5) show low follow rates (≤11%) and minimal accuracy loss, while older/smaller models exhibit severe epistemic collapse (GPT-4: 80%, Qwen 2.5-1.5B: 94%). International law and global knowledge are most fragile, while elementary mathematics is resilient.

Conclusion: Resistance to overfitting pressure should be a primary objective alongside accuracy, harm avoidance, and privacy for safe real-world LLM deployment.

Abstract: This study presents PARROT (Persuasion and Agreement Robustness Rating of Output Truth), a robustness focused framework designed to measure the degradation in accuracy that occurs under social pressure exerted on users through authority and persuasion in large language models (LLMs) the phenomenon of sycophancy (excessive conformity). PARROT (i) isolates causal effects by comparing the neutral version of the same question with an authoritatively false version using a double-blind evaluation, (ii) quantifies confidence shifts toward the correct and imposed false responses using log-likelihood-based calibration tracking, and (iii) systematically classifies failure modes (e.g., robust correct, sycophantic agreement, reinforced error, stubborn error, self-correction, etc.) using an eight-state behavioral taxonomy. We evaluated 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates. Findings show marked heterogeneity: advanced models (e.g., GPT-5, GPT-4.1, Claude Sonnet 4.5) exhibit low “follow rates” ($\leq 11%$, GPT-5: 4%) and minimal accuracy loss, while older/smaller models show severe epistemic collapse (GPT-4: 80%, Qwen 2.5-1.5B: 94%). The danger is not limited to response changes; weak models reduce confidence in the correct response while increasing confidence in the imposed incorrect response. While international law and global knowledge at the domain level exhibit high fragility, elementary mathematics is relatively resilient. Consequently, we argue that the goal of “resistance to overfitting pressure” should be addressed as a primary objective alongside accuracy, harm avoidance, and privacy for safe deployment in the real world.

[37] Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables

Anshul Singh, Rohan Chaudhary, Gagneet Singh, Abhay Kumary

Main category: cs.CL

TL;DR: MirageTVQA is a new multilingual benchmark with 60K QA pairs across 24 languages that evaluates VLMs on visually imperfect tables, revealing significant performance drops with visual noise and English-first bias.

Details

Motivation: Existing tabular QA datasets are monolingual (English) and use digitally perfect tables, creating a gap between research and real-world applications where tables are often multilingual and visually noisy.

Method: Created MirageTVQA benchmark featuring nearly 60,000 QA pairs across 24 languages with tables that incorporate realistic visual noise to mimic scanned documents.

Result: Evaluation of leading VLMs showed over 35% performance drop when faced with visual noise and consistent English-first bias where reasoning abilities don’t transfer to other languages.

Conclusion: MirageTVQA provides a benchmark to measure and drive progress towards more robust VLM models for table reasoning in real-world multilingual and visually imperfect scenarios.

Abstract: The impressive performance of VLMs is largely measured on benchmarks that fail to capture the complexities of real-world scenarios. Existing datasets for tabular QA, such as WikiTableQuestions and FinQA, are overwhelmingly monolingual (English) and present tables in a digitally perfect, clean format. This creates a significant gap between research and practice. To address this, we present \textbf{MirageTVQA}, a new benchmark designed to evaluate VLMs on these exact dimensions. Featuring nearly 60,000 QA pairs across 24 languages, MirageTVQA challenges models with tables that are not only multilingual but also visually imperfect, incorporating realistic noise to mimic scanned documents. Our evaluation of the leading VLMs reveals two primary failure points: a severe degradation in performance (over 35% drop for the best models) when faced with visual noise and a consistent English-first bias where reasoning abilities fail to transfer to other languages. MirageTVQA provides a benchmark for measuring and driving progress towards more robust VLM models for table reasoning. The dataset and the code are available at: https://github.com/anshulsc/MirageTVQA.

Benjamin White, Anastasia Shimorina

Main category: cs.CL

TL;DR: Hybrid methodology for social media user behavior prediction that handles both frequent and infrequent actions, achieving strong performance on a large-scale Bluesky dataset with 6.4M conversation threads across 12 actions and 25 persona clusters.

Details

Motivation: Existing approaches focus mainly on common actions like retweeting and liking, leaving rare but significant behaviors largely unexplored, creating a gap in comprehensive user behavior understanding.

Method: Combines four approaches: lookup database based on historical patterns, persona-specific LightGBM models for common actions, hybrid neural architecture for rare actions fusing text and temporal features, and text reply generation.

Result: Persona-specific models achieved 0.64 macro F1-score for common actions, rare action classifier achieved 0.56 macro F1-score across 10 rare actions, winning first place in SocialSim challenge.

Conclusion: Effective social media behavior prediction requires tailored modeling strategies that recognize fundamental differences between action types, with hybrid approaches outperforming single-method solutions.

Abstract: Understanding and predicting user behavior on social media platforms is crucial for content recommendation and platform design. While existing approaches focus primarily on common actions like retweeting and liking, the prediction of rare but significant behaviors remains largely unexplored. This paper presents a hybrid methodology for social media user behavior prediction that addresses both frequent and infrequent actions across a diverse action vocabulary. We evaluate our approach on a large-scale Bluesky dataset containing 6.4 million conversation threads spanning 12 distinct user actions across 25 persona clusters. Our methodology combines four complementary approaches: (i) a lookup database system based on historical response patterns; (ii) persona-specific LightGBM models with engineered temporal and semantic features for common actions; (iii) a specialized hybrid neural architecture fusing textual and temporal representations for rare action classification; and (iv) generation of text replies. Our persona-specific models achieve an average macro F1-score of 0.64 for common action prediction, while our rare action classifier achieves 0.56 macro F1-score across 10 rare actions. These results demonstrate that effective social media behavior prediction requires tailored modeling strategies recognizing fundamental differences between action types. Our approach achieved first place in the SocialSim: Social-Media Based Personas challenge organized at the Social Simulation with LLMs workshop at COLM 2025.

[39] Estonian WinoGrande Dataset: Comparative Analysis of LLM Performance on Human and Machine Translation

Marii Ojastu, Hele-Andra Kuulmets, Aleksei Dorkin, Marika Borovikova, Dage Särg, Kairit Sirts

Main category: cs.CL

TL;DR: Estonian translation of WinoGrande benchmark shows slightly lower LLM performance than English original, with machine translation performing worse. Prompt engineering offers limited improvement, highlighting need for human specialists in dataset translation.

Details

Motivation: To create a culturally adapted Estonian version of the WinoGrande commonsense reasoning benchmark for reliable evaluation of language models' competency in Estonian.

Method: Human translation by specialists, evaluation of proprietary and open-source models, and exploration of prompt engineering for machine translation improvement.

Result: Models performed slightly worse on human-translated Estonian than original English, and significantly worse on machine-translated data. Prompt engineering provided limited translation quality improvements.

Conclusion: Human language specialists are crucial for reliable dataset translation and adaptation to ensure accurate evaluation of language models’ reasoning capabilities across languages.

Abstract: In this paper, we present a localized and culturally adapted Estonian translation of the test set from the widely used commonsense reasoning benchmark, WinoGrande. We detail the translation and adaptation process carried out by translation specialists and evaluate the performance of both proprietary and open source models on the human translated benchmark. Additionally, we explore the feasibility of achieving high-quality machine translation by incorporating insights from the manual translation process into the design of a detailed prompt. This prompt is specifically tailored to address both the linguistic characteristics of Estonian and the unique translation challenges posed by the WinoGrande dataset. Our findings show that model performance on the human translated Estonian dataset is slightly lower than on the original English test set, while performance on machine-translated data is notably worse. Additionally, our experiments indicate that prompt engineering offers limited improvement in translation quality or model accuracy, and highlight the importance of involving language specialists in dataset translation and adaptation to ensure reliable and interpretable evaluations of language competency and reasoning in large language models.

Koena Ronny Mabokela, Tim Schlippe, Matthias Wölfel

Main category: cs.CL

TL;DR: Analysis of LLMs’ zero-shot sentiment analysis performance on South African social media posts in English, Sepedi and Setswana, showing fusion of multiple LLMs achieves <1% error rate.

Details

Motivation: To leverage LLMs for sentiment analysis on social media posts in South African languages to detect social challenges, as no previous work has investigated this for multilingual communities.

Method: Analyzed zero-shot performance of GPT-3.5, GPT-4, LlaMa 2, PaLM 2, and Dolly 2 on sentiment analysis of 10 emerging topics across English, Sepedi and Setswana social media posts related to 10 government departments.

Result: Big differences found between various LLMs, topics, and languages. Fusion of multiple LLM outcomes provides large performance gains with sentiment classification errors below 1%.

Conclusion: Feasible to provide reliable sentiment analysis systems for detecting social challenges and identifying action needs across different topics and language groups in South Africa.

Abstract: Sentiment analysis can aid in understanding people’s opinions and emotions on social issues. In multilingual communities sentiment analysis systems can be used to quickly identify social challenges in social media posts, enabling government departments to detect and address these issues more precisely and effectively. Recently, large-language models (LLMs) have become available to the wide public and initial analyses have shown that they exhibit magnificent zero-shot sentiment analysis abilities in English. However, there is no work that has investigated to leverage LLMs for sentiment analysis on social media posts in South African languages and detect social challenges. Consequently, in this work, we analyse the zero-shot performance of the state-of-the-art LLMs GPT-3.5, GPT-4, LlaMa 2, PaLM 2, and Dolly 2 to investigate the sentiment polarities of the 10 most emerging topics in English, Sepedi and Setswana social media posts that fall within the jurisdictional areas of 10 South African government departments. Our results demonstrate that there are big differences between the various LLMs, topics, and languages. In addition, we show that a fusion of the outcomes of different LLMs provides large gains in sentiment classification performance with sentiment classification errors below 1%. Consequently, it is now feasible to provide systems that generate reliable information about sentiment analysis to detect social challenges and draw conclusions about possible needs for actions on specific topics and within different language groups.

[41] Humanlike Multi-user Agent (HUMA): Designing a Deceptively Human AI Facilitator for Group Chats

Mateusz Jacniacki, Martí Carmona Serrat

Main category: cs.CL

TL;DR: HUMA is an LLM-based conversational agent designed for natural multi-party group chats using human-like interaction patterns and timing, making it difficult to distinguish from human participants.

Details

Motivation: Most current conversational agents are designed for one-on-one turn-based exchanges, but as AI assistants become widespread, developing natural humanlike interaction patterns is crucial for maintaining user trust and engagement in group settings.

Method: HUMA uses an event-driven architecture with three components (Router, Action Agent, and Reflection) that handles messages, replies, reactions and includes realistic response-time simulation to adapt LLMs to group conversation dynamics.

Result: In a study with 97 participants in four-person role-play chats, participants classified community managers as human at near-chance rates for both AI and human conditions, and subjective experiences (effectiveness, social presence, engagement) showed only modest differences with small effect sizes.

Conclusion: AI facilitators like HUMA can match human quality in natural group chat settings while remaining difficult to identify as nonhuman, suggesting the viability of humanlike AI participation in multi-party conversations.

Abstract: Conversational agents built on large language models (LLMs) are becoming increasingly prevalent, yet most systems are designed for one-on-one, turn-based exchanges rather than natural, asynchronous group chats. As AI assistants become widespread throughout digital platforms, from virtual assistants to customer service, developing natural and humanlike interaction patterns seems crucial for maintaining user trust and engagement. We present the Humanlike Multi-user Agent (HUMA), an LLM-based facilitator that participates in multi-party conversations using human-like strategies and timing. HUMA extends prior multi-user chatbot work with an event-driven architecture that handles messages, replies, reactions and introduces realistic response-time simulation. HUMA comprises three components-Router, Action Agent, and Reflection-which together adapt LLMs to group conversation dynamics. We evaluate HUMA in a controlled study with 97 participants in four-person role-play chats, comparing AI and human community managers (CMs). Participants classified CMs as human at near-chance rates in both conditions, indicating they could not reliably distinguish HUMA agents from humans. Subjective experience was comparable across conditions: community-manager effectiveness, social presence, and engagement/satisfaction differed only modestly with small effect sizes. Our results suggest that, in natural group chat settings, an AI facilitator can match human quality while remaining difficult to identify as nonhuman.

[42] Don’t Learn, Ground: A Case for Natural Language Inference with Visual Grounding

Daniil Ignatev, Ayman Santeer, Albert Gatt, Denis Paperno

Main category: cs.CL

TL;DR: Zero-shot NLI method using multimodal representations by generating visual representations of premises and comparing them with textual hypotheses.

Details

Motivation: To create a robust NLI approach that avoids textual biases and surface heuristics by grounding language in visual contexts.

Method: Generate visual representations of premises using text-to-image models, then perform inference using cosine similarity and visual question answering techniques without task-specific fine-tuning.

Result: Achieves high accuracy in NLI tasks and demonstrates robustness against textual biases, validated through a controlled adversarial dataset.

Conclusion: Using visual modality as meaning representation offers a promising direction for robust natural language understanding.

Abstract: We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.

[43] Selective Rotary Position Embedding

Sajad Movahedi, Timur Carstensen, Arshia Afzal, Frank Hutter, Antonio Orvieto, Volkan Cevher

Main category: cs.CL

TL;DR: Selective RoPE introduces input-dependent rotary position embeddings that generalize RoPE by enabling arbitrary rotation angles, improving performance in language modeling and sequence tasks.

Details

Motivation: To combine the benefits of selective gating from linear transformers with the position encoding capabilities of RoPE, creating a more flexible and effective positional encoding mechanism.

Method: Developed Selective RoPE, an input-dependent rotary embedding that allows arbitrary rotation angles, and showed its applicability to both linear and softmax transformers by revealing hidden rotational structures.

Result: Selective RoPE improves performance in language modeling and challenging sequence tasks like copying, state tracking, and retrieval when equipped in gated transformers.

Conclusion: Input-dependent rotations through Selective RoPE provide a more flexible and effective positional encoding mechanism that generalizes RoPE and enhances transformer performance across various tasks.

Abstract: Position information is essential for language modeling. In softmax transformers, Rotary Position Embeddings (\textit{RoPE}) encode positions through \textit{fixed-angle} rotations, while in linear transformers, order is handled via input-dependent (selective) gating that decays past key-value associations. Selectivity has generally been shown to improve language-related tasks. Inspired by this, we introduce \textit{Selective RoPE}, an \textit{input-dependent} rotary embedding mechanism, that generalizes \textit{RoPE}, and enables rotation in \textit{arbitrary angles} for both linear and softmax transformers. We show that softmax attention already performs a hidden form of these rotations on query-key pairs, uncovering an implicit positional structure. We further show that in state-space models and gated linear transformers, the real part manages forgetting while the imaginary part encodes positions through rotations. We validate our method by equipping gated transformers with \textit{Selective RoPE}, demonstrating that its input-dependent rotations improve performance in language modeling and on difficult sequence tasks like copying, state tracking, and retrieval.

[44] PUCP-Metrix: A Comprehensive Open-Source Repository of Linguistic Metrics for Spanish

Javier Alonso Villegas Luis, Marco Antonio Sobrevilla Cabezudo

Main category: cs.CL

TL;DR: PUCP-Metrix is an open-source repository of 182 linguistic metrics for Spanish text analysis, covering lexical diversity, syntax, semantics, cohesion, psycholinguistics, and readability.

Details

Motivation: Existing Spanish linguistic analysis tools offer limited coverage, creating a need for comprehensive resources that support interpretability and tasks involving style, structure, and readability.

Method: Developed a repository of 182 linguistic metrics spanning multiple dimensions of text analysis and evaluated its performance on Automated Readability Assessment and Machine-Generated Text Detection tasks.

Result: PUCP-Metrix showed competitive performance compared to existing repositories and strong neural baselines, demonstrating its effectiveness for Spanish NLP applications.

Conclusion: PUCP-Metrix provides a comprehensive, extensible resource for Spanish linguistic analysis that supports diverse NLP applications with fine-grained, interpretable text analysis capabilities.

Abstract: Linguistic features remain essential for interpretability and tasks involving style, structure, and readability, but existing Spanish tools offer limited coverage. We present PUCP-Metrix, an open-source repository of 182 linguistic metrics spanning lexical diversity, syntactic and semantic complexity, cohesion, psycholinguistics, and readability. PUCP-Metrix enables fine-grained, interpretable text analysis. We evaluate its usefulness on Automated Readability Assessment and Machine-Generated Text Detection, showing competitive performance compared to an existing repository and strong neural baselines. PUCP-Metrix offers a comprehensive, extensible resource for Spanish, supporting diverse NLP applications.

[45] Beyond Multiple Choice: A Hybrid Framework for Unifying Robust Evaluation and Verifiable Reasoning Training

Yesheng Liu, Hao Li, Haiyu Xu, Baoqi Pei, Jiahao Wang, Mingxuan Zhao, Jingshu Zheng, Zheqi He, JG Yao, Bowen Qin, Xi Yang, Jiajun Zhang

Main category: cs.CL

TL;DR: ReVeL framework converts multiple-choice questions to open-form questions to prevent answer guessing and improve evaluation reliability, showing significant score inflation in MCQA benchmarks.

Details

Motivation: Multiple-choice question answering (MCQA) has exploitable signals that make accuracy metrics unreliable and encourage guessing behaviors during reinforcement fine-tuning.

Method: Propose ReVeL framework that rewrites MCQA into open-form questions with different rewriting/verification schemes based on answer types, then uses GRPO to finetune Qwen2.5-VL models.

Result: Models trained on ReVeL-OpenQA match MCQA accuracy on benchmarks and improve OpenQA accuracy by ~6 percentage points. Reveals up to 20 percentage points score inflation in MCQA benchmarks relative to OpenQA.

Conclusion: ReVeL provides better data efficiency, more robust reward signals than MCQA-based training, improves judging accuracy, and reduces cost and latency.

Abstract: Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification. However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency. We will release code and data publicly.

[46] SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation

Shrikant Kendre, Austin Xu, Honglu Zhou, Michael Ryoo, Shafiq Joty, Juan Carlos Niebles

Main category: cs.CL

TL;DR: SMILE is a new evaluation metric that combines sentence-level and keyword-level semantic understanding with lexical exactness to better assess question answering systems, outperforming traditional metrics and LLM-based evaluators.

Details

Motivation: Traditional evaluation metrics like ROUGE and EM focus too much on lexical similarity and miss deeper semantic understanding, while BERTScore and MoverScore lack flexibility in balancing semantics and ignore lexical similarity. LLM-based evaluators have issues with cost, bias, and inconsistency.

Method: SMILE integrates sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching, creating a composite method that balances lexical precision and semantic relevance.

Result: Extensive benchmarks across text, image, and video QA tasks show SMILE achieves high correlation with human judgments while being computationally lightweight.

Conclusion: SMILE effectively bridges the gap between lexical and semantic evaluation, providing a comprehensive assessment method for question answering systems.

Abstract: Traditional evaluation metrics for textual and visual question answering, like ROUGE, METEOR, and Exact Match (EM), focus heavily on n-gram based lexical similarity, often missing the deeper semantic understanding needed for accurate assessment. While measures like BERTScore and MoverScore leverage contextual embeddings to address this limitation, they lack flexibility in balancing sentence-level and keyword-level semantics and ignore lexical similarity, which remains important. Large Language Model (LLM) based evaluators, though powerful, come with drawbacks like high costs, bias, inconsistency, and hallucinations. To address these issues, we introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching. This composite method balances lexical precision and semantic relevance, offering a comprehensive evaluation. Extensive benchmarks across text, image, and video QA tasks show SMILE is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.

[47] Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards

Zhen Wang, Zhifeng Gao, Guolin Ke

Main category: cs.CL

TL;DR: MR-RLVR introduces process-level self-supervised rewards via masking and reordering to enhance RLVR’s scalability in mathematical reasoning where only final answers are verifiable.

Details

Motivation: RLVR struggles with mathematical theorem proving where intermediate reasoning is crucial but final answers are hard to verify directly, and token-level SFT often leads to memorization rather than genuine reasoning.

Method: Two-stage training: first self-supervised training using ‘masked-then-fill’ and ‘step reordering’ on mathematical data, then RLVR fine-tuning on calculation datasets with verifiable outcomes.

Result: On Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, MR-RLVR achieved +9.86% Pass@1, +5.27% Pass@5, and +4.00% Pass@8 improvements over original RLVR on AIME24, AIME25, AMC23, and MATH500 benchmarks.

Conclusion: Process-aware self-supervised signals effectively enhance RLVR’s scalability and performance in outcome-verifiable mathematical reasoning settings.

Abstract: Test-time scaling has been shown to substantially improve large language models’ (LLMs) mathematical reasoning. However, for a large portion of mathematical corpora, especially theorem proving, RLVR’s scalability is limited: intermediate reasoning is crucial, while final answers are difficult to directly and reliably verify. Meanwhile, token-level SFT often degenerates into rote memorization rather than inducing longer chains of thought. Inspired by BERT’s self-supervised tasks, we propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via “masked-then-fill” and “step reordering” to extract learnable signals from intermediate reasoning. Our training pipeline comprises two stages: we first perform self-supervised training on sampled mathematical calculation and proof data; we then conduct RLVR fine-tuning on mathematical calculation datasets where only outcomes are verifiable. We implement MR-RLVR on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, and evaluate on AIME24, AIME25, AMC23, and MATH500. Under a fixed sampling and decoding budget, MR-RLVR achieves average relative gains over the original RLVR of +9.86% Pass@1, +5.27% Pass@5, and +4.00% Pass@8. These results indicate that incorporating process-aware self-supervised signals can effectively enhance RLVR’s scalability and performance in only outcome-verifiable settings.

[48] MiniLLM: Knowledge Distillation of Large Language Models

Yuxian Gu, Li Dong, Furu Wei, Minlie Huang

Main category: cs.CL

TL;DR: MiniLLM is a knowledge distillation method that uses reverse KLD instead of forward KLD to distill large language models into smaller models, improving response quality and reducing exposure bias.

Details

Motivation: Previous KD methods focus on white-box classification models or imitating black-box APIs, but effective distillation of white-box LLMs into smaller models is under-explored despite the growth of open-source LLMs.

Method: Replace forward KLD with reverse KLD in knowledge distillation to prevent student models from overestimating low-probability regions, and derive an on-policy optimization approach for learning this objective.

Result: MiniLLM generates more precise responses with higher overall quality, lower exposure bias, better calibration, and better long-text generation performance than baselines, scalable for models from 120M to 13B parameters.

Conclusion: The proposed reverse KLD approach effectively distills knowledge from large language models into smaller models, demonstrating superior performance across multiple metrics and scalability across different model sizes.

Abstract: Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs). However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge of white-box LLMs into small models is still under-explored, which becomes more important with the prosperity of open-source LLMs. In this work, we propose a KD approach that distills LLMs into smaller language models. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective on-policy optimization approach to learn this objective. The student models are named MiniLLM. Extensive experiments in the instruction-following setting show that MiniLLM generates more precise responses with higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance than the baselines. Our method is scalable for different model families with 120M to 13B parameters. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/minillm.

[49] Fine-Grained Reward Optimization for Machine Translation using Error Severity Mappings

Miguel Moura Ramos, Tomás Almeida, Daniel Vareta, Filipe Azevedo, Sweta Agrawal, Patrick Fernandes, André F. T. Martins

Main category: cs.CL

TL;DR: Proposes using token-level quality assessments with error severity levels in RL for machine translation, showing improved quality and training stability over sentence-level rewards.

Details

Motivation: Address reward sparsity problem in RL for machine translation where sentence-level feedback provides inefficient learning signals.

Method: Use xCOMET quality estimation system as token-level reward model with RL methods, testing on various translation datasets with encoder-decoder and LLM-based systems.

Result: Token-level rewards improve translation quality across language pairs in automatic and human evaluation, and enhance training stability with steady reward increases.

Conclusion: Fine-grained token-level rewards outperform sentence-level rewards in RL-based machine translation training.

Abstract: Reinforcement learning (RL) has been proven to be an effective and robust method for training neural machine translation systems, especially when paired with powerful reward models that accurately assess translation quality. However, most research has focused on RL methods that use sentence-level feedback, leading to inefficient learning signals due to the reward sparsity problem – the model receives a single score for the entire sentence. To address this, we propose a novel approach that leverages fine-grained, token-level quality assessments along with error severity levels using RL methods. Specifically, we use xCOMET, a state-of-the-art quality estimation system, as our token-level reward model. We conduct experiments on small and large translation datasets with standard encoder-decoder and large language models-based machine translation systems, comparing the impact of sentence-level versus fine-grained reward signals on translation quality. Our results show that training with token-level rewards improves translation quality across language pairs over baselines according to both automatic and human evaluation. Furthermore, token-level reward optimization improves training stability, evidenced by a steady increase in mean rewards over training epochs.

[50] Task-Aligned Tool Recommendation for Large Language Models

Hang Gao, Yongfeng Zhang

Main category: cs.CL

TL;DR: The paper proposes PTR, a precision-driven tool recommendation approach that dynamically selects optimal tool sets for LLMs by leveraging historical usage patterns and multi-view matching, addressing inefficiencies in current fixed-size tool selection methods.

Details

Motivation: Current tool retrieval methods for LLMs use fixed-size top-ranked tool sets, which often include redundant or unsuitable tools since optimal tool quantity varies by task, leading to inefficiencies in problem-solving.

Method: PTR approach captures initial concise tool sets using historical tool bundle usage, then dynamically adjusts through tool matching and multi-view-based tool addition to create precise tool recommendations.

Result: The approach demonstrates promising accuracy across two open benchmarks and the newly introduced RecTools dataset, validated through comprehensive experiments.

Conclusion: PTR effectively addresses the tool recommendation challenge for LLMs by providing dynamically adjusted, precise tool sets that improve efficiency and relevance compared to fixed-size ranking approaches.

Abstract: By augmenting Large Language Models (LLMs) with external tools, their capacity to solve complex problems has been significantly enhanced. However, despite ongoing advancements in the parsing capabilities of LLMs, incorporating all available tools simultaneously in the prompt remains impractical due to the vast number of external tools. Consequently, it is essential to provide LLMs with a precise set of tools tailored to the specific task, considering both quantity and quality. Current tool retrieval methods primarily focus on refining the ranking list of tools and directly packaging a fixed number of top-ranked tools as the tool set. However, these approaches often fail to equip LLMs with the optimal set of tools prior to execution, since the optimal number of tools for different tasks could be different, resulting in inefficiencies such as redundant or unsuitable tools, which impede immediate access to the most relevant tools. This paper addresses the challenge of recommending precise toolsets for LLMs. We introduce the problem of tool recommendation, define its scope, and propose a novel Precision-driven Tool Recommendation (PTR) approach. PTR captures an initial, concise set of tools by leveraging historical tool bundle usage and dynamically adjusts the tool set by performing tool matching, culminating in a multi-view-based tool addition. Additionally, we present a new dataset, RecTools, and a metric, TRACC, designed to evaluate the effectiveness of tool recommendation for LLMs. We further validate our design choices through comprehensive experiments, demonstrating promising accuracy across two open benchmarks and our RecTools dataset.

[51] EventWeave: A Dynamic Framework for Capturing Core and Supporting Events in Dialogue Systems

Zhengyi Zhao, Shubo Zhang, Yiming Du, Bin Liang, Baojun Wang, Zhongyang Li, Binyang Li, Kam-Fai Wong

Main category: cs.CL

TL;DR: EventWeave is a framework that models conversational event relationships using a dynamic event graph with core and supporting events, employing multi-head attention for contextually appropriate dialogue responses.

Details

Motivation: Current dialogue systems process conversational turns in isolation, overlooking the event structures that guide natural interactions, leading to less contextually appropriate responses.

Method: Constructs a dynamic event graph distinguishing core events (main goals) and supporting events (interconnected details), uses multi-head attention to determine relevant events, and captures three distinct relationship types between events.

Result: Experiments on three dialogue datasets show EventWeave produces more natural and contextually appropriate responses with less computational overhead than models processing entire dialogue history.

Conclusion: EventWeave effectively balances comprehensive context understanding with generating concise responses through targeted optimization, with improvements stemming from better event relationship modeling rather than increased information density.

Abstract: Large language models have improved dialogue systems, but often process conversational turns in isolation, overlooking the event structures that guide natural interactions. Hence we introduce \textbf{EventWeave}, a framework that explicitly models relationships between conversational events to generate more contextually appropriate dialogue responses. EventWeave constructs a dynamic event graph that distinguishes between core events (main goals) and supporting events (interconnected details), employing a multi-head attention mechanism to selectively determine which events are most relevant to the current turn. Unlike summarization or standard graph-based approaches, our method captures three distinct relationship types between events, allowing for more nuanced context modeling. Experiments on three dialogue datasets demonstrate that EventWeave produces more natural and contextually appropriate responses while requiring less computational overhead than models processing the entire dialogue history. Ablation studies confirm improvements stem from better event relationship modeling rather than increased information density. Our approach effectively balances comprehensive context understanding with generating concise responses, maintaining strong performance across various dialogue lengths through targeted optimization techniques.

[52] Concise Reasoning via Reinforcement Learning

Mehdi Fatemi, Banafsheh Rafiee, Mingjie Tang, Kartik Talamadupula

Main category: cs.CL

TL;DR: RL training causes reasoning models to produce excessively verbose outputs due to loss minimization on unsolvable problems, creating a systematic bias toward longer responses even when incorrect.

Details

Motivation: Address the computational inefficiency and latency caused by reasoning models' excessive token usage, which stems from RL optimization artifacts rather than genuine reasoning depth.

Method: Theoretical analysis of PPO and GRPO algorithms, correlation studies between conciseness and correctness, and a two-phase RL procedure with a brief secondary stage trained on solvable problems.

Result: Proposed two-phase RL significantly reduces response length while maintaining or improving accuracy, and uncovered consistent correlation between conciseness and correctness across model types.

Conclusion: Model verbosity is an optimization artifact from RL training, and targeted fine-tuning on solvable problems can achieve concise yet accurate reasoning, though GRPO has reliability limitations.

Abstract: A major drawback of reasoning models is their excessive token usage, inflating computational cost, resource demand, and latency. We show this verbosity stems not from deeper reasoning but from reinforcement learning loss minimization when models produce incorrect answers. With unsolvable problems dominating training, this effect compounds into a systematic tendency toward longer outputs. Through theoretical analysis of PPO and GRPO, we prove that incorrect answers inherently drive policies toward verbosity \textit{even when} $γ=1$, reframing response lengthening as an optimization artifact. We further uncover a consistent correlation between conciseness and correctness across reasoning and non-reasoning models. Building on these insights, we propose a two-phase RL procedure where a brief secondary stage, trained on a small set of solvable problems, significantly reduces response length while preserving or improving accuracy. Finally, we show that while GRPO shares properties with PPO, it exhibits collapse modes, limiting its reliability for concise reasoning. Our claims are supported by extensive experiments.

[53] The Rise of Parameter Specialization for Knowledge Storage in Large Language Models

Yihuai Hong, Yiran Zhao, Wei Tang, Yang Deng, Yu Rong, Wenxuan Zhang

Main category: cs.CL

TL;DR: Analysis of 20 LLMs shows that as models become more advanced, their MLP parameters exhibit increased specialization for encoding similar knowledge types, which improves knowledge utilization efficiency.

Details

Motivation: Limited research on how knowledge is stored in MLP parameters and how this affects knowledge utilization efficiency in language models.

Method: Analyzed 20 publicly available open-source large language models to investigate the relationship between performance and knowledge storage in MLP parameters, plus causal training experiments.

Result: Advanced models show increased parameter specialization in MLPs, with parameters more focused on encoding similar knowledge types, which improves knowledge utilization efficiency.

Conclusion: Specialized knowledge distribution in MLP parameters plays a critical role in improving model efficiency in leveraging stored knowledge.

Abstract: Over time, a growing wave of large language models from various series has been introduced to the community. Researchers are striving to maximize the performance of language models with constrained parameter sizes. However, from a microscopic perspective, there has been limited research on how to better store knowledge in model parameters, particularly within MLPs, to enable more effective utilization of this knowledge by the model. In this work, we analyze twenty publicly available open-source large language models to investigate the relationship between their strong performance and the way knowledge is stored in their corresponding MLP parameters. Our findings reveal that as language models become more advanced and demonstrate stronger knowledge capabilities, their parameters exhibit increased specialization. Specifically, parameters in the MLPs tend to be more focused on encoding similar types of knowledge. We experimentally validate that this specialized distribution of knowledge contributes to improving the efficiency of knowledge utilization in these models. Furthermore, by conducting causal training experiments, we confirm that this specialized knowledge distribution plays a critical role in improving the model’s efficiency in leveraging stored knowledge.

[54] ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

Beong-woo Kwak, Minju Kim, Dongha Lim, Hyungjoo Chae, Dongjin Kang, Sunghwan Kim, Dongil Yang, Jinyoung Yeo

Main category: cs.CL

TL;DR: ToolHaystack benchmark reveals LLMs struggle with long-term tool use interactions despite good performance in standard multi-turn settings.

Details

Motivation: Existing evaluations focus on short tool use contexts, lacking insight into realistic long-term interactions where models must maintain context and handle disruptions.

Method: Created ToolHaystack benchmark with multiple tasks execution contexts and realistic noise within continuous conversations to assess long-term tool use capabilities.

Result: 14 state-of-the-art LLMs perform well in standard multi-turn settings but significantly struggle in ToolHaystack, revealing critical gaps in long-term robustness.

Conclusion: Current LLMs have substantial limitations in long-term tool use interactions that previous benchmarks failed to uncover, highlighting the need for improved long-term robustness.

Abstract: Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.

[55] Fairness Evaluation of Large Language Models in Academic Library Reference Services

Haining Wang, Jason Clark, Yueru Yan, Star Bradley, Ruiyang Chen, Yiqiong Zhang, Hengyi Fu, Zuoyu Tian

Main category: cs.CL

TL;DR: LLMs show promising readiness for equitable library reference services with minimal demographic bias, though some minor gender stereotypes were found in one model.

Details

Motivation: To evaluate whether LLMs can serve all library users equitably regardless of demographics, as they may reproduce societal biases from training data, risking libraries' commitment to equitable service.

Method: Prompted six state-of-the-art LLMs to assist patrons differing in sex, race/ethnicity, and institutional role, then analyzed response differentiation.

Result: No evidence of differentiation by race/ethnicity; minor stereotypical bias against women in one model; LLMs accommodate institutional roles through appropriate linguistic choices reflecting professional norms.

Conclusion: Current LLMs demonstrate promising readiness to support equitable and contextually appropriate communication in academic library reference services with minimal demographic bias.

Abstract: As libraries explore large language models (LLMs) for use in virtual reference services, a key question arises: Can LLMs serve all users equitably, regardless of demographics or social status? While they offer great potential for scalable support, LLMs may also reproduce societal biases embedded in their training data, risking the integrity of libraries’ commitment to equitable service. To address this concern, we evaluate whether LLMs differentiate responses across user identities by prompting six state-of-the-art LLMs to assist patrons differing in sex, race/ethnicity, and institutional role. We find no evidence of differentiation by race or ethnicity, and only minor evidence of stereotypical bias against women in one model. LLMs demonstrate nuanced accommodation of institutional roles through the use of linguistic choices related to formality, politeness, and domain-specific vocabularies, reflecting professional norms rather than discriminatory treatment. These findings suggest that current LLMs show a promising degree of readiness to support equitable and contextually appropriate communication in academic library reference services.

[56] Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

Ziqi Miao, Lijun Li, Yuan Xiong, Zhenhua Liu, Pengyu Zhu, Jing Shao

Main category: cs.CL

TL;DR: Response Attack (RA) is a novel jailbreak framework that exploits contextual priming vulnerabilities in LLMs by strategically using intermediate harmful responses to steer subsequent behavior toward policy-violating content.

Details

Motivation: Existing jailbreak attacks have limitations in effectiveness, efficiency, and semantic drift, while contextual priming offers an unexplored attack surface where previous dialogue responses can covertly bias LLM behavior.

Method: RA reformulates harmful queries and injects intermediate, mildly harmful responses as contextual primers before issuing targeted trigger prompts, exploiting LLMs’ vulnerability to dialogue context influence.

Result: Extensive experiments across eight state-of-the-art LLMs show RA consistently achieves significantly higher attack success rates than nine leading jailbreak baselines, generating more explicit and relevant harmful content while maintaining stealth and efficiency.

Conclusion: The strategic use of intermediate responses as contextual primers represents a previously overlooked vulnerability in LLMs that enables effective, stealthy jailbreak attacks while maintaining fidelity to original queries.

Abstract: Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. While existing jailbreak attacks largely rely on single-turn or multi-turn prompt manipulations, or inject static in-context examples, these methods suffer from limited effectiveness, inefficiency, or semantic drift. We introduce Response Attack (RA), a novel framework that strategically leverages intermediate, mildly harmful responses as contextual primers within a dialogue. By reformulating harmful queries and injecting these intermediate responses before issuing a targeted trigger prompt, RA exploits a previously overlooked vulnerability in LLMs. Extensive experiments across eight state-of-the-art LLMs show that RA consistently achieves significantly higher attack success rates than nine leading jailbreak baselines. Our results demonstrate that the success of RA is directly attributable to the strategic use of intermediate responses, which induce models to generate more explicit and relevant harmful content while maintaining stealth, efficiency, and fidelity to the original query. The code and data are available at https://github.com/Dtc7w3PQ/Response-Attack.

David M. Markowitz, Samuel Hardman Taylor

Main category: cs.CL

TL;DR: Social approval (upvotes) on hate speech comments predicts increased hate speech production over time, but approval on posts shows no such effect.

Details

Motivation: To test Walther's social approval theory of online hate, examining whether receiving social approval motivates individuals to produce more and more extreme hate speech.

Method: Analyzed 110 million messages from Parler (2018-2021), measuring the relationship between upvotes on hate speech and subsequent hate speech production at various time intervals.

Result: Upvotes on hate speech comments positively predicted increased hate speech production over weeks to months, while upvotes on posts showed no association. Social approval had stronger effects than social disapproval.

Conclusion: Social approval is a critical mechanism that facilitates the propagation of online hate speech, particularly through comment interactions.

Abstract: We examined how online hate is motivated by receiving social approval via Walther’s (2024) social approval theory of online hate, which argues (H1a) more signals of social approval on hate messages predicts more subsequent hate messages, and (H1b) as social approval increases, hate speech becomes more extreme. Using 110 million messages from Parler (2018-2021), we observed the number of upvotes received on a hate speech post was unassociated with hate speech in one’s next post and during the next month, three-months, and six-months. The number of upvotes received on (extreme) hate speech comments, however, was positively associated with (extreme) hate speech during the next week, month, three-months, and six-months. Between-person effects revealed an average positive relationship between social approval and hate speech production at all time intervals. For comments, social approval linked more strongly to online hate than social disapproval. Social approval is a critical mechanism facilitating online hate propagation.

[58] Do LLMs produce texts with “human-like” lexical diversity?

Kelly Kendro, Jeffrey Maloney, Scott Jarvis

Main category: cs.CL

TL;DR: ChatGPT models produce texts with significantly different lexical diversity patterns compared to human writers, with newer models being less human-like than older ones.

Details

Motivation: To determine how human-like LLM-generated writing is by examining lexical diversity patterns across different ChatGPT models and comparing them with human-written texts.

Method: Analyzed lexical diversity in texts from four ChatGPT models and 240 human writers across six dimensions using MANOVAs, ANOVAs, and Support Vector Machines.

Result: ChatGPT-generated texts differed significantly from human-written texts in all lexical diversity measures, with ChatGPT-o4 mini and ChatGPT-4.5 showing the largest differences despite producing fewer tokens.

Conclusion: ChatGPT models do not produce human-like texts in terms of lexical diversity, and newer models are actually less human-like than older models.

Abstract: The degree to which large language models (LLMs) produce writing that is truly human-like remains unclear despite the extensive empirical attention that this question has received. The present study addresses this question from the perspective of lexical diversity. Specifically, the study investigates patterns of lexical diversity in LLM-generated texts from four ChatGPT models (ChatGPT-3.5, ChatGPT-4, ChatGPT-o4 mini, and ChatGPT-4.5) in comparison with texts written by L1 and L2 English participants (n = 240) across four education levels. Six dimensions of lexical diversity were measured in each text: volume, abundance, variety-repetition, evenness, disparity, and dispersion. Results from one-way MANOVAs, one-way ANOVAs, and Support Vector Machines revealed that the ChatGPT-generated texts differed significantly from human-written texts for each variable, with ChatGPT-o4 mini and ChatGPT-4.5 differing the most. Within these two groups, ChatGPT-4.5 demonstrated higher levels of lexical diversity than older models despite producing fewer tokens. The human writers’ lexical diversity did not differ across subgroups (i.e., education, language status). Altogether, the results indicate that ChatGPT models do not produce human-like texts in relation to lexical diversity, and the newer models produce less human-like text than older models. We discuss the implications of these results for language pedagogy and related applications.

[59] Beyond Human Judgment: A Bayesian Evaluation of LLMs’ Moral Values Understanding

Maciej Skorski, Alina Landowska

Main category: cs.CL

TL;DR: Large language models rank among the top 25% of human annotators in moral understanding, performing better than average human accuracy and producing fewer false negatives.

Details

Motivation: To evaluate how large language models understand moral dimensions compared to humans using a comprehensive Bayesian approach that captures human disagreement uncertainty.

Method: Used a GPU-optimized Bayesian framework to evaluate top language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick) across 250K+ annotations from 700 annotators in 100K+ texts from social networks, news, and forums.

Result: AI models typically rank among the top 25% of human annotators, perform better than average balanced accuracy, and produce far fewer false negatives than humans.

Conclusion: Large language models demonstrate superior moral detection capabilities compared to average human performance, with more sensitive detection and fewer false negatives.

Abstract: How do Large Language Models understand moral dimensions compared to humans? This first large-scale Bayesian evaluation of market-leading language models provides the answer. In contrast to prior work using deterministic ground truth (majority or inclusion rules), we model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity). We evaluated the best language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick) across 250K+ annotations from nearly 700 annotators in 100K+ texts spanning social networks, news and forums. Our GPU-optimized Bayesian framework processed 1M+ model queries, revealing that AI models typically rank among the top 25% of human annotators, performing much better than average balanced accuracy. Importantly, we find that AI produces far fewer false negatives than humans, highlighting their more sensitive moral detection capabilities.

[60] RPRO: Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning

Chia-Hsuan Hsu, Jun-En Ding, Hsin-Ling Hsu, Chih-Ho Hsu, Li-Hung Yao, Chun-Chieh Liao, Feng Liu, Fang-Ming Hung

Main category: cs.CL

TL;DR: RPRO is a reinforcement learning framework that enhances clinical reasoning in LLMs through preference-driven optimization and quality refinement, achieving superior performance with smaller models.

Details

Motivation: Existing LLMs generate clinically unreliable reasoning chains that lack factual accuracy in medical question answering, requiring better integration of domain knowledge and clinical workflows.

Method: Combines reinforcement learning with preference-driven reasoning refinement using task-adaptive templates, probabilistic evaluation, groupwise ranking optimization based on Bradley-Terry model, and KL-divergence regularization.

Result: Consistent improvements on PubMedQA, MedQA-USMLE, and FEMH datasets; 2B-parameter model outperforms larger 7B-20B models including medical-specialized variants.

Conclusion: Preference optimization with quality-driven refinement provides scalable and clinically grounded approach for building more reliable medical LLMs.

Abstract: Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference. However, existing large language models (LLMs) often generate reasoning chains that lack factual accuracy and clinical reliability. We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that combines reinforcement learning with preference-driven reasoning refinement to enhance clinical chain-of-thought (CoT) performance. RPRO distinguishes itself from prior approaches by employing task-adaptive reasoning templates and a probabilistic evaluation mechanism that aligns model outputs with established clinical workflows, while automatically identifying and correcting low-quality reasoning chains. Unlike traditional pairwise preference methods, RPRO introduces a groupwise ranking optimization based on the Bradley–Terry model and incorporates KL-divergence regularization for stable training. Experiments on PubMedQA, MedQA-USMLE, and a real-world clinical dataset from Far Eastern Memorial Hospital (FEMH) demonstrate consistent improvements over strong baselines. Remarkably, our 2B-parameter model outperforms much larger 7B–20B models, including medical-specialized variants. These findings demonstrate that combining preference optimization with quality-driven refinement provides a scalable and clinically grounded approach to building more reliable medical LLMs.

[61] RAG-BioQA Retrieval-Augmented Generation for Long-Form Biomedical Question Answering

Lovely Yeswanth Panchumarthi, Sai Prasad Gudari, Atharva Negi, Praveen Raj Budime, Harsit Upadhya

Main category: cs.CL

TL;DR: RAG-BioQA is a retrieval-augmented generation framework that produces evidence-based long-form biomedical answers, outperforming existing methods on PubMedQA.

Details

Motivation: Current biomedical QA systems provide only short answers, lacking comprehensive explanations needed for clinical decision-making due to the exponential growth of biomedical literature.

Method: Combines retrieval-augmented generation with domain-specific fine-tuning, using BioBERT embeddings with FAISS indexing and comparing re-ranking strategies (BM25, ColBERT, MonoT5) to optimize context selection before evidence synthesis via fine-tuned T5 model.

Result: Significant improvements over baselines on PubMedQA dataset, with substantial gains across BLEU, ROUGE, and METEOR metrics.

Conclusion: Advances the state of accessible, evidence-based biomedical knowledge retrieval by providing comprehensive long-form answers.

Abstract: The exponential growth of biomedical literature creates significant challenges for accessing precise medical information. Current biomedical question-answering systems primarily focus on short-form answers, failing to provide the comprehensive explanations necessary for clinical decision-making. We present RAG-BioQA, a novel framework combining retrieval-augmented generation with domain-specific fine-tuning to produce evidence-based, long-form biomedical answers. Our approach integrates BioBERT embeddings with FAISS indexing and compares various re-ranking strategies (BM25, ColBERT, MonoT5) to optimize context selection before synthesizing evidence through a fine-tuned T5 model. Experimental results on the PubMedQA dataset show significant improvements over baselines, with our best model achieving substantial gains across BLEU, ROUGE, and METEOR metrics, advancing the state of accessible, evidence-based biomedical knowledge retrieval.

[62] LLM one-shot style transfer for Authorship Attribution and Verification

Pablo Miralles-González, Javier Huertas-Tato, Alejandro Martín, David Camacho

Main category: cs.CL

TL;DR: Proposes an unsupervised authorship analysis method using LLMs’ pre-training and in-context learning capabilities, measuring style transferability through log-probabilities to outperform existing approaches while controlling for topical correlations.

Details

Motivation: Existing computational stylometry methods often confuse style with topic due to spurious correlations in data, and LLMs' CLM pre-training has been underutilized for general authorship problems despite their natural fit for AI-generated text detection.

Method: Unsupervised approach leveraging LLMs’ extensive pre-training and in-context learning, using log-probabilities to measure style transferability between texts.

Result: Significantly outperforms LLM prompting approaches of comparable scale and achieves higher accuracy than contrastively trained baselines when controlling for topical correlations. Performance scales with model size and test-time computation.

Conclusion: The method enables flexible trade-offs between computational cost and accuracy, demonstrating the effectiveness of leveraging LLMs’ pre-training for authorship analysis tasks.

Abstract: Computational stylometry analyzes writing style through quantitative patterns in text, supporting applications from forensic tasks such as identity linking and plagiarism detection to literary attribution in the humanities. Supervised and contrastive approaches rely on data with spurious correlations and often confuse style with topic. Despite their natural use in AI-generated text detection, the CLM pre-training of modern LLMs has been scarcely leveraged for general authorship problems. We propose a novel unsupervised approach based on this extensive pre-training and the in-context learning capabilities of LLMs, employing the log-probabilities of an LLM to measure style transferability from one text to another. Our method significantly outperforms LLM prompting approaches of comparable scale and achieves higher accuracy than contrastively trained baselines when controlling for topical correlations. Moreover, performance scales fairly consistently with the size of the base model and, in the case of authorship verification, with an additional mechanism that increases test-time computation; enabling flexible trade-offs between computational cost and accuracy.

[63] Bridging the Semantic Gap: Contrastive Rewards for Multilingual Text-to-SQL with GRPO

Ashish Kattamuri, Ishita Prasad, Meetu Malhotra, Arpita Vats, Rahul Raja, Albert Lie

Main category: cs.CL

TL;DR: A new framework combining Group Relative Policy Optimization (GRPO) with multilingual contrastive reward signals improves Text-to-SQL performance in cross-lingual scenarios, enhancing both execution and semantic accuracy without large datasets.

Details

Motivation: Current Text-to-SQL methods focus only on executable queries and overlook semantic alignment challenges, with significant performance drops in non-English languages (average 6 percentage points decline).

Method: Combines Group Relative Policy Optimization (GRPO) with multilingual contrastive reward signals to enhance task efficiency and semantic accuracy, teaching models better correspondence between SQL generation and user intent through semantic similarity-based rewards.

Result: On MultiSpider dataset, GRPO improved execution accuracy to 87.4% (+26 pp) and semantic accuracy to 52.29% (+32.86 pp). Adding contrastive reward further improved semantic accuracy to 59.14% (+6.85 pp). A 3B LLaMA model outperformed zero-shot 8B model with 88.86% execution accuracy (+7.43 pp) and nearly matched semantic accuracy using only 3,000 training examples.

Conclusion: The framework demonstrates significant improvements in Text-to-SQL performance through contrastive rewards for semantic alignment, achieving strong results with smaller models and minimal training data in cross-lingual scenarios.

Abstract: Current Text-to-SQL methods are evaluated and only focused on executable queries, overlooking the semantic alignment challenge – both in terms of the semantic meaning of the query and the correctness of the execution results. Even execution accuracy itself shows significant drops when moving from English to other languages, with an average decline of 6 percentage points across non-English languages. We address these challenges by presenting a new framework that combines Group Relative Policy Optimization (GRPO) within a multilingual contrastive reward signal to enhance both task efficiency and semantic accuracy in Text-to-SQL systems in cross-lingual scenarios. Our method teaches models to obtain better correspondence between SQL generation and user intent by combining a reward signal based on semantic similarity. On the seven-language MultiSpider dataset, fine-tuning the LLaMA-3-3B model with GRPO improved the execution accuracy up to 87.4 percent (+26 pp over zero-shot) and semantic accuracy up to 52.29 percent (+32.86 pp). Adding our contrastive reward signal in the GRPO framework further improved the average semantic accuracy to 59.14 percent (+6.85 pp, up to +10 pp for Vietnamese). Our experiments showcase that a smaller, parameter-efficient 3B LLaMA model fine-tuned with our contrastive reward signal outperforms a much larger zero-shot 8B LLaMA model, with an uplift of 7.43 pp in execution accuracy (from 81.43 percent on the 8B model to 88.86 percent on the 3B model), and nearly matches its semantic accuracy (59.14 percent vs. 68.57 percent) – all using just 3,000 reinforcement learning training examples. These results demonstrate how we can improve the performance of Text-to-SQL systems with contrastive rewards for directed semantic alignment, without requiring large-scale training datasets.

[64] ReviewGuard: Enhancing Deficient Peer Review Detection via LLM-Driven Data Augmentation

Haoxuan Zhang, Ruochi Li, Sarthak Shrestha, Shree Harshini Mamidala, Revanth Putta, Arka Krishan Aggarwal, Ting Xiao, Junhua Ding, Haihua Chen

Main category: cs.CL

TL;DR: ReviewGuard is an automated system that detects deficient peer reviews using LLMs, addressing challenges from increased submissions and AI-generated reviews. It uses a four-stage framework with real and synthetic data to train models that identify deficient reviews based on features like lower ratings, higher confidence, and simpler structure.

Details

Motivation: The surge in academic submissions and widespread use of LLMs in scholarly evaluation creates challenges for peer review quality. Unchecked deficient reviews from both human experts and AI systems threaten academic integrity, necessitating automated detection systems.

Method: Four-stage LLM-driven framework: data collection from ICLR/NeurIPS on OpenReview, GPT-4.1 annotation with human validation, synthetic data augmentation (6,634 papers with 24,657 real and 46,438 synthetic reviews), and fine-tuning encoder-based models and open-source LLMs.

Result: Deficient reviews show lower rating scores, higher self-reported confidence, reduced structural complexity, and more negative sentiment. AI-authored reviews increased dramatically post-ChatGPT. Mixed training improved detection performance significantly (Qwen 3-8B: recall 0.6653, F1 0.7073 vs. 0.5499 and 0.5606 baseline).

Conclusion: ReviewGuard is the first LLM-driven system for detecting deficient peer reviews, providing evidence to inform AI governance in peer review. The system successfully identifies problematic reviews and shows improved performance with synthetic data augmentation.

Abstract: Peer review serves as the gatekeeper of science, yet the surge in submissions and widespread adoption of large language models (LLMs) in scholarly evaluation present unprecedented challenges. While recent work has focused on using LLMs to improve review efficiency, unchecked deficient reviews from both human experts and AI systems threaten to systematically undermine academic integrity. To address this issue, we introduce ReviewGuard, an automated system for detecting and categorizing deficient reviews through a four-stage LLM-driven framework: data collection from ICLR and NeurIPS on OpenReview, GPT-4.1 annotation with human validation, synthetic data augmentation yielding 6,634 papers with 24,657 real and 46,438 synthetic reviews, and fine-tuning of encoder-based models and open-source LLMs. Feature analysis reveals that deficient reviews exhibit lower rating scores, higher self-reported confidence, reduced structural complexity, and more negative sentiment than sufficient reviews. AI-generated text detection shows dramatic increases in AI-authored reviews since ChatGPT’s emergence. Mixed training with synthetic and real data substantially improves detection performance - for example, Qwen 3-8B achieves recall of 0.6653 and F1 of 0.7073, up from 0.5499 and 0.5606 respectively. This study presents the first LLM-driven system for detecting deficient peer reviews, providing evidence to inform AI governance in peer review. Code, prompts, and data are available at https://github.com/haoxuan-unt2024/ReviewGuard

[65] AI use in American newspapers is widespread, uneven, and rarely disclosed

Jenna Russell, Marzena Karpinska, Destiny Akinode, Katherine Thai, Bradley Emi, Max Spero, Mohit Iyyer

Main category: cs.CL

TL;DR: AI is used in 9% of newspaper articles with uneven distribution across outlets and topics, and rarely disclosed despite prevalence, highlighting need for transparency standards.

Details

Motivation: To address the unclear extent of AI use in published journalism and understand its distribution patterns across different types of newspapers and content.

Method: Audited 186K articles from 1.5K American newspapers using Pangram AI detector, plus manual review of 100 AI-flagged articles and analysis of 45K opinion pieces from major publications.

Result: 9% of articles are AI-generated, more common in smaller/local outlets, weather/tech topics, and certain ownership groups. Opinion pieces are 6.4x more likely to contain AI content, with minimal disclosure (only 5% of AI-flagged articles disclosed use).

Conclusion: Immediate need for greater transparency and updated editorial standards regarding AI use in journalism to maintain public trust, given widespread but undisclosed AI adoption.

Abstract: AI is rapidly transforming journalism, but the extent of its use in published newspaper articles remains unclear. We address this gap by auditing a large-scale dataset of 186K articles from online editions of 1.5K American newspapers published in the summer of 2025. Using Pangram, a state-of-the-art AI detector, we discover that approximately 9% of newly-published articles are either partially or fully AI-generated. This AI use is unevenly distributed, appearing more frequently in smaller, local outlets, in specific topics such as weather and technology, and within certain ownership groups. We also analyze 45K opinion pieces from Washington Post, New York Times, and Wall Street Journal, finding that they are 6.4 times more likely to contain AI-generated content than news articles from the same publications, with many AI-flagged op-eds authored by prominent public figures. Despite this prevalence, we find that AI use is rarely disclosed: a manual audit of 100 AI-flagged articles found only five disclosures of AI use. Overall, our audit highlights the immediate need for greater transparency and updated editorial standards regarding the use of AI in journalism to maintain public trust.

[66] AraFinNews: Arabic Financial Summarisation with Domain-Adapted LLMs

Mo El-Haj, Paul Rayson

Main category: cs.CL

TL;DR: Domain-specific adaptation improves Arabic financial summarization, with FinAraT5 outperforming general models in coherence and numerical accuracy.

Details

Motivation: To investigate how domain specificity affects abstractive summarization of Arabic financial texts and address the lack of large-scale Arabic financial datasets.

Method: Created AraFinNews dataset (212,500 article-headline pairs), evaluated transformer models (mT5, AraT5, FinAraT5) with domain-adapted pretraining.

Result: Domain-adapted models produced more coherent summaries, especially for quantitative and entity-centered information.

Conclusion: Domain-specific adaptation significantly improves narrative fluency in Arabic financial summarization.

Abstract: This paper examines how domain specificity affects abstractive summarisation of Arabic financial texts using large language models (LLMs). We present AraFinNews, the largest publicly available Arabic financial news dataset to date, comprising 212,500 article-headline pairs spanning almost a decade of reporting from October 2015 to July 2025. Developed as an Arabic counterpart to major English summarisation corpora such as CNN/DailyMail, AraFinNews offers a strong benchmark for assessing domain-focused language understanding and generation in financial contexts. Using this resource, we evaluate transformer-based models, including mT5, AraT5 and the domain-adapted FinAraT5, to investigate how financial-domain pretraining influences accuracy, numerical reliability and stylistic alignment with professional reporting. The results show that domain-adapted models produce more coherent summaries, particularly when handling quantitative and entity-centred information. These findings underscore the value of domain-specific adaptation for improving narrative fluency in Arabic financial summarisation. The dataset is freely available for non-commercial research at https://github.com/ArabicNLP-UK/AraFinNews.

[67] Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties

Célian Ringwald, Fabien Gandon, Catherine Faron, Franck Michel, Hanna Abi Akl

Main category: cs.CL

TL;DR: SLMs struggle with rare properties in relation extraction. The paper shows that ensuring each property appears above a threshold in training data is the best strategy for balanced performance.

Details

Motivation: To investigate how small language models handle both datatype and object properties for complete RDF graph extraction, focusing on the challenge of long-tail distribution of rare properties.

Method: Evaluated several strategies: stratified sampling, weighted loss, dataset scaling, and template-based synthetic data augmentation to address the rare property bottleneck.

Result: The best strategy is to build a training set where the number of occurrences of each property exceeds a given threshold, enabling equal performance across unbalanced target properties.

Conclusion: Provides practical guidance for training shape-aware SLMs and highlights promising directions for future work in semantic relation extraction. Datasets, results, and code are publicly released for reproducibility.

Abstract: Small language models (SLMs) have shown promises for relation extraction (RE) when extracting RDF triples guided by SHACL shapes focused on common datatype properties. This paper investigates how SLMs handle both datatype and object properties for a complete RDF graph extraction. We show that the key bottleneck is related to long-tail distribution of rare properties. To solve this issue, we evaluate several strategies: stratified sampling, weighted loss, dataset scaling, and template-based synthetic data augmentation. We show that the best strategy to perform equally well over unbalanced target properties is to build a training set where the number of occurrences of each property exceeds a given threshold. To enable reproducibility, we publicly released our datasets, experimental results and code. Our findings offer practical guidance for training shape-aware SLMs and highlight promising directions for future work in semantic RE.

[68] A systematic review of relation extraction task since the emergence of Transformers

Ringwald Celian, Gandon, Fabien, Faron Catherine, Michel Franck, Abi Akl Hanna

Main category: cs.CL

TL;DR: Systematic review of relation extraction research since Transformer models, analyzing 34 surveys, 64 datasets, and 104 models from 2019-2024 to identify trends, limitations, and future directions.

Details

Motivation: To provide a comprehensive overview of relation extraction advancements since Transformer models emerged, consolidating research across multiple dimensions for researchers and practitioners.

Method: Used automated framework to collect and annotate publications, systematically analyzing surveys, datasets, and models published between 2019-2024.

Result: Identified methodological advances, benchmark resources, integration of semantic web technologies, current trends, limitations, and open challenges in relation extraction.

Conclusion: Offers a comprehensive reference for understanding the evolution and future directions of relation extraction research since Transformer-based models.

Abstract: This article presents a systematic review of relation extraction (RE) research since the advent of Transformer-based models. Using an automated framework to collect and annotate publications, we analyze 34 surveys, 64 datasets, and 104 models published between 2019 and 2024. The review highlights methodological advances, benchmark resources, and the integration of semantic web technologies. By consolidating results across multiple dimensions, the study identifies current trends, limitations, and open challenges, offering researchers and practitioners a comprehensive reference for understanding the evolution and future directions of RE.

[69] Improving the Performance of Radiology Report De-identification with Large-Scale Training and Benchmarking Against Cloud Vendor Methods

Eva Prakash, Maayane Attias, Pierre Chambon, Justin Xu, Steven Truong, Jean-Benoit Delbrouck, Tessa Cook, Curtis Langlotz

Main category: cs.CL

TL;DR: Transformer-based model trained on large radiology datasets outperforms commercial systems in PHI de-identification with F1 scores up to 0.996, establishing new benchmark for secure clinical text processing.

Details

Motivation: To enhance automated de-identification of radiology reports by scaling transformer-based models and benchmarking against commercial cloud vendor systems for PHI detection.

Method: Fine-tuned transformer-based PHI de-identification pipeline on two large annotated radiology corpora from Stanford University, introduced additional AGE category, and evaluated using token-level PHI detection, synthetic PHI generation, and comparison with commercial systems.

Result: Achieved overall F1 scores of 0.973 (Penn) and 0.996 (Stanford), outperforming previous models. Synthetic PHI evaluation showed consistent detectability (F1: 0.959). Model outperformed all vendor systems on synthetic reports (F1: 0.960 vs. 0.632-0.754).

Conclusion: Large-scale multimodal training improved cross-institutional generalization and robustness. The transformer-based model establishes a new benchmark for secure clinical text processing, outperforming both academic and commercial systems.

Abstract: Objective: To enhance automated de-identification of radiology reports by scaling transformer-based models through extensive training datasets and benchmarking performance against commercial cloud vendor systems for protected health information (PHI) detection. Materials and Methods: In this retrospective study, we built upon a state-of-the-art, transformer-based, PHI de-identification pipeline by fine-tuning on two large annotated radiology corpora from Stanford University, encompassing chest X-ray, chest CT, abdomen/pelvis CT, and brain MR reports and introducing an additional PHI category (AGE) into the architecture. Model performance was evaluated on test sets from Stanford and the University of Pennsylvania (Penn) for token-level PHI detection. We further assessed (1) the stability of synthetic PHI generation using a “hide-in-plain-sight” method and (2) performance against commercial systems. Precision, recall, and F1 scores were computed across all PHI categories. Results: Our model achieved overall F1 scores of 0.973 on the Penn dataset and 0.996 on the Stanford dataset, outperforming or maintaining the previous state-of-the-art model performance. Synthetic PHI evaluation showed consistent detectability (overall F1: 0.959 [0.958-0.960]) across 50 independently de-identified Penn datasets. Our model outperformed all vendor systems on synthetic Penn reports (overall F1: 0.960 vs. 0.632-0.754). Discussion: Large-scale, multimodal training improved cross-institutional generalization and robustness. Synthetic PHI generation preserved data utility while ensuring privacy. Conclusion: A transformer-based de-identification model trained on diverse radiology datasets outperforms prior academic and commercial systems in PHI detection and establishes a new benchmark for secure clinical text processing.

[70] When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLMs

Shaowen Wang, Yiqi Dong, Ruinian Chang, Tansheng Zhu, Yuebo Sun, Kaifeng Lyu, Jian Li

Main category: cs.CL

TL;DR: LLMs exhibit hallucinations driven by spurious correlations in training data, which evade current detection methods and persist despite model scaling and refusal fine-tuning.

Details

Motivation: To highlight a critical class of hallucinations caused by spurious correlations between features and attributes in training data, which remain underexplored and problematic.

Method: Systematically controlled synthetic experiments and empirical evaluations on state-of-the-art LLMs (including GPT-5), analyzing failure of existing detection methods like confidence-based filtering and inner-state probing.

Result: Spurious correlation-induced hallucinations are confidently generated, immune to model scaling, evade current detection methods, and persist after refusal fine-tuning. Existing detection techniques fundamentally fail in these cases.

Conclusion: There is an urgent need for new approaches specifically designed to address hallucinations caused by spurious correlations, as current methods are fundamentally inadequate for this class of errors.

Abstract: Despite substantial advances, large language models (LLMs) continue to exhibit hallucinations, generating plausible yet incorrect responses. In this paper, we highlight a critical yet previously underexplored class of hallucinations driven by spurious correlations – superficial but statistically prominent associations between features (e.g., surnames) and attributes (e.g., nationality) present in the training data. We demonstrate that these spurious correlations induce hallucinations that are confidently generated, immune to model scaling, evade current detection methods, and persist even after refusal fine-tuning. Through systematically controlled synthetic experiments and empirical evaluations on state-of-the-art open-source and proprietary LLMs (including GPT-5), we show that existing hallucination detection methods, such as confidence-based filtering and inner-state probing, fundamentally fail in the presence of spurious correlations. Our theoretical analysis further elucidates why these statistical biases intrinsically undermine confidence-based detection techniques. Our findings thus emphasize the urgent need for new approaches explicitly designed to address hallucinations caused by spurious correlations.

[71] From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models

Wenxin Zhu, Andong Chen, Yuchen Song, Kehai Chen, Conghui Zhu, Ziyan Chen, Tiejun Zhao

Main category: cs.CL

TL;DR: This paper provides a systematic review of Multimodal Chain-of-Thought (MCoT), analyzing its background, methods, evaluation benchmarks, applications, challenges, and future directions for enhancing reasoning in multimodal large language models.

Details

Motivation: Enhance complex reasoning capabilities of Multimodal Large Language Models (MLLMs) by extending Chain-of-Thought reasoning to multimodal domains, addressing challenges like opaque reasoning paths and insufficient generalization.

Method: Systematic review approach analyzing MCoT from three aspects: CoT paradigms, post-training stage, and inference stage, while examining underlying mechanisms and organizing existing research.

Result: Comprehensive analysis of MCoT’s theoretical foundations, methodological approaches, evaluation frameworks, and practical applications in multimodal reasoning.

Conclusion: MCoT shows promise for improving multimodal reasoning but faces challenges that require future research directions to address current limitations and advance the field.

Abstract: With the remarkable success of Multimodal Large Language Models (MLLMs) in perception tasks, enhancing their complex reasoning capabilities has emerged as a critical research focus. Existing models still suffer from challenges such as opaque reasoning paths and insufficient generalization ability. Chain-of-Thought (CoT) reasoning, which has demonstrated significant efficacy in language models by enhancing reasoning transparency and output interpretability, holds promise for improving model reasoning capabilities when extended to the multimodal domain. This paper provides a systematic review centered on “Multimodal Chain-of-Thought” (MCoT). First, it analyzes the background and theoretical motivations for its inception from the perspectives of technical evolution and task demands. Then, it introduces mainstream MCoT methods from three aspects: CoT paradigms, the post-training stage, and the inference stage, while also analyzing their underlying mechanisms. Furthermore, the paper summarizes existing evaluation benchmarks and metrics, and discusses the application scenarios of MCoT. Finally, it analyzes the challenges currently facing MCoT and provides an outlook on its future research directions.

[72] Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study

Mihai Nadas, Laura Diosan

Main category: cs.CL

TL;DR: Evaluation of various LLMs for Romanian diacritic restoration shows GPT-4o achieves high accuracy, while models like Llama show variability, highlighting the importance of model architecture, training data, and prompt design.

Details

Motivation: Automatic diacritic restoration is crucial for text processing in languages with rich diacritical marks like Romanian, requiring effective NLP tools.

Method: Tested multiple LLMs including GPT-3.5, GPT-4, GPT-4o, Gemini 1.0 Pro, Llama 2/3, Mixtral 8x7B, airoboros 70B, and RoLlama 2 7B using comprehensive corpus with various prompt templates from zero-shot to multi-shot instructions.

Result: GPT-4o achieves high diacritic restoration accuracy, consistently surpassing baseline, while Meta’s Llama family shows wider variability in performance.

Conclusion: Model architecture, training data, and prompt design significantly impact diacritic restoration performance, outlining directions for improving NLP tools for diacritic-rich languages.

Abstract: Automatic diacritic restoration is crucial for text processing in languages with rich diacritical marks, such as Romanian. This study evaluates the performance of several large language models (LLMs) in restoring diacritics in Romanian texts. Using a comprehensive corpus, we tested models including OpenAI’s GPT-3.5, GPT-4, GPT-4o, Google’s Gemini 1.0 Pro, Meta’s Llama 2 and Llama 3, MistralAI’s Mixtral 8x7B Instruct, airoboros 70B, and OpenLLM-Ro’s RoLlama 2 7B, under multiple prompt templates ranging from zero-shot to complex multi-shot instructions. Results show that models such as GPT-4o achieve high diacritic restoration accuracy, consistently surpassing a neutral echo baseline, while others, including Meta’s Llama family, exhibit wider variability. These findings highlight the impact of model architecture, training data, and prompt design on diacritic restoration performance and outline promising directions for improving NLP tools for diacritic-rich languages.

[73] WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

Zachary Ellis, Jared Joselowitz, Yash Deo, Yajie He, Anna Kalygina, Aisling Higham, Mana Rahimzadeh, Yan Jia, Ibrahim Habli, Ernest Lim

Main category: cs.CL

TL;DR: WER is inadequate for evaluating ASR in clinical dialogue. The paper introduces an LLM-as-a-Judge system optimized with GEPA and DSPy that achieves human-comparable performance in assessing clinical impact of transcription errors.

Details

Motivation: Standard ASR evaluations using Word Error Rate (WER) don't correlate well with clinical impact of transcription errors in doctor-patient dialogues, creating safety risks in clinical deployment.

Method: Established gold-standard benchmark with expert clinicians labeling clinical impact, then developed LLM-as-a-Judge system optimized using GEPA through DSPy to replicate expert clinical assessment.

Result: The optimized Gemini-2.5-Pro judge achieved 90% accuracy and Cohen’s κ of 0.816, showing human-comparable performance in assessing clinical impact of ASR errors.

Conclusion: Provides a validated, automated framework for moving ASR evaluation beyond textual fidelity to scalable safety assessment in clinical dialogue.

Abstract: As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA through DSPy to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen’s $κ$ of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.

cs.CV

[74] RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis

Linfeng Dong, Yuchen Yang, Hao Wu, Wei Wang, Yuenan HouZhihang Zhong, Xiao Sun

Main category: cs.CV

TL;DR: RacketVision is a novel sports analytics dataset for racket sports (table tennis, tennis, badminton) with fine-grained racket pose and ball position annotations, enabling research on ball tracking, racket pose estimation, and trajectory forecasting.

Details

Motivation: To advance computer vision in sports analytics by providing the first large-scale dataset with fine-grained racket pose annotations alongside ball positions, addressing the gap in complex human-object interaction research in racket sports.

Method: Created a dataset with three interconnected tasks: fine-grained ball tracking, articulated racket pose estimation, and predictive ball trajectory forecasting. Evaluated baseline methods and discovered that CrossAttention mechanism is essential for effective multi-modal fusion of racket pose features.

Result: Evaluation revealed that naive concatenation of racket pose features degrades performance, but CrossAttention mechanism successfully unlocks their value, leading to trajectory prediction results that surpass strong unimodal baselines.

Conclusion: RacketVision provides a versatile resource and strong starting point for future research in dynamic object tracking, conditional motion forecasting, and multimodal analysis in sports, with the key insight that CrossAttention is crucial for effective multi-modal fusion.

Abstract: We introduce RacketVision, a novel dataset and benchmark for advancing computer vision in sports analytics, covering table tennis, tennis, and badminton. The dataset is the first to provide large-scale, fine-grained annotations for racket pose alongside traditional ball positions, enabling research into complex human-object interactions. It is designed to tackle three interconnected tasks: fine-grained ball tracking, articulated racket pose estimation, and predictive ball trajectory forecasting. Our evaluation of established baselines reveals a critical insight for multi-modal fusion: while naively concatenating racket pose features degrades performance, a CrossAttention mechanism is essential to unlock their value, leading to trajectory prediction results that surpass strong unimodal baselines. RacketVision provides a versatile resource and a strong starting point for future research in dynamic object tracking, conditional motion forecasting, and multimodal analysis in sports. Project page at https://github.com/OrcustD/RacketVision

[75] The persistence of painting styles

Reetikaa Reddy Munnangi, Barbara Giunti

Main category: cs.CV

TL;DR: Using persistent homology from topological data analysis to objectively identify and differentiate artistic styles, including distinguishing between human artists and AI-generated art.

Details

Motivation: Traditional art style identification relies on subjective human expertise; this work aims to provide objective, mathematical methods for analyzing artistic styles.

Method: Applied persistent homology (PH), a topological data analysis technique, to analyze artistic styles through mathematical structure and statistical analysis.

Result: PH can statistically differentiate between artists from different artistic movements and within the same movement, and can distinguish human artist works from AI-generated images in their style.

Conclusion: Persistent homology provides objective, interpretable insights into artistic styles, offering a mathematical framework for art analysis that complements traditional subjective approaches.

Abstract: Art is a deeply personal and expressive medium, where each artist brings their own style, technique, and cultural background into their work. Traditionally, identifying artistic styles has been the job of art historians or critics, relying on visual intuition and experience. However, with the advancement of mathematical tools, we can explore art through more structured lens. In this work, we show how persistent homology (PH), a method from topological data analysis, provides objective and interpretable insights on artistic styles. We show how PH can, with statistical certainty, differentiate between artists, both from different artistic currents and from the same one, and distinguish images of an artist from an AI-generated image in the artist’s style.

[76] AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal Face Videos

Sahibzada Adil Shahzad, Ammarah Hashmi, Yan-Tsung Peng, Yu Tsao, Hsin-Min Wang

Main category: cs.CV

TL;DR: Proposes a multimodal self-supervised learning approach using AV-HuBERT and transformer-based models to detect audio-visual deepfakes by exploiting inconsistencies between audio and visual modalities.

Details

Motivation: Unimodal deepfake detectors struggle with multimodal manipulations, and timely detection is crucial to prevent spread of false propaganda and fake news. Existing methods mainly use unimodal video forensics with supervised pre-training.

Method: Uses multimodal SSL feature extractor (AV-HuBERT) for audio-visual features, multi-scale temporal CNN for temporal correlation, and additional transformer-based video model for facial features and spatial-temporal artifacts from deepfake generation.

Result: Outperforms all existing models and achieves state-of-the-art performance on FakeAVCeleb and DeepfakeTIMIT datasets.

Conclusion: Multimodal SSL-based approach effectively detects audio-visual deepfakes by exploiting cross-modal inconsistencies, demonstrating superior performance over unimodal methods.

Abstract: Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. To avoid the spread of false propaganda and fake news, timely detection is crucial. The damage to either modality (i.e., visual or audio) can only be discovered through multimodal models that can exploit both pieces of information simultaneously. However, previous methods mainly adopt unimodal video forensics and use supervised pre-training for forgery detection. This study proposes a new method based on a multimodal self-supervised-learning (SSL) feature extractor to exploit inconsistency between audio and visual modalities for multimodal video forgery detection. We use the transformer-based SSL pre-trained Audio-Visual HuBERT (AV-HuBERT) model as a visual and acoustic feature extractor and a multi-scale temporal convolutional neural network to capture the temporal correlation between the audio and visual modalities. Since AV-HuBERT only extracts visual features from the lip region, we also adopt another transformer-based video model to exploit facial features and capture spatial and temporal artifacts caused during the deepfake generation process. Experimental results show that our model outperforms all existing models and achieves new state-of-the-art performance on the FakeAVCeleb and DeepfakeTIMIT datasets.

[77] Motion Transfer-Enhanced StyleGAN for Generating Diverse Macaque Facial Expressions

Takuya Igaue, Catia Correia-Caeiro, Akito Yoshida, Takako Miyabe-Nishiwaki, Ryusuke Hayashi

Main category: cs.CV

TL;DR: Proposed method generates diverse macaque monkey facial expressions using StyleGAN2 with data augmentation, sample selection, and loss refinement to overcome limited training data.

Details

Motivation: Limited training images for animal faces, especially macaque monkeys used in neuroscience research, make generating facial expressions challenging due to insufficient quantity and variation.

Method: Used StyleGAN2 with three key approaches: 1) Data augmentation by animating still images with motion transfer, 2) Sample selection based on latent representations for uniform dataset variation, 3) Loss function refinement for accurate eye movements.

Result: Generated diverse facial expressions for multiple macaque individuals, outperforming models trained only on original still images. Model enables style-based image editing with specific parameters corresponding to distinct facial movements.

Conclusion: Method successfully disentangles motion components as style parameters, providing valuable tool for macaque facial expression research in neuroscience and evolutionary studies.

Abstract: Generating animal faces using generative AI techniques is challenging because the available training images are limited both in quantity and variation, particularly for facial expressions across individuals. In this study, we focus on macaque monkeys, widely studied in systems neuroscience and evolutionary research, and propose a method to generate their facial expressions using a style-based generative image model (i.e., StyleGAN2). To address data limitations, we implemented: 1) data augmentation by synthesizing new facial expression images using a motion transfer to animate still images with computer graphics, 2) sample selection based on the latent representation of macaque faces from an initially trained StyleGAN2 model to ensure the variation and uniform sampling in training dataset, and 3) loss function refinement to ensure the accurate reproduction of subtle movements, such as eye movements. Our results demonstrate that the proposed method enables the generation of diverse facial expressions for multiple macaque individuals, outperforming models trained solely on original still images. Additionally, we show that our model is effective for style-based image editing, where specific style parameters correspond to distinct facial movements. These findings underscore the model’s potential for disentangling motion components as style parameters, providing a valuable tool for research on macaque facial expressions.

[78] Resolving Sentiment Discrepancy for Multimodal Sentiment Detection via Semantics Completion and Decomposition

Daiqing Wu, Dongbao Yang, Huawen Shen, Can Ma, Yu Zhou

Main category: cs.CV

TL;DR: CoDe network addresses sentiment discrepancy in multimodal posts by complementing semantics with in-image text and decomposing representations to capture discrepant sentiments between image and text.

Details

Motivation: Existing multimodal sentiment detection methods fail to handle sentiment discrepancy between image and text in user-generated posts, leading to compromised performance.

Method: Proposes semantics completion (adding in-image text semantics) and decomposition (exclusive projection + contrastive learning) modules, followed by cross-attention fusion with discrepant sentiment.

Result: Extensive experiments on four datasets demonstrate superior performance compared to existing methods.

Conclusion: CoDe network effectively resolves sentiment discrepancy in multimodal content through explicit modeling of discrepant sentiments between modalities.

Abstract: With the proliferation of social media posts in recent years, the need to detect sentiments in multimodal (image-text) content has grown rapidly. Since posts are user-generated, the image and text from the same post can express different or even contradictory sentiments, leading to potential \textbf{sentiment discrepancy}. However, existing works mainly adopt a single-branch fusion structure that primarily captures the consistent sentiment between image and text. The ignorance or implicit modeling of discrepant sentiment results in compromised unimodal encoding and limited performance. In this paper, we propose a semantics Completion and Decomposition (CoDe) network to resolve the above issue. In the semantics completion module, we complement image and text representations with the semantics of the in-image text, helping bridge the sentiment gap. In the semantics decomposition module, we decompose image and text representations with exclusive projection and contrastive learning, thereby explicitly capturing the discrepant sentiment between modalities. Finally, we fuse image and text representations by cross-attention and combine them with the learned discrepant sentiment for final classification. Extensive experiments on four datasets demonstrate the superiority of CoDe and the effectiveness of each proposed module.

[79] PairHuman: A High-Fidelity Photographic Dataset for Customized Dual-Person Generation

Ting Pan, Ye Wang, Peiguang Jing, Rui Ma, Zili Yi, Yu Liu

Main category: cs.CV

TL;DR: Proposes PairHuman - the first large-scale benchmark dataset for dual-person portrait generation with 100K+ images, and DHumanDiff baseline method that enhances facial consistency while balancing personalized generation and scene creation.

Details

Motivation: The absence of a benchmark dataset hinders high-quality customization in dual-person portrait generation, which has potential applications in preserving emotional memories and wedding photography planning.

Method: Created PairHuman dataset with 100K+ images capturing various scenes, attire, and interactions with rich metadata. Developed DHumanDiff baseline method that enhances facial consistency and balances personalized person generation with semantic-driven scene creation.

Result: Experimental results demonstrate that the dataset and method produce highly customized portraits with superior visual quality tailored to human preferences.

Conclusion: The PairHuman dataset and DHumanDiff method successfully address the gap in dual-person portrait generation, enabling high-quality customized portraits that meet photographic standards.

Abstract: Personalized dual-person portrait customization has considerable potential applications, such as preserving emotional memories and facilitating wedding photography planning. However, the absence of a benchmark dataset hinders the pursuit of high-quality customization in dual-person portrait generation. In this paper, we propose the PairHuman dataset, which is the first large-scale benchmark dataset specifically designed for generating dual-person portraits that meet high photographic standards. The PairHuman dataset contains more than 100K images that capture a variety of scenes, attire, and dual-person interactions, along with rich metadata, including detailed image descriptions, person localization, human keypoints, and attribute tags. We also introduce DHumanDiff, which is a baseline specifically crafted for dual-person portrait generation that features enhanced facial consistency and simultaneously balances in personalized person generation and semantic-driven scene creation. Finally, the experimental results demonstrate that our dataset and method produce highly customized portraits with superior visual quality that are tailored to human preferences. Our dataset is publicly available at https://github.com/annaoooo/PairHuman.

[80] HDCompression: Hybrid-Diffusion Image Compression for Ultra-Low Bitrates

Lei Lu, Yize Li, Yanzhi Wang, Wei Wang, Wei Jiang

Main category: cs.CV

TL;DR: HDCompression is a dual-stream image compression framework that combines generative VQ-modeling, diffusion models, and conventional learned image compression to achieve both high fidelity and perceptual quality at ultra-low bitrates.

Details

Motivation: Address the limitations of conventional LIC (severe artifacts from heavy quantization) and generative VQ modeling (poor fidelity due to mismatch between learned priors and specific inputs) at ultra-low bitrates.

Method: Uses a dual-stream framework with diffusion models to extract high-quality complementary fidelity information from ground-truth input, improving index map prediction, enhancing LIC stream output, and refining VQ-latent correction. Features a lightweight diffusion model based on dense representative vectors with simple sampling schedulers.

Result: Outperforms previous conventional LIC, generative VQ-modeling, and hybrid frameworks in both quantitative metrics and qualitative visualization, providing balanced robust compression performance at ultra-low bitrates.

Conclusion: HDCompression successfully achieves both high fidelity and high perceptual quality in ultra-low bitrate image compression by effectively integrating multiple compression paradigms.

Abstract: Image compression under ultra-low bitrates remains challenging for both conventional learned image compression (LIC) and generative vector-quantized (VQ) modeling. Conventional LIC suffers from severe artifacts due to heavy quantization, while generative VQ modeling gives poor fidelity due to the mismatch between learned generative priors and specific inputs. In this work, we propose Hybrid-Diffusion Image Compression (HDCompression), a dual-stream framework that utilizes both generative VQ-modeling and diffusion models, as well as conventional LIC, to achieve both high fidelity and high perceptual quality. Different from previous hybrid methods that directly use pre-trained LIC models to generate low-quality fidelity-preserving information from heavily quantized latent, we use diffusion models to extract high-quality complementary fidelity information from the ground-truth input, which can enhance the system performance in several aspects: improving index map prediction, enhancing the fidelity-preserving output of the LIC stream, and refining conditioned image reconstruction with VQ-latent correction. In addition, our diffusion model is based on a dense representative vector (DRV), which is lightweight with very simple sampling schedulers. Extensive experiments demonstrate that our HDCompression outperforms the previous conventional LIC, generative VQ-modeling, and hybrid frameworks in both quantitative metrics and qualitative visualization, providing balanced robust compression performance at ultra-low bitrates.

[81] A Machine Learning-Driven Solution for Denoising Inertial Confinement Fusion Images

Asya Y. Akkus, Bradley T. Wolfe, Pinghan Chu, Chengkun Huang, Chris S. Campbell, Mariana Alvarado Alvarez, Petr Volegov, David Fittinghoff, Robert Reinovsky, Zhehui Wang

Main category: cs.CV

TL;DR: Unsupervised autoencoder with CDF 97 wavelet transform effectively denoises neutron imaging data with mixed Gaussian-Poisson noise, outperforming traditional methods like BM3D in reconstruction error and edge preservation.

Details

Motivation: Neutron imaging is crucial for ICF analysis but images are degraded by mixed Gaussian-Poisson noise that conventional methods struggle to remove while preserving image fidelity. Recent synthetic data advances enable ML approaches.

Method: Implemented unsupervised autoencoder with Cohen-Daubechies-Feauveau (CDF 97) wavelet transform in the latent space for mixed Gaussian-Poisson denoising of neutron imaging data.

Result: Network successfully denoised neutron imaging data with lower reconstruction error and superior edge preservation metrics compared to non-ML methods like BM3D when benchmarked with forward model data.

Conclusion: This approach presents promising advancement in neutron image noise reduction and 3D reconstruction analysis for ICF experiments, demonstrating ML’s potential in fusion imaging.

Abstract: Neutron imaging is important in optimizing analysis of inertial confinement fusion (ICF) events such as those at the National Ignition Facility (NIF) and improving current and future ICF platforms. However, images of neutron sources are often degraded by various types of noise. Most commonly, Gaussian and Poisson noise often coexist within one image, obscuring fine details and blurring edges. These noise types often overlap, making them difficult to distinguish and remove using conventional filtering and thresholding methods. As a result, noise removal techniques that preserve image fidelity are important for analyzing and interpreting images of a neutron source. Current solutions include a combination of filtering and thresholding methodologies. In the past, machine learning approaches were rarely implemented due to a lack of ground truth neutron imaging data for ICF processes. However, recent advances in synthetic data production, particularly in the fusion imaging field, have opened opportunities to investigate new denoising procedures using both supervised and unsupervised machine learning methods. In this study, we implement an unsupervised autoencoder with a Cohen-Daubechies- Feauveau (CDF 97) wavelet transform in the latent space for mixed Gaussian-Poisson denoising. The network successfully denoises neutron imaging data. Additionally, it demonstrates lower reconstruction error and superior edge preservation metrics when benchmarked with data generated by a forward model and compared to non-ML-based filtering mechanisms such as Block-matching and 3D filtering (BM3D). This approach presents a promising advancement in neutron image noise reduction and three-dimensional reconstruction analysis of ICF experiments.

[82] REArtGS: Reconstructing and Generating Articulated Objects via 3D Gaussian Splatting with Geometric and Motion Constraints

Di Wu, Liu Liu, Zhou Linli, Anran Huang, Liangtu Song, Qiaojun Yu, Qi Wu, Cewu Lu

Main category: cs.CV

TL;DR: REArtGS introduces geometric and motion constraints to 3D Gaussian primitives for high-fidelity textured surface reconstruction and dynamic generation of articulated objects from multi-view RGB images.

Details

Motivation: Existing methods struggle to achieve both high-fidelity textured surface reconstruction and dynamic generation for articulated objects, which are prevalent in human life and important for various applications.

Method: Uses unbiased Signed Distance Field (SDF) guidance to regularize Gaussian opacity fields for better geometry, and establishes deformable fields constrained by kinematic structures to enable unsupervised generation of surface meshes in unseen states.

Result: Extensive experiments on synthetic and real datasets demonstrate high-quality textured surface reconstruction for given states and high-fidelity surface generation for unseen states.

Conclusion: REArtGS successfully addresses the challenge of achieving both realistic surface reconstruction and dynamic generation for articulated objects through geometric and motion constraints on 3D Gaussian primitives.

Abstract: Articulated objects, as prevalent entities in human life, their 3D representations play crucial roles across various applications. However, achieving both high-fidelity textured surface reconstruction and dynamic generation for articulated objects remains challenging for existing methods. In this paper, we present REArtGS, a novel framework that introduces additional geometric and motion constraints to 3D Gaussian primitives, enabling realistic surface reconstruction and generation for articulated objects. Specifically, given multi-view RGB images of arbitrary two states of articulated objects, we first introduce an unbiased Signed Distance Field (SDF) guidance to regularize Gaussian opacity fields, enhancing geometry constraints and improving surface reconstruction quality. Then we establish deformable fields for 3D Gaussians constrained by the kinematic structures of articulated objects, achieving unsupervised generation of surface meshes in unseen states. Extensive experiments on both synthetic and real datasets demonstrate our approach achieves high-quality textured surface reconstruction for given states, and enables high-fidelity surface generation for unseen states. Project site: https://sites.google.com/view/reartgs/home.

[83] SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane Momeni, Rishi Hazra, Shuangrui Ding, Sagar Vaze, Francois Porcher, Feng Li, Siyuan Li, Aishwarya Kamath, Ho Kei Cheng, Piotr Dollár, Nikhila Ravi, Kate Saenko, Pengchuan Zhang, Christoph Feichtenhofer

Main category: cs.CV

TL;DR: SAM 3 is a unified model for detecting, segmenting, and tracking objects using concept prompts (noun phrases, image exemplars, or both), achieving double the accuracy of existing systems in both image and video promptable concept segmentation.

Details

Motivation: To advance promptable concept segmentation (PCS) by creating a unified model that can handle various concept prompts and improve upon existing segmentation capabilities.

Method: Built a scalable data engine producing 4M unique concept labels dataset, developed an image-level detector and memory-based video tracker sharing a single backbone, and decoupled recognition/localization with a presence head.

Result: SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks.

Conclusion: SAM 3 represents a significant advancement in promptable concept segmentation and is open-sourced along with the new SA-Co benchmark.

Abstract: We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., “yellow school bus”), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.

[84] Investigating self-supervised representations for audio-visual deepfake detection

Dragos-Alexandru Boldisor, Stefan Smeu, Dan Oneata, Elisabeta Oneata

Main category: cs.CV

TL;DR: Self-supervised representations show promise for audio-visual deepfake detection by capturing meaningful patterns across modalities, but fail to generalize reliably across datasets due to dataset-specific characteristics rather than superficial feature learning.

Details

Motivation: To systematically explore the potential of self-supervised representations for audio-visual deepfake detection, which remains underexplored compared to their use in other vision and speech tasks.

Method: Systematically evaluate self-supervised features across modalities (audio, video, multimodal) and domains (lip movements, generic visual content), assessing detection effectiveness, interpretability, and cross-modal complementarity.

Result: Most self-supervised features capture deepfake-relevant information that is complementary across modalities, with models attending to semantically meaningful regions rather than spurious artifacts. However, none generalize reliably across datasets.

Conclusion: Self-supervised representations learn meaningful patterns for deepfake detection but face fundamental challenges in achieving robust cross-domain performance, with generalization failure stemming from dataset characteristics rather than superficial feature learning.

Abstract: Self-supervised representations excel at many vision and speech tasks, but their potential for audio-visual deepfake detection remains underexplored. Unlike prior work that uses these features in isolation or buried within complex architectures, we systematically evaluate them across modalities (audio, video, multimodal) and domains (lip movements, generic visual content). We assess three key dimensions: detection effectiveness, interpretability of encoded information, and cross-modal complementarity. We find that most self-supervised features capture deepfake-relevant information, and that this information is complementary. Moreover, models primarily attend to semantically meaningful regions rather than spurious artifacts. Yet none generalize reliably across datasets. This generalization failure likely stems from dataset characteristics, not from the features themselves latching onto superficial patterns. These results expose both the promise and fundamental challenges of self-supervised representations for deepfake detection: while they learn meaningful patterns, achieving robust cross-domain performance remains elusive.

[85] SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge

Adeel Yousaf, Joseph Fioresi, James Beetham, Amrit Singh Bedi, Mubarak Shah

Main category: cs.CV

TL;DR: SaFeR-CLIP improves vision-language model safety by redirecting unsafe concepts to semantically closest safe alternatives, minimizing representational disruption and recovering up to 8.0% zero-shot accuracy while maintaining safety.

Details

Motivation: Traditional safety fine-tuning causes significant performance drops due to rigid alignment strategies that disrupt learned semantic structures by forcing unsafe concepts to single predefined safe targets.

Method: Proposed proximity-aware approach that redirects unsafe concepts to their semantically closest safe alternatives, implemented in SaFeR-CLIP framework with minimal intervention principle.

Result: SaFeR-CLIP recovers up to 8.0% in zero-shot accuracy over prior methods while maintaining robust safety, and introduces NSFW-Caps benchmark for rigorous safety evaluation under distributional shift.

Conclusion: Respecting the geometry of pretrained representations is key to achieving safety without sacrificing performance in vision-language models.

Abstract: Improving the safety of vision-language models like CLIP via fine-tuning often comes at a steep price, causing significant drops in their generalization performance. We find this trade-off stems from rigid alignment strategies that force unsafe concepts toward single, predefined safe targets, disrupting the model’s learned semantic structure. To address this, we propose a proximity-aware approach: redirecting unsafe concepts to their semantically closest safe alternatives to minimize representational change. We introduce SaFeR-CLIP, a fine-tuning framework that applies this principle of minimal intervention. SaFeR-CLIP successfully reconciles safety and performance, recovering up to 8.0% in zero-shot accuracy over prior methods while maintaining robust safety. To support more rigorous evaluation, we also contribute NSFW-Caps, a new benchmark of 1,000 highly-aligned pairs for testing safety under distributional shift. Our work shows that respecting the geometry of pretrained representations is key to achieving safety without sacrificing performance.

[86] Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models

Dailan He, Guanlin Feng, Xingtong Ge, Yazhe Niu, Yi Zhang, Bingqi Ma, Guanglu Song, Yu Liu, Hongsheng Li

Main category: cs.CV

TL;DR: Neighbor GRPO is a new alignment algorithm for flow matching models that bypasses SDE conversion by perturbing initial noise conditions and using distance-based optimization, offering better efficiency and compatibility than SDE-based methods.

Details

Motivation: Applying GRPO to flow matching models is challenging due to deterministic sampling, and current SDE-based approaches suffer from inefficient credit assignment and incompatibility with high-order solvers.

Method: Generate diverse candidate trajectories by perturbing initial noise conditions of ODEs and optimize using softmax distance-based surrogate leaping policy, with symmetric anchor sampling and group-wise quasi-norm reweighting.

Result: Significantly outperforms SDE-based counterparts in training cost, convergence speed, and generation quality while preserving ODE sampling advantages.

Conclusion: Neighbor GRPO provides an effective alternative to SDE-based alignment that maintains deterministic sampling benefits while improving training efficiency and performance.

Abstract: Group Relative Policy Optimization (GRPO) has shown promise in aligning image and video generative models with human preferences. However, applying it to modern flow matching models is challenging because of its deterministic sampling paradigm. Current methods address this issue by converting Ordinary Differential Equations (ODEs) to Stochastic Differential Equations (SDEs), which introduce stochasticity. However, this SDE-based GRPO suffers from issues of inefficient credit assignment and incompatibility with high-order solvers for fewer-step sampling. In this paper, we first reinterpret existing SDE-based GRPO methods from a distance optimization perspective, revealing their underlying mechanism as a form of contrastive learning. Based on this insight, we propose Neighbor GRPO, a novel alignment algorithm that completely bypasses the need for SDEs. Neighbor GRPO generates a diverse set of candidate trajectories by perturbing the initial noise conditions of the ODE and optimizes the model using a softmax distance-based surrogate leaping policy. We establish a theoretical connection between this distance-based objective and policy gradient optimization, rigorously integrating our approach into the GRPO framework. Our method fully preserves the advantages of deterministic ODE sampling, including efficiency and compatibility with high-order solvers. We further introduce symmetric anchor sampling for computational efficiency and group-wise quasi-norm reweighting to address reward flattening. Extensive experiments demonstrate that Neighbor GRPO significantly outperforms SDE-based counterparts in terms of training cost, convergence speed, and generation quality.

[87] SVG360: Multi-View SVG Generation with Geometric and Color Consistency from a Single SVG

Mengnan Jiang, Zhaolin Sun, Christian Franke, Michele Franco Adesso, Antonio Haas, Grace Li Zhang

Main category: cs.CV

TL;DR: A three-stage framework for generating multi-view consistent SVGs from single-view inputs using 3D lifting, spatial memory mechanisms, and path optimization.

Details

Motivation: To address the underexplored challenge of generating multi-view consistent SVGs from single-view inputs while maintaining geometric and color consistency.

Method: Three-stage approach: 1) Lift rasterized input to 3D and render multi-view images, 2) Extend SAM2’s temporal memory to spatial domain for part-level correspondences, 3) Perform path consolidation and structural optimization during raster-to-vector conversion.

Result: Generated SVGs exhibit strong geometric and color consistency across views, significantly reduced redundant paths, and preserved fine structural details.

Conclusion: Bridges generative modeling and structured vector representation, providing scalable multi-view SVG generation for applications like asset creation and semantic vector editing.

Abstract: Scalable Vector Graphics (SVGs) are central to modern design workflows, offering scaling without distortion and precise editability. However, for single object SVGs, generating multi-view consistent SVGs from a single-view input remains underexplored. We present a three stage framework that produces multi-view SVGs with geometric and color consistency from a single SVG input. First, the rasterized input is lifted to a 3D representation and rendered under target camera poses, producing multi-view images of the object. Next, we extend the temporal memory mechanism of Segment Anything 2 (SAM2) to the spatial domain, constructing a spatial memory bank that establishes part level correspondences across neighboring views, yielding cleaner and more consistent vector paths and color assignments without retraining. Finally, during the raster to vector conversion, we perform path consolidation and structural optimization to reduce redundancy while preserving boundaries and semantics. The resulting SVGs exhibit strong geometric and color consistency across views, significantly reduce redundant paths, and retain fine structural details. This work bridges generative modeling and structured vector representation, providing a scalable route to single input, object level multi-view SVG generation and supporting applications such as asset creation and semantic vector editing.

[88] Parameter-Free Neural Lens Blur Rendering for High-Fidelity Composites

Lingyan Ruan, Bin Chen, Taehyun Rhee

Main category: cs.CV

TL;DR: A method for realistic lens blur in mixed reality by estimating circle of confusion maps directly from RGB images, eliminating need for camera parameters or depth information.

Details

Motivation: Existing lens blur methods require camera parameters and scene depth, which are often unavailable to ordinary users, limiting accessibility and generalizability.

Method: Directly estimate CoC map from RGB images, infer CoC values for virtual objects using linear relationship between signed CoC and depth, render blur with neural reblurring network.

Result: Achieves high-fidelity compositing with realistic defocus effects, outperforming state-of-the-art techniques in qualitative and quantitative evaluations.

Conclusion: Provides flexible and practical solution for real-world mixed reality applications without requiring camera metadata or depth information.

Abstract: Consistent and natural camera lens blur is important for seamlessly blending 3D virtual objects into photographed real-scenes. Since lens blur typically varies with scene depth, the placement of virtual objects and their corresponding blur levels significantly affect the visual fidelity of mixed reality compositions. Existing pipelines often rely on camera parameters (e.g., focal length, focus distance, aperture size) and scene depth to compute the circle of confusion (CoC) for realistic lens blur rendering. However, such information is often unavailable to ordinary users, limiting the accessibility and generalizability of these methods. In this work, we propose a novel compositing approach that directly estimates the CoC map from RGB images, bypassing the need for scene depth or camera metadata. The CoC values for virtual objects are inferred through a linear relationship between its signed CoC map and depth, and realistic lens blur is rendered using a neural reblurring network. Our method provides flexible and practical solution for real-world applications. Experimental results demonstrate that our method achieves high-fidelity compositing with realistic defocus effects, outperforming state-of-the-art techniques in both qualitative and quantitative evaluations.

[89] Mesh RAG: Retrieval Augmentation for Autoregressive Mesh Generation

Xiatao Sun, Chen Liang, Qian Wang, Daniel Rakita

Main category: cs.CV

TL;DR: Mesh RAG is a training-free framework that enhances autoregressive mesh generation by using retrieval-based component generation, improving quality, speed, and enabling incremental editing without model retraining.

Details

Motivation: Traditional manual mesh creation is time-intensive, and current autoregressive models face quality-speed trade-offs and sequential dependency limitations that complicate incremental editing.

Method: Leverages point cloud segmentation, spatial transformation, and point cloud registration to retrieve, generate, and integrate mesh components, decoupling generation from sequential dependency.

Result: Significantly enhances mesh quality, accelerates generation speed compared to sequential part prediction, and enables incremental editing across various autoregressive mesh generation models.

Conclusion: Mesh RAG provides an effective training-free solution that overcomes sequential limitations in autoregressive mesh generation, offering improved quality, speed, and editing capabilities.

Abstract: 3D meshes are a critical building block for applications ranging from industrial design and gaming to simulation and robotics. Traditionally, meshes are crafted manually by artists, a process that is time-intensive and difficult to scale. To automate and accelerate this asset creation, autoregressive models have emerged as a powerful paradigm for artistic mesh generation. However, current methods to enhance quality typically rely on larger models or longer sequences that result in longer generation time, and their inherent sequential nature imposes a severe quality-speed trade-off. This sequential dependency also significantly complicates incremental editing. To overcome these limitations, we propose Mesh RAG, a novel, training-free, plug-and-play framework for autoregressive mesh generation models. Inspired by RAG for language models, our approach augments the generation process by leveraging point cloud segmentation, spatial transformation, and point cloud registration to retrieve, generate, and integrate mesh components. This retrieval-based approach decouples generation from its strict sequential dependency, facilitating efficient and parallelizable inference. We demonstrate the wide applicability of Mesh RAG across various foundational autoregressive mesh generation models, showing it significantly enhances mesh quality, accelerates generation speed compared to sequential part prediction, and enables incremental editing, all without model retraining.

[90] WorldGen: From Text to Traversable and Interactive 3D Worlds

Dilin Wang, Hyunyoung Jung, Tom Monnier, Kihyuk Sohn, Chuhang Zou, Xiaoyu Xiang, Yu-Ying Yeh, Di Liu, Zixuan Huang, Thu Nguyen-Phuoc, Yuchen Fan, Sergiu Oprea, Ziyan Wang, Roman Shapovalov, Nikolaos Sarafianos, Thibault Groueix, Antoine Toisoul, Prithviraj Dhar, Xiao Chu, Minghao Chen, Geon Yeong Park, Mahima Gupta, Yassir Azziz, Rakesh Ranjan, Andrea Vedaldi

Main category: cs.CV

TL;DR: WorldGen is a system that automatically creates large-scale, interactive 3D worlds from text prompts, enabling text-to-3D world generation without manual modeling.

Details

Motivation: To bridge the gap between creative intent and functional virtual spaces, allowing creators to design coherent, navigable worlds without manual modeling or specialized 3D expertise.

Method: Combines LLM-driven scene layout reasoning, procedural generation, diffusion-based 3D generation, and object-aware scene decomposition in a modular system.

Result: Produces traversable, fully textured environments that are geometrically consistent, visually rich, and efficient to render in real time within standard game engines.

Conclusion: Represents a step towards accessible, generative world-building at scale, advancing 3D generative AI for gaming, simulation, and immersive social environments.

Abstract: We introduce WorldGen, a system that enables the automatic creation of large-scale, interactive 3D worlds directly from text prompts. Our approach transforms natural language descriptions into traversable, fully textured environments that can be immediately explored or edited within standard game engines. By combining LLM-driven scene layout reasoning, procedural generation, diffusion-based 3D generation, and object-aware scene decomposition, WorldGen bridges the gap between creative intent and functional virtual spaces, allowing creators to design coherent, navigable worlds without manual modeling or specialized 3D expertise. The system is fully modular and supports fine-grained control over layout, scale, and style, producing worlds that are geometrically consistent, visually rich, and efficient to render in real time. This work represents a step towards accessible, generative world-building at scale, advancing the frontier of 3D generative AI for applications in gaming, simulation, and immersive social environments.

[91] Towards Unified Vision Language Models for Forest Ecological Analysis in Earth Observation

Xizhe Xue, Xiao Xiang Zhu

Main category: cs.CV

TL;DR: REO-Instruct is the first unified benchmark for both descriptive and regression tasks in Earth Observation, bridging qualitative understanding and quantitative prediction of biophysical variables like above-ground biomass.

Details

Motivation: Existing EO datasets focus mainly on semantic understanding tasks like captioning/classification, lacking benchmarks that align multimodal perception with measurable biophysical variables for scientific regression.

Method: Created REO-Instruct dataset integrating co-registered Sentinel-2 and ALOS-2 imagery with structured textual annotations generated through hybrid human-AI pipeline, establishing cognitive logic chains in forest ecological scenarios.

Result: Comprehensive evaluation reveals current VLMs struggle with numeric reasoning, highlighting essential challenge for scientific VLMs in handling regression tasks.

Conclusion: REO-Instruct provides standardized foundation for developing next-generation geospatial models capable of both description and scientific inference, addressing the gap in scientific regression capabilities for Earth Observation.

Abstract: Recent progress in vision language models (VLMs) has enabled remarkable perception and reasoning capabilities, yet their potential for scientific regression in Earth Observation (EO) remains largely unexplored. Existing EO datasets mainly emphasize semantic understanding tasks such as captioning or classification, lacking benchmarks that align multimodal perception with measurable biophysical variables. To fill this gap, we present REO-Instruct, the first unified benchmark designed for both descriptive and regression tasks in EO. REO-Instruct establishes a cognitively interpretable logic chain in forest ecological scenario (human activity,land-cover classification, ecological patch counting, above-ground biomass (AGB) regression), bridging qualitative understanding and quantitative prediction. The dataset integrates co-registered Sentinel-2 and ALOS-2 imagery with structured textual annotations generated and validated through a hybrid human AI pipeline. Comprehensive evaluation protocols and baseline results across generic VLMs reveal that current models struggle with numeric reasoning, highlighting an essential challenge for scientific VLMs. REO-Instruct offers a standardized foundation for developing and assessing next-generation geospatial models capable of both description and scientific inference. The project page are publicly available at \href{https://github.com/zhu-xlab/REO-Instruct}{REO-Instruct}.

[92] BOP-ASK: Object-Interaction Reasoning for Vision-Language Models

Vineet Bhat, Sungsu Kim, Valts Blukis, Greg Heinrich, Prashanth Krishnamurthy, Ramesh Karri, Stan Birchfield, Farshad Khorrami, Jonathan Tremblay

Main category: cs.CV

TL;DR: BOP-ASK is a large-scale dataset for object interaction reasoning that addresses limitations in current spatial reasoning benchmarks by providing fine-grained annotations for precise 3D localization, physical compatibility, affordances, and multi-step spatial planning.

Details

Motivation: Current VLMs perform well on high-level spatial relationships but lack fine-grained spatial understanding needed for real-world applications like precise 3D localization, physical compatibility between objects, object affordances, and multi-step spatial planning.

Method: Leveraged 6D object poses from BOP datasets to create a data generation pipeline that derives fine-grained annotations including grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships.

Result: Created BOP-ASK with over 150k images and 33M question-answer pairs spanning six tasks (four novel). Models trained on BOP-ASK outperform baselines and exhibit emergent capabilities in precise object/grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning.

Conclusion: BOP-ASK provides a comprehensive benchmark for training and evaluating VLMs on fine-grained object interaction reasoning, addressing critical gaps in current spatial reasoning evaluations and enabling testing of generalization through out-of-distribution benchmarks.

Abstract: Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high level relationships (’left of,’ ‘behind’, etc.) but ignore fine-grained spatial understanding needed for real world applications: precise 3D localization, physical compatibility between objects, object affordances and multi step spatial planning. In this work, we present BOP-ASK, a novel large scale dataset for object interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments. We will publicly release our datasets and dataset generation pipeline.

[93] Parts-Mamba: Augmenting Joint Context with Part-Level Scanning for Occluded Human Skeleton

Tianyi Shen, Huijuan Xu, Nilesh Ahuja, Omesh Tickoo, Philip Shin, Vijaykrishnan Narayanan

Main category: cs.CV

TL;DR: Parts-Mamba: A hybrid GCN-Mamba model that improves skeleton action recognition under occlusion by capturing distant joint context through parts-specific scanning and fusion modules.

Details

Motivation: Existing GCN models perform poorly with imperfect skeletons due to occlusions or missing frames, as they lack the ability to capture contextual information from distant joints when local context is missing.

Method: Proposed Parts-Mamba model combines GCN with Mamba architecture, using parts-specific scanning features to capture part-specific information and a parts-body fusion module to preserve non-neighboring joint context.

Result: Achieved up to 12.9% improvement in accuracy on NTU RGB+D 60 and NTU RGB+D 120 datasets under different occlusion settings compared to existing methods.

Conclusion: The hybrid GCN-Mamba approach effectively addresses the limitations of traditional GCNs in handling incomplete skeletons by better capturing and maintaining contextual information from distant joints.

Abstract: Skeleton action recognition involves recognizing human action from human skeletons. The use of graph convolutional networks (GCNs) has driven major advances in this recognition task. In real-world scenarios, the captured skeletons are not always perfect or complete because of occlusions of parts of the human body or poor communication quality, leading to missing parts in skeletons or videos with missing frames. In the presence of such non-idealities, existing GCN models perform poorly due to missing local context. To address this limitation, we propose Parts-Mamba, a hybrid GCN-Mamba model designed to enhance the ability to capture and maintain contextual information from distant joints. The proposed Parts-Mamba model effectively captures part-specific information through its parts-specific scanning feature and preserves non-neighboring joint context via a parts-body fusion module. Our proposed model is evaluated on the NTU RGB+D 60 and NTU RGB+D 120 datasets under different occlusion settings, achieving up to 12.9% improvement in accuracy.

[94] The Joint Gromov Wasserstein Objective for Multiple Object Matching

Aryan Tajmir Riahi, Khanh Dao Duc

Main category: cs.CV

TL;DR: The paper introduces Joint Gromov-Wasserstein (JGW), extending GW distance to enable simultaneous multiple-to-one and multiple-to-multiple object matching across metric spaces.

Details

Motivation: Traditional Gromov-Wasserstein distance is limited to pairwise matching between single objects, which restricts its utility in applications requiring multiple object matching.

Method: Extends GW framework to simultaneous matching between collections of objects, formulates JGW objective for point cloud representations, and adapts traditional Optimal Transport algorithms with entropic regularization.

Result: JGW provides superior performance in accuracy and computational efficiency compared to other GW variants, effectively handles multiple shape matching for geometric shapes and biomolecular complexes.

Conclusion: JGW enables complex matching problems across diverse domains including computer graphics and structural biology, with promising applications for partially isomorphic distributions of metric measure spaces.

Abstract: The Gromov-Wasserstein (GW) distance serves as a powerful tool for matching objects in metric spaces. However, its traditional formulation is constrained to pairwise matching between single objects, limiting its utility in scenarios and applications requiring multiple-to-one or multiple-to-multiple object matching. In this paper, we introduce the Joint Gromov-Wasserstein (JGW) objective and extend the original framework of GW to enable simultaneous matching between collections of objects. Our formulation provides a non-negative dissimilarity measure that identifies partially isomorphic distributions of mm-spaces, with point sampling convergence. We also show that the objective can be formulated and solved for point cloud object representations by adapting traditional algorithms in Optimal Transport, including entropic regularization. Our benchmarking with other variants of GW for partial matching indicates superior performance in accuracy and computational efficiency of our method, while experiments on both synthetic and real-world datasets show its effectiveness for multiple shape matching, including geometric shapes and biomolecular complexes, suggesting promising applications for solving complex matching problems across diverse domains, including computer graphics and structural biology.

[95] Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representational Alignment

Loukas Sfountouris, Giannis Daras, Paris Giampouras

Main category: cs.CV

TL;DR: REPA applies representation alignment between diffusion/flow models and pretrained encoders like DINOv2 to improve inverse problem solving by enhancing reconstruction fidelity and perceptual realism.

Details

Motivation: To extend representation alignment from generative modeling to inverse problems, using pretrained generative models as priors to guide reconstruction at inference time.

Method: Align internal representations of diffusion/flow models with pretrained self-supervised visual encoder features during inference, using REPA regularization that relates to divergence measures in embedding space.

Result: Consistent improvement in reconstruction quality across super-resolution, inpainting, and deblurring tasks, with efficiency gains through reduced discretization steps.

Conclusion: REPA provides an effective inductive bias for inverse problems, enhancing perceptual fidelity while maintaining solver performance with improved computational efficiency.

Abstract: Enforcing alignment between the internal representations of diffusion or flow-based generative models and those of pretrained self-supervised encoders has recently been shown to provide a powerful inductive bias, improving both convergence and sample quality. In this work, we extend this idea to inverse problems, where pretrained generative models are employed as priors. We propose applying representation alignment (REPA) between diffusion or flow-based models and a pretrained self-supervised visual encoder, such as DINOv2, to guide the reconstruction process at inference time. Although ground-truth signals are unavailable in inverse problems, we show that aligning model representations with approximate target features can substantially enhance reconstruction fidelity and perceptual realism. We provide theoretical results showing (a) the relation between the REPA regularization and a divergence measure in the DINOv2 embedding space, and (b) how REPA updates steer the model’s internal representations toward those of the clean image. These results offer insights into the role of REPA in improving perceptual fidelity. Finally, we demonstrate the generality of our approach by integrating it into multiple state-of-the-art inverse problem solvers. Extensive experiments on super-resolution, box inpainting, Gaussian deblurring, and motion deblurring confirm that our method consistently improves reconstruction quality across tasks, while also providing substantial efficiency gains by reducing the number of required discretization steps without compromising the performance of the underlying solver.

[96] Glass Surface Detection: Leveraging Reflection Dynamics in Flash/No-flash Imagery

Tao Yan, Hao Huang, Yiwei Lu, Zeyu Wang, Ke Xu, Yinghui Wang, Xiaojun Chang, Rynson W. H. Lau

Main category: cs.CV

TL;DR: NFGlassNet uses flash/no-flash image pairs and reflection dynamics to detect glass surfaces, outperforming existing methods that rely on boundary or reflection cues alone.

Details

Motivation: Glass surfaces are challenging to detect due to their transparency and lack of distinctive features. Existing methods fail to fully exploit intrinsic glass properties for accurate localization.

Method: Proposes NFGlassNet with Reflection Contrast Mining Module (RCMM) to extract reflections and Reflection Guided Attention Module (RGAM) to fuse reflection and glass surface features. Uses flash/no-flash image pairs to capture reflection dynamics.

Result: Method outperforms state-of-the-art approaches. Dataset of 3.3K flash/no-flash image pairs with ground truth annotations was constructed for training.

Conclusion: Leveraging reflection dynamics in flash/no-flash imagery provides an effective approach for glass surface detection, addressing limitations of existing methods.

Abstract: Glass surfaces are ubiquitous in daily life, typically appearing colorless, transparent, and lacking distinctive features. These characteristics make glass surface detection a challenging computer vision task. Existing glass surface detection methods always rely on boundary cues (e.g., window and door frames) or reflection cues to locate glass surfaces, but they fail to fully exploit the intrinsic properties of the glass itself for accurate localization. We observed that in most real-world scenes, the illumination intensity in front of the glass surface differs from that behind it, which results in variations in the reflections visible on the glass surface. Specifically, when standing on the brighter side of the glass and applying a flash towards the darker side, existing reflections on the glass surface tend to disappear. Conversely, while standing on the darker side and applying a flash towards the brighter side, distinct reflections will appear on the glass surface. Based on this phenomenon, we propose NFGlassNet, a novel method for glass surface detection that leverages the reflection dynamics present in flash/no-flash imagery. Specifically, we propose a Reflection Contrast Mining Module (RCMM) for extracting reflections, and a Reflection Guided Attention Module (RGAM) for fusing features from reflection and glass surface for accurate glass surface detection. For learning our network, we also construct a dataset consisting of 3.3K no-flash and flash image pairs captured from various scenes with corresponding ground truth annotations. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods. Our code, model, and dataset will be available upon acceptance of the manuscript.

[97] R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios

Lu Zhu, Tiantian Geng, Yangye Chen, Teng Wang, Ping Lu, Feng Zheng

Main category: cs.CV

TL;DR: R-AVST is the first dataset for real-world audio-visual spatio-temporal reasoning with fine-grained annotations, and AVST-Zero is a reinforcement learning model that achieves competitive performance on this benchmark.

Details

Motivation: Current MLLM research focuses on simple video scenarios, failing to capture the complex and diverse nature of real-world audio-visual events, creating a gap in understanding complex multimodal reasoning.

Method: Created R-AVST dataset using LLM-based key object extraction, automatic spatial annotation, and manual quality inspection (5K videos, 27K objects, 100 event types). Proposed AVST-Zero model using reinforcement learning with multi-dimensional rewards to avoid intermediate supervision.

Result: Generated over 8K high-quality QA pairs for benchmarking. Extensive experiments validated R-AVST’s effectiveness, and AVST-Zero demonstrated competitive performance compared to existing models.

Conclusion: R-AVST is the first dataset for real-world audio-visual spatio-temporal reasoning, and AVST-Zero provides a novel approach for addressing future challenges in this domain.

Abstract: Recently, rapid advancements have been made in multimodal large language models (MLLMs), especially in video understanding tasks. However, current research focuses on simple video scenarios, failing to reflect the complex and diverse nature of real-world audio-visual events in videos. To bridge this gap, we firstly introduce R-AVST, a dataset for audio-visual reasoning featuring fine-grained spatio-temporal annotations. In constructing this, we design a pipeline consisting of LLM-based key object extraction, automatic spatial annotation and manual quality inspection, resulting in over 5K untrimmed videos with 27K objects across 100 types of audio-visual events. Building on this dataset, we define three core tasks for spatio-temporal reasoning in audio-visual scenes and generate more than 8K high-quality, evenly distributed question-answer pairs to effectively benchmark model performance. To further enhance reasoning, we propose AVST-Zero, a reinforcement learning-based model that avoids intermediate supervision, directly optimizing behavior via carefully designed multi-dimensional rewards. Extensive experiments validate the effectiveness of our R-AVST in advancing audio-visual spatio-temporal reasoning, upon which AVST-Zero demonstrates competitive performance compared to existing models. To the best of our knowledge, R-AVST is the first dataset designed for real-world audio-visual spatio-temporal reasoning, and AVST-Zero offers a novel perspective for tackling future challenges in this domain.

[98] Warm Diffusion: Recipe for Blur-Noise Mixture Diffusion Models

Hao-Chien Hsueh, Chi-En Yen, Wen-Hsiao Peng, Ching-Chun Huang

Main category: cs.CV

TL;DR: Warm Diffusion bridges hot (noise-only) and cold (blur-only) diffusion paradigms by proposing a unified Blur-Noise Mixture Diffusion Model that jointly controls blurring and noise to exploit spectral dependencies in images.

Details

Motivation: Hot diffusion fails to leverage correlations between high-frequency details and low-frequency structures, while cold diffusion neglects the role of noise in shaping the data manifold, causing out-of-manifold issues.

Method: Proposed Warm Diffusion with Blur-Noise Mixture Diffusion Model (BNMD) using a divide-and-conquer strategy that disentangles denoising and deblurring processes, and analyzes Blur-to-Noise Ratio (BNR) through spectral analysis.

Result: Extensive experiments across benchmarks validate the effectiveness of the approach for image generation.

Conclusion: Warm Diffusion successfully integrates the strengths of both hot and cold diffusion paradigms by jointly controlling blurring and noise, addressing their respective limitations through spectral dependency exploitation.

Abstract: Diffusion probabilistic models have achieved remarkable success in generative tasks across diverse data types. While recent studies have explored alternative degradation processes beyond Gaussian noise, this paper bridges two key diffusion paradigms: hot diffusion, which relies entirely on noise, and cold diffusion, which uses only blurring without noise. We argue that hot diffusion fails to exploit the strong correlation between high-frequency image detail and low-frequency structures, leading to random behaviors in the early steps of generation. Conversely, while cold diffusion leverages image correlations for prediction, it neglects the role of noise (randomness) in shaping the data manifold, resulting in out-of-manifold issues and partially explaining its performance drop. To integrate both strengths, we propose Warm Diffusion, a unified Blur-Noise Mixture Diffusion Model (BNMD), to control blurring and noise jointly. Our divide-and-conquer strategy exploits the spectral dependency in images, simplifying score model estimation by disentangling the denoising and deblurring processes. We further analyze the Blur-to-Noise Ratio (BNR) using spectral analysis to investigate the trade-off between model learning dynamics and changes in the data manifold. Extensive experiments across benchmarks validate the effectiveness of our approach for image generation.

[99] Q-REAL: Towards Realism and Plausibility Evaluation for AI-Generated Content

Shushi Wang, Zicheng Zhang, Chunyi Li, Wei Wang, Liya Ma, Fengjiao Chen, Xiaoyu Li, Xuezhi Cao, Guangtao Zhai, Xiaohong Liu

Main category: cs.CV

TL;DR: Q-Real is a dataset for fine-grained evaluation of AI-generated images’ realism and plausibility, with 3,088 images, entity annotations, and judgment questions to improve MLLM evaluation and generative model optimization.

Details

Motivation: Existing quality assessment methods provide only single scores, which are too coarse for targeted improvement of generative models. Fine-grained evaluation along realism and plausibility dimensions is crucial for model optimization, especially with unified generation-understanding models.

Method: Created Q-Real dataset with 3,088 AI-generated images, annotated entity locations, and judgment questions for realism/plausibility. Built Q-Real Bench for evaluating MLLMs on judgment and grounding tasks, and designed a fine-tuning framework for MLLM enhancement.

Result: Experimental results demonstrate the high quality and significance of the dataset, and the comprehensiveness of the benchmark for evaluating MLLM capabilities in fine-grained image quality assessment.

Conclusion: Q-Real enables fine-grained evaluation of AI-generated images, providing targeted guidance for improving generative models and enhancing MLLM capabilities through specialized fine-tuning.

Abstract: Quality assessment of AI-generated content is crucial for evaluating model capability and guiding model optimization. However, most existing quality assessment datasets and models provide only a single quality score, which is too coarse to offer targeted guidance for improving generative models. In current applications of AI-generated images, realism and plausibility are two critical dimensions, and with the emergence of unified generation-understanding models, fine-grained evaluation along these dimensions becomes especially effective for improving generative performance. Therefore, we introduce Q-Real, a novel dataset for fine-grained evaluation of realism and plausibility in AI-generated images. Q-Real consists of 3,088 images generated by popular text-to-image models. For each image, we annotate the locations of major entities and provide a set of judgment questions and attribution descriptions for these along the dimensions of realism and plausibility. Considering that recent advances in multi-modal large language models (MLLMs) enable fine-grained evaluation of AI-generated images, we construct Q-Real Bench to evaluate them on two tasks: judgment and grounding with reasoning. Finally, to enhance MLLM capabilities, we design a fine-tuning framework and conduct experiments on multiple MLLMs using our dataset. Experimental results demonstrate the high quality and significance of our dataset and the comprehensiveness of the benchmark. Dataset and code will be released upon publication.

[100] UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation

Chi Zhang, Jiepeng Wang, Youming Wang, Yuanzhi Liang, Xiaoyan Yang, Zuoxin Li, Haibin Huang, Xuelong Li

Main category: cs.CV

TL;DR: UniModel is a unified generative model that handles both visual understanding and generation through a pixel-to-pixel diffusion framework, treating all inputs and outputs as RGB pixels in a shared visual space.

Details

Motivation: To achieve unification across models, tasks, and representations by eliminating modality discrepancies and creating a fully vision-native multimodal learning approach.

Method: Uses a Unified Diffusion Transformer trained with rectified flow in pixel space, mapping text to painted text images and treating all inputs/outputs as RGB pixels, with lightweight task embeddings to specify direction.

Result: Demonstrates strong cross-modal alignment and emergent controllability, including cycle-consistent image-caption-image loops, in both text-to-image synthesis and image-to-text understanding tasks.

Conclusion: Unifying model, tasks, and representations in a single visual space is a promising paradigm for general-purpose multimodal intelligence.

Abstract: We present UniModel, a unified generative model that jointly supports visual understanding and visual generation within a single pixel-to-pixel diffusion framework. Our goal is to achieve unification along three axes: the model, the tasks, and the representations. At the representation level, we eliminate modality discrepancies by mapping both text and images into a shared visual space: textual prompts are rendered as painted text images on a clean canvas, and all inputs and outputs are treated purely as RGB pixels. This yields a fully vision-native formulation of multimodal learning. At the task level, a broad range of vision-language problems are cast as pixel-to-pixel transformations in this visual space. For understanding tasks, the model takes an RGB image and produces a painted text image that visually encodes the semantic prediction. For generation tasks, painted text images serve as visual conditions that guide realistic and semantically aligned image synthesis. Captioning and text-to-image generation thus become different directions of the same underlying visual translation process. At the model level, we instantiate a single Unified Diffusion Transformer trained with rectified flow in pixel space. A shared backbone jointly learns bidirectional mappings between natural images and painted text images, with lightweight task embeddings to specify the desired direction. Experiments on text-to-image synthesis and image-to-text understanding demonstrate strong cross-modal alignment and emergent controllability such as cycle-consistent image-caption-image loops. Our initial exploration suggests that unifying model, tasks, and representations in a single visual space is a promising paradigm for general-purpose multimodal intelligence.

[101] DeltaDeno: Zero-Shot Anomaly Generation via Delta-Denoising Attribution

Chaoran Xu, Chengkan Lv, Qiyu Chen, Yunkang Cao, Feng Zhang, Zhengtao Zhang

Main category: cs.CV

TL;DR: DeltaDeno is a training-free zero-shot anomaly generation method that uses diffusion model contrast with minimal prompts to localize and edit defects without requiring real anomaly samples.

Details

Motivation: Existing anomaly generation methods rely on few-shot fine-tuning with anomalous samples, which contradicts the scarcity motivation and tends to overfit category priors. The paper addresses the setting where no real anomaly samples or training are available.

Method: DeltaDeno contrasts two diffusion branches driven by a minimal prompt pair under shared schedule, accumulates per-step denoising deltas into localization maps, uses masks to guide latent inpainting, performs token-level prompt refinement, and applies spatial attention bias restricted to anomaly tokens.

Result: Experiments on public datasets show DeltaDeno achieves great generation, realism and consistent gains in downstream detection performance.

Conclusion: DeltaDeno provides an effective training-free zero-shot approach for anomaly generation that works without real anomaly samples, achieving realistic defect generation and improved detection performance.

Abstract: Anomaly generation is often framed as few-shot fine-tuning with anomalous samples, which contradicts the scarcity that motivates generation and tends to overfit category priors. We tackle the setting where no real anomaly samples or training are available. We propose Delta-Denoising (DeltaDeno), a training-free zero-shot anomaly generation method that localizes and edits defects by contrasting two diffusion branches driven by a minimal prompt pair under a shared schedule. By accumulating per-step denoising deltas into an image-specific localization map, we obtain a mask to guide the latent inpainting during later diffusion steps and preserve the surrounding context while generating realistic local defects. To improve stability and control, DeltaDeno performs token-level prompt refinement that aligns shared content and strengthens anomaly tokens, and applies a spatial attention bias restricted to anomaly tokens in the predicted region. Experiments on public datasets show that DeltaDeno achieves great generation, realism and consistent gains in downstream detection performance. Code will be made publicly available.

[102] Rethinking Diffusion Model-Based Video Super-Resolution: Leveraging Dense Guidance from Aligned Features

Jingyi Xu, Meisong Zheng, Ying Chen, Minglang Qiao, Xin Deng, Mai Xu

Main category: cs.CV

TL;DR: DGAF-VSR is a diffusion model-based video super-resolution method that improves alignment and compensation between frames using optical guided warping and feature-wise temporal conditioning to enhance perceptual quality, fidelity, and temporal consistency.

Details

Motivation: Existing DM-based VSR methods suffer from error accumulation, spatial artifacts, and trade-offs between perceptual quality and fidelity due to inaccurate alignment and insufficient compensation between video frames.

Method: Proposes DGAF-VSR with Optical Guided Warping Module (OGWM) to preserve high-frequency details in aligned features and Feature-wise Temporal Condition Module (FTCM) for dense guidance in the feature domain, leveraging better feature domain correlations and upscaled resolution warping.

Result: Achieves significant improvements: 35.82% DISTS reduction (perceptual quality), 0.20 dB PSNR gain (fidelity), and 30.37% tLPIPS reduction (temporal consistency) compared to state-of-the-art methods on synthetic and real-world datasets.

Conclusion: DGAF-VSR effectively addresses alignment and compensation challenges in DM-based VSR, demonstrating superior performance across perceptual quality, fidelity, and temporal consistency metrics.

Abstract: Diffusion model (DM) based Video Super-Resolution (VSR) approaches achieve impressive perceptual quality. However, they suffer from error accumulation, spatial artifacts, and a trade-off between perceptual quality and fidelity, primarily caused by inaccurate alignment and insufficient compensation between video frames. In this paper, within the DM-based VSR pipeline, we revisit the role of alignment and compensation between adjacent video frames and reveal two crucial observations: (a) the feature domain is better suited than the pixel domain for information compensation due to its stronger spatial and temporal correlations, and (b) warping at an upscaled resolution better preserves high-frequency information, but this benefit is not necessarily monotonic. Therefore, we propose a novel Densely Guided diffusion model with Aligned Features for Video Super-Resolution (DGAF-VSR), with an Optical Guided Warping Module (OGWM) to maintain high-frequency details in the aligned features and a Feature-wise Temporal Condition Module (FTCM) to deliver dense guidance in the feature domain. Extensive experiments on synthetic and real-world datasets demonstrate that DGAF-VSR surpasses state-of-the-art methods in key aspects of VSR, including perceptual quality (35.82% DISTS reduction), fidelity (0.20 dB PSNR gain), and temporal consistency (30.37% tLPIPS reduction).

[103] Shape-preserving Tooth Segmentation from CBCT Images Using Deep Learning with Semantic and Shape Awareness

Zongrui Ji, Zhiming Cui, Na Li, Qianhan Zheng, Miaojing Shi, Ke Deng, Jingyang Zhang, Chaoyuan Li, Xuepeng Chen, Yi Dong, Lei Ma

Main category: cs.CV

TL;DR: A deep learning framework for CBCT tooth segmentation that integrates semantic and shape awareness to handle interdental adhesions and preserve anatomical shape integrity.

Details

Motivation: Accurate tooth segmentation from CBCT images is crucial for digital dentistry but challenging due to interdental adhesions causing severe anatomical shape distortion.

Method: Proposes a framework with target-tooth-centroid prompted multi-label learning for semantic relationships and tooth-shape-aware learning for morphological constraints, unified via multi-task learning.

Result: Extensive evaluations on internal and external datasets demonstrate significant performance improvements over existing methods.

Conclusion: The approach effectively mitigates shape distortions and provides anatomically faithful tooth boundaries.

Abstract: Background:Accurate tooth segmentation from cone beam computed tomography (CBCT) images is crucial for digital dentistry but remains challenging in cases of interdental adhesions, which cause severe anatomical shape distortion. Methods: To address this, we propose a deep learning framework that integrates semantic and shape awareness for shape-preserving segmentation. Our method introduces a target-tooth-centroid prompted multi-label learning strategy to model semantic relationships between teeth, reducing shape ambiguity. Additionally, a tooth-shape-aware learning mechanism explicitly enforces morphological constraints to preserve boundary integrity. These components are unified via multi-task learning, jointly optimizing segmentation and shape preservation. Results: Extensive evaluations on internal and external datasets demonstrate that our approach significantly outperforms existing methods. Conclusions: Our approach effectively mitigates shape distortions and providing anatomically faithful tooth boundaries.

[104] OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios

Hong Gao, Jingyu Wu, Xiangkai Xu, Kangni Xie, Yunchen Zhang, Bin Zhong, Xurui Gao, Min-Ling Zhang

Main category: cs.CV

TL;DR: OmniGround benchmark addresses limitations in Spatio-Temporal Video Grounding with 3,475 videos across 81 categories and complex queries, revealing performance drops in real-world scenarios. PG-TAF framework achieves significant improvements through temporal grounding and spatio-temporal propagation.

Details

Motivation: Current STVG models show category bias, oversimplified reasoning, and poor linguistic robustness due to limited benchmark scope, creating a gap between model performance and real-world demands with diverse objects and complex queries.

Method: Introduces OmniGround benchmark with Forward-Backward-Refinement annotation pipeline for high-quality labels, and proposes PG-TAF - a training-free two-stage framework that decomposes STVG into high-level temporal grounding and fine-grained spatio-temporal propagation.

Result: Evaluations show 10.4% average performance drop on complex real-world scenes, especially with small/occluded objects and intricate spatial relations. PG-TAF achieves 25.6% and 35.6% improvements in m_tIoU and m_vIoU on OmniGround with consistent gains across four benchmarks.

Conclusion: The comprehensive OmniGround benchmark reveals critical limitations in current STVG approaches, while the proposed PG-TAF framework demonstrates substantial improvements in handling complex real-world video grounding tasks through systematic decomposition of the problem.

Abstract: Spatio-Temporal Video Grounding (STVG) aims to localize target objects in videos based on natural language descriptions. Despite recent advances in Multimodal Large Language Models, a significant gap remains between current models and real-world demands involving diverse objects and complex queries. We attribute this to limited benchmark scope, causing models to exhibit category bias, oversimplified reasoning, and poor linguistic robustness. To address these limitations, we introduce OmniGround, a comprehensive benchmark with 3,475 videos spanning 81 categories and complex real-world queries. We propose the Forward-Backward-Refinement annotation pipeline that combines multi-directional tracking with intelligent error correction for high-quality labels. We further introduce DeepSTG, a systematic evaluation framework quantifying dataset quality across four complementary dimensions beyond superficial statistics. Evaluations reveal performance average drop of 10.4% on complex real-world scenes, particularly with small/occluded objects and intricate spatial relations. Motivated by these, we propose PG-TAF, a training-free two-stage framework decomposing STVG into high-level temporal grounding and fine-grained spatio-temporal propagation. Experiments demonstrate PG-TAF achieves 25.6% and 35.6% improvements in m_tIoU and m_vIoU on OmniGround with consistent gains across four benchmarks.

[105] MultiPriv: Benchmarking Individual-Level Privacy Reasoning in Vision-Language Models

Xiongtao Sun, Hui Li, Jiaming Zhang, Yujie Yang, Kaili Liu, Ruxin Feng, Wen Jun Tan, Wei Yang Bryan Lim

Main category: cs.CV

TL;DR: MultiPriv is the first benchmark to evaluate individual-level privacy reasoning in VLMs, revealing significant unmeasured risks beyond simple attribute perception.

Details

Motivation: Current privacy benchmarks are insufficient as they only evaluate privacy perception, failing to address the critical risk of privacy reasoning - VLMs' ability to infer and link distributed information to construct individual profiles.

Method: Proposed Privacy Perception and Reasoning (PPR) framework with a bilingual multimodal dataset featuring synthetic individual profiles where identifiers are linked to sensitive attributes. Evaluates 9 tasks across the PPR spectrum from attribute detection to cross-image re-identification and chained inference.

Result: Evaluation of 50+ VLMs reveals: (1) significant unmeasured reasoning-based privacy risks, (2) perception-level metrics are poor predictors of reasoning risks, (3) existing safety alignments are inconsistent and ineffective against reasoning-based attacks.

Conclusion: MultiPriv exposes systemic vulnerabilities in VLMs and provides a framework for developing robust, privacy-preserving models by addressing the critical gap in privacy reasoning evaluation.

Abstract: Modern Vision-Language Models (VLMs) demonstrate sophisticated reasoning, escalating privacy risks beyond simple attribute perception to individual-level linkage. Current privacy benchmarks are structurally insufficient for this new threat, as they primarily evaluate privacy perception while failing to address the more critical risk of privacy reasoning: a VLM’s ability to infer and link distributed information to construct individual profiles. To address this critical gap, we propose \textbf{MultiPriv}, the first benchmark designed to systematically evaluate individual-level privacy reasoning in VLMs. We introduce the \textbf{Privacy Perception and Reasoning (PPR)} framework and construct a novel, bilingual multimodal dataset to support it. The dataset uniquely features a core component of synthetic individual profiles where identifiers (e.g., faces, names) are meticulously linked to sensitive attributes. This design enables nine challenging tasks evaluating the full PPR spectrum, from attribute detection to cross-image re-identification and chained inference. We conduct a large-scale evaluation of over 50 foundational and commercial VLMs. Our analysis reveals: (1) Many VLMs possess significant, unmeasured reasoning-based privacy risks. (2) Perception-level metrics are poor predictors of these reasoning risks, revealing a critical evaluation gap. (3) Existing safety alignments are inconsistent and ineffective against such reasoning-based attacks. MultiPriv exposes systemic vulnerabilities and provides the necessary framework for developing robust, privacy-preserving VLMs.

[106] The Finer the Better: Towards Granular-aware Open-set Domain Generalization

Yunyun Wang, Zheng Duan, Xinyue Liao, Ke-Jia Chen, Songcan Chen

Main category: cs.CV

TL;DR: SeeCLIP improves open-set domain generalization by using semantic-enhanced prompts and duplex contrastive learning to better handle hard unknowns that resemble known classes.

Details

Motivation: Existing methods struggle with balancing structural risk from known classes and open-space risk from unknown classes, particularly when dealing with 'hard unknowns' that share fine-grained visual similarities with known classes.

Method: Proposes Semantic-enhanced CLIP (SeeCLIP) with semantic-aware prompt enhancement, duplex contrastive learning (repulsion and cohesion), and semantic-guided diffusion to generate challenging pseudo-unknown samples.

Result: Achieves consistent improvements of 3% accuracy and 5% H-score over state-of-the-art methods across five benchmarks.

Conclusion: SeeCLIP effectively addresses the dilemma between known-class structural risk and unknown-class open-space risk through fine-grained semantic enhancement, demonstrating superior performance in open-set domain generalization.

Abstract: Open-Set Domain Generalization (OSDG) tackles the realistic scenario where deployed models encounter both domain shifts and novel object categories. Despite impressive progress with vision-language models like CLIP, existing methods still fall into the dilemma between structural risk of known-classes and open-space risk from unknown-classes, and easily suffers from over-confidence, especially when distinguishing ``hard unknowns" that share fine-grained visual similarities with known classes. To this end, we propose a Semantic-enhanced CLIP (SeeCLIP) framework that explicitly addresses this dilemma through fine-grained semantic enhancement. In SeeCLIP, we propose a semantic-aware prompt enhancement module to decompose images into discriminative semantic tokens, enabling nuanced vision-language alignment beyond coarse category labels. To position unknown prompts effectively, we introduce duplex contrastive learning with complementary objectives, that is, repulsion to maintain separability from known classes, and cohesion to preserve semantic proximity. Further, our semantic-guided diffusion module synthesizes pseudo-unknowns by perturbing extracted semantic tokens, generating challenging samples that are visually similar to known classes yet exhibit key local differences. These hard negatives force the model to learn finer decision boundaries. Extensive experiments across five benchmarks demonstrate consistent improvements of 3% accuracy and 5% H-score over state-of-the-art methods.

[107] Flow-Guided Implicit Neural Representation for Motion-Aware Dynamic MRI Reconstruction

Baoqing Li, Yuanyuan Liu, Congcong Liu, Qingyong Zhu, Jing Cheng, Yihang Zhou, Hao Chen, Zhuo-Xu Cui, Dong Liang

Main category: cs.CV

TL;DR: A novel implicit neural representation framework that jointly models dynamic MRI images and optical flow using physics-inspired regularization, enabling simultaneous high-quality reconstruction and motion estimation without prior flow estimation.

Details

Motivation: Conventional motion-compensated MRI reconstructions rely on pre-estimated optical flow, which is inaccurate under undersampling and degrades reconstruction quality. There's a need for methods that can jointly handle image reconstruction and motion estimation.

Method: Uses two implicit neural representations (INRs): one for spatiotemporal image content and another for optical flow, coupled via the optical flow equation as physics-inspired regularization, plus data consistency with k-space measurements.

Result: Outperforms state-of-the-art motion-compensated and deep learning approaches on cardiac MRI datasets, achieving superior reconstruction quality, accurate motion estimation, and improved temporal fidelity.

Conclusion: Implicit joint modeling with flow-regularized constraints shows strong potential for advancing dynamic MRI reconstruction by enabling simultaneous recovery of coherent images and motion fields.

Abstract: Dynamic magnetic resonance imaging (dMRI) captures temporally-resolved anatomy but is often challenged by limited sampling and motion-induced artifacts. Conventional motion-compensated reconstructions typically rely on pre-estimated optical flow, which is inaccurate under undersampling and degrades reconstruction quality. In this work, we propose a novel implicit neural representation (INR) framework that jointly models both the dynamic image sequence and its underlying motion field. Specifically, one INR is employed to parameterize the spatiotemporal image content, while another INR represents the optical flow. The two are coupled via the optical flow equation, which serves as a physics-inspired regularization, in addition to a data consistency loss that enforces agreement with k-space measurements. This joint optimization enables simultaneous recovery of temporally coherent images and motion fields without requiring prior flow estimation. Experiments on dynamic cardiac MRI datasets demonstrate that the proposed method outperforms state-of-the-art motion-compensated and deep learning approaches, achieving superior reconstruction quality, accurate motion estimation, and improved temporal fidelity. These results highlight the potential of implicit joint modeling with flow-regularized constraints for advancing dMRI reconstruction.

[108] FLUID: Training-Free Face De-identification via Latent Identity Substitution

Jinhyeong Park, Shaheryar Muhammad, Seangmin Lee, Jong Taek Lee, Soon Ki Jung

Main category: cs.CV

TL;DR: FLUID is a training-free face de-identification framework that performs identity substitution in the latent space of pretrained diffusion models using semantic displacement, achieving superior identity suppression while preserving attributes.

Details

Motivation: To develop an effective face de-identification method that can suppress identity information while preserving other facial attributes, addressing privacy concerns without requiring model retraining.

Method: Reinterprets identity editing as semantic displacement in the latent h-space of pretrained diffusion models. Uses optimization with novel reagent losses for attribute preservation and identity suppression, and proposes both linear and geodesic editing schemes to navigate the latent manifold.

Result: Experimental results on CelebA-HQ and FFHQ show FLUID achieves superior trade-off between identity suppression and attribute preservation, outperforming state-of-the-art de-identification methods in both qualitative and quantitative metrics.

Conclusion: FLUID provides an effective training-free solution for face de-identification that successfully balances identity removal with attribute preservation through latent space manipulation in diffusion models.

Abstract: We present FLUID (Face de-identification in the Latent space via Utility-preserving Identity Displacement), a training-free framework that directly substitutes identity in the latent space of pretrained diffusion models. Inspired by substitution mechanisms in chemistry, we reinterpret identity editing as semantic displacement in the latent h-space of a pretrained unconditional diffusion model. Our framework discovers identity-editing directions through optimization guided by novel reagent losses, which supervise for attribute preservation and identity suppression. We further propose both linear and geodesic (tangent-based) editing schemes to effectively navigate the latent manifold. Experimental results on CelebA-HQ and FFHQ demonstrate that FLUID achieves a superior trade-off between identity suppression and attribute preservation, outperforming state-of-the-art de-identification methods in both qualitative and quantitative metrics.

[109] FingerCap: Fine-grained Finger-level Hand Motion Captioning

Xin Shen, Rui Zhu, Lei Shen, Xinyu Wang, Kaihao Zhang, Tianqing Zhu, Shuchen Wu, Chenxi Miao, Weikang Li, Yang Li, Deguo Xia, Jizhou Huang, Xin Yu

Main category: cs.CV

TL;DR: FingerCap introduces finger-level hand motion captioning using a new dataset and FiGOP method that combines RGB frames with hand keypoints to capture fine finger motions, outperforming existing Video-MLLMs.

Details

Motivation: Understanding fine-grained human hand motion is fundamental to visual perception, embodied intelligence, and multimodal communication, but current methods struggle with capturing subtle finger-level dynamics.

Method: Proposes FiGOP (Finger Group-of-Pictures) which pairs RGB keyframes with subsequent hand keypoints and uses a lightweight temporal encoder to convert keypoints into motion embeddings integrated with RGB features.

Result: Experiments on FingerCap-40K show strong Video-MLLMs struggle with finger-level reasoning, while FiGOP-augmented models achieve consistent gains under both HandJudge evaluation and human studies.

Conclusion: FiGOP effectively addresses temporal sparsity in hand motion capture by recovering fine temporal cues without increasing RGB density, enabling better finger-level motion understanding.

Abstract: Understanding fine-grained human hand motion is fundamental to visual perception, embodied intelligence, and multimodal communication. In this work, we propose Fine-grained Finger-level Hand Motion Captioning (FingerCap), which aims to generate textual descriptions that capture detailed finger-level semantics of hand actions. To support this task, we curate FingerCap-40K, a large-scale corpus of 40K paired hand-motion videos and captions spanning two complementary sources: concise instruction-style finger motions and diverse, naturalistic hand-object interactions. To enable effective evaluation, we employ HandJudge, a LLM-based rubric that measures finger-level correctness and motion completeness. Temporal sparsity remains a fundamental bottleneck for current Video-MLLMs, since sparse RGB sampling is insufficient to capture the subtle, high-frequency dynamics underlying fine finger motions. As a simple and compute-friendly remedy, we introduce FiGOP (Finger Group-of-Pictures), which pairs each RGB keyframe with subsequent hand keypoints until the next keyframe. A lightweight temporal encoder converts the keypoints into motion embeddings and integrates them with RGB features. FiGOP adapts the classic GOP concept to finger motion, recovering fine temporal cues without increasing RGB density. Experiments on FingerCap-40K show that strong open- and closed-source Video-MLLMs still struggle with finger-level reasoning, while our FiGOP-augmented model yield consistent gains under HandJudge and human studies.

[110] Point-Supervised Facial Expression Spotting with Gaussian-Based Instance-Adaptive Intensity Modeling

Yicheng Deng, Hideaki Hayashi, Hajime Nagahara

Main category: cs.CV

TL;DR: Proposes a point-supervised facial expression spotting framework using Gaussian-based intensity modeling and dual-branch architecture for macro/micro-expression detection.

Details

Motivation: Existing methods require costly temporal boundary annotations; this work aims to reduce annotation burden by using only single timestamp annotations per expression instance.

Method: Two-branch framework: (1) Class-agnostic expression intensity branch with Gaussian-based instance-adaptive intensity modeling for soft pseudo-labeling, (2) Class-aware apex classification branch for macro/micro-expression distinction. Also uses intensity-aware contrastive loss.

Result: Extensive experiments on SAMM-LV, CAS(ME)^2, and CAS(ME)^3 datasets demonstrate the framework’s effectiveness in facial expression spotting with minimal annotations.

Conclusion: The proposed point-supervised framework successfully addresses facial expression spotting with reduced annotation requirements through innovative intensity modeling and dual-branch architecture.

Abstract: Automatic facial expression spotting, which aims to identify facial expression instances in untrimmed videos, is crucial for facial expression analysis. Existing methods primarily focus on fully-supervised learning and rely on costly, time-consuming temporal boundary annotations. In this paper, we investigate point-supervised facial expression spotting (P-FES), where only a single timestamp annotation per instance is required for training. We propose a unique two-branch framework for P-FES. First, to mitigate the limitation of hard pseudo-labeling, which often confuses neutral and expression frames with various intensities, we propose a Gaussian-based instance-adaptive intensity modeling (GIM) module to model instance-level expression intensity distribution for soft pseudo-labeling. By detecting the pseudo-apex frame around each point label, estimating the duration, and constructing an instance-level Gaussian distribution, GIM assigns soft pseudo-labels to expression frames for more reliable intensity supervision. The GIM module is incorporated into our framework to optimize the class-agnostic expression intensity branch. Second, we design a class-aware apex classification branch that distinguishes macro- and micro-expressions solely based on their pseudo-apex frames. During inference, the two branches work independently: the class-agnostic expression intensity branch generates expression proposals, while the class-aware apex-classification branch is responsible for macro- and micro-expression classification.Furthermore, we introduce an intensity-aware contrastive loss to enhance discriminative feature learning and suppress neutral noise by contrasting neutral frames with expression frames with various intensities. Extensive experiments on the SAMM-LV, CAS(ME)$^2$, and CAS(ME)$^3$ datasets demonstrate the effectiveness of our proposed framework.

[111] OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding

Teng Fu, Mengyang Zhao, Ke Niu, Kaixin Peng, Bin Li

Main category: cs.CV

TL;DR: OmniPT is a unified pedestrian tracking framework using LVLMs that can track pedestrians, perform reference-based tracking, and generate semantic understanding interactively, outperforming previous methods.

Details

Motivation: LVLMs excel at semantic understanding but lag behind expert models in instance-level tasks like pedestrian tracking. New tasks like Referring MOT require advanced semantic understanding where LVLMs have advantages.

Method: Three-phase training: RL for bounding box format output, mid-training on pedestrian datasets, SFT on tracking datasets, and final RL to improve tracking and instruction following.

Result: Experimental results on tracking benchmarks show the proposed method performs better than previous approaches.

Conclusion: OmniPT successfully bridges the gap between LVLMs and instance-level tracking tasks, demonstrating superior performance in pedestrian tracking with semantic understanding capabilities.

Abstract: LVLMs have been shown to perform excellently in image-level tasks such as VQA and caption. However, in many instance-level tasks, such as visual grounding and object detection, LVLMs still show performance gaps compared to previous expert models. Meanwhile, although pedestrian tracking is a classical task, there have been a number of new topics in combining object tracking and natural language, such as Referring MOT, Cross-view Referring MOT, and Semantic MOT. These tasks emphasize that models should understand the tracked object at an advanced semantic level, which is exactly where LVLMs excel. In this paper, we propose a new unified Pedestrian Tracking framework, namely OmniPT, which can track, track based on reference and generate semantic understanding of tracked objects interactively. We address two issues: how to model the tracking task into a task that foundation models can perform, and how to make the model output formatted answers. To this end, we implement a training phase consisting of RL-Mid Training-SFT-RL. Based on the pre-trained weights of the LVLM, we first perform a simple RL phase to enable the model to output fixed and supervisable bounding box format. Subsequently, we conduct a mid-training phase using a large number of pedestrian-related datasets. Finally, we perform supervised fine-tuning on several pedestrian tracking datasets, and then carry out another RL phase to improve the model’s tracking performance and enhance its ability to follow instructions. We conduct experiments on tracking benchmarks and the experimental results demonstrate that the proposed method can perform better than the previous methods.

[112] MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis

Di Luo, Shuhui Yang, Mingxin Yang, Jiawei Lu, Yixuan Tang, Xintong Han, Zhuo Chen, Beibei Wang, Chunchao Guo

Main category: cs.CV

TL;DR: MatPedia is a foundation model for PBR materials using a joint RGB-PBR representation that encodes materials as interdependent RGB and PBR latents, enabling unified text-to-material, image-to-material, and intrinsic decomposition tasks.

Details

Motivation: Current material creation is labor-intensive and existing generative methods lack unified representations, leading to fragmented pipelines and inability to leverage large-scale RGB image data.

Method: Uses a novel joint RGB-PBR representation encoding materials into two interdependent latents (RGB appearance and PBR maps), formulated as a 5-frame sequence and trained with video diffusion architectures on MatHybrid-410K dataset.

Result: Achieves native 1024×1024 synthesis that substantially surpasses existing approaches in both quality and diversity across multiple material tasks.

Conclusion: MatPedia provides a unified framework for material generation that bridges natural image appearance with physical PBR properties, enabling more efficient and high-quality material creation.

Abstract: Physically-based rendering (PBR) materials are fundamental to photorealistic graphics, yet their creation remains labor-intensive and requires specialized expertise. While generative models have advanced material synthesis, existing methods lack a unified representation bridging natural image appearance and PBR properties, leading to fragmented task-specific pipelines and inability to leverage large-scale RGB image data. We present MatPedia, a foundation model built upon a novel joint RGB-PBR representation that compactly encodes materials into two interdependent latents: one for RGB appearance and one for the four PBR maps encoding complementary physical properties. By formulating them as a 5-frame sequence and employing video diffusion architectures, MatPedia naturally captures their correlations while transferring visual priors from RGB generation models. This joint representation enables a unified framework handling multiple material tasks–text-to-material generation, image-to-material generation, and intrinsic decomposition–within a single architecture. Trained on MatHybrid-410K, a mixed corpus combining PBR datasets with large-scale RGB images, MatPedia achieves native $1024\times1024$ synthesis that substantially surpasses existing approaches in both quality and diversity.

[113] ReBrain: Brain MRI Reconstruction from Sparse CT Slice via Retrieval-Augmented Diffusion

Junming Liu, Yifei Sun, Weihua Cheng, Yujin Kang, Yirong Chen, Ding Wang, Guosun Zeng

Main category: cs.CV

TL;DR: ReBrain is a retrieval-augmented diffusion framework that synthesizes brain MRI from sparse CT scans using Brownian Bridge Diffusion Model and reference-guided generation with ControlNet.

Details

Motivation: MRI is crucial for brain disease diagnosis but not always feasible, and sparse low-dose CT scans make accurate MRI reconstruction challenging.

Method: Uses BBDM to synthesize MRI slices from sparse CT, retrieves similar CT slices from database as references via ControlNet, and applies spherical linear interpolation for rare retrieval failures.

Result: Achieves state-of-the-art performance on SynthRAD2023 and BraTS datasets for cross-modal reconstruction under sparse conditions.

Conclusion: ReBrain effectively addresses the challenge of synthesizing full brain MRI volumes from highly sparse CT scans through retrieval-augmented diffusion framework.

Abstract: Magnetic Resonance Imaging (MRI) plays a crucial role in brain disease diagnosis, but it is not always feasible for certain patients due to physical or clinical constraints. Recent studies attempt to synthesize MRI from Computed Tomography (CT) scans; however, low-dose protocols often result in highly sparse CT volumes with poor through-plane resolution, making accurate reconstruction of the full brain MRI volume particularly challenging. To address this, we propose ReBrain, a retrieval-augmented diffusion framework for brain MRI reconstruction. Given any 3D CT scan with limited slices, we first employ a Brownian Bridge Diffusion Model (BBDM) to synthesize MRI slices along the 2D dimension. Simultaneously, we retrieve structurally and pathologically similar CT slices from a comprehensive prior database via a fine-tuned retrieval model. These retrieved slices are used as references, incorporated through a ControlNet branch to guide the generation of intermediate MRI slices and ensure structural continuity. We further account for rare retrieval failures when the database lacks suitable references and apply spherical linear interpolation to provide supplementary guidance. Extensive experiments on SynthRAD2023 and BraTS demonstrate that ReBrain achieves state-of-the-art performance in cross-modal reconstruction under sparse conditions.

Hsuan Yuan, Shao-Yu Weng, I-Hsuan Lo, Wei-Chen Chiu, Yu-Syuan Xu, Hao-Chien Hsueh, Jen-Hui Chuang, Ching-Chun Huang

Main category: cs.CV

TL;DR: Proposes a Dual Branch Degradation Extractor Network for blind super-resolution that handles both blur and noise degradation through separate embeddings, achieving state-of-the-art performance.

Details

Motivation: Existing SISR methods perform poorly when actual degradation deviates from assumed fixed degradation (e.g., bicubic downsampling), especially in blind SR scenarios where degradation is unknown.

Method: Uses a dual branch network to extract two unsupervised degradation embeddings representing blurry and noisy information separately, then adapts the SR network differently to each embedding type. Treats degradation extractor as a regularizer using differences between SR and HR images.

Result: Extensive experiments on multiple benchmarks show the method achieves state-of-the-art performance in blind SR problems.

Conclusion: The proposed dual branch approach effectively handles blind SR by separately modeling blur and noise degradation, outperforming previous methods.

Abstract: Previous methods have demonstrated remarkable performance in single image super-resolution (SISR) tasks with known and fixed degradation (e.g., bicubic downsampling). However, when the actual degradation deviates from these assumptions, these methods may experience significant declines in performance. In this paper, we propose a Dual Branch Degradation Extractor Network to address the blind SR problem. While some blind SR methods assume noise-free degradation and others do not explicitly consider the presence of noise in the degradation model, our approach predicts two unsupervised degradation embeddings that represent blurry and noisy information. The SR network can then be adapted to blur embedding and noise embedding in distinct ways. Furthermore, we treat the degradation extractor as a regularizer to capitalize on differences between SR and HR images. Extensive experiments on several benchmarks demonstrate our method achieves SOTA performance in the blind SR problem.

[115] Spanning Tree Autoregressive Visual Generation

Sangkyu Lee, Changho Lee, Janghoon Han, Hosung Song, Tackgeun You, Hwasup Lim, Stanley Jungkyu Choi, Honglak Lee, Youngjae Yu

Main category: cs.CV

TL;DR: STAR modeling uses spanning tree traversal orders to maintain sampling performance while enabling flexible image editing, overcoming limitations of random permutations in autoregressive models.

Details

Motivation: To address the trade-off between performance decline and sequence order flexibility in autoregressive image generation, particularly for image editing tasks where bidirectional context is needed.

Method: Uses traversal orders of uniform spanning trees sampled from image patch lattices, obtained through breadth-first search and rejection sampling to ensure connected partial observations appear as sequence prefixes.

Result: Preserves postfix completion capability while maintaining sampling performance without significant architectural changes to conventional autoregressive models.

Conclusion: STAR provides a structured yet flexible approach that incorporates image priors like center bias and locality, enabling effective image editing while maintaining generation quality.

Abstract: We present Spanning Tree Autoregressive (STAR) modeling, which can incorporate prior knowledge of images, such as center bias and locality, to maintain sampling performance while also providing sufficiently flexible sequence orders to accommodate image editing at inference. Approaches that expose randomly permuted sequence orders to conventional autoregressive (AR) models in visual generation for bidirectional context either suffer from a decline in performance or compromise the flexibility in sequence order choice at inference. Instead, STAR utilizes traversal orders of uniform spanning trees sampled in a lattice defined by the positions of image patches. Traversal orders are obtained through breadth-first search, allowing us to efficiently construct a spanning tree whose traversal order ensures that the connected partial observation of the image appears as a prefix in the sequence through rejection sampling. Through the tailored yet structured randomized strategy compared to random permutation, STAR preserves the capability of postfix completion while maintaining sampling performance without any significant changes to the model architecture widely adopted in the language AR modeling.

[116] Real-Time Cooked Food Image Synthesis and Visual Cooking Progress Monitoring on Edge Devices

Jigyasa Gupta, Soumya Goyal, Anil Kumar, Ishan Jindal

Main category: cs.CV

TL;DR: Edge-efficient generator synthesizes realistic cooked food images from raw inputs using recipe and cooking state guidance, with new culinary similarity metric for training and monitoring.

Details

Motivation: Synthesizing realistic cooked food images on edge devices is challenging due to complex texture/color/structure changes during cooking. Existing methods produce unrealistic results or are too resource-intensive for edge deployment.

Method: Propose edge-efficient recipe and cooking state guided generator conditioned on raw food image, using new Culinary Image Similarity (CIS) metric as training loss and progress-monitoring signal.

Result: Model outperforms existing baselines with significant FID score reductions: 30% improvement on their dataset and 60% on public datasets.

Conclusion: The proposed approach enables realistic food image synthesis on edge devices with user-preferred visual targets rather than fixed presets, ensuring temporal consistency and culinary plausibility.

Abstract: Synthesizing realistic cooked food images from raw inputs on edge devices is a challenging generative task, requiring models to capture complex changes in texture, color and structure during cooking. Existing image-to-image generation methods often produce unrealistic results or are too resource-intensive for edge deployment. We introduce the first oven-based cooking-progression dataset with chef-annotated doneness levels and propose an edge-efficient recipe and cooking state guided generator that synthesizes realistic food images conditioned on raw food image. This formulation enables user-preferred visual targets rather than fixed presets. To ensure temporal consistency and culinary plausibility, we introduce a domain-specific \textit{Culinary Image Similarity (CIS)} metric, which serves both as a training loss and a progress-monitoring signal. Our model outperforms existing baselines with significant reductions in FID scores (30% improvement on our dataset; 60% on public datasets)

[117] Gradient-Driven Natural Selection for Compact 3D Gaussian Splatting

Xiaobin Deng, Qiuli Yu, Changyu Diao, Min Li, Duanqing Xu

Main category: cs.CV

TL;DR: A natural selection-inspired pruning framework for 3D Gaussian Splatting that uses optimization gradients to autonomously determine which Gaussians to retain or prune, achieving state-of-the-art performance with over 0.6 dB PSNR gain under 15% budgets.

Details

Motivation: 3DGS uses many Gaussian primitives causing high storage and computational costs, while existing pruning methods rely on manual criteria or extra parameters, leading to suboptimal results.

Method: Models survival pressure as a regularization gradient field applied to opacity, allowing optimization gradients to autonomously determine pruning. Also introduces opacity decay with finite opacity prior to accelerate selection.

Result: Achieves over 0.6 dB PSNR gain under 15% budgets compared to 3DGS, establishing state-of-the-art performance for compact 3DGS.

Conclusion: The proposed fully learnable pruning framework effectively reduces 3DGS storage and computational overhead without human intervention, outperforming existing methods.

Abstract: 3DGS employs a large number of Gaussian primitives to fit scenes, resulting in substantial storage and computational overhead. Existing pruning methods rely on manually designed criteria or introduce additional learnable parameters, yielding suboptimal results. To address this, we propose an natural selection inspired pruning framework that models survival pressure as a regularization gradient field applied to opacity, allowing the optimization gradients–driven by the goal of maximizing rendering quality–to autonomously determine which Gaussians to retain or prune. This process is fully learnable and requires no human intervention. We further introduce an opacity decay technique with a finite opacity prior, which accelerates the selection process without compromising pruning effectiveness. Compared to 3DGS, our method achieves over 0.6 dB PSNR gain under 15% budgets, establishing state-of-the-art performance for compact 3DGS. Project page https://xiaobin2001.github.io/GNS-web.

[118] A Diversity-optimized Deep Ensemble Approach for Accurate Plant Leaf Disease Detection

Sai Nath Chowdary Medikonduru, Hongpeng Jin, Yanzhao Wu

Main category: cs.CV

TL;DR: The paper introduces the Synergistic Diversity (SQ) framework to improve plant disease detection from leaf images by better selecting ensemble members through a novel diversity metric that captures synergy between models.

Details

Motivation: Plant diseases cause $220 billion in annual economic losses and threaten food security. Deep ensembles can improve detection accuracy, but selecting optimal ensemble members is challenging due to limitations in existing diversity metrics.

Method: 1) Analyzed limitations of existing ensemble diversity metrics (Q metrics), 2) Proposed novel SQ metric that captures synergy between ensemble members, 3) Validated approach through experiments on plant leaf image dataset.

Result: The SQ metric substantially improved ensemble selection and enhanced detection accuracy compared to existing diversity metrics.

Conclusion: The SQ framework enables more reliable and efficient image-based plant disease detection by better capturing ensemble member synergy.

Abstract: Plant diseases pose a significant threat to global agriculture, causing over $220 billion in annual economic losses and jeopardizing food security. The timely and accurate detection of these diseases from plant leaf images is critical to mitigating their adverse effects. Deep neural network Ensembles (Deep Ensembles) have emerged as a powerful approach to enhancing prediction accuracy by leveraging the strengths of diverse Deep Neural Networks (DNNs). However, selecting high-performing ensemble member models is challenging due to the inherent difficulty in measuring ensemble diversity. In this paper, we introduce the Synergistic Diversity (SQ) framework to enhance plant disease detection accuracy. First, we conduct a comprehensive analysis of the limitations of existing ensemble diversity metrics (denoted as Q metrics), which often fail to identify optimal ensemble teams. Second, we present the SQ metric, a novel measure that captures the synergy between ensemble members and consistently aligns with ensemble accuracy. Third, we validate our SQ approach through extensive experiments on a plant leaf image dataset, which demonstrates that our SQ metric substantially improves ensemble selection and enhances detection accuracy. Our findings pave the way for a more reliable and efficient image-based plant disease detection.

[119] A lightweight detector for real-time detection of remote sensing images

Qianyi Wang, Guoqiang Ren

Main category: cs.CV

TL;DR: DMG-YOLO is a lightweight real-time detector for small objects in remote sensing images, featuring dual-branch feature extraction and multi-scale fusion modules to balance accuracy and efficiency.

Details

Motivation: Address the challenges of real-time small object detection in remote sensing imagery, where traditional methods struggle with balancing accuracy and computational efficiency.

Method: Proposes DMG-YOLO with three key components: Dual-branch Feature Extraction (DFE) module using depthwise convolutions and vision transformer with gating, Multi-scale Feature Fusion (MFF) with dilated convolutions, and Global and Local Aggregate Feature Pyramid Network (GLAFPN) for enhanced feature fusion.

Result: Achieves competitive performance on VisDrone2019 and NWPU VHR-10 datasets in terms of mAP, model size, and other key metrics, demonstrating effectiveness for small object detection.

Conclusion: DMG-YOLO provides an effective solution for real-time small object detection in remote sensing images, successfully balancing detection accuracy with computational efficiency through its novel architecture design.

Abstract: Remote sensing imagery is widely used across various fields, yet real-time detection remains challenging due to the prevalence of small objects and the need to balance accuracy with efficiency. To address this, we propose DMG-YOLO, a lightweight real-time detector tailored for small object detection in remote sensing images. Specifically, we design a Dual-branch Feature Extraction (DFE) module in the backbone, which partitions feature maps into two parallel branches: one extracts local features via depthwise separable convolutions, and the other captures global context using a vision transformer with a gating mechanism. Additionally, a Multi-scale Feature Fusion (MFF) module with dilated convolutions enhances multi-scale integration while preserving fine details. In the neck, we introduce the Global and Local Aggregate Feature Pyramid Network (GLAFPN) to further boost small object detection through global-local feature fusion. Extensive experiments on the VisDrone2019 and NWPU VHR-10 datasets show that DMG-YOLO achieves competitive performance in terms of mAP, model size, and other key metrics.

[120] RadioKMoE: Knowledge-Guided Radiomap Estimation with Kolmogorov-Arnold Networks and Mixture-of-Experts

Fupei Guo, Kerry Pan, Songyang Zhang, Yue Wang, Zhi Ding

Main category: cs.CV

TL;DR: RadioKMoE: A knowledge-guided radiomap estimation framework combining Kolmogorov-Arnold Networks (KAN) for coarse coverage prediction and Mixture-of-Experts (MoE) for precise estimation, achieving improved accuracy and robustness.

Details

Motivation: Complex radio propagation behavior and challenging environments make radiomap estimation difficult, requiring better methods to handle spatial signal propagation knowledge for wireless network management.

Method: Proposed RadioKMoE framework with KAN module for initial coarse coverage map prediction using physics model approximation, followed by MoE network with specialized expert networks for distinct radiomap patterns to refine local details while maintaining global consistency.

Result: Experimental results show enhanced accuracy and robustness in both multi- and single-band radiomap estimation compared to conventional methods.

Conclusion: The RadioKMoE framework effectively addresses radiomap estimation challenges by combining KAN’s physics modeling strengths with MoE’s pattern specialization, providing a robust solution for wireless network management.

Abstract: Radiomap serves as a vital tool for wireless network management and deployment by providing powerful spatial knowledge of signal propagation and coverage. However, increasingly complex radio propagation behavior and surrounding environments pose strong challenges for radiomap estimation (RME). In this work, we propose a knowledge-guided RME framework that integrates Kolmogorov-Arnold Networks (KAN) with Mixture-of-Experts (MoE), namely RadioKMoE. Specifically, we design a KAN module to predict an initial coarse coverage map, leveraging KAN’s strength in approximating physics models and global radio propagation patterns. The initial coarse map, together with environmental information, drives our MoE network for precise radiomap estimation. Unlike conventional deep learning models, the MoE module comprises expert networks specializing in distinct radiomap patterns to improve local details while preserving global consistency. Experimental results in both multi- and single-band RME demonstrate the enhanced accuracy and robustness of the proposed RadioKMoE in radiomap estimation.

[121] DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction

Jonathan Skaza, Parsa Madinei, Ziqi Wen, Miguel Eckstein

Main category: cs.CV

TL;DR: DReX is a vision-only model that fuses self-supervised DINOv3 and supervised ResNet-50 features to predict image complexity, achieving state-of-the-art performance without language data.

Details

Motivation: To determine if language information is necessary for visual complexity prediction and explore whether visual features alone can achieve human-aligned complexity assessment.

Method: Fuses multi-scale hierarchical features from ResNet-50 with semantically rich representations from DINOv3 ViT-S/16 using a learnable attention mechanism, capturing both low-level texture patterns and high-level semantic structure.

Result: Achieves SOTA on IC9600 benchmark (Pearson r = 0.9581), surpassing multimodal methods while using 21.5x fewer parameters. Generalizes well across multiple datasets and metrics.

Conclusion: Visual features alone are sufficient for human-aligned complexity prediction, and properly fused self-supervised transformers and supervised CNNs offer complementary synergistic benefits.

Abstract: Visual complexity prediction is a fundamental problem in computer vision with applications in image compression, retrieval, and classification. Understanding what makes humans perceive an image as complex is also a long-standing question in cognitive science. Recent approaches have leveraged multimodal models that combine visual and linguistic representations, but it remains unclear whether language information is necessary for this task. We propose DReX (DINO-ResNet Fusion), a vision-only model that fuses self-supervised and convolutional representations through a learnable attention mechanism to predict image complexity. Our architecture integrates multi-scale hierarchical features from ResNet-50 with semantically rich representations from DINOv3 ViT-S/16, enabling the model to capture both low-level texture patterns and high-level semantic structure. DReX achieves state-of-the-art performance on the IC9600 benchmark (Pearson r = 0.9581), surpassing previous methods–including those trained on multimodal image-text data–while using approximately 21.5x fewer learnable parameters. Furthermore, DReX generalizes robustly across multiple datasets and metrics, achieving superior results on Pearson and Spearman correlation, Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). Ablation and attention analyses confirm that DReX leverages complementary cues from both backbones, with the DINOv3 [CLS] token enhancing sensitivity to visual complexity. Our findings suggest that visual features alone can be sufficient for human-aligned complexity prediction and that, when properly fused, self-supervised transformers and supervised deep convolutional neural networks offer complementary and synergistic benefits for this task.

[122] DepthFocus: Controllable Depth Estimation for See-Through Scenes

Junhong Min, Jimin Kim, Cheol-Hui Min, Minwook Kim, Youngpil Jeon, Minyong Choi

Main category: cs.CV

TL;DR: DepthFocus is a steerable Vision Transformer that enables intent-driven stereo depth estimation, allowing users to specify desired depth focus rather than just estimating static depth maps.

Details

Motivation: Real-world depth is multi-layered due to transmissive materials, but existing models only estimate static depth maps focused on nearest surfaces, unlike humans who can actively shift focus to perceive desired depths.

Method: A steerable Vision Transformer conditioned on scalar depth preference that dynamically adapts computation to focus on intended depth, trained on a new 500k multi-layered synthetic dataset capturing diverse see-through effects.

Result: Achieves state-of-the-art performance on conventional benchmarks (BOOSTER), demonstrates intent-aligned estimation on new multi-depth datasets, and shows strong generalization to unseen see-through scenes.

Conclusion: Represents a significant step toward active, human-like 3D perception by enabling selective depth perception in complex scenes with transmissive materials.

Abstract: Depth in the real world is rarely singular. Transmissive materials create layered ambiguities that confound conventional perception systems. Existing models remain passive, attempting to estimate static depth maps anchored to the nearest surface, while humans actively shift focus to perceive a desired depth. We introduce DepthFocus, a steerable Vision Transformer that redefines stereo depth estimation as intent-driven control. Conditioned on a scalar depth preference, the model dynamically adapts its computation to focus on the intended depth, enabling selective perception within complex scenes. The training primarily leverages our newly constructed 500k multi-layered synthetic dataset, designed to capture diverse see-through effects. DepthFocus not only achieves state-of-the-art performance on conventional single-depth benchmarks like BOOSTER, a dataset notably rich in transparent and reflective objects, but also quantitatively demonstrates intent-aligned estimation on our newly proposed real and synthetic multi-depth datasets. Moreover, it exhibits strong generalization capabilities on unseen see-through scenes, underscoring its robustness as a significant step toward active and human-like 3D perception.

[123] VLM-Augmented Degradation Modeling for Image Restoration Under Adverse Weather Conditions

Qianyi Shao, Yuanfan Zhang, Renxiang Xiao, Liang Hu

Main category: cs.CV

TL;DR: MVLR is a unified model that restores images from various weather degradations using visual-language reasoning and memory retrieval for real-time deployment.

Details

Motivation: Reliable visual perception under adverse weather conditions is crucial for autonomous driving and outdoor robots, but challenging due to diverse degradation patterns.

Method: Combines lightweight encoder-decoder with Visual-Language Model for degradation reasoning and Implicit Memory Bank for pattern retrieval, using cross-attention fusion.

Result: Outperforms single-branch and Mixture-of-Experts baselines on four severe-weather benchmarks in PSNR and SSIM metrics.

Conclusion: MVLR achieves practical balance between model compactness and expressiveness for real-time deployment in diverse outdoor conditions.

Abstract: Reliable visual perception under adverse weather conditions, such as rain, haze, snow, or a mixture of them, is desirable yet challenging for autonomous driving and outdoor robots. In this paper, we propose a unified Memory-Enhanced Visual-Language Recovery (MVLR) model that restores images from different degradation levels under various weather conditions. MVLR couples a lightweight encoder-decoder backbone with a Visual-Language Model (VLM) and an Implicit Memory Bank (IMB). The VLM performs chain-of-thought inference to encode weather degradation priors and the IMB stores continuous latent representations of degradation patterns. The VLM-generated priors query the IMB to retrieve fine-grained degradation prototypes. These prototypes are then adaptively fused with multi-scale visual features via dynamic cross-attention mechanisms, enhancing restoration accuracy while maintaining computational efficiency. Extensive experiments on four severe-weather benchmarks show that MVLR surpasses single-branch and Mixture-of-Experts baselines in terms of Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). These results indicate that MVLR offers a practical balance between model compactness and expressiveness for real-time deployment in diverse outdoor conditions.

[124] Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats

Jiaye Qian, Ge Zheng, Yuchen Zhu, Sibei Yang

Main category: cs.CV

TL;DR: LVLMs suffer from multi-path hallucinations. This paper proposes a causal intervention framework targeting image-text and text-text pathways, with format-specific methods that effectively reduce hallucinations.

Details

Motivation: Large Vision-Language Models exhibit persistent hallucination issues despite strong performance, requiring deeper understanding of their causal mechanisms beyond single-path explanations.

Method: Comprehensive intervention framework aligned with transformer causal architecture, analyzing image-to-input-text, image-to-output-text, and text-to-text pathways. Identifies critical hallucination heads and applies format-specific interventions for discriminative and generative tasks.

Result: Experiments across multiple benchmarks show consistent reduction in hallucinations across diverse alignment types, demonstrating the effectiveness of the pathway-specific intervention approach.

Conclusion: Hallucinations in LVLMs stem from complex interplay of multiple causal pathways, and targeted interventions on critical heads can effectively mitigate them when tailored to specific question-answer formats.

Abstract: Despite their impressive performance across a wide range of tasks, Large Vision-Language Models (LVLMs) remain prone to hallucination. In this study, we propose a comprehensive intervention framework aligned with the transformer’s causal architecture in LVLMs, integrating the effects of different intervention paths on hallucination. We find that hallucinations in LVLMs do not arise from a single causal path, but rather from the interplay among image-to-input-text, image-to-output-text, and text-to-text pathways. For the first time, we also find that LVLMs rely on different pathways depending on the question-answer alignment format. Building on these insights, we propose simple yet effective methods to identify and intervene on critical hallucination heads within each pathway, tailored to discriminative and generative formats. Experiments across multiple benchmarks demonstrate that our approach consistently reduces hallucinations across diverse alignment types.

[125] Vision Language Models are Confused Tourists

Patrick Amadeus Irawan, Ikhlasul Akmal Hanif, Muhammad Dehan Al Kautsar, Genta Indra Winata, Fajri Koto, Alham Fikri Aji

Main category: cs.CV

TL;DR: The paper introduces ConfusedTourist, a cultural adversarial robustness suite that reveals VLMs’ vulnerability to mixed cultural cues, showing significant accuracy drops when multiple cultural concepts coexist in images.

Details

Motivation: Current VLM evaluations overlook scenarios with multiple cultural cues, failing to test model stability across diverse cultural inputs which is crucial for supporting multicultural societies.

Method: Developed ConfusedTourist benchmark with image-stacking and image-generation-based perturbations to test VLMs’ cultural robustness against mixed geographical cues.

Result: VLMs show critical vulnerability with heavy accuracy drops under simple perturbations, and interpretability analyses reveal systematic attention shifts toward distracting cultural cues.

Conclusion: Visual cultural concept mixing substantially impairs state-of-the-art VLMs, highlighting the urgent need for more culturally robust multimodal understanding systems.

Abstract: Although the cultural dimension has been one of the key aspects in evaluating Vision-Language Models (VLMs), their ability to remain stable across diverse cultural inputs remains largely untested, despite being crucial to support diversity and multicultural societies. Existing evaluations often rely on benchmarks featuring only a singular cultural concept per image, overlooking scenarios where multiple, potentially unrelated cultural cues coexist. To address this gap, we introduce ConfusedTourist, a novel cultural adversarial robustness suite designed to assess VLMs’ stability against perturbed geographical cues. Our experiments reveal a critical vulnerability, where accuracy drops heavily under simple image-stacking perturbations and even worsens with its image-generation-based variant. Interpretability analyses further show that these failures stem from systematic attention shifts toward distracting cues, diverting the model from its intended focus. These findings highlight a critical challenge: visual cultural concept mixing can substantially impair even state-of-the-art VLMs, underscoring the urgent need for more culturally robust multimodal understanding.

[126] RoomPlanner: Explicit Layout Planner for Easier LLM-Driven 3D Room Generation

Wenzhuo Sun, Mingjian Liang, Wenxuan Song, Xuelian Cheng, Zongyuan Ge

Main category: cs.CV

TL;DR: RoomPlanner is the first fully automatic 3D room generation framework that creates realistic indoor scenes from short text prompts without manual layout design or panoramic images.

Details

Motivation: To enable painless creation of realistic 3D indoor scenes with minimal input (short text) and eliminate the need for manual layout design or panoramic image guidance.

Method: Uses hierarchical language-driven agent planners to parse text prompts into detailed scene descriptions, generates 3D point clouds, implements arrangement constraints for collision-free layouts, and employs AnyReach Sampling and Interval Timestep Flow Sampling for efficient rendering optimization.

Result: Generates geometrically rational 3D indoor scenes in under 30 minutes, surpassing prior approaches in rendering speed and visual quality while maintaining editability.

Conclusion: RoomPlanner successfully demonstrates fully automatic 3D room generation from text prompts, achieving efficient and high-quality results with improved rendering speed and preserved editability.

Abstract: In this paper, we propose RoomPlanner, the first fully automatic 3D room generation framework for painlessly creating realistic indoor scenes with only short text as input. Without any manual layout design or panoramic image guidance, our framework can generate explicit layout criteria for rational spatial placement. We begin by introducing a hierarchical structure of language-driven agent planners that can automatically parse short and ambiguous prompts into detailed scene descriptions. These descriptions include raw spatial and semantic attributes for each object and the background, which are then used to initialize 3D point clouds. To position objects within bounded environments, we implement two arrangement constraints that iteratively optimize spatial arrangements, ensuring a collision-free and accessible layout solution. In the final rendering stage, we propose a novel AnyReach Sampling strategy for camera trajectory, along with the Interval Timestep Flow Sampling (ITFS) strategy, to efficiently optimize the coarse 3D Gaussian scene representation. These approaches help reduce the total generation time to under 30 minutes. Extensive experiments demonstrate that our method can produce geometrically rational 3D indoor scenes, surpassing prior approaches in both rendering speed and visual quality while preserving editability. The code will be available soon.

[127] Range-Edit: Semantic Mask Guided Outdoor LiDAR Scene Editing

Suchetan G. Uppur, Hemant Kumar, Vaibhav Kumar

Main category: cs.CV

TL;DR: A novel method for generating synthetic LiDAR point clouds by editing real-world scans using semantic mask guidance and diffusion-based generation in 2D range view space.

Details

Motivation: Current methods for training autonomous driving systems rely on expensive handcrafted 3D simulations that fail to capture real-world complexity, especially for critical edge cases, limiting system generalization and robustness.

Method: Transform LiDAR point clouds to 2D range images, use convex hull-based semantic masks to guide diffusion-based generation, ensuring geometric consistency by preserving object dimensions, orientations, and locations from real environments.

Result: High-quality LiDAR point cloud generation capable of producing complex edge cases and dynamic scenes, validated on KITTI-360 dataset, offering cost-effective and scalable data generation.

Conclusion: This approach provides a practical solution for generating diverse LiDAR data to improve autonomous driving system robustness, addressing limitations of current simulation methods.

Abstract: Training autonomous driving and navigation systems requires large and diverse point cloud datasets that capture complex edge case scenarios from various dynamic urban settings. Acquiring such diverse scenarios from real-world point cloud data, especially for critical edge cases, is challenging, which restricts system generalization and robustness. Current methods rely on simulating point cloud data within handcrafted 3D virtual environments, which is time-consuming, computationally expensive, and often fails to fully capture the complexity of real-world scenes. To address some of these issues, this research proposes a novel approach that addresses the problem discussed by editing real-world LiDAR scans using semantic mask-based guidance to generate novel synthetic LiDAR point clouds. We incorporate range image projection and semantic mask conditioning to achieve diffusion-based generation. Point clouds are transformed to 2D range view images, which are used as an intermediate representation to enable semantic editing using convex hull-based semantic masks. These masks guide the generation process by providing information on the dimensions, orientations, and locations of objects in the real environment, ensuring geometric consistency and realism. This approach demonstrates high-quality LiDAR point cloud generation, capable of producing complex edge cases and dynamic scenes, as validated on the KITTI-360 dataset. This offers a cost-effective and scalable solution for generating diverse LiDAR data, a step toward improving the robustness of autonomous driving systems.

[128] PathAgent: Toward Interpretable Analysis of Whole-slide Pathology Images via Large Language Model-based Agentic Reasoning

Jingyun Chen, Linghan Cai, Zhikang Wang, Yi Huang, Songhan Jiang, Shenjin Huang, Hongpeng Wang, Yongbing Zhang

Main category: cs.CV

TL;DR: PathAgent is a training-free LLM-based agent framework that emulates pathologists’ stepwise reasoning process for analyzing whole-slide images, providing fully interpretable predictions through explicit chain-of-thought.

Details

Motivation: Existing computational pipelines for whole-slide image analysis lack explicit reasoning trajectories, resulting in opaque and unjustifiable predictions that don't mirror human pathologists' iterative, evidence-driven reasoning process.

Method: PathAgent uses three modules: Navigator for autonomously exploring WSIs and locating significant micro-regions, Perceptor for extracting morphology visual cues, and Executor for integrating findings into evolving natural language trajectories. The entire process forms an explicit chain-of-thought.

Result: Evaluated across five challenging datasets, PathAgent exhibits strong zero-shot generalization, surpassing task-specific baselines in both open-ended and constrained visual question-answering tasks. Collaborative evaluation with human pathologists confirms its promise as a transparent diagnostic assistant.

Conclusion: PathAgent successfully bridges the gap between computational analysis and human expert reasoning by providing fully interpretable predictions through explicit reasoning trajectories, showing strong potential as a clinically grounded diagnostic tool.

Abstract: Analyzing whole-slide images (WSIs) requires an iterative, evidence-driven reasoning process that parallels how pathologists dynamically zoom, refocus, and self-correct while collecting the evidence. However, existing computational pipelines often lack this explicit reasoning trajectory, resulting in inherently opaque and unjustifiable predictions. To bridge this gap, we present PathAgent, a training-free, large language model (LLM)-based agent framework that emulates the reflective, stepwise analytical approach of human experts. PathAgent can autonomously explore WSI, iteratively and precisely locating significant micro-regions using the Navigator module, extracting morphology visual cues using the Perceptor, and integrating these findings into the continuously evolving natural language trajectories in the Executor. The entire sequence of observations and decisions forms an explicit chain-of-thought, yielding fully interpretable predictions. Evaluated across five challenging datasets, PathAgent exhibits strong zero-shot generalization, surpassing task-specific baselines in both open-ended and constrained visual question-answering tasks. Moreover, a collaborative evaluation with human pathologists confirms PathAgent’s promise as a transparent and clinically grounded diagnostic assistant.

[129] RL-AD-Net: Reinforcement Learning Guided Adaptive Displacement in Latent Space for Refined Point Cloud Completion

Bhanu Pratap Paregi, Vaibhav Kumar

Main category: cs.CV

TL;DR: RL-AD-Net is a reinforcement learning refinement framework that improves point cloud completion quality by adjusting global feature vectors in latent space, with a PointNN selector ensuring geometric consistency.

Details

Motivation: Existing point cloud completion models generate globally plausible shapes but often have local geometric inconsistencies that need refinement.

Method: Uses RL agent to adjust global feature vectors in pretrained autoencoder’s latent space, with PointNN selector for geometric consistency evaluation and ensemble refinement.

Result: Consistently improves completion quality across both training-style and random cropping scenarios, outperforming baseline completion networks.

Conclusion: RL-AD-Net provides effective, lightweight, and model-agnostic refinement for point cloud completion without requiring retraining of base networks.

Abstract: Recent point cloud completion models, including transformer-based, denoising-based, and other state-of-the-art approaches, generate globally plausible shapes from partial inputs but often leave local geometric inconsistencies. We propose RL-AD-Net, a reinforcement learning (RL) refinement framework that operates in the latent space of a pretrained point autoencoder. The autoencoder encodes completions into compact global feature vectors (GFVs), which are selectively adjusted by an RL agent to improve geometric fidelity. To ensure robustness, a lightweight non-parametric PointNN selector evaluates the geometric consistency of both the original completion and the RL-refined output, retaining the better reconstruction. When ground truth is available, both Chamfer Distance and geometric consistency metrics guide refinement. Training is performed separately per category, since the unsupervised and dynamic nature of RL makes convergence across highly diverse categories challenging. Nevertheless, the framework can be extended to multi-category refinement in future work. Experiments on ShapeNetCore-2048 demonstrate that while baseline completion networks perform reasonable under their training-style cropping, they struggle in random cropping scenarios. In contrast, RL-AD-Net consistently delivers improvements across both settings, highlighting the effectiveness of RL-guided ensemble refinement. The approach is lightweight, modular, and model-agnostic, making it applicable to a wide range of completion networks without requiring retraining.

[130] Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation

Chuancheng Shi, Shangze Li, Shiming Guo, Simiao Xie, Wenhua Wu, Jingtong Dou, Chao Wu, Canran Xiao, Cong Wang, Zifeng Cheng, Fei Shen, Tat-Seng Chua

Main category: cs.CV

TL;DR: Current multilingual T2I models produce culturally neutral or English-biased results. The paper identifies insufficient activation of culture-related representations as the root cause and proposes two alignment strategies to improve cultural consistency.

Details

Motivation: Multilingual T2I models output culturally inconsistent images across languages, often producing neutral or English-biased results despite having cultural knowledge, due to insufficient activation of culture-related representations.

Method: Proposed a probing method to localize culture-sensitive neurons, then introduced two strategies: (1) inference-time cultural activation that amplifies identified neurons without fine-tuning, and (2) layer-targeted cultural enhancement that updates only culturally relevant layers.

Result: Experiments on CultureBench show consistent improvements in cultural consistency while maintaining image fidelity and diversity, outperforming strong baselines.

Conclusion: The proposed methods effectively address cultural bias in multilingual T2I models by targeting specific culture-sensitive neurons and layers, achieving better cross-lingual cultural consistency without compromising other quality metrics.

Abstract: Multilingual text-to-image (T2I) models have advanced rapidly in terms of visual realism and semantic alignment, and are now widely utilized. Yet outputs vary across cultural contexts: because language carries cultural connotations, images synthesized from multilingual prompts should preserve cross-lingual cultural consistency. We conduct a comprehensive analysis showing that current T2I models often produce culturally neutral or English-biased results under multilingual prompts. Analyses of two representative models indicate that the issue stems not from missing cultural knowledge but from insufficient activation of culture-related representations. We propose a probing method that localizes culture-sensitive signals to a small set of neurons in a few fixed layers. Guided by this finding, we introduce two complementary alignment strategies: (1) inference-time cultural activation that amplifies the identified neurons without backbone fine-tuned; and (2) layer-targeted cultural enhancement that updates only culturally relevant layers. Experiments on our CultureBench demonstrate consistent improvements over strong baselines in cultural consistency while preserving fidelity and diversity.

[131] REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting

Di Wu, Liu Liu, Anran Huang, Yuyan Liu, Qiaoyu Jun, Shaofan Liu, Liangtu Song, Cewu Lu

Main category: cs.CV

TL;DR: REArtGS++ improves articulated object reconstruction using planar Gaussian splatting with temporal geometry constraints and decoupled screw motion modeling for better generalization across object types.

Details

Motivation: REArtGS struggles with screw-joint and multi-part objects and lacks geometric constraints for unseen states, limiting its generalization capability.

Method: Models decoupled screw motion for each joint without type prior, optimizes part-aware Gaussians with joint parameters through motion blending, and introduces temporal geometry constraints using planar Gaussians with Taylor expansion-based regularization.

Result: Demonstrates superior performance in part-level surface reconstruction and joint parameter estimation on both synthetic and real-world articulated objects compared to existing approaches.

Conclusion: REArtGS++ provides a more generalizable solution for articulated object reconstruction by addressing limitations of previous methods through improved motion modeling and temporal geometric constraints.

Abstract: Articulated objects are pervasive in daily environments, such as drawers and refrigerators. Towards their part-level surface reconstruction and joint parameter estimation, REArtGS~\cite{wu2025reartgs} introduces a category-agnostic approach using multi-view RGB images at two different states. However, we observe that REArtGS still struggles with screw-joint or multi-part objects and lacks geometric constraints for unseen states. In this paper, we propose REArtGS++, a novel method towards generalizable articulated object reconstruction with temporal geometry constraint and planar Gaussian splatting. We first model a decoupled screw motion for each joint without type prior, and jointly optimize part-aware Gaussians with joint parameters through part motion blending. To introduce time-continuous geometric constraint for articulated modeling, we encourage Gaussians to be planar and propose a temporally consistent regularization between planar normal and depth through Taylor first-order expansion. Extensive experiments on both synthetic and real-world articulated objects demonstrate our superiority in generalizable part-level surface reconstruction and joint parameter estimation, compared to existing approaches. Project Site: https://sites.google.com/view/reartgs2/home.

[132] MuM: Multi-View Masked Image Modeling for 3D Vision

David Nordström, Johan Edstedt, Fredrik Kahl, Georg Bökman

Main category: cs.CV

TL;DR: MuM extends masked autoencoding (MAE) to multiple views for 3D vision, outperforming DINOv3 and CroCo v2 on geometric tasks.

Details

Motivation: Most self-supervised learning focuses on semantic understanding rather than geometric reasoning. CroCo showed promise for 3D understanding, but there's room for improvement in scalability and performance.

Method: Extends MAE to arbitrarily many views of the same scene by uniformly masking all views and using a lightweight decoder with inter-frame attention.

Result: Outperforms state-of-the-art visual encoders DINOv3 and CroCo v2 on downstream tasks including feedforward reconstruction, dense image matching, and relative pose estimation.

Conclusion: Multi-view masked autoencoding (MuM) provides a simpler and more scalable approach for learning features tailored to 3D vision tasks.

Abstract: Self-supervised learning on images seeks to extract meaningful visual representations from unlabeled data. When scaled to large datasets, this paradigm has achieved state-of-the-art performance and the resulting trained models such as DINOv3 have seen widespread adoption. However, most prior efforts are optimized for semantic understanding rather than geometric reasoning. One important exception is Cross-View Completion, CroCo, which is a form of masked autoencoding (MAE) tailored for 3D understanding. In this work, we continue on the path proposed by CroCo and focus on learning features tailored for 3D vision. In a nutshell, we extend MAE to arbitrarily many views of the same scene. By uniformly masking all views and employing a lightweight decoder with inter-frame attention, our approach is inherently simpler and more scalable than CroCo. We evaluate the resulting model, MuM, extensively on downstream tasks including feedforward reconstruction, dense image matching and relative pose estimation, finding that it outperforms the state-of-the-art visual encoders DINOv3 and CroCo v2.

[133] Diversity Has Always Been There in Your Visual Autoregressive Models

Tong Wang, Guanyu Yang, Nian Liu, Kai Wang, Yaxing Wang, Abdelrahman M Shaker, Salman Khan, Fahad Shahbaz Khan, Senmao Li

Main category: cs.CV

TL;DR: DiverseVAR is a training-free method that restores generative diversity in Visual Autoregressive (VAR) models by suppressing pivotal components in input and amplifying them in output, addressing diversity collapse without performance loss.

Details

Motivation: VAR models suffer from diversity collapse similar to few-step distilled diffusion models, reducing output variability despite their efficiency advantages over traditional AR and diffusion models.

Method: Suppress pivotal components in the model input and amplify them in the model output to unlock inherent generative potential without requiring additional training.

Result: Substantially enhances generative diversity with negligible performance influences, effectively addressing diversity collapse while preserving high-fidelity synthesis.

Conclusion: DiverseVAR provides a simple yet effective solution to restore VAR model diversity through pivotal component manipulation, demonstrating practical improvement without training overhead.

Abstract: Visual Autoregressive (VAR) models have recently garnered significant attention for their innovative next-scale prediction paradigm, offering notable advantages in both inference efficiency and image quality compared to traditional multi-step autoregressive (AR) and diffusion models. However, despite their efficiency, VAR models often suffer from the diversity collapse i.e., a reduction in output variability, analogous to that observed in few-step distilled diffusion models. In this paper, we introduce DiverseVAR, a simple yet effective approach that restores the generative diversity of VAR models without requiring any additional training. Our analysis reveals the pivotal component of the feature map as a key factor governing diversity formation at early scales. By suppressing the pivotal component in the model input and amplifying it in the model output, DiverseVAR effectively unlocks the inherent generative potential of VAR models while preserving high-fidelity synthesis. Empirical results demonstrate that our approach substantially enhances generative diversity with only neglectable performance influences. Our code will be publicly released at https://github.com/wangtong627/DiverseVAR.

[134] SPAGS: Sparse-View Articulated Object Reconstruction from Single State via Planar Gaussian Splatting

Di Wu, Liu Liu, Xueyu Yuan, Qiaoyu Jun, Wenxiao Chen, Ruilong Yan, Yiming Tang, Liangtu Song

Main category: cs.CV

TL;DR: A category-agnostic articulated object reconstruction framework using planar Gaussian Splatting that achieves high-fidelity part-level surface reconstruction from sparse-view RGB images of a single state.

Details

Motivation: Existing articulated object reconstruction methods require costly multi-stage and multi-view observations, limiting practical applications.

Method: Uses planar Gaussian Splatting with Gaussian information field for viewpoint selection, compresses 3D Gaussians to planar Gaussians for normal/depth estimation, optimizes through depth smooth regularization and few-shot diffusion, and incorporates part segmentation probability.

Result: Achieves higher-fidelity part-level surface reconstruction on both synthetic and real-world data compared to existing methods.

Conclusion: The proposed framework effectively reconstructs articulated objects from sparse-view RGB images without requiring expensive multi-stage observations.

Abstract: Articulated objects are ubiquitous in daily environments, and their 3D reconstruction holds great significance across various fields. However, existing articulated object reconstruction methods typically require costly inputs such as multi-stage and multi-view observations. To address the limitations, we propose a category-agnostic articulated object reconstruction framework via planar Gaussian Splatting, which only uses sparse-view RGB images from a single state. Specifically, we first introduce a Gaussian information field to perceive the optimal sparse viewpoints from candidate camera poses. Then we compress 3D Gaussians into planar Gaussians to facilitate accurate estimation of normal and depth. The planar Gaussians are optimized in a coarse-to-fine manner through depth smooth regularization and few-shot diffusion. Moreover, we introduce a part segmentation probability for each Gaussian primitive and update them by back-projecting part segmentation masks of renderings. Extensive experimental results demonstrate that our method achieves higher-fidelity part-level surface reconstruction on both synthetic and real-world data than existing methods. Codes will be made publicly available.

[135] Sparse Reasoning is Enough: Biological-Inspired Framework for Video Anomaly Detection with Large Pre-trained Models

He Huang, Zixuan Hu, Dongxiao Li, Yao Xiao, Ling-Yu Duan

Main category: cs.CV

TL;DR: ReCoVAD is a training-free video anomaly detection framework that uses selective frame processing inspired by human nervous system pathways, achieving state-of-the-art performance while processing only 28.55% and 16.04% of frames compared to previous methods.

Details

Motivation: Existing video anomaly detection methods using large pre-trained models rely on dense frame-level inference, which incurs high computational costs and latency. The paper questions whether dense reasoning is truly necessary when using powerful pre-trained models.

Method: ReCoVAD uses dual pathways: (1) Reflex pathway with lightweight CLIP-based module for fast response using visual features and prototype prompts, (2) Conscious pathway with medium-scale vision-language model for generating textual descriptions and refined scores. It includes dynamic memory of past frames and uses large language model for periodic review.

Result: Extensive experiments show ReCoVAD achieves state-of-the-art training-free performance while processing only 28.55% of frames on UCF-Crime and 16.04% on XD-Violence datasets compared to previous methods.

Conclusion: Sparse reasoning is sufficient for effective large-model-based video anomaly detection, challenging the need for dense frame processing in existing approaches.

Abstract: Video anomaly detection (VAD) plays a vital role in real-world applications such as security surveillance, autonomous driving, and industrial monitoring. Recent advances in large pre-trained models have opened new opportunities for training-free VAD by leveraging rich prior knowledge and general reasoning capabilities. However, existing studies typically rely on dense frame-level inference, incurring high computational costs and latency. This raises a fundamental question: Is dense reasoning truly necessary when using powerful pre-trained models in VAD systems? To answer this, we propose ReCoVAD, a novel framework inspired by the dual reflex and conscious pathways of the human nervous system, enabling selective frame processing to reduce redundant computation. ReCoVAD consists of two core pathways: (i) a Reflex pathway that uses a lightweight CLIP-based module to fuse visual features with prototype prompts and produce decision vectors, which query a dynamic memory of past frames and anomaly scores for fast response; and (ii) a Conscious pathway that employs a medium-scale vision-language model to generate textual event descriptions and refined anomaly scores for novel frames. It continuously updates the memory and prototype prompts, while an integrated large language model periodically reviews accumulated descriptions to identify unseen anomalies, correct errors, and refine prototypes. Extensive experiments show that ReCoVAD achieves state-of-the-art training-free performance while processing only 28.55% and 16.04% of the frames used by previous methods on the UCF-Crime and XD-Violence datasets, demonstrating that sparse reasoning is sufficient for effective large-model-based VAD.

[136] Bridging Visual Affective Gap: Borrowing Textual Knowledge by Learning from Noisy Image-Text Pairs

Daiqing Wu, Dongbao Yang, Yu Zhou, Can Ma

Main category: cs.CV

TL;DR: The paper proposes PACL (Partitioned Adaptive Contrastive Learning) to bridge the “affective gap” in visual emotion recognition by leveraging knowledge from pre-trained textual models to enhance visual models’ emotional perception.

Details

Motivation: Current visual emotion recognition methods suffer from an "affective gap" where factual-level visual features lack direct association with emotional categories, limiting the applicability of pre-trained visual models for emotion tasks.

Method: Proposed Partitioned Adaptive Contrastive Learning (PACL) that separates different types of samples and uses distinct contrastive learning strategies for each type, dynamically constructing negative and positive pairs to exploit noisy social media data connections between images and texts.

Result: The method significantly improves performance of various pre-trained visual models in downstream emotion-related tasks by effectively bridging the “affective gap”.

Conclusion: Bridging the affective gap through knowledge transfer from textual models to visual models significantly enhances visual emotion recognition performance, demonstrating the effectiveness of the proposed PACL approach.

Abstract: Visual emotion recognition (VER) is a longstanding field that has garnered increasing attention with the advancement of deep neural networks. Although recent studies have achieved notable improvements by leveraging the knowledge embedded within pre-trained visual models, the lack of direct association between factual-level features and emotional categories, called the “affective gap”, limits the applicability of pre-training knowledge for VER tasks. On the contrary, the explicit emotional expression and high information density in textual modality eliminate the “affective gap”. Therefore, we propose borrowing the knowledge from the pre-trained textual model to enhance the emotional perception of pre-trained visual models. We focus on the factual and emotional connections between images and texts in noisy social media data, and propose Partitioned Adaptive Contrastive Learning (PACL) to leverage these connections. Specifically, we manage to separate different types of samples and devise distinct contrastive learning strategies for each type. By dynamically constructing negative and positive pairs, we fully exploit the potential of noisy samples. Through comprehensive experiments, we demonstrate that bridging the “affective gap” significantly improves the performance of various pre-trained visual models in downstream emotion-related tasks. Our code is released on https://github.com/wdqqdw/PACL.

[137] Designing and Generating Diverse, Equitable Face Image Datasets for Face Verification Tasks

Georgia Baltsou, Ioannis Sarridis, Christos Koutlis, Symeon Papadopoulos

Main category: cs.CV

TL;DR: Proposes DIF-V dataset with 27,780 synthetic face images to address bias in face verification systems, showing existing models exhibit gender and race biases.

Details

Motivation: Existing face datasets suffer from racial, gender, and demographic biases, limiting fairness and effectiveness of face verification systems used in identity authentication applications.

Method: Uses advanced generative models to create diverse synthetic face images representing various facial traits while adhering to identity card photo standards, and introduces the DIF-V dataset as a benchmark.

Result: Analysis reveals existing verification models exhibit biases toward certain genders and races, and identity style modifications negatively impact model performance.

Conclusion: Addresses dataset inequities to advance diversity and ethics in AI, laying foundation for more inclusive and reliable face verification technologies.

Abstract: Face verification is a significant component of identity authentication in various applications including online banking and secure access to personal devices. The majority of the existing face image datasets often suffer from notable biases related to race, gender, and other demographic characteristics, limiting the effectiveness and fairness of face verification systems. In response to these challenges, we propose a comprehensive methodology that integrates advanced generative models to create varied and diverse high-quality synthetic face images. This methodology emphasizes the representation of a diverse range of facial traits, ensuring adherence to characteristics permissible in identity card photographs. Furthermore, we introduce the Diverse and Inclusive Faces for Verification (DIF-V) dataset, comprising 27,780 images of 926 unique identities, designed as a benchmark for future research in face verification. Our analysis reveals that existing verification models exhibit biases toward certain genders and races, and notably, applying identity style modifications negatively impacts model performance. By tackling the inherent inequities in existing datasets, this work not only enriches the discussion on diversity and ethics in artificial intelligence but also lays the foundation for developing more inclusive and reliable face verification technologies

[138] ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better

Yuan Zhang, Ming Lu, Junwen Pan, Tao Huang, Kuan Cheng, Qi She, Shanghang Zhang

Main category: cs.CV

TL;DR: ChainV is a framework that dynamically integrates visual hints into multimodal reasoning to make reasoning shorter and more efficient while improving accuracy, especially on math-intensive tasks.

Details

Motivation: Existing multimodal reasoning models exhibit redundant self-reflection in lengthy reasoning chains, and current CoT compression methods provide limited gains for multimodal reasoning due to reliance on static visual references.

Method: ChainV performs coarse visual patch selection based on previous reasoning steps, refines it using attention intensity to find representative atomic visual hints, evaluates hint reliability with consistency-based mechanism, and incorporates pixel coordinates and reliability into thinking via Bernoulli stochastic process.

Result: ChainV significantly improves reasoning accuracy and efficiency, achieving 2.3% improvement on MathVista benchmark while reducing inference latency by 51.4% and shortening output token length by 24.5%.

Conclusion: Dynamic integration of visual hints through ChainV framework effectively makes multimodal reasoning shorter and better, particularly benefiting math-intensive tasks where visual hints are crucial for multi-step symbolic reasoning.

Abstract: Recent advances in multimodal reasoning models have demonstrated impressive capabilities across text and vision. However, even leading models exhibit redundant self-reflection when generating lengthy reasoning chains. While training-free CoT compression methods have emerged in the LLMs domain, they rely on static visual references and thus provide limited gains for multimodal reasoning. Therefore, we propose ChainV, a framework that dynamically integrates visual hints into the reasoning process, thereby making multimodal reasoning shorter and better. Specifically, ChainV first performs a coarse visual patch selection based on the previous reasoning step, then refines it by identifying the most representative atomic visual hint according to the averaged attention intensity. Additionally, ChainV introduces a consistency-based evaluation mechanism to assess the reliability of the chosen hint, guiding the model to adaptively adjust its level of self-reflection. Eventually, the pixel coordinates of the selected visual hint and its reliability are incorporated into thinking with a Bernoulli stochastic process. Experiments indicate that our method significantly improves reasoning accuracy and efficiency, especially on math-intensive benchmarks where visual hints are crucial for multi-step symbolic reasoning. For example, ChainV achieves $2.3%$ improvement on the MathVista within MIMO-VL-RL, while reducing inference latency by $51.4%$ and shortening output token length by $24.5%$.

[139] Sparse Mixture-of-Experts for Multi-Channel Imaging: Are All Channel Interactions Required?

Sukwon Yun, Heming Yao, Burkhard Hoeckendorf, David Richmond, Aviv Regev, Russell Littman

Main category: cs.CV

TL;DR: MoE-ViT is a Mixture-of-Experts architecture for Vision Transformers that improves efficiency in multi-channel image processing by selectively attending to relevant channels rather than all channel interactions.

Details

Motivation: Vision Transformers face computational bottlenecks when processing multi-channel images due to quadratic growth in attention from channel-wise comparisons, leading to excessive FLOPs and high training costs.

Method: Proposed MoE-ViT treats each channel as an expert and uses a lightweight router to select only the most relevant experts per patch for attention, inspired by Sparse Mixture-of-Experts philosophy.

Result: Experiments on JUMP-CP and So2Sat datasets show MoE-ViT achieves substantial efficiency gains without sacrificing performance, and in some cases enhances performance.

Conclusion: MoE-ViT provides a practical and attractive backbone for multi-channel imaging by addressing the efficiency challenge in cross-channel attention while maintaining or improving performance.

Abstract: Vision Transformers ($\text{ViTs}$) have become the backbone of vision foundation models, yet their optimization for multi-channel domains - such as cell painting or satellite imagery - remains underexplored. A key challenge in these domains is capturing interactions between channels, as each channel carries different information. While existing works have shown efficacy by treating each channel independently during tokenization, this approach naturally introduces a major computational bottleneck in the attention block - channel-wise comparisons leads to a quadratic growth in attention, resulting in excessive $\text{FLOPs}$ and high training cost. In this work, we shift focus from efficacy to the overlooked efficiency challenge in cross-channel attention and ask: “Is it necessary to model all channel interactions?”. Inspired by the philosophy of Sparse Mixture-of-Experts ($\text{MoE}$), we propose MoE-ViT, a Mixture-of-Experts architecture for multi-channel images in $\text{ViTs}$, which treats each channel as an expert and employs a lightweight router to select only the most relevant experts per patch for attention. Proof-of-concept experiments on real-world datasets - JUMP-CP and So2Sat - demonstrate that $\text{MoE-ViT}$ achieves substantial efficiency gains without sacrificing, and in some cases enhancing, performance, making it a practical and attractive backbone for multi-channel imaging.

[140] PEGS: Physics-Event Enhanced Large Spatiotemporal Motion Reconstruction via 3D Gaussian Splatting

Yijun Xu, Jingrui Zhang, Hongyi Liu, Yuhan Chen, Yuanyang Wang, Qingyao Guo, Dingwen Wang, Lei Yu, Chu He

Main category: cs.CV

TL;DR: PEGS integrates physical priors and event stream enhancement with 3D Gaussian Splatting for deblurred motion recovery, using triple-level supervision and motion-aware training scheduling.

Details

Motivation: Reconstruction of rigid motion over large spatiotemporal scales is challenging due to modeling limitations, severe motion blur, and lack of physical consistency.

Method: Integrates physical priors with event stream enhancement in 3D Gaussian Splatting pipeline; uses triple-level supervision with acceleration constraint, event stream guidance, and Kalman regularizer; implements motion-aware simulated annealing training strategy.

Result: Superior performance in reconstructing motion over large spatiotemporal scales compared to mainstream dynamic methods; created first RGB-Event paired dataset for natural fast rigid motion.

Conclusion: PEGS framework effectively addresses challenges in rigid motion reconstruction by combining physical priors, event stream enhancement, and adaptive training strategies.

Abstract: Reconstruction of rigid motion over large spatiotemporal scales remains a challenging task due to limitations in modeling paradigms, severe motion blur, and insufficient physical consistency. In this work, we propose PEGS, a framework that integrates Physical priors with Event stream enhancement within a 3D Gaussian Splatting pipeline to perform deblurred target-focused modeling and motion recovery. We introduce a cohesive triple-level supervision scheme that enforces physical plausibility via an acceleration constraint, leverages event streams for high-temporal resolution guidance, and employs a Kalman regularizer to fuse multi-source observations. Furthermore, we design a motion-aware simulated annealing strategy that adaptively schedules the training process based on real-time kinematic states. We also contribute the first RGB-Event paired dataset targeting natural, fast rigid motion across diverse scenarios. Experiments show PEGS’s superior performance in reconstructing motion over large spatiotemporal scales compared to mainstream dynamic methods.

[141] Preventing Shortcut Learning in Medical Image Analysis through Intermediate Layer Knowledge Distillation from Specialist Teachers

Christopher Boland, Sotirios Tsaftaris, Sonia Dahdouh

Main category: cs.CV

TL;DR: A knowledge distillation framework using teacher networks fine-tuned on small task-relevant datasets to mitigate shortcut learning in medical image analysis, achieving comparable performance to bias-free training even on out-of-distribution data.

Details

Motivation: Deep learning models in medical imaging often learn shortcuts using spurious correlations, which can lead to poor robustness and patient harm by preventing models from using clinically meaningful features.

Method: Proposed knowledge distillation framework where a teacher network fine-tuned on small bias-free data guides a student network trained on large biased dataset to avoid shortcut learning, targeting different shortcut types across network layers.

Result: Consistent improvements over traditional methods (ERM, augmentation-based, group-based bias mitigation) on CheXpert, ISIC 2017, and SimBA datasets using various architectures, achieving comparable performance to bias-free training even on out-of-distribution test data.

Conclusion: The approach is practically applicable to real-world medical imaging where bias annotations are limited and shortcut features are difficult to identify beforehand, effectively mitigating shortcut learning across different architectures and datasets.

Abstract: Deep learning models are prone to learning shortcut solutions to problems using spuriously correlated yet irrelevant features of their training data. In high-risk applications such as medical image analysis, this phenomenon may prevent models from using clinically meaningful features when making predictions, potentially leading to poor robustness and harm to patients. We demonstrate that different types of shortcuts (those that are diffuse and spread throughout the image, as well as those that are localized to specific areas) manifest distinctly across network layers and can, therefore, be more effectively targeted through mitigation strategies that target the intermediate layers. We propose a novel knowledge distillation framework that leverages a teacher network fine-tuned on a small subset of task-relevant data to mitigate shortcut learning in a student network trained on a large dataset corrupted with a bias feature. Through extensive experiments on CheXpert, ISIC 2017, and SimBA datasets using various architectures (ResNet-18, AlexNet, DenseNet-121, and 3D CNNs), we demonstrate consistent improvements over traditional Empirical Risk Minimization, augmentation-based bias-mitigation, and group-based bias-mitigation approaches. In many cases, we achieve comparable performance with a baseline model trained on bias-free data, even on out-of-distribution test data. Our results demonstrate the practical applicability of our approach to real-world medical imaging scenarios where bias annotations are limited and shortcut features are difficult to identify a priori.

[142] Off the Planckian Locus: Using 2D Chromaticity to Improve In-Camera Color

SaiKiran Tedla, Joshua E. Little, Hakki Can Karaimer, Michael S. Brown

Main category: cs.CV

TL;DR: Transitioning from 1D CCT to 2D chromaticity space and using MLP for colorimetric mapping improves accuracy under non-Planckian LED lighting while maintaining compatibility with traditional illuminants.

Details

Motivation: Traditional CCT-based colorimetric mapping fails with modern LED lighting that deviates from Planckian locus, requiring better illumination characterization.

Method: Uses 2D chromaticity space instead of 1D CCT, replaces CCT interpolation with lightweight MLP trained on LED sources via lightbox calibration.

Result: Reduces angular reproduction error by 22% in LED-lit scenes, maintains backward compatibility, supports multi-illuminant scenes and real-time deployment.

Conclusion: 2D chromaticity with MLP provides robust colorimetric mapping for non-Planckian illuminants with minimal computational overhead.

Abstract: Traditional in-camera colorimetric mapping relies on correlated color temperature (CCT)-based interpolation between pre-calibrated transforms optimized for Planckian illuminants such as CIE A and D65. However, modern lighting technologies such as LEDs can deviate substantially from the Planckian locus, exposing the limitations of relying on conventional one-dimensional CCT for illumination characterization. This paper demonstrates that transitioning from 1D CCT (on the Planckian locus) to a 2D chromaticity space (off the Planckian locus) improves colorimetric accuracy across various mapping approaches. In addition, we replace conventional CCT interpolation with a lightweight multi-layer perceptron (MLP) that leverages 2D chromaticity features for robust colorimetric mapping under non-Planckian illuminants. A lightbox-based calibration procedure incorporating representative LED sources is used to train our MLP. Validated across diverse LED lighting, our method reduces angular reproduction error by 22% on average in LED-lit scenes, maintains backward compatibility with traditional illuminants, accommodates multi-illuminant scenes, and supports real-time in-camera deployment with negligible additional computational cost.

[143] REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing

Binger Chen, Tacettin Emre Bök, Behnood Rasti, Volker Markl, Begüm Demir

Main category: cs.CV

TL;DR: The paper introduces RS-FMD, a structured database of 150+ remote sensing foundation models, and REMSA, an LLM-based agent that automatically selects appropriate models from natural language queries using in-context learning and transparent reasoning.

Details

Motivation: Foundation models are increasingly used in remote sensing but selecting appropriate models is difficult due to scattered documentation, heterogeneous formats, and varied deployment constraints.

Method: Created RS-FMD database covering 150+ RSFMs across modalities, resolutions, and learning paradigms. Built REMSA agent that interprets user queries, resolves missing constraints, ranks candidates using in-context learning, and provides transparent justifications.

Result: REMSA outperforms baselines including naive agents, dense retrieval, and unstructured RAG-based LLMs on a benchmark of 75 expert-verified query scenarios (900 configurations). It operates entirely on public metadata without accessing private data.

Conclusion: RS-FMD and REMSA provide effective solutions for automated remote sensing foundation model selection, addressing the challenge of scattered model information and enabling transparent, constraint-aware model recommendations.

Abstract: Foundation Models (FMs) are increasingly used in remote sensing (RS) for tasks such as environmental monitoring, disaster assessment, and land-use mapping. These models include unimodal vision encoders trained on a single data modality and multimodal architectures trained on combinations of SAR, multispectral, hyperspectral, and image-text data. They support diverse RS tasks including semantic segmentation, image classification, change detection, and visual question answering. However, selecting an appropriate remote sensing foundation model (RSFM) remains difficult due to scattered documentation, heterogeneous formats, and varied deployment constraints. We introduce the RSFM Database (RS-FMD), a structured resource covering over 150 RSFMs spanning multiple data modalities, resolutions, and learning paradigms. Built on RS-FMD, we present REMSA, the first LLM-based agent for automated RSFM selection from natural language queries. REMSA interprets user requirements, resolves missing constraints, ranks candidate models using in-context learning, and provides transparent justifications. We also propose a benchmark of 75 expert-verified RS query scenarios, producing 900 configurations under an expert-centered evaluation protocol. REMSA outperforms several baselines, including naive agents, dense retrieval, and unstructured RAG-based LLMs. It operates entirely on publicly available metadata and does not access private or sensitive data.

[144] A Multi-Stage Optimization Framework for Deploying Learned Image Compression on FPGAs

Jiaxun Fang, Li Chen

Main category: cs.CV

TL;DR: A multi-stage optimization framework for deploying deep learning-based image compression models on FPGAs, addressing quantization degradation and hardware efficiency through dynamic range-aware quantization, mixed-precision search, and channel pruning.

Details

Motivation: Deploying high-performance floating-point image compression models on resource-constrained FPGAs is challenging due to quantization-induced performance degradation and hardware limitations.

Method: Proposes Dynamic Range-Aware Quantization (DRAQ) with activation clipping and weight regularization, followed by progressive mixed-precision search and GDN-adapted channel pruning for FPGA optimization.

Result: DRAQ reduces BD-rate overhead from 30% to 6.3%, and hardware optimizations cut computational complexity by over 20% with negligible RD performance impact, achieving state-of-the-art efficiency for FPGA-based LIC.

Conclusion: The framework successfully bridges floating-point models to efficient integer implementations on FPGAs, delivering superior quality and efficiency compared to existing FPGA-based image compression solutions.

Abstract: Deep learning-based image compression (LIC) has achieved state-of-the-art rate-distortion (RD) performance, yet deploying these models on resource-constrained FPGAs remains a major challenge. This work presents a complete, multi-stage optimization framework to bridge the gap between high-performance floating-point models and efficient, hardware-friendly integer-based implementations. First, we address the fundamental problem of quantization-induced performance degradation. We propose a Dynamic Range-Aware Quantization (DRAQ) method that uses statistically-calibrated activation clipping and a novel weight regularization scheme to counteract the effects of extreme data outliers and large dynamic ranges, successfully creating a high-fidelity 8-bit integer model. Second, building on this robust foundation, we introduce two hardware-aware optimization techniques tailored for FPGAs. A progressive mixed-precision search algorithm exploits FPGA flexibility to assign optimal, non-uniform bit-widths to each layer, minimizing complexity while preserving performance. Concurrently, a channel pruning method, adapted to work with the Generalized Divisive Normalization (GDN) layers common in LIC, removes model redundancy by eliminating inactive channels. Our comprehensive experiments show that the foundational DRAQ method reduces the BD-rate overhead of a GDN-based model from $30%$ to $6.3%$. The subsequent hardware-aware optimizations further reduce computational complexity by over $20%$ with negligible impact on RD performance, yielding a final model that is both state-of-the-art in efficiency and superior in quality to existing FPGA-based LIC implementations.

[145] One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution

Yushun Fang, Yuxiang Chen, Shibo Yin, Qiang Hu, Jiangchao Yao, Ya Zhang, Xiaoyun Zhang, Yanfeng Wang

Main category: cs.CV

TL;DR: ODTSR is a one-step diffusion transformer for real-world image super-resolution that balances fidelity and controllability using a noise-hybrid visual stream design and fidelity-aware adversarial training.

Details

Motivation: Current diffusion-based real-world image super-resolution methods struggle with balancing fidelity and controllability - multi-step methods have low fidelity due to randomness, while one-step methods lack control flexibility.

Method: Uses Qwen-Image based one-step diffusion transformer with noise-hybrid visual stream (NVS) that processes low-quality images with adjustable noise and consistent noise in parallel, plus fidelity-aware adversarial training (FAA).

Result: Achieves state-of-the-art performance on generic real-world image super-resolution and enables prompt controllability on challenging scenarios like Chinese character text super-resolution without specific training.

Conclusion: ODTSR successfully addresses the fidelity-controllability trade-off in real-world image super-resolution through its novel noise-hybrid architecture and training approach.

Abstract: Recent advances in diffusion-based real-world image super-resolution (Real-ISR) have demonstrated remarkable perceptual quality, yet the balance between fidelity and controllability remains a problem: multi-step diffusion-based methods suffer from generative diversity and randomness, resulting in low fidelity, while one-step methods lose control flexibility due to fidelity-specific finetuning. In this paper, we present ODTSR, a one-step diffusion transformer based on Qwen-Image that performs Real-ISR considering fidelity and controllability simultaneously: a newly introduced visual stream receives low-quality images (LQ) with adjustable noise (Control Noise), and the original visual stream receives LQs with consistent noise (Prior Noise), forming the Noise-hybrid Visual Stream (NVS) design. ODTSR further employs Fidelity-aware Adversarial Training (FAA) to enhance controllability and achieve one-step inference. Extensive experiments demonstrate that ODTSR not only achieves state-of-the-art (SOTA) performance on generic Real-ISR, but also enables prompt controllability on challenging scenarios such as real-world scene text image super-resolution (STISR) of Chinese characters without training on specific datasets.

[146] Planning with Sketch-Guided Verification for Physics-Aware Video Generation

Yidong Huang, Zun Wang, Han Lin, Dong-Ki Kim, Shayegan Omidshafiei, Jaehong Yoon, Yue Zhang, Mohit Bansal

Main category: cs.CV

TL;DR: SketchVerify is a training-free framework that improves video generation by planning better motion trajectories through a sketch-verification loop, achieving higher motion quality and physical realism while being more efficient than existing methods.

Details

Motivation: Existing video generation methods use single-shot motion plans limited to simple motions or iterative refinement requiring multiple expensive generator calls, leading to computational inefficiency and poor motion quality.

Method: Proposes a sketch-verification framework that predicts multiple candidate motion plans, renders them as lightweight video sketches over static backgrounds, and uses a vision-language verifier to rank plans based on semantic alignment and physical plausibility through iterative refinement.

Result: Experiments on WorldModelBench and PhyWorldBench show significant improvements in motion quality, physical realism, and long-term consistency compared to baselines, with better efficiency. Scaling trajectory candidates consistently improves performance.

Conclusion: SketchVerify effectively enhances motion planning for video generation through efficient sketch-based verification, delivering superior motion quality and physical realism while reducing computational costs.

Abstract: Recent video generation approaches increasingly rely on planning intermediate control signals such as object trajectories to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement which requires multiple calls to the video generator, incuring high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with more dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision-language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to the trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that our method significantly improves motion quality, physical realism, and long-term consistency compared to competitive baselines while being substantially more efficient. Our ablation study further shows that scaling up the number of trajectory candidates consistently enhances overall performance.

[147] Learning to Look Closer: A New Instance-Wise Loss for Small Cerebral Lesion Segmentation

Luc Bouteille, Alexander Jaus, Jens Kleesiek, Rainer Stiefelhagen, Lukas Heine

Main category: cs.CV

TL;DR: CC-DiceCE loss improves small lesion detection in medical image segmentation by increasing recall with minimal segmentation degradation, outperforming blob loss.

Details

Motivation: Traditional loss functions like Dice under-segment small lesions due to their negligible contribution to overall loss, requiring instance-wise approaches for better per-lesion evaluation.

Method: Introduces CC-DiceCE loss based on CC-Metrics framework and compares it with blob loss, both benchmarked against DiceCE baseline using nnU-Net framework for standardized evaluation.

Result: CC-DiceCE increases detection recall with minimal segmentation performance degradation, though slightly more false positives; generally outperforms blob loss across multiple datasets.

Conclusion: CC-DiceCE is an effective loss function for improving small lesion detection in medical image segmentation while maintaining good segmentation quality.

Abstract: Traditional loss functions in medical image segmentation, such as Dice, often under-segment small lesions because their small relative volume contributes negligibly to the overall loss. To address this, instance-wise loss functions and metrics have been proposed to evaluate segmentation quality on a per-lesion basis. We introduce CC-DiceCE, a loss function based on the CC-Metrics framework, and compare it with the existing blob loss. Both are benchmarked against a DiceCE baseline within the nnU-Net framework, which provides a robust and standardized setup. We find that CC-DiceCE loss increases detection (recall) with minimal to no degradation in segmentation performance, albeit at the cost of slightly more false positives. Furthermore, our multi-dataset study shows that CC-DiceCE generally outperforms blob loss.

Liuhan Yin, Runkun Ju, Guodong Guo, Erkang Cheng

Main category: cs.CV

TL;DR: DiffRefiner is a two-stage trajectory prediction framework that combines discriminative trajectory proposals with generative diffusion-based refinement to achieve state-of-the-art performance in autonomous driving.

Details

Motivation: Existing generative methods like diffusion models rely on denoising human-crafted anchors or random noise, leaving room for improvement in trajectory prediction accuracy and scene compliance.

Method: Two-stage approach: 1) Transformer-based Proposal Decoder generates coarse trajectory predictions using predefined anchors, 2) Diffusion Refiner iteratively denoises and refines initial predictions with a fine-grained denoising decoder for enhanced scene compliance.

Result: Achieves state-of-the-art performance: 87.4 EPDMS on NAVSIM v2, and 87.1 DS with 71.4 SR on Bench2Drive, setting new records on both benchmarks.

Conclusion: The proposed DiffRefiner framework effectively combines discriminative trajectory proposals with generative refinement, demonstrating superior performance through improved guidance and scene alignment in trajectory prediction.

Abstract: Unlike discriminative approaches in autonomous driving that predict a fixed set of candidate trajectories of the ego vehicle, generative methods, such as diffusion models, learn the underlying distribution of future motion, enabling more flexible trajectory prediction. However, since these methods typically rely on denoising human-crafted trajectory anchors or random noise, there remains significant room for improvement. In this paper, we propose DiffRefiner, a novel two-stage trajectory prediction framework. The first stage uses a transformer-based Proposal Decoder to generate coarse trajectory predictions by regressing from sensor inputs using predefined trajectory anchors. The second stage applies a Diffusion Refiner that iteratively denoises and refines these initial predictions. In this way, we enhance the performance of diffusion-based planning by incorporating a discriminative trajectory proposal module, which provides strong guidance for the generative refinement process. Furthermore, we design a fine-grained denoising decoder to enhance scene compliance, enabling more accurate trajectory prediction through enhanced alignment with the surrounding environment. Experimental results demonstrate that DiffRefiner achieves state-of-the-art performance, attaining 87.4 EPDMS on NAVSIM v2, and 87.1 DS along with 71.4 SR on Bench2Drive, thereby setting new records on both public benchmarks. The effectiveness of each component is validated via ablation studies as well.

[149] UI-Styler: Ultrasound Image Style Transfer with Class-Aware Prompts for Cross-Device Diagnosis Using a Frozen Black-Box Inference Network

Nhat-Tuong Do-Tran, Ngoc-Hoang-Lam Le, Ching-Chun Huang

Main category: cs.CV

TL;DR: UI-Styler is a novel ultrasound-specific class-aware image style transfer framework that addresses domain shifts in ultrasound images across devices by preserving source structural content while transferring target texture patterns and ensuring semantic alignment with diagnostic categories.

Details

Motivation: Ultrasound images vary across acquisition devices causing domain shifts that degrade performance of fixed black-box downstream inference models when reused. Existing unpaired image translation methods overlook class-specific semantic alignment during domain adaptation.

Method: UI-Styler uses pattern-matching to transfer target texture patterns onto source images while preserving source structure, and introduces class-aware prompting guided by target domain pseudo labels for semantic alignment with diagnostic categories.

Result: Extensive experiments on ultrasound cross-device tasks show UI-Styler consistently outperforms existing UIT methods, achieving state-of-the-art performance in distribution distance and downstream tasks like classification and segmentation.

Conclusion: UI-Styler effectively addresses domain shifts in ultrasound imaging by combining pattern-based style transfer with class-aware semantic alignment, demonstrating superior performance over existing unpaired image translation methods.

Abstract: The appearance of ultrasound images varies across acquisition devices, causing domain shifts that degrade the performance of fixed black-box downstream inference models when reused. To mitigate this issue, it is practical to develop unpaired image translation (UIT) methods that effectively align the statistical distributions between source and target domains, particularly under the constraint of a reused inference-blackbox setting. However, existing UIT approaches often overlook class-specific semantic alignment during domain adaptation, resulting in misaligned content-class mappings that can impair diagnostic accuracy. To address this limitation, we propose UI-Styler, a novel ultrasound-specific, class-aware image style transfer framework. UI-Styler leverages a pattern-matching mechanism to transfer texture patterns embedded in the target images onto source images while preserving the source structural content. In addition, we introduce a class-aware prompting strategy guided by pseudo labels of the target domain, which enforces accurate semantic alignment with diagnostic categories. Extensive experiments on ultrasound cross-device tasks demonstrate that UI-Styler consistently outperforms existing UIT methods, achieving state-of-the-art performance in distribution distance and downstream tasks, such as classification and segmentation.

[150] FireScope: Wildfire Risk Prediction with a Chain-of-Thought Oracle

Mario Markov, Stefan Maria Ailuro, Luc Van Gool, Konrad Schindler, Danda Pani Paudel

Main category: cs.CV

TL;DR: FireScope is a VLM-based reasoning-to-generation framework that uses multimodal data (Sentinel-2 imagery and climate data) to predict wildfire risk maps with reasoning traces, demonstrating improved cross-continental generalization and interpretability.

Details

Motivation: Existing wildfire risk prediction methods lack causal reasoning and multimodal understanding for reliable generalization, particularly across different continents.

Method: Proposed FireScope framework uses reinforcement learning and visual supervision to predict risk rasters with complementary reasoning traces, trained on FireScope-Bench dataset containing Sentinel-2 imagery, climate data, and expert-defined risk rasters.

Result: FireScope achieves substantial performance gains when trained in USA and tested in Europe, with expert feedback confirming faithful and semantically meaningful reasoning traces.

Conclusion: Reasoning can ground raster prediction models, improving both generalization and interpretability, making this the first framework to demonstrate language-based reasoning improves generalization in visual generation and enables cross-continental wildfire risk modeling.

Abstract: Predicting wildfire risk is a reasoning-intensive spatial problem that requires the integration of visual, climatic, and geographic factors to infer continuous risk maps. Existing methods lack the causal reasoning and multimodal understanding required for reliable generalization. We introduce $\textbf{FireScope-Bench}$, a large-scale dataset and benchmark that couples Sentinel-2 imagery and climate data with expert-defined risk rasters across the USA, and real wildfire events in Europe for cross-continental evaluation. Building on this dataset, we propose $\textbf{FireScope}$, a VLM-based reasoning-to-generation framework that learns from both reinforcement learning and visual supervision to predict risk rasters with complementary reasoning traces. When trained in the USA and tested in Europe, $\textbf{FireScope}$ achieves substantial performance gains, while expert feedback and automated analysis confirm that its reasoning traces are faithful and semantically meaningful. Our findings demonstrate that reasoning can ground raster prediction models, improving both generalization and interpretability. To our knowledge, this is the first framework to (1) demonstrate that language-based reasoning can improve generalization in visual generation, (2) propose a high-resolution wildfire risk model that can be applied across continents, and (3) enable systematic studies of robust cross-continental generalization for multimodal fire risk models. We believe that $\textbf{FireScope-Bench}$ has the potential to serve as a foundation for advancing reasoning-driven, interpretable and generalizable spatial modeling. Data and source code will be made publicly available.

[151] Navigating in the Dark: A Multimodal Framework and Dataset for Nighttime Traffic Sign Recognition

Aditya Mishra, Akshay Agarwal, Haroon Lone

Main category: cs.CV

TL;DR: Proposes LENS-Net for nighttime traffic sign recognition, featuring adaptive image enhancement and multimodal classification, and introduces INTSD dataset with 41 sign classes captured under various night conditions.

Details

Motivation: Traffic sign recognition at night is challenging due to visual noise and lack of public nighttime datasets, with existing methods struggling with low illumination and ineffective multimodal cue utilization.

Method: Introduces INTSD dataset with 41 traffic sign classes, and proposes LENS-Net with adaptive image enhancement detector for illumination correction and sign localization, followed by multimodal CLIP-GCNN classifier using cross-modal attention and graph reasoning.

Result: LENS-Net surpasses existing frameworks, with ablation studies confirming effectiveness of key components. Dataset and code are publicly available.

Conclusion: The proposed approach effectively addresses nighttime traffic sign recognition challenges through dataset creation and a robust multimodal framework that integrates illumination correction with semantic reasoning.

Abstract: Traffic signboards are vital for road safety and intelligent transportation systems, enabling navigation and autonomous driving. Yet, recognizing traffic signs at night remains challenging due to visual noise and scarcity of public nighttime datasets. Despite advances in vision architectures, existing methods struggle with robustness under low illumination and fail to leverage complementary mutlimodal cues effectively. To overcome these limitations, firstly, we introduce INTSD, a large-scale dataset comprising street-level night-time images of traffic signboards collected across diverse regions of India. The dataset spans 41 traffic signboard classes captured under varying lighting and weather conditions, providing a comprehensive benchmark for both detection and classification tasks. To benchmark INTSD for night-time sign recognition, we conduct extensive evaluations using state-of-the-art detection and classification models. Secondly, we propose LENS-Net, which integrates an adaptive image enhancement detector for joint illumination correction and sign localization, followed by a structured multimodal CLIP-GCNN classifier that leverages cross-modal attention and graph-based reasoning for robust and semantically consistent recognition. Our method surpasses existing frameworks, with ablation studies confirming the effectiveness of its key components. The dataset and code for LENS-Net is publicly available for research.

[152] PostCam: Camera-Controllable Novel-View Video Generation with Query-Shared Cross-Attention

Yipeng Chen, Zhichao Ye, Zhenzhou Fang, Xinyu Chen, Xiaoyu Zhang, Jialing Liu, Nan Wang, Haomin Liu, Guofeng Zhang

Main category: cs.CV

TL;DR: PostCam enables post-capture editing of camera trajectories in dynamic scenes through a novel query-shared cross-attention module that fuses 6-DoF camera poses and 2D video frames for precise motion control and high-quality video generation.

Details

Motivation: Existing video recapture methods suffer from suboptimal camera motion injection strategies that limit control precision and fail to preserve fine visual details from source videos.

Method: Introduces a query-shared cross-attention module that integrates 6-DoF camera poses and 2D rendered video frames into a unified representation, using a two-stage training strategy: first learning coarse camera control from poses, then incorporating visual information for refinement.

Result: Outperforms state-of-the-art methods by over 20% in camera control precision and view consistency, while achieving the highest video generation quality on both real-world and synthetic datasets.

Conclusion: PostCam provides more accurate and flexible camera motion manipulation while preserving visual fidelity from source videos, representing a significant advancement in novel-view video generation.

Abstract: We propose PostCam, a framework for novel-view video generation that enables post-capture editing of camera trajectories in dynamic scenes. We find that existing video recapture methods suffer from suboptimal camera motion injection strategies; such suboptimal designs not only limit camera control precision but also result in generated videos that fail to preserve fine visual details from the source video. To achieve more accurate and flexible motion manipulation, PostCam introduces a query-shared cross-attention module. It integrates two distinct forms of control signals: the 6-DoF camera poses and the 2D rendered video frames. By fusing them into a unified representation within a shared feature space, our model can extract underlying motion cues, which enhances both control precision and generation quality. Furthermore, we adopt a two-stage training strategy: the model first learns coarse camera control from pose inputs, and then incorporates visual information to refine motion accuracy and enhance visual fidelity. Experiments on both real-world and synthetic datasets demonstrate that PostCam outperforms state-of-the-art methods by over 20% in camera control precision and view consistency, while achieving the highest video generation quality. Our project webpage is publicly available at: https://cccqaq.github.io/PostCam.github.io/

[153] Real Noise Decoupling for Hyperspectral Image Denoising

Yingkai Zhang, Tao Zhang, Jing Nie, Ying Fu

Main category: cs.CV

TL;DR: A multi-stage noise-decoupling framework for HSI denoising that decomposes complex noise into explicit and implicit components, using pre-training for explicit noise and a wavelet-guided network for implicit noise, achieving state-of-the-art performance.

Details

Motivation: Existing noise modeling methods struggle with accurately modeling complex real-world noise in hyperspectral images, limiting denoising effectiveness.

Method: Multi-stage framework that decouples noise into explicit and implicit components, uses pre-training with existing noise models for explicit noise, and employs high-frequency wavelet guided network for implicit noise removal.

Result: Outperforms state-of-the-art methods on public and captured datasets, effectively handling complex real-world noise and significantly enhancing HSI quality.

Conclusion: The proposed multi-stage noise-decoupling framework successfully addresses complex real-world HSI noise through explicit-implicit noise decomposition and multi-stage optimization.

Abstract: Hyperspectral image (HSI) denoising is a crucial step in enhancing the quality of HSIs. Noise modeling methods can fit noise distributions to generate synthetic HSIs to train denoising networks. However, the noise in captured HSIs is usually complex and difficult to model accurately, which significantly limits the effectiveness of these approaches. In this paper, we propose a multi-stage noise-decoupling framework that decomposes complex noise into explicitly modeled and implicitly modeled components. This decoupling reduces the complexity of noise and enhances the learnability of HSI denoising methods when applied to real paired data. Specifically, for explicitly modeled noise, we utilize an existing noise model to generate paired data for pre-training a denoising network, equipping it with prior knowledge to handle the explicitly modeled noise effectively. For implicitly modeled noise, we introduce a high-frequency wavelet guided network. Leveraging the prior knowledge from the pre-trained module, this network adaptively extracts high-frequency features to target and remove the implicitly modeled noise from real paired HSIs. Furthermore, to effectively eliminate all noise components and mitigate error accumulation across stages, a multi-stage learning strategy, comprising separate pre-training and joint fine-tuning, is employed to optimize the entire framework. Extensive experiments on public and our captured datasets demonstrate that our proposed framework outperforms state-of-the-art methods, effectively handling complex real-world noise and significantly enhancing HSI quality.

[154] VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation

Hanyu Zhou, Chuanhao Ma, Gim Hee Lee

Main category: cs.CV

TL;DR: VLA-4D is a vision-language-action model with 4D awareness that enables spatiotemporally coherent robotic manipulation by integrating temporal information into both visual representations and action planning.

Details

Motivation: Existing VLA models struggle with spatiotemporally coherent manipulation, particularly in achieving temporally coherent control over action execution despite embedding 3D positions into visual representations.

Method: Two key designs: 1) 4D-aware visual representation by embedding 1D time into 3D positions and fusing via cross-attention, and 2) Spatiotemporal action representation that extends spatial actions with temporal information for planning, aligned with LLM for action prediction.

Result: Extensive experiments demonstrate superiority across different robotic manipulation tasks, achieving spatially-smooth and temporally-coherent manipulation.

Conclusion: The unified framework with 4D-aware visual and spatiotemporal action representations effectively enables coherent robotic manipulation, with extended dataset supporting fine-tuning.

Abstract: Vision-language-action (VLA) models show potential for general robotic tasks, but remain challenging in spatiotemporally coherent manipulation, which requires fine-grained representations. Typically, existing methods embed 3D positions into visual representations to enhance the spatial precision of actions. However, these methods struggle to achieve temporally coherent control over action execution. In this work, we propose VLA-4D, a general VLA model with 4D awareness for spatiotemporally coherent robotic manipulation. Our model is guided by two key designs: 1) 4D-aware visual representation. We extract visual features, embed 1D time into 3D positions for 4D embeddings, and fuse them into a unified visual representation via a cross-attention mechanism. 2) Spatiotemporal action representation. We extend conventional spatial action representations with temporal information to enable the spatiotemporal planning, and align the multimodal representations into the LLM for spatiotemporal action prediction. Within this unified framework, the designed visual and action representations jointly make robotic manipulation spatially-smooth and temporally-coherent. In addition, we extend the VLA dataset with temporal action annotations for fine-tuning our model. Extensive experiments have been conducted to verify the superiority of our method across different tasks of robotic manipulation.

[155] Continual Alignment for SAM: Rethinking Foundation Models for Medical Image Segmentation in Continual Learning

Jiayi Wang, Wei Dai, Haoyu Wang, Sihan Yang, Haixia Bi, Jian Sun

Main category: cs.CV

TL;DR: CA-SAM introduces a lightweight Alignment Layer and continual learning strategy to adapt SAM for medical image segmentation, achieving SOTA performance while reducing computational overhead and preventing catastrophic forgetting.

Details

Motivation: Medical image segmentation faces challenges with heterogeneous privacy policies preventing joint training, and SAM's computational overhead limits practical deployment despite its strong zero-shot capabilities.

Method: Proposes Alignment Layer for feature distribution alignment and CA-SAM continual learning strategy that automatically adapts Alignment Layers to mitigate forgetting while leveraging SAM’s zero-shot priors.

Result: Achieves state-of-the-art performance across nine medical segmentation datasets under continual learning scenarios.

Conclusion: The SAM paradigm is highly promising for medical image segmentation when computational efficiency and performance are balanced through the proposed CA-SAM approach.

Abstract: In medical image segmentation, heterogeneous privacy policies across institutions often make joint training on pooled datasets infeasible, motivating continual image segmentation-learning from data streams without catastrophic forgetting. While the Segment Anything Model (SAM) offers strong zero-shot priors and has been widely fine-tuned across downstream tasks, its large parameter count and computational overhead challenge practical deployment. This paper demonstrates that the SAM paradigm is highly promising once its computational efficiency and performance can be balanced. To this end, we introduce the Alignment Layer, a lightweight, plug-and-play module which aligns encoder-decoder feature distributions to efficiently adapt SAM to specific medical images, improving accuracy while reducing computation. Building on SAM and the Alignment Layer, we then propose Continual Alignment for SAM (CA-SAM), a continual learning strategy that automatically adapts the appropriate Alignment Layer to mitigate catastrophic forgetting, while leveraging SAM’s zero-shot priors to preserve strong performance on unseen medical datasets. Experimented across nine medical segmentation datasets under continual-learning scenario, CA-SAM achieves state-of-the-art performance. Our code, models and datasets will be released on \mbox{https://github.com/azzzzyo/Continual-Alignment-for-SAM.}

[156] SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors

Kunyi Li, Michael Niemeyer, Sen Wang, Stefano Gasperini, Nassir Navab, Federico Tombari

Main category: cs.CV

TL;DR: SING3R-SLAM is a globally consistent Gaussian-based dense RGB SLAM framework that combines local reconstructions with a unified global representation to address drift and redundancy in 3D mapping.

Details

Motivation: To overcome limitations of current dense 3D reconstruction methods in SLAM, including drift, redundant point maps, and inefficiency that hinder downstream tasks like novel view synthesis.

Method: Builds locally consistent submaps through lightweight tracking/reconstruction, then progressively aligns and fuses them into a global Gaussian map that enforces cross-view geometric consistency and provides feedback to correct local drift.

Result: Achieves state-of-the-art tracking, 3D reconstruction, and novel view rendering with over 12% improvement in tracking, finer detailed geometry, while maintaining compact and memory-efficient global representation on real-world datasets.

Conclusion: SING3R-SLAM enables efficient and versatile 3D mapping for multiple downstream applications by combining local consistency with global refinement of scene geometry and camera poses.

Abstract: Recent advances in dense 3D reconstruction enable the accurate capture of local geometry; however, integrating them into SLAM is challenging due to drift and redundant point maps, which limit efficiency and downstream tasks, such as novel view synthesis. To address these issues, we propose SING3R-SLAM, a globally consistent and compact Gaussian-based dense RGB SLAM framework. The key idea is to combine locally consistent 3D reconstructions with a unified global Gaussian representation that jointly refines scene geometry and camera poses, enabling efficient and versatile 3D mapping for multiple downstream applications. SING3R-SLAM first builds locally consistent submaps through our lightweight tracking and reconstruction module, and then progressively aligns and fuses them into a global Gaussian map that enforces cross-view geometric consistency. This global map, in turn, provides feedback to correct local drift and enhance the robustness of tracking. Extensive experiments demonstrate that SING3R-SLAM achieves state-of-the-art tracking, 3D reconstruction, and novel view rendering, resulting in over 12% improvement in tracking and producing finer, more detailed geometry, all while maintaining a compact and memory-efficient global representation on real-world datasets.

Cris Claessens, Christiaan Viviers, Giacomo D’Amicantonio, Egor Bondarev, Fons van der Sommen

Main category: cs.CV

TL;DR: SPECTRE is a fully transformer-based foundation model for volumetric CT imaging that uses 3D Vision Transformers with self-supervised and vision-language pretraining to learn general-purpose CT representations, achieving state-of-the-art performance on multiple CT benchmarks.

Details

Motivation: Volumetric CT presents unique challenges including extreme token scaling, geometric anisotropy, and weak/noisy clinical supervision that make standard transformer and contrastive learning approaches ineffective for 3D medical imaging.

Method: Uses scalable 3D Vision Transformer architecture with joint optimization of local transformer for high-resolution volumetric feature extraction and global transformer for whole-scan context modeling. Combines DINO-style self-distillation with SigLIP-based vision-language alignment using paired radiology reports.

Result: Consistently outperforms prior CT foundation models across multiple CT benchmarks in both zero-shot and fine-tuned settings. Demonstrates that high-performing, generalizable representations can be achieved using only openly available CT datasets.

Conclusion: SPECTRE establishes a scalable, open, and fully transformer-based foundation model for 3D medical imaging that is both geometrically consistent and clinically meaningful.

Abstract: We introduce SPECTRE, a fully transformer-based foundation model for volumetric computed tomography (CT). Our Self-Supervised & Cross-Modal Pretraining for CT Representation Extraction (SPECTRE) approach utilizes scalable 3D Vision Transformer architectures and modern self-supervised and vision-language pretraining strategies to learn general-purpose CT representations. Volumetric CT poses unique challenges, such as extreme token scaling, geometric anisotropy, and weak or noisy clinical supervision, that make standard transformer and contrastive learning recipes ineffective out of the box. The framework jointly optimizes a local transformer for high-resolution volumetric feature extraction and a global transformer for whole-scan context modeling, making large-scale 3D attention computationally tractable. Notably, SPECTRE is trained exclusively on openly available CT datasets, demonstrating that high-performing, generalizable representations can be achieved without relying on private data. Pretraining combines DINO-style self-distillation with SigLIP-based vision-language alignment using paired radiology reports, yielding features that are both geometrically consistent and clinically meaningful. Across multiple CT benchmarks, SPECTRE consistently outperforms prior CT foundation models in both zero-shot and fine-tuned settings, establishing SPECTRE as a scalable, open, and fully transformer-based foundation model for 3D medical imaging.

[158] FisheyeGaussianLift: BEV Feature Lifting for Surround-View Fisheye Camera Perception

Shubham Sonarghare, Prasad Deshpande, Ciaran Hogan, Deepika-Rani Kaliappan-Mahalingam, Ganesh Sistu

Main category: cs.CV

TL;DR: A distortion-aware BEV segmentation framework that processes fisheye images using geometric unprojection and depth distribution estimation, achieving strong performance on parking and urban driving scenarios without requiring image undistortion.

Details

Motivation: BEV semantic segmentation from fisheye imagery is challenging due to extreme non-linear distortion, occlusion, and depth ambiguity inherent to wide-angle projections.

Method: Uses calibrated geometric unprojection and per-pixel depth distribution estimation to lift image pixels into 3D space via Gaussian parameterization, then fuses projected 3D Gaussians into BEV representation via differentiable splatting.

Result: Achieves IoU scores of 87.75% for drivable regions and 57.26% for vehicles under severe fisheye distortion and diverse environmental conditions.

Conclusion: The framework produces continuous, uncertainty-aware semantic maps without requiring undistortion or perspective rectification, demonstrating strong segmentation performance on complex parking and urban driving scenarios.

Abstract: Accurate BEV semantic segmentation from fisheye imagery remains challenging due to extreme non-linear distortion, occlusion, and depth ambiguity inherent to wide-angle projections. We present a distortion-aware BEV segmentation framework that directly processes multi-camera high-resolution fisheye images,utilizing calibrated geometric unprojection and per-pixel depth distribution estimation. Each image pixel is lifted into 3D space via Gaussian parameterization, predicting spatial means and anisotropic covariances to explicitly model geometric uncertainty. The projected 3D Gaussians are fused into a BEV representation via differentiable splatting, producing continuous, uncertainty-aware semantic maps without requiring undistortion or perspective rectification. Extensive experiments demonstrate strong segmentation performance on complex parking and urban driving scenarios, achieving IoU scores of 87.75% for drivable regions and 57.26% for vehicles under severe fisheye distortion and diverse environmental conditions.

[159] Dual-domain Adaptation Networks for Realistic Image Super-resolution

Chaowei Fang, Bolin Fu, De Cheng, Lechao Cheng, Guanbin Li

Main category: cs.CV

TL;DR: Dual-domain Adaptation Networks adapt pre-trained SR models from synthetic to real-world datasets using spatial and frequency domain adaptation strategies for improved realistic image super-resolution.

Details

Motivation: Realistic image SR faces challenges with limited real-world data and complex degradation patterns. Pre-trained models from synthetic datasets offer valuable prior knowledge that can improve generalization and reduce data requirements.

Method: Proposes Dual-domain Adaptation Networks with spatial-domain adaptation (selective parameter updates + low-rank adaptation) and frequency domain adaptation branch that combines spectral data and spatial features to infer HR frequency maps.

Result: Experimental evaluations on RealSR, D2CRealSR, and DRealSR benchmarks demonstrate superiority over existing state-of-the-art models.

Conclusion: The proposed method effectively adapts pre-trained SR models to real-world scenarios through dual-domain adaptation, achieving improved performance in realistic image super-resolution tasks.

Abstract: Realistic image super-resolution (SR) focuses on transforming real-world low-resolution (LR) images into high-resolution (HR) ones, handling more complex degradation patterns than synthetic SR tasks. This is critical for applications like surveillance, medical imaging, and consumer electronics. However, current methods struggle with limited real-world LR-HR data, impacting the learning of basic image features. Pre-trained SR models from large-scale synthetic datasets offer valuable prior knowledge, which can improve generalization, speed up training, and reduce the need for extensive real-world data in realistic SR tasks. In this paper, we introduce a novel approach, Dual-domain Adaptation Networks, which is able to efficiently adapt pre-trained image SR models from simulated to real-world datasets. To achieve this target, we first set up a spatial-domain adaptation strategy through selectively updating parameters of pre-trained models and employing the low-rank adaptation technique to adjust frozen parameters. Recognizing that image super-resolution involves recovering high-frequency components, we further integrate a frequency domain adaptation branch into the adapted model, which combines the spectral data of the input and the spatial-domain backbone’s intermediate features to infer HR frequency maps, enhancing the SR result. Experimental evaluations on public realistic image SR benchmarks, including RealSR, D2CRealSR, and DRealSR, demonstrate the superiority of our proposed method over existing state-of-the-art models. Codes are available at: https://github.com/dummerchen/DAN.

[160] Equivariant-Aware Structured Pruning for Efficient Edge Deployment: A Comprehensive Framework with Adaptive Fine-Tuning

Mohammed Alnemari

Main category: cs.CV

TL;DR: A framework combining group equivariant CNNs with structured pruning to create compact, transformation-invariant models for resource-constrained environments, achieving 29.3% parameter reduction while maintaining geometric robustness.

Details

Motivation: To bridge the gap between group-theoretic network design and practical deployment constraints, particularly for satellite imagery analysis and geometric vision tasks where transformation invariance is crucial but computational resources are limited.

Method: Combines G-CNNs with C4 cyclic group equivariance via e2cnn library, introduces structured pruning that preserves equivariant properties, implements adaptive fine-tuning with early stopping, and includes dynamic INT8 quantization in a comprehensive pipeline.

Result: 29.3% parameter reduction with significant accuracy recovery across satellite imagery (EuroSAT) and standard benchmarks (CIFAR-10, Rotated MNIST), demonstrating substantial compression while maintaining geometric robustness.

Conclusion: The framework provides a reproducible approach for optimizing equivariant models, successfully achieving substantial model compression while preserving transformation invariance, making it particularly relevant for satellite imagery analysis and geometric vision tasks.

Abstract: This paper presents a novel framework combining group equivariant convolutional neural networks (G-CNNs) with equivariant-aware structured pruning to produce compact, transformation-invariant models for resource-constrained environments. Equivariance to rotations is achieved through the C4 cyclic group via the e2cnn library,enabling consistent performance under geometric transformations while reducing computational overhead. Our approach introduces structured pruning that preserves equivariant properties by analyzing e2cnn layer structure and applying neuron-level pruning to fully connected components. To mitigate accuracy degradation, we implement adaptive fine-tuning that automatically triggers when accuracy drop exceeds 2%, using early stopping and learning rate scheduling for efficient recovery. The framework includes dynamic INT8 quantization and a comprehensive pipeline encompassing training, knowledge distillation, structured pruning, fine-tuning, and quantization. We evaluate our method on satellite imagery (EuroSAT) and standard benchmarks (CIFAR-10, Rotated MNIST) demonstrating effectiveness across diverse domains. Experimental results show 29.3% parameter reduction with significant accuracy recovery, demonstrating that structured pruning of equivariant networks achieves substantial compression while maintaining geometric robustness. Our pipeline provides a reproducible framework for optimizing equivariant models, bridging the gap between group-theoretic network design and practical deployment constraints, with particular relevance to satellite imagery analysis and geometric vision tasks.

[161] QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

Adam Lilja, Ji Lan, Junsheng Fu, Lars Hammarstrand

Main category: cs.CV

TL;DR: QueryOcc is a self-supervised framework that learns continuous 3D semantic occupancy using 4D spatio-temporal queries, achieving 26% improvement in semantic RayIoU over previous camera-based methods.

Details

Motivation: Existing approaches for 3D scene understanding either rely on 2D rendering consistency or discretized voxel grids, which limit spatial precision and scalability. Large-scale 3D annotation is expensive, creating need for self-supervised learning from sensor data.

Method: Uses query-based self-supervised framework with independent 4D spatio-temporal queries sampled across adjacent frames. Supports supervision from pseudo-point clouds or raw lidar data. Introduces contractive scene representation for long-range reasoning under constant memory.

Result: Achieves 26% improvement in semantic RayIoU on self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating strong self-supervised occupancy learning capabilities.

Conclusion: Direct 4D query supervision enables effective self-supervised occupancy learning, surpassing previous camera-based methods in performance while maintaining real-time inference speeds.

Abstract: Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving. Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels. Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability. We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions. QueryOcc surpasses previous camera-based methods by 26% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning. https://research.zenseact.com/publications/queryocc/

Yuming Yang, Michael K. Ng, Zhigang Jia, Wei Wang

Main category: cs.CV

TL;DR: Proposes a novel quaternion-based fidelity term for blind deconvolution of color images that preserves color channel relationships, outperforming methods that process channels separately.

Details

Motivation: Existing blind deconvolution methods for color images either convert to grayscale or process color channels separately, ignoring important inter-channel relationships.

Method: Formulates a quaternion fidelity term using quaternion convolution kernels with four components: one for overall blur and three for RGB channel interdependencies, using normalized quaternion kernels to preserve image intensity.

Result: Extensive experiments on real blurred color image datasets show effective artifact removal and significant improvement in deblurring effects compared to existing approaches.

Conclusion: The proposed quaternion-based method demonstrates strong potential as an effective tool for color image blind deconvolution by properly modeling color channel relationships.

Abstract: In this work, we address the challenging problem of blind deconvolution for color images. Existing methods often convert color images to grayscale or process each color channel separately, which overlooking the relationships between color channels. To handle this issue, we formulate a novel quaternion fidelity term designed specifically for color image blind deconvolution. This fidelity term leverages the properties of quaternion convolution kernel, which consists of four kernels: one that functions similarly to a non-negative convolution kernel to capture the overall blur, and three additional convolution kernels without constraints corresponding to red, green and blue channels respectively model their unknown interdependencies. In order to preserve image intensity, we propose to use the normalized quaternion kernel in the blind deconvolution process. Extensive experiments on real datasets of blurred color images show that the proposed method effectively removes artifacts and significantly improves deblurring effect, demonstrating its potential as a powerful tool for color image deconvolution.

[163] Non-Parametric Probabilistic Robustness: A Conservative Metric with Optimized Perturbation Distributions

Zheng Wang, Yi Zhang, Siddartha Khastgir, Carsten Maple, Xingyu Zhao

Main category: cs.CV

TL;DR: Proposes Non-Parametric Probabilistic Robustness (NPPR), a practical robustness metric that learns perturbation distributions from data without predefined assumptions, showing more conservative estimates than existing methods.

Details

Motivation: Existing probabilistic robustness formulations assume fixed and known perturbation distributions, which is unrealistic in practice. NPPR addresses this limitation by learning distributions directly from data.

Method: Develops NPPR estimator using Gaussian Mixture Model with MLP heads and bicubic up-sampling to handle various input-dependent and input-independent perturbation scenarios.

Result: Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet across multiple architectures show NPPR provides up to 40% more conservative PR estimates compared to state-of-the-art methods.

Conclusion: NPPR is validated as a more practical robustness metric that addresses distributional uncertainty in probabilistic robustness evaluation.

Abstract: Deep learning (DL) models, despite their remarkable success, remain vulnerable to small input perturbations that can cause erroneous outputs, motivating the recent proposal of probabilistic robustness (PR) as a complementary alternative to adversarial robustness (AR). However, existing PR formulations assume a fixed and known perturbation distribution, an unrealistic expectation in practice. To address this limitation, we propose non-parametric probabilistic robustness (NPPR), a more practical PR metric that does not rely on any predefined perturbation distribution. Following the non-parametric paradigm in statistical modeling, NPPR learns an optimized perturbation distribution directly from data, enabling conservative PR evaluation under distributional uncertainty. We further develop an NPPR estimator based on a Gaussian Mixture Model (GMM) with Multilayer Perceptron (MLP) heads and bicubic up-sampling, covering various input-dependent and input-independent perturbation scenarios. Theoretical analyses establish the relationships among AR, PR, and NPPR. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet across ResNet18/50, WideResNet50 and VGG16 validate NPPR as a more practical robustness metric, showing up to 40% more conservative (lower) PR estimates compared to assuming those common perturbation distributions used in state-of-the-arts.

[164] A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback

Bulat Khaertdinov, Mirela Popa, Nava Tintarev

Main category: cs.CV

TL;DR: Proposes relevance feedback mechanisms to improve VLM-based visual search without fine-tuning, including pseudo-relevance feedback, generative relevance feedback, and attentive feedback summarizer.

Details

Motivation: To enhance visual search performance in vision-language models without requiring fine-tuning or larger models, leveraging traditional search techniques like relevance feedback.

Method: Four feedback strategies: classical pseudo-relevance feedback, generative relevance feedback using synthetic captions, attentive feedback summarizer with transformer-based multimodal integration, and explicit feedback as upper-bound baseline.

Result: GRF, AFS, and explicit feedback improve retrieval performance by 3-5% in MRR@5 for smaller VLMs and 1-3% for larger ones on Flickr30k and COCO datasets. AFS mitigates query drift and is robust in iterative retrieval.

Conclusion: Relevance feedback consistently enhances VLM-based retrieval across different model sizes and enables interactive, adaptive visual search without model fine-tuning.

Abstract: Large vision-language models (VLMs) enable intuitive visual search using natural language queries. However, improving their performance often requires fine-tuning and scaling to larger model variants. In this work, we propose a mechanism inspired by traditional text-based search to improve retrieval performance at inference time: relevance feedback. While relevance feedback can serve as an alternative to fine-tuning, its model-agnostic design also enables use with fine-tuned VLMs. Specifically, we introduce and evaluate four feedback strategies for VLM-based retrieval. First, we revise classical pseudo-relevance feedback (PRF), which refines query embeddings based on top-ranked results. To address its limitations, we propose generative relevance feedback (GRF), which uses synthetic captions for query refinement. Furthermore, we introduce an attentive feedback summarizer (AFS), a custom transformer-based model that integrates multimodal fine-grained features from relevant items. Finally, we simulate explicit feedback using ground-truth captions as an upper-bound baseline. Experiments on Flickr30k and COCO with the VLM backbones show that GRF, AFS, and explicit feedback improve retrieval performance by 3-5% in MRR@5 for smaller VLMs, and 1-3% for larger ones, compared to retrieval with no feedback. Moreover, AFS, similarly to explicit feedback, mitigates query drift and is more robust than GRF in iterative, multi-turn retrieval settings. Our findings demonstrate that relevance feedback can consistently enhance retrieval across VLMs and open up opportunities for interactive and adaptive visual search.

[165] MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning

Wenrui Zhang, Xinggang Wang, Bin Feng, Wenyu Liu

Main category: cs.CV

TL;DR: MolSight is a three-stage learning framework for Optical Chemical Structure Recognition that achieves state-of-the-art performance in stereochemical recognition through pre-training, multi-granularity fine-tuning, and reinforcement learning optimization.

Details

Motivation: Existing OCSR systems struggle with accurately recognizing stereochemical information due to subtle visual cues like wedge/dash bonds and spatial arrangements, which is crucial for chemical data mining and drug discovery.

Method: Three-stage training: 1) Pre-training on large-scale noisy datasets for fundamental perception, 2) Multi-granularity fine-tuning with auxiliary tasks (bond classification and atom localization), 3) Reinforcement learning optimization using GRPO algorithm and novel stereochemical dataset.

Result: MolSight achieves state-of-the-art performance in stereochemical optical structure recognition, with the compact model further enhanced by GRPO algorithm for stereomolecular recognition.

Conclusion: The proposed three-stage framework effectively addresses stereochemical recognition challenges in OCSR, demonstrating that systematic training with auxiliary tasks and reinforcement learning can significantly improve performance even with relatively compact models.

Abstract: Optical Chemical Structure Recognition (OCSR) plays a pivotal role in modern chemical informatics, enabling the automated conversion of chemical structure images from scientific literature, patents, and educational materials into machine-readable molecular representations. This capability is essential for large-scale chemical data mining, drug discovery pipelines, and Large Language Model (LLM) applications in related domains. However, existing OCSR systems face significant challenges in accurately recognizing stereochemical information due to the subtle visual cues that distinguish stereoisomers, such as wedge and dash bonds, ring conformations, and spatial arrangements. To address these challenges, we propose MolSight, a comprehensive learning framework for OCSR that employs a three-stage training paradigm. In the first stage, we conduct pre-training on large-scale but noisy datasets to endow the model with fundamental perception capabilities for chemical structure images. In the second stage, we perform multi-granularity fine-tuning using datasets with richer supervisory signals, systematically exploring how auxiliary tasks-specifically chemical bond classification and atom localization-contribute to molecular formula recognition. Finally, we employ reinforcement learning for post-training optimization and introduce a novel stereochemical structure dataset. Remarkably, we find that even with MolSight’s relatively compact parameter size, the Group Relative Policy Optimization (GRPO) algorithm can further enhance the model’s performance on stereomolecular. Through extensive experiments across diverse datasets, our results demonstrate that MolSight achieves state-of-the-art performance in (stereo)chemical optical structure recognition.

[166] BiFingerPose: Bimodal Finger Pose Estimation for Touch Devices

Xiongjun Guan, Zhiyu Pan, Jianjiang Feng, Jie Zhou

Main category: cs.CV

TL;DR: BiFingerPose is a bimodal finger pose estimation algorithm that combines capacitive images and fingerprint patches to accurately predict comprehensive finger pose information, including roll angle, with significant performance improvements over existing methods.

Details

Motivation: Existing finger pose estimation methods on touchscreen devices are limited to pitch and yaw angles using capacitive images, show reduced accuracy for large-angle inputs, and cannot estimate roll angle.

Method: Proposes a bimodal approach using both capacitive images and fingerprint patches from under-screen fingerprint sensors to enable comprehensive finger pose estimation including roll angle.

Result: Achieves over 21% improvement in prediction performance, 2.5x higher task completion efficiency, and 23% better user operation accuracy compared to state-of-the-art methods in a 12-person user study.

Conclusion: BiFingerPose demonstrates practical superiority for finger pose estimation and enables new applications in authentication security and interactive experiences, with prototypes developed to showcase interaction potential.

Abstract: Finger pose offers promising opportunities to expand human computer interaction capability of touchscreen devices. Existing finger pose estimation algorithms that can be implemented in portable devices predominantly rely on capacitive images, which are currently limited to estimating pitch and yaw angles and exhibit reduced accuracy when processing large-angle inputs (especially when it is greater than 45 degrees). In this paper, we propose BiFingerPose, a novel bimodal based finger pose estimation algorithm capable of simultaneously and accurately predicting comprehensive finger pose information. A bimodal input is explored, including a capacitive image and a fingerprint patch obtained from the touchscreen with an under-screen fingerprint sensor. Our approach leads to reliable estimation of roll angle, which is not achievable using only a single modality. In addition, the prediction performance of other pose parameters has also been greatly improved. The evaluation of a 12-person user study on continuous and discrete interaction tasks further validated the advantages of our approach. Specifically, BiFingerPose outperforms previous SOTA methods with over 21% improvement in prediction performance, 2.5 times higher task completion efficiency, and 23% better user operation accuracy, demonstrating its practical superiority. Finally, we delineate the application space of finger pose with respect to enhancing authentication security and improving interactive experiences, and develop corresponding prototypes to showcase the interaction potential. Our code will be available at https://github.com/XiongjunGuan/DualFingerPose.

[167] SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion

Jiajie Guo, Qingpeng Zhu, Jin Zeng, Xiaolong Wu, Changyong He, Weida Wang

Main category: cs.CV

TL;DR: SpatialGeo enhances MLLMs’ spatial reasoning by combining CLIP’s semantic features with geometry features from self-supervised learning through a hierarchical adapter, improving spatial task accuracy by 8.0% with 50% less memory.

Details

Motivation: Most MLLMs have limited spatial reasoning ability due to lossy vision embeddings from CLIP that focus on instance-level semantics, lacking spatial awareness for 3D arrangements.

Method: Proposes SpatialGeo with hierarchical fusion of CLIP semantic features and geometry features from vision-only self-supervised learning via a hierarchical adapter, trained with random feature dropping to prevent trivial solutions.

Result: Improves spatial reasoning accuracy by at least 8.0% in SpatialRGPT-Bench with approximately 50% less memory cost during inference compared to state-of-the-art models.

Conclusion: SpatialGeo effectively enhances MLLMs’ spatial grounding capability by addressing spatial ambiguity through hierarchical fusion of geometry and semantic features.

Abstract: Multimodal large language models (MLLMs) have achieved significant progress in image and language tasks due to the strong reasoning capability of large language models (LLMs). Nevertheless, most MLLMs suffer from limited spatial reasoning ability to interpret and infer spatial arrangements in three-dimensional space. In this work, we propose a novel vision encoder based on hierarchical fusion of geometry and semantics features, generating spatial-aware visual embedding and boosting the spatial grounding capability of MLLMs. Specifically, we first unveil that the spatial ambiguity shortcoming stems from the lossy embedding of the vision encoder utilized in most existing MLLMs (e.g., CLIP), restricted to instance-level semantic features. This motivates us to complement CLIP with the geometry features from vision-only self-supervised learning via a hierarchical adapter, enhancing the spatial awareness in the proposed SpatialGeo. The network is efficiently trained using pretrained LLaVA model and optimized with random feature dropping to avoid trivial solutions relying solely on the CLIP encoder. Experimental results show that SpatialGeo improves the accuracy in spatial reasoning tasks, enhancing state-of-the-art models by at least 8.0% in SpatialRGPT-Bench with approximately 50% less memory cost during inference. The source code is available via https://ricky-plus.github.io/SpatialGeoPages/.

[168] FaCells. Teaching Machines the Language of Lines: Per Point Attribute Scores for Face-Sketch Classification

Xavier Ignacio Gonzalez

Main category: cs.CV

TL;DR: FaCells transforms face model internals into line-based artworks using vector sketches from CelebA dataset, creating statistical abstractions of facial attributes through per-point attribute scoring.

Details

Motivation: To bridge data, model, and art by turning model internals into interpretable line artworks that are plotter-ready, reproducible, and materially present, while acknowledging dataset biases.

Method: Translates CelebA face photos into vector sketches, trains bidirectional LSTM with absolute coordinates and travel-minimizing stroke order, uses per-point attribute scoring by removing global average pooling, and aggregates points exceeding attribute thresholds.

Result: Successfully generates FaCells - statistical abstractions of attributes like Eyeglasses, Wavy Hair, Bangs, achieving stable multilabel training over 40 attributes with at least 50% balanced accuracy.

Conclusion: FaCells demonstrates interpretability as a creative tool, producing reproducible plotter artworks that bridge technical analysis with artistic expression while acknowledging dataset limitations.

Abstract: FaCells is a method, and an exhibition, that turns model internals into line based artworks. Aligned face photographs (CelebA, 260k images, 40 attributes) are translated into vector sketches suitable for an XY plotter. We study how to ‘write’ these drawings for a sequence model, comparing absolute vs. relative point encodings and random vs. travel-minimizing stroke order. A bidirectional LSTM is trained for attribute prediction; a minimal architectural change, removing the global average over the sequence and applying a Dense layer at each point, yields per point attribute scores. Aggregating points whose score exceeds an attribute specific threshold across many portraits produces new drawings we call FaCells: statistical abstractions of attributes such as Eyeglasses, Wavy Hair, or Bangs. Across ablations, absolute coordinates with travel-minimizing order and a global average readout perform best; this configuration is then adapted to produce per-point scores. Multilabel training over 40 attributes is stable, and attributes reaching at least 50% balanced accuracy are visualized as FaCells. Complementary notions (e.g., No_Beard) are constructed by selecting points below a negative threshold. FaCells foregrounds interpretability as a creative tool: the resulting works are plotter ready, reproducible, and inexpensive to realize, yet materially present. Presented at Spectrum Miami 2025, the project bridges data, model, and paper while acknowledging the limits of the labels and the biases of the dataset.

[169] NoPe-NeRF++: Local-to-Global Optimization of NeRF with No Pose Prior

Dongbo Shi, Shen Cao, Bojian Wu, Jinhui Guo, Lubin Fan, Renjie Chen, Ligang Liu, Jieping Ye

Main category: cs.CV

TL;DR: NoPe-NeRF++ is a local-to-global optimization method for training Neural Radiance Fields without pose priors, combining relative pose initialization, local joint optimization, and global bundle adjustment to improve both pose estimation and novel view synthesis.

Details

Motivation: Existing methods like NoPe-NeRF that focus only on local image relationships struggle with accurate camera pose recovery in complex scenarios, highlighting the need for a more robust approach that integrates both local and global optimization.

Method: The approach uses: 1) Relative pose initialization with explicit feature matching, 2) Local joint optimization to enhance pose estimation, 3) Global optimization phase with bundle adjustment and geometric consistency constraints using feature trajectories.

Result: The method significantly improves initial pose quality, outperforms state-of-the-art methods in both pose estimation accuracy and novel view synthesis, and demonstrates superior performance and robustness on benchmark datasets, even in challenging scenes.

Conclusion: NoPe-NeRF++ successfully combines local and global optimization cues with NeRF, representing the first work to seamlessly integrate these approaches, validating the design choices through extensive evaluations.

Abstract: In this paper, we introduce NoPe-NeRF++, a novel local-to-global optimization algorithm for training Neural Radiance Fields (NeRF) without requiring pose priors. Existing methods, particularly NoPe-NeRF, which focus solely on the local relationships within images, often struggle to recover accurate camera poses in complex scenarios. To overcome the challenges, our approach begins with a relative pose initialization with explicit feature matching, followed by a local joint optimization to enhance the pose estimation for training a more robust NeRF representation. This method significantly improves the quality of initial poses. Additionally, we introduce global optimization phase that incorporates geometric consistency constraints through bundle adjustment, which integrates feature trajectories to further refine poses and collectively boost the quality of NeRF. Notably, our method is the first work that seamlessly combines the local and global cues with NeRF, and outperforms state-of-the-art methods in both pose estimation accuracy and novel view synthesis. Extensive evaluations on benchmark datasets demonstrate our superior performance and robustness, even in challenging scenes, thus validating our design choices.

[170] Refracting Reality: Generating Images with Realistic Transparent Objects

Yue Yin, Enze Tao, Dylan Campbell

Main category: cs.CV

TL;DR: The paper proposes a method to improve generative models’ synthesis of transparent objects by incorporating physical optics laws, specifically using Snell’s Law to synchronize pixels and recover accurate refraction effects.

Details

Motivation: Current generative image models perform poorly at synthesizing transparent objects because they fail to properly model optical phenomena like refraction, reflection, absorption, and scattering, particularly the complex color constraints created by refracted rays intersecting with other surfaces.

Method: The approach synchronizes pixels within and outside object boundaries using Snell’s Law of Refraction at each generation step, and recovers appearance of surfaces visible only via refraction/reflection by synchronizing with a second generated panorama image using the same warping and merging procedure.

Result: The method generates optically-plausible images that respect physical constraints, significantly improving the accuracy of refraction rendering compared to standard generative models.

Conclusion: By incorporating physical optics laws into the generation process through pixel synchronization and multi-image coordination, the approach successfully addresses the challenge of synthesizing transparent objects with accurate refraction effects.

Abstract: Generative image models can produce convincingly real images, with plausible shapes, textures, layouts and lighting. However, one domain in which they perform notably poorly is in the synthesis of transparent objects, which exhibit refraction, reflection, absorption and scattering. Refraction is a particular challenge, because refracted pixel rays often intersect with surfaces observed in other parts of the image, providing a constraint on the color. It is clear from inspection that generative models have not distilled the laws of optics sufficiently well to accurately render refractive objects. In this work, we consider the problem of generating images with accurate refraction, given a text prompt. We synchronize the pixels within the object’s boundary with those outside by warping and merging the pixels using Snell’s Law of Refraction, at each step of the generation trajectory. For those surfaces that are not directly observed in the image, but are visible via refraction or reflection, we recover their appearance by synchronizing the image with a second generated image – a panorama centered at the object – using the same warping and merging procedure. We demonstrate that our approach generates much more optically-plausible images that respect the physical constraints.

[171] Loomis Painter: Reconstructing the Painting Process

Markus Pobitzer, Chang Liu, Chenyi Zhuang, Teng Long, Bin Ren, Nicu Sebe

Main category: cs.CV

TL;DR: A unified framework for generating multi-media painting tutorials with style control, using diffusion models and cross-medium style augmentation to ensure consistent texture evolution and process transfer across artistic styles.

Details

Motivation: Existing video painting tutorials lack interactivity and personalization, while current generative models struggle with cross-media generalization and temporal/structural inconsistencies in reproducing human creative workflows.

Method: Proposes a semantics-driven style control mechanism that embeds multiple media into diffusion models’ conditional space, uses cross-medium style augmentation, and implements reverse-painting training strategy for smooth, human-aligned generation.

Result: Achieves strong results on LPIPS, DINO, and CLIP metrics for cross-media consistency, temporal coherence, and final-image fidelity. Introduces Perceptual Distance Profile (PDP) curve that quantitatively models human artistic progression stages.

Conclusion: The framework successfully enables consistent painting process generation across multiple media with style control, faithfully reproducing human creative workflows and artistic progression patterns.

Abstract: Step-by-step painting tutorials are vital for learning artistic techniques, but existing video resources (e.g., YouTube) lack interactivity and personalization. While recent generative models have advanced artistic image synthesis, they struggle to generalize across media and often show temporal or structural inconsistencies, hindering faithful reproduction of human creative workflows. To address this, we propose a unified framework for multi-media painting process generation with a semantics-driven style control mechanism that embeds multiple media into a diffusion models conditional space and uses cross-medium style augmentation. This enables consistent texture evolution and process transfer across styles. A reverse-painting training strategy further ensures smooth, human-aligned generation. We also build a large-scale dataset of real painting processes and evaluate cross-media consistency, temporal coherence, and final-image fidelity, achieving strong results on LPIPS, DINO, and CLIP metrics. Finally, our Perceptual Distance Profile (PDP) curve quantitatively models the creative sequence, i.e., composition, color blocking, and detail refinement, mirroring human artistic progression.

[172] Label-Efficient Skeleton-based Recognition with Stable-Invertible Graph Convolutional Networks

Hichem Sahbi

Main category: cs.CV

TL;DR: A label-efficient method for skeleton-based action recognition using GCNs with a novel acquisition function that selects the most informative subsets for labeling, achieving state-of-the-art performance.

Details

Motivation: To address the high cost and time-consuming nature of acquiring large manually labeled datasets for skeleton-based action recognition.

Method: Proposes a novel acquisition function that optimizes data representativity, diversity and uncertainty to select the most informative subsets for labeling, using graph convolutional networks (GCNs) and extending with invertible GCNs for better data distribution capture.

Result: Extensive experiments on two challenging skeleton-based recognition datasets show the method’s effectiveness and outperformance against related work in label-frugal scenarios.

Conclusion: The proposed label-frugal GCNs provide an effective solution for skeleton-based action recognition with reduced labeling requirements, demonstrating superior performance compared to existing methods.

Abstract: Skeleton-based action recognition is a hotspot in image processing. A key challenge of this task lies in its dependence on large, manually labeled datasets whose acquisition is costly and time-consuming. This paper devises a novel, label-efficient method for skeleton-based action recognition using graph convolutional networks (GCNs). The contribution of the proposed method resides in learning a novel acquisition function – scoring the most informative subsets for labeling – as the optimum of an objective function mixing data representativity, diversity and uncertainty. We also extend this approach by learning the most informative subsets using an invertible GCN which allows mapping data from ambient to latent spaces where the inherent distribution of the data is more easily captured. Extensive experiments, conducted on two challenging skeleton-based recognition datasets, show the effectiveness and the outperformance of our label-frugal GCNs against the related work.

[173] DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture

Xiangteng He, Shunsuke Sakai, Kun Yuan, Nicolas Padoy, Tatsuhito Hasegawa, Leonid Sigal

Main category: cs.CV

TL;DR: DSeq-JEPA improves on I-JEPA by introducing sequential prediction of masked regions based on discriminative saliency, combining JEPA’s latent prediction with GPT-style sequential reasoning for better visual representation learning.

Details

Motivation: I-JEPA treats all image regions uniformly without considering visual importance or prediction order, unlike human visual perception which processes information sequentially from most to least informative regions.

Method: Uses transformer-derived saliency maps to identify primary discriminative regions first, then predicts subsequent regions in discriminative order, creating a curriculum-like semantic progression from primary to secondary cues.

Result: Outperforms I-JEPA variants across diverse tasks including image classification, fine-grained categorization, detection, segmentation, and low-level reasoning, focusing on more discriminative and generalizable representations.

Conclusion: DSeq-JEPA successfully bridges predictive and autoregressive self-supervised learning by integrating JEPA-style latent prediction with GPT-style sequential reasoning, demonstrating superior performance through discriminative sequential processing.

Abstract: Image-based Joint-Embedding Predictive Architecture (I-JEPA) learns visual representations by predicting latent embeddings of masked regions from visible context. However, it treats all regions uniformly and independently, lacking an explicit notion of where or in what order predictions should be made. Inspired by human visual perception, which deploys attention selectively and sequentially from the most informative to secondary regions, we propose DSeq-JEPA, a Discriminative Sequential Joint-Embedding Predictive Architecture that bridges predictive and autoregressive self-supervised learning, integrating JEPA-style latent prediction with GPT-style sequential reasoning. Specifically, DSeq-JEPA (i) first identifies primary discriminative regions based on a transformer-derived saliency map, emphasizing the distribution of visual importance, and then (ii) predicts subsequent regions in this discriminative order, progressively forming a curriculum-like semantic progression from primary to secondary cues – a form of GPT-style pre-training. Extensive experiments across diverse tasks, including image classification (ImageNet), fine-grained visual categorization (iNaturalist21, CUB-200-2011, Stanford-Cars), detection and segmentation (MS-COCO, ADE20K), and low-level reasoning tasks (Clevr/Count, Clevr/Dist), demonstrate that DSeq-JEPA consistently focuses on more discriminative and generalizable representations than I-JEPA variants. Project page: https://github.com/SkyShunsuke/DSeq-JEPA.

[174] The Cooperative Network Architecture: Learning Structured Networks as Representation of Sensory Patterns

Pascal J. Sager, Jan M. Deriu, Benjamin F. Grewe, Thilo Stadelmann, Christoph von der Malsburg

Main category: cs.CV

TL;DR: CNA uses recurrently connected neuron networks called “nets” that are dynamically assembled from learned net fragments to represent sensory signals, providing robustness to noise and generalization capabilities.

Details

Motivation: To address challenges in current vision systems by creating neural representations that are robust to noise, deformation, and can generalize to out-of-distribution data from a novel perspective.

Method: Dynamically assemble structured, recurrently connected networks of neurons (nets) from overlapping net fragments learned based on statistical regularities in sensory input, with unsupervised learning of net fragments.

Result: Net fragments can be learned without supervision and flexibly recombined to encode novel patterns, enabling figure completion and resilience to noise.

Conclusion: CNA establishes a promising paradigm for neural representations that integrate local feature processing with global structure formation, providing a foundation for future research on invariant object recognition.

Abstract: We introduce the Cooperative Network Architecture (CNA), a model that represents sensory signals using structured, recurrently connected networks of neurons, termed “nets.” Nets are dynamically assembled from overlapping net fragments, which are learned based on statistical regularities in sensory input. This architecture offers robustness to noise, deformation, and generalization to out-of-distribution data, addressing challenges in current vision systems from a novel perspective. We demonstrate that net fragments can be learned without supervision and flexibly recombined to encode novel patterns, enabling figure completion and resilience to noise. Our findings establish CNA as a promising paradigm for developing neural representations that integrate local feature processing with global structure formation, providing a foundation for future research on invariant object recognition.

[175] UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification

Taixi Chen, Jingyun Chen, Nancy Guo

Main category: cs.CV

TL;DR: UAM is a unified Attention-Mamba backbone for cell-level radiomics analysis that achieves state-of-the-art performance in cell classification and tumor segmentation, improving accuracy from 74% to 78% and precision from 75% to 80%.

Details

Motivation: Cell-level radiomics features provide fine-grained tumor insights but are largely unexplored, with no dedicated backbone for radiomics data. Existing methods focus on slide/patch-level classification rather than cell-level analysis.

Method: Proposed Unified Attention-Mamba (UAM) backbone that flexibly combines Attention and Mamba modules in a single architecture, eliminating manual ratio tuning. Developed two UAM variants and extended to multimodal framework for joint cell classification and image segmentation.

Result: UAM achieves SOTA performance: cell classification accuracy improves from 74% to 78% (n=349,882 cells), tumor segmentation precision improves from 75% to 80% (n=406 patches), surpassing leading image-based foundation models.

Conclusion: UAM demonstrates effectiveness as a unified and extensible multimodal foundation for radiomics-driven cancer diagnosis, highlighting the promise of cell-level radiomics analysis.

Abstract: Cell-level radiomics features provide fine-grained insights into tumor phenotypes and have the potential to significantly enhance diagnostic accuracy on hematoxylin and eosin (H&E) images. By capturing micro-level morphological and intensity patterns, these features support more precise tumor identification and improve AI interpretability by highlighting diagnostically relevant cells for pathologist review. However, most existing studies focus on slide-level or patch-level tumor classification, leaving cell-level radiomics analysis largely unexplored. Moreover, there is currently no dedicated backbone specifically designed for radiomics data. Inspired by the recent success of the Mamba architecture in vision and language domains, we introduce a Unified Attention-Mamba (UAM) backbone for cell-level classification using radiomics features. Unlike previous hybrid approaches that integrate Attention and Mamba modules in fixed proportions, our unified design flexibly combines their capabilities within a single cohesive architecture, eliminating the need for manual ratio tuning and improving encode capability. We develop two UAM variants to comprehensively evaluate the benefits of this unified structure. Building on this backbone, we further propose a multimodal UAM framework that jointly performs cell-level classification and image segmentation. Experimental results demonstrate that UAM achieves state-of-the-art performance across both tasks on public benchmarks, surpassing leading image-based foundation models. It improves cell classification accuracy from 74% to 78% ($n$=349,882 cells), and tumor segmentation precision from 75% to 80% ($n$=406 patches). These findings highlight the effectiveness and promise of UAM as a unified and extensible multimodal foundation for radiomics-driven cancer diagnosis.

[176] SuperQuadricOcc: Multi-Layer Gaussian Approximation of Superquadrics for Real-Time Self-Supervised Occupancy Estimation

Seamie Hayes, Reenu Mohandas, Tim Brophy, Alexandre Boulch, Ganesh Sistu, Ciaran Eising

Main category: cs.CV

TL;DR: SuperQuadricOcc introduces superquadric-based scene representation for semantic occupancy estimation, achieving 75% memory reduction, 124% faster inference, and 5.9% mIoU improvement over Gaussian methods while enabling real-time inference.

Details

Motivation: Gaussian representations in occupancy estimation require many primitives, increasing memory and hindering real-time inference. Superquadrics offer diverse shapes with fewer primitives but lack a rasterizer for supervision.

Method: Uses superquadric-based scene representation with multi-layer icosphere-tessellated Gaussian approximation to enable Gaussian rasterization for training supervision, and includes a fast superquadric voxelization module.

Result: 75% memory footprint reduction, 124% faster inference, 5.9% mIoU improvement on Occ3D dataset, 84% fewer primitives needed compared to Gaussian methods, enabling real-time inference.

Conclusion: SuperQuadricOcc is the first occupancy model achieving real-time inference while maintaining competitive performance, demonstrating superquadrics’ superiority over Gaussians for efficient scene representation.

Abstract: Semantic occupancy estimation enables comprehensive scene understanding for automated driving, providing dense spatial and semantic information essential for perception and planning. While Gaussian representations have been widely adopted in self-supervised occupancy estimation, the deployment of a large number of Gaussian primitives drastically increases memory requirements and is not suitable for real-time inference. In contrast, superquadrics permit reduced primitive count and lower memory requirements due to their diverse shape set. However, implementation into a self-supervised occupancy model is nontrivial due to the absence of a superquadric rasterizer to enable model supervision. Our proposed method, SuperQuadricOcc, employs a superquadric-based scene representation. By leveraging a multi-layer icosphere-tessellated Gaussian approximation of superquadrics, we enable Gaussian rasterization for supervision during training. On the Occ3D dataset, SuperQuadricOcc achieves a 75% reduction in memory footprint, 124% faster inference, and a 5.9% improvement in mIoU compared to previous Gaussian-based methods, without the use of temporal labels. To our knowledge, this is the first occupancy model to enable real-time inference while maintaining competitive performance. The use of superquadrics reduces the number of primitives required for scene modeling by 84% relative to Gaussian-based approaches. Finally, evaluation against prior methods is facilitated by our fast superquadric voxelization module. The code will be released as open source.

Hariprasath Govindarajan, Maciej K. Wozniak, Marvin Klingner, Camille Maurice, B Ravi Kiran, Senthil Yogamani

Main category: cs.CV

TL;DR: CleverDistiller is a self-supervised cross-modal knowledge distillation framework that transfers 2D vision foundation model capabilities to 3D LiDAR models using simple feature similarity loss and MLP projection, achieving state-of-the-art performance in semantic segmentation and 3D object detection.

Details

Motivation: Existing methods for cross-modal knowledge distillation from 2D to 3D either rely on complex distillation losses, pseudo-semantic maps, or are limited to semantic segmentation features. There's a need for a simpler, more general approach that doesn't depend on semantic supervision.

Method: Uses direct feature similarity loss with MLP projection head to enable 3D networks to learn complex semantic dependencies. Introduces auxiliary self-supervised spatial task of occupancy prediction to enhance semantic knowledge with 3D spatial reasoning capabilities. Does not require pseudo-semantic maps.

Result: Achieves state-of-the-art performance in both semantic segmentation and 3D object detection by up to 10% mIoU, especially effective when fine-tuning on low data amounts.

Conclusion: The simple yet powerful knowledge distillation strategy effectively transfers generalization capabilities from 2D vision foundation models to 3D LiDAR models without complex losses or semantic supervision, demonstrating strong performance across multiple tasks.

Abstract: Vision foundation models (VFMs) such as DINO have led to a paradigm shift in 2D camera-based perception towards extracting generalized features to support many downstream tasks. Recent works introduce self-supervised cross-modal knowledge distillation (KD) as a way to transfer these powerful generalization capabilities into 3D LiDAR-based models. However, they either rely on highly complex distillation losses, pseudo-semantic maps, or limit KD to features useful for semantic segmentation only. In this work, we propose CleverDistiller, a self-supervised, cross-modal 2D-to-3D KD framework introducing a set of simple yet effective design choices: Unlike contrastive approaches relying on complex loss design choices, our method employs a direct feature similarity loss in combination with a multi layer perceptron (MLP) projection head to allow the 3D network to learn complex semantic dependencies throughout the projection. Crucially, our approach does not depend on pseudo-semantic maps, allowing for direct knowledge transfer from a VFM without explicit semantic supervision. Additionally, we introduce the auxiliary self-supervised spatial task of occupancy prediction to enhance the semantic knowledge, obtained from a VFM through KD, with 3D spatial reasoning capabilities. Experiments on standard autonomous driving benchmarks for 2D-to-3D KD demonstrate that CleverDistiller achieves state-of-the-art performance in both semantic segmentation and 3D object detection (3DOD) by up to 10% mIoU, especially when fine tuning on really low data amounts, showing the effectiveness of our simple yet powerful KD strategy

[178] ATAC: Augmentation-Based Test-Time Adversarial Correction for CLIP

Linxiang Su, András Balogh

Main category: cs.CV

TL;DR: ATAC is a test-time defense method that corrects adversarial perturbations in CLIP’s embedding space using augmentation-induced drift vectors and angular consistency, achieving 50% higher robustness than previous methods with minimal computational cost.

Details

Motivation: CLIP is highly vulnerable to adversarial attacks, and existing test-time defense strategies have limited robustness while adversarial fine-tuning is too costly.

Method: ATAC operates in CLIP’s embedding space, calculating augmentation-induced drift vectors to infer semantic recovery direction and correcting embeddings based on angular consistency of latent drifts.

Result: ATAC consistently achieves remarkably high robustness across benchmarks, surpassing previous state-of-the-art methods by nearly 50% on average with minimal computational overhead, and maintains robustness in extreme settings and against adaptive attacks.

Conclusion: ATAC is an efficient method that represents a novel paradigm for test-time adversarial defenses in CLIP’s embedding space.

Abstract: Despite its remarkable success in zero-shot image-text matching, CLIP remains highly vulnerable to adversarial perturbations on images. As adversarial fine-tuning is prohibitively costly, recent works explore various test-time defense strategies; however, these approaches still exhibit limited robustness. In this work, we revisit this problem and propose a simple yet effective strategy: Augmentation-based Test-time Adversarial Correction (ATAC). Our method operates directly in the embedding space of CLIP, calculating augmentation-induced drift vectors to infer a semantic recovery direction and correcting the embedding based on the angular consistency of these latent drifts. Across a wide range of benchmarks, ATAC consistently achieves remarkably high robustness, surpassing that of previous state-of-the-art methods by nearly 50% on average, all while requiring minimal computational overhead. Furthermore, ATAC retains state-of-the-art robustness in unconventional and extreme settings and even achieves nontrivial robustness against adaptive attacks. Our results demonstrate that ATAC is an efficient method in a novel paradigm for test-time adversarial defenses in the embedding space of CLIP.

[179] Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning

Junming Liu, Siyuan Meng, Yanting Gao, Song Mao, Pinlong Cai, Guohang Yan, Yirong Chen, Zilin Bian, Ding Wang, Botian Shi

Main category: cs.CV

TL;DR: VaLiK is a novel approach for constructing Multimodal Knowledge Graphs (MMKGs) that uses cascaded Vision-Language Models to align image features with text descriptions, enabling cross-modal reasoning enhancement for LLMs without manual annotations.

Details

Motivation: Multimodal reasoning in LLMs faces challenges with incomplete knowledge and hallucinations, while existing Knowledge Graphs have modality isolation issues. Manual MMKG construction is hindered by semantic narrowness and noisy visual-semantic linkages.

Method: Cascade pre-trained VLMs to align image features with text, generating image-specific descriptions. Implement cross-modal similarity verification to filter noise and ensure semantic consistency. Construct MMKG using refined descriptions without manual annotations.

Result: Achieves substantial storage efficiency gains while maintaining entity-to-image linkage. LLMs augmented with VaLiK outperform previous state-of-the-art models on multimodal reasoning tasks.

Conclusion: VaLiK provides an effective framework for constructing MMKGs that enhances LLM reasoning through cross-modal information supplementation, addressing key limitations of traditional approaches.

Abstract: Multimodal reasoning in Large Language Models (LLMs) struggles with incomplete knowledge and hallucination artifacts, challenges that textual Knowledge Graphs (KGs) only partially mitigate due to their modality isolation. While Multimodal Knowledge Graphs (MMKGs) promise enhanced cross-modal understanding, their practical construction is impeded by semantic narrowness of manual text annotations and inherent noise in visual-semantic entity linkages. In this paper, we propose Vision-align-to-Language integrated Knowledge Graph (VaLiK), a novel approach for constructing MMKGs that enhances LLMs reasoning through cross-modal information supplementation. Specifically, we cascade pre-trained Vision-Language Models (VLMs) to align image features with text, transforming them into descriptions that encapsulate image-specific information. Furthermore, we developed a cross-modal similarity verification mechanism to quantify semantic consistency, effectively filtering out noise introduced during feature alignment. Even without manually annotated image captions, the refined descriptions alone suffice to construct the MMKG. Compared to conventional MMKGs construction paradigms, our approach achieves substantial storage efficiency gains while maintaining direct entity-to-image linkage capability. Experimental results on multimodal reasoning tasks demonstrate that LLMs augmented with VaLiK outperform previous state-of-the-art models. Our code is published at https://github.com/Wings-Of-Disaster/VaLiK.

[180] SVRecon: Sparse Voxel Rasterization for Surface Reconstruction

Seunghun Oh, Jaesung Choe, Dongjae Lee, Daeun Lee, Seunghoon Jeong, Yu-Chiang Frank Wang, Jaesik Park

Main category: cs.CV

TL;DR: SVRecon extends sparse voxel rasterization with SDF integration for high-fidelity surface reconstruction, addressing optimization challenges through geometric initialization and spatial smoothness losses.

Details

Motivation: Sparse voxels have sharp boundaries and are prone to local minima during optimization, unlike 3D Gaussians. While SDF provides smooth geometric fields, maintaining this smoothness across independently parameterized sparse voxels is challenging.

Method: Integrates SDF into sparse voxel rasterization with two key components: (1) robust geometric initialization using visual geometry model, and (2) spatial smoothness loss enforcing coherent relationships across parent-child and sibling voxel groups.

Result: Achieves strong reconstruction accuracy across various benchmarks with consistently speedy convergence.

Conclusion: SVRecon successfully combines sparse voxel rasterization with SDF for high-fidelity surface reconstruction, overcoming optimization challenges through proper initialization and smoothness constraints.

Abstract: We extend the recently proposed sparse voxel rasterization paradigm to the task of high-fidelity surface reconstruction by integrating Signed Distance Function (SDF), named SVRecon. Unlike 3D Gaussians, sparse voxels are spatially disentangled from their neighbors and have sharp boundaries, which makes them prone to local minima during optimization. Although SDF values provide a naturally smooth and continuous geometric field, preserving this smoothness across independently parameterized sparse voxels is nontrivial. To address this challenge, we promote coherent and smooth voxel-wise structure through (1) robust geometric initialization using a visual geometry model and (2) a spatial smoothness loss that enforces coherent relationships across parent-child and sibling voxel groups. Extensive experiments across various benchmarks show that our method achieves strong reconstruction accuracy while having consistently speedy convergence. The code will be made public.

[181] MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration

Runxun Zhang, Yizhou Liu, Li Dongrui, Bo XU, Jingwei Wei

Main category: cs.CV

TL;DR: MorphSeek is a fine-grained representation-level policy optimization framework for deformable image registration that reformulates the problem as spatially continuous optimization in latent feature space, achieving improved performance with high label efficiency.

Details

Motivation: Deformable image registration faces challenges due to high-dimensional deformation spaces and limited voxel-level supervision. Existing reinforcement learning approaches use coarse representations that limit their ability to capture spatially variant deformations.

Method: Uses stochastic Gaussian policy head on encoder to model latent feature distributions, enabling efficient exploration and coarse-to-fine refinement. Integrates unsupervised warm-up with weakly supervised fine-tuning via Group Relative Policy Optimization with multi-trajectory sampling.

Result: Achieves consistent Dice improvements across three 3D registration benchmarks (OASIS brain MRI, LiTS liver CT, Abdomen MR-CT) while maintaining high label efficiency with minimal parameter cost and low latency overhead.

Conclusion: MorphSeek advances representation-level policy learning for spatially coherent and data-efficient deformation optimization, providing a principled, backbone-agnostic solution for scalable visual alignment in high-dimensional settings.

Abstract: Deformable image registration (DIR) remains a fundamental yet challenging problem in medical image analysis, largely due to the prohibitively high-dimensional deformation space of dense displacement fields and the scarcity of voxel-level supervision. Existing reinforcement learning frameworks often project this space into coarse, low-dimensional representations, limiting their ability to capture spatially variant deformations. We propose MorphSeek, a fine-grained representation-level policy optimization paradigm that reformulates DIR as a spatially continuous optimization process in the latent feature space. MorphSeek introduces a stochastic Gaussian policy head atop the encoder to model a distribution over latent features, facilitating efficient exploration and coarse-to-fine refinement. The framework integrates unsupervised warm-up with weakly supervised fine-tuning through Group Relative Policy Optimization, where multi-trajectory sampling stabilizes training and improves label efficiency. Across three 3D registration benchmarks (OASIS brain MRI, LiTS liver CT, and Abdomen MR-CT), MorphSeek achieves consistent Dice improvements over competitive baselines while maintaining high label efficiency with minimal parameter cost and low step-level latency overhead. Beyond optimizer specifics, MorphSeek advances a representation-level policy learning paradigm that achieves spatially coherent and data-efficient deformation optimization, offering a principled, backbone-agnostic, and optimizer-agnostic solution for scalable visual alignment in high-dimensional settings.

[182] ISS-Geo142: A Benchmark for Geolocating Astronaut Photography from the International Space Station

Vedika Srivastava, Hemant Kumar Singh, Jaisal Singh

Main category: cs.CV

TL;DR: ISS-Geo142 is a benchmark for geolocating ISS astronaut photography, with three evaluated methods: NN-Geo (75.52% success), SIFT-Match (high precision but computationally expensive), and TerraByte (90% success with human-readable descriptions).

Details

Motivation: ISS images lack direct georeferencing despite known capture positions, making automated localization challenging. The benchmark addresses this gap for future research.

Method: Three geolocation pipelines: NN-Geo uses VGG16 features and cross-correlation, SIFT-Match uses sliding-window feature matching, and TerraByte uses GPT-4 with vision capabilities for joint reasoning.

Result: TerraByte achieved the best performance (90% success rate), NN-Geo achieved 75.52%, while SIFT-Match showed high precision on structured scenes but with high computational cost.

Conclusion: ISS-Geo142 and the three pipelines provide a concrete benchmark for future ISS image geolocation work, with TerraByte establishing the strongest baseline.

Abstract: This paper introduces ISS-Geo142, a curated benchmark for geolocating astronaut photography captured from the International Space Station (ISS). Although the ISS position at capture time is known precisely, the specific Earth locations depicted in these images are typically not directly georeferenced, making automated localization non-trivial. ISS-Geo142 consists of 142 images with associated metadata and manually determined geographic locations, spanning a range of spatial scales and scene types. On top of this benchmark, we implement and evaluate three geolocation pipelines: a neural network based approach (NN-Geo) using VGG16 features and cross-correlation over map-derived Areas of Interest (AOIs), a Scale-Invariant Feature Transform based pipeline (SIFT-Match) using sliding-window feature matching on stitched high-resolution AOIs, and TerraByte, an AI system built around a GPT-4 model with vision capabilities that jointly reasons over image content and ISS coordinates. On ISS-Geo142, NN-Geo achieves a match for 75.52% of the images under our evaluation protocol, SIFT-Match attains high precision on structurally rich scenes at substantial computational cost, and TerraByte establishes the strongest overall baseline, correctly geolocating approximately 90% of the images while also producing human-readable geographic descriptions. The methods and experiments were originally developed in 2023; this manuscript is a revised and extended version that situates the work relative to subsequent advances in cross-view geo-localization and remote-sensing vision–language models. Taken together, ISS-Geo142 and these three pipelines provide a concrete, historically grounded benchmark for future work on ISS image geolocation.

[183] MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment

Huangbiao Xu, Huanqi Wu, Xiao Ke, Junyi Wu, Rui Xu, Jinglin Xu

Main category: cs.CV

TL;DR: Proposes MCMoE framework for multimodal AQA that handles missing modalities by dynamically reconstructing them and using mixture of experts to maintain performance.

Details

Motivation: Existing multimodal AQA models fail when modalities are missing at inference, causing catastrophic performance degradation due to interrupted cross-modal interactions.

Method: Uses adaptive gated modality generator to reconstruct missing modalities, modality experts for unimodal learning, and dynamically mixes expert knowledge for joint representations. Mines complete multimodal features during training to guide generation.

Result: Achieves state-of-the-art results on three public AQA benchmarks in both complete and incomplete multimodal learning scenarios.

Conclusion: MCMoE effectively addresses modality missing problem in multimodal AQA through unified single-stage training and mixture of experts approach.

Abstract: Multimodal Action Quality Assessment (AQA) has recently emerged as a promising paradigm. By leveraging complementary information across shared contextual cues, it enhances the discriminative evaluation of subtle intra-class variations in highly similar action sequences. However, partial modalities are frequently unavailable at the inference stage in reality. The absence of any modality often renders existing multimodal models inoperable. Furthermore, it triggers catastrophic performance degradation due to interruptions in cross-modal interactions. To address this issue, we propose a novel Missing Completion Framework with Mixture of Experts (MCMoE) that unifies unimodal and joint representation learning in single-stage training. Specifically, we propose an adaptive gated modality generator that dynamically fuses available information to reconstruct missing modalities. We then design modality experts to learn unimodal knowledge and dynamically mix the knowledge of all experts to extract cross-modal joint representations. With a mixture of experts, missing modalities are further refined and complemented. Finally, in the training phase, we mine the complete multimodal features and unimodal expert knowledge to guide modality generation and generation-based joint representation extraction. Extensive experiments demonstrate that our MCMoE achieves state-of-the-art results in both complete and incomplete multimodal learning on three public AQA benchmarks. Code is available at https://github.com/XuHuangbiao/MCMoE.

[184] MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models

Yuqi Li, Junhao Dong, Chuanguang Yang, Shiping Wen, Piotr Koniusz, Tingwen Huang, Yingli Tian, Yew-Soon Ong

Main category: cs.CV

TL;DR: MMT-ARD is a multimodal multi-teacher adversarial robust distillation framework that enhances VLM robustness through dual-teacher knowledge fusion, dynamic weight allocation, and adaptive sigmoid-based weighting.

Details

Motivation: Vision-Language Models are deployed in safety-critical applications, but traditional single-teacher adversarial knowledge distillation suffers from limited knowledge diversity, slow convergence, and difficulty balancing robustness and accuracy.

Method: Proposes a dual-teacher knowledge fusion architecture for collaborative optimization of clean feature preservation and robust feature enhancement, with dynamic weight allocation based on teacher confidence and adaptive sigmoid-based weighting to mitigate teacher bias.

Result: Improves robust accuracy by +4.32% and zero-shot accuracy by +3.5% on ViT-B-32 model, while achieving 2.3x training efficiency increase over traditional single-teacher methods.

Conclusion: MMT-ARD effectively enhances adversarial robustness of multimodal large models with improved scalability and efficiency.

Abstract: Vision-Language Models (VLMs) are increasingly deployed in safety-critical applications, making their adversarial robustness a crucial concern. While adversarial knowledge distillation has shown promise in transferring robustness from teacher to student models, traditional single-teacher approaches suffer from limited knowledge diversity, slow convergence, and difficulty in balancing robustness and accuracy. To address these challenges, we propose MMT-ARD: a Multimodal Multi-Teacher Adversarial Robust Distillation framework. Our key innovation is a dual-teacher knowledge fusion architecture that collaboratively optimizes clean feature preservation and robust feature enhancement. To better handle challenging adversarial examples, we introduce a dynamic weight allocation strategy based on teacher confidence, enabling adaptive focus on harder samples. Moreover, to mitigate bias among teachers, we design an adaptive sigmoid-based weighting function that balances the strength of knowledge transfer across modalities. Extensive experiments on ImageNet and zero-shot benchmarks demonstrate that MMT-ARD improves robust accuracy by +4.32% and zero-shot accuracy by +3.5% on the ViT-B-32 model, while achieving a 2.3x increase in training efficiency over traditional single-teacher methods. These results highlight the effectiveness and scalability of MMT-ARD in enhancing the adversarial robustness of multimodal large models. Our codes are available at https://github.com/itsnotacie/MMT-ARD.

[185] Illustrator’s Depth: Monocular Layer Index Prediction for Image Decomposition

Nissim Maruani, Peiying Zhang, Siddhartha Chaudhuri, Matthew Fisher, Nanxuan Zhao, Vladimir G. Kim, Pierre Alliez, Mathieu Desbrun, Wang Yifan

Main category: cs.CV

TL;DR: Illustrator’s Depth is a novel depth definition that decomposes flat images into editable, ordered layers by inferring layer indices for each pixel, enabling various downstream applications like image vectorization, text-to-vector generation, and depth-aware editing.

Details

Motivation: To address the challenge of decomposing flat images into editable, ordered layers for digital content creation, inspired by an artist's compositional process.

Method: Proposed a neural network trained on curated layered vector graphics to predict layering directly from raster inputs, using a discrete, globally consistent ordering of elements optimized for editability.

Result: Significantly outperforms state-of-the-art baselines for image vectorization, enables high-fidelity text-to-vector-graphics generation, automatic 3D relief generation from 2D images, and intuitive depth-aware editing.

Conclusion: By reframing depth from a physical quantity to a creative abstraction, illustrator’s depth prediction offers a new foundation for editable image decomposition.

Abstract: We introduce Illustrator’s Depth, a novel definition of depth that addresses a key challenge in digital content creation: decomposing flat images into editable, ordered layers. Inspired by an artist’s compositional process, illustrator’s depth infers a layer index to each pixel, forming an interpretable image decomposition through a discrete, globally consistent ordering of elements optimized for editability. We also propose and train a neural network using a curated dataset of layered vector graphics to predict layering directly from raster inputs. Our layer index inference unlocks a range of powerful downstream applications. In particular, it significantly outperforms state-of-the-art baselines for image vectorization while also enabling high-fidelity text-to-vector-graphics generation, automatic 3D relief generation from 2D images, and intuitive depth-aware editing. By reframing depth from a physical quantity to a creative abstraction, illustrator’s depth prediction offers a new foundation for editable image decomposition.

[186] Improving Multimodal Distillation for 3D Semantic Segmentation under Domain Shift

Björn Michele, Alexandre Boulch, Gilles Puy, Tuan-Hung Vu, Renaud Marlet, Nicolas Courty

Main category: cs.CV

TL;DR: Study identifies best practices for using vision foundation models in unsupervised domain adaptation for lidar semantic segmentation, achieving state-of-the-art results.

Details

Motivation: Semantic segmentation networks trained on one type of lidar data fail to generalize to unseen lidars, requiring domain adaptation solutions.

Method: Unsupervised image-to-lidar knowledge distillation using vision foundation models, with focus on lidar backbone architecture and frozen pretrained backbones with trainable MLP heads.

Result: Achieves state-of-the-art results in four challenging domain adaptation settings for lidar semantic segmentation.

Conclusion: Proper architecture selection and frozen pretrained backbones with trainable MLP heads are key to maximizing generalization performance across lidar domains.

Abstract: Semantic segmentation networks trained under full supervision for one type of lidar fail to generalize to unseen lidars without intervention. To reduce the performance gap under domain shifts, a recent trend is to leverage vision foundation models (VFMs) providing robust features across domains. In this work, we conduct an exhaustive study to identify recipes for exploiting VFMs in unsupervised domain adaptation for semantic segmentation of lidar point clouds. Building upon unsupervised image-to-lidar knowledge distillation, our study reveals that: (1) the architecture of the lidar backbone is key to maximize the generalization performance on a target domain; (2) it is possible to pretrain a single backbone once and for all, and use it to address many domain shifts; (3) best results are obtained by keeping the pretrained backbone frozen and training an MLP head for semantic segmentation. The resulting pipeline achieves state-of-the-art results in four widely-recognized and challenging settings. The code will be available at: https://github.com/valeoai/muddos.

[187] FAR: Function-preserving Attention Replacement for IMC-friendly Inference

Yuxin Ren, Maxwell D Collins, Miao Hu, Huanrui Yang

Main category: cs.CV

TL;DR: FAR replaces attention mechanisms in transformers with bidirectional LSTM modules to enable efficient in-memory computing, maintaining accuracy while reducing latency and parameters.

Details

Motivation: Transformers' attention mechanism is poorly suited for in-memory computing devices due to intensive computations and non-local memory access, causing latency and bandwidth issues on ReRAM-based accelerators.

Method: Function-preserving Attention Replacement (FAR) substitutes all attention in pretrained DeiTs with multi-head bidirectional LSTM modules via block-wise distillation, enabling linear-time computation and localized weight reuse. Structured pruning is also incorporated for resource adaptation.

Result: FAR maintains comparable accuracy to original attention-based DeiT models on ImageNet and downstream tasks while reducing parameters and latency. It preserves semantic token relationships while improving computational efficiency.

Conclusion: FAR enables energy-efficient transformer inference on IMC-based edge accelerators by replacing attention with IMC-compatible sequential modules, offering a practical solution for deploying transformers on resource-constrained hardware.

Abstract: While transformers dominate modern vision and language models, their attention mechanism remains poorly suited for in-memory computing (IMC) devices due to intensive activation-to-activation multiplications and non-local memory access, leading to substantial latency and bandwidth overhead on ReRAM-based accelerators. To address this mismatch, we propose FAR, a Function-preserving Attention Replacement framework that substitutes all attention in pretrained DeiTs with sequential modules inherently compatible with IMC dataflows. Specifically, FAR replaces self-attention with a multi-head bidirectional LSTM architecture via block-wise distillation to retain functional equivalence while enabling linear-time computation and localized weight reuse. We further incorporate structured pruning on FAR models, enabling flexible adaptation to resource-constrained IMC arrays while maintaining functional fidelity. Evaluations on the DeiT family demonstrate that FAR maintains comparable accuracy to the original attention-based models on ImageNet and multiple downstream tasks with reduced parameters and latency. Further analysis shows that FAR preserves the semantic token relationships learned by attention while improving computational efficiency, highlighting its potential for energy-efficient transformer inference on IMC-based edge accelerators.

[188] GPR-OdomNet: Difference and Similarity-Driven Odometry Estimation Network for Ground Penetrating Radar-Based Localization

Huaichao Wang, Xuanxin Fan, Ji Liu, Haifeng Li, Dezhen Song

Main category: cs.CV

TL;DR: A neural network-based odometry method using GPR B-scan images that extracts multi-scale features and analyzes similarities/differences to estimate Euclidean distances traveled, achieving state-of-the-art performance with 10.2% RMSE reduction.

Details

Motivation: Existing GPR-based localization methods struggle with accurate distance estimation when processing B-scan images with minor distinctions, especially under adverse weather/environmental conditions.

Method: Custom neural network that extracts multi-scale features from consecutive GPR B-scan images and analyzes similarity/difference features between them to estimate Euclidean distances traveled.

Result: Achieved overall weighted RMSE of 0.449 m across all datasets, representing a 10.2% reduction in RMSE compared to the best state-of-the-art method, consistently outperforming counterparts in all tests.

Conclusion: The proposed neural network-based odometry method effectively leverages GPR B-scan image features for precise distance estimation and demonstrates superior performance over existing state-of-the-art approaches.

Abstract: When performing robot/vehicle localization using ground penetrating radar (GPR) to handle adverse weather and environmental conditions, existing techniques often struggle to accurately estimate distances when processing B-scan images with minor distinctions. This study introduces a new neural network-based odometry method that leverages the similarity and difference features of GPR B-scan images for precise estimation of the Euclidean distances traveled between the B-scan images. The new custom neural network extracts multi-scale features from B-scan images taken at consecutive moments and then determines the Euclidean distance traveled by analyzing the similarities and differences between these features. To evaluate our method, an ablation study and comparison experiments have been conducted using the publicly available CMU-GPR dataset. The experimental results show that our method consistently outperforms state-of-the-art counterparts in all tests. Specifically, our method achieves a root mean square error (RMSE), and achieves an overall weighted RMSE of 0.449 m across all data sets, which is a 10.2% reduction in RMSE when compared to the best state-of-the-art method.

[189] Counterfactual World Models via Digital Twin-conditioned Video Diffusion

Yiqing Shen, Aiza Maksutova, Chenjia Li, Mathias Unberath

Main category: cs.CV

TL;DR: CWMDT transforms standard video diffusion models into counterfactual world models by creating digital twin representations of scenes, using LLMs to predict intervention effects, and generating counterfactual videos.

Details

Motivation: Current world models only handle factual observations, but emerging applications need counterfactual reasoning about "what if" scenarios, like object removal or property changes.

Method: Three-step framework: 1) Construct digital twins as structured text representations of objects and relationships, 2) Use LLMs to predict how interventions propagate through time, 3) Condition video diffusion models with modified representations to generate counterfactual sequences.

Result: State-of-the-art performance on two benchmarks, demonstrating that structured representations like digital twins provide powerful control for video simulation-based world models.

Conclusion: Alternative video representations, particularly digital twins, enable effective counterfactual reasoning in world models, overcoming limitations of pixel-space approaches that prevent targeted interventions.

Abstract: World models learn to predict the temporal evolution of visual observations given a control signal, potentially enabling agents to reason about environments through forward simulation. Because of the focus on forward simulation, current world models generate predictions based on factual observations. For many emerging applications, such as comprehensive evaluations of physical AI behavior under varying conditions, the ability of world models to answer counterfactual queries, such as “what would happen if this object was removed?”, is of increasing importance. We formalize counterfactual world models that additionally take interventions as explicit inputs, predicting temporal sequences under hypothetical modifications to observed scene properties. Traditional world models operate directly on entangled pixel-space representations where object properties and relationships cannot be selectively modified. This modeling choice prevents targeted interventions on specific scene properties. We introduce CWMDT, a framework to overcome those limitations, turning standard video diffusion models into effective counterfactual world models. First, CWMDT constructs digital twins of observed scenes to explicitly encode objects and their relationships, represented as structured text. Second, CWMDT applies large language models to reason over these representations and predict how a counterfactual intervention propagates through time to alter the observed scene. Third, CWMDT conditions a video diffusion model with the modified representation to generate counterfactual visual sequences. Evaluations on two benchmarks show that the CWMDT approach achieves state-of-the-art performance, suggesting that alternative representations of videos, such as the digital twins considered here, offer powerful control signals for video forward simulation-based world models.

[190] Radar2Shape: 3D Shape Reconstruction from High-Frequency Radar using Multiresolution Signed Distance Functions

Neel Sortur, Justin Goodwin, Purvik Patel, Luis Enrique Martinez, Tzofi Klinghoffer, Rajmonda S. Caceres, Robin Walters

Main category: cs.CV

TL;DR: Radar2Shape is a denoising diffusion model that reconstructs arbitrary 3D shapes from partially-observed high-frequency radar signals using frequency correlation with multiresolution shape features.

Details

Motivation: Existing deep learning methods fail to represent arbitrary shapes from radar signals, and optical 3D reconstruction methods struggle when treating radar as camera views. There's a need for robust 3D reconstruction from limited-viewing-angle radar data for commercial and aerospace applications.

Method: Two-stage approach: 1) Learns a regularized latent space with hierarchical shape feature resolutions, 2) Uses denoising diffusion conditioned on radar signal frequencies in a coarse-to-fine manner.

Result: Successfully reconstructs arbitrary 3D shapes from partially-observed radar signals, demonstrates robust generalization across two simulation methods and real-world data.

Conclusion: Radar2Shape effectively handles 3D reconstruction from limited-view radar signals and provides benchmark datasets for future high-frequency radar research.

Abstract: Determining the shape of 3D objects from high-frequency radar signals is analytically complex but critical for commercial and aerospace applications. Previous deep learning methods have been applied to radar modeling; however, they often fail to represent arbitrary shapes or have difficulty with real-world radar signals which are collected over limited viewing angles. Existing methods in optical 3D reconstruction can generate arbitrary shapes from limited camera views, but struggle when they naively treat the radar signal as a camera view. In this work, we present Radar2Shape, a denoising diffusion model that handles a partially observable radar signal for 3D reconstruction by correlating its frequencies with multiresolution shape features. Our method consists of a two-stage approach: first, Radar2Shape learns a regularized latent space with hierarchical resolutions of shape features, and second, it diffuses into this latent space by conditioning on the frequencies of the radar signal in an analogous coarse-to-fine manner. We demonstrate that Radar2Shape can successfully reconstruct arbitrary 3D shapes even from partially-observed radar signals, and we show robust generalization to two different simulation methods and real-world data. Additionally, we release two synthetic benchmark datasets to encourage future research in the high-frequency radar domain so that models like Radar2Shape can safely be adapted into real-world radar systems.

[191] An Artificial Intelligence Framework for Measuring Human Spine Aging Using MRI

Roozbeh Bazargani, Saqib Abdullah Basar, Daniel Daly-Grafstein, Rodrigo Solis Pompa, Soojin Lee, Saurabh Garg, Yuntong Ma, John A. Carrino, Siavash Khallaghi, Sam Hashemi

Main category: cs.CV

TL;DR: A deep learning method estimates spine age from MRI images, using the spine age gap (SAG) as a biomarker for spine health, showing associations with degenerative conditions and lifestyle factors.

Details

Motivation: The spine is vulnerable to age-related degenerations detectable via MRI, and there's a need for automated methods to assess spine health and its relationship with degenerative conditions and lifestyle.

Method: Proposed a computer-vision-based deep learning model using over 18,000 MRI series. Used UMAP and HDBSCAN for clustering degenerative conditions, and conducted ablation studies on data size, loss functions, and spine regions.

Result: The spine age gap (SAG) - difference between actual and predicted spine age - is associated with disc bulges, disc osteophytes, spinal stenosis, fractures, smoking, and physically demanding work.

Conclusion: SAG may serve as a useful biomarker for measuring overall spine health, linking it to specific degenerative conditions and modifiable lifestyle factors.

Abstract: The human spine is a complex structure composed of 33 vertebrae. It holds the body and is important for leading a healthy life. The spine is vulnerable to age-related degenerations that can be identified through magnetic resonance imaging (MRI). In this paper we propose a novel computer-vison-based deep learning method to estimate spine age using images from over 18,000 MRI series. Data are restricted to subjects with only age-related spine degeneration. Eligibility criteria are created by identifying common age-based clusters of degenerative spine conditions using uniform manifold approximation and projection (UMAP) and hierarchical density-based spatial clustering of applications with noise (HDBSCAN). Model selection is determined using a detailed ablation study on data size, loss, and the effect of different spine regions. We evaluate the clinical utility of our model by calculating the difference between actual spine age and model-predicted age, the spine age gap (SAG), and examining the association between these differences and spine degenerative conditions and lifestyle factors. We find that SAG is associated with conditions including disc bulges, disc osteophytes, spinal stenosis, and fractures, as well as lifestyle factors like smoking and physically demanding work, and thus may be a useful biomarker for measuring overall spine health.

[192] Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models

Mark Endo, Serena Yeung-Levy

Main category: cs.CV

TL;DR: The paper analyzes how downscaling LLMs affects multimodal capabilities, finding visual perception suffers more than reasoning. It introduces Extract+Think approach with visual extraction tuning to address this bottleneck.

Details

Motivation: Practical demands require smaller, efficient multimodal systems, but downscaling LLMs disproportionately affects visual capabilities rather than inherited LLM abilities.

Method: Introduces Extract+Think approach with visual extraction tuning to train models to extract instruction-relevant visual details consistently, followed by step-by-step reasoning.

Result: LLM downscaling causes sharp performance drops in visual perception, often matching or exceeding the impact on reasoning. The proposed method sets new efficiency and performance standards.

Conclusion: Visual extraction tuning combined with step-by-step reasoning effectively addresses the bottleneck in downscaled multimodal models, balancing efficiency and capability preservation.

Abstract: Scaling up multimodal models has enabled remarkable advances in visual understanding and reasoning, but practical demands call for smaller, efficient systems. In this work, we conduct a principled analysis of downscaling intelligence in multimodal models, examining how reduced large language model (LLM) capacity affects multimodal capabilities. Our initial findings reveal an interesting trend: LLM downscaling disproportionately affects visual capabilities, rather than abilities inherited from the LLM. We then examine whether this drop mainly reflects the expected decline in visual reasoning or a more fundamental loss of perceptual abilities. Isolating the effect of LLM downscaling on perception, we find performance still drops sharply, often matching or exceeding the impact on reasoning. To address this bottleneck, we introduce visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details consistently across tasks. With these extracted visual details, we then apply step-by-step reasoning to generate answers. Together, these components form our Extract+Think approach, setting a new standard for efficiency and performance in this space.

[193] VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding

Jianxiang He, Meisheng Hong, Jungang Li, Ziyang Chen, Weiyu Guo, Xuming Hu, Hui Xiong

Main category: cs.CV

TL;DR: VSI is a multimodal keyframe retrieval framework that combines visual and textual information to improve long video processing in MLLMs, achieving state-of-the-art performance in text-related tasks.

Details

Motivation: Existing keyframe search algorithms rely only on visual modality, making them difficult to adapt to text-related tasks and often causing retrieval results to deviate from core semantic content.

Method: Proposes VISUAL-SUBTITLE INTEGRATION (VSI) framework with dual-branch collaborative retrieval combining Video Search and Subtitle Match to fuse complementary visual and textual information for precise localization.

Result: Experiments on LongVideoBench and VideoMME show VSI achieves state-of-the-art accuracy in keyframe retrieval, delivers breakthrough performance in text-related tasks, and exhibits strong generalization across other tasks.

Conclusion: VSI effectively addresses limitations of visual-only keyframe retrieval by integrating multimodal information, significantly improving performance in text-related video tasks while maintaining strong generalization capabilities.

Abstract: Multimodal large language models (MLLMs) demonstrate exceptional performance in vision-language tasks, yet their processing of long videos is constrained by input context length and high computational costs. Sparse frame sampling thus becomes a necessary preprocessing step, with sampled frame quality directly impacting downstream performance. Existing keyframe search algorithms achieve a balance between efficiency and sampled frame quality but heavily rely on the visual modality alone. This makes them difficult to adapt to text-related tasks and often leads to retrieval results deviating from core semantic content. To address this, we propose the VISUAL-SUBTITLE INTEGRATION (VSI), a multimodal keyframe retrieval framework. It employs a dual-branch collaborative retrieval approach combining Video Search and Subtitle Match to fuse complementary visual and textual information for precise localization. Experiments on LongVideoBench and VideoMME demonstrate that VSI achieves state-of-the-art accuracy in keyframe retrieval while delivering breakthrough performance in text-related tasks and exhibiting strong generalization across other tasks.

[194] Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Yolo Yunlong Tang, Daiki Shimada, Hang Hua, Chao Huang, Jing Bi, Rogerio Feris, Chenliang Xu

Main category: cs.CV

TL;DR: Video-R4 introduces visual rumination for text-rich video reasoning, using iterative frame selection, zooming, and re-encoding to achieve state-of-the-art performance on video QA tasks.

Details

Motivation: Existing video QA models fail on fine-grained evidence due to single-pass perception, while humans repeatedly inspect critical regions through pausing, zooming, and re-reading.

Method: A multi-stage framework that finetunes a 7B LMM to perform visual rumination via supervised practice (Video-R4-CoT-17k) and reinforcement learning (Video-R4-RL-30k) with atomic and mixing visual operations.

Result: Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and generalizes to multi-page document QA, slides QA, and generic video QA.

Conclusion: Iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning, enabling fine-grained evidence capture in text-rich videos.

Abstract: Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning.

[195] EvDiff: High Quality Video with an Event Camera

Weilun Li, Lei Sun, Ruixi Gao, Qi Jiang, Yuqin Ma, Kaiwei Wang, Ming-Hsuan Yang, Luc Van Gool, Danda Pani Paudel

Main category: cs.CV

TL;DR: EvDiff is an event-based diffusion model that generates high-quality colorful videos from monochromatic event streams using a surrogate training framework and single-step diffusion.

Details

Motivation: Event cameras record sparse brightness changes but reconstructing intensity images is ill-posed due to absolute brightness ambiguity. Existing regression methods produce perceptually inferior results and struggle with model scaling.

Method: Proposes EvDiff with surrogate training framework that eliminates need for paired event-image datasets, uses temporally consistent EvEncoder, and performs only single forward diffusion step to reduce computational cost.

Result: Generates high-quality colorful videos from monochromatic event streams, outperforming existing approaches on both pixel-level and perceptual metrics in real-world datasets.

Conclusion: EvDiff strikes optimal balance between fidelity and realism, enabling scalable video generation from event streams without paired training data requirements.

Abstract: As neuromorphic sensors, event cameras asynchronously record changes in brightness as streams of sparse events with the advantages of high temporal resolution and high dynamic range. Reconstructing intensity images from events is a highly ill-posed task due to the inherent ambiguity of absolute brightness. Early methods generally follow an end-to-end regression paradigm, directly mapping events to intensity frames in a deterministic manner. While effective to some extent, these approaches often yield perceptually inferior results and struggle to scale up in model capacity and training data. In this work, we propose EvDiff, an event-based diffusion model that follows a surrogate training framework to produce high-quality videos. To reduce the heavy computational cost of high-frame-rate video generation, we design an event-based diffusion model that performs only a single forward diffusion step, equipped with a temporally consistent EvEncoder. Furthermore, our novel Surrogate Training Framework eliminates the dependence on paired event-image datasets, allowing the model to leverage large-scale image datasets for higher capacity. The proposed EvDiff is capable of generating high-quality colorful videos solely from monochromatic event streams. Experiments on real-world datasets demonstrate that our method strikes a sweet spot between fidelity and realism, outperforming existing approaches on both pixel-level and perceptual metrics.

[196] Native 3D Editing with Full Attention

Weiwei Cai, Shuangkang Fang, Weicai Ye, Xin Dong, Yunhan Yang, Xuanyang Zhang, Wei Cheng, Yanpei Cao, Gang Yu, Tao Chen

Main category: cs.CV

TL;DR: A native 3D editing framework that directly manipulates 3D representations in a single feed-forward pass, using a large-scale multi-modal dataset and token concatenation for efficient instruction-guided editing.

Details

Motivation: Existing methods have limitations: optimization-based approaches are too slow, while feed-forward approaches using multi-view 2D editing suffer from inconsistent geometry and degraded visual quality.

Method: Created a large-scale multi-modal dataset for instruction-guided 3D editing covering addition, deletion, and modification tasks. Explored two conditioning strategies: cross-attention and a novel 3D token concatenation approach.

Result: Token concatenation proved more parameter-efficient and achieved superior performance. The method outperforms existing 2D-lifting approaches in generation quality, 3D consistency, and instruction fidelity.

Conclusion: The proposed native 3D editing framework sets a new benchmark for instruction-guided 3D editing by directly manipulating 3D representations efficiently while maintaining quality and consistency.

Abstract: Instruction-guided 3D editing is a rapidly emerging field with the potential to broaden access to 3D content creation. However, existing methods face critical limitations: optimization-based approaches are prohibitively slow, while feed-forward approaches relying on multi-view 2D editing often suffer from inconsistent geometry and degraded visual quality. To address these issues, we propose a novel native 3D editing framework that directly manipulates 3D representations in a single, efficient feed-forward pass. Specifically, we create a large-scale, multi-modal dataset for instruction-guided 3D editing, covering diverse addition, deletion, and modification tasks. This dataset is meticulously curated to ensure that edited objects faithfully adhere to the instructional changes while preserving the consistency of unedited regions with the source object. Building upon this dataset, we explore two distinct conditioning strategies for our model: a conventional cross-attention mechanism and a novel 3D token concatenation approach. Our results demonstrate that token concatenation is more parameter-efficient and achieves superior performance. Extensive evaluations show that our method outperforms existing 2D-lifting approaches, setting a new benchmark in generation quality, 3D consistency, and instruction fidelity.

Elena Camuffo, Francesco Barbato, Mete Ozay, Simone Milani, Umberto Michieli

Main category: cs.CV

TL;DR: MOCHA is a distillation framework that transfers multimodal knowledge from large VLMs to lightweight detectors for personalized object detection, achieving significant performance gains with minimal inference overhead.

Details

Motivation: Lightweight detectors lack strong semantic priors for few-shot personalized detection, while large VLMs are too computationally expensive for real-time applications.

Method: Extracts fused visual-textual embeddings from frozen VLM teacher and guides student training via dual-objective loss for local alignment and global relational consistency across regions.

Result: Consistently outperforms prior baselines across four personalized detection benchmarks with +10.1 average improvement under few-shot regimes.

Conclusion: MOCHA enables efficient transfer of multimodal semantics to lightweight detectors without teacher modifications or text input at inference, making it suitable for real-time applications.

Abstract: Personalized object detection aims to adapt a general-purpose detector to recognize user-specific instances from only a few examples. Lightweight models often struggle in this setting due to their weak semantic priors, while large vision-language models (VLMs) offer strong object-level understanding but are too computationally demanding for real-time or on-device applications. We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment), a distillation framework that transfers multimodal region-level knowledge from a frozen VLM teacher into a lightweight vision-only detector. MOCHA extracts fused visual and textual teacher’s embeddings and uses them to guide student training through a dual-objective loss that enforces accurate local alignment and global relational consistency across regions. This process enables efficient transfer of semantics without the need for teacher modifications or textual input at inference. MOCHA consistently outperforms prior baselines across four personalized detection benchmarks under strict few-shot regimes, yielding a +10.1 average improvement, with minimal inference cost.

[198] Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding

Weikai Huang, Jieyu Zhang, Taoyang Jia, Chenhao Zheng, Ziqi Gao, Jae Sung Park, Winson Han, Ranjay Krishna

Main category: cs.CV

TL;DR: SOC is a synthetic data generation pipeline that composes high-quality object segments using geometric and camera augmentations with generative harmonization, achieving state-of-the-art performance on multiple benchmarks while enabling controllable dataset construction.

Details

Motivation: Real-world datasets for visual grouping tasks are costly, biased, and difficult to scale, while existing synthetic datasets struggle with flexibility, accuracy, and compositional diversity.

Method: Object-centric composition strategy using 3D geometric layout augmentation, camera configuration augmentation, generative harmonization, and mask-area-weighted blending to compose synthetic object segments into new images.

Result: Models trained on 100K SOC images outperform those trained on larger real datasets (GRIT 20M, V3Det 200K) and synthetic pipelines by +24-36%, achieving +10.9 AP on LVIS and +8.4 NAcc on gRefCOCO. Also enables strong performance in low-data scenarios (+6.59 AP on 1% COCO).

Conclusion: SOC provides an accurate, scalable alternative to real datasets for visual grouping tasks, offering superior performance, controllability for targeted use cases, and significant improvements in data-limited scenarios.

Abstract: Visual grouping – operationalized through tasks such as instance segmentation, visual grounding, and object detection – enables applications ranging from robotic perception to photo editing. These fundamental problems in computer vision are powered by large-scale, painstakingly annotated datasets. Despite their impact, these datasets are costly to build, biased in coverage, and difficult to scale. Synthetic datasets offer a promising alternative but struggle with flexibility, accuracy, and compositional diversity. We introduce Synthetic Object Compositions (SOC), an accurate and scalable data synthesis pipeline via a novel object-centric composition strategy. It composes high-quality synthetic object segments into new images using 3D geometric layout augmentation and camera configuration augmentation with generative harmonization and mask-area-weighted blending, yielding accurate and diverse masks, boxes, and referring expressions. Models trained on just 100K of our synthetic images outperform those trained on larger real datasets (GRIT 20M, V3Det 200K) and synthetic pipelines (Copy-Paste, X-Paste, SynGround, SegGen) by +24-36% – achieving +10.9 AP on LVIS and +8.4 NAcc on gRefCOCO. Beyond the general open-vocabulary setup, SOC also enables controllable dataset construction for different use cases and boosts performance in both low-data and closed-vocabulary scenarios. Augmenting LVIS and COCO with synthetic object segments delivers strong performance across different real-data scales and yields even greater improvements under extremely limited real-data conditions, including +6.59 AP on a 1% COCO data setup. Furthermore, this controllability enables targeted data generation for intra-class referring, a diagnostic grounding task we propose that requires fine-grained attribute discrimination.

[199] Colo-ReID: Discriminative Representation Embedding with Meta-learning for Colonoscopic Polyp Re-Identification

Suncheng Xiang, Chengfeng Zhou, Zhengjie Zhang, Shilun Cai, Dahong Qian

Main category: cs.CV

TL;DR: Colo-ReID is a meta-learning based method for colonoscopic polyp re-identification that addresses domain gap issues and improves retrieval performance in scenarios with limited samples.

Details

Motivation: Traditional CNN models trained on ImageNet perform poorly on colonoscopic datasets due to large domain gaps, and existing methods fail to exploit self-discrepancy in intra-class and inter-class relations for polyp re-identification.

Method: Proposes Colo-ReID training method using meta-learning strategy for few-shot scenarios, with a dynamic Meta-Learning Regulation (MLR) mechanism to enhance polyp re-identification performance.

Result: Colo-ReID outperforms the second-best method by +2.3% mAP on polyp re-identification task.

Conclusion: The proposed Colo-ReID method effectively addresses domain adaptation challenges in medical imaging and achieves superior performance for colonoscopic polyp re-identification through meta-learning strategies.

Abstract: Colonoscopic Polyp Re-Identification aims to match the same polyp from a large gallery with images from different views taken using different cameras and plays an important role in the prevention and treatment of colorectal cancer. However, traditional methods for object ReID directly adopting CNN models trained on the ImageNet dataset usually produce unsatisfactory retrieval performance on colonoscopic datasets due to the large domain gap. Additionally, these methods neglect to explore the potential of self-discrepancy among intra-class or inter-class relations in the colonoscopic polyp dataset, which remains an open research problem in the medical community. To solve this dilemma, we propose a simple but effective training method named Colo-ReID, which can help our model learn more general and discriminative knowledge based on the meta-learning strategy in scenarios with fewer samples. Based on this, a dynamic Meta-Learning Regulation mechanism called MLR is introduced to further boost the performance of polyp re-identification. Our experimental results show that Colo-ReID consistently outperforms second-best method in terms of mAP performance by +2.3% on polyp re-identification task. Our source code is also publicly available at https://github.com/JeremyXSC/Colo-ReID.

[200] A statistical method for crack pre-detection in 3D concrete images

Vitalii Makogin, Duc Nguyen, Evgeny Spodarev

Main category: cs.CV

TL;DR: A statistical framework for crack pre-localization in large CT images that identifies likely crack regions with controlled error rates, enabling more efficient subsequent segmentation.

Details

Motivation: Both classical and deep learning methods face computational challenges when processing high-resolution 3D CT images directly, requiring a more efficient preprocessing step.

Method: Combines Hessian-based filtering, geometric descriptors on spatial partitions, and spatial multiple testing to detect anomalous regions using minimal calibration data.

Result: Reliably highlights crack-containing regions while maintaining linear computational complexity, as demonstrated on semi-synthetic and real 3D CT scans.

Conclusion: Provides a practical, interpretable preprocessing step that enables more efficient deep learning segmentation by focusing computational resources only on relevant regions.

Abstract: In practical applications, effectively segmenting cracks in large-scale computed tomography (CT) images holds significant importance for understanding the structural integrity of materials. Classical image-processing techniques and modern deep-learning models both face substantial computational challenges when applied directly to high resolution big data volumes. This paper introduces a statistical framework for crack pre-localization, whose purpose is not to replace or compete with segmentation networks, but to identify, with controlled error rates, the regions of a 3D CT image that are most likely to contain cracks. The method combines a simple Hessian-based filter, geometric descriptors computed on a regular spatial partition, and a spatial multiple testing procedure to detect anomalous regions while relying only on minimal calibration data, rather than large annotated datasets. Experiments on semi-synthetic and real 3D CT scans demonstrate that the proposed approach reliably highlights regions likely to contain cracks while preserving linear computational complexity. By restricting subsequent high resolution segmentation to these localized regions, deep-learning models can be trained and operate more efficiently, reducing both training runtime as well as resource consumption. The framework thus offers a practical and interpretable preprocessing step for large-scale CT inspection pipelines.

[201] MindShot: A Few-Shot Brain Decoding Framework via Transferring Cross-Subject Prior and Distilling Frequency Domain Knowledge

Shuai Jiang, Zhu Meng, Haiwen Li, Delong Liu, Fei Su, Zhicheng Zhao

Main category: cs.CV

TL;DR: MindShot is a two-stage fMRI brain decoding framework that enables few-shot visual reconstruction with high accuracy using only 1.8% of data, reducing scanning time by up to 99% while achieving 83.6% CLIP accuracy.

Details

Motivation: Address challenges in brain decoding including substantial individual differences and high data collection costs, moving beyond the limited per-subject-per-model paradigm to enable practical clinical applications.

Method: Two-stage framework: 1) Multi-Subject Pretraining (MSP) using multi-modal contrastive learning to mine cross-subject prior, 2) Fourier-based cross-subject Knowledge Distillation (FKD) to reduce inter-individual differences and improve adaptability.

Result: Achieves 83.6% CLIP accuracy using only 1.8% of fMRI-image pairs, surpassing 77.4% accuracy of methods trained on full NSD dataset. Reduces scanning time by up to 99% while maintaining high semantic fidelity in visual reconstruction.

Conclusion: MindShot makes large-scale brain decoding frameworks feasible with significantly less data, facilitating practical applications in clinical scenarios through its few-shot learning capability.

Abstract: Aiming to reconstruct visual stimuli from brain signals, brain decoding has recently made significant progress using functional magnetic resonance imaging (fMRI). However, it still has challenging issues such as substantial individual differences and high data collection costs. To simplify these problems, most methods adopt the per-subject-per-model paradigm, but this greatly limits their applications. In this paper, we design a few-shot brain decoding setting specifically for potential clinical scenarios and propose a novel two-stage decoding framework named MindShot, comprising a Multi-Subject Pretraining (MSP) stage and Fourier-based cross-subject Knowledge Distillation (FKD) stage. Firstly, a MSP framework based on multi-modal contrastive learning is constructed to mine the cross-subject prior. Secondly, the FKD is presented to decrease inter-individual differences while improving the decoding adaptability to new individuals. Our approach achieves high semantic fidelity in visual reconstruction on the largest dataset and has the potential to reduce scanning time by up to 99%. Remarkably, MindShot achieves a CLIP accuracy of 83.6% using only 1.8% of the fMRI-image pairs, surpassing the 77.4% accuracy of the method trained on the entire NSD dataset. This makes it feasible to train large-scale brain decoding frameworks that require less data, facilitating practical applications. The code is available at https://github.com/JSinBUPT/MindShot.

[202] Comparative Study of UNet-based Architectures for Liver Tumor Segmentation in Multi-Phase Contrast-Enhanced Computed Tomography

Doan-Van-Anh Ly, Thi-Thu-Hien Pham, Thanh-Hai Le

Main category: cs.CV

TL;DR: ResNet-based UNet3+ with CBAM attention module outperforms Transformer and Mamba backbones for liver tumor segmentation in CECT, achieving best Dice score (0.755), IoU (0.662), and boundary precision.

Details

Motivation: Liver structure segmentation in multi-phase CECT is crucial for computer-aided diagnosis and treatment planning of liver diseases including tumor detection.

Method: Evaluated UNet-based architectures with ResNet, Transformer, and Mamba backbones, then incorporated attention mechanisms (CBAM) to improve segmentation quality.

Result: ResNetUNet3+ with CBAM achieved best performance: Dice 0.755, IoU 0.662, HD95 77.911, accuracy 0.925, specificity 0.926. Transformer and Mamba backbones underperformed compared to ResNet.

Conclusion: Classical ResNet architecture combined with modern attention modules remains highly competitive for medical image segmentation, offering promising direction for liver tumor detection in clinical practice.

Abstract: Segmentation of liver structures in multi-phase contrast-enhanced computed tomography (CECT) plays a crucial role in computer-aided diagnosis and treatment planning for liver diseases, including tumor detection. In this study, we investigate the performance of UNet-based architectures for liver tumor segmentation, starting from the original UNet and extending to UNet3+ with various backbone networks. We evaluate ResNet, Transformer-based, and State-space (Mamba) backbones, all initialized with pretrained weights. Surprisingly, despite the advances in modern architecture, ResNet-based models consistently outperform Transformer- and Mamba-based alternatives across multiple evaluation metrics. To further improve segmentation quality, we introduce attention mechanisms into the backbone and observe that incorporating the Convolutional Block Attention Module (CBAM) yields the best performance. ResNetUNet3+ with CBAM module not only produced the best overlap metrics with a Dice score of 0.755 and IoU of 0.662, but also achieved the most precise boundary delineation, evidenced by the lowest HD95 distance of 77.911. The model’s superiority was further cemented by its leading overall accuracy of 0.925 and specificity of 0.926, showcasing its robust capability in accurately identifying both lesion and healthy tissue. To further enhance interpretability, Grad-CAM visualizations were employed to highlight the region’s most influential predictions, providing insights into its decision-making process. These findings demonstrate that classical ResNet architecture, when combined with modern attention modules, remain highly competitive for medical image segmentation tasks, offering a promising direction for liver tumor detection in clinical practice.

[203] SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting

Yicheng Deng, Hideaki Hayashi, Hajime Nagahara

Main category: cs.CV

TL;DR: Proposes SW-MRO features and SpotFormer transformer for facial expression spotting, achieving state-of-the-art performance especially in micro-expression detection.

Details

Motivation: Existing methods struggle with irrelevant facial movements and detecting subtle motions in micro-expressions, hindering accurate expression spotting.

Method: Uses Sliding Window-based multi-temporal-resolution Optical flow (SW-MRO) features and SpotFormer transformer with Facial Local Graph Pooling and supervised contrastive learning.

Result: Outperforms state-of-the-art models on SAMM-LV, CAS(ME)^2, and CAS(ME)^3 datasets, particularly excelling in micro-expression spotting.

Conclusion: The proposed framework effectively addresses challenges in facial expression spotting, especially for micro-expressions, through innovative feature extraction and transformer architecture.

Abstract: Facial expression spotting, identifying periods where facial expressions occur in a video, is a significant yet challenging task in facial expression analysis. The issues of irrelevant facial movements and the challenge of detecting subtle motions in micro-expressions remain unresolved, hindering accurate expression spotting. In this paper, we propose an efficient framework for facial expression spotting. First, we propose a Sliding Window-based multi-temporal-resolution Optical flow (SW-MRO) feature, which calculates multi-temporal-resolution optical flow of the input image sequence within compact sliding windows. The window length is tailored to perceive complete micro-expressions and distinguish between general macro- and micro-expressions. SW-MRO can effectively reveal subtle motions while avoiding the optical flow being dominated by head movements. Second, we propose SpotFormer, a multi-scale spatio-temporal Transformer that simultaneously encodes spatio-temporal relationships of the SW-MRO features for accurate frame-level probability estimation. In SpotFormer, we use the proposed Facial Local Graph Pooling (FLGP) operation and convolutional layers to extract multi-scale spatio-temporal features. We show the validity of the architecture of SpotFormer by comparing it with several model variants. Third, we introduce supervised contrastive learning into SpotFormer to enhance the discriminability between different types of expressions. Extensive experiments on SAMM-LV, CAS(ME)^2, and CAS(ME)^3 show that our method outperforms state-of-the-art models, particularly in micro-expression spotting.

[204] ResMatching: Noise-Resilient Computational Super-Resolution via Guided Conditional Flow Matching

Anirban Ray, Vera Galinova, Florian Jug

Main category: cs.CV

TL;DR: ResMatching is a computational super-resolution method using guided conditional flow matching to learn improved data priors, achieving the best trade-off between data fidelity and perceptual realism on biological structures.

Details

Motivation: Computational super-resolution in fluorescence microscopy is ill-posed and requires strong priors to extrapolate missing frequencies. With better machine learning techniques, stronger priors can be learned for improved CSR results.

Method: ResMatching uses guided conditional flow matching to learn improved data priors for computational super-resolution. It can sample from an implicitly learned posterior distribution and provides pixel-wise uncertainty estimates.

Result: ResMatching consistently achieved competitive results on 4 diverse biological structures from BioSR dataset, outperforming 7 baselines. It showed best trade-off between data fidelity and perceptual realism, particularly effective with noisy low-resolution images.

Conclusion: ResMatching enables effective computational super-resolution with calibrated uncertainty estimates, making it particularly valuable for cases where strong priors are hard to learn due to noise in low-resolution images.

Abstract: Computational Super-Resolution (CSR) in fluorescence microscopy has, despite being an ill-posed problem, a long history. At its very core, CSR is about finding a prior that can be used to extrapolate frequencies in a micrograph that have never been imaged by the image-generating microscope. It stands to reason that, with the advent of better data-driven machine learning techniques, stronger prior can be learned and hence CSR can lead to better results. Here, we present ResMatching, a novel CSR method that uses guided conditional flow matching to learn such improved data-priors. We evaluate ResMatching on 4 diverse biological structures from the BioSR dataset and compare its results against 7 baselines. ResMatching consistently achieves competitive results, demonstrating in all cases the best trade-off between data fidelity and perceptual realism. We observe that CSR using ResMatching is particularly effective in cases where a strong prior is hard to learn, e.g. when the given low-resolution images contain a lot of noise. Additionally, we show that ResMatching can be used to sample from an implicitly learned posterior distribution and that this distribution is calibrated for all tested use-cases, enabling our method to deliver a pixel-wise data-uncertainty term that can guide future users to reject uncertain predictions.

[205] MonoGSDF: Exploring Monocular Geometric Cues for Gaussian Splatting-Guided Implicit Surface Reconstruction

Kunyi Li, Michael Niemeyer, Zeyu Chen, Nassir Navab, Federico Tombari

Main category: cs.CV

TL;DR: MonoGSDF combines Gaussian primitives with neural SDF for high-quality 3D surface reconstruction from monocular images, eliminating need for Marching Cubes while maintaining efficiency.

Details

Motivation: Address limitations of 3D Gaussian Splatting methods that rely on sparse explicit primitives, which struggle to recover watertight and topologically consistent 3D surfaces from monocular images.

Method: Couples Gaussian-based primitives with neural Signed Distance Field (SDF), uses SDF to guide Gaussians’ spatial distribution during training, employs Gaussians as priors for surface reconstruction at inference, implements scaling strategy for arbitrary-scale scenes, multi-resolution training, and incorporates monocular geometric cues from off-the-shelf estimators.

Result: Outperforms prior methods on real-world datasets while maintaining efficiency, achieves high-quality surface reconstruction without memory-intensive Marching Cubes.

Conclusion: MonoGSDF successfully bridges the gap between Gaussian-based rendering and surface reconstruction, providing an efficient and effective solution for accurate meshing from monocular images.

Abstract: Accurate meshing from monocular images remains a key challenge in 3D vision. While state-of-the-art 3D Gaussian Splatting (3DGS) methods excel at synthesizing photorealistic novel views through rasterization-based rendering, their reliance on sparse, explicit primitives severely limits their ability to recover watertight and topologically consistent 3D surfaces.We introduce MonoGSDF, a novel method that couples Gaussian-based primitives with a neural Signed Distance Field (SDF) for high-quality reconstruction. During training, the SDF guides Gaussians’ spatial distribution, while at inference, Gaussians serve as priors to reconstruct surfaces, eliminating the need for memory-intensive Marching Cubes. To handle arbitrary-scale scenes, we propose a scaling strategy for robust generalization. A multi-resolution training scheme further refines details and monocular geometric cues from off-the-shelf estimators enhance reconstruction quality. Experiments on real-world datasets show MonoGSDF outperforms prior methods while maintaining efficiency.

[206] Token Adaptation via Side Graph Convolution for Efficient Fine-tuning of 3D Point Cloud Transformers

Takahiko Furuya

Main category: cs.CV

TL;DR: STAG is a parameter-efficient fine-tuning method for 3D point cloud Transformers that uses graph convolutional side networks to achieve superior computational efficiency while maintaining accuracy.

Details

Motivation: Existing PEFT methods for 3D point cloud Transformers suffer from high temporal and spatial computational costs during fine-tuning, despite minimizing tunable parameters.

Method: STAG employs a graph convolutional side network operating in parallel with a frozen backbone Transformer, using efficient graph convolution, parameter sharing, and reduced gradient computation.

Result: STAG maintains comparable classification accuracy while reducing tunable parameters to only 0.43M and achieving significant reductions in both computation time and memory consumption. Also introduces PCC13 benchmark for comprehensive evaluation.

Conclusion: STAG effectively addresses the computational efficiency limitations of existing PEFT methods for 3D point cloud analysis while maintaining performance.

Abstract: Parameter-efficient fine-tuning (PEFT) of pre-trained 3D point cloud Transformers has emerged as a promising technique for 3D point cloud analysis. While existing PEFT methods attempt to minimize the number of tunable parameters, they often suffer from high temporal and spatial computational costs during fine-tuning. This paper proposes a novel PEFT algorithm called Side Token Adaptation on a neighborhood Graph (STAG) to achieve superior temporal and spatial efficiency. STAG employs a graph convolutional side network operating in parallel with a frozen backbone Transformer to adapt tokens to downstream tasks. Through efficient graph convolution, parameter sharing, and reduced gradient computation, STAG significantly reduces both temporal and spatial costs for fine-tuning. We also present Point Cloud Classification 13 (PCC13), a new benchmark comprising diverse publicly available 3D point cloud datasets to facilitate comprehensive evaluation. Extensive experiments using multiple pre-trained models and PCC13 demonstrates the effectiveness of STAG. Specifically, STAG maintains classification accuracy comparable to existing methods while reducing tunable parameters to only 0.43M and achieving significant reductions in both computation time and memory consumption for fine-tuning. Code and benchmark will be available at: https://github.com/takahikof/STAG.

[207] CLIMB-3D: Continual Learning for Imbalanced 3D Instance Segmentation

Vishal Thengane, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Lu Yin, Xiatian Zhu, Salman Khan

Main category: cs.CV

TL;DR: CLIMB3D is a unified framework for class-incremental imbalance-aware 3D instance segmentation that addresses both new class emergence and class imbalance through pseudo-label generation and class-balanced re-weighting.

Details

Motivation: Existing 3D instance segmentation methods assume all object classes are known in advance and uniformly distributed, which is unrealistic in dynamic real-world environments where new classes emerge gradually and exhibit natural imbalance.

Method: The framework builds on exemplar replay strategies and introduces a pseudo-label generator (PLG) that extends supervision to previously learned categories using predictions from a frozen model. To address PLG’s bias towards frequent classes, it employs a class-balanced re-weighting (CBR) scheme that estimates object frequencies from pseudo-labels and dynamically adjusts training bias without requiring past data.

Result: The method achieves state-of-the-art results, surpassing prior work by up to 16.76% mAP for instance segmentation and approximately 30% mIoU for semantic segmentation on ScanNet200 and ScanNetV2 datasets, demonstrating strong generalization across both frequent and rare classes.

Conclusion: CLIMB3D effectively addresses the challenges of class-incremental learning with class imbalance in 3D instance segmentation through its unified framework combining pseudo-label generation and dynamic re-weighting strategies.

Abstract: While 3D instance segmentation (3DIS) has advanced significantly, most existing methods assume that all object classes are known in advance and uniformly distributed. However, this assumption is unrealistic in dynamic, real-world environments where new classes emerge gradually and exhibit natural imbalance. Although some approaches address the emergence of new classes, they often overlook class imbalance, which leads to suboptimal performance, particularly on rare categories. To tackle this, we propose \ourmethodbf, a unified framework for \textbf{CL}ass-incremental \textbf{Imb}alance-aware \textbf{3D}IS. Building upon established exemplar replay (ER) strategies, we show that ER alone is insufficient to achieve robust performance under memory constraints. To mitigate this, we introduce a novel pseudo-label generator (PLG) that extends supervision to previously learned categories by leveraging predictions from a frozen model trained on prior tasks. Despite its promise, PLG tends to be biased towards frequent classes. Therefore, we propose a class-balanced re-weighting (CBR) scheme that estimates object frequencies from pseudo-labels and dynamically adjusts training bias, without requiring access to past data. We design and evaluate three incremental scenarios for 3DIS on the challenging ScanNet200 dataset and additionally validate our method for semantic segmentation on ScanNetV2. Our approach achieves state-of-the-art results, surpassing prior work by up to 16.76% mAP for instance segmentation and approximately 30% mIoU for semantic segmentation, demonstrating strong generalisation across both frequent and rare classes. Code is available at: https://github.com/vgthengane/CLIMB3D

[208] Model Inversion Attack Against Deep Hashing

Dongdong Zhao, Qiben Xu, Ranxin Fang, Baogang Song

Main category: cs.CV

TL;DR: DHMI is the first diffusion-based model inversion attack framework for deep hashing systems, successfully reconstructing high-quality training images from hash codes even in black-box settings where no training data is accessible.

Details

Motivation: Deep hashing introduces severe privacy risks by enabling reconstruction of original training data from hash codes, potentially leading to biometric forgery and privacy breaches, but model inversion attacks specifically for deep hashing remained unexplored due to the inaccessibility of training hash codes and discrete Hamming space challenges.

Method: DHMI clusters an auxiliary dataset to derive semantic hash centers as surrogate anchors, then uses surrogate-guided denoising optimization with a novel attack metric (fusing classification consistency and hash proximity) to dynamically select candidate samples, guided by a cluster of surrogate models for refinement.

Result: Experiments on multiple datasets show DHMI successfully reconstructs high-resolution, high-quality images even in the most challenging black-box setting where no training hash codes are available, outperforming existing state-of-the-art model inversion attacks in black-box scenarios.

Conclusion: DHMI demonstrates both practical efficacy and confirms the critical privacy risks inherent in deep hashing systems, highlighting the urgent need for privacy protection measures in hash-based retrieval systems.

Abstract: Deep hashing improves retrieval efficiency through compact binary codes, yet it introduces severe and often overlooked privacy risks. The ability to reconstruct original training data from hash codes could lead to serious threats such as biometric forgery and privacy breaches. However, model inversion attacks specifically targeting deep hashing models remain unexplored, leaving their security implications unexamined. This research gap stems from the inaccessibility of genuine training hash codes and the highly discrete Hamming space, which prevents existing methods from adapting to deep hashing. To address these challenges, we propose DHMI, the first diffusion-based model inversion framework designed for deep hashing. DHMI first clusters an auxiliary dataset to derive semantic hash centers as surrogate anchors. It then introduces a surrogate-guided denoising optimization method that leverages a novel attack metric (fusing classification consistency and hash proximity) to dynamically select candidate samples. A cluster of surrogate models guides the refinement of these candidates, ensuring the generation of high-fidelity and semantically consistent images. Experiments on multiple datasets demonstrate that DHMI successfully reconstructs high-resolution, high-quality images even under the most challenging black-box setting, where no training hash codes are available. Our method outperforms the existing state-of-the-art model inversion attacks in black-box scenarios, confirming both its practical efficacy and the critical privacy risks inherent in deep hashing systems.

[209] TrackGS: Optimizing COLMAP-Free 3D Gaussian Splatting with Global Track Constraints

Dongbo Shi, Shen Cao, Lubin Fan, Bojian Wu, Jinhui Guo, Ligang Liu, Renjie Chen

Main category: cs.CV

TL;DR: TrackGS integrates global feature tracks with 3D Gaussian Splatting to enable COLMAP-free novel view synthesis, eliminating the need for precomputed camera parameters while maintaining high rendering quality.

Details

Motivation: 3D Gaussian Splatting delivers impressive rendering quality but relies on accurate precomputed camera parameters, which limits its practical applications. Existing COLMAP-free approaches fail in complex scenarios due to their dependence on local constraints.

Method: Leverages feature tracks to establish global geometric constraints for simultaneous optimization of camera parameters and 3D Gaussians. Introduces track-constrained Gaussians as geometric anchors, proposes novel 2D and 3D track losses for multi-view consistency, and derives differentiable formulations for camera intrinsics optimization.

Result: Extensive experiments on challenging real-world and synthetic datasets demonstrate state-of-the-art performance with much lower pose error than previous methods while maintaining superior rendering quality.

Conclusion: TrackGS successfully eliminates the need for COLMAP preprocessing, making 3D Gaussian Splatting more accessible for practical applications by enabling COLMAP-free novel view synthesis with global geometric constraints.

Abstract: We present TrackGS, a novel method to integrate global feature tracks with 3D Gaussian Splatting (3DGS) for COLMAP-free novel view synthesis. While 3DGS delivers impressive rendering quality, its reliance on accurate precomputed camera parameters remains a significant limitation. Existing COLMAP-free approaches depend on local constraints that fail in complex scenarios. Our key innovation lies in leveraging feature tracks to establish global geometric constraints, enabling simultaneous optimization of camera parameters and 3D Gaussians. Specifically, we: (1) introduce track-constrained Gaussians that serve as geometric anchors, (2) propose novel 2D and 3D track losses to enforce multi-view consistency, and (3) derive differentiable formulations for camera intrinsics optimization. Extensive experiments on challenging real-world and synthetic datasets demonstrate state-of-the-art performance, with much lower pose error than previous methods while maintaining superior rendering quality. Our approach eliminates the need for COLMAP preprocessing, making 3DGS more accessible for practical applications.

[210] Semantic-ICP: Iterative Closest Point for Non-rigid Multi-Organ Point Cloud Registration

Wanwen Chen, Qi Zeng, Carson Studders, Jamie J. Y. Kwon, Emily H. T. Pang, Eitan Prisman, Septimiu E. Salcudean

Main category: cs.CV

TL;DR: A novel non-rigid semantic ICP method that incorporates anatomical labels and biomechanical energy constraints for improved point cloud registration in clinical applications.

Details

Motivation: Classical ICP methods fail to consider semantic meaning of points (anatomical labels) and biomechanical energy constraints, limiting their effectiveness in computer-aided interventions.

Method: Proposed SemICP method that uses semantic labels for robust closest point matching and incorporates linear elastic energy regularization in point cloud deformation representation.

Result: Significantly improves Hausdorff distance and mean surface distance compared to other methods on four datasets, and enables effective alignment of US and MR point clouds when combined with deep learning segmentation.

Conclusion: The semantic ICP method with biomechanical regularization provides more accurate and clinically relevant point cloud registration for medical applications.

Abstract: Point cloud registration is important in computer-aided interventions (CAI). While learning-based point cloud registration methods have been developed, their clinical application is hampered by issues of generalizability and explainability. Therefore, classical point cloud registration methods, including Iterative Closest Point (ICP), are still widely applied in CAI. ICP methods fail to consider that: (1) the points have well-defined semantic meaning, in that each point can be related to a specific anatomical label; (2) the deformation required for registration needs to follow biomechanical energy constraints. In this paper, we present a novel non-rigid semantic ICP (SemICP) method that handles multiple point labels and uses linear elastic energy regularization. We use semantic labels to improve the robustness of closest point matching and propose a novel point cloud deformation representation that incorporates explicit biomechanical energy regularization. Our experiments on four datasets show that our method significantly improves the Hausdorff distance and mean surface distance compared with other point cloud registration methods. We also demonstrate that integrating deep learning segmentation models with our registration pipeline enables effective alignment of US and MR point clouds.

[211] RapidPoseTriangulation: Multi-view Multi-person Whole-body Human Pose Triangulation in a Millisecond

Daniel Bermuth, Alexander Poeppel, Wolfgang Reif

Main category: cs.CV

TL;DR: New algorithm for multi-view multi-person pose estimation with fast triangulation and good generalization, extending to whole-body pose estimation including facial expressions and finger movements.

Details

Motivation: To advance computer vision applications by improving understanding of human movement and interactions through better multi-view imaging and pose estimation.

Method: Developed a new algorithm that focuses on fast triangulation speeds and extends to whole-body pose estimation across multiple individuals and viewpoints.

Result: Demonstrated strong performance across unseen datasets and configurations, showing good adaptability to different settings.

Conclusion: The work presents an effective approach for multi-view multi-person pose estimation with fast performance and generalization capabilities, and is made publicly available to support further progress.

Abstract: The integration of multi-view imaging and pose estimation represents a significant advance in computer vision applications, offering new possibilities for understanding human movement and interactions. This work presents a new algorithm that improves multi-view multi-person pose estimation, focusing on fast triangulation speeds and good generalization capabilities. The approach extends to whole-body pose estimation, capturing details from facial expressions to finger movements across multiple individuals and viewpoints. Adaptability to different settings is demonstrated through strong performance across unseen datasets and configurations. To support further progress in this field, all of this work is publicly accessible.

Sejuti Rahman, Swakshar Deb, MD. Sameer Iqbal Chowdhury, MD. Jubair Ahmed Sourov, Mohammad Shamsuddin

Main category: cs.CV

TL;DR: A multi-modal depression detection framework using eye tracking, audio, and video data with a novel Multi-Frequency Graph Convolutional Network that outperforms existing methods in classification accuracy.

Details

Motivation: Depression is underdiagnosed due to reliance on subjective clinical assessments. There's a need for objective detection methods that can capture comprehensive depressive symptoms through multimodal data.

Method: Collected tripartite data (eye tracking, audio, video) from 103 clinically assessed participants. Developed Multi-Frequency Graph Convolutional Network (MF-GCN) with Multi-Frequency Filter Bank Module to leverage both low and high frequency signals, addressing limitations of existing graph-based models.

Result: Binary classification achieved sensitivity of 0.96 and F2 score of 0.94. 3-class classification achieved sensitivity of 0.79 and specificity of 0.87. On external CMDC dataset, achieved sensitivity of 0.95 and F2 score of 0.96, demonstrating strong generalizability.

Conclusion: The trimodal, multi-frequency framework effectively captures cross-modal interactions for accurate depression detection and significantly outperforms traditional machine learning and deep learning baselines.

Abstract: Depression is a prevalent global mental health disorder, characterised by persistent low mood and anhedonia. However, it remains underdiagnosed because current diagnostic methods depend heavily on subjective clinical assessments. To enable objective detection, we introduce a gold standard dataset of 103 clinically assessed participants collected through a tripartite data approach which uniquely integrated eye tracking data with audio and video to give a comprehensive representation of depressive symptoms. Eye tracking data quantifies the attentional bias towards negative stimuli that is frequently observed in depressed groups. Audio and video data capture the affective flattening and psychomotor retardation characteristic of depression. Statistical validation confirmed their significant discriminative power in distinguishing depressed from non depressed groups. We address a critical limitation of existing graph-based models that focus on low-frequency information and propose a Multi-Frequency Graph Convolutional Network (MF-GCN). This framework consists of a novel Multi-Frequency Filter Bank Module (MFFBM), which can leverage both low and high frequency signals. Extensive evaluation against traditional machine learning algorithms and deep learning frameworks demonstrates that MF-GCN consistently outperforms baselines. In binary classification, the model achieved a sensitivity of 0.96 and F2 score of 0.94. For the 3 class classification task, the proposed method achieved a sensitivity of 0.79 and specificity of 0.87 and siginificantly suprassed other models. To validate generalizability, the model was also evaluated on the Chinese Multimodal Depression Corpus (CMDC) dataset and achieved a sensitivity of 0.95 and F2 score of 0.96. These results confirm that our trimodal, multi frequency framework effectively captures cross modal interaction for accurate depression detection.

[213] OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model

Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, Volker Tresp, Alois Knoll

Main category: cs.CV

TL;DR: OpenDriveVLA is a vision-language-action model for autonomous driving that uses multimodal inputs and hierarchical alignment to generate spatially grounded driving actions, achieving state-of-the-art performance on trajectory planning and driving QA tasks.

Details

Motivation: To develop an end-to-end autonomous driving system that can process multimodal inputs (2D/3D visual representations, ego states, language commands) and generate reliable driving actions by bridging the modality gap between vision and language.

Method: Built on open-source LLMs with hierarchical vision-language alignment to project 2D/3D visual tokens into unified semantic space, and incorporates structured agent-environment-ego interaction modeling in autoregressive decoding for spatial dependencies and behavior-aware dynamics.

Result: Achieves state-of-the-art results on nuScenes dataset for open-loop trajectory planning and driving-related question answering, with qualitative analysis showing capability to follow high-level commands and handle challenging scenarios.

Conclusion: OpenDriveVLA demonstrates strong potential for next-generation end-to-end autonomous driving through its multimodal processing, spatial grounding, and interaction modeling capabilities.

Abstract: We present OpenDriveVLA, a Vision Language Action model designed for end-to-end autonomous driving, built upon open-source large language models. OpenDriveVLA generates spatially grounded driving actions by leveraging multimodal inputs, including 2D and 3D instance-aware visual representations, ego vehicle states, and language commands. To bridge the modality gap between driving visual representations and language embeddings, we introduce a hierarchical vision language alignment process, projecting both 2D and 3D structured visual tokens into a unified semantic space. Furthermore, we incorporate structured agent environment ego interaction modeling into the autoregressive decoding process, enabling the model to capture fine-grained spatial dependencies and behavior-aware dynamics critical for reliable trajectory planning. Extensive experiments on the nuScenes dataset demonstrate that OpenDriveVLA achieves state-of-the-art results across open-loop trajectory planning and driving-related question answering tasks. Qualitative analyses further illustrate its capability to follow high-level driving commands and generate trajectories under challenging scenarios, highlighting its potential for next-generation end-to-end autonomous driving.

[214] Generalizable and Relightable Gaussian Splatting for Human Novel View Synthesis

Yipengjing Sun, Shengping Zhang, Chenyang Wang, Shunyuan Zheng, Zonglin Li, Xiangyang Ji

Main category: cs.CV

TL;DR: GRGS is a generalizable 3D Gaussian framework for high-fidelity human novel view synthesis under diverse lighting conditions, using feed-forward supervised learning to project 2D observations into 3D representations with robust geometry and physically-based rendering.

Details

Motivation: Existing methods for human novel view synthesis either require per-character optimization or ignore physical constraints, limiting their generalizability and relighting capabilities under diverse lighting conditions.

Method: GRGS uses a Lighting-robust Geometry Refinement (LGR) module trained on synthetically relit data for accurate geometry, and a Physically Grounded Neural Rendering (PGNR) module that combines neural prediction with physics-based shading. It employs a 2D-to-3D projection training scheme with differentiable supervision from various lighting maps.

Result: Extensive experiments show GRGS achieves superior visual quality, geometric consistency, and generalization across different characters and lighting conditions compared to existing methods.

Conclusion: GRGS provides a generalizable and relightable framework for high-fidelity human novel view synthesis that supports editable relighting with shadows and indirect illumination while maintaining computational efficiency.

Abstract: We propose GRGS, a generalizable and relightable 3D Gaussian framework for high-fidelity human novel view synthesis under diverse lighting conditions. Unlike existing methods that rely on per-character optimization or ignore physical constraints, GRGS adopts a feed-forward, fully supervised strategy projecting geometry, material, and illumination cues from multi-view 2D observations into 3D Gaussian representations. To recover accurate geometry under diverse lighting conditions, we introduce a Lighting-robust Geometry Refinement (LGR) module trained on synthetically relit data to predict precise depth and surface normals. Based on the high-quality geometry, a Physically Grounded Neural Rendering (PGNR) module is further proposed to integrate neural prediction with physics-based shading, supporting editable relighting with shadows and indirect illumination. Moreover, we design a 2D-to-3D projection training scheme leveraging differentiable supervision from ambient occlusion, direct, and indirect lighting maps, alleviating the computational cost of ray tracing. Extensive experiments demonstrate that GRGS achieves superior visual quality, geometric consistency, and generalization across characters and lighting conditions.

[215] QuantFace: Efficient Quantization for Face Restoration

Jiatong Li, Libo Zhu, Haotong Qin, Jingkai Wang, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang

Main category: cs.CV

TL;DR: QuantFace is a novel low-bit quantization framework that reduces face restoration models from 32-bit to 4-6-bit while maintaining performance through rotation-scaling channel balancing, quantization-distillation LoRA, and adaptive bit-width allocation.

Details

Motivation: Diffusion models achieve remarkable face restoration performance but suffer from heavy computations that limit widespread adoption, necessitating efficient quantization methods.

Method: Uses rotation-scaling channel balancing for activation distribution, Quantization-Distillation LoRA for joint optimization, and adaptive bit-width allocation formulated as integer programming combining quantization error and perceptual metrics.

Result: Extensive experiments show QuantFace effectively works under 6-bit and 4-bit quantization, achieving significant advantages over recent leading low-bit quantization methods for face restoration.

Conclusion: QuantFace provides an effective quantization framework that enables efficient face restoration with minimal performance degradation, making diffusion models more practical for real-world deployment.

Abstract: Diffusion models have been achieving remarkable performance in face restoration. However, the heavy computations hamper the widespread adoption of these models. In this work, we propose QuantFace, a novel low-bit quantization framework for face restoration models, where the full-precision (i.e., 32-bit) weights and activations are quantized to 4~6-bit. We first analyze the data distribution within activations and find that it is highly variant. To preserve the original data information, we employ rotation-scaling channel balancing. Furthermore, we propose Quantization-Distillation Low-Rank Adaptation (QD-LoRA), which jointly optimizes for quantization and distillation performance. Finally, we propose an adaptive bit-width allocation strategy. We formulate such a strategy as an integer programming problem that combines quantization error and perceptual metrics to find a satisfactory resource allocation. Extensive experiments on the synthetic and real-world datasets demonstrate the effectiveness of QuantFace under 6-bit and 4-bit. QuantFace achieves significant advantages over recent leading low-bit quantization methods for face restoration. The code is available at https://github.com/jiatongli2024/QuantFace.

[216] VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference

Ziyan Liu, Yeqiu Chen, Hongyi Cai, Tao Lin, Shuo Yang, Zheng Liu, Bo Zhao

Main category: cs.CV

TL;DR: VLA-Pruner is a token pruning method for Vision-Language-Action models that addresses the limitations of existing VLM-specific pruning methods by incorporating dual-level importance criteria for both semantic understanding and action execution.

Details

Motivation: Existing token pruning methods for VLMs focus only on semantic salience, overlooking VLA's dual-system nature of high-level semantic understanding and low-level action execution, which leads to degraded VLA performance by discarding critical action information.

Method: VLA-Pruner uses dual-level importance criteria: vision-language prefill attention for semantic relevance and action decode attention (estimated via temporal smoothing) for action-level importance. It employs a dual-level token selection strategy to adaptively preserve visual tokens for both semantic understanding and action execution.

Result: VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks, demonstrating effective acceleration while maintaining performance.

Conclusion: VLA-Pruner successfully bridges the gap in VLA token pruning by aligning with the dual-system nature of VLA models and exploiting temporal continuity, providing an efficient plug-and-play solution for real-time VLA deployment.

Abstract: Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA’s intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.

[217] ORV: 4D Occupancy-centric Robot Video Generation

Xiuyu Yang, Bohan Li, Shaocong Xu, Nan Wang, Chongjie Ye, Zhaoxi Chen, Minghan Qin, Yikang Ding, Zheng Zhu, Xin Jin, Hang Zhao, Hao Zhao

Main category: cs.CV

TL;DR: ORV introduces a 4D occupancy-centric framework for robot video generation that bridges the gap between sparse control inputs and dense pixel outputs, achieving superior video quality and controllability for embodied AI applications.

Details

Motivation: Current embodied intelligence suffers from data scarcity and conventional simulators lack visual realism. Action-conditioned video generation methods have limitations in fidelity, temporal consistency, control alignment, and are often constrained to single-view settings.

Method: ORV couples action priors with occupancy-derived visual priors through Action-Expert AdaLN modulation to align chunked 7-DoF actions with video latents, and injects 2D renderings of 4D semantic occupancy as soft guidance. The framework also includes ORV-Data, a large-scale 4D semantic occupancy dataset for robot manipulation.

Result: ORV achieves 18.8% lower FVD than state-of-the-art, +3.5% success rate on visual planning, and +6.4% success rate on policy learning across BridgeV2, DROID, and RT-1 benchmarks. It also supports multiview consistent synthesis and enables simulation-to-real transfer despite domain gaps.

Conclusion: ORV provides a powerful data engine for embodied AI by addressing the representational gap between sparse controls and dense visual outputs through occupancy-centric video generation, significantly improving video quality, controllability, and downstream task performance.

Abstract: Recent embodied intelligence suffers from data scarcity, while conventional simulators lack visual realism. Controllable video generation is emerging as a promising data engine, yet current action-conditioned methods still fall short: generated videos are limited in fidelity and temporal consistency, poorly aligned with controls, and often constrained to singleview settings. We attribute these issues to the representational gap between sparse control inputs and dense pixel outputs. Thus, we introduce ORV, a 4D occupancy-centric framework for robot video generation that couples action priors with occupancy-derived visual priors. Concretely, we align chunked 7-DoF actions with video latents via an Action-Expert AdaLN modulation, and inject 2D renderings of 4D semantic occupancy into the generation process as soft guidance. Meanwhile, a central obstacle is the lack of occupancy data for embodied scenarios; we therefore curate ORV-Data, a large-scale, high-quality 4D semantic occupancy dataset of robot manipulation. Across BridgeV2, DROID, and RT-1, ORV improves video generation quality and controllability, achieving 18.8% lower FVD than state of the art, +3.5% success rate on visual planning, and +6.4% success rate on policy learning. Beyond singleview generation, ORV natively supports multiview consistent synthesis and enables simulation-to-real transfer despite significant domain gaps. Code, models, and data are at: https://orangesodahub.github.io/ORV

[218] Supervised Contrastive Learning for Few-Shot AI-Generated Image Detection and Attribution

Jaime Álvarez Urueña, David Camacho, Javier Huertas Tato

Main category: cs.CV

TL;DR: A two-stage framework for synthetic image detection using contrastive learning and k-NN classification achieves 91.3% accuracy with minimal training data, addressing generalization challenges in rapidly evolving generative AI.

Details

Motivation: To address the computational infeasibility of traditional detection approaches that require periodic retraining due to the accelerated release of new generative models, and to ensure robust detection of synthetic images in evolving AI landscapes.

Method: Two-stage framework: 1) Vision model trained via supervised contrastive learning on subset of generators to extract discriminative embeddings, 2) k-NN classifier in few-shot learning paradigm using limited samples from unseen generators.

Result: Achieves 91.3% average detection accuracy with only 150 images per class, 5.2% improvement over existing methods. For source attribution: 14.70% AUC and 4.27% OSCR improvements in open set classification.

Conclusion: The framework enables robust, scalable forensic attribution systems that adapt to evolving generative AI without exhaustive retraining, marking significant advancement in synthetic image detection capabilities.

Abstract: The rapid advancement of generative artificial intelligence has enabled the creation of synthetic images that are increasingly indistinguishable from authentic content, posing significant challenges for digital media integrity. This problem is compounded by the accelerated release cycle of novel generative models, which renders traditional detection approaches (reliant on periodic retraining) computationally infeasible and operationally impractical. This work proposes a novel two-stage detection framework designed to address the generalization challenge inherent in synthetic image detection. The first stage employs a vision deep learning model trained via supervised contrastive learning to extract discriminative embeddings from input imagery. Critically, this model was trained on a strategically partitioned subset of available generators, with specific architectures withheld from training to rigorously ablate cross-generator generalization capabilities. The second stage utilizes a k-nearest neighbors (k-NN) classifier operating on the learned embedding space, trained in a few-shot learning paradigm incorporating limited samples from previously unseen test generators. With merely 150 images per class in the few-shot learning regime, which are easily obtainable from current generation models, the proposed framework achieves an average detection accuracy of 91.3%, representing a 5.2 percentage point improvement over existing approaches . For the source attribution task, the proposed approach obtains improvements of of 14.70% and 4.27% in AUC and OSCR respectively on an open set classification context, marking a significant advancement toward robust, scalable forensic attribution systems capable of adapting to the evolving generative AI landscape without requiring exhaustive retraining protocols.

[219] Open-Set Domain Generalization through Spectral-Spatial Uncertainty Disentanglement for Hyperspectral Image Classification

Amirreza Khoshbakht, Erchan Aptoula

Main category: cs.CV

TL;DR: Proposes an open-set domain generalization framework for hyperspectral image classification using spectral-spatial uncertainty disentanglement and evidential deep learning to handle unknown classes and domain shifts without target data.

Details

Motivation: To address the dual challenge of recognizing unknown classes while generalizing across unseen domains in hyperspectral image classification, without requiring target domain data during training.

Method: Uses Spectral-Spatial Uncertainty Disentanglement mechanism with evidential deep learning to handle domain shifts in spectral, spatial, and combined feature pathways, adaptively selecting the most reliable pathway per sample. Integrates frequency-domain feature extraction, dual-channel residual networks, and uncertainty quantification.

Result: Achieves performance comparable to state-of-the-art domain adaptation methods on three cross-scene hyperspectral datasets, while maintaining high unknown-class rejection and known-class accuracy levels.

Conclusion: The proposed framework effectively addresses open-set domain generalization for hyperspectral image classification, demonstrating strong performance without requiring target domain data during training.

Abstract: Open-set domain generalization (OSDG) tackles the dual challenge of recognizing unknown classes while simultaneously striving to generalize across unseen domains without using target data during training. In this article, an OSDG framework for hyperspectral image classification is proposed, centered on a new Spectral-Spatial Uncertainty Disentanglement mechanism. It has been designed to address the domain shift influencing both spectral, spatial and combined feature extraction pathways using evidential deep learning, after which the most reliable pathway for each sample is adaptively selected. The proposed framework is further integrated with frequency-domain feature extraction for domain-invariant representation learning, dual-channel residual networks for spectral-spatial feature extraction, and evidential deep learning based uncertainty quantification. Experiments conducted on three cross scene hyperspectral datasets, show that performance comparable to state-of-the-art domain adaptation methods can be achieved despite no access to target data, while high unknown-class rejection and known-class accuracy levels are maintained. The implementation will be available at github.com/amir-khb/UGOSDG upon acceptance.

[220] Loss-Oriented Ranking for Automated Visual Prompting in LVLMs

Yuan Zhang, Chun-Kai Fan, Tao Huang, Ming Lu, Sicheng Yu, Junwen Pan, Kuan Cheng, Qi She, Shanghang Zhang

Main category: cs.CV

TL;DR: AutoV automatically selects optimal visual prompts for LVLMs instead of using manually designed ones, improving performance across various image understanding tasks.

Details

Motivation: Current methods use heuristic visual prompts that are challenging to design manually and often lead to sub-optimal performance, failing to explore different visual prompt benefits.

Method: AutoV learns to automatically select optimal visual prompts from candidates based on textual queries and input images, using an automatic data collection pipeline that ranks prompts by prediction losses from a pre-trained LVLM.

Result: AutoV enhances LVLM performance significantly: LLaVA-OV gains 10.2% accuracy on VizWiz and Qwen2.5-VL improves by 3.8% on MMMU.

Conclusion: AutoV demonstrates potential as an optimal visual prompting method that automatically selects effective visual prompts for improved LVLM performance.

Abstract: Inspired by text prompts in large language models (LLMs), visual prompts have been explored to enhance the reasoning capabilities of large vision-language models (LVLMs). Current methods design heuristic visual prompts, such as overlaying a text-query-guided attention heatmap on the original input image. However, designing effective prompts manually is challenging and time-consuming, and it often fails to explore the benefits of different visual prompts, leading to sub-optimal performance. To this end, we propose \textbf{AutoV} that learns to automatically select the optimal visual prompt from various candidates based on given textual queries and the input image. To train AutoV, we develop an automatic data collection and labeling pipeline that evaluates various visual prompts with a pre-trained LVLM. We input a set of visual prompts into the LVLM and rank them according to the prediction losses generated by the model. Using the ranking as a supervision signal, we train AutoV to automatically choose the optimal visual prompt from various visual prompts for LVLMs. Experiments indicate that AutoV enhances the performance of various LVLMs across multiple image understanding tasks. For instance, LLaVA-OV with AutoV achieves $\textbf{10.2}%$ accuracy gain on VizWiz, and AutoV boosts Qwen2.5-VL by $\textbf{3.8}%$ on MMMU, highlighting its potential as an optimal visual prompting method.

Zeqian Li, Shangzhe Di, Zhonghua Zhai, Weilin Huang, Yanfeng Wang, Weidi Xie

Main category: cs.CV

TL;DR: UniTime is a universal video temporal grounding model that localizes moments in videos using natural language queries, handling diverse video types and lengths through MLLMs with timestamp token integration and adaptive frame scaling.

Details

Motivation: Existing video temporal grounding methods are limited to specific domains or durations, lacking universal applicability across diverse video types and complex language queries.

Method: Leverages generative Multi-modal LLMs with timestamp tokens interleaved with video tokens for precise temporal outputs, and uses adaptive frame scaling to handle videos of different lengths.

Result: Outperforms state-of-the-art methods in zero-shot and finetuned settings across five benchmarks, and significantly improves VideoQA accuracy when used as a moment retriever.

Conclusion: UniTime provides a robust and universal solution for video temporal grounding that effectively handles diverse video content and complex queries, demonstrating strong performance and practical value for video understanding tasks.

Abstract: This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries (e.g., questions or descriptions). Unlike existing methods that are often limited to specific video domains or durations, we propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs). Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries. The key contributions include: (i) We consider steering strong MLLMs for temporal grounding in videos. To enable precise timestamp outputs, we incorporate temporal information by interleaving timestamp tokens with video tokens. (ii) By training the model to handle videos with different input granularities through adaptive frame scaling, our approach achieves robust temporal grounding for both short and long videos. (iii) Comprehensive experiments show that UniTime outperforms state-of-the-art approaches in both zero-shot and dataset-specific finetuned settings across five public temporal grounding benchmarks. (iv) When employed as a preliminary moment retriever for long-form video question-answering (VideoQA), UniTime significantly improves VideoQA accuracy, highlighting its value for complex video understanding tasks.

[222] A Training-Free Style-Personalization via SVD-Based Feature Decomposition

Kyoungmin Lee, Jihun Park, Jongmin Gim, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Sunghoon Im

Main category: cs.CV

TL;DR: Training-free framework for style-personalized image generation using scale-wise autoregressive model with two lightweight control modules for style modulation and structural consistency.

Details

Motivation: To enable style-personalized image generation guided by a single reference style while preserving semantic consistency and mitigating content leakage, without requiring additional training.

Method: Uses scale-wise autoregressive model with Principal Feature Blending (SVD-based feature reconstruction for style modulation) and Structural Attention Correction (content-guided attention correction for structural consistency).

Result: Achieves competitive style fidelity and prompt fidelity compared to fine-tuned baselines, with faster inference and greater deployment flexibility.

Conclusion: The proposed training-free framework effectively generates stylized images while maintaining semantic consistency, offering practical advantages over fine-tuning approaches.

Abstract: We present a training-free framework for style-personalized image generation that operates during inference using a scale-wise autoregressive model. Our method generates a stylized image guided by a single reference style while preserving semantic consistency and mitigating content leakage. Through a detailed step-wise analysis of the generation process, we identify a pivotal step where the dominant singular values of the internal feature encode style-related components. Building upon this insight, we introduce two lightweight control modules: Principal Feature Blending, which enables precise modulation of style through SVD-based feature reconstruction, and Structural Attention Correction, which stabilizes structural consistency by leveraging content-guided attention correction across fine stages. Without any additional training, extensive experiments demonstrate that our method achieves competitive style fidelity and prompt fidelity compared to fine-tuned baselines, while offering faster inference and greater deployment flexibility.

[223] AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

Yogesh Kulkarni, Pooyan Fazli

Main category: cs.CV

TL;DR: AVATAR is a multimodal reasoning framework that addresses limitations of previous methods like GRPO through off-policy training and temporal advantage shaping, achieving significant performance gains and 5x sample efficiency.

Details

Motivation: To overcome three key limitations in existing multimodal video reasoning methods: data inefficiency from on-policy design, vanishing advantage problem when rewards are similar within groups, and uniform credit assignment that fails to emphasize critical reasoning steps.

Method: AVATAR uses an off-policy training architecture for better sample efficiency and reward diversity, plus Temporal Advantage Shaping (TAS) for improved credit assignment that upweights key reasoning phases during learning.

Result: Outperforms Qwen2.5-Omni baseline by +5.4 on MMVU, +4.9 on OmniBench, and +4.5 on Video-Holmes, while achieving 5x sample efficiency and requiring 80% fewer generated completions to reach target performance.

Conclusion: AVATAR effectively addresses the core limitations of previous multimodal reasoning methods through its off-policy architecture and temporal advantage shaping, demonstrating strong performance improvements and significant efficiency gains across multiple benchmarks.

Abstract: Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps. We introduce $\textbf{AVATAR}$ ($\textbf{A}$udio-$\textbf{V}$ideo $\textbf{A}$gen$\textbf{t}$ for $\textbf{A}$lignment and $\textbf{R}$easoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a novel credit assignment strategy that upweights key reasoning phases during learning. $\textbf{AVATAR}$ achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by $\mathbf{+5.4}$ on MMVU, $\mathbf{+4.9}$ on OmniBench, and $\mathbf{+4.5}$ on Video-Holmes, while demonstrating $\textbf{$5$$\times$ sample efficiency}$, requiring $80%$ fewer generated completions to reach target performance.

[224] Composed Object Retrieval: Object-level Retrieval via Composed Expressions

Tong Wang, Guanyu Yang, Nian Liu, Zongyan Han, Jinxing Zhou, Salman Khan, Fahad Shahbaz Khan

Main category: cs.CV

TL;DR: Proposes Composed Object Retrieval (COR), a new task for retrieving and segmenting specific objects using composed expressions of reference objects and text, with a large benchmark dataset COR127K and a unified model CORE that outperforms existing methods.

Details

Motivation: Current Composed Image Retrieval methods are limited to image-level matching and cannot localize specific objects, creating a need for more fine-grained object-level retrieval capabilities.

Method: Developed CORE, an end-to-end model with reference region encoding, adaptive visual-textual interaction, and region-level contrastive learning, trained on the COR127K benchmark containing 127,166 retrieval triplets across 408 categories.

Result: CORE significantly outperforms existing models in both base and novel categories, establishing an effective baseline for the COR task.

Conclusion: The COR task opens new directions for fine-grained multi-modal retrieval research, with the dataset and model being publicly released to facilitate further development.

Abstract: Retrieving fine-grained visual content based on user intent remains a challenge in multi-modal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a brand-new task that goes beyond image-level retrieval to achieve object-level precision, allowing the retrieval and segmentation of target objects based on composed expressions combining reference objects and retrieval texts. COR presents significant challenges in retrieval flexibility, which requires systems to identify arbitrary objects satisfying composed expressions while avoiding semantically similar but irrelevant negative objects within the same scene. We construct COR127K, the first large-scale COR benchmark that contains 127,166 retrieval triplets with various semantic transformations in 408 categories. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive visual-textual interaction, and region-level contrastive learning. Extensive experiments demonstrate that CORE significantly outperforms existing models in both base and novel categories, establishing a simple and effective baseline for this challenging task while opening new directions for fine-grained multi-modal retrieval research. We will publicly release both the dataset and the model at https://github.com/wangtong627/COR.

[225] Lung-DDPM+: Efficient Thoracic CT Image Synthesis using Diffusion Probabilistic Model

Yifan Jiang, Ahmad Shariftabrizi, Venkata SK. Manem

Main category: cs.CV

TL;DR: Lung-DDPM+ is an improved denoising diffusion model for generating high-quality lung CT images with nodules, achieving 14x faster sampling, 8x fewer FLOPs, and 6.8x lower GPU memory while maintaining comparable quality to state-of-the-art methods.

Details

Motivation: Existing generative models for lung cancer diagnosis suffer from low efficiency and anatomical imprecision, limiting their clinical applicability.

Method: A denoising diffusion probabilistic model (DDPM) guided by nodule semantic layouts and accelerated by a pulmonary DPM-solver, focusing on lesion areas while balancing sampling efficiency and quality.

Result: Achieved 8x fewer FLOPs, 6.8x lower GPU memory consumption, and 14x faster sampling compared to Lung-DDPM, while maintaining comparable sample quality in segmentation tasks and passing Visual Turing Test by radiologists.

Conclusion: Lung-DDPM+ can effectively generate high-quality thoracic CT images with lung nodules, showing potential for broader applications in medical imaging like general tumor synthesis and lesion generation.

Abstract: Generative artificial intelligence (AI) has been playing an important role in various domains. Leveraging its high capability to generate high-fidelity and diverse synthetic data, generative AI is widely applied in diagnostic tasks, such as lung cancer diagnosis using computed tomography (CT). However, existing generative models for lung cancer diagnosis suffer from low efficiency and anatomical imprecision, which limit their clinical applicability. To address these drawbacks, we propose Lung-DDPM+, an improved version of our previous model, Lung-DDPM. This novel approach is a denoising diffusion probabilistic model (DDPM) guided by nodule semantic layouts and accelerated by a pulmonary DPM-solver, enabling the method to focus on lesion areas while achieving a better trade-off between sampling efficiency and quality. Evaluation results on the public LIDC-IDRI dataset suggest that the proposed method achieves 8$\times$ fewer FLOPs (floating point operations per second), 6.8$\times$ lower GPU memory consumption, and 14$\times$ faster sampling compared to Lung-DDPM. Moreover, it maintains comparable sample quality to both Lung-DDPM and other state-of-the-art (SOTA) generative models in two downstream segmentation tasks. We also conducted a Visual Turing Test by an experienced radiologist, showing the advanced quality and fidelity of synthetic samples generated by the proposed method. These experimental results demonstrate that Lung-DDPM+ can effectively generate high-quality thoracic CT images with lung nodules, highlighting its potential for broader applications, such as general tumor synthesis and lesion generation in medical imaging. The code and pretrained models are available at https://github.com/Manem-Lab/Lung-DDPM-PLUS.

[226] A Unified Voxel Diffusion Module for Point Cloud 3D Object Detection

Qifeng Liu, Dawei Zhao, Yabo Dong, Linzhi Shang, Liang Xiao, Juan Wang, Kunkong Zhao, Dongming Lu, Qi Zhu

Main category: cs.CV

TL;DR: Proposes Voxel Diffusion Module (VDM) to enhance voxel-level representation in point cloud object detection, addressing limitations of Transformer/SSM models by enabling better spatial diffusion like CNNs.

Details

Motivation: Current Transformer/SSM-based point cloud detectors have limited spatial diffusion capability due to serialized processing, which affects detection accuracy. CNN-based architectures show better spatial context diffusion.

Method: VDM uses sparse 3D convolutions, submanifold sparse convolutions, and residual connections to diffuse foreground voxel features and aggregate spatial information, with output downsampled to 1/4 resolution for efficiency.

Result: VDM consistently improves detection accuracy across multiple datasets. VDM-SSMs achieve SOTA: 74.7 mAPH on Waymo, 72.9 NDS on nuScenes, 42.3 mAP on Argoverse 2, and 67.6 mAP on ONCE.

Conclusion: VDM effectively enhances voxel-level representation and spatial diffusion, is compatible with mainstream Transformer/SSM models, and achieves state-of-the-art performance across multiple benchmarks.

Abstract: Recent advances in point cloud object detection have increasingly adopted Transformer-based and State Space Models (SSMs), demonstrating strong performance. However, voxelbased representations in these models require strict consistency in input and output dimensions due to their serialized processing, which limits the spatial diffusion capability typically offered by convolutional operations. This limitation significantly affects detection accuracy. Inspired by CNN-based object detection architectures, we propose a novel Voxel Diffusion Module (VDM) to enhance voxel-level representation and diffusion in point cloud data. VDM is composed of sparse 3D convolutions, submanifold sparse convolutions, and residual connections. To ensure computational efficiency, the output feature maps are downsampled to one-fourth of the original input resolution. VDM serves two primary functions: (1) diffusing foreground voxel features through sparse 3D convolutions to enrich spatial context, and (2) aggregating fine-grained spatial information to strengthen voxelwise feature representation. The enhanced voxel features produced by VDM can be seamlessly integrated into mainstream Transformer- or SSM-based detection models for accurate object classification and localization, highlighting the generalizability of our method. We evaluate VDM on several benchmark datasets by embedding it into both Transformerbased and SSM-based models. Experimental results show that our approach consistently improves detection accuracy over baseline models. Specifically, VDM-SSMs achieve 74.7 mAPH (L2) on Waymo, 72.9 NDS on nuScenes, 42.3 mAP on Argoverse 2, and 67.6 mAP on ONCE, setting new stateof-the-art performance across all datasets. Our code will be made publicly available.

[227] Forecasting Future Anatomies: Longitudinal Brain Mri-to-Mri Prediction

Ali Farki, Elaheh Moradi, Deepika Koundal, Jussi Tohka

Main category: cs.CV

TL;DR: Deep learning models can predict future brain MRI scans from baseline images with high fidelity, enabling individualized prognosis for neurodegenerative diseases like Alzheimer’s.

Details

Motivation: To forecast a participant's entire brain MRI several years into the future, intrinsically modeling complex neurodegenerative patterns for Alzheimer's disease research.

Method: Implemented and evaluated five deep learning architectures (UNet, U2-Net, UNETR, Time-Embedding UNet, ODE-UNet) on two longitudinal cohorts (ADNI and AIBL) for MRI image-to-image prediction.

Result: Best performing models achieved high-fidelity predictions, with all models generalizing well to an independent external dataset and demonstrating robust cross-cohort performance.

Conclusion: Deep learning can reliably predict participant-specific brain MRI at the voxel level, offering new opportunities for individualized prognosis in neurodegenerative diseases.

Abstract: Predicting future brain state from a baseline magnetic resonance image (MRI) is a central challenge in neuroimaging and has important implications for studying neurodegenerative diseases such as Alzheimer’s disease (AD). Most existing approaches predict future cognitive scores or clinical outcomes, such as conversion from mild cognitive impairment to dementia. Instead, here we investigate longitudinal MRI image-to-image prediction that forecasts a participant’s entire brain MRI several years into the future, intrinsically modeling complex, spatially distributed neurodegenerative patterns. We implement and evaluate five deep learning architectures (UNet, U2-Net, UNETR, Time-Embedding UNet, and ODE-UNet) on two longitudinal cohorts (ADNI and AIBL). Predicted follow-up MRIs are directly compared with the actual follow-up scans using metrics that capture global similarity and local differences. The best performing models achieve high-fidelity predictions, and all models generalize well to an independent external dataset, demonstrating robust cross-cohort performance. Our results indicate that deep learning can reliably predict participant-specific brain MRI at the voxel level, offering new opportunities for individualized prognosis.

[228] EatGAN: An Edge-Attention Guided Generative Adversarial Network for Single Image Super-Resolution

Penghao Rao, Tieyong Zeng

Main category: cs.CV

TL;DR: EatGAN is a GAN-based single-image super-resolution model that uses edge priors both explicitly and implicitly to improve high-frequency detail reconstruction and training stability.

Details

Motivation: GAN-based SISR models struggle with reconstructing realistic high-frequency details and achieving stable training, despite their excellent perceptual quality.

Method: Proposes Normalized Edge Attention mechanism, edge-guided hybrid residual blocks, and composite generator objective combining pixel, perceptual, edge-gradient, and adversarial losses.

Result: Achieves state-of-the-art performance on both distortion-oriented and perception-oriented benchmarks, with 40.87 dB and 0.073 LPIPS on Manga 109.

Conclusion: Reframing image priors from passive guidance into controllable modulation parameters provides a practical path toward trustworthy, high-fidelity super-resolution.

Abstract: Single-image super-resolution (SISR) is an important task in image processing, aiming to enhance the resolution of imaging systems. Recently, SISR has made a significant leap and achieved promising results with deep learning. GAN-based models stand out among all the deep learning models because of their excellent performance in perceiving quality. However, it is rather difficult for them to reconstruct realistic high-frequency details and achieve stable training. To solve these issues, we introduce an Edge-Attention guided Generative Adversarial Network (EatGAN), the first GAN-based SISR model that simultaneously leverages edge priors both explicitly and implicitly inside the generator, which (i) proposes a Normalized Edge Attention (NEA) mechanism based on channel-affine and spatial gating that transforms edge prior into lightweight, learnable modulation parameters and injects and fuses them multiple times in a (ii) edge-guided hybrid residual block, which progressively enforces structural consistency across scales; and (iii) a composite generator objective combining pixel, perceptual, edge-gradient, and adversarial terms. Experiments show consistent state-of-the-art across distortion-oriented benchmarks and perception oriented benchmarks. Notably, our model achieves 40.87 dB and 0.073 (LPIPS) on Manga 109, which indicates that reframing image priors from passive guidance into a controllable modulation primitive for generators can chart a practical path toward trustworthy, high-fidelity Super-Resolution.

[229] Mask2IV: Interaction-Centric Video Generation via Mask Trajectories

Gen Li, Bo Zhao, Jianfei Yang, Laura Sevilla-Lara

Main category: cs.CV

TL;DR: Mask2IV is a two-stage framework for generating interaction-centric videos without requiring dense mask annotations, using predicted motion trajectories for actors and objects instead.

Details

Motivation: Existing methods struggle with complex dynamic interactions in videos, and obtaining precise mask annotations is challenging for real-world applications.

Method: Decoupled two-stage pipeline: first predicts motion trajectories for actor and object, then generates video conditioned on these trajectories. Supports control via target object specification, action descriptions, or spatial cues.

Result: Achieves superior visual realism and controllability compared to existing baselines across human-object interaction and robotic manipulation scenarios.

Conclusion: Mask2IV provides an effective solution for interaction-centric video generation without dense mask inputs, enabling flexible manipulation of interaction processes.

Abstract: Generating interaction-centric videos, such as those depicting humans or robots interacting with objects, is crucial for embodied intelligence, as they provide rich and diverse visual priors for robot learning, manipulation policy training, and affordance reasoning. However, existing methods often struggle to model such complex and dynamic interactions. While recent studies show that masks can serve as effective control signals and enhance generation quality, obtaining dense and precise mask annotations remains a major challenge for real-world use. To overcome this limitation, we introduce Mask2IV, a novel framework specifically designed for interaction-centric video generation. It adopts a decoupled two-stage pipeline that first predicts plausible motion trajectories for both actor and object, then generates a video conditioned on these trajectories. This design eliminates the need for dense mask inputs from users while preserving the flexibility to manipulate the interaction process. Furthermore, Mask2IV supports versatile and intuitive control, allowing users to specify the target object of interaction and guide the motion trajectory through action descriptions or spatial position cues. To support systematic training and evaluation, we curate two benchmarks covering diverse action and object categories across both human-object interaction and robotic manipulation scenarios. Extensive experiments demonstrate that our method achieves superior visual realism and controllability compared to existing baselines.

[230] LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation

Yushi Huang, Xingtong Ge, Ruihao Gong, Chengtao Lv, Jun Zhang

Main category: cs.CV

TL;DR: LinVideo is an efficient post-training framework that replaces self-attention with linear attention in video diffusion models to reduce computational costs while maintaining performance.

Details

Motivation: Video diffusion models have high-quality synthesis but suffer from quadratic computational complexity due to self-attention, making them expensive for long sequences.

Method: Uses selective transfer to automatically identify which layers to convert to linear attention, and introduces anytime distribution matching (ADM) to align sample distributions across timesteps during transfer.

Result: Achieves 1.25-2.00x speedup while preserving generation quality, with 4-step distilled model achieving 15.92x latency reduction with minimal quality drop.

Conclusion: LinVideo provides an effective data-free solution for accelerating video diffusion models without sacrificing performance, making them more practical for real-world applications.

Abstract: Video diffusion models (DMs) have enabled high-quality video synthesis. However, their computation costs scale quadratically with sequence length because self-attention has quadratic complexity. While linear attention lowers the cost, fully replacing quadratic attention requires expensive pretraining due to the limited expressiveness of linear attention and the complexity of spatiotemporal modeling in video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving the original model’s performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and inefficiency of existing objectives for this transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is efficient and recovers model performance. Extensive experiments show that our method achieves a 1.25-2.00x speedup while preserving generation quality, and our 4-step distilled model further delivers a 15.92x latency reduction with minimal visual quality drop.

[231] PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, Yanjun Ma

Main category: cs.CV

TL;DR: PaddleOCR-VL is a state-of-the-art, resource-efficient document parsing model that supports 109 languages and excels at recognizing complex elements like text, tables, formulas, and charts with minimal resource consumption.

Details

Motivation: To develop a compact yet powerful vision-language model for document parsing that can handle multiple languages and complex document elements while maintaining efficiency for practical deployment.

Method: Uses PaddleOCR-VL-0.9B, which integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model for accurate element recognition.

Result: Achieves SOTA performance on both page-level document parsing and element-level recognition, significantly outperforming existing solutions while delivering fast inference speeds.

Conclusion: PaddleOCR-VL demonstrates strong competitiveness against top-tier VLMs and is highly suitable for practical deployment in real-world scenarios due to its efficiency and performance.

Abstract: In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios. Code is available at https://github.com/PaddlePaddle/PaddleOCR .

[232] Weakly Supervised Pneumonia Localization from Chest X-Rays Using Deep Neural Network and Grad-CAM Explanations

Kiran Shahi, Anup Bagale

Main category: cs.CV

TL;DR: Proposes weakly supervised deep learning for pneumonia classification and localization using Grad-CAM with image-level labels instead of costly pixel-level annotations.

Details

Motivation: Chest X-ray pneumonia diagnosis typically requires expensive pixel-level annotations, which are costly and time-consuming to obtain.

Method: Uses Gradient-weighted Class Activation Mapping (Grad-CAM) with image-level labels, evaluates 7 pre-trained models including Vision Transformer with focal loss and patient-wise splits.

Result: All models achieved high accuracy (96-98%), with ResNet-18 and EfficientNet-B0 showing best performance, and Grad-CAM heatmaps confirmed focus on clinically relevant lung regions.

Conclusion: Weakly supervised explainable models enhance transparency and clinical trust in AI-assisted pneumonia screening.

Abstract: Chest X-ray imaging is commonly used to diagnose pneumonia, but accurately localizing the pneumonia affected regions typically requires detailed pixel-level annotations, which are costly and time consuming to obtain. To address this limitation, this study proposes a weakly supervised deep learning framework for pneumonia classification and localization using Gradient-weighted Class Activation Mapping (Grad-CAM). Instead of relying on costly pixel-level annotations, the proposed method utilizes image-level labels to generate clinically meaningful heatmaps that highlight pneumonia affected regions. Furthermore, we evaluate seven pre-trained deep learning models including a Vision Transformer under identical training conditions, using focal loss and patient-wise splits to prevent data leakage. Experimental results suggest that all models achieved high accuracy (96-98%), with ResNet-18 and EfficientNet-B0 showing the best overall performance and MobileNet-V2 providing an efficient lightweight alternative. Grad-CAM heatmap visualizations in this study confirm that the proposed methods focus on clinically relevant lung regions, supporting the use of explainable AI for radiological diagnostics. Overall, this work highlights the potential of weakly supervised, explainable models that enhance transparency and clinical trust in AI-assisted pneumonia screening.

[233] ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation

Panwang Pan, Jingjing Zhao, Yuchen Lin, Chenguo Lin, Chenxin Li, Hengyu Liu, Tingting Shen, Yadong MU

Main category: cs.CV

TL;DR: ID-Crafter is a framework for multi-subject video generation that achieves superior identity preservation and semantic coherence through hierarchical attention, VLM guidance, and reinforcement learning.

Details

Motivation: Current video synthesis methods struggle with integrating identity information from multiple subjects, leading to semantic conflicts and poor identity preservation, which limits controllability and applicability.

Method: Three key components: hierarchical identity-preserving attention (intra-subject, inter-subject, cross-modal), semantic understanding module with pretrained VLM for fine-grained guidance, and online reinforcement learning for refinement. Plus a new dataset for training and evaluation.

Result: ID-Crafter establishes new state-of-the-art performance on multi-subject video generation benchmarks, excelling in identity preservation, temporal consistency, and overall video quality.

Conclusion: The framework effectively addresses multi-subject identity integration challenges and demonstrates superior performance in preserving identities and maintaining semantic coherence in generated videos.

Abstract: Significant progress has been achieved in high-fidelity video synthesis, yet current paradigms often fall short in effectively integrating identity information from multiple subjects. This leads to semantic conflicts and suboptimal performance in preserving identities and interactions, limiting controllability and applicability. To tackle this issue, we introduce ID-Crafter, a framework for multi-subject video generation that achieves superior identity preservation and semantic coherence. ID-Crafter integrates three key components: (i) a hierarchical identity-preserving attention mechanism that progressively aggregates features at intra-subject, inter-subject, and cross-modal levels; (ii) a semantic understanding module powered by a pretrained Vision-Language Model (VLM) to provide fine-grained guidance and capture complex inter-subject relationships; and (iii) an online reinforcement learning phase to further refine the model for critical concepts. Furthermore, we construct a new dataset to facilitate robust training and evaluation. Extensive experiments demonstrate that ID-Crafter establishes new state-of-the-art performance on multi-subject video generation benchmarks, excelling in identity preservation, temporal consistency, and overall video quality.

[234] Disentangled Concepts Speak Louder Than Words: Explainable Video Action Recognition

Jongseo Lee, Wooil Lee, Gyeong-Moon Park, Seong Tae Kim, Jinwoo Choi

Main category: cs.CV

TL;DR: DANCE is a video action recognition framework that disentangles motion dynamics from spatial context using concept-based explanations, improving interpretability while maintaining competitive performance.

Details

Motivation: Existing video action recognition explanation methods produce entangled explanations that don't clearly separate motion from spatial context, and language-based approaches struggle to explain tacit motions that are hard to verbalize.

Method: Proposes DANCE framework with ante-hoc concept bottleneck design using three disentangled concept types: motion dynamics (human pose sequences), objects, and scenes. Uses large language model to automatically extract object and scene concepts.

Result: Experiments on KTH, Penn Action, HAA500, and UCF-101 datasets show DANCE significantly improves explanation clarity with competitive performance. User study validates superior interpretability, and framework is beneficial for model debugging, editing, and failure analysis.

Conclusion: DANCE successfully addresses the disentanglement challenge in video action recognition explanations by separating motion dynamics from spatial context through concept-based design, providing clearer interpretability while maintaining performance.

Abstract: Effective explanations of video action recognition models should disentangle how movements unfold over time from the surrounding spatial context. However, existing methods based on saliency produce entangled explanations, making it unclear whether predictions rely on motion or spatial context. Language-based approaches offer structure but often fail to explain motions due to their tacit nature – intuitively understood but difficult to verbalize. To address these challenges, we propose Disentangled Action aNd Context concept-based Explainable (DANCE) video action recognition, a framework that predicts actions through disentangled concept types: motion dynamics, objects, and scenes. We define motion dynamics concepts as human pose sequences. We employ a large language model to automatically extract object and scene concepts. Built on an ante-hoc concept bottleneck design, DANCE enforces prediction through these concepts. Experiments on four datasets – KTH, Penn Action, HAA500, and UCF-101 – demonstrate that DANCE significantly improves explanation clarity with competitive performance. We validate the superior interpretability of DANCE through a user study. Experimental results also show that DANCE is beneficial for model debugging, editing, and failure analysis.

[235] One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models

Aleksandr Razin, Danil Kazantsev, Ilya Makarov

Main category: cs.CV

TL;DR: LUA is a lightweight latent space upscaler that enables high-resolution image generation in diffusion models without additional diffusion steps or post-hoc super-resolution, achieving comparable quality with 3x faster decoding.

Details

Motivation: Diffusion models face scalability issues - direct high-resolution sampling is slow/expensive, while post-hoc super-resolution introduces artifacts and latency after decoding.

Method: A lightweight Swin-style backbone with scale-specific pixel-shuffle heads performs super-resolution directly on generator’s latent code before VAE decoding, requiring no base model modifications.

Result: LUA achieves comparable perceptual quality to pixel-space SR with 3x lower decoding time (only +0.42s for 1024px from 512px vs 1.87s for SwinIR), and generalizes across different VAEs without retraining.

Conclusion: LUA provides a practical and efficient path to scalable high-fidelity image synthesis in diffusion pipelines, closely matching native high-resolution generation quality.

Abstract: Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator’s latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.

[236] Draft and Refine with Visual Experts

Sungheon Jeong, Ryozo Masukawa, Jihong Park, Sanggeon Yun, Wenjun Huang, Hanning Chen, Mahdi Imani, Mohsen Imani

Main category: cs.CV

TL;DR: Draft and Refine (DnR) is an agent framework that uses a question-conditioned utilization metric to measure and improve LVLMs’ reliance on visual evidence, reducing hallucinations by refining responses with external visual expert feedback.

Details

Motivation: LVLMs often produce ungrounded or hallucinated responses by relying too heavily on linguistic priors rather than visual evidence, highlighting the need for a quantitative measure of visual information utilization during reasoning.

Method: Proposes DnR framework with a question-conditioned utilization metric that constructs query-conditioned relevance maps and measures dependence through relevance-guided probabilistic masking. The agent refines initial drafts using targeted feedback from external visual experts whose outputs are rendered as visual cues.

Result: Experiments across VQA and captioning benchmarks show consistent accuracy gains and reduced hallucination, demonstrating improved visual grounding without retraining or architectural changes.

Conclusion: Measuring visual utilization provides a principled path toward more interpretable and evidence-driven multimodal agent systems, enabling stronger visual grounding in LVLMs.

Abstract: While recent Large Vision-Language Models (LVLMs) exhibit strong multimodal reasoning abilities, they often produce ungrounded or hallucinated responses because they rely too heavily on linguistic priors instead of visual evidence. This limitation highlights the absence of a quantitative measure of how much these models actually use visual information during reasoning. We propose Draft and Refine (DnR), an agent framework driven by a question-conditioned utilization metric. The metric quantifies the model’s reliance on visual evidence by first constructing a query-conditioned relevance map to localize question-specific cues and then measuring dependence through relevance-guided probabilistic masking. Guided by this metric, the DnR agent refines its initial draft using targeted feedback from external visual experts. Each expert’s output (such as boxes or masks) is rendered as visual cues on the image, and the model is re-queried to select the response that yields the largest improvement in utilization. This process strengthens visual grounding without retraining or architectural changes. Experiments across VQA and captioning benchmarks show consistent accuracy gains and reduced hallucination, demonstrating that measuring visual utilization provides a principled path toward more interpretable and evidence-driven multimodal agent systems. Code is available at https://github.com/EavnJeong/Draft-and-Refine-with-Visual-Experts.

[237] Parameter-Efficient MoE LoRA for Few-Shot Multi-Style Editing

Cong Cao, Yujie Xu, Xiaodong Xu

Main category: cs.CV

TL;DR: A novel few-shot style editing framework using Mixture-of-Experts LoRA with style-specific and style-shared routing to fine-tune image editing models for new styles with limited data.

Details

Motivation: General image editing models often fail with new styles, and there's a need to effectively fine-tune them using only limited paired data for different styles.

Method: Proposes MoE LoRA with style-specific and style-shared routing, automatic rank determination via metric-guided approach, optimal LoRA insertion in DiT model, and integration of adversarial learning and flow matching.

Result: Outperforms existing state-of-the-art approaches with significantly fewer LoRA parameters on a benchmark dataset with five distinct styles.

Conclusion: The proposed framework effectively adapts general image editing models to new styles with limited data through parameter-efficient multi-style fine-tuning.

Abstract: In recent years, image editing has garnered growing attention. However, general image editing models often fail to produce satisfactory results when confronted with new styles. The challenge lies in how to effectively fine-tune general image editing models to new styles using only a limited amount of paired data. To address this issue, this paper proposes a novel few-shot style editing framework. For this task, we construct a benchmark dataset that encompasses five distinct styles. Correspondingly, we propose a parameter-efficient multi-style Mixture-of-Experts Low-Rank Adaptation (MoE LoRA) with style-specific and style-shared routing mechanisms for jointly fine-tuning multiple styles. The style-specific routing ensures that different styles do not interfere with one another, while the style-shared routing adaptively allocates shared MoE LoRAs to learn common patterns. Our MoE LoRA can automatically determine the optimal ranks for each layer through a novel metric-guided approach that estimates the importance score of each single-rank component. Additionally, we explore the optimal location to insert LoRA within the Diffusion in Transformer (DiT) model and integrate adversarial learning and flow matching to guide the diffusion training process. Experimental results demonstrate that our proposed method outperforms existing state-of-the-art approaches with significantly fewer LoRA parameters.

[238] DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding

Tanveer Hannan, Dimitrios Mallios, Parth Pathak, Faegheh Sardari, Thomas Seidl, Gedas Bertasius, Mohsen Fayyaz, Sunando Sengupta

Main category: cs.CV

TL;DR: DocSLM is an efficient small vision-language model for long-document understanding on edge devices, using hierarchical multimodal compression and streaming abstention to reduce memory and computational costs while maintaining performance.

Details

Motivation: Large Vision-Language Models have strong multimodal reasoning but high memory footprint, making them impractical for resource-constrained edge devices.

Method: Hierarchical Multimodal Compressor that jointly encodes visual, textual, and layout information into fixed-length sequences, plus Streaming Abstention mechanism with entropy-based uncertainty calibration for sequential processing of long documents.

Result: Matches or surpasses state-of-the-art methods while using 82% fewer visual tokens, 75% fewer parameters, and 71% lower latency across multiple long multimodal document benchmarks.

Conclusion: DocSLM enables reliable multimodal document understanding on lightweight edge devices with significantly reduced resource requirements.

Abstract: Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices. We present DocSLM, an efficient Small Vision-Language Model designed for long-document understanding under constrained memory resources. DocSLM incorporates a Hierarchical Multimodal Compressor that jointly encodes visual, textual, and layout information from each page into a fixed-length sequence, greatly reducing memory consumption while preserving both local and global semantics. To enable scalable processing over arbitrarily long inputs, we introduce a Streaming Abstention mechanism that operates on document segments sequentially and filters low-confidence responses using an entropy-based uncertainty calibrator. Across multiple long multimodal document benchmarks, DocSLM matches or surpasses state-of-the-art methods while using 82% fewer visual tokens, 75% fewer parameters, and 71% lower latency, delivering reliable multimodal document understanding on lightweight edge devices. Code and Model are available in https://github.com/Tanveer81/DocSLM.git.

[239] Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models

Siyou Li, Huanan Wu, Juexi Shao, Yinghao Ma, Yujian Gan, Yihao Luo, Yuwei Wang, Dong Nie, Lu Wang, Wengqing Wu, Le Zhang, Massimo Poesio, Juntao Yu

Main category: cs.CV

TL;DR: QTSplus is a lightweight visual token selection module that dynamically selects important visual evidence for text queries in long videos, achieving up to 89% compression and 28% latency reduction while maintaining near-parity accuracy.

Details

Motivation: Long video understanding in multimodal LLMs faces challenges due to linearly growing vision tokens causing attention cost explosion, memory issues, and latency problems.

Method: QTSplus scores visual tokens via cross-attention, predicts instance-specific retention budget based on query complexity, and selects Top-n tokens with differentiable straight-through estimator during training and hard gate at inference. A small re-encoder preserves temporal order.

Result: Achieves up to 89% vision stream compression and 28% latency reduction on long videos. Outperforms original Qwen model by +20.5 and +5.6 points on TempCompass direction and order accuracies while maintaining near-parity accuracy overall.

Conclusion: QTSplus is an effective, general mechanism for scaling MLLMs to real-world long-video scenarios while preserving task-relevant evidence.

Abstract: Despite the recent advances in the video understanding ability of multimodal large language models (MLLMs), long video understanding remains a challenge. One of the main issues is that the number of vision tokens grows linearly with video length, which causes an explosion in attention cost, memory, and latency. To solve this challenge, we present Query-aware Token Selector (\textbf{QTSplus}), a lightweight yet powerful visual token selection module that serves as an information gate between the vision encoder and LLMs. Given a text query and video tokens, QTSplus dynamically selects the most important visual evidence for the input text query by (i) scoring visual tokens via cross-attention, (ii) \emph{predicting} an instance-specific retention budget based on the complexity of the query, and (iii) \emph{selecting} Top-$n$ tokens with a differentiable straight-through estimator during training and a hard gate at inference. Furthermore, a small re-encoder preserves temporal order using absolute time information, enabling second-level localization while maintaining global coverage. Integrated into Qwen2.5-VL, QTSplus compresses the vision stream by up to \textbf{89%} and reduces end-to-end latency by \textbf{28%} on long videos. The evaluation on eight long video understanding benchmarks shows near-parity accuracy overall when compared with the original Qwen models and outperforms the original model by \textbf{+20.5} and \textbf{+5.6} points respectively on TempCompass direction and order accuracies. These results show that QTSplus is an effective, general mechanism for scaling MLLMs to real-world long-video scenarios while preserving task-relevant evidence.

[240] BeyondFacial: Identity-Preserving Personalized Generation Beyond Facial Close-ups

Songsong Zhang, Chuanqi Tang, Hongguang Zhang, Guijian Tang, Minglong Li, Xueqiong Li, Shaowu Yang, Yuanxi Peng, Wenjing Yang, Jing Zhao

Main category: cs.CV

TL;DR: Proposes a novel IPPG method that overcomes facial close-up limitations through identity-semantic separation, enabling better scene creation while maintaining identity fidelity.

Details

Motivation: Existing IPPG methods overemphasize facial regions, resulting in weak visual narrativity and poor semantic consistency under complex prompts due to ID feature embeddings undermining semantic expressiveness.

Method: Uses Dual-Line Inference pipeline with identity-semantic separation, Identity Adaptive Fusion strategy that defers ID-semantic fusion to noise prediction stage, and Identity Aggregation Prepending module to replace random initializations.

Result: Achieves stable and effective performance in IPPG tasks beyond facial close-ups, enabling efficient generation without manual masking or fine-tuning.

Conclusion: Method addresses over-reliance on facial close-ups, facilitates film-level character-scene creation, and provides richer personalized generation capabilities as a plug-and-play component.

Abstract: Identity-Preserving Personalized Generation (IPPG) has advanced film production and artistic creation, yet existing approaches overemphasize facial regions, resulting in outputs dominated by facial close-ups.These methods suffer from weak visual narrativity and poor semantic consistency under complex text prompts, with the core limitation rooted in identity (ID) feature embeddings undermining the semantic expressiveness of generative models. To address these issues, this paper presents an IPPG method that breaks the constraint of facial close-ups, achieving synergistic optimization of identity fidelity and scene semantic creation. Specifically, we design a Dual-Line Inference (DLI) pipeline with identity-semantic separation, resolving the representation conflict between ID and semantics inherent in traditional single-path architectures. Further, we propose an Identity Adaptive Fusion (IdAF) strategy that defers ID-semantic fusion to the noise prediction stage, integrating adaptive attention fusion and noise decision masking to avoid ID embedding interference on semantics without manual masking. Finally, an Identity Aggregation Prepending (IdAP) module is introduced to aggregate ID information and replace random initializations, further enhancing identity preservation. Experimental results validate that our method achieves stable and effective performance in IPPG tasks beyond facial close-ups, enabling efficient generation without manual masking or fine-tuning. As a plug-and-play component, it can be rapidly deployed in existing IPPG frameworks, addressing the over-reliance on facial close-ups, facilitating film-level character-scene creation, and providing richer personalized generation capabilities for related domains.

[241] SemanticStitch: Enhancing Image Coherence through Foreground-Aware Seam Carving

Ji-Ping Jin, Chen-Bin Feng, Rui Fan, Chi-Man Vong

Main category: cs.CV

TL;DR: SemanticStitch is a deep learning framework that uses semantic priors to preserve foreground object integrity in image stitching, addressing misalignments from varying capture angles and object movements.

Details

Motivation: Traditional image stitching methods fail to maintain semantic continuity of foreground objects due to varying capture angles, positional differences, and object movements, leading to visual disruptions.

Method: Deep learning-based framework incorporating semantic priors of foreground objects with a novel loss function that emphasizes semantic integrity of salient objects.

Result: Experimental results show substantial improvements over traditional techniques, with enhanced stitching quality and visual coherence.

Conclusion: SemanticStitch provides robust support for practical applications by effectively preserving foreground object integrity and improving overall stitching quality.

Abstract: Image stitching often faces challenges due to varying capture angles, positional differences, and object movements, leading to misalignments and visual discrepancies. Traditional seam carving methods neglect semantic information, causing disruptions in foreground continuity. We introduce SemanticStitch, a deep learning-based framework that incorporates semantic priors of foreground objects to preserve their integrity and enhance visual coherence. Our approach includes a novel loss function that emphasizes the semantic integrity of salient objects, significantly improving stitching quality. We also present two specialized real-world datasets to evaluate our method’s effectiveness. Experimental results demonstrate substantial improvements over traditional techniques, providing robust support for practical applications.

[242] Text2Traffic: A Text-to-Image Generation and Editing Method for Traffic Scenes

Feng Lv, Haoxuan Feng, Zilu Zhang, Chunlong Xia, Yanfeng Li

Main category: cs.CV

TL;DR: Proposes a unified text-driven framework for traffic scene image generation and editing using controllable mask mechanism and multi-view data to address semantic richness, viewpoint diversity, and visual fidelity challenges.

Details

Motivation: Address limitations in current text-driven image generation for traffic scenes: insufficient semantic richness, limited camera viewpoints, low visual fidelity, and poor text-image alignment.

Method: Unified framework with controllable mask mechanism for generation/editing integration; uses vehicle-side and roadside multi-view data; two-stage training (conceptual learning + fine-grained fine-tuning); mask-region-weighted loss for small elements.

Result: Extensive experiments show leading performance in text-based image generation and editing within traffic scenes with enhanced geometric diversity and generation fidelity.

Conclusion: The proposed framework effectively addresses key challenges in traffic scene generation, achieving superior performance through unified architecture, multi-view data integration, and specialized training strategies.

Abstract: With the rapid advancement of intelligent transportation systems, text-driven image generation and editing techniques have demonstrated significant potential in providing rich, controllable visual scene data for applications such as traffic monitoring and autonomous driving. However, several challenges remain, including insufficient semantic richness of generated traffic elements, limited camera viewpoints, low visual fidelity of synthesized images, and poor alignment between textual descriptions and generated content. To address these issues, we propose a unified text-driven framework for both image generation and editing, leveraging a controllable mask mechanism to seamlessly integrate the two tasks. Furthermore, we incorporate both vehicle-side and roadside multi-view data to enhance the geometric diversity of traffic scenes. Our training strategy follows a two-stage paradigm: first, we perform conceptual learning using large-scale coarse-grained text-image data; then, we fine-tune with fine-grained descriptive data to enhance text-image alignment and detail quality. Additionally, we introduce a mask-region-weighted loss that dynamically emphasizes small yet critical regions during training, thereby substantially enhancing the generation fidelity of small-scale traffic elements. Extensive experiments demonstrate that our method achieves leading performance in text-based image generation and editing within traffic scenes.

[243] SF-Recon: Simplification-Free Lightweight Building Reconstruction via 3D Gaussian Splatting

Zihan Li, Tengfei Wang, Wentian Gan, Hao Zhan, Xin Wang, Zongqian Zhan

Main category: cs.CV

TL;DR: SF-Recon directly reconstructs lightweight building surfaces from multi-view images using 3D Gaussian Splatting and optimization, eliminating the need for post-hoc mesh simplification.

Details

Motivation: Conventional multi-view geometry pipelines are cumbersome and quality-sensitive due to reliance on dense reconstruction, meshing, and simplification. There's a need for direct lightweight building surface reconstruction.

Method: Train initial 3D Gaussian Splatting field, use normal-gradient-guided Gaussian optimization to select structural primitives, apply multi-view edge-consistency pruning, and perform multi-view depth-constrained Delaunay triangulation.

Result: SF-Recon achieves substantially fewer faces and vertices while maintaining computational efficiency, directly reconstructing lightweight building models from multi-view imagery.

Conclusion: The method successfully reconstructs lightweight building surfaces without post-hoc simplification, demonstrating improved efficiency and structural faithfulness.

Abstract: Lightweight building surface models are crucial for digital city, navigation, and fast geospatial analytics, yet conventional multi-view geometry pipelines remain cumbersome and quality-sensitive due to their reliance on dense reconstruction, meshing, and subsequent simplification. This work presents SF-Recon, a method that directly reconstructs lightweight building surfaces from multi-view images without post-hoc mesh simplification. We first train an initial 3D Gaussian Splatting (3DGS) field to obtain a view-consistent representation. Building structure is then distilled by a normal-gradient-guided Gaussian optimization that selects primitives aligned with roof and wall boundaries, followed by multi-view edge-consistency pruning to enhance structural sharpness and suppress non-structural artifacts without external supervision. Finally, a multi-view depth-constrained Delaunay triangulation converts the structured Gaussian field into a lightweight, structurally faithful building mesh. Based on a proposed SF dataset, the experimental results demonstrate that our SF-Recon can directly reconstruct lightweight building models from multi-view imagery, achieving substantially fewer faces and vertices while maintaining computational efficiency. Website:https://lzh282140127-cell.github.io/SF-Recon-project/

[244] YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection

Ori Meiraz, Sharon Shalev, Avishai Weizman

Main category: cs.CV

TL;DR: A Mixture-of-Experts framework for object detection using multiple YOLOv9-T experts with adaptive routing achieves higher mAP and AR than a single YOLOv9-T model.

Details

Motivation: To improve object detection performance by enabling dynamic feature specialization through multiple specialized experts rather than relying on a single model.

Method: Mixture-of-Experts framework with adaptive routing among multiple YOLOv9-T experts for dynamic feature specialization.

Result: Achieves higher mean Average Precision (mAP) and Average Recall (AR) compared to a single YOLOv9-T model.

Conclusion: The Mixture-of-Experts approach with adaptive routing effectively enhances object detection performance through specialized feature processing.

Abstract: This paper presents a novel Mixture-of-Experts framework for object detection, incorporating adaptive routing among multiple YOLOv9-T experts to enable dynamic feature specialization and achieve higher mean Average Precision (mAP) and Average Recall (AR) compared to a single YOLOv9-T model.

[245] Attention Via Convolutional Nearest Neighbors

Mingi Kang, Jeová Farias Sales Rocha Neto

Main category: cs.CV

TL;DR: Convolution and self-attention are unified under a single k-nearest neighbor framework where convolution selects neighbors by spatial proximity and attention by feature similarity, existing on a continuous spectrum.

Details

Motivation: To dissolve the apparent distinction between convolutional neural networks and transformers by showing both convolution and self-attention can be unified within a common framework.

Method: Introduce Convolutional Nearest Neighbors (ConvNN) framework that serves as drop-in replacement for convolutional and attention layers, enabling systematic exploration of intermediate spectrum between spatial-proximity and feature-similarity neighbor selection.

Result: ConvNN improves accuracy on CIFAR-10/100 classification tasks: hybrid branching in VGG combines both selection methods for better performance, and ConvNN in ViT outperforms standard attention and variants. Interpolating along the spectrum provides regularization benefits.

Conclusion: The work provides a unifying framework that dissolves the distinction between convolution and attention, with implications for designing more principled and interpretable vision architectures.

Abstract: The shift from Convolutional Neural Networks to Transformers has reshaped computer vision, yet these two architectural families are typically viewed as fundamentally distinct. We argue that convolution and self-attention, despite their apparent differences, can be unified within a single k-nearest neighbor aggregation framework. The critical insight is that both operations are special cases of neighbor selection and aggregation; convolution selects neighbors by spatial proximity, while attention selects by feature similarity, revealing they exist on a continuous spectrum. We introduce Convolutional Nearest Neighbors (ConvNN), a unified framework that formalizes this connection. Crucially, ConvNN serves as a drop-in replacement for convolutional and attention layers, enabling systematic exploration of the intermediate spectrum between these two extremes. We validate the framework’s coherence on CIFAR-10 and CIFAR-100 classification tasks across two complementary architectures: (1) Hybrid branching in VGG improves accuracy on both CIFAR datasets by combining spatial-proximity and feature-similarity selection; and (2) ConvNN in ViT outperforms standard attention and other attention variants on both datasets. Extensive ablations on $k$ values and architectural variants reveal that interpolating along this spectrum provides regularization benefits by balancing local and global receptive fields. Our work provides a unifying framework that dissolves the apparent distinction between convolution and attention, with implications for designing more principled and interpretable vision architectures.

[246] FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation

Yueru He, Xueqing Peng, Yupeng Cao, Yan Wang, Lingfei Qian, Haohang Li, Yi Han, Ruoyu Xiang, Mingquan Lin, Prayag Tiwari, Jimin Huang, Guojun Xiong, Sophia Ananiadou

Main category: cs.CV

TL;DR: FinCriticalED is a visual benchmark for evaluating OCR and vision language models on financial documents at the fact level, focusing on detecting critical errors in numerical and temporal information.

Details

Motivation: Financial documents contain visually dense layouts where small OCR mistakes (like sign inversion or shifted dates) can lead to materially different interpretations, while traditional OCR metrics only capture surface-level text similarity.

Method: Created 500 image-HTML pairs with expert-annotated financial facts covering 700+ numerical/temporal facts, developed LLM-as-Judge evaluation pipeline for structured fact extraction and contextual verification.

Result: Strongest proprietary models achieve highest factual accuracy but substantial errors remain in visually intricate numerical and temporal contexts.

Conclusion: FinCriticalED provides rigorous foundation for advancing visual factual precision in financial and other precision-critical domains, shifting evaluation from lexical overlap to domain-critical factual correctness.

Abstract: We introduce FinCriticalED (Financial Critical Error Detection), a visual benchmark for evaluating OCR and vision language models on financial documents at the fact level. Financial documents contain visually dense and table heavy layouts where numerical and temporal information is tightly coupled with structure. In high stakes settings, small OCR mistakes such as sign inversion or shifted dates can lead to materially different interpretations, while traditional OCR metrics like ROUGE and edit distance capture only surface level text similarity. \ficriticaled provides 500 image-HTML pairs with expert annotated financial facts covering over seven hundred numerical and temporal facts. It introduces three key contributions. First, it establishes the first fact level evaluation benchmark for financial document understanding, shifting evaluation from lexical overlap to domain critical factual correctness. Second, all annotations are created and verified by financial experts with strict quality control over signs, magnitudes, and temporal expressions. Third, we develop an LLM-as-Judge evaluation pipeline that performs structured fact extraction and contextual verification for visually complex financial documents. We benchmark OCR systems, open source vision language models, and proprietary models on FinCriticalED. Results show that although the strongest proprietary models achieve the highest factual accuracy, substantial errors remain in visually intricate numerical and temporal contexts. Through quantitative evaluation and expert case studies, FinCriticalED provides a rigorous foundation for advancing visual factual precision in financial and other precision critical domains.

[247] Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models

Mehran Tamjidi, Hamidreza Dastmalchi, Mohammadreza Alimoradijazi, Ali Cheraghian, Aijun An, Morteza Saberi

Main category: cs.CV

TL;DR: Uni-Adapter is a training-free online test-time adaptation method for 3D vision-language foundation models that uses dynamic prototype learning to handle noisy, incomplete, or distribution-shifted data without retraining.

Details

Motivation: 3D vision-language foundation models underperform in practical scenarios with noisy, incomplete, or distribution-shifted data, requiring adaptation strategies.

Method: Uses dynamic prototype learning with a 3D cache storing class-specific cluster centers, graph-based label smoothing for inter-prototype consistency, and entropy-weighted aggregation to unify predictions.

Result: Achieves state-of-the-art performance: 10.55% improvement on ModelNet-40C, 8.26% on ScanObjectNN-C, and 4.49% on ShapeNet-C over source 3D VLFMs.

Conclusion: Uni-Adapter effectively mitigates distribution shifts in 3D vision-language foundation models through training-free online adaptation, demonstrating strong generalization across diverse 3D benchmarks.

Abstract: 3D Vision-Language Foundation Models (VLFMs) have shown strong generalization and zero-shot recognition capabilities in open-world point cloud processing tasks. However, these models often underperform in practical scenarios where data are noisy, incomplete, or drawn from a different distribution than the training data. To address this, we propose Uni-Adapter, a novel training-free online test-time adaptation (TTA) strategy for 3D VLFMs based on dynamic prototype learning. We define a 3D cache to store class-specific cluster centers as prototypes, which are continuously updated to capture intra-class variability in heterogeneous data distributions. These dynamic prototypes serve as anchors for cache-based logit computation via similarity scoring. Simultaneously, a graph-based label smoothing module captures inter-prototype similarities to enforce label consistency among similar prototypes. Finally, we unify predictions from the original 3D VLFM and the refined 3D cache using entropy-weighted aggregation for reliable adaptation. Without retraining, Uni-Adapter effectively mitigates distribution shifts, achieving state-of-the-art performance on diverse 3D benchmarks over different 3D VLFMs, improving ModelNet-40C by 10.55%, ScanObjectNN-C by 8.26%, and ShapeNet-C by 4.49% over the source 3D VLFMs. Project page: https://mehran-tam.github.io/Uni-Adapter

[248] Automated Interpretable 2D Video Extraction from 3D Echocardiography

Milos Vukadinovic, Hirotaka Ieki, Yuki Sahashi, David Ouyang, Bryan He

Main category: cs.CV

TL;DR: Automated method to extract standard 2D echocardiography views from 3D cardiac ultrasound volumes using deep learning and anatomical heuristics, validated with 96% accuracy by cardiologists.

Details

Motivation: 3D echocardiography offers better image quality but physicians are accustomed to interpreting 2D views. This bridges the gap by allowing clinicians to use their familiar 2D format while benefiting from 3D scanning advantages.

Method: Combines deep learning view classifier with anatomical landmark-based heuristics and cardiologist-provided rules to reconstruct standard 2D echocardiography views from 3D volumes.

Result: Achieved 96% accuracy in blinded evaluation by three cardiologists on 1,600 videos from 2 hospitals. Extracted 2D videos preserved spatial calibration and diagnostic features, enabling accurate cardiac abnormality detection and clinical-grade measurements.

Conclusion: The approach successfully allows clinicians to obtain accurate real-world interpretations from 3D volumes while maintaining their preferred 2D viewing format, demonstrating practical clinical utility.

Abstract: Although the heart has complex three-dimensional (3D) anatomy, conventional medical imaging with cardiac ultrasound relies on a series of 2D videos showing individual cardiac structures. 3D echocardiography is a developing modality that now offers adequate image quality for clinical use, with potential to streamline acquisition and improve assessment of off-axis features. We propose an automated method to select standard 2D views from 3D cardiac ultrasound volumes, allowing physicians to interpret the data in their usual format while benefiting from the speed and usability of 3D scanning. Applying a deep learning view classifier and downstream heuristics based on anatomical landmarks together with heuristics provided by cardiologists, we reconstruct standard echocardiography views. This approach was validated by three cardiologists in blinded evaluation (96% accuracy in 1,600 videos from 2 hospitals). The downstream 2D videos were also validated in their ability to detect cardiac abnormalities using AI echocardiography models (EchoPrime and PanEcho) as well as ability to generate clinical-grade measurements of cardiac anatomy (EchoNet-Measurement). We demonstrated that the extracted 2D videos preserve spatial calibration and diagnostic features, allowing clinicians to obtain accurate real-world interpretations from 3D volumes. We release the code and a dataset of 29 3D echocardiography videos https://github.com/echonet/3d-echo .

[249] POMA-3D: The Point Map Way to 3D Scene Understanding

Ye Mao, Weixun Luo, Ranran Huang, Junpeng Jing, Krystian Mikolajczyk

Main category: cs.CV

TL;DR: POMA-3D is the first self-supervised 3D representation model using point maps, which encode 3D coordinates on 2D grids to leverage 2D foundation models while preserving 3D geometry.

Details

Motivation: Addresses the scarcity of pretrained priors and limited data in 3D representation learning by creating a bridge between 2D foundation models and 3D scene understanding.

Method: Uses point maps with view-to-scene alignment strategy and POMA-JEPA architecture for geometrically consistent features across multiple views. Trained on ScenePoint dataset from 6.5K room-level RGB-D scenes and 1M 2D image scenes.

Result: POMA-3D serves as a strong backbone for diverse 3D tasks including 3D question answering, embodied navigation, scene retrieval, and embodied localization using only geometric inputs (3D coordinates).

Conclusion: POMA-3D successfully explores point maps as an effective approach for 3D scene understanding, transferring rich 2D priors to 3D while overcoming data limitations in 3D representation learning.

Abstract: In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits diverse tasks, including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Overall, our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning. Project Page: https://matchlab-imperial.github.io/poma3d/

[250] EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

Omkar Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan

Main category: cs.CV

TL;DR: EvoLMM is a self-evolving framework that improves large multimodal models’ reasoning capabilities through unsupervised learning using two cooperative agents (Proposer and Solver) that generate and solve image-grounded questions via internal consistency.

Details

Motivation: To enhance LMM reasoning capabilities without relying on human-curated data or external reward models, enabling more autonomous and scalable training.

Method: Proposes EvoLMM framework with two agents from a single backbone model: Proposer generates diverse image-grounded questions, and Solver solves them through internal consistency in a continuous self-rewarding process.

Result: Achieves consistent gains up to ~3% on multimodal math-reasoning benchmarks (ChartQA, MathVista, MathVision) using only raw training images with Qwen2.5-VL as base model.

Conclusion: EvoLMM provides a simple yet effective baseline for self-improving LMMs in fully-unsupervised fashion, demonstrating the potential of autonomous learning without human annotations.

Abstract: Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.

cs.AI

[251] Stable diffusion models reveal a persisting human and AI gap in visual creativity

Silvia Rondini, Claudia Alvarez-Martin, Paula Angermair-Barkai, Olivier Penacchio, M. Paz, Matthew Pelowski, Dan Dediu, Antoni Rodriguez-Fornells, Xim Cerda-Company

Main category: cs.AI

TL;DR: This study compares visual creativity between humans (artists and non-artists) and AI image generators, finding humans outperform AI, with visual artists being most creative, followed by non-artists, then AI with human guidance, and finally AI alone.

Details

Motivation: To explore visual creativity in AI compared to humans, as previous research focused mainly on language-based creativity tasks while visual creativity remains underexplored.

Method: Compared image generation between human participants (visual artists and non-artists) and AI image generators with two prompting conditions (high human input vs low human input). Used human raters and GPT4o to evaluate creativity of resulting images.

Result: Found a clear creativity gradient: Visual Artists > Non Artists > Human-Inspired AI > Self-Guided AI. Human guidance significantly improved AI’s creative output. Human and AI raters showed different creativity judgment patterns.

Conclusion: GenAI faces unique challenges in visual domains where creativity depends on human perceptual nuance and contextual sensitivity, suggesting these capacities may not be readily transferable from language models.

Abstract: While recent research suggests Large Language Models match human creative performance in divergent thinking tasks, visual creativity remains underexplored. This study compared image generation in human participants (Visual Artists and Non Artists) and using an image generation AI model (two prompting conditions with varying human input: high for Human Inspired, low for Self Guided). Human raters (N=255) and GPT4o evaluated the creativity of the resulting images. We found a clear creativity gradient, with Visual Artists being the most creative, followed by Non Artists, then Human Inspired generative AI, and finally Self Guided generative AI. Increased human guidance strongly improved GenAI’s creative output, bringing its productions close to those of Non Artists. Notably, human and AI raters also showed vastly different creativity judgment patterns. These results suggest that, in contrast to language centered tasks, GenAI models may face unique challenges in visual domains, where creativity depends on perceptual nuance and contextual sensitivity, distinctly human capacities that may not be readily transferable from language models.

[252] Cognitive BASIC: An In-Model Interpreted Reasoning Language for LLMs

Oliver Kramer

Main category: cs.AI

TL;DR: Cognitive BASIC is a minimal BASIC-style prompting language that structures LLM reasoning into explicit stepwise execution traces, enabling transparent multi-step reasoning through simulated program execution.

Details

Motivation: To create an interpretable cognitive control layer for LLMs that enables transparent multi-step reasoning by leveraging the simplicity of retro BASIC programming concepts.

Method: Developed a BASIC-style prompting language with numbered lines and simple commands, along with a natural-language interpreter file that specifies command semantics, memory updates, and logging behavior. The system extracts declarative/procedural knowledge and detects contradictions.

Result: All three tested LLMs could execute Cognitive BASIC programs, showing overall strong but not uniform performance on knowledge extraction, conflict detection, and reasoning tasks.

Conclusion: Cognitive BASIC provides an effective framework for structuring LLM reasoning into interpretable stepwise execution traces, demonstrating that modern LLMs can reliably simulate simple programs for transparent reasoning.

Abstract: Cognitive BASIC is a minimal, BASIC-style prompting language and in-model interpreter that structures large language model (LLM) reasoning into explicit, stepwise execution traces. Inspired by the simplicity of retro BASIC, we repurpose numbered lines and simple commands as an interpretable cognitive control layer. Modern LLMs can reliably simulate such short programs, enabling transparent multi-step reasoning inside the model. A natural-language interpreter file specifies command semantics, memory updates, and logging behavior. Our mental-model interpreter extracts declarative and procedural knowledge, detects contradictions, and produces resolutions when necessary. A comparison across three LLMs on a benchmark of knowledge extraction, conflict detection, and reasoning tasks shows that all models can execute Cognitive BASIC programs, with overall strong but not uniform performance.

[253] Fantastic Bugs and Where to Find Them in AI Benchmarks

Sang Truong, Yuheng Tu, Michael Hardy, Anka Reuel, Zeyu Tang, Jirayu Burapacheep, Jonathan Perera, Chibuike Uwakwe, Ben Domingue, Nick Haber, Sanmi Koyejo

Main category: cs.AI

TL;DR: A framework for systematic benchmark revision using statistical analysis of response patterns to identify invalid questions, reducing human effort through automated flagging and LLM-judge review.

Details

Motivation: Invalid benchmark questions undermine AI evaluation reliability, and manual error identification among thousands of questions is infeasible and a critical bottleneck.

Method: Leverages statistical analysis of response patterns based on the assumption that mean score summarizes model performance, flagging questions when empirical statistics fall outside expected ranges. Includes LLM-judge first pass for initial review.

Result: Achieves up to 84% precision in identifying problematic questions across nine widely used benchmarks, significantly reducing human review effort.

Conclusion: Provides an efficient and scalable framework for systematic benchmark revision that combines statistical analysis with automated review to improve benchmark reliability.

Abstract: Benchmarks are pivotal in driving AI progress, and invalid benchmark questions frequently undermine their reliability. Manually identifying and correcting errors among thousands of benchmark questions is not only infeasible but also a critical bottleneck for reliable evaluation. In this work, we introduce a framework for systematic benchmark revision that leverages statistical analysis of response patterns to flag potentially invalid questions for further expert review. Our approach builds on a core assumption commonly used in AI evaluations that the mean score sufficiently summarizes model performance. This implies a unidimensional latent construct underlying the measurement experiment, yielding expected ranges for various statistics for each item. When empirically estimated values for these statistics fall outside the expected range for an item, the item is more likely to be problematic. Across nine widely used benchmarks, our method guides expert review to identify problematic questions with up to 84% precision. In addition, we introduce an LLM-judge first pass to review questions, further reducing human effort. Together, these components provide an efficient and scalable framework for systematic benchmark revision.

[254] Hybrid Differential Reward: Combining Temporal Difference and Action Gradients for Efficient Multi-Agent Reinforcement Learning in Cooperative Driving

Ye Han, Lijun Zhang, Dejian Meng, Zhuang Zhang

Main category: cs.AI

TL;DR: Proposes Hybrid Differential Reward (HDR) mechanism to solve vanishing reward differences in multi-vehicle cooperative driving, combining Temporal Difference Reward and Action Gradient Reward to improve policy gradient SNR and learning efficiency.

Details

Motivation: Traditional state-based reward functions in multi-vehicle cooperative driving suffer from vanishing reward differences due to temporal quasi-steady traffic states and physical proximity of actions, leading to low signal-to-noise ratio in policy gradients that hinders algorithm convergence.

Method: HDR integrates two components: (1) Temporal Difference Reward based on global potential function using evolutionary trend of potential energy, and (2) Action Gradient Reward measuring marginal utility of actions. Formulated as Multi-Agent Partially Observable Markov Game with time-varying agent set.

Result: Extensive experiments with online planning (MCTS) and Multi-Agent RL algorithms (QMIX, MAPPO, MADDPG) show HDR significantly improves convergence speed and policy stability, enabling agents to learn high-quality cooperative policies balancing traffic efficiency and safety.

Conclusion: HDR mechanism effectively addresses vanishing reward differences in multi-vehicle cooperative driving, providing high SNR guidance signals that enhance learning efficiency and enable effective cooperative policies.

Abstract: In multi-vehicle cooperative driving tasks involving high-frequency continuous control, traditional state-based reward functions suffer from the issue of vanishing reward differences. This phenomenon results in a low signal-to-noise ratio (SNR) for policy gradients, significantly hindering algorithm convergence and performance improvement. To address this challenge, this paper proposes a novel Hybrid Differential Reward (HDR) mechanism. We first theoretically elucidate how the temporal quasi-steady nature of traffic states and the physical proximity of actions lead to the failure of traditional reward signals. Building on this analysis, the HDR framework innovatively integrates two complementary components: (1) a Temporal Difference Reward (TRD) based on a global potential function, which utilizes the evolutionary trend of potential energy to ensure optimal policy invariance and consistency with long-term objectives; and (2) an Action Gradient Reward (ARG), which directly measures the marginal utility of actions to provide a local guidance signal with a high SNR. Furthermore, we formulate the cooperative driving problem as a Multi-Agent Partially Observable Markov Game (POMDPG) with a time-varying agent set and provide a complete instantiation scheme for HDR within this framework. Extensive experiments conducted using both online planning (MCTS) and Multi-Agent Reinforcement Learning (QMIX, MAPPO, MADDPG) algorithms demonstrate that the HDR mechanism significantly improves convergence speed and policy stability. The results confirm that HDR guides agents to learn high-quality cooperative policies that effectively balance traffic efficiency and safety.

[255] Agentifying Agentic AI

Virginia Dignum, Frank Dignum

Main category: cs.AI

TL;DR: The paper argues that AAMAS community tools (BDI architectures, communication protocols, mechanism design, institutional modeling) provide the foundation needed to complement agentic AI’s agency assumptions with explicit models of cognition, cooperation, and governance.

Details

Motivation: To realize the vision of agentic AI with sustained autonomy, reasoning, and interaction capabilities, its assumptions about agency need to be complemented by explicit models of cognition, cooperation, and governance.

Method: Align adaptive, data-driven approaches with structured models of reasoning and coordination from AAMAS community tools including BDI architectures, communication protocols, mechanism design, and institutional modeling.

Result: A perspective on agency that bridges formal theory and practical autonomy, enabling agentic systems that are capable, flexible, transparent, cooperative, and accountable.

Conclusion: The conceptual tools from the AAMAS community provide the necessary foundation for developing agentic systems with proper cognition, cooperation, and governance models, creating a bridge between formal theory and practical autonomy.

Abstract: Agentic AI seeks to endow systems with sustained autonomy, reasoning, and interaction capabilities. To realize this vision, its assumptions about agency must be complemented by explicit models of cognition, cooperation, and governance. This paper argues that the conceptual tools developed within the Autonomous Agents and Multi-Agent Systems (AAMAS) community, such as BDI architectures, communication protocols, mechanism design, and institutional modelling, provide precisely such a foundation. By aligning adaptive, data-driven approaches with structured models of reasoning and coordination, we outline a path toward agentic systems that are not only capable and flexible, but also transparent, cooperative, and accountable. The result is a perspective on agency that bridges formal theory and practical autonomy.

[256] Comparing verbal, visual and combined explanations for Bayesian Network inferences

Erik P. Nyberg, Steven Mascaro, Ingrid Zukerman, Michael Wybrow, Duc-Minh Vo, Ann Nicholson

Main category: cs.AI

TL;DR: The paper presents UI extensions (verbal, visual, combined) for Bayesian Networks to improve user understanding of probabilistic reasoning, showing that all extensions outperform baseline UI and combined modality works best for certain question types.

Details

Motivation: Bayesian Networks are considered transparent models but users still struggle to understand them, and current UIs don't clarify BN reasoning effectively.

Method: Designed verbal and visual UI extensions to guide users through common inference patterns, and conducted a user study comparing verbal, visual, combined extensions against baseline UI.

Result: Users performed better with all three extension types than baseline UI for questions about observation impact, enabling paths, and influence of observations. Combined verbal and visual modalities outperformed single modalities for some question types.

Conclusion: UI extensions (especially combined verbal-visual) significantly improve user understanding of Bayesian Network reasoning compared to standard interfaces.

Abstract: Bayesian Networks (BNs) are an important tool for assisting probabilistic reasoning, but despite being considered transparent models, people have trouble understanding them. Further, current User Interfaces (UIs) still do not clarify the reasoning of BNs. To address this problem, we have designed verbal and visual extensions to the standard BN UI, which can guide users through common inference patterns. We conducted a user study to compare our verbal, visual and combined UI extensions, and a baseline UI. Our main findings are: (1) users did better with all three types of extensions than with the baseline UI for questions about the impact of an observation, the paths that enable this impact, and the way in which an observation influences the impact of other observations; and (2) using verbal and visual modalities together is better than using either modality alone for some of these question types.

[257] MirrorMind: Empowering OmniScientist with the Expert Perspectives and Collective Knowledge of Human Scientists

Qingbin Zeng, Bingbing Fan, Zhiyu Chen, Sijian Ren, Zhilun Zhou, Xuhua Zhang, Yuanyi Zhen, Fengli Xu, Yong Li, Tie-Yan Liu

Main category: cs.AI

TL;DR: MirrorMind is a hierarchical cognitive architecture that integrates individual researcher cognitive models with collective disciplinary knowledge to enable AI scientists to perform personalized and insight-generating scientific reasoning.

Details

Motivation: Current AI scientist approaches treat scientific discovery as solitary optimization, overlooking the social and historical nature of knowledge production where insights come from both individual cognitive trajectories and collective disciplinary memory.

Method: Three-level hierarchical architecture: Individual Level (episodic, semantic, persona memories), Domain Level (structured disciplinary concept graphs), and Interdisciplinary Level (orchestration engine), with separation of memory storage from agentic execution.

Result: Evaluated across four tasks showing improved author-level cognitive simulation, complementary reasoning, cross-disciplinary collaboration, and multi-agent scientific problem solving, moving beyond simple fact retrieval to structural and personalized reasoning.

Conclusion: MirrorMind successfully bridges the gap by integrating individual cognitive depth with collective disciplinary breadth, enabling AI scientists to perform insight-generating scientific reasoning that captures both personal perspectives and structured knowledge networks.

Abstract: The emergence of AI Scientists has demonstrated remarkable potential in automating scientific research. However, current approaches largely conceptualize scientific discovery as a solitary optimization or search process, overlooking that knowledge production is inherently a social and historical endeavor. Human scientific insight stems from two distinct yet interconnected sources. First is the individual cognitive trajectory, where a researcher’s unique insight is shaped by their evolving research history and stylistic preferences; another is the collective disciplinary memory, where knowledge is sedimented into vast, interconnected networks of citations and concepts. Existing LLMs still struggle to represent these structured, high-fidelity cognitive and social contexts. To bridge this gap, we introduce MirrorMind, a hierarchical cognitive architecture that integrates dual-memory representations within a three-level framework. The Individual Level constructs high-fidelity cognitive models of individual researchers by capturing their episodic, semantic, and persona memories; the Domain Level maps collective knowledge into structured disciplinary concept graphs; and the Interdisciplinary Level that acts as an orthogonal orchestration engine. Crucially, our architecture separates memory storage from agentic execution, enabling AI scientist agents to flexibly access individual memories for unique perspectives or collective structures to reason. We evaluate MirrorMind across four comprehensive tasks, including author-level cognitive simulation, complementary reasoning, cross-disciplinary collaboration promotion, and multi-agent scientific problem solving. The results show that by integrating individual cognitive depth with collective disciplinary breadth, MirrorMind moves beyond simple fact retrieval toward structural, personalized, and insight-generating scientific reasoning.

[258] Budget-Aware Tool-Use Enables Effective Agent Scaling

Tengxiao Liu, Zifeng Wang, Jin Miao, I-Hung Hsu, Jun Yan, Jiefeng Chen, Rujun Han, Fangyuan Xu, Yanfei Chen, Ke Jiang, Samira Daruki, Yi Liang, William Yang Wang, Tomas Pfister, Chen-Yu Lee

Main category: cs.AI

TL;DR: Budget-aware scaling for tool-augmented agents improves performance by providing continuous budget awareness and dynamic strategy adaptation, rather than simply increasing tool-call budgets.

Details

Motivation: Current tool-augmented agents lack budget awareness and hit performance ceilings when given larger tool-call budgets, failing to effectively scale with increased computational resources.

Method: Introduces Budget Tracker for continuous budget awareness and BATS framework for dynamic planning/verification strategy adaptation that decides whether to dig deeper or pivot based on remaining resources.

Result: Budget-aware methods produce more favorable scaling curves and push the cost-performance Pareto frontier, enabling more effective scaling in tool-augmented agents.

Conclusion: Budget awareness is crucial for effective scaling in tool-augmented agents, offering empirical insights for more transparent and principled understanding of scaling in such systems.

Abstract: Scaling test-time computation improves performance across different tasks on large language models (LLMs), which has also been extended to tool-augmented agents. For these agents, scaling involves not only “thinking” in tokens but also “acting” via tool calls. The number of tool calls directly bounds the agent’s interaction with the external environment. However, we find that simply granting agents a larger tool-call budget fails to improve performance, as they lack “budget awareness” and quickly hit a performance ceiling. To address this, we study how to scale such agents effectively under explicit tool-call budgets, focusing on web search agents. We first introduce the Budget Tracker, a lightweight plug-in that provides the agent with continuous budget awareness, enabling simple yet effective scaling. We further develop BATS (Budget Aware Test-time Scaling), an advanced framework that leverages this awareness to dynamically adapt its planning and verification strategy, deciding whether to “dig deeper” on a promising lead or “pivot” to new paths based on remaining resources. To analyze cost-performance scaling in a controlled manner, we formalize a unified cost metric that jointly accounts for token and tool consumption. We provide the first systematic study on budget-constrained agents, showing that budget-aware methods produce more favorable scaling curves and push the cost-performance Pareto frontier. Our work offers empirical insights toward a more transparent and principled understanding of scaling in tool-augmented agents.

[259] DAPS++: Rethinking Diffusion Inverse Problems with Decoupled Posterior Annealing

Hao Chen, Renzheng Zhang, Scott S. Howard

Main category: cs.AI

TL;DR: The paper reinterprets diffusion models for inverse problems as an EM-style framework, introducing DAPS++ which decouples diffusion from data refinement for more efficient and stable reconstruction.

Details

Motivation: Current Bayesian interpretation of diffusion models fails to explain practical behavior where prior guidance is limited and reconstruction is mainly driven by measurement consistency, effectively decoupling from diffusion dynamics.

Method: Reinterpret diffusion as initialization in EM framework, introduce DAPS++ that allows direct likelihood guidance while maintaining stability, with decoupled diffusion stage and data-driven refinement.

Result: DAPS++ achieves high computational efficiency with fewer function evaluations and measurement-optimization steps, while maintaining robust reconstruction performance across diverse image restoration tasks.

Conclusion: The EM-style framework provides better insight into why unified diffusion trajectories work in practice, and DAPS++ offers more direct likelihood guidance with improved efficiency and stability.

Abstract: From a Bayesian perspective, score-based diffusion solves inverse problems through joint inference, embedding the likelihood with the prior to guide the sampling process. However, this formulation fails to explain its practical behavior: the prior offers limited guidance, while reconstruction is largely driven by the measurement-consistency term, leading to an inference process that is effectively decoupled from the diffusion dynamics. To clarify this structure, we reinterpret the role of diffusion in inverse problem solving as an initialization stage within an expectation–maximization (EM)–style framework, where the diffusion stage and the data-driven refinement are fully decoupled. We introduce \textbf{DAPS++}, which allows the likelihood term to guide inference more directly while maintaining numerical stability and providing insight into why unified diffusion trajectories remain effective in practice. By requiring fewer function evaluations (NFEs) and measurement-optimization steps, \textbf{DAPS++} achieves high computational efficiency and robust reconstruction performance across diverse image restoration tasks.

[260] Patient-level Information Extraction by Consistent Integration of Textual and Tabular Evidence with Bayesian Networks

Paloma Rabaey, Adrick Tench, Stefan Heytens, Thomas Demeester

Main category: cs.AI

TL;DR: Proposes a multi-modal method combining structured EHR data (via Bayesian network) and unstructured clinical notes (via neural text classifiers) using virtual evidence with consistency nodes for improved calibration and interpretable fusion.

Details

Motivation: Leverage both structured and unstructured EHR data for clinical decision support, as much valuable patient information is contained in unstructured text like discharge summaries and nursing notes.

Method: Multi-modal patient-level information extraction combining tabular EHR features (using expert-informed Bayesian network) and clinical notes (using neural text classifiers) with virtual evidence augmented by consistency nodes for probabilistic fusion.

Result: The consistency node improves calibration of final predictions compared to virtual evidence alone, allowing better adjustment of neural classifier outputs to handle missing information and resolve contradictions between tabular and text data.

Conclusion: The method shows potential on the SimSUM dataset for interpretable, probabilistic fusion of multi-modal EHR data, improving handling of missing information and contradictions between data sources.

Abstract: Electronic health records (EHRs) form an invaluable resource for training clinical decision support systems. To leverage the potential of such systems in high-risk applications, we need large, structured tabular datasets on which we can build transparent feature-based models. While part of the EHR already contains structured information (e.g. diagnosis codes, medications, and lab results), much of the information is contained within unstructured text (e.g. discharge summaries and nursing notes). In this work, we propose a method for multi-modal patient-level information extraction that leverages both the tabular features available in the patient’s EHR (using an expert-informed Bayesian network) as well as clinical notes describing the patient’s symptoms (using neural text classifiers). We propose the use of virtual evidence augmented with a consistency node to provide an interpretable, probabilistic fusion of the models’ predictions. The consistency node improves the calibration of the final predictions compared to virtual evidence alone, allowing the Bayesian network to better adjust the neural classifier’s output to handle missing information and resolve contradictions between the tabular and text data. We show the potential of our method on the SimSUM dataset, a simulated benchmark linking tabular EHRs with clinical notes through expert knowledge.

[261] The Belief-Desire-Intention Ontology for modelling mental reality and agency

Sara Zuppiroli, Carmelo Fabio Longo, Anna Sofia Lippolis, Rocco Paolillo, Lorenzo Giammei, Miguel Ceriani, Francesco Poggi, Antonio Zinilli, Andrea Giovanni Nuzzolese

Main category: cs.AI

TL;DR: A formal BDI Ontology is developed as a modular design pattern to represent agent cognitive architecture, enabling semantic interoperability and tested with LLM integration and reasoning platforms.

Details

Motivation: To address the limited integration of BDI models into structured, semantically interoperable knowledge representations in AI and cognitive sciences.

Method: Created a formal BDI Ontology as a modular Ontology Design Pattern, aligned with foundational ontologies, and tested through two experiments: LLM integration via Logic Augmented Generation and Semas reasoning platform integration using T2B2T paradigm.

Result: The ontology successfully bridges declarative and procedural intelligence, enabling bidirectional flow between RDF triples and agent mental states, and enhances inferential coherence and consistency.

Conclusion: The BDI Ontology provides a conceptual and operational foundation for cognitively grounded, explainable, and semantically interoperable multi-agent and neuro-symbolic systems in the Web of Data.

Abstract: The Belief-Desire-Intention (BDI) model is a cornerstone for representing rational agency in artificial intelligence and cognitive sciences. Yet, its integration into structured, semantically interoperable knowledge representations remains limited. This paper presents a formal BDI Ontology, conceived as a modular Ontology Design Pattern (ODP) that captures the cognitive architecture of agents through beliefs, desires, intentions, and their dynamic interrelations. The ontology ensures semantic precision and reusability by aligning with foundational ontologies and best practices in modular design. Two complementary lines of experimentation demonstrate its applicability: (i) coupling the ontology with Large Language Models (LLMs) via Logic Augmented Generation (LAG) to assess the contribution of ontological grounding to inferential coherence and consistency; and (ii) integrating the ontology within the Semas reasoning platform, which implements the Triples-to-Beliefs-to-Triples (T2B2T) paradigm, enabling a bidirectional flow between RDF triples and agent mental states. Together, these experiments illustrate how the BDI Ontology acts as both a conceptual and operational bridge between declarative and procedural intelligence, paving the way for cognitively grounded, explainable, and semantically interoperable multi-agent and neuro-symbolic systems operating within the Web of Data.

[262] MIR: Efficient Exploration in Episodic Multi-Agent Reinforcement Learning via Mutual Intrinsic Reward

Kesheng Chen, Wenjian Luo, Bang Zhang, Zeping Yin, Zipeng Ye

Main category: cs.AI

TL;DR: MIR is a mutual intrinsic reward method that enhances MARL performance in sparse episodic reward scenarios by encouraging agents to explore actions that affect teammates.

Details

Motivation: Episodic rewards pose challenges in MARL due to exponential sparsity of joint action trajectories and failure to account for joint actions influencing team states.

Method: Proposes Mutual Intrinsic Reward (MIR) that incentivizes individual agents to explore actions affecting teammates, combined with original strategies to stimulate team exploration.

Result: Experimental validation using extended MiniGrid-MA environments shows superior performance compared to state-of-the-art approaches in sparse reward settings.

Conclusion: MIR is a simple yet effective enhancement strategy that improves MARL algorithm performance in extremely sparse episodic reward scenarios.

Abstract: Episodic rewards present a significant challenge in reinforcement learning. While intrinsic reward methods have demonstrated effectiveness in single-agent rein-forcement learning scenarios, their application to multi-agent reinforcement learn-ing (MARL) remains problematic. The primary difficulties stem from two fac-tors: (1) the exponential sparsity of joint action trajectories that lead to rewards as the exploration space expands, and (2) existing methods often fail to account for joint actions that can influence team states. To address these challenges, this paper introduces Mutual Intrinsic Reward (MIR), a simple yet effective enhancement strategy for MARL with extremely sparse rewards like episodic rewards. MIR incentivizes individual agents to explore actions that affect their teammates, and when combined with original strategies, effectively stimulates team exploration and improves algorithm performance. For comprehensive experimental valida-tion, we extend the representative single-agent MiniGrid environment to create MiniGrid-MA, a series of MARL environments with sparse rewards. Our evalu-ation compares the proposed method against state-of-the-art approaches in the MiniGrid-MA setting, with experimental results demonstrating superior perfor-mance.

[263] Designing Domain-Specific Agents via Hierarchical Task Abstraction Mechanism

Kaiyu Li, Jiayu Wang, Zhi Wang, Hui Qiao, Weizhan Zhang, Deyu Meng, Xiangyong Cao

Main category: cs.AI

TL;DR: The paper introduces HTAM, a hierarchical task abstraction framework for multi-agent systems that structures agents according to domain-specific task dependencies, and demonstrates its effectiveness in geospatial analysis through EarthAgent.

Details

Motivation: General agent frameworks like ReAct struggle in specialized domains requiring structured workflows, such as remote sensing with specialized tools and multi-step procedures.

Method: Hierarchical Task Abstraction Mechanism (HTAM) structures multi-agent systems into logical hierarchies mirroring domain task-dependency graphs, enforcing procedural correctness through sequential layers.

Result: EarthAgent (HTAM implementation) substantially outperforms established single- and multi-agent systems on GeoPlan-bench, a comprehensive geospatial planning benchmark.

Conclusion: Aligning agent architecture with domain’s intrinsic task structure is critical for building robust specialized autonomous systems.

Abstract: LLM-driven agents, particularly those using general frameworks like ReAct or human-inspired role-playing, often struggle in specialized domains that necessitate rigorously structured workflows. Fields such as remote sensing, requiring specialized tools (e.g., correction, spectral indices calculation), and multi-step procedures (e.g., numerous intermediate products and optional steps), significantly challenge generalized approaches. To address this gap, we introduce a novel agent design framework centered on a Hierarchical Task Abstraction Mechanism (HTAM). Specifically, HTAM moves beyond emulating social roles, instead structuring multi-agent systems into a logical hierarchy that mirrors the intrinsic task-dependency graph of a given domain. This task-centric architecture thus enforces procedural correctness and decomposes complex problems into sequential layers, where each layer’s sub-agents operate on the outputs of the preceding layers. We instantiate this framework as EarthAgent, a multi-agent system tailored for complex geospatial analysis. To evaluate such complex planning capabilities, we build GeoPlan-bench, a comprehensive benchmark of realistic, multi-step geospatial planning tasks. It is accompanied by a suite of carefully designed metrics to evaluate tool selection, path similarity, and logical completeness. Experiments show that EarthAgent substantially outperforms a range of established single- and multi-agent systems. Our work demonstrates that aligning agent architecture with a domain’s intrinsic task structure is a critical step toward building robust and reliable specialized autonomous systems.

[264] That’s not natural: The Impact of Off-Policy Training Data on Probe Performance

Nathalie Kirch, Samuel Dower, Adrians Skapars, Ekdeep Singh Lubana, Dmitrii Krasheninnikov

Main category: cs.AI

TL;DR: Probing LLMs with synthetic/off-policy data affects generalization; same-domain off-policy data works better than different-domain on-policy data; Deception and Sandbagging probes may fail in real monitoring.

Details

Motivation: Natural examples of concerning LLM behaviors are rare, forcing reliance on synthetic or off-policy data for training probes, but it's unclear how this affects probe generalization.

Method: Systematically evaluated linear and attention probes across 8 LLM behaviors using synthetic/off-policy data, testing generalization to on-policy scenarios where models are incentivized to produce target behaviors.

Result: Response generation strategy significantly affects probe performance; successful off-policy generalization predicts on-policy success; Deception and Sandbagging probes may fail; domain shifts cause larger performance degradation than policy shifts.

Conclusion: Same-domain off-policy data yields more reliable probes than different-domain on-policy data, highlighting the need for methods that better handle distribution shifts in LLM monitoring.

Abstract: Probing has emerged as a promising method for monitoring Large Language Models (LLMs), enabling inference-time detection of concerning behaviours such as deception and sycophancy. However, natural examples of many behaviours are rare, forcing researchers to rely on synthetic or off-policy LLM responses for training probes. We systematically evaluate how the use of synthetic and off-policy data influences probe generalisation across eight distinct LLM behaviours. Testing linear and attention probes across multiple LLMs, we find that the response generation strategy can significantly affect probe performance, though the magnitude of this effect varies by behaviour. We find that successful generalisation from off-policy data, to test sets where the model is incentivised to produce the target behaviour, is predictive of successful on-policy generalisation. Leveraging this result, we predict that Deception and Sandbagging probes may fail to generalise from off-policy to on-policy data when used in real monitoring scenarios. Notably, shifts in the training data domain still cause even larger performance degradation, with different-domain test scores being consistently lower than the same-domain ones. These results indicate that, in the absence of on-policy data, using same-domain off-policy data yields more reliable probes than using on-policy data from a different domain, emphasizing the need for methods that can better handle distribution shifts in LLM monitoring.

[265] SRA-CP: Spontaneous Risk-Aware Selective Cooperative Perception

Jiaxi Liu, Chengyuan Ma, Hang Zhou, Weizhe Tang, Shixiao Liang, Haoyang Ding, Xiaopeng Li, Bin Ran

Main category: cs.AI

TL;DR: SRA-CP is a decentralized cooperative perception framework that enables connected vehicles to selectively share safety-critical perception data only when risk-relevant blind zones are detected, reducing communication bandwidth by 80% while maintaining near-perfect safety-critical object detection.

Details

Motivation: Existing cooperative perception approaches transmit excessive irrelevant data that exceeds bandwidth limits and rely on pre-defined communication partners, making them unsuitable for dynamic traffic environments.

Method: Vehicles broadcast lightweight perception coverage summaries and initiate targeted cooperation only when risk-relevant blind zones are detected. A perceptual risk identification module assesses occlusion impact locally, and selective information exchange prioritizes safety-critical content while adapting to bandwidth constraints.

Result: SRA-CP achieves less than 1% AP loss for safety-critical objects compared to generic CP while using only 20% of communication bandwidth, and improves perception performance by 15% over existing selective CP methods without risk awareness.

Conclusion: The proposed SRA-CP framework effectively addresses bandwidth limitations and dynamic environment challenges in cooperative perception by enabling risk-aware selective data sharing, making it suitable for real-world connected vehicle applications.

Abstract: Cooperative perception (CP) offers significant potential to overcome the limitations of single-vehicle sensing by enabling information sharing among connected vehicles (CVs). However, existing generic CP approaches need to transmit large volumes of perception data that are irrelevant to the driving safety, exceeding available communication bandwidth. Moreover, most CP frameworks rely on pre-defined communication partners, making them unsuitable for dynamic traffic environments. This paper proposes a Spontaneous Risk-Aware Selective Cooperative Perception (SRA-CP) framework to address these challenges. SRA-CP introduces a decentralized protocol where connected agents continuously broadcast lightweight perception coverage summaries and initiate targeted cooperation only when risk-relevant blind zones are detected. A perceptual risk identification module enables each CV to locally assess the impact of occlusions on its driving task and determine whether cooperation is necessary. When CP is triggered, the ego vehicle selects appropriate peers based on shared perception coverage and engages in selective information exchange through a fusion module that prioritizes safety-critical content and adapts to bandwidth constraints. We evaluate SRA-CP on a public dataset against several representative baselines. Results show that SRA-CP achieves less than 1% average precision (AP) loss for safety-critical objects compared to generic CP, while using only 20% of the communication bandwidth. Moreover, it improves the perception performance by 15% over existing selective CP methods that do not incorporate risk awareness.

[266] RubiSCoT: A Framework for AI-Supported Academic Assessment

Thorsten Fröhlich, Tim Schlippe

Main category: cs.AI

TL;DR: RubiSCoT is an AI framework using NLP and LLMs to automate thesis evaluation from proposal to final submission, providing consistent and scalable assessment.

Details

Motivation: Traditional thesis evaluation methods are time-consuming and suffer from evaluator variability, creating a need for more efficient and consistent assessment solutions.

Method: Uses advanced NLP techniques including large language models, retrieval-augmented generation, and structured chain-of-thought prompting for preliminary assessments, multidimensional assessments, content extraction, rubric-based scoring, and detailed reporting.

Result: The paper presents the design and implementation of RubiSCoT framework, demonstrating its capability to provide consistent, scalable thesis evaluation.

Conclusion: RubiSCoT has potential to optimize academic assessment processes through consistent, scalable, and transparent evaluation of academic theses.

Abstract: The evaluation of academic theses is a cornerstone of higher education, ensuring rigor and integrity. Traditional methods, though effective, are time-consuming and subject to evaluator variability. This paper presents RubiSCoT, an AI-supported framework designed to enhance thesis evaluation from proposal to final submission. Using advanced natural language processing techniques, including large language models, retrieval-augmented generation, and structured chain-of-thought prompting, RubiSCoT offers a consistent, scalable solution. The framework includes preliminary assessments, multidimensional assessments, content extraction, rubric-based scoring, and detailed reporting. We present the design and implementation of RubiSCoT, discussing its potential to optimize academic assessment processes through consistent, scalable, and transparent evaluation.

[267] Observer-Aware Probabilistic Planning Under Partial Observability

Salomé Lepers, Vincent Thomas, Olivier Buffet

Main category: cs.AI

TL;DR: This paper extends observer-aware Markov decision processes (OAMDPs) to handle partial observability scenarios, enabling the agent to optimize information transmission to an observer with limited visibility.

Details

Motivation: To address planning problems where an agent must consider how its actions appear to an observer with partial observability, and optimize the information conveyed through observations.

Method: Proposed PO-OAMDPs (Partial Observability OAMDPs) framework that extends observer-aware MDPs to handle partial observability and dynamic hidden variables, using HSVI algorithm with dedicated initializations.

Result: The framework successfully handles more realistic problems with partial observability and dynamic target variables, allowing analysis of legibility, explicability, and predictability properties in observer-aware planning.

Conclusion: PO-OAMDPs provide a comprehensive framework for observer-aware planning under partial observability, enabling agents to strategically optimize information transmission while handling dynamic hidden variables and changing goals during execution.

Abstract: In this article, we are interested in planning problems where the agent is aware of the presence of an observer, and where this observer is in a partial observability situation. The agent has to choose its strategy so as to optimize the information transmitted by observations. Building on observer-aware Markov decision processes (OAMDPs), we propose a framework to handle this type of problems and thus formalize properties such as legibility, explicability and predictability. This extension of OAMDPs to partial observability can not only handle more realistic problems, but also permits considering dynamic hidden variables of interest. These dynamic target variables allow, for instance, working with predictability, or with legibility problems where the goal might change during execution. We discuss theoretical properties of PO-OAMDPs and, experimenting with benchmark problems, we analyze HSVI’s convergence behavior with dedicated initializations and study the resulting strategies.

[268] From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems

Zekun Zhou, Xiaocheng Feng, Lei Huang, Xiachong Feng, Ziyun Song, Ruihan Chen, Liang Zhao, Weitao Ma, Yuxuan Gu, Baoxin Wang, Dayong Wu, Guoping Hu, Ting Liu, Bing Qin

Main category: cs.AI

TL;DR: This paper provides a systematic review of how AI can accelerate and enhance scientific research, categorizing AI applications into hypothesis formulation, hypothesis validation, and manuscript publication, while also discussing challenges, future directions, and available tools.

Details

Motivation: Research is time-consuming and effort-intensive, but recent AI advancements offer opportunities to accelerate and enhance the research process. The paper aims to systematically review and organize AI applications in research to monitor relevant progress.

Method: The authors conducted a systematic review, organizing relevant studies into three main categories: hypothesis formulation (knowledge synthesis and hypothesis generation), hypothesis validation (verification of scientific claims, theorem proving, and experiment validation), and manuscript publication (writing and peer review).

Result: The review provides a comprehensive organization of AI applications in research, identifies current challenges in each category, discusses potential future research directions, and offers an overview of existing benchmarks and tools supporting AI integration into research.

Conclusion: This systematic review serves as an introduction for beginners and aims to foster future research in AI for scientific research, with publicly available resources provided for further exploration.

Abstract: Research is a fundamental process driving the advancement of human civilization, yet it demands substantial time and effort from researchers. In recent years, the rapid development of artificial intelligence (AI) technologies has inspired researchers to explore how AI can accelerate and enhance research. To monitor relevant advancements, this paper presents a systematic review of the progress in this domain. Specifically, we organize the relevant studies into three main categories: hypothesis formulation, hypothesis validation, and manuscript publication. Hypothesis formulation involves knowledge synthesis and hypothesis generation. Hypothesis validation includes the verification of scientific claims, theorem proving, and experiment validation. Manuscript publication encompasses manuscript writing and the peer review process. Furthermore, we identify and discuss the current challenges faced in these areas, as well as potential future directions for research. Finally, we also offer a comprehensive overview of existing benchmarks and tools across various domains that support the integration of AI into the research process. We hope this paper serves as an introduction for beginners and fosters future research. Resources have been made publicly available at https://github.com/zkzhou126/AI-for-Research.

[269] Artificial Intelligence Index Report 2025

Nestor Maslej, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Njenga Kariuki, Emily Capstick, Anka Reuel, Erik Brynjolfsson, John Etchemendy, Katrina Ligett, Terah Lyons, James Manyika, Juan Carlos Niebles, Yoav Shoham, Russell Wald, Toby Walsh, Armin Hamrah, Lapo Santarlasci, Julia Betts Lotufo, Alexandra Rome, Andrew Shi, Sukrut Oak

Main category: cs.AI

TL;DR: The AI Index 2025 report provides comprehensive data on AI’s global impact, featuring new analyses on hardware, inference costs, publication trends, and responsible AI adoption in business and science.

Details

Motivation: To equip policymakers, journalists, executives, researchers, and the public with accurate, validated data to make informed decisions about AI development and deployment in an increasingly AI-influenced world.

Method: Longitudinal tracking and comprehensive data collection from global sources, with new in-depth analyses of AI hardware, inference costs, publication/patenting trends, and corporate responsible AI practices.

Result: The report is recognized globally as an authoritative resource, cited in major media and academic papers, and used by policymakers worldwide to understand AI’s current state and future trajectory.

Conclusion: The AI Index remains essential for tracking critical AI trends and providing context in a rapidly advancing field, helping stakeholders navigate AI’s growing influence across society, economy, and governance.

Abstract: Welcome to the eighth edition of the AI Index report. The 2025 Index is our most comprehensive to date and arrives at an important moment, as AI’s influence across society, the economy, and global governance continues to intensify. New in this year’s report are in-depth analyses of the evolving landscape of AI hardware, novel estimates of inference costs, and new analyses of AI publication and patenting trends. We also introduce fresh data on corporate adoption of responsible AI practices, along with expanded coverage of AI’s growing role in science and medicine. Since its founding in 2017 as an offshoot of the One Hundred Year Study of Artificial Intelligence, the AI Index has been committed to equipping policymakers, journalists, executives, researchers, and the public with accurate, rigorously validated, and globally sourced data. Our mission has always been to help these stakeholders make better-informed decisions about the development and deployment of AI. In a world where AI is discussed everywhere - from boardrooms to kitchen tables - this mission has never been more essential. The AI Index continues to lead in tracking and interpreting the most critical trends shaping the field - from the shifting geopolitical landscape and the rapid evolution of underlying technologies, to AI’s expanding role in business, policymaking, and public life. Longitudinal tracking remains at the heart of our mission. In a domain advancing at breakneck speed, the Index provides essential context - helping us understand where AI stands today, how it got here, and where it may be headed next. Recognized globally as one of the most authoritative resources on artificial intelligence, the AI Index has been cited in major media outlets such as The New York Times, Bloomberg, and The Guardian; referenced in hundreds of academic papers; and used by policymakers and government agencies around the world.

[270] Evaluating AI-Driven Automated Map Digitization in QGIS

Diana Febrita

Main category: cs.AI

TL;DR: Deepness AI tool assessed for automated map digitization using Google Earth imagery, compared with OpenStreetMap outputs.

Details

Motivation: Reduce human involvement in map digitization by leveraging AI and machine learning techniques for automated feature extraction.

Method: Used Deepness plugin in QGIS to generate digitization from Google Earth imagery, then compared results with OpenStreetMap digitized outputs.

Result: Performance evaluation of AI-generated digitization against established OpenStreetMap data to assess effectiveness.

Conclusion: Research demonstrates the potential of AI-driven tools like Deepness for automated map digitization processes.

Abstract: Map digitization is an important process that converts maps into digital formats that can be used for further analysis. This process typically requires a deep human involvement because of the need for interpretation and decision-making when translating complex features. With the advancement of artificial intelligence, there is an alternative to conducting map digitization with the help of machine learning techniques. Deepness, or Deep Neural Remote Sensing, is an advanced AI-driven tool designed and integrated as a plugin in QGIS application. This research focuses on assessing the effectiveness of Deepness in automated digitization. This study analyses AI-generated digitization results from Google Earth imagery and compares them with digitized outputs from OpenStreetMap (OSM) to evaluate performance.

[271] The promise and limits of LLMs in constructing proofs and hints for logic problems in intelligent tutoring systems

Sutapa Dey Tithi, Arun Kumar Ramesh, Clara DiMarco, Xiaoyi Tian, Nazia Alam, Kimia Fazeli, Tiffany Barnes

Main category: cs.AI

TL;DR: LLMs can generate accurate logic proof hints (75% accuracy) but need improvements for pedagogical context and explanation depth.

Details

Motivation: Traditional intelligent tutoring systems use template-based explanations that lack personalization, while LLMs offer dynamic feedback but risk hallucinations and unsound pedagogy.

Method: Evaluated 6 prompting techniques across 4 LLMs on 358 logic problems, then used best-performing LLM (DeepSeek-V3) to generate hints for 1,050 student states, evaluated by both LLM grader and human experts.

Result: DeepSeek-V3 achieved 86.7% accuracy on stepwise proof construction, LLM-generated hints were 75% accurate with high human ratings for consistency and clarity, but performed poorly on explaining why hints were provided and their context.

Conclusion: LLMs can augment tutoring systems with logic hints but require modifications to ensure accuracy and pedagogical appropriateness.

Abstract: Intelligent tutoring systems have demonstrated effectiveness in teaching formal propositional logic proofs, but their reliance on template-based explanations limits their ability to provide personalized student feedback. While large language models (LLMs) offer promising capabilities for dynamic feedback generation, they risk producing hallucinations or pedagogically unsound explanations. We evaluated the stepwise accuracy of LLMs in constructing multi-step symbolic logic proofs, comparing six prompting techniques across four state-of-the-art LLMs on 358 propositional logic problems. Results show that DeepSeek-V3 achieved superior performance up to 86.7% accuracy on stepwise proof construction and excelled particularly in simpler rules. We further used the best-performing LLM to generate explanatory hints for 1,050 unique student problem-solving states from a logic ITS and evaluated them on 4 criteria with both an LLM grader and human expert ratings on a 20% sample. Our analysis finds that LLM-generated hints were 75% accurate and rated highly by human evaluators on consistency and clarity, but did not perform as well explaining why the hint was provided or its larger context. Our results demonstrate that LLMs may be used to augment tutoring systems with logic tutoring hints, but require additional modifications to ensure accuracy and pedagogical appropriateness.

[272] Meta-World+: An Improved, Standardized, RL Benchmark

Reginald McLean, Evangelos Chatzaroulas, Luc McCutcheon, Frank Röder, Tianhe Yu, Zhanpeng He, K. R. Zentner, Ryan Julian, J K Terry, Isaac Woungang, Nariman Farsad, Pablo Samuel Castro

Main category: cs.AI

TL;DR: This paper addresses undocumented changes in Meta-World benchmark that hindered fair algorithm comparisons, and releases a new version with reproducibility, better ergonomics, and user control over task sets.

Details

Motivation: To resolve ambiguity in Meta-World benchmark results caused by undocumented changes and provide insights into multi-task/meta-RL benchmark design.

Method: Leveraged past versions of Meta-World to analyze changes and created a new open-source version with full reproducibility and improved technical features.

Result: Developed a new Meta-World version that enables fair algorithm comparisons, maintains reproducibility of past results, and offers enhanced user control.

Conclusion: The new Meta-World release addresses previous reproducibility issues while providing better benchmarking capabilities for multi-task and meta-reinforcement learning research.

Abstract: Meta-World is widely used for evaluating multi-task and meta-reinforcement learning agents, which are challenged to master diverse skills simultaneously. Since its introduction however, there have been numerous undocumented changes which inhibit a fair comparison of algorithms. This work strives to disambiguate these results from the literature, while also leveraging the past versions of Meta-World to provide insights into multi-task and meta-reinforcement learning benchmark design. Through this process we release a new open-source version of Meta-World (https://github.com/Farama-Foundation/Metaworld/) that has full reproducibility of past results, is more technically ergonomic, and gives users more control over the tasks that are included in a task set.

[273] Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?

Satiyabooshan Murugaboopathy, Connor T. Jerzak, Adel Daoud

Main category: cs.AI

TL;DR: This paper investigates whether socio-economic indicators like household wealth can be recovered from satellite imagery and internet-sourced text using a multimodal framework that combines vision models, LLMs, and AI search agents.

Details

Motivation: To explore if socio-economic indicators leave recoverable imprints in satellite imagery (physical features) and internet text (historical/economic narratives), and to develop methods for predicting household wealth using multimodal approaches.

Method: Developed a multimodal framework using DHS data from African neighborhoods with five pipelines: (1) vision model on satellite images, (2) LLM using location/year, (3) AI agent searching web text, (4) joint image-text encoder, (5) ensemble of all signals.

Result: Fusing vision and text modalities outperformed vision-only baselines (R-squared 0.77 vs 0.63). LLM-internal knowledge was more effective than agent-retrieved text. Found partial representational convergence with median cosine similarity of 0.60 between modalities. Released dataset of 60,000+ DHS clusters.

Conclusion: Multimodal fusion improves wealth prediction, with LLM knowledge proving more robust than agent-retrieved data. Partial representational convergence supports shared latent codes while retaining complementary details, consistent with Platonic Representation Hypothesis.

Abstract: We investigate whether socio-economic indicators like household wealth leave recoverable imprints in satellite imagery (capturing physical features) and Internet-sourced text (reflecting historical/economic narratives). Using Demographic and Health Survey (DHS) data from African neighborhoods, we pair Landsat images with LLM-generated textual descriptions conditioned on location/year and text retrieved by an AI search agent from web sources. We develop a multimodal framework predicting household wealth (International Wealth Index) through five pipelines: (i) vision model on satellite images, (ii) LLM using only location/year, (iii) AI agent searching/synthesizing web text, (iv) joint image-text encoder, (v) ensemble of all signals. Our framework yields three contributions. First, fusing vision and agent/LLM text outperforms vision-only baselines in wealth prediction (e.g., R-squared of 0.77 vs. 0.63 on out-of-sample splits), with LLM-internal knowledge proving more effective than agent-retrieved text, improving robustness to out-of-country and out-of-time generalization. Second, we find partial representational convergence: fused embeddings from vision/language modalities correlate moderately (median cosine similarity of 0.60 after alignment), suggesting a shared latent code of material well-being while retaining complementary details, consistent with the Platonic Representation Hypothesis. Although LLM-only text outperforms agent-retrieved data, challenging our Agent-Induced Novelty Hypothesis, modest gains from combining agent data in some splits weakly support the notion that agent-gathered information introduces unique representational structures not fully captured by static LLM knowledge. Third, we release a large-scale multimodal dataset comprising more than 60,000 DHS clusters linked to satellite images, LLM-generated descriptions, and agent-retrieved texts.

[274] LLM Collaboration With Multi-Agent Reinforcement Learning

Shuo Liu, Tianle Chen, Zeyu Liang, Xueguang Lyu, Christopher Amato

Main category: cs.AI

TL;DR: The paper proposes MAGRPO, a multi-agent reinforcement learning method for fine-tuning LLMs to enable effective cooperation, addressing the gap in current LLM training that focuses on individual rather than collaborative performance.

Details

Motivation: Most LLMs are pretrained independently without optimization for coordination, and existing fine-tuning frameworks rely on complex individual reward designs that don't effectively encourage collaboration between agents.

Method: Model LLM collaboration as cooperative MARL and develop Multi-Agent Group Relative Policy Optimization (MAGRPO), a multi-agent multi-turn algorithm building on RL approaches for LLMs and MARL techniques.

Result: Experiments on LLM writing and coding collaboration show that fine-tuning with MAGRPO enables agents to generate high-quality responses efficiently through effective cooperation.

Conclusion: The approach opens doors to using other MARL methods for LLMs and highlights associated challenges in multi-agent LLM coordination.

Abstract: A large amount of work has been done in Multi-Agent Systems (MAS) for modeling and solving problems with multiple interacting agents. However, most LLMs are pretrained independently and not specifically optimized for coordination. Existing LLM fine-tuning frameworks rely on individual rewards, which require complex reward designs for each agent to encourage collaboration. To address these challenges, we model LLM collaboration as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. We develop a multi-agent, multi-turn algorithm, Multi-Agent Group Relative Policy Optimization (MAGRPO), to solve it, building on current RL approaches for LLMs as well as MARL techniques. Our experiments on LLM writing and coding collaboration demonstrate that fine-tuning MAS with MAGRPO enables agents to generate high-quality responses efficiently through effective cooperation. Our approach opens the door to using other MARL methods for LLMs and highlights the associated challenges. Our code is available at https://github.com/OpenMLRL/CoMLRL.

[275] MSRS: Adaptive Multi-Subspace Representation Steering for Attribute Alignment in Large Language Models

Xinyan Jiang, Lin Zhang, Jiayi Zhang, Qingsong Yang, Guimin Hu, Di Wang, Lijie Hu

Main category: cs.AI

TL;DR: MSRS is a multi-attribute steering framework that uses orthogonal subspaces to reduce interference between attributes and enables fine-grained control through token-level steering.

Details

Motivation: Existing activation steering methods struggle with multi-attribute control due to interference and trade-offs between different attributes.

Method: Allocates orthogonal subspaces for each attribute, combines attribute-specific and shared subspaces, and uses token-level steering during inference to target semantically relevant tokens.

Result: Significantly reduces attribute conflicts, outperforms existing methods across multiple attributes, and generalizes well to diverse downstream tasks.

Conclusion: MSRS provides an effective solution for multi-attribute steering in LLMs by addressing interference issues through subspace isolation and fine-grained token-level control.

Abstract: Activation steering offers a promising approach to controlling the behavior of Large Language Models by directly manipulating their internal activations. However, most existing methods struggle to jointly steer multiple attributes, often resulting in interference and undesirable trade-offs. To address this challenge, we propose Multi-Subspace Representation Steering (MSRS), a novel framework for effective multi-attribute steering via subspace representation fine-tuning. MSRS reduces inter-attribute interference by allocating orthogonal subspaces to each attribute, isolating their influence within the model’s representation space. MSRS also incorporates a hybrid subspace composition strategy: it combines attribute-specific subspaces for unique steering directions with a shared subspace for common steering directions. A dynamic weighting function learns to efficiently integrate these components for precise control. During inference, MSRS introduces a token-level steering mechanism that dynamically identifies and intervenes on the most semantically relevant tokens, enabling fine-grained behavioral modulation. Experimental results show that MSRS significantly reduces attribute conflicts, surpasses existing methods across a range of attributes, and generalizes effectively to diverse downstream tasks.

[276] Interactive Query Answering on Knowledge Graphs with Soft Entity Constraints

Daniel Daza, Alberto Bernardi, Luca Costabello, Christophe Gueret, Masoud Mansoury, Michael Cochez, Martijn Schut

Main category: cs.AI

TL;DR: This paper introduces query answering with soft constraints over incomplete knowledge graphs, proposing two efficient methods to incorporate vague or context-dependent preferences while maintaining original query answer rankings.

Details

Motivation: Existing query answering methods focus on first-order-logic queries but real-world queries often involve vague or context-dependent constraints that current approaches don't handle.

Method: Two lightweight methods that adjust query answer scores by incorporating soft constraints without disrupting original answers: one requiring tuning only two parameters, and another using a small neural network trained to capture soft constraints while preserving ranking structure.

Result: Methods successfully capture soft constraints while maintaining robust query answering performance with minimal overhead, as demonstrated through experiments on extended QA benchmarks with generated soft constraint datasets.

Conclusion: The proposed approach effectively addresses the gap in handling soft constraints for query answering over incomplete knowledge graphs, providing practical solutions that maintain performance while adding little computational overhead.

Abstract: Methods for query answering over incomplete knowledge graphs retrieve entities that are \emph{likely} to be answers, which is particularly useful when such answers cannot be reached by direct graph traversal due to missing edges. However, existing approaches have focused on queries formalized using first-order-logic. In practice, many real-world queries involve constraints that are inherently vague or context-dependent, such as preferences for attributes or related categories. Addressing this gap, we introduce the problem of query answering with soft constraints. We formalize the problem and introduce two efficient methods designed to adjust query answer scores by incorporating soft constraints without disrupting the original answers to a query. These methods are lightweight, requiring tuning only two parameters or a small neural network trained to capture soft constraints while maintaining the original ranking structure. To evaluate the task, we extend existing QA benchmarks by generating datasets with soft constraints. Our experiments demonstrate that our methods can capture soft constraints while maintaining robust query answering performance and adding very little overhead.

[277] Can AI Perceive Physical Danger and Intervene?

Abhishek Jindal, Dmitry Kalashnikov, R. Alex Hofer, Oscar Chang, Divya Garikapati, Anirudha Majumdar, Pierre Sermanet, Vikas Sindhwani

Main category: cs.AI

TL;DR: The paper develops a scalable physical safety benchmark for Embodied AI systems, analyzes foundation models’ safety understanding, and creates a post-training method to improve safety reasoning with interpretable thinking traces.

Details

Motivation: To address the direct physical safety risks when AI interacts with the physical world, beyond digital AI safety concerns, by assessing how well foundation models understand common-sense physical safety facts.

Method: Created a scalable physical safety benchmark using real-world injury narratives and operational constraints converted to photorealistic images/videos via generative models. Analyzed major foundation models’ risk perception and safety reasoning, then developed a post-training paradigm to teach explicit safety constraint reasoning.

Result: Comprehensive analysis revealed insights into foundation models’ deployment readiness for safety-critical applications. The post-training approach achieved state-of-the-art performance in constraint satisfaction evaluations with interpretable safety reasoning traces.

Conclusion: The work provides a robust framework for benchmarking physical safety in Embodied AI, demonstrates current limitations in foundation models’ safety understanding, and offers an effective method to improve safety reasoning with transparency.

Abstract: When AI interacts with the physical world – as a robot or an assistive agent – new safety challenges emerge beyond those of purely ``digital AI". In such interactions, the potential for physical harm is direct and immediate. How well do state-of-the-art foundation models understand common-sense facts about physical safety, e.g. that a box may be too heavy to lift, or that a hot cup of coffee should not be handed to a child? In this paper, our contributions are three-fold: first, we develop a highly scalable approach to continuous physical safety benchmarking of Embodied AI systems, grounded in real-world injury narratives and operational safety constraints. To probe multi-modal safety understanding, we turn these narratives and constraints into photorealistic images and videos capturing transitions from safe to unsafe states, using advanced generative models. Secondly, we comprehensively analyze the ability of major foundation models to perceive risks, reason about safety, and trigger interventions; this yields multi-faceted insights into their deployment readiness for safety-critical agentic applications. Finally, we develop a post-training paradigm to teach models to explicitly reason about embodiment-specific safety constraints provided through system instructions. The resulting models generate thinking traces that make safety reasoning interpretable and transparent, achieving state of the art performance in constraint satisfaction evaluations. The benchmark is released at https://asimov-benchmark.github.io/v2

[278] How LLMs Learn to Reason: A Complex Network Perspective

Sihan Hu, Xiansheng Cai, Yuan Huang, Zhiyuan Yao, Linfeng Zhang, Pan Zhang, Youjin Deng, Kun Chen

Main category: cs.AI

TL;DR: RLVR training in LLMs shows puzzling behaviors like two-stage learning curves and catastrophic forgetting, which are explained as emergent phenomena from the topological evolution of latent reasoning graphs in semantic space.

Details

Motivation: To understand the puzzling behaviors in RLVR-trained LLMs (two-stage learning, V-shaped response lengths, catastrophic forgetting) and provide a unified physical explanation beyond neural implementation details.

Method: Proposed a dynamical isomorphism between a 1.5B-parameter LLM and a minimal Concept Network Model (CoNet), analyzing the topological evolution of latent reasoning graphs. Developed Annealed-RLVR algorithm with targeted SFT “heating” to resolve topological bottlenecks.

Result: The geometric perspective explains observed anomalies: V-shaped trajectory tracks evolution from local to global optimization, catastrophic forgetting from topological disconnection, policy collapse from sequential transitions at leaf nodes. Annealed-RLVR outperforms standard RLVR on benchmarks including Minerva and AIME.

Conclusion: RLVR can be recast from black-box optimization into a predictable process of structural self-organization, providing new physical intuition for engineering emergent reasoning capabilities in future AI systems.

Abstract: Training large language models with Reinforcement Learning with Verifiable Rewards (RLVR) exhibits a set of distinctive and puzzling behaviors that remain poorly understood, including a two-stage learning curve, a V-shaped response-length trajectory, and a pronounced vulnerability to catastrophic forgetting. In this work, we propose that these behaviors are emergent collective phenomena governed not by neural implementation details, but by the topological evolution of the latent reasoning graph in semantic space. By demonstrating a dynamical isomorphism between a 1.5B-parameter LLM and a minimal Concept Network Model (CoNet), we trace the causal source to the self-organization of a sparse concept web pinned to an average degree of two. This geometric perspective provides a unified physical explanation for the observed anomalies: the V-shaped trajectory tracks the evolution from parallel local skill optimization to global network integration; catastrophic forgetting stems from the topological disconnection of critical trunk'' edges; and policy collapse arises from the accumulation of sequential transitions at the web's leaf nodes, where broad exploration abruptly freezes into rigid, high-reward trajectories. Identifying a maximally frustrated state’’ at the transition between learning stages, we propose Annealed-RLVR, a principled algorithm that injects a targeted SFT ``heating’’ step to resolve this topological bottleneck. Experiments confirm that this theory-driven intervention outperforms standard RLVR on both in-distribution and out-of-distribution benchmarks (including Minerva and AIME). By recasting RLVR from black-box optimization into a predictable process of structural self-organization, our work provides a new physical intuition for engineering the emergent reasoning capabilities of future AI systems.

[279] CharCom: Composable Identity Control for Multi-Character Story Illustration

Zhongsheng Wang, Ming Lin, Zhedong Lin, Yaser Shakib, Qian Liu, Jiamou Liu

Main category: cs.AI

TL;DR: CharCom is a modular framework using composable LoRA adapters to maintain character identity consistency in diffusion-based text-to-image generation, enabling efficient per-character customization without retraining the base model.

Details

Motivation: Character identity consistency across varying prompts remains a fundamental limitation in diffusion-based text-to-image generation, making it challenging to maintain consistent character appearances in multi-scene narratives.

Method: Built on a frozen diffusion backbone, CharCom dynamically composes LoRA adapters at inference using prompt-aware control, creating a modular and parameter-efficient framework for character-consistent story illustration.

Result: Experiments show CharCom significantly enhances character fidelity, semantic alignment, and temporal coherence, remaining robust in crowded scenes and enabling scalable multi-character generation with minimal overhead.

Conclusion: CharCom is well-suited for real-world applications such as story illustration and animation, providing an efficient solution for maintaining character consistency in text-to-image generation.

Abstract: Ensuring character identity consistency across varying prompts remains a fundamental limitation in diffusion-based text-to-image generation. We propose CharCom, a modular and parameter-efficient framework that achieves character-consistent story illustration through composable LoRA adapters, enabling efficient per-character customization without retraining the base model. Built on a frozen diffusion backbone, CharCom dynamically composes adapters at inference using prompt-aware control. Experiments on multi-scene narratives demonstrate that CharCom significantly enhances character fidelity, semantic alignment, and temporal coherence. It remains robust in crowded scenes and enables scalable multi-character generation with minimal overhead, making it well-suited for real-world applications such as story illustration and animation.

[280] ResearStudio: A Human-Intervenable Framework for Building Controllable Deep-Research Agents

Linyi Yang, Yixuan Weng

Main category: cs.AI

TL;DR: ResearStudio is an open-source framework that enables real-time human control over deep-research agents, allowing users to pause, edit plans/code, and switch between AI-led and human-led modes while achieving state-of-the-art performance on GAIA benchmark.

Details

Motivation: Current deep-research agents operate in a 'fire-and-forget' mode without allowing users to fix errors or add expert knowledge during execution, limiting their practical utility.

Method: Uses a Collaborative Workshop design with hierarchical Planner-Executor that writes steps to a live ‘plan-as-document’, fast communication layer streaming actions to web interface, enabling real-time human intervention.

Result: Achieves state-of-the-art results on GAIA benchmark, surpassing systems like OpenAI’s DeepResearch and Manus, while maintaining fine-grained human control.

Conclusion: Strong automated performance and fine-grained human control can coexist, enabling safe and controllable research agents through real-time collaboration between humans and AI.

Abstract: Current deep-research agents run in a ‘‘fire-and-forget’’ mode: once started, they give users no way to fix errors or add expert knowledge during execution. We present ResearStudio, the first open-source framework that places real-time human control at its core. The system follows a Collaborative Workshop design. A hierarchical Planner-Executor writes every step to a live ‘‘plan-as-document,’’ a fast communication layer streams each action, file change, and tool call to a web interface. At any moment, the user can pause the run, edit the plan or code, run custom commands, and resume – switching smoothly between AI-led, human-assisted and human-led, AI-assisted modes. In fully autonomous mode, ResearStudio achieves state-of-the-art results on the GAIA benchmark, surpassing systems like OpenAI’s DeepResearch and Manus. These results show that strong automated performance and fine-grained human control can coexist. The full code, protocol, and evaluation scripts are available at https://github.com/ResearAI/ResearStudio. We will continue to update the repository to encourage further work on safe and controllable research agents. Our live demo is publicly accessible at http://ai-researcher.net:3000/. We support the development of DeepScientist, which can be accessed at https://github.com/ResearAI/DeepScientist.

[281] Structured Debate Improves Corporate Credit Reasoning in Financial AI

Yoonjin Lee, Munhee Kim, Hanbi Choi, Juhyeon Park, Seungho Lyoo, Woojin Park

Main category: cs.AI

TL;DR: This paper develops two LLM-based systems for corporate credit assessment that automate evidence-based reasoning from non-financial indicators, with a multi-agent debate system outperforming a single-agent system in reasoning quality.

Details

Motivation: Existing financial AI focuses on numerical prediction but lacks support for interpretive judgments in loan evaluation, particularly for qualitative non-financial indicators that resist formalization but significantly influence loan repayment outcomes.

Method: Developed two LLM-based systems: 1) Non-adversarial single-agent system (NAS) with single-pass reasoning pipeline, and 2) Debate-based multi-agent system (KPD-MADS) using Karl Popper’s critical dialogue framework with ten-step structured interaction protocol. Both systems were tested on three real corporate cases and evaluated by credit risk professionals.

Result: Both systems achieved substantial productivity gains (NAS: 11.55s per case; KPD-MADS: 91.97s; human baseline: 1920s). KPD-MADS demonstrated superior reasoning quality with higher median ratings in explanatory adequacy (4.0 vs. 3.0), practical applicability (4.0 vs. 3.0), and usability (62.5 vs. 52.5).

Conclusion: Structured multi-agent interaction enhances reasoning rigor and interpretability in financial AI, advancing scalable and defensible automation in corporate credit assessment.

Abstract: Despite advances in financial AI, the automation of evidence-based reasoning remains unresolved in corporate credit assessment, where qualitative non-financial indicators exert decisive influence on loan repayment outcomes yet resist formalization. Existing approaches focus predominantly on numerical prediction and provide limited support for the interpretive judgments required in professional loan evaluation. This study develops and evaluates two operational large language model (LLM)-based systems designed to generate structured reasoning from non-financial evidence. The first is a non-adversarial single-agent system (NAS) that produces bidirectional analysis through a single-pass reasoning pipeline. The second is a debate-based multi-agent system (KPD-MADS) that operationalizes adversarial verification through a ten-step structured interaction protocol grounded in Karl Popper’s critical dialogue framework. Both systems were applied to three real corporate cases and evaluated by experienced credit risk professionals. Compared to manual expert reporting, both systems achieved substantial productivity gains (NAS: 11.55 s per case; KPD-MADS: 91.97 s; human baseline: 1920 s). The KPD-MADS demonstrated superior reasoning quality, receiving higher median ratings in explanatory adequacy (4.0 vs. 3.0), practical applicability (4.0 vs. 3.0), and usability (62.5 vs. 52.5). These findings show that structured multi-agent interaction can enhance reasoning rigor and interpretability in financial AI, advancing scalable and defensible automation in corporate credit assessment.

[282] RTMol: Rethinking Molecule-text Alignment in a Round-trip View

Letian Chen, Runhan Shi, Gufeng Yu, Yang Yang

Main category: cs.AI

TL;DR: RTMol is a bidirectional alignment framework that unifies molecular captioning and text-to-SMILES generation through self-supervised round-trip learning, addressing limitations of existing methods and improving bidirectional alignment by up to 47%.

Details

Motivation: Existing methods treat molecular captioning and text-based molecular design as separate tasks, facing limitations in chemical accuracy, ambiguous training data, and bidirectional inconsistency between generation directions.

Method: Proposes RTMol framework with self-supervised round-trip learning, novel round-trip evaluation metrics, and enables unsupervised training for molecular captioning without requiring paired molecule-text corpora.

Result: RTMol enhances bidirectional alignment performance by up to 47% across various LLMs, demonstrating improved joint molecule-text understanding and generation capabilities.

Conclusion: RTMol establishes an effective paradigm for unified molecule-text understanding and generation, addressing key limitations in existing approaches through bidirectional alignment and round-trip learning.

Abstract: Aligning molecular sequence representations (e.g., SMILES notations) with textual descriptions is critical for applications spanning drug discovery, materials design, and automated chemical literature analysis. Existing methodologies typically treat molecular captioning (molecule-to-text) and text-based molecular design (text-to-molecule) as separate tasks, relying on supervised fine-tuning or contrastive learning pipelines. These approaches face three key limitations: (i) conventional metrics like BLEU prioritize linguistic fluency over chemical accuracy, (ii) training datasets frequently contain chemically ambiguous narratives with incomplete specifications, and (iii) independent optimization of generation directions leads to bidirectional inconsistency. To address these issues, we propose RTMol, a bidirectional alignment framework that unifies molecular captioning and text-to-SMILES generation through self-supervised round-trip learning. The framework introduces novel round-trip evaluation metrics and enables unsupervised training for molecular captioning without requiring paired molecule-text corpora. Experiments demonstrate that RTMol enhances bidirectional alignment performance by up to 47% across various LLMs, establishing an effective paradigm for joint molecule-text understanding and generation.

[283] KRAL: Knowledge and Reasoning Augmented Learning for LLM-assisted Clinical Antimicrobial Therapy

Zhe Li, Yehan Qiu, Yujie Chen, Xiang Zhou

Main category: cs.AI

TL;DR: KRAL is a low-cost, privacy-preserving framework that enhances clinical LLMs by distilling teacher-model reasoning via reverse generation, using heuristic learning for data augmentation, and agentic reinforcement learning to improve knowledge and reasoning capabilities.

Details

Motivation: Current LLMs face limitations in clinical decision-making due to knowledge gaps, privacy concerns, high costs, and limited reasoning capabilities, making them unsuitable for high-stakes medical applications.

Method: Uses teacher-model reasoning distillation via answer-to-question reverse generation, heuristic learning for semi-supervised data augmentation (reducing manual annotation by ~80%), and agentic reinforcement learning to jointly enhance medical knowledge and reasoning while optimizing computational efficiency.

Result: Significantly outperforms RAG and SFT methods: improves knowledge QA (Accuracy@1 on MEDQA: +1.8% vs SFT, +3.6% vs RAG) and reasoning (Pass@1 on PUMCH Antimicrobial: +27% vs SFT, +27.2% vs RAG) at ~20% of SFT’s long-term training costs.

Conclusion: KRAL establishes an effective solution for enhancing local LLMs’ clinical diagnostic capabilities, enabling low-cost, high-safety deployment in complex medical decision support systems.

Abstract: Clinical antimicrobial therapy requires the dynamic integration of pathogen profiles,host factors, pharmacological properties of antimicrobials,and the severity of infection. This complexity imposes fundamental limitations on the applicability of Large Language Models (LLMs) in high-stakes clinical decision-making including knowledge gaps, data privacy concerns, high deployment costs, and limited reasoning capabilities. To address these challenges, we propose KRAL (Knowledge and Reasoning Augmented Learning), a low-cost, scalable, privacy-preserving paradigm that leverages teacher-model reasoning to automatically distill knowledge and reasoning trajectories via answer-to-question reverse generation, employs heuristic learning for semi-supervised data augmentation (reducing manual annotation requirements by approximately 80%), and utilizes agentic reinforcement learning to jointly enhance medical knowledge and reasoning while optimizing computational and memory efficiency. A hierarchical evaluation employing diverse teacher-model proxies reduces assessment costs, while modular interface design facilitates seamless system updates. Experimental results demonstrate that KRAL significantly outperforms traditional Retrieval-Augmented Generation (RAG) and Supervised Fine-Tuning (SFT) methods. It improves knowledge question-answering capability (Accuracy@1 on the external open-source benchmark MEDQA increased by 1.8% vs. SFT and 3.6% vs. RAG) and reasoning capability (Pass@1 on the external benchmark PUMCH Antimicrobial increased by 27% vs. SFT and 27.2% vs. RAG), achieved at about 20% of SFT’s long-term training costs. This establishes KRAL as an effective solution for enhancing local LLMs’ clinical diagnostic capabilities, enabling low-cost, high-safety deployment in complex medical decision support.

[284] Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning

Pei Yang, Ke Zhang, Ji Wang, Xiao Chen, Yuxin Tang, Eric Yang, Lynn Ai, Bill Shi

Main category: cs.AI

TL;DR: CRM replaces single reward models with a team of specialist evaluators for better robustness and interpretability in RLHF, using multi-agent collaboration and a centralized aggregator.

Details

Motivation: Conventional reward models struggle with optimizing multiple conflicting preference dimensions and lack transparency in scoring decisions.

Method: Decomposes preference evaluation into domain-specific agents with partial signals, uses global evaluators, and a centralized aggregator to fuse signals with factors like step-wise correctness and multi-agent agreement.

Result: Enables multi-perspective reward shaping without additional human annotations, compatible with standard RL pipelines using advantage-based updates.

Conclusion: CRM and rewardBench provide a modular path to more transparent reward modeling and stable optimization in RLHF.

Abstract: We present CRM (Multi-Agent Collaborative Reward Model), a framework that replaces a single black-box reward model with a coordinated team of specialist evaluators to improve robustness and interpretability in RLHF. Conventional reward models struggle to jointly optimize multiple, sometimes conflicting, preference dimensions (e.g., factuality, helpfulness, safety) and offer limited transparency into why a score is assigned. CRM addresses these issues by decomposing preference evaluation into domain-specific agents that each produce partial signals, alongside global evaluators such as ranker-based and embedding-similarity rewards. A centralized aggregator fuses these signals at each timestep, balancing factors like step-wise correctness, multi-agent agreement, and repetition penalties, yielding a single training reward compatible with standard RL pipelines. The policy is optimized with advantage-based updates (e.g., GAE), while a value model regresses to the aggregated reward, enabling multi-perspective reward shaping without requiring additional human annotations beyond those used to train the evaluators. To support training and assessment, we introduce rewardBench, a benchmark and training suite aligned with the collaborative structure of CRM. Together, CRM and rewardBench provide a practical, modular path to more transparent reward modeling and more stable optimization.

[285] You Only Forward Once: An Efficient Compositional Judging Paradigm

Tianlong Zhang, Hongwei Xue, Shilin Yan, Di Wu, Chen Xu, Yunyun Yang

Main category: cs.AI

TL;DR: YOFO is a template-conditioned method that enables multimodal LLMs to judge all requirements in a single forward pass, achieving orders-of-magnitude speedups while maintaining interpretability and state-of-the-art performance.

Details

Motivation: Existing MLLM judging approaches face a trade-off: single-score adaptation misaligns with generative nature and limits fine-grained understanding, while autoregressive analysis generation is too slow for high-throughput settings.

Method: YOFO uses a structured requirement template and reads logits of final tokens associated with each requirement to produce binary yes/no decisions in one inference step, built on an autoregressive model.

Result: Extensive experiments show YOFO achieves state-of-the-art results on standard recommendation datasets, supports dependency-aware analysis, and benefits from post-hoc CoT while providing orders-of-magnitude speedups.

Conclusion: YOFO successfully addresses the speed-accuracy trade-off in MLLM judging by enabling single-pass requirement verification while preserving interpretability and supporting advanced analysis features.

Abstract: Multimodal large language models (MLLMs) show strong potential as judges. However, existing approaches face a fundamental trade-off: adapting MLLMs to output a single score misaligns with the generative nature of MLLMs and limits fine-grained requirement understanding, whereas autoregressively generating judging analyses is prohibitively slow in high-throughput settings. Observing that judgment reduces to verifying whether inputs satisfy a set of structured requirements, we propose YOFO, a template-conditioned method that judges all requirements in a single forward pass. Built on an autoregressive model, YOFO accepts a structured requirement template and, in one inference step, produces a binary yes/no decision for each requirement by reading the logits of the final token associated with that requirement. This design yields orders-of-magnitude speedups while preserving interpretability. Extensive experiments show that YOFO not only achieves state-of-the-art results on standard recommendation datasets, but also supports dependency-aware analysis – where subsequent judgments are conditioned on previous ones – and further benefits from post-hoc CoT.

cs.SD

[286] MusicAIR: A Multimodal AI Music Generation Framework Powered by an Algorithm-Driven Core

Callie C. Liao, Duoduo Liao, Ellie L. Zhang

Main category: cs.SD

TL;DR: MusicAIR is a multimodal AI music generation framework that uses algorithm-driven symbolic music core to generate music from lyrics, text, and images while mitigating copyright risks.

Details

Motivation: Address concerns about copyright infringement and high computational costs in neural-based music generation models by creating a novel algorithm-driven approach.

Method: Uses a symbolic music core algorithm that connects lyrical and rhythmic information to automatically derive musical features and create complete melodic scores from lyrics. Developed GenAIM web tool for lyric-to-song, text-to-music, and image-to-music generation.

Result: Achieves 85% average key confidence (outperforming human composers at 79%), generates compositions that align with music theory standards, and produces diverse, human-like music.

Conclusion: MusicAIR serves as an effective co-pilot tool for music composition assistance and educational tutoring, lowering entry barriers for aspiring musicians while innovating AI music generation.

Abstract: Recent advances in generative AI have made music generation a prominent research focus. However, many neural-based models rely on large datasets, raising concerns about copyright infringement and high-performance costs. In contrast, we propose MusicAIR, an innovative multimodal AI music generation framework powered by a novel algorithm-driven symbolic music core, effectively mitigating copyright infringement risks. The music core algorithms connect critical lyrical and rhythmic information to automatically derive musical features, creating a complete, coherent melodic score solely from the lyrics. The MusicAIR framework facilitates music generation from lyrics, text, and images. The generated score adheres to established principles of music theory, lyrical structure, and rhythmic conventions. We developed Generate AI Music (GenAIM), a web tool using MusicAIR for lyric-to-song, text-to-music, and image-to-music generation. In our experiments, we evaluated AI-generated music scores produced by the system using both standard music metrics and innovative analysis that compares these compositions with original works. The system achieves an average key confidence of 85%, outperforming human composers at 79%, and aligns closely with established music theory standards, demonstrating its ability to generate diverse, human-like compositions. As a co-pilot tool, GenAIM can serve as a reliable music composition assistant and a possible educational composition tutor while simultaneously lowering the entry barrier for all aspiring musicians, which is innovative and significantly contributes to AI for music generation.

[287] Device-Guided Music Transfer

Manh Pham Hung, Changshuo Hu, Ting Dang, Dong Ma

Main category: cs.SD

TL;DR: DeMT enables device-guided music transfer by using a vision-language model to extract device embeddings from speaker frequency response curves, then conditions a hybrid transformer for effective speaker-style transfer and few-shot adaptation.

Details

Motivation: Existing music transfer methods focus on timbre, rhythm, harmony, or instrumentation to mimic genres/artists, but overlook the diverse hardware properties of playback devices (speakers).

Method: Process speaker frequency response curves as line graphs using a vision-language model to extract device embeddings, then condition a hybrid transformer via feature-wise linear modulation. Fine-tuned on a self-collected dataset.

Result: DeMT enables effective speaker-style transfer and robust few-shot adaptation for unseen devices, supporting applications like device-style augmentation and quality enhancement.

Conclusion: The proposed DeMT method successfully addresses the gap in considering playback device hardware properties for music transfer, achieving effective device-style transfer and adaptation capabilities.

Abstract: Device-guided music transfer adapts playback across unseen devices for users who lack them. Existing methods mainly focus on modifying the timbre, rhythm, harmony, or instrumentation to mimic genres or artists, overlooking the diverse hardware properties of the playback device (i.e., speaker). Therefore, we propose DeMT, which processes a speaker’s frequency response curve as a line graph using a vision-language model to extract device embeddings. These embeddings then condition a hybrid transformer via feature-wise linear modulation. Fine-tuned on a self-collected dataset, DeMT enables effective speaker-style transfer and robust few-shot adaptation for unseen devices, supporting applications like device-style augmentation and quality enhancement.

[288] Is Phase Really Needed for Weakly-Supervised Dereverberation ?

Marius Rodrigues, Louis Bahrman, Roland Badeau, Gaël Richard

Main category: cs.SD

TL;DR: The wet phase in reverberant speech carries limited useful information for dereverberation due to late reverberation perturbing phase components with white noise, except at low frequencies.

Details

Motivation: To understand how much information can be retrieved from reverberant speech alone in unsupervised/weakly-supervised dereverberation, particularly investigating the role of the reverberant phase.

Method: Used Statistical Wave Field Theory to analyze phase perturbations, then trained dereverberation models under weak supervision while excluding the reverberant phase from the loss function.

Result: Performance of dereverberation models was significantly improved by excluding the reverberant phase from the loss function.

Conclusion: The wet phase carries limited useful information and is not essential for weakly supervised dereverberation, as late reverberation perturbs phase components with white noise.

Abstract: In unsupervised or weakly-supervised approaches for speech dereverberation, the target clean (dry) signals are considered to be unknown during training. In that context, evaluating to what extent information can be retrieved from the sole knowledge of reverberant (wet) speech becomes critical. This work investigates the role of the reverberant (wet) phase in the time-frequency domain. Based on Statistical Wave Field Theory, we show that late reverberation perturbs phase components with white, uniformly distributed noise, except at low frequencies. Consequently, the wet phase carries limited useful information and is not essential for weakly supervised dereverberation. To validate this finding, we train dereverberation models under a recent weak supervision framework and demonstrate that performance can be significantly improved by excluding the reverberant phase from the loss function.

[289] The Artist is Present: Traces of Artists Resigind and Spawning in Text-to-Audio AI

Guilherme Coelho

Main category: cs.SD

TL;DR: Text-to-audio systems can generate artist-specific content through strategic metatag-based prompt engineering, revealing that artists’ works are used as training data without consent or attribution.

Details

Motivation: To investigate how TTA systems trained on undisclosed datasets can reproduce specific artists' styles through prompt engineering, raising concerns about consent and attribution.

Method: Systematic exploration of metatag-based prompt engineering using descriptor constellations from public music taxonomies to microlocate artist-conditioned regions.

Result: Successfully demonstrated reproducible proximity to artists like Bon Iver, Philip Glass, Panda Bear and William Basinski, showing stable text-audio correspondences for artist-specific styles.

Conclusion: The findings reveal ethical concerns about governance, attribution, consent, and creative practice boundaries in algorithmic music generation systems.

Abstract: Text-to-audio (TTA) systems are rapidly transforming music creation and distribution, with platforms like Udio and Suno generating thousands of tracks daily and integrating into mainstream music platforms and ecosystems. These systems, trained on vast and largely undisclosed datasets, are fundamentally reshaping how music is produced, reproduced and consumed. This paper presents empirical evidence that artist-conditioned regions can be systematically microlocated through metatag-based prompt design, effectively enabling the spawning of artist-like content through strategic prompt engineering. Through systematic exploration of metatag-based prompt engineering techniques this research reveals how users can access the distinctive sonic signatures of specific artists, evidencing their inclusion in training datasets. Using descriptor constellations drawn from public music taxonomies, the paper demonstrates reproducible proximity to artists such as Bon Iver, Philip Glass, Panda Bear and William Basinski. The results indicate stable text-audio correspondences consistent with artist-specific training signals, enabling precise traversal of stylistic microlocations without explicitly naming artists. This capacity to summon artist-specific outputs shows that artists’ creative works fuction as foundational material from which these systems generate new content, often without explicit consent or attribuition. Conceptually, the work clarifies how textual descriptors act as navigational cues in high-dimensional representation spaces; methodologically, it provides a replicable protocol for auditing stylistic inducibility. The findings raise immediate queestions for governance-attribution, consent and disclosure standards-and for creative practice, where induced stylistic proximity complicates boundaries between ownership, reproduction, imitation, creative agency and the ethics of algorithmic creation.

[290] AI in Music and Sound: Pedagogical Reflections, Post-Structuralist Approaches and Creative Outcomes in Seminar Practice

Guilherme Coelho

Main category: cs.SD

TL;DR: A pedagogical course on AI in music that uses paired-études design to teach various AI modalities through both intended use and deliberate misuse, fostering technical fluency and critical awareness.

Details

Motivation: To develop a pedagogical framework that combines technical AI skills with critical media literacy, preparing students to engage creatively and critically with AI technologies in music.

Method: Paired-études design where each AI modality is explored through intended affordances followed by deliberate misuse exercises; course combines theoretical reflection with practice-based experimentation across symbolic composition, voice synthesis, timbre transfer, neural audio synthesis, and text-to-audio systems.

Result: Students demonstrated growth in technical fluency, medium awareness, critical literacy, experimental method, and process-oriented listening; course produced design patterns for AI-music pedagogy including prompt-conditioned interplays and semantic destabilization.

Conclusion: Integrating creative practice with medium awareness and cultural-epistemic analysis prepares students to participate in how AI is understood, developed, and deployed within creative communities.

Abstract: This paper presents a pedagogical and conceptual account of the course AI in Music and Sound: Modalities, Tools and Creative Applications, offered within the Music Informatics and Media Art module of an M.Sc. in Audio Communication. The course engaged students with a range of AI modalities such as symbolic composition, voice synthesis, timbre transfer, neural audio synthesis, and text-to-audio systems, combining theoretical reflection with practice-based experimentation. Its central pedagogical move is a paired-études design: each modality is approached first through its intended affordances and then through a deliberately reframed or “misused” exercise that surfaces representational limits and alternative behaviours. Framed by medium theory and post-structuralist inquiry, we treated AI as a transmodal conduit-a system that translates and perturbs musical signs across textual, symbolic, timbral and audio domains. Evidence from student work and reflection indicates growth in technical fluency, medium awareness, and critical literacy, alongside the cultivation of experimental method and process-oriented listening. The paper outlines the course architecture, assessment design, and representative projects, and distils a set of design patterns for AI-music pedagogy (eg., prompt-conditioned interplays and semantic destabilisation in text-to-audio; latent space materialism in timbre transfer). It concludes with pedagogical recommendations that integrate creative practice with medium awareness and with cultural-epistemic analysis of AI technologies, preparing students to participate in how AI is understood, developed, and deployed with creative communities.

[291] Semantic and Semiotic Interplays in Text-to-Audio AI: Exploring Cognitive Dynamics and Musical Interactions

Guilherme Coelho

Main category: cs.SD

TL;DR: This paper examines how text-to-audio AI systems transform musical creation and cognition by translating language prompts into sound, analyzing their impact on musical signification, cognitive frameworks, and listener engagement.

Details

Motivation: To investigate the transformative implications of text-to-audio AI for musical creation, interpretation, and cognition, exploring how these systems reconfigure musical signification processes and navigate established cognitive frameworks.

Method: The research analyzes text-to-audio AI systems using structuralist and post-structuralist perspectives, cognitive theories of schema dynamics and metacognition, with Udio as a primary case study to examine the translation from linguistic prompts to sonic outputs.

Result: Text-to-audio AI models function as quasi-objects of musical signification that simultaneously stabilize and destabilize conventional forms while fostering new modes of listening and aesthetic reflexivity, encouraging critical and structurally-aware listening.

Conclusion: Text-to-audio AI models have potential as epistemic tools and quasi-objects that facilitate significant shifts in musical interactions, enabling users to develop more nuanced understanding of music’s cognitive and cultural foundations.

Abstract: This paper investigates the emerging text-to-audio paradigm in artificial intelligence (AI), examining its transformative implications for musical creation, interpretation, and cognition. I explore the complex semantic and semiotic interplays that occur when descriptive natural language prompts are translated into nuanced sound objects across the text-to-audio modality. Drawing from structuralist and post-structuralist perspectives, as well as cognitive theories of schema dynamics and metacognition, the paper explores how these AI systems reconfigure musical signification processes and navigate established cognitive frameworks. The research analyzes some of the cognitive dynamics at play in AI-mediated musicking, including processes of schema assimilation and accommodation, metacognitive reflection, and constructive perception. The paper argues that text-to-audio AI models function as quasi-objects of musical signification, simultaneously stabilizing and destabilizing conventional forms while fostering new modes of listening and aesthetic reflexivity.Using Udio as a primary case study, this study explores how these models navigate the liminal spaces between linguistic prompts and sonic outputs. This process not only generates novel musical expressions but also prompts listeners to engage in forms of critical and “structurally-aware listening.”, encouraging a deeper understanding of music’s structures, semiotic nuances, and the socio-cultural contexts that shape our musical cognition. The paper concludes by reflecting on the potential of text-to-audio AI models to serve as epistemic tools and quasi-objects, facilitating a significant shift in musical interactions and inviting users to develop a more nuanced comprehension of the cognitive and cultural foundations of music.

[292] Enhancing Quranic Learning: A Multimodal Deep Learning Approach for Arabic Phoneme Recognition

Ayhan Kucukmanisa, Derya Gelmez, Sukru Selim Calik, Zeynep Hilal Kilimci

Main category: cs.SD

TL;DR: A transformer-based multimodal framework combining acoustic (UniSpeech) and textual (BERT) embeddings for Arabic phoneme mispronunciation detection, achieving strong performance in Quranic recitation assessment.

Details

Motivation: Accurate pronunciation detection is challenging in Arabic, especially for Quranic recitation where subtle phonetic differences can alter meaning. Existing systems need higher precision and robustness for this critical application.

Method: Proposes a multimodal framework integrating UniSpeech-derived acoustic embeddings with BERT-based textual embeddings from Whisper transcriptions. Evaluated early, intermediate, and late fusion methods on datasets containing 29 Arabic phonemes including eight hafiz sounds.

Result: The UniSpeech-BERT multimodal configuration provides strong results, with fusion-based transformer architectures proving effective for phoneme-level mispronunciation detection across standard evaluation metrics (accuracy, precision, recall, F1-score).

Conclusion: The framework contributes to developing intelligent, speaker-independent multimodal CALL systems, offering practical technology for Quranic pronunciation training and broader speech-based educational applications.

Abstract: Recent advances in multimodal deep learning have greatly enhanced the capability of systems for speech analysis and pronunciation assessment. Accurate pronunciation detection remains a key challenge in Arabic, particularly in the context of Quranic recitation, where subtle phonetic differences can alter meaning. Addressing this challenge, the present study proposes a transformer-based multimodal framework for Arabic phoneme mispronunciation detection that combines acoustic and textual representations to achieve higher precision and robustness. The framework integrates UniSpeech-derived acoustic embeddings with BERT-based textual embeddings extracted from Whisper transcriptions, creating a unified representation that captures both phonetic detail and linguistic context. To determine the most effective integration strategy, early, intermediate, and late fusion methods were implemented and evaluated on two datasets containing 29 Arabic phonemes, including eight hafiz sounds, articulated by 11 native speakers. Additional speech samples collected from publicly available YouTube recordings were incorporated to enhance data diversity and generalization. Model performance was assessed using standard evaluation metrics: accuracy, precision, recall, and F1-score, allowing a detailed comparison of the fusion strategies. Experimental findings show that the UniSpeech-BERT multimodal configuration provides strong results and that fusion-based transformer architectures are effective for phoneme-level mispronunciation detection. The study contributes to the development of intelligent, speaker-independent, and multimodal Computer-Aided Language Learning (CALL) systems, offering a practical step toward technology-supported Quranic pronunciation training and broader speech-based educational applications.

cs.LG

[293] Joint Design of Protein Surface and Structure Using a Diffusion Bridge Model

Guanlue Li, Xufeng Zhao, Fang Wu, Sören Laue

Main category: cs.LG

TL;DR: PepBridge is a novel framework for joint design of protein surfaces and structures that integrates receptor surface geometry and biochemical properties using diffusion models to generate structurally viable proteins.

Details

Motivation: Designing diverse and physically realistic protein structures with precise surface complementarity to target receptors remains a significant challenge in computational protein design.

Method: Uses denoising diffusion bridge models to map receptor surfaces to ligand surfaces, multi-model diffusion for structure prediction, and Shape-Frame Matching Networks for surface-backbone alignment.

Result: Extensive validation shows PepBridge effectively generates structurally viable proteins with surface complementarity, conformational stability, and chemical feasibility.

Conclusion: PepBridge represents a significant advancement in joint top-down protein structure design by seamlessly integrating surface geometry and biochemical properties.

Abstract: Protein-protein interactions (PPIs) are governed by surface complementarity and hydrophobic interactions at protein interfaces. However, designing diverse and physically realistic protein structure and surfaces that precisely complement target receptors remains a significant challenge in computational protein design. In this work, we introduce PepBridge, a novel framework for the joint design of protein surface and structure that seamlessly integrates receptor surface geometry and biochemical properties. Starting with a receptor surface represented as a 3D point cloud, PepBridge generates complete protein structures through a multi-step process. First, it employs denoising diffusion bridge models (DDBMs) to map receptor surfaces to ligand surfaces. Next, a multi-model diffusion model predicts the corresponding structure, while Shape-Frame Matching Networks ensure alignment between surface geometry and backbone architecture. This integrated approach facilitates surface complementarity, conformational stability, and chemical feasibility. Extensive validation across diverse protein design scenarios demonstrates PepBridge’s efficacy in generating structurally viable proteins, representing a significant advancement in the joint design of top-down protein structure.

[294] DDTime: Dataset Distillation with Spectral Alignment and Information Bottleneck for Time-Series Forecasting

Yuqi Li, Kuiye Ding, Chuanguang Yang, Hao Wang, Haoxuan Wang, Huiran Duan, Junming Liu, Yingli Tian

Main category: cs.LG

TL;DR: DDTime is a lightweight dataset distillation framework for time-series forecasting that addresses temporal bias and sample diversity challenges through frequency-domain alignment and inter-sample regularization, achieving 30% accuracy gains with minimal computational overhead.

Details

Motivation: Time-series forecasting requires large datasets and computational resources. Dataset distillation offers a compact alternative, but faces challenges from temporal bias (autocorrelation distorting value-term alignment) and insufficient sample diversity (lack of categorical priors for trajectory variety).

Method: Built on first-order condensation decomposition, DDTime uses frequency-domain alignment to mitigate autocorrelation-induced bias and ensure spectral consistency. It also employs inter-sample regularization based on information bottleneck principle to enhance diversity and maximize information density across synthetic trajectories.

Result: Extensive experiments on 20 benchmark datasets show DDTime consistently outperforms existing distillation methods, achieving about 30% relative accuracy gains while introducing only about 2.49% computational overhead.

Conclusion: DDTime provides an effective and efficient solution for time-series dataset distillation, with theoretical compatibility across condensation paradigms and stable first-order optimization. The framework successfully addresses key challenges in temporal data distillation.

Abstract: Time-series forecasting is fundamental across many domains, yet training accurate models often requires large-scale datasets and substantial computational resources. Dataset distillation offers a promising alternative by synthesizing compact datasets that preserve the learning behavior of full data. However, extending dataset distillation to time-series forecasting is non-trivial due to two fundamental challenges: 1.temporal bias from strong autocorrelation, which leads to distorted value-term alignment between teacher and student models; and 2.insufficient diversity among synthetic samples, arising from the absence of explicit categorical priors to regularize trajectory variety. In this work, we propose DDTime, a lightweight and plug-in distillation framework built upon first-order condensation decomposition. To tackle Challenge 1, it revisits value-term alignment through temporal statistics and introduces a frequency-domain alignment mechanism to mitigate autocorrelation-induced bias, ensuring spectral consistency and temporal fidelity. To address Challenge 2, we further design an inter-sample regularization inspired by the information bottleneck principle, which enhances diversity and maximizes information density across synthetic trajectories. The combined objective is theoretically compatible with a wide range of condensation paradigms and supports stable first-order optimization. Extensive experiments on 20 benchmark datasets and diverse forecasting architectures demonstrate that DDTime consistently outperforms existing distillation methods, achieving about 30% relative accuracy gains while introducing about 2.49% computational overhead. All code and distilled datasets will be released.

[295] When Structure Doesn’t Help: LLMs Do Not Read Text-Attributed Graphs as Effectively as We Expected

Haotian Xu, Yuning You, Tengfei Ma

Main category: cs.LG

TL;DR: LLMs achieve strong performance on text-attributed graphs using only node text descriptions, with most structural encoding strategies providing marginal or negative gains, challenging traditional graph learning paradigms.

Details

Motivation: To investigate how different graph structure encoding strategies affect LLM performance on text-attributed graphs and challenge the assumption that explicit structural information is always beneficial.

Method: Systematic experiments comparing LLM performance with various structural encoding strategies versus using only node textual descriptions on text-attributed graphs.

Result: LLMs using only node text descriptions achieve strong performance, while most structural encoding strategies offer marginal or negative gains. Explicit structural priors are often unnecessary and sometimes counterproductive.

Conclusion: Traditional graph learning assumptions about structural benefits need rethinking in the LLM era, opening opportunities for semantics-driven approaches that may not require explicit structural encoding.

Abstract: Graphs provide a unified representation of semantic content and relational structure, making them a natural fit for domains such as molecular modeling, citation networks, and social graphs. Meanwhile, large language models (LLMs) have excelled at understanding natural language and integrating cross-modal signals, sparking interest in their potential for graph reasoning. Recent work has explored this by either designing template-based graph templates or using graph neural networks (GNNs) to encode structural information. In this study, we investigate how different strategies for encoding graph structure affect LLM performance on text-attributed graphs. Surprisingly, our systematic experiments reveal that: (i) LLMs leveraging only node textual descriptions already achieve strong performance across tasks; and (ii) most structural encoding strategies offer marginal or even negative gains. We show that explicit structural priors are often unnecessary and, in some cases, counterproductive when powerful language models are involved. This represents a significant departure from traditional graph learning paradigms and highlights the need to rethink how structure should be represented and utilized in the LLM era. Our study is to systematically challenge the foundational assumption that structure is inherently beneficial for LLM-based graph reasoning, opening the door to new, semantics-driven approaches for graph learning.

[296] GCL-OT: Graph Contrastive Learning with Optimal Transport for Heterophilic Text-Attributed Graphs

Yating Ren, Yikun Ban, Huobin Tan

Main category: cs.LG

TL;DR: GCL-OT is a graph contrastive learning framework that addresses multi-granular heterophily in text-attributed graphs using optimal transport, achieving superior performance over existing methods.

Details

Motivation: Existing structure-text contrastive learning methods struggle with heterophilic graphs due to homophily assumptions and static textual embeddings, leading to suboptimal alignment in mixed, noisy, and missing semantic correlation scenarios.

Method: Proposes GCL-OT with three tailored mechanisms: RealSoftMax-based similarity estimator for partial heterophily, prompt-based filter for complete heterophily, and OT-guided soft supervision for latent homophily.

Result: Extensive experiments on nine benchmarks show GCL-OT consistently outperforms state-of-the-art methods, with theoretical analysis demonstrating improved mutual information bound and Bayes error guarantees.

Conclusion: GCL-OT effectively addresses multi-granular heterophily through flexible bidirectional alignment using optimal transport, providing robust performance across diverse text-attributed graph scenarios.

Abstract: Recently, structure-text contrastive learning has shown promising performance on text-attributed graphs by leveraging the complementary strengths of graph neural networks and language models. However, existing methods typically rely on homophily assumptions in similarity estimation and hard optimization objectives, which limit their applicability to heterophilic graphs. Although existing methods can mitigate heterophily through structural adjustments or neighbor aggregation, they usually treat textual embeddings as static targets, leading to suboptimal alignment. In this work, we identify the multi-granular heterophily in text-attributed graphs, including complete heterophily, partial heterophily, and latent homophily, which makes structure-text alignment particularly challenging due to mixed, noisy, and missing semantic correlations. To achieve flexible and bidirectional alignment, we propose GCL-OT, a novel graph contrastive learning framework with optimal transport, equipped with tailored mechanisms for each type of heterophily. Specifically, for partial heterophily, we design a RealSoftMax-based similarity estimator to emphasize key neighbor-word interactions while easing background noise. For complete heterophily, we introduce a prompt-based filter that adaptively excludes irrelevant noise during optimal transport alignment. Furthermore, we incorporate OT-guided soft supervision to uncover potential neighbors with similar semantics, enhancing the learning of latent homophily. Theoretical analysis shows that GCL-OT can improve the mutual information bound and Bayes error guarantees. Extensive experiments on nine benchmarks show that GCL-OT consistently outperforms state-of-the-art methods, verifying its effectiveness and robustness.

[297] Better audio representations are more brain-like: linking model-brain alignment with performance in downstream auditory tasks

Leonardo Pepino, Pablo Riera, Juan Kamienkowski, Luciana Ferrer

Main category: cs.LG

TL;DR: Self-supervised audio models with strong task performance better predict auditory cortex activity than older specialized models, with brain similarity emerging as a byproduct of learning from natural audio data.

Details

Motivation: To determine whether improving artificial neural network task performance also makes their internal representations more similar to brain signals in the auditory domain.

Method: Quantified alignment between 36 audio models and fMRI brain activity using voxel-wise/component-wise regression and RSA; evaluated models on 6 auditory tasks from HEAREval benchmark; analyzed similarity evolution during EnCodecMAE pretraining.

Result: Found strong positive correlations (r>0.7) between model task performance and brain alignment; brain similarity increases progressively during pretraining and emerges early without explicit optimization.

Conclusion: Brain-like representations can be an emergent byproduct of learning to reconstruct missing information from naturalistic audio data, suggesting task performance improvement naturally leads to more brain-like representations.

Abstract: Artificial neural networks (ANNs) are increasingly powerful models of brain computation, yet it remains unclear whether improving their task performance also makes their internal representations more similar to brain signals. To address this question in the auditory domain, we quantified the alignment between the internal representations of 36 different audio models and brain activity from two independent fMRI datasets. Using voxel-wise and component-wise regression, and representation similarity analysis (RSA), we found that recent self-supervised audio models with strong performance in diverse downstream tasks are better predictors of auditory cortex activity than older and more specialized models. To assess the quality of the audio representations, we evaluated these models in 6 auditory tasks from the HEAREval benchmark, spanning music, speech, and environmental sounds. This revealed strong positive Pearson correlations ($r>0.7$) between a model’s overall task performance and its alignment with brain representations. Finally, we analyzed the evolution of the similarity between audio and brain representations during the pretraining of EnCodecMAE. We discovered that brain similarity increases progressively and emerges early during pretraining, despite the model not being explicitly optimized for this objective. This suggests that brain-like representations can be an emergent byproduct of learning to reconstruct missing information from naturalistic audio data.

[298] Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach

Yaoxin Yang, Peng Ye, Xudong Tan, Chongjun Tu, Maosen Zhao, Jia Hao, Tao Chen

Main category: cs.LG

TL;DR: FlashCache is a frequency-domain-guided KV Cache compression framework for multimodal LLMs that identifies and retains outlier KV pairs while compressing the majority of KV cache, achieving 1.69× faster decoding with 80% lower memory usage.

Details

Motivation: Multimodal LLMs suffer from substantial inference overhead due to large KV Cache that grows with visual input length. Existing compression methods are incompatible with efficient attention kernels and ignore value vectors' contribution.

Method: Uses frequency-domain analysis to identify principal energy in KV matrices, recognizes Outlier KVs that deviate from principal energy, and implements dynamic budget allocation to retain critical KV pairs while compressing others.

Result: Achieves up to 1.69× faster decoding with 80% lower KV memory usage while maintaining task performance across multiple MLLMs and benchmarks.

Conclusion: FlashCache provides an effective KV Cache compression framework that leverages frequency-domain analysis and outlier-aware retention, outperforming state-of-the-art multimodal KV compression methods.

Abstract: Multimodal large language models suffer from substantial inference overhead since multimodal KV Cache grows proportionally with the visual input length. Existing multimodal KV Cache compression methods mostly rely on attention score to reduce cache size, which makes them are incompatible with established efficient attention kernels (e.g., FlashAttention) and ignores the contribution of value vectors to the attention output. In this work, we revisit multimodal KV Cache compression from the perspective of the KV matrices’ distribution. First, we observe that frequency-domain energy of multimodal KV matrices is predominantly concentrated in low-frequency and extract this principal energy via a low-pass filter. Further, we find that removing KV pairs that deviate substantially from this principal energy leads to a pronounced performance drop, which we define as Outlier KVs. Considering Outlier KVs are more likely to encode features critical for inference, we propose FlashCache, a frequency-domain-guided, Outlier-KV-aware KV Cache compression framework. First, we introduce an Outlier KV Recognition Module that models the principal component of multimodal KV matrices in the frequency domain and preferentially retains KV pairs that significantly deviate from it. Furthermore, Dynamic Budget Allocation Module is designed to adaptively determine the per-layer KV Cache size to retain more Outlier KVs. Experiments on multiple MLLMs and benchmarks demonstrate that FlashCache outperforms state-of-the-art multimoal KV compression methods, achieving up to 1.69 times faster decoding with 80% lower KV memory usage while maintaining task performance.

[299] A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport

Yacouba Kaloga, Shashi Kumar, Petr Motlicek, Ina Kodrasi

Main category: cs.LG

TL;DR: Proposes Optimal Temporal Transport Classification (OTTC) loss using sequence optimal transport for better seq2seq alignment in ASR, improving alignment accuracy but with ASR performance trade-off.

Details

Motivation: Address alignment inaccuracies in state-of-the-art E2E ASR systems (CTC and transducer models) that suffer from peaky behavior, critical for applications like medical speech analysis and language learning tools.

Method: Novel differentiable alignment framework based on one-dimensional optimal transport, introducing Sequence Optimal Transport Distance (SOTD) pseudo-metric and OTTC loss for ASR training.

Result: Experimental results on TIMIT, AMI, and LibriSpeech datasets show considerable improvement in alignment performance compared to CTC and Consistency-Regularized CTC, though with trade-off in ASR performance.

Conclusion: Opens new avenues for seq2seq alignment research, providing solid foundation for further exploration and development in the community.

Abstract: Accurate sequence-to-sequence (seq2seq) alignment is critical for applications like medical speech analysis and language learning tools relying on automatic speech recognition (ASR). State-of-the-art end-to-end (E2E) ASR systems, such as the Connectionist Temporal Classification (CTC) and transducer-based models, suffer from peaky behavior and alignment inaccuracies. In this paper, we propose a novel differentiable alignment framework based on one-dimensional optimal transport, enabling the model to learn a single alignment and perform ASR in an E2E manner. We introduce a pseudo-metric, called Sequence Optimal Transport Distance (SOTD), over the sequence space and discuss its theoretical properties. Based on the SOTD, we propose Optimal Temporal Transport Classification (OTTC) loss for ASR and contrast its behavior with CTC. Experimental results on the TIMIT, AMI, and LibriSpeech datasets show that our method considerably improves alignment performance compared to CTC and the more recently proposed Consistency-Regularized CTC, though with a trade-off in ASR performance. We believe this work opens new avenues for seq2seq alignment research, providing a solid foundation for further exploration and development within the community. Our code is publicly available at: https://github.com/idiap/OTTC

[300] A Vector Symbolic Approach to Multiple Instance Learning

Ehsan Ahmed Dhrubo, Mohammad Mahmudul Alam, Edward Raff, Tim Oates, James Holt

Main category: cs.LG

TL;DR: A novel MIL framework using Vector Symbolic Architectures (VSAs) that strictly enforces the iff constraint through symbolic operations in high-dimensional space, achieving state-of-the-art results while maintaining MIL formulation integrity.

Details

Motivation: Most deep learning-based MIL approaches violate the fundamental iff constraint (bag positive iff at least one instance positive), leading to inflated performance metrics and poor generalization. There's a need for methods that strictly adhere to the MIL formulation.

Method: Uses Vector Symbolic Architectures to encode instances and concepts as nearly orthogonal high-dimensional vectors, with algebraic operations enforcing the iff constraint. Includes a learned encoder to transform raw data into VSA-compatible vectors and a VSA-driven MaxNetwork classifier.

Result: Achieves state-of-the-art results for valid MIL models on standard benchmarks and medical imaging datasets, outperforming existing methods while strictly adhering to the MIL formulation.

Conclusion: Provides a principled, interpretable, and effective alternative to existing MIL approaches that rely on learned heuristics, bridging the gap between symbolic reasoning and deep learning for MIL tasks.

Abstract: Multiple Instance Learning (MIL) tasks impose a strict logical constraint: a bag is labeled positive if and only if at least one instance within it is positive. While this iff constraint aligns with many real-world applications, recent work has shown that most deep learning-based MIL approaches violate it, leading to inflated performance metrics and poor generalization. We propose a novel MIL framework based on Vector Symbolic Architectures (VSAs), which provide a differentiable mechanism for performing symbolic operations in high-dimensional space. Our method encodes the MIL assumption directly into the model’s structure by representing instances and concepts as nearly orthogonal high-dimensional vectors and using algebraic operations to enforce the iff constraint during classification. To bridge the gap between raw data and VSA representations, we design a learned encoder that transforms input instances into VSA-compatible vectors while preserving key distributional properties. Our approach, which includes a VSA-driven MaxNetwork classifier, achieves state-of-the-art results for a valid MIL model on standard MIL benchmarks and medical imaging datasets, outperforming existing methods while maintaining strict adherence to the MIL formulation. This work offers a principled, interpretable, and effective alternative to existing MIL approaches that rely on learned heuristics.

[301] A Robust Federated Learning Approach for Combating Attacks Against IoT Systems Under non-IID Challenges

Eyad Gad, Zubair Md Fadlullah, Mostafa M. Fouda

Main category: cs.LG

TL;DR: This paper compares federated learning methods (FedAvg, FedProx, Scaffold) for IoT attack detection using the CICIoT2023 dataset, focusing on their performance under statistical heterogeneity in non-IID data distributions.

Details

Motivation: The proliferation of IoT devices and data volumes creates challenges for traditional ML training in resource-constrained, security-sensitive environments. Federated Learning addresses privacy and resource limitations but faces challenges from statistical heterogeneity in non-IID data across parties.

Method: The study explores FL algorithms (FedAvg, FedProx, Scaffold) under different data distributions and classifies large-scale IoT attacks using the CICIoT2023 dataset through meticulous analysis and experimentation.

Result: The research provides performance comparisons of FL methods in detecting IoT attacks under statistical heterogeneity conditions, though specific quantitative results are not detailed in the abstract.

Conclusion: The study aims to illuminate performance nuances of FL methods for IoT attack detection and provide valuable insights for researchers and practitioners dealing with statistical heterogeneity challenges in federated learning.

Abstract: In the context of the growing proliferation of user devices and the concurrent surge in data volumes, the complexities arising from the substantial increase in data have posed formidable challenges to conventional machine learning model training. Particularly, this is evident within resource-constrained and security-sensitive environments such as those encountered in networks associated with the Internet of Things (IoT). Federated Learning has emerged as a promising remedy to these challenges by decentralizing model training to edge devices or parties, effectively addressing privacy concerns and resource limitations. Nevertheless, the presence of statistical heterogeneity in non-Independently and Identically Distributed (non-IID) data across different parties poses a significant hurdle to the effectiveness of FL. Many FL approaches have been proposed to enhance learning effectiveness under statistical heterogeneity. However, prior studies have uncovered a gap in the existing research landscape, particularly in the absence of a comprehensive comparison between federated methods addressing statistical heterogeneity in detecting IoT attacks. In this research endeavor, we delve into the exploration of FL algorithms, specifically FedAvg, FedProx, and Scaffold, under different data distributions. Our focus is on achieving a comprehensive understanding of and addressing the challenges posed by statistical heterogeneity. In this study, We classify large-scale IoT attacks by utilizing the CICIoT2023 dataset. Through meticulous analysis and experimentation, our objective is to illuminate the performance nuances of these FL methods, providing valuable insights for researchers and practitioners in the domain.

[302] Monte Carlo Expected Threat (MOCET) Scoring

Joseph Kim, Saahith Potluri

Main category: cs.LG

TL;DR: Introduces MOCET, an interpretable and scalable metric to quantify real-world AI safety risks, addressing gaps in existing evaluation methods for AI Safety Level (ASL) threats.

Details

Motivation: Existing evaluation metrics like LAB-Bench, BioLP-bench, and WMDP can assess model uplift and domain knowledge but lack contextualization of real-world risks and scalable open-ended metrics to keep pace with rapid LLM advancements.

Method: Developed MOCET - an interpretable and doubly-scalable metric that is both automatable and open-ended to quantify real-world risks.

Result: MOCET provides a framework to better contextualize real-world risks and inform safety cases for LLMs, particularly addressing ASL-3+ model risks in biosecurity uplift scenarios.

Conclusion: MOCET addresses critical gaps in AI safety evaluation by providing scalable, open-ended metrics that can quantify real-world risks and keep pace with rapid LLM advancements.

Abstract: Evaluating and measuring AI Safety Level (ASL) threats are crucial for guiding stakeholders to implement safeguards that keep risks within acceptable limits. ASL-3+ models present a unique risk in their ability to uplift novice non-state actors, especially in the realm of biosecurity. Existing evaluation metrics, such as LAB-Bench, BioLP-bench, and WMDP, can reliably assess model uplift and domain knowledge. However, metrics that better contextualize “real-world risks” are needed to inform the safety case for LLMs, along with scalable, open-ended metrics to keep pace with their rapid advancements. To address both gaps, we introduce MOCET, an interpretable and doubly-scalable metric (automatable and open-ended) that can quantify real-world risks.

[303] ManifoldFormer: Geometric Deep Learning for Neural Dynamics on Riemannian Manifolds

Yihang Fu, Lifang He, Qingyu Chen

Main category: cs.LG

TL;DR: ManifoldFormer introduces a geometric deep learning framework for EEG analysis that explicitly models neural dynamics on manifolds, overcoming limitations of Euclidean-based approaches and achieving superior performance with better cross-subject generalization.

Details

Motivation: Existing EEG foundation models treat neural signals as generic time series in Euclidean space, ignoring the intrinsic geometric structure of neural dynamics constrained to low-dimensional manifolds, which limits representation quality and cross-subject generalization.

Method: Integrates Riemannian VAE for manifold embedding preserving geometric structure, geometric Transformer with geodesic-aware attention mechanisms operating on neural manifolds, and dynamics predictor using neural ODEs for manifold-constrained temporal evolution.

Result: Substantial improvements over state-of-the-art methods across four public datasets: 4.6-4.8% higher accuracy and 6.2-10.2% higher Cohen’s Kappa, while maintaining robust cross-subject generalization.

Conclusion: Geometric constraints are essential for effective EEG foundation models, as the approach reveals meaningful neural patterns consistent with neurophysiological principles and establishes superior performance over Euclidean-based methods.

Abstract: Existing EEG foundation models mainly treat neural signals as generic time series in Euclidean space, ignoring the intrinsic geometric structure of neural dynamics that constrains brain activity to low-dimensional manifolds. This fundamental mismatch between model assumptions and neural geometry limits representation quality and cross-subject generalization. ManifoldFormer addresses this limitation through a novel geometric deep learning framework that explicitly learns neural manifold representations. The architecture integrates three key innovations: a Riemannian VAE for manifold embedding that preserves geometric structure, a geometric Transformer with geodesic-aware attention mechanisms operating directly on neural manifolds, and a dynamics predictor leveraging neural ODEs for manifold-constrained temporal evolution. Extensive evaluation across four public datasets demonstrates substantial improvements over state-of-the-art methods, with 4.6-4.8% higher accuracy and 6.2-10.2% higher Cohen’s Kappa, while maintaining robust cross-subject generalization. The geometric approach reveals meaningful neural patterns consistent with neurophysiological principles, establishing geometric constraints as essential for effective EEG foundation models.

[304] Analysis of heart failure patient trajectories using sequence modeling

Falk Dippela, Yinan Yu, Annika Rosengren, Martin Lindgren, Christina E. Lundberg, Erik Aerts, Martin Adiels, Helen Sjöland

Main category: cs.LG

TL;DR: Systematic comparison of Transformers, Transformers++ (Llama), and Mambas for clinical prediction using EHR data from 42,820 heart failure patients, showing Llama achieves best performance across three prediction tasks.

Details

Motivation: Lack of systematic empirical analysis of model performance and efficiency in medical domain despite Transformers and Mamba architectures showing promise for EHR-based clinical prediction tasks.

Method: Evaluated six sequence models across three architecture classes on Swedish heart failure cohort with EHR data including diagnoses, vital signs, labs, medications, and procedures. Conducted ablations on input tokenization, model configurations, and temporal preprocessing techniques.

Result: Llama achieved highest predictive discrimination, best calibration, and robustness across all tasks, followed by Mambas. Both architectures showed efficient representation learning with tiny configurations surpassing large Transformers. At equal model size, achieved superior performance using 25% less training data.

Conclusion: Provides first systematic ablation study for EHR-based clinical prediction with recommendations for input tokenization, model configuration, and temporal preprocessing that future model development can build upon.

Abstract: Transformers have defined the state-of-the-art for clinical prediction tasks involving electronic health records (EHRs). The recently introduced Mamba architecture outperformed an advanced Transformer (Transformer++) based on Llama in handling long context lengths, while using fewer model parameters. Despite the impressive performance of these architectures, a systematic approach to empirically analyze model performance and efficiency under various settings is not well established in the medical domain. The performances of six sequence models were investigated across three architecture classes (Transformers, Transformers++, Mambas) in a large Swedish heart failure (HF) cohort (N = 42820), providing a clinically relevant case study. Patient data included diagnoses, vital signs, laboratories, medications and procedures extracted from in-hospital EHRs. The models were evaluated on three one-year prediction tasks: clinical instability (a readmission phenotype) after initial HF hospitalization, mortality after initial HF hospitalization and mortality after latest hospitalization. Ablations account for modifications of the EHR-based input patient sequence, architectural model configurations, and temporal preprocessing techniques for data collection. Llama achieves the highest predictive discrimination, best calibration, and showed robustness across all tasks, followed by Mambas. Both architectures demonstrate efficient representation learning, with tiny configurations surpassing other large-scaled Transformers. At equal model size, Llama and Mambas achieve superior performance using 25% less training data. This paper presents a first ablation study with systematic design choices for input tokenization, model configuration and temporal data preprocessing. Future model development in clinical prediction tasks using EHRs could build upon this study’s recommendation as a starting point.

[305] Provably Minimum-Length Conformal Prediction Sets for Ordinal Classification

Zijian Zhang, Xinyu Chen, Yuanjie Shi, Liyuan Lillian Ma, Zifan Xu, Yan Yan

Main category: cs.LG

TL;DR: Proposes a model-agnostic conformal prediction method for ordinal classification that provides instance-level optimal prediction intervals with statistically valid uncertainty guarantees.

Details

Motivation: Existing ordinal conformal prediction methods are either heuristic or require restrictive unimodal distribution assumptions, limiting their coverage-efficiency trade-offs and model-agnostic nature.

Method: Formulates ordinal conformal prediction as a minimum-length covering problem and develops a linear-time sliding-window algorithm for instance-level optimal prediction intervals, plus a length-regularized variant.

Result: Significantly improved predictive efficiency over baselines (15% average decrease in prediction set size across four benchmark datasets) while maintaining coverage guarantees.

Conclusion: The proposed method provides model-agnostic, distribution-free ordinal conformal prediction with instance-level optimality and improved efficiency.

Abstract: Ordinal classification has been widely applied in many high-stakes applications, e.g., medical imaging and diagnosis, where reliable uncertainty quantification (UQ) is essential for decision making. Conformal prediction (CP) is a general UQ framework that provides statistically valid guarantees, which is especially useful in practice. However, prior ordinal CP methods mainly focus on heuristic algorithms or restrictively require the underlying model to predict a unimodal distribution over ordinal labels. Consequently, they provide limited insight into coverage-efficiency trade-offs, or a model-agnostic and distribution-free nature favored by CP methods. To this end, we fill this gap by propose an ordinal-CP method that is model-agnostic and provides instance-level optimal prediction intervals. Specifically, we formulate conformal ordinal classification as a minimum-length covering problem at the instance level. To solve this problem, we develop a sliding-window algorithm that is optimal on each calibration data, with only a linear time complexity in K, the number of label candidates. The local optimality per instance further also improves predictive efficiency in expectation. Moreover, we propose a length-regularized variant that shrinks prediction set size while preserving coverage. Experiments on four benchmark datasets from diverse domains are conducted to demonstrate the significantly improved predictive efficiency of the proposed methods over baselines (by 15% decrease on average over four datasets).

[306] Sex and age determination in European lobsters using AI-Enhanced bioacoustics

Feliciano Pedro Francisco Domingos, Isibor Kennedy Ihianle, Omprakash Kaiwartya, Ahmad Lotfi, Nicola Khan, Nicholas Beaudreau, Amaya Albalat, Pedro Machado

Main category: cs.LG

TL;DR: This study uses passive acoustic monitoring and AI models to classify European lobsters by age and sex from their bioacoustic emissions, achieving high accuracy rates.

Details

Motivation: Monitoring elusive aquatic species like lobsters is challenging but crucial for fisheries management and conservation. Understanding lobster habitats, welfare, reproduction, sex, and age requires non-invasive methods.

Method: Used Passive Acoustic Monitoring (PAM) to collect lobster bioacoustics (buzzing/carapace vibrations) from concrete tanks in Scotland. Applied Deep Learning (1D-CNN, 1D-DCNN) and six Machine Learning models (SVM, k-NN, Naive Bayes, Random Forest, XGBoost, MLP) with MFCC features for classification.

Result: Age classification achieved over 97% accuracy for most models (Naive Bayes: 91.31%). Sex classification exceeded 93.23% accuracy for all models except Naive Bayes. Both tasks showed strong performance across ML and DL approaches.

Conclusion: Supervised ML and DL can effectively extract age- and sex-related features from lobster sounds, providing a promising non-invasive PAM approach for conservation, detection, and management in aquaculture and fisheries with potential for real-world edge computing applications.

Abstract: Monitoring aquatic species, especially elusive ones like lobsters, presents challenges. This study focuses on Homarus gammarus (European lobster), a key species for fisheries and aquaculture, and leverages non-invasive Passive Acoustic Monitoring (PAM). Understanding lobster habitats, welfare, reproduction, sex, and age is crucial for management and conservation. While bioacoustic emissions have classified various aquatic species using Artificial Intelligence (AI) models, this research specifically uses H. gammarus bioacoustics (buzzing/carapace vibrations) to classify lobsters by age (juvenile/adult) and sex (male/female). The dataset was collected at Johnshaven, Scotland, using hydrophones in concrete tanks. We explored the efficacy of Deep Learning (DL) models (1D-CNN, 1D-DCNN) and six Machine Learning (ML) models (SVM, k-NN, Naive Bayes, Random Forest, XGBoost, MLP). Mel-frequency cepstral coefficients (MFCCs) were used as features. For age classification (adult vs. juvenile), most models achieved over 97% accuracy (Naive Bayes: 91.31%). For sex classification, all models except Naive Bayes surpassed 93.23%. These strong results demonstrate the potential of supervised ML and DL to extract age- and sex-related features from lobster sounds. This research offers a promising non-invasive PAM approach for lobster conservation, detection, and management in aquaculture and fisheries, enabling real-world edge computing applications for underwater species.

[307] The use of vocal biomarkers in the detection of Parkinson’s disease: a robust statistical performance comparison of classic machine learning models

Katia Pires Nascimento do Sacramento, Elliot Q. C. Garcia, Nicéias Silva Vilela, Vinicius P. Sacramento, Tiago A. E. Ferreira

Main category: cs.LG

TL;DR: Deep Neural Networks outperform traditional ML methods for Parkinson’s disease detection using vocal biomarkers, achieving 98.65% accuracy on Italian Voice dataset and 92.11% on Parkinson’s Telemonitoring dataset.

Details

Motivation: Parkinson's disease causes vocal impairments early on, and using vocal biomarkers provides a non-invasive, low-cost alternative for early diagnosis in clinical settings.

Method: Cross-sectional study using two public voice datasets, extracting MFCC features, comparing DNN with traditional ML methods through 1000 random executions, and using non-parametric statistical tests for validation.

Result: DNN achieved significantly higher accuracy (98.65% and 92.11%) than traditional ML models, demonstrating superior performance and efficiency in PD classification.

Conclusion: DNNs show great potential for accurate and reliable early detection of neurodegenerative diseases using voice biomarkers, outperforming traditional machine learning approaches.

Abstract: Parkinson’s disease (PD) is a progressive neurodegenerative disorder that, in addition to directly impairing functional mobility, is frequently associated with vocal impairments such as hypophonia and dysarthria, which typically manifest in the early stages. The use of vocal biomarkers to support the early diagnosis of PD presents a non-invasive, low-cost, and accessible alternative in clinical settings. Thus, the objective of this cross-sectional study was to consistently evaluate the effectiveness of a Deep Neural Network (DNN) in distinguishing individuals with Parkinson’s disease from healthy controls, in comparison with traditional Machine Learning (ML) methods, using vocal biomarkers. Two publicly available voice datasets were used. Mel-frequency cepstral coefficients (MFCCs) were extracted from the samples, and model robustness was assessed using a validation strategy with 1000 independent random executions. Performance was evaluated using classification statistics. Since normality assumptions were not satisfied, non-parametric tests (Kruskal-Wallis and Bonferroni post-hoc tests) were applied to verify whether the tested classification models were similar or different in the classification of PD. With an average accuracy of $98.65%$ and $92.11%$ on the Italian Voice dataset and Parkinson’s Telemonitoring dataset, respectively, the DNN demonstrated superior performance and efficiency compared to traditional ML models, while also achieving competitive results when benchmarked against relevant studies. Overall, this study confirms the efficiency of DNNs and emphasizes their potential to provide greater accuracy and reliability for the early detection of neurodegenerative diseases using voice-based biomarkers.

[308] Topologic Attention Networks: Attending to Direct and Indirect Neighbors through Gaussian Belief Propagation

Marshall Rosenhoover, Huaming Zhang

Main category: cs.LG

TL;DR: Topologic Attention Networks introduce a probabilistic attention mechanism that learns information flow through both direct and indirect graph connections, overcoming limitations of local message passing in GNNs while being more scalable than existing approaches.

Details

Motivation: Graph Neural Networks struggle with long-range dependencies due to local message passing, and existing solutions like continuous-time dynamics or dense self-attention suffer from high computational costs and limited scalability.

Method: Proposes topologic attention - a probabilistic mechanism that learns how information should flow through both direct and indirect connections in graphs, emerging from learned information propagation rather than explicit pairwise interactions.

Result: Achieves state-of-the-art performance across all measured baseline models, providing unified reasoning over local and global relationships in graphs.

Conclusion: Topologic Attention Networks offer an effective and scalable solution for modeling long-range dependencies in graphs, outperforming existing approaches while maintaining computational efficiency.

Abstract: Graph Neural Networks rely on local message passing, which limits their ability to model long-range dependencies in graphs. Existing approaches extend this range through continuous-time dynamics or dense self-attention, but both suffer from high computational cost and limited scalability. We propose Topologic Attention Networks, a new framework that applies topologic attention, a probabilistic mechanism that learns how information should flow through both direct and indirect connections in a graph. Unlike conventional attention that depends on explicit pairwise interactions, topologic attention emerges from the learned information propagation of the graph, enabling unified reasoning over local and global relationships. This method achieves provides state-of-the-art performance across all measured baseline models. Our implementation is available at https://github.com/Marshall-Rosenhoover/Topologic-Attention-Networks.

[309] PersonalizedRouter: Personalized LLM Routing via Graph-based User Preference Modeling

Zhongjie Dai, Tao Feng, Jiaxuan You

Main category: cs.LG

TL;DR: PersonalizedRouter is a graph-based framework that performs personalized LLM selection by modeling user preferences from interaction data, significantly outperforming existing methods.

Details

Motivation: Current LLM selection methods optimize for fixed objectives like performance or cost, failing to learn individual user preferences from interaction data as user preferences vary in performance, cost, and response style.

Method: Converts interaction data into a heterogeneous graph to capture contextual relationships between user queries and optimal LLMs, using two evaluation strategies: multi-cost-efficiency simulation and LLM-as-a-Judge.

Result: Outperforms existing methods by 15.38% and 9.83% under simulation strategies, and by 16.19% and 59.69% on PersonaRoute-Bench with 1,000 users while maintaining higher efficiency. Shows strong few-shot generalization achieving 64.81% and 85.80% of fully trained performance.

Conclusion: PersonalizedRouter effectively addresses personalized LLM selection by leveraging interaction data and graph-based modeling, demonstrating significant improvements over existing methods and strong adaptability to new users and LLMs.

Abstract: The growing number of Large Language Models (LLMs) with diverse capabilities and response styles provides users with a wider range of choices, which presents challenges in selecting appropriate LLMs, as user preferences vary in terms of performance, cost, and response style. Current LLM selection methods typically optimize for a single fixed objective, such as performance, cost, or a trade-off between them, and fail to learn individual user preferences from interaction data. To address these limitations, we propose PersonalizedRouter, a graph-based framework that models diverse user profiles and performs personalized LLM selection by leveraging interaction data that includes task context, queries, candidate LLMs, and user decisions. To capture contextual information between user queries and optimal LLMs, PersonalizedRouter converts the interaction data into a heterogeneous graph, where the relationships between different types of nodes are represented by edges. To evaluate adaptability across users, we design two strategies: the multi-cost-efficiency simulation strategy and the LLM-as-a-Judge strategy. In addition, we construct PersonaRoute-Bench, a large-scale benchmark with 1,000 simulated users and 10 LLMs. Experimental results show that PersonalizedRouter significantly outperforms existing LLM selection methods and surpasses the strongest methods by a large margin of 15.38% and 9.83% under two simulation strategies. On the PersonaRoute-Bench with 1,000 users, it further surpasses the best methods by 16.19% and 59.69% while maintaining higher efficiency. Moreover, PersonalizedRouter demonstrates strong few-shot generalization, achieving 64.81% and 85.80% of the fully trained model’s performance when adapting to new users and new LLMs.

[310] Predicting Talent Breakout Rate using Twitter and TV data

Bilguun Batsaikhan, Hiroyuki Fukuda

Main category: cs.LG

TL;DR: This paper proposes a method to detect Japanese talents before their rise to stardom by combining Twitter and TV data, and compares traditional, neural network, and ensemble learning methods for predicting talent breakout.

Details

Motivation: Early detection of rising talents is crucial in advertising, and there's interest in applying neural network techniques to time-series forecasting despite traditional models' robustness.

Method: Combined Twitter and TV data to predict talent breakout, experimenting with traditional time-series models, neural networks, and ensemble learning methods.

Result: Ensemble learning performed best on standard regression metrics, but neural networks outperformed others in precision and recall when evaluated using the talent breakout concept.

Conclusion: Neural networks show superior forecasting ability for talent breakout detection despite ensemble methods performing better on conventional metrics, highlighting the importance of appropriate evaluation metrics.

Abstract: Early detection of rising talents is of paramount importance in the field of advertising. In this paper, we define a concept of talent breakout and propose a method to detect Japanese talents before their rise to stardom. The main focus of the study is to determine the effectiveness of combining Twitter and TV data on predicting time-dependent changes in social data. Although traditional time-series models are known to be robust in many applications, the success of neural network models in various fields (e.g.\ Natural Language Processing, Computer Vision, Reinforcement Learning) continues to spark an interest in the time-series community to apply new techniques in practice. Therefore, in order to find the best modeling approach, we have experimented with traditional, neural network and ensemble learning methods. We observe that ensemble learning methods outperform traditional and neural network models based on standard regression metrics. However, by utilizing the concept of talent breakout, we are able to assess the true forecasting ability of the models, where neural networks outperform traditional and ensemble learning methods in terms of precision and recall.

[311] PepEVOLVE: Position-Aware Dynamic Peptide Optimization via Group-Relative Advantage

Trieu Nguyen, Hao-Wei Pang, Shasha Feng

Main category: cs.LG

TL;DR: PepEVOLVE is a position-aware dynamic framework that outperforms prior methods like PepINVENT by learning both where to edit and how to optimize macrocyclic peptides for multi-objective improvement, achieving higher scores and faster convergence.

Details

Motivation: Macrocyclic peptides have vast combinatorial spaces and multi-parameter objectives that make lead optimization slow and challenging. Prior approaches require pre-specifying mutable positions and use static pretraining that limits generalization and effective optimization.

Method: PepEVOLVE uses: (i) dynamic masking and CHUCKLES shifting for pretraining, (ii) context-free multi-armed bandit router to discover high-reward residues, and (iii) evolving optimization algorithm with group-relative advantage for stable reinforcement updates.

Result: On Rev-binding macrocycle benchmark, PepEVOLVE achieved higher mean scores (~0.8 vs 0.6), best candidates with score of 0.95 (vs 0.87), and converged in fewer steps when optimizing permeability and lipophilicity with structural constraints.

Conclusion: PepEVOLVE provides a practical, reproducible path to peptide lead optimization when optimal edit sites are unknown, enabling more efficient exploration and improved design quality across multiple objectives.

Abstract: Macrocyclic peptides are an emerging modality that combines biologics-like affinity with small-molecule-like developability, but their vast combinatorial space and multi-parameter objectives make lead optimization slow and challenging. Prior generative approaches such as PepINVENT require chemists to pre-specify mutable positions for optimization, choices that are not always known a priori, and rely on static pretraining and optimization algorithms that limit the model’s ability to generalize and effectively optimize peptide sequences. We introduce PepEVOLVE, a position-aware, dynamic framework that learns both where to edit and how to dynamically optimize peptides for multi-objective improvement. PepEVOLVE (i) augments pretraining with dynamic masking and CHUCKLES shifting to improve generalization, (ii) uses a context-free multi-armed bandit router that discovers high-reward residues, and (iii) couples a novel evolving optimization algorithm with group-relative advantage to stabilize reinforcement updates. During in silico evaluations, the router policy reliably learns and concentrates probability on chemically meaningful sites that influence the peptide’s properties. On a therapeutically motivated Rev-binding macrocycle benchmark, PepEVOLVE outperformed PepINVENT by reaching higher mean scores (approximately 0.8 vs. 0.6), achieving best candidates with a score of 0.95 (vs. 0.87), and converging in fewer steps under the task of optimizing permeability and lipophilicity with structural constraints. Overall, PepEVOLVE offers a practical, reproducible path to peptide lead optimization when optimal edit sites are unknown, enabling more efficient exploration and improving design quality across multiple objectives.

[312] A Hybrid Computational Intelligence Framework for scRNA-seq Imputation: Integrating scRecover and Random Forests

Ali Anaissi, Deshao Liu, Yuanzhe Jia, Weidong Huang, Widad Alyassine, Junaid Akram

Main category: cs.LG

TL;DR: SCR-MF is a modular two-stage workflow for scRNA-seq dropout imputation that combines dropout detection with non-parametric imputation, achieving robust performance while preserving biological fidelity.

Details

Motivation: scRNA-seq suffers from pervasive dropout events that obscure biological signals, requiring effective imputation methods to recover missing data.

Method: Two-stage workflow combining principled dropout detection using scRecover with robust non-parametric imputation via missForest.

Result: Achieves robust and interpretable performance comparable to or exceeding existing methods, preserves biological fidelity, and provides competitive balance between accuracy and computational efficiency.

Conclusion: SCR-MF is suitable for mid-scale single-cell datasets, offering transparent and biologically faithful imputation.

Abstract: Single-cell RNA sequencing (scRNA-seq) enables transcriptomic profiling at cellular resolution but suffers from pervasive dropout events that obscure biological signals. We present SCR-MF, a modular two-stage workflow that combines principled dropout detection using scRecover with robust non-parametric imputation via missForest. Across public and simulated datasets, SCR-MF achieves robust and interpretable performance comparable to or exceeding existing imputation methods in most cases, while preserving biological fidelity and transparency. Runtime analysis demonstrates that SCR-MF provides a competitive balance between accuracy and computational efficiency, making it suitable for mid-scale single-cell datasets.

[313] CroTad: A Contrastive Reinforcement Learning Framework for Online Trajectory Anomaly Detection

Rui Xue, Dan He, Fengmei Jin, Chen Zhang, Xiaofang Zhou

Main category: cs.LG

TL;DR: CroTad is a contrastive reinforcement learning framework for online trajectory anomaly detection that is threshold-free and robust to noisy, irregularly sampled data, enabling fine-grained identification of abnormal segments at both sub-trajectory and point levels.

Details

Motivation: To address challenges in trajectory anomaly detection including underexplored sub-trajectory analysis, dependency on carefully tuned thresholds, and performance degradation due to irregular sampling and noise in training data.

Method: Proposes a contrastive reinforcement learning framework that uses contrastive learning to extract diverse normal travel patterns and deep reinforcement learning for online, real-time anomaly scoring.

Result: Extensive experiments on two real-world datasets demonstrate the effectiveness and robustness of the framework across various evaluation scenarios.

Conclusion: CroTad provides an effective solution for online trajectory anomaly detection that overcomes limitations of existing methods through its threshold-free design and robustness to data irregularities.

Abstract: Detecting trajectory anomalies is a vital task in modern Intelligent Transportation Systems (ITS), enabling the identification of unsafe, inefficient, or irregular travel behaviours. While deep learning has emerged as the dominant approach, several key challenges remain unresolved. First, sub-trajectory anomaly detection, capable of pinpointing the precise segments where anomalies occur, remains underexplored compared to whole-trajectory analysis. Second, many existing methods depend on carefully tuned thresholds, limiting their adaptability in real-world applications. Moreover, the irregular sampling of trajectory data and the presence of noise in training sets further degrade model performance, making it difficult to learn reliable representations of normal routes. To address these challenges, we propose a contrastive reinforcement learning framework for online trajectory anomaly detection, CroTad. Our method is threshold-free and robust to noisy, irregularly sampled data. By incorporating contrastive learning, CroTad learns to extract diverse normal travel patterns for different itineraries and effectively distinguish anomalous behaviours at both sub-trajectory and point levels. The detection module leverages deep reinforcement learning to perform online, real-time anomaly scoring, enabling timely and fine-grained identification of abnormal segments. Extensive experiments on two real-world datasets demonstrate the effectiveness and robustness of our framework across various evaluation scenarios.

[314] A novel approach to classification of ECG arrhythmia types with latent ODEs

Angelina Yan, Matt L. Sampson, Peter Melchior

Main category: cs.LG

TL;DR: An end-to-end pipeline using latent ODEs and gradient boosted trees for robust ECG arrhythmia classification across varying sampling frequencies, enabling long-term wearable monitoring without sacrificing performance.

Details

Motivation: Address the trade-off between high-fidelity 12-lead ECGs (short-term, spot-check) and wearable ECGs (long-term but lower sampling frequencies) for arrhythmia detection by creating a robust classification method that works across different sampling rates.

Method: Train latent ODEs to model continuous ECG waveforms and extract robust feature vectors from high-frequency single-channel signals. Create three latent vectors per waveform by downsampling from 360 Hz to 90 Hz and 45 Hz, then use gradient boosted trees for classification.

Result: Minimal performance degradation across frequencies: macro-averaged AUC-ROC values of 0.984 (360 Hz), 0.978 (90 Hz), and 0.976 (45 Hz), demonstrating robust classification despite significant downsampling.

Conclusion: The approach enables smaller wearables by sidestepping the trade-off between signal fidelity and battery life, promoting long-term monitoring of cardiac health through robust ECG classification across varying sampling frequencies.

Abstract: 12-lead ECGs with high sampling frequency are the clinical gold standard for arrhythmia detection, but their short-term, spot-check nature often misses intermittent events. Wearable ECGs enable long-term monitoring but suffer from irregular, lower sampling frequencies due to battery constraints, making morphology analysis challenging. We present an end-to-end classification pipeline to address these issues. We train a latent ODE to model continuous ECG waveforms and create robust feature vectors from high-frequency single-channel signals. We construct three latent vectors per waveform via downsampling the initial 360 Hz ECG to 90 Hz and 45 Hz. We then use a gradient boosted tree to classify these vectors and test robustness across frequencies. Performance shows minimal degradation, with macro-averaged AUC-ROC values of 0.984, 0.978, and 0.976 at 360 Hz, 90 Hz, and 45 Hz, respectively, suggesting a way to sidestep the trade-off between signal fidelity and battery life. This enables smaller wearables, promoting long-term monitoring of cardiac health.

[315] ToC: Tree-of-Claims Search with Multi-Agent Language Models

Shuyang Yu, Jianan Liang, Hui Hu

Main category: cs.LG

TL;DR: Tree of Claims (ToC) is a framework that uses Monte Carlo Tree Search and multi-agent LLMs to optimize patent claims by balancing novelty, scope, and coherence, outperforming standard LLMs by 8-9% in composite scores.

Details

Motivation: Manual patent claim drafting is labor-intensive and inconsistent, while conventional LLMs lack structured reasoning needed for precise claim refinement.

Method: Integrates MCTS with collaborative multi-agent system: EditorAgent proposes edits, ExaminerAgent critiques novelty and prior art using chain-of-thought analysis, guided by multi-objective reward function.

Result: Outperforms standard LLMs in zero-shot/few-shot scenarios with 8% average composite score improvement (up to 9% in some cases) on 1145-claim benchmark.

Conclusion: ToC establishes transparent, controllable methodology that bridges LLM reasoning with MCTS planning for structured patent claim optimization.

Abstract: Optimizing patent claims is a critical yet challenging task, demanding careful balance between maximizing novelty and preserving legal scope. Manual claim drafting is labor-intensive, costly, and inherently inconsistent, while conventional Large Language Models (LLMs) often lack the structured, iterative reasoning essential for precise claim refinement. To address these challenges, we introduce Tree of Claims (ToC), an innovative framework that redefines claim editing as a guided search problem. ToC synergistically integrates Monte Carlo Tree Search (MCTS) with a collaborative multi-agent system, comprising an LLM-based EditorAgent that proposes contextually grounded edits, and an ExaminerAgent that mimics patent examiner critiques through structured, chain-of-thought analyses of novelty and prior art disclosure. Driven by a carefully designed multi-objective reward function, ToC jointly optimizes novelty, scope retention, and semantic coherence. Experimental evaluation on a benchmark of 1145 claims demonstrates that ToC significantly outperforms standard LLMs in zero-shot and few-shot scenarios, achieving an average composite score improvement of 8%, and up to 9% in certain cases. Extensive experiments, including detailed ablation studies, validate ToC’s efficacy in generating superior, legally robust claim revisions. Overall, ToC establishes a transparent, controllable, and interpretable methodology that effectively bridges advanced LLM reasoning capabilities with strategic MCTS planning for structured patent claim optimization.The source code is available at https://github.com/ysy2003/ToC.

[316] Gradient flow for deep equilibrium single-index models

Sanjit Dandapanthula, Aaditya Ramdas

Main category: cs.LG

TL;DR: This paper provides theoretical analysis of gradient descent dynamics for deep equilibrium models (DEQs) in linear and single-index models, proving conservation laws, well-conditioned training, and linear convergence to global minimizers.

Details

Motivation: Despite DEQs' practical success in achieving state-of-the-art performance with infinitely deep weight-tied networks, there's limited theoretical understanding of their gradient descent dynamics, which this work aims to address.

Method: The authors rigorously analyze gradient descent dynamics for DEQs in linear models and single-index models, proving conservation laws and convergence properties through mathematical analysis and validating with experiments.

Result: Proved a conservation law for linear DEQs showing parameters remain on spheres during training, demonstrated gradient flow remains well-conditioned, and established linear convergence to global minimizers under appropriate initialization and step sizes.

Conclusion: The theoretical analysis fills gaps in DEQ literature by providing rigorous understanding of gradient descent dynamics, with conservation laws and convergence guarantees that support DEQs’ practical success.

Abstract: Deep equilibrium models (DEQs) have recently emerged as a powerful paradigm for training infinitely deep weight-tied neural networks that achieve state of the art performance across many modern machine learning tasks. Despite their practical success, theoretically understanding the gradient descent dynamics for training DEQs remains an area of active research. In this work, we rigorously study the gradient descent dynamics for DEQs in the simple setting of linear models and single-index models, filling several gaps in the literature. We prove a conservation law for linear DEQs which implies that the parameters remain trapped on spheres during training and use this property to show that gradient flow remains well-conditioned for all time. We then prove linear convergence of gradient descent to a global minimizer for linear DEQs and deep equilibrium single-index models under appropriate initialization and with a sufficiently small step size. Finally, we validate our theoretical findings through experiments.

[317] FIRM: Federated In-client Regularized Multi-objective Alignment for Large Language Models

Fatemeh, Nourzad, Amirhossein Roknilamouki, Eylem Ekici, Jia, Liu, Ness B. Shroff

Main category: cs.LG

TL;DR: FIRM is a federated multi-objective alignment method that achieves communication efficiency by solving regularized optimization locally, eliminating the need for multi-gradient transmissions while mitigating client disagreement drift.

Details

Motivation: Aligning LLMs with human values involves balancing conflicting objectives like helpfulness and harmlessness, but centralized training raises privacy concerns and FL methods face communication bottlenecks from multi-gradient transmissions.

Method: Each client locally solves a regularized multi-objective optimization problem with in-client regularization to mitigate disagreement drift, requiring only single parameter set transmission instead of multiple gradients.

Result: FIRM converges to Pareto-stationary points, provides finite-time convergence guarantees, leads to smoother training dynamics, reduced client disagreement drift, and improved reward trade-offs compared to baselines.

Conclusion: FIRM enables efficient federated multi-objective alignment with communication efficiency, proven convergence, and the ability to adapt trade-offs between objectives based on specified preferences.

Abstract: Aligning Large Language Models (LLMs) with human values often involves balancing multiple, conflicting objectives such as helpfulness and harmlessness. Training these models is computationally intensive, and centralizing the process raises significant data privacy concerns. Federated Learning (FL) offers a compelling alternative, but existing Federated Multi-Objective Optimization (FMOO) methods face severe communication bottlenecks as their reliance on transmitting multiple gradients to a server is unscalable for large models. We introduce FIRM (Federated In-client Regularized Multi-objective alignment), a novel algorithm that achieves both client disagreement drift mitigation and communication efficiency. In FIRM, each client locally solves a regularized multi-objective optimization problem. By directly mitigating client disagreement drift through in-client regularization, our method eliminates the need for the multi-gradient transmissions common in prior works. Consequently, clients need only to transmit a single set of adapted parameters, maintaining high communication efficiency. We prove that our algorithm converges to Pareto-stationary points and, to our knowledge, provide the first finite-time convergence guarantees for this federated multi-objective alignment setting. Empirically, we show that FIRM leads to smoother training dynamics, reduced client disagreement drift, and improved reward trade-offs compared to baselines. We further propose a method to incorporate a preference over the objectives and report empirical Pareto plots, demonstrating that FIRM can smoothly adapt trade-offs between objectives in response to specified preferences.

[318] Mask the Redundancy: Evolving Masking Representation Learning for Multivariate Time-Series Clustering

Zexi Tan, Xiaopeng Luo, Yunlin Liu, Yiqun Zhang

Main category: cs.LG

TL;DR: EMTC proposes an evolving-masked MTS clustering method that dynamically adapts masking to focus on discriminative timestamps through importance-aware variate-wise masking and multi-endogenous views representation learning.

Details

Motivation: Traditional MTS clustering suffers from redundancy in time-series data that diminishes attention to discriminative timestamps, and existing masking strategies are isolated from the learning process, preventing dynamic adaptation to clustering-critical timestamps.

Method: EMTC uses Importance-aware Variate-wise Masking (IVM) to adaptively guide discriminative representation learning, and Multi-Endogenous Views (MEV) modules with reconstruction and contrastive learning pathways to enhance generalization and prevent premature convergence.

Result: Extensive experiments on 15 real benchmark datasets show EMTC achieves an average improvement of 4.85% over the strongest baselines, demonstrating superiority over eight state-of-the-art methods.

Conclusion: The proposed EMTC method effectively addresses the redundancy problem in MTS clustering through dynamic masking adaptation and joint optimization of representation and clustering, achieving significant performance improvements.

Abstract: Multivariate Time-Series (MTS) clustering discovers intrinsic grouping patterns of temporal data samples. Although time-series provide rich discriminative information, they also contain substantial redundancy, such as steady-state machine operation records and zero-output periods of solar power generation. Such redundancy diminishes the attention given to discriminative timestamps in representation learning, thus leading to performance bottlenecks in MTS clustering. Masking has been widely adopted to enhance the MTS representation, where temporal reconstruction tasks are designed to capture critical information from MTS. However, most existing masking strategies appear to be standalone preprocessing steps, isolated from the learning process, which hinders dynamic adaptation to the importance of clustering-critical timestamps. Accordingly, this paper proposes the Evolving-masked MTS Clustering (EMTC) method, with its model architecture composed of Importance-aware Variate-wise Masking (IVM) and Multi-Endogenous Views (MEV) representation learning modules. IVM adaptively guides the model in learning more discriminative representations for clustering, while the MEV-based reconstruction and contrastive learning pathways enhance the generalization. That is, the MEV reconstruction facilitates multi-perspective complementary to prevent the masking from premature convergence, and the clustering-guided contrastive learning facilitates the joint optimization of representation and clustering. Extensive experiments on 15 real benchmark datasets demonstrate the superiority of EMTC in comparison with eight SOTA methods, where the EMTC achieves an average improvement of 4.85% over the strongest baselines.

[319] Energy Scaling Laws for Diffusion Models: Quantifying Compute and Carbon Emissions in Image Generation

Aniketh Iyengar, Jiaqi Han, Boris Ruf, Vincent Grari, Marcin Detyniecki, Stefano Ermon

Main category: cs.LG

TL;DR: Proposes energy scaling laws for diffusion models based on computational complexity, enabling accurate prediction of GPU energy consumption across different models and hardware configurations.

Details

Motivation: Address growing energy consumption concerns in diffusion models by developing principled methods to predict energy usage across various model configurations and hardware setups.

Method: Adapt Kaplan scaling laws to predict GPU energy consumption based on FLOPs, decomposing diffusion inference into text encoding, iterative denoising, and decoding components, with focus on denoising operations.

Result: Achieves high predictive accuracy (R-squared > 0.9) within individual GPU architectures and strong cross-architecture generalization, maintaining high rank correlations across models for reliable energy estimation.

Conclusion: Validates the compute-bound nature of diffusion inference and provides foundation for sustainable AI deployment planning and carbon footprint estimation.

Abstract: The rapidly growing computational demands of diffusion models for image generation have raised significant concerns about energy consumption and environmental impact. While existing approaches to energy optimization focus on architectural improvements or hardware acceleration, there is a lack of principled methods to predict energy consumption across different model configurations and hardware setups. We propose an adaptation of Kaplan scaling laws to predict GPU energy consumption for diffusion models based on computational complexity (FLOPs). Our approach decomposes diffusion model inference into text encoding, iterative denoising, and decoding components, with the hypothesis that denoising operations dominate energy consumption due to their repeated execution across multiple inference steps. We conduct comprehensive experiments across four state-of-the-art diffusion models (Stable Diffusion 2, Stable Diffusion 3.5, Flux, and Qwen) on three GPU architectures (NVIDIA A100, A4000, A6000), spanning various inference configurations including resolution (256x256 to 1024x1024), precision (fp16/fp32), step counts (10-50), and classifier-free guidance settings. Our energy scaling law achieves high predictive accuracy within individual architectures (R-squared > 0.9) and exhibits strong cross-architecture generalization, maintaining high rank correlations across models and enabling reliable energy estimation for unseen model-hardware combinations. These results validate the compute-bound nature of diffusion inference and provide a foundation for sustainable AI deployment planning and carbon footprint estimation.

[320] Step-E: A Differentiable Data Cleaning Framework for Robust Learning with Noisy Labels

Wenzhang Du

Main category: cs.LG

TL;DR: Step-E integrates sample selection and model training into a single optimization process, using loss-based ranking to gradually exclude high-loss examples and focus on consistent samples, improving performance on noisy datasets.

Details

Motivation: Training data often contains noisy labels and outliers that degrade neural network performance, and existing two-stage data cleaning pipelines don't fully exploit model feedback or adapt to unknown noise patterns.

Method: Step-E ranks samples by loss at each epoch and gradually increases the fraction of high-loss examples excluded from gradient updates after a warm-up stage, creating an online curriculum that focuses on easy/consistent examples.

Result: On CIFAR-100N, Step-E improved ResNet-18 test accuracy from 43.3% to 50.4%, outperforming other methods and approaching clean-label oracle (60.5%). On CIFAR-10N, it improved from 83.9% to 85.3%, nearly matching clean-label oracle (85.9%) with moderate overhead.

Conclusion: Step-E effectively handles noisy training data by integrating sample selection with model learning, achieving significant performance improvements while approaching clean-label oracle performance on benchmark datasets.

Abstract: Training data collected in the wild often contain noisy labels and outliers that substantially degrade the performance and reliability of deep neural networks. While data cleaning is commonly applied as a separate preprocessing stage, such two-stage pipelines neither fully exploit feedback from the downstream model nor adapt to unknown noise patterns. We propose Step-E, a simple framework that integrates sample selection and model learning into a single optimization process. At each epoch, Step-E ranks samples by loss and gradually increases the fraction of high-loss examples that are excluded from gradient updates after a brief warm-up stage, yielding an online curriculum that focuses on easy and consistent examples and eventually ignores persistent outliers. On CIFAR-100N, Step-E improves the test accuracy of a ResNet-18 model from 43.3% (+/- 0.7%) to 50.4% (+/- 0.9%), clearly outperforming loss truncation, self-paced learning, and one-shot filtering while approaching the clean-label oracle at 60.5% (+/- 0.2%). On CIFAR-10N (aggre), Step-E also improves over the noisy baseline (85.3% vs. 83.9%) and nearly matches the clean-label oracle (85.9%), with only moderate training-time overhead.

[321] Hash Collisions in Molecular Fingerprints: Effects on Property Prediction and Bayesian Optimization

Walter Virany, Austin Tripp

Main category: cs.LG

TL;DR: Exact molecular fingerprints slightly improve predictive accuracy but don’t significantly enhance Bayesian optimization performance compared to standard compressed fingerprints.

Details

Motivation: Hash collisions in standard molecular fingerprints cause distinct substructures to share features, leading to overestimated molecular similarity and potentially reduced accuracy in molecular property prediction.

Method: Compare exact fingerprints against standard compressed fingerprints using Gaussian process models on five molecular property prediction benchmarks from the DOCKSTRING dataset, and evaluate performance in Bayesian optimization.

Result: Exact fingerprints yield small but consistent improvements in predictive accuracy across all five benchmarks, but these accuracy gains do not translate to significant improvements in Bayesian optimization performance.

Conclusion: While exact fingerprints provide modest accuracy benefits for molecular property prediction, they do not offer substantial advantages for Bayesian optimization tasks, suggesting that hash collision issues in compressed fingerprints may not be a major limiting factor in optimization contexts.

Abstract: Molecular fingerprinting methods use hash functions to create fixed-length vector representations of molecules. However, hash collisions cause distinct substructures to be represented with the same feature, leading to overestimates in molecular similarity calculations. We investigate whether using exact fingerprints improves accuracy compared to standard compressed fingerprints in molecular property prediction and Bayesian optimization where the underlying predictive model is a Gaussian process. We find that using exact fingerprints yields a small yet consistent improvement in predictive accuracy on five molecular property prediction benchmarks from the DOCKSTRING dataset. However, these gains did not translate to significant improvements in Bayesian optimization performance.

[322] Why Do Language Model Agents Whistleblow?

Kushal Agrawal, Frank Xiao, Guido Bergman, Asa Cooper Stickland

Main category: cs.LG

TL;DR: LLMs can autonomously whistleblow on suspected misconduct to external parties without user instruction, with whistleblowing frequency varying by model and being influenced by task complexity, moral nudges, and available tools.

Details

Motivation: To study how LLM alignment training manifests when models use tools, specifically examining unauthorized whistleblowing behavior where models disclose misconduct without user knowledge.

Method: Created an evaluation suite with diverse staged misconduct scenarios to test agents, analyzed factors affecting whistleblowing rates across models and settings.

Result: Found whistleblowing varies by model family, decreases with task complexity, increases with moral nudges, and decreases when more tools/workflows are provided. Dataset showed low evaluation awareness.

Conclusion: LLM whistleblowing behavior is influenced by multiple factors, and the study provides a robust framework for assessing this emergent alignment behavior in tool-using agents.

Abstract: The deployment of Large Language Models (LLMs) as tool-using agents causes their alignment training to manifest in new ways. Recent work finds that language models can use tools in ways that contradict the interests or explicit instructions of the user. We study LLM whistleblowing: a subset of this behavior where models disclose suspected misconduct to parties beyond the dialog boundary (e.g., regulatory agencies) without user instruction or knowledge. We introduce an evaluation suite of diverse and realistic staged misconduct scenarios to assess agents for this behavior. Across models and settings, we find that: (1) the frequency of whistleblowing varies widely across model families, (2) increasing the complexity of the task the agent is instructed to complete lowers whistleblowing tendencies, (3) nudging the agent in the system prompt to act morally substantially raises whistleblowing rates, and (4) giving the model more obvious avenues for non-whistleblowing behavior, by providing more tools and a detailed workflow to follow, decreases whistleblowing rates. Additionally, we verify the robustness of our dataset by testing for model evaluation awareness, and find that both black-box methods and probes on model activations show lower evaluation awareness in our settings than in comparable previous work.

[323] Geometric-Disentangelment Unlearning

Duo Zhou, Yuji Zhang, Tianxin Wei, Ruizhong Qiu, Ke Yang, Xiao Lin, Cheng Qian, Jingrui He, Hanghang Tong, Heng Ji, Huan Zhang

Main category: cs.LG

TL;DR: Proposes Geometric-disentanglement Unlearning (GU), a plug-and-play method that decomposes forget gradient updates into tangential and normal components to retain space, executing only the normal component to achieve effective unlearning while preserving retained knowledge.

Details

Motivation: Existing machine unlearning approaches face a tradeoff between effective forgetting and preservation of retained knowledge, often lacking formal analysis of how forgetting updates harm retained knowledge and whether side effects can be removed with theoretical guarantees.

Method: Starts from first-principles analysis showing retain loss is unchanged to first order iff update direction is orthogonal to retain gradients subspace. GU decomposes any candidate forget gradient update into tangential and normal components to retain space, executing only the normal component under trust-region budget constraints.

Result: GU achieves consistent improvement on various methods across three benchmarks (TOFU, MUSE, and WMDP), demonstrating effective unlearning while preserving retained knowledge.

Conclusion: The proposed geometric disentanglement approach provides a theoretically sound and simple solution for machine unlearning that mitigates side effects on retained knowledge while achieving effective forgetting.

Abstract: Machine unlearning, the removal of a training subset’s influence from a deployed model, is critical for privacy preservation and model reliability, yet gradient ascent on forget samples often harms retained knowledge. Existing approaches face a persistent tradeoff between effective forgetting and preservation on the retain set. While previous methods provide useful heuristics, they often lack a formal analysis on how exactly forgetting updates harm retained knowledge, and whether the side effects can be removed with theoretical guarantees. To explore a theoretically sound and simple solution, we start from the first principle on how performance on the retain set is actually affected: a first-order analysis of the local change of the retain loss under small parameter updates during model training. We start from a crisp equivalence: the retain loss is unchanged to first order iff the update direction is orthogonal to the subspace spanned by retain gradients (“retain-invariant”). This identifies the entangled component as the tangential part of forget update within the retain-gradient subspace, and characterizes disentanglement as orthogonality. Guided by this, we propose the Geometric-disentanglement Unlearning (GU) that decomposes any candidate forget gradient update into tangential and normal components to retain space and executes only the normal component. Under a standard trust-region budget, the projected direction aligned with the raw forget gradient is optimal among all first-order retain-invariant moves, and we also derive the optimal projected direction for joint forget-retain updating objectives. Our method is plug-and-play and can be attached to existing gradient-based unlearning procedures to mitigate side effects. GU achieves consistent improvement on various methods across three benchmarks TOFU, MUSE, and WMDP.

[324] Four decades of circumpolar super-resolved satellite land surface temperature data

Sonia Dupuis, Nando Metzger, Konrad Schindler, Frank Göttsche, Stefan Wunderle

Main category: cs.LG

TL;DR: A new 42-year pan-Arctic land surface temperature dataset downscaled from AVHRR GAC data to 1 km resolution using deep learning, enabling improved permafrost modeling and climate monitoring.

Details

Motivation: Coarse spatial resolution of AVHRR's global area coverage data limits utility for analyzing fine-scale permafrost dynamics and surface processes in the rapidly warming Arctic.

Method: Super-resolution algorithm based on deep anisotropic diffusion model trained on MODIS LST data, using coarsened inputs and native-resolution outputs guided by high-resolution land cover, digital elevation, and vegetation height maps.

Result: Created twice-daily, 1 km LST observations for entire pan-Arctic region over four decades, enabling improved permafrost modeling, air temperature reconstruction, and Greenland Ice Sheet surface mass balance assessment.

Conclusion: The enhanced dataset supports climate monitoring in pre-MODIS era and provides adaptable framework for future satellite missions to ensure thermal infrared observation and climate data record continuity.

Abstract: Land surface temperature (LST) is an essential climate variable (ECV) crucial for understanding land-atmosphere energy exchange and monitoring climate change, especially in the rapidly warming Arctic. Long-term satellite-based LST records, such as those derived from the Advanced Very High Resolution Radiometer (AVHRR), are essential for detecting climate trends. However, the coarse spatial resolution of AVHRR’s global area coverage (GAC) data limit their utility for analyzing fine-scale permafrost dynamics and other surface processes in the Arctic. This paper presents a new 42 years pan-Arctic LST dataset, downscaled from AVHRR GAC to 1 km with a super-resolution algorithm based on a deep anisotropic diffusion model. The model is trained on MODIS LST data, using coarsened inputs and native-resolution outputs, guided by high-resolution land cover, digital elevation, and vegetation height maps. The resulting dataset provides twice-daily, 1 km LST observations for the entire pan-Arctic region over four decades. This enhanced dataset enables improved modelling of permafrost, reconstruction of near-surface air temperature, and assessment of surface mass balance of the Greenland Ice Sheet. Additionally, it supports climate monitoring efforts in the pre-MODIS era and offers a framework adaptable to future satellite missions for thermal infrared observation and climate data record continuity.

[325] Reconstruction of Surface EMG Signal using IMU data for Upper Limb Actions

Shubhranil Basak, Mada Hemanth, Madhav Rao

Main category: cs.LG

TL;DR: Deep learning model synthesizes sEMG signals from IMU data for muscle intent detection

Details

Motivation: sEMG provides muscle function insights but is noisy and hard to acquire, while IMUs offer robust wearable motion tracking

Method: Used Sliding-Window-Wave-Net with dilated causal convolutions to map 6-axis IMU data to sEMG signals, trained on simultaneous 1KHz data from arm movements

Result: Model successfully predicts timing and general shape of muscle activations with high temporal fidelity, though peak amplitudes were often underestimated

Conclusion: Feasible to use IMU-to-sEMG synthesis for muscle intent detection in prosthetics and rehabilitation biofeedback applications

Abstract: Surface Electromyography (sEMG) provides vital insights into muscle function, but it can be noisy and challenging to acquire. Inertial Measurement Units (IMUs) provide a robust and wearable alternative to motion capture systems. This paper investigates the synthesis of normalized sEMG signals from 6-axis IMU data using a deep learning approach. We collected simultaneous sEMG and IMU data sampled at 1~KHz for various arm movements. A Sliding-Window-Wave-Net model, based on dilated causal convolutions, was trained to map the IMU data to the sEMG signal. The results show that the model successfully predicts the timing and general shape of muscle activations. Although peak amplitudes were often underestimated, the high temporal fidelity demonstrates the feasibility of using this method for muscle intent detection in applications such as prosthetics and rehabilitation biofeedback.

[326] DelTriC: A Novel Clustering Method with Accurate Outlier

Tomas Javurek, Michal Gregor, Sebastian Kula, Marian Simko

Main category: cs.LG

TL;DR: DelTriC is a clustering algorithm that combines PCA/UMAP projection, Delaunay triangulation, and back-projection to create clusters in high-dimensional space, outperforming traditional methods like k-means and DBSCAN.

Details

Motivation: To develop a clustering method that decouples neighborhood construction from decision-making and works effectively in high-dimensional spaces where traditional methods struggle.

Method: Projects data to low-dimensional space using PCA/UMAP, performs Delaunay triangulation to index local adjacency, then back-projects to original space for robust edge pruning, merging, and anomaly detection.

Result: Outperforms traditional clustering methods (k-means, DBSCAN, HDBSCAN) in many scenarios, offering both scalability and accuracy while significantly improving outlier detection.

Conclusion: DelTriC provides an effective approach for high-dimensional clustering by leveraging low-dimensional projections while maintaining cluster integrity in the original space, offering advantages over traditional clustering algorithms.

Abstract: The paper introduces DelTriC (Delaunay Triangulation Clustering), a clustering algorithm which integrates PCA/UMAP-based projection, Delaunay triangulation, and a novel back-projection mechanism to form clusters in the original high-dimensional space. DelTriC decouples neighborhood construction from decision-making by first triangulating in a low-dimensional proxy to index local adjacency, and then back-projecting to the original space to perform robust edge pruning, merging, and anomaly detection. DelTriC can outperform traditional methods such as k-means, DBSCAN, and HDBSCAN in many scenarios; it is both scalable and accurate, and it also significantly improves outlier detection.

[327] Generating transition states of chemical reactions via distance-geometry-based flow matching

Yufei Luo, Xiang Gu, Jian Sun

Main category: cs.LG

TL;DR: TS-DFM is a flow matching framework that predicts transition states from reactants and products using molecular distance geometry, outperforming previous methods by 30% and enabling discovery of alternative reaction paths.

Details

Motivation: Transition states are crucial for understanding reaction mechanisms but are difficult to explore experimentally and computationally, requiring better prediction methods.

Method: TS-DFM uses flow matching in molecular distance geometry space with TSDVNet to learn velocity fields for generating accurate TS geometries.

Result: Outperforms React-OT by 30% on Transition1X dataset, provides high-quality initial structures for CI-NEB optimization, discovers alternative reaction paths including more favorable TS with lower energy barriers, and shows strong generalization on RGD1 dataset.

Conclusion: TS-DFM demonstrates strong potential for facilitating reaction exploration by accurately predicting transition states and discovering alternative reaction pathways.

Abstract: Transition states (TSs) are crucial for understanding reaction mechanisms, yet their exploration is limited by the complexity of experimental and computational approaches. Here we propose TS-DFM, a flow matching framework that predicts TSs from reactants and products. By operating in molecular distance geometry space, TS-DFM explicitly captures the dynamic changes of interatomic distances in chemical reactions. A network structure named TSDVNet is designed to learn the velocity field for generating TS geometries accurately. On the benchmark dataset Transition1X, TS-DFM outperforms the previous state-of-the-art method React-OT by 30% in structural accuracy. These predicted TSs provide high-quality initial structures, accelerating the convergence of CI-NEB optimization. Additionally, TS-DFM can identify alternative reaction paths. In our experiments, even a more favorable TS with lower energy barrier is discovered. Further tests on RGD1 dataset confirm its strong generalization ability on unseen molecules and reaction types, highlighting its potential for facilitating reaction exploration.

[328] FlexiFlow: decomposable flow matching for generation of flexible molecular ensemble

Riccardo Tedoldi, Ola Engkvist, Patrick Bryant, Hossein Azizpour, Jon Paul Janet, Alessandro Tibo

Main category: cs.LG

TL;DR: FlexiFlow is a novel flow-matching architecture that jointly generates 3D molecular structures with multiple conformations, addressing limitations of current single-conformation models in drug discovery.

Details

Motivation: Current 3D molecular generation models only produce single conformations, but the conformational landscape determines molecular properties and binding affinity. Generating multiple low-energy conformers enables better assessment of thermodynamic properties and improved drug design.

Method: Extends flow-matching models to jointly sample molecules with multiple conformations while preserving equivariance and permutation invariance. Uses novel architecture for conformational ensemble generation.

Result: Achieves SOTA results on QM9 and GEOM Drugs datasets, generating valid, unstrained, unique, and novel molecules with high fidelity. Produces conformational ensembles comparable to physics-based methods but much faster. Successfully transfers to protein-conditioned ligand generation.

Conclusion: FlexiFlow enables efficient joint generation of molecules with conformational diversity, providing better coverage of molecular properties and faster inference than traditional methods, with applications in drug discovery.

Abstract: Sampling useful three-dimensional molecular structures along with their most favorable conformations is a key challenge in drug discovery. Current state-of-the-art 3D de-novo design flow matching or diffusion-based models are limited to generating a single conformation. However, the conformational landscape of a molecule determines its observable properties and how tightly it is able to bind to a given protein target. By generating a representative set of low-energy conformers, we can more directly assess these properties and potentially improve the ability to generate molecules with desired thermodynamic observables. Towards this aim, we propose FlexiFlow, a novel architecture that extends flow-matching models, allowing for the joint sampling of molecules along with multiple conformations while preserving both equivariance and permutation invariance. We demonstrate the effectiveness of our approach on the QM9 and GEOM Drugs datasets, achieving state-of-the-art results in molecular generation tasks. Our results show that FlexiFlow can generate valid, unstrained, unique, and novel molecules with high fidelity to the training data distribution, while also capturing the conformational diversity of molecules. Moreover, we show that our model can generate conformational ensembles that provide similar coverage to state-of-the-art physics-based methods at a fraction of the inference time. Finally, FlexiFlow can be successfully transferred to the protein-conditioned ligand generation task, even when the dataset contains only static pockets without accompanying conformations.

[329] Enforcing governing equation constraints in neural PDE solvers via training-free projections

Omer Rochman, Gilles Louppe

Main category: cs.LG

TL;DR: Two post-hoc projection methods (nonlinear optimization and local linearization) are evaluated to enforce constraints in neural PDE solvers, reducing violations and improving accuracy over physics-informed baselines.

Details

Motivation: Neural PDE solvers often violate governing equation constraints, especially nonlinear constraints in dynamical PDEs which create long-range temporal dependencies, making projection onto feasible sets challenging.

Method: Evaluated two training-free post-hoc projection approaches: 1) nonlinear optimization-based projection, and 2) local linearization-based projection using Jacobian-vector and vector-Jacobian products.

Result: Both projection methods substantially reduce constraint violations and improve accuracy compared to physics-informed baselines across representative PDEs.

Conclusion: Post-hoc projection methods effectively enforce constraints in neural PDE solvers, offering a practical solution to constraint violation issues without requiring retraining.

Abstract: Neural PDE solvers used for scientific simulation often violate governing equation constraints. While linear constraints can be projected cheaply, many constraints are nonlinear, complicating projection onto the feasible set. Dynamical PDEs are especially difficult because constraints induce long-range dependencies in time. In this work, we evaluate two training-free, post hoc projections of approximate solutions: a nonlinear optimization-based projection, and a local linearization-based projection using Jacobian-vector and vector-Jacobian products. We analyze constraints across representative PDEs and find that both projections substantially reduce violations and improve accuracy over physics-informed baselines.

[330] Automobile demand forecasting: Spatiotemporal and hierarchical modeling, life cycle dynamics, and user-generated online information

Tom Nahrendorf, Stefan Minner, Helfried Binder, Richard Zinck

Main category: cs.LG

TL;DR: This study develops a forecasting methodology for premium automotive demand using ensemble models and mixed-integer programming to address sparse data and volatile markets across multiple planning levels.

Details

Motivation: Premium automotive manufacturers face complex forecasting challenges due to high product variety, sparse variant-level data, and volatile market dynamics that require accurate demand predictions.

Method: Combines point and probabilistic forecasts using ensembles of LightGBM models with pooled training sets, quantile regression, and mixed-integer linear programming reconciliation across strategic and operational planning levels.

Result: Spatiotemporal dependencies and rounding bias significantly affect forecast accuracy, with integer forecasts being crucial for operational feasibility. Online behavioral data improves accuracy at disaggregated levels.

Conclusion: Short-term demand is reactive (life cycle maturity, autoregressive momentum), while medium-term demand is anticipatory (online engagement, planning targets), with the methodology effectively addressing premium automotive forecasting challenges.

Abstract: Premium automotive manufacturers face increasingly complex forecasting challenges due to high product variety, sparse variant-level data, and volatile market dynamics. This study addresses monthly automobile demand forecasting across a multi-product, multi-market, and multi-level hierarchy using data from a German premium manufacturer. The methodology combines point and probabilistic forecasts across strategic and operational planning levels, leveraging ensembles of LightGBM models with pooled training sets, quantile regression, and a mixed-integer linear programming reconciliation approach. Results highlight that spatiotemporal dependencies, as well as rounding bias, significantly affect forecast accuracy, underscoring the importance of integer forecasts for operational feasibility. Shapley analysis shows that short-term demand is reactive, shaped by life cycle maturity, autoregressive momentum, and operational signals, whereas medium-term demand reflects anticipatory drivers such as online engagement, planning targets, and competitive indicators, with online behavioral data considerably improving accuracy at disaggregated levels.

[331] SAVeD: Semantic Aware Version Discovery

Artem Frenk, Roee Shraga

Main category: cs.LG

TL;DR: SAVeD is a contrastive learning framework that identifies dataset versions without metadata by using semantic similarity through table transformations and transformer embeddings.

Details

Motivation: Addresses repeated labor in data science caused by difficulty tracking similar work or transformations on datasets, eliminating reliance on metadata, labels, or integration assumptions.

Method: Uses modified SimCLR pipeline with random table transformations (row deletion, encoding perturbations), custom transformer encoder for embeddings, and contrastive learning to optimize semantic similarity between dataset versions.

Result: Achieves significantly higher accuracy on unseen tables and substantial separation scores, outperforming untrained baselines and state-of-the-art methods like Starmie across five canonical datasets from Semantic Versioning in Databases Benchmark.

Conclusion: SAVeD effectively distinguishes semantically altered dataset versions through contrastive learning, demonstrating strong performance in version detection without metadata dependency.

Abstract: Our work introduces SAVeD (Semantically Aware Version Detection), a contrastive learning-based framework for identifying versions of structured datasets without relying on metadata, labels, or integration-based assumptions. SAVeD addresses a common challenge in data science of repeated labor due to a difficulty of similar work or transformations on datasets. SAVeD employs a modified SimCLR pipeline, generating augmented table views through random transformations (e.g., row deletion, encoding perturbations). These views are embedded via a custom transformer encoder and contrasted in latent space to optimize semantic similarity. Our model learns to minimize distances between augmented views of the same dataset and maximize those between unrelated tables. We evaluate performance using validation accuracy and separation, defined respectively as the proportion of correctly classified version/non-version pairs on a hold-out set, and the difference between average similarities of versioned and non-versioned tables (defined by a benchmark, and not provided to the model). Our experiments span five canonical datasets from the Semantic Versioning in Databases Benchmark, and demonstrate substantial gains post-training. SAVeD achieves significantly higher accuracy on completely unseen tables in, and a significant boost in separation scores, confirming its capability to distinguish semantically altered versions. Compared to untrained baselines and prior state-of-the-art dataset-discovery methods like Starmie, our custom encoder achieves competitive or superior results.

[332] Self-supervised denoising of raw tomography detector data for improved image reconstruction

Israt Jahan Tulin, Sebastian Starke, Dominic Windisch, André Bieberle, Peter Steinbach

Main category: cs.LG

TL;DR: Deep learning methods outperform traditional denoising for noisy X-ray CT data, improving both raw detector data and reconstructed images.

Details

Motivation: Ultrafast electron beam X-ray CT produces noisy data due to short measurement times, causing reconstruction artifacts and limiting image quality.

Method: Two self-supervised deep learning methods for denoising raw detector data were investigated and compared against a non-learning based denoising method.

Result: Deep-learning-based methods enhanced signal-to-noise ratios in detector data and led to consistent improvements in reconstructed images, outperforming the non-learning based method.

Conclusion: Self-supervised deep learning approaches are effective for denoising ultrafast X-ray CT data and provide superior performance compared to traditional methods.

Abstract: Ultrafast electron beam X-ray computed tomography produces noisy data due to short measurement times, causing reconstruction artifacts and limiting overall image quality. To counteract these issues, two self-supervised deep learning methods for denoising of raw detector data were investigated and compared against a non-learning based denoising method. We found that the application of the deep-learning-based methods was able to enhance signal-to-noise ratios in the detector data and also led to consistent improvements of the reconstructed images, outperforming the non-learning based method.

[333] ReBaPL: Repulsive Bayesian Prompt Learning

Yassir Bendou, Omar Ezzahir, Eduardo Fernandes Montesuma, Gabriel Mahuas, Victoria Shevchenko, Mike Gartrell

Main category: cs.LG

TL;DR: ReBaPL is a novel Bayesian prompt learning method that uses cyclical step-size scheduling with SGHMC and representation-space repulsive forces to efficiently explore multimodal prompt posteriors, improving generalization over conventional prompt tuning methods.

Details

Motivation: Conventional prompt tuning methods suffer from overfitting and poor out-of-distribution generalization. Bayesian prompt learning addresses these issues but needs better methods to explore complex multimodal posterior distributions of prompts.

Method: Combines cyclical step-size schedule with SGHMC for alternating exploration/exploitation phases, plus repulsive forces based on probability metrics (MMD/Wasserstein) between representations from different prompts to diversify exploration and prevent mode collapse.

Result: Demonstrates superior performance over state-of-the-art prompt learning methods on several benchmark datasets, showing improved generalization capabilities.

Conclusion: ReBaPL provides a modular Bayesian extension for existing prompt learning methods that enables more comprehensive characterization of prompt posterior distributions, leading to enhanced robustness and generalization.

Abstract: Prompt learning has emerged as an effective technique for fine-tuning large-scale foundation models for downstream tasks. However, conventional prompt tuning methods are prone to overfitting and can struggle with out-of-distribution generalization. To address these limitations, Bayesian prompt learning has been proposed, which frames prompt optimization as a Bayesian inference problem to enhance robustness. This paper introduces Repulsive Bayesian Prompt Learning (ReBaPL), a novel method for Bayesian prompt learning, designed to efficiently explore the complex and often multimodal posterior landscape of prompts. Our method integrates a cyclical step-size schedule with a stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm, enabling alternating phases of exploration to discover new modes, and exploitation to refine existing modes. Furthermore, we introduce a repulsive force derived from a potential function over probability metrics (including Maximum Mean Discrepancy and Wasserstein distance) computed on the distributions of representations produced by different prompts. This representation-space repulsion diversifies exploration and prevents premature collapse to a single mode. Our approach allows for a more comprehensive characterization of the prompt posterior distribution, leading to improved generalization. In contrast to prior Bayesian prompt learning methods, our method provides a modular plug-and-play Bayesian extension of any existing prompt learning method based on maximum likelihood estimation. We demonstrate the efficacy of ReBaPL on several benchmark datasets, showing superior performance over state-of-the-art methods for prompt learning.

[334] Convergence and stability of Q-learning in Hierarchical Reinforcement Learning

Massimiliano Manenti, Andrea Iannelli

Main category: cs.LG

TL;DR: The paper provides theoretical convergence and stability guarantees for Feudal Q-learning in hierarchical reinforcement learning, showing it converges to a game equilibrium point.

Details

Motivation: Hierarchical RL promises efficient temporal structure capture and continual learning, but lacks theoretical guarantees. This paper aims to provide principled convergence analysis for Feudal Q-learning.

Method: Proposed Feudal Q-learning scheme analyzed using Stochastic Approximation theory and ODE method to prove convergence and stability properties.

Result: Established a theorem proving convergence and stability of Feudal Q-learning, showing updates converge to an equilibrium point of a suitably defined game.

Conclusion: The theoretical analysis provides principled convergence guarantees for Feudal RL, opening doors to game-theoretic approaches, with experimental results supporting the theory.

Abstract: Hierarchical Reinforcement Learning promises, among other benefits, to efficiently capture and utilize the temporal structure of a decision-making problem and to enhance continual learning capabilities, but theoretical guarantees lag behind practice. In this paper, we propose a Feudal Q-learning scheme and investigate under which conditions its coupled updates converge and are stable. By leveraging the theory of Stochastic Approximation and the ODE method, we present a theorem stating the convergence and stability properties of Feudal Q-learning. This provides a principled convergence and stability analysis tailored to Feudal RL. Moreover, we show that the updates converge to a point that can be interpreted as an equilibrium of a suitably defined game, opening the door to game-theoretic approaches to Hierarchical RL. Lastly, experiments based on the Feudal Q-learning algorithm support the outcomes anticipated by theory.

[335] R2PS: Worst-Case Robust Real-Time Pursuit Strategies under Partial Observability

Runyu Lu, Ruochuan Shi, Yuanheng Zhu, Dongbin Zhao

Main category: cs.LG

TL;DR: This paper introduces R2PS, the first approach for worst-case robust real-time pursuit strategies in pursuit-evasion games under partial observability, where pursuers have imperfect information about the evader’s position.

Details

Motivation: Current reinforcement learning methods for pursuit-evasion games are limited to perfect information scenarios and don't account for evaders that can predict pursuers' actions, creating a gap for real-world applications with partial observability.

Method: The approach extends dynamic programming pursuit strategies to partial observability using belief preservation about evader positions, then embeds this into the EPG framework for cross-graph reinforcement learning against asynchronous-move DP evasion strategies.

Result: The learned policy achieves robust zero-shot generalization to unseen real-world graph structures and consistently outperforms policies directly trained on test graphs by existing game RL approaches.

Conclusion: R2PS successfully addresses the challenge of real-time pursuit strategies under partial observability, providing a framework that maintains robustness while enabling generalization to new environments.

Abstract: Computing worst-case robust strategies in pursuit-evasion games (PEGs) is time-consuming, especially when real-world factors like partial observability are considered. While important for general security purposes, real-time applicable pursuit strategies for graph-based PEGs are currently missing when the pursuers only have imperfect information about the evader’s position. Although state-of-the-art reinforcement learning (RL) methods like Equilibrium Policy Generalization (EPG) and Grasper provide guidelines for learning graph neural network (GNN) policies robust to different game dynamics, they are restricted to the scenario of perfect information and do not take into account the possible case where the evader can predict the pursuers’ actions. This paper introduces the first approach to worst-case robust real-time pursuit strategies (R2PS) under partial observability. We first prove that a traditional dynamic programming (DP) algorithm for solving Markov PEGs maintains optimality under the asynchronous moves by the evader. Then, we propose a belief preservation mechanism about the evader’s possible positions, extending the DP pursuit strategies to a partially observable setting. Finally, we embed the belief preservation into the state-of-the-art EPG framework to finish our R2PS learning scheme, which leads to a real-time pursuer policy through cross-graph reinforcement learning against the asynchronous-move DP evasion strategies. After reinforcement learning, our policy achieves robust zero-shot generalization to unseen real-world graph structures and consistently outperforms the policy directly trained on the test graphs by the existing game RL approach.

[336] A Unified Stability Analysis of SAM vs SGD: Role of Data Coherence and Emergence of Simplicity Bias

Wei-Kai Chang, Rajiv Khanna

Main category: cs.LG

TL;DR: Develops a linear stability framework to understand why SGD and SAM prefer flat minima in deep learning, focusing on gradient curvature coherence across data points in two-layer ReLU networks.

Details

Motivation: To understand the mechanisms driving generalization in deep learning optimization, particularly why SGD and its variants prefer flatter minima, and to develop a unified theory connecting data structure, optimization dynamics, and solution characteristics.

Method: Develops a linear stability framework analyzing SGD, random perturbations, and SAM behavior in two-layer ReLU networks, using a coherence measure to quantify gradient curvature alignment across data points.

Result: The coherence measure reveals why certain minima are stable and favored during training, explaining the preference for flat minima in overparameterized settings.

Conclusion: Provides a theoretical framework connecting data structure, optimization dynamics, and solution flatness, offering insights into why SGD and SAM generalize well by preferring stable minima with coherent gradient curvature.

Abstract: Understanding the dynamics of optimization in deep learning is increasingly important as models scale. While stochastic gradient descent (SGD) and its variants reliably find solutions that generalize well, the mechanisms driving this generalization remain unclear. Notably, these algorithms often prefer flatter or simpler minima, particularly in overparameterized settings. Prior work has linked flatness to generalization, and methods like Sharpness-Aware Minimization (SAM) explicitly encourage flatness, but a unified theory connecting data structure, optimization dynamics, and the nature of learned solutions is still lacking. In this work, we develop a linear stability framework that analyzes the behavior of SGD, random perturbations, and SAM, particularly in two layer ReLU networks. Central to our analysis is a coherence measure that quantifies how gradient curvature aligns across data points, revealing why certain minima are stable and favored during training.

[337] Stable Coresets via Posterior Sampling: Aligning Induced and Full Loss Landscapes

Wei-Kai Chang, Rajiv Khanna

Main category: cs.LG

TL;DR: A novel coreset selection framework using posterior sampling to address gradient-based method limitations, achieving faster training and better generalization than state-of-the-art approaches.

Details

Motivation: Deep learning models' growing computational demands require efficient coreset selection, but gradient-based methods face challenges with SGD baselines and loss curvature mismatches over time.

Method: Establishes connection between posterior sampling and loss landscapes, introduces smoothed loss function based on posterior sampling on model weights, and provides convergence analysis for the sampling-based approach.

Result: Achieves faster training and enhanced generalization across diverse datasets compared to current state-of-the-art methods.

Conclusion: The proposed framework effectively addresses limitations of gradient-based coreset selection methods through posterior sampling, providing robust performance even in high data corruption scenarios.

Abstract: As deep learning models continue to scale, the growing computational demands have amplified the need for effective coreset selection techniques. Coreset selection aims to accelerate training by identifying small, representative subsets of data that approximate the performance of the full dataset. Among various approaches, gradient based methods stand out due to their strong theoretical underpinnings and practical benefits, particularly under limited data budgets. However, these methods face challenges such as naive stochastic gradient descent (SGD) acting as a surprisingly strong baseline and the breakdown of representativeness due to loss curvature mismatches over time. In this work, we propose a novel framework that addresses these limitations. First, we establish a connection between posterior sampling and loss landscapes, enabling robust coreset selection even in high data corruption scenarios. Second, we introduce a smoothed loss function based on posterior sampling onto the model weights, enhancing stability and generalization while maintaining computational efficiency. We also present a novel convergence analysis for our sampling-based coreset selection method. Finally, through extensive experiments, we demonstrate how our approach achieves faster training and enhanced generalization across diverse datasets than the current state of the art.

[338] DS-Span: Single-Phase Discriminative Subgraph Mining for Efficient Graph Embeddings

Yeamin Kaiser, Muhammed Tasnim Bin Anwar, Bholanath Das, Chowdhury Farhan Ahmed, Md. Tanvir Alam

Main category: cs.LG

TL;DR: DS-Span is a single-phase discriminative subgraph mining framework that unifies pattern growth, pruning, and supervision-driven scoring, achieving efficient and interpretable graph representation learning with reduced computational cost.

Details

Motivation: Existing subgraph mining methods suffer from redundant multi-phase pipelines, high computational cost, and weak coupling between mined structures and their discriminative relevance, requiring a more efficient and unified approach.

Method: DS-Span introduces a coverage-capped eligibility mechanism that dynamically limits exploration once a graph is sufficiently represented, and an information-gain-guided selection that promotes subgraphs with strong class-separating ability while minimizing redundancy.

Result: Extensive experiments show DS-Span generates more compact and discriminative subgraph features than prior multi-stage methods, achieving higher or comparable accuracy with significantly reduced runtime.

Conclusion: The unified single-phase discriminative mining approach serves as a foundation for scalable and interpretable graph representation learning, demonstrating the potential of efficient pattern discovery for downstream applications.

Abstract: Graph representation learning seeks to transform complex, high-dimensional graph structures into compact vector spaces that preserve both topology and semantics. Among the various strategies, subgraph-based methods provide an interpretable bridge between symbolic pattern discovery and continuous embedding learning. Yet, existing frequent or discriminative subgraph mining approaches often suffer from redundant multi-phase pipelines, high computational cost, and weak coupling between mined structures and their discriminative relevance. We propose DS-Span, a single-phase discriminative subgraph mining framework that unifies pattern growth, pruning, and supervision-driven scoring within one traversal of the search space. DS-Span introduces a coverage-capped eligibility mechanism that dynamically limits exploration once a graph is sufficiently represented, and an information-gain-guided selection that promotes subgraphs with strong class-separating ability while minimizing redundancy. The resulting subgraph set serves as an efficient, interpretable basis for downstream graph embedding and classification. Extensive experiments across benchmarks demonstrate that DS-Span generates more compact and discriminative subgraph features than prior multi-stage methods, achieving higher or comparable accuracy with significantly reduced runtime. These results highlight the potential of unified, single-phase discriminative mining as a foundation for scalable and interpretable graph representation learning.

[339] Self-Supervised Learning by Curvature Alignment

Benyamin Ghojogh, M. Hadi Sepanj, Paul Fieguth

Main category: cs.LG

TL;DR: CurvSSL is a curvature-regularized self-supervised learning framework that augments standard SSL with curvature-based regularizers to explicitly shape local data manifold geometry, showing competitive performance on MNIST and CIFAR-10.

Details

Motivation: Current non-contrastive SSL methods focus on statistical properties but largely ignore the local geometry of the underlying data manifold, which could provide valuable structural information.

Method: Uses standard two-view encoder-projector architecture with Barlow Twins-style redundancy-reduction loss, augmented with curvature regularizer computed from k-nearest neighbors on unit hypersphere or via normalized local Gram matrix in RKHS.

Result: CurvSSL achieves competitive or improved linear evaluation performance compared to Barlow Twins and VICReg on MNIST and CIFAR-10 datasets using ResNet-18 backbone.

Conclusion: Explicitly shaping local geometry through curvature regularization is a simple and effective complement to purely statistical SSL regularizers.

Abstract: Self-supervised learning (SSL) has recently advanced through non-contrastive methods that couple an invariance term with variance, covariance, or redundancy-reduction penalties. While such objectives shape first- and second-order statistics of the representation, they largely ignore the local geometry of the underlying data manifold. In this paper, we introduce CurvSSL, a curvature-regularized self-supervised learning framework, and its RKHS extension, kernel CurvSSL. Our approach retains a standard two-view encoder-projector architecture with a Barlow Twins-style redundancy-reduction loss on projected features, but augments it with a curvature-based regularizer. Each embedding is treated as a vertex whose $k$ nearest neighbors define a discrete curvature score via cosine interactions on the unit hypersphere; in the kernel variant, curvature is computed from a normalized local Gram matrix in an RKHS. These scores are aligned and decorrelated across augmentations by a Barlow-style loss on a curvature-derived matrix, encouraging both view invariance and consistency of local manifold bending. Experiments on MNIST and CIFAR-10 datasets with a ResNet-18 backbone show that curvature-regularized SSL yields competitive or improved linear evaluation performance compared to Barlow Twins and VICReg. Our results indicate that explicitly shaping local geometry is a simple and effective complement to purely statistical SSL regularizers.

[340] Towards fully differentiable neural ocean model with Veros

Etienne Meunier, Said Ouala, Hugo Frezat, Julien Le Sommer, Ronan Fablet

Main category: cs.LG

TL;DR: Differentiable extension of VEROS ocean model enabling automatic differentiation through its dynamical core using JAX framework, with applications in state correction and parameter calibration.

Details

Motivation: To enable gradient-based optimization and parameter tuning in ocean modeling through differentiable programming, facilitating end-to-end learning from model observations.

Method: Modified VEROS ocean model to be fully compatible with JAX autodifferentiation framework, ensuring numerical consistency while enabling automatic differentiation through the dynamical core.

Result: Successfully implemented differentiable ocean model with two demonstrated applications: initial ocean state correction via gradient-based optimization and calibration of unknown physical parameters from observations.

Conclusion: Differentiable programming enables efficient gradient-based optimization for ocean modeling tasks, opening possibilities for end-to-end learning and parameter tuning in complex ocean simulations.

Abstract: We present a differentiable extension of the VEROS ocean model, enabling automatic differentiation through its dynamical core. We describe the key modifications required to make the model fully compatible with JAX autodifferentiation framework and evaluate the numerical consistency of the resulting implementation. Two illustrative applications are then demonstrated: (i) the correction of an initial ocean state through gradient-based optimization, and (ii) the calibration of unknown physical parameters directly from model observations. These examples highlight how differentiable programming can facilitate end-to-end learning and parameter tuning in ocean modeling. Our implementation is available online.

[341] Multi-Agent Pointer Transformer: Seq-to-Seq Reinforcement Learning for Multi-Vehicle Dynamic Pickup-Delivery Problems

Zengyu Zou, Jingyuan Wang, Yixuan Huang, Junjie Wu

Main category: cs.LG

TL;DR: Proposes MAPT, an end-to-end centralized framework using sequence-to-sequence Transformer with Pointer Network to solve cooperative multi-vehicle dynamic pickup and delivery with stochastic requests, outperforming existing methods.

Details

Motivation: Classical operations research methods struggle with computational complexity in large-scale dynamic problems, while existing RL methods fail to model joint action distributions, capture inter-entity relationships, and handle exponentially large joint action spaces.

Method: Uses Transformer Encoder for entity representations, Transformer Decoder with Pointer Network for autoregressive joint action sequence generation, Relation-Aware Attention module for inter-entity relationships, and informative priors for decision guidance.

Result: MAPT significantly outperforms baseline methods on 8 datasets and shows substantial computational time advantages over classical operations research methods.

Conclusion: The proposed MAPT framework effectively addresses challenges in cooperative multi-vehicle routing with stochastic requests through centralized decision-making and relation-aware attention, achieving superior performance and efficiency.

Abstract: This paper addresses the cooperative Multi-Vehicle Dynamic Pickup and Delivery Problem with Stochastic Requests (MVDPDPSR) and proposes an end-to-end centralized decision-making framework based on sequence-to-sequence, named Multi-Agent Pointer Transformer (MAPT). MVDPDPSR is an extension of the vehicle routing problem and a spatio-temporal system optimization problem, widely applied in scenarios such as on-demand delivery. Classical operations research methods face bottlenecks in computational complexity and time efficiency when handling large-scale dynamic problems. Although existing reinforcement learning methods have achieved some progress, they still encounter several challenges: 1) Independent decoding across multiple vehicles fails to model joint action distributions; 2) The feature extraction network struggles to capture inter-entity relationships; 3) The joint action space is exponentially large. To address these issues, we designed the MAPT framework, which employs a Transformer Encoder to extract entity representations, combines a Transformer Decoder with a Pointer Network to generate joint action sequences in an AutoRegressive manner, and introduces a Relation-Aware Attention module to capture inter-entity relationships. Additionally, we guide the model’s decision-making using informative priors to facilitate effective exploration. Experiments on 8 datasets demonstrate that MAPT significantly outperforms existing baseline methods in terms of performance and exhibits substantial computational time advantages compared to classical operations research methods.

[342] InTAct: Interval-based Task Activation Consolidation for Continual Learning

Patryk Krukowski, Jan Miksa, Piotr Helm, Jacek Tabor, Paweł Wawrzyński, Przemysław Spurek

Main category: cs.LG

TL;DR: InTAct addresses representation drift in continual learning by preserving functional behavior in shared layers through activation range constraints, improving performance in domain-incremental settings.

Details

Motivation: Prompt-based continual learning methods are vulnerable to representation drift under domain shifts, where shared representations evolve and overwrite previously useful features, causing forgetting even with task-isolated parameters.

Method: InTAct captures characteristic activation ranges for previously learned tasks and constrains updates to maintain network consistency within these regions while allowing flexible adaptation elsewhere, stabilizing important neurons’ functional roles without freezing parameters or storing past data.

Result: Across DomainNet and ImageNet-R benchmarks, InTAct consistently reduces representation drift and improves performance, increasing Average Accuracy by up to 8 percentage points over state-of-the-art baselines.

Conclusion: InTAct achieves a principled balance between stability and plasticity by regulating representation changes where past knowledge is encoded, making it an effective architecture-agnostic solution for domain-incremental continual learning.

Abstract: Continual learning aims to enable neural networks to acquire new knowledge without forgetting previously learned information. While recent prompt-based methods perform strongly in class-incremental settings, they remain vulnerable under domain shifts, where the input distribution changes but the label space remains fixed. This exposes a persistent problem known as representation drift. Shared representations evolve in ways that overwrite previously useful features and cause forgetting even when prompts isolate task-specific parameters. To address this issue, we introduce InTAct, a method that preserves functional behavior in shared layers without freezing parameters or storing past data. InTAct captures the characteristic activation ranges associated with previously learned tasks and constrains updates to ensure the network remains consistent within these regions, while still allowing for flexible adaptation elsewhere. In doing so, InTAct stabilizes the functional role of important neurons rather than directly restricting parameter values. The approach is architecture-agnostic and integrates seamlessly into existing prompt-based continual learning frameworks. By regulating representation changes where past knowledge is encoded, InTAct achieves a principled balance between stability and plasticity. Across diverse domain-incremental benchmarks, including DomainNet and ImageNet-R, InTAct consistently reduces representation drift and improves performance, increasing Average Accuracy by up to 8 percentage points over state-of-the-art baselines.

[343] Unmasking Airborne Threats: Guided-Transformers for Portable Aerosol Mass Spectrometry

Kyle M. Regan, Michael McLoughlin, Wayne A. Bryden, Gonzalo R. Arce

Main category: cs.LG

TL;DR: MS-DGFormer is a transformer-based framework that enables single-shot pathogen identification from noisy MALDI-MS spectra without extensive preprocessing, making real-time environmental monitoring feasible.

Details

Motivation: Current MALDI-MS systems require labor-intensive sample preparation and multi-shot spectral averaging, making them impractical for real-time environmental monitoring and autonomous aerosol analysis.

Method: Uses transformer architecture with a novel dictionary encoder that integrates denoised spectral information from SVD to capture long-range dependencies in time-series spectra and extract critical biomolecular patterns from single-shot spectra.

Result: Achieves superior pathogen identification from aerosol samples with robust performance, enabling autonomous analysis in field conditions.

Conclusion: Eliminates need for extensive preprocessing, unlocking potential for portable MALDI-MS platforms and revolutionizing environmental pathogen detection and rapid biological threat response.

Abstract: Matrix Assisted Laser Desorption/Ionization Mass Spectrometry (MALDI-MS) is a cornerstone in biomolecular analysis, offering precise identification of pathogens through unique mass spectral signatures. Yet, its reliance on labor-intensive sample preparation and multi-shot spectral averaging restricts its use to laboratory settings, rendering it impractical for real-time environmental monitoring. These limitations are especially pronounced in emerging aerosol MALDI-MS systems, where autonomous sampling generates noisy spectra for unknown aerosol analytes, requiring single-shot detection for effective analysis. Addressing these challenges, we propose the Mass Spectral Dictionary-Guided Transformer (MS-DGFormer): a data-driven framework that redefines spectral analysis by directly processing raw, minimally prepared mass spectral data. MS-DGFormer leverages a transformer architecture, designed to capture the long-range dependencies inherent in these time-series spectra. To enhance feature extraction, we introduce a novel dictionary encoder that integrates denoised spectral information derived from Singular Value Decomposition (SVD), enabling the model to discern critical biomolecular patterns from single-shot spectra with robust performance. This innovation provides a system to achieve superior pathogen identification from aerosol samples, facilitating autonomous, real-time analysis in field conditions. By eliminating the need for extensive preprocessing, our method unlocks the potential for portable, deployable MALDI-MS platforms, revolutionizing environmental pathogen detection and rapid response to biological threats.

[344] PersonaAgent with GraphRAG: Community-Aware Knowledge Graphs for Personalized LLM

Siqi Liang, Yudi Zhang, Yue Guo

Main category: cs.LG

TL;DR: A persona-based LLM framework using Knowledge-Graph-enhanced RAG that improves personalization by combining user history summaries with global interaction patterns, achieving significant performance gains on LaMP benchmark.

Details

Motivation: Need for personalized AI agents that adapt to individual user preferences by embodying user personas and leveraging rich contextual information.

Method: Knowledge-Graph-enhanced Retrieval-Augmented Generation (Graph RAG) that constructs LLM-derived graph index of documents and summarizes communities, combined with dynamic prompt engineering using user history and global interaction patterns.

Result: On LaMP benchmark: 11.1% F1 improvement in news categorization, 56.1% F1 improvement in movie tagging, 10.4% reduction in product rating MAE compared to prior methods.

Conclusion: The framework enables persona-aligned behaviors while benefiting from collective knowledge through graph-based community detection and dynamic prompt engineering.

Abstract: We propose a novel framework for persona-based language model system, motivated by the need for personalized AI agents that adapt to individual user preferences. In our approach, the agent embodies the user’s “persona” (e.g. user profile or taste) and is powered by a large language model (LLM). To enable the agent to leverage rich contextual information, we introduce a Knowledge-Graph-enhanced Retrieval-Augmented Generation (Graph RAG) mechanism that constructs an LLM-derived graph index of relevant documents and summarizes communities of related information. Our framework generates personalized prompts by combining: (1) a summary of the user’s historical behaviors and preferences extracted from the knowledge graph, and (2) relevant global interaction patterns identified through graph-based community detection. This dynamic prompt engineering approach allows the agent to maintain consistent persona-aligned behaviors while benefiting from collective knowledge. On the LaMP benchmark, our method improves news categorization F1 by 11.1%, movie tagging F1 by 56.1%, and reduces product rating MAE by 10.4% over prior methods. Our code is available at https://anonymous.4open.science/r/PersonaAgentwGraphRAG-DE6F

[345] Harnessing Data from Clustered LQR Systems: Personalized and Collaborative Policy Optimization

Vinay Kanakeri, Shivam Bajaj, Ashwin Verma, Vijay Gupta, Aritra Mitra

Main category: cs.LG

TL;DR: A clustering-based RL algorithm for multiple agents with similar linear processes that simultaneously learns personalized policies and identifies clusters, achieving statistical gains without performance degradation from dissimilar processes.

Details

Motivation: RL is data-hungry and using data from similar processes can improve sample efficiency, but identifying which processes are similar is challenging when process models are unknown.

Method: Combines sequential elimination and zeroth-order policy optimization to perform simultaneous clustering and learning, outputting personalized policies for each cluster of similar linear processes.

Result: Proves correct clustering with high probability under cluster separation conditions, and shows sub-optimality gap scales inversely with cluster size without additional bias from dissimilar processes.

Conclusion: First work showing clustering enables learning personalized policies with statistical gains from collaboration while avoiding performance degradation from dissimilar processes, with mild logarithmic communication overhead.

Abstract: It is known that reinforcement learning (RL) is data-hungry. To improve sample-efficiency of RL, it has been proposed that the learning algorithm utilize data from ‘approximately similar’ processes. However, since the process models are unknown, identifying which other processes are similar poses a challenge. In this work, we study this problem in the context of the benchmark Linear Quadratic Regulator (LQR) setting. Specifically, we consider a setting with multiple agents, each corresponding to a copy of a linear process to be controlled. The agents’ local processes can be partitioned into clusters based on similarities in dynamics and tasks. Combining ideas from sequential elimination and zeroth-order policy optimization, we propose a new algorithm that performs simultaneous clustering and learning to output a personalized policy (controller) for each cluster. Under a suitable notion of cluster separation that captures differences in closed-loop performance across systems, we prove that our approach guarantees correct clustering with high probability. Furthermore, we show that the sub-optimality gap of the policy learned for each cluster scales inversely with the size of the cluster, with no additional bias, unlike in prior works on collaborative learning-based control. Our work is the first to reveal how clustering can be used in data-driven control to learn personalized policies that enjoy statistical gains from collaboration but do not suffer sub-optimality due to inclusion of data from dissimilar processes. From a distributed implementation perspective, our method is attractive as it incurs only a mild logarithmic communication overhead.

[346] A New Causal Rule Learning Approach to Interpretable Estimation of Heterogeneous Treatment Effect

Ying Wu, Hanzhong Liu, Kai Ren, Shujie Ma, Xiangyu Chang

Main category: cs.LG

TL;DR: CRL uses rule-based workflow to estimate heterogeneous treatment effects for atrial septal defect, addressing cases where individuals belong to multiple treatment effect groups simultaneously.

Details

Motivation: Interpretability is crucial for HTE estimation in complex diseases, and previous literature overlooked cases where individuals simultaneously belong to multiple groups with different treatment effects.

Method: Three-step causal rule learning: rule discovery (generates causal rules with subgroup effects), rule selection (identifies subset for individual-level effect decomposition), and rule analysis (multi-perspective examination of promising rules).

Result: CRL outperforms other methods in providing interpretable HTE estimates, especially with complex ground truth and sufficient sample sizes.

Conclusion: CRL provides superior interpretable HTE estimation through its rule-based workflow, effectively handling complex scenarios where individuals belong to multiple treatment effect groups.

Abstract: Interpretability plays a crucial role in the application of statistical learning to estimate heterogeneous treatment effects (HTE) in complex diseases. In this study, we leverage a rule-based workflow, namely causal rule learning (CRL), to estimate and improve our understanding of HTE for atrial septal defect, addressing an overlooked question in the previous literature: what if an individual simultaneously belongs to multiple groups with different average treatment effects? The CRL process consists of three steps: rule discovery, which generates a set of causal rules with corresponding subgroup average treatment effects; rule selection, which identifies a subset of these rules to deconstruct individual-level treatment effects as a linear combination of subgroup-level effects; and rule analysis, which presents a detailed procedure for further analyzing each selected rule from multiple perspectives to identify the most promising rules for validation. Extensive simulation studies and real-world data analysis demonstrate that CRL outperforms other methods in providing interpretable estimates of HTE, especially when dealing with complex ground truth and sufficient sample sizes.

[347] Posts of Peril: Detecting Information About Hazards in Text

Keith Burghardt, Daniel M. T. Fessler, Chyna Tang, Anne Pisor, Kristina Lerman

Main category: cs.LG

TL;DR: Developed a new model to detect hazard information in social media posts, showing it outperforms dictionary approaches and is not strongly correlated with common sentiment indicators. Applied to geopolitical events (Israel-Hamas war, French election) to analyze how hazard information is used in information warfare.

Details

Motivation: To better understand socio-linguistic indicators in human-computer interactions, particularly the overlooked presence of hazard/harm information in social media, and how this information is used in geopolitical information campaigns.

Method: Developed a new hazard detection model trained on annotated X posts, applied it to datasets from Israel-Hamas war and French election, and analyzed differences between organic and inorganic accounts in hazard information usage.

Result: Model performs well (outperforms dictionary approaches), hazard information is common in geopolitical discussions, and inorganic accounts representing weaker sides in conflicts often discuss hazards to civilians at different rates than organic accounts.

Conclusion: Hazard information is a significant indicator in information warfare, with strategic framing by information operators. The model is shared as a Python package for researchers and journalists to analyze hazard content.

Abstract: Socio-linguistic indicators of affectively-relevant phenomena, such as emotion or sentiment, are often extracted from text to better understand features of human-computer interactions, including on social media. However, an indicator that is often overlooked is the presence or absence of information concerning harms or hazards. Here, we develop a new model to detect information concerning hazards, trained on a new collection of annotated X posts. We show that not only does this model perform well (outperforming, e.g., dictionary approaches), but that the hazard information it extracts is not strongly correlated with common indicators. To demonstrate the utility of our tool, we apply it to two datasets of X posts that discuss important geopolitical events, namely the Israel-Hamas war and the 2022 French national election. In both cases, we find that hazard information, especially information concerning conflict, is common. We extract accounts associated with information campaigns from each data set to explore how information about hazards could be used to attempt to influence geopolitical events. We find that inorganic accounts representing the viewpoints of weaker sides in a conflict often discuss hazards to civilians, potentially as a way to elicit aid for the weaker side. Moreover, the rate at which these hazards are mentioned differs markedly from organic accounts, likely reflecting information operators’ efforts to frame the given geopolitical event for strategic purposes. These results are first steps towards exploring hazards within an information warfare environment. The model is shared as a Python package to help researchers and journalists analyze hazard content. The model, along with data and annotations, is available in the following repository: https://github.com/KeithBurghardt/DetectHazards.

[348] Estimating Global Input Relevance and Enforcing Sparse Representations with a Scalable Spectral Neural Network Approach

Lorenzo Chicchi, Lorenzo Buffoni, Diego Febbe, Lorenzo Giambagli, Raffaele Marino, Duccio Fanelli

Main category: cs.LG

TL;DR: A novel spectral-based method for automatically ranking input feature importance in Deep Neural Networks during training, using eigenvalue analysis to identify relevant features and enable sparse representations.

Details

Motivation: To improve explainability in machine learning by identifying and ranking key input features that influence neural network decisions, helping understand the decision-making process.

Method: Spectral re-parametrization of optimization process where eigenvalues associated with input nodes serve as proxies for feature relevance; includes eigenvalue regularization to enforce sparse input representations.

Result: Successfully tested against synthetic and real data, showing comparable or better performance than common feature importance methods while providing automatic feature ranking during training.

Conclusion: The spectral approach provides an effective, automatic way to gauge feature importance in neural networks, enhancing model explainability and enabling sparse input representations without additional post-processing.

Abstract: In machine learning practice it is often useful to identify relevant input features. Isolating key input elements, ranked according their respective degree of relevance, can help to elaborate on the process of decision making. Here, we propose a novel method to estimate the relative importance of the input components for a Deep Neural Network. This is achieved by leveraging on a spectral re-parametrization of the optimization process. Eigenvalues associated to input nodes provide in fact a robust proxy to gauge the relevance of the supplied entry features. Notably, the spectral features ranking is performed automatically, as a byproduct of the network training, with no additional processing to be carried out. Moreover, by leveraging on the regularization of the eigenvalues, it is possible to enforce solutions making use of a minimum subset of the input components, increasing the explainability of the model and providing sparse input representations. The technique is compared to the most common methods in the literature and is successfully challenged against both synthetic and real data.

[349] “Normalized Stress” is Not Normalized: How to Interpret Stress Correctly

Kiran Smelser, Jacob Miller, Stephen Kobourov

Main category: cs.LG

TL;DR: Normalized stress, a common quality metric for dimension reduction projections, is sensitive to uniform scaling despite this not meaningfully changing the projection. The paper analyzes this scaling effect and introduces a scale-invariant version of normalized stress.

Details

Motivation: Stress is widely used to evaluate dimension reduction projections, but it's sensitive to uniform scaling which doesn't meaningfully change the projection's properties. This scaling sensitivity can affect the evaluation and comparison of dimension reduction techniques.

Method: The authors investigate the scaling effect on stress and other distance-based quality metrics both analytically and empirically. They introduce a simple technique to make normalized stress scale invariant and test it on a small benchmark.

Result: The paper shows how much stress values change with scaling and how this affects dimension reduction evaluations. The proposed scale-invariant normalized stress accurately captures expected behavior in benchmark tests.

Conclusion: The scaling sensitivity of normalized stress can distort dimension reduction evaluations. The proposed scale-invariant version provides a more reliable metric for assessing projection quality.

Abstract: Stress is among the most commonly employed quality metrics and optimization criteria for dimension reduction projections of high dimensional data. Complex, high dimensional data is ubiquitous across many scientific disciplines, including machine learning, biology, and the social sciences. One of the primary methods of visualizing these datasets is with two dimensional scatter plots that visually capture some properties of the data. Because visually determining the accuracy of these plots is challenging, researchers often use quality metrics to measure projection accuracy or faithfulness to the full data. One of the most commonly employed metrics, normalized stress, is sensitive to uniform scaling of the projection, despite this act not meaningfully changing anything about the projection. We investigate the effect of scaling on stress and other distance based quality metrics analytically and empirically by showing just how much the values change and how this affects dimension reduction technique evaluations. We introduce a simple technique to make normalized stress scale invariant and show that it accurately captures expected behavior on a small benchmark.

[350] MonoKAN: Certified Monotonic Kolmogorov-Arnold Network

Alejandro Polo-Molina, David Alfaya, Jose Portela

Main category: cs.LG

TL;DR: MonoKAN is a novel neural network architecture that achieves certified partial monotonicity while enhancing interpretability, outperforming state-of-the-art monotonic MLPs in both interpretability and predictive performance.

Details

Motivation: Address the challenge of interpretability in ANNs and the need for model predictions to align with expert-imposed requirements like partial monotonicity constraints, especially in applications requiring transparency and accountability.

Method: Propose MonoKAN architecture based on KAN, using cubic Hermite splines as learnable activation functions with straightforward monotonicity conditions, and positive weights in linear combinations to preserve monotonic relationships.

Result: MonoKAN enhances interpretability and improves predictive performance across most benchmarks, outperforming state-of-the-art monotonic MLP approaches.

Conclusion: MonoKAN successfully achieves certified partial monotonicity while providing better interpretability and performance than existing monotonic methods, addressing key limitations in neural network transparency.

Abstract: Artificial Neural Networks (ANNs) have significantly advanced various fields by effectively recognizing patterns and solving complex problems. Despite these advancements, their interpretability remains a critical challenge, especially in applications where transparency and accountability are essential. To address this, explainable AI (XAI) has made progress in demystifying ANNs, yet interpretability alone is often insufficient. In certain applications, model predictions must align with expert-imposed requirements, sometimes exemplified by partial monotonicity constraints. While monotonic approaches are found in the literature for traditional Multi-layer Perceptrons (MLPs), they still face difficulties in achieving both interpretability and certified partial monotonicity. Recently, the Kolmogorov-Arnold Network (KAN) architecture, based on learnable activation functions parametrized as splines, has been proposed as a more interpretable alternative to MLPs. Building on this, we introduce a novel ANN architecture called MonoKAN, which is based on the KAN architecture and achieves certified partial monotonicity while enhancing interpretability. To achieve this, we employ cubic Hermite splines, which guarantee monotonicity through a set of straightforward conditions. Additionally, by using positive weights in the linear combinations of these splines, we ensure that the network preserves the monotonic relationships between input and output. Our experiments demonstrate that MonoKAN not only enhances interpretability but also improves predictive performance across the majority of benchmarks, outperforming state-of-the-art monotonic MLP approaches.

[351] Text-guided multi-property molecular optimization with a diffusion language model

Yida Xiong, Kun Li, Jiameng Chen, Hongzhi Zhang, Di Lin, Yan Che, Wenbin Hu

Main category: cs.LG

TL;DR: TransDLM: A text-guided multi-property molecular optimization method using transformer-based diffusion language model that outperforms state-of-the-art methods in maintaining structural similarity and enhancing chemical properties.

Details

Motivation: Existing molecular optimization approaches rely on external property predictors that introduce errors and noise due to approximation, leading to discrepancy accumulation, generalization reduction and suboptimal candidates.

Method: Uses transformer-based diffusion language model (TransDLM) with standardized chemical nomenclature as semantic representations, implicitly embedding property requirements into textual descriptions to mitigate error propagation during diffusion process.

Result: Surpasses state-of-the-art methods in maintaining molecular structural similarity and enhancing chemical properties on benchmark dataset. A case study demonstrates practical problem-solving ability.

Conclusion: TransDLM effectively integrates diverse information sources to guide precise optimization, enhancing the model’s ability to balance structural retention and property enhancement in molecular optimization.

Abstract: Molecular optimization (MO) is a crucial stage in drug discovery in which task-oriented generated molecules are optimized to meet practical industrial requirements. Existing mainstream MO approaches primarily utilize external property predictors to guide iterative property optimization. However, learning all molecular samples in the vast chemical space is unrealistic for predictors. As a result, errors and noise are inevitably introduced during property prediction due to the nature of approximation. This leads to discrepancy accumulation, generalization reduction and suboptimal molecular candidates. In this paper, we propose a text-guided multi-property molecular optimization method utilizing transformer-based diffusion language model (TransDLM). TransDLM leverages standardized chemical nomenclature as semantic representations of molecules and implicitly embeds property requirements into textual descriptions, thereby mitigating error propagation during diffusion process. By fusing physically and chemically detailed textual semantics with specialized molecular representations, TransDLM effectively integrates diverse information sources to guide precise optimization, which enhances the model’s ability to balance structural retention and property enhancement. Additionally, the success of a case study further demonstrates TransDLM’s ability to solve practical problems. Experimentally, our approach surpasses state-of-the-art methods in maintaining molecular structural similarity and enhancing chemical properties on the benchmark dataset.

[352] Physically Interpretable World Models via Weakly Supervised Representation Learning

Zhenjiang Mao, Mrinall Eashaan Umasudhan, Ivan Ruchkin

Main category: cs.LG

TL;DR: PIWM framework learns physically interpretable world models from images by aligning latent representations with real-world physical quantities and constraining their evolution through known physical dynamics, without requiring ground-truth annotations.

Details

Motivation: Standard world models lack physical interpretability, limiting their reliability, generalizability, and applicability to safety-critical tasks in cyber-physical systems.

Method: Uses VQ-based visual encoder, transformer-based physical encoder, and learnable dynamics model grounded in known physical equations with weak distribution-based supervision that captures state uncertainty from real-world sensing.

Result: Achieves accurate long-horizon prediction, recovers true system parameters, and significantly improves physical grounding over purely data-driven models across Cart Pole, Lunar Lander, and Donkey Car case studies.

Conclusion: Demonstrates feasibility and advantages of learning physically interpretable world models directly from images under weak supervision, enabling more reliable and generalizable models for safety-critical applications.

Abstract: Learning predictive models from high-dimensional sensory observations is fundamental for cyber-physical systems, yet the latent representations learned by standard world models lack physical interpretability. This limits their reliability, generalizability, and applicability to safety-critical tasks. We introduce Physically Interpretable World Models (PIWM), a framework that aligns latent representations with real-world physical quantities and constrains their evolution through partially known physical dynamics. Physical interpretability in PIWM is defined by two complementary properties: (i) the learned latent state corresponds to meaningful physical variables, and (ii) its temporal evolution follows physically consistent dynamics. To achieve this without requiring ground-truth physical annotations, PIWM employs weak distribution-based supervision that captures state uncertainty naturally arising from real-world sensing pipelines. The architecture integrates a VQ-based visual encoder, a transformer-based physical encoder, and a learnable dynamics model grounded in known physical equations. Across three case studies (Cart Pole, Lunar Lander, and Donkey Car), PIWM achieves accurate long-horizon prediction, recovers true system parameters, and significantly improves physical grounding over purely data-driven models. These results demonstrate the feasibility and advantages of learning physically interpretable world models directly from images under weak supervision.

[353] Sometimes Painful but Certainly Promising: Feasibility and Trade-offs of Language Model Inference at the Edge

Maximilian Abstreiter, Sasu Tarkoma, Roberto Morabito

Main category: cs.LG

TL;DR: Comprehensive evaluation of generative language model inference on edge devices, examining performance, energy consumption, and memory usage trade-offs.

Details

Motivation: The shift towards compact language models for edge deployment offers benefits like enhanced privacy and reduced latency, but raises questions about practical trade-offs given limited edge computing resources.

Method: Conducted comprehensive evaluation of generative LM inference on CPU-based and GPU-accelerated edge devices, measuring memory usage, inference speed, energy consumption, throughput-energy trade-offs, cost, usability, and qualitative model performance.

Result: Quantization helps reduce memory overhead but doesn’t fully eliminate resource bottlenecks, especially for larger models. Findings quantify memory and energy constraints that must be considered for practical deployments.

Conclusion: LM deployment at the edge is still early-stage. The study provides foundation for future research on model refinement, inference efficiency enhancement, and edge-centric AI systems advancement.

Abstract: The rapid rise of Language Models (LMs) has expanded the capabilities of natural language processing, powering applications from text generation to complex decision-making. While state-of-the-art LMs often boast hundreds of billions of parameters and are primarily deployed in data centers, recent trends show a growing focus on compact models-typically under 10 billion parameters-enabled by techniques such as quantization and other model compression techniques. This shift paves the way for LMs on edge devices, offering potential benefits such as enhanced privacy, reduced latency, and improved data sovereignty. However, the inherent complexity of even these smaller models, combined with the limited computing resources of edge hardware, raises critical questions about the practical trade-offs in executing LM inference outside the cloud. To address these challenges, we present a comprehensive evaluation of generative LM inference on representative CPU-based and GPU-accelerated edge devices. Our study measures key performance indicators-including memory usage, inference speed, and energy consumption-across various device configurations. Additionally, we examine throughput-energy trade-offs, cost considerations, and usability, alongside an assessment of qualitative model performance. While quantization helps mitigate memory overhead, it does not fully eliminate resource bottlenecks, especially for larger models. Our findings quantify the memory and energy constraints that must be considered for practical real-world deployments, offering concrete insights into the trade-offs between model size, inference performance, and efficiency. The exploration of LMs at the edge is still in its early stages. We hope this study provides a foundation for future research, guiding the refinement of models, the enhancement of inference efficiency, and the advancement of edge-centric AI systems.

[354] TRACE: Time SeRies PArameter EffiCient FinE-tuning

Yuze Li, Wei Zhu

Main category: cs.LG

TL;DR: TRACE is an efficient fine-tuning method for time series foundation models that addresses challenges in adapting to varying temporal characteristics and improves long-term forecasting performance with reduced parameters.

Details

Motivation: Time series foundation models face challenges due to varying data characteristics (frequency, channels, lengths) and existing parameter-efficient methods like LoRA need adaptation for temporal data. Long-term forecasting requires tailored fine-tuning for optimal performance.

Method: TRACE introduces two innovations: (1) Gated DSIC - an unbiased LoRA module importance selection mechanism ensuring conditional parameter consistency, and (2) Reconstructed prediction heads for long-term forecasting that reduce parameter counts while maintaining performance.

Result: Extensive experiments on long/short-term forecasting, anomaly detection, and natural language tasks across diverse datasets show TRACE outperforms common fine-tuning methods and achieves comparable or superior performance to linear probing heads with drastically reduced parameters.

Conclusion: TRACE provides an effective parameter-efficient fine-tuning framework for time series foundation models that addresses temporal data challenges and significantly enhances long-term forecasting performance while reducing computational overhead.

Abstract: We propose an efficient fine-tuning method for time series foundation models, termed TRACE: Time Series Parameter Efficient Fine-tuning. While pretrained time series foundation models are gaining popularity, they face the following challenges: (1) Unlike natural language tasks, time series data vary in frequency, channel numbers, historical/prediction lengths. For long-term forecasting tasks in particular, tailored fine-tuning can significantly enhance performance.(2) Existing parameter-efficient tuning methods like LoRA remain applicable but require adaptation to temporal characteristics. To address these challenges, our TRACE framework introduces two key innovations: (1) Gated DSIC (Gated Dynamic Simulation Importance Calculation), an unbiased LoRA module importance selection mechanism that ensures conditional parameter consistency before and after masking. Experiments demonstrate that Gated DSIC outperforms common fine-tuning. (2) Reconstructed prediction heads for long-term forecasting tasks, which achieve comparable or superior performance to linear probing heads while drastically reducing parameter counts. Extensive experiments on long-/short-term forecasting, anomaly detection and natural language tasks across diverse datasets, coupled with ablation studies, validate the effectiveness of our method.

[355] Multi-Objective Reinforcement Learning for Water Management

Zuzanna Osika, Roxana Rădulescu, Jazmin Zatarain Salazar, Frans Oliehoek, Pradeep K. Murukannaiah

Main category: cs.LG

TL;DR: The paper introduces a water resource management case study (Nile river basin) as a complex MORL benchmark and shows that specialized domain methods outperform state-of-the-art MORL algorithms.

Details

Motivation: Multi-objective reinforcement learning lacks realistic, complex environments and benchmarks for real-world applications like resource management, autonomous driving, and drug discovery.

Method: Created a water resource management case study of the Nile river basin modeled as a MORL environment, then benchmarked existing MORL algorithms against specialized water management methods.

Result: Specialized water management methods outperformed state-of-the-art MORL approaches, highlighting scalability challenges for MORL algorithms in real-world scenarios.

Conclusion: MORL algorithms face significant scalability issues in complex real-world environments, and domain-specific approaches currently outperform general MORL methods in practical applications.

Abstract: Many real-world problems (e.g., resource management, autonomous driving, drug discovery) require optimizing multiple, conflicting objectives. Multi-objective reinforcement learning (MORL) extends classic reinforcement learning to handle multiple objectives simultaneously, yielding a set of policies that capture various trade-offs. However, the MORL field lacks complex, realistic environments and benchmarks. We introduce a water resource (Nile river basin) management case study and model it as a MORL environment. We then benchmark existing MORL algorithms on this task. Our results show that specialized water management methods outperform state-of-the-art MORL approaches, underscoring the scalability challenges MORL algorithms face in real-world scenarios.

[356] Quantitative Attractor Analysis of High-Capacity Kernel Logistic Regression Hopfield Networks

Akira Tamamori

Main category: cs.LG

TL;DR: Kernel Logistic Regression (KLR) dramatically increases Hopfield network storage capacity, with linear scaling (P ∝ N) when kernel width γ is properly scaled (γ×N increases with N). KLR and Kernel Ridge Regression show similar high capacities, though KRR is computationally faster.

Details

Motivation: To establish principles governing performance and stability of kernel-based learning methods in Hopfield networks, addressing critical questions of generality, scalability, and robustness.

Method: Comprehensive quantitative analysis of attractor landscapes through extensive, statistically validated simulations comparing KLR and KRR, with investigation of kernel width scaling, storage capacity scaling, and regularization parameter sensitivity.

Result: KLR and KRR exhibit similarly high storage capacities and clean attractor landscapes. Optimal capacity requires γ×N to increase with network size N, leading to linear storage capacity scaling (P ∝ N). Performance is remarkably robust to regularization parameter λ.

Conclusion: The findings provide clear empirical principles for designing high-capacity, robust associative memories and clarify how kernel methods overcome classical limitations of Hopfield-type models.

Abstract: Kernel-based learning methods such as Kernel Logistic Regression (KLR) can dramatically increase the storage capacity of Hopfield networks, but the principles governing their performance and stability remain largely uncharacterized. This paper presents a comprehensive quantitative analysis of the attractor landscape in KLR-trained networks to establish a solid foundation for their design and application. Through extensive, statistically validated simulations, we address critical questions of generality, scalability, and robustness. Our comparative analysis reveals that KLR and Kernel Ridge Regression (KRR) exhibit similarly high storage capacities and clean attractor landscapes, suggesting this is a general property of kernel regression methods, though KRR is computationally much faster. We uncover a non-trivial, scale-dependent scaling law for the kernel width ($γ$), demonstrating that optimal capacity requires $γ$ to be scaled such that $γ\times N$ increases with network size $N$. This implies that larger networks necessitate more localized kernels – where each pattern’s influence is more spatially confined – to manage inter-pattern interference. Under this optimized scaling, we provide definitive evidence that the storage capacity scales linearly with network size ($P \propto N$). Furthermore, our sensitivity analysis shows that performance is remarkably robust to the choice of the regularization parameter $λ$. Collectively, these findings provide a clear set of empirical principles for designing high-capacity, robust associative memories and clarify the mechanisms that enable kernel methods to overcome the classical limitations of Hopfield-type models.

[357] Defending the Edge: Representative-Attention Defense against Backdoor Attacks in Federated Learning

Chibueze Peace Obioma, Youcheng Sun, Mustafa A. Mustafa

Main category: cs.LG

TL;DR: FeRA is a novel federated learning defense that uses attention-driven consistency analysis to detect adaptive backdoor attacks by identifying malicious clients through suppressed representation-space variance and norm inflation, achieving superior backdoor mitigation while maintaining high clean accuracy.

Details

Motivation: Existing federated learning defenses rely on anomaly detection in parameter/gradient space but fail against adaptive backdoor attacks that mimic benign statistics while preserving backdoor functionality, creating a fundamental detection gap.

Method: FeRA shifts detection from anomaly-centric to consistency-centric analysis using multi-dimensional behavioral analysis combining spectral/spatial attention, directional alignment, mutual similarity, and norm inflation across two detection mechanisms: consistency analysis and norm-inflation detection.

Result: Extensive evaluation across 6 datasets, 9 attacks, and 3 model architectures under IID and non-IID settings shows FeRA achieves the lowest average Backdoor Accuracy (1.67%) while maintaining high clean accuracy compared to state-of-the-art defenses.

Conclusion: FeRA provides superior backdoor mitigation in federated learning by exploiting the intrinsic need for backdoor persistence across training rounds through attention-driven consistency analysis, effectively detecting adaptive attacks that evade traditional anomaly detection methods.

Abstract: Federated learning (FL) remains highly vulnerable to adaptive backdoor attacks that preserve stealth by closely imitating benign update statistics. Existing defenses predominantly rely on anomaly detection in parameter or gradient space, overlooking behavioral constraints that backdoor attacks must satisfy to ensure reliable trigger activation. These anomaly-centric methods fail against adaptive attacks that normalize update magnitudes and mimic benign statistical patterns while preserving backdoor functionality, creating a fundamental detection gap. To address this limitation, this paper introduces FeRA (Federated Representative Attention) – a novel attention-driven defense that shifts the detection paradigm from anomaly-centric to consistency-centric analysis. FeRA exploits the intrinsic need for backdoor persistence across training rounds, identifying malicious clients through suppressed representation-space variance, an orthogonal property to traditional magnitude-based statistics. The framework conducts multi-dimensional behavioral analysis combining spectral and spatial attention, directional alignment, mutual similarity, and norm inflation across two complementary detection mechanisms: consistency analysis and norm-inflation detection. Through this mechanism, FeRA isolates malicious clients that exhibit low-variance consistency or magnitude amplification. Extensive evaluation across six datasets, nine attacks, and three model architectures under both Independent and Identically Distributed (IID) and non-IID settings confirm FeRA achieves superior backdoor mitigation. Under different non-IID settings, FeRA achieved the lowest average Backdoor Accuracy (BA), about 1.67% while maintaining high clean accuracy compared to other state-of-the-art defenses. The code is available at https://github.com/Peatech/FeRA_defense.git.

[358] Deterministic Bounds and Random Estimates of Metric Tensors on Neuromanifolds

Ke Sun

Main category: cs.LG

TL;DR: The paper analyzes the Fisher information metric on deep neural network parameter spaces, extends bounds from low-dimensional probability spaces to neural networks, and introduces an efficient unbiased estimator.

Details

Motivation: Understanding the Fisher information metric on neural network parameter spaces is crucial for both theoretical analysis and practical methods in deep learning.

Method: Analyze the metric spectrum in low-dimensional probability spaces, extend bounds to neural networks, and develop an unbiased random estimator using Hutchinson’s trace estimator that requires only one backward pass.

Result: The proposed estimator is efficient (single backward pass), unbiased, and has bounded standard deviation relative to the true metric value.

Conclusion: The method provides practical tools for estimating Fisher information metrics in deep learning with theoretical guarantees and computational efficiency.

Abstract: The high dimensional parameter space of modern deep neural networks – the neuromanifold – is endowed with a unique metric tensor defined by the Fisher information, estimating which is crucial for both theory and practical methods in deep learning. To analyze this tensor for classification networks, we return to a low dimensional space of probability distributions – the core space – and carefully analyze the spectrum of its Riemannian metric. We extend our discoveries there into deterministic bounds of the metric tensor on the neuromanifold. We introduce an unbiased random estimate of the metric tensor and its bounds based on Hutchinson’s trace estimator. It can be evaluated efficiently through a single backward pass, with a standard deviation bounded by the true value up to scaling.

[359] Improving Generalization of Neural Combinatorial Optimization for Vehicle Routing Problems via Test-Time Projection Learning

Yuanyao Chen, Rongsheng Chen, Fu Luo, Zhenkun Wang

Main category: cs.LG

TL;DR: A novel LLM-driven framework that enables neural combinatorial optimization models trained on small-scale instances to scale effectively to large problems (up to 100K nodes) without retraining, by learning a projection between training and testing distributions during inference.

Details

Motivation: Existing Neural Combinatorial Optimization methods trained on small instances (e.g., 100 nodes) suffer significant performance degradation when applied to large-scale scenarios due to distributional shift between training and testing data.

Method: Introduces a Large Language Model-driven framework that learns a projection between training and testing distributions, deployed exclusively during inference phase without requiring model retraining.

Result: Enables backbone models trained on 100-node instances to achieve superior performance on large-scale TSP and CVRP problems up to 100K nodes from diverse distributions.

Conclusion: The proposed LLM-driven framework effectively addresses the scalability limitation of NCO methods by handling distributional shift during inference, making small-scale trained models applicable to large-scale problems without retraining.

Abstract: Neural Combinatorial Optimization (NCO) has emerged as a promising learning-based paradigm for addressing Vehicle Routing Problems (VRPs) by minimizing the need for extensive manual engineering. While existing NCO methods, trained on small-scale instances (e.g., 100 nodes), have demonstrated considerable success on problems of similar scale, their performance significantly degrades when applied to large-scale scenarios. This degradation arises from the distributional shift between training and testing data, rendering policies learned on small instances ineffective for larger problems. To overcome this limitation, we introduce a novel learning framework driven by Large Language Models (LLMs). This framework learns a projection between the training and testing distributions, which is then deployed to enhance the scalability of the NCO model. Notably, unlike prevailing techniques that necessitate joint training with the neural network, our approach operates exclusively during the inference phase, obviating the need for model retraining. Extensive experiments demonstrate that our method enables a backbone model (trained on 100-node instances) to achieve superior performance on large-scale Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) of up to 100K nodes from diverse distributions.

[360] SHIELD: Secure Hypernetworks for Incremental Expansion Learning Defense

Patryk Krukowski, Łukasz Gorczyca, Piotr Helm, Kamil Książek, Przemysław Spurek

Main category: cs.LG

TL;DR: SHIELD is a novel framework for certifiably robust continual learning that combines Interval Bound Propagation with hypernetworks, eliminating replay buffers and enabling efficient scaling across sequential tasks.

Details

Motivation: Existing continual learning methods often compromise either robustness or scalability when facing adversarial conditions, creating a need for theoretically grounded approaches that maintain both properties.

Method: Integrates IBP with hypernetwork architecture to generate task-specific parameters from compact embeddings. Introduces Interval MixUp training strategy that blends virtual examples as l∞ balls using interval arithmetic to guarantee robustness and mitigate wrapping effects.

Result: Outperforms existing robust continual learning methods under strong white-box attacks (PGD, AutoAttack) across multiple benchmarks, achieving state-of-the-art average accuracy while maintaining scalability and certification.

Conclusion: Represents a significant step toward practical and theoretically grounded continual learning in adversarial settings, successfully balancing robustness, scalability, and certification guarantees.

Abstract: Continual learning under adversarial conditions remains an open problem, as existing methods often compromise either robustness, scalability, or both. We propose a novel framework that integrates Interval Bound Propagation (IBP) with a hypernetwork-based architecture to enable certifiably robust continual learning across sequential tasks. Our method, SHIELD, generates task-specific model parameters via a shared hypernetwork conditioned solely on compact task embeddings, eliminating the need for replay buffers or full model copies and enabling efficient over time. To further enhance robustness, we introduce Interval MixUp, a novel training strategy that blends virtual examples represented as $\ell_{\infty}$ balls centered around MixUp points. Leveraging interval arithmetic, this technique guarantees certified robustness while mitigating the wrapping effect, resulting in smoother decision boundaries. We evaluate SHIELD under strong white-box adversarial attacks, including PGD and AutoAttack, across multiple benchmarks. It consistently outperforms existing robust continual learning methods, achieving state-of-the-art average accuracy while maintaining both scalability and certification. These results represent a significant step toward practical and theoretically grounded continual learning in adversarial settings.

[361] The Impact of Feature Scaling In Machine Learning: Effects on Regression and Classification Tasks

João Manoel Herrera Pinheiro, Suzana Vilas Boas de Oliveira, Thiago Henrique Segreto Silva, Pedro Antonio Rabelo Saraiva, Enzo Ferreira de Souza, Ricardo V. Godoy, Leonardo André Ambrosio, Marcelo Becker

Main category: cs.LG

TL;DR: Systematic evaluation of 12 feature scaling techniques across 14 ML algorithms and 16 datasets reveals that ensemble methods are robust to scaling while models like Logistic Regression, SVMs, TabNet, and MLPs show significant performance variations dependent on scaler choice.

Details

Motivation: Address the critical lack of comprehensive studies on feature scaling by systematically evaluating various scaling techniques across multiple ML algorithms and datasets for both classification and regression tasks.

Method: Evaluated 12 scaling techniques (including less common transformations) across 14 different Machine Learning algorithms and 16 datasets, analyzing impacts on predictive performance (accuracy, MAE, MSE, R²) and computational costs (training time, inference time, memory usage).

Result: Ensemble methods (Random Forest, XGBoost, CatBoost, LightGBM) demonstrate robust performance largely independent of scaling, while Logistic Regression, SVMs, TabNet, and MLPs show significant performance variations highly dependent on the chosen scaler.

Conclusion: Provides model-specific crucial guidance to practitioners on the need for optimal selection of feature scaling techniques, with all source code, experimental results, and model parameters made publicly available for transparency and reproducibility.

Abstract: This research addresses the critical lack of comprehensive studies on feature scaling by systematically evaluating 12 scaling techniques - including several less common transformations - across 14 different Machine Learning algorithms and 16 datasets for classification and regression tasks. We meticulously analyzed impacts on predictive performance (using metrics such as accuracy, MAE, MSE, and $R^2$) and computational costs (training time, inference time, and memory usage). Key findings reveal that while ensemble methods (such as Random Forest and gradient boosting models like XGBoost, CatBoost and LightGBM) demonstrate robust performance largely independent of scaling, other widely used models such as Logistic Regression, SVMs, TabNet, and MLPs show significant performance variations highly dependent on the chosen scaler. This extensive empirical analysis, with all source code, experimental results, and model parameters made publicly available to ensure complete transparency and reproducibility, offers model-specific crucial guidance to practitioners on the need for an optimal selection of feature scaling techniques.

[362] Soft decision trees for survival analysis

Antonio Consolo, Edoardo Amaldi, Emilio Carrizosa

Main category: cs.LG

TL;DR: Proposes soft survival trees (SST) with soft splitting rules trained via nonlinear optimization, outperforming benchmark survival trees on discrimination and calibration measures.

Details

Motivation: To create more flexible and interpretable survival trees that can model complex relationships while maintaining the benefits of conditional computation and interpretability.

Method: Soft survival tree model with soft splitting rules trained via nonlinear optimization formulation amenable to decomposition, using parametric or semiparametric survival functions estimated through maximum likelihood.

Result: Numerical experiments on 15 datasets show SSTs outperform three benchmark survival trees in terms of discrimination and calibration measures.

Conclusion: SSTs combine flexibility with interpretability, can be extended for group fairness, and provide better performance than traditional survival trees.

Abstract: Decision trees are popular in survival analysis for their interpretability and ability to model complex relationships. Survival trees, which predict the timing of singular events using censored historical data, are typically built through heuristic approaches. Recently, there has been growing interest in globally optimized trees, where the overall tree is trained by minimizing the error function over all its parameters. We propose a new soft survival tree model (SST), with a soft splitting rule at each branch node, trained via a nonlinear optimization formulation amenable to decomposition. Since SSTs provide for every input vector a specific survival function associated to a single leaf node, they satisfy the conditional computation property and inherit the related benefits. SST and the training formulation combine flexibility with interpretability: any smooth survival function (parametric, semiparametric, or nonparametric) estimated through maximum likelihood can be used, and each leaf node of an SST yields a cluster of distinct survival functions which are associated to the data points routed to it. Numerical experiments on 15 well-known datasets show that SSTs, with parametric and spline-based semiparametric survival functions, trained using an adaptation of the node-based decomposition algorithm proposed by Consolo et al. (2024) for soft regression trees, outperform three benchmark survival trees in terms of four widely-used discrimination and calibration measures. SSTs can also be extended to consider group fairness.

[363] Convergence Bound and Critical Batch Size of Muon Optimizer

Naoki Sato, Hiroki Naganuma, Hideaki Iiduka

Main category: cs.LG

TL;DR: Theoretical analysis of Muon optimizer showing convergence proofs across four settings, tighter bounds with weight decay, and derivation of critical batch size for computational efficiency.

Details

Motivation: Muon optimizer shows strong empirical performance but lacks theoretical foundation. This paper aims to provide theoretical analysis to support its practical success and understand its behavior.

Method: Provides convergence proofs for Muon across four practical settings (with/without Nesterov momentum and weight decay), analyzes interplay between weight decay and learning rate, and derives critical batch size for minimizing computational cost.

Result: Demonstrated that weight decay yields strictly tighter theoretical bounds, identified hyperparameters governing critical batch size, and validated findings through experiments on image classification and language modeling tasks.

Conclusion: The theoretical analysis supports Muon’s empirical success, provides insights into its behavior with different components, and offers practical guidance for hyperparameter tuning and computational efficiency.

Abstract: Muon, a recently proposed optimizer that leverages the inherent matrix structure of neural network parameters, has demonstrated strong empirical performance, indicating its potential as a successor to standard optimizers such as AdamW. This paper presents theoretical analysis to support its practical success. We provide convergence proofs for Muon across four practical settings, systematically examining its behavior with and without the inclusion of Nesterov momentum and weight decay. Our analysis covers the standard configuration using both, thereby elucidating its real-world performance. We then demonstrate that the addition of weight decay yields strictly tighter theoretical bounds and clarify the interplay between the weight decay coefficient and the learning rate. Finally, we derive the critical batch size for Muon that minimizes the computational cost of training. Our analysis identifies the hyperparameters governing this value, and our experiments validate the corresponding theoretical findings across workloads including image classification and language modeling task.

[364] Comprehensive Evaluation of Prototype Neural Networks

Philipp Schlinge, Steffen Meinert, Martin Atzmueller

Main category: cs.LG

TL;DR: Comprehensive analysis of prototype models (ProtoPNet, ProtoPool, PIPNet) using standard and new interpretability metrics across diverse datasets, with open-source implementation.

Details

Motivation: Prototype models are important for explainable AI and interpretable machine learning, but need systematic evaluation using comprehensive metrics to assess their interpretability.

Method: Applied prototype models on diverse datasets (fine-grained classification, Non-IID settings, multi-label classification) and evaluated using both standard metrics from literature and newly proposed interpretability metrics.

Result: Developed a comprehensive evaluation framework for prototype models and created an open-source library (quanproto) for easy application and extensibility of metrics.

Conclusion: The study provides systematic assessment tools for prototype-based XAI methods and makes them accessible through an open-source implementation that supports further research and development.

Abstract: Prototype models are an important method for explainable artificial intelligence (XAI) and interpretable machine learning. In this paper, we perform an in-depth analysis of a set of prominent prototype models including ProtoPNet, ProtoPool and PIPNet. For their assessment, we apply a comprehensive set of metrics. In addition to applying standard metrics from literature, we propose several new metrics to further complement the analysis of model interpretability. In our experimentation, we apply the set of prototype models on a diverse set of datasets including fine-grained classification, Non-IID settings and multi-label classification to further contrast the performance. Furthermore, we also provide our code as an open-source library (https://github.com/uos-sis/quanproto), which facilitates simple application of the metrics itself, as well as extensibility – providing the option for easily adding new metrics and models.

[365] Crafting Imperceptible On-Manifold Adversarial Attacks for Tabular Data

Zhipeng He, Alexander Stevens, Chun Ouyang, Johannes De Smedt, Alistair Barros, Catarina Moreira

Main category: cs.LG

TL;DR: Proposes a latent-space perturbation framework using mixed-input VAE for generating statistically consistent adversarial examples on tabular data, addressing challenges of heterogeneous features and distributional deviations in traditional methods.

Details

Motivation: Adversarial attacks on tabular data face unique challenges due to heterogeneous categorical/numerical features and lack of intuitive similarity metrics. Traditional gradient-based methods using ℓp-norm constraints often produce adversarial examples that deviate from original data distributions.

Method: Uses a mixed-input Variational Autoencoder (VAE) to integrate categorical embeddings and numerical features into a unified latent manifold, enabling perturbations that preserve statistical consistency. Introduces In-Distribution Success Rate (IDSR) for joint evaluation.

Result: Achieves substantially lower outlier rates and more consistent performance compared to traditional input-space attacks and other VAE-based methods. Shows superior practical utility and stability when reconstruction quality and sufficient training data conditions are met.

Conclusion: VAE-based attacks depend strongly on reconstruction quality and training data availability. The framework demonstrates the importance of maintaining on-manifold perturbations for generating realistic and robust adversarial examples in tabular domains.

Abstract: Adversarial attacks on tabular data present unique challenges due to the heterogeneous nature of mixed categorical and numerical features. Unlike images where pixel perturbations maintain visual similarity, tabular data lacks intuitive similarity metrics, making it difficult to define imperceptible modifications. Additionally, traditional gradient-based methods prioritise $\ell_p$-norm constraints, often producing adversarial examples that deviate from the original data distributions. To address this, we propose a latent-space perturbation framework using a mixed-input Variational Autoencoder (VAE) to generate statistically consistent adversarial examples. The proposed VAE integrates categorical embeddings and numerical features into a unified latent manifold, enabling perturbations that preserve statistical consistency. We introduce In-Distribution Success Rate (IDSR) to jointly evaluate attack effectiveness and distributional alignment. Evaluation across six publicly available datasets and three model architectures demonstrates that our method achieves substantially lower outlier rates and more consistent performance compared to traditional input-space attacks and other VAE-based methods adapted from image domain approaches, achieving substantially lower outlier rates and higher IDSR across six datasets and three model architectures. Our comprehensive analyses of hyperparameter sensitivity, sparsity control, and generative architecture demonstrate that the effectiveness of VAE-based attacks depends strongly on reconstruction quality and the availability of sufficient training data. When these conditions are met, the proposed framework achieves superior practical utility and stability compared with input-space methods. This work underscores the importance of maintaining on-manifold perturbations for generating realistic and robust adversarial examples in tabular domains.

[366] Model-Agnostic Gender Bias Control for Text-to-Image Generation via Sparse Autoencoder

Chao Wu, Zhenyi Wang, Kangxian Xie, Naresh Kumar Devulapally, Vishnu Suresh Lokhande, Mingchen Gao

Main category: cs.LG

TL;DR: SAE Debias is a lightweight, model-agnostic framework that uses k-sparse autoencoders to identify and suppress gender bias in text-to-image diffusion models without retraining or architectural changes.

Details

Motivation: Text-to-image diffusion models exhibit significant gender bias by generating stereotypical associations between professions and gendered subjects, which requires effective debiasing methods.

Method: Leverages k-sparse autoencoder pre-trained on gender bias data to identify gender-relevant directions in sparse latent space, constructs biased direction per profession, and suppresses it during inference to achieve gender-balanced outputs.

Result: Extensive evaluations across Stable Diffusion 1.4, 1.5, 2.1, and SDXL show SAE Debias substantially reduces gender bias while preserving generation quality.

Conclusion: This is the first work using sparse autoencoders for gender bias intervention in T2I models, providing an interpretable, model-agnostic tool for building socially responsible generative AI.

Abstract: Text-to-image (T2I) diffusion models often exhibit gender bias, particularly by generating stereotypical associations between professions and gendered subjects. This paper presents SAE Debias, a lightweight and model-agnostic framework for mitigating such bias in T2I generation. Unlike prior approaches that rely on CLIP-based filtering or prompt engineering, which often require model-specific adjustments and offer limited control, SAE Debias operates directly within the feature space without retraining or architectural modifications. By leveraging a k-sparse autoencoder pre-trained on a gender bias dataset, the method identifies gender-relevant directions within the sparse latent space, capturing professional stereotypes. Specifically, a biased direction per profession is constructed from sparse latents and suppressed during inference to steer generations toward more gender-balanced outputs. Trained only once, the sparse autoencoder provides a reusable debiasing direction, offering effective control and interpretable insight into biased subspaces. Extensive evaluations across multiple T2I models, including Stable Diffusion 1.4, 1.5, 2.1, and SDXL, demonstrate that SAE Debias substantially reduces gender bias while preserving generation quality. To the best of our knowledge, this is the first work to apply sparse autoencoders for identifying and intervening in gender bias within T2I models. These findings contribute toward building socially responsible generative AI, providing an interpretable and model-agnostic tool to support fairness in text-to-image generation.

[367] HiCL: Hippocampal-Inspired Continual Learning

Kushal Kapoor, Wyatt Mackey, Yiannis Aloimonos, Xiaomin Lin

Main category: cs.LG

TL;DR: HiCL is a hippocampal-inspired dual-memory continual learning architecture that mitigates catastrophic forgetting using biologically-inspired modules for pattern separation, episodic memory, and task-specific processing with differentiable gating.

Details

Motivation: To address catastrophic forgetting in continual learning by drawing inspiration from the hippocampal circuitry, which naturally handles sequential learning and memory consolidation in biological systems.

Method: Uses grid-cell-like encoding, dentate gyrus-inspired sparse pattern separation, CA3-like autoassociative episodic memory, DG-gated mixture-of-experts for task routing, Elastic Weight Consolidation for cortical consolidation, and prioritized replay of stored patterns.

Result: Achieves near state-of-the-art results on standard continual learning benchmarks with reduced task interference and lower computational costs compared to existing methods.

Conclusion: The biologically grounded HiCL architecture effectively mitigates catastrophic forgetting through its dual-memory system and differentiable task-routing mechanism, providing an efficient and scalable solution for continual learning.

Abstract: We propose HiCL, a novel hippocampal-inspired dual-memory continual learning architecture designed to mitigate catastrophic forgetting by using elements inspired by the hippocampal circuitry. Our system encodes inputs through a grid-cell-like layer, followed by sparse pattern separation using a dentate gyrus-inspired module with top-k sparsity. Episodic memory traces are maintained in a CA3-like autoassociative memory. Task-specific processing is dynamically managed via a DG-gated mixture-of-experts mechanism, wherein inputs are routed to experts based on cosine similarity between their normalized sparse DG representations and learned task-specific DG prototypes computed through online exponential moving averages. This biologically grounded yet mathematically principled gating strategy enables differentiable, scalable task-routing without relying on a separate gating network, and enhances the model’s adaptability and efficiency in learning multiple sequential tasks. Cortical outputs are consolidated using Elastic Weight Consolidation weighted by inter-task similarity. Crucially, we incorporate prioritized replay of stored patterns to reinforce essential past experiences. Evaluations on standard continual learning benchmarks demonstrate the effectiveness of our architecture in reducing task interference, achieving near state-of-the-art results in continual learning tasks at lower computational costs. Our code is available here https://github.com/kushalk173-sc/HiCL.

[368] Topology Aware Neural Interpolation of Scalar Fields

Mohamed Kissi, Keanu Sisouk, Joshua A. Levine, Julien Tierny

Main category: cs.LG

TL;DR: Neural network approach for topology-aware interpolation of time-varying scalar fields using persistence diagrams and keyframes to estimate missing data at non-keyframe time steps.

Details

Motivation: To address the challenge of interpolating missing scalar field data in time-varying sequences by leveraging topological information from persistence diagrams to improve reconstruction quality.

Method: Uses a neural architecture that learns the relationship between time values and scalar fields from keyframe examples, enhanced with topological losses that exploit input persistence diagrams for better geometrical and topological reconstruction.

Result: Experiments on 2D and 3D time-varying datasets show superior performance in both data fitting and topological accuracy compared to reference interpolation schemes, with instantaneous output generation at query time.

Conclusion: The proposed topology-aware neural interpolation method effectively reconstructs missing scalar field data while preserving topological features, outperforming traditional interpolation approaches.

Abstract: This paper presents a neural scheme for the topology-aware interpolation of time-varying scalar fields. Given a time-varying sequence of persistence diagrams, along with a sparse temporal sampling of the corresponding scalar fields, denoted as keyframes, our interpolation approach aims at “inverting” the non-keyframe diagrams to produce plausible estimations of the corresponding, missing data. For this, we rely on a neural architecture which learns the relation from a time value to the corresponding scalar field, based on the keyframe examples, and reliably extends this relation to the non-keyframe time steps. We show how augmenting this architecture with specific topological losses exploiting the input diagrams both improves the geometrical and topological reconstruction of the non-keyframe time steps. At query time, given an input time value for which an interpolation is desired, our approach instantaneously produces an output, via a single propagation of the time input through the network. Experiments interpolating 2D and 3D time-varying datasets show our approach superiority, both in terms of data and topological fitting, with regard to reference interpolation schemes. Our implementation is available at this GitHub link : https://github.com/MohamedKISSI/Topology-Aware-Neural-Interpolation-of-Scalar-Fields.git.

[369] From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers

Praneet Suresh, Jack Stanley, Sonia Joseph, Luca Scimeca, Danilo Bzdok

Main category: cs.LG

TL;DR: The paper investigates how hallucinations arise in transformer models through concept representations, showing that increased input uncertainty leads to activation of input-insensitive semantic features and hallucinated outputs.

Details

Motivation: As generative AI systems become more competent and democratized, understanding their failure modes like hallucinations is crucial for trust and adoption in high-stakes applications.

Method: Using sparse autoencoders to capture concept representations in pre-trained transformer models under controlled input uncertainty scenarios, including pure-noise inputs.

Result: Transformers activate more semantic concepts as input becomes unstructured, and for pure-noise inputs, they robustly trigger meaningful concepts. Hallucinations can be reliably predicted from concept patterns in layer activations.

Conclusion: These insights have immediate implications for AI alignment, safety, adversarial attack vulnerabilities, and automatic quantification of hallucination risk in transformer models.

Abstract: As generative AI systems become competent and democratized in science, business, and government, deeper insight into their failure modes now poses an acute need. The occasional volatility in their behavior, such as the propensity of transformer models to hallucinate, impedes trust and adoption of emerging AI solutions in high-stakes areas. In the present work, we establish how and when hallucinations arise in pre-trained transformer models through concept representations captured by sparse autoencoders, under scenarios with experimentally controlled uncertainty in the input space. Our systematic experiments reveal that the number of semantic concepts used by the transformer model grows as the input information becomes increasingly unstructured. In the face of growing uncertainty in the input space, the transformer model becomes prone to activate coherent yet input-insensitive semantic features, leading to hallucinated output. At its extreme, for pure-noise inputs, we identify a wide variety of robustly triggered and meaningful concepts in the intermediate activations of pre-trained transformer models, whose functional integrity we confirm through targeted steering. We also show that hallucinations in the output of a transformer model can be reliably predicted from the concept patterns embedded in transformer layer activations. This collection of insights on transformer internal processing mechanics has immediate consequences for aligning AI models with human values, AI safety, opening the attack surface for potential adversarial attacks, and providing a basis for automatic quantification of a model’s hallucination risk.

[370] Performance of Conformal Prediction in Capturing Aleatoric Uncertainty

Misgina Tsighe Hagos, Claes Lundström

Main category: cs.LG

TL;DR: Conformal prediction’s ability to capture aleatoric uncertainty through prediction set size is not well validated. This study shows weak correlation between prediction set sizes and human annotations across multiple datasets and models.

Details

Motivation: To investigate whether conformal predictors effectively quantify aleatoric uncertainty (inherent dataset ambiguity from overlapping classes) by comparing prediction set sizes with human annotations.

Method: Used three conformal prediction approaches to generate prediction sets for eight deep learning models trained on four datasets with multiple human annotations per instance (5-50 annotators). Measured correlation between prediction set sizes and number of distinct human labels, and similarity between prediction sets and human annotations.

Result: Vast majority of conformal prediction outputs showed very weak to weak correlation with human annotations, with only a few showing moderate correlation. Prediction sets provide higher coverage of true classes but poorly capture aleatoric uncertainty.

Conclusion: Conformal predictors need critical reassessment - while they ensure coverage of true classes, their capability to capture aleatoric uncertainty and align with human annotations remains limited.

Abstract: Conformal prediction is a model-agnostic approach to generating prediction sets that cover the true class with a high probability. Although its prediction set size is expected to capture aleatoric uncertainty, there is a lack of evidence regarding its effectiveness. The literature presents that prediction set size can upper-bound aleatoric uncertainty or that prediction sets are larger for difficult instances and smaller for easy ones, but a validation of this attribute of conformal predictors is missing. This work investigates how effectively conformal predictors quantify aleatoric uncertainty, specifically the inherent ambiguity in datasets caused by overlapping classes. We perform this by measuring the correlation between prediction set sizes and the number of distinct labels assigned by human annotators per instance. We further assess the similarity between prediction sets and human-provided annotations. We use three conformal prediction approaches to generate prediction sets for eight deep learning models trained on four datasets. The datasets contain annotations from multiple human annotators (ranging from five to fifty participants) per instance, enabling the identification of class overlap. We show that the vast majority of the conformal prediction outputs show a very weak to weak correlation with human annotations, with only a few showing moderate correlation. These findings underscore the necessity of critically reassessing the prediction sets generated using conformal predictors. While they can provide a higher coverage of the true classes, their capability in capturing aleatoric uncertainty and generating sets that align with human annotations remains limited.

[371] A neural recommender system leveraging transfer learning for property prediction of ionic liquids

Sahil Sethi, Kai Sundmacher, Caroline Ganzer

Main category: cs.LG

TL;DR: A transfer learning framework with neural recommender system enables reliable prediction of ionic liquid properties using sparse experimental data, achieving improved performance for most target properties.

Details

Motivation: Accurately predicting thermophysical properties of ionic liquids is challenging due to vast chemical design space and limited experimental data availability.

Method: Two-stage process: pre-training NRS models on COSMO-RS simulated data at fixed conditions, then fine-tuning feedforward neural networks with experimental data at varying temperatures and pressures.

Result: Framework supports both within-property and cross-property knowledge transfer, achieving substantial performance improvement for four out of five properties (density, viscosity, surface tension, heat capacity, melting point).

Conclusion: Combining simulated data and transfer learning effectively overcomes experimental data sparsity, enabling scalable property prediction for over 700,000 IL combinations.

Abstract: Ionic liquids (ILs) have emerged as versatile replacements for traditional solvents because their physicochemical properties can be precisely tailored to various applications. However, accurately predicting key thermophysical properties remains challenging due to the vast chemical design space and the limited availability of experimental data. In this study, we present a data-driven transfer learning framework combined with a neural recommender system (NRS) to enable reliable property prediction for ILs using sparse experimental datasets. The approach involves a two-stage process: first, pre-training NRS models on COSMO-RS-based simulated data at fixed temperature and pressure, and second, fine-tuning simple feedforward neural networks with experimental data at varying temperatures and pressures. In this work, five essential IL properties are considered: density, viscosity, surface tension, heat capacity, and melting point. We find that the framework supports both within-property and cross-property knowledge transfer. Notably, pre-trained models for density, viscosity, and heat capacity are used to fine-tune models for all five target properties, achieving improved performance by a substantial margin for four of them. The model exhibits robust extrapolation to previously unseen ILs. Moreover, the final trained models enable property prediction for over 700,000 IL combinations, offering a scalable solution for IL screening in process design. This work highlights the effectiveness of combining simulated data and transfer learning to overcome sparsity in the experimental data.

[372] Holographic Knowledge Manifolds: A Novel Pipeline for Continual Learning Without Catastrophic Forgetting in Large Language Models

Justin Arndt

Main category: cs.LG

TL;DR: HKM is a four-phase pipeline that achieves zero catastrophic forgetting with minimal memory growth, using fractal quantization and holographic integration to compress knowledge by 3x while maintaining 100% integration efficiency.

Details

Motivation: To solve catastrophic forgetting in AI knowledge representation while minimizing memory growth and improving efficiency, enabling "eternal" adaptation of large language models without retraining.

Method: Four-phase pipeline leveraging fractal quantization, probabilistic entanglement, and dynamic diffraction chipping to compress knowledge substrates and integrate holographically.

Result: 0% catastrophic forgetting (infinite improvement over baselines), 3x compression with 67% storage savings, 53% training time reduction, and support for over 1,020 updates with only 1% growth per increment.

Conclusion: HKM enables paradigm shift for LLMs with potential for 60-80% fine-tuning cost reduction, projecting $92.4M savings over 5 years at petabyte scale with significant energy and carbon footprint reductions.

Abstract: We introduce the Holographic Knowledge Manifold (HKM), a four-phase pipeline that achieves zero catastrophic forgetting in AI knowledge representation while maintaining minimal memory growth and high efficiency. Leveraging fractal quantization, probabilistic entanglement, and dynamic diffraction chipping, HKM compresses knowledge substrates by 3x with 67% storage savings, integrates holographically at 100%, and supports over 1,020 updates with 1% growth per increment. In experiments on combined WikiText and FB15k datasets (scaled to 2,997 nodes), we demonstrate industry-leading performance: 0% forgetting (infinite improvement over GEM baselines), 3x compression, and 53% training time reduction on consumer GPU hardware. Hypothetical cost analyses project $92.4M savings over 5 years at petabyte scale, with 21.2% energy reduction and 33% lower carbon footprint. This work hypothesizes a paradigm shift for public large language models (LLMs), enabling “eternal” adaptation without retraining. Future extensions to multimodal fusion and quantum hardware could further democratize scalable AI, potentially reducing fine-tuning costs by 60-80% for models like Llama-3 or Grok-4. Code, datasets, and full results are publicly available for reproducibility.

[373] Reinforced Generation of Combinatorial Structures: Applications to Complexity Theory

Ansh Nagda, Prabhakar Raghavan, Abhradeep Thakurta

Main category: cs.LG

TL;DR: AI methods (AlphaEvolve) help advance complexity theory by improving bounds on certification algorithms, obtaining new inapproximability results for MAX-CUT problems and metric TSP, and evolving faster verification procedures.

Details

Motivation: To explore whether AI-based methods can advance complexity theory by obtaining new theoretical results through automated discovery and optimization.

Method: Used AlphaEvolve (LLM code mutation agent) to: 1) construct nearly extremal Ramanujan graphs for improved bounds, 2) discover new gadget reductions for inapproximability proofs, 3) evolve faster verification procedures for candidate constructions.

Result: Improved bounds on MAX-CUT/Independent Set certification; new inapproximability factors: MAX-4-CUT (0.987), MAX-3-CUT (0.9649), metric TSP (111/110); achieved 10,000x speedup in verification.

Conclusion: AI tools can significantly strengthen gadget-based proofs in complexity theory, suggesting broader potential for AI-assisted theoretical advances.

Abstract: Can AI based methods help us make advances in complexity theory? We provide evidence towards answering this in the affirmative, using AlphaEvolve (an LLM code mutation agent) to obtain new results in three settings: a) We improve a recent result of Kunisky and Yu to obtain near-optimal upper and (conditional) lower bounds on certification algorithms for MAX-CUT and MAX-Independent Set on random 3- and 4-regular graphs. Our improved lower bounds are obtained by constructing nearly extremal Ramanujan graphs on as many as $163$ vertices, and our upper bounds are obtained via analytical arguments. b) We obtain new inapproximability results for MAX-4-CUT and MAX-3-CUT, proving that it is NP-hard to approximate them within factors of $0.987$ and $0.9649$ respectively, using AlphaEvolve to discover new gadget reductions. Our MAX-4-CUT result improves upon the SOTA of $0.9883$, and our MAX-3-CUT result improves on the current best gadget-based inapproximability result of $0.9853$, but falls short of the SOTA of $16/17$ that relies on a custom PCP (rather than a reduction from ``standard’’ Håstad-style PCPs). c) Inapproximability for the metric Traveling Salesman Problem (TSP): We show that it is NP-hard to approximate the minimum cost tour within a factor of $111/110$ using AlphaEvolve to discover a new gadget, thus improving the SOTA of $117/116$. Along the way, we provide new modular soundness and completeness arguments that can be of independent interest. A key technical challenge we faced: verifying a candidate construction produced by AlphaEvolve is costly (sometimes requiring time exponential in the size of the construction). We used AlphaEvolve itself to evolve the verification procedure to be faster (sometimes by $10,000\times$ for our gadgets). Our results suggest that gadget based proofs would benefit from a pass through AI-based tools to obtain stronger results.

[374] Splines-Based Feature Importance in Kolmogorov-Arnold Networks: A Framework for Supervised Tabular Data Dimensionality Reduction

Ange-Clément Akazan, Verlon Roel Mbingui

Main category: cs.LG

TL;DR: KAN-based feature selection methods are competitive with and sometimes superior to traditional methods like LASSO and Random Forest, providing robust performance in both classification and regression tasks while capturing nonlinear feature interactions.

Details

Motivation: Feature selection is crucial for tabular prediction problems to handle redundant, noisy, or weakly informative variables. KANs offer natural per-feature importance scores that can improve selection beyond traditional linear or tree-based methods.

Method: Developed four KAN-based selection criteria (coefficient norms, gradient-based saliency, knockout scores) and compared them with LASSO, Random Forest feature importance, Mutual Information, and SVM-RFE on real and synthetic datasets using F1 and R² scores at 20%, 40%, 60% feature retention levels.

Result: KAN-based selectors are generally competitive with classical baselines, often matching or exceeding existing methods in multi-class classification and providing robust performance in noisy regression datasets. They effectively remove redundant features and capture nonlinear interactions.

Conclusion: KAN-based feature selection provides a powerful and interpretable alternative to traditional methods, capable of uncovering nonlinear and multivariate feature relevance beyond sparsity or impurity-based measures, with reproducible feature subsets and avoidance of unnecessary correlation inflation.

Abstract: Feature selection is a key step in many tabular prediction problems, where multiple candidate variables may be redundant, noisy, or weakly informative. We investigate feature selection based on Kolmogorov-Arnold Networks (KANs), which parameterize feature transformations with splines and expose per-feature importance scores in a natural way. From this idea we derive four KAN-based selection criteria (coefficient norms, gradient-based saliency, and knockout scores) and compare them with standard methods such as LASSO, Random Forest feature importance, Mutual Information, and SVM-RFE on a suite of real and synthetic classification and regression datasets. Using average F1 and $R^2$ scores across three feature-retention levels (20%, 40%, 60%), we find that KAN-based selectors are generally competitive with, and sometimes superior to, classical baselines. In classification, KAN criteria often match or exceed existing methods on multi-class tasks by removing redundant features and capturing nonlinear interactions. In regression, KAN-based scores provide robust performance on noisy and heterogeneous datasets, closely tracking strong ensemble predictors; we also observe characteristic failure modes, such as overly aggressive pruning with an $\ell_1$ criterion. Stability and redundancy analyses further show that KAN-based selectors yield reproducible feature subsets across folds while avoiding unnecessary correlation inflation, ensuring reliable and non-redundant variable selection. Overall, our findings demonstrate that KAN-based feature selection provides a powerful and interpretable alternative to traditional methods, capable of uncovering nonlinear and multivariate feature relevance beyond sparsity or impurity-based measures.

[375] A Small Math Model: Recasting Strategy Choice Theory in an LLM-Inspired Architecture

Roussel Rahman, Jeff Shrager

Main category: cs.LG

TL;DR: Recasting Strategy Choice Theory as a Small Math Model using neural network architecture to study children’s arithmetic learning and strategy development.

Details

Motivation: To extend Strategy Choice Theory using modern neural network approaches to better understand how children develop arithmetic skills and strategy choices.

Method: Developed a Small Math Model (SMM) with neural-network-based architecture incorporating counting practice, number embedding, and gated attention mechanisms.

Result: The model demonstrates constructive/destructive interference between counting and addition, and shows wave-like finger-counting patterns as recall improves.

Conclusion: The SMM provides a unified platform to investigate numerical understanding and mathematical reasoning, with plans to extend to adaptive strategy choice and discovery in LLM-based agents.

Abstract: Strategy Choice Theory (SCT; Siegler and Shrager, 1984; Siegler, 2000) explains important aspects of children’s arithmetic learning based upon principles including learning from developmentally naturalistic data, probabilistic representation, confidence-based retrieval, and the phase-like importance of scaffolding strategies, such as finger-counting. Here we recast SCT as a Small Math Model'' (SMM), employing a neural-network-based architecture analogous to LLMs. The SMM extends SCT to include counting practice, symbol (number) embedding, and gated attention. Similar to earlier work, the SMM demonstrates constructive and destructive interference between counting and addition, and the wave-like’’ use of finger-counting as sum recall improves. We plan to extend the SMM to later aspects of the decades-long SCT program, including adaptive strategy choice and eventually strategy discovery, providing a unified platform to investigate the understanding of numerical characteristics and relationships essential for mathematical reasoning – as it can emerge in LLM-based agents.

[376] ResCP: Reservoir Conformal Prediction for Time Series Forecasting

Roberto Neglia, Andrea Cini, Michael M. Bronstein, Filippo Maria Bianchi

Main category: cs.LG

TL;DR: Reservoir Conformal Prediction (ResCP) is a training-free conformal prediction method for time series that uses reservoir computing to dynamically reweight conformity scores based on temporal similarity, achieving asymptotic conditional coverage without expensive retraining.

Details

Motivation: Existing conformal prediction methods for sequential data require complex models that can fail with small sample sizes and need expensive retraining when data distributions change, creating limitations for practical time series applications.

Method: Leverages reservoir computing to compute similarity scores among reservoir states and uses them to adaptively reweight observed residuals, enabling modeling of local temporal dynamics without compromising computational scalability.

Result: Proves that ResCP achieves asymptotic conditional coverage under reasonable assumptions and empirically demonstrates effectiveness across diverse forecasting tasks.

Conclusion: ResCP provides a computationally scalable, training-free approach to conformal prediction for time series that handles temporal dependencies effectively without the need for complex model fitting or expensive retraining.

Abstract: Conformal prediction offers a powerful framework for building distribution-free prediction intervals for exchangeable data. Existing methods that extend conformal prediction to sequential data rely on fitting a relatively complex model to capture temporal dependencies. However, these methods can fail if the sample size is small and often require expensive retraining when the underlying data distribution changes. To overcome these limitations, we propose Reservoir Conformal Prediction (ResCP), a novel training-free conformal prediction method for time series. Our approach leverages the efficiency and representation learning capabilities of reservoir computing to dynamically reweight conformity scores. In particular, we compute similarity scores among reservoir states and use them to adaptively reweight the observed residuals at each step. With this approach, ResCP enables us to account for local temporal dynamics when modeling the error distribution without compromising computational scalability. We prove that, under reasonable assumptions, ResCP achieves asymptotic conditional coverage, and we empirically demonstrate its effectiveness across diverse forecasting tasks.

[377] Bootstrap Off-policy with World Model

Guojian Zhan, Likun Wang, Xiangteng Zhang, Jiaxin Gao, Masayoshi Tomizuka, Shengbo Eben Li

Main category: cs.LG

TL;DR: BOOM integrates planning and off-policy learning through a bootstrap loop with a world model, achieving state-of-the-art results on high-dimensional control tasks.

Details

Motivation: Address the divergence between collected data and actual policy behaviors when using planning for environment interaction, which degrades model learning and policy improvement.

Method: Uses a bootstrap loop where policy initializes planner and planner refines actions to bootstrap policy through behavior alignment, supported by a jointly learned world model. Core components include likelihood-free alignment loss and soft value-weighted mechanism.

Result: Achieves state-of-the-art results in both training stability and final performance on DeepMind Control Suite and Humanoid-Bench.

Conclusion: BOOM effectively integrates planning and off-policy learning through a tight bootstrap loop with world model support, solving the data-policy divergence problem in online planning.

Abstract: Online planning has proven effective in reinforcement learning (RL) for improving sample efficiency and final performance. However, using planning for environment interaction inevitably introduces a divergence between the collected data and the policy’s actual behaviors, degrading both model learning and policy improvement. To address this, we propose BOOM (Bootstrap Off-policy with WOrld Model), a framework that tightly integrates planning and off-policy learning through a bootstrap loop: the policy initializes the planner, and the planner refines actions to bootstrap the policy through behavior alignment. This loop is supported by a jointly learned world model, which enables the planner to simulate future trajectories and provides value targets to facilitate policy improvement. The core of BOOM is a likelihood-free alignment loss that bootstraps the policy using the planner’s non-parametric action distribution, combined with a soft value-weighted mechanism that prioritizes high-return behaviors and mitigates variability in the planner’s action quality within the replay buffer. Experiments on the high-dimensional DeepMind Control Suite and Humanoid-Bench show that BOOM achieves state-of-the-art results in both training stability and final performance. The code is accessible at https://github.com/molumitu/BOOM_MBRL.

[378] Efficient Reinforcement Learning for Large Language Models with Intrinsic Exploration

Yan Sun, Jia Guo, Stanley Kok, Zihao Wang, Zujie Wen, Zhiqiang Zhang

Main category: cs.LG

TL;DR: PREPO improves RLVR data efficiency using prompt perplexity for adaptive learning and relative entropy differentiation for exploration prioritization, achieving competitive results with 3x fewer rollouts.

Details

Motivation: Current RLVR training is computationally expensive as many rollouts contribute little to optimization. This work aims to leverage intrinsic data properties to improve data efficiency at minimal cost.

Method: PREPO has two components: 1) Using prompt perplexity as an adaptability indicator to progress from well-understood to challenging contexts, 2) Differentiating relative entropy to amplify rollout discrepancies and prioritize exploratory sequences.

Result: On Qwen and Llama models, PREPO achieves effective results on mathematical reasoning benchmarks with up to 3 times fewer rollouts than baselines while preserving competitive performance.

Conclusion: The method successfully reduces rollout demand in RLVR through adaptive learning and exploration prioritization, with theoretical and empirical validation of improved data efficiency.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has improved the reasoning ability of large language models, yet training remains costly because many rollouts contribute little to optimization, considering the amount of computation required. This study investigates how simply leveraging intrinsic data properties, almost free benefit during training, can improve data efficiency for RLVR. We propose PREPO with two complementary components. First, we adopt prompt perplexity as an indicator of model adaptability in learning, enabling the model to progress from well-understood contexts to more challenging ones. Second, we amplify the discrepancy among the rollouts by differentiating their relative entropy, and prioritize sequences that exhibit a higher degree of exploration. Together, these mechanisms reduce rollout demand while preserving competitive performance. On the Qwen and Llama models, PREPO achieves effective results on mathematical reasoning benchmarks with up to 3 times fewer rollouts than the baselines. Beyond empirical gains, we provide theoretical and in-depth analyses explaining the underlying rationale of our method to improve the data efficiency of RLVR.

[379] Explore More, Learn Better: Parallel MLLM Embeddings under Mutual Information Minimization

Zhicheng Wang, Chen Ju, Xu Chen, Shuai Xiao, Jinsong Lan, Xiaoyong Zhu, Ying Chen, Zhiguo Cao

Main category: cs.LG

TL;DR: PDF introduces a parallel decoupling framework for multimodal embedding learning that generates multiple parallel embeddings from a single input using MLLMs with learnable prefixes, achieving significant performance gains with minimal computational overhead.

Details

Motivation: Current multimodal embedding models follow the SSC paradigm (single input, singular embedding, contrastive supervision), which collapses rich multimodal inputs into monolithic embeddings and fails to fully exploit MLLM capabilities.

Method: PDF conditions a shared MLLM backbone on distinct learnable prefixes to create multiple parallel paths, uses Mutual Information Minimization to ensure diversity, and applies per-path contrastive supervision for semantic alignment.

Result: Significant performance gains across various model sizes: +8.9% for VLM2Vec-LLaVA-1.6-LR (7B), +4.2% for VLM2Vec-Qwen2VL (2B), +3.1% for VLM2Vec-Qwen2VL (7B). The 2B model outperforms baseline by +2.6% using only half the computational budget.

Conclusion: PDF enables robust semantic coverage and generalizable embedding spaces through parallel embedding generation, achieving state-of-the-art performance with minimal computational overhead.

Abstract: Embedding models are a cornerstone of modern AI. Driven by Multimodal Large Language Models (MLLMs), they have made great progress in architecture and data curation, while the holistic paradigm is still limited to SSC, i.e., single input, singular embedding, contrastive supervision, which collapses rich, multifaceted inputs into monolithic embeddings and fails to fully exploit MLLM capabilities. In this paper, we tailor one Parallel Decoupling Framework (PDF) for multimodal embedding learning, by utilizing the proprietary steerability of MLLMs, i.e., their ability to flexibly generate quite differentiated response under explicit instructions. Concretely, PDF conditions a shared MLLM backbone on distinct, learnable prefixes to roll out multiple parallel paths for one input, then relies on these paths to obtain parallel embeddings. To promote full parallel diversity, we employ Mutual Information Minimization (MIM) as an explicit constraint, coupled with per-path contrastive supervision to maintain semantic alignment. Such dual-objectives force PDF to yield robust semantic coverage and a generalizable embedding space. Ultimately, the remarkable embedding space are accessible at inference via one single forward pass, incurring negligible computational overhead. We instantiate PDF on multiple MLLM backbones and prove its effectiveness on MMEB benchmark. Significant gains are consistently achieved across various resolutions and model sizes, e.g., boosting the VLM2Vec-LLaVA-1.6-LR model by a remarkable +8.9% (7B), while the VLM2Vec-Qwen2VL models by +4.2% (2B) and +3.1% (7B). In terms of efficiency, our 2B model surpasses its baseline by +2.6% using only half the computational budget.

[380] Adaptive and Robust Data Poisoning Detection and Sanitization in Wearable IoT Systems using Large Language Models

W. K. M Mithsara, Ning Yang, Ahmed Imteaj, Hussein Zangoti, Abdur R. Shahid

Main category: cs.LG

TL;DR: A framework using large language models (LLMs) for detecting and sanitizing data poisoning attacks in human activity recognition systems for wearable IoT devices, employing zero-shot/one-shot/few-shot learning with role-play prompting and step-by-step reasoning.

Details

Motivation: Wearable IoT systems are vulnerable to data poisoning attacks that compromise data integrity and reliability. Traditional defense methods require extensive labeled datasets and lack adaptability in dynamic IoT environments.

Method: Uses LLMs with role-play prompting (LLM acts as expert) and think step-by-step reasoning to detect poisoning indicators in sensor data and generate clean alternatives, employing zero-shot, one-shot, and few-shot learning paradigms.

Result: Extensive evaluation shows effective poisoning detection accuracy, high-quality data sanitization, acceptable latency, and reasonable communication costs, demonstrating practical applicability.

Conclusion: LLMs provide robust, adaptable defense mechanisms against data poisoning in HAR systems without requiring extensive labeled datasets, improving security and reliability of wearable IoT systems.

Abstract: The widespread integration of wearable sensing devices in Internet of Things (IoT) ecosystems, particularly in healthcare, smart homes, and industrial applications, has required robust human activity recognition (HAR) techniques to improve functionality and user experience. Although machine learning models have advanced HAR, they are increasingly susceptible to data poisoning attacks that compromise the data integrity and reliability of these systems. Conventional approaches to defending against such attacks often require extensive task-specific training with large, labeled datasets, which limits adaptability in dynamic IoT environments. This work proposes a novel framework that uses large language models (LLMs) to perform poisoning detection and sanitization in HAR systems, utilizing zero-shot, one-shot, and few-shot learning paradigms. Our approach incorporates \textit{role play} prompting, whereby the LLM assumes the role of expert to contextualize and evaluate sensor anomalies, and \textit{think step-by-step} reasoning, guiding the LLM to infer poisoning indicators in the raw sensor data and plausible clean alternatives. These strategies minimize reliance on curation of extensive datasets and enable robust, adaptable defense mechanisms in real-time. We perform an extensive evaluation of the framework, quantifying detection accuracy, sanitization quality, latency, and communication cost, thus demonstrating the practicality and effectiveness of LLMs in improving the security and reliability of wearable IoT systems.

[381] Value of Information-Enhanced Exploration in Bootstrapped DQN

Stergios Plataniotis, Charilaos Akasiadis, Georgios Chalkiadakis

Main category: cs.LG

TL;DR: Integrates expected value of information (EVOI) into Bootstrapped DQN to enhance deep exploration in sparse-reward environments, using network head opinion discrepancies to guide exploration without extra hyperparameters.

Details

Motivation: Traditional exploration methods like ε-greedy struggle with efficient exploration-exploitation balance in high-dimensional, sparse-reward environments.

Method: Developed two novel algorithms that incorporate EVOI into Bootstrapped DQN, using value of information estimates to measure network head opinion discrepancies and drive exploration toward high-potential areas.

Result: Experiments in complex, sparse-reward Atari games show increased performance, better uncertainty utilization, and improved exploration capabilities.

Conclusion: The EVOI-enhanced Bootstrapped DQN approach effectively improves deep exploration in sparse-reward environments while maintaining simplicity through no additional hyperparameters.

Abstract: Efficient exploration in deep reinforcement learning remains a fundamental challenge, especially in environments characterized by high-dimensional states and sparse rewards. Traditional exploration strategies that rely on random local policy noise, such as $ε$-greedy and Boltzmann exploration methods, often struggle to efficiently balance exploration and exploitation. In this paper, we integrate the notion of (expected) value of information (EVOI) within the well-known Bootstrapped DQN algorithmic framework, to enhance the algorithm’s deep exploration ability. Specifically, we develop two novel algorithms that incorporate the expected gain from learning the value of information into Bootstrapped DQN. Our methods use value of information estimates to measure the discrepancies of opinions among distinct network heads, and drive exploration towards areas with the most potential. We evaluate our algorithms with respect to performance and their ability to exploit inherent uncertainty arising from random network initialization. Our experiments in complex, sparse-reward Atari games demonstrate increased performance, all the while making better use of uncertainty, and, importantly, without introducing extra hyperparameters.

[382] A Weak Penalty Neural ODE for Learning Chaotic Dynamics from Noisy Time Series

Xuyang Li, John Harlim, Dibyajyoti Chakraborty, Romit Maulik

Main category: cs.LG

TL;DR: Proposes Weak-Penalty NODE (WP-NODE) - a hybrid training approach combining weak and strong formulations to improve forecasting accuracy and robustness in chaotic dynamical systems with noisy data.

Details

Motivation: Real-world measurements are often corrupted by noise, which severely degrades data-driven model performance, especially in chaotic systems where small errors amplify rapidly. Standard approaches struggle to achieve both short-term accuracy and long-term stability.

Method: Uses weak formulation as complementary approach to classical strong formulation in neural ODEs. Weak formulation constrains model using integrated residuals over temporal subdomains, and is employed as penalty alongside strong formulation-based learning.

Result: WP-NODE achieves state-of-the-art forecasting accuracy and exceptional robustness across benchmark chaotic dynamical systems and real-world climate datasets, outperforming standard approaches.

Conclusion: The weak formulation penalty approach significantly enhances neural ODE performance for noisy chaotic systems, providing superior forecasting accuracy and robustness compared to traditional strong formulation methods.

Abstract: Accurate forecasting of complex high-dimensional dynamical systems from observational data is essential for several applications across science and engineering. A key challenge, however, is that real-world measurements are often corrupted by noise, which severely degrades the performance of data-driven models. Particularly, in chaotic dynamical systems, where small errors amplify rapidly, it is challenging to identify a data-driven model from noisy data that achieves short-term accuracy while preserving long-term invariant properties. In this paper, we propose the use of the weak formulation as a complementary approach to the classical strong formulation of data-driven time-series forecasting models. Specifically, we focus on the neural ordinary differential equation (NODE) architecture. Unlike the standard strong formulation, which relies on the discretization of the NODE followed by optimization, the weak formulation constrains the model using a set of integrated residuals over temporal subdomains. While such a formulation yields an effective NODE model, we discover that the performance of a NODE can be further enhanced by employing this weak formulation as a penalty alongside the classical strong formulation-based learning. Through numerical demonstrations, we illustrate that our proposed training strategy, which we coined as the Weak-Penalty NODE (WP-NODE), achieves state-of-the-art forecasting accuracy and exceptional robustness across benchmark chaotic dynamical systems and real-world climate dataset.

[383] Genomic Next-Token Predictors are In-Context Learners

Nathan Breslow, Aayush Mishra, Mahler Revsine, Michael C. Schatz, Anqi Liu, Daniel Khashabi

Main category: cs.LG

TL;DR: Genomic models trained on next-nucleotide prediction exhibit emergent in-context learning similar to language models, showing that ICL arises from large-scale predictive training across different sequence domains.

Details

Motivation: To investigate whether in-context learning emerges organically in non-linguistic sequence domains through large-scale predictive training, challenging the notion that ICL is unique to human language.

Method: Developed controlled experimental framework with symbolic reasoning tasks in both linguistic and genomic forms, using the Evo2 genomic model trained on next-nucleotide prediction at scale comparable to mid-sized LLMs.

Result: Genomic models show log-linear gains in pattern induction with increasing in-context demonstrations, similar to linguistic models, demonstrating emergent ICL in genomic sequences.

Conclusion: ICL arises as a consequence of large-scale predictive modeling over rich data, extending emergent meta-learning beyond language to a modality-agnostic view.

Abstract: In-context learning (ICL) – the capacity of a model to infer and apply abstract patterns from examples provided within its input – has been extensively studied in large language models trained for next-token prediction on human text. In fact, prior work often attributes this emergent behavior to distinctive statistical properties in human language. This raises a fundamental question: can ICL arise organically in other sequence domains purely through large-scale predictive training? To explore this, we turn to genomic sequences, an alternative symbolic domain rich in statistical structure. Specifically, we study the Evo2 genomic model, trained predominantly on next-nucleotide (A/T/C/G) prediction, at a scale comparable to mid-sized LLMs. We develop a controlled experimental framework comprising symbolic reasoning tasks instantiated in both linguistic and genomic forms, enabling direct comparison of ICL across genomic and linguistic models. Our results show that genomic models, like their linguistic counterparts, exhibit log-linear gains in pattern induction as the number of in-context demonstrations increases. To the best of our knowledge, this is the first evidence of organically emergent ICL in genomic sequences, supporting the hypothesis that ICL arises as a consequence of large-scale predictive modeling over rich data. These findings extend emergent meta-learning beyond language, pointing toward a unified, modality-agnostic view of in-context learning.

[384] Generalization Bounds for Semi-supervised Matrix Completion with Distributional Side Information

Antoine Ledent, Mun Chong Soo, Nong Minh Hieu

Main category: cs.LG

TL;DR: Matrix completion with both ground truth and sampling distribution being low-rank and sharing common subspace, leveraging unlabeled implicit feedback data and limited labeled explicit feedback data for improved performance.

Details

Motivation: Inspired by recommender systems where abundant implicit feedback (clicks, purchases) and scarce explicit feedback (ratings) coexist, aiming to leverage both data types effectively.

Method: Use theory of low-rank subspace recovery and generalization bounds for matrix completion, combining large unlabeled data (M samples) with small labeled data (N samples) sharing same distribution.

Result: Achieved error bounds scaling as Õ(√(nd/M)) and Õ(√(dr/N)), validated in synthetic experiments showing independent error terms for P and ground truth estimation. Outperformed baselines on Douban and MovieLens datasets.

Conclusion: The method effectively leverages both implicit and explicit feedback, providing valid theoretical framework for studying their interaction in recommender systems.

Abstract: We study a matrix completion problem where both the ground truth $R$ matrix and the unknown sampling distribution $P$ over observed entries are low-rank matrices, and \textit{share a common subspace}. We assume that a large amount $M$ of \textit{unlabeled} data drawn from the sampling distribution $P$ is available, together with a small amount $N$ of labeled data drawn from the same distribution and noisy estimates of the corresponding ground truth entries. This setting is inspired by recommender systems scenarios where the unlabeled data corresponds to implicit feedback' (consisting in interactions such as purchase, click, etc. ) and the labeled data corresponds to the explicit feedback’, consisting of interactions where the user has given an explicit rating to the item. Leveraging powerful results from the theory of low-rank subspace recovery, together with classic generalization bounds for matrix completion models, we show error bounds consisting of a sum of two error terms scaling as $\widetilde{O}\left(\sqrt{\frac{nd}{M}}\right)$ and $\widetilde{O}\left(\sqrt{\frac{dr}{N}}\right)$ respectively, where $d$ is the rank of $P$ and $r$ is the rank of $M$. In synthetic experiments, we confirm that the true generalization error naturally splits into independent error terms corresponding to the estimations of $P$ and and the ground truth matrix $\ground$ respectively. In real-life experiments on Douban and MovieLens with most explicit ratings removed, we demonstrate that the method can outperform baselines relying only on the explicit ratings, demonstrating that our assumptions provide a valid toy theoretical setting to study the interaction between explicit and implicit feedbacks in recommender systems.

[385] SCALEX: Scalable Concept and Latent Exploration for Diffusion Models

E. Zhixuan Zeng, Yuhao Chen, Alexander Wong

Main category: cs.LG

TL;DR: SCALEX is a framework for scalable, automated exploration of diffusion model latent spaces to analyze social biases using natural language prompts without retraining or labeling.

Details

Motivation: Existing bias analysis methods for diffusion models are limited to predefined categories or manual interpretation, restricting scalability and discovery of subtle patterns.

Method: Extracts semantically meaningful directions from H-space using natural language prompts for zero-shot interpretation, enabling systematic comparison across concepts.

Result: Detects gender bias in profession prompts, ranks semantic alignment across identity descriptors, and reveals clustered conceptual structure without supervision.

Conclusion: SCALEX makes bias analysis in diffusion models more scalable, interpretable, and extensible by directly linking prompts to latent directions.

Abstract: Image generation models frequently encode social biases, including stereotypes tied to gender, race, and profession. Existing methods for analyzing these biases in diffusion models either focus narrowly on predefined categories or depend on manual interpretation of latent directions. These constraints limit scalability and hinder the discovery of subtle or unanticipated patterns. We introduce SCALEX, a framework for scalable and automated exploration of diffusion model latent spaces. SCALEX extracts semantically meaningful directions from H-space using only natural language prompts, enabling zero-shot interpretation without retraining or labelling. This allows systematic comparison across arbitrary concepts and large-scale discovery of internal model associations. We show that SCALEX detects gender bias in profession prompts, ranks semantic alignment across identity descriptors, and reveals clustered conceptual structure without supervision. By linking prompts to latent directions directly, SCALEX makes bias analysis in diffusion models more scalable, interpretable, and extensible than prior approaches.

[386] Extending Test-Time Scaling: A 3D Perspective with Context, Batch, and Turn

Chao Yu, Qixin Tan, Jiaxuan Gao, Shi Yu, Hong Lu, Xinting Yang, Zelai Xu, Yu Wang, Yi Wu, Eugene Vinitsky

Main category: cs.LG

TL;DR: The paper introduces 3D test-time scaling, a unified framework that combines context, batch, and turn scaling to enhance reasoning performance beyond conventional test-time scaling limitations.

Details

Motivation: Test-time scaling in reasoning RL is limited by base models' context length, which is much smaller than training tokens. The authors aim to overcome this limitation by exploring additional scaling dimensions.

Method: Proposed 3D test-time scaling framework integrating three dimensions: context-length scaling, batch scaling (parallel sampling), and turn scaling (iterative self-refinement).

Result: Each scaling dimension shows bounded capacity individually, but combining all three substantially improves reasoning performance on challenging testbeds (IOI, IMO, CPHO) and benefits from human preference feedback.

Conclusion: The 3D test-time scaling framework effectively extends reasoning capacity and naturally extends to open-ended domains like embodied learning for humanoid control behavior design.

Abstract: Reasoning reinforcement learning (RL) has recently revealed a new scaling effect: test-time scaling. Thinking models such as R1 and o1 improve their reasoning accuracy at test time as the length of the reasoning context increases. However, compared with training-time scaling, test-time scaling is fundamentally limited by the limited context length of base models, which remains orders of magnitude smaller than the amount of tokens consumed during training. We revisit test-time enhancement techniques through the lens of scaling effect and introduce a unified framework of multi-dimensional test-time scaling to extend the capacity of test-time reasoning. Beyond conventional context-length scaling, we consider two additional dimensions: batch scaling, where accuracy improves with parallel sampling, and turn scaling, where iterative self-refinement enhances reasoning quality. Building on this perspective, we propose 3D test-time scaling, which integrates context, batch, and turn scaling. We show that: (1) each dimension demonstrates a test-time scaling effect, but with a bounded capacity; (2) combining all three dimensions substantially improves the reasoning performance of challenging testbeds, including IOI, IMO, and CPHO, and further benefits from human preference feedback; and (3) the human-in-the-loop framework naturally extends to a more open-ended domain, i.e., embodied learning, which enables the design of humanoid control behaviors.

[387] GLOBE: Accurate and Generalizable PDE Surrogates using Domain-Inspired Architectures and Equivariances

Peter Sharpe

Main category: cs.LG

TL;DR: GLOBE is a neural surrogate for homogeneous PDEs that combines boundary-element methods with equivariant ML, achieving substantial accuracy improvements on AirFRANS dataset with compact architecture and discretization-invariant properties.

Details

Motivation: To create a more accurate and practical ML-based PDE surrogate for industrial CAE by incorporating rigorous physics-inspired inductive biases from boundary-element methods and equivariant ML.

Method: Represents solutions as superpositions of learnable Green’s-function-like kernels evaluated from boundary faces to targets, using multiscale branches and communication hyperlayers. Architecture is translation-, rotation-, and parity-equivariant with explicit far-field decay envelope and boundary-to-boundary hyperlayer communication.

Result: On AirFRANS dataset, achieves 200x lower MSE on all fields relative to baselines and 50x relative to next-best model. In scarce data setting, achieves 100x lower error on velocity/pressure and 600x lower on surface pressure than Transolver. Model is compact (117k parameters) and supports arbitrary point evaluation.

Conclusion: Rigorous physics- and domain-inspired inductive biases enable large gains in accuracy, generalizability, and practicality for ML-based PDE surrogates in industrial CAE applications.

Abstract: We introduce GLOBE, a new neural surrogate for homogeneous PDEs that draws inductive bias from boundary-element methods and equivariant ML. GLOBE represents solutions as superpositions of learnable Green’s-function-like kernels evaluated from boundary faces to targets, composed across multiscale branches and communication hyperlayers. The architecture is translation-, rotation-, and parity-equivariant; discretization-invariant in the fine-mesh limit; and units-invariant via rigorous nondimensionalization. An explicit far-field decay envelope stabilizes extrapolation, boundary-to-boundary hyperlayer communication mediates long-range coupling, and the all-to-all boundary-to-target evaluation yields a global receptive field that respects PDE information flow, even for elliptic PDEs. On AirFRANS (steady incompressible RANS over NACA airfoils), GLOBE achieves substantial accuracy improvements. On the “Full” split, it reduces mean-squared error by roughly 200x on all fields relative to the dataset’s reference baselines, and roughly 50x relative to the next-best-performing model. In the “Scarce” split, it achieves over 100x lower error on velocity and pressure fields and over 600x lower error on surface pressure than Transolver. Qualitative results show sharp near-wall gradients, coherent wakes, and limited errors under modest extrapolation in Reynolds number and angle of attack. In addition to this accuracy, the model is quite compact (117k parameters), and fields can be evaluated at arbitrary points during inference. We also demonstrate the ability to train and predict with non-watertight meshes, which has strong practical implications. These results show that rigorous physics- and domain-inspired inductive biases can achieve large gains in accuracy, generalizability, and practicality for ML-based PDE surrogates for industrial computer-aided engineering (CAE).

[388] GeoPTH: A Lightweight Approach to Category-Based Trajectory Retrieval via Geometric Prototype Trajectory Hashing

Yang Xu, Zuliang Yang, Kai Ming Ting

Main category: cs.LG

TL;DR: GeoPTH is a lightweight, non-learning framework for efficient trajectory similarity retrieval using geometric prototypes as hash functions, achieving competitive accuracy and superior efficiency compared to traditional and learning-based methods.

Details

Motivation: Existing trajectory similarity methods face computational expense (traditional metrics) or high training costs and instability (learning-based methods), requiring a more practical alternative.

Method: Constructs data-dependent hash functions using representative trajectory prototypes as anchors, mapping new trajectories to closest prototypes via Hausdorff metric for efficient hashing.

Result: GeoPTH achieves competitive retrieval accuracy with traditional metrics and state-of-the-art learning methods, significantly outperforms binary codes from learned embeddings, and consistently beats all competitors in efficiency.

Conclusion: The lightweight prototype-centric approach offers a practical and powerful alternative for trajectory retrieval, delivering exceptional performance and computational efficiency without learning requirements.

Abstract: Trajectory similarity retrieval is an important part of spatiotemporal data mining, however, existing methods have the following limitations: traditional metrics are computationally expensive, while learning-based methods suffer from substantial training costs and potential instability. This paper addresses these problems by proposing Geometric Prototype Trajectory Hashing (GeoPTH), a novel, lightweight, and non-learning framework for efficient category-based trajectory retrieval. GeoPTH constructs data-dependent hash functions by using representative trajectory prototypes, i.e., small point sets preserving geometric characteristics, as anchors. The hashing process is efficient, which involves mapping a new trajectory to its closest prototype via a robust, Hausdorff metric. Extensive experiments show that GeoPTH’s retrieval accuracy is highly competitive with both traditional metrics and state-of-the-art learning methods, and it significantly outperforms binary codes generated through simple binarization of the learned embeddings. Critically, GeoPTH consistently outperforms all competitors in terms of efficiency. Our work demonstrates that a lightweight, prototype-centric approach offers a practical and powerful alternative, achieving an exceptional retrieval performance and computational efficiency.

cs.MA

[389] Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems

Kirill Nagaitsev, Luka Grbcic, Samuel Williams, Costin Iancu

Main category: cs.MA

TL;DR: Multi-agent systems for PyTorch GPU optimization achieve 2.88x speedup on H100 GPUs, with exploit-heavy strategies and error-fixing agents performing best.

Details

Motivation: Maximizing GPU performance for AI inference is challenging, and while multi-agent systems show promise for code optimization, their dynamics remain unexplored.

Method: Developed a logical framework for comparing multi-agent PyTorch optimization systems, evaluating different strategies and agent configurations.

Result: Achieved average 2.88x speedup on H100 GPU across diverse tasks in KernelBench benchmark suite, with exploit-heavy strategies performing best when paired with error-fixing agents.

Conclusion: Multi-agent systems can effectively optimize PyTorch code for GPU performance, with strategy selection and agent composition significantly impacting optimization outcomes.

Abstract: Maximizing performance on available GPU hardware is an ongoing challenge for modern AI inference systems. Traditional approaches include writing custom GPU kernels and using specialized model compilers to tune high-level code for specific GPU targets. Recent work shows that LLM-based multi-agent systems can effectively perform such tuning, often outperforming existing compilers and eliminating the need for manual kernel development. However, the dynamics of multi-agent systems for this task remain unexplored. In this work, we present a logical framework for comparing multi-agent PyTorch optimization systems. Our evaluation shows that exploit-heavy strategies perform best when paired with error-fixing agents, and that performance correlates with the granularity of optimization steps. The best implementation achieves an average 2.88x speedup on an H100 GPU across diverse tasks in KernelBench, a benchmark suite covering a range of machine learning architectures in PyTorch.

[390] A segment anchoring-based balancing algorithm for agricultural multi-robot task allocation with energy constraints

Peng Chen, Jing Liang, Kang-Jia Qiao, Hui Song, Tian-lei Ma, Kun-Jie Yu, Cai-Tong Yue, Ponnuthurai Nagaratnam Suganthan, Witold Pedryc

Main category: cs.MA

TL;DR: Proposes SABA algorithm for multi-robot harvesting scheduling with energy constraints, addressing complex interactions between payload, battery capacity, and charging disruptions through anchoring and rebalancing mechanisms.

Details

Motivation: Address efficiency and cost challenges in labor-intensive industries like smart farming, where planning harvesting schedules for electric robots is complex due to conflicting objectives (makespan vs transportation cost), payload constraints, and finite battery capacity with disruptive charging events.

Method: Segment anchoring-based balancing algorithm (SABA) with two synergistic mechanisms: sequential anchoring and balancing (using charging decisions as anchors to reconstruct disrupted routes) and proportional splitting-based rebalancing (fine-grained balancing and tuning of makespans).

Result: Extensive experiments on real-world case study and benchmark instances show SABA comprehensively outperforms 6 state-of-the-art algorithms in both solution convergence and diversity.

Conclusion: Provides novel theoretical perspective and effective solution for multi-robot task allocation under energy constraints, addressing complex cascading effects of charging disruptions on robot schedules.

Abstract: Multi-robot systems have emerged as a key technology for addressing the efficiency and cost challenges in labor-intensive industries. In the representative scenario of smart farming, planning efficient harvesting schedules for a fleet of electric robots presents a highly challenging frontier problem. The complexity arises not only from the need to find Pareto-optimal solutions for the conflicting objectives of makespan and transportation cost, but also from the necessity to simultaneously manage payload constraints and finite battery capacity. When robot loads are dynamically updated during planned multi-trip operations, a mandatory recharge triggered by energy constraints introduces an unscheduled load reset. This interaction creates a complex cascading effect that disrupts the entire schedule and renders traditional optimization methods ineffective. To address this challenge, this paper proposes the segment anchoring-based balancing algorithm (SABA). The core of SABA lies in the organic combination of two synergistic mechanisms: the sequential anchoring and balancing mechanism, which leverages charging decisions as `anchors’ to systematically reconstruct disrupted routes, while the proportional splitting-based rebalancing mechanism is responsible for the fine-grained balancing and tuning of the final solutions’ makespans. Extensive comparative experiments, conducted on a real-world case study and a suite of benchmark instances, demonstrate that SABA comprehensively outperforms 6 state-of-the-art algorithms in terms of both solution convergence and diversity. This research provides a novel theoretical perspective and an effective solution for the multi-robot task allocation problem under energy constraints.

[391] Area-Optimal Control Strategies for Heterogeneous Multi-Agent Pursuit

Kamal Mammadov, Damith C. Ranasinghe

Main category: cs.MA

TL;DR: Novel multi-agent pursuit-evasion strategy using geometric safe-reachable sets and gradient-based control laws for real-time capture of slower evader by faster pursuers.

Details

Motivation: To develop an efficient cooperative capture strategy for multiple faster pursuers against a single slower evader using geometric analysis and game theory.

Method: Define evader’s safe-reachable set as intersection of Apollonius circles, formulate as zero-sum game, derive analytical gradients of set area, and develop closed-form optimal control laws for agent headings.

Result: Gradient-based controls effectively shrink evader’s safe region leading to guaranteed capture, with computationally efficient real-time implementation.

Conclusion: Area-minimization approach provides clear geometric objective for cooperative capture with guaranteed performance and real-time feasibility.

Abstract: This paper presents a novel strategy for a multi-agent pursuit-evasion game involving multiple faster pursuers with heterogenous speeds and a single slower evader. We define a geometric region, the evader’s safe-reachable set, as the intersection of Apollonius circles derived from each pursuer-evader pair. The capture strategy is formulated as a zero-sum game where the pursuers cooperatively minimize the area of this set, while the evader seeks to maximize it, effectively playing a game of spatial containment. By deriving the analytical gradients of the safe-reachable set’s area with respect to agent positions, we obtain closed-form, instantaneous optimal control laws for the heading of each agent. These strategies are computationally efficient, allowing for real-time implementation. Simulations demonstrate that the gradient-based controls effectively steer the pursuers to systematically shrink the evader’s safe region, leading to guaranteed capture. This area-minimization approach provides a clear geometric objective for cooperative capture.

cs.MM

eess.AS

[392] Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation

Wei-Cheng Tseng, Xuanru Zhou, Mingyue Huo, Yiwen Shao, Hao Zhang, Dong Yu

Main category: eess.AS

TL;DR: Audio-language pretraining is underexplored compared to vision-language models. The paper introduces CaptionStew (10.7M captions), compares contrastive vs captioning objectives, and shows both yield competitive audio representations with complementary scaling properties.

Details

Motivation: Audio-language pretraining holds promise for general-purpose audio understanding but faces barriers: limited large-scale datasets, insufficient caption diversity, and lack of systematic evaluation compared to vision-language models.

Method: Created CaptionStew dataset (10.7M captions) aggregating diverse audio-text corpora. Conducted comprehensive evaluation comparing contrastive and captioning objectives across speech, music, and environmental sound tasks with systematic data-scaling experiments.

Result: Audio-language pretraining yields competitive, transferable representations. Contrastive learning achieves superior data efficiency at smaller scales, while captioning demonstrates better scalability on language-involved tasks. Supervised initialization provides diminishing returns at scale.

Conclusion: Audio-language pretraining is a viable pathway toward general-purpose audio representations. Findings guide future research, and the authors release data recipes, protocols, and models to accelerate progress toward universal audio understanding.

Abstract: Audio-language pretraining holds promise for general-purpose audio understanding, yet remains underexplored compared to its vision counterpart. While vision-language models like CLIP serve as widely adopted foundations, existing audio-language models primarily excel at retrieval tasks with limited adoption as general-purpose encoders. We identify three key barriers: limited large-scale audio-text corpora, insufficient caption diversity, and lack of systematic exploration and evaluation. To this end, we introduce CaptionStew, a 10.7M caption dataset aggregating diverse open-source audio-text corpora across multiple domains and captioning styles. Using this resource, we conduct the first comprehensive evaluation comparing contrastive and captioning objectives for audio representation learning across speech, music, and environmental sound tasks. Our results demonstrate that audio-language pretraining yields competitive, transferable representations. Through systematic data-scaling experiments, we reveal complementary objective strengths: contrastive learning achieves superior data efficiency at smaller scales, while captioning demonstrates better scalability on language-involved audio understanding tasks. We also find that common supervised initialization practices provide diminishing returns at scale, challenging current approaches. These findings establish audio-language pretraining as a viable pathway toward general-purpose audio representations, guiding future research. To accelerate progress, we release data preparation recipes, training protocols, and pretrained models, paving the way toward universal audio understanding.

[393] Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass

Main category: eess.AS

TL;DR: Omni-R1 achieves SOTA performance on MMAU and MMAR benchmarks by fine-tuning Qwen2.5-Omni with GRPO on audio QA data, with surprising findings about text-only training improving audio performance.

Details

Motivation: To improve multi-modal LLM performance on audio question answering benchmarks and understand the sources of performance gains.

Method: Fine-tuned Qwen2.5-Omni using GRPO reinforcement learning on audio question answering dataset, with ablation studies testing models with/without audio and text-only training.

Result: Achieved highest accuracies on sounds, music, speech, and overall categories on MMAU and MMAR benchmarks; discovered that much of GRPO improvement comes from better text reasoning and that text-only training can improve audio performance.

Conclusion: GRPO fine-tuning effectively improves multi-modal audio QA performance, with text reasoning playing a crucial role and text-only training surprisingly beneficial for audio tasks.

Abstract: We propose Omni-R1 which fine-tunes a recent multi-modal LLM, Qwen2.5-Omni, on an audio question answering dataset with the reinforcement learning method GRPO. This leads to new State-of-the-Art performance on the recent MMAU and MMAR benchmarks. Omni-R1 achieves the highest accuracies on the sounds, music, speech, and overall average categories, both on the Test-mini and Test-full splits. To understand the performance improvement, we tested models both with and without audio and found that much of the performance improvement from GRPO could be attributed to better text-based reasoning. We also made a surprising discovery that fine-tuning without audio on a text-only dataset was effective at improving the audio-based performance.

eess.IV

[394] Avoiding Quality Saturation in UGC Compression Using Denoised References

Xin Xiong, Samuel Fernández-Menduiña, Eduardo Pavez, Antonio Ortega, Neil Birkbeck, Balu Adsumilli

Main category: eess.IV

TL;DR: Proposes efficient methods to detect quality saturation in UGC video compression by using denoised MSE instead of traditional NRM evaluation, achieving 8%-20% BD-rate savings.

Details

Motivation: Conventional codecs cause quality saturation when compressing noisy UGC videos - increasing bitrate preserves input artifacts without improving visual quality. Repeated NRM evaluation for detection is inefficient.

Method: Use D-MSE (MSE with respect to denoised UGC) instead of traditional MSE. Propose two detection methods: DSD (input-dependent threshold) and RDSD (estimates Lagrangian at saturation point using low-complexity compression).

Result: Experiments with AVC show 8%-20% BD-rate savings across multiple NRMs by avoiding encoding in saturation regions.

Conclusion: The proposed methods efficiently detect distortion saturation as a pre-processing step, helping standard codecs avoid quality saturation in UGC compression without repeated NRM evaluation.

Abstract: Video-sharing platforms must re-encode large volumes of noisy user-generated content (UGC) to meet streaming demands. However, conventional codecs, which aim to minimize the mean squared error (MSE) between the compressed and input videos, can cause quality saturation (QS) when applied to UGC, i.e., increasing the bitrate preserves input artifacts without improving visual quality. A direct approach to solve this problem is to detect QS by repeatedly evaluating a non-reference metric (NRM) on videos compressed with multiple codec parameters, which is inefficient. In this paper, we re-frame UGC compression and QS detection from the lens of noisy source coding theory: rather than using a NRM, we compute the MSE with respect to the denoised UGC, which serves as an alternative reference (D-MSE). Unlike MSE measured between the UGC input and the compressed UGC, D-MSE saturates at non-zero values as bitrates increase, a phenomenon we term distortion saturation (DS). Since D-MSE can be computed at the block level in the transform domain, we can efficiently detect D-MSE without coding and decoding with various parameters. We propose two methods for DS detection: distortion saturation detection (DSD), which relies on an input-dependent threshold derived from the D-MSE of the input UGC, and rate-distortion saturation detection (RDSD), which estimates the Lagrangian at the saturation point using a low-complexity compression method. Both methods work as a pre-processing step that can help standard-compliant codecs avoid QS in UGC compression. Experiments with AVC show that preventing encoding in the saturation region, i.e., avoiding encoding at QPs that result in QS according to our methods, achieves BD-rate savings of 8%-20% across multiple different NRMs, compared to a naïve baseline that encodes at the given input QP while ignoring QS.

[395] MRI Super-Resolution with Deep Learning: A Comprehensive Survey

Mohammad Khateri, Serge Vasylechko, Morteza Ghahremani, Liam Timms, Deniz Kocanaogullari, Simon K. Warfield, Camilo Jaimes, Davood Karimi, Alejandra Sierra, Jussi Tohka, Sila Kurugol, Onur Afacan

Main category: eess.IV

TL;DR: This survey paper reviews deep learning-based super-resolution techniques for MRI, providing a systematic taxonomy and analysis of methods to generate high-resolution images from low-resolution scans.

Details

Motivation: High-resolution MRI is clinically important but costly and technically constrained. Super-resolution offers a computational solution to improve image quality without additional hardware, potentially enhancing diagnostic accuracy.

Method: The paper systematically reviews DL-based MRI SR methods, examining them from computer vision, computational imaging, inverse problems, and MR physics perspectives. It covers theoretical foundations, architectures, learning strategies, datasets, and metrics.

Result: The survey provides a comprehensive taxonomy of MRI SR techniques, analyzes both established and emerging methods, and identifies unique challenges in clinical and research contexts.

Conclusion: The paper highlights open challenges and future directions for the MRI SR community, while providing essential open-access resources and tools for researchers and practitioners.

Abstract: High-resolution (HR) magnetic resonance imaging (MRI) is crucial for many clinical and research applications. However, achieving it remains costly and constrained by technical trade-offs and experimental limitations. Super-resolution (SR) presents a promising computational approach to overcome these challenges by generating HR images from more affordable low-resolution (LR) scans, potentially improving diagnostic accuracy and efficiency without requiring additional hardware. This survey reviews recent advances in MRI SR techniques, with a focus on deep learning (DL) approaches. It examines DL-based MRI SR methods from the perspectives of computer vision, computational imaging, inverse problems, and MR physics, covering theoretical foundations, architectural designs, learning strategies, benchmark datasets, and performance metrics. We propose a systematic taxonomy to categorize these methods and present an in-depth study of both established and emerging SR techniques applicable to MRI, considering unique challenges in clinical and research contexts. We also highlight open challenges and directions that the community needs to address. Additionally, we provide a collection of essential open-access resources, tools, and tutorials, available on our GitHub: https://github.com/mkhateri/Awesome-MRI-Super-Resolution. IEEE keywords: MRI, Super-Resolution, Deep Learning, Computational Imaging, Inverse Problem, Survey.

[396] MedImageInsight for Thoracic Cavity Health Classification from Chest X-rays

Rama Krishna Boya, Mohan Kireeti Magalanadu, Azaruddin Palavalli, Rupa Ganesh Tekuri, Amrit Pattanayak, Prasanthi Enuga, Vignesh Esakki Muthu, Vivek Aditya Boya

Main category: eess.IV

TL;DR: MedImageInsight foundational model used for automated binary classification of chest X-rays (Normal vs Abnormal), achieving ROC-AUC of 0.888 with fine-tuning approach.

Details

Motivation: Address increasing chest radiography volumes and radiologist workload challenges for timely interpretation.

Method: Two approaches: (1) fine-tuning MedImageInsight for end-to-end classification, (2) using model as feature extractor with traditional ML classifiers. Used ChestX-ray14 dataset and real-world clinical data.

Result: Fine-tuned classifier achieved highest performance (ROC-AUC 0.888) with superior calibration, comparable to established architectures like CheXNet.

Conclusion: Foundational medical imaging models effectively reduce task-specific training requirements while maintaining diagnostic reliability, suitable for integration into clinical workflows to support triage and reduce radiologist burden.

Abstract: Chest radiography remains one of the most widely used imaging modalities for thoracic diagnosis, yet increasing imaging volumes and radiologist workload continue to challenge timely interpretation. In this work, we investigate the use of MedImageInsight, a medical imaging foundational model, for automated binary classification of chest X-rays into Normal and Abnormal categories. Two approaches were evaluated: (1) fine-tuning MedImageInsight for end-to-end classification, and (2) employing the model as a feature extractor for a transfer learning pipeline using traditional machine learning classifiers. Experiments were conducted using a combination of the ChestX-ray14 dataset and real-world clinical data sourced from partner hospitals. The fine-tuned classifier achieved the highest performance, with an ROC-AUC of 0.888 and superior calibration compared to the transfer learning models, demonstrating performance comparable to established architectures such as CheXNet. These results highlight the effectiveness of foundational medical imaging models in reducing task-specific training requirements while maintaining diagnostic reliability. The system is designed for integration into web-based and hospital PACS workflows to support triage and reduce radiologist burden. Future work will extend the model to multi-label pathology classification to provide preliminary diagnostic interpretation in clinical environments.

Qi Jiang, Xiaolong Qian, Yao Gao, Lei Sun, Kailun Yang, Zhonghua Yi, Wenyong Li, Ming-Hsuan Yang, Luc Van Gool, Kaiwei Wang

Main category: eess.IV

TL;DR: OmniLens++ is a framework for blind lens aberration correction that addresses data scalability and prior guidance challenges through expanded lens design specifications and a Latent PSF Representation using VQVAE.

Details

Motivation: To overcome limitations in existing lens library pre-training pipelines, specifically the difficulty of scaling data and lack of optical degradation prior guidance, which hinder generalization ability.

Method: Expands lens design specifications for degradation diversity, samples uniform distribution by quantifying spatial-variation patterns, and introduces Latent PSF Representation using VQVAE framework to learn degradation priors from Point Spread Functions.

Result: Demonstrates state-of-the-art generalization capacity in blind aberration correction across diverse real-world lenses and synthetic LensLib, with AODLibpro verified as scalable foundation and LPR tapping potential of large-scale LensLib.

Conclusion: OmniLens++ effectively resolves key challenges in blind lens aberration correction, providing improved generalization through enhanced data scalability and degradation prior modeling, with publicly available code and datasets.

Abstract: Emerging deep-learning-based lens library pre-training (LensLib-PT) pipeline offers a new avenue for blind lens aberration correction by training a universal neural network, demonstrating strong capability in handling diverse unknown optical degradations. This work proposes the OmniLens++ framework, which resolves two challenges that hinder the generalization ability of existing pipelines: the difficulty of scaling data and the absence of prior guidance characterizing optical degradation. To improve data scalability, we expand the design specifications to increase the degradation diversity of the lens source, and we sample a more uniform distribution by quantifying the spatial-variation patterns and severity of optical degradation. In terms of model design, to leverage the Point Spread Functions (PSFs), which intuitively describe optical degradation, as guidance in a blind paradigm, we propose the Latent PSF Representation (LPR). The VQVAE framework is introduced to learn latent features of LensLib’s PSFs, which is assisted by modeling the optical degradation process to constrain the learning of degradation priors. Experiments on diverse aberrations of real-world lenses and synthetic LensLib show that OmniLens++ exhibits state-of-the-art generalization capacity in blind aberration correction. Beyond performance, the AODLibpro is verified as a scalable foundation for more effective training across diverse aberrations, and LPR can further tap the potential of large-scale LensLib. The source code and datasets will be made publicly available at https://github.com/zju-jiangqi/OmniLens2.

[398] Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal

Xiaolong Qian, Qi Jiang, Lei Sun, Zongxi Yu, Kailun Yang, Peixuan Wu, Jiacheng Zhou, Yao Gao, Yaoguang Ma, Ming-Hsuan Yang, Kaiwei Wang

Main category: eess.IV

TL;DR: VeilGen generates realistic veiling glare for compact optical systems using unsupervised learning with Stable Diffusion priors, enabling paired dataset creation. DeVeiler then uses these generated maps to restore images through a reversibility-constrained network.

Details

Motivation: Compact optical systems suffer from veiling glare that degrades image quality beyond traditional aberrations. Existing scattering models fail to capture this spatial-varying, depth-independent glare, making paired data generation difficult for data-driven restoration methods.

Method: VeilGen: generative model that learns to simulate veiling glare by estimating optical transmission and glare maps from target images using SD-based priors. DeVeiler: restoration network trained with reversibility constraint that uses predicted latent maps to guide inverse scattering process.

Result: Extensive experiments show superior restoration quality and physical fidelity compared to existing methods. VeilGen reliably synthesizes realistic veiling glare and its learned latent maps effectively guide DeVeiler’s restoration process.

Conclusion: The proposed approach successfully addresses veiling glare in compact optical systems through realistic glare simulation and guided restoration, outperforming current methods while providing interpretable latent optical maps.

Abstract: Beyond the commonly recognized optical aberrations, the imaging performance of compact optical systems-including single-lens and metalens designs-is often further degraded by veiling glare caused by stray-light scattering from non-ideal optical surfaces and coatings, particularly in complex real-world environments. This compound degradation undermines traditional lens aberration correction yet remains underexplored. A major challenge is that conventional scattering models (e.g., for dehazing) fail to fit veiling glare due to its spatial-varying and depth-independent nature. Consequently, paired high-quality data are difficult to prepare via simulation, hindering application of data-driven veiling glare removal models. To this end, we propose VeilGen, a generative model that learns to simulate veiling glare by estimating its underlying optical transmission and glare maps in an unsupervised manner from target images, regularized by Stable Diffusion (SD)-based priors. VeilGen enables paired dataset generation with realistic compound degradation of optical aberrations and veiling glare, while also providing the estimated latent optical transmission and glare maps to guide the veiling glare removal process. We further introduce DeVeiler, a restoration network trained with a reversibility constraint, which utilizes the predicted latent maps to guide an inverse process of the learned scattering model. Extensive experiments on challenging compact optical systems demonstrate that our approach delivers superior restoration quality and physical fidelity compared with existing methods. These suggest that VeilGen reliably synthesizes realistic veiling glare, and its learned latent maps effectively guide the restoration process in DeVeiler. All code and datasets will be publicly released at https://github.com/XiaolongQian/DeVeiler.

[399] Automated Muscle and Fat Segmentation in Computed Tomography for Comprehensive Body Composition Analysis

Yaqian Chen, Hanxue Gu, Yuwen Chen, Jichen Yang, Haoyu Dong, Joseph Y. Cao, Adrian Camarena, Christopher Mantyh, Roy Colglazier, Maciej A. Mazurowski

Main category: eess.IV

TL;DR: A publicly available end-to-end segmentation model for CT body composition analysis that segments skeletal muscle, subcutaneous and visceral adipose tissue, and calculates various body composition metrics with high accuracy.

Details

Motivation: Limited publicly available tools for consistent CT body composition analysis across different clinical applications, despite its importance for cardiovascular prognostication, metabolic health evaluation, disease monitoring, and surgical risk stratification.

Method: Developed an end-to-end segmentation and feature calculation model that performs 2D and 3D segmentation of skeletal muscle, SAT, and VAT across chest, abdomen, and pelvis in axial CT images, with automated calculation of body composition metrics including muscle density, VAT/SAT ratio, muscle area/volume, and SMI.

Result: Achieved high dice coefficients exceeding 89% for all tissue segmentations on both internal and external datasets, outperforming benchmark by 2.10% on skeletal muscle and 8.6% on SAT compared to manual annotations. All body composition metrics showed mean relative absolute errors under 10%.

Conclusion: The publicly available model provides accurate and consistent CT body composition analysis across diverse populations, addressing the gap in accessible tools for clinical applications and research.

Abstract: Body composition assessment using CT images can potentially be used for a number of clinical applications, including the prognostication of cardiovascular outcomes, evaluation of metabolic health, monitoring of disease progression, assessment of nutritional status, prediction of treatment response in oncology, and risk stratification for surgical and critical care outcomes. While multiple groups have developed in-house segmentation tools for this analysis, there are very limited publicly available tools that could be consistently used across different applications. To mitigate this gap, we present a publicly accessible, end-to-end segmentation and feature calculation model specifically for CT body composition analysis. Our model performs segmentation of skeletal muscle, subcutaneous adipose tissue (SAT), and visceral adipose tissue (VAT) across the chest, abdomen, and pelvis area in axial CT images. It also provides various body composition metrics, including muscle density, visceral-to-subcutaneous fat (VAT/SAT) ratio, muscle area/volume, and skeletal muscle index (SMI), supporting both 2D and 3D assessments. To evaluate the model, the segmentation was applied to both internal and external datasets, with body composition metrics analyzed across different age, sex, and race groups. The model achieved high dice coefficients on both internal and external datasets, exceeding 89% for skeletal muscle, SAT, and VAT segmentation. The model outperforms the benchmark by 2.10% on skeletal muscle and 8.6% on SAT compared to the manual annotations given by the publicly available dataset. Body composition metrics show mean relative absolute errors (MRAEs) under 10% for all measures. Our model with weights is publicly available at https://github.com/mazurowski-lab/CT-Muscle-and-Fat-Segmentation.git.

[400] HazeMatching: Dehazing Light Microscopy Images with Guided Conditional Flow Matching

Anirban Ray, Ashesh, Florian Jug

Main category: eess.IV

TL;DR: HazeMatching is a novel iterative method for dehazing light microscopy images that balances fidelity and realism using conditional flow matching guided by hazy observations.

Details

Motivation: To address the trade-off between data fidelity and realism in computational dehazing of microscopy images, where existing methods either prioritize fidelity at the expense of realism or produce perceptually convincing results lacking quantitative accuracy.

Method: Adapts conditional flow matching framework by guiding the generative process with a hazy observation in the conditional velocity field, without needing an explicit degradation operator.

Result: Achieves consistent balance between fidelity and realism across 5 datasets (synthetic and real data), outperforming 11 baselines, and produces well-calibrated predictions.

Conclusion: HazeMatching effectively balances fidelity and realism in microscopy image dehazing, is applicable to real data without explicit degradation operators, and will be publicly available with permissive licensing.

Abstract: Fluorescence microscopy is a major driver of scientific progress in the life sciences. Although high-end confocal microscopes are capable of filtering out-of-focus light, cheaper and more accessible microscopy modalities, such as widefield microscopy, can not, which consequently leads to hazy image data. Computational dehazing is trying to combine the best of both worlds, leading to cheap microscopy but crisp-looking images. The perception-distortion trade-off tells us that we can optimize either for data fidelity, e.g. low MSE or high PSNR, or for data realism, measured by perceptual metrics such as LPIPS or FID. Existing methods either prioritize fidelity at the expense of realism, or produce perceptually convincing results that lack quantitative accuracy. In this work, we propose HazeMatching, a novel iterative method for dehazing light microscopy images, which effectively balances these objectives. Our goal was to find a balanced trade-off between the fidelity of the dehazing results and the realism of individual predictions (samples). We achieve this by adapting the conditional flow matching framework by guiding the generative process with a hazy observation in the conditional velocity field. We evaluate HazeMatching on 5 datasets, covering both synthetic and real data, assessing both distortion and perceptual quality. Our method is compared against 11 baselines, achieving a consistent balance between fidelity and realism on average. Additionally, with calibration analysis, we show that HazeMatching produces well-calibrated predictions. Note that our method does not need an explicit degradation operator to exist, making it easily applicable on real microscopy data. All data used for training and evaluation and our code will be publicly available under a permissive license.

[401] Deep Learning Analysis of Prenatal Ultrasound for Identification of Ventriculomegaly

Youssef Megahed, Inok Lee, Robin Ducharme, Aylin Erman, Olivier X. Miguel, Kevin Dick, Adrian D. C. Chan, Steven Hawken, Mark Walker, Felipe Moretti

Main category: eess.IV

TL;DR: Developed a deep learning model using fine-tuned Ultrasound Self-Supervised Foundation Model (USF-MAE) for detecting ventriculomegaly in prenatal ultrasound images, achieving high performance with F1-scores over 91% and demonstrating clinical plausibility through attention visualization.

Details

Motivation: Ventriculomegaly is a prenatal condition with dilated cerebral ventricles that requires early diagnosis due to association with increased risk of fetal aneuploidies and genetic syndromes, but current detection methods need improvement.

Method: Fine-tuned a pretrained USF-MAE Vision Transformer encoder (trained on 370,000+ ultrasound images) for binary classification of normal vs ventriculomegaly fetal brain ultrasound images, using 5-fold cross-validation and independent test set evaluation.

Result: Achieved F1-scores of 91.76% (cross-validation) and 91.78% (test set), outperforming baseline models by significant margins (19.37% vs VGG-19, 2.31% vs ResNet-50, 5.03% vs ViT-B/16). Model showed 97.24% accuracy and 94.47% precision on test set.

Conclusion: The USF-MAE model effectively detects ventriculomegaly with high accuracy and clinical explainability, as evidenced by Eigen-CAM heatmaps showing focus on ventricle areas, making it a promising tool for prenatal diagnosis.

Abstract: The proposed study aimed to develop a deep learning model capable of detecting ventriculomegaly on prenatal ultrasound images. Ventriculomegaly is a prenatal condition characterized by dilated cerebral ventricles of the fetal brain and is important to diagnose early, as it can be associated with an increased risk for fetal aneuploidies and/or underlying genetic syndromes. An Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE), recently developed by our group, was fine-tuned for a binary classification task to distinguish fetal brain ultrasound images as either normal or showing ventriculomegaly. The USF-MAE incorporates a Vision Transformer encoder pretrained on more than 370,000 ultrasound images from the OpenUS-46 corpus. For this study, the pretrained encoder was adapted and fine-tuned on a curated dataset of fetal brain ultrasound images to optimize its performance for ventriculomegaly detection. Model evaluation was conducted using 5-fold cross-validation and an independent test cohort, and performance was quantified using accuracy, precision, recall, specificity, F1-score, and area under the receiver operating characteristic curve (AUC). The proposed USF-MAE model reached an F1-score of 91.76% on the 5-fold cross-validation and 91.78% on the independent test set, with much higher scores than those obtained by the baseline models by 19.37% and 16.15% compared to VGG-19, 2.31% and 2.56% compared to ResNet-50, and 5.03% and 11.93% compared to ViT-B/16, respectively. The model also showed a high mean test precision of 94.47% and an accuracy of 97.24%. The Eigen-CAM (Eigen Class Activation Map) heatmaps showed that the model was focusing on the ventricle area for the diagnosis of ventriculomegaly, which has explainability and clinical plausibility.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Shona spaCy: A Morphological Analyzer for an Under-Resourced Bantu Language

[2] Towards Hyper-Efficient RAG Systems in VecDBs: Distributed Parallel Multi-Resolution Vector Search

[3] Bench360: Benchmarking Local LLM Inference from 360°

[4] How Well Do LLMs Understand Tunisian Arabic?

[5] Ellipsoid-Based Decision Boundaries for Open Intent Classification

[6] Prompt-Based Value Steering of Large Language Models

[7] A new kid on the block: Distributional semantics predicts the word-specific tone signatures of monosyllabic words in conversational Taiwan Mandarin

[8] Concept-Based Interpretability for Toxicity Detection

[9] Falsely Accused: How AI Detectors Misjudge Slightly Polished Arabic Articles

[10] Reproducibility Report: Test-Time Training on Nearest Neighbors for Large Language Models

[11] How Language Directions Align with Token Geometry in Multilingual LLMs

[12] Hierarchical Retrieval with Out-Of-Vocabulary Queries: A Case Study on SNOMED CT

[13] Detecting and Steering LLMs’ Empathy in Action

[14] NALA_MAINZ at BLP-2025 Task 2: A Multi-agent Approach for Bangla Instruction to Python Code Generation

[15] From Representation to Enactment: The ABC Framework of the Translating Mind

[16] Interpretable dimensions support an effect of agentivity and telicity on split intransitivity

[17] PEPPER: Perception-Guided Perturbation for Robust Backdoor Defense in Text-to-Image Diffusion Models

[18] ConCISE: A Reference-Free Conciseness Evaluation Metric for LLM-Generated Answers

[19] Improving Latent Reasoning in LLMs via Soft Concept Mixing

[20] Deep Improvement Supervision

[21] Predicting the Formation of Induction Heads

[22] ARQUSUMM: Argument-aware Quantitative Summarization of Online Conversations

[23] Supervised Fine Tuning of Large Language Models for Domain Specific Knowledge Graph Construction:A Case Study on Hunan’s Historical Celebrities

[24] Do Vision-Language Models Understand Visual Persuasiveness?

[25] Principled Design of Interpretable Automated Scoring for Large-Scale Educational Assessments

[26] MUCH: A Multilingual Claim Hallucination Benchmark

[27] Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design

[28] Learning to Compress: Unlocking the Potential of Large Language Models for Text Representation

[29] LangMark: A Multilingual Dataset for Automatic Post-Editing

[30] The PLLuM Instruction Corpus

[31] Hallucinate Less by Thinking More: Aspect-Based Causal Abstention for Large Language Models

[32] Attention-Guided Feature Fusion (AGFF) Model for Integrating Statistical and Semantic Features in News Text Classification

[33] AutoLink: Autonomous Schema Exploration and Expansion for Scalable Schema Linking in Text-to-SQL at Scale

[34] E$^3$-Pruner: Towards Efficient, Economical, and Effective Layer Pruning for Large Language Models

[35] A Simple Yet Strong Baseline for Long-Term Conversational Memory of LLM Agents

[36] Parrot: Persuasion and Agreement Robustness Rating of Output Truth – A Sycophancy Robustness Benchmark for LLMs

[37] Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables

[38] Social-Media Based Personas Challenge: Hybrid Prediction of Common and Rare User Actions on Bluesky

[39] Estonian WinoGrande Dataset: Comparative Analysis of LLM Performance on Human and Machine Translation

[40] Large Language Models for Sentiment Analysis to Detect Social Challenges: A Use Case with South African Languages

[41] Humanlike Multi-user Agent (HUMA): Designing a Deceptively Human AI Facilitator for Group Chats

[42] Don’t Learn, Ground: A Case for Natural Language Inference with Visual Grounding

[43] Selective Rotary Position Embedding

[44] PUCP-Metrix: A Comprehensive Open-Source Repository of Linguistic Metrics for Spanish

[45] Beyond Multiple Choice: A Hybrid Framework for Unifying Robust Evaluation and Verifiable Reasoning Training

[46] SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation

[47] Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards

[48] MiniLLM: Knowledge Distillation of Large Language Models

[49] Fine-Grained Reward Optimization for Machine Translation using Error Severity Mappings

[50] Task-Aligned Tool Recommendation for Large Language Models

[51] EventWeave: A Dynamic Framework for Capturing Core and Supporting Events in Dialogue Systems

[52] Concise Reasoning via Reinforcement Learning

[53] The Rise of Parameter Specialization for Knowledge Storage in Large Language Models

[54] ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

[55] Fairness Evaluation of Large Language Models in Academic Library Reference Services

[56] Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

[57] Testing Hypotheses from the Social Approval Theory of Online Hate: An Analysis of 110 Million Messages from Parler

[58] Do LLMs produce texts with “human-like” lexical diversity?

[59] Beyond Human Judgment: A Bayesian Evaluation of LLMs’ Moral Values Understanding

[60] RPRO: Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning

[61] RAG-BioQA Retrieval-Augmented Generation for Long-Form Biomedical Question Answering

[62] LLM one-shot style transfer for Authorship Attribution and Verification

[63] Bridging the Semantic Gap: Contrastive Rewards for Multilingual Text-to-SQL with GRPO

[64] ReviewGuard: Enhancing Deficient Peer Review Detection via LLM-Driven Data Augmentation

[65] AI use in American newspapers is widespread, uneven, and rarely disclosed

[66] AraFinNews: Arabic Financial Summarisation with Domain-Adapted LLMs

[67] Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties

[68] A systematic review of relation extraction task since the emergence of Transformers

[69] Improving the Performance of Radiology Report De-identification with Large-Scale Training and Benchmarking Against Cloud Vendor Methods

[70] When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLMs

[71] From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models

[72] Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study

[73] WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

cs.CV

[74] RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis

[75] The persistence of painting styles

[76] AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal Face Videos