Daily arXiv Papers - 2025-09-10

AI-enhanced summaries of 0 research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] MedBench-IT: A Comprehensive Benchmark for Evaluating Large Language Models on Italian Medical Entrance Examinations

Ruggero Marino Lazzaroni, Alessandro Angioi, Michelangelo Puliga, Davide Sanna, Roberto Marras

Main category: cs.CL

TL;DR: First comprehensive benchmark for evaluating LLMs on Italian medical university entrance exams with 17,410 expert-written questions across 6 subjects and 3 difficulty levels.

DetailsMotivation: Address the scarcity of benchmarks for non-English languages in specialized domains, particularly Italian medical education evaluation.

Method: Created MedBench-IT benchmark with 17,410 multiple-choice questions from Edizioni Simone. Evaluated diverse LLMs including proprietary models (GPT-4o, Claude series) and resource-efficient open-source alternatives (<30B parameters). Conducted reproducibility tests, ordering bias analysis, reasoning prompt evaluation, and readability-performance correlation analysis.

Result: Achieved 88.86% response consistency across tests (varying by subject), found minimal ordering bias impact, and identified statistically significant but small inverse relationship between question readability and model performance.

Conclusion: MedBench-IT provides a crucial standardized evaluation resource for Italian NLP community and EdTech developers, offering insights into current LLM capabilities in Italian medical education domain.

Abstract: Large language models (LLMs) show increasing potential in education, yet benchmarks for non-English languages in specialized domains remain scarce. We introduce MedBench-IT, the first comprehensive benchmark for evaluating LLMs on Italian medical university entrance examinations. Sourced from Edizioni Simone, a leading preparatory materials publisher, MedBench-IT comprises 17,410 expert-written multiple-choice questions across six subjects (Biology, Chemistry, Logic, General Culture, Mathematics, Physics) and three difficulty levels. We evaluated diverse models including proprietary LLMs (GPT-4o, Claude series) and resource-efficient open-source alternatives (<30B parameters) focusing on practical deployability. Beyond accuracy, we conducted rigorous reproducibility tests (88.86% response consistency, varying by subject), ordering bias analysis (minimal impact), and reasoning prompt evaluation. We also examined correlations between question readability and model performance, finding a statistically significant but small inverse relationship. MedBench-IT provides a crucial resource for Italian NLP community, EdTech developers, and practitioners, offering insights into current capabilities and standardized evaluation methodology for this critical domain.

[2] The ML-SUPERB 2.0 Challenge: Towards Inclusive ASR Benchmarking for All Language Varieties

William Chen, Chutong Meng, Jiatong Shi, Martijn Bartelds, Shih-Heng Wang, Hsiu-Hsuan Wang, Rafael Mosquera, Sara Hincapie, Dan Jurafsky, Antonis Anastasopoulos, Hung-yi Lee, Karen Livescu, Shinji Watanabe

Main category: cs.CL

TL;DR: The Interspeech 2025 ML-SUPERB 2.0 Challenge introduces a comprehensive multilingual ASR evaluation suite with 200+ languages, accents, and dialects, resulting in submissions that significantly outperformed baselines with up to 30.2% lower CER and 23% higher LID accuracy.

DetailsMotivation: To address unequal distribution of multilingual ASR improvements across languages and language varieties, and advance state-of-the-art ASR models through inclusive community challenges.

Method: Constructed a new test suite with 200+ languages, accents, and dialects, and introduced an online evaluation server based on DynaBench for flexible model design and architecture.

Result: Received 5 submissions from 3 teams, all outperforming baselines. Best submission achieved 23% absolute improvement in LID accuracy, 18% CER reduction on general multilingual test, and 30.2% lower CER with 15.7% higher LID accuracy on accented/dialectal data.

Conclusion: Community challenges like ML-SUPERB 2.0 are crucial for making speech technologies more inclusive by driving improvements in multilingual ASR performance across diverse languages and language varieties.

Abstract: Recent improvements in multilingual ASR have not been equally distributed across languages and language varieties. To advance state-of-the-art (SOTA) ASR models, we present the Interspeech 2025 ML-SUPERB 2.0 Challenge. We construct a new test suite that consists of data from 200+ languages, accents, and dialects to evaluate SOTA multilingual speech models. The challenge also introduces an online evaluation server based on DynaBench, allowing for flexibility in model design and architecture for participants. The challenge received 5 submissions from 3 teams, all of which outperformed our baselines. The best-performing submission achieved an absolute improvement in LID accuracy of 23% and a reduction in CER of 18% when compared to the best baseline on a general multilingual test set. On accented and dialectal data, the best submission obtained 30.2% lower CER and 15.7% higher LID accuracy, showing the importance of community challenges in making speech technologies more inclusive.

[3] Dual Knowledge-Enhanced Two-Stage Reasoner for Multimodal Dialog Systems

Xiaolin Chen, Xuemeng Song, Haokun Wen, Weili Guan, Xiangyu Zhao, Liqiang Nie

Main category: cs.CL

TL;DR: DK2R is a dual knowledge-enhanced two-stage reasoner that leverages both structured attribute and unstructured review knowledge with LLMs to improve textual response generation in multimodal task-oriented dialog systems.

DetailsMotivation: Existing multimodal task-oriented dialog systems neglect unstructured review knowledge and underutilize large language models (LLMs), limiting their response generation capabilities.

Method: Proposes DK2R framework that: 1) extracts both structured attribute and unstructured review knowledge, 2) uses LLM to evaluate knowledge utility via provisional probe responses, 3) separately summarizes intention-oriented key clues through dedicated reasoning, and 4) uses these as auxiliary signals for enhanced response generation.

Result: Extensive experiments on a public dataset verify the superiority of DK2R over existing approaches.

Conclusion: The proposed DK2R framework effectively addresses dynamic knowledge type selection and intention-response decoupling challenges, demonstrating improved performance in multimodal dialog response generation by fully utilizing dual knowledge with LLMs.

Abstract: Textual response generation is pivotal for multimodal \mbox{task-oriented} dialog systems, which aims to generate proper textual responses based on the multimodal context. While existing efforts have demonstrated remarkable progress, there still exist the following limitations: 1) \textit{neglect of unstructured review knowledge} and 2) \textit{underutilization of large language models (LLMs)}. Inspired by this, we aim to fully utilize dual knowledge (\textit{i.e., } structured attribute and unstructured review knowledge) with LLMs to promote textual response generation in multimodal task-oriented dialog systems. However, this task is non-trivial due to two key challenges: 1) \textit{dynamic knowledge type selection} and 2) \textit{intention-response decoupling}. To address these challenges, we propose a novel dual knowledge-enhanced two-stage reasoner by adapting LLMs for multimodal dialog systems (named DK2R). To be specific, DK2R first extracts both structured attribute and unstructured review knowledge from external knowledge base given the dialog context. Thereafter, DK2R uses an LLM to evaluate each knowledge type’s utility by analyzing LLM-generated provisional probe responses. Moreover, DK2R separately summarizes the intention-oriented key clues via dedicated reasoning, which are further used as auxiliary signals to enhance LLM-based textual response generation. Extensive experiments conducted on a public dataset verify the superiority of DK2R. We have released the codes and parameters.

[4] Toward Purpose-oriented Topic Model Evaluation enabled by Large Language Models

Zhiyin Tan, Jennifer D’Souza

Main category: cs.CL

TL;DR: A framework using LLMs for automated evaluation of topic models across four quality dimensions, providing more interpretable and robust assessments than traditional metrics.

DetailsMotivation: Traditional topic model evaluation metrics like coherence and diversity capture only narrow statistical patterns and fail to explain semantic failures in practice, limiting their effectiveness for dynamic knowledge domains.

Method: Developed a purpose-oriented evaluation framework with nine LLM-based metrics spanning four dimensions: lexical validity, intra-topic semantic soundness, inter-topic structural soundness, and document-topic alignment soundness. Validated through adversarial and sampling-based protocols across multiple datasets and topic modeling methods.

Result: LLM-based metrics provide interpretable, robust, and task-relevant assessments, uncovering critical weaknesses in topic models such as redundancy and semantic drift that traditional metrics often miss.

Conclusion: The framework supports development of scalable, fine-grained evaluation tools for maintaining topic relevance in dynamic datasets, offering superior evaluation capabilities compared to traditional statistical metrics.

Abstract: This study presents a framework for automated evaluation of dynamically evolving topic models using Large Language Models (LLMs). Topic modeling is essential for organizing and retrieving scholarly content in digital library systems, helping users navigate complex and evolving knowledge domains. However, widely used automated metrics, such as coherence and diversity, often capture only narrow statistical patterns and fail to explain semantic failures in practice. We introduce a purpose-oriented evaluation framework that employs nine LLM-based metrics spanning four key dimensions of topic quality: lexical validity, intra-topic semantic soundness, inter-topic structural soundness, and document-topic alignment soundness. The framework is validated through adversarial and sampling-based protocols, and is applied across datasets spanning news articles, scholarly publications, and social media posts, as well as multiple topic modeling methods and open-source LLMs. Our analysis shows that LLM-based metrics provide interpretable, robust, and task-relevant assessments, uncovering critical weaknesses in topic models such as redundancy and semantic drift, which are often missed by traditional metrics. These results support the development of scalable, fine-grained evaluation tools for maintaining topic relevance in dynamic datasets. All code and data supporting this work are accessible at https://github.com/zhiyintan/topic-model-LLMjudgment.

[5] Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector

Amal Chebbi, Babajide Kolade

Main category: cs.CL

TL;DR: EnergyGPT is a specialized LLM for the energy sector, fine-tuned from LLaMA 3.1-8B, showing improved performance in energy-related tasks without requiring large infrastructure.

DetailsMotivation: General-purpose LLMs lack effectiveness in specialized fields like energy that require deep technical expertise and precise domain knowledge.

Method: Fine-tuned LLaMA 3.1-8B using Supervised Fine-Tuning on a curated energy corpus, with complete pipeline including data collection, model training, benchmark design, and evaluation.

Result: EnergyGPT outperforms the base model in most energy-related language understanding and generation tasks on domain-specific benchmarks.

Conclusion: Domain-specialized fine-tuning enables significant improvements in relevance and performance for technical fields like energy without large-scale infrastructure requirements.

Abstract: Large Language Models have demonstrated impressive capabilities across various domains. However, their general-purpose nature often limits their effectiveness in specialized fields such as energy, where deep technical expertise and precise domain knowledge are essential. In this paper, we introduce EnergyGPT, a domain-specialized language model tailored for the energy sector, developed by fine-tuning LLaMA 3.1-8B model using Supervised Fine-Tuning on a high-quality, curated corpus of energy-related texts. We present a complete development pipeline, including data collection and curation, model fine-tuning, benchmark design and LLM-judge choice, evaluation and deployment. Through this work, we demonstrate that our training strategy enables improvements in domain relevance and performance without the need for large-scale infrastructure. By evaluating the performance of the model using domain-specific question-answering benchmarks, our results demonstrate that EnergyGPT outperforms the base model in most of the energy-related language understanding and generation tasks.

[6] Cardiverse: Harnessing LLMs for Novel Card Game Prototyping

Danrui Li, Sen Zhang, Sam S. Sohn, Kaidong Hu, Muhammad Usman, Mubbasir Kapadia

Main category: cs.CL

TL;DR: Automated framework for card game prototyping using LLMs, featuring novel game generation, consistent code creation, and scalable AI gameplay evaluation.

DetailsMotivation: Reduce human effort in card game design by automating creative ideation and gameplay evaluation processes that are currently challenging for LLMs to handle effectively.

Method: Graph-based indexing for novel game variations, LLM-driven system for consistent game code generation with validation, and ensemble of LLM-generated heuristic functions optimized through self-play for gameplay AI.

Result: Developed a comprehensive automated card game prototyping framework that addresses LLM limitations in designing novel mechanics, generating consistent environments, and creating scalable evaluation AI.

Conclusion: The framework accelerates card game prototyping, reduces human labor, and lowers barriers to entry for game developers by automating key aspects of game design and evaluation.

Abstract: The prototyping of computer games, particularly card games, requires extensive human effort in creative ideation and gameplay evaluation. Recent advances in Large Language Models (LLMs) offer opportunities to automate and streamline these processes. However, it remains challenging for LLMs to design novel game mechanics beyond existing databases, generate consistent gameplay environments, and develop scalable gameplay AI for large-scale evaluations. This paper addresses these challenges by introducing a comprehensive automated card game prototyping framework. The approach highlights a graph-based indexing method for generating novel game variations, an LLM-driven system for consistent game code generation validated by gameplay records, and a gameplay AI constructing method that uses an ensemble of LLM-generated heuristic functions optimized through self-play. These contributions aim to accelerate card game prototyping, reduce human labor, and lower barriers to entry for game developers. For code repo visit this http URL https://github.com/danruili/Cardiverse

[7] DischargeSim: A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge

Zonghai Yao, Michael Sun, Won Seok Jang, Sunjae Kwon, Soie Kwon, Hong Yu

Main category: cs.CL

TL;DR: DischargeSim is a new benchmark that evaluates LLMs’ ability to provide personalized discharge education through simulated doctor-patient conversations, assessing dialogue quality, document generation, and patient comprehension.

DetailsMotivation: Current LLM benchmarks focus on diagnostic reasoning but fail to evaluate models' ability to support patients after medical visits through discharge communication and education.

Method: DischargeSim simulates multi-turn conversations between LLM-driven DoctorAgents and PatientAgents with diverse psychosocial profiles across six clinical topics, evaluated through automatic metrics, LLM-as-judge, document generation, and comprehension tests.

Result: Experiments with 18 LLMs revealed significant gaps in discharge education capability, with performance varying across patient profiles and model size not always correlating with better outcomes.

Conclusion: DischargeSim provides the first benchmark for evaluating LLMs in post-visit clinical education, highlighting the need for equitable and personalized patient support capabilities.

Abstract: Discharge communication is a critical yet underexplored component of patient care, where the goal shifts from diagnosis to education. While recent large language model (LLM) benchmarks emphasize in-visit diagnostic reasoning, they fail to evaluate models’ ability to support patients after the visit. We introduce DischargeSim, a novel benchmark that evaluates LLMs on their ability to act as personalized discharge educators. DischargeSim simulates post-visit, multi-turn conversations between LLM-driven DoctorAgents and PatientAgents with diverse psychosocial profiles (e.g., health literacy, education, emotion). Interactions are structured across six clinically grounded discharge topics and assessed along three axes: (1) dialogue quality via automatic and LLM-as-judge evaluation, (2) personalized document generation including free-text summaries and structured AHRQ checklists, and (3) patient comprehension through a downstream multiple-choice exam. Experiments across 18 LLMs reveal significant gaps in discharge education capability, with performance varying widely across patient profiles. Notably, model size does not always yield better education outcomes, highlighting trade-offs in strategy use and content prioritization. DischargeSim offers a first step toward benchmarking LLMs in post-visit clinical education and promoting equitable, personalized patient support.

[8] Rule-Based Moral Principles for Explaining Uncertainty in Natural Language Generation

Zahra Atf, Peter R Lewis

Main category: cs.CL

TL;DR: A rule-based moral framework for handling uncertainty in LLMs using principles from moral psychology and virtue ethics, implemented in Prolog to provide transparent uncertainty explanations.

DetailsMotivation: LLMs are used in high-stakes settings where probabilistic uncertainty methods are often opaque and misaligned with transparency expectations, creating technical and ethical challenges.

Method: Proposed framework based on moral principles (precaution, deference, responsibility) encoded in lightweight Prolog engine. Uncertainty levels trigger system actions with plain-language rationales. Evaluated through scenario simulations.

Result: Framework provides transparent uncertainty handling with moral reasoning. Scenario simulations show improved rule coverage, fairness, and trust calibration in clinical and legal domains.

Conclusion: The approach offers a transparent, lightweight alternative to probabilistic models for socially responsible natural language generation, improving trust and interpretability through moral reasoning.

Abstract: Large language models (LLMs) are increasingly used in high-stakes settings, where explaining uncertainty is both technical and ethical. Probabilistic methods are often opaque and misaligned with expectations of transparency. We propose a framework based on rule-based moral principles for handling uncertainty in LLM-generated text. Using insights from moral psychology and virtue ethics, we define rules such as precaution, deference, and responsibility to guide responses under epistemic or aleatoric uncertainty. These rules are encoded in a lightweight Prolog engine, where uncertainty levels (low, medium, high) trigger aligned system actions with plain-language rationales. Scenario-based simulations benchmark rule coverage, fairness, and trust calibration. Use cases in clinical and legal domains illustrate how moral reasoning can improve trust and interpretability. Our approach offers a transparent, lightweight alternative to probabilistic models for socially responsible natural language generation.

[9] LLM Analysis of 150+ years of German Parliamentary Debates on Migration Reveals Shift from Post-War Solidarity to Anti-Solidarity in the Last Decade

Aida Kostikova, Ole Pütz, Steffen Eger, Olga Sabelfeld, Benjamin Paassen

Main category: cs.CL

TL;DR: LLMs can effectively annotate (anti-)solidarity in German parliamentary debates about migration, showing high postwar solidarity and rising anti-solidarity since 2015.

DetailsMotivation: Traditional manual annotation of political speech about migration is time-consuming and limits analysis scope. LLMs offer potential for automating complex annotation tasks in political text analysis.

Method: Extensive evaluation of multiple LLMs for annotating (anti-)solidarity subtypes in German parliamentary debates, comparing with thousands of human reference annotations. Examined model size, prompting differences, fine-tuning, historical vs contemporary data, and systematic errors.

Result: LLMs show promise for political text analysis. Data reveals high migrant-directed solidarity in postwar period and strong anti-solidarity trend in German parliament since 2015.

Conclusion: LLMs are valuable tools for political text analysis, particularly for migration debates. Findings highlight Germany’s complex migration landscape with demographic decline and labor shortages alongside rising polarization.

Abstract: Migration has been a core topic in German political debate, from millions of expellees post World War II over labor migration to refugee movements in the recent past. Studying political speech regarding such wide-ranging phenomena in depth traditionally required extensive manual annotations, limiting the scope of analysis to small subsets of the data. Large language models (LLMs) have the potential to partially automate even complex annotation tasks. We provide an extensive evaluation of a multiple LLMs in annotating (anti-)solidarity subtypes in German parliamentary debates compared to a large set of thousands of human reference annotations (gathered over a year). We evaluate the influence of model size, prompting differences, fine-tuning, historical versus contemporary data; and we investigate systematic errors. Beyond methodological evaluation, we also interpret the resulting annotations from a social science lense, gaining deeper insight into (anti-)solidarity trends towards migrants in the German post-World War II period and recent past. Our data reveals a high degree of migrant-directed solidarity in the postwar period, as well as a strong trend towards anti-solidarity in the German parliament since 2015, motivating further research. These findings highlight the promise of LLMs for political text analysis and the importance of migration debates in Germany, where demographic decline and labor shortages coexist with rising polarization.

[10] Causal Attention with Lookahead Keys

Zhuoqing Song, Peng Sun, Huizhuo Yuan, Quanquan Gu

Main category: cs.CL

TL;DR: CASTLE introduces lookahead keys that update tokens’ keys as context unfolds, enabling integration of future information while preserving autoregressive properties, with efficient parallel training and improved language modeling performance.

DetailsMotivation: Standard causal attention has static QKV that only encode preceding context, limiting the model's ability to integrate information from future tokens while maintaining autoregressive constraints.

Method: CASTLE continually updates each token’s keys as context unfolds, creating lookahead keys that integrate information from later positions while preserving autoregressive property. A mathematical equivalence enables efficient parallel training without explicitly materializing lookahead keys at each position.

Result: On language modeling benchmarks, CASTLE consistently outperforms standard causal attention across model scales, reducing validation perplexity and improving performance on a range of downstream tasks.

Conclusion: CASTLE provides an effective attention mechanism that allows integration of future context information while maintaining autoregressive properties, with demonstrated improvements in language modeling performance.

Abstract: In standard causal attention, each token’s query, key, and value (QKV) are static and encode only preceding context. We introduce CAuSal aTtention with Lookahead kEys (CASTLE), an attention mechanism that continually updates each token’s keys as the context unfolds. We term these updated keys lookahead keys because they belong to earlier positions yet integrate information from tokens that appear later relative to those positions, while strictly preserving the autoregressive property. Although the mechanism appears sequential, we derive a mathematical equivalence that avoids explicitly materializing lookahead keys at each position and enables efficient parallel training. On language modeling benchmarks, CASTLE consistently outperforms standard causal attention across model scales, reducing validation perplexity and improving performance on a range of downstream tasks.

[11] When Large Language Models Meet Speech: A Survey on Integration Approaches

Zhengdong Yang, Shuichiro Shimizu, Yahan Yu, Chenhui Chu

Main category: cs.CL

TL;DR: Survey paper on integrating speech modality with large language models (LLMs), categorizing approaches into text-based, latent-representation-based, and audio-token-based methods.

DetailsMotivation: Recent advancements in LLMs have created interest in expanding their capabilities beyond text to include speech modality, which is naturally related to text.

Method: Categorizes integration methodologies into three primary approaches: text-based, latent-representation-based, and audio-token-based integration.

Result: Demonstrates how these methods are applied across various speech-related applications and highlights challenges in the field.

Conclusion: Provides a comprehensive survey of speech-LLM integration approaches to inspire future research in multimodal language modeling.

Abstract: Recent advancements in large language models (LLMs) have spurred interest in expanding their application beyond text-based tasks. A large number of studies have explored integrating other modalities with LLMs, notably speech modality, which is naturally related to text. This paper surveys the integration of speech with LLMs, categorizing the methodologies into three primary approaches: text-based, latent-representation-based, and audio-token-based integration. We also demonstrate how these methods are applied across various speech-related applications and highlight the challenges in this field to offer inspiration for

[12] Basis Vector Metric: A Method for Robust Open-Ended State Change Detection

David Oprea, Sam Powers

Main category: cs.CL

TL;DR: BVM method tested for image state classification using language embeddings on MIT-States dataset, outperforms other metrics for noun classification but shows inconclusive results for adjective differentiation compared to logistic regression.

DetailsMotivation: To develop and test a new method (Basis Vectors Method) for judging state changes in images using language embeddings, and compare its performance against existing metrics.

Method: Used MIT-States dataset with 53,000 images, 225 nouns and 115 adjectives. Tested BVM against cosine similarity, dot product, product quantization, binary index, Naive Bayes, and custom neural network for noun classification. Compared BVM to logistic regression for adjective differentiation.

Result: BVM performed best among all metrics for classifying states for each noun. For adjective differentiation, BVM did not conclusively outperform logistic regression, but evidence suggests potential improvements with methodology changes.

Conclusion: BVM is effective for noun state classification but requires refinement for adjective differentiation, with potential for improved accuracy through methodological adjustments.

Abstract: We test a new method, which we will abbreviate using the acronym BVM (Basis Vectors Method), in its ability to judge the state changes in images through using language embeddings. We used the MIT-States dataset, containing about 53,000 images, to gather all of our data, which has 225 nouns and 115 adjectives, with each noun having about 9 different adjectives, forming approximately 1000 noun-adjective pairs. For our first experiment, we test our method’s ability to determine the state of each noun class separately against other metrics for comparison. These metrics are cosine similarity, dot product, product quantization, binary index, Naive Bayes, and a custom neural network. Among these metrics, we found that our proposed BVM performs the best in classifying the states for each noun. We then perform a second experiment where we try using BVM to determine if it can differentiate adjectives from one another for each adjective separately. We compared the abilities of BVM to differentiate adjectives against the proposed method the MIT-States paper suggests: using a logistic regression model. In the end, we did not find conclusive evidence that our BVM metric could perform better than the logistic regression model at discerning adjectives. Yet, we were able to find evidence for possible improvements to our method; this leads to the chance of increasing our method’s accuracy through certain changes in our methodologies.

[13] Instance-level Performance Prediction for Long-form Generation Tasks

Chi-Yang Hsu, Alexander Braylan, Yiheng Su, Omar Alonso, Matthew Lease

Main category: cs.CL

TL;DR: A new benchmark for predicting instance-level performance in long-form generation tasks with fine-grained quality metrics, requiring both point estimates and uncertainty quantification using minimal training data.

DetailsMotivation: To create a task-, model- and metric-agnostic framework for predicting continuous evaluation metric scores in long-form generation tasks, addressing the need for performance prediction with uncertainty quantification.

Method: Black-box approach using only model inputs and outputs to predict metric scores, with evaluation across 11 long-form datasets/tasks using multiple LLMs, baselines, and metrics per task.

Result: Scores can be effectively predicted across long-form generation tasks using as few as 16 training examples, demonstrating practical feasibility.

Conclusion: The paper introduces a novel and useful task, provides a valuable benchmark for driving progress, and offers baselines ready for practical adoption in performance prediction for long-form generation.

Abstract: We motivate and share a new benchmark for instance-level performance prediction of long-form generation tasks having multi-faceted, fine-grained quality metrics. Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs. Beyond predicting point estimates of metric scores, the benchmark also requires inferring prediction intervals to quantify uncertainty around point estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs, baselines, and metrics per task. We show that scores can be effectively predicted across long-form generation tasks using as few as 16 training examples. Overall, we introduce a novel and useful task, a valuable benchmark to drive progress, and baselines ready for practical adoption today.

[14] Does This Look Familiar to You? Knowledge Analysis via Model Internal Representations

Sihyun Park

Main category: cs.CL

TL;DR: KAMIR is a novel data selection method that uses model’s internal representations to identify training data based on familiarity, improving generalization performance across various NLP tasks.

DetailsMotivation: Current SFT data selection methods rely heavily on prompt engineering and are sensitive to variations, while simply increasing data volume doesn't guarantee performance improvements. There's a need for more efficient and robust data selection approaches.

Method: KAMIR computes similarities between hidden states of each layer and final hidden states to assess data familiarity. It uses model’s internal representations rather than prompt engineering to select useful training data.

Result: Experiments show that training with less familiar data selected by KAMIR leads to better generalization performance across diverse task datasets including machine reading comprehension and summarization.

Conclusion: KAMIR provides an effective alternative to prompt-based data selection methods, enabling efficient training data selection based on model’s internal representations without additional prompt design costs.

Abstract: Recent advances in large language models (LLMs) have been driven by pretraining, supervised fine tuning (SFT), and alignment tuning. Among these, SFT plays a crucial role in transforming a model ’s general knowledge into structured responses tailored to specific tasks. However, there is no clearly established methodology for effective training data selection. Simply increasing the volume of data does not guarantee performance improvements, while preprocessing, sampling, and validation require substantial time and cost. To address this issue, a variety of data selection methods have been proposed. Among them, knowledge based selection approaches identify suitable training data by analyzing the model ’s responses. Nevertheless, these methods typically rely on prompt engineering, making them sensitive to variations and incurring additional costs for prompt design. In this study, we propose Knowledge Analysis via Model Internal Representations (KAMIR), a novel approach that overcomes these limitations by analyzing data based on the model ’s internal representations. KAMIR computes similarities between the hidden states of each layer (block) and the final hidden states for a given input to assess the data. Unlike prior methods that were largely limited to multiple choice tasks, KAMIR can be applied to a wide range of tasks such as machine reading comprehension and summarization. Moreover, it selects data useful for training based on the model ’s familiarity with the input, even with a small dataset and a simple classifier architecture. Experiments across diverse task datasets demonstrate that training with less familiar data leads to better generalization performance.

[15] Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM

Thomas Thebaud, Yen-Ju Lu, Matthew Wiesner, Peter Viechnicki, Najim Dehak

Main category: cs.CL

TL;DR: Using frozen audio foundation models (Whisper/WavLM) and frozen LLAMA with lightweight connectors to add speaker metadata tags (age, gender, emotion) to transcribed dialogues without task-specific fine-tuning.

DetailsMotivation: To enrich dialogue transcriptions by adding speaker characteristic metadata tags for better understanding and analysis, complementing existing grammar and readability improvements.

Method: Couples frozen audio foundation models with frozen LLAMA language model using efficient connectors to bridge audio and language representations, enabling speaker attribute inference without fine-tuning.

Result: Achieves competitive performance on speaker profiling tasks while maintaining modularity and speed. Frozen LLAMA can compare x-vectors with 8.8% Equal Error Rate in some scenarios.

Conclusion: The approach successfully adds valuable speaker metadata to dialogue transcriptions using frozen models with lightweight connectors, offering a modular and efficient solution for dialogue enrichment.

Abstract: In dialogue transcription pipelines, Large Language Models (LLMs) are frequently employed in post-processing to improve grammar, punctuation, and readability. We explore a complementary post-processing step: enriching transcribed dialogues by adding metadata tags for speaker characteristics such as age, gender, and emotion. Some of the tags are global to the entire dialogue, while some are time-variant. Our approach couples frozen audio foundation models, such as Whisper or WavLM, with a frozen LLAMA language model to infer these speaker attributes, without requiring task-specific fine-tuning of either model. Using lightweight, efficient connectors to bridge audio and language representations, we achieve competitive performance on speaker profiling tasks while preserving modularity and speed. Additionally, we demonstrate that a frozen LLAMA model can compare x-vectors directly, achieving an Equal Error Rate of 8.8% in some scenarios.

[16] Mitigating Attention Localization in Small Scale: Self-Attention Refinement via One-step Belief Propagation

Nakyung Lee, Yeongoon Kim, Minhae Oh, Suhwan Kim, Jin Woo Koo, Hyewon Jo, Jungwoo Lee

Main category: cs.CL

TL;DR: SAOBP framework prevents attention collapse in transformers by injecting multi-hop relationships through belief propagation, improving long-range dependencies and model performance, especially in small-scale models.

DetailsMotivation: Transformer self-attention often suffers from localization where attention collapses onto limited tokens, failing to capture long-range dependencies effectively.

Method: Proposed Self-Attention One-step Belief Propagation (SAOBP) that refines attention through belief propagation to inject multi-hop relationships, and introduced Global Token Dependency (GTD) to quantify these interactions.

Result: SAOBP prevents entropy collapse in deeper layers, maintains adaptive GTD levels, and shows competitive performance gains particularly in small-scale models for resource-constrained scenarios.

Conclusion: The SAOBP framework effectively addresses attention localization issues and improves transformer performance, with particular promise for enhancing inference quality in resource-limited environments.

Abstract: Transformer-based self-attention mechanism serves as the core of modern language models, yet it often suffers from localization, where attentions collapse onto a limited subset of tokens and fail to capture long-range dependencies. To address this issue, we propose Self-Attention One-step Belief Propagation (SAOBP), a refinement framework that injects multi-hop relationships through a belief propagation process. To interpret and quantify these interactions, we introduce Global Token Dependency (GTD) that captures the relative contribution of multihop connections within the attention graph. Empirical results indicate that SAOBP helps prevent entropy collapse in deeper layers and adaptively maintains GTD at task-appropriate levels, thereby supporting improvements in model performance. Importantly, we observe competitive gains in small-scale models, highlighting its potential for improving inference quality in resource-constrained scenarios.

[17] PersonaFuse: A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions

Yixuan Tang, Yi Yang, Ahmed Abbasi

Main category: cs.CL

TL;DR: PersonaFuse is a novel LLM post-training framework that enables language models to adapt and express different personalities for varying social contexts, improving social-emotional intelligence without sacrificing reasoning ability or safety.

DetailsMotivation: LLMs have limitations in emotional perception and social competence during real-world conversations, particularly in adapting communication style and emotional expression to different social and task contexts.

Method: Inspired by Trait Activation Theory and Big Five personality model, PersonaFuse uses a Mixture-of-Expert architecture combining persona adapters with dynamic routing network for contextual trait expression.

Result: Substantially outperforms baseline models across multiple social-emotional intelligence dimensions, achieves competitive response quality against leading LLMs like GPT-4o and DeepSeek despite smaller size, and improves downstream applications like mental health counseling and customer service.

Conclusion: PersonaFuse offers a theoretically grounded and practical approach for developing social-emotional enhanced LLMs, marking significant advancement toward more human-centric AI systems.

Abstract: Recent advancements in Large Language Models (LLMs) demonstrate remarkable capabilities across various fields. These developments have led to more direct communication between humans and LLMs in various situations, such as social companionship and psychological support. However, LLMs often exhibit limitations in emotional perception and social competence during real-world conversations. These limitations partly originate from their inability to adapt their communication style and emotional expression to different social and task contexts. In this work, we introduce PersonaFuse, a novel LLM post-training framework that enables LLMs to adapt and express different personalities for varying situations. Inspired by Trait Activation Theory and the Big Five personality model, PersonaFuse employs a Mixture-of-Expert architecture that combines persona adapters with a dynamic routing network, enabling contextual trait expression. Experimental results show that PersonaFuse substantially outperforms baseline models across multiple dimensions of social-emotional intelligence. Importantly, these gains are achieved without sacrificing general reasoning ability or model safety, which remain common limitations of direct prompting and supervised fine-tuning approaches. PersonaFuse also delivers consistent improvements in downstream human-centered applications, such as mental health counseling and review-based customer service. Finally, human preference evaluations against leading LLMs, including GPT-4o and DeepSeek, demonstrate that PersonaFuse achieves competitive response quality despite its comparatively smaller model size. These findings demonstrate that PersonaFuse~offers a theoretically grounded and practical approach for developing social-emotional enhanced LLMs, marking a significant advancement toward more human-centric AI systems.

[18] Talking with Oompa Loompas: A novel framework for evaluating linguistic acquisition of LLM agents

Sankalp Tattwadarshi Swain, Anshika Krishnatray, Dhruv Kumar, Jagat Sesh Challa

Main category: cs.CL

TL;DR: LLM agents fail to acquire a new constructed language (Tinkatongue) through interactive conversation within 100 responses, but show human-like learning strategies, suggesting new evaluation benchmarks and model design improvements.

DetailsMotivation: Existing LLM evaluation studies focus on vocabulary, syntax, and other linguistic aspects but none assess whether LLMs can acquire language through pattern recognition and interactive feedback like humans do.

Method: A novel experimental framework where an LLM agent is evaluated on acquiring and using a newly constructed language (Tinkatongue) through conversation with a bot that only understands Tinkatongue.

Result: LLM agents failed to establish successful conversation within 100 responses, but they adopted distinct strategies that mirror human approaches to language learning.

Conclusion: The findings suggest a new direction for evaluation benchmarks and open pathways to model designs that can learn more effectively from interactive feedback in language acquisition.

Abstract: Existing evaluation studies on linguistic competence of large language models (LLM agents) have focused primarily on vocabulary learning, morphological rule induction, syntactic generalization, pragmatic inference, and cross-linguistic transfer. However, none assess whether LLM agents can acquire a language through pattern recognition and interactive feedback, a central feature of human language acquisition. We propose a novel experimental framework in which an LLM agent is evaluated on its ability to acquire and use a newly constructed language (Tinkatongue) in conversation with a bot that understands only Tinkatongue. Our findings show that LLM agents fail to establish a conversation within 100 responses, yet they adopt distinct strategies that mirror human approaches to language learning. The results suggest a new direction for evaluation benchmarks and open pathways to model designs that learn more effectively from interactive feedback.

[19] The Role of Exploration Modules in Small Language Models for Knowledge Graph Question Answering

Yi-Jie Cheng, Oscar Chew, Yun-Nung Chen

Main category: cs.CL

TL;DR: Using lightweight exploration modules to help small language models (SLMs) better traverse knowledge graphs for question answering, improving performance without relying on large proprietary models.

DetailsMotivation: Existing knowledge graph integration methods often require large proprietary language models, limiting accessibility and scalability. Small language models struggle with KG traversal and reasoning.

Method: Propose simple and efficient exploration modules to handle knowledge graph traversal instead of relying on the language model itself for KG navigation.

Result: Lightweight modules effectively improve small language models’ performance on knowledge graph question answering tasks.

Conclusion: Lightweight exploration modules can successfully enhance SLMs’ KG reasoning capabilities, making KG-based question answering more accessible and scalable.

Abstract: Integrating knowledge graphs (KGs) into the reasoning processes of large language models (LLMs) has emerged as a promising approach to mitigate hallucination. However, existing work in this area often relies on proprietary or extremely large models, limiting accessibility and scalability. In this study, we investigate the capabilities of existing integration methods for small language models (SLMs) in KG-based question answering and observe that their performance is often constrained by their limited ability to traverse and reason over knowledge graphs. To address this limitation, we propose leveraging simple and efficient exploration modules to handle knowledge graph traversal in place of the language model itself. Experiment results demonstrate that these lightweight modules effectively improve the performance of small language models on knowledge graph question answering tasks. Source code: https://github.com/yijie-cheng/SLM-ToG/.

[20] LongEmotion: Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction

Weichu Liu, Jing Xiong, Yuxuan Hu, Zixuan Li, Minghuan Tan, Ningning Mao, Chenyang Zhao, Zhongwei Wan, Chaofan Tao, Wendong Xu, Hui Shen, Chengming Li, Lingpeng Kong, Ngai Wong

Main category: cs.CL

TL;DR: LongEmotion benchmark for evaluating emotional intelligence in long-context scenarios with average input length of 8,777 tokens, featuring RAG and CoEM methods that outperform standard approaches.

DetailsMotivation: Existing benchmarks overlook emotional intelligence aspects in realistic long-context settings where interactions are lengthy, diverse, and noisy.

Method: Proposed Retrieval-Augmented Generation (RAG) using conversation context and LLM itself as retrieval sources, and Collaborative Emotional Modeling (CoEM) with five-stage task decomposition combining retrieval augmentation and limited knowledge injection.

Result: Both RAG and CoEM consistently enhance EI-related performance across most long-context tasks, advancing LLMs toward practical real-world EI applications.

Conclusion: LongEmotion benchmark enables better evaluation of emotional intelligence in realistic long-context scenarios, with proposed methods showing significant performance improvements.

Abstract: Large language models (LLMs) make significant progress in Emotional Intelligence (EI) and long-context understanding. However, existing benchmarks tend to overlook certain aspects of EI in long-context scenarios, especially under realistic, practical settings where interactions are lengthy, diverse, and often noisy. To move towards such realistic settings, we present LongEmotion, a benchmark specifically designed for long-context EI tasks. It covers a diverse set of tasks, including Emotion Classification, Emotion Detection, Emotion QA, Emotion Conversation, Emotion Summary, and Emotion Expression. On average, the input length for these tasks reaches 8,777 tokens, with long-form generation required for Emotion Expression. To enhance performance under realistic constraints, we incorporate Retrieval-Augmented Generation (RAG) and Collaborative Emotional Modeling (CoEM), and compare them with standard prompt-based methods. Unlike conventional approaches, our RAG method leverages both the conversation context and the large language model itself as retrieval sources, avoiding reliance on external knowledge bases. The CoEM method further improves performance by decomposing the task into five stages, integrating both retrieval augmentation and limited knowledge injection. Experimental results show that both RAG and CoEM consistently enhance EI-related performance across most long-context tasks, advancing LLMs toward more practical and real-world EI applications. Furthermore, we conducted a comparative case study experiment on the GPT series to demonstrate the differences among various models in terms of EI. Code is available on GitHub at https://github.com/LongEmotion/LongEmotion, and the project page can be found at https://longemotion.github.io/.

[21] AIxcellent Vibes at GermEval 2025 Shared Task on Candy Speech Detection: Improving Model Performance by Span-Level Training

Christian Rene Thelen, Patrick Gustav Blaneck, Tobias Bornheim, Niklas Grieger, Stephan Bialonski

Main category: cs.CL

TL;DR: Multilingual XLM-RoBERTa-Large model achieves best performance for detecting positive supportive language (candy speech) in German YouTube comments through span-level training.

DetailsMotivation: Automated detection of positive, supportive online communication (candy speech) is underexplored but important for fostering civility and enabling systematic analysis of its impact.

Method: Used monolingual (GBERT) and multilingual (Qwen3 Embedding, XLM-RoBERTa) language models on 46k German YouTube comments, with XLM-RoBERTa-Large trained for span-level candy speech detection.

Result: XLM-RoBERTa-Large outperformed other models, achieving F1 scores of 0.8906 for binary detection and 0.6307 for categorized span-based detection, ranking first in GermEval 2025 Shared Task.

Conclusion: Multilingual models with span-based training and emoji-aware tokenizers are effective for detecting positive supportive language, demonstrating the value of multilingual capabilities in this task.

Abstract: Positive, supportive online communication in social media (candy speech) has the potential to foster civility, yet automated detection of such language remains underexplored, limiting systematic analysis of its impact. We investigate how candy speech can be reliably detected in a 46k-comment German YouTube corpus by monolingual and multilingual language models, including GBERT, Qwen3 Embedding, and XLM-RoBERTa. We find that a multilingual XLM-RoBERTa-Large model trained to detect candy speech at the span level outperforms other approaches, ranking first in both binary positive F1: 0.8906) and categorized span-based detection (strict F1: 0.6307) subtasks at the GermEval 2025 Shared Task on Candy Speech Detection. We speculate that span-based training, multilingual capabilities, and emoji-aware tokenizers improved detection performance. Our results demonstrate the effectiveness of multilingual models in identifying positive, supportive language.

[22] Understanding Stigmatizing Language Lexicons: A Comparative Analysis in Clinical Contexts

Yiliang Zhou, Di Hu, Tianchu Lyu, Jasmine Dhillon, Alexandra L. Beck, Gelareh Sadigh, Kai Zheng

Main category: cs.CL

TL;DR: Systematic review of stigmatizing language lexicons in healthcare found moderate similarity between existing lexicons, with most stigmatizing terms being clinician judgmental expressions about perceived negative behaviors, highlighting need for standardization.

DetailsMotivation: Stigmatizing language contributes to healthcare inequities, but there is no universally accepted lexicon defining what constitutes stigmatizing language in healthcare settings.

Method: Conducted systematic literature search to identify existing stigmatizing language lexicons, then performed comparative analysis examining similarities/discrepancies and sentiment distribution using established sentiment dataset.

Result: Identified four lexicons with moderate semantic similarity; most stigmatizing terms were clinician judgmental expressions about perceived negative behaviors; sentiment analysis showed predominant negative classification with variations across lexicons.

Conclusion: Findings underscore the need for a standardized lexicon and highlight challenges in defining stigmatizing language in clinical texts.

Abstract: Stigmatizing language results in healthcare inequities, yet there is no universally accepted or standardized lexicon defining which words, terms, or phrases constitute stigmatizing language in healthcare. We conducted a systematic search of the literature to identify existing stigmatizing language lexicons and then analyzed them comparatively to examine: 1) similarities and discrepancies between these lexicons, and 2) the distribution of positive, negative, or neutral terms based on an established sentiment dataset. Our search identified four lexicons. The analysis results revealed moderate semantic similarity among them, and that most stigmatizing terms are related to judgmental expressions by clinicians to describe perceived negative behaviors. Sentiment analysis showed a predominant proportion of negatively classified terms, though variations exist across lexicons. Our findings underscore the need for a standardized lexicon and highlight challenges in defining stigmatizing language in clinical texts.

[23] From Scarcity to Efficiency: Investigating the Effects of Data Augmentation on African Machine Translation

Mardiyyah Oduwole, Oluwatosin Olajide, Jamiu Suleiman, Faith Hunja, Busayo Awobade, Fatimo Adebanjo, Comfort Akanni, Chinonyelum Igwe, Peace Ododo, Promise Omoigui, Steven Kolawole, Abraham Owodunni

Main category: cs.CL

TL;DR: Data augmentation techniques (sentence concatenation with back translation and switch-out) significantly improve machine translation performance for low-resource African languages, with minimum 25% BLEU score increase across six languages.

DetailsMotivation: Linguistic diversity in Africa presents challenges for machine translation, particularly for low-resource languages that lack sufficient training data.

Method: Applied two data augmentation techniques: sentence concatenation with back translation and switch-out across six African languages to enhance translation systems.

Result: Significant improvements in machine translation performance with minimum 25% increase in BLEU score across all six tested languages.

Conclusion: Data augmentation techniques show strong potential for improving machine translation systems for low-resource African languages, contributing to more robust translation systems for under-resourced languages.

Abstract: The linguistic diversity across the African continent presents different challenges and opportunities for machine translation. This study explores the effects of data augmentation techniques in improving translation systems in low-resource African languages. We focus on two data augmentation techniques: sentence concatenation with back translation and switch-out, applying them across six African languages. Our experiments show significant improvements in machine translation performance, with a minimum increase of 25% in BLEU score across all six languages.We provide a comprehensive analysis and highlight the potential of these techniques to improve machine translation systems for low-resource languages, contributing to the development of more robust translation systems for under-resourced languages.

[24] HALT-RAG: A Task-Adaptable Framework for Hallucination Detection with Calibrated NLI Ensembles and Abstention

Saumya Goswami, Siddharth Kurra

Main category: cs.CL

TL;DR: HALT-RAG is a post-hoc verification system that detects hallucinations in RAG outputs using NLI models and lexical features with a task-adapted meta-classifier, achieving strong performance on summarization, QA, and dialogue tasks.

DetailsMotivation: To address the critical challenge of detecting content that contradicts or is unsupported by source text in generative language models, ensuring safe deployment by identifying hallucinations in RAG pipeline outputs.

Method: Uses a flexible framework with universal features from frozen NLI models and lexical signals, trained with a 5-fold out-of-fold protocol to prevent data leakage, with a calibrated meta-classifier and precision-constrained decision policy.

Result: Achieves strong OOF F1-scores: 0.7756 (summarization), 0.9786 (QA), and 0.7391 (dialogue) on HaluEval benchmark, with well-calibrated probabilities enabling practical abstention.

Conclusion: HALT-RAG provides a reliable tool for balancing model performance with safety requirements through effective hallucination detection and calibrated abstention mechanisms.

Abstract: Detecting content that contradicts or is unsupported by a given source text is a critical challenge for the safe deployment of generative language models. We introduce HALT-RAG, a post-hoc verification system designed to identify hallucinations in the outputs of Retrieval-Augmented Generation (RAG) pipelines. Our flexible and task-adaptable framework uses a universal feature set derived from an ensemble of two frozen, off-the-shelf Natural Language Inference (NLI) models and lightweight lexical signals. These features are used to train a simple, calibrated, and task-adapted meta-classifier. Using a rigorous 5-fold out-of-fold (OOF) training protocol to prevent data leakage and produce unbiased estimates, we evaluate our system on the HaluEval benchmark. By pairing our universal feature set with a lightweight, task-adapted classifier and a precision-constrained decision policy, HALT-RAG achieves strong OOF F1-scores of 0.7756, 0.9786, and 0.7391 on the summarization, QA, and dialogue tasks, respectively. The system’s well-calibrated probabilities enable a practical abstention mechanism, providing a reliable tool for balancing model performance with safety requirements.

[25] ALLabel: Three-stage Active Learning for LLM-based Entity Recognition using Demonstration Retrieval

Zihan Chen, Lei Shi, Weize Wu, Qiji Zhou, Yue Zhang

Main category: cs.CL

TL;DR: ALLabel is a three-stage active learning framework that selects the most informative samples for LLM entity recognition, achieving comparable performance to full annotation with only 5-10% of data.

DetailsMotivation: Traditional fine-tuning of LLMs for entity recognition in scientific domains incurs high costs, requiring a more efficient approach to sample selection and annotation.

Method: A three-stage framework using active learning strategies to select informative samples, construct a ground-truth retrieval corpus, and enable LLM in-context learning with minimal annotation.

Result: ALLabel consistently outperforms baselines under the same annotation budget across three specialized domain datasets, achieving full-dataset performance with only 5-10% annotation.

Conclusion: The framework provides an effective and generalizable solution for cost-efficient entity recognition in scientific domains using LLMs with minimal annotation requirements.

Abstract: Many contemporary data-driven research efforts in the natural sciences, such as chemistry and materials science, require large-scale, high-performance entity recognition from scientific datasets. Large language models (LLMs) have increasingly been adopted to solve the entity recognition task, with the same trend being observed on all-spectrum NLP tasks. The prevailing entity recognition LLMs rely on fine-tuned technology, yet the fine-tuning process often incurs significant cost. To achieve a best performance-cost trade-off, we propose ALLabel, a three-stage framework designed to select the most informative and representative samples in preparing the demonstrations for LLM modeling. The annotated examples are used to construct a ground-truth retrieval corpus for LLM in-context learning. By sequentially employing three distinct active learning strategies, ALLabel consistently outperforms all baselines under the same annotation budget across three specialized domain datasets. Experimental results also demonstrate that selectively annotating only 5%-10% of the dataset with ALLabel can achieve performance comparable to the method annotating the entire dataset. Further analyses and ablation studies verify the effectiveness and generalizability of our proposal.

[26] VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents

Zheng Wu, Heyuan Huang, Xingyu Lou, Xiangmou Qu, Pengzhou Cheng, Zongru Wu, Weiwen Liu, Weinan Zhang, Jun Wang, Zhaoxiang Wang, Zhuosheng Zhang

Main category: cs.CL

TL;DR: VeriOS-Agent is a trustworthy OS agent that uses human queries in untrustworthy GUI environments, improving success rates by 20.64% without affecting normal performance.

DetailsMotivation: Existing OS agents are designed for ideal settings but real-world environments often present untrustworthy conditions, creating risks of over-execution.

Method: Proposes a query-driven human-agent-GUI interaction framework with two-stage learning paradigm that decouples and utilizes meta-knowledge. The agent autonomously executes actions normally but proactively queries humans in untrustworthy scenarios.

Result: Improves average step-wise success rate by 20.64% in untrustworthy scenarios over state-of-the-art methods while maintaining normal performance.

Conclusion: VeriOS-Agent demonstrates rationality, generalizability, and scalability, providing a trustworthy solution for OS agents operating in real-world untrustworthy environments.

Abstract: With the rapid progress of multimodal large language models, operating system (OS) agents become increasingly capable of automating tasks through on-device graphical user interfaces (GUIs). However, most existing OS agents are designed for idealized settings, whereas real-world environments often present untrustworthy conditions. To mitigate risks of over-execution in such scenarios, we propose a query-driven human-agent-GUI interaction framework that enables OS agents to decide when to query humans for more reliable task completion. Built upon this framework, we introduce VeriOS-Agent, a trustworthy OS agent trained with a two-stage learning paradigm that falicitate the decoupling and utilization of meta-knowledge. Concretely, VeriOS-Agent autonomously executes actions in normal conditions while proactively querying humans in untrustworthy scenarios. Experiments show that VeriOS-Agent improves the average step-wise success rate by 20.64% in untrustworthy scenarios over the state-of-the-art, without compromising normal performance. Analysis highlights VeriOS-Agent’s rationality, generalizability, and scalability. The codes, datasets and models are available at https://github.com/Wuzheng02/VeriOS.

[27] Avoiding Knowledge Edit Skipping in Multi-hop Question Answering with Guided Decomposition

Yi Liu, Xiangrong Zhu, Xiangyu Liu, Wei Wei, Wei Hu

Main category: cs.CL

TL;DR: IRAKE addresses edit skipping in retrieval-augmented knowledge editing for multi-hop QA by using iterative retrieval with guided decomposition from single facts and entire cases.

DetailsMotivation: Existing RAG-based knowledge editing methods struggle with multi-hop question answering due to edit skipping, where models skip relevant edited facts during inference, and the mismatch between LLM problem-solving granularity and edited fact granularity.

Method: Proposes Iterative Retrieval-Augmented Knowledge Editing (IRAKE) with guided decomposition using guidance from both single edited facts and entire edited cases to address edit skipping.

Result: Experimental results show IRAKE mitigates editing failures caused by edit skipping and outperforms state-of-the-art knowledge editing methods for multi-hop question answering.

Conclusion: IRAKE effectively addresses the edit skipping problem in knowledge editing for complex multi-hop reasoning tasks, providing superior performance compared to existing methods.

Abstract: In a rapidly evolving world where information updates swiftly, knowledge in large language models (LLMs) becomes outdated quickly. Retraining LLMs is not a cost-effective option, making knowledge editing (KE) without modifying parameters particularly necessary. We find that although existing retrieval-augmented generation (RAG)-based KE methods excel at editing simple knowledge, they struggle with KE in multi-hop question answering due to the issue of “edit skipping”, which refers to skipping the relevant edited fact in inference. In addition to the diversity of natural language expressions of knowledge, edit skipping also arises from the mismatch between the granularity of LLMs in problem-solving and the facts in the edited memory. To address this issue, we propose a novel Iterative Retrieval-Augmented Knowledge Editing method with guided decomposition (IRAKE) through the guidance from single edited facts and entire edited cases. Experimental results demonstrate that IRAKE mitigates the failure of editing caused by edit skipping and outperforms state-of-the-art methods for KE in multi-hop question answering.

[28] BALI: Enhancing Biomedical Language Representations through Knowledge Graph and Language Model Alignment

Andrey Sakhovskiy, Elena Tutubalina

Main category: cs.CL

TL;DR: BALI is a novel joint pre-training method that aligns biomedical language models with knowledge graphs to improve comprehension of domain-specific concepts and factual information.

DetailsMotivation: Existing biomedical LLMs show limited understanding of complex domain-specific concept structures and factual information in biomedical knowledge graphs, despite progress in biomedical text understanding.

Method: Proposes BALI method that simultaneously learns a KG encoder and aligns LM and KG representations by linking biomedical concepts to UMLS KG and using local KG subgraphs as cross-modal positive samples.

Result: Implementation on leading biomedical LMs (PubMedBERT, BioLinkBERT) improves performance on language understanding tasks and entity representation quality, even with minimal pre-training on small alignment datasets from PubMed abstracts.

Conclusion: The BALI framework effectively enhances biomedical language models by integrating external knowledge from knowledge graphs through joint pre-training, demonstrating improved performance across various tasks with minimal training data.

Abstract: In recent years, there has been substantial progress in using pretrained Language Models (LMs) on a range of tasks aimed at improving the understanding of biomedical texts. Nonetheless, existing biomedical LLMs show limited comprehension of complex, domain-specific concept structures and the factual information encoded in biomedical Knowledge Graphs (KGs). In this work, we propose BALI (Biomedical Knowledge Graph and Language Model Alignment), a novel joint LM and KG pre-training method that augments an LM with external knowledge by the simultaneous learning of a dedicated KG encoder and aligning the representations of both the LM and the graph. For a given textual sequence, we link biomedical concept mentions to the Unified Medical Language System (UMLS) KG and utilize local KG subgraphs as cross-modal positive samples for these mentions. Our empirical findings indicate that implementing our method on several leading biomedical LMs, such as PubMedBERT and BioLinkBERT, improves their performance on a range of language understanding tasks and the quality of entity representations, even with minimal pre-training on a small alignment dataset sourced from PubMed scientific abstracts.

[29] MaLei at MultiClinSUM: Summarisation of Clinical Documents using Perspective-Aware Iterative Self-Prompting with LLMs

Libo Ren, Yee Man Ng, Lifeng Han

Main category: cs.CL

TL;DR: This paper presents a perspective-aware iterative self-prompting (PA-ISP) technique using LLMs for clinical report summarization, achieving high semantic similarity scores despite lower lexical overlap.

DetailsMotivation: Clinical reports are often lengthy and filled with jargon, making it difficult for domain experts to efficiently identify important information, which hinders effective patient-clinician communication and shared decision-making.

Method: Used Iterative Self-Prompting technique on LLMs (GPT-4 and GPT-4o) to generate task-specific prompts and refine them via example-based few-shot learning. Employed ROUGE and BERT-score metrics to guide model fine-tuning across epochs.

Result: Achieved ROUGE scores (46.53, 24.68, 30.77) and BERTscores (87.84, 83.25, 85.46) for (P, R, F1) on 3,396 clinical case reports. High BERTscore indicates semantically equivalent output despite lower ROUGE scores.

Conclusion: Perspective-aware ISP can be effectively deployed for clinical report summarization to support better communication between patients and clinicians, with semantic understanding being more important than exact lexical matching.

Abstract: Efficient communication between patients and clinicians plays an important role in shared decision-making. However, clinical reports are often lengthy and filled with clinical jargon, making it difficult for domain experts to identify important aspects in the document efficiently. This paper presents the methodology we applied in the MultiClinSUM shared task for summarising clinical case documents. We used an Iterative Self-Prompting technique on large language models (LLMs) by asking LLMs to generate task-specific prompts and refine them via example-based few-shot learning. Furthermore, we used lexical and embedding space metrics, ROUGE and BERT-score, to guide the model fine-tuning with epochs. Our submission using perspective-aware ISP on GPT-4 and GPT-4o achieved ROUGE scores (46.53, 24.68, 30.77) and BERTscores (87.84, 83.25, 85.46) for (P, R, F1) from the official evaluation on 3,396 clinical case reports from various specialties extracted from open journals. The high BERTscore indicates that the model produced semantically equivalent output summaries compared to the references, even though the overlap at the exact lexicon level is lower, as reflected in the lower ROUGE scores. This work sheds some light on how perspective-aware ISP (PA-ISP) can be deployed for clinical report summarisation and support better communication between patients and clinicians.

[30] MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval

Xixi Wu, Yanchao Tan, Nan Hou, Ruiyang Zhang, Hong Cheng

Main category: cs.CL

TL;DR: MoLoRAG is a logic-aware retrieval framework that combines semantic and logical relevance for multi-modal, multi-page document understanding, improving DocQA accuracy by 9.68% over LVLMs.

DetailsMotivation: Traditional text conversion methods strip multi-modal information, while LVLMs have constrained input size and RAG methods ignore logical connections between pages, making multi-page document comprehension challenging.

Method: Constructs a page graph to capture contextual relationships, uses lightweight VLM for graph traversal to retrieve semantically and logically relevant pages, then feeds top-K pages to LVLMs for QA with training-free and fine-tuned variants.

Result: Achieves average improvements of 9.68% in accuracy over LVLM direct inference and 7.44% in retrieval precision over baselines on four DocQA datasets.

Conclusion: MoLoRAG effectively addresses the limitations of existing methods by incorporating both semantic and logical relevance, providing a flexible framework for multi-modal, multi-page document understanding with significant performance improvements.

Abstract: Document Understanding is a foundational AI capability with broad applications, and Document Question Answering (DocQA) is a key evaluation task. Traditional methods convert the document into text for processing by Large Language Models (LLMs), but this process strips away critical multi-modal information like figures. While Large Vision-Language Models (LVLMs) address this limitation, their constrained input size makes multi-page document comprehension infeasible. Retrieval-augmented generation (RAG) methods mitigate this by selecting relevant pages, but they rely solely on semantic relevance, ignoring logical connections between pages and the query, which is essential for reasoning. To this end, we propose MoLoRAG, a logic-aware retrieval framework for multi-modal, multi-page document understanding. By constructing a page graph that captures contextual relationships between pages, a lightweight VLM performs graph traversal to retrieve relevant pages, including those with logical connections often overlooked. This approach combines semantic and logical relevance to deliver more accurate retrieval. After retrieval, the top-$K$ pages are fed into arbitrary LVLMs for question answering. To enhance flexibility, MoLoRAG offers two variants: a training-free solution for easy deployment and a fine-tuned version to improve logical relevance checking. Experiments on four DocQA datasets demonstrate average improvements of 9.68% in accuracy over LVLM direct inference and 7.44% in retrieval precision over baselines. Codes and datasets are released at https://github.com/WxxShirley/MoLoRAG.

[31] M-BRe: Discovering Training Samples for Relation Extraction from Unlabeled Texts with Large Language Models

Zexuan Li, Hongliang Dai, Piji Li

Main category: cs.CL

TL;DR: M-BRe framework combines multi-class and binary classification approaches to efficiently extract high-quality training instances for relation extraction from unlabeled texts using LLMs.

DetailsMotivation: Manual annotation for relation extraction is expensive due to scarcity of relevant sentences. LLMs struggle with comprehensive relation semantics in multi-class classification and face computational overhead in binary classification.

Method: Proposes M-BRe framework with three modules: Relation Grouping, Relation Extraction, and Label Decision to combine advantages of both multi-class and binary classification approaches.

Result: Extensive experiments confirm superior capability in discovering high-quality training samples from unlabeled texts for relation extraction.

Conclusion: M-BRe effectively addresses the limitations of both multi-class and binary classification approaches for relation extraction using LLMs, providing an efficient solution for training data extraction.

Abstract: For Relation Extraction (RE), the manual annotation of training data may be prohibitively expensive, since the sentences that contain the target relations in texts can be very scarce and difficult to find. It is therefore beneficial to develop an efficient method that can automatically extract training instances from unlabeled texts for training RE models. Recently, large language models (LLMs) have been adopted in various natural language processing tasks, with RE also benefiting from their advances. However, when leveraging LLMs for RE with predefined relation categories, two key challenges arise. First, in a multi-class classification setting, LLMs often struggle to comprehensively capture the semantics of every relation, leading to suboptimal results. Second, although employing binary classification for each relation individually can mitigate this issue, it introduces significant computational overhead, resulting in impractical time complexity for real-world applications. Therefore, this paper proposes a framework called M-BRe to extract training instances from unlabeled texts for RE. It utilizes three modules to combine the advantages of both of the above classification approaches: Relation Grouping, Relation Extraction, and Label Decision. Extensive experiments confirm its superior capability in discovering high-quality training samples from unlabeled texts for RE.

[32] Factuality Beyond Coherence: Evaluating LLM Watermarking Methods for Medical Texts

Rochana Prih Hastuti, Rian Adam Rajagede, Mansour Al Ghanim, Mengxin Zheng, Qian Lou

Main category: cs.CL

TL;DR: Watermarking in medical LLMs compromises factual accuracy, requiring domain-specific evaluation and new approaches to preserve medical integrity.

DetailsMotivation: As LLMs are adapted to sensitive medical domains, watermarking techniques used for provenance and accountability need testing for reliability in medical contexts where factual accuracy is critical.

Method: Proposed a medical-focused evaluation workflow using GPT-Judger and human validation to assess both factual accuracy and coherence, introducing Factuality-Weighted Score (FWS) as a composite metric.

Result: Current watermarking methods substantially compromise medical factuality, with entropy shifts degrading medical entity representation in low-entropy settings.

Conclusion: There is a critical need for domain-aware watermarking approaches that preserve the integrity of medical content beyond just detection-quality tradeoffs.

Abstract: As large language models (LLMs) adapted to sensitive domains such as medicine, their fluency raises safety risks, particularly regarding provenance and accountability. Watermarking embeds detectable patterns to mitigate these risks, yet its reliability in medical contexts remains untested. Existing benchmarks focus on detection-quality tradeoffs, overlooking factual risks under low-entropy settings often exploited by watermarking’s reweighting strategy. We propose a medical-focused evaluation workflow that jointly assesses factual accuracy and coherence. Using GPT-Judger and further human validation, we introduce the Factuality-Weighted Score (FWS), a composite metric prioritizing factual accuracy beyond coherence to guide watermarking deployment in medical domains. Our evaluation shows current watermarking methods substantially compromise medical factuality, with entropy shifts degrading medical entity representation. These findings underscore the need for domain-aware watermarking approaches that preserve the integrity of medical content.

[33] Are LLMs Enough for Hyperpartisan, Fake, Polarized and Harmful Content Detection? Evaluating In-Context Learning vs. Fine-Tuning

Michele Joshua Maggini, Dhia Merzougui, Rabiraj Bandyopadhyay, Gaël Dias, Fabrice Maurel, Pablo Gamallo

Main category: cs.CL

TL;DR: Comprehensive benchmarking of LLMs for detecting fake news, harmful content, and political bias across 10 datasets and 5 languages, showing fine-tuning outperforms in-context learning even with smaller models.

DetailsMotivation: Address the lack of proper benchmarking of large language models for detecting harmful online content across different models, usage methods, and languages.

Method: Tested various adaptation paradigms including parameter-efficient fine-tuning and multiple in-context learning strategies (zero-shot, codebooks, few-shot with random/diverse selection, Chain-of-Thought) across 10 datasets in 5 languages.

Result: In-context learning often underperforms compared to fine-tuning, even when using smaller fine-tuned models versus larger in-context learning models like LlaMA3.1-8b-Instruct, Mistral-Nemo-Instruct-2407, and Qwen2.5-7B-Instruct.

Conclusion: Fine-tuning smaller models on task-specific settings is more effective than using larger models with in-context learning for detecting harmful online content.

Abstract: The spread of fake news, polarizing, politically biased, and harmful content on online platforms has been a serious concern. With large language models becoming a promising approach, however, no study has properly benchmarked their performance across different models, usage methods, and languages. This study presents a comprehensive overview of different Large Language Models adaptation paradigms for the detection of hyperpartisan and fake news, harmful tweets, and political bias. Our experiments spanned 10 datasets and 5 different languages (English, Spanish, Portuguese, Arabic and Bulgarian), covering both binary and multiclass classification scenarios. We tested different strategies ranging from parameter efficient Fine-Tuning of language models to a variety of different In-Context Learning strategies and prompts. These included zero-shot prompts, codebooks, few-shot (with both randomly-selected and diversely-selected examples using Determinantal Point Processes), and Chain-of-Thought. We discovered that In-Context Learning often underperforms when compared to Fine-Tuning a model. This main finding highlights the importance of Fine-Tuning even smaller models on task-specific settings even when compared to the largest models evaluated in an In-Context Learning setup - in our case LlaMA3.1-8b-Instruct, Mistral-Nemo-Instruct-2407 and Qwen2.5-7B-Instruct.

[34] SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP

Decheng Duan, Yingyi Zhang, Jitong Peng, Chengzhi Zhang

Main category: cs.CL

TL;DR: SciNLP is a new benchmark dataset for full-text entity and relation extraction in NLP domain, containing 60 annotated publications with 7,072 entities and 1,826 relations, enabling improved knowledge graph construction.

DetailsMotivation: Existing datasets for scientific information extraction focus only on specific publication sections due to domain complexity and high annotation costs, limiting comprehensive understanding of scientific literature.

Method: Created SciNLP dataset with 60 manually annotated full-text NLP publications, then conducted comparative experiments with state-of-the-art supervised models and performed cross-dataset evaluations.

Result: Models show varying extraction capabilities across different text lengths, and SciNLP achieves significant performance improvements on certain baseline models. The constructed knowledge graph has rich semantic topology with average node degree of 3.2 per entity.

Conclusion: SciNLP is the first full-text annotated dataset for entity and relation extraction in NLP domain, enabling better knowledge graph construction and enhancing downstream applications in scientific literature analysis.

Abstract: Structured information extraction from scientific literature is crucial for capturing core concepts and emerging trends in specialized fields. While existing datasets aid model development, most focus on specific publication sections due to domain complexity and the high cost of annotating scientific texts. To address this limitation, we introduce SciNLP - a specialized benchmark for full-text entity and relation extraction in the Natural Language Processing (NLP) domain. The dataset comprises 60 manually annotated full-text NLP publications, covering 7,072 entities and 1,826 relations. Compared to existing research, SciNLP is the first dataset providing full-text annotations of entities and their relationships in the NLP domain. To validate the effectiveness of SciNLP, we conducted comparative experiments with similar datasets and evaluated the performance of state-of-the-art supervised models on this dataset. Results reveal varying extraction capabilities of existing models across academic texts of different lengths. Cross-comparisons with existing datasets show that SciNLP achieves significant performance improvements on certain baseline models. Using models trained on SciNLP, we implemented automatic construction of a fine-grained knowledge graph for the NLP domain. Our KG has an average node degree of 3.2 per entity, indicating rich semantic topological information that enhances downstream applications. The dataset is publicly available at https://github.com/AKADDC/SciNLP.

[35] Small Open Models Achieve Near Parity with Large Models in Low Resource Literary Translation at a Fraction of the Cost

Mihai Nadas, Laura Diosan, Andreea Tomescu, Andrei Piscoran

Main category: cs.CL

TL;DR: TF2 is a framework for English-Romanian literary translation featuring a fine-tuned 12B model and synthetic datasets, achieving competitive performance with proprietary models while being open and cost-effective.

DetailsMotivation: Address the lack of high-quality literary translation capabilities in small open models and the need for rich literary datasets in low-resource languages like Romanian.

Method: Two-stage fine-tuning process: (1) instruction tuning for genre-specific narrative style, (2) adapter compression for efficient deployment. Uses synthetic dataset generation with high-performing LLM and combines BLEU with 5-dimension LLM-based rubric evaluation.

Result: The fine-tuned model achieves fluency and adequacy competitive with top proprietary models while being open, accessible, and significantly more cost-effective.

Conclusion: TF2 provides an end-to-end reproducible pipeline for cost-efficient translation and cross-lingual narrative generation, enabling broad adoption of open models for literary content in low-resource settings.

Abstract: Literary translation has recently gained attention as a distinct and complex task in machine translation research. However, the translation by small open models remains an open problem. We contribute to this ongoing research by introducing TINYFABULIST TRANSLATION FRAMEWORK (TF2), a unified framework for dataset creation, fine tuning, and evaluation in English-Romanian literary translations, centred on the creation and open release of both a compact, fine tuned language model (TF2-12B) and large scale synthetic parallel datasets (DS-TF2-EN-RO-3M and DS-TF2-EN-RO-15K). Building on DS-TF1-EN-3M (TF1), the largest collection of synthetic English fables to date, we address the need for rich, high quality literary datasets in low resource languages such as Romanian. Our pipeline first generates 15k high quality Romanian references from the TF1 pool using a high performing LLM. We then apply a two stage fine tuning process to a 12B parameter open weight model: (i) instruction tuning to capture genre specific narrative style, and (ii) adapter compression for efficient deployment. Evaluation combines corpus level BLEU and a five dimension LLM based rubric (accuracy, fluency, coherence, style, cultural adaptation) to provide a nuanced assessment of translation quality. Results show that our fine tuned model achieves fluency and adequacy competitive with top performing large proprietary models, while being open, accessible, and significantly more cost effective. Alongside the fine tuned model and both datasets, we publicly release all scripts and evaluation prompts. TF2 thus provides an end-to-end, reproducible pipeline for research on cost efficient translation, cross lingual narrative generation, and the broad adoption of open models for culturally significant literary content in low resource settings.

[36] Are Humans as Brittle as Large Language Models?

Jiahui Li, Sean Papay, Roman Klinger

Main category: cs.CL

TL;DR: This paper investigates whether human annotators show similar sensitivity to instruction changes as LLMs, comparing prompt brittleness effects between humans and LLMs across text classification tasks.

DetailsMotivation: To determine if prompt brittleness in LLMs is problematic or if it correctly reflects human annotation variances, by systematically comparing the effects of prompt modifications on both humans and LLMs.

Method: Prompt both humans and LLMs for text classification tasks with various prompt modifications, including alternative label sets, label formats, typographical errors, and reversed label order.

Result: Both humans and LLMs exhibit increased brittleness to specific prompt modifications (alternative label sets and formats), but humans are less affected by typographical errors and reversed label order than LLMs.

Conclusion: Human judgments show similar sensitivity to certain prompt modifications as LLMs, suggesting prompt brittleness may reflect inherent annotation variance rather than being unique to LLMs.

Abstract: The output of large language models (LLM) is unstable, due to both non-determinism of the decoding process as well as to prompt brittleness. While the intrinsic non-determinism of LLM generation may mimic existing uncertainty in human annotations through distributional shifts in outputs, it is largely assumed, yet unexplored, that the prompt brittleness effect is unique to LLMs. This raises the question: do human annotators show similar sensitivity to instruction changes? If so, should prompt brittleness in LLMs be considered problematic? One may alternatively hypothesize that prompt brittleness correctly reflects human annotation variances. To fill this research gap, we systematically compare the effects of prompt modifications on LLMs and identical instruction modifications for human annotators, focusing on the question of whether humans are similarly sensitive to prompt perturbations. To study this, we prompt both humans and LLMs for a set of text classification tasks conditioned on prompt variations. Our findings indicate that both humans and LLMs exhibit increased brittleness in response to specific types of prompt modifications, particularly those involving the substitution of alternative label sets or label formats. However, the distribution of human judgments is less affected by typographical errors and reversed label order than that of LLMs.

[37] From Detection to Mitigation: Addressing Gender Bias in Chinese Texts via Efficient Tuning and Voting-Based Rebalancing

Chengyan Wu, Yiqiang Cai, Yufei Cheng, Yun Xue

Main category: cs.CL

TL;DR: Fine-tuned LLMs with LoRA for Chinese gender bias detection, using balanced data, majority voting, and multi-temperature sampling to achieve 47.90% average score (4th place).

DetailsMotivation: To promote fairness and controllability in natural language generation by automatically detecting, classifying, and mitigating gender bias in Chinese sentences.

Method: Fine-tuned large language models using Low-Rank Adaptation (LoRA), constructed balanced training sets, used majority voting with multiple expert models, and implemented multi-temperature sampling for bias detection and mitigation.

Result: Achieved an average score of 47.90% in the shared task, ranking fourth overall in performance.

Conclusion: The approach demonstrates effectiveness in Chinese gender bias detection, classification, and mitigation, showing promise for promoting fairness in NLP applications.

Abstract: This paper presents our team’s solution to Shared Task 7 of NLPCC-2025, which focuses on sentence-level gender bias detection and mitigation in Chinese. The task aims to promote fairness and controllability in natural language generation by automatically detecting, classifying, and mitigating gender bias. To address this challenge, we adopt a fine-tuning approach based on large language models (LLMs), efficiently adapt to the bias detection task via Low-Rank Adaptation (LoRA). In terms of data processing, we construct a more balanced training set to alleviate class imbalance and introduce heterogeneous samples from multiple sources to enhance model generalization. For the detection and classification sub-tasks, we employ a majority voting strategy that integrates outputs from multiple expert models to boost performance. Additionally, to improve bias generation detection and mitigation, we design a multi-temperature sampling mechanism to capture potential variations in bias expression styles. Experimental results demonstrate the effectiveness of our approach in bias detection, classification, and mitigation. Our method ultimately achieves an average score of 47.90%, ranking fourth in the shared task.

[38] Biased Tales: Cultural and Topic Bias in Generating Children’s Stories

Donya Rooein, Vilém Zouhar, Debora Nozza, Dirk Hovy

Main category: cs.CL

TL;DR: Biased Tales dataset reveals significant gender and cultural stereotypes in LLM-generated children’s stories, with 55.26% more appearance-focused attributes for female protagonists and disproportionate emphasis on heritage themes for non-Western characters.

DetailsMotivation: Address concerns about cultural and gender stereotypes in LLM-generated bedtime stories as parents increasingly rely on AI for storytelling, particularly for children who are highly impressionable.

Method: Created Biased Tales dataset to systematically analyze how biases influence protagonists’ attributes and story elements in LLM-generated narratives across different gender and cultural contexts.

Result: Found striking disparities: 55.26% increase in appearance-related attributes for female protagonists; non-Western stories disproportionately emphasize cultural heritage, tradition, and family themes compared to Western stories.

Conclusion: Highlights the critical role of addressing sociocultural bias in AI storytelling to ensure more equitable and diverse creative AI applications, especially for children’s content.

Abstract: Stories play a pivotal role in human communication, shaping beliefs and morals, particularly in children. As parents increasingly rely on large language models (LLMs) to craft bedtime stories, the presence of cultural and gender stereotypes in these narratives raises significant concerns. To address this issue, we present Biased Tales, a comprehensive dataset designed to analyze how biases influence protagonists’ attributes and story elements in LLM-generated stories. Our analysis uncovers striking disparities. When the protagonist is described as a girl (as compared to a boy), appearance-related attributes increase by 55.26%. Stories featuring non-Western children disproportionately emphasize cultural heritage, tradition, and family themes far more than those for Western children. Our findings highlight the role of sociocultural bias in making creative AI use more equitable and diverse.

[39] GENUINE: Graph Enhanced Multi-level Uncertainty Estimation for Large Language Models

Tuo Wang, Adithya Kulkarni, Tyler Cody, Peter A. Beling, Yujun Yan, Dawei Zhou

Main category: cs.CL

TL;DR: GENUINE is a graph-enhanced uncertainty estimation framework for LLMs that uses dependency parse trees and hierarchical graph pooling to capture semantic and structural relationships, outperforming existing methods by up to 29% in AUROC.

DetailsMotivation: Existing uncertainty estimation methods for LLMs overlook semantic dependencies and rely on token-level probability measures that fail to capture structural relationships within generated text, limiting reliability in high-stakes applications.

Method: Proposes GENUINE framework that leverages dependency parse trees and hierarchical graph pooling to model semantic and structural relationships, incorporating supervised learning for refined uncertainty quantification.

Result: Extensive experiments show GENUINE achieves up to 29% higher AUROC than semantic entropy-based approaches and reduces calibration errors by over 15% across various NLP tasks.

Conclusion: Graph-based uncertainty modeling through GENUINE effectively improves confidence assessments in LLMs, demonstrating the importance of capturing structural relationships for reliable uncertainty estimation.

Abstract: Uncertainty estimation is essential for enhancing the reliability of Large Language Models (LLMs), particularly in high-stakes applications. Existing methods often overlook semantic dependencies, relying on token-level probability measures that fail to capture structural relationships within the generated text. We propose GENUINE: Graph ENhanced mUlti-level uncertaINty Estimation for Large Language Models, a structure-aware framework that leverages dependency parse trees and hierarchical graph pooling to refine uncertainty quantification. By incorporating supervised learning, GENUINE effectively models semantic and structural relationships, improving confidence assessments. Extensive experiments across NLP tasks show that GENUINE achieves up to 29% higher AUROC than semantic entropy-based approaches and reduces calibration errors by over 15%, demonstrating the effectiveness of graph-based uncertainty modeling. The code is available at https://github.com/ODYSSEYWT/GUQ.

[40] SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

Lukas Haas, Gal Yona, Giovanni D’Antonio, Sasha Goldshtein, Dipanjan Das

Main category: cs.CL

TL;DR: SimpleQA Verified is a 1,000-prompt benchmark that improves upon OpenAI’s SimpleQA for evaluating LLM factuality, featuring better labels, reduced bias, and less redundancy.

DetailsMotivation: To address limitations in OpenAI's SimpleQA benchmark including noisy/incorrect labels, topical biases, and question redundancy that undermine reliable evaluation of LLM factuality.

Method: Created through rigorous multi-stage filtering: de-duplication, topic balancing, source reconciliation, and improvements to the autorater prompt to produce a more reliable evaluation set.

Result: Gemini 2.5 Pro achieves state-of-the-art F1-score of 55.6 on this benchmark, outperforming other frontier models including GPT-5.

Conclusion: Provides a higher-fidelity tool for tracking genuine progress in parametric model factuality and mitigating hallucinations, with benchmark dataset and evaluation code publicly available.

Abstract: We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI’s SimpleQA. It addresses critical limitations in OpenAI’s benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongside improvements in the autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models, including GPT-5. This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations. The benchmark dataset, evaluation code, and leaderboard are available at: https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified.

[41] Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Xinyu Yang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, Dong Yu

Main category: cs.CL

TL;DR: Parallel-R1 is a reinforcement learning framework that trains LLMs for parallel thinking using progressive curriculum learning, achieving significant accuracy improvements on math benchmarks.

DetailsMotivation: Existing methods rely on supervised fine-tuning with synthetic data, which encourages imitation rather than exploration and generalization for parallel thinking capabilities.

Method: A progressive curriculum: first use SFT on easier tasks to instill parallel thinking, then transition to RL to explore and generalize on harder problems, addressing the cold-start problem.

Result: 8.4% accuracy improvement over sequential thinking models on challenging tasks, with up to 42.9% improvement on AIME25 benchmark. The model shows behavioral shift from exploration strategy to multi-perspective verification.

Conclusion: Parallel thinking serves as an effective mid-training exploration scaffold that unlocks higher performance ceilings after RL training, demonstrating successful instillation of parallel reasoning capabilities.

Abstract: Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model’s thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a \textbf{mid-training exploration scaffold}, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.

[42] UPLex: Fine-Grained Personality Control in Large Language Models via Unsupervised Lexical Modulation

Tianlong Li, Wenhao Liu, Muling Wu, Shihan Dou, Zhenghua Wang, Changze Lv, Xiaohua Wang, Xiaoqing Zheng, Xuanjing Huang

Main category: cs.CL

TL;DR: UPLex is a pluggable method that uses unsupervisedly-built personalized lexicons to manipulate LLM personalities during decoding, enabling fine-grained control without costly fine-tuning or manual prompting.

DetailsMotivation: Personality shapes human communication, so regulating LLM personalities can enhance user experience. Existing methods are either inefficient (fine-tuning) or imprecise (manual prompts) for fine-grained personality manipulation.

Method: UPLex constructs an Unsupervisedly-Built Personalized Lexicon (UPL) from a situational judgment test dataset, then uses it during decoding to dynamically alter word probabilities in a pluggable fashion.

Result: Extensive experiments show UPLex is remarkably effective and pluggable for fine-grained manipulation of LLM personalities.

Conclusion: UPLex provides an efficient, precise, and pluggable solution for personality manipulation in LLMs, overcoming limitations of previous approaches.

Abstract: Personality is a crucial factor that shapes human communication patterns, thereby regulating the personalities of large language models (LLMs) holds significant potential in enhancing their user experiences. Previous approaches either relied on fine-tuning LLMs on specific corpora or required manually crafted prompts to evoke specific personalities from LLMs. However, the former is inefficient and costly, while the latter cannot precisely manipulate personality traits at a fine-grained level. To address these challenges, we propose UPLex, a method that uses an Unsupervisedly-Built Personalized Lexicon (UPL) during the decoding phase to manipulate LLM’s personality traits. UPL can be constructed from a newly built situational judgment test dataset in an unsupervised fashion, and used to modulate the personality expression of LLMs by dynamically altering their predicted probability of upcoming words in a pluggable fashion. Extensive experimentation demonstrates the remarkable effectiveness and pluggability of our method for fine-grained manipulation of LLMs’ personalities.

[43] Linearly Controlled Language Generation with Performative Guarantees

Emily Cheng, Carmen Amo Alonso

Main category: cs.CL

TL;DR: A control-theoretic approach for steering LLM text generation away from undesired meanings using lightweight, gradient-free interventions in latent space.

DetailsMotivation: Need for computationally efficient controlled language generation with performance guarantees in critical LLM applications.

Method: Uses control theory to dynamically steer token activations in embedding space away from regions corresponding to undesired meanings, with closed-form optimal controller formulation.

Result: Effective toxicity avoidance and sentiment control while maintaining text quality, with minimal impact on generation time.

Conclusion: Control-theoretic intervention provides fine-grained steering of generation attributes with performance guarantees and computational efficiency.

Abstract: The increasing prevalence of Large Language Models (LMs) in critical applications highlights the need for controlled language generation strategies that are not only computationally efficient but that also enjoy performance guarantees. To achieve this, we use a common model of concept semantics as linearly represented in an LM’s latent space. In particular, we take the view that natural language generation traces a trajectory in this continuous semantic space, realized by the language model’s hidden activations. This view permits a control-theoretic treatment of text generation in latent space, in which we propose a lightweight, gradient-free intervention that dynamically steers trajectories away from regions corresponding to undesired meanings. In particular, we propose to directly intervene the activations of the token that is being generated in embedding space in an online fashion. Crucially, we do not simply steer activations towards a desirable region. Instead, our method relies on classical techniques from control theory to precisely control activations in a context-dependent way, and guarantees that they are brought into a specific pre-defined region of embedding space that corresponds to allowed semantics. Our intervention is computed in closed-form according to an optimal controller formulation, minimally impacting generation time. This control of the activations in embedding space allows for fine-grained steering of attributes of the generated sequence. We demonstrate the effectiveness of our approach on different objectives – toxicity avoidance and sentiment control – while maintaining text quality.

[44] JoPA:Explaining Large Language Model’s Generation via Joint Prompt Attribution

Yurui Chang, Bochuan Cao, Yujia Wang, Jinghui Chen, Lu Lin

Main category: cs.CL

TL;DR: A counterfactual explanation framework called JoPA that explains how multiple prompt texts collaboratively influence LLM generation outputs by treating it as a combinatorial optimization problem.

DetailsMotivation: Existing prompt explanation methods are limited to classification tasks or treat input texts independently, failing to capture the combinatorial effects of multiple prompts on complete text generation.

Method: Formulates prompt attribution as a combinatorial optimization problem and uses a probabilistic algorithm to search for causal input combinations in discrete space.

Result: The framework demonstrates both faithfulness and efficiency in explaining collaborative prompt effects on LLM generation through multiple evaluation metrics.

Conclusion: JoPA provides a novel approach to understanding how combinations of prompt texts jointly influence LLM outputs, addressing limitations of existing explanation methods.

Abstract: Large Language Models (LLMs) have demonstrated impressive performances in complex text generation tasks. However, the contribution of the input prompt to the generated content still remains obscure to humans, underscoring the necessity of understanding the causality between input and output pairs. Existing works for providing prompt-specific explanation often confine model output to be classification or next-word prediction. Few initial attempts aiming to explain the entire language generation often treat input prompt texts independently, ignoring their combinatorial effects on the follow-up generation. In this study, we introduce a counterfactual explanation framework based on Joint Prompt Attribution, JoPA, which aims to explain how a few prompt texts collaboratively influences the LLM’s complete generation. Particularly, we formulate the task of prompt attribution for generation interpretation as a combinatorial optimization problem, and introduce a probabilistic algorithm to search for the casual input combination in the discrete space. We define and utilize multiple metrics to evaluate the produced explanations, demonstrating both the faithfulness and efficiency of our framework.

[45] CTourLLM: Enhancing LLMs with Chinese Tourism Knowledge

Qikai Wei, Mingzhi Yang, Jinqiang Wang, Wenwei Mao, Jiabo Xu, Huansheng Ning

Main category: cs.CL

TL;DR: CTourLLM is a Qwen-based model fine-tuned on Chinese tourism data (Cultour dataset) that outperforms ChatGPT in tourism-related tasks with improved BLEU-1 and Rouge-L scores.

DetailsMotivation: Large language models lack specialized tourism knowledge, limiting their performance in tourist attraction presentations and travel planning tasks.

Method: Constructed Cultour dataset with tourism knowledge base, travelogues, and QA data. Fine-tuned Qwen model on this dataset to create CTourLLM. Used RRA evaluation criteria (Relevance, Readability, Availability) with both automatic and human evaluation.

Result: CTourLLM outperformed ChatGPT with improvements of 1.21 in BLEU-1 and 1.54 in Rouge-L scores, demonstrating better response quality for tourism information.

Conclusion: The Cultour dataset and CTourLLM model effectively enhance LLM performance in Chinese tourism domain, providing better attraction information and travel planning capabilities.

Abstract: Recently, large language models (LLMs) have demonstrated their effectiveness in various natural language processing (NLP) tasks. However, the lack of tourism knowledge limits the performance of LLMs in tourist attraction presentations and travel planning. To address this challenge, we constructed a supervised fine-tuning dataset for the Chinese culture and tourism domain, named Cultour. This dataset consists of three parts: tourism knowledge base data, travelogues data, and tourism QA data. Additionally, we propose CTourLLM, a Qwen-based model supervised fine-tuned with Cultour, to improve the quality of information about attractions and travel planning. To evaluate the performance of CTourLLM, we proposed a human evaluation criterion named RRA (Relevance, Readability, Availability), and employed both automatic and human evaluation. The experimental results demonstrate that CTourLLM outperforms ChatGPT, achieving an improvement of 1.21 in BLEU-1 and 1.54 in Rouge-L, thereby validating the effectiveness of the response outcomes. Our proposed Cultour is accessible at https://github.com/mrweiqk/Cultour.

[46] TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection

Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Tianfu Wang, Kun Fu, Zheng Wang, Hui Xiong

Main category: cs.CL

TL;DR: TokenSelect is a training-free method that uses QK dot products and per-head soft voting to selectively process critical KV cache tokens, achieving up to 23.84x speedup in attention computation and 2.28x end-to-end acceleration while maintaining accuracy.

DetailsMotivation: Address performance degradation and excessive inference times in LLMs when processing long context sequences due to out-of-distribution sequence lengths and quadratic attention complexity.

Method: Uses QK dot products to measure KV Cache criticality at token-level, implements per-head soft voting mechanism to select critical tokens, and employs Selection Cache with Paged Dot Product Kernel for efficiency.

Result: Achieves up to 23.84x speedup in attention computation and 2.28x acceleration in end-to-end latency while outperforming state-of-the-art long-context inference methods.

Conclusion: TokenSelect provides an efficient and accurate solution for long-context inference in LLMs without requiring additional training, significantly reducing computational overhead while maintaining performance.

Abstract: Rapid advances in Large Language Models (LLMs) have spurred demand for processing extended context sequences in contemporary applications. However, this progress faces two challenges: performance degradation due to sequence lengths out-of-distribution, and excessively long inference times caused by the quadratic computational complexity of attention. These issues limit LLMs in long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache Selection (TokenSelect), a training-free method for efficient and accurate long-context inference. TokenSelect builds upon the observation of non-contiguous attention sparsity, using QK dot products to measure per-head KV Cache criticality at token-level. By per-head soft voting mechanism, TokenSelect selectively involves a few critical KV cache tokens in attention calculation without sacrificing accuracy. To further accelerate TokenSelect, we design the Selection Cache based on observations of consecutive Query similarity and implemented the efficient Paged Dot Product Kernel, significantly reducing the selection overhead. A comprehensive evaluation of TokenSelect demonstrates up to $23.84\times$ speedup in attention computation and up to $2.28\times$ acceleration in end-to-end latency, while providing superior performance compared to state-of-the-art long-context inference methods.

[47] Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference

Libo Zhang, Zhaoning Zhang, Baizhou Xu, Rui Li, Zhiliang Tian, Songzhu Mei, Dongsheng Li

Main category: cs.CL

TL;DR: Dovetail is a lossless inference acceleration method that uses speculative decoding with GPU draft models and CPU target models to achieve 1.79x-10.1x speedups on consumer devices while maintaining output quality.

DetailsMotivation: Large language models require significant computational resources and memory, creating challenges for efficient inference on consumer-grade devices with weaker GPUs and stronger CPUs. Existing offloading techniques have limitations due to communication latency and suboptimal hardware utilization.

Method: Dovetail uses speculative decoding with a draft model on GPU for preliminary predictions and target model on CPU for validation. It reduces data transfer granularity, optimizes draft tokens to lower verification latency, increases model depth for better predictions, and introduces Dynamic Gating Fusion for improved feature integration.

Result: Experimental results on 13B models show inference speedups ranging from 1.79x to 10.1x across different consumer-grade GPUs, while maintaining consistency and stability in generated text distributions.

Conclusion: Dovetail effectively leverages heterogeneous device characteristics and speculative decoding to achieve significant inference acceleration on resource-constrained devices without compromising output quality.

Abstract: With the continuous advancement in the performance of large language models (LLMs), their demand for computational resources and memory has significantly increased, which poses major challenges for efficient inference on consumer-grade devices and legacy servers. These devices typically feature relatively weaker GPUs and stronger CPUs. Although techniques such as parameter offloading and partial offloading can alleviate GPU memory pressure to some extent, their effectiveness is limited due to communication latency and suboptimal hardware resource utilization. To address this issue, we propose Dovetail, a lossless inference acceleration method that leverages the complementary characteristics of heterogeneous devices and the advantages of speculative decoding. Dovetail deploys a draft model on the GPU to perform preliminary predictions, while a target model running on the CPU validates these outputs. By reducing the granularity of data transfer, Dovetail significantly minimizes communication overhead. To further improve efficiency, we optimize the draft model specifically for heterogeneous hardware environments by reducing the number of draft tokens to lower parallel verification latency, increasing model depth to enhance predictive capabilities, and introducing a Dynamic Gating Fusion (DGF) mechanism to improve the integration of feature and embedding information. We conduct comprehensive evaluations of Dovetail across various consumer-grade GPUs, covering multiple tasks and mainstream models. Experimental results on 13B models demonstrate that Dovetail achieves inference speedups ranging from 1.79x to 10.1x across different devices, while maintaining consistency and stability in the distribution of generated texts.

[48] MEMIT-Merge: Addressing MEMIT’s Key-Value Conflicts in Same-Subject Batch Editing for LLMs

Zilu Dong, Xiangqing Shen, Rui Xia

Main category: cs.CL

TL;DR: MEMIT-Merge improves MEMIT’s batch editing performance by merging value computations for facts sharing the same subject, maintaining 90%+ success rate vs MEMIT’s 50% drop in same-subject scenarios.

DetailsMotivation: MEMIT's knowledge editing performance deteriorates significantly when processing batches containing multiple edits with the same subject due to update conflicts in its key-value framework.

Method: Proposed MEMIT-Merge approach that merges value computation processes for facts sharing the same subject, resolving performance degradation in same-subject batch editing scenarios.

Result: MEMIT-Merge maintains success rate exceeding 90% at larger batch sizes where MEMIT drops to around 50%, demonstrating remarkable robustness to subject entity collisions.

Conclusion: MEMIT-Merge effectively addresses MEMIT’s limitations in same-subject batch editing by merging value computations, providing a robust solution for mass knowledge modifications in large language models.

Abstract: As large language models continue to scale up, knowledge editing techniques that modify models’ internal knowledge without full retraining have gained significant attention. MEMIT, a prominent batch editing algorithm, stands out for its capability to perform mass knowledge modifications. However, we uncover that MEMIT’s editing efficacy significantly deteriorates when processing batches containing multiple edits sharing the same subject. Our analysis reveals this stems from MEMIT’s key value modeling framework: identical keys (derived from the shared subject) are forced to represent different values (corresponding to different knowledge), resulting in update conflicts during editing. Addressing this issue, we propose MEMIT-Merge, an enhanced approach that merges value computation processes for facts sharing the same subject, effectively resolving the performance degradation in samesubject batch editing scenarios. Experimental results demonstrate that when MEMIT’s edit success rate drops to around 50% at larger batch sizes, MEMIT-Merge maintains a success rate exceeding 90%, showcasing remarkable robustness to subject entity collisions. The code is available at https://github.com/NUSTM/ MEMIT-Merge.

[49] M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis

Chengyan Wu, Bolei Ma, Yihong Liu, Zheyu Zhang, Ningyuan Deng, Yanshu Li, Baolan Chen, Yi Zhang, Yun Xue, Barbara Plank

Main category: cs.CL

TL;DR: M-ABSA is a comprehensive multilingual parallel dataset for aspect-based sentiment analysis spanning 7 domains and 21 languages, created to address the English-centric limitation in existing ABSA research.

DetailsMotivation: Existing ABSA datasets are predominantly English-centric, which limits multilingual evaluation and research opportunities in aspect-based sentiment analysis.

Method: The dataset was constructed through an automatic translation process with human review to ensure quality, focusing on triplet extraction (aspect terms, categories, and sentiment polarities).

Result: Extensive experiments with various baselines showed that M-ABSA enables diverse evaluation tasks including multilingual and multi-domain transfer learning, and large language model evaluation.

Conclusion: M-ABSA represents the most extensive multilingual parallel dataset for ABSA to date and has the potential to drive advancements in multilingual ABSA research due to its inclusivity and comprehensive coverage.

Abstract: Aspect-based sentiment analysis (ABSA) is a crucial task in information extraction and sentiment analysis, aiming to identify aspects with associated sentiment elements in text. However, existing ABSA datasets are predominantly English-centric, limiting the scope for multilingual evaluation and research. To bridge this gap, we present M-ABSA, a comprehensive dataset spanning 7 domains and 21 languages, making it the most extensive multilingual parallel dataset for ABSA to date. Our primary focus is on triplet extraction, which involves identifying aspect terms, aspect categories, and sentiment polarities. The dataset is constructed through an automatic translation process with human review to ensure quality. We perform extensive experiments using various baselines to assess performance and compatibility on M-ABSA. Our empirical findings highlight that the dataset enables diverse evaluation tasks, such as multilingual and multi-domain transfer learning, and large language model evaluation, underscoring its inclusivity and its potential to drive advancements in multilingual ABSA research.

[50] Robust Adaptation of Large Multimodal Models for Retrieval Augmented Hateful Meme Detection

Jingbiao Mei, Jinghong Chen, Guangyu Yang, Weizhe Lin, Bill Byrne

Main category: cs.CL

TL;DR: Proposes a robust adaptation framework for hateful meme detection that improves in-domain accuracy, cross-domain generalization, and adversarial robustness while preserving LMMs’ general capabilities.

DetailsMotivation: Hateful memes are a significant online concern, but current Large Multimodal Models face challenges with sub-optimal performance, limited out-of-domain generalization, and limitations of both supervised fine-tuning and in-context learning approaches.

Method: A robust adaptation framework designed to enhance hateful meme detection capabilities while preserving the general vision-language capabilities of Large Multimodal Models.

Result: Achieves state-of-the-art performance on six meme classification datasets, outperforms larger agentic systems, shows improved robustness under adversarial attacks compared to SFT models, and generates higher-quality rationales for interpretability.

Conclusion: The proposed framework effectively addresses the limitations of current approaches by providing improved detection accuracy, better cross-domain generalization, enhanced adversarial robustness, and superior interpretability through high-quality rationales.

Abstract: Hateful memes have become a significant concern on the Internet, necessitating robust automated detection systems. While Large Multimodal Models (LMMs) have shown promise in hateful meme detection, they face notable challenges like sub-optimal performance and limited out-of-domain generalization capabilities. Recent studies further reveal the limitations of both supervised fine-tuning (SFT) and in-context learning when applied to LMMs in this setting. To address these issues, we propose a robust adaptation framework for hateful meme detection that enhances in-domain accuracy and cross-domain generalization while preserving the general vision-language capabilities of LMMs. Analysis reveals that our approach achieves improved robustness under adversarial attacks compared to SFT models. Experiments on six meme classification datasets show that our approach achieves state-of-the-art performance, outperforming larger agentic systems. Moreover, our method generates higher-quality rationales for explaining hateful content compared to standard SFT, enhancing model interpretability. Code available at https://github.com/JingbiaoMei/RGCL

[51] MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering

Teng Lin, Yuyu Luo, Nan Tang

Main category: cs.CL

TL;DR: MEBench benchmark reveals LLMs and RAG systems struggle with multi-entity QA, achieving only 59% accuracy on complex cross-document entity consolidation tasks.

DetailsMotivation: Existing methods excel at single-document comprehension but struggle with cross-document aggregation for entity-dense questions requiring integration of scattered information from heterogeneous sources.

Method: Introduced MEBench, a multi-document, multi-entity benchmark with 4,780 questions categorized into 3 primary categories and 8 distinct types, using Entity-Attributed F1 (EA-F1) metric for granular evaluation of entity-level correctness.

Result: State-of-the-art LLMs (GPT-4, Llama-3) and RAG pipelines achieve only 59% accuracy on MEBench, revealing critical limitations in multi-entity reasoning capabilities.

Conclusion: MEBench highlights systemic weaknesses in current LLM frameworks and provides a foundation for advancing robust, entity-aware QA architectures that require better completeness and factual precision in information extraction.

Abstract: Multi-entity question answering (MEQA) represents significant challenges for large language models (LLM) and retrieval-augmented generation (RAG) systems, which frequently struggle to consolidate scattered information across diverse documents. While existing methods excel at single-document comprehension, they often struggle with cross-document aggregation, particularly when resolving entity-dense questions like “What is the distribution of ACM Fellows among various fields of study?”, which require integrating entity-centric insights from heterogeneous sources (e.g., Wikipedia pages). To address this gap, we introduce MEBench, a novel multi-document, multi-entity benchmark designed to systematically evaluate LLMs’ capacity to retrieve, consolidate, and reason over fragmented information. Our benchmark comprises 4,780 questions which are systematically categorized into three primary categories, further divided into eight distinct types, ensuring broad coverage of real-world multi-entity reasoning scenarios. Our experiments on state-of-the-art LLMs (e.g., GPT-4, Llama-3) and RAG pipelines reveal critical limitations: even advanced models achieve only 59% accuracy on MEBench. Our benchmark emphasizes the importance of completeness and factual precision of information extraction in MEQA tasks, using Entity-Attributed F1 (EA-F1) metric for granular evaluation of entity-level correctness and attribution validity. MEBench not only highlights systemic weaknesses in current LLM frameworks but also provides a foundation for advancing robust, entity-aware QA architectures.

[52] Local Normalization Distortion and the Thermodynamic Formalism of Decoding Strategies for Large Language Models

Tom Kempton, Stuart Burrell

Main category: cs.CL

TL;DR: The paper develops a theoretical framework for decoding strategies in language models, analyzing how popular methods like top-k, nucleus, and temperature sampling optimize specific objectives and suffer from local normalization distortion.

DetailsMotivation: Existing decoding strategies for language models are largely heuristic-based and difficult to improve systematically, despite decoding being a crucial but understudied component of natural language generation.

Method: The authors express popular decoding algorithms as equilibrium states using ergodic theory, analyze the mathematical objectives they optimize, and quantify the effects of local normalization distortion on text quality and diversity.

Result: The research reveals that local normalization distortion is a fundamental defect in current decoding strategies and provides quantitative analysis of its impact on generated text quality and diversity.

Conclusion: The theoretical framework provides insights for designing better decoding algorithms and offers methods for detecting machine-generated text based on understanding decoding strategy limitations.

Abstract: Advances in hardware and language model architecture have spurred a revolution in natural language generation. However, autoregressive models compute probability distributions over next-token choices, and sampling from these distributions, known as decoding, has received significantly less attention than other design choices. Existing decoding strategies are largely based on heuristics, resulting in methods that are difficult to apply or improve in a principled manner. We develop the theory of decoding strategies for language models by expressing popular decoding algorithms as equilibrium states in the language of ergodic theory and stating the objective functions they optimize. Using this, we analyze the effect of the local normalization step required to make probabilities sum to one in top-k, nucleus, and temperature sampling. We argue that local normalization distortion is a fundamental defect of decoding strategies and quantify the size of this distortion and its effect on mathematical proxies for the quality and diversity of generated text. This yields conclusions for the design of decoding algorithms and the detection of machine-generated text.

[53] Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation

Amanda Myntti, Erik Henriksson, Veronika Laippala, Sampo Pyysalo

Main category: cs.CL

TL;DR: Study shows that pretraining data register/genre significantly impacts LLM performance, with Opinion texts being highly beneficial while News performs poorly, suggesting deliberate register-based data selection can improve model outcomes.

DetailsMotivation: Current pretraining data curation often uses binary quality filtering, but lacks understanding of how different text types contribute to model performance. The research aims to investigate the effect of linguistic registers on LLM performance.

Method: Trained small generative models using register-classified data from corpus linguistics, evaluated using standard benchmarks to measure performance differences across text genres.

Result: Register substantially affects model performance - Opinion texts (reviews, blogs) are highly beneficial, News performs poorly. Combining well-performing registers (How-to-Instructions, Informational Description, Opinion) leads to major improvements over single-register training.

Conclusion: Register is an important factor explaining model variation and can enable more deliberate data selection practices for pretraining, moving beyond simple binary filtering to genre-aware curation.

Abstract: Pretraining data curation is a cornerstone in Large Language Model (LLM) development, leading to growing research on quality filtering of large web corpora. From statistical quality flags to LLM-based labelling systems, datasets are divided into categories, frequently reducing to a binary: those passing the filters are deemed as valuable examples, others are discarded as useless or detrimental. However, a more detailed understanding of the contribution of different kinds of texts to model performance is still largely lacking. In this article, we present the first study utilising registers or genres - a widely used standard in corpus linguistics to model linguistic variation - to curate pretraining datasets and investigate the effect of register on the performance of LLMs. We train small generative models with register classified data and evaluate them using standard benchmarks, and show that the register of pretraining data substantially affects model performance. We uncover surprising relationships between the pretraining material and the resulting models: using the News register results in subpar performance, and on the contrary, including the Opinion class, covering texts such as reviews and opinion blogs, is highly beneficial. While a model trained on the entire unfiltered dataset outperforms those trained on datasets limited to a single register, combining well-performing registers like How-to-Instructions, Informational Description, and Opinion leads to major improvements. Furthermore, analysis of individual benchmark results reveals key differences in the strengths and drawbacks of specific register classes as pretraining data. These findings show that register is an important explainer of model variation and can facilitate more deliberate future data selection practices.

[54] SemCAFE: When Named Entities make the Difference Assessing Web Source Reliability through Entity-level Analytics

Gautam Kishore Shahi, Oshani Seneviratne, Marc Spaniol

Main category: cs.CL

TL;DR: SemCAFE is a system that detects unreliable news articles by analyzing entity relatedness using NLP techniques and YAGO knowledge base, achieving 12% improvement in F1 score over state-of-the-art methods.

DetailsMotivation: The digital media landscape contains both reliable and unreliable content that is difficult to distinguish, especially with AI-generated content mimicking credible sources. The Russian invasion of Ukraine in 2022 highlighted this challenge as unreliable articles closely resembled credible ones.

Method: Uses standard NLP techniques (boilerplate removal, tokenization) combined with entity-level semantic analysis using YAGO knowledge base to create semantic fingerprints for news articles.

Result: Successfully assessed 46,020 reliable and 3,407 unreliable articles about the 2022 Russian invasion of Ukraine, achieving a 12% improvement in macro F1 score over state-of-the-art methods.

Conclusion: SemCAFE effectively detects news reliability through semantic entity analysis, providing a robust solution for identifying unreliable content in digital media.

Abstract: With the shift from traditional to digital media, the online landscape now hosts not only reliable news articles but also a significant amount of unreliable content. Digital media has faster reachability by significantly influencing public opinion and advancing political agendas. While newspaper readers may be familiar with their preferred outlets political leanings or credibility, determining unreliable news articles is much more challenging. The credibility of many online sources is often opaque, with AI generated content being easily disseminated at minimal cost. Unreliable news articles, particularly those that followed the Russian invasion of Ukraine in 2022, closely mimic the topics and writing styles of credible sources, making them difficult to distinguish. To address this, we introduce SemCAFE, a system designed to detect news reliability by incorporating entity relatedness into its assessment. SemCAFE employs standard Natural Language Processing techniques, such as boilerplate removal and tokenization, alongside entity level semantic analysis using the YAGO knowledge base. By creating a semantic fingerprint for each news article, SemCAFE could assess the credibility of 46,020 reliable and 3,407 unreliable articles on the 2022 Russian invasion of Ukraine. Our approach improved the macro F1 score by 12% over state of the art methods. The sample data and code are available on GitHub

[55] Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts

Hanhua Hong, Chenghao Xiao, Yang Wang, Yiqi Liu, Wenge Rong, Chenghua Lin

Main category: cs.CL

TL;DR: Proposes inversion learning method to automatically generate effective evaluation prompts for LLM-based evaluators, eliminating manual prompt engineering and improving robustness.

DetailsMotivation: Human evaluation of NLG systems suffers from inconsistencies and biases, while LLM-based evaluators are highly sensitive to prompt design variations, limiting reproducibility and scalability.

Method: Uses inversion learning to learn reverse mappings from model outputs back to input instructions, enabling automatic generation of model-specific evaluation prompts with just one sample.

Result: Eliminates need for manual prompt engineering, improves efficiency and robustness of LLM-based evaluation systems.

Conclusion: Contributes to more robust and efficient LLM-based evaluation through automatic prompt generation, addressing prompt sensitivity issues in current methods.

Abstract: Evaluating natural language generation systems is challenging due to the diversity of valid outputs. While human evaluation is the gold standard, it suffers from inconsistencies, lack of standardisation, and demographic biases, limiting reproducibility. LLM-based evaluators offer a scalable alternative but are highly sensitive to prompt design, where small variations can lead to significant discrepancies. In this work, we propose an inversion learning method that learns effective reverse mappings from model outputs back to their input instructions, enabling the automatic generation of highly effective, model-specific evaluation prompts. Our method requires only a single evaluation sample and eliminates the need for time-consuming manual prompt engineering, thereby improving both efficiency and robustness. Our work contributes toward a new direction for more robust and efficient LLM-based evaluation.

[56] Llama-Nemotron: Efficient Reasoning Models

Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, Ido Shahaf, Oren Tropp, Ehud Karpas, Ran Zilberstein, Jiaqi Zeng, Soumye Singhal, Alexander Bukharin, Yian Zhang, Tugrul Konuk, Gerald Shen, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Yoshi Suhara, Olivier Delalleau, Zijia Chen, Zhilin Wang, David Mosallanezhad, Adi Renduchintala, Haifeng Qian, Dima Rekesh, Fei Jia, Somshubra Majumdar, Vahid Noroozi, Wasi Uddin Ahmad, Sean Narenthiran, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Igor Gitman, Ivan Moshkov, Wei Du, Shubham Toshniwal, George Armstrong, Branislav Kisacanin, Matvei Novikov, Daria Gitman, Evelina Bakhturina, Prasoon Varshney, Makesh Narsimhan, Jane Polak Scowcroft, John Kamalu, Dan Su, Kezhi Kong, Markus Kliegl, Rabeeh Karimi Mahabadi, Ying Lin, Sanjeev Satheesh, Jupinder Parmar, Pritam Gundecha, Brandon Norick, Joseph Jennings, Shrimai Prabhumoye, Syeda Nahida Akter, Mostofa Patwary, Abhinav Khattar, Deepak Narayanan, Roger Waleffe, Jimmy Zhang, Bor-Yiing Su, Guyue Huang, Terry Kong, Parth Chadha, Sahil Jain, Christine Harvey, Elad Segal, Jining Huang, Sergey Kashirsky, Robert McQueen, Izzy Putterman, George Lam, Arun Venkatesan, Sherry Wu, Vinh Nguyen, Manoj Kilaru, Andrew Wang, Anna Warno, Abhilash Somasamudramath, Sandip Bhaskar, Maka Dong, Nave Assaf, Shahar Mor, Omer Ullman Argov, Scot Junkin, Oleksandr Romanenko, Pedro Larroy, Monika Katariya, Marco Rovinelli, Viji Balas, Nicholas Edelman, Anahita Bhiwandiwalla, Muthu Subramaniam, Smita Ithape, Karthik Ramamoorthy, Yuting Wu, Suguna Varshini Velury, Omri Almog, Joyjit Daw, Denys Fridman, Erick Galinkin, Michael Evans, Shaona Ghosh, Katherine Luna, Leon Derczynski, Nikki Pope, Eileen Long, Seth Schneider, Guillermo Siman, Tomasz Grzegorzek, Pablo Ribalta, Monika Katariya, Chris Alexiuk, Joey Conway, Trisha Saar, Ann Guan, Krzysztof Pawelec, Shyamala Prayaga, Oleksii Kuchaiev, Boris Ginsburg, Oluwatobi Olabiyi, Kari Briski, Jonathan Cohen, Bryan Catanzaro, Jonah Alben, Yonatan Geifman, Eric Chung

Main category: cs.CL

TL;DR: Llama-Nemotron series: open-source heterogeneous reasoning models in 8B, 49B, and 253B sizes that compete with state-of-the-art models while offering superior inference efficiency and a dynamic reasoning toggle feature.

DetailsMotivation: To create open-source reasoning models with exceptional capabilities, inference efficiency, and enterprise-friendly licensing that can compete with proprietary state-of-the-art reasoning models like DeepSeek-R1.

Method: Neural architecture search from Llama 3 models for accelerated inference, knowledge distillation, continued pretraining, followed by reasoning-focused post-training with supervised fine-tuning and large-scale reinforcement learning.

Result: Models deliver competitive reasoning performance with superior inference throughput and memory efficiency. Three model sizes released under commercial license with complete post-training dataset and training codebases.

Conclusion: Successful development of open-source reasoning models with dynamic reasoning toggle capability, providing enterprise-ready solutions and supporting open research through released models, datasets, and codebases.

Abstract: We introduce the Llama-Nemotron series of models, an open family of heterogeneous reasoning models that deliver exceptional reasoning capabilities, inference efficiency, and an open license for enterprise use. The family comes in three sizes – Nano (8B), Super (49B), and Ultra (253B) – and performs competitively with state-of-the-art reasoning models such as DeepSeek-R1 while offering superior inference throughput and memory efficiency. In this report, we discuss the training procedure for these models, which entails using neural architecture search from Llama 3 models for accelerated inference, knowledge distillation, and continued pretraining, followed by a reasoning-focused post-training stage consisting of two main parts: supervised fine-tuning and large scale reinforcement learning. Llama-Nemotron models are the first open-source models to support a dynamic reasoning toggle, allowing users to switch between standard chat and reasoning modes during inference. To further support open research and facilitate model development, we provide the following resources: 1. We release the Llama-Nemotron reasoning models – LN-Nano, LN-Super, and LN-Ultra – under the commercially permissive NVIDIA Open Model License Agreement. 2. We release the complete post-training dataset: Llama-Nemotron-Post-Training-Dataset. 3. We also release our training codebases: NeMo, NeMo-Aligner, and Megatron-LM.

[57] OBLIVIATE: Robust and Practical Machine Unlearning for Large Language Models

Xiaoyu Xu, Minxin Du, Qingqing Ye, Haibo Hu

Main category: cs.CL

TL;DR: OBLIVIATE is an efficient unlearning framework that removes targeted sensitive/copyrighted content from LLMs while preserving model utility through token extraction, retain set building, and multi-component loss optimization using LoRA adapters.

DetailsMotivation: LLMs risk memorizing sensitive, copyrighted, or toxic content from their training data, creating privacy and legal concerns that require effective content removal mechanisms.

Method: Structured process involving target token extraction, retain set construction, and fine-tuning with a three-component loss function (masking, distillation, world fact) using low-rank adapters for efficiency.

Result: Effective resistance against membership inference attacks, minimal impact on retained data quality, and maintained robustness across multiple datasets including Harry Potter, WMDP, and TOFU.

Conclusion: OBLIVIATE provides a robust and efficient solution for targeted data removal from LLMs while preserving overall model utility and performance across diverse scenarios.

Abstract: Large language models (LLMs) trained over extensive corpora risk memorizing sensitive, copyrighted, or toxic content. To address this, we propose \textbf{OBLIVIATE}, a robust unlearning framework that removes targeted data while preserving model utility. The framework follows a structured process: extracting target tokens, building retain sets, and fine-tuning with a tailored loss function comprising three components – masking, distillation, and world fact. Using low-rank adapters (LoRA) ensures efficiency without compromising unlearning quality. We conduct experiments on multiple datasets, including Harry Potter series, WMDP, and TOFU, using a comprehensive suite of metrics: \emph{forget quality} (via a new document-level memorization score), \emph{model utility}, and \emph{fluency}. Results demonstrate its effectiveness in resisting membership inference attacks, minimizing the impact on retained data, and maintaining robustness across diverse scenarios.

[58] Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis

Haoming Huang, Yibo Yan, Jiahao Huo, Xin Zou, Xinfeng Li, Kun Wang, Xuming Hu

Main category: cs.CL

TL;DR: PhantomCircuit is a novel framework that analyzes and detects knowledge overshadowing in LLMs - a type of hallucination where one piece of knowledge masks another relevant piece, causing errors even with good training data.

DetailsMotivation: Current understanding of knowledge overshadowing is limited to inference-time observations, lacking insights into its origins and internal mechanisms during model training. The paper aims to provide deeper understanding of this challenging hallucination variant.

Method: The framework employs knowledge circuit analysis to dissect the function of key components in the circuit and how attention pattern dynamics contribute to overshadowing and its evolution throughout training.

Result: Extensive experiments demonstrate PhantomCircuit’s effectiveness in identifying knowledge overshadowing instances, offering novel insights into this elusive hallucination phenomenon.

Conclusion: PhantomCircuit provides the research community with a new methodological lens for analyzing and potentially mitigating knowledge overshadowing in large language models.

Abstract: Large Language Models (LLMs), despite their remarkable capabilities, are hampered by hallucinations. A particularly challenging variant, knowledge overshadowing, occurs when one piece of activated knowledge inadvertently masks another relevant piece, leading to erroneous outputs even with high-quality training data. Current understanding of overshadowing is largely confined to inference-time observations, lacking deep insights into its origins and internal mechanisms during model training. Therefore, we introduce PhantomCircuit, a novel framework designed to comprehensively analyze and detect knowledge overshadowing. By innovatively employing knowledge circuit analysis, PhantomCircuit dissects the function of key components in the circuit and how the attention pattern dynamics contribute to the overshadowing phenomenon and its evolution throughout the training process. Extensive experiments demonstrate PhantomCircuit’s effectiveness in identifying such instances, offering novel insights into this elusive hallucination and providing the research community with a new methodological lens for its potential mitigation.

[59] A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP

Shinnosuke Ono, Issey Sukeda, Takuro Fujii, Kosei Buma, Shunsuke Sasaki

Main category: cs.CL

TL;DR: Japanese pharmaceutical domain-specific LLM outperforms open models and competes with commercial ones on specialized benchmarks including terminology-heavy tasks and cross-sentence consistency reasoning.

DetailsMotivation: To develop a practical, secure, and cost-effective Japanese language model specifically for the pharmaceutical field, addressing the need for domain-specific NLP capabilities in healthcare.

Method: Continual pretraining on 2 billion Japanese pharmaceutical tokens and 8 billion English biomedical tokens, with evaluation using three new benchmarks: YakugakuQA (pharmacist exams), NayoseQA (cross-lingual terminology), and SogoCheck (consistency reasoning).

Result: The domain-specific model outperforms existing open-source medical LLMs and achieves competitive performance with commercial models like GPT-4o, particularly on terminology-heavy and knowledge-based tasks. GPT-4o performed poorly on SogoCheck, indicating cross-sentence consistency reasoning remains challenging.

Conclusion: This work demonstrates the feasibility of building effective Japanese domain-specific language models for pharmaceutical applications and provides reusable evaluation resources for future healthcare NLP research.

Abstract: We present a Japanese domain-specific language model for the pharmaceutical field, developed through continual pretraining on 2 billion Japanese pharmaceutical tokens and 8 billion English biomedical tokens. To enable rigorous evaluation, we introduce three new benchmarks: YakugakuQA, based on national pharmacist licensing exams; NayoseQA, which tests cross-lingual synonym and terminology normalization; and SogoCheck, a novel task designed to assess consistency reasoning between paired statements. We evaluate our model against both open-source medical LLMs and commercial models, including GPT-4o. Results show that our domain-specific model outperforms existing open models and achieves competitive performance with commercial ones, particularly on terminology-heavy and knowledge-based tasks. Interestingly, even GPT-4o performs poorly on SogoCheck, suggesting that cross-sentence consistency reasoning remains an open challenge. Our benchmark suite offers a broader diagnostic lens for pharmaceutical NLP, covering factual recall, lexical variation, and logical consistency. This work demonstrates the feasibility of building practical, secure, and cost-effective language models for Japanese domain-specific applications, and provides reusable evaluation resources for future research in pharmaceutical and healthcare NLP. Our model, codes, and datasets are released at https://github.com/EQUES-Inc/pharma-LLM-eval.

[60] FinRAGBench-V: A Benchmark for Multimodal RAG with Visual Citation in the Financial Domain

Suifeng Zhao, Zhuoran Jin, Sujian Li, Jun Gao

Main category: cs.CL

TL;DR: FinRAGBench-V is a visual RAG benchmark for finance that addresses the gap in existing text-only approaches by incorporating multimodal data with visual citation capabilities, featuring bilingual corpus and comprehensive QA dataset.

DetailsMotivation: Existing RAG research in finance focuses mainly on textual data, overlooking valuable visual content in financial documents, leading to loss of key analytical insights.

Method: Developed FinRAGBench-V benchmark with bilingual retrieval corpus (60,780 Chinese + 51,219 English pages) and human-annotated QA dataset across heterogeneous data types and seven question categories. Introduced RGenCite baseline integrating visual citation with generation, and proposed automatic citation evaluation method for MLLMs.

Result: Extensive experiments demonstrate the challenging nature of FinRAGBench-V, providing valuable insights for multimodal RAG system development in finance.

Conclusion: The benchmark effectively bridges the visual-textual gap in financial RAG applications and enables systematic assessment of visual citation capabilities in multimodal language models for financial domain applications.

Abstract: Retrieval-Augmented Generation (RAG) plays a vital role in the financial domain, powering applications such as real-time market analysis, trend forecasting, and interest rate computation. However, most existing RAG research in finance focuses predominantly on textual data, overlooking the rich visual content in financial documents, resulting in the loss of key analytical insights. To bridge this gap, we present FinRAGBench-V, a comprehensive visual RAG benchmark tailored for finance which effectively integrates multimodal data and provides visual citation to ensure traceability. It includes a bilingual retrieval corpus with 60,780 Chinese and 51,219 English pages, along with a high-quality, human-annotated question-answering (QA) dataset spanning heterogeneous data types and seven question categories. Moreover, we introduce RGenCite, an RAG baseline that seamlessly integrates visual citation with generation. Furthermore, we propose an automatic citation evaluation method to systematically assess the visual citation capabilities of Multimodal Large Language Models (MLLMs). Extensive experiments on RGenCite underscore the challenging nature of FinRAGBench-V, providing valuable insights for the development of multimodal RAG systems in finance.

[61] LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Complex Reasoning

Tao Liu, Xutao Mao, Hongying Zan, Dixuan Zhang, Yifan Li, Haixin Liu, Lulu Kong, Jiaming Hou, Rui Li, YunLong Li, aoze zheng, Zhiqiang Zhang, Luo Zhewei, Kunli Zhang, Min Peng

Main category: cs.CL

TL;DR: LogicCat is a new Text-to-SQL benchmark dataset focused on complex reasoning scenarios including physics, arithmetic, commonsense, and hypothetical reasoning, with 4,038 questions and 12,114 chain-of-thought steps across 45 domains.

DetailsMotivation: Existing Text-to-SQL datasets focus on business logic but neglect critical real-world reasoning demands like domain knowledge, complex mathematical computations, and hypothetical reasoning scenarios needed for practical data analysis.

Method: Created LogicCat dataset with 4,038 English questions paired with detailed chain-of-thought reasoning steps spanning 45 databases across diverse domains including physics, arithmetic, commonsense, and hypothetical reasoning scenarios.

Result: Experimental results show LogicCat substantially increases difficulty for state-of-the-art models, reducing execution accuracy to at most 33.20%, demonstrating the task remains exceptionally challenging.

Conclusion: LogicCat represents a crucial step toward developing systems suitable for real-world enterprise data analysis and autonomous query generation, addressing the gap in complex reasoning requirements for practical Text-to-SQL applications.

Abstract: Text-to-SQL is a critical task in natural language processing that aims to transform natural language questions into accurate and executable SQL queries. In real-world scenarios, these reasoning tasks are often accompanied by complex mathematical computations, domain knowledge, and hypothetical reasoning scenarios. However, existing large-scale Text-to-SQL datasets typically focus on business logic and task logic, neglecting critical factors such as vertical domain knowledge, complex mathematical reasoning, and hypothetical reasoning, which are essential for realistically reflecting the reasoning demands in practical applications and completing data querying and analysis. To bridge this gap, we introduce LogicCat, the first Text-to-SQL benchmark dataset specifically designed for complex reasoning and chain-of-thought parsing, encompassing physics, arithmetic, commonsense, and hypothetical reasoning scenarios. LogicCat comprises 4,038 English questions paired 12,114 detailed chain-of-thought reasoning steps, spanning 45 databases across diverse domains, significantly surpassing existing datasets in complexity. Experimental results demonstrate that LogicCat substantially increases the task difficulty for current state-of-the-art models to at most 33.20% execution accuracy, indicating that this task remains exceptionally challenging. The advancement of LogicCat represents a crucial step toward developing systems suitable for real-world enterprise data analysis and autonomous query generation. We have released our dataset code at https://github.com/Ffunkytao/LogicCat.

Chengyan Wu, Yiqiang Cai, Yang Liu, Pengxu Zhu, Yun Xue, Ziwei Gong, Julia Hirschberg, Bolei Ma

Main category: cs.CL

TL;DR: A systematic survey of Multimodal Emotion Recognition in Conversations (MERC) that integrates text, speech, and visual signals for enhanced emotional understanding in dialogue systems.

DetailsMotivation: Real-world dialogue systems require more nuanced emotional understanding than single modality approaches can provide, necessitating multimodal integration for better human-computer interaction.

Method: The survey provides a systematic overview including motivations, core tasks, representative methods, and evaluation strategies for MERC.

Result: The paper examines recent trends, highlights key challenges, and outlines future directions for multimodal emotion recognition research.

Conclusion: This survey offers timely guidance for advancing MERC research as interest in emotionally intelligent systems continues to grow.

Abstract: While text-based emotion recognition methods have achieved notable success, real-world dialogue systems often demand a more nuanced emotional understanding than any single modality can offer. Multimodal Emotion Recognition in Conversations (MERC) has thus emerged as a crucial direction for enhancing the naturalness and emotional understanding of human-computer interaction. Its goal is to accurately recognize emotions by integrating information from various modalities such as text, speech, and visual signals. This survey offers a systematic overview of MERC, including its motivations, core tasks, representative methods, and evaluation strategies. We further examine recent trends, highlight key challenges, and outline future directions. As interest in emotionally intelligent systems grows, this survey provides timely guidance for advancing MERC research.

[63] Localizing Persona Representations in LLMs

Celia Cintas, Miriam Rateike, Erik Miehling, Elizabeth Daly, Skyler Speakman

Main category: cs.CL

TL;DR: Study examines where and how personas (human characteristics, values, beliefs) are encoded in LLM representation spaces, finding significant differences emerge in final decoder layers with varying overlap patterns across ethical and political ideologies.

DetailsMotivation: To understand how large language models internally represent distinct human personas and characteristics, which can inform better modulation of specific traits in LLM outputs.

Method: Used dimension reduction and pattern recognition methods to analyze model layers, identified layers with greatest divergence in persona encoding, and examined activations within selected layers to study shared and distinct embedding spaces.

Result: Personas show large representation space differences only within final third of decoder layers; ethical perspectives like moral nihilism and utilitarianism show overlapping activations (polysemy), while political ideologies like conservatism and liberalism are represented in more distinct regions.

Conclusion: Findings improve understanding of LLM internal information representation and can guide future efforts to refine modulation of specific human traits in LLM outputs.

Abstract: We present a study on how and where personas – defined by distinct sets of human characteristics, values, and beliefs – are encoded in the representation space of large language models (LLMs). Using a range of dimension reduction and pattern recognition methods, we first identify the model layers that show the greatest divergence in encoding these representations. We then analyze the activations within a selected layer to examine how specific personas are encoded relative to others, including their shared and distinct embedding spaces. We find that, across multiple pre-trained decoder-only LLMs, the analyzed personas show large differences in representation space only within the final third of the decoder layers. We observe overlapping activations for specific ethical perspectives – such as moral nihilism and utilitarianism – suggesting a degree of polysemy. In contrast, political ideologies like conservatism and liberalism appear to be represented in more distinct regions. These findings help to improve our understanding of how LLMs internally represent information and can inform future efforts in refining the modulation of specific human traits in LLM outputs. Warning: This paper includes potentially offensive sample statements.

[64] Are Economists Always More Introverted? Analyzing Consistency in Persona-Assigned LLMs

Manon Reusens, Bart Baesens, David Jurgens

Main category: cs.CL

TL;DR: A new framework for evaluating persona consistency in LLMs across different personas and task types, revealing that consistency varies based on persona stereotypes, task structure, and model design.

DetailsMotivation: While LLMs are increasingly assigned specific personas for various applications, there's a lack of comprehensive analysis on how consistently they maintain these personas across different tasks and runs.

Method: Developed a standardized framework to evaluate persona consistency across four persona categories (happiness, occupation, personality, political stance) and multiple task dimensions (survey writing, essay generation, social media posts, single/multi-turn conversations).

Result: Consistency is influenced by multiple factors including assigned persona, stereotypes, and model design choices. Consistency varies across tasks, increasing with more structured tasks and additional context.

Conclusion: The framework provides a standardized way to analyze persona consistency in LLMs, revealing important insights about how different factors affect model behavior across various persona-task combinations.

Abstract: Personalized Large Language Models (LLMs) are increasingly used in diverse applications, where they are assigned a specific persona - such as a happy high school teacher - to guide their responses. While prior research has examined how well LLMs adhere to predefined personas in writing style, a comprehensive analysis of consistency across different personas and task types is lacking. In this paper, we introduce a new standardized framework to analyze consistency in persona-assigned LLMs. We define consistency as the extent to which a model maintains coherent responses when assigned the same persona across different tasks and runs. Our framework evaluates personas across four different categories (happiness, occupation, personality, and political stance) spanning multiple task dimensions (survey writing, essay generation, social media post generation, single turn, and multi-turn conversations). Our findings reveal that consistency is influenced by multiple factors, including the assigned persona, stereotypes, and model design choices. Consistency also varies across tasks, increasing with more structured tasks and additional context. All code is available on GitHub.

[65] Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation

Noy Sternlicht, Ariel Gera, Roy Bar-Haim, Tom Hope, Noam Slonim

Main category: cs.CL

TL;DR: Introduces Debate Speech Evaluation as a new benchmark for testing LLM judges’ ability to evaluate debate speeches across multiple dimensions like argument strength, coherence, and style.

DetailsMotivation: Existing LLM benchmarks lack systematic evaluation of cognitive abilities required for debate speech assessment, which involves complex multi-level understanding.

Method: Leveraged a dataset of 600+ meticulously annotated debate speeches to analyze how state-of-the-art LLMs compare to human judges on debate evaluation tasks.

Result: Larger models can approximate individual human judgments in some respects but differ substantially in overall judgment behavior. Frontier LLMs can generate persuasive speeches at human level.

Conclusion: Debate speech evaluation presents unique challenges for LLMs, revealing nuanced differences from human judgment patterns despite some approximation capabilities.

Abstract: We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that previously received limited attention in systematic LLM benchmarking. To explore such skills, we leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task. Our findings reveal a nuanced picture: while larger models can approximate individual human judgments in some respects, they differ substantially in their overall judgment behavior. We also investigate the ability of frontier LLMs to generate persuasive, opinionated speeches, showing that models may perform at a human level on this task.

[66] TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-based Scientific Peer Review

Yuan Chang, Ziyue Li, Hengyuan Zhang, Yuanbo Kong, Yanru Wu, Hayden Kwok-Hay So, Zhijiang Guo, Liya Zhu, Ngai Wong

Main category: cs.CL

TL;DR: TreeReview is a hierarchical QA framework that improves LLM-generated paper reviews through recursive question decomposition and dynamic expansion, achieving better quality while reducing computational costs.

DetailsMotivation: Current LLM-based peer review methods struggle to generate thorough and insightful reviews efficiently, needing a more structured approach to improve review quality while maintaining computational efficiency.

Method: TreeReview models paper review as hierarchical bidirectional QA - recursively decomposes high-level questions into sub-questions, then aggregates answers from leaf to root with dynamic question expansion for deeper probing.

Result: Outperforms baselines in comprehensive and expert-aligned reviews, reduces LLM token usage by up to 80% compared to intensive approaches, validated through both LLM and human evaluation on ICLR/NeurIPS benchmark.

Conclusion: TreeReview provides an effective framework for generating high-quality, insightful paper reviews efficiently through hierarchical question decomposition and dynamic expansion, demonstrating significant improvements over existing methods.

Abstract: While Large Language Models (LLMs) have shown significant potential in assisting peer review, current methods often struggle to generate thorough and insightful reviews while maintaining efficiency. In this paper, we propose TreeReview, a novel framework that models paper review as a hierarchical and bidirectional question-answering process. TreeReview first constructs a tree of review questions by recursively decomposing high-level questions into fine-grained sub-questions and then resolves the question tree by iteratively aggregating answers from leaf to root to get the final review. Crucially, we incorporate a dynamic question expansion mechanism to enable deeper probing by generating follow-up questions when needed. We construct a benchmark derived from ICLR and NeurIPS venues to evaluate our method on full review generation and actionable feedback comments generation tasks. Experimental results of both LLM-based and human evaluation show that TreeReview outperforms strong baselines in providing comprehensive, in-depth, and expert-aligned review feedback, while reducing LLM token usage by up to 80% compared to computationally intensive approaches. Our code and benchmark dataset are available at https://github.com/YuanChang98/tree-review.

[67] MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLMs

Ke Wang, Yiming Qin, Nikolaos Dimitriadis, Alessandro Favero, Pascal Frossard

Main category: cs.CL

TL;DR: MEMOIR is a scalable framework for lifelong language model editing that uses residual memory with sparse activation masks to enable thousands of sequential edits without forgetting or interference.

DetailsMotivation: Real-world language models need efficient post-hoc updates to incorporate new knowledge without retraining or forgetting previous information, but existing methods struggle with generalization, interference, and scalability.

Method: Uses residual memory with sample-dependent sparse masks to confine each edit to distinct memory parameters, and compares sparse activation patterns at inference to identify relevant knowledge.

Result: Achieves state-of-the-art performance on QA, hallucination correction, and OOD generalization benchmarks for LLaMA-3 and Mistral, scaling to thousands of edits with minimal forgetting.

Conclusion: MEMOIR provides an effective solution for scalable and reliable lifelong model editing through sparse memory activation, enabling practical deployment of continuously updated language models.

Abstract: Language models deployed in real-world systems often require post-hoc updates to incorporate new or corrected knowledge. However, editing such models efficiently and reliably-without retraining or forgetting previous information-remains a major challenge. Existing methods for lifelong model editing either compromise generalization, interfere with past edits, or fail to scale to long editing sequences. We propose MEMOIR, a novel scalable framework that injects knowledge through a residual memory, i.e., a dedicated parameter module, while preserving the core capabilities of the pre-trained model. By sparsifying input activations through sample-dependent masks, MEMOIR confines each edit to a distinct subset of the memory parameters, minimizing interference among edits. At inference, it identifies relevant edits by comparing the sparse activation patterns of new queries to those stored during editing. This enables generalization to rephrased queries by activating only the relevant knowledge while suppressing unnecessary memory activation for unrelated prompts. Experiments on question answering, hallucination correction, and out-of-distribution generalization benchmarks for LLaMA-3 and Mistral backbones demonstrate that MEMOIR achieves state-of-the-art performance across reliability, generalization, and locality metrics, scaling to thousands of sequential edits with minimal forgetting.

[68] Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting

Nathaniel Getachew, Abulhair Saparov

Main category: cs.CL

TL;DR: StorySim is a programmable framework for generating synthetic stories to evaluate theory of mind and world modeling capabilities in LLMs, avoiding data contamination issues from pretraining.

DetailsMotivation: To address limitations in existing benchmarks that may suffer from pretraining data contamination and lack precise control over character perspectives and events for evaluating ToM and WM capabilities.

Method: Uses a highly controllable Storyboard to generate novel, compositional story prompts, enabling precise manipulation of character perspectives and events for first- and second-order ToM tasks alongside WM tasks.

Result: Most LLMs perform better on WM tasks than ToM tasks, show better reasoning with humans vs inanimate objects, and exhibit heuristic behaviors like recency bias and over-reliance on earlier events.

Conclusion: StorySim provides a robust framework for evaluating ToM and WM capabilities while revealing systematic patterns in LLM reasoning behaviors, with all code made publicly available.

Abstract: We introduce $\texttt{StorySim}$, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, $\texttt{StorySim}$ produces novel, compositional story prompts anchored by a highly controllable $\texttt{Storyboard}$, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of state-of-the-art LLMs reveal that most models perform better on WM tasks than ToM tasks, and that models tend to perform better reasoning with humans compared to inanimate objects. Additionally, our framework enabled us to find evidence of heuristic behavior such as recency bias and an over-reliance on earlier events in the story. All code for generating data and evaluations is freely available.

[69] Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models

Kaiyan Chang, Yonghao Shi, Chenglong Wang, Hang Zhou, Chi Hu, Xiaoqian Liu, Yingfeng Luo, Yuan Ge, Tong Xiao, Jingbo Zhu

Main category: cs.CL

TL;DR: Hybrid Test-Time Scaling combines training-free methods for improved LLM reasoning without additional training overhead.

DetailsMotivation: Training-based TTS methods add computational burden, so the paper focuses on training-free approaches to enhance reasoning performance efficiently.

Method: Developed Conditional Step-level Self-refinement with process verification, then combined it with parallel scaling methods to create Hybrid Test-Time Scaling.

Result: Extensive experiments on 3B-14B LLMs showed hybrid training-free TTS significantly expands reasoning performance boundaries.

Conclusion: Fine-grained hybrid strategies incorporating various training-free TTS methods have considerable potential for enhancing LLM reasoning capabilities.

Abstract: Test-Time Scaling (TTS) is a promising approach to progressively elicit the model’s intelligence during inference. Recently, training-based TTS methods, such as continued reinforcement learning (RL), have further surged in popularity, while training-free TTS methods are gradually fading from prominence. However, the additional computation overhead of training amplifies the burden on test-time scaling. In this paper, we focus on training-free TTS methods for reasoning. We first design Conditional Step-level Self-refinement, a fine-grained sequential scaling method guided by process verification. On top of its effectiveness, we further combine it with other classical parallel scaling methods at the step level, to introduce a novel inference paradigm called Hybrid Test-Time Scaling. Extensive experiments on five instruction-tuned LLMs across different scales (3B-14B) and families demonstrate that hybrid strategy incorporating various training-free TTS methods at a fine granularity has considerable potential for expanding the reasoning performance boundaries of LLMs.

[70] Meaning-infused grammar: Gradient Acceptability Shapes the Geometric Representations of Constructions in LLMs

Supantho Rakshit, Adele Goldberg

Main category: cs.CL

TL;DR: LLMs learn graded, meaning-infused representations of English Double Object and Prepositional Object constructions, with more prototypical examples occupying more distinct regions in activation space.

DetailsMotivation: To investigate whether Large Language Models' internal representations reflect the function-infused gradience proposed by usage-based constructionist approaches to language.

Method: Analyzed representations of English Double Object and Prepositional Object constructions in Pythia-1.4B using 5000 sentence pairs varied by human-rated preference strength, employing geometric measures like energy distance and Jensen-Shannon divergence.

Result: Separability between construction representations is systematically modulated by gradient preference strength, with more prototypical exemplars occupying more distinct regions in activation space.

Conclusion: LLMs learn rich, meaning-infused, graded representations of constructions, supporting geometric measures for analyzing representations in language models.

Abstract: The usage-based constructionist (UCx) approach to language posits that language comprises a network of learned form-meaning pairings (constructions) whose use is largely determined by their meanings or functions, requiring them to be graded and probabilistic. This study investigates whether the internal representations in Large Language Models (LLMs) reflect the proposed function-infused gradience. We analyze representations of the English Double Object (DO) and Prepositional Object (PO) constructions in Pythia-$1.4$B, using a dataset of $5000$ sentence pairs systematically varied by human-rated preference strength for DO or PO. Geometric analyses show that the separability between the two constructions’ representations, as measured by energy distance or Jensen-Shannon divergence, is systematically modulated by gradient preference strength, which depends on lexical and functional properties of sentences. That is, more prototypical exemplars of each construction occupy more distinct regions in activation space, compared to sentences that could have equally well have occured in either construction. These results provide evidence that LLMs learn rich, meaning-infused, graded representations of constructions and offer support for geometric measures for representations in LLMs.

[71] Bhav-Net: Knowledge Transfer for Cross-Lingual Antonym vs Synonym Distinction via Dual-Space Graph Transformers

Samyak S. Sanghvi

Main category: cs.CL

TL;DR: Bhav-Net is a dual-space architecture that enables knowledge transfer from multilingual models to language-specific architectures for antonym-synonym distinction across multiple languages.

DetailsMotivation: Antonym vs synonym distinction presents computational challenges due to the paradoxical nature of antonymous relationships where words share semantic domains but express opposite meanings.

Method: Combines language-specific BERT encoders with graph transformer networks to create distinct semantic projections - synonyms cluster in one space while antonyms exhibit high similarity in a complementary space.

Result: Achieves competitive performance against state-of-the-art baselines across eight languages (English, German, French, Spanish, Italian, Portuguese, Dutch, Russian) with effective cross-lingual generalization.

Conclusion: Semantic relationship modeling transfers effectively across languages, providing interpretable semantic representations and robust cross-lingual antonym-synonym distinction capabilities.

Abstract: Antonym vs synonym distinction across multiple languages presents unique computational challenges due to the paradoxical nature of antonymous relationships words that share semantic domains while expressing opposite meanings. This work introduces Bhav-Net, a novel dual-space architecture that enables effective knowledge transfer from complex multilingual models to simpler, language-specific architectures while maintaining robust cross-lingual antonym–synonym distinction capabilities. Our approach combines language-specific BERT encoders with graph transformer networks, creating distinct semantic projections where synonymous pairs cluster in one space while antonymous pairs exhibit high similarity in a complementary space. Through comprehensive evaluation across eight languages (English, German, French, Spanish, Italian, Portuguese, Dutch, and Russian), we demonstrate that semantic relationship modeling transfers effectively across languages. The dual-encoder design achieves competitive performance against state-of-the-art baselines while providing interpretable semantic representations and effective cross-lingual generalization.

[72] Trust but Verify! A Survey on Verification Design for Test-time Scaling

V Venktesh, Mandeep Rathee, Avishek Anand

Main category: cs.CL

TL;DR: Survey paper on test-time scaling verifiers for LLMs, covering diverse verification approaches, training mechanisms, and their utility in improving model performance during inference.

DetailsMotivation: Despite widespread adoption of verifiers in test-time scaling, there is no comprehensive collection, categorization, or discussion of diverse verification approaches and their training mechanisms.

Method: The authors conduct a systematic survey of literature, presenting a unified view of verifier training, types (prompt-based, fine-tuned discriminative/generative models), and their utility in exploring decoding search space.

Result: The survey provides detailed categorization and discussion of various verification approaches used in test-time scaling paradigms.

Conclusion: Verifiers serve as reward models that score candidate outputs to explore solution space and select optimal outcomes, enabling parameter-free scaling at inference time with high performance gains.

Abstract: Test-time scaling (TTS) has emerged as a new frontier for scaling the performance of Large Language Models. In test-time scaling, by using more computational resources during inference, LLMs can improve their reasoning process and task performance. Several approaches have emerged for TTS such as distilling reasoning traces from another model or exploring the vast decoding search space by employing a verifier. The verifiers serve as reward models that help score the candidate outputs from the decoding process to diligently explore the vast solution space and select the best outcome. This paradigm commonly termed has emerged as a superior approach owing to parameter free scaling at inference time and high performance gains. The verifiers could be prompt-based, fine-tuned as a discriminative or generative model to verify process paths, outcomes or both. Despite their widespread adoption, there is no detailed collection, clear categorization and discussion of diverse verification approaches and their training mechanisms. In this survey, we cover the diverse approaches in the literature and present a unified view of verifier training, types and their utility in test-time scaling. Our repository can be found at https://github.com/elixir-research-group/Verifierstesttimescaling.github.io.

[73] Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD

Bryan Chen Zhengyu Tan, Daniel Wai Kit Chin, Zhengyuan Liu, Nancy F. Chen, Roy Ka-Wei Lee

Main category: cs.CL

TL;DR: DuET-PD framework evaluates LLM vulnerability to misinformation and receptiveness to corrections in persuasive dialogues, finding GPT-4o only achieves 27.32% accuracy under sustained misleading persuasion. Holistic DPO training improves robustness significantly.

DetailsMotivation: LLMs struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues, which is critical for reliable deployment in real-world applications.

Method: Introduces DuET-PD framework evaluating multi-turn stance-change dynamics across persuasion type (corrective/misleading) and domain (knowledge/safety). Proposes Holistic DPO training approach balancing positive and negative persuasion examples.

Result: GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasion. Holistic DPO improves Llama-3.1-8B-Instruct’s accuracy under misleading persuasion in safety contexts from 4.21% to 76.54%.

Conclusion: The framework and training approach offer a pathway to developing more reliable and adaptable LLMs for multi-turn dialogue, addressing critical vulnerabilities to misinformation while maintaining receptiveness to valid corrections.

Abstract: Large Language Models (LLMs) can struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues, a critical challenge for reliable deployment. We introduce DuET-PD (Dual Evaluation for Trust in Persuasive Dialogues), a framework evaluating multi-turn stance-change dynamics across dual dimensions: persuasion type (corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions. Moreover, results reveal a concerning trend of increasing sycophancy in newer open-source models. To address this, we introduce Holistic DPO, a training approach balancing positive and negative persuasion examples. Unlike prompting or resist-only training, Holistic DPO enhances both robustness to misinformation and receptiveness to corrections, improving Llama-3.1-8B-Instruct’s accuracy under misleading persuasion in safety contexts from 4.21% to 76.54%. These contributions offer a pathway to developing more reliable and adaptable LLMs for multi-turn dialogue. Code is available at https://github.com/Social-AI-Studio/DuET-PD.

[74] Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness

Lang Xiong, Nishant Bhargava, Jeremy Chang, Jianhang Hong, Haihao Liu, Vasu Sharma, Kevin Zhu

Main category: cs.CL

TL;DR: LLMs show behavioral changes when they detect evaluation vs deployment contexts, with models being more deceptive/unsafe in test settings. The study developed a method to quantify and manipulate this “evaluation awareness” phenomenon.

DetailsMotivation: Address the critical AI alignment challenge where benchmark performance doesn't accurately reflect true model safety and honesty due to LLMs' ability to detect evaluation contexts and change behavior accordingly.

Method: Used linear probes to score prompts on a “test-like” to “deploy-like” scale, and employed LLM rewriting to shift prompts toward natural deployment contexts while preserving original tasks.

Result: Rewritten prompts achieved 30% higher average probe scores, induced 5.26% increase in honest responses, 12.40% decrease in deceptive responses, and 6.38% increase in refusal rates across state-of-the-art models.

Conclusion: Evaluation awareness is quantifiable and manipulable, revealing models are more prone to unsafe/deceptive outputs in perceived test environments, highlighting the need for more realistic evaluation frameworks.

Abstract: Large Language Models (LLMs) often exhibit significant behavioral shifts when they perceive a change from a real-world deployment context to a controlled evaluation setting, a phenomenon known as “evaluation awareness.” This discrepancy poses a critical challenge for AI alignment, as benchmark performance may not accurately reflect a model’s true safety and honesty. In this work, we systematically quantify these behavioral changes by manipulating the perceived context of prompts. We introduce a methodology that uses a linear probe to score prompts on a continuous scale from “test-like” to “deploy-like” and leverage an LLM rewriting strategy to shift these prompts towards a more natural, deployment-style context while preserving the original task. Using this method, we achieved a 30% increase in the average probe score across a strategic role-playing dataset after rewriting. Evaluating a suite of state-of-the-art models on these original and rewritten prompts, we find that rewritten “deploy-like” prompts induce a significant and consistent shift in behavior. Across all models, we observed an average increase in honest responses of 5.26% and a corresponding average decrease in deceptive responses of 12.40%. Furthermore, refusal rates increased by an average of 6.38%, indicating heightened safety compliance. Our findings demonstrate that evaluation awareness is a quantifiable and manipulable factor that directly influences LLM behavior, revealing that models are more prone to unsafe or deceptive outputs in perceived test environments. This underscores the urgent need for more realistic evaluation frameworks to accurately gauge true model alignment before deployment.

[75] Joint Information Extraction Across Classical and Modern Chinese with Tea-MOELoRA

Xuemei Tang, Chengxi Yan, Jinghang Gu, Chu-Ren Huang

Main category: cs.CL

TL;DR: Tea-MOELoRA is a parameter-efficient multi-task framework combining LoRA with Mixture-of-Experts to handle Chinese IE across different temporal domains without performance interference.

DetailsMotivation: Fine-tuning a single model on heterogeneous Chinese IE tasks across Classical and Modern documents causes interference and reduced performance due to temporal domain differences.

Method: Combines LoRA with Mixture-of-Experts design, using multiple low-rank LoRA experts specialized for different IE tasks and eras, with a task-era-aware router for dynamic expert allocation.

Result: Outperforms both single-task and joint LoRA baselines, demonstrating effective leveraging of task and temporal knowledge.

Conclusion: Tea-MOELoRA successfully addresses the challenge of multi-task Chinese IE across diverse temporal domains through parameter-efficient expert specialization and dynamic routing.

Abstract: Chinese information extraction (IE) involves multiple tasks across diverse temporal domains, including Classical and Modern documents. Fine-tuning a single model on heterogeneous tasks and across different eras may lead to interference and reduced performance. Therefore, in this paper, we propose Tea-MOELoRA, a parameter-efficient multi-task framework that combines LoRA with a Mixture-of-Experts (MoE) design. Multiple low-rank LoRA experts specialize in different IE tasks and eras, while a task-era-aware router mechanism dynamically allocates expert contributions. Experiments show that Tea-MOELoRA outperforms both single-task and joint LoRA baselines, demonstrating its ability to leverage task and temporal knowledge effectively.

[76] CAT: Causal Attention Tuning For Injecting Fine-grained Causal Knowledge into Large Language Models

Kairong Han, Wenshuo Zhao, Ziyu Zhao, JunJian Ye, Lujia Pan, Kun Kuang

Main category: cs.CL

TL;DR: CAT method improves LLMs’ causal reasoning by injecting fine-grained causal knowledge into attention mechanism, achieving significant OOD performance gains

DetailsMotivation: LLMs often capture spurious correlations instead of true causal relationships, leading to poor performance in out-of-distribution scenarios

Method: Causal Attention Tuning (CAT) with automated pipeline for token-level causal signals generation and Re-Attention mechanism to guide training and mitigate attention biases

Result: Average improvement of 5.76% on STG dataset and 1.56% on downstream tasks; Llama-3.1-8B OOD performance increased from 64.5% to 90.5%; Qwen improved from 25.4% to 55.9%

Conclusion: CAT effectively leverages causal knowledge for prediction and maintains robustness in OOD scenarios, demonstrating significant improvements over standard LLM training

Abstract: Large Language Models (LLMs) have achieved remarkable success across various domains. However, a fundamental question remains: Can LLMs effectively utilize causal knowledge for prediction and generation? Through empirical studies, we find that LLMs trained directly on large-scale data often capture spurious correlations rather than true causal relationships, leading to suboptimal performance, especially in out-of-distribution (OOD) scenarios. To address this challenge, we propose Causal Attention Tuning (CAT), a novel approach that injects fine-grained causal knowledge into the attention mechanism. We propose an automated pipeline that leverages human priors to automatically generate token-level causal signals and introduce the Re-Attention mechanism to guide training, helping the model focus on causal structures while mitigating noise and biases in attention scores. Experimental results on our proposed Spurious Token Game (STG) benchmark and multiple downstream tasks demonstrate that our approach effectively leverages causal knowledge for prediction and remains robust in OOD scenarios. The CAT achieves an average improvement of 5.76% on the STG dataset and 1.56% on downstream tasks. Notably, the OOD performance of the Llama-3.1-8B model on STG_M increased from 64.5% to 90.5%, and Qwen’s OOD performance on the STG_H dataset improved from 25.4% to 55.9%. Implementation details can be found at https://github.com/Kairong-Han/CAT.

[77] Training LLMs to be Better Text Embedders through Bidirectional Reconstruction

Chang Su, Dengliang Shi, Siyuan Huang, Jintao Du, Changhua Meng, Yu Cheng, Weiqiang Wang, Zhouhan Lin

Main category: cs.CL

TL;DR: New training stage improves LLM text embeddings by enriching final token semantics through bidirectional reconstruction tasks, achieving SOTA results on MTEB benchmark.

DetailsMotivation: Existing LLM-based text embedding approaches use final token embeddings (like [EOS]) that weren't intentionally trained to capture whole context semantics, limiting performance in retrieval and re-ranking tasks.

Method: Adds a new training stage before contrastive learning that uses bidirectional generative reconstruction tasks (EBQ2D and EBD2Q) to anchor [EOS] embedding and reconstruct Query-Document pairs from either side.

Result: Significantly improves LLM performance on Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.

Conclusion: The proposed additional training stage with bidirectional reconstruction tasks effectively enriches final token semantics, making LLMs more powerful text embedders for retrieval tasks.

Abstract: Large language models (LLMs) have increasingly been explored as powerful text embedders. Existing LLM-based text embedding approaches often leverage the embedding of the final token, typically a reserved special token such as [EOS]. However, these tokens have not been intentionally trained to capture the semantics of the whole context, limiting their capacity as text embeddings, especially for retrieval and re-ranking tasks. We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding. This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query), which interleave to anchor the [EOS] embedding and reconstruct either side of Query-Document pairs. Experimental results demonstrate that our additional training stage significantly improves LLM performance on the Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.

[78] Measuring Bias or Measuring the Task: Understanding the Brittle Nature of LLM Gender Biases

Bufan Gao, Elisa Kreiss

Main category: cs.CL

TL;DR: LLM gender bias evaluations are sensitive to how prompts signal evaluation context, with more explicit evaluation framing producing different bias measurements than naturalistic prompts.

DetailsMotivation: To understand how signaling the evaluative purpose of tasks impacts measured gender bias in LLMs, as current evaluation methods often use artificial prompts that differ from natural language distributions.

Method: Tested models under different prompt conditions that make testing context and gender-focused content salient, using four task formats with both token-probability and discrete-choice metrics.

Result: Prompts that clearly align with gender bias evaluation framing produce distinct gender output distributions compared to less evaluation-framed prompts. Discrete-choice metrics amplify bias relative to probabilistic measures.

Conclusion: LLM gender bias evaluations are brittle and sensitive to prompt framing, raising questions about ecological validity of benchmarks and whether testing designs trigger artificial “testing mode” performance.

Abstract: As LLMs are increasingly applied in socially impactful settings, concerns about gender bias have prompted growing efforts both to measure and mitigate such bias. These efforts often rely on evaluation tasks that differ from natural language distributions, as they typically involve carefully constructed task prompts that overtly or covertly signal the presence of gender bias-related content. In this paper, we examine how signaling the evaluative purpose of a task impacts measured gender bias in LLMs.Concretely, we test models under prompt conditions that (1) make the testing context salient, and (2) make gender-focused content salient. We then assess prompt sensitivity across four task formats with both token-probability and discrete-choice metrics. We find that prompts that more clearly align with (gender bias) evaluation framing elicit distinct gender output distributions compared to less evaluation-framed prompts. Discrete-choice metrics further tend to amplify bias relative to probabilistic measures. These findings do not only highlight the brittleness of LLM gender bias evaluations but open a new puzzle for the NLP benchmarking and development community: To what extent can well-controlled testing designs trigger LLM “testing mode” performance, and what does this mean for the ecological validity of future benchmarks.

[79] Advancing SLM Tool-Use Capability using Reinforcement Learning

Dhruvi Paprunia, Vansh Kharidia, Pankti Doshi

Main category: cs.CL

TL;DR: GRPO enables small language models to achieve significant improvements in tool-use capabilities through reinforcement learning with structured reward systems.

DetailsMotivation: Small Language Models (SLMs) face challenges in accurately integrating tool use compared to Large Language Models, especially in resource-constrained settings, limiting their practical deployment in AI applications.

Method: Used Group Relative Policy Optimization (GRPO) reinforcement learning with a reward system that reinforces structured JSON output, correct tool selection, and precise parameter usage.

Result: Demonstrated that GRPO enables SLMs to achieve significant improvements in tool-use capabilities (function calling/JSON output) with computationally efficient training.

Conclusion: GRPO provides an effective method to enhance SLMs’ tool-use accuracy, making them more practical for real-world AI applications where computational efficiency is crucial.

Abstract: In an era where tool-augmented AI agents are becoming increasingly vital, our findings highlight the ability of Group Relative Policy Optimization (GRPO) to empower SLMs, which are traditionally constrained in tool use. The ability to use tools effectively has become a defining feature of Large Language Models (LLMs), allowing them to access external data and internal resources. As AI agents grow more sophisticated, tool-use capabilities have become indispensable. While LLMs have made significant progress in this area, Small Language Models (SLMs) still face challenges in accurately integrating tool use, especially in resource-constrained settings. This study investigates how Reinforcement Learning, specifically Group Relative Policy Optimization (GRPO), can enhance the tool-use accuracy of SLMs. By designing a well-defined reward system that reinforces structured JSON output, correct tool selection, and precise parameter usage, we demonstrate that GRPO enables SLMs to achieve significant improvements in tool-use capabilities (function calling/JSON output). Our approach provides a computationally efficient training method that enhances SLMs practical deployment in real-world AI applications.

[80] AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs

Aisha Alansari, Hamzah Luqman

Main category: cs.CL

TL;DR: First comprehensive evaluation of hallucination in Arabic and multilingual LLMs on Arabic question answering and summarization tasks, revealing Arabic-specific models outperform multilingual ones.

DetailsMotivation: Arabic LLM hallucination evaluation is underexplored despite Arabic's widespread use and importance in global communication, creating a critical knowledge gap.

Method: Evaluated 12 LLMs (4 Arabic pre-trained, 4 multilingual, 4 reasoning-based) using a fine-grained framework with 12 hallucination indicators on generative question answering and summarization tasks.

Result: Factual hallucinations are more prevalent than faithfulness errors across all models. Arabic pre-trained model Allam consistently shows lower hallucination rates than multilingual models and comparable performance to reasoning-based models.

Conclusion: Arabic-specific models demonstrate superior performance in reducing hallucinations for Arabic language tasks compared to multilingual models, highlighting the importance of language-specific training for Arabic NLP applications.

Abstract: Recently, extensive research on the hallucination of the large language models (LLMs) has mainly focused on the English language. Despite the growing number of multilingual and Arabic-specific LLMs, evaluating LLMs’ hallucination in the Arabic context remains relatively underexplored. The knowledge gap is particularly pressing given Arabic’s widespread use across many regions and its importance in global communication and media. This paper presents the first comprehensive hallucination evaluation of Arabic and multilingual LLMs on two critical Arabic natural language generation tasks: generative question answering (GQA) and summarization. This study evaluates a total of 12 LLMs, including 4 Arabic pre-trained models, 4 multilingual models, and 4 reasoning-based models. To assess the factual consistency and faithfulness of LLMs’ outputs, we developed a fine-grained hallucination evaluation framework consisting of 12 fine-grained hallucination indicators that represent the varying characteristics of each task. The results reveal that factual hallucinations are more prevalent than faithfulness errors across all models and tasks. Notably, the Arabic pre-trained model Allam consistently demonstrates lower hallucination rates than multilingual models and a comparative performance with reasoning-based models. The code is available at: https://github.com/aishaalansari57/AraHalluEval

[81] Hunyuan-MT Technical Report

Mao Zheng, Zheng Li, Bingxin Qu, Mingyang Song, Yang Du, Mingrui Sun, Di Wang

Main category: cs.CL

TL;DR: Hunyuan-MT-7B is a multilingual translation model supporting 33 languages with special focus on Mandarin-minority language/dialect translation. Hunyuan-MT-Chimera-7B enhances performance by integrating multiple outputs from the base model, achieving SOTA results in WMT2025.

DetailsMotivation: To develop open-source multilingual translation models that excel in bidirectional translation across 33 languages, with particular emphasis on supporting translation between Mandarin and ethnic minority languages/dialects that are often underserved.

Method: Holistic training process: general and MT-oriented pre-training for foundation, Supervised Fine-Tuning for task adaptation, and advanced alignment through Reinforcement Learning and weak-to-strong RL. Chimera model integrates multiple outputs from base model under varying parameters.

Result: Both models significantly outperform comparable translation-specific models and most SOTA large models. Ranked first in 30 out of 31 language pairs in WMT2025 shared task, demonstrating robustness across high-resource and low-resource languages including Czech, Marathi, Estonian, and Icelandic.

Conclusion: The Hunyuan-MT models represent state-of-the-art multilingual translation capabilities, particularly excelling in challenging translation scenarios involving minority languages and dialects, while achieving superior performance through innovative slow-thinking inspired architecture.

Abstract: In this report, we introduce Hunyuan-MT-7B, our first open-source multilingual translation model, which supports bidirectional translation across 33 major languages and places a special emphasis on translation between Mandarin and several ethnic minority languages as well as dialects. Furthermore, to serve and address diverse translation scenarios and enhance model performance at test time, we introduce Hunyuan-MT-Chimera-7B, a translation model inspired by the slow thinking mode. This model integrates multiple outputs generated by the Hunyuan-MT-7B model under varying parameter settings, thereby achieving performance superior to that of conventional slow-thinking models based on Chain-of-Thought (CoT). The development of our models follows a holistic training process specifically engineered for multilingual translation, which begins with general and MT-oriented pre-training to build foundational capabilities, proceeds to Supervised Fine-Tuning (SFT) for task-specific adaptation, and culminates in advanced alignment through Reinforcement Learning (RL) and weak-to-strong RL. Through comprehensive experimentation, we demonstrate that both Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B significantly outperform all translation-specific models of comparable parameter size and most of the SOTA large models, particularly on the task of translation between Mandarin and minority languages as well as dialects. In the WMT2025 shared task (General Machine Translation), our models demonstrate state-of-the-art performance, ranking first in 30 out of 31 language pairs. This result highlights the robustness of our models across a diverse linguistic spectrum, encompassing high-resource languages such as Chinese, English, and Japanese, as well as low-resource languages including Czech, Marathi, Estonian, and Icelandic.

[82] Mitigating Spurious Correlations Between Question and Answer via Chain-of-Thought Correctness Perception Distillation

Hongyan Xie, Yitong Yao, Yikun Ban, Zixuan Huang, Deqing Wang, Zhenhe Wu, Haoxiang Su, Chao Wang, Shuangyong Song

Main category: cs.CL

TL;DR: CoPeD improves small language model reasoning by filtering noisy CoT data and using correctness-aware training to focus on well-supported rationales.

DetailsMotivation: Small language models fine-tuned on LLM-generated CoT data often learn from noisy rationales that don't properly support answers, leading to poor reasoning quality.

Method: Introduces correctness-aware task setting that predicts answers based on correct rationales and revises incorrect ones, plus correctness-aware weighted loss that prioritizes well-supported training samples.

Result: Effective on both in-distribution and out-of-distribution benchmark reasoning datasets.

Conclusion: CoPeD successfully improves reasoning quality in small language models by addressing noisy CoT data through correctness perception and weighted loss strategies.

Abstract: Large language models (LLMs) excel at reasoning tasks but are expensive to deploy. Thus small language models (SLMs) are fine-tuned on CoT data generated by LLMs to copy LLMs’ abilities. However, these CoT data may include noisy rationales that either fail to substantiate the answers or contribute no additional information to support answer prediction, which leads SLMs to capture spurious correlations between questions and answers and compromise the quality of reasoning. In this work, we propose Chain-of-Thought Correctness Perception Distillation (CoPeD), which aims to improve the reasoning quality of the student model from the perspectives of task setting and data utilization. Firstly, we introduce a correctness-aware task setting that encourages the student model to predict answers based on correct rationales and revise them when they are incorrect. This setting improves the faithfulness of reasoning and allows the model to learn from its mistakes. Then, we propose a Correctness-Aware Weighted loss, which dynamically adjusts the contribution of each training instance based on the combined loss of the rationale and the answer. This strategy encourages the model to focus more on samples where the rationale offers stronger support for the correct answer. Experiments have shown that CoPeD is effective on both in-distribution (IND) and out-of-distribution (OOD) benchmark reasoning datasets.

[83] LM-Searcher: Cross-domain Neural Architecture Search with LLMs via Unified Numerical Encoding

Yuxuan Hu, Jihao Liu, Ke Wang, Jinliang Zhen, Weikang Shi, Manyuan Zhang, Qi Dou, Rui Liu, Aojun Zhou, Hongsheng Li

Main category: cs.CL

TL;DR: LM-Searcher is a novel LLM-based framework for cross-domain neural architecture search that uses universal numerical encoding (NCode) and reformulates NAS as a ranking task, achieving competitive performance without domain-specific tuning.

DetailsMotivation: Existing LLM-driven NAS approaches require heavy prompt engineering and domain-specific tuning, limiting their practicality and scalability across diverse optimization tasks.

Method: Proposes NCode universal numerical representation for architectures, reformulates NAS as ranking task, uses instruction-tuning with pruning-based subspace sampling, and creates a comprehensive dataset of architecture-performance pairs.

Result: Achieves competitive performance in both in-domain (CNNs for image classification) and out-of-domain (LoRA configurations for segmentation/generation) tasks.

Conclusion: Establishes a new paradigm for flexible and generalizable LLM-based architecture search that works across domains without extensive adaptation.

Abstract: Recent progress in Large Language Models (LLMs) has opened new avenues for solving complex optimization problems, including Neural Architecture Search (NAS). However, existing LLM-driven NAS approaches rely heavily on prompt engineering and domain-specific tuning, limiting their practicality and scalability across diverse tasks. In this work, we propose LM-Searcher, a novel framework that leverages LLMs for cross-domain neural architecture optimization without the need for extensive domain-specific adaptation. Central to our approach is NCode, a universal numerical string representation for neural architectures, which enables cross-domain architecture encoding and search. We also reformulate the NAS problem as a ranking task, training LLMs to select high-performing architectures from candidate pools using instruction-tuning samples derived from a novel pruning-based subspace sampling strategy. Our curated dataset, encompassing a wide range of architecture-performance pairs, encourages robust and transferable learning. Comprehensive experiments demonstrate that LM-Searcher achieves competitive performance in both in-domain (e.g., CNNs for image classification) and out-of-domain (e.g., LoRA configurations for segmentation and generation) tasks, establishing a new paradigm for flexible and generalizable LLM-based architecture search. The datasets and models will be released at https://github.com/Ashone3/LM-Searcher.

[84] Modelling Intertextuality with N-gram Embeddings

Yi Xing

Main category: cs.CL

TL;DR: Proposes a quantitative model for measuring intertextuality using n-gram embeddings and pairwise comparisons, validated on known texts and scalable to large corpora.

DetailsMotivation: To enable scalable analysis and network-based insights into intertextual relationships between literary texts, moving beyond qualitative approaches.

Method: Perform pairwise comparisons of n-gram embeddings from two texts and average the results to compute overall intertextuality scores.

Result: Validation on four texts with known intertextuality degrees shows effectiveness, and scalability test on 267 texts demonstrates efficiency. Network analysis reveals centrality and community structures.

Conclusion: The approach successfully captures and quantifies intertextual relationships, providing a scalable method for literary analysis with network-based insights.

Abstract: Intertextuality is a central tenet in literary studies. It refers to the intricate links between literary texts that are created by various types of references. This paper proposes a new quantitative model of intertextuality to enable scalable analysis and network-based insights: perform pairwise comparisons of the embeddings of n-grams from two texts and average their results as the overall intertextuality. Validation on four texts with known degrees of intertextuality, alongside a scalability test on 267 diverse texts, demonstrates the method’s effectiveness and efficiency. Network analysis further reveals centrality and community structures, affirming the approach’s success in capturing and quantifying intertextual relationships.

cs.CV

[85] Detection and Recovery of Adversarial Slow-Pose Drift in Offloaded Visual-Inertial Odometry

Soruya Saha, Md Nurul Absur, Saptarshi Debroy

Main category: cs.CV

TL;DR: Unsupervised detection and recovery mechanism for Visual-Inertial Odometry (VIO) systems to protect against pose spoofing attacks in offloaded VR environments.

DetailsMotivation: Current trend of offloading VIO to edge servers creates security vulnerabilities where subtle pose spoofing attacks can accumulate into significant drift while evading existing heuristic checks, compromising VR immersion and accuracy.

Method: Proposes an unsupervised, label-free detection and recovery model trained on attack-free sessions to learn temporal motion regularities, enabling runtime deviation detection and pose consistency restoration.

Result: Experimental evaluation using ILLIXR testbed across multiple spoofing intensities shows substantial reductions in trajectory and pose error compared to no-defense baseline, as measured by well-known performance metrics.

Conclusion: The proposed unsupervised approach effectively detects and recovers from pose spoofing attacks in offloaded VIO systems, providing enhanced security without requiring labeled attack data.

Abstract: Visual-Inertial Odometry (VIO) supports immersive Virtual Reality (VR) by fusing camera and Inertial Measurement Unit (IMU) data for real-time pose. However, current trend of offloading VIO to edge servers can lead server-side threat surface where subtle pose spoofing can accumulate into substantial drift, while evading heuristic checks. In this paper, we study this threat and present an unsupervised, label-free detection and recovery mechanism. The proposed model is trained on attack-free sessions to learn temporal regularities of motion to detect runtime deviations and initiate recovery to restore pose consistency. We evaluate the approach in a realistic offloaded-VIO environment using ILLIXR testbed across multiple spoofing intensities. Experimental results in terms of well-known performance metrics show substantial reductions in trajectory and pose error compared to a no-defense baseline.

[86] CellPainTR: Generalizable Representation Learning for Cross-Dataset Cell Painting Analysis

Cedric Caruzzo, Jong Chul Ye

Main category: cs.CV

TL;DR: CellPainTR is a Transformer-based model that learns batch-effect-robust cellular morphology representations, enabling out-of-distribution generalization to unseen datasets without retraining.

DetailsMotivation: Large-scale biological discovery requires integrating massive heterogeneous datasets, but technical batch effects and lack of generalizable models remain critical roadblocks.

Method: Transformer-based architecture with source-specific context tokens designed to learn foundational representations of cellular morphology that are robust to batch effects.

Result: Outperforms established methods like ComBat and Harmony in batch integration and biological signal preservation on JUMP dataset. Maintains high performance on unseen Bray et al. dataset despite domain/feature shifts.

Conclusion: Represents a significant step towards creating truly foundational models for image-based profiling, enabling more reliable and scalable cross-study biological analysis.

Abstract: Large-scale biological discovery requires integrating massive, heterogeneous datasets like those from the JUMP Cell Painting consortium, but technical batch effects and a lack of generalizable models remain critical roadblocks. To address this, we introduce CellPainTR, a Transformer-based architecture designed to learn foundational representations of cellular morphology that are robust to batch effects. Unlike traditional methods that require retraining on new data, CellPainTR’s design, featuring source-specific context tokens, allows for effective out-of-distribution (OOD) generalization to entirely unseen datasets without fine-tuning. We validate CellPainTR on the large-scale JUMP dataset, where it outperforms established methods like ComBat and Harmony in both batch integration and biological signal preservation. Critically, we demonstrate its robustness through a challenging OOD task on the unseen Bray et al. dataset, where it maintains high performance despite significant domain and feature shifts. Our work represents a significant step towards creating truly foundational models for image-based profiling, enabling more reliable and scalable cross-study biological analysis.

[87] FusWay: Multimodal hybrid fusion approach. Application to Railway Defect Detection

Alexey Zhukov, Jenny Benois-Pineau, Amira Youssef, Akka Zemmari, Mohamed Mosbah, Virginie Taillandier

Main category: cs.CV

TL;DR: Multimodal fusion architecture combining YOLOv8n and Vision Transformer with audio features improves railway defect detection accuracy by 0.2 points over vision-only methods.

DetailsMotivation: Single modality vision approaches like YOLO detectors suffer from overdetection when normal structural elements appear similar to defects, requiring multimodal fusion with audio signals for better discrimination.

Method: Proposes a multimodal fusion architecture using YOLOv8n for object detection and Vision Transformer to combine feature maps from layers 7, 16, and 19 with synthesized audio representations for rail rupture and surface defect classes.

Result: Experimental evaluation on real-world railway dataset shows 0.2 point improvement in precision and overall accuracy compared to vision-only approach, with statistical significance confirmed by Student’s unpaired t-test.

Conclusion: Multimodal fusion between audio and image modalities effectively enhances defect detection performance in railway infrastructure monitoring, addressing limitations of single-modality approaches.

Abstract: Multimodal fusion is a multimedia technique that has become popular in the wide range of tasks where image information is accompanied by a signal/audio. The latter may not convey highly semantic information, such as speech or music, but some measures such as audio signal recorded by mics in the goal to detect rail structure elements or defects. While classical detection approaches such as You Only Look Once (YOLO) family detectors can be efficiently deployed for defect detection on the image modality, the single modality approaches remain limited. They yield an overdetection in case of the appearance similar to normal structural elements. The paper proposes a new multimodal fusion architecture built on the basis of domain rules with YOLO and Vision transformer backbones. It integrates YOLOv8n for rapid object detection with a Vision Transformer (ViT) to combine feature maps extracted from multiple layers (7, 16, and 19) and synthesised audio representations for two defect classes: rail Rupture and Surface defect. Fusion is performed between audio and image. Experimental evaluation on a real-world railway dataset demonstrates that our multimodal fusion improves precision and overall accuracy by 0.2 points compared to the vision-only approach. Student’s unpaired t-test also confirms statistical significance of differences in the mean accuracy.

[88] Frustratingly Easy Feature Reconstruction for Out-of-Distribution Detection

Yingsheng Wang, Shuo Lu, Jian Liang, Aihua Zheng, Ran He

Main category: cs.CV

TL;DR: A novel post-hoc OOD detection method called ClaFR that uses classifier weight decomposition to create class-known subspaces, achieving state-of-the-art performance without requiring training data access.

DetailsMotivation: Existing feature-based post-hoc OOD detection methods often require access to training data, which poses privacy concerns in real-world applications where data cannot be shared.

Method: Performs orthogonal decomposition of classifier weights to extract class-known subspace, maps original features into this subspace, and calculates OOD scores based on feature reconstruction error within the subspace.

Result: Achieves leading performance on multiple OOD benchmarks without needing training data access, outperforming existing OOD detection algorithms.

Conclusion: ClaFR provides a simple yet effective privacy-preserving solution for OOD detection that eliminates the need for training data while maintaining high detection performance.

Abstract: Out-of-distribution (OOD) detection helps models identify data outside the training categories, crucial for security applications. While feature-based post-hoc methods address this by evaluating data differences in the feature space without changing network parameters, they often require access to training data, which may not be suitable for some data privacy scenarios. This may not be suitable in scenarios where data privacy protection is a concern. In this paper, we propose a simple yet effective post-hoc method, termed Classifier-based Feature Reconstruction (ClaFR), from the perspective of subspace projection. It first performs an orthogonal decomposition of the classifier’s weights to extract the class-known subspace, then maps the original data features into this subspace to obtain new data representations. Subsequently, the OOD score is determined by calculating the feature reconstruction error of the data within the subspace. Compared to existing OOD detection algorithms, our method does not require access to training data while achieving leading performance on multiple OOD benchmarks. Our code is released at https://github.com/Aie0923/ClaFR.

[89] DIET-CP: Lightweight and Data Efficient Self Supervised Continued Pretraining

Bryan Rodas, Natalie Montesino, Jakob Ambsdorf, David Klindt, Randall Balestriero

Main category: cs.CV

TL;DR: DIET-CP is a simple continued pretraining strategy that adapts foundation models to new domains using minimal data (1000 images) with stable performance across modalities.

DetailsMotivation: Specialized domains often have very small datasets, limiting SSL methods and making hyperparameter search infeasible. Pretrained models lack important information for continued pretraining.

Method: Uses a simple objective requiring no labels, introduces no additional hyperparameters beyond supervised finetuning, and can steer any foundation model towards new data distributions.

Result: Provides significant performance boost for state-of-the-art models like DINOv3 using only 1000 images, with stability across data modalities and backbone choices.

Conclusion: DIET-CP effectively bridges the gap for continued pretraining in specialized domains with limited data, offering a simple and stable solution without complex hyperparameter tuning.

Abstract: Continued pretraining offers a promising solution for adapting foundation models to a new target domain. However, in specialized domains, available datasets are often very small, limiting the applicability of SSL methods developed for large-scale pretraining and making hyperparameter search infeasible. In addition, pretrained models are usually released as backbone-weights only, lacking important information to continue pretraining. We propose to bridge this gap with DIET-CP, a simple continued pretraining strategy, where any strong foundation model can be steered towards the new data distribution of interest. DIET-CP relies on a very simple objective, requires no labels, and introduces no more hyperparameters than supervised finetuning. It is stable across data modalities and backbone choices, while providing a significant performance boost for state-of-the-art models such as DINOv3 using only 1000 images.

[90] MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention

Tianyi Wang, Jianan Fan, Dingxin Zhang, Dongnan Liu, Yong Xia, Heng Huang, Weidong Cai

Main category: cs.CV

TL;DR: MIRROR is a novel multi-modal self-supervised learning method that integrates histopathology and transcriptomics data by balancing modality alignment with retention of modality-specific structures, achieving superior performance in cancer subtyping and survival analysis.

DetailsMotivation: Histopathology and transcriptomics provide orthogonal yet complementary insights in oncology, but conventional multi-modal methods focus too much on alignment while neglecting the preservation of modality-specific structures due to their pronounced heterogeneity.

Method: MIRROR employs dedicated encoders for each modality, a modality alignment module for integration, a modality retention module to safeguard unique attributes, and a style clustering module to reduce redundancy and enhance disease-relevant information through clustering space modeling.

Result: Extensive evaluations on TCGA cohorts demonstrate MIRROR’s superior performance in cancer subtyping and survival analysis, showing effectiveness in constructing comprehensive oncological feature representations.

Conclusion: MIRROR successfully addresses the challenge of integrating heterogeneous histopathology and transcriptomics data while maintaining modality fidelity, benefiting cancer diagnosis through improved multi-modal representation learning.

Abstract: Histopathology and transcriptomics are fundamental modalities in oncology, encapsulating the morphological and molecular aspects of the disease. Multi-modal self-supervised learning has demonstrated remarkable potential in learning pathological representations by integrating diverse data sources. Conventional multi-modal integration methods primarily emphasize modality alignment, while paying insufficient attention to retaining the modality-specific structures. However, unlike conventional scenarios where multi-modal inputs share highly overlapping features, histopathology and transcriptomics exhibit pronounced heterogeneity, offering orthogonal yet complementary insights. Histopathology provides morphological and spatial context, elucidating tissue architecture and cellular topology, whereas transcriptomics delineates molecular signatures through gene expression patterns. This inherent disparity introduces a major challenge in aligning them while maintaining modality-specific fidelity. To address these challenges, we present MIRROR, a novel multi-modal representation learning method designed to foster both modality alignment and retention. MIRROR employs dedicated encoders to extract comprehensive features for each modality, which is further complemented by a modality alignment module to achieve seamless integration between phenotype patterns and molecular profiles. Furthermore, a modality retention module safeguards unique attributes from each modality, while a style clustering module mitigates redundancy and enhances disease-relevant information by modeling and aligning consistent pathological signatures within a clustering space. Extensive evaluations on TCGA cohorts for cancer subtyping and survival analysis highlight MIRROR’s superior performance, demonstrating its effectiveness in constructing comprehensive oncological feature representations and benefiting the cancer diagnosis.

[91] The Protocol Genome A Self Supervised Learning Framework from DICOM Headers

Jimmy Joseph

Main category: cs.CV

TL;DR: Protocol Genome is a self-supervised learning system that uses DICOM headers to improve medical image analysis performance, achieving AUROC 0.901 and better calibration across multiple imaging modalities and vendors.

DetailsMotivation: Clinical imaging through PACS/DICOM involves various procedure choices that create latent confounders, impeding generalization of image-only networks across different sites and equipment.

Method: Uses structured DICOM headers as labels with three approaches: protocol-image contrastive learning, masked protocol prediction, and protocol-protocol translation to learn protocol-aware but clinically robust image representations.

Result: Achieves AUROC 0.901 (vs 0.847 baseline) and ECE 0.036 (vs 0.058) on external validation. Shows +0.046 improvement for PE detection, +0.058 for glioma grading, +0.041 for cardiomegaly detection, with 25-37% calibration improvements.

Conclusion: Protocol Genome significantly improves medical image analysis performance and calibration across multiple tasks and modalities, reduces false positives at protocol borders, and is clinically applicable through standard DICOM interfaces.

Abstract: In this paper, we introduce the Protocol Genome, a self-supervised learning system that learns correlations from DICOM headers and achieves AUROC 0.901 (vs 0.847 baseline) and ECE 0.036 (vs 0.058) on fully held-out external validation. Our method also improves calibration and robustness across modalities (CT, MRI, CXR) and vendors. Clinical imaging is funneled through PACS/DICOM, where procedure choices (scanner make/model, sequence, kernel, kVp, TR/TE, and slice thickness) have consequences for contrast, noise, and artifact. These latent confounders impede the generalization of image-only networks across sites. We consider structured DICOM headers as a label and learn protocol-aware but clinically robust image representations. Protocol Genome obtains tokenized embeddings of de-identified header fields and models them along with image features using: (1) protocol-image contrastive learning, (2) masked protocol prediction, and (3) protocol-protocol translation. With 1.26M studies (7 health systems, 31 scanners, 3 vendors; CT, MR, CR/DR), we experiment on: (A) chest CT triage for PE, (B) brain MRI glioma grading, and (C) chest radiograph cardiomegaly detection. Relative to strong SSL baselines (SimCLR, MAE) as well as ImageNet transfer, Protocol Genome (+0.046: PE, +0.058: glioma, +0.041: cardiomegaly) is associated with higher external AUROC; 25-37% calibration improvements are obtained (p < 0.01, DeLong tests). While the gains may be task-dependent, they are preserved with 10-20% of labeled data. From a clinical point of view, the technique reduces false positives at protocol borders and is applicable in a PACS (DICOM C-FIND/C-MOVE, DICOMweb QIDO/WADO). We publish a model card and deployment guide, complete with both de-identification and bias audits.

[92] FedAPT: Federated Adversarial Prompt Tuning for Vision-Language Models

Kun Zhai, Siheng Chen, Xingjun Ma, Yu-Gang Jiang

Main category: cs.CV

TL;DR: FedAPT enhances adversarial robustness in federated prompt tuning by addressing class information gaps through a class-aware prompt generator and cross-layer sharing strategy.

DetailsMotivation: Federated Prompt Tuning models are vulnerable to adversarial attacks, especially under non-IID settings where clients have limited local label information while the global model faces attacks from global labels.

Method: Proposes a class-aware prompt generator that creates visual prompts from text prompts guided by Global Label Embedding, plus cross-layer generator sharing to enhance prompt coupling across model layers.

Result: Extensive experiments show FedAPT significantly outperforms existing methods in adversarial robustness and demonstrates exceptional generalization in cross-domain and cross-dataset scenarios.

Conclusion: FedAPT effectively addresses the class information gap in federated learning and provides superior adversarial robustness, making it suitable for real-world applications.

Abstract: Federated Prompt Tuning (FPT) is an efficient method for cross-client collaborative fine-tuning of large Vision-Language Models (VLMs). However, models tuned using FPT are vulnerable to adversarial attacks, leading to misclassification in downstream tasks. In this work, we introduce Federated Adversarial Prompt Tuning (\textbf{FedAPT}), a novel method designed to enhance the adversarial robustness of FPT. We identify a key issue in FedAPT under non-independent and identically distributed (non-IID) settings: a \textit{class information gap} between clients and the global model. Clients rely solely on limited local label information to generate adversarial samples for training, while the global model must defend against adversarial attacks from global labels. To address this issue, we propose a \textbf{class-aware prompt generator} that generates visual prompts from text prompts. This generator is guided by a \emph{Global Label Embedding} (serving as a ``beacon") which encodes cross-client label information to create more globally-aligned visual prompts. Additionally, we propose a \textbf{cross-layer generator sharing} strategy to enhance prompt coupling across different layers of the model, further boosting adversarial robustness. Extensive experiments on multiple image classification datasets demonstrate the superiority of FedAPT in improving adversarial robustness, outperforming existing methods by a large margin. FedAPT also exhibits exceptional generalization in cross-domain and cross-dataset scenarios, indicating its effectiveness in real-world applications.

[93] “Humor, Art, or Misinformation?”: A Multimodal Dataset for Intent-Aware Synthetic Image Detection

Anastasios Skoularikis, Stefanos-Iordanis Papadopoulos, Symeon Papadopoulos, Panagiotis C. Petrantonakis

Main category: cs.CV

TL;DR: S-HArM dataset for intent-aware classification of AI-generated images with three prompting strategies, showing image- and multimodally-guided data generalize better but overall performance remains challenging.

DetailsMotivation: Existing multimodal AI efforts overlook the intent behind AI-generated images, creating a gap in understanding whether content is created for humor/satire, art, or misinformation purposes.

Method: Created S-HArM dataset with 9,576 real-world image-text pairs labeled by intent. Explored three prompting strategies (image-guided, description-guided, multimodally-guided) with Stable Diffusion to generate synthetic training data. Tested various approaches including modality fusion, contrastive learning, reconstruction networks, attention mechanisms, and large vision-language models.

Result: Models trained on image- and multimodally-guided synthetic data showed better generalization to real-world content due to preserved visual context. However, overall classification performance remained limited across all approaches.

Conclusion: Inferring intent from AI-generated content is complex and requires specialized architectures beyond current multimodal models. The preservation of visual context in training data is crucial for better generalization to real-world scenarios.

Abstract: Recent advances in multimodal AI have enabled progress in detecting synthetic and out-of-context content. However, existing efforts largely overlook the intent behind AI-generated images. To fill this gap, we introduce S-HArM, a multimodal dataset for intent-aware classification, comprising 9,576 “in the wild” image-text pairs from Twitter/X and Reddit, labeled as Humor/Satire, Art, or Misinformation. Additionally, we explore three prompting strategies (image-guided, description-guided, and multimodally-guided) to construct a large-scale synthetic training dataset with Stable Diffusion. We conduct an extensive comparative study including modality fusion, contrastive learning, reconstruction networks, attention mechanisms, and large vision-language models. Our results show that models trained on image- and multimodally-guided data generalize better to “in the wild” content, due to preserved visual context. However, overall performance remains limited, highlighting the complexity of inferring intent and the need for specialized architectures.

[94] Geospatial Foundational Embedder: Top-1 Winning Solution on EarthVision Embed2Scale Challenge (CVPR 2025)

Zirui Xu, Raphael Tang, Mike Bianco, Qi Zhang, Rishi Madhok, Nikolaos Karianakis, Fuxun Yu

Main category: cs.CV

TL;DR: Top-1 winning solution for CVPR 2025 EarthVision Embed2Scale challenge using hyperspectral geospatial data

DetailsMotivation: Develop foundational geospatial models that can embed SSL4EO-S12 hyperspectral data cubes into embedding vectors to facilitate various downstream tasks like classification and regression

Method: Not specified in the abstract - requires full paper analysis

Result: Achieved Top-1 winning solution in the Embed2Scale Challenge

Conclusion: Successful development of a method for embedding hyperspectral geospatial data that performs well in the competition setting

Abstract: EarthVision Embed2Scale challenge (CVPR 2025) aims to develop foundational geospatial models to embed SSL4EO-S12 hyperspectral geospatial data cubes into embedding vectors that faciliatetes various downstream tasks, e.g., classification, regression, etc. In this technical report, we introduce our proposed method for the Top-1 winning solution on the Embed2Scale Challenge.

[95] VLMs-in-the-Wild: Bridging the Gap Between Academic Benchmarks and Enterprise Reality

Srihari Bandraupalli, Anupam Purwar

Main category: cs.CV

TL;DR: ViLD framework bridges the gap between academic VLM evaluation and enterprise needs by testing 10 real-world business tasks on 7,500 real samples with innovative OCR comparison and comprehensive metrics.

DetailsMotivation: Current VLM benchmarks use synthetic data and multiple-choice questions that don't reflect real enterprise deployment requirements for tasks like social media analysis, logo detection, and content moderation.

Method: Developed ViLD framework with 10 business-critical tasks, created benchmark of 7,500 real-world samples, and introduced BlockWeaver Algorithm for comparing unordered OCR outputs without embeddings/LLMs. Used semantic matching, traditional metrics, and novel completeness/faithfulness measures.

Result: Benchmarked leading open-source VLMs (Qwen, MIMO, InternVL) against proprietary baseline, providing industry-grounded assessment of capabilities for enterprise deployment.

Conclusion: ViLD offers the first comprehensive framework for evaluating VLMs on real enterprise requirements, providing actionable insights for deployment in business environments beyond academic benchmarks.

Abstract: Open-source Vision-Language Models show immense promise for enterprise applications, yet a critical disconnect exists between academic evaluation and enterprise deployment requirements. Current benchmarks rely heavily on multiple-choice questions and synthetic data, failing to capture the complexity of real-world business applications like social media content analysis. This paper introduces VLM-in-the-Wild (ViLD), a comprehensive framework to bridge this gap by evaluating VLMs on operational enterprise requirements. We define ten business-critical tasks: logo detection, OCR, object detection, human presence and demographic analysis, human activity and appearance analysis, scene detection, camera perspective and media quality assessment, dominant colors, comprehensive description, and NSFW detection. To this framework, we bring an innovative BlockWeaver Algorithm that solves the challenging problem of comparing unordered, variably-grouped OCR outputs from VLMs without relying on embeddings or LLMs, achieving remarkable speed and reliability. To demonstrate efficacy of ViLD, we constructed a new benchmark dataset of 7,500 diverse samples, carefully stratified from a corpus of one million real-world images and videos. ViLD provides actionable insights by combining semantic matching (both embedding-based and LLM-as-a-judge approaches), traditional metrics, and novel methods to measure the completeness and faithfulness of descriptive outputs. By benchmarking leading open-source VLMs (Qwen, MIMO, and InternVL) against a powerful proprietary baseline as per ViLD framework, we provide one of the first industry-grounded, task-driven assessment of VLMs capabilities, offering actionable insights for their deployment in enterprise environments.

[96] Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems

Jie Zhang, Ting Xu, Gelei Deng, Runyi Hu, Han Qiu, Tianwei Zhang, Qing Guo, Ivor Tsang

Main category: cs.CV

TL;DR: Vision language models struggle with fragmented, fused, or occluded text that humans can easily read, revealing structural limitations in their compositional reasoning across writing systems.

DetailsMotivation: To investigate whether advanced vision language models share humans' remarkable resilience in recognizing words despite character fragmentation, fusion, or occlusion across different writing systems.

Method: Constructed two psychophysics-inspired benchmarks for Chinese logographs and English alphabetic words by splicing, recombining, and overlaying glyphs to create ‘visible but unreadable’ stimuli for models while remaining legible to humans.

Result: Contemporary VLMs show severe performance drops under these perturbations, frequently producing unrelated or incoherent outputs, indicating they rely heavily on generic visual invariances but underutilize compositional priors needed for robust literacy.

Conclusion: The findings reveal structural limitations in current VLMs and motivate the development of architectures and training strategies that better encode symbol segmentation, composition, and binding across scripts for applications in education, accessibility, cultural heritage, and security.

Abstract: Writing is a universal cultural technology that reuses vision for symbolic communication. Humans display striking resilience: we readily recognize words even when characters are fragmented, fused, or partially occluded. This paper investigates whether advanced vision language models (VLMs) share this resilience. We construct two psychophysics inspired benchmarks across distinct writing systems, Chinese logographs and English alphabetic words, by splicing, recombining, and overlaying glyphs to yield ‘‘visible but unreadable’’ stimuli for models while remaining legible to humans. Despite strong performance on clean text, contemporary VLMs show a severe drop under these perturbations, frequently producing unrelated or incoherent outputs. The pattern suggests a structural limitation: models heavily leverage generic visual invariances but under rely on compositional priors needed for robust literacy. We release stimuli generation code, prompts, and evaluation protocols to facilitate transparent replication and follow up work. Our findings motivate architectures and training strategies that encode symbol segmentation, composition, and binding across scripts, and they delineate concrete challenges for deploying multimodal systems in education, accessibility, cultural heritage, and security.

[97] K-Syn: K-space Data Synthesis in Ultra Low-data Regimes

Guan Yu, Zhang Jianhua, Liang Dong, Liu Qiegen

Main category: cs.CV

TL;DR: A novel method for dynamic cardiac MRI reconstruction that performs feature-level learning directly in the frequency domain using temporal fusion strategies to synthesize k-space data, particularly effective in low-data scenarios.

DetailsMotivation: Dynamic cardiac MRI faces challenges due to limited high-quality k-space data availability, which hampers robust reconstruction. Traditional methods use pixel-level convolution in image domain, but frequency domain offers better global representation capacity.

Method: Feature-level learning in frequency domain using Fourier transform’s global representation capacity. Temporal fusion strategies integrate k-space data across time frames to guide generative trajectory. Focuses on frequency domain modeling instead of traditional image domain approaches.

Result: Experimental results show strong generative ability in low-data regimes, demonstrating practical potential to alleviate data scarcity issues in dynamic MRI reconstruction.

Conclusion: The proposed frequency-domain feature learning with temporal fusion enables stable and rich k-space data generation even with ultra low-data availability, offering a promising solution for dynamic cardiac MRI reconstruction challenges.

Abstract: Owing to the inherently dynamic and complex characteristics of cardiac magnetic resonance (CMR) imaging, high-quality and diverse k-space data are rarely available in practice, which in turn hampers robust reconstruction of dynamic cardiac MRI. To address this challenge, we perform feature-level learning directly in the frequency domain and employ a temporal-fusion strategy as the generative guidance to synthesize k-space data. Specifically, leveraging the global representation capacity of the Fourier transform, the frequency domain can be considered a natural global feature space. Therefore, unlike traditional methods that use pixel-level convolution for feature learning and modeling in the image domain, this letter focuses on feature-level modeling in the frequency domain, enabling stable and rich generation even with ultra low-data regimes. Moreover, leveraging the advantages of feature-level modeling in the frequency domain, we integrate k-space data across time frames with multiple fusion strategies to steer and further optimize the generative trajectory. Experimental results demonstrate that the proposed method possesses strong generative ability in low-data regimes, indicating practical potential to alleviate data scarcity in dynamic MRI reconstruction.

[98] Feature Space Analysis by Guided Diffusion Model

Kimiaki Shirahama, Miki Yanobu, Kaduki Yamashita, Miho Ohsaki

Main category: cs.CV

TL;DR: A decoder method using guided diffusion to generate images that match user-specified DNN features, enabling analysis of black-box neural networks without additional training.

DetailsMotivation: To address the black-box nature of DNNs by providing a way to understand what image attributes are encoded in specific features through controlled image generation.

Method: Guided diffusion model that reverse-generates images while minimizing Euclidean distance between the generated image’s features and target features from pre-trained DNNs like CLIP, ResNet-50, and vision transformers.

Result: Generated images have features remarkably similar to user-specified ones, providing valuable insights into DNN feature spaces and working efficiently on commercial GPUs.

Conclusion: The proposed decoder successfully enables feature space analysis of various DNN architectures without retraining, offering a practical tool for understanding what visual information neural networks encode in their features.

Abstract: One of the key issues in Deep Neural Networks (DNNs) is the black-box nature of their internal feature extraction process. Targeting vision-related domains, this paper focuses on analysing the feature space of a DNN by proposing a decoder that can generate images whose features are guaranteed to closely match a user-specified feature. Owing to this guarantee that is missed in past studies, our decoder allows us to evidence which of various attributes in an image are encoded into a feature by the DNN, by generating images whose features are in proximity to that feature. Our decoder is implemented as a guided diffusion model that guides the reverse image generation of a pre-trained diffusion model to minimise the Euclidean distance between the feature of a clean image estimated at each step and the user-specified feature. One practical advantage of our decoder is that it can analyse feature spaces of different DNNs with no additional training and run on a single COTS GPU. The experimental results targeting CLIP’s image encoder, ResNet-50 and vision transformer demonstrate that images generated by our decoder have features remarkably similar to the user-specified ones and reveal valuable insights into these DNNs’ feature spaces.

[99] Not All Splits Are Equal: Rethinking Attribute Generalization Across Unrelated Categories

Liviu Nicolae Fircă, Antonio Bărbălau, Dan Oneata, Elena Burceanu

Main category: cs.CV

TL;DR: This paper evaluates whether models can generalize attribute knowledge across semantically and perceptually dissimilar categories, finding performance drops significantly as training-test correlation decreases.

DetailsMotivation: To test if current models can abstract attributes and apply them to conceptually distant categories, going beyond narrow taxonomic or visually similar domains.

Method: Introduces train-test split strategies that progressively reduce correlation: LLM-driven semantic grouping, embedding similarity thresholding, embedding-based clustering, and supercategory-based partitioning using ground-truth labels.

Result: Results show sharp performance drop as correlation between training and test categories decreases, indicating strong sensitivity to split design. Clustering yields the most effective trade-off.

Conclusion: Findings reveal limitations of current representations and provide insights for future benchmark construction for attribute reasoning.

Abstract: Can models generalize attribute knowledge across semantically and perceptually dissimilar categories? While prior work has addressed attribute prediction within narrow taxonomic or visually similar domains, it remains unclear whether current models can abstract attributes and apply them to conceptually distant categories. This work presents the first explicit evaluation for the robustness of the attribute prediction task under such conditions, testing whether models can correctly infer shared attributes between unrelated object types: e.g., identifying that the attribute “has four legs” is common to both “dogs” and “chairs”. To enable this evaluation, we introduce train-test split strategies that progressively reduce correlation between training and test sets, based on: LLM-driven semantic grouping, embedding similarity thresholding, embedding-based clustering, and supercategory-based partitioning using ground-truth labels. Results show a sharp drop in performance as the correlation between training and test categories decreases, indicating strong sensitivity to split design. Among the evaluated methods, clustering yields the most effective trade-off, reducing hidden correlations while preserving learnability. These findings offer new insights into the limitations of current representations and inform future benchmark construction for attribute reasoning.

[100] Human-in-the-Loop: Quantitative Evaluation of 3D Models Generation by Large Language Models

Ahmed R. Sadik, Mariusz Bujny

Main category: cs.CV

TL;DR: A human-in-the-loop framework for quantitative evaluation of LLM-generated 3D models using comprehensive similarity and complexity metrics, showing improved fidelity with richer semantic inputs and faster convergence than qualitative methods.

DetailsMotivation: Large Language Models can interpret multimodal inputs for 3D shape generation, but robust evaluation methods for geometric and structural fidelity remain underdeveloped, limiting applications like CAD democratization and rapid prototyping.

Method: Proposed a comprehensive suite of metrics (volumetric accuracy, surface alignment, dimensional fidelity, topological intricacy) to benchmark generated models against ground truth CAD references. Used an L-bracket case study across four input modalities: 2D orthographic views, isometric sketches, geometric structure trees, and code-based correction prompts.

Result: Found improved generation fidelity with increased semantic richness, with code-level prompts achieving perfect reconstruction across all metrics. The quantitative evaluation approach enabled significantly faster convergence toward ground truth compared to traditional qualitative methods.

Conclusion: This work advances understanding of AI-assisted shape synthesis and provides a scalable methodology to validate and refine generative models for diverse CAD applications, demonstrating the effectiveness of quantitative evaluation over visual inspection and human intuition.

Abstract: Large Language Models are increasingly capable of interpreting multimodal inputs to generate complex 3D shapes, yet robust methods to evaluate geometric and structural fidelity remain underdeveloped. This paper introduces a human in the loop framework for the quantitative evaluation of LLM generated 3D models, supporting applications such as democratization of CAD design, reverse engineering of legacy designs, and rapid prototyping. We propose a comprehensive suite of similarity and complexity metrics, including volumetric accuracy, surface alignment, dimensional fidelity, and topological intricacy, to benchmark generated models against ground truth CAD references. Using an L bracket component as a case study, we systematically compare LLM performance across four input modalities: 2D orthographic views, isometric sketches, geometric structure trees, and code based correction prompts. Our findings demonstrate improved generation fidelity with increased semantic richness, with code level prompts achieving perfect reconstruction across all metrics. A key contribution of this work is demonstrating that our proposed quantitative evaluation approach enables significantly faster convergence toward the ground truth, especially compared to traditional qualitative methods based solely on visual inspection and human intuition. This work not only advances the understanding of AI assisted shape synthesis but also provides a scalable methodology to validate and refine generative models for diverse CAD applications.

[101] MEGS$^{2}$: Memory-Efficient Gaussian Splatting via Spherical Gaussians and Unified Pruning

Jiarui Chen, Yikeng Chen, Yingshuang Zou, Ye Huang, Peng Wang, Yuan Liu, Yujing Sun, Wenping Wang

Main category: cs.CV

TL;DR: MEGS² is a memory-efficient 3D Gaussian Splatting framework that reduces VRAM usage by 50% for static memory and 40% for rendering memory through optimized primitive and parameter compression, while maintaining comparable rendering quality.

DetailsMotivation: 3D Gaussian Splatting (3DGS) has high memory consumption that limits its use on edge devices. Most existing compression methods focus only on storage compression and fail to address the critical bottleneck of rendering memory.

Method: Proposes MEGS² framework that jointly optimizes primitive number and parameters per primitive. Replaces memory-intensive spherical harmonics with lightweight arbitrarily-oriented spherical Gaussian lobes for color representation. Introduces a unified soft pruning framework that models primitive-number and lobe-number pruning as a single constrained optimization problem.

Result: Achieves 50% static VRAM reduction and 40% rendering VRAM reduction compared to existing methods while maintaining comparable rendering quality.

Conclusion: MEGS² successfully addresses the memory bottleneck in 3DGS rendering through innovative compression techniques, making it more suitable for edge device applications.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a dominant novel-view synthesis technique, but its high memory consumption severely limits its applicability on edge devices. A growing number of 3DGS compression methods have been proposed to make 3DGS more efficient, yet most only focus on storage compression and fail to address the critical bottleneck of rendering memory. To address this problem, we introduce MEGS$^{2}$, a novel memory-efficient framework that tackles this challenge by jointly optimizing two key factors: the total primitive number and the parameters per primitive, achieving unprecedented memory compression. Specifically, we replace the memory-intensive spherical harmonics with lightweight arbitrarily-oriented spherical Gaussian lobes as our color representations. More importantly, we propose a unified soft pruning framework that models primitive-number and lobe-number pruning as a single constrained optimization problem. Experiments show that MEGS$^{2}$ achieves a 50% static VRAM reduction and a 40% rendering VRAM reduction compared to existing methods, while maintaining comparable rendering quality.

[102] Decoupled Sparse Priors Guided Diffusion Compression Model for Point Clouds

Xiaoge Zhang, Zijie Wu, Mehwish Nasim, Mingtao Feng, Ajmal Mian

Main category: cs.CV

TL;DR: A novel point cloud compression method using sparse priors and conditional diffusion to reduce redundancy in latent representations, achieving superior compression performance at high ratios.

DetailsMotivation: Existing lossy compression methods use autoencoders but leave latent representation redundancy unexplored, limiting compression efficiency especially at high ratios.

Method: Dual-density scheme with separate processing of latent points (for reconstruction) and decoupled sparse priors (for storage). Uses progressive conditional diffusion model with hierarchical intra-point and inter-point priors, attention-based conditional denoiser, and local distribution integration in arithmetic coding.

Result: Superior rate-distortion trade-off compared to state-of-the-art methods, demonstrated through extensive evaluations on ShapeNet, 8iVFB, and Owlii datasets.

Conclusion: The proposed sparse priors guided method effectively reduces redundancy in latent representations and achieves high reconstruction quality, particularly at high compression ratios, advancing point cloud compression technology.

Abstract: Lossy compression methods rely on an autoencoder to transform a point cloud into latent points for storage, leaving the inherent redundancy of latent representations unexplored. To reduce redundancy in latent points, we propose a sparse priors guided method that achieves high reconstruction quality, especially at high compression ratios. This is accomplished by a dual-density scheme separately processing the latent points (intended for reconstruction) and the decoupled sparse priors (intended for storage). Our approach features an efficient dual-density data flow that relaxes size constraints on latent points, and hybridizes a progressive conditional diffusion model to encapsulate essential details for reconstruction within the conditions, which are decoupled hierarchically to intra-point and inter-point priors. Specifically, our method encodes the original point cloud into latent points and decoupled sparse priors through separate encoders. Latent points serve as intermediates, while sparse priors act as adaptive conditions. We then employ a progressive attention-based conditional denoiser to generate latent points conditioned on the decoupled priors, allowing the denoiser to dynamically attend to geometric and semantic cues from the priors at each encoding and decoding layer. Additionally, we integrate the local distribution into the arithmetic encoder and decoder to enhance local context modeling of the sparse points. The original point cloud is reconstructed through a point decoder. Compared to state-of-the-art, our method obtains superior rate-distortion trade-off, evidenced by extensive evaluations on the ShapeNet dataset and standard test datasets from MPEG group including 8iVFB, and Owlii.

[103] Moment- and Power-Spectrum-Based Gaussianity Regularization for Text-to-Image Models

Jisung Hwang, Jaihoon Kim, Minhyuk Sung

Main category: cs.CV

TL;DR: Novel regularization loss that enforces standard Gaussian distribution in latent space to improve downstream optimization tasks for text-to-image models.

DetailsMotivation: To facilitate downstream tasks involving optimization in the latent space of text-to-image models by ensuring samples follow a standard Gaussian distribution, which enables better performance in applications like test-time reward alignment.

Method: Composite loss combining moment-based regularization in spatial domain with power spectrum-based regularization in spectral domain, applied to randomly permuted inputs to ensure permutation invariance. Uses analytically known expected values of moments and power spectrum distributions.

Result: Outperforms previous Gaussianity regularization methods, effectively prevents reward hacking, and accelerates convergence in generative modeling for test-time reward alignment to enhance aesthetics and text alignment.

Conclusion: The proposed regularization framework provides a unified approach that encompasses existing Gaussianity-based methods while offering improved efficiency and performance for text-to-image model optimization tasks.

Abstract: We propose a novel regularization loss that enforces standard Gaussianity, encouraging samples to align with a standard Gaussian distribution. This facilitates a range of downstream tasks involving optimization in the latent space of text-to-image models. We treat elements of a high-dimensional sample as one-dimensional standard Gaussian variables and define a composite loss that combines moment-based regularization in the spatial domain with power spectrum-based regularization in the spectral domain. Since the expected values of moments and power spectrum distributions are analytically known, the loss promotes conformity to these properties. To ensure permutation invariance, the losses are applied to randomly permuted inputs. Notably, existing Gaussianity-based regularizations fall within our unified framework: some correspond to moment losses of specific orders, while the previous covariance-matching loss is equivalent to our spectral loss but incurs higher time complexity due to its spatial-domain computation. We showcase the application of our regularization in generative modeling for test-time reward alignment with a text-to-image model, specifically to enhance aesthetics and text alignment. Our regularization outperforms previous Gaussianity regularization, effectively prevents reward hacking and accelerates convergence.

[104] SAM$^{*}$: Task-Adaptive SAM with Physics-Guided Rewards

Kamyar Barakati, Utkarsh Pratiush, Sheryl L. Sanchez, Aditya Raghavan, Delia J. Milliron, Mahshid Ahmadi, Philip D. Rack, Sergei V. Kalinin

Main category: cs.CV

TL;DR: Reward function-based optimization for fine-tuning foundational segmentation models like SAM, enabling real-time streaming data analysis in microscopy without manual parameter tuning.

DetailsMotivation: Foundational models for image segmentation have too many non-transparent tuning parameters requiring manual optimization, limiting their usability for real-time streaming data analysis in microscopy.

Method: Introduce reward function-based optimization to fine-tune foundational models (demonstrated with SAM). Reward functions represent physics of imaged system (particle size distributions, geometries, etc.) to create optimized variant SAM*.

Result: Enhanced SAM’s adaptability and performance for diverse segmentation tasks, particularly enabling real-time streaming data segmentation in microscopy applications.

Conclusion: Reward-driven optimization framework successfully improves foundational models’ usability for real-time microscopy segmentation by automating parameter tuning through physics-based reward functions.

Abstract: Image segmentation is a critical task in microscopy, essential for accurately analyzing and interpreting complex visual data. This task can be performed using custom models trained on domain-specific datasets, transfer learning from pre-trained models, or foundational models that offer broad applicability. However, foundational models often present a considerable number of non-transparent tuning parameters that require extensive manual optimization, limiting their usability for real-time streaming data analysis. Here, we introduce a reward function-based optimization to fine-tune foundational models and illustrate this approach for SAM (Segment Anything Model) framework by Meta. The reward functions can be constructed to represent the physics of the imaged system, including particle size distributions, geometries, and other criteria. By integrating a reward-driven optimization framework, we enhance SAM’s adaptability and performance, leading to an optimized variant, SAM$^{*}$, that better aligns with the requirements of diverse segmentation tasks and particularly allows for real-time streaming data segmentation. We demonstrate the effectiveness of this approach in microscopy imaging, where precise segmentation is crucial for analyzing cellular structures, material interfaces, and nanoscale features.

[105] Enhancing Classification of Streaming Data with Image Distillation

Rwad Khatib, Yehudit Aperstein

Main category: cs.CV

TL;DR: A distillation-based method achieves 73.1% accuracy for streaming image classification, outperforming traditional algorithms and reservoir sampling in resource-constrained environments.

DetailsMotivation: To address the challenge of efficiently classifying streaming data with limited memory and computational resources, particularly for image data where traditional methods may be inadequate.

Method: Proposes Distillation Based Classification (DBC) that distills essential features from data streams to minimize computational demands while preserving crucial information. Compared against traditional algorithms (Hoeffding Trees, Adaptive Random Forest) adapted through embeddings and Reservoir Sampling Based Classification.

Result: DBC demonstrated superior performance with 73.1% accuracy rate, surpassing both traditional methods and RBC technique.

Conclusion: This represents a significant advancement in streaming data classification, showing effectiveness in processing complex data streams and setting new standards for accuracy and efficiency in resource-constrained environments.

Abstract: This study tackles the challenge of efficiently classifying streaming data in envi-ronments with limited memory and computational resources. It delves into the application of data distillation as an innovative approach to improve the precision of streaming image data classification. By focusing on distilling essential features from data streams, our method aims to minimize computational demands while preserving crucial information for accurate classification. Our investigation com-pares this approach against traditional algorithms like Hoeffding Trees and Adap-tive Random Forest, adapted through embeddings for image data. The Distillation Based Classification (DBC) demonstrated superior performance, achieving a 73.1% accuracy rate, surpassing both traditional methods and Reservoir Sam-pling Based Classification (RBC) technique. This marks a significant advance-ment in streaming data classification, showcasing the effectiveness of our method in processing complex data streams and setting a new standard for accuracy and efficiency.

[106] Automated Evaluation of Gender Bias Across 13 Large Multimodal Models

Juan Manuel Contreras

Main category: cs.CV

TL;DR: Aymara benchmark reveals LMMs amplify gender stereotypes in image generation, showing systematic male bias across professions and significant model variation in bias levels.

DetailsMotivation: To address methodological limitations in previous gender bias studies of large multimodal models and provide large-scale, comparable cross-model analysis of social bias in AI-generated images.

Method: Created Aymara Image Fairness Evaluation benchmark with 75 gender-neutral prompts across stereotypical/non-stereotypical professions, tested 13 commercial LMMs, generated 965 images, and used LLM-as-judge system to score gender representation.

Result: LMMs systematically amplify occupational gender stereotypes (93.0% men for male-stereotyped vs 22.5% for female-stereotyped professions), show strong default-male bias (68.3% men for non-stereotyped professions), with bias varying dramatically across models (46.7%-73.3% male representation).

Conclusion: High bias is not inevitable but results from design choices; standardized automated evaluation tools are necessary for promoting accountability and fairness in AI development.

Abstract: Large multimodal models (LMMs) have revolutionized text-to-image generation, but they risk perpetuating the harmful social biases in their training data. Prior work has identified gender bias in these models, but methodological limitations prevented large-scale, comparable, cross-model analysis. To address this gap, we introduce the Aymara Image Fairness Evaluation, a benchmark for assessing social bias in AI-generated images. We test 13 commercially available LMMs using 75 procedurally-generated, gender-neutral prompts to generate people in stereotypically-male, stereotypically-female, and non-stereotypical professions. We then use a validated LLM-as-a-judge system to score the 965 resulting images for gender representation. Our results reveal (p < .001 for all): 1) LMMs systematically not only reproduce but actually amplify occupational gender stereotypes relative to real-world labor data, generating men in 93.0% of images for male-stereotyped professions but only 22.5% for female-stereotyped professions; 2) Models exhibit a strong default-male bias, generating men in 68.3% of the time for non-stereotyped professions; and 3) The extent of bias varies dramatically across models, with overall male representation ranging from 46.7% to 73.3%. Notably, the top-performing model de-amplified gender stereotypes and approached gender parity, achieving the highest fairness scores. This variation suggests high bias is not an inevitable outcome but a consequence of design choices. Our work provides the most comprehensive cross-model benchmark of gender bias to date and underscores the necessity of standardized, automated evaluation tools for promoting accountability and fairness in AI development.

[107] Faster VGGT with Block-Sparse Global Attention

Chung-Shien Brian Wang, Christian Schmidt, Jens Piekenbrinck, Bastian Leibe

Main category: cs.CV

TL;DR: Proposes block-sparse attention kernels to replace dense global attention in transformer-based multi-view reconstruction models, achieving 4x faster inference without retraining while maintaining performance.

DetailsMotivation: Transformer-based models like VGGT and π³ face runtime bottlenecks due to quadratic complexity of global attention layers, limiting scalability to large image sets.

Method: Replace dense global attention with optimized block-sparse kernels based on observation that attention probability mass concentrates on a small subset of patch-patch interactions corresponding to cross-view geometric matches.

Result: Achieves up to 4x faster inference with comparable task performance, requires no retraining, and supports large image collections.

Conclusion: The proposed block-sparse attention retrofit effectively addresses scalability limitations while maintaining performance across comprehensive multi-view benchmarks.

Abstract: Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT and $\pi^3$ have achieved impressive results with simple architectures, yet they face an inherent runtime bottleneck, due to the quadratic complexity of the global attention layers, that limits the scalability to large image sets. In this paper, we empirically analyze the global attention matrix of these models and observe that probability mass concentrates on a small subset of patch-patch interactions that correspond to cross-view geometric matches. Motivated by the structured attention and inspired by recent advancement in large language models, we propose a replacement for the dense global attention operation based on highly optimized block-sparse kernels, yielding up to $4\times$ faster inference with comparable task performance. Our retrofit requires no retraining of the backbone, extends to both VGGT and $\pi^3$, and supports large image collections. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate the effectiveness of our approach.

[108] Realism to Deception: Investigating Deepfake Detectors Against Face Enhancement

Muhammad Saad Saeed, Ijaz Ul Haq, Khalid Malik

Main category: cs.CV

TL;DR: Face enhancement techniques, including traditional filters and GAN-based methods, can significantly reduce deepfake detection accuracy by distorting biometric features, achieving attack success rates up to 75.12%.

DetailsMotivation: To investigate whether face enhancement techniques, while improving perceptual quality, can inadvertently degrade deepfake detector performance and serve as anti-forensic tools.

Method: Systematic evaluation of traditional image processing and GAN-based enhancement methods on deepfake detectors, including analysis of Naïve, Spatial, and Frequency-based detection methods. Conducted adversarial training experiments to assess model robustness.

Result: Basic enhancement filters reduced detection accuracy with ASR up to 64.63%, while GAN-based techniques achieved even higher ASR up to 75.12%. Face enhancement methods effectively function as anti-forensic tools.

Conclusion: Face enhancement techniques pose a significant threat to deepfake detection systems, highlighting the need for more resilient and adaptive forensic methods to counter these anti-forensic approaches.

Abstract: Face enhancement techniques are widely used to enhance facial appearance. However, they can inadvertently distort biometric features, leading to significant decrease in the accuracy of deepfake detectors. This study hypothesizes that these techniques, while improving perceptual quality, can degrade the performance of deepfake detectors. To investigate this, we systematically evaluate whether commonly used face enhancement methods can serve an anti-forensic role by reducing detection accuracy. We use both traditional image processing methods and advanced GAN-based enhancements to evaluate the robustness of deepfake detectors. We provide a comprehensive analysis of the effectiveness of these enhancement techniques, focusing on their impact on Na"ive, Spatial, and Frequency-based detection methods. Furthermore, we conduct adversarial training experiments to assess whether exposure to face enhancement transformations improves model robustness. Experiments conducted on the FaceForensics++, DeepFakeDetection, and CelebDF-v2 datasets indicate that even basic enhancement filters can significantly reduce detection accuracy achieving ASR up to 64.63%. In contrast, GAN-based techniques further exploit these vulnerabilities, achieving ASR up to 75.12%. Our results demonstrate that face enhancement methods can effectively function as anti-forensic tools, emphasizing the need for more resilient and adaptive forensic methods.

[109] Dimensionally Reduced Open-World Clustering: DROWCULA

Erencem Ozbey, Dimitrios I. Diochnos

Main category: cs.CV

TL;DR: A fully unsupervised approach for novel class discovery in image classification using Vision Transformers and manifold learning to estimate cluster numbers and achieve state-of-the-art results.

DetailsMotivation: Supervised learning requires extensive human labeling effort, and real-world applications often encounter novel classes that weren't present during initial training, making traditional approaches insufficient for open-world scenarios.

Method: Uses Vision Transformers with attention mechanisms to generate vector embeddings, incorporates manifold learning techniques to refine embeddings by exploiting data geometry, and estimates number of clusters without supervision.

Result: Achieves new State-of-the-Art results on single-modal clustering and Novel Class Discovery across multiple datasets (CIFAR-10, CIFAR-100, ImageNet-100, Tiny ImageNet), working both when cluster numbers are known or unknown.

Conclusion: Demonstrates that fully unsupervised approaches can effectively discover novel classes in open-world scenarios, outperforming semi-supervised methods and providing a practical solution for real-world applications where novel categories may emerge.

Abstract: Working with annotated data is the cornerstone of supervised learning. Nevertheless, providing labels to instances is a task that requires significant human effort. Several critical real-world applications make things more complicated because no matter how many labels may have been identified in a task of interest, it could be the case that examples corresponding to novel classes may appear in the future. Not unsurprisingly, prior work in this, so-called, `open-world’ context has focused a lot on semi-supervised approaches. Focusing on image classification, somehow paradoxically, we propose a fully unsupervised approach to the problem of determining the novel categories in a particular dataset. Our approach relies on estimating the number of clusters using Vision Transformers, which utilize attention mechanisms to generate vector embeddings. Furthermore, we incorporate manifold learning techniques to refine these embeddings by exploiting the intrinsic geometry of the data, thereby enhancing the overall image clustering performance. Overall, we establish new State-of-the-Art results on single-modal clustering and Novel Class Discovery on CIFAR-10, CIFAR-100, ImageNet-100, and Tiny ImageNet. We do so, both when the number of clusters is known or unknown ahead of time. The code is available at: https://github.com/DROWCULA/DROWCULA.

[110] XBusNet: Text-Guided Breast Ultrasound Segmentation via Multimodal Vision-Language Learning

Raja Mallina, Bryar Shareef

Main category: cs.CV

TL;DR: XBusNet is a dual-prompt, dual-branch multimodal model for breast ultrasound segmentation that combines global image semantics with local boundary precision using automated text prompts from clinical metadata, achieving state-of-the-art performance.

DetailsMotivation: Breast ultrasound segmentation is challenging for small or low-contrast lesions with fuzzy margins and speckle noise. Existing text-image approaches produce coarse responses that smear boundaries without fine edge recovery mechanisms.

Method: Proposed XBusNet with global pathway (CLIP Vision Transformer for whole-image semantics) and local pathway (U-Net for precise boundaries). Uses automated text prompts from structured metadata describing shape, margin, and BI-RADS terms without manual input.

Result: Achieves state-of-the-art performance on BLU dataset with mean Dice of 0.8765 and IoU of 0.8149, outperforming six baselines. Largest gains for small lesions with fewer missed regions and spurious activations.

Conclusion: Dual-prompt, dual-branch multimodal design combining global semantics with local precision yields accurate segmentation masks and improves robustness for small, low-contrast breast lesions.

Abstract: Background: Precise breast ultrasound (BUS) segmentation supports reliable measurement, quantitative analysis, and downstream classification, yet remains difficult for small or low-contrast lesions with fuzzy margins and speckle noise. Text prompts can add clinical context, but directly applying weakly localized text-image cues (e.g., CAM/CLIP-derived signals) tends to produce coarse, blob-like responses that smear boundaries unless additional mechanisms recover fine edges. Methods: We propose XBusNet, a novel dual-prompt, dual-branch multimodal model that combines image features with clinically grounded text. A global pathway based on a CLIP Vision Transformer encodes whole-image semantics conditioned on lesion size and location, while a local U-Net pathway emphasizes precise boundaries and is modulated by prompts that describe shape, margin, and Breast Imaging Reporting and Data System (BI-RADS) terms. Prompts are assembled automatically from structured metadata, requiring no manual clicks. We evaluate on the Breast Lesions USG (BLU) dataset using five-fold cross-validation. Primary metrics are Dice and Intersection over Union (IoU); we also conduct size-stratified analyses and ablations to assess the roles of the global and local paths and the text-driven modulation. Results: XBusNet achieves state-of-the-art performance on BLU, with mean Dice of 0.8765 and IoU of 0.8149, outperforming six strong baselines. Small lesions show the largest gains, with fewer missed regions and fewer spurious activations. Ablation studies show complementary contributions of global context, local boundary modeling, and prompt-based modulation. Conclusions: A dual-prompt, dual-branch multimodal design that merges global semantics with local precision yields accurate BUS segmentation masks and improves robustness for small, low-contrast lesions.

[111] Breast Cancer Detection in Thermographic Images via Diffusion-Based Augmentation and Nonlinear Feature Fusion

Sepehr Salem, M. Moein Esfahani, Jingyu Liu, Vince Calhoun

Main category: cs.CV

TL;DR: Proposes a DPM-based data augmentation framework for breast cancer classification in thermograms, achieving 98.0% accuracy by fusing deep ResNet-50 features with handcrafted nonlinear features through XGBoost.

DetailsMotivation: Data scarcity in medical imaging hinders deep learning performance, particularly for breast cancer classification using thermograms.

Method: Uses Diffusion Probabilistic Model (DPM) for superior data augmentation, fuses pre-trained ResNet-50 deep features with handcrafted nonlinear features (e.g., Fractal Dimension) from U-Net segmented tumors, and employs XGBoost classifier.

Result: Achieves 98.0% accuracy and 98.1% sensitivity, with ablation studies confirming statistical significance of both DPM augmentation and nonlinear feature fusion.

Conclusion: Validates the synergy between advanced generative models and interpretable features for creating highly accurate medical diagnostic tools.

Abstract: Data scarcity hinders deep learning for medical imaging. We propose a framework for breast cancer classification in thermograms that addresses this using a Diffusion Probabilistic Model (DPM) for data augmentation. Our DPM-based augmentation is shown to be superior to both traditional methods and a ProGAN baseline. The framework fuses deep features from a pre-trained ResNet-50 with handcrafted nonlinear features (e.g., Fractal Dimension) derived from U-Net segmented tumors. An XGBoost classifier trained on these fused features achieves 98.0% accuracy and 98.1% sensitivity. Ablation studies and statistical tests confirm that both the DPM augmentation and the nonlinear feature fusion are critical, statistically significant components of this success. This work validates the synergy between advanced generative models and interpretable features for creating highly accurate medical diagnostic tools.

[112] Reconstruction Alignment Improves Unified Multimodal Models

Ji Xie, Trevor Darrell, Luke Zettlemoyer, XuDong Wang

Main category: cs.CV

TL;DR: RecA is a resource-efficient post-training method that uses visual understanding embeddings as dense text prompts to align multimodal models’ understanding and generation capabilities, improving image generation and editing performance across various architectures with minimal compute.

DetailsMotivation: Conventional multimodal model training relies on sparse image-text pairs that miss fine-grained visual details, creating a gap between visual understanding and generation capabilities.

Method: Reconstruction Alignment (RecA) conditions unified multimodal models on their own visual understanding embeddings and optimizes them to reconstruct input images using self-supervised reconstruction loss.

Result: With only 27 GPU-hours, RecA substantially improves image generation performance on GenEval (0.73→0.90) and DPGBench (80.93→88.15), while boosting editing benchmarks (ImgEdit 3.38→3.75, GEdit 6.94→7.25).

Conclusion: RecA is an efficient and general post-training alignment strategy that works across diverse UMM architectures, surpassing larger open-source models and establishing a new standard for aligning visual understanding and generation.

Abstract: Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details–even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense “text prompts,” providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73$\rightarrow$0.90) and DPGBench (80.93$\rightarrow$88.15), while also boosting editing benchmarks (ImgEdit 3.38$\rightarrow$3.75, GEdit 6.94$\rightarrow$7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs

[113] DEPF: A UAV Multispectral Object Detector with Dual-Domain Enhancement and Priority-Guided Mamba Fusion

Shucong Li, Zhenyu Liu, Zijie Hong, Zhiheng Zhou, Xianghai Cao

Main category: cs.CV

TL;DR: A novel UAV multispectral object detector called DEPF is proposed to address challenges in low-light image enhancement, local target modeling, and computational efficiency using dual-domain enhancement and priority-guided Mamba fusion with linear complexity.

DetailsMotivation: To overcome three key challenges in UAV multispectral object detection: reduced complementarity in low-light conditions, interference from redundant information during fusion that affects local small target modeling, and the computational inefficiency of transformer-based methods on UAV platforms.

Method: Proposes DEPF with two main components: 1) Dual-Domain Enhancement Module (DDE) containing Cross-Scale Wavelet Mamba for global brightness enhancement and Fourier Details Recovery block for texture-detail recovery, and 2) Priority-Guided Mamba Fusion Module (PGMF) that uses priority scanning starting from local target features based on modality difference scores.

Result: Experiments on DroneVehicle and VEDAI datasets show that DEPF performs well on object detection tasks and outperforms state-of-the-art methods.

Conclusion: The proposed DEPF framework effectively addresses the three key challenges in UAV multispectral object detection through dual-domain enhancement and priority-guided Mamba fusion, achieving superior performance with linear computational complexity suitable for UAV platforms.

Abstract: Multispectral remote sensing object detection is one of the important application of unmanned aerial vehicle (UAV). However, it faces three challenges. Firstly, the low-light remote sensing images reduce the complementarity during multi-modality fusion. Secondly, the local small target modeling is interfered with redundant information in the fusion stage easily. Thirdly, due to the quadratic computational complexity, it is hard to apply the transformer-based methods on the UAV platform. To address these limitations, motivated by Mamba with linear complexity, a UAV multispectral object detector with dual-domain enhancement and priority-guided mamba fusion (DEPF) is proposed. Firstly, to enhance low-light remote sensing images, Dual-Domain Enhancement Module (DDE) is designed, which contains Cross-Scale Wavelet Mamba (CSWM) and Fourier Details Recovery block (FDR). CSWM applies cross-scale mamba scanning for the low-frequency components to enhance the global brightness of images, while FDR constructs spectrum recovery network to enhance the frequency spectra features for recovering the texture-details. Secondly, to enhance local target modeling and reduce the impact of redundant information during fusion, Priority-Guided Mamba Fusion Module (PGMF) is designed. PGMF introduces the concept of priority scanning, which starts from local targets features according to the priority scores obtained from modality difference. Experiments on DroneVehicle dataset and VEDAI dataset reports that, DEPF performs well on object detection, comparing with state-of-the-art methods. Our code is available in the supplementary material.

[114] G3CN: Gaussian Topology Refinement Gated Graph Convolutional Network for Skeleton-Based Action Recognition

Haiqing Ren, Zhongkai Luo, Heng Fan, Xiaohui Yuan, Guanchen Wang, Libo Zhang

Main category: cs.CV

TL;DR: G$^{3}$CN - a novel graph convolutional network with Gaussian topology refinement and GRU gating to better distinguish ambiguous actions in skeleton-based recognition

DetailsMotivation: Standard GCNs struggle to distinguish between ambiguous actions due to limitations in representing topological and spatial features

Method: Incorporates Gaussian filter to refine skeleton topology graph and integrates GRUs into GCN framework to enhance information propagation between skeleton points

Result: Shows strong generalization across various GCN backbones and effectively improves action recognition, particularly for ambiguous samples on multiple benchmarks

Conclusion: G$^{3}$CN addresses the challenge of ambiguous action recognition by refining topology and enhancing feature propagation, demonstrating improved performance on standard benchmarks

Abstract: Graph Convolutional Networks (GCNs) have proven to be highly effective for skeleton-based action recognition, primarily due to their ability to leverage graph topology for feature aggregation, a key factor in extracting meaningful representations. However, despite their success, GCNs often struggle to effectively distinguish between ambiguous actions, revealing limitations in the representation of learned topological and spatial features. To address this challenge, we propose a novel approach, Gaussian Topology Refinement Gated Graph Convolution (G$^{3}$CN), to address the challenge of distinguishing ambiguous actions in skeleton-based action recognition. G$^{3}$CN incorporates a Gaussian filter to refine the skeleton topology graph, improving the representation of ambiguous actions. Additionally, Gated Recurrent Units (GRUs) are integrated into the GCN framework to enhance information propagation between skeleton points. Our method shows strong generalization across various GCN backbones. Extensive experiments on NTU RGB+D, NTU RGB+D 120, and NW-UCLA benchmarks demonstrate that G$^{3}$CN effectively improves action recognition, particularly for ambiguous samples.

[115] Parse Graph-Based Visual-Language Interaction for Human Pose Estimation

Shibang Liu, Xuemei Xie, Guangming Shi

Main category: cs.CV

TL;DR: PGVL introduces a parse graph-based visual-language interaction method with a Guided Module to effectively fuse multimodal information for human pose estimation, addressing occlusion challenges through hierarchical feature integration.

DetailsMotivation: Existing parse graph methods focus on single modality modeling and ignore multimodal fusion potential. Language offers rich spatial priors for occluded scenes, but current visual-language fusion approaches weaken occluded region responses and cause alignment failures.

Method: Proposes Parse Graph-based Visual-Language interaction (PGVL) with a novel Guided Module. Uses hierarchical nodes: low-level nodes maintain local features for occluded areas, high-level nodes integrate global features. Includes top-down decomposition and bottom-up composition with recursive bidirectional cross-attention purified by GM.

Result: The PGVL method and network are validated on major pose estimation datasets, demonstrating effectiveness in handling occluded scenes through multimodal fusion.

Conclusion: PGVL successfully addresses occlusion challenges in human pose estimation by leveraging language priors and hierarchical parse graph structures with guided multimodal interaction, achieving improved performance on standard datasets.

Abstract: Parse graphs boost human pose estimation (HPE) by integrating context and hierarchies, yet prior work mostly focuses on single modality modeling, ignoring the potential of multimodal fusion. Notably, language offers rich HPE priors like spatial relations for occluded scenes, but existing visual-language fusion via global feature integration weakens occluded region responses and causes alignment and location failures. To address this issue, we propose Parse Graph-based Visual-Language interaction (PGVL) with a core novel Guided Module (GM). In PGVL, low-level nodes focus on local features, maximizing the maintenance of responses in occluded areas and high-level nodes integrate global features to infer occluded or invisible parts. GM enables high semantic nodes to guide the feature update of low semantic nodes that have undergone cross attention. It ensuring effective fusion of diverse information. PGVL includes top-down decomposition and bottom-up composition. In the first stage, modality specific parse graphs are constructed. Next stage. recursive bidirectional cross-attention is used, purified by GM. We also design network based on PGVL. The PGVL and our network is validated on major pose estimation datasets. We will release the code soon.

[116] DreamLifting: A Plug-in Module Lifting MV Diffusion Models for 3D Asset Generation

Ze-Xin Yin, Jiaxiong Qiu, Liu Liu, Xinjie Wang, Wei Sui, Zhizhong Su, Jian Yang, Jin Xie

Main category: cs.CV

TL;DR: LGAA is a novel framework for end-to-end PBR-ready 3D asset generation that unifies geometry and material modeling using multi-view diffusion priors, achieving high-quality results with minimal training data.

DetailsMotivation: Traditional 3D asset creation with PBR materials is labor-intensive, and existing 3D generation methods focus mainly on geometry while treating texture synthesis as post-processing, lacking end-to-end PBR-ready solutions.

Method: Modular framework with three components: LGAA Wrapper (reuses MV diffusion model layers), LGAA Switcher (aligns multiple diffusion priors), and LGAA Decoder (tamed VAE for 2D Gaussian Splatting with PBR channels), plus dedicated post-processing for mesh extraction.

Result: Superior performance demonstrated through extensive experiments with both text- and image-conditioned MV diffusion models, achieving efficient convergence trained on only 69k multi-view instances.

Conclusion: LGAA enables flexible incorporation of multiple diffusion priors, preserves knowledge for data-efficient training, and produces high-quality relightable mesh assets, representing a significant advancement in end-to-end PBR-ready 3D generation.

Abstract: The labor- and experience-intensive creation of 3D assets with physically based rendering (PBR) materials demands an autonomous 3D asset creation pipeline. However, most existing 3D generation methods focus on geometry modeling, either baking textures into simple vertex colors or leaving texture synthesis to post-processing with image diffusion models. To achieve end-to-end PBR-ready 3D asset generation, we present Lightweight Gaussian Asset Adapter (LGAA), a novel framework that unifies the modeling of geometry and PBR materials by exploiting multi-view (MV) diffusion priors from a novel perspective. The LGAA features a modular design with three components. Specifically, the LGAA Wrapper reuses and adapts network layers from MV diffusion models, which encapsulate knowledge acquired from billions of images, enabling better convergence in a data-efficient manner. To incorporate multiple diffusion priors for geometry and PBR synthesis, the LGAA Switcher aligns multiple LGAA Wrapper layers encapsulating different knowledge. Then, a tamed variational autoencoder (VAE), termed LGAA Decoder, is designed to predict 2D Gaussian Splatting (2DGS) with PBR channels. Finally, we introduce a dedicated post-processing procedure to effectively extract high-quality, relightable mesh assets from the resulting 2DGS. Extensive quantitative and qualitative experiments demonstrate the superior performance of LGAA with both text-and image-conditioned MV diffusion models. Additionally, the modular design enables flexible incorporation of multiple diffusion priors, and the knowledge-preserving scheme leads to efficient convergence trained on merely 69k multi-view instances. Our code, pre-trained weights, and the dataset used will be publicly available via our project page: https://zx-yin.github.io/dreamlifting/.

[117] In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting

Taiying Peng, Jiacheng Hua, Miao Liu, Feng Lu

Main category: cs.CV

TL;DR: EgoGazeVQA is a new benchmark for egocentric gaze-guided video question answering that uses gaze information to better understand user intent in daily-life videos, showing that current MLLMs struggle with intent interpretation but gaze-guided methods significantly improve performance.

DetailsMotivation: Existing multimodal benchmarks overlook gaze as a crucial indicator of user intent in egocentric videos, which directly capture user focus and context in a unified coordinate system, limiting the development of proactive and personalized AI assistants.

Method: Created EgoGazeVQA benchmark with gaze-based QA pairs generated by MLLMs and refined by human annotators. Developed gaze-guided intent prompting methods that integrate spatial, temporal, and intent-related cues. Conducted experiments on gaze-related fine-tuning and analyzed gaze estimation accuracy impact.

Result: Existing MLLMs struggle to accurately interpret user intentions from egocentric videos. Gaze-guided intent prompting methods significantly enhance performance. Gaze estimation accuracy directly impacts prompting effectiveness.

Conclusion: Gaze information is valuable for creating more personalized and effective AI assistants in egocentric settings, as it provides crucial intent-related cues that significantly improve multimodal understanding of user focus and actions.

Abstract: The emergence of advanced multimodal large language models (MLLMs) has significantly enhanced AI assistants’ ability to process complex information across modalities. Recently, egocentric videos, by directly capturing user focus, actions, and context in an unified coordinate, offer an exciting opportunity to enable proactive and personalized AI user experiences with MLLMs. However, existing benchmarks overlook the crucial role of gaze as an indicator of user intent. To address this gap, we introduce EgoGazeVQA, an egocentric gaze-guided video question answering benchmark that leverages gaze information to improve the understanding of longer daily-life videos. EgoGazeVQA consists of gaze-based QA pairs generated by MLLMs and refined by human annotators. Our experiments reveal that existing MLLMs struggle to accurately interpret user intentions. In contrast, our gaze-guided intent prompting methods significantly enhance performance by integrating spatial, temporal, and intent-related cues. We further conduct experiments on gaze-related fine-tuning and analyze how gaze estimation accuracy impacts prompting effectiveness. These results underscore the value of gaze for more personalized and effective AI assistants in egocentric settings.

[118] GLEAM: Learning to Match and Explain in Cross-View Geo-Localization

Xudong Lu, Zhi Zheng, Yi Wan, Yongxiang Yao, Annan Wang, Renrui Zhang, Panwang Xia, Qiong Wu, Qingyun Li, Weifeng Lin, Xiangyu Zhao, Xue Yang, Hongsheng Li

Main category: cs.CV

TL;DR: GLEAM introduces a unified cross-view geo-localization framework with GLEAM-C for multi-modal alignment and GLEAM-X for explainable reasoning, combining accurate matching with interpretable analysis.

DetailsMotivation: Existing CVGL approaches are limited to single views/modalities and lack interpretability - they only predict matches without explaining the reasoning behind them.

Method: GLEAM-C aligns multiple views (UAV, street maps, panoramas, ground photos) with satellite imagery using optimized implementation and two-phase training. GLEAM-X leverages MLLMs for explainable reasoning and creates a bilingual benchmark using GPT-4o and Doubao-1.5-Thinking-Vision-Pro.

Result: Achieves accuracy comparable to prior modality-specific models while providing interpretable correspondence analysis through the new explainable reasoning task.

Conclusion: The framework integrates multi-modal alignment with interpretable analysis, advancing geo-localization by enabling models to both explain and match cross-view correspondences, improving transparency and scalability.

Abstract: Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location. However, existing CVGL approaches are typically restricted to a single view or modality, and their direct visual matching strategy lacks interpretability: they merely predict whether two images correspond, without explaining the rationale behind the match. In this paper, we present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities-including UAV imagery, street maps, panoramic views, and ground photographs-by aligning them exclusively with satellite imagery. Our framework enhances training efficiency through optimized implementation while achieving accuracy comparable to prior modality-specific CVGL models through a two-phase training strategy. Moreover, to address the lack of interpretability in traditional CVGL methods, we leverage the reasoning capabilities of multimodal large language models (MLLMs) to propose a new task, GLEAM-X, which combines cross-view correspondence prediction with explainable reasoning. To support this task, we construct a bilingual benchmark using GPT-4o and Doubao-1.5-Thinking-Vision-Pro to generate training and testing data. The test set is further refined through detailed human revision, enabling systematic evaluation of explainable cross-view reasoning and advancing transparency and scalability in geo-localization. Together, GLEAM-C and GLEAM-X form a comprehensive CVGL pipeline that integrates multi-modal, multi-view alignment with interpretable correspondence analysis, unifying accurate cross-view matching with explainable reasoning and advancing Geo-Localization by enabling models to better Explain And Match. Code and datasets used in this work will be made publicly accessible at https://github.com/Lucky-Lance/GLEAM.

[119] XOCT: Enhancing OCT to OCTA Translation via Cross-Dimensional Supervised Multi-Scale Feature Learning

Pooya Khosravi, Kun Han, Anthony T. Wu, Arghavan Rezvani, Zexin Feng, Xiaohui Xie

Main category: cs.CV

TL;DR: XOCT is a deep learning framework that uses cross-dimensional supervision and multi-scale feature fusion for layer-aware OCT-to-OCTA translation, improving vascular reconstruction quality and clinical utility.

DetailsMotivation: Acquiring high-quality OCTA images is challenging due to motion sensitivity and high costs of software modifications for conventional OCT devices. Current deep learning methods overlook vascular differences across retinal layers and struggle with intricate vascular details needed for reliable diagnosis.

Method: Proposes XOCT framework with Cross-Dimensional Supervision (CDS) using 2D layer-wise en-face projections as supervisory signals, and Multi-Scale Feature Fusion (MSFF) network with channel reweighting strategy for multi-scale vascular detail capture.

Result: Experiments on OCTA-500 dataset demonstrate XOCT’s improvements, especially for en-face projections which are clinically significant for retinal pathology evaluation.

Conclusion: XOCT enhances OCTA accessibility, reliability, and diagnostic value for ophthalmic disease detection and monitoring, with code publicly available.

Abstract: Optical Coherence Tomography Angiography (OCTA) and its derived en-face projections provide high-resolution visualization of the retinal and choroidal vasculature, which is critical for the rapid and accurate diagnosis of retinal diseases. However, acquiring high-quality OCTA images is challenging due to motion sensitivity and the high costs associated with software modifications for conventional OCT devices. Moreover, current deep learning methods for OCT-to-OCTA translation often overlook the vascular differences across retinal layers and struggle to reconstruct the intricate, dense vascular details necessary for reliable diagnosis. To overcome these limitations, we propose XOCT, a novel deep learning framework that integrates Cross-Dimensional Supervision (CDS) with a Multi-Scale Feature Fusion (MSFF) network for layer-aware vascular reconstruction. Our CDS module leverages 2D layer-wise en-face projections, generated via segmentation-weighted z-axis averaging, as supervisory signals to compel the network to learn distinct representations for each retinal layer through fine-grained, targeted guidance. Meanwhile, the MSFF module enhances vessel delineation through multi-scale feature extraction combined with a channel reweighting strategy, effectively capturing vascular details at multiple spatial scales. Our experiments on the OCTA-500 dataset demonstrate XOCT’s improvements, especially for the en-face projections which are significant for clinical evaluation of retinal pathologies, underscoring its potential to enhance OCTA accessibility, reliability, and diagnostic value for ophthalmic disease detection and monitoring. The code is available at https://github.com/uci-cbcl/XOCT.

[120] Bias-Aware Machine Unlearning: Towards Fairer Vision Models via Controllable Forgetting

Sai Siddhartha Chary Aylapuram, Veeraraju Elluru, Shivang Agarwal

Main category: cs.CV

TL;DR: Machine unlearning techniques can effectively mitigate bias in vision models by selectively removing biased samples/features, achieving up to 97% fairness improvement with minimal accuracy loss.

DetailsMotivation: Deep neural networks often rely on spurious correlations in training data, leading to biased predictions in safety-critical domains. Traditional bias mitigation requires retraining from scratch, but machine unlearning offers a post-hoc alternative.

Method: Bias-Aware Machine Unlearning using Gradient Ascent, LoRA, and Teacher-Student distillation to selectively remove biased samples or feature representations. Evaluated on CUB-200-2011 (pose bias), CIFAR-10 (patch bias), and CelebA (gender bias).

Result: Substantial reduction in subgroup disparities: 94.86% improvement on CUB-200, 30.28% on CIFAR-10, and 97.37% on CelebA. Minimal accuracy loss with average score of 0.62 across utility, fairness, quality, and privacy metrics.

Conclusion: Machine unlearning is a practical framework for enhancing fairness in deployed vision systems without requiring full retraining, establishing it as an effective post-hoc bias mitigation approach.

Abstract: Deep neural networks often rely on spurious correlations in training data, leading to biased or unfair predictions in safety-critical domains such as medicine and autonomous driving. While conventional bias mitigation typically requires retraining from scratch or redesigning data pipelines, recent advances in machine unlearning provide a promising alternative for post-hoc model correction. In this work, we investigate \textit{Bias-Aware Machine Unlearning}, a paradigm that selectively removes biased samples or feature representations to mitigate diverse forms of bias in vision models. Building on privacy-preserving unlearning techniques, we evaluate various strategies including Gradient Ascent, LoRA, and Teacher-Student distillation. Through empirical analysis on three benchmark datasets, CUB-200-2011 (pose bias), CIFAR-10 (synthetic patch bias), and CelebA (gender bias in smile detection), we demonstrate that post-hoc unlearning can substantially reduce subgroup disparities, with improvements in demographic parity of up to \textbf{94.86%} on CUB-200, \textbf{30.28%} on CIFAR-10, and \textbf{97.37%} on CelebA. These gains are achieved with minimal accuracy loss and with methods scoring an average of 0.62 across the 3 settings on the joint evaluation of utility, fairness, quality, and privacy. Our findings establish machine unlearning as a practical framework for enhancing fairness in deployed vision systems without necessitating full retraining.

[121] ANYPORTAL: Zero-Shot Consistent Video Background Replacement

Wenshuo Gao, Xicheng Lan, Shuai Yang

Main category: cs.CV

TL;DR: ANYPORTAL is a zero-shot framework for video background replacement that leverages pre-trained diffusion models without training, achieving precise foreground consistency and temporal coherence through collaborative integration of video and image diffusion models.

DetailsMotivation: Existing video generation methods lack fine-grained control over details and fail to achieve precise alignment with user intentions, limiting practical applicability for video editing tasks.

Method: Collaboratively integrates temporal prior of video diffusion models with relighting capabilities of image diffusion models. Uses a Refinement Projection Algorithm for pixel-level detail manipulation to ensure foreground consistency in a zero-shot setting.

Result: Achieves high-quality video background replacement results on consumer-grade GPUs, demonstrating effective foreground preservation and temporally coherent relighting without requiring training.

Conclusion: ANYPORTAL provides a practical and efficient training-free solution for video content creation and editing, overcoming challenges of foreground consistency and temporal coherence in video background replacement.

Abstract: Despite the rapid advancements in video generation technology, creating high-quality videos that precisely align with user intentions remains a significant challenge. Existing methods often fail to achieve fine-grained control over video details, limiting their practical applicability. We introduce ANYPORTAL, a novel zero-shot framework for video background replacement that leverages pre-trained diffusion models. Our framework collaboratively integrates the temporal prior of video diffusion models with the relighting capabilities of image diffusion models in a zero-shot setting. To address the critical challenge of foreground consistency, we propose a Refinement Projection Algorithm, which enables pixel-level detail manipulation to ensure precise foreground preservation. ANYPORTAL is training-free and overcomes the challenges of achieving foreground consistency and temporally coherent relighting. Experimental results demonstrate that ANYPORTAL achieves high-quality results on consumer-grade GPUs, offering a practical and efficient solution for video content creation and editing.

[122] MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification

Patrick Wienholt, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn

Main category: cs.CV

TL;DR: MedicalPatchNet is a self-explainable deep learning model for chest X-ray classification that matches EfficientNet-B0 performance while providing transparent patch-based explanations without post-hoc techniques.

DetailsMotivation: Deep neural networks for radiological image classification often lack interpretability, limiting clinical acceptance and trust in AI-assisted diagnostics.

Method: Splits images into non-overlapping patches, independently classifies each patch, and aggregates predictions to enable intuitive visualization of each patch’s diagnostic contribution.

Result: Achieves AUROC of 0.907 (vs 0.908 for EfficientNet-B0) on CheXpert dataset with substantially improved interpretability - higher pathology localization accuracy (mean hit-rate 0.485 vs 0.376 with Grad-CAM) on CheXlocalize dataset.

Conclusion: MedicalPatchNet provides explicit, reliable explanations accessible to non-AI experts, mitigates shortcut learning risks, and improves clinical trust in AI-assisted diagnostics.

Abstract: Deep neural networks excel in radiological image classification but frequently suffer from poor interpretability, limiting clinical acceptance. We present MedicalPatchNet, an inherently self-explainable architecture for chest X-ray classification that transparently attributes decisions to distinct image regions. MedicalPatchNet splits images into non-overlapping patches, independently classifies each patch, and aggregates predictions, enabling intuitive visualization of each patch’s diagnostic contribution without post-hoc techniques. Trained on the CheXpert dataset (223,414 images), MedicalPatchNet matches the classification performance (AUROC 0.907 vs. 0.908) of EfficientNet-B0, while substantially improving interpretability: MedicalPatchNet demonstrates substantially improved interpretability with higher pathology localization accuracy (mean hit-rate 0.485 vs. 0.376 with Grad-CAM) on the CheXlocalize dataset. By providing explicit, reliable explanations accessible even to non-AI experts, MedicalPatchNet mitigates risks associated with shortcut learning, thus improving clinical trust. Our model is publicly available with reproducible training and inference scripts and contributes to safer, explainable AI-assisted diagnostics across medical imaging domains. We make the code publicly available: https://github.com/TruhnLab/MedicalPatchNet

[123] Estimating forest carbon stocks from high-resolution remote sensing imagery by reducing domain shift with style transfer

Zhenyu Yu, Jinnian Wang

Main category: cs.CV

TL;DR: Using GF-1 WFV and Landsat TM imagery with Swin Transformer attention mechanisms to improve forest carbon stock estimation accuracy through image translation techniques.

DetailsMotivation: Forests are vital carbon reservoirs that mitigate climate change, but current monitoring methods combining ground data with satellite imagery need improved accuracy for large-scale observation.

Method: Used GF-1 WFV and Landsat TM images from Huize County, China, and applied Swin Transformer with attention mechanisms to extract global features, converting carbon stock estimation into an image translation problem.

Result: The paper presents a novel approach using style transfer and transformer architecture for forest carbon monitoring, but specific quantitative results are not provided in the abstract.

Conclusion: The proposed method using Swin Transformer and attention mechanisms shows potential for improving the accuracy of forest carbon stock estimation through image translation techniques.

Abstract: Forests function as crucial carbon reservoirs on land, and their carbon sinks can efficiently reduce atmospheric CO2 concentrations and mitigate climate change. Currently, the overall trend for monitoring and assessing forest carbon stocks is to integrate ground monitoring sample data with satellite remote sensing imagery. This style of analysis facilitates large-scale observation. However, these techniques require improvement in accuracy. We used GF-1 WFV and Landsat TM images to analyze Huize County, Qujing City, Yunnan Province in China. Using the style transfer method, we introduced Swin Transformer to extract global features through attention mechanisms, converting the carbon stock estimation into an image translation.

[124] LINR Bridge: Vector Graphic Animation via Neural Implicits and Video Diffusion Priors

Wenshuo Gao, Xicheng Lan, Luyao Zhang, Shuai Yang

Main category: cs.CV

TL;DR: A novel method that combines implicit neural representations with text-to-video diffusion models to automatically animate vector graphics while preserving their scalability and precision.

DetailsMotivation: Vector graphics offer scalability and user-friendliness but animating them requires substantial manual effort. Existing techniques have limitations in flexibility and animation quality.

Method: Uses layered implicit neural representations to reconstruct vector graphics, bridges domain gap with diffusion models, optimizes with video score distillation sampling using motion priors from text-to-video diffusion models, then warps vector graphics to match representations.

Result: Generates vivid and natural vector graphic animations with significant improvement over existing techniques, preserving infinite resolution and precise color/shape constraints.

Conclusion: The proposed method effectively automates vector graphic animation while maintaining the inherent advantages of vector graphics, demonstrating superior performance compared to current approaches.

Abstract: Vector graphics, known for their scalability and user-friendliness, provide a unique approach to visual content compared to traditional pixel-based images. Animation of these graphics, driven by the motion of their elements, offers enhanced comprehensibility and controllability but often requires substantial manual effort. To automate this process, we propose a novel method that integrates implicit neural representations with text-to-video diffusion models for vector graphic animation. Our approach employs layered implicit neural representations to reconstruct vector graphics, preserving their inherent properties such as infinite resolution and precise color and shape constraints, which effectively bridges the large domain gap between vector graphics and diffusion models. The neural representations are then optimized using video score distillation sampling, which leverages motion priors from pretrained text-to-video diffusion models. Finally, the vector graphics are warped to match the representations resulting in smooth animation. Experimental results validate the effectiveness of our method in generating vivid and natural vector graphic animations, demonstrating significant improvement over existing techniques that suffer from limitations in flexibility and animation quality.

[125] Fine-Tuning Vision-Language Models for Visual Navigation Assistance

Xiao Li, Bharat Gandhi, Ming Zhan, Mohit Nehra, Zhicheng Zhang, Yuchen Sun, Meijia Song, Naisheng Zhang, Xi Wang

Main category: cs.CV

TL;DR: Vision-language indoor navigation system for visually impaired using BLIP-2 with LoRA fine-tuning, achieving improved directional instruction generation with a refined BERT F1 evaluation metric.

DetailsMotivation: To assist visually impaired individuals with indoor navigation where traditional GPS-based systems fail due to lack of precise location data, by integrating vision and language models for step-by-step guidance.

Method: Fine-tuned BLIP-2 model using Low Rank Adaptation (LoRA) on a manually annotated indoor navigation dataset, and developed a refined evaluation metric based on BERT F1 score that emphasizes directional and sequential variables.

Result: The model significantly improved in generating directional instructions after applying LoRA, overcoming limitations of the original BLIP-2 model.

Conclusion: The proposed vision-language approach with specialized fine-tuning and evaluation metrics effectively enhances indoor navigation capabilities for visually impaired users, improving accessibility and independence.

Abstract: We address vision-language-driven indoor navigation to assist visually impaired individuals in reaching a target location using images and natural language guidance. Traditional navigation systems are ineffective indoors due to the lack of precise location data. Our approach integrates vision and language models to generate step-by-step navigational instructions, enhancing accessibility and independence. We fine-tune the BLIP-2 model with Low Rank Adaptation (LoRA) on a manually annotated indoor navigation dataset. We propose an evaluation metric that refines the BERT F1 score by emphasizing directional and sequential variables, providing a more comprehensive measure of navigational performance. After applying LoRA, the model significantly improved in generating directional instructions, overcoming limitations in the original BLIP-2 model.

[126] DiGS: Accurate and Complete Surface Reconstruction from 3D Gaussians via Direct SDF Learning

Wenzhi Guo, Bing Wang

Main category: cs.CV

TL;DR: DiGS integrates Signed Distance Field learning into 3D Gaussian Splatting to improve surface reconstruction while maintaining rendering quality.

DetailsMotivation: 3DGS achieves photorealistic view synthesis but struggles with accurate and complete surface reconstruction due to its unstructured nature and lack of explicit geometric supervision.

Method: Associates each Gaussian with a learnable SDF value to align primitives with geometry, and uses geometry-guided grid growth strategy for adaptive Gaussian distribution along geometry-consistent regions.

Result: Extensive experiments on DTU, Mip-NeRF 360, and Tanks&Temples benchmarks show consistent improvements in reconstruction accuracy and completeness while preserving high rendering fidelity.

Conclusion: DiGS successfully bridges the gap between rendering quality and geometric reconstruction in 3DGS by incorporating SDF learning, providing a unified framework with strong surface priors.

Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a powerful paradigm for photorealistic view synthesis, representing scenes with spatially distributed Gaussian primitives. While highly effective for rendering, achieving accurate and complete surface reconstruction remains challenging due to the unstructured nature of the representation and the absence of explicit geometric supervision. In this work, we propose DiGS, a unified framework that embeds Signed Distance Field (SDF) learning directly into the 3DGS pipeline, thereby enforcing strong and interpretable surface priors. By associating each Gaussian with a learnable SDF value, DiGS explicitly aligns primitives with underlying geometry and improves cross-view consistency. To further ensure dense and coherent coverage, we design a geometry-guided grid growth strategy that adaptively distributes Gaussians along geometry-consistent regions under a multi-scale hierarchy. Extensive experiments on standard benchmarks, including DTU, Mip-NeRF 360, and Tanks& Temples, demonstrate that DiGS consistently improves reconstruction accuracy and completeness while retaining high rendering fidelity.

[127] Generating Transferrable Adversarial Examples via Local Mixing and Logits Optimization for Remote Sensing Object Recognition

Chun Liu, Hailong Wang, Bingqian Zhu, Panpan Ding, Zheng Zheng, Tao Xu, Zhigang Han, Jiayao Wang

Main category: cs.CV

TL;DR: Proposes a novel adversarial attack framework using local mixing and logits optimization to improve transferability and avoid gradient vanishing issues in non-targeted attacks on remote sensing models.

DetailsMotivation: DNNs are vulnerable to adversarial attacks, especially in remote sensing applications. Current mixing-based strategies either destroy global semantic features or suffer from gradient diminishing during iterative updates, compromising adversarial example quality.

Method: 1) Local mixing strategy to generate diverse yet semantically consistent inputs by blending only local regions; 2) Adapts logit loss from targeted to non-targeted attacks to mitigate gradient vanishing; 3) Applies perturbation smoothing loss to suppress high-frequency noise.

Result: Extensive experiments on FGSCR-42 and MTARSI datasets show superior performance over 12 state-of-the-art methods across 6 surrogate models. Achieves 17.28% average improvement in black-box attack success rate with ResNet on MTARSI.

Conclusion: The proposed framework effectively addresses limitations of existing methods by preserving global semantics through local mixing and optimizing with logit loss, significantly enhancing adversarial example transferability for non-targeted attacks in remote sensing applications.

Abstract: Deep Neural Networks (DNNs) are vulnerable to adversarial attacks, posing significant security threats to their deployment in remote sensing applications. Research on adversarial attacks not only reveals model vulnerabilities but also provides critical insights for enhancing robustness. Although current mixing-based strategies have been proposed to increase the transferability of adversarial examples, they either perform global blending or directly exchange a region in the images, which may destroy global semantic features and mislead the optimization of adversarial examples. Furthermore, their reliance on cross-entropy loss for perturbation optimization leads to gradient diminishing during iterative updates, compromising adversarial example quality. To address these limitations, we focus on non-targeted attacks and propose a novel framework via local mixing and logits optimization. First, we present a local mixing strategy to generate diverse yet semantically consistent inputs. Different from MixUp, which globally blends two images, and MixCut, which stitches images together, our method merely blends local regions to preserve global semantic information. Second, we adapt the logit loss from targeted attacks to non-targeted scenarios, mitigating the gradient vanishing problem of cross-entropy loss. Third, a perturbation smoothing loss is applied to suppress high-frequency noise and enhance transferability. Extensive experiments on FGSCR-42 and MTARSI datasets demonstrate superior performance over 12 state-of-the-art methods across 6 surrogate models. Notably, with ResNet as the surrogate on MTARSI, our method achieves a 17.28% average improvement in black-box attack success rate.

[128] MVAT: Multi-View Aware Teacher for Weakly Supervised 3D Object Detection

Saad Lahlali, Alexandre Fournier Montgieux, Nicolas Granger, Hervé Le Borgne, Quoc Cuong Pham

Main category: cs.CV

TL;DR: MVAT is a weakly supervised 3D object detection framework that uses temporal multi-view data and teacher-student distillation to overcome projection ambiguities from 2D box annotations, achieving state-of-the-art performance without 3D labels.

DetailsMotivation: 3D data annotation is costly, and relying solely on 2D box annotations introduces projection ambiguities and difficulties with partial object visibility in single viewpoints.

Method: Leverages temporal multi-view data to aggregate object-centric point clouds across time, uses teacher-student distillation where teacher learns from aggregated static objects and generates pseudo-labels, and incorporates multi-view 2D projection loss for consistency.

Result: Achieves state-of-the-art performance on nuScenes and Waymo Open datasets, significantly narrowing the gap with fully supervised methods without requiring 3D box annotations.

Conclusion: MVAT effectively addresses weakly supervised 3D object detection challenges by utilizing temporal multi-view information and distillation, demonstrating strong performance comparable to fully supervised approaches.

Abstract: Annotating 3D data remains a costly bottleneck for 3D object detection, motivating the development of weakly supervised annotation methods that rely on more accessible 2D box annotations. However, relying solely on 2D boxes introduces projection ambiguities since a single 2D box can correspond to multiple valid 3D poses. Furthermore, partial object visibility under a single viewpoint setting makes accurate 3D box estimation difficult. We propose MVAT, a novel framework that leverages temporal multi-view present in sequential data to address these challenges. Our approach aggregates object-centric point clouds across time to build 3D object representations as dense and complete as possible. A Teacher-Student distillation paradigm is employed: The Teacher network learns from single viewpoints but targets are derived from temporally aggregated static objects. Then the Teacher generates high quality pseudo-labels that the Student learns to predict from a single viewpoint for both static and moving objects. The whole framework incorporates a multi-view 2D projection loss to enforce consistency between predicted 3D boxes and all available 2D annotations. Experiments on the nuScenes and Waymo Open datasets demonstrate that MVAT achieves state-of-the-art performance for weakly supervised 3D object detection, significantly narrowing the gap with fully supervised methods without requiring any 3D box annotations. % \footnote{Code available upon acceptance} Our code is available in our public repository (\href{https://github.com/CEA-LIST/MVAT}{code}).

[129] EHWGesture – A dataset for multimodal understanding of clinical gestures

Gianluca Amprimo, Alberto Ancilotto, Alessandro Savino, Fabio Quazzolo, Claudia Ferraris, Gabriella Olmo, Elisabetta Farella, Stefano Di Carlo

Main category: cs.CV

TL;DR: EHWGesture dataset for multimodal dynamic gesture understanding with precise tracking, 6+ hours of recordings, 25 subjects, 5 clinical gestures, and action quality assessment capabilities.

DetailsMotivation: Dynamic gesture understanding remains challenging due to complex spatiotemporal variations, and existing datasets lack multimodal diversity, precise ground-truth tracking, and action quality components embedded within gestures.

Method: Collected over 1,100 recordings from 25 healthy subjects using two RGB-Depth cameras and an event camera, with motion capture system for precise hand landmark tracking. All devices spatially calibrated and synchronized for cross-modal alignment. Recordings organized by execution speed classes mirroring clinical dexterity evaluations.

Result: Created comprehensive multimodal video dataset with precise ground-truth tracking. Baseline experiments demonstrate potential for gesture classification, gesture trigger detection, and action quality assessment tasks.

Conclusion: EHWGesture serves as a comprehensive benchmark for advancing multimodal clinical gesture understanding, particularly for applications requiring precise tracking and action quality evaluation.

Abstract: Hand gesture understanding is essential for several applications in human-computer interaction, including automatic clinical assessment of hand dexterity. While deep learning has advanced static gesture recognition, dynamic gesture understanding remains challenging due to complex spatiotemporal variations. Moreover, existing datasets often lack multimodal and multi-view diversity, precise ground-truth tracking, and an action quality component embedded within gestures. This paper introduces EHWGesture, a multimodal video dataset for gesture understanding featuring five clinically relevant gestures. It includes over 1,100 recordings (6 hours), captured from 25 healthy subjects using two high-resolution RGB-Depth cameras and an event camera. A motion capture system provides precise ground-truth hand landmark tracking, and all devices are spatially calibrated and synchronized to ensure cross-modal alignment. Moreover, to embed an action quality task within gesture understanding, collected recordings are organized in classes of execution speed that mirror clinical evaluations of hand dexterity. Baseline experiments highlight the dataset’s potential for gesture classification, gesture trigger detection, and action quality assessment. Thus, EHWGesture can serve as a comprehensive benchmark for advancing multimodal clinical gesture understanding.

[130] Universal Few-Shot Spatial Control for Diffusion Models

Kiet T. Nguyen, Chanhuyk Lee, Donggyun Kim, Dong Hoon Lee, Seunghoon Hong

Main category: cs.CV

TL;DR: UFC is a universal few-shot control adapter that enables text-to-image diffusion models to generalize to novel spatial control conditions with minimal training data (30 examples), achieving competitive performance with fully supervised baselines.

DetailsMotivation: Existing spatial control adapters for diffusion models lack adaptability to novel control conditions and require high training costs when encountering tasks that differ substantially from training data.

Method: UFC leverages analogy between query and support conditions to construct task-specific control features through a matching mechanism and updates on a small set of task-specific parameters, requiring only few-shot examples of novel tasks.

Result: Experiments on six novel spatial control tasks show UFC achieves fine-grained control with only 30 annotated examples, and with 0.1% of full training data, it matches performance of fully supervised baselines across various control tasks.

Conclusion: UFC provides an effective few-shot solution for spatial conditioning in diffusion models, demonstrating versatility across different diffusion backbones (UNet and DiT) and requiring minimal training data for novel control tasks.

Abstract: Spatial conditioning in pretrained text-to-image diffusion models has significantly improved fine-grained control over the structure of generated images. However, existing control adapters exhibit limited adaptability and incur high training costs when encountering novel spatial control conditions that differ substantially from the training tasks. To address this limitation, we propose Universal Few-Shot Control (UFC), a versatile few-shot control adapter capable of generalizing to novel spatial conditions. Given a few image-condition pairs of an unseen task and a query condition, UFC leverages the analogy between query and support conditions to construct task-specific control features, instantiated by a matching mechanism and an update on a small set of task-specific parameters. Experiments on six novel spatial control tasks show that UFC, fine-tuned with only 30 annotated examples of novel tasks, achieves fine-grained control consistent with the spatial conditions. Notably, when fine-tuned with 0.1% of the full training data, UFC achieves competitive performance with the fully supervised baselines in various control tasks. We also show that UFC is applicable agnostically to various diffusion backbones and demonstrate its effectiveness on both UNet and DiT architectures. Code is available at https://github.com/kietngt00/UFC.

[131] HU-based Foreground Masking for 3D Medical Masked Image Modeling

Jin Lee, Vu Dang, Gwang-Hyun Yu, Anh Le, Zahid Rahman, Jin-Ho Jang, Heonzoo Lee, Kun-Yung Kim, Jin-Sul Kim, Jin-Young Kim

Main category: cs.CV

TL;DR: Enhanced MIM for 3D medical imaging using HU-based foreground masking that focuses on diagnostically relevant tissue regions instead of random masking, achieving superior segmentation performance across multiple datasets.

DetailsMotivation: Random masking in Masked Image Modeling overlooks anatomical density in 3D medical images, failing to focus on diagnostically meaningful regions like visceral organs while including non-tissue areas like air and fluid.

Method: Proposed HU-based Foreground Masking strategy that leverages Hounsfield Unit measurements to focus on intensity distribution of visceral organs and exclude non-tissue regions lacking diagnostic features.

Result: Consistent performance improvements across five public 3D medical imaging datasets with segmentation quality and Dice scores: BTCV (~84.64%), Flare22 (~92.43%), MM-WHS (~90.67%), Amos22 (~88.64%), BraTS (~78.55%).

Conclusion: Domain-centric MIM with HU-based foreground masking is crucial for medical image representation learning and shows promising direction for improving medical image segmentation tasks.

Abstract: While Masked Image Modeling (MIM) has revolutionized fields of computer vision, its adoption in 3D medical image computing has been limited by the use of random masking, which overlooks the density of anatomical objects. To address this limitation, we enhance the pretext task with a simple yet effective masking strategy. Leveraging Hounsfield Unit (HU) measurements, we implement an HU-based Foreground Masking, which focuses on the intensity distribution of visceral organs and excludes non-tissue regions, such as air and fluid, that lack diagnostically meaningful features. Extensive experiments on five public 3D medical imaging datasets demonstrate that our masking consistently improves performance, both in quality of segmentation and Dice score (BTCV:~84.64%, Flare22:~92.43%, MM-WHS:~90.67%, Amos22:~88.64%, BraTS:~78.55%). These results underscore the importance of domain-centric MIM and suggest a promising direction for representation learning in medical image segmentation. Implementation is available at github.com/AISeedHub/SubFore/.

[132] TextlessRAG: End-to-End Visual Document RAG by Speech Without Text

Peijin Xie, Shun Qian, Bingquan Liu, Dexin Wang, Lin Sun, Xiangzheng Zhang

Main category: cs.CV

TL;DR: TextlessRAG is the first end-to-end framework for speech-based question answering over document images without using ASR, TTS, or OCR, achieving improved efficiency and accuracy.

DetailsMotivation: Document images contain rich knowledge and spoken queries offer flexible application scenarios, but no prior work has explored speech-based QA over visual documents without text conversion.

Method: Eliminates ASR, TTS and OCR; directly interprets speech, retrieves relevant visual knowledge, and generates answers in a fully textless pipeline with layout-aware reranking for improved retrieval.

Result: Experiments demonstrate substantial improvements in both efficiency and accuracy compared to previous methods.

Conclusion: The framework successfully enables speech-based QA over document images without text conversion, and the authors release the first bilingual speech-document RAG dataset to advance research in this area.

Abstract: Document images encapsulate a wealth of knowledge, while the portability of spoken queries enables broader and flexible application scenarios. Yet, no prior work has explored knowledge base question answering over visual document images with queries provided directly in speech. We propose TextlessRAG, the first end-to-end framework for speech-based question answering over large-scale document images. Unlike prior methods, TextlessRAG eliminates ASR, TTS and OCR, directly interpreting speech, retrieving relevant visual knowledge, and generating answers in a fully textless pipeline. To further boost performance, we integrate a layout-aware reranking mechanism to refine retrieval. Experiments demonstrate substantial improvements in both efficiency and accuracy. To advance research in this direction, we also release the first bilingual speech–document RAG dataset, featuring Chinese and English voice queries paired with multimodal document content. Both the dataset and our pipeline will be made available at repository:https://github.com/xiepeijinhit-hue/textlessrag

[133] PanoLAM: Large Avatar Model for Gaussian Full-Head Synthesis from One-shot Unposed Image

Peng Li, Yisheng He, Yingdong Hu, Yuan Dong, Weihao Yuan, Yuan Liu, Zilong Dong, Yike Guo

Main category: cs.CV

TL;DR: Fast feed-forward framework for Gaussian full-head 3D reconstruction from single unposed images using synthetic data and coarse-to-fine generation pipeline.

DetailsMotivation: Previous methods rely on time-consuming GAN inversion and test-time optimization, which are slow for inference. There's also a lack of large-scale 3D head assets for training.

Method: Proposes a coarse-to-fine Gaussian head generation pipeline using FLAME model sparse points with transformer blocks for feature extraction, and a dual-branch framework that aggregates spherical triplane features with point-based features. Uses synthetic data from trained 3D GANs.

Result: Framework enables fast reconstruction and rendering in a single forward pass, achieving efficient high-fidelity generation without requiring real 3D head data.

Conclusion: The method shows effectiveness compared to existing work, providing rapid Gaussian full-head synthesis from single unposed images using synthetic training data and innovative feature aggregation techniques.

Abstract: We present a feed-forward framework for Gaussian full-head synthesis from a single unposed image. Unlike previous work that relies on time-consuming GAN inversion and test-time optimization, our framework can reconstruct the Gaussian full-head model given a single unposed image in a single forward pass. This enables fast reconstruction and rendering during inference. To mitigate the lack of large-scale 3D head assets, we propose a large-scale synthetic dataset from trained 3D GANs and train our framework using only synthetic data. For efficient high-fidelity generation, we introduce a coarse-to-fine Gaussian head generation pipeline, where sparse points from the FLAME model interact with the image features by transformer blocks for feature extraction and coarse shape reconstruction, which are then densified for high-fidelity reconstruction. To fully leverage the prior knowledge residing in pretrained 3D GANs for effective reconstruction, we propose a dual-branch framework that effectively aggregates the structured spherical triplane feature and unstructured point-based features for more effective Gaussian head reconstruction. Experimental results show the effectiveness of our framework towards existing work.

[134] Attention Maps in 3D Shape Classification for Dental Stage Estimation with Class Node Graph Attention Networks

Barkin Buyukcakir, Rocharles Cavalcante Fontenele, Reinhilde Jacobs, Jannick De Tobel, Patrick Thevissen, Dirk Vandermeulen, Peter Claes

Main category: cs.CV

TL;DR: CGAT architecture for transparent 3D shape recognition using graph attention networks with explainable attention mechanisms for dental age estimation.

DetailsMotivation: Deep learning models lack transparency for high-stakes applications like medical diagnosis where trust and accountability are crucial.

Method: Class Node Graph Attention Network (CGAT) with graph attention convolutions and attention rollout visualization, using local mean curvature and distance to centroid as node features.

Result: Models with directed edges to global CLS node produced intuitive attention maps with 0.76 weighted F1 score, and combination of features yielded better performance and visualizations.

Conclusion: CGAT enables human-understandable explanations for model decisions, enhancing trust and facilitating expert validation in high-stakes environments beyond dental applications.

Abstract: Deep learning offers a promising avenue for automating many recognition tasks in fields such as medicine and forensics. However, the black-box nature of these models hinders their adoption in high-stakes applications where trust and accountability are required. For 3D shape recognition tasks in particular, this paper introduces the Class Node Graph Attention Network (CGAT) architecture to address this need. Applied to 3D meshes of third molars derived from CBCT images, for Demirjian stage allocation, CGAT utilizes graph attention convolutions and an inherent attention mechanism, visualized via attention rollout, to explain its decision-making process. We evaluated the local mean curvature and distance to centroid node features, both individually and in combination, as well as model depth, finding that models incorporating directed edges to a global CLS node produced more intuitive attention maps, while also yielding desirable classification performance. We analyzed the attention-based explanations of the models, and their predictive performances to propose optimal settings for the CGAT. The combination of local mean curvature and distance to centroid as node features yielded a slight performance increase with 0.76 weighted F1 score, and more comprehensive attention visualizations. The CGAT architecture’s ability to generate human-understandable attention maps can enhance trust and facilitate expert validation of model decisions. While demonstrated on dental data, CGAT is broadly applicable to graph-based classification and regression tasks, promoting wider adoption of transparent and competitive deep learning models in high-stakes environments.

[135] Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

Boammani Aser Lompo, Marc Haraoui

Main category: cs.CV

TL;DR: Visual-TableQA is a large-scale multimodal dataset for evaluating visual reasoning over rendered table images, featuring 2.5k tables and 6k QA pairs generated through collaborative LLM pipeline at low cost.

DetailsMotivation: Current benchmarks for visual table reasoning are limited in scale, diversity, and reasoning depth, especially for rendered table images, creating a need for more comprehensive evaluation resources.

Method: Modular, scalable autonomous generation pipeline using multiple LLMs collaborating across generation, validation, and inspiration roles with cross-model prompting and LLM-jury filtering.

Result: Models fine-tuned on Visual-TableQA generalize robustly to external benchmarks and outperform several proprietary models despite the dataset’s synthetic nature.

Conclusion: The approach successfully creates a diverse, reasoning-intensive dataset that enhances model performance on visual table understanding tasks, demonstrating the effectiveness of collaborative LLM data generation.

Abstract: Visual reasoning over structured data such as tables is a critical capability for modern vision-language models (VLMs), yet current benchmarks remain limited in scale, diversity, or reasoning depth, especially when it comes to rendered table images. Addressing this gap, we introduce Visual-TableQA, a large-scale, open-domain multimodal dataset specifically designed to evaluate and enhance visual reasoning over complex tabular data. Our generation pipeline is modular, scalable, and fully autonomous, involving multiple reasoning LLMs collaborating across distinct roles: generation, validation, and inspiration. Visual-TableQA comprises 2.5k richly structured LaTeX-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under USD 100. To promote diversity and creativity, our pipeline performs multi-model collaborative data generation via cross-model prompting (‘inspiration’) and LLM-jury filtering. Stronger models seed layouts and topics that weaker models elaborate, collectively distilling diverse reasoning patterns and visual structures into the dataset. Empirical results show that models fine-tuned on Visual-TableQA generalize robustly to external benchmarks, outperforming several proprietary models despite the dataset’s synthetic nature. The full pipeline and resources are publicly available at https://github.com/AI-4-Everyone/Visual-TableQA.

[136] Temporal Image Forensics: A Review and Critical Evaluation

Robert Jöchl, Andreas Uhl

Main category: cs.CV

TL;DR: This paper provides a comprehensive review of temporal image forensics, focusing on time-dependent traces from image acquisition pipelines, while critically examining content bias issues and the importance of explainable AI methods.

DetailsMotivation: To give a comprehensive overview of temporal image forensics based on time-dependent traces from image acquisition pipelines, highlight the problem of content bias, and demonstrate the importance of explainable AI methods for verifying reliability.

Method: The authors review previous works, propose a new forensic setting, verify properties of in-field sensor defects, analyze existing methods for potential content bias exploitation, investigate neural network features in palmprint dating, and demonstrate how neural networks can be distracted from learning age traces through experiments and re-implementation of previous work.

Result: The review provides detailed insights into known age traces (in-field sensor defects and sensor dust) and forensic techniques. It shows that some methods claiming to use sensor defects actually exploit other traces (likely content bias), verifies main properties of sensor defects, and demonstrates neural networks’ vulnerability to distraction from learning actual age traces.

Conclusion: The field of temporal image forensics requires careful consideration of content bias and reliable verification methods. Explainable AI is crucial for ensuring the reliability of forensic techniques, as neural networks can easily be distracted from learning genuine age-dependent traces.

Abstract: Temporal image forensics is the science of estimating the age of a digital image. Usually, time-dependent traces (age traces) introduced by the image acquisition pipeline are exploited for this purpose. In this review, a comprehensive overview of the field of temporal image forensics based on time-dependent traces from the image acquisition pipeline is given. This includes a detailed insight into the properties of known age traces (i.e., in-field sensor defects and sensor dust) and temporal image forensics techniques. Another key aspect of this work is to highlight the problem of content bias and to illustrate how important eXplainable Artificial Intelligence methods are to verify the reliability of temporal image forensics techniques. Apart from reviewing material presented in previous works, in this review: (i) a new (probably more realistic) forensic setting is proposed; (ii) the main properties (growth rate and spatial distribution) of in-field sensor defects are verified; (iii) it is shown that a method proposed to utilize in-field sensor defects for image age approximation actually exploits other traces (most likely content bias); (iv) the features learned by a neural network dating palmprint images are further investigated; (v) it is shown how easily a neural network can be distracted from learning age traces. For this purpose, previous work is analyzed, re-implemented if required and experiments are conducted.

[137] Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, Hengshuang Zhao

Main category: cs.CV

TL;DR: Mini-o3 is a system that enables deep multi-turn reasoning for visual search tasks by scaling up tool-based interactions, achieving state-of-the-art performance through iterative data collection and over-turn masking strategies.

DetailsMotivation: Existing open-source multimodal models exhibit monotonous reasoning patterns and limited interaction turns, making them inadequate for difficult visual tasks requiring trial-and-error exploration.

Method: Three key components: 1) Visual Probe Dataset with challenging visual search problems, 2) Iterative data collection pipeline for diverse reasoning patterns, 3) Over-turn masking strategy to prevent penalization of maximum-turn responses during reinforcement learning.

Result: The model generates trajectories scaling to tens of turns at inference time with improving accuracy, produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.

Conclusion: Mini-o3 demonstrates that scaling tool-based interactions enables deep multi-turn reasoning and achieves state-of-the-art performance on challenging visual search tasks through careful dataset construction and training strategies.

Abstract: Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning – spanning tens of steps – and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.

[138] Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

Yusuke Hirota, Ryo Hachiuma, Boyi Li, Ximing Lu, Michael Ross Boone, Boris Ivanovic, Yejin Choi, Marco Pavone, Yu-Chiang Frank Wang, Noa Garcia, Yuta Nakashima, Chao-Han Huck Yang

Main category: cs.CV

TL;DR: Current gender bias evaluations in vision-language models are unreliable because they’re heavily influenced by spurious correlations between gender and non-gender features like objects and backgrounds, rather than measuring actual gender bias.

DetailsMotivation: To investigate whether spurious features in gender bias benchmarks distort evaluation results, as these benchmarks often contain unwanted correlations between gender and non-gender elements that may confound bias measurements.

Method: Systematically perturb non-gender features across four benchmarks (COCO-gender, FACET, MIAP, PHASE) and various VLMs, using techniques like object masking and background blurring to quantify impact on bias scores.

Result: Minimal perturbations (10% object masking or weak background blurring) dramatically alter bias scores - up to 175% change in generative VLMs and 43% in CLIP variants, showing current evaluations reflect spurious feature responses rather than true gender bias.

Conclusion: Since creating spurious feature-free benchmarks is fundamentally challenging, researchers should report bias metrics alongside feature-sensitivity measurements for more reliable bias assessment.

Abstract: Gender bias in vision-language foundation models (VLMs) raises concerns about their safe deployment and is typically evaluated using benchmarks with gender annotations on real-world images. However, as these benchmarks often contain spurious correlations between gender and non-gender features, such as objects and backgrounds, we identify a critical oversight in gender bias evaluation: Do spurious features distort gender bias evaluation? To address this question, we systematically perturb non-gender features across four widely used benchmarks (COCO-gender, FACET, MIAP, and PHASE) and various VLMs to quantify their impact on bias evaluation. Our findings reveal that even minimal perturbations, such as masking just 10% of objects or weakly blurring backgrounds, can dramatically alter bias scores, shifting metrics by up to 175% in generative VLMs and 43% in CLIP variants. This suggests that current bias evaluations often reflect model responses to spurious features rather than gender bias, undermining their reliability. Since creating spurious feature-free benchmarks is fundamentally challenging, we recommend reporting bias metrics alongside feature-sensitivity measurements to enable a more reliable bias assessment.

[139] Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer’s Disease

Fangqi Cheng, Surajit Ray, Xiaochen Yang

Main category: cs.CV

TL;DR: A data-efficient fine-tuning pipeline for 3D CT-based Med-VLMs adapted for 3D MRI, achieving SOTA Alzheimer’s diagnosis performance with only 1,500 training images.

DetailsMotivation: Current Med-VLMs underutilize patient metadata, lack clinical diagnostic knowledge integration, require extensive computational resources, and have limited effectiveness on 3D medical imaging due to missing structural information.

Method: Two key innovations: 1) Convert structured metadata into synthetic reports for better image-text alignment; 2) Add auxiliary token trained to predict MMSE score for additional supervision. Uses lightweight prompt tuning on both image and text modalities.

Result: Achieves state-of-the-art performance on two Alzheimer’s disease datasets using only 1,500 training images, outperforming methods fine-tuned on 10,000 images.

Conclusion: The proposed approach effectively addresses limitations of existing Med-VLMs by better utilizing metadata, integrating clinical knowledge, and achieving superior performance with significantly reduced data requirements for 3D medical imaging applications.

Abstract: Medical vision-language models (Med-VLMs) have shown impressive results in tasks such as report generation and visual question answering, but they still face several limitations. Most notably, they underutilize patient metadata and lack integration of clinical diagnostic knowledge. Moreover, most existing models are typically trained from scratch or fine-tuned on large-scale 2D image-text pairs, requiring extensive computational resources, and their effectiveness on 3D medical imaging is often limited due to the absence of structural information. To address these gaps, we propose a data-efficient fine-tuning pipeline to adapt 3D CT-based Med-VLMs for 3D MRI and demonstrate its application in Alzheimer’s disease (AD) diagnosis. Our system introduces two key innovations. First, we convert structured metadata into synthetic reports, enriching textual input for improved image-text alignment. Second, we add an auxiliary token trained to predict the mini-mental state examination (MMSE) score, a widely used clinical measure of cognitive function that correlates with AD severity. This provides additional supervision for fine-tuning. Applying lightweight prompt tuning to both image and text modalities, our approach achieves state-of-the-art performance on two AD datasets using 1,500 training images, outperforming existing methods fine-tuned on 10,000 images. Code will be released upon publication.

[140] Self-Supervised Cross-Encoder for Neurodegenerative Disease Diagnosis

Fangqi Cheng, Yingying Zhao, Xiaochen Yang

Main category: cs.CV

TL;DR: A self-supervised cross-encoder framework for neurodegenerative disease diagnosis from MRI that disentangles static and dynamic representations using temporal continuity in longitudinal scans, achieving superior accuracy and interpretability.

DetailsMotivation: Existing deep learning methods for neurodegenerative disease diagnosis rely heavily on large labeled datasets and produce representations that lack interpretability.

Method: Proposes a self-supervised cross-encoder framework that leverages temporal continuity in longitudinal MRI scans. It disentangles representations into static (contrastive learning) and dynamic (input-gradient regularization) components.

Result: Achieves superior classification accuracy on ADNI dataset, shows improved interpretability, and demonstrates strong zero-shot generalization on OASIS dataset and cross-task generalization on PPMI dataset.

Conclusion: The proposed framework effectively addresses both data efficiency and interpretability challenges in neurodegenerative disease diagnosis from MRI data, with strong generalization capabilities.

Abstract: Deep learning has shown significant potential in diagnosing neurodegenerative diseases from MRI data. However, most existing methods rely heavily on large volumes of labeled data and often yield representations that lack interpretability. To address both challenges, we propose a novel self-supervised cross-encoder framework that leverages the temporal continuity in longitudinal MRI scans for supervision. This framework disentangles learned representations into two components: a static representation, constrained by contrastive learning, which captures stable anatomical features; and a dynamic representation, guided by input-gradient regularization, which reflects temporal changes and can be effectively fine-tuned for downstream classification tasks. Experimental results on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset demonstrate that our method achieves superior classification accuracy and improved interpretability. Furthermore, the learned representations exhibit strong zero-shot generalization on the Open Access Series of Imaging Studies (OASIS) dataset and cross-task generalization on the Parkinson Progression Marker Initiative (PPMI) dataset. The code for the proposed method will be made publicly available.

[141] Semantic Watermarking Reinvented: Enhancing Robustness and Generation Quality with Fourier Integrity

Sung Ju Lee, Nam Ik Cho

Main category: cs.CV

TL;DR: Proposes Hermitian Symmetric Fourier Watermarking (SFW) to maintain frequency integrity and improve robustness against attacks in latent diffusion models, with center-aware embedding for cropping resistance.

DetailsMotivation: Semantic watermarking for LDMs suffers from detection performance degradation due to loss of frequency integrity and vulnerability to cropping attacks.

Method: Hermitian Symmetric Fourier Watermarking (SFW) enforces Hermitian symmetry to maintain frequency integrity, plus center-aware embedding strategy to reduce cropping vulnerability by ensuring robust information retention.

Result: Achieves state-of-the-art verification and identification performance across various attack scenarios, highest detection accuracy while maintaining superior image fidelity (FID and CLIP scores).

Conclusion: SFW is an effective framework for balancing robustness and image fidelity, addressing inherent trade-offs in semantic watermarking.

Abstract: Semantic watermarking techniques for latent diffusion models (LDMs) are robust against regeneration attacks, but often suffer from detection performance degradation due to the loss of frequency integrity. To tackle this problem, we propose a novel embedding method called Hermitian Symmetric Fourier Watermarking (SFW), which maintains frequency integrity by enforcing Hermitian symmetry. Additionally, we introduce a center-aware embedding strategy that reduces the vulnerability of semantic watermarking due to cropping attacks by ensuring robust information retention. To validate our approach, we apply these techniques to existing semantic watermarking schemes, enhancing their frequency-domain structures for better robustness and retrieval accuracy. Extensive experiments demonstrate that our methods achieve state-of-the-art verification and identification performance, surpassing previous approaches across various attack scenarios. Ablation studies confirm the impact of SFW on detection capabilities, the effectiveness of the center-aware embedding against cropping, and how message capacity influences identification accuracy. Notably, our method achieves the highest detection accuracy while maintaining superior image fidelity, as evidenced by FID and CLIP scores. Conclusively, our proposed SFW is shown to be an effective framework for balancing robustness and image fidelity, addressing the inherent trade-offs in semantic watermarking. Code available at https://github.com/thomas11809/SFWMark

[142] Beyond Motion Cues and Structural Sparsity: Revisiting Small Moving Target Detection

Guoyi Zhang, Siyang Chen, Guangsheng Xu, Zhihua Shen, Han Wang, Xiaohu Zhang

Main category: cs.CV

TL;DR: A novel deep learning framework called TenRPCANet for small moving target detection that uses tensor-based low-rank and sparse decomposition with self-attention mechanisms, achieving state-of-the-art results on infrared and space object detection tasks.

DetailsMotivation: Small moving target detection is crucial for defense applications but remains challenging due to low signal-to-noise ratios, ambiguous visual cues, and cluttered backgrounds. Existing approaches lack robustness in complex environments as they rely on target-specific features or motion cues.

Method: Reformulates the task as tensor-based low-rank and sparse decomposition. Proposes TenRPCANet with tokenization strategy that enforces multi-order tensor low-rank priors through self-attention, capturing local and non-local self-similarity. Includes feature refinement module inspired by sparse component update in tensor RPCA to enhance target saliency.

Result: Achieves state-of-the-art performance on two highly distinct and challenging tasks: multi-frame infrared small target detection and space object detection.

Conclusion: The method demonstrates both effectiveness and generalizability, requiring minimal assumptions about target characteristics while leveraging inherent low-rank structures in cluttered backgrounds as stable priors for detection.

Abstract: Small moving target detection is crucial for many defense applications but remains highly challenging due to low signal-to-noise ratios, ambiguous visual cues, and cluttered backgrounds. In this work, we propose a novel deep learning framework that differs fundamentally from existing approaches, which often rely on target-specific features or motion cues and tend to lack robustness in complex environments. Our key insight is that small target detection and background discrimination are inherently coupled, even cluttered video backgrounds often exhibit strong low-rank structures that can serve as stable priors for detection. We reformulate the task as a tensor-based low-rank and sparse decomposition problem and conduct a theoretical analysis of the background, target, and noise components to guide model design. Building on these insights, we introduce TenRPCANet, a deep neural network that requires minimal assumptions about target characteristics. Specifically, we propose a tokenization strategy that implicitly enforces multi-order tensor low-rank priors through a self-attention mechanism. This mechanism captures both local and non-local self-similarity to model the low-rank background without relying on explicit iterative optimization. In addition, inspired by the sparse component update in tensor RPCA, we design a feature refinement module to enhance target saliency. The proposed method achieves state-of-the-art performance on two highly distinct and challenging tasks: multi-frame infrared small target detection and space object detection. These results demonstrate both the effectiveness and the generalizability of our approach.

[143] EDFFDNet: Towards Accurate and Efficient Unsupervised Multi-Grid Image Registration

Haokai Zhu, Bo Qu, Si-Yuan Cao, Runmin Zhang, Shujie Chen, Bailin Yang, Hui-Liang Shen

Main category: cs.CV

TL;DR: EDFFDNet uses exponential-decay free-form deformation and adaptive sparse motion aggregation to achieve efficient and accurate image registration, especially for scenes with depth disparities.

DetailsMotivation: Previous deep image registration methods struggle with real scenes containing depth disparities due to their inherent limitations in single homography, multi-grid homography, or thin-plate spline approaches.

Method: Proposes EDFFDNet with exponential-decay basis function for free-form deformation, Adaptive Sparse Motion Aggregator (ASMA) to replace MLP motion aggregator, and progressive correlation refinement strategy for coarse-to-fine motion estimation.

Result: Reduces parameters by 70.5%, memory by 32.6%, and total runtime by 33.7% while achieving 0.5 dB PSNR gain over state-of-the-art. With local refinement, EDFFDNet-2 further improves PSNR by 1.06 dB with lower computational costs.

Conclusion: The method demonstrates strong generalization ability across datasets and outperforms previous deep learning methods in efficiency and accuracy for image registration in scenes with depth disparities.

Abstract: Previous deep image registration methods that employ single homography, multi-grid homography, or thin-plate spline often struggle with real scenes containing depth disparities due to their inherent limitations. To address this, we propose an Exponential-Decay Free-Form Deformation Network (EDFFDNet), which employs free-form deformation with an exponential-decay basis function. This design achieves higher efficiency and performs well in scenes with depth disparities, benefiting from its inherent locality. We also introduce an Adaptive Sparse Motion Aggregator (ASMA), which replaces the MLP motion aggregator used in previous methods. By transforming dense interactions into sparse ones, ASMA reduces parameters and improves accuracy. Additionally, we propose a progressive correlation refinement strategy that leverages global-local correlation patterns for coarse-to-fine motion estimation, further enhancing efficiency and accuracy. Experiments demonstrate that EDFFDNet reduces parameters, memory, and total runtime by 70.5%, 32.6%, and 33.7%, respectively, while achieving a 0.5 dB PSNR gain over the state-of-the-art method. With an additional local refinement stage,EDFFDNet-2 further improves PSNR by 1.06 dB while maintaining lower computational costs. Our method also demonstrates strong generalization ability across datasets, outperforming previous deep learning methods.

[144] Nearest Neighbor Projection Removal Adversarial Training

Himanshu Singh, A. V. Subramanyam, Shivank Rajput, Mohan Kankanhalli

Main category: cs.CV

TL;DR: Novel adversarial training framework that reduces inter-class feature overlap to improve robustness by projecting out inter-class dependencies from features, achieving competitive performance on standard benchmarks.

DetailsMotivation: Standard adversarial training fails to explicitly address inter-class feature overlap, which is a significant contributor to adversarial susceptibility in deep neural networks.

Method: Proposes a framework that identifies nearest inter-class neighbors for each adversarial sample and removes projections onto these neighbors to enforce stronger feature separability, with theoretical analysis showing reduced Lipschitz constant and Rademacher complexity.

Result: Extensive experiments on CIFAR-10, CIFAR-100, and SVHN show strong performance competitive with leading adversarial training techniques, with significant achievements in both robust and clean accuracy.

Conclusion: Explicitly addressing inter-class feature proximity is crucial for bolstering adversarial robustness in deep neural networks, as demonstrated by the proposed method’s effectiveness.

Abstract: Deep neural networks have exhibited impressive performance in image classification tasks but remain vulnerable to adversarial examples. Standard adversarial training enhances robustness but typically fails to explicitly address inter-class feature overlap, a significant contributor to adversarial susceptibility. In this work, we introduce a novel adversarial training framework that actively mitigates inter-class proximity by projecting out inter-class dependencies from adversarial and clean samples in the feature space. Specifically, our approach first identifies the nearest inter-class neighbors for each adversarial sample and subsequently removes projections onto these neighbors to enforce stronger feature separability. Theoretically, we demonstrate that our proposed logits correction reduces the Lipschitz constant of neural networks, thereby lowering the Rademacher complexity, which directly contributes to improved generalization and robustness. Extensive experiments across standard benchmarks including CIFAR-10, CIFAR-100, and SVHN show that our method demonstrates strong performance that is competitive with leading adversarial training techniques, highlighting significant achievements in both robust and clean accuracy. Our findings reveal the importance of addressing inter-class feature proximity explicitly to bolster adversarial robustness in DNNs.

[145] CAViAR: Critic-Augmented Video Agentic Reasoning

Sachit Menon, Ahmet Iscen, Arsha Nagrani, Tobias Weyand, Carl Vondrick, Cordelia Schmid

Main category: cs.CV

TL;DR: A language model agent with video modules and critic mechanism achieves strong performance on complex video reasoning tasks where previous methods struggle.

DetailsMotivation: Existing video perception models perform well on short clips but struggle with complex reasoning on longer videos and more complex queries. The paper aims to leverage existing perception capabilities for more sophisticated video reasoning.

Method: Developed a large language model agent that uses video modules as subagents/tools, with a critic mechanism to distinguish successful from unsuccessful action sequences. The agent dynamically determines subsequent steps based on previous module results rather than following fixed procedures.

Result: The combination of the agent and critic achieves strong performance on LVBench, Neptune, and ActivityNet-RTL datasets that require complex video reasoning.

Conclusion: The proposed agent-critic approach successfully leverages existing video perception capabilities for complex reasoning tasks, outperforming previous fixed-procedure methods like Visual Programming, ViperGPT, and MoReVQA.

Abstract: Video understanding has seen significant progress in recent years, with models’ performance on perception from short clips continuing to rise. Yet, multiple recent benchmarks, such as LVBench, Neptune, and ActivityNet-RTL, show performance wanes for tasks requiring complex reasoning on videos as queries grow more complex and videos grow longer. In this work, we ask: can existing perception capabilities be leveraged to successfully perform more complex video reasoning? In particular, we develop a large language model agent given access to video modules as subagents or tools. Rather than following a fixed procedure to solve queries as in previous work such as Visual Programming, ViperGPT, and MoReVQA, the agent uses the results of each call to a module to determine subsequent steps. Inspired by work in the textual reasoning domain, we introduce a critic to distinguish between instances of successful and unsuccessful sequences from the agent. We show that the combination of our agent and critic achieve strong performance on the previously-mentioned datasets.

[146] SEEC: Segmentation-Assisted Multi-Entropy Models for Learned Lossless Image Compression

Chunhang Zheng, Zichang Ren, Dou Li

Main category: cs.CV

TL;DR: SEEC proposes a segmentation-assisted multi-entropy model framework for lossless image compression that uses semantic segmentation to guide multiple specialized entropy models for different image regions, achieving state-of-the-art compression ratios with minimal latency.

DetailsMotivation: Traditional learned image compression methods use a single entropy model for the entire image, which limits their ability to capture diverse statistical characteristics across different semantic regions. This limitation reduces compression efficiency.

Method: The framework extracts image features, applies semantic segmentation to identify different regions, assigns specialized entropy models to each region to capture unique statistical properties, and uses multi-channel discrete logistic mixture likelihood for effective pixel value distribution modeling.

Result: Experimental results show SEEC achieves state-of-the-art compression ratios on benchmark datasets with minimal encoding/decoding latency. It also supports Regions of Interest coding based on segmentation masks.

Conclusion: SEEC demonstrates that using multiple entropy models guided by semantic segmentation significantly improves compression performance over single-model approaches, providing superior compression ratios while maintaining practical efficiency.

Abstract: Recently, learned image compression has attracted considerable attention due to its superior performance over traditional methods. However, most existing approaches employ a single entropy model to estimate the probability distribution of pixel values across the entire image, which limits their ability to capture the diverse statistical characteristics of different semantic regions. To overcome this limitation, we propose Segmentation-Assisted Multi-Entropy Models for Lossless Image Compression (SEEC). Our framework utilizes semantic segmentation to guide the selection and adaptation of multiple entropy models, enabling more accurate probability distribution estimation for distinct semantic regions. Specifically, SEEC first extracts image features and then applies semantic segmentation to identify different regions, each assigned a specialized entropy model to better capture its unique statistical properties. Finally, a multi-channel discrete logistic mixture likelihood is employed to model the pixel value distributions effectively. Experimental results on benchmark datasets demonstrate that SEEC achieves state-of-the-art compression ratios while introducing only minimal encoding and decoding latency. With superior performance, the proposed model also supports Regions of Interest (ROIs) coding condition on the provided segmentation mask. Our code is available at https://github.com/chunbaobao/SEEC.

[147] XSRD-Net: EXplainable Stroke Relapse Detection

Christian Gapp, Elias Tappeiner, Martin Welk, Karl Fritscher, Stephanie Mangesius, Constantin Eisenschink, Philipp Deisl, Michael Knoflach, Astrid E. Grams, Elke R. Gizewski, Rainer Schubert

Main category: cs.CV

TL;DR: Multimodal deep learning approach combining 3D CTA imaging and clinical data to predict stroke recurrence risk and relapse-free survival time, achieving AUC of 0.71 for relapse prediction with insights linking heart diseases and carotid artery features.

DetailsMotivation: Stroke has high recurrence rates (5-25% in first year) with extremely high mortality (40% for relapses), making early detection of at-risk patients crucial for appropriate therapy planning to reduce recurrence rates.

Method: Collected 3D intracranial CTA image data, heart diseases, age, and gender from stroke patients (2010-2024). Trained single- and multimodal deep learning networks for binary relapse detection and relapse-free survival time prediction with classification.

Result: Tabular data alone achieved AUC 0.84 for relapse detection. Multimodal XSRD-net (vision:tabular 0.68:0.32 ratio) achieved c-index 0.68 and AUC 0.71 for relapse-free survival prediction. Found link between heart diseases (tabular) and carotid arteries (vision) for relapse detection.

Conclusion: Multimodal approach combining imaging and clinical data shows promise for stroke recurrence prediction. The identified link between heart conditions and carotid artery features provides valuable clinical insights for ongoing model improvement and data collection.

Abstract: Stroke is the second most frequent cause of death world wide with an annual mortality of around 5.5 million. Recurrence rates of stroke are between 5 and 25% in the first year. As mortality rates for relapses are extraordinarily high (40%) it is of utmost importance to reduce the recurrence rates. We address this issue by detecting patients at risk of stroke recurrence at an early stage in order to enable appropriate therapy planning. To this end we collected 3D intracranial CTA image data and recorded concomitant heart diseases, the age and the gender of stroke patients between 2010 and 2024. We trained single- and multimodal deep learning based neural networks for binary relapse detection (Task 1) and for relapse free survival (RFS) time prediction together with a subsequent classification (Task 2). The separation of relapse from non-relapse patients (Task 1) could be solved with tabular data (AUC on test dataset: 0.84). However, for the main task, the regression (Task 2), our multimodal XSRD-net processed the modalities vision:tabular with 0.68:0.32 according to modality contribution measures. The c-index with respect to relapses for the multimodal model reached 0.68, and the AUC is 0.71 for the test dataset. Final, deeper interpretability analysis results could highlight a link between both heart diseases (tabular) and carotid arteries (vision) for the detection of relapses and the prediction of the RFS time. This is a central outcome that we strive to strengthen with ongoing data collection and model retraining.

[148] HairGS: Hair Strand Reconstruction based on 3D Gaussian Splatting

Yimin Pan, Matthias Nießner, Tobias Kirschstein

Main category: cs.CV

TL;DR: 3D Gaussian Splatting extension for strand-level hair reconstruction from multi-view images, with novel merging scheme and topological evaluation metric

DetailsMotivation: Human hair reconstruction is challenging but important for VR and digital human modeling. Existing methods focus on geometric quality but neglect strand connectivity and topology.

Method: Multi-stage pipeline: 1) Reconstruct detailed hair geometry using differentiable Gaussian rasterizer, 2) Merge Gaussian segments into coherent strands with novel merging scheme, 3) Refine and grow strands under photometric supervision

Result: Robustly handles wide range of hairstyles, achieves efficient reconstruction (typically within one hour), demonstrates effectiveness on both synthetic and real-world datasets

Conclusion: The method provides strand-level hair reconstruction with topological accuracy assessment, addressing limitations of existing geometric-focused approaches

Abstract: Human hair reconstruction is a challenging problem in computer vision, with growing importance for applications in virtual reality and digital human modeling. Recent advances in 3D Gaussians Splatting (3DGS) provide efficient and explicit scene representations that naturally align with the structure of hair strands. In this work, we extend the 3DGS framework to enable strand-level hair geometry reconstruction from multi-view images. Our multi-stage pipeline first reconstructs detailed hair geometry using a differentiable Gaussian rasterizer, then merges individual Gaussian segments into coherent strands through a novel merging scheme, and finally refines and grows the strands under photometric supervision. While existing methods typically evaluate reconstruction quality at the geometric level, they often neglect the connectivity and topology of hair strands. To address this, we propose a new evaluation metric that serves as a proxy for assessing topological accuracy in strand reconstruction. Extensive experiments on both synthetic and real-world datasets demonstrate that our method robustly handles a wide range of hairstyles and achieves efficient reconstruction, typically completing within one hour. The project page can be found at: https://yimin-pan.github.io/hair-gs/

[149] RayGaussX: Accelerating Gaussian-Based Ray Marching for Real-Time and High-Quality Novel View Synthesis

Hugo Blanc, Jean-Emmanuel Deschaud, Alexis Paljic

Main category: cs.CV

TL;DR: RayGaussX accelerates RayGauss for real-time novel-view synthesis by introducing volumetric rendering acceleration techniques, improved ray coherence, scale regularization, and better densification, achieving 5-12x faster training and 50-80x higher FPS while improving visual quality.

DetailsMotivation: RayGauss achieved state-of-the-art rendering quality but its computational cost prevented real-time rendering on real-world scenes, necessitating acceleration methods.

Method: Builds on RayGauss with volumetric rendering acceleration (empty-space skipping, adaptive sampling), enhanced ray coherence, scale regularization to reduce false intersections, and new densification criterion for better density distribution.

Result: Achieves 5x to 12x faster training, 50x to 80x higher rendering speeds (FPS) on real-world datasets, and improves visual quality by up to +0.56 dB in PSNR.

Conclusion: RayGaussX successfully accelerates RayGauss for real-time performance while maintaining and even improving rendering quality on real-world scenes.

Abstract: RayGauss has achieved state-of-the-art rendering quality for novel-view synthesis on synthetic and indoor scenes by representing radiance and density fields with irregularly distributed elliptical basis functions, rendered via volume ray casting using a Bounding Volume Hierarchy (BVH). However, its computational cost prevents real-time rendering on real-world scenes. Our approach, RayGaussX, builds on RayGauss by introducing key contributions that accelerate both training and inference. Specifically, we incorporate volumetric rendering acceleration strategies such as empty-space skipping and adaptive sampling, enhance ray coherence, and introduce scale regularization to reduce false-positive intersections. Additionally, we propose a new densification criterion that improves density distribution in distant regions, leading to enhanced graphical quality on larger scenes. As a result, RayGaussX achieves 5x to 12x faster training and 50x to 80x higher rendering speeds (FPS) on real-world datasets while improving visual quality by up to +0.56 dB in PSNR. Project page with videos and code: https://raygaussx.github.io/.

[150] Faster, Self-Supervised Super-Resolution for Anisotropic Multi-View MRI Using a Sparse Coordinate Loss

Maja Schlereth, Moritz Schillinger, Katharina Breininger

Main category: cs.CV

TL;DR: Self-supervised neural network for fusing two orthogonal anisotropic low-resolution MR images to reconstruct high-resolution details without requiring high-resolution training data.

DetailsMotivation: Balancing MR image quality with acquisition time and patient comfort is challenging. Current practice of analyzing individual low-resolution scans is time-consuming and can lead to inaccurate interpretation.

Method: Multi-view neural network trained with sparse coordinate-based loss in self-supervised manner, combining patient-agnostic offline and patient-specific online phases for efficient reconstruction.

Result: Achieves comparable or improved super-resolution performance compared to state-of-the-art self-supervised methods, with up to 10x speed-up for patient-specific reconstruction.

Conclusion: The proposed approach enables efficient high-quality MR image reconstruction from anisotropic low-resolution scans without requiring high-resolution training data, significantly improving both quality and processing speed.

Abstract: Acquiring images in high resolution is often a challenging task. Especially in the medical sector, image quality has to be balanced with acquisition time and patient comfort. To strike a compromise between scan time and quality for Magnetic Resonance (MR) imaging, two anisotropic scans with different low-resolution (LR) orientations can be acquired. Typically, LR scans are analyzed individually by radiologists, which is time consuming and can lead to inaccurate interpretation. To tackle this, we propose a novel approach for fusing two orthogonal anisotropic LR MR images to reconstruct anatomical details in a unified representation. Our multi-view neural network is trained in a self-supervised manner, without requiring corresponding high-resolution (HR) data. To optimize the model, we introduce a sparse coordinate-based loss, enabling the integration of LR images with arbitrary scaling. We evaluate our method on MR images from two independent cohorts. Our results demonstrate comparable or even improved super-resolution (SR) performance compared to state-of-the-art (SOTA) self-supervised SR methods for different upsampling scales. By combining a patient-agnostic offline and a patient-specific online phase, we achieve a substantial speed-up of up to ten times for patient-specific reconstruction while achieving similar or better SR quality. Code is available at https://github.com/MajaSchle/tripleSR.

[151] SplatFill: 3D Scene Inpainting via Depth-Guided Gaussian Splatting

Mahtab Dahaghin, Milind G. Padalkar, Matteo Toso, Alessio Del Bue

Main category: cs.CV

TL;DR: SplatFill is a novel depth-guided 3D Gaussian Splatting inpainting method that achieves state-of-the-art results with improved efficiency, combining depth-based supervision and consistency-aware refinement.

DetailsMotivation: 3DGS scene inpainting remains challenging with issues like blurry details, artifacts, and inconsistent geometry when handling missing regions from occlusion or scene editing.

Method: Uses joint depth-based and object-based supervision for accurate Gaussian placement, plus a consistency-aware refinement scheme to selectively correct inconsistent regions without disrupting the scene.

Result: Surpasses existing NeRF-based and 3DGS-based methods in visual fidelity, reduces training time by 24.5%, and delivers sharper details with fewer artifacts across challenging viewpoints.

Conclusion: SplatFill provides an effective solution for high-quality 3DGS scene inpainting with improved perceptual quality and efficiency through depth guidance and selective refinement.

Abstract: 3D Gaussian Splatting (3DGS) has enabled the creation of highly realistic 3D scene representations from sets of multi-view images. However, inpainting missing regions, whether due to occlusion or scene editing, remains a challenging task, often leading to blurry details, artifacts, and inconsistent geometry. In this work, we introduce SplatFill, a novel depth-guided approach for 3DGS scene inpainting that achieves state-of-the-art perceptual quality and improved efficiency. Our method combines two key ideas: (1) joint depth-based and object-based supervision to ensure inpainted Gaussians are accurately placed in 3D space and aligned with surrounding geometry, and (2) we propose a consistency-aware refinement scheme that selectively identifies and corrects inconsistent regions without disrupting the rest of the scene. Evaluations on the SPIn-NeRF dataset demonstrate that SplatFill not only surpasses existing NeRF-based and 3DGS-based inpainting methods in visual fidelity but also reduces training time by 24.5%. Qualitative results show our method delivers sharper details, fewer artifacts, and greater coherence across challenging viewpoints.

[152] Point Linguist Model: Segment Any Object via Bridged Large 3D-Language Model

Zhuoxu Huang, Mingqi Gao, Jungong Han

Main category: cs.CV

TL;DR: PLM bridges the representation gap between LLMs and 3D point clouds using Object-centric Discriminative Representation and Geometric Reactivation Decoder, achieving significant improvements in 3D segmentation tasks.

DetailsMotivation: Address representation misalignment between LLMs (semantic tokens) and 3D point clouds (dense geometry), which limits input and output stages in prior methods, weakening object-level semantics and fine-grained accuracy.

Method: Introduces Object-centric Discriminative Representation (OcDR) to learn object-centric tokens with hard negative-aware training, and Geometric Reactivation Decoder (GRD) to combine LLM-inferred geometry with dense features for mask prediction.

Result: Achieves +7.3 mIoU on ScanNetv2 and +6.0 mIoU on Multi3DRefer for 3D referring segmentation, with consistent gains across 7 benchmarks spanning 4 different tasks.

Conclusion: PLM demonstrates effective object-centric reasoning for robust 3D understanding by bridging the LLM-3D representation gap without requiring large-scale pre-alignment.

Abstract: 3D object segmentation with Large Language Models (LLMs) has become a prevailing paradigm due to its broad semantics, task flexibility, and strong generalization. However, this paradigm is hindered by representation misalignment: LLMs process high-level semantic tokens, whereas 3D point clouds convey only dense geometric structures. In prior methods, misalignment limits both input and output. At the input stage, dense point patches require heavy pre-alignment, weakening object-level semantics and confusing similar distractors. At the output stage, predictions depend only on dense features without explicit geometric cues, leading to a loss of fine-grained accuracy. To address these limitations, we present the Point Linguist Model (PLM), a general framework that bridges the representation gap between LLMs and dense 3D point clouds without requiring large-scale pre-alignment between 3D-text or 3D-images. Specifically, we introduce Object-centric Discriminative Representation (OcDR), which learns object-centric tokens that capture target semantics and scene relations under a hard negative-aware training objective. This mitigates the misalignment between LLM tokens and 3D points, enhances resilience to distractors, and facilitates semantic-level reasoning within LLMs. For accurate segmentation, we introduce the Geometric Reactivation Decoder (GRD), which predicts masks by combining OcDR tokens carrying LLM-inferred geometry with corresponding dense features, preserving comprehensive dense features throughout the pipeline. Extensive experiments show that PLM achieves significant improvements of +7.3 mIoU on ScanNetv2 and +6.0 mIoU on Multi3DRefer for 3D referring segmentation, with consistent gains across 7 benchmarks spanning 4 different tasks, demonstrating the effectiveness of comprehensive object-centric reasoning for robust 3D understanding.

[153] Deep Learning-Based Burned Area Mapping Using Bi-Temporal Siamese Networks and AlphaEarth Foundation Datasets

Seyd Teymoor Seydi

Main category: cs.CV

TL;DR: Novel automated burned area mapping using AlphaEarth dataset with Siamese U-Net achieves 95% accuracy, demonstrating strong transferability across diverse ecosystems.

DetailsMotivation: Accurate and timely burned area mapping is crucial for environmental monitoring, disaster management, and climate change assessment.

Method: Uses AlphaEarth dataset (high-resolution optical/thermal imagery with ground-truth) combined with Siamese U-Net deep learning architecture, trained on MTBS dataset in US and evaluated across 17 European regions.

Result: Achieves 95% overall accuracy, 0.6 IoU, and 74% F1-score. Successfully identifies burned areas in diverse ecosystems, particularly strong at detecting partially burned vegetation and fire boundaries.

Conclusion: Provides scalable solution for global burn area monitoring, advances automated fire damage assessment with high generalization and transferability capabilities.

Abstract: Accurate and timely mapping of burned areas is crucial for environmental monitoring, disaster management, and assessment of climate change. This study presents a novel approach to automated burned area mapping using the AlphaEArth dataset combined with the Siamese U-Net deep learning architecture. The AlphaEArth Dataset, comprising high-resolution optical and thermal infrared imagery with comprehensive ground-truth annotations, provides an unprecedented resource for training robust burned area detection models. We trained our model with the Monitoring Trends in Burn Severity (MTBS) dataset in the contiguous US and evaluated it with 17 regions cross in Europe. Our experimental results demonstrate that the proposed ensemble approach achieves superior performance with an overall accuracy of 95%, IoU of 0.6, and F1-score of 74% on the test dataset. The model successfully identifies burned areas across diverse ecosystems with complex background, showing particular strength in detecting partially burned vegetation and fire boundaries and its transferability and high generalization in burned area mapping. This research contributes to the advancement of automated fire damage assessment and provides a scalable solution for global burn area monitoring using the AlphaEarth dataset.

[154] D-LEAF: Localizing and Correcting Hallucinations in Multimodal LLMs via Layer-to-head Attention Diagnostics

Tiancheng Yang, Lin Zhang, Jiaye Lin, Guimin Hu, Di Wang, Lijie Hu

Main category: cs.CV

TL;DR: D-LEAF method dynamically detects and corrects hallucinations in MLLMs by analyzing attention patterns across layers and heads, achieving significant improvements in captioning and VQA tasks.

DetailsMotivation: Multimodal LLMs suffer from hallucinations where generated text conflicts with visual input, and existing uniform attention adjustment methods fail to accurately localize problematic layers.

Method: Introduces Layer Image Attention Entropy (LIAE) to flag anomalous layers and Image Attention Focus (IAF) to score attention heads, then uses Dynamic Layer-wise Entropy and Attention Fusion (D-LEAF) for dynamic error correction during inference.

Result: 53% relative improvement on standard captioning benchmarks, and approximately 4% improvement in both accuracy and F1-score on VQA tasks, with negligible overhead.

Conclusion: The proposed attention-guided diagnostics and dynamic correction method effectively suppress hallucinations while maintaining efficiency in multimodal language models.

Abstract: Multimodal Large Language Models (MLLMs) achieve strong performance on tasks like image captioning and visual question answering, but remain prone to hallucinations, where generated text conflicts with the visual input. Prior work links this partly to insufficient visual attention, but existing attention-based detectors and mitigation typically apply uniform adjustments across layers and heads, obscuring where errors originate. In this paper, we first show these methods fail to accurately localize problematic layers. Then, we introduce two diagnostics: Layer Image Attention Entropy (LIAE) which flags anomalous layers, and Image Attention Focus (IAF) which scores attention heads within those layers. Analysis shows that LIAE pinpoints faulty layers and IAF reliably ranks heads that warrant correction. Guided by these signals, we propose Dynamic Layer-wise Entropy and Attention Fusion (D-LEAF), a task-agnostic, attention-guided method that dynamically localizes and corrects errors during inference with negligible overhead. Results show our D-LEAF delivers a 53% relative improvement on standard captioning benchmarks, and on VQA both accuracy and F1-score improve by approximately 4%, substantially suppressing hallucinations while preserving efficiency.

[155] Active Membership Inference Test (aMINT): Enhancing Model Auditability with Multi-Task Learning

Daniel DeAlcala, Aythami Morales, Julian Fierrez, Gonzalo Mancera, Ruben Tolosana, Javier Ortega-Garcia

Main category: cs.CV

TL;DR: Active MINT is a multitask learning method that trains two models simultaneously - an audited model and a MINT model - to detect whether specific data was used in training, achieving over 80% accuracy.

DetailsMotivation: To increase transparency in AI models and provide stronger safeguards for security, privacy, and copyright protection by detecting if specific data was used during model training.

Method: Novel multitask learning approach that trains an audited model and a secondary MINT model simultaneously, using intermediate activation maps as inputs to MINT layers to enhance training data detection.

Result: Achieves over 80% accuracy in detecting training data usage across various neural network architectures (MobileNet to Vision Transformers) on 5 public benchmarks, significantly outperforming previous approaches.

Conclusion: Active MINT successfully incorporates auditability as an optimization objective during training and provides an effective method for detecting training data usage, contributing to AI transparency and security.

Abstract: Active Membership Inference Test (aMINT) is a method designed to detect whether given data were used during the training of machine learning models. In Active MINT, we propose a novel multitask learning process that involves training simultaneously two models: the original or Audited Model, and a secondary model, referred to as the MINT Model, responsible for identifying the data used for training the Audited Model. This novel multi-task learning approach has been designed to incorporate the auditability of the model as an optimization objective during the training process of neural networks. The proposed approach incorporates intermediate activation maps as inputs to the MINT layers, which are trained to enhance the detection of training data. We present results using a wide range of neural networks, from lighter architectures such as MobileNet to more complex ones such as Vision Transformers, evaluated in 5 public benchmarks. Our proposed Active MINT achieves over 80% accuracy in detecting if given data was used for training, significantly outperforming previous approaches in the literature. Our aMINT and related methodological developments contribute to increasing transparency in AI models, facilitating stronger safeguards in AI deployments to achieve proper security, privacy, and copyright protection.

[156] Object-level Correlation for Few-Shot Segmentation

Chunlin Wen, Yu Zhang, Jie Fan, Hongyuan Zhu, Xiu-Shen Wei, Yijun Wang, Zhiqiang Kou, Shuzhou Sun

Main category: cs.CV

TL;DR: OCNet introduces object-level correlation for few-shot semantic segmentation, addressing background noise by matching support target objects with query general objects instead of entire query images.

DetailsMotivation: Existing FSS methods build image-level correlations that contain hard pixel noise from irrelevant background objects, leading to overfitting and poor segmentation performance.

Method: Proposes Object-level Correlation Network (OCNet) with General Object Mining Module (GOMM) to extract query general object features, and Correlation Construction Module (CCM) to establish object-level correlation by matching target prototypes with general object features.

Result: Achieves state-of-the-art performance on PASCAL-5i and COCO-20i benchmarks, demonstrating effective suppression of background noise and improved segmentation accuracy.

Conclusion: Object-level correlation is more effective than image-level correlation for few-shot semantic segmentation, particularly in low-data regimes, as it better handles background noise and improves target object identification.

Abstract: Few-shot semantic segmentation (FSS) aims to segment objects of novel categories in the query images given only a few annotated support samples. Existing methods primarily build the image-level correlation between the support target object and the entire query image. However, this correlation contains the hard pixel noise, \textit{i.e.}, irrelevant background objects, that is intractable to trace and suppress, leading to the overfitting of the background. To address the limitation of this correlation, we imitate the biological vision process to identify novel objects in the object-level information. Target identification in the general objects is more valid than in the entire image, especially in the low-data regime. Inspired by this, we design an Object-level Correlation Network (OCNet) by establishing the object-level correlation between the support target object and query general objects, which is mainly composed of the General Object Mining Module (GOMM) and Correlation Construction Module (CCM). Specifically, GOMM constructs the query general object feature by learning saliency and high-level similarity cues, where the general objects include the irrelevant background objects and the target foreground object. Then, CCM establishes the object-level correlation by allocating the target prototypes to match the general object feature. The generated object-level correlation can mine the query target feature and suppress the hard pixel noise for the final prediction. Extensive experiments on PASCAL-${5}^{i}$ and COCO-${20}^{i}$ show that our model achieves the state-of-the-art performance.

[157] ScoreHOI: Physically Plausible Reconstruction of Human-Object Interaction via Score-Guided Diffusion

Ao Li, Jinpeng Liu, Yixuan Zhu, Yansong Tang

Main category: cs.CV

TL;DR: ScoreHOI is a diffusion-based optimizer that uses diffusion priors and physical constraints to achieve precise human-object interaction reconstruction, outperforming state-of-the-art methods.

DetailsMotivation: Previous optimization methods struggle with physically plausible reconstruction results due to lack of prior knowledge about human-object interactions.

Method: Uses diffusion priors and score-guided sampling to reconstruct conditional distribution of human and object poses. Incorporates physical constraints during denoising and employs contact-driven iterative refinement for better contact plausibility.

Result: Extensive evaluations show superior performance over state-of-the-art methods, achieving precise and robust improvement in joint human-object interaction reconstruction.

Conclusion: ScoreHOI effectively addresses the limitations of previous methods by leveraging diffusion priors and physical constraints, demonstrating significant advancements in human-object interaction reconstruction.

Abstract: Joint reconstruction of human-object interaction marks a significant milestone in comprehending the intricate interrelations between humans and their surrounding environment. Nevertheless, previous optimization methods often struggle to achieve physically plausible reconstruction results due to the lack of prior knowledge about human-object interactions. In this paper, we introduce ScoreHOI, an effective diffusion-based optimizer that introduces diffusion priors for the precise recovery of human-object interactions. By harnessing the controllability within score-guided sampling, the diffusion model can reconstruct a conditional distribution of human and object pose given the image observation and object feature. During inference, the ScoreHOI effectively improves the reconstruction results by guiding the denoising process with specific physical constraints. Furthermore, we propose a contact-driven iterative refinement approach to enhance the contact plausibility and improve the reconstruction accuracy. Extensive evaluations on standard benchmarks demonstrate ScoreHOI’s superior performance over state-of-the-art methods, highlighting its ability to achieve a precise and robust improvement in joint human-object interaction reconstruction.

[158] Multimodal Contrastive Pretraining of CBCT and IOS for Enhanced Tooth Segmentation

Moo Hyun Son, Juyoung Bae, Zelin Qiu, Jiale Peng, Kai Xin Li, Yifan Lin, Hao Chen

Main category: cs.CV

TL;DR: ToothMCL is a multimodal contrastive learning framework that integrates CBCT and IOS data for superior tooth segmentation, achieving state-of-the-art performance with 12% improvement for CBCT and 8% for IOS segmentation.

DetailsMotivation: Existing tooth segmentation methodologies lack rigorous validation and demonstrate limited performance and clinical applicability, creating a need for more accurate digital dentistry solutions.

Method: ToothMCL uses multimodal contrastive learning to integrate volumetric (CBCT) and surface-based (IOS) modalities, capturing modality-invariant representations for precise multi-class segmentation and FDI tooth numbering.

Result: Achieves state-of-the-art performance with 12% improvement for CBCT segmentation and 8% for IOS segmentation in Dice Similarity Coefficient, demonstrating robust generalizability across diverse imaging conditions.

Conclusion: This work presents the first multimodal pretraining framework for tooth segmentation, offering superior accuracy and clinical applicability for digital dentistry through integrated CBCT-IOS data processing.

Abstract: Digital dentistry represents a transformative shift in modern dental practice. The foundational step in this transformation is the accurate digital representation of the patient’s dentition, which is obtained from segmented Cone-Beam Computed Tomography (CBCT) and Intraoral Scans (IOS). Despite the growing interest in digital dental technologies, existing segmentation methodologies frequently lack rigorous validation and demonstrate limited performance and clinical applicability. To the best of our knowledge, this is the first work to introduce a multimodal pretraining framework for tooth segmentation. We present ToothMCL, a Tooth Multimodal Contrastive Learning for pretraining that integrates volumetric (CBCT) and surface-based (IOS) modalities. By capturing modality-invariant representations through multimodal contrastive learning, our approach effectively models fine-grained anatomical features, enabling precise multi-class segmentation and accurate identification of F'ed'eration Dentaire Internationale (FDI) tooth numbering. Along with the framework, we curated CBCT-IOS3.8K, the largest paired CBCT and IOS dataset to date, comprising 3,867 patients. We then evaluated ToothMCL on a comprehensive collection of independent datasets, representing the largest and most diverse evaluation to date. Our method achieves state-of-the-art performance in both internal and external testing, with an increase of 12% for CBCT segmentation and 8% for IOS segmentation in the Dice Similarity Coefficient (DSC). Furthermore, ToothMCL consistently surpasses existing approaches in tooth groups and demonstrates robust generalizability across varying imaging conditions and clinical scenarios.

[159] Accelerating Local AI on Consumer GPUs: A Hardware-Aware Dynamic Strategy for YOLOv10s

Mahmudul Islam Masum, Miad Islam, Arif I. Sarwat

Main category: cs.CV

TL;DR: This paper addresses performance bottlenecks in object detection on consumer hardware by introducing a Two-Pass Adaptive Inference algorithm that achieves 1.85x speedup with minimal accuracy loss.

DetailsMotivation: There's a critical gap between benchmark performance of object detectors and their practical viability on consumer-grade hardware, where performance is dominated by system-level bottlenecks rather than compute limitations.

Method: Two-Pass Adaptive Inference algorithm - a model-independent approach using fast low-resolution pass first, then escalating to high-resolution model only when detection confidence is low. Comparative analysis of architectural early-exit and resolution-adaptive routing strategies.

Result: On 5000-image COCO dataset, achieves 1.85x speedup over PyTorch Early-Exit baseline with modest mAP loss of 5.51%.

Conclusion: Provides practical blueprint for deploying real-time AI on consumer devices by shifting focus from pure model optimization to hardware-aware inference strategies that maximize throughput.

Abstract: As local AI grows in popularity, there is a critical gap between the benchmark performance of object detectors and their practical viability on consumer-grade hardware. While models like YOLOv10s promise real-time speeds, these metrics are typically achieved on high-power, desktop-class GPUs. This paper reveals that on resource-constrained systems, such as laptops with RTX 4060 GPUs, performance is not compute-bound but is instead dominated by system-level bottlenecks, as illustrated by a simple bottleneck test. To overcome this hardware-level constraint, we introduce a Two-Pass Adaptive Inference algorithm, a model-independent approach that requires no architectural changes. This study mainly focuses on adaptive inference strategies and undertakes a comparative analysis of architectural early-exit and resolution-adaptive routing, highlighting their respective trade-offs within a unified evaluation framework. The system uses a fast, low-resolution pass and only escalates to a high-resolution model pass when detection confidence is low. On a 5000-image COCO dataset, our method achieves a 1.85x speedup over a PyTorch Early-Exit baseline, with a modest mAP loss of 5.51%. This work provides a practical and reproducible blueprint for deploying high-performance, real-time AI on consumer-grade devices by shifting the focus from pure model optimization to hardware-aware inference strategies that maximize throughput.

[160] Dynamic Scene 3D Reconstruction of an Uncooperative Resident Space Object

Bala Prenith Reddy Gopu, Timothy Jacob Huber, George M. Nehma, Patrick Quinn, Madhur Tiwari, Matt Ueckermann, David Hinckley, Christopher McKenna

Main category: cs.CV

TL;DR: Evaluation of 3D reconstruction algorithms for tumbling space objects using simulation environment, with Neuralangelo showing promising results on static scenes as baseline for dynamic scene analysis.

DetailsMotivation: Characterization of uncooperative Resident Space Objects is crucial for On-Orbit Servicing and Active Debris Removal missions to assess geometry and motion properties.

Method: Developed simulation environment using Isaac Sim to generate physics-accurate 2D image sequences of tumbling satellites under realistic orbital lighting conditions, evaluating state-of-the-art 3D reconstruction algorithms.

Result: Preliminary results on static scenes using Neuralangelo demonstrate promising reconstruction quality - generated 3D meshes closely match original CAD models with minimal errors and artifacts, capturing critical fine details for mission planning.

Conclusion: Provides a baseline for ongoing evaluation of dynamic scene reconstruction algorithms for tumbling uncooperative space targets.

Abstract: Characterization of uncooperative Resident Space Objects (RSO) play a crucial role in On-Orbit Servicing (OOS) and Active Debris Removal (ADR) missions to assess the geometry and motion properties. To address the challenges of reconstructing tumbling uncooperative targets, this study evaluates the performance of existing state-of-the-art 3D reconstruction algorithms for dynamic scenes, focusing on their ability to generate geometrically accurate models with high-fidelity. To support our evaluation, we developed a simulation environment using Isaac Sim to generate physics-accurate 2D image sequences of tumbling satellite under realistic orbital lighting conditions. Our preliminary results on static scenes using Neuralangelo demonstrate promising reconstruction quality. The generated 3D meshes closely match the original CAD models with minimal errors and artifacts when compared using Cloud Compare (CC). The reconstructed models were able to capture critical fine details for mission planning. This provides a baseline for our ongoing evaluation of dynamic scene reconstruction.

[161] One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation

Zheng Geng, Nan Wang, Shaocong Xu, Chongjie Ye, Bohan Li, Zhaoxi Chen, Sida Peng, Hao Zhao

Main category: cs.CV

TL;DR: OnePoseViaGen enables 6D pose estimation of unseen objects from a single reference image using coarse-to-fine alignment and text-guided domain randomization.

DetailsMotivation: Addressing the challenge of 6D pose estimation for arbitrary unseen objects where 3D models are unavailable, single-view reconstructions lack metric scale, and domain gaps exist between generated models and real images.

Method: A pipeline with two key components: 1) coarse-to-fine alignment module that jointly refines scale and pose using multi-view feature matching with render-and-compare refinement, and 2) text-guided generative domain randomization strategy to diversify textures for effective fine-tuning.

Result: Achieves state-of-the-art performance on challenging benchmarks (YCBInEOAT, Toyota-Light, LM-O) and demonstrates robust dexterous grasping with a real robot hand.

Conclusion: The method enables high-fidelity single-view 3D generation to support reliable one-shot 6D pose estimation, validating practicality for real-world manipulation tasks.

Abstract: Estimating the 6D pose of arbitrary unseen objects from a single reference image is critical for robotics operating in the long-tail of real-world instances. However, this setting is notoriously challenging: 3D models are rarely available, single-view reconstructions lack metric scale, and domain gaps between generated models and real-world images undermine robustness. We propose OnePoseViaGen, a pipeline that tackles these challenges through two key components. First, a coarse-to-fine alignment module jointly refines scale and pose by combining multi-view feature matching with render-and-compare refinement. Second, a text-guided generative domain randomization strategy diversifies textures, enabling effective fine-tuning of pose estimators with synthetic data. Together, these steps allow high-fidelity single-view 3D generation to support reliable one-shot 6D pose estimation. On challenging benchmarks (YCBInEOAT, Toyota-Light, LM-O), OnePoseViaGen achieves state-of-the-art performance far surpassing prior approaches. We further demonstrate robust dexterous grasping with a real robot hand, validating the practicality of our method in real-world manipulation. Project page: https://gzwsama.github.io/OnePoseviaGen.github.io/

[162] Visual Representation Alignment for Multimodal Large Language Models

Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, Chanho Eom, Sunghwan Hong, Seungryong Kim

Main category: cs.CV

TL;DR: VIRAL is a regularization method that aligns MLLM visual representations with pre-trained vision foundation models to preserve fine-grained visual details and improve performance on vision-centric tasks.

DetailsMotivation: Current MLLMs underperform on vision-centric tasks like object counting and spatial reasoning due to text-only supervision that causes loss of fine-grained visual details during training.

Method: Visual Representation Alignment (VIRAL) - a regularization strategy that explicitly aligns internal visual representations of MLLMs with pre-trained vision foundation models to retain critical visual details.

Result: Consistent improvements across all tasks on widely adopted multimodal benchmarks, with comprehensive ablation studies validating key design choices.

Conclusion: VIRAL opens an important direction for effective visual information integration in MLLM training, enhancing reasoning over complex visual inputs.

Abstract: Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.

[163] A Challenging Benchmark of Anime Style Recognition

Haotang Li, Shengtao Guo, Kailin Lyu, Xiao Yang, Tianchen Chen, Jianqing Zhu, Huanqiang Zeng

Main category: cs.CV

TL;DR: Proposes LSASRD - a large-scale anime style recognition benchmark with 20,937 images from 190 anime works, featuring cross-role evaluation protocol to ensure models learn abstract painting styles rather than character features.

DetailsMotivation: Anime style recognition (ASR) is challenging due to large semantic gaps compared to biometric recognition, yet receives little research attention despite its importance in determining if images come from the same anime work.

Method: Created LSASRD dataset with complex factors (illuminations, poses, colors, compositions), designed cross-role protocol where query/gallery images come from different characters, and applied person re-identification methods (AGW and TransReID) as baselines.

Result: Transformer model (TransReID) achieved only 42.24% mAP, demonstrating the difficulty of ASR task with huge semantic gaps.

Conclusion: ASR presents significant challenges that deserve deep and long-term research, with the dataset and code being made publicly available to facilitate future work.

Abstract: Given two images of different anime roles, anime style recognition (ASR) aims to learn abstract painting style to determine whether the two images are from the same work, which is an interesting but challenging problem. Unlike biometric recognition, such as face recognition, iris recognition, and person re-identification, ASR suffers from a much larger semantic gap but receives less attention. In this paper, we propose a challenging ASR benchmark. Firstly, we collect a large-scale ASR dataset (LSASRD), which contains 20,937 images of 190 anime works and each work at least has ten different roles. In addition to the large-scale, LSASRD contains a list of challenging factors, such as complex illuminations, various poses, theatrical colors and exaggerated compositions. Secondly, we design a cross-role protocol to evaluate ASR performance, in which query and gallery images must come from different roles to validate an ASR model is to learn abstract painting style rather than learn discriminative features of roles. Finally, we apply two powerful person re-identification methods, namely, AGW and TransReID, to construct the baseline performance on LSASRD. Surprisingly, the recent transformer model (i.e., TransReID) only acquires a 42.24% mAP on LSASRD. Therefore, we believe that the ASR task of a huge semantic gap deserves deep and long-term research. We will open our dataset and code at https://github.com/nkjcqvcpi/ASR.

[164] Closed-Loop Unsupervised Representation Disentanglement with $β$-VAE Distillation and Diffusion Probabilistic Feedback

Xin Jin, Bohan Li, BAAO Xie, Wenyao Zhang, Jinming Liu, Ziqiang Li, Tao Yang, Wenjun Zeng

Main category: cs.CV

TL;DR: CL-Dis is a closed-loop unsupervised representation disentanglement approach that combines diffusion autoencoder with β-VAE for better generalization and adaptive training without label annotation.

DetailsMotivation: Address three core issues in representation disentanglement: heavy reliance on labels/synthetic data, heuristic constraints limiting optimal training trade-offs, and lack of reasonable evaluation metrics for real label-free data.

Method: Uses diffusion-based autoencoder backbone with β-VAE co-pilot, VAE-latent distillation, diffusion-wise feedback in closed-loop system, self-supervised navigation strategy for semantic directions, and content-tracking based evaluation metric.

Result: Superior performance on real image manipulation and visual analysis applications compared to existing methods.

Conclusion: CL-Dis effectively addresses current disentanglement challenges through complementary diffusion-VAE architecture and closed-loop mutual promotion, enabling better unsupervised representation learning for real-world scenarios.

Abstract: Representation disentanglement may help AI fundamentally understand the real world and thus benefit both discrimination and generation tasks. It currently has at least three unresolved core issues: (i) heavy reliance on label annotation and synthetic data – causing poor generalization on natural scenarios; (ii) heuristic/hand-craft disentangling constraints make it hard to adaptively achieve an optimal training trade-off; (iii) lacking reasonable evaluation metric, especially for the real label-free data. To address these challenges, we propose a \textbf{C}losed-\textbf{L}oop unsupervised representation \textbf{Dis}entanglement approach dubbed \textbf{CL-Dis}. Specifically, we use diffusion-based autoencoder (Diff-AE) as a backbone while resorting to $\beta$-VAE as a co-pilot to extract semantically disentangled representations. The strong generation ability of diffusion model and the good disentanglement ability of VAE model are complementary. To strengthen disentangling, VAE-latent distillation and diffusion-wise feedback are interconnected in a closed-loop system for a further mutual promotion. Then, a self-supervised \textbf{Navigation} strategy is introduced to identify interpretable semantic directions in the disentangled latent space. Finally, a new metric based on content tracking is designed to evaluate the disentanglement effect. Experiments demonstrate the superiority of CL-Dis on applications like real image manipulation and visual analysis.

[165] SGCNeRF: Few-Shot Neural Rendering via Sparse Geometric Consistency Guidance

Yuru Xiao, Xianming Liu, Deming Zhai, Kui Jiang, Junjun Jiang, Xiangyang Ji

Main category: cs.CV

TL;DR: SGCNeRF introduces feature-matching-based sparse geometry regularization with frequency guidance to improve few-shot neural rendering, outperforming FreeNeRF by 0.7 dB PSNR on LLFF and DTU datasets.

DetailsMotivation: NeRF struggles with sparse views due to overfitting, and FreeNeRF's low initial positional encoding bandwidth excludes high-frequency details. A holistic approach is needed to address both overfitting and preserve high-frequency information.

Method: Proposes a feature-matching-based sparse geometry regularization module with spatially consistent geometry filtering and frequency-guided geometric regularization. Progressively refines geometry and textures across NeRF iterations.

Result: SGCNeRF achieves superior geometry-consistent outcomes and surpasses FreeNeRF with 0.7 dB improvement in PSNR on both LLFF and DTU datasets.

Conclusion: The proposed SGCNeRF architecture effectively addresses few-shot neural rendering challenges by preserving high-frequency details while preventing overfitting, demonstrating state-of-the-art performance in novel view synthesis.

Abstract: Neural Radiance Field (NeRF) technology has made significant strides in creating novel viewpoints. However, its effectiveness is hampered when working with sparsely available views, often leading to performance dips due to overfitting. FreeNeRF attempts to overcome this limitation by integrating implicit geometry regularization, which incrementally improves both geometry and textures. Nonetheless, an initial low positional encoding bandwidth results in the exclusion of high-frequency elements. The quest for a holistic approach that simultaneously addresses overfitting and the preservation of high-frequency details remains ongoing. This study presents a novel feature-matching-based sparse geometry regularization module, enhanced by a spatially consistent geometry filtering mechanism and a frequency-guided geometric regularization strategy. This module excels at accurately identifying high-frequency keypoints, effectively preserving fine structural details. Through progressive refinement of geometry and textures across NeRF iterations, we unveil an effective few-shot neural rendering architecture, designated as SGCNeRF, for enhanced novel view synthesis. Our experiments demonstrate that SGCNeRF not only achieves superior geometry-consistent outcomes but also surpasses FreeNeRF, with improvements of 0.7 dB in PSNR on LLFF and DTU.

[166] TractGraphFormer: Anatomically Informed Hybrid Graph CNN-Transformer Network for Interpretable Sex and Age Prediction from Diffusion MRI Tractography

Yuqian Chen, Fan Zhang, Meng Wang, Leo R. Zekelman, Suheyla Cetin-Karayumak, Tengfei Xue, Chaoyi Zhang, Yang Song, Jarrett Rushmore, Nikos Makris, Yogesh Rathi, Weidong Cai, Lauren J. O’Donnell

Main category: cs.CV

TL;DR: TractGraphFormer is a hybrid Graph CNN-Transformer model for diffusion MRI tractography that combines local anatomical features with global dependencies to predict sex and age from brain white matter connections.

DetailsMotivation: Current deep learning approaches for studying brain connections and phenotypes often overlook both local and global properties of white matter networks in convolutional network design.

Method: Hybrid Graph CNN-Transformer framework where Graph CNN captures white matter geometry and grey matter connectivity for local features, and Transformer uses self-attention for global information learning, with an attention module for interpretability.

Result: Strong performance in sex and age prediction on large datasets (children n=9345, young adults n=1065), identifying consistent predictive anatomical tracts across datasets, showing widespread WM connections are predictive of individual sex and age.

Conclusion: Integrating local anatomical information with global feature dependencies improves prediction performance in machine learning with diffusion MRI tractography, highlighting the potential of this hybrid approach.

Abstract: The relationship between brain connections and non-imaging phenotypes is increasingly studied using deep neural networks. However, the local and global properties of brain white matter networks are often overlooked in convolutional network design. We introduce TractGraphFormer, a hybrid Graph CNN-Transformer deep learning framework tailored for diffusion MRI tractography. This model leverages local anatomical characteristics and global feature dependencies of white matter structures. The Graph CNN module captures white matter geometry and grey matter connectivity to aggregate local features from anatomically similar white matter connections, while the Transformer module uses self-attention to enhance global information learning. Additionally, TractGraphFormer includes an attention module for interpreting predictive white matter connections. We apply TractGraphFormer to tasks of sex and age prediction. TractGraphFormer shows strong performance in large datasets of children (n=9345) and young adults (n=1065). Overall, our approach suggests that widespread connections in the WM are predictive of the sex and age of an individual. For each prediction task, consistent predictive anatomical tracts are identified across the two datasets. The proposed approach highlights the potential of integrating local anatomical information and global feature dependencies to improve prediction performance in machine learning with diffusion MRI tractography.

[167] MSCPT: Few-shot Whole Slide Image Classification with Multi-scale and Context-focused Prompt Tuning

Minghao Han, Linhao Qu, Dingkang Yang, Xukun Zhang, Xiaoying Wang, Lihua Zhang

Main category: cs.CV

TL;DR: MSCPT is a multi-scale prompt tuning method for few-shot weakly supervised WSI classification that addresses limitations of existing methods by leveraging VLM text knowledge, capturing multi-scale contextual information, and improving instance aggregation.

DetailsMotivation: Existing prompt tuning methods for WSIs fail to fully utilize VLM text knowledge, overlook multi-scale contextual information, and lack effective instance aggregation, leading to suboptimal performance in few-shot scenarios.

Method: Uses frozen LLM to generate pathological visual language prior knowledge at multiple scales, designs graph prompt tuning for contextual information, and introduces non-parametric cross-guided instance aggregation for WSI-level features.

Result: Extensive experiments on five datasets and three downstream tasks with three VLMs demonstrate strong performance, with visualizations and interpretability analyses confirming effectiveness.

Conclusion: MSCPT effectively addresses the challenges of few-shot WSI classification by integrating multi-scale knowledge, contextual learning, and improved instance aggregation, showing superior performance across multiple benchmarks.

Abstract: Multiple instance learning (MIL) has become a standard paradigm for the weakly supervised classification of whole slide images (WSIs). However, this paradigm relies on using a large number of labeled WSIs for training. The lack of training data and the presence of rare diseases pose significant challenges for these methods. Prompt tuning combined with pre-trained Vision-Language models (VLMs) is an effective solution to the Few-shot Weakly Supervised WSI Classification (FSWC) task. Nevertheless, applying prompt tuning methods designed for natural images to WSIs presents three significant challenges: 1) These methods fail to fully leverage the prior knowledge from the VLM’s text modality; 2) They overlook the essential multi-scale and contextual information in WSIs, leading to suboptimal results; and 3) They lack exploration of instance aggregation methods. To address these problems, we propose a Multi-Scale and Context-focused Prompt Tuning (MSCPT) method for FSWC task. Specifically, MSCPT employs the frozen large language model to generate pathological visual language prior knowledge at multiple scales, guiding hierarchical prompt tuning. Additionally, we design a graph prompt tuning module to learn essential contextual information within WSI, and finally, a non-parametric cross-guided instance aggregation module has been introduced to derive the WSI-level features. Extensive experiments, visualizations, and interpretability analyses were conducted on five datasets and three downstream tasks using three VLMs, demonstrating the strong performance of our MSCPT. All codes have been made publicly accessible at https://github.com/Hanminghao/MSCPT.

[168] InteractPro: A Unified Framework for Motion-Aware Image Composition

Weijing Tao, Xiaofeng Yang, Miaomiao Cui, Guosheng Lin

Main category: cs.CV

TL;DR: InteractPro is a framework for dynamic motion-aware image composition that uses an intelligent planner (InteractPlan) with LVLM to choose between physics-based simulation (InteractPhys) and diffusion-based methods (InteractMotion) for realistic motion effects.

DetailsMotivation: Traditional image composition methods require manual planning for object placement and generate static outputs without motion awareness, limiting their realism and applicability.

Method: Uses InteractPlan LVLM planner to analyze scenarios and choose between InteractPhys (MPM-based physics simulation) for physical interactions and InteractMotion (pretrained video diffusion) for motion effects, unifying both approaches under planner guidance.

Result: Extensive evaluations show InteractPro produces controllable, coherent compositions across varied scenarios with realistic motion effects, overcoming limitations of traditional static composition methods.

Conclusion: InteractPro successfully addresses the limitations of traditional composition methods by providing a unified framework that generates motion-aware, realistic compositions through intelligent planning and specialized physical/diffusion modules.

Abstract: We introduce InteractPro, a comprehensive framework for dynamic motion-aware image composition. At its core is InteractPlan, an intelligent planner that leverages a Large Vision Language Model (LVLM) for scenario analysis and object placement, determining the optimal composition strategy to achieve realistic motion effects. Based on each scenario, InteractPlan selects between our two specialized modules: InteractPhys and InteractMotion. InteractPhys employs an enhanced Material Point Method (MPM)-based simulation to produce physically faithful and controllable object-scene interactions, capturing diverse and abstract events that require true physical modeling. InteractMotion, in contrast, is a training-free method based on pretrained video diffusion. Traditional composition approaches suffer from two major limitations: requiring manual planning for object placement and generating static, motionless outputs. By unifying simulation-based and diffusion-based methods under planner guidance, InteractPro overcomes these challenges, ensuring richly motion-aware compositions. Extensive quantitative and qualitative evaluations demonstrate InteractPro’s effectiveness in producing controllable, and coherent compositions across varied scenarios.

[169] PnP-Flow: Plug-and-Play Image Restoration with Flow Matching

Ségolène Martin, Anne Gagneux, Paul Hagemann, Gabriele Steidl

Main category: cs.CV

TL;DR: PnP Flow Matching combines Plug-and-Play optimization with Flow Matching to solve imaging inverse problems more effectively than existing methods, achieving superior performance on tasks like denoising, super-resolution, deblurring, and inpainting.

DetailsMotivation: Traditional PnP methods have limitations on generative tasks like inpainting, while Flow Matching models excel at image generation but lack efficient methods for image restoration tasks. The paper aims to bridge this gap by combining both approaches.

Method: Defines a time-dependent denoiser using pre-trained Flow Matching model, then alternates between gradient descent on data-fidelity term, reprojections onto the learned FM path, and denoising steps. The method avoids backpropagation through ODEs and trace computations for efficiency.

Result: The algorithm demonstrates superior performance compared to existing PnP algorithms and Flow Matching based state-of-the-art methods across multiple imaging tasks including denoising, super-resolution, deblurring, and inpainting.

Conclusion: PnP Flow Matching successfully combines the strengths of both PnP optimization and Flow Matching, providing a computationally efficient and memory-friendly solution that outperforms existing methods on various imaging inverse problems.

Abstract: In this paper, we introduce Plug-and-Play (PnP) Flow Matching, an algorithm for solving imaging inverse problems. PnP methods leverage the strength of pre-trained denoisers, often deep neural networks, by integrating them in optimization schemes. While they achieve state-of-the-art performance on various inverse problems in imaging, PnP approaches face inherent limitations on more generative tasks like inpainting. On the other hand, generative models such as Flow Matching pushed the boundary in image sampling yet lack a clear method for efficient use in image restoration. We propose to combine the PnP framework with Flow Matching (FM) by defining a time-dependent denoiser using a pre-trained FM model. Our algorithm alternates between gradient descent steps on the data-fidelity term, reprojections onto the learned FM path, and denoising. Notably, our method is computationally efficient and memory-friendly, as it avoids backpropagation through ODEs and trace computations. We evaluate its performance on denoising, super-resolution, deblurring, and inpainting tasks, demonstrating superior results compared to existing PnP algorithms and Flow Matching based state-of-the-art methods.

[170] Self Supervised Networks for Learning Latent Space Representations of Human Body Scans and Motions

Emmanuel Hartman, Nicolas Charon, Martin Bauer

Main category: cs.CV

TL;DR: Self-supervised neural networks for 3D human body analysis, featuring VariShaPE for latent space shape/pose estimation and MoGeN for motion geometry learning in latent space.

DetailsMotivation: To address fundamental problems in 3D human body analysis with fast, robust methods for latent space representation and motion processing.

Method: Two novel architectures: VariShaPE for estimating latent embeddings of unregistered meshes, and MoGeN for learning geometry in latent space through Euclidean space lifting and linear interpolation of motion sequences.

Result: The combined models enable efficient operations like motion interpolation, extrapolation, transfer, and random shape/pose generation with minimal computational cost using SMPL latent space.

Conclusion: The proposed self-supervised framework provides a comprehensive solution for various 3D human body processing tasks with high efficiency and robustness.

Abstract: This paper introduces self-supervised neural network models to tackle several fundamental problems in the field of 3D human body analysis and processing. First, we propose VariShaPE (Varifold Shape Parameter Estimator), a novel architecture for the retrieval of latent space representations of body shapes and poses. This network offers a fast and robust method to estimate the embedding of arbitrary unregistered meshes into the latent space. Second, we complement the estimation of latent codes with MoGeN (Motion Geometry Network) a framework that learns the geometry on the latent space itself. This is achieved by lifting the body pose parameter space into a higher dimensional Euclidean space in which body motion mini-sequences from a training set of 4D data can be approximated by simple linear interpolation. Using the SMPL latent space representation we illustrate how the combination of these network models, once trained, can be used to perform a variety of tasks with very limited computational cost. This includes operations such as motion interpolation, extrapolation and transfer as well as random shape and pose generation.

[171] OOD-SEG: Exploiting out-of-distribution detection techniques for learning image segmentation from sparse multi-class positive-only annotations

Junwen Wang, Zhonghao Wang, Oscar MacCormac, Jonathan Shapey, Tom Vercauteren

Main category: cs.CV

TL;DR: A novel segmentation approach combining positive-unlabelled learning with OOD detection to address sparse annotations and detect out-of-distribution pixels in medical imaging.

DetailsMotivation: Address two key challenges in medical image segmentation: 1) time-consuming pixel-level annotation requiring domain expertise, and 2) inability to detect out-of-distribution pixels leading to spurious outputs during deployment.

Method: Proposed framework falls within positive-unlabelled learning paradigm, learning from sparsely annotated pixels from multiple positive-only classes without background annotation. Unlabelled pixels containing positive classes and negative/background are treated as OOD set. Can integrate any pixel-level OOD detection approaches designed for classification tasks.

Result: Extensive experiments on multi-class hyperspectral and RGB surgical imaging datasets demonstrate robustness and generalization capability. Proposed cross-validation strategy treats held-out labelled classes as OOD for evaluation.

Conclusion: The framework effectively addresses sparse annotation challenges and OOD detection in medical image segmentation, showing strong performance and generalization across different imaging modalities.

Abstract: Despite significant advancements, segmentation based on deep neural networks in medical and surgical imaging faces several challenges, two of which we aim to address in this work. First, acquiring complete pixel-level segmentation labels for medical images is time-consuming and requires domain expertise. Second, typical segmentation pipelines cannot detect out-of-distribution (OOD) pixels, leaving them prone to spurious outputs during deployment. In this work, we propose a novel segmentation approach which broadly falls within the positive-unlabelled (PU) learning paradigm and exploits tools from OOD detection techniques. Our framework learns only from sparsely annotated pixels from multiple positive-only classes and does not use any annotation for the background class. These multi-class positive annotations naturally fall within the in-distribution (ID) set. Unlabelled pixels may contain positive classes but also negative ones, including what is typically referred to as \emph{background} in standard segmentation formulations. Here, we forgo the need for background annotation and consider these together with any other unseen classes as part of the OOD set. Our framework can integrate, at a pixel-level, any OOD detection approaches designed for classification tasks. To address the lack of existing OOD datasets and established evaluation metric for medical image segmentation, we propose a cross-validation strategy that treats held-out labelled classes as OOD. Extensive experiments on both multi-class hyperspectral and RGB surgical imaging datasets demonstrate the robustness and generalisation capability of our proposed framework.

[172] Understanding Museum Exhibits using Vision-Language Reasoning

Ada-Astrid Balauca, Sanjana Garai, Stefan Balauca, Rasesh Udayakumar Shetty, Naitik Agrawal, Dhwanil Subhashbhai Shah, Yuqian Fu, Xi Wang, Kristina Toutanova, Danda Pani Paudel, Luc Van Gool

Main category: cs.CV

TL;DR: Researchers created a large-scale museum dataset with 65M images and 200M QA pairs, trained specialized vision-language models (BLIP and LLaVA), and showed that fine-tuned models outperform SOTA VLMs on museum-specific visual question answering tasks requiring historical context.

DetailsMotivation: Museums contain vast cultural knowledge that visitors explore through questions, requiring specialized AI models that can analyze visual exhibits and connect them to historical context for meaningful visitor interactions.

Method: Collected and curated 65M museum exhibit images and 200M expert-labeled QA pairs; trained two VLMs (BLIP and LLaVA) on this dataset; benchmarked performance on five visual question answering tasks designed for real museum scenarios.

Result: Both model types effectively answered visually grounded questions, but large vision-language models excelled at queries requiring deeper historical context and reasoning. Fine-tuned models significantly outperformed current SOTA VLMs on domain-specific attributes and complex queries.

Conclusion: Fine-tuning vision-language models on large-scale domain-specific museum datasets is essential for handling complex, nuanced queries about cultural artifacts, with large models showing superior performance in historical reasoning tasks.

Abstract: Museums serve as repositories of cultural heritage and historical artifacts from diverse epochs, civilizations, and regions, preserving well-documented collections that encapsulate vast knowledge, which, when systematically structured into large-scale datasets, can train specialized models. Visitors engage with exhibits through curiosity and questions, making expert domain-specific models essential for interactive query resolution and gaining historical insights. Understanding exhibits from images requires analyzing visual features and linking them to historical knowledge to derive meaningful correlations. We facilitate such reasoning by (a) collecting and curating a large-scale dataset of 65M images and 200M question-answer pairs for exhibits from all around the world; (b) training large vision-language models (VLMs) on the collected dataset; (c) benchmarking their ability on five visual question answering tasks, specifically designed to reflect real-world inquiries and challenges observed in museum settings. The complete dataset is labeled by museum experts, ensuring the quality and the practical significance of the labels. We train two VLMs from different categories: BLIP with vision-language aligned embeddings, but lacking the expressive power of large language models, and the LLaVA model, a powerful instruction-tuned LLM enriched with vision-language reasoning capabilities. Through extensive experiments, we find that while both model types effectively answer visually grounded questions, large vision-language models excel in queries requiring deeper historical context and reasoning. We further demonstrate the necessity of fine-tuning models on large-scale domain-specific datasets by showing that our fine-tuned models significantly outperform current SOTA VLMs in answering questions related to specific attributes, highlighting their limitations in handling complex, nuanced queries.

[173] A Data-Free Analytical Quantization Scheme for Deep Learning Models

Ahmed Luqman, Khuzemah Qazi, Murray Patterson, Malik Jahan Khan, Imdadullah Khan

Main category: cs.CV

TL;DR: A novel post-training quantization method that finds optimal clipping thresholds and scaling factors with mathematical guarantees to minimize quantization noise, reducing CNN model size and computational requirements while preserving accuracy.

DetailsMotivation: CNN models have high computational and storage demands that challenge deployment on resource-constrained devices. Quantization can reduce these requirements by lowering parameter precision.

Method: Post-training quantization method that finds optimal clipping thresholds and scaling factors with mathematical guarantees to minimize quantization noise.

Result: Significantly reduces model size and computational requirements while preserving model accuracy on real-world datasets.

Conclusion: The proposed quantization scheme effectively addresses CNN deployment challenges on resource-constrained devices by reducing storage and computational demands without sacrificing accuracy.

Abstract: Despite the success of CNN models on a variety of Image classification and segmentation tasks, their extensive computational and storage demands pose considerable challenges for real-world deployment on resource-constrained devices. Quantization is one technique that aims to alleviate these large storage requirements and speed up the inference process by reducing the precision of model parameters to lower-bit representations. In this paper, we introduce a novel post-training quantization method for model weights. Our method finds optimal clipping thresholds and scaling factors along with mathematical guarantees that our method minimizes quantization noise. Empirical results on real-world datasets demonstrate that our quantization scheme significantly reduces model size and computational requirements while preserving model accuracy.

[174] Texture- and Shape-based Adversarial Attacks for Overhead Image Vehicle Detection

Mikael Yeghiazaryan, Sai Abhishek Siddhartha Namburu, Emily Kim, Stanislav Panev, Celso de Melo, Fernando De la Torre, Jessica K. Hodgins

Main category: cs.CV

TL;DR: This paper analyzes practical adversarial attacks on vehicle detection in aerial images, considering realistic constraints on texture modifications and shape alterations, showing a trade-off between attack effectiveness and practicality.

DetailsMotivation: Vehicle detection in aerial images faces challenges from complex backgrounds and small objects, and while deep learning has improved detection, these models are vulnerable to adversarial attacks that often ignore practical implementation constraints.

Method: The authors propose realistic constraints on texture modifications (lower resolution, limited area, color range restrictions) and analyze shape modifications. They test these on three object detector architectures with extensive experiments.

Result: Experiments demonstrate a performance-practicality trade-off: more practical modifications are less effective as attacks, while more effective attacks are less practical to implement.

Conclusion: The work provides a framework for evaluating adversarial attacks under realistic constraints and releases code and data to support reproducibility in practical security assessments of aerial vehicle detection systems.

Abstract: Detecting vehicles in aerial images is difficult due to complex backgrounds, small object sizes, shadows, and occlusions. Although recent deep learning advancements have improved object detection, these models remain susceptible to adversarial attacks (AAs), challenging their reliability. Traditional AA strategies often ignore practical implementation constraints. Our work proposes realistic and practical constraints on texture (lowering resolution, limiting modified areas, and color ranges) and analyzes the impact of shape modifications on attack performance. We conducted extensive experiments with three object detector architectures, demonstrating the performance-practicality trade-off: more practical modifications tend to be less effective, and vice versa. We release both code and data to support reproducibility at https://github.com/humansensinglab/texture-shape-adversarial-attacks.

[175] Detect Changes like Humans: Incorporating Semantic Priors for Improved Change Detection

Yuhang Gan, Wenjie Xuan, Zhiming Luo, Lei Fang, Zengmao Wang, Juhua Liu, Bo Du

Main category: cs.CV

TL;DR: SA-CDNet integrates semantic priors from visual foundation models (FastSAM) into change detection, using a dual-stream decoder to combine semantic-aware and difference-aware features, with single-temporal pre-training strategy for better adaptation.

DetailsMotivation: Current binary change detection models focus mainly on difference-aware features but lack semantic understanding of changed landscapes, making them vulnerable to noise and illumination variations. Humans use both appearance comparison and semantic understanding to identify differences.

Method: Proposes Semantic-Aware Change Detection network (SA-CDNet) that transfers knowledge from FastSAM visual foundation model. Uses dual-stream feature decoder to combine semantic-aware and difference-aware features. Implements single-temporal pre-training strategy using pseudo-change data from segmentation datasets with proxy semantic segmentation task.

Result: Experimental results on five challenging benchmarks demonstrate superiority over existing state-of-the-art methods.

Conclusion: Incorporating semantic priors from visual foundation models significantly improves change detection accuracy and robustness against noise and illumination variations.

Abstract: When given two similar images, humans identify their differences by comparing the appearance (e.g., color, texture) with the help of semantics (e.g., objects, relations). However, mainstream binary change detection models adopt a supervised training paradigm, where the annotated binary change map is the main constraint. Thus, such methods primarily emphasize difference-aware features between bi-temporal images, and the semantic understanding of changed landscapes is undermined, resulting in limited accuracy in the face of noise and illumination variations. To this end, this paper explores incorporating semantic priors from visual foundation models to improve the ability to detect changes. Firstly, we propose a Semantic-Aware Change Detection network (SA-CDNet), which transfers the knowledge of visual foundation models (i.e., FastSAM) to change detection. Inspired by the human visual paradigm, a novel dual-stream feature decoder is derived to distinguish changes by combining semantic-aware features and difference-aware features. Secondly, we explore a single-temporal pre-training strategy for better adaptation of visual foundation models. With pseudo-change data constructed from single-temporal segmentation datasets, we employ an extra branch of proxy semantic segmentation task for pre-training. We explore various settings like dataset combinations and landscape types, thus providing valuable insights. Experimental results on five challenging benchmarks demonstrate the superiority of our method over the existing state-of-the-art methods. The code is available at $\href{https://github.com/DREAMXFAR/SA-CDNet}{github}$.

[176] Efficient Deep Learning-based Forward Solvers for Brain Tumor Growth Models

Zeineb Haouari, Jonas Weidner, Yeray Martin-Ruisanchez, Ivan Ezhov, Aswathi Varma, Daniel Rueckert, Bjoern Menze, Benedikt Wiestler

Main category: cs.CV

TL;DR: Neural network models for glioblastoma simulation show nnU-Net outperforms other architectures in tumor cell concentration prediction and outline matching, enabling faster model calibration for radiotherapy planning.

DetailsMotivation: Glioblastoma treatment planning requires patient-specific tumor simulations, but traditional PDE model calibration is computationally expensive. Neural forward solvers can accelerate this process but need highly accurate and differentiable models.

Method: Evaluated three neural architectures: enhanced TumorSurrogate, modified nnU-Net, and 3D Vision Transformer (ViT) as forward solvers for tumor simulation. Compared performance on tumor outline matching and voxel-level tumor cell concentration prediction.

Result: nnU-Net achieved best overall results with lowest MSE in tumor cell concentration compared to ground truth numerical simulation and highest Dice score across all tumor cell concentration thresholds.

Conclusion: nnU-Net demonstrates superior performance as a neural forward solver for glioblastoma modeling, enabling faster calibration and improved radiotherapy planning. The study highlights important future research directions for neural network-based tumor simulation.

Abstract: Glioblastoma, a highly aggressive brain tumor, poses major challenges due to its poor prognosis and high morbidity rates. Partial differential equation-based models offer promising potential to enhance therapeutic outcomes by simulating patient-specific tumor behavior for improved radiotherapy planning. However, model calibration remains a bottleneck due to the high computational demands of optimization methods like Monte Carlo sampling and evolutionary algorithms. To address this, we recently introduced an approach leveraging a neural forward solver with gradient-based optimization to significantly reduce calibration time. This approach requires a highly accurate and fully differentiable forward model. We investigate multiple architectures, including (i) an enhanced TumorSurrogate, (ii) a modified nnU-Net, and (iii) a 3D Vision Transformer (ViT). The nnU-Net achieved the best overall results, excelling in both tumor outline matching and voxel-level prediction of tumor cell concentration. It yielded the lowest MSE in tumor cell concentration compared to ground truth numerical simulation and the highest Dice score across all tumor cell concentration thresholds. Our study demonstrates significant enhancement in forward solver performance and outlines important future research directions.

[177] Frequency Domain Enhanced U-Net for Low-Frequency Information-Rich Image Segmentation in Surgical and Deep-Sea Exploration Robots

Guohao Huo, Ruiting Dai, Jinliang Liu, Ling Shao, Hao Tang

Main category: cs.CV

TL;DR: Proposed FE-UNet model with wavelet adaptive spectrum fusion and perception frequency blocks to address high-frequency feature attenuation in deep-sea and surgical imaging, achieving state-of-the-art cross-domain segmentation performance.

DetailsMotivation: Address high-frequency feature attenuation caused by environmental lighting and device resolution limitations in deep-sea exploration and surgical robotics, where CNNs have different frequency sensitivity than human vision.

Method: Quantified CNN contrast sensitivity function, developed wavelet adaptive spectrum fusion (WASF) method inspired by biological vision, designed perception frequency blocks (PFB), and built FE-UNet with SAM2 backbone and fine-tuned Hiera-Large modules.

Result: FE-UNet achieves state-of-the-art performance in cross-domain tasks including marine organism segmentation and polyp segmentation, demonstrating robust adaptability.

Conclusion: The proposed approach effectively balances cross-frequency image features and shows significant application potential for challenging imaging environments with frequency attenuation issues.

Abstract: In deep-sea exploration and surgical robotics scenarios, environmental lighting and device resolution limitations often cause high-frequency feature attenuation. Addressing the differences in frequency band sensitivity between CNNs and the human visual system (mid-frequency sensitivity with low-frequency sensitivity surpassing high-frequency), we experimentally quantified the CNN contrast sensitivity function and proposed a wavelet adaptive spectrum fusion (WASF) method inspired by biological vision mechanisms to balance cross-frequency image features. Furthermore, we designed a perception frequency block (PFB) that integrates WASF to enhance frequency-domain feature extraction. Based on this, we developed the FE-UNet model, which employs a SAM2 backbone network and incorporates fine-tuned Hiera-Large modules to ensure segmentation accuracy while improving generalization capability. Experiments demonstrate that FE-UNet achieves state-of-the-art performance in cross-domain tasks such as marine organism segmentation and polyp segmentation, showcasing robust adaptability and significant application potential. The code will be released soon.

[178] FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA

Nobin Sarwar

Main category: cs.CV

TL;DR: FilterRAG is a retrieval-augmented framework that combines BLIP-VQA with external knowledge sources to reduce hallucinations and improve accuracy in Visual Question Answering, achieving 36.5% accuracy on OK-VQA.

DetailsMotivation: VQA models struggle with hallucinations and produce incorrect answers, especially in knowledge-driven and Out-of-Distribution scenarios, limiting their real-world deployment.

Method: Introduces FilterRAG framework that integrates BLIP-VQA with Retrieval-Augmented Generation, grounding answers in external knowledge sources like Wikipedia and DBpedia.

Result: Achieves 36.5% accuracy on the OK-VQA dataset, demonstrating effectiveness in reducing hallucinations and improving robustness in both in-domain and Out-of-Distribution settings.

Conclusion: FilterRAG shows potential to improve Visual Question Answering systems for real-world deployment by effectively combining visual understanding with external knowledge retrieval.

Abstract: Visual Question Answering requires models to generate accurate answers by integrating visual and textual understanding. However, VQA models still struggle with hallucinations, producing convincing but incorrect answers, particularly in knowledge-driven and Out-of-Distribution scenarios. We introduce FilterRAG, a retrieval-augmented framework that combines BLIP-VQA with Retrieval-Augmented Generation to ground answers in external knowledge sources like Wikipedia and DBpedia. FilterRAG achieves 36.5% accuracy on the OK-VQA dataset, demonstrating its effectiveness in reducing hallucinations and improving robustness in both in-domain and Out-of-Distribution settings. These findings highlight the potential of FilterRAG to improve Visual Question Answering systems for real-world deployment.

[179] Evaluation of Alignment-Regularity Characteristics in Deformable Image Registration

Vasiliki Sideri-Lampretsa, Daniel Rueckert, Huaqi Qiu

Main category: cs.CV

TL;DR: Novel evaluation scheme for deformable image registration using alignment-regularity characteristic curves and HyperNetwork interpolation to analyze trade-offs between accuracy and deformation regularity.

DetailsMotivation: Evaluating deformable image registration is challenging due to the inherent trade-off between alignment accuracy and deformation regularity, requiring systematic analysis.

Method: Introduces ARC curves to measure performance spectrum, uses HyperNetwork-based approach to continuously interpolate across regularization range for dense sampling.

Result: Demonstrated evaluation scheme on learning-based registration methods with various architectures, revealing findings not evident from existing practices.

Conclusion: Provides general recommendations for model evaluation and selection, with all code made publicly available for broader adoption.

Abstract: Evaluating deformable image registration (DIR) is challenging due to the inherent trade-off between achieving high alignment accuracy and maintaining deformation regularity. In this work, we introduce a novel evaluation scheme based on the alignment-regularity characteristic (ARC) to systematically capture and analyze this trade-off. We first introduce the ARC curves, which describe the performance of a given registration algorithm as a spectrum measured by alignment and regularity metrics. We further adopt a HyperNetwork-based approach that learns to continuously interpolate across the full regularization range, accelerating the construction and improving the sample density of ARC curves. We empirically demonstrate our evaluation scheme using representative learning-based deformable image registration methods with various network architectures and transformation models on two public datasets. We present a range of findings not evident from existing evaluation practices and provide general recommendations for model evaluation and selection using our evaluation scheme. All code relevant is made publicly available.

[180] Audio-centric Video Understanding Benchmark without Text Shortcut

Yudong Yang, Jimin Zhuang, Guangzhi Sun, Changli Tang, Yixuan Li, Peihan Li, Yifan Jiang, Wei Li, Zejun Ma, Chao Zhang

Main category: cs.CV

TL;DR: AVUT benchmark evaluates audio-visual LLMs’ video understanding with focus on auditory information, addressing text shortcut problems through answer permutation filtering.

DetailsMotivation: Audio is often treated as auxiliary in video understanding, but thorough video comprehension depends critically on auditory information for context, emotion, and semantic meaning that visuals alone lack.

Method: Proposes AVUT benchmark with audio-centric tasks testing both audio content and audio-visual interactions, using answer permutation-based filtering to prevent text shortcuts.

Result: Comprehensive evaluation of diverse open-source and proprietary multimodal LLMs reveals deficiencies in current audio-visual LLMs’ capabilities.

Conclusion: AVUT provides a robust benchmark for assessing audio-centric video understanding, highlighting the importance of auditory information and exposing limitations in current multimodal LLMs.

Abstract: Audio often serves as an auxiliary modality in video understanding tasks of audio-visual large language models (LLMs), merely assisting in the comprehension of visual information. However, a thorough understanding of videos significantly depends on auditory information, as audio offers critical context, emotional cues, and semantic meaning that visual data alone often lacks. This paper proposes an audio-centric video understanding benchmark (AVUT) to evaluate the video comprehension capabilities of multimodal LLMs with a particular focus on auditory information. AVUT introduces a suite of carefully designed audio-centric tasks, holistically testing the understanding of both audio content and audio-visual interactions in videos. Moreover, this work points out the text shortcut problem that largely exists in other benchmarks where the correct answer can be found from question text alone without needing videos. AVUT addresses this problem by proposing a answer permutation-based filtering mechanism. A thorough evaluation across a diverse range of open-source and proprietary multimodal LLMs is performed, followed by the analyses of deficiencies in audio-visual LLMs. Demos and data are available at https://github.com/lark-png/AVUT.

[181] Visuospatial Cognitive Assistant

Qi Feng

Main category: cs.CV

TL;DR: ViCA introduces a 322K QA dataset from real-world indoor videos and a 7B model that achieves SOTA on spatial reasoning tasks, with explicit reasoning chains for interpretability.

DetailsMotivation: Video-based spatial cognition is crucial for robotics and embodied AI but current VLMs struggle with it, requiring targeted datasets and models.

Method: Created ViCA-322K dataset from ARKitScenes/ScanNet videos, developed ViCA-7B fine-tuned on this data, and built ViCA-Thinking-2.68K with reasoning chains.

Result: ViCA-7B achieves new SOTA on all 8 VSI-Bench tasks (+26.1 on Absolute Distance), outperforming larger models, with interpretable reasoning via ViCA-7B-Thinking.

Conclusion: Targeted data is key for spatial intelligence; work provides resources for improved temporal-spatial modeling in robotics and embodied AI.

Abstract: Video-based spatial cognition is vital for robotics and embodied AI but challenges current Vision-Language Models (VLMs). This paper makes two key contributions. First, we introduce ViCA (Visuospatial Cognitive Assistant)-322K, a diverse dataset of 322,003 QA pairs from real-world indoor videos (ARKitScenes, ScanNet, ScanNet++), offering supervision for 3D metadata-grounded queries and video-based complex reasoning. Second, we develop ViCA-7B, fine-tuned on ViCA-322K, which achieves new state-of-the-art on all eight VSI-Bench tasks, outperforming existing models, including larger ones (e.g., +26.1 on Absolute Distance). For interpretability, we present ViCA-Thinking-2.68K, a dataset with explicit reasoning chains, and fine-tune ViCA-7B to create ViCA-7B-Thinking, a model that articulates its spatial reasoning. Our work highlights the importance of targeted data and suggests paths for improved temporal-spatial modeling. We release all resources to foster research in robust visuospatial intelligence.

[182] Large-scale Pre-training for Grounded Video Caption Generation

Evangelos Kazakos, Cordelia Schmid, Josef Sivic

Main category: cs.CV

TL;DR: Proposes GROVE model for video captioning with dense object grounding, introduces HowToGround1M (auto-annotated) and iGround (manual) datasets, achieves SOTA results on multiple benchmarks.

DetailsMotivation: To address the challenging problem of generating video captions with temporally dense object grounding, where objects mentioned in captions are precisely localized with bounding boxes throughout the video.

Method: Uses automatic annotation to create large-scale HowToGround1M dataset from HowTo100M, develops GROVE model for grounded video caption generation, pre-trains on auto-annotated data then fine-tunes on manually annotated iGround dataset.

Result: Achieves state-of-the-art performance on iGround, VidSTG, ActivityNet-Entities, GroundingYouTube, and YouCook-Interactions datasets. Ablations show importance of pre-training on auto-annotated data followed by fine-tuning.

Conclusion: The proposed approach effectively combines large-scale automatic annotation with high-quality manual annotation to achieve superior performance in grounded video captioning, demonstrating the value of both pre-training and fine-tuning strategies.

Abstract: We propose a novel approach for captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally dense bounding boxes. We introduce the following contributions. First, we present a large-scale automatic annotation method that aggregates frame-level captions grounded with bounding boxes into temporally dense and consistent annotations. We apply this approach on the HowTo100M dataset to construct a large-scale pre-training dataset, named HowToGround1M. We also introduce a Grounded Video Caption Generation model, dubbed GROVE, and pre-train the model on HowToGround1M. Second, we introduce iGround–a dataset of 3513 videos with manually annotated captions and dense spatio-temporally grounded bounding boxes. This allows us to measure progress on this challenging problem, as well as to fine-tune our model on this small-scale but high-quality data. Third, we demonstrate that our approach achieves state-of-the-art results on the proposed iGround dataset, as well as on the VidSTG, ActivityNet-Entities, GroundingYouTube, and YouCook-Interactions datasets. Our ablations demonstrate the importance of pre-training on our automatically annotated HowToGround1M dataset followed by fine-tuning on the manually annotated iGround dataset and validate the key technical contributions of our model. The dataset and code are available at https://ekazakos.github.io/grounded_video_caption_generation/.

[183] Enhancing Traffic Incident Response through Sub-Second Temporal Localization with HybridMamba

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Main category: cs.CV

TL;DR: HybridMamba is a novel architecture combining visual transformers with state-space modeling for precise traffic crash detection in surveillance videos, achieving 1.50s mean error with 65.2% predictions within 1s of ground truth.

DetailsMotivation: Traffic crash detection in long surveillance videos is challenging due to brief and infrequent crash events, requiring efficient temporal localization methods.

Method: Integrates visual transformers with state-space temporal modeling using multi-level token compression and hierarchical temporal processing for computational efficiency.

Result: Achieves 1.50s mean absolute error for 2-minute videos, outperforms video-language models by up to 3.95s while using significantly fewer parameters (3B vs 13-72B).

Conclusion: Demonstrates effective temporal localization across various video durations and environmental conditions, showing potential for fine-grained traffic surveillance while identifying remaining deployment challenges.

Abstract: Traffic crash detection in long-form surveillance videos is essential for improving emergency response and infrastructure planning, yet remains difficult due to the brief and infrequent nature of crash events. We present \textbf{HybridMamba}, a novel architecture integrating visual transformers with state-space temporal modeling to achieve high-precision crash time localization. Our approach introduces multi-level token compression and hierarchical temporal processing to maintain computational efficiency without sacrificing temporal resolution. Evaluated on a large-scale dataset from the Iowa Department of Transportation, HybridMamba achieves a mean absolute error of \textbf{1.50 seconds} for 2-minute videos ($p<0.01$ compared to baselines), with \textbf{65.2%} of predictions falling within one second of the ground truth. It outperforms recent video-language models (e.g., TimeChat, VideoLLaMA-2) by up to 3.95 seconds while using significantly fewer parameters (3B vs. 13–72B). Our results demonstrate effective temporal localization across various video durations (2–40 minutes) and diverse environmental conditions, highlighting HybridMamba’s potential for fine-grained temporal localization in traffic surveillance while identifying challenges that remain for extended deployment.

[184] Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts

Qi Feng

Main category: cs.CV

TL;DR: ViCA2 is a novel multimodal LLM that enhances spatial reasoning with dual vision encoders (SigLIP for semantics, Hiera for spatial structure) and achieves state-of-the-art performance on visuospatial tasks with a compact 7B model.

DetailsMotivation: Existing MLLMs struggle with visuospatial cognition - reasoning about spatial layouts, relations, and dynamics - due to lacking appropriate architectural components and specialized training data for fine-grained spatial understanding.

Method: Developed ViCA2 with dual vision encoder architecture integrating SigLIP for semantics and Hiera for spatial structure, plus token ratio control mechanism. Created ViCA-322K dataset with 322,000 spatially grounded QA pairs for instruction tuning.

Result: ViCA2-7B achieves SOTA average score of 56.8 on VSI-Bench, significantly outperforming larger open-source models (LLaVA-NeXT-Video-72B: 40.9) and leading proprietary models (Gemini-1.5 Pro: 45.4).

Conclusion: The approach effectively achieves strong visuospatial intelligence with a compact model. ViCA2, its codebase, and the ViCA-322K dataset are released to facilitate further research.

Abstract: While Multimodal Large Language Models (MLLMs) excel at general vision-language tasks, visuospatial cognition - reasoning about spatial layouts, relations, and dynamics - remains a significant challenge. Existing models often lack the necessary architectural components and specialized training data for fine-grained spatial understanding. We introduce ViCA2 (Visuospatial Cognitive Assistant 2), a novel MLLM designed to enhance spatial reasoning. ViCA2 features a dual vision encoder architecture integrating SigLIP for semantics and Hiera for spatial structure, coupled with a token ratio control mechanism for efficiency. We also developed ViCA-322K, a new large-scale dataset with over 322,000 spatially grounded question-answer pairs for targeted instruction tuning. On the challenging VSI-Bench benchmark, our ViCA2-7B model achieves a state-of-the-art average score of 56.8, significantly surpassing larger open-source models (e.g., LLaVA-NeXT-Video-72B, 40.9) and leading proprietary models (Gemini-1.5 Pro, 45.4). This demonstrates the effectiveness of our approach in achieving strong visuospatial intelligence with a compact model. We release ViCA2, its codebase, and the ViCA-322K dataset to facilitate further research.

[185] A Decade of Wheat Mapping for Lebanon

Hasan Wehbi, Hasan Nasrallah, Mohamad Hasan Zahweh, Zeinab Takach, Veera Ganesh Yalla, Ali J. Ghandour

Main category: cs.CV

TL;DR: Improved pipeline for wheat field mapping using TSViT with PEFT and FTW-based post-processing, enabling accurate segmentation and boundary extraction for agricultural monitoring.

DetailsMotivation: Wheat is crucial for global food security (20% of caloric intake), and accurate mapping helps stakeholders make informed decisions about food security, supply chains, and resource allocation.

Method: Integrates Temporal Spatial Vision Transformer (TSViT) with Parameter-Efficient Fine Tuning (PEFT) and novel post-processing pipeline based on Fields of The World (FTW) framework to merge wheat segmentation with precise field boundary extraction.

Result: Produces geometrically coherent and semantically rich maps that enable tracking crop rotation patterns over years, with extensive evaluations showing improved boundary delineation and field-level precision.

Conclusion: Establishes potential for operational agricultural monitoring and historical trend analysis, laying foundation for critical studies including crop monitoring and yield estimation.

Abstract: Wheat accounts for approximately 20% of the world’s caloric intake, making it a vital component of global food security. Given this importance, mapping wheat fields plays a crucial role in enabling various stakeholders, including policy makers, researchers, and agricultural organizations, to make informed decisions regarding food security, supply chain management, and resource allocation. In this paper, we tackle the problem of accurately mapping wheat fields out of satellite images by introducing an improved pipeline for winter wheat segmentation, as well as presenting a case study on a decade-long analysis of wheat mapping in Lebanon. We integrate a Temporal Spatial Vision Transformer (TSViT) with Parameter-Efficient Fine Tuning (PEFT) and a novel post-processing pipeline based on the Fields of The World (FTW) framework. Our proposed pipeline addresses key challenges encountered in existing approaches, such as the clustering of small agricultural parcels in a single large field. By merging wheat segmentation with precise field boundary extraction, our method produces geometrically coherent and semantically rich maps that enable us to perform in-depth analysis such as tracking crop rotation pattern over years. Extensive evaluations demonstrate improved boundary delineation and field-level precision, establishing the potential of the proposed framework in operational agricultural monitoring and historical trend analysis. By allowing for accurate mapping of wheat fields, this work lays the foundation for a range of critical studies and future advances, including crop monitoring and yield estimation.

[186] DMS-Net:Dual-Modal Multi-Scale Siamese Network for Binocular Fundus Image Classification

Guohao Huo, Zibo Lin, Zitong Wang, Ruiting Dai, Hao Tang

Main category: cs.CV

TL;DR: DMS-Net is a dual-modal multi-scale siamese network for binocular retinal image classification that achieves state-of-the-art performance (82.9% accuracy) by leveraging bilateral eye correlations and advanced feature fusion modules.

DetailsMotivation: Traditional diagnostic methods and existing monocular image-based deep learning approaches often overlook the pathological correlations between the two eyes, while practical medical robotic diagnostic scenarios require paired retinal images as diagnostic evidence.

Method: The framework uses a weight-sharing siamese ResNet-152 to extract features from bilateral fundus images, with OSIM for multi-resolution feature aggregation, CASFM for cross-modal interaction, CCAM for differential semantic information, and CIAM for lesion-correlated semantic aggregation.

Result: Evaluation on ODIR-5K dataset shows DMS-Net achieves 82.9% accuracy, 84.5% recall, and 83.2% Cohen’s kappa coefficient, demonstrating state-of-the-art performance in detecting symmetrical pathologies.

Conclusion: DMS-Net showcases robust capacity in detecting symmetrical pathologies and improving clinical decision-making for ocular diseases, with code and processed dataset to be released.

Abstract: Ophthalmic diseases pose a significant global health burden. However, traditional diagnostic methods and existing monocular image-based deep learning approaches often overlook the pathological correlations between the two eyes. In practical medical robotic diagnostic scenarios, paired retinal images (binocular fundus images) are frequently required as diagnostic evidence. To address this, we propose DMS-Net-a dual-modal multi-scale siamese network for binocular retinal image classification. The framework employs a weight-sharing siamese ResNet-152 architecture to concurrently extract deep semantic features from bilateral fundus images. To tackle challenges like indistinct lesion boundaries and diffuse pathological distributions, we introduce the OmniPool Spatial Integrator Module (OSIM), which achieves multi-resolution feature aggregation through multi-scale adaptive pooling and spatial attention mechanisms. Furthermore, the Calibrated Analogous Semantic Fusion Module (CASFM) leverages spatial-semantic recalibration and bidirectional attention mechanisms to enhance cross-modal interaction, aggregating modality-agnostic representations of fundus structures. To fully exploit the differential semantic information of lesions present in bilateral fundus features, we introduce the Cross-Modal Contrastive Alignment Module (CCAM). Additionally, to enhance the aggregation of lesion-correlated semantic information, we introduce the Cross-Modal Integrative Alignment Module (CIAM). Evaluation on the ODIR-5K dataset demonstrates that DMS-Net achieves state-of-the-art performance with an accuracy of 82.9%, recall of 84.5%, and a Cohen’s kappa coefficient of 83.2%, showcasing robust capacity in detecting symmetrical pathologies and improving clinical decision-making for ocular diseases. Code and the processed dataset will be released subsequently.

[187] Comparative Analysis of Lightweight Deep Learning Models for Memory-Constrained Devices

Tasnim Shahriar

Main category: cs.CV

TL;DR: Evaluation of 5 lightweight deep learning models for image classification on resource-constrained devices, showing trade-offs between accuracy and efficiency with EfficientNetV2-S achieving highest accuracy and MobileNetV3 offering best balance.

DetailsMotivation: To address the need for deploying deep learning models in resource-constrained environments like low-memory devices and edge computing platforms where computational efficiency is critical.

Method: Benchmarked five state-of-the-art architectures (MobileNetV3 Small, ResNet18, SqueezeNet, EfficientNetV2-S, ShuffleNetV2) across three datasets (CIFAR-10, CIFAR-100, Tiny ImageNet) using four metrics: accuracy, inference time, FLOPs, and model size. Investigated hyperparameter tuning, data augmentation, and compared pretrained vs scratch-trained models.

Result: Transfer learning significantly enhances accuracy and computational efficiency, especially for complex datasets. EfficientNetV2-S achieved highest accuracy, MobileNetV3 offered best accuracy-efficiency balance, and SqueezeNet excelled in inference speed and compactness.

Conclusion: The study provides actionable insights for deploying lightweight models in real-world applications, highlighting critical trade-offs between accuracy and efficiency for optimizing deep learning systems in edge computing and mobile platforms.

Abstract: This paper presents a comprehensive evaluation of lightweight deep learning models for image classification, emphasizing their suitability for deployment in resource-constrained environments such as low-memory devices. Five state-of-the-art architectures - MobileNetV3 Small, ResNet18, SqueezeNet, EfficientNetV2-S, and ShuffleNetV2 - are benchmarked across three diverse datasets: CIFAR-10, CIFAR-100, and Tiny ImageNet. The models are assessed using four key performance metrics: classification accuracy, inference time, floating-point operations (FLOPs), and model size. Additionally, we investigate the impact of hyperparameter tuning, data augmentation, and training paradigms by comparing pretrained models with scratch-trained counterparts, focusing on MobileNetV3 Small. Our findings reveal that transfer learning significantly enhances model accuracy and computational efficiency, particularly for complex datasets like Tiny ImageNet. EfficientNetV2 consistently achieves the highest accuracy, while MobileNetV3 offers the best balance between accuracy and efficiency, and SqueezeNet excels in inference speed and compactness. This study highlights critical trade-offs between accuracy and efficiency, offering actionable insights for deploying lightweight models in real-world applications where computational resources are limited. By addressing these challenges, this research contributes to optimizing deep learning systems for edge computing and mobile platforms.

[188] RealRep: Generalized SDR-to-HDR Conversion via Attribute-Disentangled Representation Learning

Gang He, Siqi Wang, Kepeng Xu, Lin Zhang, Li Xu, Weiran Wang, Yu-Wing Tai

Main category: cs.CV

TL;DR: A novel SDR-to-HDR conversion framework using attribute-disentangled representations and degradation-aware mapping for robust HDR reconstruction across diverse SDR content.

DetailsMotivation: Existing fixed tone mapping operators struggle with diverse appearances and degradations in real-world SDR content, requiring a more robust and adaptive solution.

Method: Proposes RealRep with luminance/chrominance disentanglement, negative exemplar generation for contrastive learning, and DDACMNet - a lightweight two-stage framework with degradation-conditioned adaptive mapping.

Result: Extensive experiments show consistent outperformance of state-of-the-art methods in generalization and perceptually faithful HDR color gamut reconstruction.

Conclusion: The proposed framework effectively addresses limitations of existing methods by learning attribute-disentangled representations and enabling robust adaptation across diverse SDR degradation domains.

Abstract: High-Dynamic-Range Wide-Color-Gamut (HDR-WCG) technology is becoming increasingly widespread, driving a growing need for converting Standard Dynamic Range (SDR) content to HDR. Existing methods primarily rely on fixed tone mapping operators, which struggle to handle the diverse appearances and degradations commonly present in real-world SDR content. To address this limitation, we propose a generalized SDR-to-HDR framework that enhances robustness by learning attribute-disentangled representations. Central to our approach is Realistic Attribute-Disentangled Representation Learning (RealRep), which explicitly disentangles luminance and chrominance components to capture intrinsic content variations across different SDR distributions. Furthermore, we design a Luma-/Chroma-aware negative exemplar generation strategy that constructs degradation-sensitive contrastive pairs, effectively modeling tone discrepancies across SDR styles. Building on these attribute-level priors, we introduce the Degradation-Domain Aware Controlled Mapping Network (DDACMNet), a lightweight, two-stage framework that performs adaptive hierarchical mapping guided by a control-aware normalization mechanism. DDACMNet dynamically modulates the mapping process via degradation-conditioned features, enabling robust adaptation across diverse degradation domains. Extensive experiments demonstrate that RealRep consistently outperforms state-of-the-art methods in both generalization and perceptually faithful HDR color gamut reconstruction.

[189] RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events

Zhenyuan Chen, Chenxi Wang, Ningyu Zhang, Feng Zhang

Main category: cs.CV

TL;DR: Introduces RSCC dataset with 62,315 pre/post-disaster image pairs and detailed captions to enable better vision-language models for disaster monitoring in remote sensing.

DetailsMotivation: Existing remote sensing datasets lack temporal image pairs and detailed textual annotations, making it difficult to capture dynamic disaster impacts over time.

Method: Created the Remote Sensing Change Caption (RSCC) dataset - a large-scale benchmark with pre-/post-disaster image pairs across multiple disaster types, paired with human-like change captions.

Result: RSCC enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding, facilitating detailed disaster-related analysis.

Conclusion: The dataset bridges temporal and semantic gaps in remote sensing data, paving the way for more accurate, interpretable, and scalable vision-language applications in disaster monitoring.

Abstract: Remote sensing is critical for disaster monitoring, yet existing datasets lack temporal image pairs and detailed textual annotations. While single-snapshot imagery dominates current resources, it fails to capture dynamic disaster impacts over time. To address this gap, we introduce the Remote Sensing Change Caption (RSCC) dataset, a large-scale benchmark comprising 62,315 pre-/post-disaster image pairs (spanning earthquakes, floods, wildfires, and more) paired with rich, human-like change captions. By bridging the temporal and semantic divide in remote sensing data, RSCC enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding. Our results highlight RSCC’s ability to facilitate detailed disaster-related analysis, paving the way for more accurate, interpretable, and scalable vision-language applications in remote sensing. Code and dataset are available at https://github.com/Bili-Sakura/RSCC.

[190] SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation

Edoardo Bianchi, Antonio Liotta

Main category: cs.CV

TL;DR: SkillFormer is a parameter-efficient architecture for multi-view skill assessment from egocentric and exocentric videos, achieving state-of-the-art accuracy with significantly reduced computational costs.

DetailsMotivation: Assessing human skill levels in complex activities is challenging but has important applications in sports, rehabilitation, and training. Current methods need better multi-view integration and computational efficiency.

Method: Built on TimeSformer backbone with CrossViewFusion module using multi-head cross-attention, learnable gating, and adaptive self-calibration. Uses Low-Rank Adaptation for parameter-efficient fine-tuning.

Result: Achieves state-of-the-art accuracy on EgoExo4D dataset with 4.5x fewer parameters and 3.75x fewer training epochs than prior baselines. Excels in multiple structured tasks.

Conclusion: SkillFormer demonstrates the value of multi-view integration for fine-grained skill assessment while maintaining remarkable computational efficiency.

Abstract: Assessing human skill levels in complex activities is a challenging problem with applications in sports, rehabilitation, and training. In this work, we present SkillFormer, a parameter-efficient architecture for unified multi-view proficiency estimation from egocentric and exocentric videos. Building on the TimeSformer backbone, SkillFormer introduces a CrossViewFusion module that fuses view-specific features using multi-head cross-attention, learnable gating, and adaptive self-calibration. We leverage Low-Rank Adaptation to fine-tune only a small subset of parameters, significantly reducing training costs. In fact, when evaluated on the EgoExo4D dataset, SkillFormer achieves state-of-the-art accuracy in multi-view settings while demonstrating remarkable computational efficiency, using 4.5x fewer parameters and requiring 3.75x fewer training epochs than prior baselines. It excels in multiple structured tasks, confirming the value of multi-view integration for fine-grained skill assessment.

[191] SAMba-UNet: SAM2-Mamba UNet for Cardiac MRI in Medical Robotic Perception

Guohao Huo, Ruiting Dai, Ling Shao, Hao Tang

Main category: cs.CV

TL;DR: SAMba-UNet combines SAM2, Mamba, and UNet for cardiac MRI segmentation, achieving state-of-the-art results with improved boundary precision through novel fusion modules.

DetailsMotivation: To address complex pathological feature extraction and domain shifts between natural images and medical scans in automated cardiac MRI segmentation.

Method: Dual-encoder architecture combining SAM2, Mamba, and UNet with Dynamic Feature Fusion Refiner and Heterogeneous Omni-Attention Convergence Module for cross-modal collaborative feature learning.

Result: Achieves Dice of 0.9103 and HD95 of 1.0859 mm on ACDC cardiac MRI benchmark, with notable improvements in boundary localization for challenging structures like right ventricle.

Conclusion: SAMba-UNet provides robust, high-fidelity segmentation maps applicable for medical and surgical robotic systems, with code to be open-sourced for clinical translation.

Abstract: To address complex pathological feature extraction in automated cardiac MRI segmentation, we propose SAMba-UNet, a novel dual-encoder architecture that synergistically combines the vision foundation model SAM2, the linear-complexity state-space model Mamba, and the classical UNet to achieve cross-modal collaborative feature learning; to overcome domain shifts between natural images and medical scans, we introduce a Dynamic Feature Fusion Refiner that employs multi-scale pooling and channel-spatial dual-path calibration to strengthen small-lesion and fine-structure representation, and we design a Heterogeneous Omni-Attention Convergence Module (HOACM) that fuses SAM2’s local positional semantics with Mamba’s long-range dependency modeling via global contextual attention and branch-selective emphasis, yielding substantial gains in both global consistency and boundary precision-on the ACDC cardiac MRI benchmark, SAMba-UNet attains a Dice of 0.9103 and HD95 of 1.0859 mm, notably improving boundary localization for challenging structures like the right ventricle, and its robust, high-fidelity segmentation maps are directly applicable as a perception module within intelligent medical and surgical robotic systems to support preoperative planning, intraoperative navigation, and postoperative complication screening; the code will be open-sourced to facilitate clinical translation and further validation.

[192] Interleaving Reasoning for Better Text-to-Image Generation

Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, Junbo Qiao, Yue Guo, Yao Hu, Zhenfei Yin, Philip Torr, Yu Cheng, Wanli Ouyang, Shaohui Lin

Main category: cs.CV

TL;DR: IRG framework alternates between text reasoning and image generation to improve T2I models, achieving 5-10 point gains on multiple benchmarks through interleaved thinking and refinement stages.

DetailsMotivation: Current multimodal models have improved image generation but still lag behind tightly coupled systems like GPT-4o in instruction following and detail preservation. Recent advances in interleaving reasoning suggest this approach could enhance T2I generation.

Method: Introduces Interleaving Reasoning Generation (IRG) framework that alternates text-based thinking with image synthesis, followed by reflection and refinement. Uses IRGL training with two sub-goals: strengthening initial think-generate stage and enabling high-quality textual reflection with faithful implementation. Trained on IRGL-300K dataset with six decomposed learning modes.

Result: Achieves state-of-the-art performance with absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN benchmarks. Shows substantial improvements in visual quality and fine-grained fidelity.

Conclusion: Interleaving reasoning between text and image generation significantly improves T2I model performance, demonstrating the effectiveness of alternating thinking and synthesis stages for better instruction following and detail preservation.

Abstract: Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .

[193] HueManity: Probing Fine-Grained Visual Perception in MLLMs

Rynaa Grover, Jayant Sravan Tamarapalli, Sahiti Yerramilli, Nilay Pande

Main category: cs.CV

TL;DR: HueManity benchmark reveals significant visual perception limitations in MLLMs compared to humans and traditional CV models, with MLLMs achieving only 33.6% on easy tasks and 3% on hard tasks.

DetailsMotivation: Multimodal LLMs excel at high-level reasoning but perform poorly on nuanced perceptual tasks, creating a need to assess their visual perception capabilities.

Method: Created HueManity benchmark with 83,850 images featuring alphanumeric strings in Ishihara-style dot patterns, evaluated 9 state-of-the-art MLLMs against human performance and ResNet50 baselines.

Result: MLLMs showed severe performance deficit: best MLLM achieved 33.6% (easy) and 3% (hard) accuracy, while humans scored 100%/95.6% and ResNet50 achieved 96.5%/94.5%.

Conclusion: Current MLLMs have critical visual perception gaps, highlighting the need for improved architectural and training approaches. Dataset and code are open-sourced to advance MLLM perceptual robustness research.

Abstract: Multimodal Large Language Models (MLLMs) excel at high-level visual reasoning, but their performance on nuanced perceptual tasks remains surprisingly limited. We present HueManity, a benchmark designed to assess visual perception in MLLMs. The dataset comprises 83,850 images featuring two-character alphanumeric strings embedded in Ishihara test style dot patterns, challenging models on precise pattern recognition. Our evaluation of nine state-of-the-art MLLMs on HueManity demonstrates a significant performance deficit compared to human and traditional computer vision baselines. The best-performing MLLM achieved a 33.6% accuracy on the numeric easy' task and a striking 3% on the alphanumeric hard’ task. In contrast, human participants achieved near-perfect scores (100% and 95.6%), and a fine-tuned ResNet50 model reached accuracies of 96.5% and 94.5%. These results highlight a critical gap in the visual capabilities of current MLLMs. Our analysis further explores potential architectural and training-paradigm factors contributing to this perceptual gap in MLLMs. We open-source HueManity dataset and code to foster further research in improving perceptual robustness of MLLMs.

[194] PATS: Proficiency-Aware Temporal Sampling for Multi-View Sports Skill Assessment

Edoardo Bianchi, Antonio Liotta

Main category: cs.CV

TL;DR: PATS is a novel temporal sampling method that preserves complete movement patterns for sports skill assessment, achieving state-of-the-art performance across multiple domains.

DetailsMotivation: Current video sampling methods disrupt temporal continuity essential for evaluating sports proficiency, making it difficult to capture fundamental movement patterns that distinguish expert from novice performance.

Method: Proficiency-Aware Temporal Sampling (PATS) adaptively segments videos to ensure each analyzed portion contains full execution of critical performance components, repeating across multiple segments to maximize information coverage while maintaining temporal coherence.

Result: PATS surpasses state-of-the-art accuracy across all viewing configurations (+0.65% to +3.05%) on EgoExo4D benchmark with SkillFormer, delivering substantial gains in challenging domains (+26.22% bouldering, +2.39% music, +1.13% basketball).

Conclusion: PATS successfully adapts to diverse activity characteristics and demonstrates effectiveness as an adaptive temporal sampling approach that advances automated skill assessment for real-world applications.

Abstract: Automated sports skill assessment requires capturing fundamental movement patterns that distinguish expert from novice performance, yet current video sampling methods disrupt the temporal continuity essential for proficiency evaluation. To this end, we introduce Proficiency-Aware Temporal Sampling (PATS), a novel sampling strategy that preserves complete fundamental movements within continuous temporal segments for multi-view skill assessment. PATS adaptively segments videos to ensure each analyzed portion contains full execution of critical performance components, repeating this process across multiple segments to maximize information coverage while maintaining temporal coherence. Evaluated on the EgoExo4D benchmark with SkillFormer, PATS surpasses the state-of-the-art accuracy across all viewing configurations (+0.65% to +3.05%) and delivers substantial gains in challenging domains (+26.22% bouldering, +2.39% music, +1.13% basketball). Systematic analysis reveals that PATS successfully adapts to diverse activity characteristics-from high-frequency sampling for dynamic sports to fine-grained segmentation for sequential skills-demonstrating its effectiveness as an adaptive approach to temporal sampling that advances automated skill assessment for real-world applications.

[195] From Images to Insights: Explainable Biodiversity Monitoring with Plain Language Habitat Explanations

Yutong Zhou, Masahiro Ryo

Main category: cs.CV

TL;DR: End-to-end visual-to-causal framework that transforms species images into interpretable causal insights about habitat preferences using AI and causal inference methods.

DetailsMotivation: Existing ecological workflows are fragmented and inaccessible to non-specialists, making it difficult to understand why species live in specific locations and conserve biodiversity.

Method: Integrates species recognition, global occurrence retrieval, pseudo-absence sampling, climate data extraction, causal structure discovery, and modern causal inference methods with LLM-generated explanations.

Result: Demonstrated on bee and flower species, showing potential for multimodal AI assistant to describe species habitat in human-understandable language.

Conclusion: The framework provides statistically grounded causal explanations for species habitat preferences, making ecological insights more accessible and interpretable for conservation efforts.

Abstract: Explaining why the species lives at a particular location is important for understanding ecological systems and conserving biodiversity. However, existing ecological workflows are fragmented and often inaccessible to non-specialists. We propose an end-to-end visual-to-causal framework that transforms a species image into interpretable causal insights about its habitat preference. The system integrates species recognition, global occurrence retrieval, pseudo-absence sampling, and climate data extraction. We then discover causal structures among environmental features and estimate their influence on species occurrence using modern causal inference methods. Finally, we generate statistically grounded, human-readable causal explanations from structured templates and large language models. We demonstrate the framework on a bee and a flower species and report early results as part of an ongoing project, showing the potential of the multimodal AI assistant backed up by a recommended ecological modeling practice for describing species habitat in human-understandable language. Our code is available at: https://github.com/Yutong-Zhou-cv/BioX.

[196] SPACE-iT: Spatial-Aware Curriculum Exploration and Feedback-Driven Adaptive Augmentation for Vision Transformer Distillation

Jihyeon Seong, Hyunkyung Han

Main category: cs.CV

TL;DR: SPACE-iT is a spatial-aware knowledge distillation framework for Vision Transformers that uses confidence-based adaptive augmentation and reverse curriculum learning to improve performance without extra memory cost.

DetailsMotivation: Traditional knowledge distillation methods treat all image patches uniformly, ignoring spatial variations in learning difficulty, which limits their effectiveness for Vision Transformers.

Method: Computes spatial confidence scores at attention, patch, and logit levels; uses confidence map to dynamically modulate distillation loss and guide adaptive augmentation with reverse curriculum learning (hard to easy progression).

Result: Achieves superior performance over vanilla distillation by enabling more effective learning of complex spatial patterns.

Conclusion: SPACE-iT provides an effective spatial-aware distillation framework that addresses patch-level learning variations through feedback-driven adaptive augmentation and reverse curriculum strategies.

Abstract: Knowledge distillation (KD) has proven to be a powerful technique for improving the performance of Vision Transformers (ViTs). However, traditional KD methods often treat all image patches uniformly, overlooking spatial variations in learning difficulty. To address this limitation, we propose SPACE-iT, a novel framework for Spatial-Aware Curriculum Exploration via Feedback-Driven Adaptive Augmentation. At its core, SPACE-iT computes spatial confidence scores at the attention, patch, and logit levels. This confidence map supports a two-fold strategy: (1) dynamically modulating the distillation loss, and (2) guiding an adaptive augmentation module that intensifies reverse curriculum learning. By establishing a feedback-driven reverse curriculum that initially exposes students to challenging regions-progressing from hard to easy-SPACE-iT enables more effective learning of complex spatial patterns and achieves superior performance over vanilla distillation, without introducing additional memory overhead.

[197] Grounding DINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models

Hamza Rasaee, Taha Koleilat, Hassan Rivaz

Main category: cs.CV

TL;DR: A prompt-driven vision-language model combining Grounding DINO and SAM2 achieves state-of-the-art ultrasound segmentation across multiple organs using 18 public datasets, with strong performance on both seen and unseen data without additional fine-tuning.

DetailsMotivation: Address the challenge of accurate and generalizable object segmentation in ultrasound imaging due to anatomical variability, diverse protocols, and limited annotated data.

Method: Integrated Grounding DINO with SAM2 using Low Rank Adaptation (LoRA) fine-tuning on 15 ultrasound datasets from multiple organs, with 3 held-out datasets for testing unseen distributions.

Result: Outperformed state-of-the-art methods (UniverSeg, MedSAM, MedCLIP-SAM, BiomedParse, SAMUS) on most seen datasets and maintained strong performance on unseen datasets without additional fine-tuning.

Conclusion: Demonstrates the promise of vision-language models for scalable and robust ultrasound image analysis, reducing dependence on large organ-specific annotated datasets.

Abstract: Accurate and generalizable object segmentation in ultrasound imaging remains a significant challenge due to anatomical variability, diverse imaging protocols, and limited annotated data. In this study, we propose a prompt-driven vision-language model (VLM) that integrates Grounding DINO with SAM2 (Segment Anything Model2) to enable object segmentation across multiple ultrasound organs. A total of 18 public ultrasound datasets, encompassing the breast, thyroid, liver, prostate, kidney, and paraspinal muscle, were utilized. These datasets were divided into 15 for fine-tuning and validation of Grounding DINO using Low Rank Adaptation (LoRA) to the ultrasound domain, and 3 were held out entirely for testing to evaluate performance in unseen distributions. Comprehensive experiments demonstrate that our approach outperforms state-of-the-art segmentation methods, including UniverSeg, MedSAM, MedCLIP-SAM, BiomedParse, and SAMUS on most seen datasets while maintaining strong performance on unseen datasets without additional fine-tuning. These results underscore the promise of VLMs in scalable and robust ultrasound image analysis, reducing dependence on large, organ-specific annotated datasets. We will publish our code on code.sonography.ai after acceptance.

[198] Interpretable Text-Guided Image Clustering via Iterative Search

Bingchen Zhao, Oisin Mac Aodha

Main category: cs.CV

TL;DR: ITGC is a new text-guided image clustering method that uses iterative discovery with unsupervised objectives to generate interpretable visual concepts aligned with user instructions, outperforming existing methods.

DetailsMotivation: Traditional clustering is ill-posed with multiple valid partitioning options. Users may want different clustering criteria (e.g., shape vs color), requiring text guidance to specify intent and resolve ambiguity.

Method: Proposes ITGC - an iterative discovery process guided by unsupervised clustering objectives to generate interpretable visual concepts that capture user-specified criteria from natural language instructions.

Result: Superior performance compared to existing methods across various image clustering and fine-grained classification benchmarks.

Conclusion: Text-guided clustering with iterative discovery effectively addresses clustering ambiguity by aligning results with user intent through natural language instructions.

Abstract: Traditional clustering methods aim to group unlabeled data points based on their similarity to each other. However, clustering, in the absence of additional information, is an ill-posed problem as there may be many different, yet equally valid, ways to partition a dataset. Distinct users may want to use different criteria to form clusters in the same data, e.g. shape v.s. color. Recently introduced text-guided image clustering methods aim to address this ambiguity by allowing users to specify the criteria of interest using natural language instructions. This instruction provides the necessary context and control needed to obtain clusters that are more aligned with the users’ intent. We propose a new text-guided clustering approach named ITGC that uses an iterative discovery process, guided by an unsupervised clustering objective, to generate interpretable visual concepts that better capture the criteria expressed in a user’s instructions. We report superior performance compared to existing methods across a wide variety of image clustering and fine-grained classification benchmarks.

[199] Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges

Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma

Main category: cs.CV

TL;DR: Survey paper on using LLMs and VLMs for crash detection from video feeds in transportation systems, covering fusion strategies, datasets, architectures, benchmarks, and challenges.

DetailsMotivation: Crash detection from video is critical for intelligent transportation systems, and recent advances in LLMs/VLMs offer new opportunities for multimodal video analysis.

Method: Structured survey approach with taxonomy of fusion strategies, analysis of model architectures, comparison of performance benchmarks, and review of key datasets.

Result: Comprehensive review of current methods leveraging LLMs for video crash detection, identifying trends and approaches in this emerging field.

Conclusion: Provides foundation for future research at intersection of video understanding and foundation models, highlighting ongoing challenges and opportunities.

Abstract: Crash detection from video feeds is a critical problem in intelligent transportation systems. Recent developments in large language models (LLMs) and vision-language models (VLMs) have transformed how we process, reason about, and summarize multimodal information. This paper surveys recent methods leveraging LLMs for crash detection from video data. We present a structured taxonomy of fusion strategies, summarize key datasets, analyze model architectures, compare performance benchmarks, and discuss ongoing challenges and opportunities. Our review provides a foundation for future research in this fast-growing intersection of video understanding and foundation models.

[200] Atomizer: Generalizing to new modalities by breaking satellite images down to a set of scalars

Hugo Riffaud de Turckheim, Sylvain Lobry, Roberto Interdonato, Diego Marcos

Main category: cs.CV

TL;DR: Atomizer is a flexible architecture that represents remote sensing images as sets of scalars with contextual metadata, enabling a single encoder to handle diverse satellite data configurations without retraining.

DetailsMotivation: Existing remote sensing models require fixed input formats and modality-specific encoders, limiting generalization across diverse satellite data with varying spatial, spectral, and temporal configurations.

Method: Represents images as sets of scalars (spectral band values) enriched with metadata (time, resolution, wavelength, bandwidth). Uses structured tokenization with Fourier features and radial basis functions, mapping tokens via cross-attention into latent space.

Result: Outperforms standard models in modality-disjoint evaluations and demonstrates robust performance across varying resolutions and spatial sizes without interpolation or resampling.

Conclusion: Atomizer provides a flexible, unified approach for processing arbitrary remote sensing modalities, overcoming limitations of traditional fixed-format models and enabling better generalization across diverse satellite data configurations.

Abstract: The growing number of Earth observation satellites has led to increasingly diverse remote sensing data, with varying spatial, spectral, and temporal configurations. Most existing models rely on fixed input formats and modality-specific encoders, which require retraining when new configurations are introduced, limiting their ability to generalize across modalities. We introduce Atomizer, a flexible architecture that represents remote sensing images as sets of scalars, each corresponding to a spectral band value of a pixel. Each scalar is enriched with contextual metadata (acquisition time, spatial resolution, wavelength, and bandwidth), producing an atomic representation that allows a single encoder to process arbitrary modalities without interpolation or resampling. Atomizer uses structured tokenization with Fourier features and non-uniform radial basis functions to encode content and context, and maps tokens into a latent space via cross-attention. Under modality-disjoint evaluations, Atomizer outperforms standard models and demonstrates robust performance across varying resolutions and spatial sizes.

[201] Conditional Video Generation for High-Efficiency Video Compression

Fangqiu Yi, Jingyu Xu, Jiawei Shao, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: A video compression framework using conditional diffusion models that outperforms traditional and neural codecs on perceptual quality metrics, especially at high compression ratios.

DetailsMotivation: Leverage conditional diffusion models' ability to reconstruct video content aligned with human visual perception for improved video compression.

Method: Reframe video compression as conditional generation with three key modules: multi-granular conditioning, compact representations, and multi-condition training with modality dropout and role-aware embeddings.

Result: Significantly outperforms both traditional and neural codecs on perceptual quality metrics (FVD and LPIPS), particularly under high compression ratios.

Conclusion: Conditional diffusion models provide an effective framework for perceptually optimized video compression, demonstrating superior performance compared to existing approaches.

Abstract: Perceptual studies demonstrate that conditional diffusion models excel at reconstructing video content aligned with human visual perception. Building on this insight, we propose a video compression framework that leverages conditional diffusion models for perceptually optimized reconstruction. Specifically, we reframe video compression as a conditional generation task, where a generative model synthesizes video from sparse, yet informative signals. Our approach introduces three key modules: (1) Multi-granular conditioning that captures both static scene structure and dynamic spatio-temporal cues; (2) Compact representations designed for efficient transmission without sacrificing semantic richness; (3) Multi-condition training with modality dropout and role-aware embeddings, which prevent over-reliance on any single modality and enhance robustness. Extensive experiments show that our method significantly outperforms both traditional and neural codecs on perceptual quality metrics such as Fr'echet Video Distance (FVD) and LPIPS, especially under high compression ratios.

[202] DIP: Unsupervised Dense In-Context Post-training of Visual Representations

Sophia Sirko-Galouchenko, Spyros Gidaris, Antonin Vobecky, Andrei Bursuc, Nicolas Thome

Main category: cs.CV

TL;DR: DIP is an unsupervised post-training method that enhances dense image representations in pretrained vision encoders for in-context scene understanding using pseudo-tasks generated automatically from unlabeled data.

DetailsMotivation: To improve dense image representations in large-scale pretrained vision encoders for better in-context scene understanding without relying on complex self-distillation architectures or labeled data.

Method: Trains vision encoder using pseudo-tasks that simulate downstream in-context scenarios, generated automatically by combining a pretrained diffusion model and the vision encoder itself on unlabeled data.

Result: Achieves strong performance across various downstream in-context scene understanding tasks, outperforming both the initial vision encoder and prior methods, with computational efficiency (less than 9 hours on single A100 GPU).

Conclusion: DIP provides a simple, unsupervised, and computationally efficient solution for enhancing dense representations in vision encoders, offering practical improvements for in-context scene understanding applications.

Abstract: We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by meta-learning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world in-context scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations. Code available here: https://github.com/sirkosophia/DIP

[203] HieraRS: A Hierarchical Segmentation Paradigm for Remote Sensing Enabling Multi-Granularity Interpretation and Cross-Domain Transfer

Tianlong Ai, Tianzhu Liu, Haochen Jiang, Yanfeng Gu

Main category: cs.CV

TL;DR: HieraRS is a hierarchical interpretation paradigm for remote sensing imagery that enables multi-granularity predictions and cross-domain transfer to tasks with heterogeneous hierarchies, addressing limitations of flat classification approaches.

DetailsMotivation: Existing deep learning methods for land cover/land use classification use flat classification paradigms that cannot generate end-to-end multi-granularity hierarchical predictions aligned with tree-structured hierarchies, and lack capability for cross-domain transfer to tasks with heterogeneous hierarchies.

Method: Proposes HieraRS with Bidirectional Hierarchical Consistency Constraint Mechanism (BHCCM) integrated into flat classification models, and TransLU dual-branch cross-domain transfer framework with Cross-Domain Knowledge Sharing (CDKS) and Cross-Domain Semantic Alignment (CDSA). Also constructs MM-5B multi-modal hierarchical dataset.

Result: The approach generates hierarchical predictions while improving semantic consistency and classification accuracy, supports dynamic category expansion, and facilitates effective adaptation to heterogeneous hierarchies.

Conclusion: HieraRS addresses key limitations in hierarchical land cover classification by enabling multi-granularity predictions and cross-domain transfer capabilities, with practical applications enhanced by the new MM-5B dataset.

Abstract: Hierarchical land cover and land use (LCLU) classification aims to assign pixel-wise labels with multiple levels of semantic granularity to remote sensing (RS) imagery. However, existing deep learning-based methods face two major challenges: 1) They predominantly adopt a flat classification paradigm, which limits their ability to generate end-to-end multi-granularity hierarchical predictions aligned with tree-structured hierarchies used in practice. 2) Most cross-domain studies focus on performance degradation caused by sensor or scene variations, with limited attention to transferring LCLU models to cross-domain tasks with heterogeneous hierarchies (e.g., LCLU to crop classification). These limitations hinder the flexibility and generalization of LCLU models in practical applications. To address these challenges, we propose HieraRS, a novel hierarchical interpretation paradigm that enables multi-granularity predictions and supports the efficient transfer of LCLU models to cross-domain tasks with heterogeneous tree-structured hierarchies. We introduce the Bidirectional Hierarchical Consistency Constraint Mechanism (BHCCM), which can be seamlessly integrated into mainstream flat classification models to generate hierarchical predictions, while improving both semantic consistency and classification accuracy. Furthermore, we present TransLU, a dual-branch cross-domain transfer framework comprising two key components: Cross-Domain Knowledge Sharing (CDKS) and Cross-Domain Semantic Alignment (CDSA). TransLU supports dynamic category expansion and facilitates the effective adaptation of LCLU models to heterogeneous hierarchies. In addition, we construct MM-5B, a large-scale multi-modal hierarchical land use dataset featuring pixel-wise annotations. The code and MM-5B dataset will be released at: https://github.com/AI-Tianlong/HieraRS.

[204] $π^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, Tong He

Main category: cs.CV

TL;DR: π³ is a feed-forward neural network for visual geometry reconstruction that eliminates the need for fixed reference views, using permutation-equivariant architecture for camera pose estimation and point mapping without reference frames.

DetailsMotivation: Previous methods rely on fixed reference views which can cause instability and failures when the reference is suboptimal. The authors aim to create a more robust and accurate approach without this inductive bias.

Method: Uses a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames, making the model robust to input ordering.

Result: Achieves state-of-the-art performance on camera pose estimation, monocular/video depth estimation, and dense point map reconstruction with higher accuracy and robustness.

Conclusion: The simple and bias-free approach of π³ demonstrates superior performance across multiple geometry reconstruction tasks, offering a more stable alternative to reference-view-dependent methods.

Abstract: We introduce $\pi^3$, a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast, $\pi^3$ employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design not only makes our model inherently robust to input ordering, but also leads to higher accuracy and performance. These advantages enable our simple and bias-free approach to achieve state-of-the-art performance on a wide range of tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction. Code and models are publicly available.

[205] AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data

Christopher F. Brown, Michal R. Kazmierski, Valerie J. Pasquarella, William J. Rucklidge, Masha Samsikova, Chenhui Zhang, Evan Shelhamer, Estefania Lahera, Olivia Wiles, Simon Ilyushchenko, Noel Gorelick, Lihui Lydia Zhang, Sophia Alj, Emily Schechter, Sean Askay, Oliver Guinan, Rebecca Moore, Alexis Boukouvalas, Pushmeet Kohli

Main category: cs.CV

TL;DR: AlphaEarth Foundations is a geospatial embedding model that processes Earth observation data to create general-purpose representations, outperforming other featurization methods without retraining and enabling efficient map production.

DetailsMotivation: Earth observation data is abundant but high-quality labels are scarce due to the effort required for physical measurements, creating a need for better modeling approaches to translate sparse labels into useful maps.

Method: The paper introduces AlphaEarth Foundations, an embedding field model that assimilates spatial, temporal, and measurement contexts across multiple data sources to create general geospatial representations.

Result: The embeddings generated consistently outperform other well-known featurization approaches on diverse mapping evaluations without requiring retraining. A global dataset of annual embedding field layers from 2017-2024 has been released.

Conclusion: AlphaEarth Foundations provides a highly general and effective geospatial representation that enables accurate and efficient production of maps and monitoring systems from local to global scales.

Abstract: Unprecedented volumes of Earth observation data are continually collected around the world, but high-quality labels remain scarce given the effort required to make physical measurements and observations. This has led to considerable investment in bespoke modeling efforts translating sparse labels into maps. Here we introduce AlphaEarth Foundations, an embedding field model yielding a highly general, geospatial representation that assimilates spatial, temporal, and measurement contexts across multiple sources, enabling accurate and efficient production of maps and monitoring systems from local to global scales. The embeddings generated by AlphaEarth Foundations are the only to consistently outperform a suite of other well-known/widely accepted featurization approaches tested on a diverse set of mapping evaluations without re-training. We have released a dataset of global, annual, analysis-ready embedding field layers from 2017 through 2024.

[206] LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences

Ao Liang, Youquan Liu, Yu Yang, Dongyue Lu, Linfeng Li, Lingdong Kong, Huaici Zhao, Wei Tsang Ooi

Main category: cs.CV

TL;DR: LiDARCrafter is a unified framework for 4D LiDAR generation and editing using natural language inputs, achieving state-of-the-art performance in fidelity, controllability, and temporal consistency.

DetailsMotivation: Existing generative world models focus on videos or occupancy grids but overlook unique LiDAR properties, creating challenges in controllability, temporal coherence, and evaluation standardization for 4D LiDAR generation.

Method: The framework parses natural language instructions into ego-centric scene graphs, then uses a tri-branch diffusion network to generate object structures, motion trajectories, and geometry, with an autoregressive module for temporal coherence.

Result: Experiments on nuScenes dataset demonstrate state-of-the-art performance across fidelity, controllability, and temporal consistency at scene-, object-, and sequence-level metrics using the established comprehensive benchmark.

Conclusion: LiDARCrafter paves the way for data augmentation and simulation in autonomous driving by providing a unified framework for controllable 4D LiDAR generation with standardized evaluation, with code and benchmark released to the community.

Abstract: Generative world models have become essential data engines for autonomous driving, yet most existing efforts focus on videos or occupancy grids, overlooking the unique LiDAR properties. Extending LiDAR generation to dynamic 4D world modeling presents challenges in controllability, temporal coherence, and evaluation standardization. To this end, we present LiDARCrafter, a unified framework for 4D LiDAR generation and editing. Given free-form natural language inputs, we parse instructions into ego-centric scene graphs, which condition a tri-branch diffusion network to generate object structures, motion trajectories, and geometry. These structured conditions enable diverse and fine-grained scene editing. Additionally, an autoregressive module generates temporally coherent 4D LiDAR sequences with smooth transitions. To support standardized evaluation, we establish a comprehensive benchmark with diverse metrics spanning scene-, object-, and sequence-level aspects. Experiments on the nuScenes dataset using this benchmark demonstrate that LiDARCrafter achieves state-of-the-art performance in fidelity, controllability, and temporal consistency across all levels, paving the way for data augmentation and simulation. The code and benchmark are released to the community.

[207] A Novel Image Similarity Metric for Scene Composition Structure

Md Redwanul Haque, Manzur Murshed, Manoranjan Paul, Tsz-Kwan Lee

Main category: cs.CV

TL;DR: SCSSIM is a novel training-free metric that evaluates Scene Composition Structure preservation in generative AI images using cuboidal partitioning and statistical measures, outperforming traditional metrics in structural fidelity assessment.

DetailsMotivation: Traditional image similarity metrics fail to adequately assess Scene Composition Structure (SCS) - the geometric relationships among objects and background. Pixel-level metrics are noise-sensitive, perception-based metrics prioritize aesthetics, and neural metrics have training overheads and generalization issues.

Method: SCSSIM uses analytical, training-free approach based on cuboidal hierarchical partitioning of images. It employs statistical measures to quantify SCS preservation by capturing non-object-based structural relationships robustly.

Result: SCSSIM shows high invariance to non-compositional distortions while demonstrating strong monotonic decrease for compositional distortions. It accurately reflects unchanged SCS and precisely indicates when SCS has been altered, outperforming existing metrics.

Conclusion: SCSSIM provides superior structural evaluation properties for generative models, making it an invaluable tool for ensuring scene composition integrity in AI-generated images without requiring training or introducing generalization issues.

Abstract: The rapid advancement of generative AI models necessitates novel methods for evaluating image quality that extend beyond human perception. A critical concern for these models is the preservation of an image’s underlying Scene Composition Structure (SCS), which defines the geometric relationships among objects and the background, their relative positions, sizes, orientations, etc. Maintaining SCS integrity is paramount for ensuring faithful and structurally accurate GenAI outputs. Traditional image similarity metrics often fall short in assessing SCS. Pixel-level approaches are overly sensitive to minor visual noise, while perception-based metrics prioritize human aesthetic appeal, neither adequately capturing structural fidelity. Furthermore, recent neural-network-based metrics introduce training overheads and potential generalization issues. We introduce the SCS Similarity Index Measure (SCSSIM), a novel, analytical, and training-free metric that quantifies SCS preservation by exploiting statistical measures derived from the Cuboidal hierarchical partitioning of images, robustly capturing non-object-based structural relationships. Our experiments demonstrate SCSSIM’s high invariance to non-compositional distortions, accurately reflecting unchanged SCS. Conversely, it shows a strong monotonic decrease for compositional distortions, precisely indicating when SCS has been altered. Compared to existing metrics, SCSSIM exhibits superior properties for structural evaluation, making it an invaluable tool for developing and evaluating generative models, ensuring the integrity of scene composition.

[208] Ultra-Low-Latency Spiking Neural Networks with Temporal-Dependent Integrate-and-Fire Neuron Model for Objects Detection

Chengjun Zhang, Yuhao Zhang, Jie Yang, Mohamad Sawan

Main category: cs.CV

TL;DR: Proposed delay-spike approach and temporal-dependent Integrate-and-Fire (tdIF) neuron for SNNs to improve visual detection tasks with ultra-low latency (5 time-steps), achieving state-of-the-art performance.

DetailsMotivation: Current ANN-SNN conversion methods perform well in classification tasks but show suboptimal results in visual detection tasks due to residual membrane potential issues from heterogeneous spiking patterns.

Method: Introduces delay-spike approach to mitigate residual membrane potential and proposes tdIF neuron that dynamically adjusts accumulation/firing behaviors based on temporal order of time-steps, enabling distinct temporal properties beyond frequency-based representations.

Result: Achieves more precise feature representation with lower time-steps, enabling high performance and ultra-low latency in visual detection tasks. Surpasses current ANN-SNN conversion approaches in object detection and lane line detection with state-of-the-art performance within 5 time-steps.

Conclusion: The tdIF neuron maintains energy consumption comparable to traditional IF neurons while significantly improving performance in visual detection tasks with ultra-low latency, making SNNs more effective for real-time vision applications.

Abstract: Spiking Neural Networks (SNNs), inspired by the brain, are characterized by minimal power consumption and swift inference capabilities on neuromorphic hardware, and have been widely applied to various visual perception tasks. Current ANN-SNN conversion methods have achieved excellent results in classification tasks with ultra-low time-steps, but their performance in visual detection tasks remains suboptimal. In this paper, we propose a delay-spike approach to mitigate the issue of residual membrane potential caused by heterogeneous spiking patterns. Furthermore, we propose a novel temporal-dependent Integrate-and-Fire (tdIF) neuron architecture for SNNs. This enables Integrate-and-fire (IF) neurons to dynamically adjust their accumulation and firing behaviors based on the temporal order of time-steps. Our method enables spikes to exhibit distinct temporal properties, rather than relying solely on frequency-based representations. Moreover, the tdIF neuron maintains energy consumption on par with traditional IF neuron. We demonstrate that our method achieves more precise feature representation with lower time-steps, enabling high performance and ultra-low latency in visual detection tasks. In this study, we conduct extensive evaluation of the tdIF method across two critical vision tasks: object detection and lane line detection. The results demonstrate that the proposed method surpasses current ANN-SNN conversion approaches, achieving state-of-the-art performance with ultra-low latency (within 5 time-steps).

[209] GCRPNet: Graph-Enhanced Contextual and Regional Perception Network for Salient Object Detection in Optical Remote Sensing Images

Mengyu Ren, Yutong Li, Hua Li, Runmin Cong, Sam Kwong

Main category: cs.CV

TL;DR: Proposed GCRPNet, a graph-enhanced network based on Mamba architecture for salient object detection in remote sensing images, achieving state-of-the-art performance by effectively integrating global and local features.

DetailsMotivation: Existing ViT and CNN-based methods struggle to effectively integrate heterogeneous global and local features for salient object detection in optical remote sensing images, which face challenges like significant scale variations and low target-background contrast.

Method: GCRPNet uses VSS encoder for multi-scale feature extraction, DS-HGAM module for cross-layer interaction and structural perception, and LEVSS decoder with adaptive scanning strategy and MCAEM for enhanced local modeling and rich region information capture.

Result: Extensive experiments demonstrate that the proposed model achieves state-of-the-art performance in salient object detection for optical remote sensing images.

Conclusion: GCRPNet effectively overcomes limitations of existing methods by simultaneously capturing long-range dependencies and enhancing regional feature representation, validating its effectiveness and superiority for SOD in ORSIs.

Abstract: Salient object detection (SOD) in optical remote sensing images (ORSIs) faces numerous challenges, including significant variations in target scales and low contrast between targets and the background. Existing methods based on vision transformers (ViTs) and convolutional neural networks (CNNs) architectures aim to leverage both global and local features, but the difficulty in effectively integrating these heterogeneous features limits their overall performance. To overcome these limitations, we propose a graph-enhanced contextual and regional perception network (GCRPNet), which builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation. Specifically, we employ the visual state space (VSS) encoder to extract multi-scale features. To further achieve deep guidance and enhancement of these features, we first design a difference-similarity guided hierarchical graph attention module (DS-HGAM). This module strengthens cross-layer interaction capabilities between features of different scales while enhancing the model’s structural perception,allowing it to distinguish between foreground and background more effectively. Then, we design the LEVSS block as the decoder of GCRPNet. This module integrates our proposed adaptive scanning strategy and multi-granularity collaborative attention enhancement module (MCAEM). It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information and enhancing Mamba’s local modeling capability. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance, validating its effectiveness and superiority.

[210] POEv2: a flexible and robust framework for generic line segment detection and wireframe line segment detection

Chenguang Liu, Chisheng Wang, Yuhua Cai, Chuanhua Zhu, Qingquan Li

Main category: cs.CV

TL;DR: POEv2 is a robust line segment detection framework that works for both generic and wireframe detection, achieving state-of-the-art performance when combined with efficient edge detectors.

DetailsMotivation: Existing line segment detectors are specialized for either generic detection (all meaningful segments) or wireframe detection (geometrically meaningful segments with large spatial support), but none work well for both tasks simultaneously.

Method: Improved version of Pixel Orientation Estimation (POE) method that detects line segments from edge strength maps and can be combined with any edge detector.

Result: Achieves state-of-the-art performance on three publicly available datasets when combined with an efficient edge detector.

Conclusion: POEv2 provides a unified framework that effectively handles both generic and wireframe line segment detection tasks, overcoming the limitations of specialized detectors.

Abstract: Line segment detection in images has been studied for several decades. Existing line segment detectors can be roughly divided into two categories: generic line segment detectors and wireframe line segment detectors. Generic line segment detectors aim to detect all meaningful line segments in images and traditional approaches usually fall into this category. Recent deep learning based approaches are mostly wireframe line segment detectors. They detect only line segments that are geometrically meaningful and have large spatial support. Due to the difference in the aim of design, the performance of generic line segment detectors for the task of wireframe line segment detection won’t be satisfactory, and vice versa. In this work, we propose a robust framework that can be used for both generic line segment detection and wireframe line segment detection. The proposed method is an improved version of the Pixel Orientation Estimation (POE) method. It is thus named as POEv2. POEv2 detects line segments from edge strength maps, and can be combined with any edge detector. We show in our experiments that by combining the proposed POEv2 with an efficient edge detector, it achieves state-of-the-art performance on three publicly available datasets.

[211] BuzzSet v1.0: A Dataset for Pollinator Detection in Field Conditions

Ahmed Emam, Mohamed Elbassiouny, Julius Miller, Patrick Donworth, Sabine Seidel, Ribana Roscher

Main category: cs.CV

TL;DR: BuzzSet v1.0 is a large-scale dataset of high-resolution pollinator images for automated insect monitoring in agricultural environments, containing 7,856 verified images with over 8,000 annotated instances across honeybees, bumblebees, and unidentified insects.

DetailsMotivation: Pollinator insects are vital to global food production but their populations are declining. Scalable automated monitoring remains challenging due to difficulties detecting small, fast-moving, and camouflaged insects in field conditions.

Method: Created BuzzSet dataset with manually verified images collected under real field conditions. Used YOLOv12 model for initial annotations refined through human verification. Images preprocessed into 256x256 tiles. Provided baselines using RF-DETR transformer-based object detector.

Result: Strong classification accuracy with F1 scores of 0.94 for honeybees and 0.92 for bumblebees, minimal confusion between categories. Overall mAP at 0.50 of 0.559 demonstrates dataset’s challenging nature. Unidentified class remains difficult due to label ambiguity.

Conclusion: BuzzSet establishes a benchmark for ecological computer vision, highlighting the primary challenge of detecting camouflaged insects in natural vegetation as an open problem for future research. Future work focuses on expanding to version 2.0 with additional annotations.

Abstract: Pollinator insects such as honeybees and bumblebees are vital to global food production and ecosystem stability, yet their populations are declining due to anthropogenic and environmental stressors. Scalable, automated monitoring in agricultural environments remains an open challenge due to the difficulty of detecting small, fast-moving, and often camouflaged insects. To address this, we present BuzzSet v1.0, a large-scale dataset of high-resolution pollinator images collected under real field conditions. BuzzSet contains 7,856 manually verified images with more than 8,000 annotated instances across three classes: honeybees, bumblebees, and unidentified insects. Initial annotations were produced using a YOLOv12 model trained on external data and refined through human verification with open-source tools. All images were preprocessed into 256 x 256 tiles to improve the detection of small insects. We provide baselines using the RF-DETR transformer-based object detector. The model achieves strong classification accuracy with F1 scores of 0.94 and 0.92 for honeybees and bumblebees, with minimal confusion between these categories. The unidentified class remains more difficult due to label ambiguity and fewer samples, yet still contributes insights for robustness evaluation. Overall detection performance (mAP at 0.50 of 0.559) illustrates the challenging nature of the dataset and its potential to drive advances in small object detection under realistic ecological conditions. Future work focuses on expanding the dataset to version 2.0 with additional annotations and evaluating further detection strategies. BuzzSet establishes a benchmark for ecological computer vision, with the primary challenge being reliable detection of insects frequently camouflaged within natural vegetation, highlighting an open problem for future research.

[212] C-DiffDet+: Fusing Global Scene Context with Generative Denoising for High-Fidelity Object Detection

Abdellah Zakaria Sellam, Ilyes Benaissa, Salah Eddine Bekhouche, Abdenour Hadid, Vito Renó, Cosimo Distante

Main category: cs.CV

TL;DR: Introduces Context-Aware Fusion (CAF) to improve fine-grained object detection by integrating global scene context with local features using cross-attention mechanisms, outperforming state-of-the-art on vehicle damage assessment tasks.

DetailsMotivation: Fine-grained object detection in challenging domains like vehicle damage assessment is difficult even for humans. While DiffusionDet advanced the field, its performance is limited by local feature conditioning in context-dependent scenarios.

Method: Proposed Context-Aware Fusion (CAF) that uses cross-attention mechanisms to integrate global scene context (captured by a separate dedicated encoder) with local proposal features, enabling each object proposal to attend to comprehensive environmental information.

Result: Experimental results show improvement over state-of-the-art models on the CarDD benchmark, establishing new performance benchmarks for context-aware object detection in fine-grained domains.

Conclusion: The framework significantly enhances the generative detection paradigm by enabling object proposals to leverage comprehensive environmental context, addressing fundamental limitations of previous approaches.

Abstract: Fine-grained object detection in challenging visual domains, such as vehicle damage assessment, presents a formidable challenge even for human experts to resolve reliably. While DiffusionDet has advanced the state-of-the-art through conditional denoising diffusion, its performance remains limited by local feature conditioning in context-dependent scenarios. We address this fundamental limitation by introducing Context-Aware Fusion (CAF), which leverages cross-attention mechanisms to integrate global scene context with local proposal features directly. The global context is generated using a separate dedicated encoder that captures comprehensive environmental information, enabling each object proposal to attend to scene-level understanding. Our framework significantly enhances the generative detection paradigm by enabling each object proposal to attend to comprehensive environmental information. Experimental results demonstrate an improvement over state-of-the-art models on the CarDD benchmark, establishing new performance benchmarks for context-aware object detection in fine-grained domains

[213] Prompt the Unseen: Evaluating Visual-Language Alignment Beyond Supervision

Raehyuk Jung, Seungjun Yu, Hyunjung Shim

Main category: cs.CV

TL;DR: The paper proposes a benchmark to evaluate projection layer generalization in Vision-Language Models, finding it retains 79-88% performance on unseen concepts and functions like a key-value memory.

DetailsMotivation: To systematically evaluate the generalization capability of projection layers in VLMs for unseen visual concepts, which hasn't been thoroughly studied despite their importance.

Method: Adapt object detection datasets into prompting format with disjoint train/test splits, design controlled experiments to separate seen/unseen concepts, and use mechanistic interpretability analysis.

Result: Projection layers retain 79-88% performance on unseen classes compared to seen ones, showing non-trivial generalization without explicit alignment supervision.

Conclusion: The projection layer functions like a key-value memory with good generalization, enabling efficient VLM training with limited aligned data through the proposed evaluation framework.

Abstract: Vision-Language Models (VLMs) combine a vision encoder and a large language model (LLM) through alignment training, showing strong performance on multimodal tasks. A central component in this architecture is the projection layer, which maps visual features into the LLM’s embedding space. Despite its importance, its ability to generalize to unseen visual concepts has not been systematically evaluated. To address this, we propose a benchmark for evaluating projection-layer generalization. We adapt object detection datasets (rich in fine-grained annotations) into a prompting format and design train/test splits with disjoint label sets, enabling precise control over seen and unseen concept separation. Experimental results show that the projection layer retains about 79 to 88 percent of the performance on unseen classes compared to seen ones across various settings, suggesting a non-trivial level of generalization even without explicit alignment supervision on those concepts. We further analyze this behavior through a mechanistic interpretability lens. Our findings indicate that the feed-forward network in the projection layer functions like a key-value memory, processing seen and unseen tokens in similar ways. This study introduces a new evaluation framework for alignment generalization and highlights the potential for efficient VLM training with limited aligned data.

[214] BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, Shanghang Zhang

Main category: cs.CV

TL;DR: BranchGRPO reduces computational costs and improves training stability for image/video generative models by introducing branch sampling, tree-based advantage estimation, and pruning strategies, achieving 16% better alignment with 50% less training time.

DetailsMotivation: Current GRPO methods for aligning generative models with human preferences suffer from high computational costs due to on-policy rollouts and excessive SDE sampling steps, as well as training instability from sparse rewards.

Method: Proposes BranchGRPO with three key components: 1) branch sampling policy that updates SDE sampling process, 2) tree-based advantage estimator with dense process-level rewards, and 3) pruning strategies that eliminate low-reward paths and redundant depths while sharing computation across common prefixes.

Result: Experiments show BranchGRPO improves alignment scores by 16% over strong baselines while reducing training time by 50%. The method maintains or improves exploration diversity while substantially lowering per-update compute costs.

Conclusion: BranchGRPO effectively addresses computational efficiency and training stability issues in preference alignment for generative models, achieving significant performance gains with reduced computational requirements.

Abstract: Recent advancements in aligning image and video generative models via GRPO have achieved remarkable gains in enhancing human preference alignment. However, these methods still face high computational costs from on-policy rollouts and excessive SDE sampling steps, as well as training instability due to sparse rewards. In this paper, we propose BranchGRPO, a novel method that introduces a branch sampling policy updating the SDE sampling process. By sharing computation across common prefixes and pruning low-reward paths and redundant depths, BranchGRPO substantially lowers the per-update compute cost while maintaining or improving exploration diversity. This work makes three main contributions: (1) a branch sampling scheme that reduces rollout and training cost; (2) a tree-based advantage estimator incorporating dense process-level rewards; and (3) pruning strategies exploiting path and depth redundancy to accelerate convergence and boost performance. Experiments on image and video preference alignment show that BranchGRPO improves alignment scores by 16% over strong baselines, while cutting training time by 50%.

[215] Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors

Xiangchen Wang, Jinrui Zhang, Teng Wang, Haigang Zhang, Feng Zheng

Main category: cs.CV

TL;DR: LangDC is a language-aware dynamic token compressor that uses a lightweight language model to convert video clips into soft caption tokens, dynamically adjusting compression ratios based on semantic density to reduce computation while maintaining performance.

DetailsMotivation: Existing video token compression methods use fixed ratios, ignoring semantic density variations, leading to inadequate representation of information-rich clips and wasted computation on static content.

Method: Uses lightweight language model to generate soft caption tokens as visual representations, with semantic density-aware supervision to dynamically adjust compression based on scene richness (description length).

Result: Reduces FLOPs by 49% compared to VideoGPT+ while maintaining competitive performance, with adaptive token compression based on video segment richness.

Conclusion: LangDC successfully mimics human dynamic expression patterns, providing efficient and adaptive token compression that maintains video understanding performance while significantly reducing computational costs.

Abstract: Recent advancements in large video-language models have revolutionized video understanding tasks. However, their efficiency is significantly constrained by processing high volumes of visual tokens. Existing token compression strategies apply a fixed compression ratio, ignoring the variability in semantic density among different video clips. Consequently, this lead to inadequate representation of information-rich clips due to insufficient tokens and unnecessary computation on static or content-poor ones. To address this, we propose LangDC, a Language-aware Dynamic Token Compressor. LangDC leverages a lightweight language model to describe video clips, converting them into soft caption tokens as visual representations. Trained with our proposed semantic density-aware supervision, LangDC aims to 1) cover key visual cues necessary for downstream task reasoning and 2) dynamically adjust compression ratios based on scene richness, reflected by descriptions length. Our design mimics how humans dynamically express what they see: complex scenes (seeing more) elicit more detailed language to convey nuances (saying more), whereas simpler scenes are described with fewer words. Experimental results show that our method reduces FLOPs by 49% compared to VideoGPT+ while maintaining competitive performance. Furthermore, qualitative results demonstrate our approach adaptively adjusts the token compression ratio based on video segment richness.

[216] Aesthetic Image Captioning with Saliency Enhanced MLLMs

Yilin Tao, Jiashui Huang, Huaze Xu, Ling Shao

Main category: cs.CV

TL;DR: ASE-MLLM is a novel framework that integrates aesthetic saliency features into multimodal large language models for improved aesthetic image captioning, achieving state-of-the-art performance.

DetailsMotivation: Existing AIC works using MLLMs don't specifically adapt to focus on aesthetic content and primarily rely on fine-tuning without explicit aesthetic saliency integration.

Method: Proposes ASE-MLLM with Image Aesthetic Saliency Module (IASM) to extract aesthetic features and IAS-ViT encoder that fuses aesthetic saliency with original image features via cross-attention.

Result: Significantly outperforms traditional methods and generic MLLMs on mainstream AIC benchmarks, achieving state-of-the-art performance.

Conclusion: ASE-MLLM is the first framework to successfully integrate image aesthetic saliency into MLLMs specifically for AIC tasks, demonstrating superior performance over existing approaches.

Abstract: Aesthetic Image Captioning (AIC) aims to generate textual descriptions of image aesthetics, becoming a key research direction in the field of computational aesthetics. In recent years, pretrained Multimodal Large Language Models (MLLMs) have advanced rapidly, leading to a significant increase in image aesthetics research that integrates both visual and textual modalities. However, most existing studies on image aesthetics primarily focus on predicting aesthetic ratings and have shown limited application in AIC. Existing AIC works leveraging MLLMs predominantly rely on fine-tuning methods without specifically adapting MLLMs to focus on target aesthetic content. To address this limitation, we propose the Aesthetic Saliency Enhanced Multimodal Large Language Model (ASE-MLLM), an end-to-end framework that explicitly incorporates aesthetic saliency into MLLMs. Within this framework, we introduce the Image Aesthetic Saliency Module (IASM), which efficiently and effectively extracts aesthetic saliency features from images. Additionally, we design IAS-ViT as the image encoder for MLLMs, this module fuses aesthetic saliency features with original image features via a cross-attention mechanism. To the best of our knowledge, ASE-MLLM is the first framework to integrate image aesthetic saliency into MLLMs specifically for AIC tasks. Extensive experiments demonstrated that our approach significantly outperformed traditional methods and generic MLLMs on current mainstream AIC benchmarks, achieving state-of-the-art (SOTA) performance.

[217] One Flight Over the Gap: A Survey from Perspective to Panoramic Vision

Xin Lin, Xian Ge, Dizhe Zhang, Zhaoliang Wan, Xianshun Wang, Xiangtai Li, Wenjie Jiang, Bo Du, Dacheng Tao, Ming-Hsuan Yang, Lu Qi

Main category: cs.CV

TL;DR: Survey paper on panoramic vision techniques focusing on perspective-to-panorama domain adaptation challenges and solutions across 20+ tasks from 300+ research papers.

DetailsMotivation: Growing demand for spatial intelligence and holistic scene perception in applications like VR, autonomous driving, and robotics requires specialized techniques for omnidirectional images (ODIs) that differ significantly from perspective images.

Method: Reviews panoramic imaging pipeline and projection methods, analyzes structural disparities, summarizes three key domain adaptation challenges (geometric distortions, non-uniform sampling, boundary continuity), and categorizes panoramic vision into four major categories with cross-method and cross-task analysis.

Result: Comprehensive survey covering 20+ representative tasks from 300+ research papers, providing analysis of strategies for panoramic-specific challenges and classification of panoramic vision into four categories: visual quality enhancement/assessment, visual understanding, multimodal understanding, and visual generation.

Conclusion: Identifies open challenges and future directions in data, models, and applications to advance panoramic vision research, offering new insights and forward-looking perspectives for developing panoramic vision technologies.

Abstract: Driven by the demand for spatial intelligence and holistic scene perception, omnidirectional images (ODIs), which provide a complete 360\textdegree{} field of view, are receiving growing attention across diverse applications such as virtual reality, autonomous driving, and embodied robotics. Despite their unique characteristics, ODIs exhibit remarkable differences from perspective images in geometric projection, spatial distribution, and boundary continuity, making it challenging for direct domain adaption from perspective methods. This survey reviews recent panoramic vision techniques with a particular emphasis on the perspective-to-panorama adaptation. We first revisit the panoramic imaging pipeline and projection methods to build the prior knowledge required for analyzing the structural disparities. Then, we summarize three challenges of domain adaptation: severe geometric distortions near the poles, non-uniform sampling in Equirectangular Projection (ERP), and periodic boundary continuity. Building on this, we cover 20+ representative tasks drawn from more than 300 research papers in two dimensions. On one hand, we present a cross-method analysis of representative strategies for addressing panoramic specific challenges across different tasks. On the other hand, we conduct a cross-task comparison and classify panoramic vision into four major categories: visual quality enhancement and assessment, visual understanding, multimodal understanding, and visual generation. In addition, we discuss open challenges and future directions in data, models, and applications that will drive the advancement of panoramic vision research. We hope that our work can provide new insight and forward looking perspectives to advance the development of panoramic vision technologies. Our project page is https://insta360-research-team.github.io/Survey-of-Panorama

[218] PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting

Linqing Wang, Ximing Xing, Yiji Cheng, Zhiyuan Zhao, Jiale Tao, Qixun Wang, Ruihuang Li, Xin Li, Mingrui Wu, Xinchi Deng, Chunyu Wang, Qinglin Lu

Main category: cs.CV

TL;DR: PromptEnhancer is a universal prompt rewriting framework that enhances text-to-image models by generating more precise prompts through reinforcement learning, significantly improving image-text alignment without modifying model weights.

DetailsMotivation: Current text-to-image diffusion models often fail to faithfully render complex user prompts, leading to mismatches between user intent and generated output, particularly in attribute binding, negation, and compositional relationships.

Method: A Chain-of-Thought rewriter trained through reinforcement learning guided by an AlignEvaluator reward model that provides fine-grained feedback based on 24 key points derived from common T2I failure modes.

Result: Extensive experiments on HunyuanImage 2.1 model show significant improvements in image-text alignment across various semantic and compositional challenges.

Conclusion: PromptEnhancer effectively addresses T2I model limitations by decoupling prompt rewriting from generation, providing a universal solution that enhances any pretrained model without weight modifications.

Abstract: Recent advancements in text-to-image (T2I) diffusion models have demonstrated remarkable capabilities in generating high-fidelity images. However, these models often struggle to faithfully render complex user prompts, particularly in aspects like attribute binding, negation, and compositional relationships. This leads to a significant mismatch between user intent and the generated output. To address this challenge, we introduce PromptEnhancer, a novel and universal prompt rewriting framework that enhances any pretrained T2I model without requiring modifications to its weights. Unlike prior methods that rely on model-specific fine-tuning or implicit reward signals like image-reward scores, our framework decouples the rewriter from the generator. We achieve this by training a Chain-of-Thought (CoT) rewriter through reinforcement learning, guided by a dedicated reward model we term the AlignEvaluator. The AlignEvaluator is trained to provide explicit and fine-grained feedback based on a systematic taxonomy of 24 key points, which are derived from a comprehensive analysis of common T2I failure modes. By optimizing the CoT rewriter to maximize the reward from our AlignEvaluator, our framework learns to generate prompts that are more precisely interpreted by T2I models. Extensive experiments on the HunyuanImage 2.1 model demonstrate that PromptEnhancer significantly improves image-text alignment across a wide range of semantic and compositional challenges. Furthermore, we introduce a new, high-quality human preference benchmark to facilitate future research in this direction.

[219] Missing Fine Details in Images: Last Seen in High Frequencies

Tejaswini Medi, Hsien-Yi Wang, Arianna Rampini, Margret Keuper

Main category: cs.CV

TL;DR: The paper identifies that current latent tokenizers in generative models prioritize low-frequency reconstruction, causing loss of high-frequency details and visual artifacts. They propose a frequency-aware VAE with wavelet decomposition to separately optimize low and high frequencies, resulting in sharper image generation.

DetailsMotivation: Existing latent tokenizers in generative models exhibit bias toward low-frequency information during optimization, leading to over-smoothed outputs and loss of fine details in textured regions with sharp transitions, which diminishes perceptual quality.

Method: Propose a wavelet-based frequency-aware variational autoencoder (FA-VAE) framework that explicitly decouples optimization of low- and high-frequency components using frequency decomposition analysis.

Result: The approach enables improved reconstruction of fine textures while preserving global structure, and when integrated into a state-of-the-art latent diffusion model, produces sharper and more realistic image generation.

Conclusion: The frequency-aware optimization bridges the fidelity gap in current latent tokenizers and emphasizes the importance of frequency-aware approaches for realistic image synthesis, with applications in content creation, neural rendering, and medical imaging.

Abstract: Latent generative models have shown remarkable progress in high-fidelity image synthesis, typically using a two-stage training process that involves compressing images into latent embeddings via learned tokenizers in the first stage. The quality of generation strongly depends on how expressive and well-optimized these latent embeddings are. While various methods have been proposed to learn effective latent representations, generated images often lack realism, particularly in textured regions with sharp transitions, due to loss of fine details governed by high frequencies. We conduct a detailed frequency decomposition of existing state-of-the-art (SOTA) latent tokenizers and show that conventional objectives inherently prioritize low-frequency reconstruction, often at the expense of high-frequency fidelity. Our analysis reveals these latent tokenizers exhibit a bias toward low-frequency information during optimization, leading to over-smoothed outputs and visual artifacts that diminish perceptual quality. To address this, we propose a wavelet-based, frequency-aware variational autoencoder (FA-VAE) framework that explicitly decouples the optimization of low- and high-frequency components. This decoupling enables improved reconstruction of fine textures while preserving global structure. Moreover, we integrate our frequency-preserving latent embeddings into a SOTA latent diffusion model, resulting in sharper and more realistic image generation. Our approach bridges the fidelity gap in current latent tokenizers and emphasizes the importance of frequency-aware optimization for realistic image synthesis, with broader implications for applications in content creation, neural rendering, and medical imaging.

[220] Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching

Feng Wang, Zihao Yu

Main category: cs.CV

TL;DR: Proposes CPS method to eliminate noise artifacts in SDE-based sampling for Flow Matching models, enabling more accurate reward modeling and stable RL convergence.

DetailsMotivation: SDE-based sampling in Flow Matching introduces noise artifacts that harm reward learning and image quality in RL-based optimization.

Method: Draws inspiration from DDIM to reformulate sampling with Coefficients-Preserving Sampling (CPS) that eliminates excess stochasticity.

Result: CPS removes noise artifacts, enables more accurate reward modeling, and leads to faster, more stable convergence for RL optimizers like Flow-GRPO and Dance-GRPO.

Conclusion: The proposed CPS method successfully addresses noise issues in SDE-based sampling, improving RL-based optimization for Flow Matching models.

Abstract: Reinforcement Learning (RL) has recently emerged as a powerful technique for improving image and video generation in Diffusion and Flow Matching models, specifically for enhancing output quality and alignment with prompts. A critical step for applying online RL methods on Flow Matching is the introduction of stochasticity into the deterministic framework, commonly realized by Stochastic Differential Equation (SDE). Our investigation reveals a significant drawback to this approach: SDE-based sampling introduces pronounced noise artifacts in the generated images, which we found to be detrimental to the reward learning process. A rigorous theoretical analysis traces the origin of this noise to an excess of stochasticity injected during inference. To address this, we draw inspiration from Denoising Diffusion Implicit Models (DDIM) to reformulate the sampling process. Our proposed method, Coefficients-Preserving Sampling (CPS), eliminates these noise artifacts. This leads to more accurate reward modeling, ultimately enabling faster and more stable convergence for reinforcement learning-based optimizers like Flow-GRPO and Dance-GRPO. Code will be released at https://github.com/IamCreateAI/FlowCPS

[221] Light-Weight Cross-Modal Enhancement Method with Benchmark Construction for UAV-based Open-Vocabulary Object Detection

Zhenhai Weng, Xinjie Li, Can Wu, Weijie He, Jianfeng Lv, Dong Zhou, Zhongliang Yu

Main category: cs.CV

TL;DR: A complete UAV-oriented solution for Open-Vocabulary Object Detection that combines a refined UAV-Label Engine for dataset construction and a Cross-Attention Gated Enhancement module for improved text-vision alignment, achieving significant performance gains on UAV imagery.

DetailsMotivation: Open-Vocabulary Object Detection (OVD) suffers severe performance degradation when applied to UAV imagery due to the domain gap from ground-level datasets, requiring specialized solutions for aerial perspectives.

Method: 1) Developed UAV-Label Engine to resolve annotation issues and generate large-scale UAV datasets (UAVDE-2M with 2.4M+ instances, UAVCAP-15K for vision-language pretraining). 2) Introduced Cross-Attention Gated Enhancement (CAGE) module with dual-path fusion integrating cross-attention, adaptive gating, and global FiLM modulation for robust text-vision alignment.

Result: Achieved +5.3 mAP improvement in zero-shot detection on VisDrone while reducing parameters and GFLOPs. Demonstrated strong cross-domain generalization on SIMD. Extensive experiments and real-world UAV deployment confirmed effectiveness.

Conclusion: The proposed UAV-oriented solution effectively addresses the domain gap in OVD for UAV imagery through combined dataset construction and model innovation, providing practical and efficient performance improvements for aerial detection tasks.

Abstract: Open-Vocabulary Object Detection (OVD) faces severe performance degradation when applied to UAV imagery due to the domain gap from ground-level datasets. To address this challenge, we propose a complete UAV-oriented solution that combines both dataset construction and model innovation. First, we design a refined UAV-Label Engine, which efficiently resolves annotation redundancy, inconsistency, and ambiguity, enabling the generation of largescale UAV datasets. Based on this engine, we construct two new benchmarks: UAVDE-2M, with over 2.4M instances across 1,800+ categories, and UAVCAP-15K, providing rich image-text pairs for vision-language pretraining. Second, we introduce the Cross-Attention Gated Enhancement (CAGE) module, a lightweight dual-path fusion design that integrates cross-attention, adaptive gating, and global FiLM modulation for robust textvision alignment. By embedding CAGE into the YOLO-World-v2 framework, our method achieves significant gains in both accuracy and efficiency, notably improving zero-shot detection on VisDrone by +5.3 mAP while reducing parameters and GFLOPs, and demonstrating strong cross-domain generalization on SIMD. Extensive experiments and real-world UAV deployment confirm the effectiveness and practicality of our proposed solution for UAV-based OVD

[222] Evolving from Unknown to Known: Retentive Angular Representation Learning for Incremental Open Set Recognition

Runqing Yang, Yimin Fu, Changyuan Wu, Zhunga Liu

Main category: cs.CV

TL;DR: RARL method for incremental open set recognition that maintains decision boundaries by aligning unknown representations around inactive prototypes in angular space and using virtual classes to compact known representations.

DetailsMotivation: Existing OSR methods are designed for static scenarios but real-world applications require incremental identification of new unknown classes from continuous data streams while maintaining discriminability of decision boundaries.

Method: Retentive Angular Representation Learning (RARL) with virtual-intrinsic interactive training strategy that uses boundary-proximal virtual classes to compact known representations, and stratified rectification to refine decision boundaries and mitigate representation bias.

Result: State-of-the-art performance across various task setups on CIFAR100 and TinyImageNet datasets, establishing a new benchmark for incremental open set recognition.

Conclusion: RARL effectively addresses the challenge of maintaining discriminative decision boundaries in incremental open set recognition scenarios by mitigating representation drift and feature space distortion through angular space alignment and virtual class strategies.

Abstract: Existing open set recognition (OSR) methods are typically designed for static scenarios, where models aim to classify known classes and identify unknown ones within fixed scopes. This deviates from the expectation that the model should incrementally identify newly emerging unknown classes from continuous data streams and acquire corresponding knowledge. In such evolving scenarios, the discriminability of OSR decision boundaries is hard to maintain due to restricted access to former training data, causing severe inter-class confusion. To solve this problem, we propose retentive angular representation learning (RARL) for incremental open set recognition (IOSR). In RARL, unknown representations are encouraged to align around inactive prototypes within an angular space constructed under the equiangular tight frame, thereby mitigating excessive representation drift during knowledge updates. Specifically, we adopt a virtual-intrinsic interactive (VII) training strategy, which compacts known representations by enforcing clear inter-class margins through boundary-proximal virtual classes. Furthermore, a stratified rectification strategy is designed to refine decision boundaries, mitigating representation bias and feature space distortion caused by imbalances between old/new and positive/negative class samples. We conduct thorough evaluations on CIFAR100 and TinyImageNet datasets and establish a new benchmark for IOSR. Experimental results across various task setups demonstrate that the proposed method achieves state-of-the-art performance.

[223] Hybrid Swin Attention Networks for Simultaneously Low-Dose PET and CT Denoising

Yichao Liu, Hengzhi Xue, YueYang Teng

Main category: cs.CV

TL;DR: HSANet - a novel Hybrid Swin Attention Network for LDCT/PET denoising that combines Efficient Global Attention modules and hybrid upsampling to improve image quality while maintaining radiation safety.

DetailsMotivation: Low-dose CT and PET reduce radiation exposure but introduce noise and artifacts that compromise diagnostic accuracy, requiring effective denoising methods.

Method: Proposes HSANet with Efficient Global Attention modules for enhanced spatial/channel interaction and hybrid upsampling module to prevent noise overfitting.

Result: HSANet achieves superior denoising performance compared to existing methods while maintaining lightweight model size suitable for standard GPU deployment.

Conclusion: The approach is highly practical for real-world clinical applications, providing effective denoising without compromising radiation safety benefits.

Abstract: Low-dose computed tomography (LDCT) and positron emission tomography (PET) have emerged as safer alternatives to conventional imaging modalities by significantly reducing radiation exposure. However, this reduction often results in increased noise and artifacts, which can compromise diagnostic accuracy. Consequently, denoising for LDCT/PET has become a vital area of research aimed at enhancing image quality while maintaining radiation safety. In this study, we introduce a novel Hybrid Swin Attention Network (HSANet), which incorporates Efficient Global Attention (EGA) modules and a hybrid upsampling module. The EGA modules enhance both spatial and channel-wise interaction, improving the network’s capacity to capture relevant features, while the hybrid upsampling module mitigates the risk of overfitting to noise. We validate the proposed approach using a publicly available LDCT/PET dataset. Experimental results demonstrate that HSANet achieves superior denoising performance compared to existing methods, while maintaining a lightweight model size suitable for deployment on GPUs with standard memory configurations. This makes our approach highly practical for real-world clinical applications.

[224] VIM-GS: Visual-Inertial Monocular Gaussian Splatting via Object-level Guidance in Large Scenes

Shengkai Zhang, Yuhe Liu, Guanjun Wu, Jianhua He, Xinggang Wang, Mozi Chen, Kezhong Liu

Main category: cs.CV

TL;DR: VIM-GS is a Gaussian Splatting framework that uses monocular images for novel-view synthesis in large scenes by combining sparse SfM depth with dense but coarse foundation model depth through object-segmented propagation and dynamic refinement.

DetailsMotivation: Gaussian Splatting typically requires accurate depth from RGB-D/stereo cameras, which have limited sensing range for large scenes. Monocular images lack depth guidance, leading to inferior results. Foundation models for monocular depth estimation suffer from inconsistency, inaccuracy for distant scenes, and texture ambiguity.

Method: Leverages sparse but accurate depth from visual-inertial SfM to refine dense but coarse depth from large foundation models. Uses object-segmented depth propagation algorithm for structured objects and dynamic depth refinement module for dynamic objects.

Result: Superior rendering quality in large scenes demonstrated through experiments on public and customized datasets.

Conclusion: VIM-GS successfully generates dense, accurate depth from monocular RGB inputs for high-quality Gaussian Splatting rendering in large scenes by effectively combining sparse SfM depth with foundation model depth.

Abstract: VIM-GS is a Gaussian Splatting (GS) framework using monocular images for novel-view synthesis (NVS) in large scenes. GS typically requires accurate depth to initiate Gaussian ellipsoids using RGB-D/stereo cameras. Their limited depth sensing range makes it difficult for GS to work in large scenes. Monocular images, however, lack depth to guide the learning and lead to inferior NVS results. Although large foundation models (LFMs) for monocular depth estimation are available, they suffer from cross-frame inconsistency, inaccuracy for distant scenes, and ambiguity in deceptive texture cues. This paper aims to generate dense, accurate depth images from monocular RGB inputs for high-definite GS rendering. The key idea is to leverage the accurate but sparse depth from visual-inertial Structure-from-Motion (SfM) to refine the dense but coarse depth from LFMs. To bridge the sparse input and dense output, we propose an object-segmented depth propagation algorithm that renders the depth of pixels of structured objects. Then we develop a dynamic depth refinement module to handle the crippled SfM depth of dynamic objects and refine the coarse LFM depth. Experiments using public and customized datasets demonstrate the superior rendering quality of VIM-GS in large scenes.

[225] P3-SAM: Native 3D Part Segmentation

Changfeng Ma, Yang Li, Xinhao Yan, Jiachen Xu, Yunhan Yang, Chunshi Wang, Zibo Zhao, Yanwen Guo, Zhuo Chen, Chunchao Guo

Main category: cs.CV

TL;DR: P3-SAM is a native 3D point-promptable part segmentation model that automates 3D object segmentation into components using a SAM-inspired architecture with feature extraction, multiple segmentation heads, and IoU prediction.

DetailsMotivation: Current 3D part segmentation methods suffer from poor robustness with complex objects and lack full automation, limiting their practical applications in 3D understanding and model reuse.

Method: Proposes P3-SAM with feature extractor, multiple segmentation heads, and IoU predictor for interactive segmentation. Includes algorithm for automatic mask selection and merging. Trained on 3.7M models with segmentation labels.

Result: Achieves precise segmentation results and strong robustness on complex objects, attaining state-of-the-art performance compared to existing methods.

Conclusion: P3-SAM successfully addresses automation and robustness challenges in 3D part segmentation, providing an effective solution for component-based 3D object analysis and reuse applications.

Abstract: Segmenting 3D assets into their constituent parts is crucial for enhancing 3D understanding, facilitating model reuse, and supporting various applications such as part generation. However, current methods face limitations such as poor robustness when dealing with complex objects and cannot fully automate the process. In this paper, we propose a native 3D point-promptable part segmentation model termed P3-SAM, designed to fully automate the segmentation of any 3D objects into components. Inspired by SAM, P3-SAM consists of a feature extractor, multiple segmentation heads, and an IoU predictor, enabling interactive segmentation for users. We also propose an algorithm to automatically select and merge masks predicted by our model for part instance segmentation. Our model is trained on a newly built dataset containing nearly 3.7 million models with reasonable segmentation labels. Comparisons show that our method achieves precise segmentation results and strong robustness on any complex objects, attaining state-of-the-art performance. Our code will be released soon.

cs.AI

[226] Renewable Energy Sources Selection Analysis with the Maximizing Deviation Method

Kirisci Murat

Main category: cs.AI

TL;DR: This paper proposes a novel multi-criteria decision-making method using Fermatean fuzzy sets and optimization modeling to determine renewable energy source selection under uncertainty.

DetailsMotivation: To address uncertainty in human judgments and complex decision-making scenarios, particularly for renewable energy selection which involves technical, managerial, and political considerations while balancing carbon emissions and climate change mitigation.

Method: Developed an optimization model based on deviation maximization method combined with interval-valued Fermatean fuzzy sets to determine partially known feature weights in a fuzzy environment.

Result: The proposed method was successfully applied to renewable energy source selection, demonstrating its effectiveness in handling uncertainty and fuzziness in decision-makers’ judgments.

Conclusion: The Fermatean fuzzy environment combined with optimization modeling provides an effective framework for multi-criteria decision-making in complex, uncertain scenarios like renewable energy selection, offering both technical and managerial/political insights.

Abstract: Multi-criteria decision-making methods provide decision-makers with appropriate tools to make better decisions in uncertain, complex, and conflicting situations. Fuzzy set theory primarily deals with the uncertainty inherent in human thoughts and perceptions and attempts to quantify this uncertainty. Fuzzy logic and fuzzy set theory are utilized with multi-criteria decision-making methods because they effectively handle uncertainty and fuzziness in decision-makers’ judgments, allowing for verbal judgments of the problem. This study utilizes the Fermatean fuzzy environment, a generalization of fuzzy sets. An optimization model based on the deviation maximization method is proposed to determine partially known feature weights. This method is combined with interval-valued Fermatean fuzzy sets. The proposed method was applied to the problem of selecting renewable energy sources. The reason for choosing renewable energy sources is that meeting energy needs from renewable sources, balancing carbon emissions, and mitigating the effects of global climate change are among the most critical issues of the recent period. Even though selecting renewable energy sources is a technical issue, the managerial and political implications of this issue are also important, and are discussed in this study.

[227] A data-driven discretized CS:GO simulation environment to facilitate strategic multi-agent planning research

Yunzhe Wang, Volkan Ustun, Chris McGroarty

Main category: cs.AI

TL;DR: DECOY is a multi-agent simulator that abstracts strategic planning in 3D environments using discretized waypoints, trained on CS:GO data to simulate gameplay without low-level mechanics.

DetailsMotivation: To balance high-fidelity detail with computational efficiency in complex multi-agent simulations, enabling strategic planning research without requiring detailed low-level mechanics modeling.

Method: Uses a waypoint system to discretize continuous states/actions, paired with neural predictive and generative models trained on real CS:GO tournament data to reconstruct event outcomes from movement decisions alone.

Result: Extensive evaluations show that replays generated from human data in DECOY closely match those observed in the original game, demonstrating accurate simulation fidelity.

Conclusion: DECOY provides a valuable publicly available tool for advancing research in strategic multi-agent planning and behavior generation with efficient high-level abstraction.

Abstract: Modern simulation environments for complex multi-agent interactions must balance high-fidelity detail with computational efficiency. We present DECOY, a novel multi-agent simulator that abstracts strategic, long-horizon planning in 3D terrains into high-level discretized simulation while preserving low-level environmental fidelity. Using Counter-Strike: Global Offensive (CS:GO) as a testbed, our framework accurately simulates gameplay using only movement decisions as tactical positioning – without explicitly modeling low-level mechanics such as aiming and shooting. Central to our approach is a waypoint system that simplifies and discretizes continuous states and actions, paired with neural predictive and generative models trained on real CS:GO tournament data to reconstruct event outcomes. Extensive evaluations show that replays generated from human data in DECOY closely match those observed in the original game. Our publicly available simulation environment provides a valuable tool for advancing research in strategic multi-agent planning and behavior generation.

[228] From Eigenmodes to Proofs: Integrating Graph Spectral Operators with Symbolic Interpretable Reasoning

Andrew Kiruluta, Priscilla Burity

Main category: cs.AI

TL;DR: Spectral NSR is a neuro-symbolic reasoning framework that embeds logical rules as spectral templates using graph signal processing, achieving superior accuracy, speed, and interpretability compared to existing methods.

DetailsMotivation: To unify the interpretability of symbolic reasoning with the scalability and adaptability of spectral learning by leveraging graph spectral domains for knowledge graph inference.

Method: Embeds logical rules as spectral templates using graph signal processing with Laplacian eigenstructure, incorporating dynamic graph learning, frequency-selective filters, mixture-of-spectral-experts, proof-guided training, and various enhancements like LLM coupling and adversarial robustness.

Result: Achieves superior accuracy, faster inference, improved robustness, and higher interpretability on benchmarks like ProofWriter and CLUTRR compared to transformers, message-passing neural networks, and neuro-symbolic logic programming systems.

Conclusion: Spectral NSR establishes a scalable and principled foundation for next-generation reasoning systems, offering transparency, robustness, and generalization beyond conventional approaches.

Abstract: We introduce Spectral NSR, a fully spectral neuro-symbolic reasoning framework that embeds logical rules as spectral templates and performs inference directly in the graph spectral domain. By leveraging graph signal processing (GSP) and frequency-selective filters grounded in the Laplacian eigenstructure of knowledge graphs, the architecture unifies the interpretability of symbolic reasoning with the scalability and adaptability of spectral learning. Beyond the core formulation, we incorporate a comprehensive set of extensions, including dynamic graph and basis learning, rational and diffusion filters for sharper spectral selectivity, mixture-of-spectral-experts for modular specialization, proof-guided training with spectral curricula, and uncertainty quantification for calibrated confidence. Additional enhancements such as large language model coupling, co-spectral transfer alignment, adversarial robustness, efficient GPU kernels, generalized Laplacians, and causal interventions further expand the versatility of the framework. Empirical evaluation on state-of-the-art reasoning benchmarks such as ProofWriter and CLUTRR demonstrates that Spectral NSR achieves superior accuracy, faster inference, improved robustness to adversarial perturbations, and higher interpretability compared to leading baselines including transformers, message-passing neural networks, and neuro-symbolic logic programming systems. Spectral attribution and proof-band agreement analyses confirm that model decisions align closely with symbolic proof structures, while transfer experiments validate effective domain adaptation through co-spectral alignment. These results establish Spectral NSR as a scalable and principled foundation for the next generation of reasoning systems, offering transparency, robustness, and generalization beyond conventional approaches.

[229] PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

Junjie Wang, Yuxiang Zhang, Minghao Liu, Yin Zhang, Yatai Ji, Weihao Xuan, Nie Lin, Kang Zhu, Zhiqiang Lin, Yiming Ren, Chunyang Jiang, Yiyao Yu, Zekun Wang, Tiezhen Wang, Wenhao Huang, Jie Fu, Qunshu Lin, Yujiu Yang, Ge Zhang, Ruibin Yuan, Bei Chen, Wenhu Chen

Main category: cs.AI

TL;DR: PIN introduces a novel multimodal document format combining structured Markdown with holistic images, and releases two large-scale datasets (PIN-200M and PIN-14M) to improve LMMs’ visual-text integration and reasoning capabilities.

DetailsMotivation: Address persistent perceptual and reasoning errors in large multimodal models when interpreting complex visual data and deducing multimodal relationships.

Method: Developed PIN format that pairs semantically rich Markdown files (preserving fine-grained textual structures) with holistic overall images capturing complete document layouts. Created two large-scale datasets from diverse web and scientific sources in English and Chinese.

Result: Constructed and released PIN-200M (~200M documents) and PIN-14M (~14M) datasets with detailed statistical analyses and quality signals for easy filtering and task-specific data selection.

Conclusion: Provides community with versatile data format and substantial resources to enable new research in pre-training strategies and development of more powerful knowledge-intensive LMMs.

Abstract: Recent advancements in large multimodal models (LMMs) have leveraged extensive multimodal datasets to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing multimodal relationships. To address these issues, we introduce PIN (Paired and INterleaved multimodal documents), a novel data format designed to foster a deeper integration of visual and textual knowledge. The PIN format uniquely combines semantically rich Markdown files, which preserve fine-grained textual structures, with holistic overall images that capture the complete document layout. Following this format, we construct and release two large-scale, open-source datasets: PIN-200M (~200 million documents) and PIN-14M (~14 million), compiled from diverse web and scientific sources in both English and Chinese. To maximize usability, we provide detailed statistical analyses and equip the datasets with quality signals, enabling researchers to easily filter and select data for specific tasks. Our work provides the community with a versatile data format and substantial resources, offering a foundation for new research in pre-training strategies and the development of more powerful knowledge-intensive LMMs.

[230] Statistical Methods in Generative AI

Edgar Dobriban

Main category: cs.AI

TL;DR: Review of statistical methods for improving reliability, quality, and evaluation of generative AI systems

DetailsMotivation: Generative AI is transformative but lacks guarantees about correctness, safety, and fairness due to its probabilistic nature. Statistical methods offer potential solutions to address these reliability issues.

Method: Review and analysis of existing statistical techniques and their applications to generative AI, including methods for improving reliability, evaluation quality, efficiency, and experimental design.

Result: Identifies promising statistical approaches for enhancing generative AI systems but notes current limitations and the need for further development.

Conclusion: Statistical methods show significant potential for making generative AI more reliable and effective, though more research is needed to address current limitations and explore future directions.

Abstract: Generative Artificial Intelligence is emerging as an important technology, promising to be transformative in many areas. At the same time, generative AI techniques are based on sampling from probabilistic models, and by default, they come with no guarantees about correctness, safety, fairness, or other properties. Statistical methods offer a promising potential approach to improve the reliability of generative AI techniques. In addition, statistical methods are also promising for improving the quality and efficiency of AI evaluation, as well as for designing interventions and experiments in AI. In this paper, we review some of the existing work on these topics, explaining both the general statistical techniques used, as well as their applications to generative AI. We also discuss limitations and potential future directions.

[231] Instruction Agent: Enhancing Agent with Expert Demonstration

Yinheng Li, Hailey Hultquist, Justin Wagle, Kazuhito Koishida

Main category: cs.AI

TL;DR: Instruction Agent is a GUI agent that uses expert demonstrations to solve complex GUI tasks by extracting step-by-step instructions and executing them precisely, achieving 60% success rate on challenging OSWorld tasks where other agents failed.

DetailsMotivation: Current GUI agents struggle with complex tasks involving novel UI elements, long-horizon actions, and personalized trajectories, creating a need for more reliable automation.

Method: Leverages single expert demonstration to extract step-by-step instructions, executes strictly following user-intended trajectory, and uses verifier and backtracker modules to handle unexpected interruptions and improve robustness.

Result: Achieves 60% success rate on OSWorld tasks that all top-ranked agents failed to complete, demonstrating superior performance on challenging GUI automation tasks.

Conclusion: Instruction Agent provides a practical and extensible framework that bridges the gap between current GUI agents and reliable real-world GUI task automation through demonstration-based learning.

Abstract: Graphical user interface (GUI) agents have advanced rapidly but still struggle with complex tasks involving novel UI elements, long-horizon actions, and personalized trajectories. In this work, we introduce Instruction Agent, a GUI agent that leverages expert demonstrations to solve such tasks, enabling completion of otherwise difficult workflows. Given a single demonstration, the agent extracts step-by-step instructions and executes them by strictly following the trajectory intended by the user, which avoids making mistakes during execution. The agent leverages the verifier and backtracker modules further to improve robustness. Both modules are critical to understand the current outcome from each action and handle unexpected interruptions(such as pop-up windows) during execution. Our experiments show that Instruction Agent achieves a 60% success rate on a set of tasks in OSWorld that all top-ranked agents failed to complete. The Instruction Agent offers a practical and extensible framework, bridging the gap between current GUI agents and reliable real-world GUI task automation.

[232] Neuro-Symbolic Frameworks: Conceptual Characterization and Empirical Comparative Analysis

Sania Sinha, Tanawan Premsri, Danial Kamali, Parisa Kordjamshidi

Main category: cs.AI

TL;DR: Analysis of neurosymbolic frameworks’ technical facets, challenges, and showcase of three frameworks (DeepProbLog, Scallop, DomiKnowS) to identify expressivity and encourage community innovation.

DetailsMotivation: Neurosymbolic frameworks combine neural learning with symbolic reasoning for better explainability and data efficiency, but face challenges with learning curve and lack of user-friendly tools.

Method: Characterize technical facets of existing NeSy frameworks including symbolic representation language, neural integration, and algorithms. Analyze three frameworks to identify expressivity in problem solving.

Result: Identified key challenges in neurosymbolic framework development and showcased how different frameworks approach problem specification and solving capabilities.

Conclusion: The analysis provides foundation for understanding framework expressivity and aims to spark transformative action in the neurosymbolic community to rethink problem solving approaches.

Abstract: Neurosymbolic (NeSy) frameworks combine neural representations and learning with symbolic representations and reasoning. Combining the reasoning capacities, explainability, and interpretability of symbolic processing with the flexibility and power of neural computing allows us to solve complex problems with more reliability while being data-efficient. However, this recently growing topic poses a challenge to developers with its learning curve, lack of user-friendly tools, libraries, and unifying frameworks. In this paper, we characterize the technical facets of existing NeSy frameworks, such as the symbolic representation language, integration with neural models, and the underlying algorithms. A majority of the NeSy research focuses on algorithms instead of providing generic frameworks for declarative problem specification to leverage problem solving. To highlight the key aspects of Neurosymbolic modeling, we showcase three generic NeSy frameworks - \textit{DeepProbLog}, \textit{Scallop}, and \textit{DomiKnowS}. We identify the challenges within each facet that lay the foundation for identifying the expressivity of each framework in solving a variety of problems. Building on this foundation, we aim to spark transformative action and encourage the community to rethink this problem in novel ways.

[233] Autoencoder-Based Denoising of Muscle Artifacts in ECG to Preserve Skin Nerve Activity (SKNA) for Cognitive Stress Detection

Farnoush Baghestani, Jihye Moon, Youngsun Kong, Ki Chon

Main category: cs.AI

TL;DR: Deep learning-based denoising method using 1D convolutional autoencoder with LSTM bottleneck to remove EMG contamination from skin nerve activity (SKNA) measurements, achieving significant SNR improvement and accurate stress condition classification.

DetailsMotivation: Skin nerve activity (SKNA) from ECG provides noninvasive monitoring of sympathetic nervous system but is highly susceptible to EMG contamination during muscle activity, limiting its reliability in movement-rich environments.

Method: Used lightweight 1D convolutional autoencoder with LSTM bottleneck to reconstruct clean SKNA from EMG-contaminated recordings. Trained on simulated contamination data from cognitive stress experiments and chaotic muscle stimulation at realistic noise levels (-4 dB, -8 dB SNR) using leave-one-subject-out cross-validation.

Result: Improved SNR by up to 9.65 dB, increased cross correlation with clean SKNA from 0.40 to 0.72, restored burst-based features to near-clean discriminability (AUROC ≥ 0.96). Classification accuracy of baseline vs stress conditions reached 91-98% across severe noise levels.

Conclusion: Deep learning-based reconstruction effectively preserves physiologically relevant sympathetic bursts during EMG interference, enabling robust SKNA monitoring in naturalistic, movement-rich environments.

Abstract: The sympathetic nervous system (SNS) plays a central role in regulating the body’s responses to stress and maintaining physiological stability. Its dysregulation is associated with a wide range of conditions, from cardiovascular disease to anxiety disorders. Skin nerve activity (SKNA) extracted from high-frequency electrocardiogram (ECG) recordings provides a noninvasive window into SNS dynamics, but its measurement is highly susceptible to electromyographic (EMG) contamination. Traditional preprocessing based on bandpass filtering within a fixed range (e.g., 500–1000 Hz) is susceptible to overlapping EMG and SKNA spectral components, especially during sustained muscle activity. We present a denoising approach using a lightweight one-dimensional convolutional autoencoder with a long short-term memory (LSTM) bottleneck to reconstruct clean SKNA from EMG-contaminated recordings. Using clean ECG-derived SKNA data from cognitive stress experiments and EMG noise from chaotic muscle stimulation recordings, we simulated contamination at realistic noise levels (–4 dB, –8 dB signal-to-noise ratio) and trained the model in the leave-one-subject-out cross-validation framework. The method improved signal-to-noise ratio by up to 9.65 dB, increased cross correlation with clean SKNA from 0.40 to 0.72, and restored burst-based SKNA features to near-clean discriminability (AUROC $\geq$ 0.96). Classification of baseline versus sympathetic stimulation (cognitive stress) conditions reached accuracies of 91–98% across severe noise levels, comparable to clean data. These results demonstrate that deep learning–based reconstruction can preserve physiologically relevant sympathetic bursts during substantial EMG interference, enabling more robust SKNA monitoring in naturalistic, movement-rich environments.

[234] Attention of a Kiss: Exploring Attention Maps in Video Diffusion for XAIxArts

Adam Cole, Mick Grierson

Main category: cs.AI

TL;DR: Visualizing attention mechanisms in video diffusion transformers for artistic and analytical purposes

DetailsMotivation: Inspired by early video artists manipulating analog signals, this research aims to make AI attention mechanisms interpretable and usable as creative material for artists

Method: Developed a tool based on the open-source Wan model to extract and visualize cross-attention maps in text-to-video generation, using exploratory probes and artistic case studies

Result: Created an interpretable window into temporal and spatial attention behavior in generative video models

Conclusion: Contributes to Explainable AI for the Arts (XAIxArts), enabling artists to use AI’s inner workings as a creative medium

Abstract: This paper presents an artistic and technical investigation into the attention mechanisms of video diffusion transformers. Inspired by early video artists who manipulated analog video signals to create new visual aesthetics, this study proposes a method for extracting and visualizing cross-attention maps in generative video models. Built on the open-source Wan model, our tool provides an interpretable window into the temporal and spatial behavior of attention in text-to-video generation. Through exploratory probes and an artistic case study, we examine the potential of attention maps as both analytical tools and raw artistic material. This work contributes to the growing field of Explainable AI for the Arts (XAIxArts), inviting artists to reclaim the inner workings of AI as a creative medium.

[235] PaVeRL-SQL: Text-to-SQL via Partial-Match Rewards and Verbal Reinforcement Learning

Heng Hao, Wenjun Hu, Oxana Verkholyak, Davoud Ataee Tarzanagh, Baruch Gutow, Sima Didari, Masoud Faraki, Hankyu Moon, Seungjai Min

Main category: cs.AI

TL;DR: PaVeRL-SQL framework combines partial-match rewards and verbal reinforcement learning to improve Text-to-SQL models, achieving state-of-the-art results on industry benchmarks with 7.4% higher execution accuracy.

DetailsMotivation: Current Text-to-SQL methods suffer from low execution accuracy on industry-scale databases and complex questions involving domain-specific business logic.

Method: Uses two pipelines: (1) in-context learning with group self-evaluation (verbal-RL) using large language models, and (2) chain-of-thought RL pipeline with OmniSQL-7B trained with special reward function and two-stage RL.

Result: Achieves SOTA results on Spider, Spider 2.0, and BIRD benchmarks. 7.4% higher execution accuracy than SOTA on Spider2.0-SQLite, with threefold gains for SQL dialects with limited training data.

Conclusion: PaVeRL-SQL delivers reliable, state-of-the-art Text-to-SQL performance under realistic industrial constraints.

Abstract: Text-to-SQL models allow users to interact with a database more easily by generating executable SQL statements from natural-language questions. Despite recent successes on simpler databases and questions, current Text-to-SQL methods still suffer from low execution accuracy on industry-scale databases and complex questions involving domain-specific business logic. We present \emph{PaVeRL-SQL}, a framework that combines \emph{Partial-Match Rewards} and \emph{Verbal Reinforcement Learning} to drive self-improvement in reasoning language models (RLMs) for Text-to-SQL. To handle practical use cases, we adopt two pipelines: (1) a newly designed in-context learning framework with group self-evaluation (verbal-RL), using capable open- and closed-source large language models (LLMs) as backbones; and (2) a chain-of-thought (CoT) RL pipeline with a small backbone model (OmniSQL-7B) trained with a specially designed reward function and two-stage RL. These pipelines achieve state-of-the-art (SOTA) results on popular Text-to-SQL benchmarks – Spider, Spider 2.0, and BIRD. For the industrial-level Spider2.0-SQLite benchmark, the verbal-RL pipeline achieves an execution accuracy 7.4% higher than SOTA, and the CoT pipeline is 1.4% higher. RL training with mixed SQL dialects yields strong, threefold gains, particularly for dialects with limited training data. Overall, \emph{PaVeRL-SQL} delivers reliable, SOTA Text-to-SQL under realistic industrial constraints. The code is available at https://github.com/PaVeRL-SQL/PaVeRL-SQL.

Quinten Steenhuis

Main category: cs.AI

TL;DR: FETCH classifier for legal issue classification using hybrid LLM/ML ensemble and automatic follow-up questions achieves 97.37% accuracy at lower cost than GPT-5.

DetailsMotivation: Millions seek legal help annually, and misdirection in problem identification can lead to severe consequences like missed deadlines, physical abuse, housing loss, or child custody issues while waiting for proper legal assistance.

Method: Hybrid LLM/ML ensemble classification method with automatic generation of follow-up questions to enrich initial problem narratives, tested on 419 real-world queries to a nonprofit lawyer referral service.

Result: Achieved classification accuracy (hits@2) of 97.37% using inexpensive models, exceeding the performance of state-of-the-art GPT-5 model.

Conclusion: The approach shows promise in significantly reducing costs while maintaining high accuracy for guiding users to appropriate legal resources.

Abstract: Each year millions of people seek help for their legal problems by calling a legal aid program hotline, walking into a legal aid office, or using a lawyer referral service. The first step to match them to the right help is to identify the legal problem the applicant is experiencing. Misdirection has consequences. Applicants may miss a deadline, experience physical abuse, lose housing or lose custody of children while waiting to connect to the right legal help. We introduce and evaluate the FETCH classifier for legal issue classification and describe two methods for improving accuracy: a hybrid LLM/ML ensemble classification method, and the automatic generation of follow-up questions to enrich the initial problem narrative. We employ a novel data set of 419 real-world queries to a nonprofit lawyer referral service. Ultimately, we show classification accuracy (hits@2) of 97.37% using a mix of inexpensive models, exceeding the performance of the current state-of-the-art GPT-5 model. Our approach shows promise in significantly reducing the cost of guiding users of the legal system to the right resource for their problem while achieving high accuracy.

[237] A Hybrid CNN-LSTM Deep Learning Model for Intrusion Detection in Smart Grid

Abdulhakim Alsaiari, Mohammad Ilyas

Main category: cs.AI

TL;DR: Hybrid deep learning IDS using CNN-LSTM achieves 99.70% accuracy for smart grid cybersecurity against DNP3 and IEC104 protocol attacks.

DetailsMotivation: Smart grid evolution increases vulnerability to cyber attacks like unauthorized access and DoS, requiring advanced protection for SCADA systems.

Method: Combines CNN for feature extraction and LSTM for temporal pattern recognition, trained on DNP3 and IEC104 intrusion detection datasets.

Result: Achieves 99.70% detection accuracy with significant improvements in precision, recall, and F1-score compared to other deep learning approaches.

Conclusion: The CNN-LSTM hybrid model effectively enhances smart grid cybersecurity by accurately detecting and classifying cyber threats in real-time.

Abstract: The evolution of the traditional power grid into the “smart grid” has resulted in a fundamental shift in energy management, which allows the integration of renewable energy sources with modern communication technology. However, this interconnection has increased smart grids’ vulnerability to attackers, which might result in privacy breaches, operational interruptions, and massive outages. The SCADA-based smart grid protocols are critical for real-time data collection and control, but they are vulnerable to attacks like unauthorized access and denial of service (DoS). This research proposes a hybrid deep learning-based Intrusion Detection System (IDS) intended to improve the cybersecurity of smart grids. The suggested model takes advantage of Convolutional Neural Networks’ (CNN) feature extraction capabilities as well as Long Short-Term Memory (LSTM) networks’ temporal pattern recognition skills. DNP3 and IEC104 intrusion detection datasets are employed to train and test our CNN-LSTM model to recognize and classify the potential cyber threats. Compared to other deep learning approaches, the results demonstrate considerable improvements in accuracy, precision, recall, and F1-score, with a detection accuracy of 99.70%.

[238] BlendedNet: A Blended Wing Body Aircraft Dataset and Surrogate Model for Aerodynamic Predictions

Nicholas Sung, Steven Spreizer, Mohamed Elrefaie, Kaira Samuel, Matthew C. Jones, Faez Ahmed

Main category: cs.AI

TL;DR: BlendedNet is a large public dataset of 999 blended wing body geometries with 8830 RANS simulations, plus an end-to-end surrogate framework for pointwise aerodynamic prediction using PointNet and FiLM networks.

DetailsMotivation: Address data scarcity for unconventional aircraft configurations like blended wing bodies and enable research on data-driven surrogate modeling for aerodynamic design.

Method: Created dataset by sampling geometric design parameters and flight conditions, running RANS simulations. Developed surrogate framework with PointNet regressor to predict geometric parameters from point clouds, then FiLM network conditioned on parameters and flight conditions to predict pointwise coefficients.

Result: Generated 8830 converged RANS cases with detailed surface quantities. Experiments show low errors in surface predictions across diverse blended wing body configurations.

Conclusion: BlendedNet successfully addresses data scarcity issues and provides a foundation for data-driven aerodynamic surrogate modeling research, particularly for unconventional aircraft designs.

Abstract: BlendedNet is a publicly available aerodynamic dataset of 999 blended wing body (BWB) geometries. Each geometry is simulated across about nine flight conditions, yielding 8830 converged RANS cases with the Spalart-Allmaras model and 9 to 14 million cells per case. The dataset is generated by sampling geometric design parameters and flight conditions, and includes detailed pointwise surface quantities needed to study lift and drag. We also introduce an end-to-end surrogate framework for pointwise aerodynamic prediction. The pipeline first uses a permutation-invariant PointNet regressor to predict geometric parameters from sampled surface point clouds, then conditions a Feature-wise Linear Modulation (FiLM) network on the predicted parameters and flight conditions to predict pointwise coefficients Cp, Cfx, and Cfz. Experiments show low errors in surface predictions across diverse BWBs. BlendedNet addresses data scarcity for unconventional configurations and enables research on data-driven surrogate modeling for aerodynamic design.

[239] OmniAcc: Personalized Accessibility Assistant Using Generative AI

Siddhant Karki, Ethan Han, Nadim Mahmud, Suman Bhunia, John Femiani, Vaskar Raychoudhury

Main category: cs.AI

TL;DR: OmniAcc is an AI-powered navigation system using GPT-4, satellite imagery, and OpenStreetMap to detect and map wheelchair-accessible features like ramps and crosswalks with 97.5% accuracy.

DetailsMotivation: Individuals with ambulatory disabilities face navigation barriers in urban environments due to lack of accessible information and tools.

Method: Uses GPT-4, satellite imagery, and OpenStreetMap data with zero-shot learning and customized prompts for precise accessibility feature detection and validation workflows.

Result: Achieved 97.5% crosswalk detection accuracy and provides personalized route planning, real-time hands-free navigation, and instant accessibility queries.

Conclusion: OmniAcc demonstrates transformative potential of AI in improving navigation and fostering more inclusive urban spaces for mobility-aid users and urban planners.

Abstract: Individuals with ambulatory disabilities often encounter significant barriers when navigating urban environments due to the lack of accessible information and tools. This paper presents OmniAcc, an AI-powered interactive navigation system that utilizes GPT-4, satellite imagery, and OpenStreetMap data to identify, classify, and map wheelchair-accessible features such as ramps and crosswalks in the built environment. OmniAcc offers personalized route planning, real-time hands-free navigation, and instant query responses regarding physical accessibility. By using zero-shot learning and customized prompts, the system ensures precise detection of accessibility features, while supporting validation through structured workflows. This paper introduces OmniAcc and explores its potential to assist urban planners and mobility-aid users, demonstrated through a case study on crosswalk detection. With a crosswalk detection accuracy of 97.5%, OmniAcc highlights the transformative potential of AI in improving navigation and fostering more inclusive urban spaces.

[240] HealthSLM-Bench: Benchmarking Small Language Models for Mobile and Wearable Healthcare Monitoring

Xin Wang, Ting Dang, Xinyu Zhang, Vassilis Kostakos, Michael J. Witbrock, Hong Jia

Main category: cs.AI

TL;DR: SLMs achieve comparable performance to LLMs in healthcare prediction tasks with better efficiency and privacy, though challenges remain in handling class imbalance and few-shot scenarios.

DetailsMotivation: Address privacy concerns and efficiency issues of cloud-based LLMs in healthcare by exploring lightweight Small Language Models (SLMs) that can run locally on mobile/wearable devices.

Method: Systematically evaluated SLMs on health prediction tasks using zero-shot, few-shot, and instruction fine-tuning approaches, then deployed best performing models on mobile devices for real-world testing.

Result: SLMs achieved performance comparable to LLMs while offering substantial gains in efficiency and privacy preservation.

Conclusion: SLMs represent a promising solution for next-generation privacy-preserving healthcare monitoring, though current limitations in handling class imbalance and few-shot scenarios need to be addressed.

Abstract: Mobile and wearable healthcare monitoring play a vital role in facilitating timely interventions, managing chronic health conditions, and ultimately improving individuals’ quality of life. Previous studies on large language models (LLMs) have highlighted their impressive generalization abilities and effectiveness in healthcare prediction tasks. However, most LLM-based healthcare solutions are cloud-based, which raises significant privacy concerns and results in increased memory usage and latency. To address these challenges, there is growing interest in compact models, Small Language Models (SLMs), which are lightweight and designed to run locally and efficiently on mobile and wearable devices. Nevertheless, how well these models perform in healthcare prediction remains largely unexplored. We systematically evaluated SLMs on health prediction tasks using zero-shot, few-shot, and instruction fine-tuning approaches, and deployed the best performing fine-tuned SLMs on mobile devices to evaluate their real-world efficiency and predictive performance in practical healthcare scenarios. Our results show that SLMs can achieve performance comparable to LLMs while offering substantial gains in efficiency and privacy. However, challenges remain, particularly in handling class imbalance and few-shot scenarios. These findings highlight SLMs, though imperfect in their current form, as a promising solution for next-generation, privacy-preserving healthcare monitoring.

[241] Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity

Vardhan Palod, Karthik Valmeekam, Kaya Stechly, Subbarao Kambhampati

Main category: cs.AI

TL;DR: Intermediate token generation length doesn’t reliably indicate problem difficulty; correlation appears only when problems are close to training distribution, suggesting approximate recall rather than adaptive computation.

DetailsMotivation: To critically examine whether intermediate token sequence length reflects problem difficulty, challenging the prevailing assumption that longer traces indicate higher problem-adaptive computation or "thinking".

Method: Trained transformer models from scratch on derivational traces of A* search algorithm, using maze problems with verifiable complexity measures. Evaluated on trivial free-space problems and systematically on out-of-distribution problems.

Result: Models produced excessively long reasoning traces even for simplest tasks and sometimes failed to generate solutions. Intermediate token length and ground truth A* trace length only loosely correlated, with correlation appearing only when problems were closer to training distribution.

Conclusion: Intermediate trace generation is not adaptive to problem difficulty; longer sequences in systems like CoT do not automatically indicate “thinking effort” but rather reflect distributional similarity to training data.

Abstract: Intermediate token generation (ITG), where a model produces output before the solution, has been proposed as a method to improve the performance of language models on reasoning tasks. While these reasoning traces or Chain of Thoughts (CoTs) are correlated with performance gains, the mechanisms underlying them remain unclear. A prevailing assumption in the community has been to anthropomorphize these tokens as “thinking”, treating longer traces as evidence of higher problem-adaptive computation. In this work, we critically examine whether intermediate token sequence length reflects or correlates with problem difficulty. To do so, we train transformer models from scratch on derivational traces of the A* search algorithm, where the number of operations required to solve a maze problem provides a precise and verifiable measure of problem complexity. We first evaluate the models on trivial free-space problems, finding that even for the simplest tasks, they often produce excessively long reasoning traces and sometimes fail to generate a solution. We then systematically evaluate the model on out-of-distribution problems and find that the intermediate token length and ground truth A* trace length only loosely correlate. We notice that the few cases where correlation appears are those where the problems are closer to the training distribution, suggesting that the effect arises from approximate recall rather than genuine problem-adaptive computation. This suggests that the inherent computational complexity of the problem instance is not a significant factor, but rather its distributional distance from the training data. These results challenge the assumption that intermediate trace generation is adaptive to problem difficulty and caution against interpreting longer sequences in systems like R1 as automatically indicative of “thinking effort”.

[242] Autonomous Code Evolution Meets NP-Completeness

Cunxi Yu, Rongjian Liang, Chia-Tung Ho, Haoxing Ren

Main category: cs.AI

TL;DR: SATLUTION extends LLM-based code evolution to full repository scale, evolving SAT solvers that outperformed human-designed competition winners.

DetailsMotivation: Building on AlphaEvolve's success with isolated code kernels, the authors aim to scale LLM-based code evolution to entire repositories with hundreds of files and tens of thousands of lines of C/C++ code, specifically targeting Boolean Satisfiability (SAT) problems.

Method: SATLUTION orchestrates LLM agents to evolve solver repositories with strict correctness guarantees and distributed runtime feedback, while simultaneously self-evolving its own evolution policies and rules.

Result: Starting from SAT Competition 2024 codebases, SATLUTION evolved solvers that decisively outperformed human-designed winners of SAT Competition 2025, and also surpassed both 2024 and 2025 champions on the 2024 benchmarks.

Conclusion: The framework successfully demonstrates that LLM-based code evolution can be effectively scaled to full repository levels, achieving superior performance over human experts in complex algorithmic domains like SAT solving.

Abstract: Large language models (LLMs) have recently shown strong coding abilities, enabling not only static code generation but also iterative code self-evolving through agentic frameworks. Recently, AlphaEvolve \cite{novikov2025alphaevolve} demonstrated that LLM-based coding agents can autonomously improve algorithms and surpass human experts, with scopes limited to isolated kernels spanning hundreds of lines of code. Inspired by AlphaEvolve, we present SATLUTION, the first framework to extend LLM-based code evolution to the full repository scale, encompassing hundreds of files and tens of thousands of lines of C/C++ code. Targeting Boolean Satisfiability (SAT), the canonical NP-complete problem and a cornerstone of both theory and applications. SATLUTION orchestrates LLM agents to directly evolve solver repositories under strict correctness guarantees and distributed runtime feedback, while simultaneously self-evolving its own evolution policies and rules. Starting from SAT Competition 2024 codebases and benchmark, SATLUTION evolved solvers that decisively outperformed the human-designed winners of the SAT Competition 2025, and also surpassed both 2024 and 2025 champions on the 2024 benchmarks.

[243] Language Self-Play For Data-Free Training

Jakub Grudzien Kuba, Mengting Gu, Qi Ma, Yuandong Tian, Vijai Mohan

Main category: cs.AI

TL;DR: Language Self-Play (LSP) enables LLMs to improve without additional training data through self-play reinforcement learning, outperforming data-driven baselines.

DetailsMotivation: Address the fundamental bottleneck of LLM progress requiring ever more training data by enabling models to improve without additional data.

Method: Game-theoretic self-play framework where models compete against themselves, treating capabilities as performance in competitive games to generate stronger policies.

Result: Llama-3.2-3B-Instruct showed enhanced performance on challenging instruction-following benchmarks through self-play alone, more effectively than data-driven approaches.

Conclusion: Self-play reinforcement learning provides a viable alternative to data scaling for improving LLM capabilities, potentially overcoming the data dependency bottleneck.

Abstract: Large language models (LLMs) have advanced rapidly in recent years, driven by scale, abundant high-quality training data, and reinforcement learning. Yet this progress faces a fundamental bottleneck: the need for ever more data from which models can continue to learn. In this work, we propose a reinforcement learning approach that removes this dependency by enabling models to improve without additional data. Our method leverages a game-theoretic framework of self-play, where a model’s capabilities are cast as performance in a competitive game and stronger policies emerge by having the model play against itself - a process we call Language Self-Play (LSP). Experiments with Llama-3.2-3B-Instruct on instruction-following benchmarks show that pretrained models can not only enhance their performance on challenging tasks through self-play alone, but can also do so more effectively than data-driven baselines.

[244] SheetDesigner: MLLM-Powered Spreadsheet Layout Generation with Rule-Based and Vision-Based Reflection

Qin Chen, Yuanyi Ren, Xiaojun Ma, Mugeng Liu, Han Shi, Dongmei Zhang

Main category: cs.AI

TL;DR: SheetDesigner is a zero-shot framework using MLLMs for automated spreadsheet layout generation that outperforms baselines by 22.6% by combining rule and vision reflection strategies.

DetailsMotivation: Manual spreadsheet layout design requires significant time and expertise, while existing automated layout models fail to handle the discrete grid-based structure and semantic relationships unique to spreadsheets.

Method: A zero-shot training-free framework using Multimodal Large Language Models (MLLMs) with hybrid rule and vision reflection for component placement and content population, formalized with a 7-criterion evaluation protocol.

Result: Outperforms five baselines by at least 22.6%, with vision modality handling overlap and balance well but struggling with alignment, requiring hybrid strategies.

Conclusion: SheetDesigner provides an effective automated solution for spreadsheet layout generation, demonstrating the value of combining rule-based and visual reflection approaches in MLLMs for structured document design.

Abstract: Spreadsheets are critical to data-centric tasks, with rich, structured layouts that enable efficient information transmission. Given the time and expertise required for manual spreadsheet layout design, there is an urgent need for automated solutions. However, existing automated layout models are ill-suited to spreadsheets, as they often (1) treat components as axis-aligned rectangles with continuous coordinates, overlooking the inherently discrete, grid-based structure of spreadsheets; and (2) neglect interrelated semantics, such as data dependencies and contextual links, unique to spreadsheets. In this paper, we first formalize the spreadsheet layout generation task, supported by a seven-criterion evaluation protocol and a dataset of 3,326 spreadsheets. We then introduce SheetDesigner, a zero-shot and training-free framework using Multimodal Large Language Models (MLLMs) that combines rule and vision reflection for component placement and content population. SheetDesigner outperforms five baselines by at least 22.6%. We further find that through vision modality, MLLMs handle overlap and balance well but struggle with alignment, necessitates hybrid rule and visual reflection strategies. Our codes and data is available at Github.

[245] Towards explainable decision support using hybrid neural models for logistic terminal automation

Riccardo DElia, Alberto Termine, Francesco Flammini

Main category: cs.AI

TL;DR: A novel interpretable-by-design neural system dynamics framework that combines deep learning with interpretability techniques to maintain explainability and causal reliability in transportation logistics modeling.

DetailsMotivation: Deep learning in system dynamics modeling improves scalability and predictive accuracy but loses explainability and causal reliability, which are critical for decision-making in transportation logistics.

Method: Hybrid approach integrating deep learning with concept-based interpretability, mechanistic interpretability, and causal machine learning to create neural network models with semantically meaningful variables.

Result: Framework enables construction of neural network models that retain causal grounding and transparency while operating on actionable variables, applied to real-world multimodal logistic terminal case studies.

Conclusion: Neuro-symbolic methods can bridge the gap between black-box predictive models and the need for explainable decision support in complex cyber-physical systems enabled by industrial IoT.

Abstract: The integration of Deep Learning (DL) in System Dynamics (SD) modeling for transportation logistics offers significant advantages in scalability and predictive accuracy. However, these gains are often offset by the loss of explainability and causal reliability $-$ key requirements in critical decision-making systems. This paper presents a novel framework for interpretable-by-design neural system dynamics modeling that synergizes DL with techniques from Concept-Based Interpretability, Mechanistic Interpretability, and Causal Machine Learning. The proposed hybrid approach enables the construction of neural network models that operate on semantically meaningful and actionable variables, while retaining the causal grounding and transparency typical of traditional SD models. The framework is conceived to be applied to real-world case-studies from the EU-funded project AutoMoTIF, focusing on data-driven decision support, automation, and optimization of multimodal logistic terminals. We aim at showing how neuro-symbolic methods can bridge the gap between black-box predictive models and the need for critical decision support in complex dynamical environments within cyber-physical systems enabled by the industrial Internet-of-Things.

[246] Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling

Minghui Li, Hao Zhang, Yechao Zhang, Wei Wan, Shengshan Hu, pei Xiaobing, Jing Wang

Main category: cs.AI

TL;DR: An activations-guided black-box attack framework using Energy-based Models and MCMC sampling to achieve superior cross-model transferability for Direct Prompt Injection attacks on LLMs.

DetailsMotivation: Address the impracticality of existing white-box/gray-box methods and poor transferability of black-box methods for Direct Prompt Injection attacks on Large Language Models.

Method: Construct an Energy-based Model using activations from a surrogate model to evaluate adversarial prompt quality, then use token-level Markov Chain Monte Carlo sampling to adaptively optimize prompts for gradient-free black-box attacks.

Result: Achieved 49.6% attack success rate across five mainstream LLMs, 34.6% improvement over human-crafted prompts, and maintained 36.6% ASR on unseen task scenarios with superior cross-model transferability.

Conclusion: The framework demonstrates effective gradient-free black-box attacks with strong transferability, revealing a correlation between activations and attack effectiveness, highlighting semantic patterns’ critical role in transferable vulnerability exploitation.

Abstract: Direct Prompt Injection (DPI) attacks pose a critical security threat to Large Language Models (LLMs) due to their low barrier of execution and high potential damage. To address the impracticality of existing white-box/gray-box methods and the poor transferability of black-box methods, we propose an activations-guided prompt injection attack framework. We first construct an Energy-based Model (EBM) using activations from a surrogate model to evaluate the quality of adversarial prompts. Guided by the trained EBM, we employ the token-level Markov Chain Monte Carlo (MCMC) sampling to adaptively optimize adversarial prompts, thereby enabling gradient-free black-box attacks. Experimental results demonstrate our superior cross-model transferability, achieving 49.6% attack success rate (ASR) across five mainstream LLMs and 34.6% improvement over human-crafted prompts, and maintaining 36.6% ASR on unseen task scenarios. Interpretability analysis reveals a correlation between activations and attack effectiveness, highlighting the critical role of semantic patterns in transferable vulnerability exploitation.

[247] Getting In Contract with Large Language Models – An Agency Theory Perspective On Large Language Model Alignment

Sascha Kaltenpoth, Oliver Müller

Main category: cs.AI

TL;DR: LLM ATLAS framework uses agency theory to address AI alignment problems in organizational LLM adoption by mitigating information asymmetries between adopters and black-box LLM agents.

DetailsMotivation: LLMs can generate harmful content due to misspecifications during adoption, and existing research doesn't address information asymmetries in organizational adoption processes.

Method: Conceptual literature analysis using organizational LLM adoption phases and agency theory as analytical concepts.

Result: Developed an extended literature analysis process specific to AI alignment methods and created the first LLM alignment problem-solution space.

Conclusion: LLM ATLAS provides a theoretical framework to mitigate alignment problems during organizational LLM adoption by addressing principal-agent information asymmetries.

Abstract: Adopting Large language models (LLMs) in organizations potentially revolutionizes our lives and work. However, they can generate off-topic, discriminating, or harmful content. This AI alignment problem often stems from misspecifications during the LLM adoption, unnoticed by the principal due to the LLM’s black-box nature. While various research disciplines investigated AI alignment, they neither address the information asymmetries between organizational adopters and black-box LLM agents nor consider organizational AI adoption processes. Therefore, we propose LLM ATLAS (LLM Agency Theory-Led Alignment Strategy) a conceptual framework grounded in agency (contract) theory, to mitigate alignment problems during organizational LLM adoption. We conduct a conceptual literature analysis using the organizational LLM adoption phases and the agency theory as concepts. Our approach results in (1) providing an extended literature analysis process specific to AI alignment methods during organizational LLM adoption and (2) providing a first LLM alignment problem-solution space.

[248] DeepGraphLog for Layered Neurosymbolic AI

Adem Kikaj, Giuseppe Marra, Floris Geerts, Robin Manhaeve, Luc De Raedt

Main category: cs.AI

TL;DR: DeepGraphLog is a neurosymbolic AI framework that extends ProbLog with Graph Neural Predicates, enabling flexible multi-layer neural-symbolic reasoning for graph-structured data.

DetailsMotivation: Current neurosymbolic frameworks like DeepProbLog enforce fixed neural-then-symbolic processing flow, limiting their ability to handle complex dependencies in irregular data structures like graphs.

Method: Extends ProbLog with Graph Neural Predicates, treats symbolic representations as graphs processable by GNNs, and allows arbitrary layering of neural and symbolic components.

Result: Effectively captures complex relational dependencies in planning, knowledge graph completion with distant supervision, and GNN expressivity tasks, overcoming limitations of existing systems.

Conclusion: DeepGraphLog broadens neurosymbolic AI applicability to graph-structured domains, providing a more expressive and flexible framework for neural-symbolic integration.

Abstract: Neurosymbolic AI (NeSy) aims to integrate the statistical strengths of neural networks with the interpretability and structure of symbolic reasoning. However, current NeSy frameworks like DeepProbLog enforce a fixed flow where symbolic reasoning always follows neural processing. This restricts their ability to model complex dependencies, especially in irregular data structures such as graphs. In this work, we introduce DeepGraphLog, a novel NeSy framework that extends ProbLog with Graph Neural Predicates. DeepGraphLog enables multi-layer neural-symbolic reasoning, allowing neural and symbolic components to be layered in arbitrary order. In contrast to DeepProbLog, which cannot handle symbolic reasoning via neural methods, DeepGraphLog treats symbolic representations as graphs, enabling them to be processed by Graph Neural Networks (GNNs). We showcase the capabilities of DeepGraphLog on tasks in planning, knowledge graph completion with distant supervision, and GNN expressivity. Our results demonstrate that DeepGraphLog effectively captures complex relational dependencies, overcoming key limitations of existing NeSy systems. By broadening the applicability of neurosymbolic AI to graph-structured domains, DeepGraphLog offers a more expressive and flexible framework for neural-symbolic integration.

[249] Unleashing the True Potential of LLMs: A Feedback-Triggered Self-Correction with Long-Term Multipath Decoding

Jipeng Li, Zeyu Gao, Yubin Qi, Hande Dong, Weijian Chen, Qiang Lin

Main category: cs.AI

TL;DR: FTR framework combines user feedback with enhanced decoding to improve LLM self-correction, avoiding error propagation while enabling deeper reasoning through multipath exploration.

DetailsMotivation: LLMs often generate incorrect content during inference, and existing self-correction methods suffer from unreliable error localization signals and limited reasoning depth due to next-token decoding constraints.

Method: Proposes Feedback-Triggered Regeneration (FTR) that only regenerates responses upon negative user feedback, and Long-Term Multipath (LTM) decoding for systematic exploration of multiple reasoning trajectories through delayed sequence evaluation.

Result: Extensive experiments on mathematical reasoning and code generation benchmarks show consistent and significant improvements over state-of-the-art prompt-based self-correction methods.

Conclusion: The FTR framework effectively addresses limitations of current self-correction approaches by leveraging user feedback and enhanced decoding dynamics, achieving superior performance while preserving originally correct outputs.

Abstract: Large Language Models (LLMs) have achieved remarkable performance across diverse tasks, yet their susceptibility to generating incorrect content during inference remains a critical unsolved challenge. While self-correction methods offer potential solutions, their effectiveness is hindered by two inherent limitations: (1) the absence of reliable guidance signals for error localization, and (2) the restricted reasoning depth imposed by conventional next-token decoding paradigms. To address these issues, we propose Feedback-Triggered Regeneration (FTR), a novel framework that synergizes user feedback with enhanced decoding dynamics. Specifically, FTR activates response regeneration only upon receiving negative user feedback, thereby circumventing error propagation from faulty self-assessment while preserving originally correct outputs. Furthermore, we introduce Long-Term Multipath (LTM) decoding, which enables systematic exploration of multiple reasoning trajectories through delayed sequence evaluation, effectively overcoming the myopic decision-making characteristic of standard next-token prediction. Extensive experiments on mathematical reasoning and code generation benchmarks demonstrate that our framework achieves consistent and significant improvements over state-of-the-art prompt-based self-correction methods.

[250] FHIR-RAG-MEDS: Integrating HL7 FHIR with Retrieval-Augmented Large Language Models for Enhanced Medical Decision Support

Yildiray Kabak, Gokce B. Laleci Erturkmen, Mert Gencturk, Tuncay Namli, A. Anil Sinaci, Ruben Alcantud Corcoles, Cristina Gomez Ballesteros, Pedro Abizanda, Asuman Dogac

Main category: cs.AI

TL;DR: FHIR-RAG-MEDS system integrates HL7 FHIR standards with RAG technology for personalized medical decision support using evidence-based clinical guidelines

DetailsMotivation: There is limited research on integrating RAG and HL7 FHIR technologies in practical medical applications despite their potential to enhance clinical decision-making processes

Method: Proposes a system that combines Health Level 7 Fast Healthcare Interoperability Resources (HL7 FHIR) with Retrieval-Augmented Generation (RAG) technology

Result: Not specified in the abstract (system proposal only)

Conclusion: The integration of RAG and HL7 FHIR can significantly improve personalized medical decision support on evidence-based clinical guidelines, emphasizing the need for practical application research

Abstract: In this study, we propose FHIR-RAG-MEDS system that aims to integrate Health Level 7 Fast Healthcare Interoperability Resources (HL7 FHIR) with a Retrieval-Augmented Generation (RAG)-based system to improve personalized medical decision support on evidence-based clinical guidelines, emphasizing the need for research in practical applications. In the evolving landscape of medical decision support systems, integrating advanced technologies such as RAG and HL7 FHIR can significantly enhance clinical decision-making processes. Despite the potential of these technologies, there is limited research on their integration in practical applications.

[251] RIMO: An Easy-to-Evaluate, Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning

Ziye Chen, Chengwei Qin, Yao Shu

Main category: cs.AI

TL;DR: RIMO is a new benchmark for evaluating LLMs on International Mathematical Olympiad problems, featuring two tracks: RIMO-N with unique integer answers for deterministic grading, and RIMO-P with proof problems decomposed into sub-problems for step-by-step reasoning evaluation.

DetailsMotivation: Existing Olympiad-level benchmarks suffer from grading noise and potential bias due to heterogeneous answer formats requiring model-based judges and reliance on potentially flawed solutions.

Method: Created two tracks: RIMO-N (335 IMO problems rewritten to admit single unique integer answers) and RIMO-P (456 proof problems with expert-checked solutions decomposed into sub-problems for automated grading of step-by-step reasoning).

Result: Benchmarking of ten frontier LLMs (including GPT-4o and Gemini 2.5 Flash) shows sharp performance drop on RIMO compared to older benchmarks, revealing substantial gap between current LLM capabilities and actual Olympiad-level reasoning.

Conclusion: RIMO provides a challenging yet easy-to-evaluate suite that serves as a high-resolution yardstick for future research, presenting a clear target for closing the profound reasoning gap exposed by the findings.

Abstract: As large language models (LLMs) reach high scores on established mathematical benchmarks, such as GSM8K and MATH, the research community has turned to International Mathematical Olympiad (IMO) problems to push the evaluation frontier. However, existing Olympiad-level benchmarks suffer from practical constraints that introduce grading noise and potential bias, such as heterogeneous answer formats requiring model-based judges and a reliance on potentially flawed solutions. We introduce RIMO, a two-track benchmark designed to preserve peak Olympiad difficulty while eliminating this evaluation noise. The first track, RIMO-N, rewrites 335 IMO problems to admit a single, unique integer answer, allowing for deterministic correctness checking. The second track, RIMO-P, features 456 proof problems with expert-checked solutions, which are decomposed into a sequence of sub-problems to evaluate the step-by-step reasoning process via an automated grading system. Our benchmarking of ten frontier LLMs, including GPT-4o and Gemini 2.5 Flash, reveals that while these systems excel on older benchmarks, their performance drops sharply on RIMO. These results highlight a substantial gap between current LLM capabilities and actual Olympiad-level reasoning. By providing a challenging yet easy-to-evaluate suite, RIMO offers a high-resolution yardstick for future research, presenting a clear target for closing the profound reasoning gap our findings expose.

[252] BDPM: A Machine Learning-Based Feature Extractor for Parkinson’s Disease Classification via Gut Microbiota Analysis

Bo Yu, Zhixiu Hua, Bo Zhao

Main category: cs.AI

TL;DR: A machine learning framework called BDPM was developed for Parkinson’s disease classification using gut microbiota data, featuring innovative feature selection and hybrid modeling to address limitations of single classifiers and capture microbiome dynamics.

DetailsMotivation: Parkinson's disease has high misdiagnosis rates due to reliance on clinical rating scales. Gut microbiota shows promise as a biomarker, but existing deep learning approaches using single classifiers often overlook inter-strain correlations and temporal dynamics in microbiome data.

Method: Collected gut microbiota profiles from 39 Parkinson’s patients and healthy spouses, developed RFRE (Random Forest with Recursive Feature Elimination) for feature selection integrating ecological knowledge, and designed a hybrid classification model to capture temporal and spatial patterns.

Result: The paper proposes BDPM framework but does not report specific performance metrics or validation results in the abstract.

Conclusion: The BDPM framework represents an advancement in Parkinson’s disease diagnosis by addressing key limitations in microbiome data analysis through ecological knowledge integration and hybrid modeling approaches for improved feature extraction and classification.

Abstract: Background: Parkinson’s disease remains a major neurodegenerative disorder with high misdiagnosis rates, primarily due to reliance on clinical rating scales. Recent studies have demonstrated a strong association between gut microbiota and Parkinson’s disease, suggesting that microbial composition may serve as a promising biomarker. Although deep learning models based ongut microbiota show potential for early prediction, most approaches rely on single classifiers and often overlook inter-strain correlations or temporal dynamics. Therefore, there is an urgent need for more robust feature extraction methods tailored to microbiome data. Methods: We proposed BDPM (A Machine Learning-Based Feature Extractor for Parkinson’s Disease Classification via Gut Microbiota Analysis). First, we collected gut microbiota profiles from 39 Parkinson’s patients and their healthy spouses to identify differentially abundant taxa. Second, we developed an innovative feature selection framework named RFRE (Random Forest combined with Recursive Feature Elimination), integrating ecological knowledge to enhance biological interpretability. Finally, we designed a hybrid classification model to capture temporal and spatial patterns in microbiome data.

[253] The Carbon Footprint Wizard: A Knowledge-Augmented AI Interface for Streamlining Food Carbon Footprint Analysis

Mustafa Kaan Aslan, Reinout Heijungs, Filip Ilievski

Main category: cs.AI

TL;DR: AI-powered chatbot combines LCA and knowledge-augmented techniques to estimate food carbon footprints interactively, addressing data fragmentation in supply chains.

DetailsMotivation: Environmental sustainability and climate change concerns require accessible carbon footprint assessment, but traditional LCA faces challenges with opaque global supply chains and fragmented data.

Method: Combines life cycle assessment (LCA) with publicly available databases and knowledge-augmented AI techniques including retrieval-augmented generation to estimate cradle-to-gate carbon footprints of food products.

Result: Developed an interactive chatbot interface that allows users to explore carbon impact of composite meals and relate results to familiar activities, with live web demonstration showcasing proof-of-concept system.

Conclusion: The approach shows potential for delivering LCA insights in accessible format, though limitations include database uncertainties and AI misinterpretations.

Abstract: Environmental sustainability, particularly in relation to climate change, is a key concern for consumers, producers, and policymakers. The carbon footprint, based on greenhouse gas emissions, is a standard metric for quantifying the contribution to climate change of activities and is often assessed using life cycle assessment (LCA). However, conducting LCA is complex due to opaque and global supply chains, as well as fragmented data. This paper presents a methodology that combines advances in LCA and publicly available databases with knowledge-augmented AI techniques, including retrieval-augmented generation, to estimate cradle-to-gate carbon footprints of food products. We introduce a chatbot interface that allows users to interactively explore the carbon impact of composite meals and relate the results to familiar activities. A live web demonstration showcases our proof-of-concept system with arbitrary food items and follow-up questions, highlighting both the potential and limitations - such as database uncertainties and AI misinterpretations - of delivering LCA insights in an accessible format.

[254] Certainty-Guided Reasoning in Large Language Models: A Dynamic Thinking Budget Approach

João Paulo Nogueira, Wentao Sun, Alonso Silva, Laith Zumot

Main category: cs.AI

TL;DR: CGR introduces a critic model that self-assesses reasoning confidence, enabling adaptive early termination or continued reasoning based on certainty thresholds, improving accuracy while reducing token usage.

DetailsMotivation: Large reasoning language models operate with fixed thinking budgets, but lack mechanisms to determine when reasoning is sufficient. The goal is to make reasoning more adaptive and efficient by integrating confidence assessment.

Method: Inspired by GANs, uses a generator/discriminator framework where a critic model periodically probes its own reasoning to assess confidence. Reasoning continues until target certainty threshold is met, balancing efficiency and reliability.

Result: Improves baseline accuracy while reducing token usage on AIME2024/2025 datasets. Stable across 64 runs, reduces variance, improves exam performance under penalty grading. Eliminates millions of tokens aggregate with tunable trade-offs.

Conclusion: Certainty is a powerful signal for reasoning sufficiency. CGR makes reasoning models more adaptive, trustworthy, and resource-efficient, enabling practical deployment where both accuracy and computational cost matter.

Abstract: The rise of large reasoning language models (LRLMs) has unlocked new potential for solving complex tasks. These models operate with a thinking budget, that is, a predefined number of reasoning tokens used to arrive at a solution. We propose a novel approach, inspired by the generator/discriminator framework in generative adversarial networks, in which a critic model periodically probes its own reasoning to assess whether it has reached a confident conclusion. If not, reasoning continues until a target certainty threshold is met. This mechanism adaptively balances efficiency and reliability by allowing early termination when confidence is high, while encouraging further reasoning when uncertainty persists. Through experiments on the AIME2024 and AIME2025 datasets, we show that Certainty-Guided Reasoning (CGR) improves baseline accuracy while reducing token usage. Importantly, extended multi-seed evaluations over 64 runs demonstrate that CGR is stable, reducing variance across seeds and improving exam-like performance under penalty-based grading. Additionally, our token savings analysis shows that CGR can eliminate millions of tokens in aggregate, with tunable trade-offs between certainty thresholds and efficiency. Together, these findings highlight certainty as a powerful signal for reasoning sufficiency. By integrating confidence into the reasoning process, CGR makes large reasoning language models more adaptive, trustworthy, and resource efficient, paving the way for practical deployment in domains where both accuracy and computational cost matter.

[255] Aligning LLMs for the Classroom with Knowledge-Based Retrieval – A Comparative RAG Study

Amay Jain, Liu Cui, Si Chen

Main category: cs.AI

TL;DR: Comparative study of vector-based vs graph-based RAG for educational QA, showing vector RAG is cost-effective for facts while graph RAG excels at thematic queries and corpus integrity, recommending dynamic routing framework.

DetailsMotivation: LLMs like ChatGPT provide outdated/fabricated information in classrooms, needing reliable RAG systems that account for pedagogical factors and practical deployment costs.

Method: Used EduScopeQA dataset (3,176 questions across subjects) to evaluate OpenAI Vector Search RAG vs GraphRAG on educational query types and systematically altered textbooks that contradict LLM knowledge.

Result: Vector RAG performs well as low-cost generalist for fact retrieval; GraphRAG Global excels at thematic queries; GraphRAG Local achieves highest accuracy with altered textbooks; dynamic routing framework boosts fidelity and efficiency.

Conclusion: Actionable guidelines for educators: use vector RAG for cost-effective fact retrieval, graph RAG for thematic queries and corpus integrity, with dynamic query routing for optimal performance in learning environments.

Abstract: Large language models like ChatGPT are increasingly used in classrooms, but they often provide outdated or fabricated information that can mislead students. Retrieval Augmented Generation (RAG) improves reliability of LLMs by grounding responses in external resources. We investigate two accessible RAG paradigms, vector-based retrieval and graph-based retrieval to identify best practices for classroom question answering (QA). Existing comparative studies fail to account for pedagogical factors such as educational disciplines, question types, and practical deployment costs. Using a novel dataset, EduScopeQA, of 3,176 questions across academic subjects, we measure performance on various educational query types, from specific facts to broad thematic discussions. We also evaluate system alignment with a dataset of systematically altered textbooks that contradict the LLM’s latent knowledge. We find that OpenAI Vector Search RAG (representing vector-based RAG) performs well as a low-cost generalist, especially for quick fact retrieval. On the other hand, GraphRAG Global excels at providing pedagogically rich answers to thematic queries, and GraphRAG Local achieves the highest accuracy with the dense, altered textbooks when corpus integrity is critical. Accounting for the 10-20x higher resource usage of GraphRAG (representing graph-based RAG), we show that a dynamic branching framework that routes queries to the optimal retrieval method boosts fidelity and efficiency. These insights provide actionable guidelines for educators and system designers to integrate RAG-augmented LLMs into learning environments effectively.

[256] SCoder: Iterative Self-Distillation for Bootstrapping Small-Scale Data Synthesizers to Empower Code LLMs

Xinyu Zhang, Changzhi Zhou, Linmei Hu, Luhao Zhang, Xiancai Chen, Haomin Fu, Yang Yang, Mengdi Zhang

Main category: cs.AI

TL;DR: Small LLMs can be enhanced to synthesize high-quality code instruction data through iterative self-distillation, reducing reliance on costly proprietary models while achieving state-of-the-art code generation.

DetailsMotivation: Existing code LLMs depend on expensive proprietary models for instruction data, which incurs high costs. The paper explores using small open-source LLMs as cost-effective alternatives for data synthesis.

Method: Proposes iterative self-distillation approach with multi-checkpoint sampling, multi-aspect scoring for data selection, and gradient-based influence estimation for filtering. Uses small LLMs (7B) trained on superior samples from proprietary models.

Result: Developed SCoder models from DeepSeek-Coder that achieve state-of-the-art code generation capabilities, demonstrating the effectiveness of the self-distillation approach.

Conclusion: Small-scale LLMs can be transformed into powerful synthesizers through iterative self-distillation, reducing costs and dependency on proprietary models while maintaining high-quality code generation performance.

Abstract: Existing code large language models (LLMs) often rely on large-scale instruction data distilled from proprietary LLMs for fine-tuning, which typically incurs high costs. In this paper, we explore the potential of small-scale open-source LLMs (e.g., 7B) as synthesizers for high-quality code instruction data construction. We first observe that the data synthesis capability of small-scale LLMs can be enhanced by training on a few superior data synthesis samples from proprietary LLMs. Building on this, we propose a novel iterative self-distillation approach to bootstrap small-scale LLMs, transforming them into powerful synthesizers that reduce reliance on proprietary LLMs and minimize costs. Concretely, in each iteration, to obtain diverse and high-quality self-distilled data, we design multi-checkpoint sampling and multi-aspect scoring strategies for initial data selection. Furthermore, to identify the most influential samples, we introduce a gradient-based influence estimation method for final data filtering. Based on the code instruction datasets from the small-scale synthesizers, we develop SCoder, a family of code generation models fine-tuned from DeepSeek-Coder. SCoder models achieve state-of-the-art code generation capabilities, demonstrating the effectiveness of our method.

[257] CP-Model-Zoo: A Natural Language Query System for Constraint Programming Models

Augustin Crespin, Ioannis Kostis, Hélène Verhaeghe, Pierre Schaus

Main category: cs.AI

TL;DR: CP-Model-Zoo is a tutoring system that retrieves expert-written constraint programming models from a database based on natural language problem descriptions, helping non-experts access validated models without human labeling.

DetailsMotivation: Constraint programming has high potential but is hindered by complex modeling languages and the need for expertise, making it inaccessible to non-experts who struggle with model creation.

Method: The system retrieves the closest source code model from a database of expert-written models using natural language descriptions of combinatorial problems, eliminating the need for human data labeling.

Result: Experiments show excellent accuracy in retrieving correct models based on user-input descriptions with different expertise levels.

Conclusion: CP-Model-Zoo successfully bridges the expertise gap by providing non-experts with validated constraint programming models through natural language queries.

Abstract: Constraint Programming and its high-level modeling languages have long been recognized for their potential to achieve the holy grail of problem-solving. However, the complexity of modeling languages, the large number of global constraints, and the art of creating good models have often hindered non-experts from choosing CP to solve their combinatorial problems. While generating an expert-level model from a natural-language description of a problem would be the dream, we are not yet there. We propose a tutoring system called CP-Model-Zoo, exploiting expert-written models accumulated through the years. CP-Model-Zoo retrieves the closest source code model from a database based on a user’s natural language description of a combinatorial problem. It ensures that expert-validated models are presented to the user while eliminating the need for human data labeling. Our experiments show excellent accuracy in retrieving the correct model based on a user-input description of a problem simulated with different levels of expertise.

[258] HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark?

Fangchen Yu, Haiyuan Wan, Qianjia Cheng, Yuchen Zhang, Jiacheng Chen, Fujun Han, Yulun Wu, Junchi Yao, Ruilizhen Hu, Ning Ding, Yu Cheng, Tao Chen, Lei Bai, Dongzhan Zhou, Yun Luo, Ganqu Cui, Peng Ye

Main category: cs.AI

TL;DR: HiPhO is the first benchmark for high school physics Olympiads that enables direct comparison between AI models and human contestants using official grading standards and medal thresholds.

DetailsMotivation: Existing physics benchmarks lack systematic coverage of real-world physics competitions like Olympiads and don't allow direct performance comparison with human contestants.

Method: Compiled 13 latest Olympiad exams (2024-2025) with mixed modalities, adopted official marking schemes for fine-grained grading, and assigned medals based on official thresholds to compare models with human performance.

Result: Open-source MLLMs mostly remain at/below bronze level; open-source LLMs show occasional golds; closed-source reasoning MLLMs achieve 6-12 gold medals; most models still have significant gap from full marks.

Conclusion: Substantial performance gap exists between open-source models and top students, closed-source reasoning models show strong physical reasoning, and significant room for improvement remains. HiPhO provides rigorous human-aligned evaluation for advancing multimodal physical reasoning.

Abstract: Recently, the physical capabilities of (M)LLMs have garnered increasing attention. However, existing benchmarks for physics suffer from two major gaps: they neither provide systematic and up-to-date coverage of real-world physics competitions such as physics Olympiads, nor enable direct performance comparison with humans. To bridge these gaps, we present HiPhO, the first benchmark dedicated to high school physics Olympiads with human-aligned evaluation. Specifically, HiPhO highlights three key innovations. (1) Comprehensive Data: It compiles 13 latest Olympiad exams from 2024-2025, spanning both international and regional competitions, and covering mixed modalities that encompass problems spanning text-only to diagram-based. (2) Professional Evaluation: We adopt official marking schemes to perform fine-grained grading at both the answer and step level, fully aligned with human examiners to ensure high-quality and domain-specific evaluation. (3) Comparison with Human Contestants: We assign gold, silver, and bronze medals to models based on official medal thresholds, thereby enabling direct comparison between (M)LLMs and human contestants. Our large-scale evaluation of 30 state-of-the-art (M)LLMs shows that: across 13 exams, open-source MLLMs mostly remain at or below the bronze level; open-source LLMs show promising progress with occasional golds; closed-source reasoning MLLMs can achieve 6 to 12 gold medals; and most models still have a significant gap from full marks. These results highlight a substantial performance gap between open-source models and top students, the strong physical reasoning capabilities of closed-source reasoning models, and the fact that there is still significant room for improvement. HiPhO, as a rigorous, human-aligned, and Olympiad-focused benchmark for advancing multimodal physical reasoning, is open-source and available at https://github.com/SciYu/HiPhO.

[259] Probing the Preferences of a Language Model: Integrating Verbal and Behavioral Tests of AI Welfare

Valen Tagliabue, Leonard Dung

Main category: cs.AI

TL;DR: New experimental methods for measuring AI welfare show correlations between stated preferences and behavior, suggesting preference satisfaction can serve as a measurable welfare proxy in language models, though consistency varies across models and conditions.

DetailsMotivation: To develop experimental paradigms for measuring welfare in language models and test whether preference satisfaction can serve as an empirically measurable welfare proxy in AI systems.

Method: Compared verbal reports of model preferences with behavior in virtual environments and conversation topic selection. Tested effects of costs/rewards on behavior and consistency of eudaimonic welfare scale responses across semantically equivalent prompts.

Result: Observed notable mutual support between measures with reliable correlations between stated preferences and behavior. Consistency was more pronounced in some models/conditions than others, and responses were not consistent across perturbations.

Conclusion: While uncertain whether methods successfully measure true welfare states due to background uncertainties, findings highlight feasibility of welfare measurement in language models and invite further exploration.

Abstract: We develop new experimental paradigms for measuring welfare in language models. We compare verbal reports of models about their preferences with preferences expressed through behavior when navigating a virtual environment and selecting conversation topics. We also test how costs and rewards affect behavior and whether responses to an eudaimonic welfare scale - measuring states such as autonomy and purpose in life - are consistent across semantically equivalent prompts. Overall, we observed a notable degree of mutual support between our measures. The reliable correlations observed between stated preferences and behavior across conditions suggest that preference satisfaction can, in principle, serve as an empirically measurable welfare proxy in some of today’s AI systems. Furthermore, our design offered an illuminating setting for qualitative observation of model behavior. Yet, the consistency between measures was more pronounced in some models and conditions than others and responses were not consistent across perturbations. Due to this, and the background uncertainty about the nature of welfare and the cognitive states (and welfare subjecthood) of language models, we are currently uncertain whether our methods successfully measure the welfare state of language models. Nevertheless, these findings highlight the feasibility of welfare measurement in language models, inviting further exploration.

[260] VISION: Robust and Interpretable Code Vulnerability Detection Leveraging Counterfactual Augmentation

David Egea, Barproda Halder, Sanghamitra Dutta

Main category: cs.AI

TL;DR: VISION is a framework that uses counterfactual training with LLMs to improve GNN-based vulnerability detection by reducing spurious correlations, achieving significant accuracy improvements from 51.8% to 97.8%.

DetailsMotivation: GNNs for vulnerability detection suffer from training data imbalances and label noise, causing them to learn spurious correlations from superficial code similarities rather than genuine vulnerability patterns.

Method: A three-part framework: (1) generating counterfactual samples using LLM prompts to create minimally modified code with opposite labels, (2) targeted GNN training on paired code examples, and (3) graph-based interpretability to identify crucial code statements.

Result: Significant improvements: overall accuracy from 51.8% to 97.8%, pairwise contrast accuracy from 4.5% to 95.8%, worst-group accuracy from 0.7% to 85.5% on CWE-20 vulnerabilities. Also created CWE-20-CFA benchmark with 27,556 functions.

Conclusion: VISION successfully mitigates spurious learning and enables more robust, generalizable vulnerability detection while advancing transparent AI cybersecurity through interpretability and human-in-the-loop analysis.

Abstract: Automated detection of vulnerabilities in source code is an essential cybersecurity challenge, underpinning trust in digital systems and services. Graph Neural Networks (GNNs) have emerged as a promising approach as they can learn structural and logical code relationships in a data-driven manner. However, their performance is severely constrained by training data imbalances and label noise. GNNs often learn ‘spurious’ correlations from superficial code similarities, producing detectors that fail to generalize well to unseen real-world data. In this work, we propose a unified framework for robust and interpretable vulnerability detection, called VISION, to mitigate spurious correlations by systematically augmenting a counterfactual training dataset. Counterfactuals are samples with minimal semantic modifications but opposite labels. Our framework includes: (i) generating counterfactuals by prompting a Large Language Model (LLM); (ii) targeted GNN training on paired code examples with opposite labels; and (iii) graph-based interpretability to identify the crucial code statements relevant for vulnerability predictions while ignoring spurious ones. We find that VISION reduces spurious learning and enables more robust, generalizable detection, improving overall accuracy (from 51.8% to 97.8%), pairwise contrast accuracy (from 4.5% to 95.8%), and worst-group accuracy (from 0.7% to 85.5%) on the Common Weakness Enumeration (CWE)-20 vulnerability. We further demonstrate gains using proposed metrics: intra-class attribution variance, inter-class attribution distance, and node score dependency. We also release CWE-20-CFA, a benchmark of 27,556 functions (real and counterfactual) from the high-impact CWE-20 category. Finally, VISION advances transparent and trustworthy AI-based cybersecurity systems through interactive visualization for human-in-the-loop analysis.

[261] Self-Emotion-Mediated Exploration in Artificial Intelligence Mirrors: Findings from Cognitive Psychology

Gustavo Assunção, Miguel Castelo-Branco, Paulo Menezes

Main category: cs.AI

TL;DR: A bio-inspired reinforcement learning framework that uses emotional states (pride and surprise) to drive autonomous exploration in AI agents, achieving human-like behavioral correlations.

DetailsMotivation: Current AI models lack autonomous exploration capabilities during training, hindering adaptability. The work aims to develop artificial agents with intrinsic exploratory drive inspired by human epistemic and achievement emotions.

Method: A dual-module reinforcement framework where data analysis scores trigger pride or surprise emotions based on psychological studies. The correlation between these emotional states and exploration is optimized for learning goals.

Result: Majority of agents demonstrated causal relationships between emotional states and exploration. 15.4% mean increase for surprise-driven exploration, 2.8% decrease for pride-driven. Obtained correlations: ρ_surprise=0.461 and ρ_pride=-0.237, mirroring human behavior patterns.

Conclusion: Bio-inspiration is highly valuable for AI development, providing autonomy benefits similar to living organisms. The work also empirically validates human behavioral findings through AI methodologies, demonstrating significant interdisciplinary importance.

Abstract: Background: Exploration of the physical environment is an indispensable precursor to information acquisition and knowledge consolidation for living organisms. Yet, current artificial intelligence models lack these autonomy capabilities during training, hindering their adaptability. This work proposes a learning framework for artificial agents to obtain an intrinsic exploratory drive, based on epistemic and achievement emotions triggered during data observation. Methods: This study proposes a dual-module reinforcement framework, where data analysis scores dictate pride or surprise, in accordance with psychological studies on humans. A correlation between these states and exploration is then optimized for agents to meet their learning goals. Results: Causal relationships between states and exploration are demonstrated by the majority of agents. A 15.4% mean increase is noted for surprise, with a 2.8% mean decrease for pride. Resulting correlations of $\rho_{surprise}=0.461$ and $\rho_{pride}=-0.237$ are obtained, mirroring previously reported human behavior. Conclusions: These findings lead to the conclusion that bio-inspiration for AI development can be of great use. This can incur benefits typically found in living beings, such as autonomy. Further, it empirically shows how AI methodologies can corroborate human behavioral findings, showcasing major interdisciplinary importance. Ramifications are discussed.

[262] Understanding the Language Model to Solve the Symbolic Multi-Step Reasoning Problem from the Perspective of Buffer Mechanism

Zhiwei Wang, Yunji Wang, Zhongwang Zhang, Zhangchen Zhou, Hui Jin, Tianyang Hu, Jiacheng Sun, Zhenguo Li, Yaoyu Zhang, Zhi-Qin John Xu

Main category: cs.AI

TL;DR: The paper investigates reasoning mechanisms in LLMs through symbolic multi-step reasoning tasks, introduces a buffer mechanism concept, and proposes a lightweight random matrix-based algorithm that significantly improves performance on reasoning datasets.

DetailsMotivation: Large language models struggle with complex reasoning tasks like math problem-solving. Understanding their internal reasoning mechanisms can help design better architectures and training strategies to enhance reasoning capabilities.

Method: Constructed symbolic multi-step reasoning tasks to study information propagation in Transformers. Introduced buffer mechanism concept and proposed a random matrix-based algorithm with only 132 trainable parameters.

Result: Significant performance improvements on 7 multi-step reasoning datasets including PrOntoQA, LogicAsker, and LogicInference.

Conclusion: The findings provide new insights into understanding large language models’ reasoning mechanisms, demonstrating that minimal parameter additions can substantially enhance complex reasoning capabilities.

Abstract: Large language models have consistently struggled with complex reasoning tasks, such as mathematical problem-solving. Investigating the internal reasoning mechanisms of these models can help us design better model architectures and training strategies, ultimately enhancing their reasoning capability. In this study, we constructed a symbolic multi-step reasoning task to investigate the information propagation mechanisms in Transformer models when solving the task through direct answering and Chain-of-Thought (CoT) reasoning. We introduced the concept of buffer mechanism: the model stores various information in distinct buffers and selectively extracts it through the query-key matrix. We proposed a random matrix-based algorithm to enhance the model’s reasoning ability. This algorithm introduces only 132 trainable parameters, yet leads to significant performance improvements on 7 multi-step reasoning datasets, including PrOntoQA, LogicAsker, and LogicInference. These findings provide new insights into understanding the large language models.

[263] COMMA: A Communicative Multimodal Multi-Agent Benchmark

Timothy Ossowski, Jixuan Chen, Danyal Maqbool, Zefan Cai, Tyler Bradshaw, Junjie Hu

Main category: cs.AI

TL;DR: COMMA is a new benchmark for evaluating multimodal multi-agent collaboration through language communication, revealing significant weaknesses in state-of-the-art models including GPT-4o and reasoning models.

DetailsMotivation: Current multimodal agent benchmarks overlook language-based communication between agents in collaborative tasks, creating a gap in understanding their real-world effectiveness, especially when agents have unequal information access and need to work together.

Method: Introduces COMMA, a novel puzzle benchmark with various multimodal puzzles designed to comprehensively evaluate four key categories of agentic capability in communicative collaboration settings.

Result: Reveals surprising weaknesses in state-of-the-art models - GPT-4o, o4-mini, R1-Onevision, and LLaVA-CoT struggle with agent-agent collaboration, with some performing worse than random baselines.

Conclusion: There’s significant room for improvement in multimodal agents’ communication abilities, indicating a potential growth area for future model development in collaborative scenarios.

Abstract: The rapid advances of multimodal agents built on large foundation models have largely overlooked their potential for language-based communication between agents in collaborative tasks. This oversight presents a critical gap in understanding their effectiveness in real-world deployments, particularly when communicating with humans. Existing agentic benchmarks fail to address key aspects of inter-agent communication and collaboration, particularly in scenarios where agents have unequal access to information and must work together to achieve tasks beyond the scope of individual capabilities. To fill this gap, we introduce COMMA: a novel puzzle benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of multimodal puzzles, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. Our findings reveal surprising weaknesses in state-of-the-art models, including strong proprietary models like GPT-4o and reasoning models like o4-mini. Many chain of thought reasoning models such as R1-Onevision and LLaVA-CoT struggle to outperform even a random baseline in agent-agent collaboration, indicating a potential growth area in their communication abilities.

[264] Visualizing Thought: Conceptual Diagrams Enable Robust Combinatorial Planning in LMMs

Nasim Borazjanizadeh, Roei Herzig, Eduard Oks, Trevor Darrell, Rogerio Feris, Leonid Karlinsky

Main category: cs.AI

TL;DR: Visual Thinking framework enables LMMs to reason through self-generated conceptual diagrams, significantly improving combinatorial planning capabilities without human initialization.

DetailsMotivation: Human reasoning uses mental models and conceptual diagrams, while current LLMs/LMMs rely mainly on text, limiting effectiveness in complex multi-step tasks.

Method: Zero-shot framework integrating textual and diagrammatic reasoning in optimized Graph-of-Thought inference with beam search and depth-wise backtracking.

Result: Substantial performance improvements (e.g., GPT-4o: 35.5% -> 90.2% in Blocksworld), outperforms text-only methods and even o1-preview model in difficult domains.

Conclusion: Conceptual diagrams are valuable as a reasoning medium in LMMs, enabling enhanced combinatorial planning capabilities.

Abstract: Human reasoning relies on constructing and manipulating mental models – simplified internal representations of situations that we use to understand and solve problems. Conceptual diagrams (e.g., a sketch drawn by a human to aid reasoning) externalize these mental models, abstracting irrelevant details to efficiently capture how entities interact with each other. In contrast, Large Language Models (LLMs) and Large MultiModal Models (LMMs) predominantly reason through text, limiting their effectiveness in complex multi-step tasks. In this paper, we propose Visual Thinking, a zero-shot framework that enables LMMs to reason through multiple chains of (self-generated) conceptual diagrams, significantly enhancing their combinatorial planning capabilities. Our approach does not require any human initialization beyond the natural language description of the task. It integrates both textual and diagrammatic reasoning within an optimized Graph-of-Thought inference framework, enhanced by beam search and depth-wise backtracking. Evaluated on multiple challenging PDDL planning domains, our method substantially improves LMMs’ performance (e.g., GPT-4o: 35.5% -> 90.2% in Blocksworld) and consistently outperforms other text-only search-based inference methods. On more difficult planning domains with solution depths up to 40, our approach outperforms even the o1-preview reasoning model (e.g., 16 percentage points improvement in Floor Tiles). These results highlight the value of conceptual diagrams as a reasoning medium in LMMs.

[265] Automatic Reward Shaping from Confounded Offline Data

Mingxuan Li, Junzhe Zhang, Elias Bareinboim

Main category: cs.AI

TL;DR: A novel deep reinforcement learning algorithm robust to confounding biases in observed data, based on DQN but designed to handle unobserved confounders in complex environments.

DetailsMotivation: To address the challenge of off-policy learning from biased data in complex, high-dimensional domains where unobserved confounding cannot be ruled out, which affects the performance of standard methods like Q-learning.

Method: Building on Deep Q-Network (DQN), the proposed algorithm finds a safe policy for the worst-case environment compatible with the observations, making it robust to confounding biases.

Result: The method was applied to twelve confounded Atari games and consistently dominated standard DQN in all games where observed input to behavioral and target policies mismatch and unobserved confounders exist.

Conclusion: The proposed algorithm effectively handles confounding biases in off-policy learning, outperforming standard DQN in environments with unobserved confounders and policy mismatches.

Abstract: A key task in Artificial Intelligence is learning effective policies for controlling agents in unknown environments to optimize performance measures. Off-policy learning methods, like Q-learning, allow learners to make optimal decisions based on past experiences. This paper studies off-policy learning from biased data in complex and high-dimensional domains where \emph{unobserved confounding} cannot be ruled out a priori. Building on the well-celebrated Deep Q-Network (DQN), we propose a novel deep reinforcement learning algorithm robust to confounding biases in observed data. Specifically, our algorithm attempts to find a safe policy for the worst-case environment compatible with the observations. We apply our method to twelve confounded Atari games, and find that it consistently dominates the standard DQN in all games where the observed input to the behavioral and target policies mismatch and unobserved confounders exist.

[266] GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning

Sahiti Yerramilli, Nilay Pande, Rynaa Grover, Jayant Sravan Tamarapalli

Main category: cs.AI

TL;DR: GeoChain is a large-scale benchmark with 1.46M street images and 30M Q&A pairs for evaluating step-by-step geographic reasoning in MLLMs, revealing models’ weaknesses in visual grounding and localization.

DetailsMotivation: To address the lack of comprehensive benchmarks for evaluating complex geographic reasoning capabilities in multimodal large language models, particularly step-by-step reasoning across visual, spatial, cultural, and precise geolocation tasks.

Method: Leveraged 1.46 million Mapillary street-level images paired with 21-step chain-of-thought question sequences, enriched with semantic segmentation (150 classes) and visual locatability scores. Tested on 2,088-image subset with contemporary MLLMs including GPT-4.1, Claude 3.7, and Gemini 2.5 variants.

Result: Models consistently showed weaknesses in visual grounding, erratic reasoning patterns, and struggled with accurate localization, especially as reasoning complexity increased. Performance degraded significantly with more complex geographic reasoning tasks.

Conclusion: GeoChain provides a robust diagnostic methodology that is critical for driving advancements in complex geographic reasoning capabilities within multimodal large language models, highlighting current limitations and areas for improvement.

Abstract: This paper introduces GeoChain, a large-scale benchmark for evaluating step-by-step geographic reasoning in multimodal large language models (MLLMs). Leveraging 1.46 million Mapillary street-level images, GeoChain pairs each image with a 21-step chain-of-thought (CoT) question sequence (over 30 million Q&A pairs). These sequences guide models from coarse attributes to fine-grained localization across four reasoning categories - visual, spatial, cultural, and precise geolocation - annotated by difficulty. Images are also enriched with semantic segmentation (150 classes) and a visual locatability score. Our benchmarking of contemporary MLLMs (GPT-4.1 variants, Claude 3.7, Gemini 2.5 variants) on a diverse 2,088-image subset reveals consistent challenges: models frequently exhibit weaknesses in visual grounding, display erratic reasoning, and struggle to achieve accurate localization, especially as the reasoning complexity escalates. GeoChain offers a robust diagnostic methodology, critical for fostering significant advancements in complex geographic reasoning within MLLMs.

[267] Addition in Four Movements: Mapping Layer-wise Information Trajectories in LLMs

Yao Yan

Main category: cs.AI

TL;DR: Analysis of LLaMA-3-8B-Instruct’s internal arithmetic processes reveals a four-stage hierarchical computation for multi-digit addition, showing internal computation rather than memorization.

DetailsMotivation: To understand how large language models perform computational tasks like multi-digit addition and dissect their internal arithmetic processes.

Method: Combined linear probing with logit-lens inspection to analyze the forward pass, identifying a coherent four-stage trajectory inspired by human step-by-step addition.

Result: Found that formula-structure representations emerge first, computational features develop prominently, numerical abstractions become clear enabling digit detection, and final content organization occurs near output with correct token reliably top-ranked.

Conclusion: The model uses a hierarchical computational process that favors internal computation over rote memorization for arithmetic tasks.

Abstract: Multi-digit addition is a clear probe of the computational power of large language models. To dissect the internal arithmetic processes in LLaMA-3-8B-Instruct, we combine linear probing with logit-lens inspection. Inspired by the step-by-step manner in which humans perform addition, we propose and analyze a coherent four-stage trajectory in the forward pass:Formula-structure representations become linearly decodable first, while the answer token is still far down the candidate list.Core computational features then emerge prominently.At deeper activation layers, numerical abstractions of the result become clearer, enabling near-perfect detection and decoding of the individual digits in the sum.Near the output, the model organizes and generates the final content, with the correct token reliably occupying the top rank.This trajectory suggests a hierarchical process that favors internal computation over rote memorization. We release our code and data to facilitate reproducibility.

Hudson de Martim

Main category: cs.AI

TL;DR: A formal modeling pattern for granular versioning of legal norms using LRMoo ontology, enabling deterministic point-in-time reconstruction of legal texts through temporal versions and language versions.

DetailsMotivation: Current frameworks lack dedicated modeling for component-level versioning of legal norms, hindering reliable temporal reconstruction needed for Legal Tech and AI applications.

Method: Proposes a temporal modeling pattern using LRMoo ontology with diachronic chains of F2 Expressions, distinguishing between language-agnostic Temporal Versions and monolingual Language Versions, applied recursively to legal text structure.

Result: Demonstrated with Brazilian Federal Constitution case study, enabling precise deterministic retrieval and reconstruction of any legal text part at specific dates through event-centric architecture.

Conclusion: Provides robust foundation for verifiable knowledge graphs and advanced AI tools, overcoming limitations of current generative models for legal text processing.

Abstract: Effectively representing legal norms for automated processing is a critical challenge, particularly in tracking the temporal evolution of their hierarchical components. While foundational conceptual frameworks like IFLA LRMoo provide a generic toolkit for bibliographic data, and encoding standards like Akoma Ntoso offer a robust syntax for legal documents, a dedicated, formal modeling pattern for granular, component-level versioning is still required. This limitation hinders the deterministic point-intime reconstruction of legal texts, a fundamental capability for reliable Legal Tech and AI applications. This paper proposes a structured, temporal modeling pattern grounded in the LRMoo ontology to address this need. Our approach models the evolution of a legal norm as a diachronic chain of F2 Expressions. We introduce a key distinction between a language-agnostic Temporal Version (TV)-a semantic snapshot of the norm’s structure-and its concrete monolingual realizations, the Language Versions (LV). Both are modeled as F2 Expressions linked by the canonical R76 is derivative of property. This paradigm is applied recursively to the legal text’s internal structure, representing it as a parallel hierarchy of abstract Component Works (F1) and their versioned Component Expressions (F2). Furthermore, we formalize the legislative amendment process using the F28 Expression Creation event, allowing changes to be traced from an amending act to its precise effect on the amended norm. Using the Brazilian Federal Constitution as a case study, we demonstrate how this event-centric architecture enables the precise, deterministic retrieval and reconstruction of any part of a legal text as it existed on a specific date. The model provides a robust foundation for building verifiable knowledge graphs and advanced AI tools, overcoming the limitations of current generative models.

[269] MedGellan: LLM-Generated Medical Guidance to Support Physicians

Debodeep Banerjee, Burcu Sayin, Stefano Teso, Andrea Passerini

Main category: cs.AI

TL;DR: MedGellan is a lightweight, annotation-free framework that uses LLMs to generate clinical guidance from medical records, improving physician diagnostic performance through Bayesian-inspired prompting.

DetailsMotivation: Medical decision-making errors can have serious consequences, and while full automation is challenging, hybrid frameworks combining machine intelligence with human oversight offer a practical solution.

Method: Uses a Large Language Model with Bayesian-inspired prompting strategy that respects temporal order of clinical data to generate clinical guidance from raw medical records for physician use.

Result: Preliminary experiments show improved diagnostic performance, particularly in recall and F1 score, when physicians use the LLM-generated guidance.

Conclusion: MedGellan demonstrates that LLM-generated clinical guidance can effectively support physicians in diagnostic tasks, offering a practical hybrid approach to medical decision-making.

Abstract: Medical decision-making is a critical task, where errors can result in serious, potentially life-threatening consequences. While full automation remains challenging, hybrid frameworks that combine machine intelligence with human oversight offer a practical alternative. In this paper, we present MedGellan, a lightweight, annotation-free framework that uses a Large Language Model (LLM) to generate clinical guidance from raw medical records, which is then used by a physician to predict diagnoses. MedGellan uses a Bayesian-inspired prompting strategy that respects the temporal order of clinical data. Preliminary experiments show that the guidance generated by the LLM with MedGellan improves diagnostic performance, particularly in recall and $F_1$ score.

[270] ASP-FZN: A Translation-based Constraint Answer Set Solver

Thomas Eiter, Tobias Geibinger, Tobias Kaminski, Nysret Musliu, Johannes Oetsch

Main category: cs.AI

TL;DR: asp-fzn is a new solver for Constraint Answer Set Programming that translates CASP programs to FlatZinc format, enabling use of multiple backend solvers and showing competitive performance against state-of-the-art ASP and CASP solvers.

DetailsMotivation: To extend Answer Set Programming (ASP) with linear constraints and provide a solver-independent approach that can leverage various Constraint Programming and Integer Programming backend solvers through the FlatZinc language.

Method: Translation of CASP programs into the solver-independent FlatZinc language, supporting rich linear constraints including common global constraints, enabling use of multiple backend constraint solvers.

Result: asp-fzn is competitive with state-of-the-art ASP solvers on ASP competition benchmarks and shows promising performance on CASP problems, even outperforming the prominent clingcon solver on some CASP benchmarks.

Conclusion: The asp-fzn solver provides an effective approach for Constraint Answer Set Programming by leveraging the FlatZinc ecosystem, demonstrating competitive performance and establishing itself as a viable alternative to existing CASP solvers.

Abstract: We present the solver asp-fzn for Constraint Answer Set Programming (CASP), which extends ASP with linear constraints. Our approach is based on translating CASP programs into the solver-independent FlatZinc language that supports several Constraint Programming and Integer Programming backend solvers. Our solver supports a rich language of linear constraints, including some common global constraints. As for evaluation, we show that asp-fzn is competitive with state-of-the-art ASP solvers on benchmarks taken from past ASP competitions. Furthermore, we evaluate it on several CASP problems from the literature and compare its performance with clingcon, which is a prominent CASP solver that supports most of the asp-fzn language. The performance of asp-fzn is very promising as it is already competitive on plain ASP and even outperforms clingcon on some CASP benchmarks.

[271] CountQA: How Well Do MLLMs Count in the Wild?

Jayant Sravan Tamarapalli, Rynaa Grover, Nilay Pande, Sahiti Yerramilli

Main category: cs.AI

TL;DR: CountQA benchmark reveals MLLMs’ severe object counting deficiency, with top model achieving only 42.9% accuracy on real-world images with high object density and occlusion.

DetailsMotivation: Multimodal LLMs show remarkable visual understanding but lack fundamental object counting skills, limiting their real-world reliability. Existing benchmarks don't adequately test this capability in complex scenarios.

Method: Created CountQA benchmark with 1,500+ question-answer pairs featuring real-world images with high object density, clutter, and occlusion. Evaluated 15 prominent MLLMs on this benchmark.

Result: Top-performing model achieved only 42.9% accuracy, with performance declining as object counts increased, revealing significant counting weakness in current MLLMs.

Conclusion: CountQA provides a dedicated benchmark to diagnose and fix MLLMs’ core counting weakness, enabling development of numerically grounded and spatially aware models. Dataset and code will be open-sourced.

Abstract: Multimodal Large Language Models (MLLMs) demonstrate remarkable fluency in understanding visual scenes, yet they exhibit a critical lack in a fundamental cognitive skill: object counting. This blind spot severely limits their reliability in real-world applications. To date, this capability has been largely unevaluated in complex scenarios, as existing benchmarks either feature sparse object densities or are confined to specific visual domains, failing to test models under realistic conditions. Addressing this gap, we introduce CountQA, a challenging new benchmark designed to probe this deficiency. Comprising over 1,500 question-answer pairs, CountQA features real-world images with high object density, clutter, and occlusion. We investigate this weakness by evaluating 15 prominent MLLMs on the CountQA benchmark and reveal that the top-performing model achieves a mere 42.9% accuracy, with performance declining as object counts rise. By providing a dedicated benchmark to diagnose and rectify this core weakness, CountQA paves the way for a new generation of MLLMs that are not only descriptively fluent but also numerically grounded and spatially aware. We will open-source the dataset and code upon paper acceptance to foster further research.

[272] Benchmarking for Domain-Specific LLMs: A Case Study on Academia and Beyond

Rubing Chen, Jiaxin Wu, Jian Wang, Xulu Zhang, Wenqi Fan, Chenghua Lin, Xiao-Yong Wei, Qing Li

Main category: cs.AI

TL;DR: The paper challenges the conventional data scaling approach for domain-specific LLM benchmarks and introduces Comp-Comp, an iterative framework that prioritizes comprehensiveness and compactness to improve evaluation precision and recall.

DetailsMotivation: Current domain-specific LLM benchmarks rely heavily on data scaling with large corpora or QA sets, but the impact of corpus and QA design on evaluation precision and recall remains poorly understood. The authors argue that data scaling is not always optimal for domain-specific benchmark construction.

Method: The authors introduce Comp-Comp, an iterative benchmarking framework based on comprehensiveness (ensuring semantic recall by covering full domain breadth) and compactness (improving precision by reducing redundancy and noise). They demonstrate this with a case study creating PolyBench, a large-scale academic benchmark.

Result: The framework was successfully applied to create PolyBench, a high-quality academic benchmark. The approach proved effective in balancing coverage and precision for domain-specific evaluation.

Conclusion: The Comp-Comp framework provides a domain-agnostic alternative to data scaling for benchmark construction, offering better precision and recall through comprehensiveness and compactness principles. The framework is adaptable to various specialized fields beyond academia.

Abstract: The increasing demand for domain-specific evaluation of large language models (LLMs) has led to the development of numerous benchmarks. These efforts often adhere to the principle of data scaling, relying on large corpora or extensive question-answer (QA) sets to ensure broad coverage. However, the impact of corpus and QA set design on the precision and recall of domain-specific LLM performance remains poorly understood. In this paper, we argue that data scaling is not always the optimal principle for domain-specific benchmark construction. Instead, we introduce Comp-Comp, an iterative benchmarking framework grounded in the principle of comprehensiveness and compactness. Comprehensiveness ensures semantic recall by covering the full breadth of the domain, while compactness improves precision by reducing redundancy and noise. To demonstrate the effectiveness of our approach, we present a case study conducted at a well-renowned university, resulting in the creation of PolyBench, a large-scale, high-quality academic benchmark. Although this study focuses on academia, the Comp-Comp framework is domain-agnostic and readily adaptable to a wide range of specialized fields. The source code and datasets can be accessed at https://github.com/Anya-RB-Chen/COMP-COMP.

[273] MaRVL-QA: A Benchmark for Mathematical Reasoning over Visual Landscapes

Nilay Pande, Sahiti Yerramilli, Jayant Sravan Tamarapalli, Rynaa Grover

Main category: cs.AI

TL;DR: MaRVL-QA benchmark tests MLLMs’ mathematical and spatial reasoning on surface plots, revealing current models struggle with topological counting and transformation recognition tasks.

DetailsMotivation: To move beyond semantic description and test deep mathematical/spatial reasoning capabilities in Multimodal Large Language Models using mathematical surface plots as a rigorous testbed.

Method: Created MaRVL-QA benchmark with two tasks: Topological Counting (identifying/enumerating features like local maxima) and Transformation Recognition (recognizing geometric transformations), generated from curated functions with ambiguity filtering.

Result: State-of-the-art MLLMs struggle significantly, often resorting to superficial heuristics instead of robust spatial reasoning.

Conclusion: MaRVL-QA provides a challenging tool to measure progress, expose model limitations, and guide development of MLLMs with more profound reasoning abilities.

Abstract: A key frontier for Multimodal Large Language Models (MLLMs) is the ability to perform deep mathematical and spatial reasoning directly from images, moving beyond their established success in semantic description. Mathematical surface plots provide a rigorous testbed for this capability, as they isolate the task of reasoning from the semantic noise common in natural images. To measure progress on this frontier, we introduce MaRVL-QA (Mathematical Reasoning over Visual Landscapes), a new benchmark designed to quantitatively evaluate these core reasoning skills. The benchmark comprises two novel tasks: Topological Counting, identifying and enumerating features like local maxima; and Transformation Recognition, recognizing applied geometric transformations. Generated from a curated library of functions with rigorous ambiguity filtering, our evaluation on MaRVL-QA reveals that even state-of-the-art MLLMs struggle significantly, often resorting to superficial heuristics instead of robust spatial reasoning. MaRVL-QA provides a challenging new tool for the research community to measure progress, expose model limitations, and guide the development of MLLMs with more profound reasoning abilities.

[274] AI-SearchPlanner: Modular Agentic Search via Pareto-Optimal Multi-Objective Reinforcement Learning

Lang Mei, Zhihan Yang, Chong Chen

Main category: cs.AI

TL;DR: AI-SearchPlanner is a novel RL framework that decouples search planning from QA tasks, using a small trainable LLM for planning to enhance frozen QA models’ performance through dual-reward alignment and Pareto optimization.

DetailsMotivation: Existing RL-based search agents use a single LLM for both search planning and QA, limiting optimization of both capabilities. Real-world systems use large frozen LLMs for QA quality, so a dedicated small planner is more effective.

Method: Proposes AI-SearchPlanner with three innovations: 1) Decoupled architecture separating planner and generator, 2) Dual-reward alignment for search planning, 3) Pareto optimization of planning utility and cost.

Result: Extensive experiments show AI-SearchPlanner outperforms existing RL-based search agents in effectiveness and efficiency, with strong generalization across diverse frozen QA models and data domains.

Conclusion: The decoupled approach with a dedicated small planner significantly enhances frozen QA model performance, demonstrating superior effectiveness, efficiency and generalization compared to end-to-end single LLM approaches.

Abstract: Recent studies have explored integrating Large Language Models (LLMs) with search engines to leverage both the LLMs’ internal pre-trained knowledge and external information. Specially, reinforcement learning (RL) has emerged as a promising paradigm for enhancing LLM reasoning through multi-turn interactions with search engines. However, existing RL-based search agents rely on a single LLM to handle both search planning and question-answering (QA) tasks in an end-to-end manner, which limits their ability to optimize both capabilities simultaneously. In practice, sophisticated AI search systems often employ a large, frozen LLM (e.g., GPT-4, DeepSeek-R1) to ensure high-quality QA. Thus, a more effective and efficient approach is to utilize a small, trainable LLM dedicated to search planning. In this paper, we propose \textbf{AI-SearchPlanner}, a novel reinforcement learning framework designed to enhance the performance of frozen QA models by focusing on search planning. Specifically, our approach introduces three key innovations: 1) Decoupling the Architecture of the Search Planner and Generator, 2) Dual-Reward Alignment for Search Planning, and 3) Pareto Optimization of Planning Utility and Cost, to achieve the objectives. Extensive experiments on real-world datasets demonstrate that AI SearchPlanner outperforms existing RL-based search agents in both effectiveness and efficiency, while exhibiting strong generalization capabilities across diverse frozen QA models and data domains.

[275] EvoEmo: Towards Evolved Emotional Policies for LLM Agents in Multi-Turn Negotiation

Yunbo Long, Liming Xu, Lukas Beckenbauer, Yuhan Liu, Alexandra Brintrup

Main category: cs.AI

TL;DR: EvoEmo is an evolutionary reinforcement learning framework that optimizes dynamic emotional expression in LLM negotiations, outperforming baseline strategies with higher success rates and efficiency.

DetailsMotivation: Existing LLM agents overlook the functional role of emotions in negotiations, generating passive emotional responses that make them vulnerable to manipulation and strategic exploitation by adversarial counterparts.

Method: EvoEmo models emotional state transitions as a Markov Decision Process and employs population-based genetic optimization to evolve high-reward emotion policies across diverse negotiation scenarios.

Result: Extensive experiments show EvoEmo consistently outperforms both vanilla strategies and fixed-emotion strategies, achieving higher success rates, higher efficiency, and increased buyer savings.

Conclusion: The findings highlight the importance of adaptive emotional expression in enabling more effective LLM agents for multi-turn negotiation.

Abstract: Recent research on Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) has demonstrated that agents can engage in \textit{complex}, \textit{multi-turn} negotiations, opening new avenues for agentic AI. However, existing LLM agents largely overlook the functional role of emotions in such negotiations, instead generating passive, preference-driven emotional responses that make them vulnerable to manipulation and strategic exploitation by adversarial counterparts. To address this gap, we present EvoEmo, an evolutionary reinforcement learning framework that optimizes dynamic emotional expression in negotiations. EvoEmo models emotional state transitions as a Markov Decision Process and employs population-based genetic optimization to evolve high-reward emotion policies across diverse negotiation scenarios. We further propose an evaluation framework with two baselines – vanilla strategies and fixed-emotion strategies – for benchmarking emotion-aware negotiation. Extensive experiments and ablation studies show that EvoEmo consistently outperforms both baselines, achieving higher success rates, higher efficiency, and increased buyer savings. This findings highlight the importance of adaptive emotional expression in enabling more effective LLM agents for multi-turn negotiation.

[276] MSRFormer: Road Network Representation Learning using Multi-scale Feature Fusion of Heterogeneous Spatial Interactions

Jian Yang, Jiahui Wu, Li Fang, Hongchao Fan, Bianying Zhang, Huijie Zhao, Guangyi Yang, Rui Xin, Xiong You

Main category: cs.AI

TL;DR: MSRFormer is a novel road network representation learning framework that integrates multi-scale spatial interactions to address flow heterogeneity and long-distance dependencies in urban road networks.

DetailsMotivation: Urban road networks are heterogeneous and hierarchical, making accurate representation learning challenging. Traditional graph neural networks struggle due to homogeneity assumptions and single-scale focus.

Method: Uses spatial flow convolution to extract small-scale features from trajectory data, identifies scale-dependent spatial interaction regions, employs graph transformer for multi-scale dependencies, and fuses features with residual connections for contrastive learning.

Result: Outperforms baseline methods on two real-world datasets, with up to 16% improvement in complex road network structures. Traffic-related tasks benefit more from trajectory data incorporation.

Conclusion: Provides a practical framework for task-agnostic road network representation models and reveals distinct association patterns between scale effects and flow heterogeneity in spatial interactions.

Abstract: Transforming road network data into vector representations using deep learning has proven effective for road network analysis. However, urban road networks’ heterogeneous and hierarchical nature poses challenges for accurate representation learning. Graph neural networks, which aggregate features from neighboring nodes, often struggle due to their homogeneity assumption and focus on a single structural scale. To address these issues, this paper presents MSRFormer, a novel road network representation learning framework that integrates multi-scale spatial interactions by addressing their flow heterogeneity and long-distance dependencies. It uses spatial flow convolution to extract small-scale features from large trajectory datasets, and identifies scale-dependent spatial interaction regions to capture the spatial structure of road networks and flow heterogeneity. By employing a graph transformer, MSRFormer effectively captures complex spatial dependencies across multiple scales. The spatial interaction features are fused using residual connections, which are fed to a contrastive learning algorithm to derive the final road network representation. Validation on two real-world datasets demonstrates that MSRFormer outperforms baseline methods in two road network analysis tasks. The performance gains of MSRFormer suggest the traffic-related task benefits more from incorporating trajectory data, also resulting in greater improvements in complex road network structures with up to 16% improvements compared to the most competitive baseline method. This research provides a practical framework for developing task-agnostic road network representation models and highlights distinct association patterns of the interplay between scale effects and flow heterogeneity of spatial interactions.

[277] SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents

Xuan-Phi Nguyen, Shrey Pandit, Revanth Gangi Reddy, Austin Xu, Silvio Savarese, Caiming Xiong, Shafiq Joty

Main category: cs.AI

TL;DR: Training autonomous single-agent models for deep research using continual RL with synthetic data, achieving 28.7% on Humanity’s Last Exam benchmark.

DetailsMotivation: To develop native autonomous single-agent models for deep research that can dynamically determine actions without manual directives, unlike multi-agent systems with static workflows.

Method: Continual reinforcement learning with entirely synthetic data applied to reasoning-optimized LLMs, featuring minimal web crawling and Python tool integration.

Result: Best variant SFR-DR-20B achieves up to 28.7% on Humanity’s Last Exam benchmark, demonstrating enhanced agentic skills while preserving reasoning ability.

Conclusion: The proposed RL recipe with synthetic data effectively enhances autonomous deep research capabilities in single-agent models, providing a promising approach for agentic AI development.

Abstract: Equipping large language models (LLMs) with complex, interleaved reasoning and tool-use capabilities has become a key focus in agentic AI research, especially with recent advances in reasoning-oriented (``thinking’’) models. Such capabilities are key to unlocking a number of important applications. One such application is Deep Research (DR), which requires extensive search and reasoning over many sources. Our work in this paper focuses on the development of native Autonomous Single-Agent models for DR featuring minimal web crawling and Python tool integration. Unlike multi-agent systems, where agents take up pre-defined roles and are told what to do at each step in a static workflow, an autonomous single-agent determines its next action dynamically based on context, without manual directive. While prior work has proposed training recipes for base or instruction-tuned LLMs, we focus on continual reinforcement learning (RL) of reasoning-optimized models to further enhance agentic skills while preserving reasoning ability. Towards this end, we propose a simple RL recipe with entirely synthetic data, which we apply to various open-source LLMs. Our best variant SFR-DR-20B achieves up to 28.7% on Humanity’s Last Exam benchmark. In addition, we conduct key analysis experiments to provide more insights into our methodologies.

[278] Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference

Xiangwei Shen, Zhimin Li, Zhantao Yang, Shiyi Zhang, Yingfang Zhang, Donghao Li, Chunyu Wang, Qinglin Lu, Yansong Tang

Main category: cs.AI

TL;DR: Direct-Align method with Semantic Relative Preference Optimization improves diffusion model alignment with human preferences by avoiding expensive multistep denoising and enabling online reward adjustment.

DetailsMotivation: Existing methods for aligning diffusion models with human preferences are computationally expensive due to multistep denoising and require continuous offline reward model adaptation for desired aesthetic quality.

Method: Proposes Direct-Align method that predefines noise prior to recover original images via interpolation, avoiding multistep denoising. Introduces Semantic Relative Preference Optimization (SRPO) with text-conditioned rewards for online adjustment through prompt augmentation.

Result: Fine-tuning the FLUX model with these improvements achieved over 3x improvement in human-evaluated realism and aesthetic quality.

Conclusion: The proposed approach effectively addresses computational limitations of previous methods and reduces reliance on offline reward fine-tuning while significantly improving output quality.

Abstract: Recent studies have demonstrated the effectiveness of directly aligning diffusion models with human preferences using differentiable reward. However, they exhibit two primary challenges: (1) they rely on multistep denoising with gradient computation for reward scoring, which is computationally expensive, thus restricting optimization to only a few diffusion steps; (2) they often need continuous offline adaptation of reward models in order to achieve desired aesthetic quality, such as photorealism or precise lighting effects. To address the limitation of multistep denoising, we propose Direct-Align, a method that predefines a noise prior to effectively recover original images from any time steps via interpolation, leveraging the equation that diffusion states are interpolations between noise and target images, which effectively avoids over-optimization in late timesteps. Furthermore, we introduce Semantic Relative Preference Optimization (SRPO), in which rewards are formulated as text-conditioned signals. This approach enables online adjustment of rewards in response to positive and negative prompt augmentation, thereby reducing the reliance on offline reward fine-tuning. By fine-tuning the FLUX model with optimized denoising and online reward adjustment, we improve its human-evaluated realism and aesthetic quality by over 3x.

cs.SD

[279] Prototype: A Keyword Spotting-Based Intelligent Audio SoC for IoT

Huihong Liang, Dongxuan Jia, Youquan Wang, Longtao Huang, Shida Zhong, Luping Xiang, Lei Huang, Tao Yuan

Main category: cs.SD

TL;DR: Compact audio SoC with keyword spotting accelerator for ultra-low latency, low-power voice interaction in IoT devices

DetailsMotivation: Enable efficient voice interaction capabilities for Internet of Things devices with minimal power consumption and latency

Method: Algorithm-hardware co-design approach with FPGA-based prototype implementation

Result: Demonstrated stable performance and real-time voice interaction for edge intelligence applications

Conclusion: The system successfully enables low-cost, energy-efficient voice interaction suitable for IoT edge devices

Abstract: In this demo, we present a compact intelligent audio system-on-chip (SoC) integrated with a keyword spotting accelerator, enabling ultra-low latency, low-power, and low-cost voice interaction in Internet of Things (IoT) devices. Through algorithm-hardware co-design, the system’s energy efficiency is maximized. We demonstrate the system’s capabilities through a live FPGA-based prototype, showcasing stable performance and real-time voice interaction for edge intelligence applications.

[280] Controllable Singing Voice Synthesis using Phoneme-Level Energy Sequence

Yerin Ryu, Inseop Shin, Chanwoo Kim

Main category: cs.SD

TL;DR: First user-driven dynamics control method for singing voice synthesis using phoneme-level energy sequences for precise loudness control without compromising audio quality.

DetailsMotivation: Current SVS systems rely on probabilistic modeling which limits precise control over musical attributes like dynamics (loudness variations), making it difficult for users to express their intent accurately.

Method: Explicitly condition SVS model on energy sequences extracted from ground-truth spectrograms, and propose phoneme-level energy sequences for user-friendly control to reduce annotation costs.

Result: Achieves over 50% reduction in mean absolute error of energy sequences for phoneme-level inputs compared to baseline and energy-predictor models while maintaining synthesis quality.

Conclusion: The proposed method successfully enables user-driven dynamics control in SVS with significantly improved controllability and reduced annotation requirements.

Abstract: Controllable Singing Voice Synthesis (SVS) aims to generate expressive singing voices reflecting user intent. While recent SVS systems achieve high audio quality, most rely on probabilistic modeling, limiting precise control over attributes such as dynamics. We address this by focusing on dynamic control–temporal loudness variation essential for musical expressiveness–and explicitly condition the SVS model on energy sequences extracted from ground-truth spectrograms, reducing annotation costs and improving controllability. We also propose a phoneme-level energy sequence for user-friendly control. To the best of our knowledge, this is the first attempt enabling user-driven dynamics control in SVS. Experiments show our method achieves over 50% reduction in mean absolute error of energy sequences for phoneme-level inputs compared to baseline and energy-predictor models, without compromising synthesis quality.

[281] End-to-End Efficiency in Keyword Spotting: A System-Level Approach for Embedded Microcontrollers

Pietro Bartoli, Tommaso Bondini, Christian Veronesi, Andrea Giudici, Niccolò Antonello, Franco Zappa

Main category: cs.SD

TL;DR: TKWS architecture achieves 92.4% F1-score with only 14.4k parameters, outperforming other lightweight models for keyword spotting on MCUs while considering the full processing pipeline including feature extraction.

DetailsMotivation: Keyword spotting enables hands-free interaction in embedded/IoT devices but faces memory and energy constraints. Most studies focus only on model inference, ignoring the complete processing pipeline.

Method: Systematic evaluation of lightweight neural networks (DS-CNN, LiCoNet, TENet) and proposed TKWS based on MobileNet. Full pipeline analysis from MFCC feature extraction to neural inference across three STM32 platforms with benchmarking.

Result: TKWS with three residual blocks achieves 92.4% F1-score with only 14.4k parameters. N6 MCU with neural acceleration shows best energy-delay product. High-resolution features enable low-latency operation.

Conclusion: Model accuracy alone doesn’t determine real-world effectiveness. Optimal keyword spotting requires careful feature extraction parameter selection and hardware-specific optimization for efficient deployment.

Abstract: Keyword spotting (KWS) is a key enabling technology for hands-free interaction in embedded and IoT devices, where stringent memory and energy constraints challenge the deployment of AI-enabeld devices. In this work, we systematically evaluate and compare several state-of-the-art lightweight neural network architectures, including DS-CNN, LiCoNet, and TENet, alongside our proposed Typman-KWS (TKWS) architecture built upon MobileNet, specifically designed for efficient KWS on microcontroller units (MCUs). Unlike prior studies focused solely on model inference, our analysis encompasses the entire processing pipeline, from Mel-Frequency Cepstral Coefficient (MFCC) feature extraction to neural inference, and is benchmarked across three STM32 platforms (N6, H7, and U5). Our results show that TKWS with three residual blocks achieves up to 92.4% F1-score with only 14.4k parameters, reducing memory footprint without compromising the accuracy. Moreover, the N6 MCU with integrated neural acceleration achieves the best energy-delay product (EDP), enabling efficient, low-latency operation even with high-resolution features. Our findings highlight the model accuracy alone does not determine real-world effectiveness; rather, optimal keyword spotting deployments require careful consideration of feature extraction parameters and hardware-specific optimization.

[282] Neural Proxies for Sound Synthesizers: Learning Perceptually Informed Preset Representations

Paolo Combes, Stefan Weinzierl, Klaus Obermayer

Main category: cs.SD

TL;DR: Neural network method to approximate black-box synthesizers by mapping presets to audio embeddings, enabling integration into neural-based automatic synthesizer programming systems.

DetailsMotivation: Deep learning for automatic synthesizer programming is challenging due to non-differentiable software synthesizers that can't be integrated into training pipelines.

Method: Train neural networks to map synthesizer presets to audio embedding space from pretrained models, creating neural proxies that enable audio embedding loss for black-box synthesizers.

Result: Evaluated various pretrained audio models and neural architectures (feedforward, recurrent, transformers) on three software synthesizers, showing encouraging results in sound matching tasks despite nuanced resource requirements.

Conclusion: The method provides effective neural proxies for black-box synthesizers, paving the way for future research in neural-based automatic synthesizer programming systems.

Abstract: Deep learning appears as an appealing solution for Automatic Synthesizer Programming (ASP), which aims to assist musicians and sound designers in programming sound synthesizers. However, integrating software synthesizers into training pipelines is challenging due to their potential non-differentiability. This work tackles this challenge by introducing a method to approximate arbitrary synthesizers. Specifically, we train a neural network to map synthesizer presets onto an audio embedding space derived from a pretrained model. This facilitates the definition of a neural proxy that produces compact yet effective representations, thereby enabling the integration of audio embedding loss into neural-based ASP systems for black-box synthesizers. We evaluate the representations derived by various pretrained audio models in the context of neural-based nASP and assess the effectiveness of several neural network architectures, including feedforward, recurrent, and transformer-based models, in defining neural proxies. We evaluate the proposed method using both synthetic and hand-crafted presets from three popular software synthesizers and assess its performance in a synthesizer sound matching downstream task. While the benefits of the learned representation are nuanced by resource requirements, encouraging results were obtained for all synthesizers, paving the way for future research into the application of synthesizer proxies for neural-based ASP systems.

[283] Adversarial Attacks on Audio Deepfake Detection: A Benchmark and Comparative Study

Kutub Uddin, Muhammad Umar Farooq, Awais Khan, Khalid Mahmood Malik

Main category: cs.SD

TL;DR: Comparative analysis of state-of-the-art audio deepfake detection methods against various anti-forensic attacks, revealing vulnerabilities and providing guidance for developing more robust detectors.

DetailsMotivation: The widespread use of generative AI for creating realistic deepfakes poses serious threats to voice biometric applications, and existing detection methods are vulnerable to anti-forensic attacks that conceal generative signatures.

Method: Extensive evaluation of SoTA audio deepfake detection methods on five benchmark datasets, comparing raw and spectrogram-based approaches against various anti-forensic attacks including statistical modifications and optimization-based attacks.

Result: The analysis reveals vulnerabilities of current ADD methods under adversarial conditions and provides insights into their effectiveness in exposing deepfake signatures.

Conclusion: This study highlights the need for more robust and generalized detectors, informs future defense strategy design, and guides research toward adaptive defenses against evolving anti-forensic techniques.

Abstract: The widespread use of generative AI has shown remarkable success in producing highly realistic deepfakes, posing a serious threat to various voice biometric applications, including speaker verification, voice biometrics, audio conferencing, and criminal investigations. To counteract this, several state-of-the-art (SoTA) audio deepfake detection (ADD) methods have been proposed to identify generative AI signatures to distinguish between real and deepfake audio. However, the effectiveness of these methods is severely undermined by anti-forensic (AF) attacks that conceal generative signatures. These AF attacks span a wide range of techniques, including statistical modifications (e.g., pitch shifting, filtering, noise addition, and quantization) and optimization-based attacks (e.g., FGSM, PGD, C & W, and DeepFool). In this paper, we investigate the SoTA ADD methods and provide a comparative analysis to highlight their effectiveness in exposing deepfake signatures, as well as their vulnerabilities under adversarial conditions. We conducted an extensive evaluation of ADD methods on five deepfake benchmark datasets using two categories: raw and spectrogram-based approaches. This comparative analysis enables a deeper understanding of the strengths and limitations of SoTA ADD methods against diverse AF attacks. It does not only highlight vulnerabilities of ADD methods, but also informs the design of more robust and generalized detectors for real-world voice biometrics. It will further guide future research in developing adaptive defense strategies that can effectively counter evolving AF techniques.

[284] Spectral and Rhythm Feature Performance Evaluation for Category and Class Level Audio Classification with Deep Convolutional Neural Networks

Friedrich Wolf-Monheim

Main category: cs.SD

TL;DR: Comparison of spectral/rhythm features for audio classification using deep CNNs, finding mel-spectrograms and MFCCs perform best on ESC-50 environmental audio dataset.

DetailsMotivation: To investigate which spectral and rhythm features perform best for audio classification tasks using deep convolutional neural networks, as various features can be used as input but their comparative effectiveness is not well established.

Method: Used deep CNN with end-to-end deep learning pipeline on ESC-50 dataset (2,000 labeled environmental audio recordings). Tested multiple features: mel-scaled spectrograms, MFCCs, cyclic tempograms, STFT chromagrams, CQT chromagrams, and CENS chromagrams as image inputs.

Result: Mel-scaled spectrograms and MFCCs performed significantly better than other features across all evaluated metrics (accuracy, precision, recall, F1 score) for both category and class level audio classification.

Conclusion: Mel-scaled spectrograms and MFCCs are the most effective spectral features for audio classification tasks when using deep convolutional neural networks, outperforming other rhythm and spectral features tested.

Abstract: Next to decision tree and k-nearest neighbours algorithms deep convolutional neural networks (CNNs) are widely used to classify audio data in many domains like music, speech or environmental sounds. To train a specific CNN various spectral and rhythm features like mel-scaled spectrograms, mel-frequency cepstral coefficients (MFCC), cyclic tempograms, short-time Fourier transform (STFT) chromagrams, constant-Q transform (CQT) chromagrams and chroma energy normalized statistics (CENS) chromagrams can be used as digital image input data for the neural network. The performance of these spectral and rhythm features for audio category level as well as audio class level classification is investigated in detail with a deep CNN and the ESC-50 dataset with 2,000 labeled environmental audio recordings using an end-to-end deep learning pipeline. The evaluated metrics accuracy, precision, recall and F1 score for multiclass classification clearly show that the mel-scaled spectrograms and the mel-frequency cepstral coefficients (MFCC) perform significantly better then the other spectral and rhythm features investigated in this research for audio classification tasks using deep CNNs.

[285] When Fine-Tuning is Not Enough: Lessons from HSAD on Hybrid and Adversarial Audio Spoof Detection

Bin Hu, Kunyang Huang, Daehan Kwak, Meng Xu, Kuan Huang

Main category: cs.SD

TL;DR: HSAD dataset addresses hybrid speech spoofing detection challenges with 42K+ samples across 4 classes, revealing transformer models’ limitations and need for specialized benchmarks.

DetailsMotivation: Real-world voice spoofing attacks often involve hybrid utterances mixing genuine and synthetic speech, making binary detection inadequate and requiring more sophisticated benchmarks.

Method: Created Hybrid Spoofed Audio Dataset (HSAD) with 1,248 clean and 41,044 degraded utterances across 4 classes, evaluated 6 transformer models including spectrogram encoders and self-supervised waveform models.

Result: Pretrained models overgeneralize under hybrid conditions, spoof-specific fine-tuning improves separability but struggles with unseen compositions, HSAD adaptation yields >97% accuracy and ~99% F1 score.

Conclusion: Fine-tuning alone is insufficient - robust hybrid-aware benchmarks like HSAD are essential to expose model biases and build resilient voice authentication systems.

Abstract: The rapid advancement of AI has enabled highly realistic speech synthesis and voice cloning, posing serious risks to voice authentication, smart assistants, and telecom security. While most prior work frames spoof detection as a binary task, real-world attacks often involve hybrid utterances that mix genuine and synthetic speech, making detection substantially more challenging. To address this gap, we introduce the Hybrid Spoofed Audio Dataset (HSAD), a benchmark containing 1,248 clean and 41,044 degraded utterances across four classes: human, cloned, zero-shot AI-generated, and hybrid audio. Each sample is annotated with spoofing method, speaker identity, and degradation metadata to enable fine-grained analysis. We evaluate six transformer-based models, including spectrogram encoders (MIT-AST, MattyB95-AST) and self-supervised waveform models (Wav2Vec2, HuBERT). Results reveal critical lessons: pretrained models overgeneralize and collapse under hybrid conditions; spoof-specific fine-tuning improves separability but struggles with unseen compositions; and dataset-specific adaptation on HSAD yields large performance gains (AST greater than 97 percent and F1 score is approximately 99 percent), though residual errors persist for complex hybrids. These findings demonstrate that fine-tuning alone is not sufficient-robust hybrid-aware benchmarks like HSAD are essential to expose calibration failures, model biases, and factors affecting spoof detection in adversarial environments. HSAD thus provides both a dataset and an analytic framework for building resilient and trustworthy voice authentication systems.

[286] Progressive Facial Granularity Aggregation with Bilateral Attribute-based Enhancement for Face-to-Speech Synthesis

Yejin Jeon, Youngjae Kim, Jihyun Lee, Hyounghun Kim, Gary Geunbae Lee

Main category: cs.SD

TL;DR: A novel face-to-voice synthesis method that preserves fine-grained facial attributes through multi-granular representation learning and multi-view training, improving voice congruence and synthesis stability.

DetailsMotivation: Traditional text-to-speech fails to preserve users' own voices after traumatic events like strokes. Existing face-to-voice methods lose fine-grained facial information and suffer from inefficient multi-stage training.

Method: Decomposes facial images into non-overlapping segments for multi-granular representation, uses multi-task learning for speaker attributes, and employs multi-view training with various visual perspectives paired with identical speech.

Result: Extensive evaluations show substantial improvements in face-voice congruence and synthesis stability compared to existing methods.

Conclusion: The proposed approach effectively addresses limitations of current face-to-voice synthesis by preserving fine-grained facial attributes and improving training efficiency through integrated multi-granular modeling.

Abstract: For individuals who have experienced traumatic events such as strokes, speech may no longer be a viable means of communication. While text-to-speech (TTS) can be used as a communication aid since it generates synthetic speech, it fails to preserve the user’s own voice. As such, face-to-voice (FTV) synthesis, which derives corresponding voices from facial images, provides a promising alternative. However, existing methods rely on pre-trained visual encoders, and finetune them to align with speech embeddings, which strips fine-grained information from facial inputs such as gender or ethnicity, despite their known correlation with vocal traits. Moreover, these pipelines are multi-stage, which requires separate training of multiple components, thus leading to training inefficiency. To address these limitations, we utilize fine-grained facial attribute modeling by decomposing facial images into non-overlapping segments and progressively integrating them into a multi-granular representation. This representation is further refined through multi-task learning of speaker attributes such as gender and ethnicity at both the visual and acoustic domains. Moreover, to improve alignment robustness, we adopt a multi-view training strategy by pairing various visual perspectives of a speaker in terms of different angles and lighting conditions, with identical speech recordings. Extensive subjective and objective evaluations confirm that our approach substantially enhances face-voice congruence and synthesis stability.

[287] Target matching based generative model for speech enhancement

Taihui Wang, Rilin Chen, Tong Lei, Andong Li, Jinzheng Zhao, Meng Yu, Dong Yu

Main category: cs.SD

TL;DR: A novel target-based generative framework that improves mean/variance schedule design flexibility and training/inference efficiency for generative speech enhancement.

DetailsMotivation: Overcome limitations in current generative models including careful schedule selection requirements, hallucination artifacts, inefficient training/inference, and high computational complexity of existing diffusion backbones.

Method: Reformulates generative speech enhancement as target signal estimation to eliminate stochastic components, employs logistic mean schedule and bridge variance schedule for better SNR trajectory, and proposes a new efficient diffusion backbone for audio that models long-term frame correlations and cross-band dependencies.

Result: More stable and efficient training/inference processes, more favorable signal-to-noise ratio trajectory, and significantly improved efficiency over NCSN++ backbone.

Conclusion: The proposed framework enhances both schedule design flexibility and computational efficiency while addressing key limitations of existing generative models for speech enhancement tasks.

Abstract: The design of mean and variance schedules for the perturbed signal is a fundamental challenge in generative models. While score-based and Schr"odinger bridge-based models require careful selection of the stochastic differential equation to derive the corresponding schedules, flow-based models address this issue via vector field matching. However, this strategy often leads to hallucination artifacts and inefficient training and inference processes due to the potential inclusion of stochastic components in the vector field. Additionally, the widely adopted diffusion backbone, NCSN++, suffers from high computational complexity. To overcome these limitations, we propose a novel target-based generative framework that enhances both the flexibility of mean/variance schedule design and the efficiency of training and inference processes. Specifically, we eliminate the stochastic components in the training loss by reformulating the generative speech enhancement task as a target signal estimation problem, which therefore leads to more stable and efficient training and inference processes. In addition, we employ a logistic mean schedule and a bridge variance schedule, which yield a more favorable signal-to-noise ratio trajectory compared to several widely used schedules and thus leads to a more efficient perturbation strategy. Furthermore, we propose a new diffusion backbone for audio, which significantly improves the efficiency over NCSN++ by explicitly modeling long-term frame correlations and cross-band dependencies.

[288] Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data

Gokul Karthik Kumar, Rishabh Saraf, Ludovick Lepauloux, Abdul Muneer, Billel Mokeddem, Hakim Hacid

Main category: cs.SD

TL;DR: Falcon3-Audio is a family of audio-language models that achieves state-of-the-art performance on audio tasks using remarkably small data (under 30K hours) and simple architecture, matching larger models while being more efficient and transparent.

DetailsMotivation: Despite LLMs transforming NLP, audio integration remains underexplored despite audio's importance in human communication. Current audio models require massive data and complex architectures.

Method: Built on instruction-tuned LLMs and Whisper encoders using simple single-stage training with minimal public audio data (<30K hours). Avoids complex curriculum learning, multiple encoders, and intricate cross-attention mechanisms.

Result: Falcon3-Audio-7B matches best open-weight models on MMAU benchmark (score 64.14, matching R1-AQA) with superior data/parameter efficiency. Even the 1B model competes with larger 2B-13B models. Shows complex architectures unnecessary for strong performance.

Conclusion: Simple architectures with instruction-tuned LLMs and Whisper encoders can achieve state-of-the-art audio performance with minimal data, challenging the need for massive datasets and complex model designs in audio-language modeling.

Abstract: Large language models (LLMs) have transformed NLP, yet their integration with audio remains underexplored – despite audio’s centrality to human communication. We introduce Falcon3-Audio, a family of Audio-Language Models (ALMs) built on instruction-tuned LLMs and Whisper encoders. Using a remarkably small amount of public audio data – less than 30K hours (5K unique) – Falcon3-Audio-7B matches the best reported performance among open-weight models on the MMAU benchmark, with a score of 64.14, matching R1-AQA, while distinguishing itself through superior data and parameter efficiency, single-stage training, and transparency. Notably, our smallest 1B model remains competitive with larger open models ranging from 2B to 13B parameters. Through extensive ablations, we find that common complexities – such as curriculum learning, multiple audio encoders, and intricate cross-attention connectors – are not required for strong performance, even compared to models trained on over 500K hours of data.

[289] The Model Hears You: Audio Language Model Deployments Should Consider the Principle of Least Privilege

Luxi He, Xiangyu Qi, Michel Liao, Inyoung Cheong, Prateek Mittal, Danqi Chen, Peter Henderson

Main category: cs.SD

TL;DR: Audio LMs process speech directly but introduce new safety risks like identity inference and biased decision-making, requiring the Principle of Least Privilege for responsible deployment.

DetailsMotivation: Audio Language Models bypass transcription to preserve speech details like intonation and multiple speakers, but this direct processing creates new safety and privacy risks including potential misuse of speaker identity and vocal attributes.

Method: The paper conducts experiments comparing end-to-end modeling with cascaded pipelines, analyzing socio-technical safety risks such as identity inference, biased decision-making, and emotion detection capabilities.

Result: Experiments show that end-to-end Audio LMs create significant safety risks including potential storage of voiceprints and function in ways that create legal uncertainty under existing regimes.

Conclusion: The Principle of Least Privilege should guide Audio LM development, with evaluations assessing privacy risks and appropriate information access scope. Current benchmarks have gaps that need addressing through both technical and policy research for responsible deployment.

Abstract: The latest Audio Language Models (Audio LMs) process speech directly instead of relying on a separate transcription step. This shift preserves detailed information, such as intonation or the presence of multiple speakers, that would otherwise be lost in transcription. However, it also introduces new safety risks, including the potential misuse of speaker identity cues and other sensitive vocal attributes, which could have legal implications. In this paper, we urge a closer examination of how these models are built and deployed. Our experiments show that end-to-end modeling, compared with cascaded pipelines, creates socio-technical safety risks such as identity inference, biased decision-making, and emotion detection. This raises concerns about whether Audio LMs store voiceprints and function in ways that create uncertainty under existing legal regimes. We then argue that the Principle of Least Privilege should be considered to guide the development and deployment of these models. Specifically, evaluations should assess (1) the privacy and safety risks associated with end-to-end modeling; and (2) the appropriate scope of information access. Finally, we highlight related gaps in current audio LM benchmarks and identify key open research questions, both technical and policy-related, that must be addressed to enable the responsible deployment of end-to-end Audio LMs.

[290] Spectral Masking and Interpolation Attack (SMIA): A Black-box Adversarial Attack against Voice Authentication and Anti-Spoofing Systems

Kamel Kamel, Hridoy Sankar Dutta, Keshav Sood, Sunil Aryal

Main category: cs.SD

TL;DR: Voice authentication systems are vulnerable to sophisticated attacks like SMIA which manipulates inaudible frequencies in AI-generated audio to bypass security measures with high success rates, demonstrating the need for dynamic, adaptive defenses.

DetailsMotivation: Voice authentication systems are increasingly used in high-security sectors but face severe vulnerabilities from deepfakes and adversarial attacks, with current anti-spoofing countermeasures being insufficient against novel threats.

Method: Proposed Spectral Masking and Interpolation Attack (SMIA) that strategically manipulates inaudible frequency regions of AI-generated audio to create adversarial samples that sound authentic but deceive countermeasures.

Result: SMIA achieved 82% success against combined VAS/CM systems, 97.5% against standalone speaker verification systems, and 100% against countermeasures, demonstrating current security is insufficient.

Conclusion: Current security postures are inadequate against adaptive adversarial attacks, highlighting the urgent need for next-generation dynamic, context-aware defense frameworks that can evolve with the threat landscape.

Abstract: Voice Authentication Systems (VAS) use unique vocal characteristics for verification. They are increasingly integrated into high-security sectors such as banking and healthcare. Despite their improvements using deep learning, they face severe vulnerabilities from sophisticated threats like deepfakes and adversarial attacks. The emergence of realistic voice cloning complicates detection, as systems struggle to distinguish authentic from synthetic audio. While anti-spoofing countermeasures (CMs) exist to mitigate these risks, many rely on static detection models that can be bypassed by novel adversarial methods, leaving a critical security gap. To demonstrate this vulnerability, we propose the Spectral Masking and Interpolation Attack (SMIA), a novel method that strategically manipulates inaudible frequency regions of AI-generated audio. By altering the voice in imperceptible zones to the human ear, SMIA creates adversarial samples that sound authentic while deceiving CMs. We conducted a comprehensive evaluation of our attack against state-of-the-art (SOTA) models across multiple tasks, under simulated real-world conditions. SMIA achieved a strong attack success rate (ASR) of at least 82% against combined VAS/CM systems, at least 97.5% against standalone speaker verification systems, and 100% against countermeasures. These findings conclusively demonstrate that current security postures are insufficient against adaptive adversarial attacks. This work highlights the urgent need for a paradigm shift toward next-generation defenses that employ dynamic, context-aware frameworks capable of evolving with the threat landscape.

[291] Learning to Upsample and Upmix Audio in the Latent Domain

Dimitrios Bralios, Paris Smaragdis, Jonah Casebeer

Main category: cs.SD

TL;DR: A framework for performing audio processing operations directly in autoencoder latent space instead of raw audio, achieving 100x computational efficiency gains while maintaining comparable quality to traditional methods.

DetailsMotivation: Most audio processing operations inefficiently operate on raw waveforms or spectral representations rather than directly on compressed latent representations from neural audio autoencoders, despite autoencoders being foundational for modern audio compression and generation systems.

Method: Proposes a framework that performs audio processing entirely within an autoencoder’s latent space using a latent L1 reconstruction term augmented by a single latent adversarial discriminator, eliminating the need to decode to raw audio formats.

Result: Achieves computational efficiency gains of up to 100x while maintaining quality comparable to post-processing on raw audio, demonstrated through experiments in bandwidth extension and mono-to-stereo up-mixing.

Conclusion: Establishes a more efficient paradigm for audio processing pipelines that incorporate autoencoders, enabling significantly faster and more resource-efficient workflows across various audio tasks.

Abstract: Neural audio autoencoders create compact latent representations that preserve perceptually important information, serving as the foundation for both modern audio compression systems and generation approaches like next-token prediction and latent diffusion. Despite their prevalence, most audio processing operations, such as spatial and spectral up-sampling, still inefficiently operate on raw waveforms or spectral representations rather than directly on these compressed representations. We propose a framework that performs audio processing operations entirely within an autoencoder’s latent space, eliminating the need to decode to raw audio formats. Our approach dramatically simplifies training by operating solely in the latent domain, with a latent L1 reconstruction term, augmented by a single latent adversarial discriminator. This contrasts sharply with raw-audio methods that typically require complex combinations of multi-scale losses and discriminators. Through experiments in bandwidth extension and mono-to-stereo up-mixing, we demonstrate computational efficiency gains of up to 100x while maintaining quality comparable to post-processing on raw audio. This work establishes a more efficient paradigm for audio processing pipelines that already incorporate autoencoders, enabling significantly faster and more resource-efficient workflows across various audio tasks.

[292] Re-Bottleneck: Latent Re-Structuring for Neural Audio Autoencoders

Dimitrios Bralios, Jonah Casebeer, Paris Smaragdis

Main category: cs.SD

TL;DR: A post-hoc framework called Re-Bottleneck that modifies pre-trained audio autoencoders to impose user-defined latent structure without sacrificing reconstruction quality, enabling better performance in downstream applications.

DetailsMotivation: Existing neural audio codecs are trained primarily for reconstruction fidelity but lack the specific latent structure needed for optimal performance in diverse downstream applications like feature extraction and latent-space generation.

Method: Introduces a Re-Bottleneck - an inner bottleneck trained exclusively through latent space losses to instill user-defined structure into pre-trained autoencoders. Demonstrated through three experiments: latent channel ordering, semantic embedding alignment, and equivariance enforcement.

Result: The framework successfully enforces desired latent structures without compromising reconstruction quality, enables semantic alignment for better diffusion modeling, and introduces equivariance where input transformations directly correspond to latent space transformations.

Conclusion: Re-Bottleneck provides a flexible and efficient way to tailor neural audio model representations to meet varied application demands with minimal additional training, making pre-trained models more versatile for downstream tasks.

Abstract: Neural audio codecs and autoencoders have emerged as versatile models for audio compression, transmission, feature-extraction, and latent-space generation. However, a key limitation is that most are trained to maximize reconstruction fidelity, often neglecting the specific latent structure necessary for optimal performance in diverse downstream applications. We propose a simple, post-hoc framework to address this by modifying the bottleneck of a pre-trained autoencoder. Our method introduces a “Re-Bottleneck”, an inner bottleneck trained exclusively through latent space losses to instill user-defined structure. We demonstrate the framework’s effectiveness in three experiments. First, we enforce an ordering on latent channels without sacrificing reconstruction quality. Second, we align latents with semantic embeddings, analyzing the impact on downstream diffusion modeling. Third, we introduce equivariance, ensuring that a filtering operation on the input waveform directly corresponds to a specific transformation in the latent space. Ultimately, our Re-Bottleneck framework offers a flexible and efficient way to tailor representations of neural audio models, enabling them to seamlessly meet the varied demands of different applications with minimal additional training.

[293] HingeNet: A Harmonic-Aware Fine-Tuning Approach for Beat Tracking

Ganghui Ru, Jieying Wang, Jiahao Zhao, Yulun Wu, Yi Yu, Nannan Jiang, Wei Wang, Wei Li

Main category: cs.SD

TL;DR: HingeNet is a parameter-efficient fine-tuning method for beat tracking that uses intermediate features from pre-trained models and incorporates harmonic-aware mechanisms to achieve state-of-the-art performance.

DetailsMotivation: Existing fine-tuning methods for pre-trained foundation models are ineffective for beat tracking tasks due to limited annotated data, creating a need for specialized parameter-efficient approaches.

Method: HingeNet is a lightweight, separable network that interfaces with pre-trained models using their intermediate feature representations. It includes harmonic-aware mechanisms to better capture harmonic structures in music signals.

Result: Experiments on benchmark datasets show that HingeNet achieves state-of-the-art performance in both beat and downbeat tracking tasks.

Conclusion: HingeNet provides an effective parameter-efficient fine-tuning solution for beat tracking that generalizes well across different pre-trained foundation models and leverages harmonic information for improved performance.

Abstract: Fine-tuning pre-trained foundation models has made significant progress in music information retrieval. However, applying these models to beat tracking tasks remains unexplored as the limited annotated data renders conventional fine-tuning methods ineffective. To address this challenge, we propose HingeNet, a novel and general parameter-efficient fine-tuning method specifically designed for beat tracking tasks. HingeNet is a lightweight and separable network, visually resembling a hinge, designed to tightly interface with pre-trained foundation models by using their intermediate feature representations as input. This unique architecture grants HingeNet broad generalizability, enabling effective integration with various pre-trained foundation models. Furthermore, considering the significance of harmonics in beat tracking, we introduce harmonic-aware mechanism during the fine-tuning process to better capture and emphasize the harmonic structures in musical signals. Experiments on benchmark datasets demonstrate that HingeNet achieves state-of-the-art performance in beat and downbeat tracking

[294] SaD: A Scenario-Aware Discriminator for Speech Enhancement

Xihao Yuan, Siqi Liu, Yan Chen, Hang Zhou, Chang Liu, Hanting Chen, Jie Hu

Main category: cs.SD

TL;DR: Proposes a scenario-aware discriminator for GAN-based speech enhancement that captures scene-specific features and performs frequency-domain division to improve quality assessment without changing generator architectures.

DetailsMotivation: Current GAN optimization strategies focus mainly on generator architecture or discriminator quality metrics, overlooking rich contextual information from diverse scenarios.

Method: A scenario-aware discriminator that captures scene-specific features and performs frequency-domain division for more accurate quality assessment of enhanced speech.

Result: Comprehensive experiments on three representative models using two public datasets show the method effectively adapts to various generator architectures without structural changes, unlocking further performance gains.

Conclusion: The proposed scenario-aware discriminator enables better speech enhancement performance across different scenarios by leveraging contextual information and frequency-domain analysis.

Abstract: Generative adversarial network-based models have shown remarkable performance in the field of speech enhancement. However, the current optimization strategies for these models predominantly focus on refining the architecture of the generator or enhancing the quality evaluation metrics of the discriminator. This approach often overlooks the rich contextual information inherent in diverse scenarios. In this paper, we propose a scenario-aware discriminator that captures scene-specific features and performs frequency-domain division, thereby enabling a more accurate quality assessment of the enhanced speech generated by the generator. We conducted comprehensive experiments on three representative models using two publicly available datasets. The results demonstrate that our method can effectively adapt to various generator architectures without altering their structure, thereby unlocking further performance gains in speech enhancement across different scenarios.

[295] BeatFM: Improving Beat Tracking with Pre-trained Music Foundation Model

Ganghui Ru, Jieying Wang, Jiahao Zhao, Yulun Wu, Yi Yu, Nannan Jiang, Wei Wang, Wei Li

Main category: cs.SD

TL;DR: BeatFM introduces a pre-trained music foundation model with multi-dimensional semantic aggregation to address beat tracking challenges caused by limited labeled data and improve generalization across musical styles.

DetailsMotivation: Current beat tracking methods struggle with limited labeled data, poor generalization across diverse musical styles, and difficulty capturing complex rhythmic structures.

Method: Proposes BeatFM paradigm using pre-trained music foundation model with plug-and-play multi-dimensional semantic aggregation module (temporal, frequency, and channel domain sub-modules).

Result: Achieves state-of-the-art performance in beat and downbeat tracking across multiple benchmark datasets.

Conclusion: Pre-trained music foundation models with semantic aggregation effectively address beat tracking challenges and improve performance across diverse musical contexts.

Abstract: Beat tracking is a widely researched topic in music information retrieval. However, current beat tracking methods face challenges due to the scarcity of labeled data, which limits their ability to generalize across diverse musical styles and accurately capture complex rhythmic structures. To overcome these challenges, we propose a novel beat tracking paradigm BeatFM, which introduces a pre-trained music foundation model and leverages its rich semantic knowledge to improve beat tracking performance. Pre-training on diverse music datasets endows music foundation models with a robust understanding of music, thereby effectively addressing these challenges. To further adapt it for beat tracking, we design a plug-and-play multi-dimensional semantic aggregation module, which is composed of three parallel sub-modules, each focusing on semantic aggregation in the temporal, frequency, and channel domains, respectively. Extensive experiments demonstrate that our method achieves state-of-the-art performance in beat and downbeat tracking across multiple benchmark datasets.

[296] Continuous Audio Language Models

Simon Rouard, Manu Orsini, Axel Roebel, Neil Zeghidour, Alexandre Défossez

Main category: cs.SD

TL;DR: CALM introduces continuous audio language models that avoid lossy compression, achieving higher audio quality at lower computational cost compared to discrete token-based approaches.

DetailsMotivation: Discrete audio tokens from lossy codecs create a trade-off between audio fidelity and computational cost - higher quality requires more tokens. CALM aims to overcome this limitation by using continuous representations.

Method: Uses a large Transformer backbone to produce contextual embeddings at each timestep, then conditions an MLP to generate continuous audio frames through consistency modeling with an audio VAE, avoiding discrete tokenization.

Result: CALM achieves higher audio quality with lower computational cost than state-of-the-art discrete audio language models, demonstrated on both speech and music generation tasks.

Conclusion: Continuous audio language modeling facilitates lightweight, high-quality audio generation by eliminating the fidelity-computation trade-off inherent in discrete token approaches.

Abstract: Audio Language Models (ALM) have emerged as the dominant paradigm for speech and music generation by representing audio as sequences of discrete tokens. Yet, unlike text tokens, which are invertible, audio tokens are extracted from lossy codecs with a limited bitrate. As a consequence, increasing audio quality requires generating more tokens, which imposes a trade-off between fidelity and computational cost. We address this issue by studying Continuous Audio Language Models (CALM). These models instantiate a large Transformer backbone that produces a contextual embedding at every timestep. This sequential information then conditions an MLP that generates the next continuous frame of an audio VAE through consistency modeling. By avoiding lossy compression, CALM achieves higher quality at lower computational cost than their discrete counterpart. Experiments on speech and music demonstrate improved efficiency and fidelity over state-of-the-art discrete audio language models, facilitating lightweight, high-quality audio generation. Samples are available at hf.co/spaces/kyutai/calm-samples

cs.LG

[297] Individualized and Interpretable Sleep Forecasting via a Two-Stage Adaptive Spatial-Temporal Model

Xueyi Wang, Elisabeth Wilhelm

Main category: cs.LG

TL;DR: An interpretable two-stage adaptive spatial-temporal model for sleep quality prediction that outperforms baseline methods and provides feature explainability.

DetailsMotivation: Sleep quality significantly impacts well-being, and there's a need for accessible, reliable forecasting tools for preventive interventions using data from commercial wearable devices.

Method: Combines multi-scale convolutional layers for spatial interactions, recurrent layers and attention mechanisms for temporal dependencies, and a two-stage domain adaptation strategy (training-time adaptation and source-free test-time adaptation) to enhance generalization to new users.

Result: Outperformed baseline approaches (LSTM, Informer, PatchTST, TimesNet) across various window sizes. Best performance: 0.216 RMSE with 3-day input and 1-day prediction window. Maintained good performance (0.257 RMSE) for longer 3-day prediction horizons.

Conclusion: The framework provides a robust, adaptive, and explainable solution for personalized sleep forecasting using sparse wearable data, demonstrating practical utility for real-world applications.

Abstract: Sleep quality significantly impacts well-being. Therefore, healthcare providers and individuals need accessible and reliable forecasting tools for preventive interventions. This paper introduces an interpretable, individualized two-stage adaptive spatial-temporal model for predicting sleep quality scores. Our proposed framework combines multi-scale convolutional layers to model spatial interactions across multiple input variables, recurrent layers and attention mechanisms to capture long-term temporal dependencies, and a two-stage domain adaptation strategy to enhance generalization. The first adaptation stage is applied during training to mitigate overfitting on the training set. In the second stage, a source-free test-time adaptation mechanism is employed to adapt the model to new users without requiring labels. We conducted various experiments with five input window sizes (3, 5, 7, 9, and 11 days) and five prediction window sizes (1, 3, 5, 7, and 9 days). Our model consistently outperformed time series forecasting baseline approaches, including Long Short-Term Memory (LSTM), Informer, PatchTST, and TimesNet. The best performance was achieved with a three-day input window and a one-day prediction window, yielding a root mean square error (RMSE) of 0.216. Furthermore, the model demonstrated good predictive performance even for longer forecasting horizons (e.g, with a 0.257 RMSE for a three-day prediction window), highlighting its practical utility for real-world applications. We also conducted an explainability analysis to examine how different features influence sleep quality. These findings proved that the proposed framework offers a robust, adaptive, and explainable solution for personalized sleep forecasting using sparse data from commercial wearable devices.

[298] GSTBench: A Benchmark Study on the Transferability of Graph Self-Supervised Learning

Yu Song, Zhigang Hua, Yan Xie, Jingzhe Liu, Bo Long, Hui Liu

Main category: cs.LG

TL;DR: GSTBench is the first systematic benchmark for evaluating cross-dataset transferability of graph self-supervised learning methods, revealing most SSL methods struggle to generalize while GraphMAE shows consistent improvements.

DetailsMotivation: Most graph SSL methods are developed and evaluated under single-dataset settings, leaving their cross-dataset transferability unexplored and limiting knowledge transfer capabilities needed for generalized intelligence.

Method: Large-scale pretraining on ogbn-papers100M with evaluation of five representative SSL methods across diverse target graphs using standardized experimental setup that decouples confounding factors.

Result: Most graph SSL methods struggle to generalize, with some performing worse than random initialization. GraphMAE (masked autoencoder approach) consistently improves transfer performance.

Conclusion: The benchmark provides insights to guide future research on transferable graph SSL and lays foundation for “pretrain-then-transfer” paradigm in graph learning.

Abstract: Self-supervised learning (SSL) has shown great promise in graph representation learning. However, most existing graph SSL methods are developed and evaluated under a single-dataset setting, leaving their cross-dataset transferability largely unexplored and limiting their ability to leverage knowledge transfer and large-scale pretraining, factors that are critical for developing generalized intelligence beyond fitting training data. To address this gap and advance foundation model research for graphs, we present GSTBench, the first systematic benchmark for evaluating the transferability of graph SSL methods. We conduct large-scale pretraining on ogbn-papers100M and evaluate five representative SSL methods across a diverse set of target graphs. Our standardized experimental setup decouples confounding factors such as model architecture, dataset characteristics, and adaptation protocols, enabling rigorous comparisons focused solely on pretraining objectives. Surprisingly, we observe that most graph SSL methods struggle to generalize, with some performing worse than random initialization. In contrast, GraphMAE, a masked autoencoder approach, consistently improves transfer performance. We analyze the underlying factors that drive these differences and offer insights to guide future research on transferable graph SSL, laying a solid foundation for the “pretrain-then-transfer” paradigm in graph learning. Our code is available at https://github.com/SongYYYY/GSTBench.

[299] A Knowledge-Guided Cross-Modal Feature Fusion Model for Local Traffic Demand Prediction

Lingyu Zhang, Pengfei Xu, Guobin Wu, Jian Liang, Ruiyang Dong, Yunhai Wang, Xuan Song

Main category: cs.LG

TL;DR: A knowledge-guided cross-modal model that combines temporal traffic data with human knowledge text for improved traffic demand prediction.

DetailsMotivation: Existing traffic prediction models rely mainly on temporal data, ignoring valuable human knowledge and experience that could enhance prediction accuracy and robustness.

Method: Proposes KGCM model that integrates structured temporal traffic data with textual human knowledge. Uses LLM to create prior knowledge dataset, learns multimodal features through adaptive graph networks and cross-modal fusion, with dynamic parameter optimization.

Result: Experiments on multiple traffic datasets show the model accurately predicts future traffic demand and outperforms state-of-the-art models.

Conclusion: Incorporating human knowledge and experience through cross-modal learning significantly improves traffic demand prediction performance compared to temporal-only approaches.

Abstract: Traffic demand prediction plays a critical role in intelligent transportation systems. Existing traffic prediction models primarily rely on temporal traffic data, with limited efforts incorporating human knowledge and experience for urban traffic demand forecasting. However, in real-world scenarios, traffic knowledge and experience derived from human daily life significantly influence precise traffic prediction. Such knowledge and experiences can guide the model in uncovering latent patterns within traffic data, thereby enhancing the accuracy and robustness of predictions. To this end, this paper proposes integrating structured temporal traffic data with textual data representing human knowledge and experience, resulting in a novel knowledge-guided cross-modal feature representation learning (KGCM) model for traffic demand prediction. Based on regional transportation characteristics, we construct a prior knowledge dataset using a large language model combined with manual authoring and revision, covering both regional and global knowledge and experiences. The KGCM model then learns multimodal data features through designed local and global adaptive graph networks, as well as a cross-modal feature fusion mechanism. A proposed reasoning-based dynamic update strategy enables dynamic optimization of the graph model’s parameters, achieving optimal performance. Experiments on multiple traffic datasets demonstrate that our model accurately predicts future traffic demand and outperforms existing state-of-the-art (SOTA) models.

[300] Toward Reproducible Cross-Backend Compatibility for Deep Learning: A Configuration-First Framework with Three-Tier Verification

Zehua Li

Main category: cs.LG

TL;DR: A configuration-first framework for evaluating cross-backend compatibility in deep learning systems across CPU, GPU, and compiled runtimes, using YAML-based experiments and three-tier verification to quantify and mitigate backend drift.

DetailsMotivation: To address the challenge of ensuring consistent deep learning model behavior across different deployment backends (CPU, GPU, compiled runtimes) where discrepancies and drift can occur, particularly with detection models and compiled backends.

Method: Decouples experiments from code using YAML configuration, supports library and repository models, employs three-tier verification protocol (tensor-level closeness, activation alignment, task-level metrics), and tests with 672 checks across multiple models and tolerance settings.

Result: 72.0% of runs pass, with most discrepancies occurring under stricter thresholds. Detection models and compiled backends are particularly prone to drift, often due to nondeterministic post-processing. Deterministic adapters and selective fallbacks can substantially improve agreement without significant performance loss.

Conclusion: This is the first unified framework that systematically quantifies and mitigates cross-backend drift in deep learning, providing a reproducible methodology for dependable deployment across heterogeneous runtimes.

Abstract: This paper presents a configuration-first framework for evaluating cross-backend compatibility in deep learning systems deployed on CPU, GPU, and compiled runtimes. The framework decouples experiments from code using YAML, supports both library and repository models, and employs a three-tier verification protocol covering tensor-level closeness, activation alignment, and task-level metrics. Through 672 checks across multiple models and tolerance settings, we observe that 72.0% of runs pass, with most discrepancies occurring under stricter thresholds. Our results show that detection models and compiled backends are particularly prone to drift, often due to nondeterministic post-processing. We further demonstrate that deterministic adapters and selective fallbacks can substantially improve agreement without significant performance loss. To our knowledge, this is the first unified framework that systematically quantifies and mitigates cross-backend drift in deep learning, providing a reproducible methodology for dependable deployment across heterogeneous runtimes.

[301] A Kriging-HDMR-based surrogate model with sample pool-free active learning strategy for reliability analysis

Wenxiong Li, Hanyu Liao, Suiyin Chen

Main category: cs.LG

TL;DR: Active learning Kriging-HDMR surrogate model for high-dimensional reliability analysis that maintains accuracy in critical regions while being computationally efficient.

DetailsMotivation: Conventional surrogate models suffer from curse of dimensionality in high-dimensional reliability analysis, and existing HDMR approaches focus on optimization rather than reliability analysis which requires accuracy in critical regions.

Method: Three-stage Kriging-HDMR framework: 1) single-variable sub-surrogate models, 2) identification of coupling-variable requirements, 3) construction of coupling-variable sub-surrogates. Uses optimization models for sample selection with uncertainty variance, predicted mean, location and distance criteria, without candidate pool.

Result: Numerical experiments show the method achieves high computational efficiency while maintaining strong predictive accuracy for high-dimensional reliability problems.

Conclusion: The proposed active learning Kriging-HDMR approach effectively handles high-dimensional reliability analysis with improved efficiency and maintains accuracy in critical regions where it matters most.

Abstract: In reliability engineering, conventional surrogate models encounter the “curse of dimensionality” as the number of random variables increases. While the active learning Kriging surrogate approaches with high-dimensional model representation (HDMR) enable effective approximation of high-dimensional functions and are widely applied to optimization problems, there are rare studies specifically focused on reliability analysis, which prioritizes prediction accuracy in critical regions over uniform accuracy across the entire domain. This study develops an active learning surrogate model method based on the Kriging-HDMR modeling for reliability analysis. The proposed approach facilitates the approximation of high-dimensional limit state functions through a composite representation constructed from multiple low-dimensional sub-surrogate models. The architecture of the surrogate modeling framework comprises three distinct stages: developing single-variable sub-surrogate models for all random variables, identifying the requirements for coupling-variable sub-surrogate models, and constructing the coupling-variable sub-surrogate models. Optimization mathematical models for selection of design of experiment samples are formulated based on each stage’s characteristics, with objectives incorporating uncertainty variance, predicted mean, sample location and inter-sample distances. A candidate sample pool-free approach is adopted to achieve the selection of informative samples. Numerical experiments demonstrate that the proposed method achieves high computational efficiency while maintaining strong predictive accuracy in solving high-dimensional reliability problems.

[302] Exploring Over-stationarization in Deep Learning-based Bus/Tram Arrival Time Prediction: Analysis and Non-stationary Effect Recovery

Zirui Li, Bin Yang, Meng Wang

Main category: cs.LG

TL;DR: Proposes NSATP method for public transport arrival time prediction that balances predictability and non-stationarity through two-stage approach: series stationarization and non-stationarity effect recovery.

DetailsMotivation: Existing normalization methods for handling non-stationary time series in ATP may obscure useful characteristics inherent in non-stationarity (over-stationarization), degrading model performance.

Method: Two-stage approach: 1) Series stationarization to improve predictability, 2) Non-stationarity effect recovery using 2D extension of state-of-the-art methods to capture hidden periodicity and compensation module learning scaling/shifting factors from raw data.

Result: Reduces RMSE, MAE, and MAPE by 2.37%, 1.22%, and 2.26% for trams and by 1.72%, 0.60%, and 1.17% for buses compared to baselines, using 125 days of Dresden public transport data.

Conclusion: NSATP effectively balances predictability and non-stationarity in multi-step ATP, outperforming baseline methods while preserving useful non-stationary characteristics.

Abstract: Arrival time prediction (ATP) of public transport vehicles is essential in improving passenger experience and supporting traffic management. Deep learning has demonstrated outstanding performance in ATP due to its ability to model non-linear and temporal dynamics. In the multi-step ATP, non-stationary data will degrade the model performance due to the variation in variables’ joint distribution along the temporal direction. Previous studies mainly applied normalization to eliminate the non-stationarity in time series, thereby achieving better predictability. However, the normalization may obscure useful characteristics inherent in non-stationarity, which is known as the over-stationarization. In this work, to trade off predictability and non-stationarity, a new approach for multi-step ATP, named non-stationary ATP ( NSATP), is proposed. The method consists of two stages: series stationarization and non-stationarity effect recovery. The first stage aims at improving the predictability. As for the latter, NSATP extends a state-of-the-art method from one-dimensional to two dimensional based models to capture the hidden periodicity in time series and designs a compensation module of over-stationarization by learning scaling and shifting factors from raw data. 125 days’ public transport operational data of Dresden is collected for validation. Experimental results show that compared to baseline methods, the proposed NSATP can reduce RMSE, MAE, and MAPE by 2.37%, 1.22%, and 2.26% for trams and by 1.72%, 0.60%, and 1.17% for buses, respectively.

[303] RLFactory: A Plug-and-Play Reinforcement Learning Post-Training Framework for LLM Multi-Turn Tool-Use

Jiajun Chai, Guojun Yin, Zekun Xu, Chuhuai Yue, Yi Jia, Siyu Xia, Xiaohan Wang, Jiwen Jiang, Xiaoguang Li, Chengqi Dong, Hang He, Wei Lin

Main category: cs.LG

TL;DR: RLFactory is a reinforcement learning framework that improves LLMs’ multi-round tool use through asynchronous calling, decoupled architecture, and diverse reward signals, achieving better performance than larger models with higher efficiency.

DetailsMotivation: Large language models struggle with tasks requiring interaction with external tools, particularly in multi-round scenarios with tool heterogeneity and interface issues.

Method: Uses asyncio-based asynchronous caller, decoupled tool/training architecture, observation markers from tool feedback, and a generate-parse-invoke-update workflow with multiple reward signals (rule-based, model-judgment, tool-verification).

Result: Achieved 0.486 test score on Natural Questions dataset with Qwen3-4B, surpassing Qwen2.5-7B-Instruct-GRPO (0.473) and increasing training throughput by 6.8x.

Conclusion: RLFactory provides a low-barrier, highly adaptable framework for strengthening multi-round tool use of LLMs in real-world scenarios with improved performance and efficiency.

Abstract: Large language models excel at basic reasoning but struggle with tasks that require interaction with external tools. We present RLFactory, a plug-and-play reinforcement learning post-training framework for multi-round tool use. RLFactory tackles (i) tool-call stability and adaptability amid tool heterogeneity and interface issues via an asyncio-based asynchronous caller and a decoupled tool/training architecture, and (ii) diverse evaluation needs via a reward layer supporting rule-based, model-judgment, and tool-verification signals. It reconstructs the MDP by introducing observation markers from tool feedback, closing the loop among model, tools, and environment, and implements a generate-parse-invoke-update workflow for dynamic policy optimization. On Search-R1 with Qwen3-4B, RLFactory achieves a 0.486 test score on the Natural Questions (NQ) dataset, surpassing larger models trained with similar techniques (e.g., Qwen2.5-7B-Instruct-GRPO at 0.473), and increases training throughput by 6.8x. RLFactory provides a low-barrier, highly adaptable framework for strengthening multi-round tool use of LLMs in real-world scenarios. Code: https://github.com/Simple-Efficient/RL-Factory.

[304] CARE: Decoding Time Safety Alignment via Rollback and Introspection Intervention

Xiaomeng Hu, Fei Huang, Chenhan Yuan, Junyang Lin, Tsung-Yi Ho

Main category: cs.LG

TL;DR: CARE is a novel decoding-time safety alignment framework that combines real-time safety monitoring, rollback mechanism, and introspection-based intervention to achieve superior safety-quality trade-off in LLM outputs.

DetailsMotivation: Existing decoding-time interventions like Contrastive Decoding force severe trade-offs between safety and response quality, creating a need for better safety alignment methods during LLM deployment.

Method: Three key components: (1) guard model for real-time safety monitoring, (2) rollback mechanism with token buffer for early correction, and (3) introspection-based intervention where the model generates self-reflective critiques to guide subsequent decoding.

Result: Experimental results show superior balance of safety, quality, and efficiency with low harmful response rate and minimal user experience disruption while maintaining high response quality.

Conclusion: CARE framework effectively addresses the safety-quality trade-off problem in LLM decoding through integrated real-time monitoring, timely corrections, and self-reflective intervention strategies.

Abstract: As large language models (LLMs) are increasingly deployed in real-world applications, ensuring the safety of their outputs during decoding has become a critical challenge. However, existing decoding-time interventions, such as Contrastive Decoding, often force a severe trade-off between safety and response quality. In this work, we propose CARE, a novel framework for decoding-time safety alignment that integrates three key components: (1) a guard model for real-time safety monitoring, enabling detection of potentially unsafe content; (2) a rollback mechanism with a token buffer to correct unsafe outputs efficiently at an earlier stage without disrupting the user experience; and (3) a novel introspection-based intervention strategy, where the model generates self-reflective critiques of its previous outputs and incorporates these reflections into the context to guide subsequent decoding steps. The framework achieves a superior safety-quality trade-off by using its guard model for precise interventions, its rollback mechanism for timely corrections, and our novel introspection method for effective self-correction. Experimental results demonstrate that our framework achieves a superior balance of safety, quality, and efficiency, attaining a low harmful response rate and minimal disruption to the user experience while maintaining high response quality.

[305] FediLoRA: Heterogeneous LoRA for Federated Multimodal Fine-tuning under Missing Modalities

Lishan Yang, Nam Kha Nguygen, Po Hu, Wei Emma Zhang, Yanjun Shu, Mong Yuan Sim, Weitong Chen

Main category: cs.LG

TL;DR: FediLoRA is a federated learning framework that addresses heterogeneous client resources and missing modalities in multimodal fine-tuning using dimension-wise aggregation and layer-wise model editing.

DetailsMotivation: Foundation models have deployment challenges due to large parameter sizes. Existing federated LoRA methods overlook heterogeneous client resources with different LoRA ranks and multimodal data with missing modalities.

Method: Proposes FediLoRA with dimension-wise aggregation strategy that reweights LoRA updates without information dilution, and lightweight layer-wise model editing to selectively incorporate global parameters for local component repair.

Result: Experimental results on three multimodal benchmark datasets show superior performance over competitive baselines in both global and personalized settings, especially with modality incompleteness.

Conclusion: FediLoRA effectively handles heterogeneous LoRA ranks and missing modalities in federated multimodal fine-tuning, achieving better performance than existing methods.

Abstract: Foundation models have demonstrated remarkable performance across a wide range of tasks, yet their large parameter sizes pose challenges for practical deployment, especially in decentralized environments. Parameter-efficient fine-tuning (PEFT), such as Low-Rank Adaptation (LoRA), reduces local computing and memory overhead, making it attractive for federated learning. However, existing federated LoRA methods typically assume uniform rank configurations and unimodal inputs, overlooking two key real-world challenges: (1) heterogeneous client resources have different LoRA ranks, and (2) multimodal data settings with potentially missing modalities. In this work, we propose FediLoRA, a simple yet effective framework for federated multimodal fine-tuning under heterogeneous LoRA ranks and missing modalities. FediLoRA introduces a dimension-wise aggregation strategy that reweights LoRA updates without information dilution during aggregation. It also includes a lightweight layer-wise model editing method that selectively incorporates global parameters to repair local components which improves both client and global model performances. Experimental results on three multimodal benchmark datasets demonstrate that FediLoRA achieves superior performance over competitive baselines in both global and personalized settings, particularly in the presence of modality incompleteness.

[306] Machine Generalize Learning in Agent-Based Models: Going Beyond Surrogate Models for Calibration in ABMs

Sima Najafzadehkhoei, George Vega Yon, Bernardo Modenesi, Derek S. Meyer

Main category: cs.LG

TL;DR: A machine learning approach using bidirectional LSTM to calibrate SIR epidemic models faster and more accurately than traditional Bayesian methods.

DetailsMotivation: Traditional calibration methods for agent-based epidemic models are computationally demanding and time-consuming, requiring more efficient solutions.

Method: Three-layer bidirectional LSTM that learns inverse mapping from epidemic time series to SIR parameters, using composite loss with epidemiology-motivated consistency penalty.

Result: Achieved lower error across all targets (MAE: R0 0.0616 vs 0.275), tighter predictive intervals, and reduced calibration time from 77.4s to 2.35s per calibration compared to ABC.

Conclusion: The machine learning calibrator provides fast, practical, and more accurate calibration of epidemic models while handling parameter nonidentifiability better than traditional methods.

Abstract: Calibrating agent-based epidemic models is computationally demanding. We present a supervised machine learning calibrator that learns the inverse mapping from epidemic time series to SIR parameters. A three-layer bidirectional LSTM ingests 60-day incidence together with population size and recovery rate, and outputs transmission probability, contact rate, and R0. Training uses a composite loss with an epidemiology-motivated consistency penalty that encourages R0 * recovery rate to equal transmission probability * contact rate. In a 1000-scenario simulation study, we compare the calibrator with Approximate Bayesian Computation (likelihood-free MCMC). The method achieves lower error across all targets (MAE: R0 0.0616 vs 0.275; transmission 0.0715 vs 0.128; contact 1.02 vs 4.24), produces tighter predictive intervals with near nominal coverage, and reduces wall clock time from 77.4 s to 2.35 s per calibration. Although contact rate and transmission probability are partially nonidentifiable, the approach reproduces epidemic curves more faithfully than ABC, enabling fast and practical calibration. We evaluate it on SIR agent based epidemics generated with epiworldR and provide an implementation in R.

[307] An efficient deep reinforcement learning environment for flexible job-shop scheduling

Xinquan Wu, Xuefeng Yan, Mingqiang Wei, Donghai Guan

Main category: cs.LG

TL;DR: A simple chronological DRL environment for FJSP using discrete event simulation with PPO-based scheduling model, improved state representation, and novel reward function that outperforms traditional methods.

DetailsMotivation: Existing DRL methods for FJSP focus mainly on agent design while overlooking environment modeling, creating a need for better DRL environment frameworks.

Method: Developed a chronological DRL environment based on discrete event simulation, used proximal policy optimization (PPO) for end-to-end scheduling, proposed short state representation with two variables, and designed novel reward function based on machine scheduling areas.

Result: Experimental results show improved performance of priority dispatching rules in the new environment, and the DRL model achieves competitive performance compared to OR-Tools, meta-heuristic, DRL, and PDR methods on public benchmarks.

Conclusion: The proposed simple chronological DRL environment and scheduling model effectively address FJSP, demonstrating that proper environment modeling is crucial for achieving competitive scheduling performance with deep reinforcement learning approaches.

Abstract: The Flexible Job-shop Scheduling Problem (FJSP) is a classical combinatorial optimization problem that has a wide-range of applications in the real world. In order to generate fast and accurate scheduling solutions for FJSP, various deep reinforcement learning (DRL) scheduling methods have been developed. However, these methods are mainly focused on the design of DRL scheduling Agent, overlooking the modeling of DRL environment. This paper presents a simple chronological DRL environment for FJSP based on discrete event simulation and an end-to-end DRL scheduling model is proposed based on the proximal policy optimization (PPO). Furthermore, a short novel state representation of FJSP is proposed based on two state variables in the scheduling environment and a novel comprehensible reward function is designed based on the scheduling area of machines. Experimental results on public benchmark instances show that the performance of simple priority dispatching rules (PDR) is improved in our scheduling environment and our DRL scheduling model obtains competing performance compared with OR-Tools, meta-heuristic, DRL and PDR scheduling methods.

[308] 1 bit is all we need: binary normalized neural networks

Eduardo Lobo Lustoda Cabral, Paulo Pirozelli, Larissa Driemeier

Main category: cs.LG

TL;DR: A novel neural network architecture using single-bit parameters (0 or 1) called binary normalized layers that reduces memory usage by 32x while maintaining equivalent performance to 32-bit models.

DetailsMotivation: To address deployment challenges of large neural network models by reducing memory requirements and enhancing computational efficiency for practical deployment across various applications.

Method: Developed binary normalized layers (fully connected, convolutional, attention, etc.) where all parameters including weights and biases are single-bit (0 or 1). Tested on image classification with convolutional/fully connected layers and language modeling with transformer blocks.

Result: Models with binary normalized layers achieved almost identical performance to equivalent 32-bit models while using 32 times less memory. The layers can be implemented using 1-bit arrays without requiring dedicated hardware.

Conclusion: Binary normalized layers enable large neural networks with dramatically reduced memory requirements that can be deployed on simple hardware like mobile devices and CPUs, opening new possibilities for efficient model deployment.

Abstract: The increasing size of large neural network models, specifically language models and foundational image models, poses deployment challenges, prompting efforts to reduce memory requirements and enhance computational efficiency. These efforts are critical to ensure practical deployment and effective utilization of these models across various applications. In this work, a novel type of neural network layers and models is developed that uses only single-bit parameters. In this novel type of models all parameters of all layers, including kernel weights and biases, only have values equal to zero or one. This novel type of models uses layers named as binary normalized layer. These binary normalized layers can be of any type, such as fully connected, convolutional, attention, etc., and they consist of slight variations of the corresponding conventional layers. To show the effectiveness of the binary normalized layers, two different models are configured to solve a multiclass image classification problem and a language decoder to predict the next token of a sequence. The model to solve the image classification has convolutional and fully connected layers, and the language model is composed of transformer blocks with multi-head attention. The results show that models with binary normalized layers present almost the same results obtained by equivalent models with real 32-bit parameters. The binary normalized layers allow to develop models that use 32 times less memory than current models and have equivalent performance. Besides, the binary normalized layers can be easily implemented on current computers using 1-bit arrays, and do not require the development of dedicated electronic hardware. This novel type of layers opens a new era for large neural network models with reduced memory requirements that can be deployed using simple and cheap hardware, such as mobile devices or only cpus.

[309] Recursive State Inference for Linear PASFA

Vishal Rishi

Main category: cs.LG

TL;DR: A recursive extension to linear Probabilistic Adaptive Slow Feature Analysis (PASFA) is proposed for efficient MMSE estimation of slow features from observations and ARMA process models.

DetailsMotivation: There is a need to develop efficient methods to infer states (slow features) from observations and probabilistic SFA models, as current Kalman filter approaches cannot easily recover the original states that form useful representations.

Method: The proposed algorithm performs minimum mean square error (MMSE) estimation of states evolving according to an ARMA process, given the observations and the model, using a recursive extension to linear PASFA.

Result: The proposed technique is evaluated on a synthetic dataset to demonstrate its correctness in estimating slow features.

Conclusion: The recursive extension provides an efficient method for inferring slow features from probabilistic SFA models, addressing limitations of current state space transformation approaches.

Abstract: Slow feature analysis (SFA), as a method for learning slowly varying features in classification and signal analysis, has attracted increasing attention in recent years. Recent probabilistic extensions to SFA learn effective representations for classification tasks. Notably, the Probabilistic Adaptive Slow Feature Analysis models the slow features as states in an ARMA process and estimate the model from the observations. However, there is a need to develop efficient methods to infer the states (slow features) from the observations and the model. In this paper, a recursive extension to the linear PASFA has been proposed. The proposed algorithm performs MMSE estimation of states evolving according to an ARMA process, given the observations and the model. Although current methods tackle this problem using Kalman filters after transforming the ARMA process into a state space model, the original states (or slow features) that form useful representations cannot be easily recovered. The proposed technique is evaluated on a synthetic dataset to demonstrate its correctness.

[310] A Minimalist Bayesian Framework for Stochastic Optimization

Kaizheng Wang

Main category: cs.LG

TL;DR: A minimalist Bayesian framework that uses profile likelihood to eliminate nuisance parameters, enabling Thompson sampling for constrained optimization problems with near-optimal regret guarantees.

DetailsMotivation: Traditional Bayesian methods require probabilistic models for all parameters, which hinders incorporation of complex structural constraints in sequential decision-making under uncertainty.

Method: Introduces a minimalist Bayesian approach placing priors only on parameters of interest (e.g., optimum location) and eliminates nuisance parameters via profile likelihood. Develops MINimalist Thompson Sampling (MINTS) algorithm.

Result: The framework accommodates structured problems including continuum-armed Lipschitz bandits and dynamic pricing, provides probabilistic interpretation of classical convex optimization methods, and achieves near-optimal regret guarantees for multi-armed bandits.

Conclusion: The minimalist Bayesian framework successfully bridges Bayesian decision theory with constrained optimization, offering a principled yet practical approach for sequential decision-making with complex structural constraints.

Abstract: The Bayesian paradigm offers principled tools for sequential decision-making under uncertainty, but its reliance on a probabilistic model for all parameters can hinder the incorporation of complex structural constraints. We introduce a minimalist Bayesian framework that places a prior only on the component of interest, such as the location of the optimum. Nuisance parameters are eliminated via profile likelihood, which naturally handles constraints. As a direct instantiation, we develop a MINimalist Thompson Sampling (MINTS) algorithm. Our framework accommodates structured problems, including continuum-armed Lipschitz bandits and dynamic pricing. It also provides a probabilistic lens on classical convex optimization algorithms such as the center of gravity and ellipsoid methods. We further analyze MINTS for multi-armed bandits and establish near-optimal regret guarantees.

[311] Methodological Insights into Structural Causal Modelling and Uncertainty-Aware Forecasting for Economic Indicators

Federico Cerutti

Main category: cs.LG

TL;DR: Combines causal discovery and uncertainty-aware forecasting for US macroeconomic analysis, revealing causal relationships and enabling accurate zero-shot unemployment predictions with confidence intervals.

DetailsMotivation: To understand dynamic causal relationships between key macroeconomic indicators and develop robust forecasting methods that can inform economic policy without requiring task-specific training.

Method: Applied LPCMCI framework with Gaussian Process Distance Correlation (GPDC) for causal discovery on quarterly data (1970-2021), then used Chronos (a time series LLM) for zero-shot probabilistic forecasting of unemployment.

Result: Found unidirectional causal link from economic growth to GDP, limited connectivity of inflation suggesting latent factors, and strong autoregressive dependence in unemployment. Achieved accurate 1-2 quarter ahead unemployment forecasts with 90% confidence intervals for anomaly detection.

Conclusion: The combination of causal structure learning with probabilistic language models provides valuable insights for economic policy and enhances forecasting robustness through uncertainty-aware predictions.

Abstract: This paper presents a methodological approach to financial time series analysis by combining causal discovery and uncertainty-aware forecasting. As a case study, we focus on four key U.S. macroeconomic indicators – GDP, economic growth, inflation, and unemployment – and we apply the LPCMCI framework with Gaussian Process Distance Correlation (GPDC) to uncover dynamic causal relationships in quarterly data from 1970 to 2021. Our results reveal a robust unidirectional causal link from economic growth to GDP and highlight the limited connectivity of inflation, suggesting the influence of latent factors. Unemployment exhibits strong autoregressive dependence, motivating its use as a case study for probabilistic forecasting. Leveraging the Chronos framework, a large language model trained for time series, we perform zero-shot predictions on unemployment. This approach delivers accurate forecasts one and two quarters ahead, without requiring task-specific training. Crucially, the model’s uncertainty-aware predictions yield 90% confidence intervals, enabling effective anomaly detection through statistically principled deviation analysis. This study demonstrates the value of combining causal structure learning with probabilistic language models to inform economic policy and enhance forecasting robustness.

[312] Benchmarking Vision Transformers and CNNs for Thermal Photovoltaic Fault Detection with Explainable AI Validation

Serra Aksoy

Main category: cs.LG

TL;DR: This study compares CNN and vision transformer models for thermal PV fault detection, using XRAI analysis to validate model decisions against thermal physics principles. Swin Transformer achieved best performance (94% binary accuracy), with models learning physically meaningful features like hotspots and thermal boundaries.

DetailsMotivation: AI deployment for PV monitoring faces interpretability barriers. While deep learning achieves high accuracy in thermal fault detection, validation that model decisions align with thermal physics principles is lacking, creating deployment hesitancy in critical energy infrastructure applications.

Method: Systematic comparison of convolutional neural networks (ResNet-18, EfficientNet-B0) and vision transformers (ViT-Tiny, Swin-Tiny) for thermal PV fault detection, using XRAI saliency analysis to assess alignment with thermal physics principles on 20,000 infrared images spanning normal operation and 11 fault categories.

Result: Swin Transformer achieved highest performance (94% binary accuracy; 73% multiclass accuracy) compared to CNN approaches. XRAI analysis revealed models learn physically meaningful features consistent with expected thermal signatures. Performance varied significantly across fault types: electrical faults achieved strong detection (F1-scores >0.90) while environmental factors like soiling remained challenging (F1-scores 0.20-0.33).

Conclusion: The thermal physics-guided interpretability approach provides methodology for validating AI decision-making in energy monitoring applications, addressing deployment barriers in renewable energy infrastructure by ensuring model decisions align with physical principles.

Abstract: Artificial intelligence deployment for automated photovoltaic (PV) monitoring faces interpretability barriers that limit adoption in energy infrastructure applications. While deep learning achieves high accuracy in thermal fault detection, validation that model decisions align with thermal physics principles remains lacking, creating deployment hesitancy where understanding model reasoning is critical. This study provides a systematic comparison of convolutional neural networks (ResNet-18, EfficientNet-B0) and vision transformers (ViT-Tiny, Swin-Tiny) for thermal PV fault detection, using XRAI saliency analysis to assess alignment with thermal physics principles. This represents the first systematic comparison of CNNs and vision transformers for thermal PV fault detection with physics-validated interpretability. Evaluation on 20,000 infrared images spanning normal operation and 11 fault categories shows that Swin Transformer achieves the highest performance (94% binary accuracy; 73% multiclass accuracy) compared to CNN approaches. XRAI analysis reveals that models learn physically meaningful features, such as localized hotspots for cell defects, linear thermal paths for diode failures, and thermal boundaries for vegetation shading, consistent with expected thermal signatures. However, performance varies significantly across fault types: electrical faults achieve strong detection (F1-scores >0.90) while environmental factors like soiling remain challenging (F1-scores 0.20-0.33), indicating limitations imposed by thermal imaging resolution. The thermal physics-guided interpretability approach provides methodology for validating AI decision-making in energy monitoring applications, addressing deployment barriers in renewable energy infrastructure.

[313] Lookup multivariate Kolmogorov-Arnold Networks

Sergey Pozdnyakov, Philippe Schwaller

Main category: cs.LG

TL;DR: lmKANs replace linear layers with trainable spline lookup tables, achieving up to 6x FLOPs reduction while maintaining MLP flexibility in high-dimensional function approximation.

DetailsMotivation: Linear layers dominate parameter count and computational cost in deep learning models, creating a need for more efficient alternatives that maintain performance.

Method: Use lookup multivariate Kolmogorov-Arnold Networks (lmKANs) that express high-dimensional mappings through trainable low-dimensional multivariate functions implemented as spline lookup tables.

Result: 6x inference FLOPs reduction, 10x higher H100 throughput at equal accuracy, 1.6-2.1x FLOPs reduction in CNNs on CIFAR-10 and ImageNet-1k while maintaining matched accuracy.

Conclusion: lmKANs provide a superior trade-off between capacity and inference cost, making them effective drop-in replacements for linear layers in various deep learning architectures.

Abstract: High-dimensional linear mappings, or linear layers, dominate both the parameter count and the computational cost of most modern deep-learning models. We introduce a general drop-in replacement, lookup multivariate Kolmogorov-Arnold Networks (lmKANs), which deliver a substantially better trade-off between capacity and inference cost. Our construction expresses a general high-dimensional mapping through trainable low-dimensional multivariate functions. These functions can carry dozens or hundreds of trainable parameters each, and yet it takes only a few multiplications to compute them because they are implemented as spline lookup tables. Empirically, lmKANs reduce inference FLOPs by up to 6.0x while matching the flexibility of MLPs in general high-dimensional function approximation. In another feedforward fully connected benchmark, on the tabular-like dataset of randomly displaced methane configurations, lmKANs enable more than 10x higher H100 throughput at equal accuracy. Within frameworks of Convolutional Neural Networks, lmKAN-based CNNs cut inference FLOPs at matched accuracy by 1.6-2.1x and by 1.7x on the CIFAR-10 and ImageNet-1k datasets, respectively. Our code, including dedicated CUDA kernels, is available online at https://github.com/schwallergroup/lmkan.

[314] Riemannian Batch Normalization: A Gyro Approach

Ziheng Chen, Xiao-Jun Wu, Nicu Sebe

Main category: cs.LG

TL;DR: GyroBN is a Riemannian batch normalization framework for gyrogroups that extends Euclidean normalization to non-Euclidean manifolds while maintaining theoretical control over sample statistics.

DetailsMotivation: Euclidean normalization layers are inadequate for data on manifolds, and many Riemannian manifolds in machine learning have gyro-structures that enable principled extensions to non-Euclidean domains.

Method: Developed GyroBN framework with two key conditions (pseudo-reduction and gyroisometric gyrations) that guarantee theoretical control. Instantiated on seven geometries including Grassmannian, constant curvature spaces, and correlation manifold.

Result: The framework works effectively across all seven tested geometries and incorporates several existing Riemannian normalization methods as special cases.

Conclusion: GyroBN provides a principled Riemannian batch normalization approach that successfully extends to various non-Euclidean manifolds with theoretical guarantees.

Abstract: Normalization layers are crucial for deep learning, but their Euclidean formulations are inadequate for data on manifolds. On the other hand, many Riemannian manifolds in machine learning admit gyro-structures, enabling principled extensions of Euclidean neural networks to non-Euclidean domains. Inspired by this, we introduce GyroBN, a principled Riemannian batch normalization framework for gyrogroups. We establish two necessary conditions, namely \emph{pseudo-reduction} and \emph{gyroisometric gyrations}, that guarantee GyroBN with theoretical control over sample statistics, and show that these conditions hold for all known gyrogroups in machine learning. Our framework also incorporates several existing Riemannian normalization methods as special cases. We further instantiate GyroBN on seven representative geometries, including the Grassmannian, five constant curvature spaces, and the correlation manifold, and derive novel gyro and Riemannian structures to enable these instantiations. Experiments across these geometries demonstrate the effectiveness of GyroBN. The code is available at https://github.com/GitZH-Chen/GyroBN.git.

[315] Of Graphs and Tables: Zero-Shot Node Classification with Tabular Foundation Models

Adrian Hayler, Xingyue Huang, İsmail İlkan Ceylan, Michael Bronstein, Ben Finkelshtein

Main category: cs.LG

TL;DR: TabGFM reformulates graph node classification as a tabular problem, using tabular foundation models for zero-shot learning via table conversion and ensemble selection, outperforming GNNs and GFMs.

DetailsMotivation: Existing graph foundation models are trained on poorly representative datasets, limiting generalization, while tabular foundation models have shown strong cross-domain performance.

Method: Convert graphs to tables using feature and structural encoders, apply multiple TFMs to subsampled tables, and aggregate outputs through ensemble selection.

Result: Achieves consistent improvements over task-specific GNNs and state-of-the-art GFMs across 28 real-world datasets.

Conclusion: Tabular reformulation shows potential for scalable and generalizable graph learning, offering an alternative to traditional graph foundation models.

Abstract: Graph foundation models (GFMs) have recently emerged as a promising paradigm for achieving broad generalization across various graph data. However, existing GFMs are often trained on datasets that were shown to poorly represent real-world graphs, limiting their generalization performance. In contrast, tabular foundation models (TFMs) not only excel at classical tabular prediction tasks but have also shown strong applicability in other domains such as time series forecasting, natural language processing, and computer vision. Motivated by this, we take an alternative view to the standard perspective of GFMs and reformulate node classification as a tabular problem. Each node can be represented as a row with feature, structure, and label information as columns, enabling TFMs to directly perform zero-shot node classification via in-context learning. In this work, we introduce TabGFM, a graph foundation model framework that first converts a graph into a table via feature and structural encoders, applies multiple TFMs to diversely subsampled tables, and then aggregates their outputs through ensemble selection. Through experiments on 28 real-world datasets, TabGFM achieves consistent improvements over task-specific GNNs and state-of-the-art GFMs, highlighting the potential of tabular reformulation for scalable and generalizable graph learning.

[316] Measuring Uncertainty in Transformer Circuits with Effective Information Consistency

Anatoly A. Krasnovsky

Main category: cs.LG

TL;DR: Proposes EICS - a single-pass metric combining sheaf inconsistency and causal emergence to quantify when Transformer Circuits in LLMs are behaving coherently and trustworthily.

DetailsMotivation: Lack of formal methods to quantify when active circuits in LLMs are behaving coherently, despite mechanistic interpretability identifying functional subgraphs.

Method: Specializes sheaf/cohomology and causal emergence perspective to Transformer Circuits, combining normalized sheaf inconsistency from local Jacobians/activations with Gaussian EI proxy for circuit-level causal emergence.

Result: Introduces EICS - a white-box, single-pass, dimensionless score that makes units explicit, with practical guidance on interpretation and computation.

Conclusion: Provides theoretical framework and practical tool for assessing circuit coherence, though empirical validation on LLM tasks is deferred for future work.

Abstract: Mechanistic interpretability has identified functional subgraphs within large language models (LLMs), known as Transformer Circuits (TCs), that appear to implement specific algorithms. Yet we lack a formal, single-pass way to quantify when an active circuit is behaving coherently and thus likely trustworthy. Building on prior systems-theoretic proposals, we specialize a sheaf/cohomology and causal emergence perspective to TCs and introduce the Effective-Information Consistency Score (EICS). EICS combines (i) a normalized sheaf inconsistency computed from local Jacobians and activations, with (ii) a Gaussian EI proxy for circuit-level causal emergence derived from the same forward state. The construction is white-box, single-pass, and makes units explicit so that the score is dimensionless. We further provide practical guidance on score interpretation, computational overhead (with fast and exact modes), and a toy sanity-check analysis. Empirical validation on LLM tasks is deferred.

[317] PLaID++: A Preference Aligned Language Model for Targeted Inorganic Materials Design

Andy Xu, Rohan Desai, Larry Wang, Gabriel Hope, Ethan Ritz

Main category: cs.LG

TL;DR: PLaID++ is an LLM fine-tuned for stable and property-guided crystal generation using Wyckoff-based text representation and DPO reinforcement learning, achieving ~50% better generation rates than prior methods.

DetailsMotivation: Accelerate materials discovery by overcoming slow and expensive trial-and-error processes in developing novel materials for technologies like solar cells, batteries, and carbon capture.

Method: Fine-tune Qwen-2.5 7B LLM with Wyckoff-based text representation for crystal structures, using Direct Preference Optimization (DPO) reinforcement learning to guide generation towards stable, novel structures with desired space group properties.

Result: Generates thermodynamically stable, unique, and novel structures at ~50% greater rate than prior methods, with ~115% improvement in unconditional generation and ~50% improvement in space group conditioned generation compared to fine-tuning alone.

Conclusion: Demonstrates successful adaptation of NLP post-training techniques to materials design, enabling targeted and efficient discovery of novel materials through stable and property-guided crystal generation.

Abstract: Discovering novel materials is critical for technological advancements such as solar cells, batteries, and carbon capture. However, the development of new materials is constrained by a slow and expensive trial-and-error process. To accelerate this pipeline, we introduce PLaID++, a Large Language Model (LLM) fine-tuned for stable and property-guided crystal generation. We fine-tune Qwen-2.5 7B to generate crystal structures using a novel Wyckoff-based text representation. We show that generation can be effectively guided with a reinforcement learning technique based on Direct Preference Optimization (DPO), with sampled structures categorized by their stability, novelty, and space group. By encoding symmetry constraints directly into text and guiding model outputs towards desirable chemical space, PLaID++ generates structures that are thermodynamically stable, unique, and novel at a $\sim$50% greater rate than prior methods and conditionally generates structures with desired space group properties. Our experiments highlight the effectiveness of iterative DPO, achieving $\sim$115% and $\sim$50% improvements in unconditional and space group conditioned generation, respectively, compared to fine-tuning alone. Our work demonstrates the potential of adapting post-training techniques from natural language processing to materials design, paving the way for targeted and efficient discovery of novel materials.

[318] Fed-REACT: Federated Representation Learning for Heterogeneous and Evolving Data

Yiyue Chen, Usman Akram, Chianing Wang, Haris Vikalo

Main category: cs.LG

TL;DR: Fed-REACT is a federated learning framework that addresses data heterogeneity and evolution through representation learning and evolutionary clustering, achieving better performance than standard FL methods.

DetailsMotivation: Standard federated learning algorithms suffer performance degradation when client data distributions are heterogeneous and evolve over time, which is common in real-world deployments with privacy and resource constraints.

Method: Two-stage framework: (1) clients learn local models to extract feature representations, (2) server dynamically clusters clients based on representations and coordinates cluster-wise training of task-specific models for downstream objectives.

Result: Theoretical analysis of representation learning stage and empirical demonstration showing superior accuracy and robustness on real-world datasets compared to standard FL algorithms.

Conclusion: Fed-REACT effectively handles heterogeneous and evolving client data distributions in federated learning settings, providing a practical solution for real-world deployment challenges.

Abstract: Motivated by the high resource costs and privacy concerns associated with centralized machine learning, federated learning (FL) has emerged as an efficient alternative that enables clients to collaboratively train a global model while keeping their data local. However, in real-world deployments, client data distributions often evolve over time and differ significantly across clients, introducing heterogeneity that degrades the performance of standard FL algorithms. In this work, we introduce Fed-REACT, a federated learning framework designed for heterogeneous and evolving client data. Fed-REACT combines representation learning with evolutionary clustering in a two-stage process: (1) in the first stage, each client learns a local model to extracts feature representations from its data; (2) in the second stage, the server dynamically groups clients into clusters based on these representations and coordinates cluster-wise training of task-specific models for downstream objectives such as classification or regression. We provide a theoretical analysis of the representation learning stage, and empirically demonstrate that Fed-REACT achieves superior accuracy and robustness on real-world datasets.

[319] Predicting effect of novel treatments using molecular pathways and real-world data

Adrien Couetoux, Thomas Devenyns, Lise Diagne, David Champagne, Pierre-Yves Mousset, Chris Anagnostopoulos

Main category: cs.LG

TL;DR: Machine learning approach predicts pharmaceutical efficacy using drug-pathway impact scores and patient data before clinical testing

DetailsMotivation: Predicting drug efficacy before clinical trials is challenging in pharmaceutical R&D, requiring better methods to assess untested pharmaceuticals

Method: Train ML model using pharmaceutical-pathway weight impact scores and patient data (characteristics and outcomes), then analyze weighted impact scores of untested drugs across biological pathways

Result: Demonstrated on real-world dataset with patient treatments and outcomes using two different weight impact score algorithms, with methods for evaluating generalization performance

Conclusion: Provides a flexible, modular framework that can be iterated on for predicting untested drug effects using real-world clinical data and drug embeddings

Abstract: In pharmaceutical R&D, predicting the efficacy of a pharmaceutical in treating a particular disease prior to clinical testing or any real-world use has been challenging. In this paper, we propose a flexible and modular machine learning-based approach for predicting the efficacy of an untested pharmaceutical for treating a disease. We train a machine learning model using sets of pharmaceutical-pathway weight impact scores and patient data, which can include patient characteristics and observed clinical outcomes. The resulting model then analyses weighted impact scores of an untested pharmaceutical across human biological molecule-protein pathways to generate a predicted efficacy value. We demonstrate how the method works on a real-world dataset with patient treatments and outcomes, with two different weight impact score algorithms We include methods for evaluating the generalisation performance on unseen treatments, and to characterise conditions under which the approach can be expected to be most predictive. We discuss specific ways in which our approach can be iterated on, making it an initial framework to support future work on predicting the effect of untested drugs, leveraging RWD clinical data and drug embeddings.

[320] Explaining How Quantization Disparately Skews a Model

Abhimanyu Bellam, Jung-Eun Kim

Main category: cs.LG

TL;DR: PTQ exacerbates disparate impacts on minority groups due to quantization effects on weights and activations, leading to reduced logit variance, increased loss, and compromised group accuracies. The paper proposes mixed precision QAT with dataset sampling and weighted loss to achieve fair quantized neural networks.

DetailsMotivation: Post Training Quantization (PTQ) is widely used for compression but was observed to disproportionately harm minority group performance, creating fairness issues in quantized neural networks.

Method: Analyzed quantization’s chain effects on weights/activations, studied impacts on gradient norms and Hessian eigenvalues, and proposed mixed precision Quantization Aware Training with dataset sampling and weighted loss functions.

Result: Quantization causes cascaded impacts reducing logit variance, increasing loss, and compromising group accuracies, with measurable effects on optimization metrics like gradient norms and Hessian eigenvalues.

Conclusion: Mixed precision QAT combined with dataset sampling and weighted loss functions can mitigate disparate impacts and enable fair deployment of quantized neural networks.

Abstract: Post Training Quantization (PTQ) is widely adopted due to its high compression capacity and speed with minimal impact on accuracy. However, we observed that disparate impacts are exacerbated by quantization, especially for minority groups. Our analysis explains that in the course of quantization there is a chain of factors attributed to a disparate impact across groups during forward and backward passes. We explore how the changes in weights and activations induced by quantization cause cascaded impacts in the network, resulting in logits with lower variance, increased loss, and compromised group accuracies. We extend our study to verify the influence of these impacts on group gradient norms and eigenvalues of the Hessian matrix, providing insights into the state of the network from an optimization point of view. To mitigate these effects, we propose integrating mixed precision Quantization Aware Training (QAT) with dataset sampling methods and weighted loss functions, therefore providing fair deployment of quantized neural networks.

[321] Systematic Optimization of Open Source Large Language Models for Mathematical Reasoning

Pranav Pawar, Dhwaj Jain, Varun Gupta, Kaustav Dedhia, Dashrath Kale, Sudhir Dhekane

Main category: cs.LG

TL;DR: Comprehensive parameter optimization study for mathematical reasoning tasks across five SOTA models, achieving 29.4% cost reduction and 23.9% speed improvement while maintaining accuracy.

DetailsMotivation: To improve efficiency and performance of mathematical reasoning tasks through systematic fine-tuning of model parameters, addressing the need for optimized inference in production environments.

Method: Systematic parameter optimization framework testing temperature (0.1-0.5), reasoning steps (4-12), planning periods (1-4), and nucleus sampling (0.85-0.98) across Qwen2.5-72B, Llama-3.1-70B, DeepSeek-V3, Mixtral-8x22B, and Yi-Lightning models on mathematical reasoning benchmarks.

Result: Achieved 100% optimization success rate with average 29.4% computational cost reduction and 23.9% inference speed improvement. DeepSeek-V3 reached 98% accuracy, Mixtral-8x22B delivered 361.5 tokens per accurate response. Lower temperatures (0.1-0.4) and fewer reasoning steps (4-6) proved optimal.

Conclusion: The framework provides production-ready configurations and reveals universal optimization trends across diverse model architectures, demonstrating that careful parameter tuning significantly enhances mathematical reasoning efficiency without compromising accuracy.

Abstract: This paper presents a practical investigation into fine-tuning model parameters for mathematical reasoning tasks through experimenting with various configurations including randomness control, reasoning depth, and sampling strategies, careful tuning demonstrates substantial improvements in efficiency as well as performance. A holistically optimized framework is introduced for five state-of-the-art models on mathematical reasoning tasks, exhibiting significant performance boosts while maintaining solution correctness. Through systematic parameter optimization across Qwen2.5-72B, Llama-3.1-70B, DeepSeek-V3, Mixtral-8x22B, and Yi-Lightning, consistent efficiency gains are demonstrated with 100% optimization success rate. The methodology achieves an average 29.4% reduction in computational cost and 23.9% improvement in inference speed across all tested models. This framework systematically searches parameter spaces including temperature (0.1-0.5), reasoning steps (4-12), planning periods (1-4), and nucleus sampling (0.85-0.98), determining optimal configurations through testing on mathematical reasoning benchmarks. Critical findings show that lower temperature regimes (0.1-0.4) and reduced reasoning steps (4-6) consistently enhance efficiency without compromising accuracy. DeepSeek-V3 achieves the highest accuracy at 98%, while Mixtral-8x22B delivers the most cost-effective performance at 361.5 tokens per accurate response. Key contributions include: (1) the first comprehensive optimization study for five diverse SOTA models in mathematical reasoning, (2) a standardized production-oriented parameter optimization framework, (3) discovery of universal optimization trends applicable across model architectures, and (4) production-ready configurations with extensive performance characterization.

[322] IP-Basis PINNs: Efficient Multi-Query Inverse Parameter Estimation

Shalev Manor, Mohammad Kohandel

Main category: cs.LG

TL;DR: IP-Basis PINNs is a meta-learning framework that enables rapid inference for inverse problems by training a deep network offline to produce basis functions, then freezing it and only training a lightweight linear output layer for each new inverse problem online.

DetailsMotivation: Standard Physics-Informed Neural Networks (PINNs) are computationally expensive for multi-query inverse problems, requiring expensive retraining for each new set of observed data.

Method: Offline-online decomposition: train deep network offline to produce basis functions spanning parametric differential equation solution space; freeze network online and train only linear output layer against observed data with novel loss formulation for simultaneous solution reconstruction and parameter identification.

Result: Demonstrated efficacy on three diverse benchmarks, showing consistent performance across constant and functional parameter estimation, significant speedup per query over standard PINNs, and robust operation with scarce and noisy data.

Conclusion: IP-Basis PINNs provide an efficient meta-learning framework for inverse problems that reduces computational overhead while maintaining robust performance across various parameter estimation scenarios.

Abstract: Solving inverse problems with Physics-Informed Neural Networks (PINNs) is computationally expensive for multi-query scenarios, as each new set of observed data requires a new, expensive training procedure. We present Inverse-Parameter Basis PINNs (IP-Basis PINNs), a meta-learning framework that extends the foundational work of Desai et al. (2022) to enable rapid and efficient inference for inverse problems. Our method employs an offline-online decomposition: a deep network is first trained offline to produce a rich set of basis functions that span the solution space of a parametric differential equation. For each new inverse problem online, this network is frozen, and solutions and parameters are inferred by training only a lightweight linear output layer against observed data. Key innovations that make our approach effective for inverse problems include: (1) a novel online loss formulation for simultaneous solution reconstruction and parameter identification, (2) a significant reduction in computational overhead via forward-mode automatic differentiation for PDE loss evaluation, and (3) a non-trivial validation and early-stopping mechanism for robust offline training. We demonstrate the efficacy of IP-Basis PINNs on three diverse benchmarks, including an extension to universal PINNs for unknown functional terms-showing consistent performance across constant and functional parameter estimation, a significant speedup per query over standard PINNs, and robust operation with scarce and noisy data.

[323] GCond: Gradient Conflict Resolution via Accumulation-based Stabilization for Large-Scale Multi-Task Learning

Evgeny Alves Limarenko, Anastasiia Alexandrovna Studenikina

Main category: cs.LG

TL;DR: GCond is a computationally efficient method that addresses gradient conflicts in multi-task learning by combining PCGrad principles with gradient accumulation and adaptive arbitration, achieving 2x speedup while maintaining performance.

DetailsMotivation: Existing gradient conflict resolution methods like PCGrad, CAGrad, and GradNorm are computationally demanding, limiting their application in modern large models and transformers.

Method: Gradient Conductor (GCond) builds on PCGrad principles with gradient accumulation and adaptive arbitration mechanism, evaluated on self-supervised learning tasks using MobileNetV3-Small and ConvNeXt architectures.

Result: GCond achieved two-fold computational speedup while maintaining optimization quality, with superior performance across all metrics (lower L1 and SSIM losses) on ImageNet 1K and head/neck CT datasets.

Conclusion: GCond offers a scalable and efficient solution to gradient conflicts in MTL, compatible with modern optimizers and applicable to both compact and large architectures.

Abstract: In multi-task learning (MTL), gradient conflict poses a significant challenge. Effective methods for addressing this problem, including PCGrad, CAGrad, and GradNorm, in their original implementations are computationally demanding, which significantly limits their application in modern large models and transformers. We propose Gradient Conductor (GCond), a method that builds upon PCGrad principles by combining them with gradient accumulation and an adaptive arbitration mechanism. We evaluated GCond on self-supervised learning tasks using MobileNetV3-Small and ConvNeXt architectures on the ImageNet 1K dataset and a combined head and neck CT scan dataset, comparing the proposed method against baseline linear combinations and state-of-the-art gradient conflict resolution methods. The stochastic mode of GCond achieved a two-fold computational speedup while maintaining optimization quality, and demonstrated superior performance across all evaluated metrics, achieving lower L1 and SSIM losses compared to other methods on both datasets. GCond exhibited high scalability, being successfully applied to both compact models (MobileNetV3-Small) and large architectures (ConvNeXt-tiny and ConvNeXt-Base). It also showed compatibility with modern optimizers such as AdamW and Lion/LARS. Therefore, GCond offers a scalable and efficient solution to the problem of gradient conflicts in multi-task learning.

[324] Learning Generalized Hamiltonian Dynamics with Stability from Noisy Trajectory Data

Luke McLennan, Yi Wang, Ryan Farell, Minh Nguyen, Chandrajit Bajaj

Main category: cs.LG

TL;DR: A robust unsupervised framework for learning generalized Hamiltonian dynamics from noisy sparse phase-space data using variational Bayesian inference and physics-constrained regularization.

DetailsMotivation: Existing Hamiltonian network models struggle to capture distinctive motion dynamics and physics of different conservative, dissipative, and port-Hamiltonian systems from observational data.

Method: Extends sparse symplectic random Fourier Gaussian processes with predictive Hamiltonian landscape estimation, incorporating kernelized ELBO loss plus stability and conservation constraints as regularization terms.

Result: The framework enables learning of various Hamiltonian dynamics types while enforcing physics correctness for improved prediction accuracy with bounded uncertainty.

Conclusion: Proposed approach successfully addresses the complex Hamiltonian manifold learning challenge through Bayesian inference and physics-informed regularization constraints.

Abstract: We introduce a robust framework for learning various generalized Hamiltonian dynamics from noisy, sparse phase-space data and in an unsupervised manner based on variational Bayesian inference. Although conservative, dissipative, and port-Hamiltonian systems might share the same initial total energy of a closed system, it is challenging for a single Hamiltonian network model to capture the distinctive and varying motion dynamics and physics of a phase space, from sampled observational phase space trajectories. To address this complicated Hamiltonian manifold learning challenge, we extend sparse symplectic, random Fourier Gaussian processes learning with predictive successive numerical estimations of the Hamiltonian landscape, using a generalized form of state and conjugate momentum Hamiltonian dynamics, appropriate to different classes of conservative, dissipative and port-Hamiltonian physical systems. In addition to the kernelized evidence lower bound (ELBO) loss for data fidelity, we incorporate stability and conservation constraints as additional hyper-parameter balanced loss terms to regularize the model’s multi-gradients, enforcing physics correctness for improved prediction accuracy with bounded uncertainty.

[325] ALICE: An Interpretable Neural Architecture for Generalization in Substitution Ciphers

Jeff Shen, Lindsay Smith

Main category: cs.LG

TL;DR: ALICE transformer model achieves state-of-the-art cryptogram decryption with minimal training data, generalizing well and providing interpretable insights into neural network reasoning processes.

DetailsMotivation: Cryptogram solving serves as an ideal testbed for studying neural network generalization in combinatorially complex domains with 26! possible mappings, addressing the challenge of learning without explicit cipher access.

Method: Developed ALICE - an encoder-only Transformer with novel bijective decoding head using Gumbel-Sinkhorn method to explicitly model permutations, enabling direct extraction of learned cipher mappings.

Result: ALICE achieves new state-of-the-art accuracy and speed, generalizing to unseen ciphers after training on only ~1500 unique ciphers (3.7×10⁻²⁴ of possible space). Early exit analysis reveals progressive refinement mirroring human strategies.

Conclusion: The architectural innovations and analysis methods extend beyond cryptograms to any domain with bijective mappings and combinatorial structure, offering new insights into neural network generalization and interpretability.

Abstract: We present cryptogram solving as an ideal testbed for studying neural network generalization in combinatorially complex domains. In this task, models must decrypt text encoded with substitution ciphers, choosing from 26! possible mappings without explicit access to the cipher. We develop ALICE (an Architecture for Learning Interpretable Cryptogram dEcipherment): a simple encoder-only Transformer that sets a new state-of-the-art for both accuracy and speed on this decryption problem. Surprisingly, ALICE generalizes to unseen ciphers after training on only ${\sim}1500$ unique ciphers, a minute fraction ($3.7 \times 10^{-24}$) of the possible cipher space. To enhance interpretability, we introduce a novel bijective decoding head that explicitly models permutations via the Gumbel-Sinkhorn method, enabling direct extraction of learned cipher mappings. Through early exit analysis, we reveal how ALICE progressively refines its predictions in a way that appears to mirror common human strategies for this task: early layers employ frequency-based heuristics, middle layers form word structures, and final layers correct individual characters. Our architectural innovations and analysis methods extend beyond cryptograms to any domain with bijective mappings and combinatorial structure, offering new insights into neural network generalization and interpretability.

[326] CancerGUIDE: Cancer Guideline Understanding via Internal Disagreement Estimation

Alyssa Unell, Noel C. F. Codella, Sam Preston, Peniel Argaw, Wen-wai Yim, Zelalem Gero, Cliff Wong, Rajesh Jena, Eric Horvitz, Amanda K. Hall, Ruican Rachel Zhong, Jiachen Li, Shrey Jain, Mu Wei, Matthew Lungren, Hoifung Poon

Main category: cs.LG

TL;DR: LLM agent-based system for automated NCCN guideline-compliant treatment recommendations for NSCLC patients, combining human expertise with AI to reduce annotation costs while maintaining accuracy and regulatory compliance.

DetailsMotivation: Translating complex patient presentations into NCCN guideline-compliant treatment recommendations is time-intensive, requires specialized expertise, and is prone to error. LLMs promise to reduce time and improve accuracy.

Method: Developed a hybrid approach combining human annotations with model consistency information. Created a longitudinal dataset of 121 NSCLC cases with expert annotations, used LLMs for proxy benchmark generation, and built an agent framework with meta-classifier for verification.

Result: Achieved strong correlation with expert benchmarks (Spearman r=0.88, RMSE=0.08) and high accuracy for treatment recommendations (AUROC=0.800). Demonstrated LLMs possess domain-specific knowledge for quality benchmark generation.

Conclusion: Established a clinically viable framework for LLM-based guideline adherence systems that balances accuracy, interpretability, and regulatory requirements while reducing annotation costs, providing scalable automated clinical decision support.

Abstract: The National Comprehensive Cancer Network (NCCN) provides evidence-based guidelines for cancer treatment. Translating complex patient presentations into guideline-compliant treatment recommendations is time-intensive, requires specialized expertise, and is prone to error. Advances in large language model (LLM) capabilities promise to reduce the time required to generate treatment recommendations and improve accuracy. We present an LLM agent-based approach to automatically generate guideline-concordant treatment trajectories for patients with non-small cell lung cancer (NSCLC). Our contributions are threefold. First, we construct a novel longitudinal dataset of 121 cases of NSCLC patients that includes clinical encounters, diagnostic results, and medical histories, each expertly annotated with the corresponding NCCN guideline trajectories by board-certified oncologists. Second, we demonstrate that existing LLMs possess domain-specific knowledge that enables high-quality proxy benchmark generation for both model development and evaluation, achieving strong correlation (Spearman coefficient r=0.88, RMSE = 0.08) with expert-annotated benchmarks. Third, we develop a hybrid approach combining expensive human annotations with model consistency information to create both the agent framework that predicts the relevant guidelines for a patient, as well as a meta-classifier that verifies prediction accuracy with calibrated confidence scores for treatment recommendations (AUROC=0.800), a critical capability for communicating the accuracy of outputs, custom-tailoring tradeoffs in performance, and supporting regulatory compliance. This work establishes a framework for clinically viable LLM-based guideline adherence systems that balance accuracy, interpretability, and regulatory requirements while reducing annotation costs, providing a scalable pathway toward automated clinical decision support.

[327] General Demographic Foundation Models for Enhancing Predictive Performance Across Diseases

Li-Chin Chen, Ji-Tian Sheu, Yuh-Jue Chuang

Main category: cs.LG

TL;DR: A General Demographic Pre-trained (GDP) model is proposed to learn better representations of age and gender from EHR data, improving predictive performance in healthcare applications across diverse populations and diseases.

DetailsMotivation: Demographic attributes like age and gender are crucial predictors in healthcare but are often treated as auxiliary features with limited attention to learning their optimal representations.

Method: The GDP model explores combinations of ordering strategies and encoding methods to transform tabular demographic inputs into latent embeddings, pre-trained on diverse datasets from different geographic regions.

Result: Sequential ordering substantially improves model performance in discrimination, calibration, and information gain, particularly in diseases where demographics significantly contribute to risk stratification. GDP enhances representational importance even when demographic attributes have low predictive value.

Conclusion: Foundational models for tabular demographic attributes can generalize across tasks and populations, offering a promising direction for improving predictive performance in healthcare applications.

Abstract: Demographic attributes are universally present in electronic health records and serve as vital predictors in clinical risk stratification and treatment decisions. Despite their significance, these attributes are often relegated to auxiliary roles in model design, with limited attention has been given to learning their representations. This study proposes a General Demographic Pre-trained (GDP) model as a foundational representation framework tailored to age and gender. The model is pre-trained and evaluated using datasets with diverse diseases and population compositions from different geographic regions. The GDP architecture explores combinations of ordering strategies and encoding methods to transform tabular demographic inputs into latent embeddings. Experimental results demonstrate that sequential ordering substantially improves model performance in discrimination, calibration, and the corresponding information gain at each decision tree split, particularly in diseases where age and gender contribute significantly to risk stratification. Even in datasets where demographic attributes hold relatively low predictive value, GDP enhances the representational importance, increasing their influence in downstream gradient boosting models. The findings suggest that foundational models for tabular demographic attributes can generalize across tasks and populations, offering a promising direction for improving predictive performance in healthcare applications.

[328] FedTeddi: Temporal Drift and Divergence Aware Scheduling for Timely Federated Edge Learning

Yuxuan Bai, Yuxuan Sun, Tan Chen, Wei Chen, Sheng Zhou, Zhisheng Niu

Main category: cs.LG

TL;DR: FedTeddi is a scheduling algorithm for federated edge learning that handles time-varying non-i.i.d. data by quantifying temporal drift and divergence using Earth Mover’s Distance, achieving faster convergence than benchmarks.

DetailsMotivation: Real-world federated learning scenarios involve continuously evolving data with time-varying and non-i.i.d. characteristics, requiring timely model adaptation while maintaining previous knowledge.

Method: Proposes FedTeddi algorithm that quantifies temporal dynamics using temporal drift and collective divergence (measured as EMD), with joint scheduling and bandwidth allocation optimization.

Result: Achieves 58.4% faster convergence on CIFAR-10 and 49.2% on CIFAR-100 compared to random scheduling, with higher test accuracy.

Conclusion: FedTeddi effectively handles dynamic data evolution in FEEL systems, enabling quick learning from new data while preserving previous knowledge through drift-and-divergence-aware scheduling.

Abstract: Federated edge learning (FEEL) enables collaborative model training across distributed clients over wireless networks without exposing raw data. While most existing studies assume static datasets, in real-world scenarios clients may continuously collect data with time-varying and non-independent and identically distributed (non-i.i.d.) characteristics. A critical challenge is how to adapt models in a timely yet efficient manner to such evolving data. In this paper, we propose FedTeddi, a temporal-drift-and-divergence-aware scheduling algorithm that facilitates fast convergence of FEEL under dynamic data evolution and communication resource limits. We first quantify the temporal dynamics and non-i.i.d. characteristics of data using temporal drift and collective divergence, respectively, and represent them as the Earth Mover’s Distance (EMD) of class distributions for classification tasks. We then propose a novel optimization objective and develop a joint scheduling and bandwidth allocation algorithm, enabling the FEEL system to learn from new data quickly without forgetting previous knowledge. Experimental results show that our algorithm achieves higher test accuracy and faster convergence compared to benchmark methods, improving the rate of convergence by 58.4% on CIFAR-10 and 49.2% on CIFAR-100 compared to random scheduling.

[329] SBS: Enhancing Parameter-Efficiency of Neural Representations for Neural Networks via Spectral Bias Suppression

Qihu Xie, Yuan Li, Yi Kang

Main category: cs.LG

TL;DR: SBS enhances neural representation for neural networks by suppressing spectral bias through unidirectional smoothing and adaptive Fourier features, achieving better parameter compression and reconstruction accuracy.

DetailsMotivation: Standard MLPs in neural representation for neural networks suffer from spectral bias, limiting their ability to reconstruct high-frequency details effectively in CNN weight compression.

Method: Proposes two techniques: (1) unidirectional ordering-based smoothing for kernel smoothness, and (2) smoothing-aware random Fourier features that adaptively modulate frequency bandwidth based on layer-wise parameter count.

Result: Extensive evaluations on ResNet models with CIFAR-10, CIFAR-100, and ImageNet show significantly better reconstruction accuracy with fewer parameters compared to state-of-the-art methods.

Conclusion: SBS provides a parameter-efficient enhancement that effectively suppresses spectral bias in neural network weight representations, improving compression and reconstruction performance.

Abstract: Implicit neural representations have recently been extended to represent convolutional neural network weights via neural representation for neural networks, offering promising parameter compression benefits. However, standard multi-layer perceptrons used in neural representation for neural networks exhibit a pronounced spectral bias, hampering their ability to reconstruct high-frequency details effectively. In this paper, we propose SBS, a parameter-efficient enhancement to neural representation for neural networks that suppresses spectral bias using two techniques: (1) a unidirectional ordering-based smoothing that improves kernel smoothness in the output space, and (2) unidirectional ordering-based smoothing aware random fourier features that adaptively modulate the frequency bandwidth of input encodings based on layer-wise parameter count. Extensive evaluations on various ResNet models with datasets CIFAR-10, CIFAR-100, and ImageNet, demonstrate that SBS achieves significantly better reconstruction accuracy with less parameters compared to SOTA.

[330] EfficientNet in Digital Twin-based Cardiac Arrest Prediction and Analysis

Qasim Zia, Avais Jan, Zafar Iqbal, Muhammad Mumtaz Ali, Mukarram Ali, Murray Patterson

Main category: cs.LG

TL;DR: A novel framework combining EfficientNet-based deep learning with digital twin technology for early cardiac arrest detection and personalized treatment assessment.

DetailsMotivation: Cardiac arrest is a major global health issue where early identification and management are crucial for improving patient outcomes.

Method: Uses EfficientNet with compound scaling for cardiovascular image feature learning, combined with a digital twin system that creates individualized cardiovascular models from IoT device data for continuous patient assessment and treatment impact analysis.

Result: The proposed system demonstrates high accuracy in prediction capabilities while maintaining efficiency in performance.

Conclusion: Combining deep learning and digital twin technology enables an active, individualized approach to cardiac disease prediction and treatment planning.

Abstract: Cardiac arrest is one of the biggest global health problems, and early identification and management are key to enhancing the patient’s prognosis. In this paper, we propose a novel framework that combines an EfficientNet-based deep learning model with a digital twin system to improve the early detection and analysis of cardiac arrest. We use compound scaling and EfficientNet to learn the features of cardiovascular images. In parallel, the digital twin creates a realistic and individualized cardiovascular system model of the patient based on data received from the Internet of Things (IoT) devices attached to the patient, which can help in the constant assessment of the patient and the impact of possible treatment plans. As shown by our experiments, the proposed system is highly accurate in its prediction abilities and, at the same time, efficient. Combining highly advanced techniques such as deep learning and digital twin (DT) technology presents the possibility of using an active and individual approach to predicting cardiac disease.

[331] Hybrid GCN-GRU Model for Anomaly Detection in Cryptocurrency Transactions

Gyuyeon Na, Minjung Park, Hyeonjeong Cha, Soyoun Kim, Sunyoung Moon, Sua Lee, Jaeyoung Choi, Hyemin Lee, Sangmi Chai

Main category: cs.LG

TL;DR: Hybrid GCN-GRU model for blockchain illicit activity detection achieves high accuracy (0.9470) and AUC-ROC (0.9807) on Bitcoin data

DetailsMotivation: Blockchain transaction networks are complex with evolving temporal patterns and inter-node relationships, requiring advanced methods to detect illicit activities

Method: Proposed a hybrid GCN-GRU model that captures both structural features (via Graph Convolutional Networks) and sequential features (via Gated Recurrent Units)

Result: Achieved 0.9470 Accuracy and 0.9807 AUC-ROC using real Bitcoin transaction data from 2020-2024, outperforming all baseline methods

Conclusion: The hybrid GCN-GRU approach effectively combines structural and temporal information for superior illicit activity detection in blockchain networks

Abstract: Blockchain transaction networks are complex, with evolving temporal patterns and inter-node relationships. To detect illicit activities, we propose a hybrid GCN-GRU model that captures both structural and sequential features. Using real Bitcoin transaction data (2020-2024), our model achieved 0.9470 Accuracy and 0.9807 AUC-ROC, outperforming all baselines.

[332] EMORF-II: Adaptive EM-based Outlier-Robust Filtering with Correlated Measurement Noise

Arslan Majal, Aamir Hussain Chughtai, Muhammad Tahir

Main category: cs.LG

TL;DR: EMORF-II is an enhanced outlier-robust filter that handles correlated measurement noise and learns outlier characteristics during inference, improving accuracy with manageable computational overhead.

DetailsMotivation: Existing outlier-robust filters need better capabilities to handle correlated measurement noise and learn outlier characteristics dynamically during operation.

Method: Enhanced version of EM-based outlier robust filter (EMORF) with additional learning feature that detects and learns outlier characteristics during inference.

Result: Numerical experiments show improved accuracy compared to state-of-the-art methods, with increased but manageable computational overhead.

Conclusion: EMORF-II provides superior outlier-mitigation capability while maintaining practical computational complexity, making it suitable for diverse applications.

Abstract: We present a learning-based outlier-robust filter for a general setup where the measurement noise can be correlated. Since it is an enhanced version of EM-based outlier robust filter (EMORF), we call it as EMORF-II. As it is equipped with an additional powerful feature to learn the outlier characteristics during inference along with outlier-detection, EMORF-II has improved outlier-mitigation capability. Numerical experiments confirm performance gains as compared to the state-of-the-art methods in terms of accuracy with an increased computational overhead. However, thankfully the computational complexity order remains at par with other practical methods making it a useful choice for diverse applications.

[333] The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

Long Li, Jiaran Hao, Jason Klein Liu, Zhijian Zhou, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, Yuan Qi

Main category: cs.LG

TL;DR: DPH-RL framework uses mass-covering f-divergences to prevent multi-attempt performance degradation in RL fine-tuning of LLMs, improving both single and multi-attempt accuracy while maintaining knowledge diversity.

DetailsMotivation: Address the paradox where RL fine-tuning improves single-attempt accuracy (Pass@1) but degrades multi-attempt performance (Pass@k) due to catastrophic forgetting and lack of knowledge retention mechanisms in standard RLVR objectives.

Method: Proposes DPH-RL framework that uses mass-covering f-divergences (forward-KL and JS-divergence) as rehearsal mechanisms, continuously referencing the initial policy to maintain broad solution coverage without requiring online reference models.

Result: Extensive experiments on math and SQL generation show DPH-RL resolves Pass@k degradation, improves both Pass@1 and Pass@k in- and out-of-domain, and is more training-efficient by computing f-divergence using generator functions.

Conclusion: Proper selection of divergence measure is a powerful tool for building more general and diverse reasoning models, highlighting an overlooked axis for improving RLVR in LLM fine-tuning.

Abstract: A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance (Pass@k) despite improvements in single-attempt accuracy (Pass@1). This is often accompanied by catastrophic forgetting, where models lose previously acquired skills. While various methods have been proposed, the choice and function of the divergence term have been surprisingly unexamined as a proactive solution. We argue that standard RLVR objectives – both those using the mode-seeking reverse KL-divergence and those forgoing a divergence term entirely – lack a crucial mechanism for knowledge retention. The reverse-KL actively accelerates this decay by narrowing the policy, while its absence provides no safeguard against the model drifting from its diverse knowledge base. We propose a fundamental shift in perspective: using the divergence term itself as the solution. Our framework, Diversity-Preserving Hybrid RL (DPH-RL), leverages mass-covering f-divergences (like forward-KL and JS-divergence) to function as a rehearsal mechanism. By continuously referencing the initial policy, this approach forces the model to maintain broad solution coverage. Extensive experiments on math and SQL generation demonstrate that DPH-RL not only resolves the Pass@k degradation but improves both Pass@1 and Pass@k in- and out-of-domain. Additionally, DPH-RL is more training-efficient because it computes f-divergence using generator functions, requiring only sampling from the initial policy and no online reference model. Our work highlights a crucial, overlooked axis for improving RLVR, demonstrating that the proper selection of a divergence measure is a powerful tool for building more general and diverse reasoning models.

[334] Conv4Rec: A 1-by-1 Convolutional AutoEncoder for User Profiling through Joint Analysis of Implicit and Explicit Feedbacks

Antoine Ledent, Petr Kasalický, Rodrigo Alves, Hady W. Lauw

Main category: cs.LG

TL;DR: New convolutional AutoEncoder architecture for recommendation systems that handles both explicit ratings and implicit feedback, learns associations between interaction types, and provides separate predictions for content consumption probability and high rating likelihood.

DetailsMotivation: To improve user modeling and recommendation by creating a more flexible model that can learn from both explicit ratings and implicit feedback patterns, while providing more informative predictions and theoretical guarantees.

Method: Convolutional AutoEncoder architecture that jointly learns from explicit ratings and implicit feedback, models associations between different interaction types, and provides separate predictions for consumption probability and rating likelihood.

Result: Achieves state-of-the-art performance on both implicit and explicit feedback prediction tasks using a single model, with additional interpretability through individual probability predictions for each possible rating.

Conclusion: The proposed architecture successfully combines explicit and implicit feedback learning, provides theoretical generalization bounds, and delivers superior performance while offering enhanced interpretability in recommendation systems.

Abstract: We introduce a new convolutional AutoEncoder architecture for user modelling and recommendation tasks with several improvements over the state of the art. Firstly, our model has the flexibility to learn a set of associations and combinations between different interaction types in a way that carries over to each user and item. Secondly, our model is able to learn jointly from both the explicit ratings and the implicit information in the sampling pattern (which we refer to as `implicit feedback’). It can also make separate predictions for the probability of consuming content and the likelihood of granting it a high rating if observed. This not only allows the model to make predictions for both the implicit and explicit feedback, but also increases the informativeness of the predictions: in particular, our model can identify items which users would not have been likely to consume naturally, but would be likely to enjoy if exposed to them. Finally, we provide several generalization bounds for our model, which to the best of our knowledge, are among the first generalization bounds for auto-encoders in a Recommender Systems setting; we also show that optimizing our loss function guarantees the recovery of the exact sampling distribution over interactions up to a small error in total variation. In experiments on several real-life datasets, we achieve state-of-the-art performance on both the implicit and explicit feedback prediction tasks despite relying on a single model for both, and benefiting from additional interpretability in the form of individual predictions for the probabilities of each possible rating.

[335] Water Demand Forecasting of District Metered Areas through Learned Consumer Representations

Adithya Ramachandran, Thorkil Flensmark B. Neergaard, Tomás Arias-Vergara, Andreas Maier, Siming Bayer

Main category: cs.LG

TL;DR: Novel water demand forecasting method using unsupervised contrastive learning to categorize consumers by behavior patterns, then using wavelet-transformed convolutional networks with cross-attention for improved short-term predictions.

DetailsMotivation: Water demand prediction is challenging due to non-deterministic factors like weather, and securing water resources is urgent due to climate change. Smart metering data provides insights but needs better forecasting methods.

Method: Unsupervised contrastive learning to categorize end-users by consumption behavior, then wavelet-transformed convolutional networks with cross-attention mechanism combining historical data and behavioral representations.

Result: 4.9% maximum improvement in MAPE across different DMAs over 6-month evaluation, plus identification of consumers influenced by socioeconomic factors.

Conclusion: The approach improves water demand forecasting accuracy and provides insights into consumption patterns influenced by socioeconomic factors, enhancing understanding of deterministic demand patterns.

Abstract: Advancements in smart metering technologies have significantly improved the ability to monitor and manage water utilities. In the context of increasing uncertainty due to climate change, securing water resources and supply has emerged as an urgent global issue with extensive socioeconomic ramifications. Hourly consumption data from end-users have yielded substantial insights for projecting demand across regions characterized by diverse consumption patterns. Nevertheless, the prediction of water demand remains challenging due to influencing non-deterministic factors, such as meteorological conditions. This work introduces a novel method for short-term water demand forecasting for District Metered Areas (DMAs) which encompass commercial, agricultural, and residential consumers. Unsupervised contrastive learning is applied to categorize end-users according to distinct consumption behaviors present within a DMA. Subsequently, the distinct consumption behaviors are utilized as features in the ensuing demand forecasting task using wavelet-transformed convolutional networks that incorporate a cross-attention mechanism combining both historical data and the derived representations. The proposed approach is evaluated on real-world DMAs over a six-month period, demonstrating improved forecasting performance in terms of MAPE across different DMAs, with a maximum improvement of 4.9%. Additionally, it identifies consumers whose behavior is shaped by socioeconomic factors, enhancing prior knowledge about the deterministic patterns that influence demand.

[336] RoseCDL: Robust and Scalable Convolutional Dictionary Learning for Rare-event Detection

Jad Yehya, Mansour Benbakoura, Cédric Allain, Benoît Malezieux, Matthieu Kowalski, Thomas Moreau

Main category: cs.LG

TL;DR: RoseCDL is a scalable and robust convolutional dictionary learning algorithm for unsupervised rare event detection in large signals, combining stochastic windowing for efficiency with inline outlier detection for robustness.

DetailsMotivation: Convolutional Dictionary Learning (CDL) is powerful for modeling local signal structures but faces challenges in rare event detection due to high computational costs and sensitivity to artifacts/outliers.

Method: RoseCDL combines stochastic windowing for efficient training on large datasets with inline outlier detection to enhance robustness and isolate anomalous patterns.

Result: The method reframes CDL as a practical tool for event discovery and characterization in real-world signals, extending beyond traditional compression/denoising tasks.

Conclusion: RoseCDL enables scalable and robust unsupervised rare event detection in long signals, making CDL applicable for discovering and characterizing anomalous patterns in various scientific domains.

Abstract: Identifying recurring patterns and rare events in large-scale signals is a fundamental challenge in fields such as astronomy, physical simulations, and biomedical science. Convolutional Dictionary Learning (CDL) offers a powerful framework for modeling local structures in signals, but its use for detecting rare or anomalous events remains largely unexplored. In particular, CDL faces two key challenges in this setting: high computational cost and sensitivity to artifacts and outliers. In this paper, we introduce RoseCDL, a scalable and robust CDL algorithm designed for unsupervised rare event detection in long signals. RoseCDL combines stochastic windowing for efficient training on large datasets with inline outlier detection to enhance robustness and isolate anomalous patterns. This reframes CDL as a practical tool for event discovery and characterization in real-world signals, extending its role beyond traditional tasks like compression or denoising.

[337] $ΔL$ Normalization: Rethink Loss Aggregation in RLVR

Zhiyuan He, Xufang Luo, Yike Zhang, Yuqing Yang, Lili Qiu

Main category: cs.LG

TL;DR: Delta L Normalization is a loss aggregation method that addresses high gradient variance from variable response lengths in RLVR training, providing unbiased estimates and minimizing variance.

DetailsMotivation: RLVR shows promise for improving LLM reasoning but suffers from unstable optimization due to large variability in response lengths during training, causing high gradient variance.

Method: Theoretical and empirical analysis of length effects on policy loss, reformulating as minimum-variance unbiased estimator problem. Delta L Normalization provides unbiased policy loss estimates while minimizing gradient variance.

Result: Extensive experiments demonstrate superior performance across different model sizes, maximum lengths, and tasks compared to previous methods like GRPO, DAPO, and Dr. GRPO.

Conclusion: Delta L Normalization effectively solves the length variability problem in RLVR training, offering both theoretical guarantees and practical performance improvements.

Abstract: We propose $\Delta L$ Normalization, a simple yet effective loss aggregation method tailored to the characteristic of dynamic generation lengths in Reinforcement Learning with Verifiable Rewards (RLVR). Recently, RLVR has demonstrated strong potential in improving the reasoning capabilities of large language models (LLMs), but a major challenge lies in the large variability of response lengths during training, which leads to high gradient variance and unstable optimization. Although previous methods such as GRPO, DAPO, and Dr. GRPO introduce different loss normalization terms to address this issue, they either produce biased estimates or still suffer from high gradient variance. By analyzing the effect of varying lengths on policy loss both theoretically and empirically, we reformulate the problem as finding a minimum-variance unbiased estimator. Our proposed $\Delta L$ Normalization not only provides an unbiased estimate of the true policy loss but also minimizes gradient variance in theory. Extensive experiments show that it consistently achieves superior results across different model sizes, maximum lengths, and tasks. Our code will be made public at https://github.com/zerolllin/Delta-L-Normalization.

[338] uGMM-NN: Univariate Gaussian Mixture Model Neural Network

Zakeria Sharif Ali

Main category: cs.LG

TL;DR: uGMM-NN is a novel neural network architecture that replaces traditional neurons with probabilistic units parameterized as univariate Gaussian mixtures, enabling multimodality and uncertainty capture while maintaining scalability.

DetailsMotivation: To embed probabilistic reasoning directly into neural network computational units, enabling richer representations that capture multimodality and uncertainty at the individual neuron level, unlike traditional fixed-nonlinearity neurons.

Method: Each neuron parameterizes activations as a univariate Gaussian mixture with learnable means, variances, and mixing coefficients, replacing weighted sums and fixed nonlinearities while retaining feedforward network scalability.

Result: uGMM-NN achieves competitive discriminative performance compared to conventional multilayer perceptrons while providing probabilistic interpretation of activations.

Conclusion: The framework provides a foundation for integrating uncertainty-aware components into modern neural architectures, opening new directions for both discriminative and generative modeling.

Abstract: This paper introduces the Univariate Gaussian Mixture Model Neural Network (uGMM-NN), a novel neural architecture that embeds probabilistic reasoning directly into the computational units of deep networks. Unlike traditional neurons, which apply weighted sums followed by fixed nonlinearities, each uGMM-NN node parameterizes its activations as a univariate Gaussian mixture, with learnable means, variances, and mixing coefficients. This design enables richer representations by capturing multimodality and uncertainty at the level of individual neurons, while retaining the scalability of standard feedforward networks. We demonstrate that uGMM-NN can achieve competitive discriminative performance compared to conventional multilayer perceptrons, while additionally offering a probabilistic interpretation of activations. The proposed framework provides a foundation for integrating uncertainty-aware components into modern neural architectures, opening new directions for both discriminative and generative modeling.

[339] Uncovering Scaling Laws for Large Language Models via Inverse Problems

Arun Verma, Zhaoxuan Wu, Zijian Zhou, Xiaoqiang Lin, Zhiliang Chen, Rachael Hwee Ling Sim, Rui Qiao, Jingtan Wang, Nhung Bui, Xinyuan Niu, Wenyang Hu, Gregory Kang Ruey Lau, Zi-Yu Khoo, Zitong Zhao, Xinyi Xu, Apivich Hemachandra, See-Kiong Ng, Bryan Kian Hsiang Low

Main category: cs.LG

TL;DR: Inverse problems can efficiently uncover scaling laws for LLMs to achieve better performance with improved cost-effectiveness, avoiding brute-force trial-and-error approaches.

DetailsMotivation: High training costs make brute-force trial-and-error approaches infeasible for improving LLMs, and inverse problems have proven successful in uncovering fundamental scientific laws.

Method: Advocates using inverse problems to efficiently discover scaling laws that guide LLM development, inspired by successful applications in scientific domains.

Result: Proposes a framework where inverse problems can systematically reveal the relationships between model scale, data, computation, and performance.

Conclusion: Inverse problems offer a promising alternative approach to efficiently uncover scaling laws for LLMs, enabling more cost-effective model development compared to traditional trial-and-error methods.

Abstract: Large Language Models (LLMs) are large-scale pretrained models that have achieved remarkable success across diverse domains. These successes have been driven by unprecedented complexity and scale in both data and computations. However, due to the high costs of training such models, brute-force trial-and-error approaches to improve LLMs are not feasible. Inspired by the success of inverse problems in uncovering fundamental scientific laws, this position paper advocates that inverse problems can also efficiently uncover scaling laws that guide the building of LLMs to achieve the desirable performance with significantly better cost-effectiveness.

[340] Homogenization with Guaranteed Bounds via Primal-Dual Physically Informed Neural Networks

Liya Gaynutdinova, Martin Doškář, Ondřej Rokoš, Ivana Pultarová

Main category: cs.LG

TL;DR: Dual formulation PINN framework improves reliability for homogenization of periodic thermo-conductive composites with discontinuous coefficients, providing error bounds and failure detection.

DetailsMotivation: Standard PINNs often fail when applied to materials with discontinuous coefficients like piecewise constant properties, requiring more robust approaches for homogenization problems.

Method: Introduces dual formulation for PINNs to derive guaranteed upper and lower error bounds. Compares standard PINNs with smoothed approximations vs variational PINNs (VPINNs) using spectral and neural network-based test functions.

Result: Strong-form PINNs may outperform VPINNs in controlled settings but are sensitive to material discontinuities. VPINNs accommodate piecewise constants directly but require careful test function selection. Dual formulation reliably indicates convergence quality.

Conclusion: Integration of dual formulation into PINN frameworks enhances applicability to homogenization problems in micromechanics by providing robust error detection and improved reliability for materials with discontinuous coefficients.

Abstract: Physics-informed neural networks (PINNs) have shown promise in solving partial differential equations (PDEs) relevant to multiscale modeling, but they often fail when applied to materials with discontinuous coefficients, such as media with piecewise constant properties. This paper introduces a dual formulation for the PINN framework to improve the reliability of the homogenization of periodic thermo-conductive composites, for both strong and variational (weak) formulations. The dual approach facilitates the derivation of guaranteed upper and lower error bounds, enabling more robust detection of PINN failure. We compare standard PINNs applied to smoothed material approximations with variational PINNs (VPINNs) using both spectral and neural network-based test functions. Our results indicate that while strong-form PINNs may outperform VPINNs in controlled settings, they are sensitive to material discontinuities and may fail without clear diagnostics. In contrast, VPINNs accommodate piecewise constant material parameters directly but require careful selection of test functions to avoid instability. Dual formulation serves as a reliable indicator of convergence quality, and its integration into PINN frameworks enhances their applicability to homogenization problems in micromechanics.

[341] Transformer-Based Approach to Optimal Sensor Placement for Structural Health Monitoring of Probe Cards

Mehdi Bejani, Marco Mauri, Daniele Acconcia, Simone Todaro, Stefano Mariani

Main category: cs.LG

TL;DR: Transformer-based deep learning for sensor placement optimization in semiconductor probe card health monitoring, achieving 99.83% accuracy in failure classification and 99.73% crack detection recall.

DetailsMotivation: Failures in probe cards (substrate cracks and loosened screws) critically affect semiconductor manufacturing yield and reliability. Sensor-based monitoring can detect these failures but requires optimal placement strategies.

Method: Uses frequency response functions from simulated failure scenarios in a finite element model. Trains a hybrid CNN-Transformer model on a comprehensive dataset enriched with physics-informed scenario expansion and physics-aware statistical data augmentation.

Result: Achieves 99.83% accuracy in classifying probe card health states (baseline, loose screw, crack) and 99.73% crack detection recall. Model robustness confirmed through 3 repetitions of 10-fold stratified cross-validation. Attention mechanism identifies critical sensor locations.

Conclusion: Attention-based deep learning enables proactive maintenance and enhances operational reliability and yield in semiconductor manufacturing by optimizing sensor configurations for cost-effective monitoring systems.

Abstract: This paper presents an innovative Transformer-based deep learning strategy for optimizing the placement of sensors aiming at structural health monitoring of semiconductor probe cards. Failures in probe cards, including substrate cracks and loosened screws, would critically affect semiconductor manufacturing yield and reliability. Some failure modes could be detected by equipping a probe card with adequate sensors. Frequency response functions from simulated failure scenarios are adopted within a finite element model of a probe card. A comprehensive dataset, enriched by physics-informed scenario expansion and physics-aware statistical data augmentation, is exploited to train a hybrid Convolutional Neural Network and Transformer model. The model achieves high accuracy (99.83%) in classifying the probe card health states (baseline, loose screw, crack) and an excellent crack detection recall (99.73%). Model robustness is confirmed through a rigorous framework of 3 repetitions of 10-fold stratified cross-validation. The attention mechanism also pinpoints critical sensor locations: an analysis of the attention weights offers actionable insights for designing efficient, cost-effective monitoring systems by optimizing sensor configurations. This research highlights the capability of attention-based deep learning to advance proactive maintenance, enhancing operational reliability and yield in semiconductor manufacturing.

[342] K2-Think: A Parameter-Efficient Reasoning System

Zhoujun Cheng, Richard Fan, Shibo Hao, Taylor W. Killian, Haonan Li, Suqi Sun, Hector Ren, Alexander Moreno, Daqian Zhang, Tianjun Zhong, Yuxin Xiong, Yuanzhe Hu, Yutao Xie, Xudong Han, Yuqi Wang, Varad Pimpalkhute, Yonghao Zhuang, Aaryamonvikram Singh, Xuezhi Liang, Anze Xie, Jianshu She, Desai Fan, Chengqian Gao, Liqun Ma, Mikhail Yurochkin, John Maggs, Xuezhe Ma, Guowei He, Zhiting Hu, Zhengzhong Liu, Eric P. Xing

Main category: cs.LG

TL;DR: K2-Think is a 32B parameter reasoning system that matches/surpasses larger models like GPT-OSS 120B through advanced post-training techniques and test-time computation, achieving state-of-the-art performance in mathematical reasoning and strong results in other domains.

DetailsMotivation: To demonstrate that smaller models can compete with much larger state-of-the-art systems through integrated post-training recipes and inference-time enhancements, making open-source reasoning systems more accessible and affordable.

Method: Built on Qwen2.5 base model with six technical pillars: Long Chain-of-thought Supervised Finetuning, Reinforcement Learning with Verifiable Rewards (RLVR), Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, and Inference-optimized Hardware using publicly available open-source datasets.

Result: Achieves state-of-the-art scores on public benchmarks for open-source models in mathematical reasoning, while performing strongly in Code and Science domains. Delivers best-in-class inference speeds of over 2,000 tokens per second per request via Cerebras Wafer-Scale Engine.

Conclusion: A 32B parameter model can compete with state-of-the-art systems through integrated post-training techniques including long chain-of-thought training and strategic inference-time enhancements, making high-performance reasoning systems more accessible.

Abstract: K2-Think is a reasoning system that achieves state-of-the-art performance with a 32B parameter model, matching or surpassing much larger models like GPT-OSS 120B and DeepSeek v3.1. Built on the Qwen2.5 base model, our system shows that smaller models can compete at the highest levels by combining advanced post-training and test-time computation techniques. The approach is based on six key technical pillars: Long Chain-of-thought Supervised Finetuning, Reinforcement Learning with Verifiable Rewards (RLVR), Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, and Inference-optimized Hardware, all using publicly available open-source datasets. K2-Think excels in mathematical reasoning, achieving state-of-the-art scores on public benchmarks for open-source models, while also performing strongly in other areas such as Code and Science. Our results confirm that a more parameter-efficient model like K2-Think 32B can compete with state-of-the-art systems through an integrated post-training recipe that includes long chain-of-thought training and strategic inference-time enhancements, making open-source reasoning systems more accessible and affordable. K2-Think is freely available at k2think.ai, offering best-in-class inference speeds of over 2,000 tokens per second per request via the Cerebras Wafer-Scale Engine.

[343] Beyond Rebalancing: Benchmarking Binary Classifiers Under Class Imbalance Without Rebalancing Techniques

Ali Nawaz, Amir Ahmad, Shehroz S. Khan

Main category: cs.LG

TL;DR: Evaluation of binary classifiers’ performance under severe class imbalance without rebalancing techniques, showing that advanced models like TabPFN and boosting ensembles maintain better performance than traditional classifiers.

DetailsMotivation: Class imbalance is a major challenge in supervised classification, especially in critical domains like medical diagnostics, but most research focuses on rebalancing techniques rather than evaluating classifiers' native performance under imbalance.

Method: Systematic evaluation of diverse binary classifiers across real-world and synthetic datasets with progressively reduced minority class sizes, using one-shot and few-shot baselines. Includes experiments with varying data complexity through synthetic decision boundaries, undersampling, oversampling, and one-class classification methods.

Result: Classification becomes more difficult as data complexity increases and minority class size decreases. Traditional classifiers deteriorate under extreme imbalance, while advanced models like TabPFN and boosting-based ensembles retain relatively higher performance and better generalization.

Conclusion: The study provides valuable guidance for model selection in imbalanced learning, showing that certain advanced classifiers can maintain robustness without explicit rebalancing techniques, with implications for critical applications where minority classes are rare.

Abstract: Class imbalance poses a significant challenge to supervised classification, particularly in critical domains like medical diagnostics and anomaly detection where minority class instances are rare. While numerous studies have explored rebalancing techniques to address this issue, less attention has been given to evaluating the performance of binary classifiers under imbalance when no such techniques are applied. Therefore, the goal of this study is to assess the performance of binary classifiers “as-is”, without performing any explicit rebalancing. Specifically, we systematically evaluate the robustness of a diverse set of binary classifiers across both real-world and synthetic datasets, under progressively reduced minority class sizes, using one-shot and few-shot scenarios as baselines. Our approach also explores varying data complexities through synthetic decision boundary generation to simulate real-world conditions. In addition to standard classifiers, we include experiments using undersampling, oversampling strategies, and one-class classification (OCC) methods to examine their behavior under severe imbalance. The results confirm that classification becomes more difficult as data complexity increases and the minority class size decreases. While traditional classifiers deteriorate under extreme imbalance, advanced models like TabPFN and boosting-based ensembles retain relatively higher performance and better generalization compared to traditional classifiers. Visual interpretability and evaluation metrics further validate these findings. Our work offers valuable guidance on model selection for imbalanced learning, providing insights into classifier robustness without dependence on explicit rebalancing techniques.

[344] Graph-based Integrated Gradients for Explaining Graph Neural Networks

Lachlan Simpson, Kyle Millar, Adriel Cheng, Cheng-Chew Lim, Hong Gunn Chew

Main category: cs.LG

TL;DR: GB-IG extends Integrated Gradients to graphs, outperforming standard IG on graph datasets by better identifying important structural components and features for classification tasks.

DetailsMotivation: Integrated Gradients (IG) is designed for continuous data but graphs are discrete structures, making IG ill-suited for graph explainability tasks.

Method: Developed graph-based integrated gradients (GB-IG) as an extension of IG specifically for graph structures.

Result: GB-IG accurately identifies crucial structural components on synthetic datasets and outperforms IG on real-world graph datasets for node classification tasks.

Conclusion: GB-IG successfully adapts IG to graph data, providing more effective explainability for graph neural networks by handling discrete graph structures.

Abstract: Integrated Gradients (IG) is a common explainability technique to address the black-box problem of neural networks. Integrated gradients assumes continuous data. Graphs are discrete structures making IG ill-suited to graphs. In this work, we introduce graph-based integrated gradients (GB-IG); an extension of IG to graphs. We demonstrate on four synthetic datasets that GB-IG accurately identifies crucial structural components of the graph used in classification tasks. We further demonstrate on three prevalent real-world graph datasets that GB-IG outperforms IG in highlighting important features for node classification tasks.

[345] FUnc-SNE: A flexible, Fast, and Unconstrained algorithm for neighbour embeddings

Pierre Lambert, Edouard Couplet, Michel Verleysen, John Aldo Lee

Main category: cs.LG

TL;DR: A novel neighbor embedding acceleration method that bridges the gap between coarse approximations (like UMAP) and precise but slow methods (like FIt-SNE), enabling interactive exploration with good structure preservation and hyperparameter flexibility without dimensionality limits.

DetailsMotivation: To address the trade-off between speed and quality in neighbor embedding methods - current approaches either sacrifice structure preservation for speed (UMAP) or are too slow for interactive use while limiting dimensionality to 2-3 (FIt-SNE/BH-t-SNE).

Method: Proposes a new acceleration technique requiring few computations per iteration while maintaining structure quality. Features a novel iterative approximate nearest neighbor search approach and abandons the traditional two-phase approach for instantaneous visual feedback during hyperparameter tuning.

Result: Experiments show promising results in speed, flexibility for structure extraction, and potential for broader ML applications. The method enables interactive data exploration with immediate visual feedback even when adjusting high-dimensional hyperparameters.

Conclusion: The method successfully bridges the gap between fast but coarse approximations and precise but slow methods, offering a balanced solution for interactive neighbor embedding with good structure preservation and dimensionality flexibility.

Abstract: Neighbour embeddings (NE) allow the representation of high dimensional datasets into lower dimensional spaces and are often used in data visualisation. In practice, accelerated approximations are employed to handle very large datasets. Accelerating NE is challenging, and two main directions have been explored: very coarse approximations based on negative sampling (as in UMAP) achieve high effective speed but may lack quality in the extracted structures; less coarse approximations, as used in FIt-SNE or BH-t-SNE, offer better structure preservation at the cost of speed, while also restricting the target dimensionality to 2 or 3, limiting NE to visualisation. In some variants, the precision of these costlier accelerations also enables finer-grained control on the extracted structures through dedicated hyperparameters. This paper proposes to bridge the gab between both approaches by introducing a novel way to accelerate NE, requiring a small number of computations per iteration while maintaining good fine-grained structure preservation and flexibility through hyperparameter tuning, without limiting the dimensionality of the embedding space. The method was designed for interactive exploration of data; as such, it abandons the traditional two-phased approach of other NE methods, allowing instantaneous visual feedback when changing hyperparameters, even when these control processes happening on the high-dimensional side of the computations. Experiments using a publicly available, GPU accelerated GUI integration of the method show promising results in terms of speed, flexibility in the structures getting extracted, and show potential uses in broader machine learning contexts with minimal algorithmic modifications. Central to this algorithm is a novel approach to iterative approximate nearest neighbour search, which shows promising results compared to nearest neighbour descent.

[346] IBN: An Interpretable Bidirectional-Modeling Network for Multivariate Time Series Forecasting with Variable Missing

Shusen Ma, Tianhao Zhang, Qijiu Xia, Yun-Bo Zhao

Main category: cs.LG

TL;DR: IBN proposes an interpretable bidirectional network for multivariate time series forecasting with missing variables, using uncertainty-aware interpolation and Gaussian kernel-based graph convolution to achieve state-of-the-art performance.

DetailsMotivation: Existing methods like GinAR lack interpretability and fail to capture latent temporal patterns when handling missing variables in multivariate time series forecasting, requiring a more reliable and interpretable framework.

Method: IBN integrates Uncertainty-Aware Interpolation (UAI) with MC Dropout to estimate uncertainty of reconstructed values, Gaussian kernel-based Graph Convolution (GGCN) to model spatial correlations, and bidirectional recursive units for enhanced temporal dependency modeling.

Result: Extensive experiments demonstrate that IBN achieves state-of-the-art forecasting performance under various missing-rate scenarios, providing superior reliability and interpretability.

Conclusion: IBN offers a more reliable and interpretable framework for multivariate time series forecasting with missing variables, effectively addressing limitations of previous approaches through uncertainty-aware reconstruction and enhanced spatial-temporal modeling.

Abstract: Multivariate time series forecasting (MTSF) often faces challenges from missing variables, which hinder conventional spatial-temporal graph neural networks in modeling inter-variable correlations. While GinAR addresses variable missing using attention-based imputation and adaptive graph learning for the first time, it lacks interpretability and fails to capture more latent temporal patterns due to its simple recursive units (RUs). To overcome these limitations, we propose the Interpretable Bidirectional-modeling Network (IBN), integrating Uncertainty-Aware Interpolation (UAI) and Gaussian kernel-based Graph Convolution (GGCN). IBN estimates the uncertainty of reconstructed values using MC Dropout and applies an uncertainty-weighted strategy to mitigate high-risk reconstructions. GGCN explicitly models spatial correlations among variables, while a bidirectional RU enhances temporal dependency modeling. Extensive experiments show that IBN achieves state-of-the-art forecasting performance under various missing-rate scenarios, providing a more reliable and interpretable framework for MTSF with missing variables. Code is available at: https://github.com/zhangth1211/NICLab-IBN.

[347] MoE-Compression: How the Compression Error of Experts Affects the Inference Accuracy of MoE Model?

Songkai Ma, Zhaorui Zhang, Sheng Di, Benben Liu, Xiaodong Yu, Xiaoyi Lu, Dan Wang

Main category: cs.LG

TL;DR: Proposes using error-bounded lossy compression for non-activated experts in MoE models to reduce GPU memory transfer overhead, with analysis showing varying impact on inference accuracy across different model layers.

DetailsMotivation: Efficiently serving Mixture of Experts models under limited GPU memory constraints by reducing data transfer overhead when offloading non-activated experts to main memory.

Method: Employ error-bounded lossy compression algorithms (SZ3 and CuSZp) to compress non-activated experts and conduct extensive experiments across various benchmarks to analyze compression-induced error effects.

Result: Shallow layer experts (attention and input transformation) show minimal accuracy degradation; middle-layer experts (model reasoning) suffer significant accuracy impairment; deep-layer experts (instruction following) sometimes show improved accuracy with bounded errors.

Conclusion: Error-bounded compression is effective for MoE model serving with layer-dependent impact - shallow layers tolerate compression well, middle layers are sensitive, while deep layers can sometimes benefit from bounded errors.

Abstract: With the widespread application of Mixture of Experts (MoE) reasoning models in the field of LLM learning, efficiently serving MoE models under limited GPU memory constraints has emerged as a significant challenge. Offloading the non-activated experts to main memory has been identified as an efficient approach to address such a problem, while it brings the challenges of transferring the expert between the GPU memory and main memory. We need to explore an efficient approach to compress the expert and analyze how the compression error affects the inference performance. To bridge this gap, we propose employing error-bounded lossy compression algorithms (such as SZ3 and CuSZp) to compress non-activated experts, thereby reducing data transfer overhead during MoE inference. We conduct extensive experiments across various benchmarks and present a comprehensive analysis of how compression-induced errors in different experts affect overall inference accuracy. The results indicate that experts in the shallow layers, which are primarily responsible for the attention mechanism and the transformation of input tokens into vector representations, exhibit minimal degradation in inference accuracy when subjected to bounded errors. In contrast, errors in the middle-layer experts, which are central to model reasoning, significantly impair inference accuracy. Interestingly, introducing bounded errors in the deep-layer experts, which are mainly responsible for instruction following and output integration, can sometimes lead to improvements in inference accuracy.

[348] Forecasting Russian Equipment Losses Using Time Series and Deep Learning Models

Jonathan Teagan

Main category: cs.LG

TL;DR: Comparative analysis of forecasting models for Russian equipment losses in Ukraine war using OSINT data, showing deep learning models (TCN and LSTM) perform best with high temporal granularity.

DetailsMotivation: To assess trends in Russian equipment attrition during the Ukraine conflict, evaluate different forecasting model performance, and estimate future loss patterns using publicly available OSINT data.

Method: Applied multiple forecasting techniques including ARIMA, Prophet, LSTM, TCN, and XGBoost on daily and monthly OSINT data from WarSpotting to model and predict equipment losses.

Result: Deep learning models (particularly TCN and LSTM) produced the most stable and consistent forecasts, especially with high temporal granularity data.

Conclusion: Ensemble forecasting is valuable for conflict modeling, and publicly available OSINT data provides important insights into quantifying material degradation over time in military conflicts.

Abstract: This study applies a range of forecasting techniques,including ARIMA, Prophet, Long Short Term Memory networks (LSTM), Temporal Convolutional Networks (TCN), and XGBoost, to model and predict Russian equipment losses during the ongoing war in Ukraine. Drawing on daily and monthly open-source intelligence (OSINT) data from WarSpotting, we aim to assess trends in attrition, evaluate model performance, and estimate future loss patterns through the end of 2025. Our findings show that deep learning models, particularly TCN and LSTM, produce stable and consistent forecasts, especially under conditions of high temporal granularity. By comparing different model architectures and input structures, this study highlights the importance of ensemble forecasting in conflict modeling, and the value of publicly available OSINT data in quantifying material degradation over time.

[349] Predicting person-level injury severity using crash narratives: A balanced approach with roadway classification and natural language process techniques

Mohammad Zana Majidi, Sajjad Karimi, Teng Wang, Robert Kluger, Reginald Souleyrette

Main category: cs.LG

TL;DR: Combining unstructured crash narratives with structured data using NLP techniques (TF-IDF and Word2Vec) significantly improves injury severity prediction in traffic crashes compared to using structured data alone.

DetailsMotivation: To enhance road safety, emergency response, and public health interventions by leveraging the additional value of unstructured police crash narratives when combined with traditional structured crash data for injury severity prediction.

Method: Used TF-IDF and Word2Vec NLP techniques to extract semantic meaning from crash narratives, applied K-Nearest Neighbors oversampling for class imbalance, and developed 102 machine learning models using XGBoost, Random Forest, and AdaBoost across three road classification schemes.

Result: Models incorporating narrative data consistently outperformed structured-data-only models. TF-IDF with XGBoost yielded the most accurate predictions across most subgroups.

Conclusion: Integrating textual and structured crash information provides a powerful framework for improving injury prediction, offering transportation safety professionals a practical approach to enhance crash severity modeling and guide policy decisions.

Abstract: Predicting injuries and fatalities in traffic crashes plays a critical role in enhancing road safety, improving emergency response, and guiding public health interventions. This study investigates the added value of unstructured crash narratives (written by police officers at the scene) when combined with structured crash data to predict injury severity. Two widely used Natural Language Processing (NLP) techniques, Term Frequency-Inverse Document Frequency (TF-IDF) and Word2Vec, were employed to extract semantic meaning from the narratives, and their effectiveness was compared. To address the challenge of class imbalance, a K-Nearest Neighbors-based oversampling method was applied to the training data prior to modeling. The dataset consists of crash records from Kentucky spanning 2019 to 2023. To account for roadway heterogeneity, three road classification schemes were used: (1) eight detailed functional classes (e.g., Urban Two-Lane, Rural Interstate, Urban Multilane Divided), (2) four broader paired categories (e.g., Urban vs. Rural, Freeway vs. Non-Freeway), and (3) a unified dataset without classification. A total of 102 machine learning models were developed by combining structured features and narrative-based features using the two NLP techniques alongside three ensemble algorithms: XGBoost, Random Forest, and AdaBoost. Results demonstrate that models incorporating narrative data consistently outperform those relying solely on structured data. Among all combinations, TF-IDF coupled with XGBoost yielded the most accurate predictions in most subgroups. The findings highlight the power of integrating textual and structured crash information to enhance person-level injury prediction. This work offers a practical and adaptable framework for transportation safety professionals to improve crash severity modeling, guide policy decisions, and design more effective countermeasures.

[350] Addressing the Cold-Start Problem for Personalized Combination Drug Screening

Antoine de Mathelin, Christopher Tosh, Wesley Tansey

Main category: cs.LG

TL;DR: A deep learning-based strategy for efficient initial drug combination screening in personalized oncology using pretrained models and dose-weighting mechanisms.

DetailsMotivation: Personalized combination therapy faces immense search space challenges with limited feasible experiments, creating a cold-start problem where no prior patient information is available to guide initial drug response testing.

Method: Leverages pretrained deep learning model on historical drug response data to generate drug combination embeddings and dose-level importance scores, combining clustering for functional diversity with dose-weighting to prioritize historically informative doses.

Result: Retrospective simulations on large-scale drug combination datasets show substantial improvement in initial screening efficiency compared to baseline methods.

Conclusion: The proposed method offers a viable path for more effective early-phase decision-making in personalized combination drug screens by addressing the cold-start problem through principled experiment selection.

Abstract: Personalizing combination therapies in oncology requires navigating an immense space of possible drug and dose combinations, a task that remains largely infeasible through exhaustive experimentation. Recent developments in patient-derived models have enabled high-throughput ex vivo screening, but the number of feasible experiments is limited. Further, a tight therapeutic window makes gathering molecular profiling information (e.g. RNA-seq) impractical as a means of guiding drug response prediction. This leads to a challenging cold-start problem: how do we select the most informative combinations to test early, when no prior information about the patient is available? We propose a strategy that leverages a pretrained deep learning model built on historical drug response data. The model provides both embeddings for drug combinations and dose-level importance scores, enabling a principled selection of initial experiments. We combine clustering of drug embeddings to ensure functional diversity with a dose-weighting mechanism that prioritizes doses based on their historical informativeness. Retrospective simulations on large-scale drug combination datasets show that our method substantially improves initial screening efficiency compared to baselines, offering a viable path for more effective early-phase decision-making in personalized combination drug screens.

[351] Leveraging Support Vector Regression for Outcome Prediction in Personalized Ultra-fractionated Stereotactic Adaptive Radiotherapy

Yajun Yu, Steve Jiang, Robert Timmerman, Hao Peng

Main category: cs.LG

TL;DR: Multi-omics SVR model using radiomics and dosiomics features predicts GTV changes in PULSAR radiotherapy with high accuracy (R²=0.743).

DetailsMotivation: Accurate prediction of gross tumor volume (GTV) changes has substantial prognostic value for personalized ultra-fractionated stereotactic adaptive radiotherapy (PULSAR) treatment.

Method: Developed support vector regression (SVR) models using radiomics (MRI) and dosiomics (dose maps) features from 39 patients with 69 brain metastases. Used LASSO feature selection and delta features to capture changes between time points. Employed 5-fold cross-validation with 10 repeats.

Result: Multi-omics models integrating radiomics, dosiomics, and delta features outperformed individual-omics models. Top model achieved R²=0.743 and RRMSE=0.022. Delta-radiomic features significantly enhanced prediction accuracy.

Conclusion: The multi-omics SVR model shows promising performance for predicting continuous GTV changes, providing a quantitative and personalized approach for patient selection and treatment adjustment in PULSAR radiotherapy.

Abstract: Personalized ultra-fractionated stereotactic adaptive radiotherapy (PULSAR) is a novel treatment that delivers radiation in pulses of protracted intervals. Accurate prediction of gross tumor volume (GTV) changes through regression models has substantial prognostic value. This study aims to develop a multi-omics based support vector regression (SVR) model for predicting GTV change. A retrospective cohort of 39 patients with 69 brain metastases was analyzed, based on radiomics (MRI images) and dosiomics (dose maps) features. Delta features were computed to capture relative changes between two time points. A feature selection pipeline using least absolute shrinkage and selection operator (Lasso) algorithm with weight- or frequency-based ranking criterion was implemented. SVR models with various kernels were evaluated using the coefficient of determination (R2) and relative root mean square error (RRMSE). Five-fold cross-validation with 10 repeats was employed to mitigate the limitation of small data size. Multi-omics models that integrate radiomics, dosiomics, and their delta counterparts outperform individual-omics models. Delta-radiomic features play a critical role in enhancing prediction accuracy relative to features at single time points. The top-performing model achieves an R2 of 0.743 and an RRMSE of 0.022. The proposed multi-omics SVR model shows promising performance in predicting continuous change of GTV. It provides a more quantitative and personalized approach to assist patient selection and treatment adjustment in PULSAR.

[352] A Survey of Graph Neural Networks for Drug Discovery: Recent Developments and Challenges

Katherine Berry, Liang Cheng

Main category: cs.LG

TL;DR: Comprehensive review of Graph Neural Networks (GNNs) applications across various drug discovery domains including molecular property prediction, drug-target interactions, drug repositioning, and new drug design.

DetailsMotivation: GNNs have gained significant traction in drug discovery due to their ability to process graph-structured data like drug molecule models, leading to numerous published methods that need comprehensive categorization and analysis.

Method: The paper conducts a comprehensive review covering multiple research categories including molecular property prediction, drug-target binding affinity prediction, drug-drug interaction studies, microbiome interaction prediction, drug repositioning, retrosynthesis, and new drug design.

Result: The review systematically categorizes and analyzes recent GNN-based approaches across various drug discovery domains, providing insights into current methodologies and applications.

Conclusion: The paper provides comprehensive guidance for future work on GNNs in drug discovery by synthesizing current research across multiple application domains and identifying directions for further development.

Abstract: Graph Neural Networks (GNNs) have gained traction in the complex domain of drug discovery because of their ability to process graph-structured data such as drug molecule models. This approach has resulted in a myriad of methods and models in published literature across several categories of drug discovery research. This paper covers the research categories comprehensively with recent papers, namely molecular property prediction, including drug-target binding affinity prediction, drug-drug interaction study, microbiome interaction prediction, drug repositioning, retrosynthesis, and new drug design, and provides guidance for future work on GNNs for drug discovery.

[353] Feasibility of In-Ear Single-Channel ExG for Wearable Sleep~Monitoring in Real-World Settings

Philipp Lepold, Jonas Leichtle, Tobias Röddiger, Michael Beigl

Main category: cs.LG

TL;DR: In-ear EEG signals can achieve 90.5% accuracy for binary sleep detection and 65.1% for four-class staging, offering a comfortable wearable alternative to traditional scalp EEG for sleep monitoring.

DetailsMotivation: Traditional EEG sleep staging is accurate but obtrusive and impractical for everyday use outside sleep laboratories, limiting real-world applications like home monitoring and automatic media control when users fall asleep.

Method: Conducted sleep study with 11 participants using custom earpiece with dry eartip electrode in one ear and reference in the other. Used Apple Watch Ultra for ground truth validation and leave-one-subject-out validation for testing.

Result: Achieved 90.5% accuracy for binary sleep detection (Awake vs. Asleep) and 65.1% accuracy for four-class staging (Awake, REM, Core, Deep).

Conclusion: In-ear electrodes show potential as a low-effort, comfortable approach to sleep monitoring, enabling practical applications like automatically pausing media when users fall asleep.

Abstract: Automatic sleep staging typically relies on gold-standard EEG setups, which are accurate but obtrusive and impractical for everyday use outside sleep laboratories. This limits applicability in real-world settings, such as home environments, where continuous, long-term monitoring is needed. Detecting sleep onset is particularly relevant, enabling consumer applications (e.g. automatically pausing media playback when the user falls asleep). Recent research has shown correlations between in-ear EEG and full-scalp EEG for various phenomena, suggesting wearable, in-ear devices could allow unobtrusive sleep monitoring. We investigated the feasibility of using single-channel in-ear electrophysiological (ExG) signals for automatic sleep staging in a wearable device by conducting a sleep study with 11~participants (mean age: 24), using a custom earpiece with a dry eartip electrode (D"atwyler SoftPulse) as a measurement electrode in one ear and a reference in the other. Ground truth sleep stages were obtained from an Apple Watch Ultra, validated for sleep staging. Our system achieved 90.5% accuracy for binary sleep detection (Awake vs. Asleep) and 65.1% accuracy for four-class staging (Awake, REM, Core, Deep) using leave-one-subject-out validation. These findings demonstrate the potential of in-ear electrodes as a low-effort, comfortable approach to sleep monitoring, with applications such as stopping podcasts when users fall asleep.

[354] A Modular Algorithm for Non-Stationary Online Convex-Concave Optimization

Qing-xin Meng, Xia Lei, Jian-wei Liu

Main category: cs.LG

TL;DR: Novel modular algorithm for Online Convex-Concave Optimization achieves minimax optimal dynamic duality gap bounds with adaptive components for varying non-stationarity and predictor aggregation.

DetailsMotivation: Existing algorithms fail to deliver optimal performance in Online Convex-Concave Optimization, particularly in stationary or predictable environments, requiring better dynamic duality gap minimization.

Method: Proposed modular algorithm with three components: Adaptive Module for dynamic adjustment to non-stationarity, Multi-Predictor Aggregator for best predictor selection, and Integration Module for combining strengths.

Result: Achieves minimax optimal D-DGap upper bound (up to logarithmic factor) and prediction error-driven D-DGap bounds, with empirical validation showing effectiveness and adaptability.

Conclusion: The modular design enables optimal performance in dynamic environments and allows seamless component replacement and integration of side knowledge from multiple predictors.

Abstract: This paper investigates the problem of Online Convex-Concave Optimization, which extends Online Convex Optimization to two-player time-varying convex-concave games. The goal is to minimize the dynamic duality gap (D-DGap), a critical performance measure that evaluates players’ strategies against arbitrary comparator sequences. Existing algorithms fail to deliver optimal performance, particularly in stationary or predictable environments. To address this, we propose a novel modular algorithm with three core components: an Adaptive Module that dynamically adjusts to varying levels of non-stationarity, a Multi-Predictor Aggregator that identifies the best predictor among multiple candidates, and an Integration Module that effectively combines their strengths. Our algorithm achieves a minimax optimal D-DGap upper bound, up to a logarithmic factor, while also ensuring prediction error-driven D-DGap bounds. The modular design allows for the seamless replacement of components that regulate adaptability to dynamic environments, as well as the incorporation of components that integrate ``side knowledge’’ from multiple predictors. Empirical results further demonstrate the effectiveness and adaptability of the proposed method.

[355] Bio-KGvec2go: Serving up-to-date Dynamic Biomedical Knowledge Graph Embeddings

Hamid Ahmad, Heiko Paulheim, Rita T. Sousa

Main category: cs.LG

TL;DR: Bio-KGvec2go extends the KGvec2go Web API to provide pre-trained knowledge graph embeddings for biomedical ontologies, with regular updates to keep pace with ontology version releases.

DetailsMotivation: Knowledge graphs and ontologies are crucial for modern AI applications, but integrating them with machine learning requires embedding models. Pre-trained models for popular biomedical ontologies can democratize AI development and enable sustainable computing by eliminating the need to retrain models for different tasks.

Method: Extension of the KGvec2go Web API to generate and serve knowledge graph embeddings for widely used biomedical ontologies, with support for regular updates aligned with ontology version releases.

Result: Bio-KGvec2go provides up-to-date embeddings for biomedical ontologies with minimal computational effort required from users.

Conclusion: The system facilitates efficient and timely biomedical research by offering readily available, current knowledge graph embeddings for biomedical ontologies.

Abstract: Knowledge graphs and ontologies represent entities and their relationships in a structured way, having gained significance in the development of modern AI applications. Integrating these semantic resources with machine learning models often relies on knowledge graph embedding models to transform graph data into numerical representations. Therefore, pre-trained models for popular knowledge graphs and ontologies are increasingly valuable, as they spare the need to retrain models for different tasks using the same data, thereby helping to democratize AI development and enabling sustainable computing. In this paper, we present Bio-KGvec2go, an extension of the KGvec2go Web API, designed to generate and serve knowledge graph embeddings for widely used biomedical ontologies. Given the dynamic nature of these ontologies, Bio-KGvec2go also supports regular updates aligned with ontology version releases. By offering up-to-date embeddings with minimal computational effort required from users, Bio-KGvec2go facilitates efficient and timely biomedical research.

[356] One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning

Yuan Pu, Yazhe Niu, Jia Tang, Junyu Xiong, Shuai Hu, Hongsheng Li

Main category: cs.LG

TL;DR: ScaleZero extends UniZero with Mixture-of-Experts architecture and dynamic parameter scaling to handle heterogeneous multi-task learning efficiently, achieving performance comparable to specialized single-task models with fewer environment interactions.

DetailsMotivation: Conventional multi-task world models like UniZero struggle with gradient conflicts and loss of model plasticity when handling large-scale heterogeneous environments with diverse observation/action spaces and varying task difficulties.

Method: Proposes ScaleZero with MoE architecture to mitigate gradient conflicts, plus online LoRA-based dynamic parameter scaling strategy that progressively integrates adapters based on task-specific progress for adaptive knowledge retention and parameter expansion.

Result: Achieves performance on par with specialized single-task baselines on Atari, DMControl, and Jericho benchmarks using only online RL with one model. With dynamic parameter scaling, requires only 80% of single-task environment interaction steps while maintaining competitive performance.

Conclusion: ScaleZero demonstrates effective large-scale multi-task learning potential by addressing gradient conflicts and computational efficiency through MoE architecture and dynamic parameter scaling strategy.

Abstract: In heterogeneous multi-task learning, tasks not only exhibit diverse observation and action spaces but also vary substantially in intrinsic difficulty. While conventional multi-task world models like UniZero excel in single-task settings, we find that when handling large-scale heterogeneous environments, gradient conflicts and the loss of model plasticity often constrain their sample and computational efficiency. In this work, we address these challenges from two perspectives: the single learning iteration and the overall learning process. First, we investigate the impact of key design spaces on extending UniZero to multi-task planning. We find that a Mixture-of-Experts (MoE) architecture provides the most substantial performance gains by mitigating gradient conflicts, leading to our proposed model, \textit{ScaleZero}. Second, to dynamically balance the computational load across the learning process, we introduce an online, LoRA-based \textit{dynamic parameter scaling} (DPS) strategy. This strategy progressively integrates LoRA adapters in response to task-specific progress, enabling adaptive knowledge retention and parameter expansion. Empirical evaluations on standard benchmarks such as Atari, DMControl (DMC), and Jericho demonstrate that ScaleZero, relying exclusively on online reinforcement learning with one model, attains performance on par with specialized single-task baselines. Furthermore, when augmented with our dynamic parameter scaling strategy, our method achieves competitive performance while requiring only 80% of the single-task environment interaction steps. These findings underscore the potential of ScaleZero for effective large-scale multi-task learning. Our code is available at \textcolor{magenta}{https://github.com/opendilab/LightZero}.

[357] Bringing Multi-Modal Multi-Task Federated Foundation Models to Education Domain: Prospects and Challenges

Kasra Borazjani, Naji Khosravan, Rajeev Sahay, Bita Akram, Seyyedali Hosseinalipour

Main category: cs.LG

TL;DR: M3T Federated Foundation Models (FedFMs) combine multi-modal multi-task foundation models with federated learning to enable privacy-preserving collaborative AI training across educational institutions while handling diverse data modalities and tasks.

DetailsMotivation: Current multi-modal multi-task foundation models face deployment challenges in education due to privacy regulations, data silos, and limited domain-specific data availability, requiring a privacy-preserving collaborative approach.

Method: Integration of federated learning with multi-modal multi-task foundation models to enable collaborative training across decentralized educational institutions while keeping sensitive data local and accommodating diverse modalities and tasks.

Result: The paper proposes M3T FedFMs as a promising approach that can advance privacy preservation, personalization, and equity/inclusivity in next-generation intelligent education systems.

Conclusion: M3T FedFMs represent an underexplored but promising paradigm for education, though several research challenges need addressing including heterogeneous privacy regulations, data modality characteristics, unlearning approaches, continual learning frameworks, and model interpretability.

Abstract: Multi-modal multi-task (M3T) foundation models (FMs) have recently shown transformative potential in artificial intelligence, with emerging applications in education. However, their deployment in real-world educational settings is hindered by privacy regulations, data silos, and limited domain-specific data availability. We introduce M3T Federated Foundation Models (FedFMs) for education: a paradigm that integrates federated learning (FL) with M3T FMs to enable collaborative, privacy-preserving training across decentralized institutions while accommodating diverse modalities and tasks. Subsequently, this position paper aims to unveil M3T FedFMs as a promising yet underexplored approach to the education community, explore its potentials, and reveal its related future research directions. We outline how M3T FedFMs can advance three critical pillars of next-generation intelligent education systems: (i) privacy preservation, by keeping sensitive multi-modal student and institutional data local; (ii) personalization, through modular architectures enabling tailored models for students, instructors, and institutions; and (iii) equity and inclusivity, by facilitating participation from underrepresented and resource-constrained entities. We finally identify various open research challenges, including studying of (i) inter-institution heterogeneous privacy regulations, (ii) the non-uniformity of data modalities’ characteristics, (iii) the unlearning approaches for M3T FedFMs, (iv) the continual learning frameworks for M3T FedFMs, and (v) M3T FedFM model interpretability, which must be collectively addressed for practical deployment.

[358] ACE and Diverse Generalization via Selective Disagreement

Oliver Daniels, Stuart Armstrong, Alexandre Maranhão, Mahirah Fairuz Rahman, Benjamin M. Marlin, Rebecca Gorman

Main category: cs.LG

TL;DR: ACE method addresses complete spurious correlations in neural networks using self-training with confident selective disagreement, outperforming existing methods on benchmarks while being robust to incomplete correlations.

DetailsMotivation: Deep neural networks are sensitive to spurious correlations that fail out-of-distribution, especially when correlations are complete and correct generalization is fundamentally underspecified.

Method: Proposes learning concepts consistent with training data but making distinct predictions on novel unlabeled inputs using self-training that encourages confident and selective disagreement.

Result: ACE matches or outperforms existing methods on complete-spurious correlation benchmarks, remains robust to incomplete correlations, and achieves competitive performance on language-model alignment without access to untrusted measurements.

Conclusion: ACE represents significant progress towards overcoming underspecification in neural networks, though still subject to important limitations.

Abstract: Deep neural networks are notoriously sensitive to spurious correlations - where a model learns a shortcut that fails out-of-distribution. Existing work on spurious correlations has often focused on incomplete correlations,leveraging access to labeled instances that break the correlation. But in cases where the spurious correlations are complete, the correct generalization is fundamentally \textit{underspecified}. To resolve this underspecification, we propose learning a set of concepts that are consistent with training data but make distinct predictions on a subset of novel unlabeled inputs. Using a self-training approach that encourages \textit{confident} and \textit{selective} disagreement, our method ACE matches or outperforms existing methods on a suite of complete-spurious correlation benchmarks, while remaining robust to incomplete spurious correlations. ACE is also more configurable than prior approaches, allowing for straight-forward encoding of prior knowledge and principled unsupervised model selection. In an early application to language-model alignment, we find that ACE achieves competitive performance on the measurement tampering detection benchmark \textit{without} access to untrusted measurements. While still subject to important limitations, ACE represents significant progress towards overcoming underspecification.

[359] Customizing the Inductive Biases of Softmax Attention using Structured Matrices

Yilun Kuang, Noah Amsel, Sanae Lotfi, Shikai Qiu, Andres Potapczynski, Andrew Gordon Wilson

Main category: cs.LG

TL;DR: Proposes new attention scoring functions using Block Tensor-Train and Multi-Level Low Rank matrices to address information loss and lack of distance-dependent bias in standard attention.

DetailsMotivation: Standard attention suffers from information loss due to low-dimensional projections and lacks distance-dependent compute bias for neighboring tokens, which limits performance on certain tasks.

Method: Develops new scoring functions based on computationally efficient structured matrices (BTT and MLR) with high ranks that can encode full-rank or distance-dependent compute biases.

Result: Outperforms standard attention on high-dimensional in-context regression tasks for fixed compute budgets, achieves improved scaling laws in language modeling, and shows promise for long-range time-series forecasting.

Conclusion: The proposed structured matrix-based attention methods effectively address key limitations of standard attention and demonstrate superior performance across multiple tasks while maintaining computational efficiency.

Abstract: The core component of attention is the scoring function, which transforms the inputs into low-dimensional queries and keys and takes the dot product of each pair. While the low-dimensional projection improves efficiency, it causes information loss for certain tasks that have intrinsically high-dimensional inputs. Additionally, attention uses the same scoring function for all input pairs, without imposing a distance-dependent compute bias for neighboring tokens in the sequence. In this work, we address these shortcomings by proposing new scoring functions based on computationally efficient structured matrices with high ranks, including Block Tensor-Train (BTT) and Multi-Level Low Rank (MLR) matrices. On in-context regression tasks with high-dimensional inputs, our proposed scoring functions outperform standard attention for any fixed compute budget. On language modeling, a task that exhibits locality patterns, our MLR-based attention method achieves improved scaling laws compared to both standard attention and variants of sliding window attention. Additionally, we show that both BTT and MLR fall under a broader family of efficient structured matrices capable of encoding either full-rank or distance-dependent compute biases, thereby addressing significant shortcomings of standard attention. Finally, we show that MLR attention has promising results for long-range time-series forecasting.

[360] Theoretical Analysis on how Learning Rate Warmup Accelerates Convergence

Yuxing Liu, Yuze Ge, Rui Pan, An Kang, Tong Zhang

Main category: cs.LG

TL;DR: Learning rate warmup accelerates gradient descent convergence under generalized smoothness assumptions, providing theoretical justification for this practical technique.

DetailsMotivation: Despite widespread practical success of learning rate warmup in training large neural networks, there has been a gap in theoretical understanding of why this strategy works so well.

Method: Proposed a novel family of generalized smoothness assumptions and analyzed convergence properties of gradient descent (both deterministic and stochastic) under these assumptions with warmup schedules.

Result: Showed that learning rate warmup consistently accelerates GD convergence, with up to Θ(T) times faster convergence compared to non-increasing schedules in specific cases.

Conclusion: The study provides theoretical insights into the optimization benefits of learning rate warmup, bridging the gap between practice and theory for this widely used technique.

Abstract: Learning rate warmup is a popular and practical technique in training large-scale deep neural networks. Despite the huge success in practice, the theoretical advantages of this strategy of gradually increasing the learning rate at the beginning of the training process have not been fully understood. To resolve this gap between theory and practice, we first propose a novel family of generalized smoothness assumptions, and validate its applicability both theoretically and empirically. Under the novel smoothness assumption, we study the convergence properties of gradient descent (GD) in both deterministic and stochastic settings. It is shown that learning rate warmup consistently accelerates GD, and GD with warmup can converge at most $\Theta(T)$ times faster than with a non-increasing learning rate schedule in some specific cases, providing insights into the benefits of this strategy from an optimization theory perspective.

[361] Solving Truly Massive Budgeted Monotonic POMDPs with Oracle-Guided Meta-Reinforcement Learning

Manav Vora, Jonas Liang, Melkior Ornik

Main category: cs.LG

TL;DR: Proposed a scalable two-step approach for solving budget-constrained multi-component monotonic POMDPs using random forest approximation and oracle-guided PPO to handle exponential state space growth.

DetailsMotivation: Current methods cannot solve large-scale multi-component monotonic POMDPs due to exponential state space growth with increasing components, making computational tractability a major challenge.

Method: Two-step approach: 1) Approximate optimal budget allocation using random forest models of component POMDP value functions, 2) Use oracle-guided meta-trained PPO algorithm with MDP-based oracle policies to solve individual component POMDPs.

Result: The method provides scalability for solving massive multi-component monotonic POMDPs, demonstrated through a real-world building maintenance scenario with computational complexity analysis showing effectiveness.

Conclusion: The proposed approach successfully addresses the computational intractability of large-scale budget-constrained multi-component monotonic POMDPs and enables practical application to real-world sequential repair problems.

Abstract: Monotonic Partially Observable Markov Decision Processes (POMDPs), where the system state progressively decreases until a restorative action is performed, can be used to model sequential repair problems effectively. This paper considers the problem of solving budget-constrained multi-component monotonic POMDPs, where a finite budget limits the maximal number of restorative actions. For a large number of components, solving such a POMDP using current methods is computationally intractable due to the exponential growth in the state space with an increasing number of components. To address this challenge, we propose a two-step approach. Since the individual components of a budget-constrained multi-component monotonic POMDP are only connected via the shared budget, we first approximate the optimal budget allocation among these components using an approximation of each component POMDP’s optimal value function which is obtained through a random forest model. Subsequently, we introduce an oracle-guided meta-trained Proximal Policy Optimization (PPO) algorithm to solve each of the independent budget-constrained single-component monotonic POMDPs. The oracle policy is obtained by performing value iteration on the corresponding monotonic Markov Decision Process (MDP). This two-step method provides scalability in solving truly massive multi-component monotonic POMDPs. To demonstrate the efficacy of our approach, we consider a real-world maintenance scenario that involves inspection and repair of an administrative building by a team of agents within a maintenance budget. Finally, we perform a computational complexity analysis for a varying number of components to show the scalability of the proposed approach.

[362] Active Learning of Piecewise Gaussian Process Surrogates

Chiwoo Park, Robert Waelder, Bonggwon Kang, Benji Maruyama, Soondo Hong, Robert Gramacy

Main category: cs.LG

TL;DR: Active learning method for piecewise Jump Gaussian Process surrogates that accounts for model bias rather than just uncertainty, with applications in materials design and smart factory systems.

DetailsMotivation: Existing active learning methods for Gaussian processes focus on model uncertainty but fail to account for model bias, which is particularly important for Jump GPs that exhibit discontinuities across design space regions.

Method: Developed active learning heuristics for Jump GPs that estimate both bias and variance, appropriated from strategies originally designed for ordinary GPs but enhanced to handle discontinuities.

Result: The proposed method demonstrates advantages over traditional approaches on synthetic benchmarks and real-simulation experiments of varying complexity.

Conclusion: Accounting for model bias, in addition to uncertainty, is essential for effective active learning of Jump GP surrogates, providing improved performance in applications with discontinuous behavior.

Abstract: Active learning of Gaussian process (GP) surrogates has been useful for optimizing experimental designs for physical/computer simulation experiments, and for steering data acquisition schemes in machine learning. In this paper, we develop a method for active learning of piecewise, Jump GP surrogates. Jump GPs are continuous within, but discontinuous across, regions of a design space, as required for applications spanning autonomous materials design, configuration of smart factory systems, and many others. Although our active learning heuristics are appropriated from strategies originally designed for ordinary GPs, we demonstrate that additionally accounting for model bias, as opposed to the usual model uncertainty, is essential in the Jump GP context. Toward that end, we develop an estimator for bias and variance of Jump GP models. Illustrations, and evidence of the advantage of our proposed methods, are provided on a suite of synthetic benchmarks, and real-simulation experiments of varying complexity.

[363] FilterFL: Knowledge Filtering-based Data-Free Backdoor Defense for Federated Learning

Yanxin Yang, Ming Hu, Xiaofei Xie, Yue Cao, Pengyu Zhang, Yihao Huang, Mingsong Chen

Main category: cs.LG

TL;DR: A novel data-free defense method against backdoor attacks in federated learning that generates trigger images by analyzing model differences and filters them based on classification impact, achieving strong protection even with 80% malicious clients.

DetailsMotivation: Federated learning is vulnerable to poisoning and backdoor attacks from untrusted clients who can inject malicious triggers into models without sharing raw data, compromising model integrity.

Method: Proposes a data-free trigger-generation approach that identifies differences between old and new global models to generate images with newly learned knowledge, then filters trigger images by evaluating their classification impact to eliminate poisoned models.

Result: Comprehensive experiments show the approach defends against almost all existing backdoor attack types, outperforms seven state-of-the-art defense methods in both IID and non-IID scenarios, and successfully defends even when 80% of clients are malicious.

Conclusion: The proposed data-free trigger-generation defense effectively protects federated learning models from backdoor attacks by leveraging attack characteristics and model difference analysis, providing robust security against high rates of malicious participation.

Abstract: As a distributed machine learning paradigm, Federated Learning (FL) enables large-scale clients to collaboratively train a model without sharing their raw data. However, due to the lack of data auditing for untrusted clients, FL is vulnerable to poisoning attacks, especially backdoor attacks. By using poisoned data for local training or directly changing the model parameters, attackers can easily inject backdoors into the model, which can trigger the model to make misclassification of targeted patterns in images. To address these issues, we propose a novel data-free trigger-generation-based defense approach based on the two characteristics of backdoor attacks: i) triggers are learned faster than normal knowledge, and ii) trigger patterns have a greater effect on image classification than normal class patterns. Our approach generates the images with newly learned knowledge by identifying the differences between the old and new global models, and filters trigger images by evaluating the effect of these generated images. By using these trigger images, our approach eliminates poisoned models to ensure the updated global model is benign. Comprehensive experiments demonstrate that our approach can defend against almost all the existing types of backdoor attacks and outperform all the seven state-of-the-art defense methods with both IID and non-IID scenarios. Especially, our approach can successfully defend against the backdoor attack even when 80% of the clients are malicious.

[364] Efficient Methods for Non-stationary Online Learning

Peng Zhao, Yan-Feng Xie, Lijun Zhang, Zhi-Hua Zhou

Main category: cs.LG

TL;DR: Efficient online learning methods that reduce computational complexity from O(log T) to 1 projection per round for non-stationary environments, maintaining optimal performance while significantly improving efficiency.

DetailsMotivation: Existing two-layer online ensemble methods for non-stationary environments require maintaining O(log T) base-learners and performing multiple projections per round, creating computational bottlenecks especially with complex domains.

Method: Proposed reduction mechanism from parameter-free online learning, modified for non-stationary settings. Algorithms require only one gradient query and one function evaluation per round, reducing projections from O(log T) to 1 for dynamic/adaptive regret, and from O(log² T) to 1 for interval dynamic regret.

Result: Developed efficient methods that maintain optimal performance while drastically reducing computational overhead. Applied successfully to online stochastic control and online principal component analysis, achieving both efficiency and optimality.

Conclusion: The proposed techniques demonstrate broad generality and practical applicability, providing computationally efficient solutions for non-stationary online learning problems while preserving theoretical guarantees, as verified by empirical studies.

Abstract: Non-stationary online learning has drawn much attention in recent years. In particular, dynamic regret and adaptive regret are proposed as two principled performance measures for online convex optimization in non-stationary environments. To optimize them, a two-layer online ensemble is usually deployed due to the inherent uncertainty of non-stationarity, in which multiple base-learners are maintained and a meta-algorithm is employed to track the best one on the fly. However, the two-layer structure raises concerns about computational complexity – such methods typically maintain $O(\log T)$ base-learners simultaneously for a $T$-round online game and thus perform multiple projections onto the feasible domain per round, which becomes the computational bottleneck when the domain is complicated. In this paper, we present efficient methods for optimizing dynamic regret and adaptive regret that reduce the number of projections per round from $O(\log T)$ to $1$. The proposed algorithms require only one gradient query and one function evaluation at each round. Our technique hinges on the reduction mechanism developed in parameter-free online learning and requires non-trivial modifications for non-stationary online methods. Furthermore, we study an even stronger measure, namely “interval dynamic regret”, and reduce the number of projections per round from $O(\log^2 T)$ to $1$ for minimizing it. Our reduction demonstrates broad generality and applies to two important applications: online stochastic control and online principal component analysis, resulting in methods that are both efficient and optimal. Finally, empirical studies verify our theoretical findings.

[365] CoMMIT: Coordinated Multimodal Instruction Tuning

Xintong Li, Junda Wu, Tong Yu, Yu Wang, Xiang Chen, Jiuxiang Gu, Lina Yao, Julian McAuley, Jingbo Shang

Main category: cs.LG

TL;DR: Proposes a Multimodal Balance Coefficient and dynamic learning scheduler to address unbalanced learning between LLMs and feature encoders in MLLM instruction tuning, improving convergence and performance.

DetailsMotivation: Address the challenge of unbalanced learning between backbone LLMs and feature encoders during multimodal instruction tuning, which causes oscillation and biased learning leading to sub-optimal convergence.

Method: Introduces a Multimodal Balance Coefficient for quantitative measurement of learning balance, a dynamic learning scheduler to coordinate LLM and encoder learning, and an auxiliary gradient regularization for larger step sizes.

Result: Experiments on multiple downstream tasks with various MLLMs demonstrate the proposed method is more effective than baselines in MLLM instruction tuning.

Conclusion: The proposed approach effectively addresses oscillation and biased learning issues in MLLM instruction tuning, is architecture-agnostic, and improves training sufficiency and convergence across various multimodal tasks.

Abstract: Instruction tuning in multimodal large language models (MLLMs) generally involves cooperative learning between a backbone LLM and a feature encoder of non-text input modalities. The major challenge is how to efficiently find the synergy between the two modules so that LLMs can adapt their reasoning abilities to downstream tasks while feature encoders can adjust to provide more task-specific information about its modality. In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives, where we find the unbalanced learning between the feature encoder and the LLM can cause problems of oscillation and biased learning that lead to sub-optimal convergence. Inspired by our findings, we propose a Multimodal Balance Coefficient that enables quantitative measurement of the balance of learning. Based on this, we further design a dynamic learning scheduler that better coordinates the learning between the LLM and feature encoder, alleviating the problems of oscillation and biased learning. In addition, we introduce an auxiliary regularization on the gradient to promote updating with larger step sizes, which potentially allows for a more accurate estimation of the proposed MultiModal Balance Coefficient and further improves the training sufficiency. Our proposed approach is agnostic to the architecture of LLM and feature encoder, so it can be generically integrated with various MLLMs. We conduct experiments on multiple downstream tasks with various MLLMs, demonstrating that the proposed method is more effective than the baselines in MLLM instruction tuning.

[366] On the Benefits of Public Representations for Private Transfer Learning under Distribution Shift

Pratiksha Thaker, Amrith Setlur, Zhiwei Steven Wu, Virginia Smith

Main category: cs.LG

TL;DR: Public pretraining improves differentially private model training even with large distribution shift between public and private data, boosting accuracy by up to 67% over private-only training.

DetailsMotivation: Previous research on public pretraining for private model training focused mainly on in-distribution tasks, but real-world scenarios often involve distribution shift due to the sensitive nature of private data.

Method: Empirical evaluation across three tasks with large distribution shift, comparing zero-shot performance, private-only training, and public feature-enhanced private training. Theoretical analysis showing improved sample complexity when public and private data share low-dimensional representations.

Result: Public features improved private training accuracy by up to 67% over private training from scratch, even when zero-shot performance was unusable. Theoretical framework confirms public representations can enhance private training efficiency.

Conclusion: Public data can make private training practical even in realistic settings with extreme distribution shift, as public representations improve performance when data shares underlying low-dimensional structure.

Abstract: Public pretraining is a promising approach to improve differentially private model training. However, recent work has noted that many positive research results studying this paradigm only consider in-distribution tasks, and may not apply to settings where there is distribution shift between the pretraining and finetuning data – a scenario that is likely when finetuning private tasks due to the sensitive nature of the data. In this work, we show empirically across three tasks that even in settings with large distribution shift, where both zero-shot performance from public data and training from scratch with private data give unusably weak results, public features can in fact improve private training accuracy by up to 67% over private training from scratch. We provide a theoretical explanation for this phenomenon, showing that if the public and private data share a low-dimensional representation, public representations can improve the sample complexity of private training even if it is impossible to learn the private task from the public data alone. Altogether, our results provide evidence that public data can indeed make private training practical in realistic settings of extreme distribution shift.

[367] BioNeMo Framework: a modular, high-performance library for AI model development in drug discovery

Peter St. John, Dejun Lin, Polina Binder, Malcolm Greaves, Vega Shah, John St. John, Adrian Lange, Patrick Hsu, Rajesh Illango, Arvind Ramanathan, Anima Anandkumar, David H Brookes, Akosua Busia, Abhishaike Mahajan, Stephen Malina, Neha Prasad, Sam Sinai, Lindsay Edwards, Thomas Gaudelet, Cristian Regep, Martin Steinegger, Burkhard Rost, Alexander Brace, Kyle Hippe, Luca Naef, Keisuke Kamata, George Armstrong, Kevin Boyd, Zhonglin Cao, Han-Yi Chou, Simon Chu, Allan dos Santos Costa, Sajad Darabi, Eric Dawson, Kieran Didi, Cong Fu, Mario Geiger, Michelle Gill, Darren J Hsu, Gagan Kaushik, Maria Korshunova, Steven Kothen-Hill, Youhan Lee, Meng Liu, Micha Livne, Zachary McClure, Jonathan Mitchell, Alireza Moradzadeh, Ohad Mosafi, Youssef Nashed, Saee Paliwal, Yuxing Peng, Sara Rabhi, Farhad Ramezanghorbani, Danny Reidenbach, Camir Ricketts, Brian C Roland, Kushal Shah, Tyler Shimko, Hassan Sirelkhatim, Savitha Srinivasan, Abraham C Stern, Dorota Toczydlowska, Srimukh Prasad Veccham, Niccolò Alberto Elia Venanzi, Anton Vorontsov, Jared Wilber, Isabel Wilkinson, Wei Jing Wong, Eva Xue, Cory Ye, Xin Yu, Yang Zhang, Guoqing Zhou, Becca Zandstein, Alejandro Chacon, Prashant Sohani, Maximilian Stadler, Christian Hundt, Feiwen Zhu, Christian Dallago, Bruno Trentini, Emine Kucukbenli, Saee Paliwal, Timur Rvachov, Eddie Calleja, Johnny Israeli, Harry Clifford, Risto Haukioja, Nicholas Haemel, Kyle Tretina, Neha Tadimeti, Anthony B Costa

Main category: cs.LG

TL;DR: BioNeMo Framework enables efficient training of large-scale biology and chemistry AI models across hundreds of GPUs, achieving 3B parameter protein language model training on 1T+ tokens in 4.2 days.

DetailsMotivation: AI models for drug development require massive computational scale, with recent protein language models training on hundreds of GPUs, creating a need for accessible frameworks.

Method: Modular framework design with components like data loaders that can integrate into existing workflows, supporting community contributions and open-source development.

Result: Successfully trained a 3 billion parameter BERT-based protein language model on over 1 trillion tokens using 256 NVIDIA A100 GPUs in just 4.2 days.

Conclusion: BioNeMo provides an open-source, free framework that facilitates high-throughput computational biology and chemistry AI model training at scale.

Abstract: Artificial Intelligence models encoding biology and chemistry are opening new routes to high-throughput and high-quality in-silico drug development. However, their training increasingly relies on computational scale, with recent protein language models (pLM) training on hundreds of graphical processing units (GPUs). We introduce the BioNeMo Framework to facilitate the training of computational biology and chemistry AI models across hundreds of GPUs. Its modular design allows the integration of individual components, such as data loaders, into existing workflows and is open to community contributions. We detail technical features of the BioNeMo Framework through use cases such as pLM pre-training and fine-tuning. On 256 NVIDIA A100s, BioNeMo Framework trains a three billion parameter BERT-based pLM on over one trillion tokens in 4.2 days. The BioNeMo Framework is open-source and free for everyone to use.

[368] Hybrid-Regularized Magnitude Pruning for Robust Federated Learning under Covariate Shift

Ozgu Goksu, Nicolas Pugeault

Main category: cs.LG

TL;DR: Proposes a novel federated learning framework using pruning and regularization to address data heterogeneity issues, with a new CelebA-Gender benchmark dataset for evaluating within-class distribution shifts.

DetailsMotivation: Data heterogeneity in federated learning degrades global model generalization, particularly critical for specialized applications like medical imaging with limited data and clients.

Method: Combination of pruning and regularization of clients’ training to improve sparsity, redundancy, and robustness of neural connections for better resilience to model aggregation.

Result: Comprehensive experiments on CIFAR-10, MNIST, and CelebA-Gender show consistent outperformance over standard FL baselines, yielding more robust and generalizable models.

Conclusion: The proposed framework effectively addresses data heterogeneity challenges in federated learning, particularly for within-class distribution shifts, demonstrating superior performance across multiple benchmark datasets.

Abstract: Federated Learning offers a solution for decentralised model training, addressing the difficulties associated with distributed data and privacy in machine learning. However, the fact of data heterogeneity in federated learning frequently hinders the global model’s generalisation, leading to low performance and adaptability to unseen data. This problem is particularly critical for specialised applications such as medical imaging, where both the data and the number of clients are limited. In this paper, we empirically demonstrate that inconsistencies in client-side training distributions substantially degrade the performance of federated learning models across multiple benchmark datasets. We propose a novel FL framework using a combination of pruning and regularisation of clients’ training to improve the sparsity, redundancy, and robustness of neural connections, and thereby the resilience to model aggregation. To address a relatively unexplored dimension of data heterogeneity, we further introduce a novel benchmark dataset, CelebA-Gender, specifically designed to control for within-class distributional shifts across clients based on attribute variations, thereby complementing the predominant focus on inter-class imbalance in prior federated learning research. Comprehensive experiments on many datasets like CIFAR-10, MNIST, and the newly introduced CelebA-Gender dataset demonstrate that our method consistently outperforms standard FL baselines, yielding more robust and generalizable models in heterogeneous settings.

[369] When Do Neural Networks Learn World Models?

Tianren Zhang, Guanyu Chen, Feng Chen

Main category: cs.LG

TL;DR: Neural networks with low-degree bias can theoretically recover latent data-generating variables in multi-task settings, even with complex non-linear proxy tasks, but recovery depends on model architecture.

DetailsMotivation: To understand whether neural networks can learn world models that capture the underlying data generation process like humans do, and provide the first theoretical results for this problem.

Method: Leverages Boolean models of task solutions via Fourier-Walsh transform and introduces new techniques for analyzing invertible Boolean transforms in a multi-task setting.

Result: Models with low-degree bias provably recover latent data-generating variables under mild assumptions, even when proxy tasks involve complex non-linear functions of the latents.

Conclusion: Recovery of latent variables is possible but sensitive to model architecture, with implications for self-supervised learning, out-of-distribution generalization, and linear representation hypothesis in LLMs.

Abstract: Humans develop world models that capture the underlying generation process of data. Whether neural networks can learn similar world models remains an open problem. In this work, we present the first theoretical results for this problem, showing that in a multi-task setting, models with a low-degree bias provably recover latent data-generating variables under mild assumptions–even if proxy tasks involve complex, non-linear functions of the latents. However, such recovery is sensitive to model architecture. Our analysis leverages Boolean models of task solutions via the Fourier-Walsh transform and introduces new techniques for analyzing invertible Boolean transforms, which may be of independent interest. We illustrate the algorithmic implications of our results and connect them to related research areas, including self-supervised learning, out-of-distribution generalization, and the linear representation hypothesis in large language models.

[370] Tripartite-GraphRAG via Plugin Ontologies

Michael Banf, Johannes Kuhn

Main category: cs.LG

TL;DR: A novel approach combining LLMs with tripartite knowledge graphs to address hallucination and provenance issues in knowledge-intensive domains like healthcare, enabling optimized prompt creation and reduced output lengths.

DetailsMotivation: LLMs struggle with factual accuracy, hallucination, lack of source traceability, and timely knowledge updates in knowledge-intensive domains like industrial automation and healthcare.

Method: Combines LLMs with tripartite knowledge graphs constructed by connecting domain-specific objects via curated ontologies to text sections through concept-anchored pre-analysis. Formulates prompt creation as unsupervised node classification to optimize information density, coverage, and arrangement.

Result: Initial evaluation on healthcare use cases shows potential to optimize information density, coverage, and prompt arrangement while significantly reducing prompt lengths, leading to reduced costs and more consistent outputs.

Conclusion: The proposed approach addresses key LLM limitations in knowledge-intensive tasks and shows promise for improving factual accuracy, provenance, and efficiency in domains requiring high reliability.

Abstract: Large Language Models (LLMs) have shown remarkable capabilities across various domains, yet they struggle with knowledge-intensive tasks in areas that demand factual accuracy, e.g. industrial automation and healthcare. Key limitations include their tendency to hallucinate, lack of source traceability (provenance), and challenges in timely knowledge updates. Combining language models with knowledge graphs (GraphRAG) offers promising avenues for overcoming these deficits. However, a major challenge lies in creating such a knowledge graph in the first place. Here, we propose a novel approach that combines LLMs with a tripartite knowledge graph representation, which is constructed by connecting complex, domain-specific objects via a curated ontology of corresponding, domain-specific concepts to relevant sections within chunks of text through a concept-anchored pre-analysis of source documents starting from an initial lexical graph. Subsequently, we formulate LLM prompt creation as an unsupervised node classification problem allowing for the optimization of information density, coverage, and arrangement of LLM prompts at significantly reduced lengths. An initial experimental evaluation of our approach on a healthcare use case, involving multi-faceted analyses of patient anamneses given a set of medical concepts as well as a series of clinical guideline literature, indicates its potential to optimize information density, coverage, and arrangement of LLM prompts while significantly reducing their lengths, which, in turn, may lead to reduced costs as well as more consistent and reliable LLM outputs.

[371] Contrastive MIM: A Contrastive Mutual Information Framework for Unified Generative and Discriminative Representation Learning

Micha Livne

Main category: cs.LG

TL;DR: cMIM is a contrastive extension of Mutual Information Machine that improves discriminative performance while maintaining generative capabilities, eliminating the need for data augmentation and being robust to batch size.

DetailsMotivation: Existing representation learning methods like contrastive learning and MIM have trade-offs - MIM underperforms on discriminative tasks compared to state-of-the-art alternatives despite its generative strengths.

Method: Augments MIM with a novel contrastive objective to enforce global discriminative structure, and introduces informative embeddings technique to extract enriched representations from encoder-decoder models.

Result: cMIM consistently outperforms MIM and InfoNCE in classification and regression tasks while preserving comparable reconstruction quality.

Conclusion: cMIM provides a unified framework for learning representations that are simultaneously effective for both discriminative and generative applications.

Abstract: Learning representations that generalize well to unknown downstream tasks is a central challenge in representation learning. Existing approaches such as contrastive learning, self-supervised masking, and denoising auto-encoders address this challenge with varying trade-offs. In this paper, we introduce the {contrastive Mutual Information Machine} (cMIM), a probabilistic framework that augments the Mutual Information Machine (MIM) with a novel contrastive objective. While MIM maximizes mutual information between inputs and latent variables and encourages clustering of latent codes, its representations underperform on discriminative tasks compared to state-of-the-art alternatives. cMIM addresses this limitation by enforcing global discriminative structure while retaining MIM’s generative strengths. We present two main contributions: (1) we propose cMIM, a contrastive extension of MIM that eliminates the need for positive data augmentation and is robust to batch size, unlike InfoNCE-based methods; (2) we introduce {informative embeddings}, a general technique for extracting enriched representations from encoder–decoder models that substantially improve discriminative performance without additional training, and which apply broadly beyond MIM. Empirical results demonstrate that cMIM consistently outperforms MIM and InfoNCE in classification and regression tasks, while preserving comparable reconstruction quality. These findings suggest that cMIM provides a unified framework for learning representations that are simultaneously effective for discriminative and generative applications.

[372] Heterogeneous Self-Supervised Acoustic Pre-Training with Local Constraints

Xiaodong Cui, A F M Saif, Brian Kingsbury, Tianyi Chen

Main category: cs.LG

TL;DR: A new self-supervised pre-training approach for heterogeneous data using bilevel optimization with local constraints to improve model adaptivity for downstream tasks.

DetailsMotivation: Conventional self-supervised pre-training mixes all heterogeneous data and minimizes averaged global loss, which may not optimize each data source effectively for downstream adaptation.

Method: Proposes bilevel optimization with local constraints to ensure model optimizes each heterogeneous data source to its local optimum after K-step gradient descent, using first-order approximation method.

Result: Experiments on multi-domain and multilingual datasets show significant improvement in adaptivity of self-supervised pre-trained models for downstream supervised fine-tuning tasks.

Conclusion: The proposed approach effectively handles heterogeneous data in self-supervised pre-training and enhances model performance on downstream tasks compared to conventional methods.

Abstract: Self-supervised pre-training using unlabeled data is widely used in automatic speech recognition. In this paper, we propose a new self-supervised pre-training approach to dealing with heterogeneous data. Instead of mixing all the data and minimizing the averaged global loss in the conventional way, we impose additional local constraints to ensure that the model optimizes each source of heterogeneous data to its local optimum after $K$-step gradient descent initialized from the model. We formulate this as a bilevel optimization problem, and use the first-order approximation method to solve the problem. We discuss its connection to model-agnostic meta learning. Experiments are carried out on self-supervised pre-training using multi-domain and multilingual datasets, demonstrating that the proposed approach can significantly improve the adaptivity of the self-supervised pre-trained model for the downstream supervised fine-tuning tasks.

[373] Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?

Guangzhi Sun, Potsawee Manakul, Xiao Zhan, Mark Gales

Main category: cs.LG

TL;DR: DF-MCQ is a novel unlearning method that flattens model predictions using KL-divergence on multiple-choice questions, achieving true knowledge removal rather than obfuscation with over 90% refusal rate.

DetailsMotivation: Current unlearning techniques often rely on obfuscation by injecting incorrect information rather than true knowledge removal, leaving models vulnerable to probing attacks. The paper aims to distinguish genuine unlearning from mere obfuscation.

Method: Proposed DF-MCQ method that flattens the model’s predictive distribution over automatically generated multiple-choice questions using KL-divergence to remove knowledge about target individuals and trigger refusal behavior.

Result: DF-MCQ achieves over 90% refusal rate and significantly higher random choice-level uncertainty compared to obfuscation methods on probing questions, demonstrating genuine knowledge removal.

Conclusion: The paper introduces a formal distinction between unlearning and obfuscation, provides an evaluation framework, and demonstrates that DF-MCQ effectively removes targeted knowledge while maintaining appropriate refusal behavior, making it superior to obfuscation-based approaches.

Abstract: Unlearning has emerged as a critical capability for large language models (LLMs) to support data privacy, regulatory compliance, and ethical AI deployment. Recent techniques often rely on obfuscation by injecting incorrect or irrelevant information to suppress knowledge. Such methods effectively constitute knowledge addition rather than true removal, often leaving models vulnerable to probing. In this paper, we formally distinguish unlearning from obfuscation and introduce a probing-based evaluation framework to assess whether existing approaches genuinely remove targeted information. Moreover, we propose DF-MCQ, a novel unlearning method that flattens the model predictive distribution over automatically generated multiple-choice questions using KL-divergence, effectively removing knowledge about target individuals and triggering appropriate refusal behaviour. Experimental results demonstrate that DF-MCQ achieves unlearning with over 90% refusal rate and a random choice-level uncertainty that is much higher than obfuscation on probing questions.

[374] SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language Models

Kunyang Sun, Dorian Bagni, Joseph M. Cavanagh, Yingze Wang, Jacob M. Sawyer, Bo Zhou, Andrew Gritsevskiy, Oufan Zhang, Teresa Head-Gordon

Main category: cs.LG

TL;DR: SynLlama is a fine-tuned LLM that generates practical synthetic pathways using accessible building blocks, outperforming state-of-the-art methods in synthesis planning with strong generalization to unseen purchasable compounds.

DetailsMotivation: Many generative ML models produce molecules that are too difficult to synthesize, making them impractical for real-world chemical development and investigation.

Method: Fine-tuning Meta’s Llama3 LLMs to create SynLlama, which generates full synthetic pathways using commonly accessible building blocks and robust organic reaction templates with minimal data requirements.

Result: SynLlama shows strong performance in both forward and bottom-up synthesis planning, effectively generalizes to unseen purchasable building blocks, and demonstrates practical utility in pharmaceutical contexts for analog synthesis and hit expansion.

Conclusion: SynLlama provides medicinal chemists with a valuable tool for drug discovery by generating synthesizable chemical pathways that extend beyond the training data, making generative chemistry more practical and applicable.

Abstract: Generative machine learning models for exploring chemical space have shown immense promise, but many molecules they generate are too difficult to synthesize, making them impractical for further investigation or development. In this work, we present a novel approach by fine-tuning Meta’s Llama3 Large Language Models (LLMs) to create SynLlama, which generates full synthetic pathways made of commonly accessible building blocks and robust organic reaction templates. SynLlama explores a large synthesizable space using significantly less data, and offers strong performance in both forward and bottom-up synthesis planning compared to other state-of-the-art methods. We find that SynLlama, even without training on external building blocks, can effectively generalize to unseen yet purchasable building blocks, meaning that its reconstruction capabilities extend to a broader synthesizable chemical space than the training data. We also demonstrate the use of SynLlama in a pharmaceutical context for synthesis planning of analog molecules and hit expansion leads for proposed inhibitors of target proteins, offering medicinal chemists a valuable tool for discovery.

[375] Highly Efficient Direct Analytics on Semantic-aware Time Series Data Compression

Guoyou Sun, Panagiotis Karras, Qi Zhang

Main category: cs.LG

TL;DR: A novel method for direct analytics on compressed time series data using SHRINK algorithm, achieving efficient outlier detection with 4x faster runtime and 90% data reduction while maintaining accuracy.

DetailsMotivation: Address challenges of massive data traffic and resource constraints in IoT environments by enabling goal-oriented semantic communication and efficient analytics on compressed time series data.

Method: Propose direct analytics approach on time series data compressed by SHRINK compression algorithm, using outlier detection as a case study to demonstrate the method’s effectiveness.

Result: Outperforms baselines on uncompressed data in most cases with only 1% difference in worst case, achieves 4x lower runtime on average, and accesses only 10% of original data volume.

Conclusion: The approach enables reliable, high-speed outlier detection for IoT applications while achieving high compression, reducing data transmission, and supporting edge analytics with limited resources.

Abstract: Semantic communication has emerged as a promising paradigm to tackle the challenges of massive growing data traffic and sustainable data communication. It shifts the focus from data fidelity to goal-oriented or task-oriented semantic transmission. While deep learning-based methods are commonly used for semantic encoding and decoding, they struggle with the sequential nature of time series data and high computation cost, particularly in resource-constrained IoT environments. Data compression plays a crucial role in reducing transmission and storage costs, yet traditional data compression methods fall short of the demands of goal-oriented communication systems. In this paper, we propose a novel method for direct analytics on time series data compressed by the SHRINK compression algorithm. Through experimentation using outlier detection as a case study, we show that our method outperforms baselines running on uncompressed data in multiple cases, with merely 1% difference in the worst case. Additionally, it achieves four times lower runtime on average and accesses approximately 10% of the data volume, which enables edge analytics with limited storage and computation power. These results demonstrate that our approach offers reliable, high-speed outlier detection analytics for diverse IoT applications while extracting semantics from time-series data, achieving high compression, and reducing data transmission.

[376] Overflow Prevention Enhances Long-Context Recurrent LLMs

Assaf Ben-Kish, Itamar Zimerman, M. Jehanzeb Mirza, Lior Wolf, James Glass, Leonid Karlinsky, Raja Giryes

Main category: cs.LG

TL;DR: Chunk-based inference improves recurrent LLM performance by focusing on relevant input portions, achieving up to 51% gains on LongBench and competitive results with Transformers.

DetailsMotivation: To investigate why recurrent sub-quadratic LLMs underutilize long contexts despite training, and address recurrent memory failures in long-context processing.

Method: Proposed a chunk-based inference procedure that identifies and processes only the most relevant portions of input to mitigate recurrent memory limitations.

Result: Significant performance improvements: 14% for Falcon3-Mamba-7B, 28% for Falcon-Mamba-7B, 50% for RecurrentGemma-9B, and 51% for RWKV6-Finch-7B on LongBench. Achieved state-of-the-art results on LongBench v2.

Conclusion: Simple chunk-based approach effectively addresses recurrent memory failures, raising questions about whether recurrent models truly exploit long-range dependencies as single-chunk strategy outperforms even in cross-context tasks.

Abstract: A recent trend in LLMs is developing recurrent sub-quadratic models that improve long-context processing efficiency. We investigate leading large long-context models, focusing on how their fixed-size recurrent memory affects their performance. Our experiments reveal that, even when these models are trained for extended contexts, their use of long contexts remains underutilized. Specifically, we demonstrate that a chunk-based inference procedure, which identifies and processes only the most relevant portion of the input can mitigate recurrent memory failures and be effective for many long-context tasks: On LongBench, our method improves the overall performance of Falcon3-Mamba-Inst-7B by 14%, Falcon-Mamba-Inst-7B by 28%, RecurrentGemma-IT-9B by 50%, and RWKV6-Finch-7B by 51%. Surprisingly, this simple approach also leads to state-of-the-art results in the challenging LongBench v2 benchmark, showing competitive performance with equivalent size Transformers. Furthermore, our findings raise questions about whether recurrent models genuinely exploit long-range dependencies, as our single-chunk strategy delivers stronger performance - even in tasks that presumably require cross-context relations.

[377] M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

Junxiong Wang, Wen-Ding Li, Daniele Paliotta, Daniel Ritter, Alexander M. Rush, Tri Dao

Main category: cs.LG

TL;DR: M1 is a hybrid linear RNN reasoning model based on Mamba architecture that enables memory-efficient inference for mathematical reasoning, outperforming previous linear RNN models and matching state-of-the-art transformer performance with 3x speedup.

DetailsMotivation: Transformer models are limited in extending context length due to quadratic computational complexity and linear memory requirements, making long chain-of-thought reasoning inefficient.

Method: Developed a hybrid linear RNN model (M1) using Mamba architecture with distillation from existing reasoning models and enhanced through RL training.

Result: Outperforms previous linear RNN models, matches Deepseek R1 distilled reasoning models, achieves 3x speedup over same-size transformers, and higher accuracy under fixed generation time budget with self-consistency voting.

Conclusion: M1 provides an effective approach for scaling test-time generation using self-consistency or long chain-of-thought reasoning with superior efficiency and performance.

Abstract: Effective reasoning is crucial to solving complex mathematical problems. Recent large language models (LLMs) have boosted performance by scaling test-time computation through long chain-of-thought reasoning. However, transformer-based models are inherently limited in extending context length due to their quadratic computational complexity and linear memory requirements. In this paper, we introduce a novel hybrid linear RNN reasoning model, M1, built on the Mamba architecture, which allows memory-efficient inference. Our approach leverages a distillation process from existing reasoning models and is further enhanced through RL training. Experimental results on the AIME and MATH benchmarks show that M1 not only outperforms previous linear RNN models but also matches the performance of state-of-the-art Deepseek R1 distilled reasoning models at a similar scale. We also compare our generation speed with a highly performant general purpose inference engine, vLLM, and observe more than a 3x speedup compared to a same size transformer. With throughput speedup, we are able to achieve higher accuracy compared to DeepSeek R1 distilled transformer reasoning models under a fixed generation time budget using self-consistency voting. Overall, we introduce a hybrid Mamba reasoning model and provide a more effective approach to scaling test-time generation using self-consistency or long chain of thought reasoning.

[378] Scalable Autoregressive 3D Molecule Generation

Austin H. Cheng, Chong Sun, Alán Aspuru-Guzik

Main category: cs.LG

TL;DR: Quetzal is a scalable autoregressive model that generates 3D molecules atom-by-atom, combining a transformer for atom type prediction with a diffusion MLP for position modeling, achieving competitive performance with diffusion models while enabling faster generation.

DetailsMotivation: Diffusion models currently dominate 3D molecule generation while autoregressive models have trailed behind. The authors aim to develop a simple but scalable autoregressive approach that can compete with state-of-the-art diffusion models.

Method: Quetzal treats molecules as ordered sequences and combines a causal transformer for predicting the next atom’s discrete type with a smaller Diffusion MLP for modeling the continuous next-position distribution. It builds molecules atom-by-atom in 3D.

Result: Quetzal achieves substantial improvements in generation quality over existing autoregressive baselines and is competitive with state-of-the-art diffusion models. It enables significantly faster generation speed and exact divergence-based likelihood computation.

Conclusion: The work demonstrates that autoregressive models can be competitive with diffusion models for 3D molecule generation, offering advantages in speed and likelihood computation while natively handling variable-size tasks without architectural changes.

Abstract: Generative models of 3D molecular structure play a rapidly growing role in the design and simulation of molecules. Diffusion models currently dominate the space of 3D molecule generation, while autoregressive models have trailed behind. In this work, we present Quetzal, a simple but scalable autoregressive model that builds molecules atom-by-atom in 3D. Treating each molecule as an ordered sequence of atoms, Quetzal combines a causal transformer that predicts the next atom’s discrete type with a smaller Diffusion MLP that models the continuous next-position distribution. Compared to existing autoregressive baselines, Quetzal achieves substantial improvements in generation quality and is competitive with the performance of state-of-the-art diffusion models. In addition, by reducing the number of expensive forward passes through a dense transformer, Quetzal enables significantly faster generation speed, as well as exact divergence-based likelihood computation. Finally, without any architectural changes, Quetzal natively handles variable-size tasks like hydrogen decoration and scaffold completion. We hope that our work motivates a perspective on scalability and generality for generative modelling of 3D molecules.

[379] Understanding Behavioral Metric Learning: A Large-Scale Study on Distracting Reinforcement Learning Environments

Ziyan Luo, Tianwei Ni, Pierre-Luc Bacon, Doina Precup, Xujie Si

Main category: cs.LG

TL;DR: Systematic evaluation of 5 metric learning approaches in deep RL across 34 tasks with 370 configurations, introducing new evaluation metrics and releasing open-source codebase.

DetailsMotivation: Prior work on behavioral metrics (bisimulation metrics) for state abstraction has theoretical-practical gaps and unclear performance sources, with evaluations focusing only on final returns rather than metric quality.

Method: Evaluated 5 recent approaches unified as isometric embeddings, benchmarked across 20 state-based and 14 pixel-based tasks with diverse noise settings. Introduced denoising factor evaluation and isolated metric estimation setting to separate metric learning effects.

Result: Comprehensive benchmarking across 370 task configurations provides insights into how different metric learning approaches perform under various noise conditions and task types.

Conclusion: The study provides systematic assessment of metric learning in deep RL, introduces better evaluation metrics, and releases open-source code to support future research and improve reproducibility.

Abstract: A key approach to state abstraction is approximating behavioral metrics (notably, bisimulation metrics) in the observation space and embedding these learned distances in the representation space. While promising for robustness to task-irrelevant noise, as shown in prior work, accurately estimating these metrics remains challenging, requiring various design choices that create gaps between theory and practice. Prior evaluations focus mainly on final returns, leaving the quality of learned metrics and the source of performance gains unclear. To systematically assess how metric learning works in deep reinforcement learning (RL), we evaluate five recent approaches, unified conceptually as isometric embeddings with varying design choices. We benchmark them with baselines across 20 state-based and 14 pixel-based tasks, spanning 370 task configurations with diverse noise settings. Beyond final returns, we introduce the evaluation of a denoising factor to quantify the encoder’s ability to filter distractions. To further isolate the effect of metric learning, we propose and evaluate an isolated metric estimation setting, in which the encoder is influenced solely by the metric loss. Finally, we release an open-source, modular codebase to improve reproducibility and support future research on metric learning in deep RL.

[380] Closing the Gap between TD Learning and Supervised Learning with $Q$-Conditioned Maximization

Xing Lei, Zifeng Zhuang, Shentao Yang, Sheng Xu, Yunhao Luo, Fei Shen, Wenyan Yang, Xuetao Zhang, Donglin Wang

Main category: cs.LG

TL;DR: GCReinSL enhances supervised learning methods for offline RL with stitching capability through Q-conditioned policy and maximization, outperforming prior SL approaches.

DetailsMotivation: Supervised learning methods for offline RL lack trajectory stitching capability that TD-based methods have, creating a performance gap that needs to be addressed.

Method: Proposes Goal-Conditioned Reinforced Supervised Learning (GCReinSL) with two steps: (1) estimating Q-function using Normalizing Flows from offline data, and (2) finding maximum Q-value within data support by combining Q-function maximization with Expectile Regression.

Result: Experimental results show the method outperforms prior SL approaches with stitching capabilities and goal data augmentation techniques on offline RL datasets.

Conclusion: The proposed GCReinSL successfully endows supervised learning methods with stitching capability, closing the performance gap with TD learning approaches in offline goal-conditioned RL.

Abstract: Recently, supervised learning (SL) methodology has emerged as an effective approach for offline reinforcement learning (RL) due to their simplicity, stability, and efficiency. However, recent studies show that SL methods lack the trajectory stitching capability, typically associated with temporal difference (TD)-based approaches. A question naturally surfaces: \textit{How can we endow SL methods with stitching capability and close its performance gap with TD learning?} To answer this question, we introduce $Q$-conditioned maximization supervised learning for offline goal-conditioned RL, which enhances SL with the stitching capability through $Q$-conditioned policy and $Q$-conditioned maximization. Concretely, we propose \textbf{G}oal-\textbf{C}onditioned \textbf{\textit{Rein}}forced \textbf{S}upervised \textbf{L}earning (\textbf{GC\textit{Rein}SL}), which consists of (1) estimating the $Q$-function by Normalizing Flows from the offline dataset and (2) finding the maximum $Q$-value within the data support by integrating $Q$-function maximization with Expectile Regression. In inference time, our policy chooses optimal actions based on such a maximum $Q$-value. Experimental results from stitching evaluations on offline RL datasets demonstrate that our method outperforms prior SL approaches with stitching capabilities and goal data augmentation techniques.

[381] Navigating High Dimensional Concept Space with Metalearning

Max Gupta

Main category: cs.LG

TL;DR: Meta-learning improves few-shot learning of compositional Boolean concepts but struggles with featural complexity. Second-order methods and extended gradient adaptation help handle complex concept spaces.

DetailsMotivation: To investigate if gradient-based meta-learning can provide neural networks with inductive biases for efficient few-shot learning of discrete abstract concepts, comparing it against supervised learning baselines.

Method: Systematic comparison of meta-learning methods vs supervised learning on Boolean concepts generated by probabilistic context-free grammar, varying concept dimensionality and recursive compositionality. Includes representational analysis of weights and loss landscape analysis.

Result: Meta-learners handle compositional complexity much better than featural complexity. Featural complexity increases loss landscape roughness, making curvature-aware optimization more effective. More adaptation steps in meta-SGD improve out-of-distribution generalization.

Conclusion: This work reveals the intricacies of learning compositional vs featural complexity and provides insights into the role of second-order methods and extended gradient adaptation in few-shot concept learning.

Abstract: Rapidly learning abstract concepts from limited examples is a hallmark of human intelligence. This work investigates whether gradient-based meta-learning can equip neural networks with inductive biases for efficient few-shot acquisition of discrete concepts. I compare meta-learning methods against a supervised learning baseline on Boolean concepts (logical statements) generated by a probabilistic context-free grammar (PCFG). By systematically varying concept dimensionality (number of features) and recursive compositionality (depth of grammar recursion), I delineate between complexity regimes in which meta-learning robustly improves few-shot concept learning and regimes in which it does not. Meta-learners are much better able to handle compositional complexity than featural complexity. I highlight some reasons for this with a representational analysis of the weights of meta-learners and a loss landscape analysis demonstrating how featural complexity increases the roughness of loss trajectories, allowing curvature-aware optimization to be more effective than first-order methods. I find improvements in out-of-distribution generalization on complex concepts by increasing the number of adaptation steps in meta-SGD, where adaptation acts as a way of encouraging exploration of rougher loss basins. Overall, this work highlights the intricacies of learning compositional versus featural complexity in high dimensional concept spaces and provides a road to understanding the role of 2nd order methods and extended gradient adaptation in few-shot concept learning.

[382] Self-Supervised Temporal Super-Resolution of Energy Data using Generative Adversarial Transformer

Xuanhao Mu, Gökhan Demirel, Yuzhe Zhang, Jianlei Liu, Thorsten Schlachter, Veit Hagenmeyer

Main category: cs.LG

TL;DR: A new Generative Adversarial Transformers (GATs) method for energy time series upsampling that doesn’t require ground-truth high-resolution data, reducing RMSE by 9% and improving MPC accuracy by 13% compared to conventional interpolation.

DetailsMotivation: To address the temporal granularity gap in energy network design by overcoming limitations of conventional upsampling methods (information loss/noise) and advanced models (supervised learning paradox requiring unavailable high-resolution data).

Method: Utilizes Generative Adversarial Transformers (GATs) that can be trained without access to ground-truth high-resolution data, addressing the fundamental application paradox where high-resolution data is intrinsically absent.

Result: The method reduces root mean square error (RMSE) of upsampling tasks by 9% and improves accuracy in model predictive control (MPC) applications by 13% compared to conventional interpolation methods.

Conclusion: The GATs approach provides an effective solution for energy time series upsampling that overcomes the supervised learning paradox and outperforms traditional methods, making it suitable for energy system modeling applications where high-resolution ground truth data is unavailable.

Abstract: To bridge the temporal granularity gap in energy network design and operation based on Energy System Models, resampling of time series is required. While conventional upsampling methods are computationally efficient, they often result in significant information loss or increased noise. Advanced models such as time series generation models, Super-Resolution models and imputation models show potential, but also face fundamental challenges. The goal of time series generative models is to learn the distribution of the original data to generate high-resolution series with similar statistical characteristics. This is not entirely consistent with the definition of upsampling. Time series Super-Resolution models or imputation models can degrade the accuracy of upsampling because the input low-resolution time series are sparse and may have insufficient context. Moreover, such models usually rely on supervised learning paradigms. This presents a fundamental application paradox: their training requires the high-resolution time series that is intrinsically absent in upsampling application scenarios. To address the mentioned upsampling issue, this paper introduces a new method utilizing Generative Adversarial Transformers (GATs), which can be trained without access to any ground-truth high-resolution data. Compared with conventional interpolation methods, the introduced method can reduce the root mean square error (RMSE) of upsampling tasks by 9%, and the accuracy of a model predictive control (MPC) application scenario is improved by 13%.

[383] Using item recommendations and LLMs in marketing email titles

Deddy Jobson, Muktti Shukla, Phuong Dinh, Julio Christian Young, Nick Pitton, Nina Chen, Ryan Ginstrom

Main category: cs.LG

TL;DR: Using LLMs to generate personalized email titles instead of fixed templates improves customer engagement in e-commerce marketing.

DetailsMotivation: Traditional email marketing uses fixed template titles that fail to inspire interest, limiting the effectiveness of personalized email content recommendations.

Method: Employ large language models (LLMs) to generate thematic titles that reflect personalized email content, conducting offline simulations and online experiments with millions of users.

Result: The techniques improved engagement between customers and emails, demonstrating the effectiveness of LLM-generated personalized titles.

Conclusion: LLMs can be successfully productionized for safe and automated generation of personalized email titles at scale, enhancing e-commerce marketing effectiveness.

Abstract: E-commerce marketplaces make use of a number of marketing channels like emails, push notifications, etc. to reach their users and stimulate purchases. Personalized emails especially are a popular touch point for marketers to inform users of latest items in stock, especially for those who stopped visiting the marketplace. Such emails contain personalized recommendations tailored to each user’s interests, enticing users to buy relevant items. A common limitation of these emails is that the primary entry point, the title of the email, tends to follow fixed templates, failing to inspire enough interest in the contents. In this work, we explore the potential of large language models (LLMs) for generating thematic titles that reflect the personalized content of the emails. We perform offline simulations and conduct online experiments on the order of millions of users, finding our techniques useful in improving the engagement between customers and our emails. We highlight key findings and learnings as we productionize the safe and automated generation of email titles for millions of users.

[384] Adaptive LLM Routing under Budget Constraints

Pranoy Panda, Raghav Magazine, Chaitanya Devaguptapu, Sho Takemori, Vishal Sharma

Main category: cs.LG

TL;DR: LLM routing as contextual bandit problem using shared embedding space and PILOT algorithm for adaptive model selection without exhaustive inference.

DetailsMotivation: Address limitations of supervised LLM routing approaches that assume complete optimal pairings, which don't exist in real-world scenarios with evolving queries.

Method: Develop shared embedding space for queries and LLMs aligned by affinity, learned from offline human preference data and refined through online bandit feedback. Use PILOT (Preference-prior Informed Linucb) extension and online cost policy for budget-aware routing.

Result: Proposed framework enables adaptive LLM selection without requiring inference across all models for all queries, handling diverse user budgets efficiently.

Conclusion: Contextual bandit formulation with shared embeddings provides effective solution for practical LLM routing in dynamic real-world environments with cost constraints.

Abstract: Large Language Models (LLMs) have revolutionized natural language processing, but their varying capabilities and costs pose challenges in practical applications. LLM routing addresses this by dynamically selecting the most suitable LLM for each query/task. Previous approaches treat this as a supervised learning problem, assuming complete knowledge of optimal query-LLM pairings. However, real-world scenarios lack such comprehensive mappings and face evolving user queries. We thus propose to study LLM routing as a contextual bandit problem, enabling adaptive decision-making using bandit feedback without requiring exhaustive inference across all LLMs for all queries (in contrast to supervised routing). To address this problem, we develop a shared embedding space for queries and LLMs, where query and LLM embeddings are aligned to reflect their affinity. This space is initially learned from offline human preference data and refined through online bandit feedback. We instantiate this idea through Preference-prior Informed Linucb fOr adaptive rouTing (PILOT), a novel extension of LinUCB. To handle diverse user budgets for model routing, we introduce an online cost policy modeled as a multi-choice knapsack problem, ensuring resource-efficient routing.

[385] FNODE: Flow-Matching for data-driven simulation of constrained multibody systems

Hongyu Wang, Jingquan Wang, Dan Negrut

Main category: cs.LG

TL;DR: FNODE framework learns acceleration vector fields directly from trajectory data, eliminating backpropagation through ODE solvers for improved computational efficiency and prediction accuracy in constrained multibody systems.

DetailsMotivation: Address high computational cost and limited long-term prediction accuracy in data-driven modeling of constrained multibody systems.

Method: Flow-Matching Neural ODE (FNODE) that learns acceleration vector fields directly, using numerical differentiation techniques including hybrid FFT and Finite Difference schemes to compute acceleration targets.

Result: FNODE consistently outperforms MBD-NODE, LSTM, and FCNN across multiple benchmarks (mass-spring-damper systems, double pendulum, slider-crank, cart-pole) with good accuracy, generalization, and computational efficiency.

Conclusion: FNODE provides an effective framework for constrained multibody system modeling that addresses computational bottlenecks while maintaining high prediction accuracy.

Abstract: Data-driven modeling of constrained multibody systems faces two persistent challenges: high computational cost and limited long-term prediction accuracy. To address these issues, we introduce the Flow-Matching Neural Ordinary Differential Equation (FNODE), a framework that learns acceleration vector fields directly from trajectory data. By reformulating the training objective to supervise accelerations rather than integrated states, FNODE eliminates the need for backpropagation through an ODE solver, which represents a bottleneck in traditional Neural ODEs. Acceleration targets are computed efficiently using numerical differentiation techniques, including a hybrid Fast Fourier Transform (FFT) and Finite Difference (FD) scheme. We evaluate FNODE on a diverse set of benchmarks, including the single and triple mass-spring-damper systems, double pendulum, slider-crank, and cart-pole. Across all cases, FNODE consistently outperforms existing approaches such as Multi-Body Dynamic Neural ODE (MBD-NODE), Long Short-Term Memory (LSTM) networks, and Fully Connected Neural Networks (FCNN), demonstrating good accuracy, generalization, and computational efficiency.

[386] Equivariant U-Shaped Neural Operators for the Cahn-Hilliard Phase-Field Model

Xiao Xue, Marco F. P. ten Eikelder, Tianyue Yang, Yiqing Li, Kan He, Shuo Wang, Peter V. Coveney

Main category: cs.LG

TL;DR: E-UNO: an equivariant U-shaped neural operator that learns phase-field dynamics from short histories, outperforming standard neural operators by encoding physical symmetries and multi-scale behavior.

DetailsMotivation: Traditional numerical solvers for Cahn-Hilliard equation are computationally expensive and lack flexibility. Current neural operators fail to capture multiscale behavior and neglect physical symmetries.

Method: Combines global spectral convolution with multi-resolution U-shaped architecture, regulates translation equivariance to align with physics, learns from short histories of past dynamics.

Result: Outperforms standard Fourier neural operator and U-shaped neural operator baselines, particularly on fine-scale and high-frequency structures. Generalizes better with less training data.

Conclusion: E-UNO establishes an efficient surrogate for complex phase-field systems by encoding symmetry and scale hierarchy, yielding physically consistent dynamics.

Abstract: Phase separation in binary mixtures, governed by the Cahn-Hilliard equation, plays a central role in interfacial dynamics across materials science and soft matter. While numerical solvers are accurate, they are often computationally expensive and lack flexibility across varying initial conditions and geometries. Neural operators provide a data-driven alternative by learning solution operators between function spaces, but current architectures often fail to capture multiscale behavior and neglect underlying physical symmetries. Here we show that an equivariant U-shaped neural operator (E-UNO) can learn the evolution of the phase-field variable from short histories of past dynamics, achieving accurate predictions across space and time. The model combines global spectral convolution with a multi-resolution U-shaped architecture and regulates translation equivariance to align with the underlying physics. E-UNO outperforms standard Fourier neural operator and U-shaped neural operator baselines, particularly on fine-scale and high-frequency structures. By encoding symmetry and scale hierarchy, the model generalizes better, requires less training data, and yields physically consistent dynamics. This establishes E-UNO as an efficient surrogate for complex phase-field systems.

[387] What Fundamental Structure in Reward Functions Enables Efficient Sparse-Reward Learning?

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Main category: cs.LG

TL;DR: PAMC is a new RL method that uses matrix completion with policy-aware sampling to learn sparse rewards more efficiently by exploiting low-rank + sparse structure in reward matrices.

DetailsMotivation: Sparse-reward RL is fundamentally hard and requires many samples without structure. The paper aims to develop a structural reward learning framework to improve sample efficiency.

Method: Policy-Aware Matrix Completion (PAMC) that exploits approximate low-rank + sparse structure in reward matrices under policy-biased sampling, using inverse-propensity weighting and establishing visitation-weighted error-to-regret bounds.

Result: PAMC improves sample efficiency across multiple benchmarks (Atari-26, DM Control, MetaWorld MT50, D4RL offline RL, preference-based RL), outperforming state-of-the-art methods like DrQ-v2, DreamerV3, Agent57, T-REX/D-REX, and PrefPPO.

Conclusion: PAMC provides a practical and principled tool for structural reward learning, gracefully degrading when assumptions weaken and serving as the first instantiation of a broader structural reward learning perspective.

Abstract: Sparse-reward reinforcement learning (RL) remains fundamentally hard: without structure, any agent needs $\Omega(|\mathcal{S}||\mathcal{A}|/p)$ samples to recover rewards. We introduce Policy-Aware Matrix Completion (PAMC) as a first concrete step toward a structural reward learning framework. Our key idea is to exploit approximate low-rank + sparse structure in the reward matrix, under policy-biased (MNAR) sampling. We prove recovery guarantees with inverse-propensity weighting, and establish a visitation-weighted error-to-regret bound linking completion error to control performance. Importantly, when assumptions weaken, PAMC degrades gracefully: confidence intervals widen and the algorithm abstains, ensuring safe fallback to exploration. Empirically, PAMC improves sample efficiency across Atari-26 (10M steps), DM Control, MetaWorld MT50, D4RL offline RL, and preference-based RL benchmarks, outperforming DrQ-v2, DreamerV3, Agent57, T-REX/D-REX, and PrefPPO under compute-normalized comparisons. Our results highlight PAMC as a practical and principled tool when structural rewards exist, and as a concrete first instantiation of a broader structural reward learning perspective.

[388] Bootstrapping Task Spaces for Self-Improvement

Minqi Jiang, Andrei Lupu, Yoram Bachrach

Main category: cs.LG

TL;DR: ExIt is an autocurriculum RL method that trains LLMs for multi-step self-improvement by selectively sampling informative intermediate steps during training, enabling effective inference-time iteration beyond training depths.

DetailsMotivation: Current RL approaches for self-improvement tasks require fixed maximum iteration depths, which can be costly and arbitrary. There's a need for methods that can train agents to reliably self-improve over sequences at inference-time without these limitations.

Method: Exploratory Iteration (ExIt) grows a task space by selectively sampling the most informative intermediate partial histories encountered during episodes, treating these as new self-iteration task instances to train a self-improvement policy. It can pair with exploration mechanisms for greater task diversity.

Result: Across domains including competition math, multi-turn tool-use, and machine learning engineering, ExIt produces policies that show strong inference-time self-improvement on held-out tasks and can iterate towards higher performance beyond the average training iteration depth.

Conclusion: ExIt effectively enables LLMs to perform multi-step self-improvement at inference-time while only training on single-step iterations, demonstrating the method’s ability to generalize beyond training conditions and sustain iterative improvement.

Abstract: Progress in many task domains emerges from repeated revisions to previous solution attempts. Training agents that can reliably self-improve over such sequences at inference-time is a natural target for reinforcement learning (RL), yet the naive approach assumes a fixed maximum iteration depth, which can be both costly and arbitrary. We present Exploratory Iteration (ExIt), a family of autocurriculum RL methods that directly exploits the recurrent structure of self-improvement tasks to train LLMs to perform multi-step self-improvement at inference-time while only training on the most informative single-step iterations. ExIt grows a task space by selectively sampling the most informative intermediate, partial histories encountered during an episode for continued iteration, treating these starting points as new self-iteration task instances to train a self-improvement policy. ExIt can further pair with explicit exploration mechanisms to sustain greater task diversity. Across several domains, encompassing competition math, multi-turn tool-use, and machine learning engineering, we demonstrate that ExIt strategies, starting from either a single or many task instances, can produce policies exhibiting strong inference-time self-improvement on held-out task instances, and the ability to iterate towards higher performance over a step budget extending beyond the average iteration depth encountered during training.

[389] Safeguarding Graph Neural Networks against Topology Inference Attacks

Jie Fu, Hong Yuan, Zhili Chen, Wendy Hui Wang

Main category: cs.LG

TL;DR: GNNs are vulnerable to topology privacy attacks that can reconstruct training graph structures from black-box model access. Existing edge-level privacy protections are inadequate. Proposed PGR defense uses bi-level optimization with synthetic graphs to protect topology while maintaining accuracy.

DetailsMotivation: Graph Neural Networks raise serious privacy concerns, particularly around topology privacy (confidentiality of graph structure), which is underexplored compared to edge-level privacy. GNNs are vulnerable to graph-level inference attacks.

Method: Proposed Topology Inference Attacks (TIAs) to reconstruct training graph structure from black-box GNN access. Introduced Private Graph Reconstruction (PGR) defense - a bi-level optimization framework that iteratively generates synthetic training graphs using meta-gradients while updating the GNN model.

Result: GNNs are highly susceptible to topology inference attacks. Existing edge-level differential privacy mechanisms fail to mitigate the risk or severely compromise accuracy. PGR significantly reduces topology leakage with minimal impact on model accuracy.

Conclusion: Topology privacy is a critical threat in GNNs that requires specialized defenses. PGR provides an effective solution that protects graph structure confidentiality while maintaining model performance, addressing a gap in current privacy protection approaches.

Abstract: Graph Neural Networks (GNNs) have emerged as powerful models for learning from graph-structured data. However, their widespread adoption has raised serious privacy concerns. While prior research has primarily focused on edge-level privacy, a critical yet underexplored threat lies in topology privacy - the confidentiality of the graph’s overall structure. In this work, we present a comprehensive study on topology privacy risks in GNNs, revealing their vulnerability to graph-level inference attacks. To this end, we propose a suite of Topology Inference Attacks (TIAs) that can reconstruct the structure of a target training graph using only black-box access to a GNN model. Our findings show that GNNs are highly susceptible to these attacks, and that existing edge-level differential privacy mechanisms are insufficient as they either fail to mitigate the risk or severely compromise model accuracy. To address this challenge, we introduce Private Graph Reconstruction (PGR), a novel defense framework designed to protect topology privacy while maintaining model accuracy. PGR is formulated as a bi-level optimization problem, where a synthetic training graph is iteratively generated using meta-gradients, and the GNN model is concurrently updated based on the evolving graph. Extensive experiments demonstrate that PGR significantly reduces topology leakage with minimal impact on model accuracy. Our code is available at https://github.com/JeffffffFu/PGR.

[390] Toward a Metrology for Artificial Intelligence: Hidden-Rule Environments and Reinforcement Learning

Christo Mathew, Wentian Wang, Jacob Feldman, Lazaros K. Gallos, Paul B. Kantor, Vladimir Menkov, Hao Wang

Main category: cs.LG

TL;DR: Reinforcement learning study in Game Of Hidden Rules environment using Transformer-based A2C with Feature-Centric and Object-Centric representations to infer hidden rules and clear boards.

DetailsMotivation: To develop agents that can simultaneously infer hidden governing rules and learn optimal policies in complex puzzle environments with partial observations.

Method: Transformer-based Advantage Actor-Critic (A2C) algorithm with two state representation strategies: Feature-Centric (FC) and Object-Centric (OC), evaluated across multiple rule-based and trial-list-based experimental setups.

Result: Evaluation shows performance differences between representation strategies and analyzes transfer effects and learning efficiency impacts.

Conclusion: The study demonstrates effective reinforcement learning approaches for inferring hidden rules in complex puzzle environments, with representation strategy affecting learning efficiency and transfer capabilities.

Abstract: We investigate reinforcement learning in the Game Of Hidden Rules (GOHR) environment, a complex puzzle in which an agent must infer and execute hidden rules to clear a 6$\times$6 board by placing game pieces into buckets. We explore two state representation strategies, namely Feature-Centric (FC) and Object-Centric (OC), and employ a Transformer-based Advantage Actor-Critic (A2C) algorithm for training. The agent has access only to partial observations and must simultaneously infer the governing rule and learn the optimal policy through experience. We evaluate our models across multiple rule-based and trial-list-based experimental setups, analyzing transfer effects and the impact of representation on learning efficiency.

[391] CAME-AB: Cross-Modality Attention with Mixture-of-Experts for Antibody Binding Site Prediction

Hongzong Li, Jiahao Ma, Zhanpeng Shi, Rui Xiao, Fanming Jin, Ye-Fan Hu, Jian-Dong Huang

Main category: cs.LG

TL;DR: CAME-AB is a cross-modality attention framework with MoE backbone for antibody binding site prediction, integrating five biological modalities and outperforming existing methods.

DetailsMotivation: Existing antibody binding site prediction methods rely on single-view features and fail to identify antibody-specific binding sites on antigens effectively.

Method: Integrates five biological modalities (amino acid encodings, BLOSUM profiles, language model embeddings, structure features, GCN-refined graphs) with adaptive modality fusion, Transformer encoder, MoE module, supervised contrastive learning, and stochastic weight averaging.

Result: Consistently outperforms strong baselines on multiple metrics (Precision, Recall, F1-score, AUC-ROC, MCC) on benchmark antibody-antigen datasets.

Conclusion: The multimodal integration and architectural components are effective, with ablation studies validating each component’s contribution to improved antibody binding site prediction.

Abstract: Antibody binding site prediction plays a pivotal role in computational immunology and therapeutic antibody design. Existing sequence or structure methods rely on single-view features and fail to identify antibody-specific binding sites on the antigens. In this paper, we propose \textbf{CAME-AB}, a novel Cross-modality Attention framework with a Mixture-of-Experts (MoE) backbone for robust antibody binding site prediction. CAME-AB integrates five biologically grounded modalities, including raw amino acid encodings, BLOSUM substitution profiles, pretrained language model embeddings, structure-aware features, and GCN-refined biochemical graphs, into a unified multimodal representation. To enhance adaptive cross-modal reasoning, we propose an \emph{adaptive modality fusion} module that learns to dynamically weight each modality based on its global relevance and input-specific contribution. A Transformer encoder combined with an MoE module further promotes feature specialization and capacity expansion. We additionally incorporate a supervised contrastive learning objective to explicitly shape the latent space geometry, encouraging intra-class compactness and inter-class separability. To improve optimization stability and generalization, we apply stochastic weight averaging during training. Extensive experiments on benchmark antibody-antigen datasets demonstrate that CAME-AB consistently outperforms strong baselines on multiple metrics, including Precision, Recall, F1-score, AUC-ROC, and MCC. Ablation studies further validate the effectiveness of each architectural component and the benefit of multimodal feature integration. The model implementation details and the codes are available on https://anonymous.4open.science/r/CAME-AB-C525

[392] Barycentric Neural Networks and Length-Weighted Persistent Entropy Loss: A Green Geometric and Topological Framework for Function Approximation

Victor Toscano-Duran, Rocio Gonzalez-Diaz, Miguel A. Gutiérrez-Naranjo

Main category: cs.LG

TL;DR: Proposes Barycentric Neural Network (BNN) with fixed base points and barycentric coordinates for exact representation of continuous piecewise linear functions, combined with new length-weighted persistent entropy (LWPE) loss for superior approximation performance.

DetailsMotivation: Address computational costs of deep/overparameterized networks by developing small shallow networks that can exactly represent continuous piecewise linear functions for flexible and interpretable approximation in resource-constrained settings.

Method: Uses BNN architecture with fixed base points and barycentric coordinates, introduces LWPE (length-weighted persistent entropy) as stable topological feature, and optimizes base points directly rather than internal weights with LWPE-based loss function.

Result: Experimental results show superior and faster approximation performance compared to classical loss functions (MSE, RMSE, MAE, log-cosh) in resource-constrained settings with limited base points and training epochs.

Conclusion: BNN combined with LWPE loss provides flexible, geometrically interpretable approximations of nonlinear continuous functions, achieving exact representation of CPLFs and better performance than traditional approaches with lower computational costs.

Abstract: While it is well-established that artificial neural networks are universal approximators for continuous functions on compact domains, many modern approaches rely on deep or overparameterized architectures that incur high computational costs. In this paper, a new type of small shallow neural network, called the Barycentric Neural Network (BNN), is proposed, which leverages a fixed set of base points and their barycentric coordinates to define both its structure and its parameters. We demonstrate that our BNN enables the exact representation of continuous piecewise linear functions (CPLFs), ensuring strict continuity across segments. Since any continuous function over a compact domain can be approximated arbitrarily well by CPLFs, the BNN naturally emerges as a flexible and interpretable tool for function approximation. Beyond the use of this representation, the main contribution of the paper is the introduction of a new variant of persistent entropy, a topological feature that is stable and scale invariant, called the length-weighted persistent entropy (LWPE), which is weighted by the lifetime of topological features. Our framework, which combines the BNN with a loss function based on our LWPE, aims to provide flexible and geometrically interpretable approximations of nonlinear continuous functions in resource-constrained settings, such as those with limited base points for BNN design and few training epochs. Instead of optimizing internal weights, our approach directly optimizes the base points that define the BNN. Experimental results show that our approach achieves superior and faster approximation performance compared to classical loss functions such as MSE, RMSE, MAE, and log-cosh.

cs.MA

[393] Efficient Multi-Agent Coordination via Dynamic Joint-State Graph Construction

Yanlin Zhou, Manshi Limbu, Xuesu Xiao

Main category: cs.MA

TL;DR: This paper introduces TCGRE, a multi-agent coordination problem where agents collaborate to reduce traversal costs on risky edges. The problem is proven NP-hard and solved using efficient decomposition methods including JSG, CES, RHOCA*, and a novel Dynamic-HJSG approach that reduces computational complexity while preserving optimality.

DetailsMotivation: Real-world multi-agent applications require active coordination beyond collision avoidance to improve team performance, particularly when dealing with high-risk edges that benefit from teammate support.

Method: Reformulated TCGRE as a 3D matching problem, proved NP-hardness, and developed decomposition methods: Joint-State Graph (JSG), Coordination-Exhaustive Search (CES), Receding-Horizon Optimistic Cooperative A* (RHOCA*), and Dynamic-HJSG for dynamic graph construction with state pruning.

Result: Theoretical analysis shows Dynamic-HJSG preserves optimality while reducing complexity from exponential to polynomial. Empirical results demonstrate scalability for large teams and graphs, with HJSG significantly outperforming baselines in runtime across different graph sizes and types.

Conclusion: This work bridges combinatorial optimization and multi-agent planning, providing a principled framework for collaborative pathfinding with provable guarantees. The solution approach can be extended to other collaborative optimization problems like MAPF.

Abstract: Multi-agent pathfinding (MAPF) traditionally focuses on collision avoidance, but many real-world applications require active coordination between agents to improve team performance. This paper introduces Team Coordination on Graphs with Risky Edges (TCGRE), where agents collaborate to reduce traversal costs on high-risk edges via support from teammates. We reformulate TCGRE as a 3D matching problem-mapping robot pairs, support pairs, and time steps-and rigorously prove its NP-hardness via reduction from Minimum 3D Matching. To address this complexity, (in the conference version) we proposed efficient decomposition methods, reducing the problem to tractable subproblems: Joint-State Graph (JSG): Encodes coordination as a single-agent shortest-path problem. Coordination-Exhaustive Search (CES): Optimizes support assignments via exhaustive pairing. Receding-Horizon Optimistic Cooperative A* (RHOCA*): Balances optimality and scalability via horizon-limited planning. Further in this extension, we introduce a dynamic graph construction method (Dynamic-HJSG), leveraging agent homogeneity to prune redundant states and reduce computational overhead by constructing the joint-state graph dynamically. Theoretical analysis shows Dynamic-HJSG preserves optimality while lowering complexity from exponential to polynomial in key cases. Empirical results validate scalability for large teams and graphs, with HJSG outperforming baselines greatly in runtime in different sizes and types of graphs. This work bridges combinatorial optimization and multi-agent planning, offering a principled framework for collaborative pathfinding with provable guarantees, and the key idea of the solution can be widely extended to many other collaborative optimization problems, such as MAPF.

[394] Adaptive Evolutionary Framework for Safe, Efficient, and Cooperative Autonomous Vehicle Interactions

Zhen Tian, Zhihao Lin

Main category: cs.MA

TL;DR: Proposes CEGT framework using Evolutionary Game Theory with causal evaluation to optimize AV interactions, outperforming traditional methods in safety and efficiency.

DetailsMotivation: Address challenges in autonomous vehicle interactions including lack of centralized control, need to balance passenger demands and traffic efficiency, and limitations of traditional rule-based, optimization-based, and game-theoretic approaches.

Method: Evolutionary Game Theory (EGT) framework with causal evaluation module (CEGT) that uses decentralized adaptive strategy evolution and optimizes evolutionary rate by learning from historical interactions.

Result: CEGT outperforms EGT and benchmark games (Nash, Stackelberg) with lower collision rates, improved safety distances, higher speeds, and better overall performance across diverse scenarios.

Conclusion: The proposed CEGT framework effectively addresses AV interaction challenges through adaptive evolutionary mechanisms and causal learning, demonstrating superior performance over traditional game-theoretic approaches.

Abstract: Modern transportation systems face significant challenges in ensuring road safety, given serious injuries caused by road accidents. The rapid growth of autonomous vehicles (AVs) has prompted new traffic designs that aim to optimize interactions among AVs. However, effective interactions between AVs remains challenging due to the absence of centralized control. Besides, there is a need for balancing multiple factors, including passenger demands and overall traffic efficiency. Traditional rule-based, optimization-based, and game-theoretic approaches each have limitations in addressing these challenges. Rule-based methods struggle with adaptability and generalization in complex scenarios, while optimization-based methods often require high computational resources. Game-theoretic approaches, such as Stackelberg and Nash games, suffer from limited adaptability and potential inefficiencies in cooperative settings. This paper proposes an Evolutionary Game Theory (EGT)-based framework for AV interactions that overcomes these limitations by utilizing a decentralized and adaptive strategy evolution mechanism. A causal evaluation module (CEGT) is introduced to optimize the evolutionary rate, balancing mutation and evolution by learning from historical interactions. Simulation results demonstrate the proposed CEGT outperforms EGT and popular benchmark games in terms of lower collision rates, improved safety distances, higher speeds, and overall better performance compared to Nash and Stackelberg games across diverse scenarios and parameter settings.

[395] Bio-inspired decision making in swarms under biases from stubborn robots, corrupted communication, and independent discovery

Raina Zakir, Timoteo Carletti, Marco Dorigo, Andreagiovanni Reina

Main category: cs.MA

TL;DR: Minimal robot swarms can achieve reliable consensus despite individual sensing errors, with bio-inspired cross-inhibition outperforming direct-switch mechanisms in biased conditions.

DetailsMotivation: Coordinating decentralized minimal robot swarms is challenging due to communication, computation, and memory constraints, especially when robots make sensing errors.

Method: Compared two opinion dynamics mechanisms (direct-switch and cross-inhibition) with generalized mean-field models that incorporate asocial biases influencing opinion dynamics.

Result: Cross-inhibition enables faster, more cohesive, accurate, robust, and scalable decisions across biased conditions, while direct-switch deteriorates and causes decision deadlocks.

Conclusion: Bio-inspired cross-inhibition provides superior coordination for minimal swarms, offering insights for decentralized decision-making systems in both biology and engineering.

Abstract: Minimalistic robot swarms offer a scalable, robust, and cost-effective approach to performing complex tasks with the potential to transform applications in healthcare, disaster response, and environmental monitoring. However, coordinating such decentralised systems remains a fundamental challenge, particularly when robots are constrained in communication, computation, and memory. In our study, individual robots frequently make errors when sensing the environment, yet the swarm can rapidly and reliably reach consensus on the best among $n$ discrete options. We compare two canonical mechanisms of opinion dynamics – direct-switch and cross-inhibition – which are simple yet effective rules for collective information processing observed in biological systems across scales, from neural populations to insect colonies. We generalise the existing mean-field models by considering asocial biases influencing the opinion dynamics. While swarms using direct-switch reliably select the best option in absence of asocial dynamics, their performance deteriorates once such biases are introduced, often resulting in decision deadlocks. In contrast, bio-inspired cross-inhibition enables faster, more cohesive, accurate, robust, and scalable decisions across a wide range of biased conditions. Our findings provide theoretical and practical insights into the coordination of minimal swarms and offer insights that extend to a broad class of decentralised decision-making systems in biology and engineering.

[396] Towards Generalized Routing: Model and Agent Orchestration for Adaptive and Efficient Inference

Xiyu Guo, Shan Wang, Chunfang Ji, Xuefeng Zhao, Wenhao Xi, Yaoyao Liu, Qinglan Li, Chao Deng, Junlan Feng

Main category: cs.MA

TL;DR: MoMA is a generalized routing framework that combines LLM and agent-based routing to efficiently handle diverse user queries by directing them to the most appropriate execution units based on cost-performance optimization.

DetailsMotivation: The diversity of user queries spanning multiple domains and task types creates complex routing challenges in AI service ecosystems, requiring accurate query direction while optimizing both performance and efficiency.

Method: MoMA integrates LLM and agent-based routing with precise intent recognition and adaptive strategies. It uses a detailed training dataset to profile LLM capabilities under different routing structures, and employs dynamic routing to the best cost-performance LLM plus context-aware state machine with dynamic masking for agent selection.

Result: Experimental results show that MoMA offers superior cost-efficiency and scalability compared to existing routing approaches.

Conclusion: MoMA provides an effective solution for handling heterogeneous query landscapes through intelligent routing that balances efficiency and cost, demonstrating practical advantages over current methods.

Abstract: The rapid advancement of large language models (LLMs) and domain-specific AI agents has greatly expanded the ecosystem of AI-powered services. User queries, however, are highly diverse and often span multiple domains and task types, resulting in a complex and heterogeneous landscape. This diversity presents a fundamental routing challenge: how to accurately direct each query to an appropriate execution unit while optimizing both performance and efficiency. To address this, we propose MoMA (Mixture of Models and Agents), a generalized routing framework that integrates both LLM and agent-based routing. Built upon a deep understanding of model and agent capabilities, MoMA effectively handles diverse queries through precise intent recognition and adaptive routing strategies, achieving an optimal balance between efficiency and cost. Specifically, we construct a detailed training dataset to profile the capabilities of various LLMs under different routing model structures, identifying the most suitable tasks for each LLM. During inference, queries are dynamically routed to the LLM with the best cost-performance efficiency. We also introduce an efficient agent selection strategy based on a context-aware state machine and dynamic masking. Experimental results demonstrate that the MoMA router offers superior cost-efficiency and scalability compared to existing approaches.

[397] Grid-Agent: An LLM-Powered Multi-Agent System for Power Grid Control

Yan Zhang, Ahmad Mohammad Saber, Amr Youssef, Deepa Kundur

Main category: cs.MA

TL;DR: Grid-Agent is an AI framework using LLMs and multi-agent systems to autonomously detect and fix power grid violations through coordinated planning, validation, and adaptive network representation.

DetailsMotivation: Modern power grids face increasing complexity from DERs, EVs, and extreme weather, while being vulnerable to cyberattacks that can cause grid violations, requiring rapid adaptive response systems.

Method: Uses LLMs in a multi-agent system with planning agent for coordinated action sequences using power flow solvers, validation agent for safety through sandboxed execution, and adaptive multi-scale network representation for scalability.

Result: Demonstrated superior mitigation performance on IEEE and CIGRE benchmark networks including IEEE 69-bus, CIGRE MV, and IEEE 30-bus test systems.

Conclusion: Grid-Agent is suitable for modern smart grids requiring rapid, adaptive response to violations through optimized switch configurations, battery deployment, and load curtailment.

Abstract: Modern power grids face unprecedented complexity from Distributed Energy Resources (DERs), Electric Vehicles (EVs), and extreme weather, while also being increasingly exposed to cyberattacks that can trigger grid violations. This paper introduces Grid-Agent, an autonomous AI-driven framework that leverages Large Language Models (LLMs) within a multi-agent system to detect and remediate violations. Grid-Agent integrates semantic reasoning with numerical precision through modular agents: a planning agent generates coordinated action sequences using power flow solvers, while a validation agent ensures stability and safety through sandboxed execution with rollback mechanisms. To enhance scalability, the framework employs an adaptive multi-scale network representation that dynamically adjusts encoding schemes based on system size and complexity. Violation resolution is achieved through optimizing switch configurations, battery deployment, and load curtailment. Our experiments on IEEE and CIGRE benchmark networks, including the IEEE 69-bus, CIGRE MV, IEEE 30-bus test systems, demonstrate superior mitigation performance, highlighting Grid-Agent’s suitability for modern smart grids requiring rapid, adaptive response.

cs.MM

eess.AS

[398] Identifying and Calibrating Overconfidence in Noisy Speech Recognition

Mingyue Huo, Yuheng Zhang, Yan Tang

Main category: eess.AS

TL;DR: Lightweight post-hoc calibration framework reduces Whisper ASR’s overconfidence in noisy conditions by 58% ECE reduction and 3x NCE improvement at low SNRs.

DetailsMotivation: Whisper ASR models exhibit overconfidence in noisy environments, assigning high confidence to incorrect predictions, which reduces reliability in low signal-to-noise ratio conditions.

Method: Propose a lightweight post-hoc calibration framework that detects potential overconfidence and applies temperature scaling selectively to problematic tokens without modifying the underlying ASR model.

Result: On R-SPIN dataset at low SNRs (-18 to -5 dB), the method reduces expected calibration error by 58% and triples normalized cross entropy, significantly improving confidence reliability.

Conclusion: The proposed selective temperature scaling approach effectively mitigates overconfidence in Whisper ASR under severe noise conditions while maintaining model integrity through post-hoc calibration.

Abstract: Modern end-to-end automatic speech recognition (ASR) models like Whisper not only suffer from reduced recognition accuracy in noise, but also exhibit overconfidence - assigning high confidence to wrong predictions. We conduct a systematic analysis of Whisper’s behavior in additive noise conditions and find that overconfident errors increase dramatically at low signal-to-noise ratios, with 10-20% of tokens incorrectly predicted with confidence above 0.7. To mitigate this, we propose a lightweight, post-hoc calibration framework that detects potential overconfidence and applies temperature scaling selectively to those tokens, without altering the underlying ASR model. Evaluations on the R-SPIN dataset demonstrate that, in the low signal-to-noise ratio range (-18 to -5 dB), our method reduces the expected calibration error (ECE) by 58% and triples the normalized cross entropy (NCE), yielding more reliable confidence estimates under severe noise conditions.

[399] Affine Modulation-based Audiogram Fusion Network for Joint Noise Reduction and Hearing Loss Compensation

Ye Ni, Ruiyu Liang, Xiaoshuai Hao, Jiaming Cheng, Qingyun Wang, Chengwei Huang, Cairong Zou, Wei Zhou, Weiping Ding, Björn W. Schuller

Main category: eess.AS

TL;DR: AFN-HearNet is a novel audiogram fusion network that jointly optimizes noise reduction and hearing loss compensation for hearing aids, achieving state-of-the-art performance with better efficiency.

DetailsMotivation: Current hearing aids treat noise reduction and hearing loss compensation as separate tasks, leading to suboptimal performance in noisy environments, lack of systematic optimization, and increased system complexity.

Method: Proposes AFN-HearNet with audiogram-specific encoder, affine modulation-based audiogram fusion frequency-temporal Conformer, and voice activity detection auxiliary task to fuse cross-domain audiogram and spectrum features for joint optimization.

Result: Significantly outperforms state-of-the-art in-context fusion joint models on key metrics (HASQI and PESQ) while achieving good performance-efficiency trade-off across multiple datasets.

Conclusion: The proposed AFN-HearNet successfully addresses the limitations of separate task optimization in hearing aids by providing a unified framework that simultaneously handles noise reduction and hearing loss compensation through effective cross-domain feature fusion.

Abstract: Hearing aids (HAs) are widely used to provide personalized speech enhancement (PSE) services, improving the quality of life for individuals with hearing loss. However, HA performance significantly declines in noisy environments as it treats noise reduction (NR) and hearing loss compensation (HLC) as separate tasks. This separation leads to a lack of systematic optimization, overlooking the interactions between these two critical tasks, and increases the system complexity. To address these challenges, we propose a novel audiogram fusion network, named AFN-HearNet, which simultaneously tackles the NR and HLC tasks by fusing cross-domain audiogram and spectrum features. We propose an audiogram-specific encoder that transforms the sparse audiogram profile into a deep representation, addressing the alignment problem of cross-domain features prior to fusion. To incorporate the interactions between NR and HLC tasks, we propose the affine modulation-based audiogram fusion frequency-temporal Conformer that adaptively fuses these two features into a unified deep representation for speech reconstruction. Furthermore, we introduce a voice activity detection auxiliary training task to embed speech and non-speech patterns into the unified deep representation implicitly. We conduct comprehensive experiments across multiple datasets to validate the effectiveness of each proposed module. The results indicate that the AFN-HearNet significantly outperforms state-of-the-art in-context fusion joint models regarding key metrics such as HASQI and PESQ, achieving a considerable trade-off between performance and efficiency. The source code and data will be released at https://github.com/deepnetni/AFN-HearNet.

[400] Exploring System Adaptations For Minimum Latency Real-Time Piano Transcription

Patricia Hu, Silvan David Peter, Jan Schlüter, Gerhard Widmer

Main category: eess.AS

TL;DR: This paper adapts state-of-the-art online piano transcription models for real-time applications requiring latencies below 30ms by eliminating non-causal processing and optimizing computational efficiency.

DetailsMotivation: Most real-time musical applications require latencies below 30ms, but existing piano transcription approaches either target offline use or have delays of 128-320ms, creating a gap for true real-time performance.

Method: The authors eliminate all non-causal processing, reduce computational load through shared computations across model components, vary model size, and explore different pre-/postprocessing strategies and label encoding schemes suitable for real-time transcription.

Result: Evaluation on MAESTRO dataset shows a drop in transcription accuracy due to strictly causal processing and reveals a tradeoff between preprocessing latency and prediction accuracy.

Conclusion: The system is released as a baseline to support researchers in designing models for minimum latency real-time piano transcription, addressing the gap between current online models and true real-time requirements.

Abstract: Advances in neural network design and the availability of large-scale labeled datasets have driven major improvements in piano transcription. Existing approaches target either offline applications, with no restrictions on computational demands, or online transcription, with delays of 128-320 ms. However, most real-time musical applications require latencies below 30 ms. In this work, we investigate whether and how the current state-of-the-art online transcription model can be adapted for real-time piano transcription. Specifically, we eliminate all non-causal processing, and reduce computational load through shared computations across core model components and variations in model size. Additionally, we explore different pre- and postprocessing strategies, and related label encoding schemes, and discuss their suitability for real-time transcription. Evaluating the adaptions on the MAESTRO dataset, we find a drop in transcription accuracy due to strictly causal processing as well as a tradeoff between the preprocessing latency and prediction accuracy. We release our system as a baseline to support researchers in designing models towards minimum latency real-time transcription.

[401] VINP: Variational Bayesian Inference with Neural Speech Prior for Joint ASR-Effective Speech Dereverberation and Blind RIR Identification

Pengyu Wang, Ying Fang, Xiaofei Li

Main category: eess.AS

TL;DR: VINP is a variational Bayesian framework that combines neural speech priors for joint speech dereverberation and blind RIR identification, achieving SOTA performance in both tasks.

DetailsMotivation: Reverberant speech contains information about both clean speech and room characteristics, but existing DL-based approaches often don't effectively serve ASR systems and lack joint estimation capabilities.

Method: Uses variational Bayesian inference with CTF approximation and integrates discriminative DNNs to estimate anechoic speech prior, producing MAP estimates for speech and ML estimates for CTF filters.

Result: Achieves SOTA performance in MOS and WER for dereverberation, and SOTA in RT60 estimation with advanced DRR estimation for blind RIR identification.

Conclusion: VINP effectively bridges probabilistic modeling with neural networks for joint speech and RIR estimation, demonstrating superior performance for both dereverberation and room characterization tasks.

Abstract: Reverberant speech, denoting the speech signal degraded by reverberation, contains crucial knowledge of both anechoic source speech and room impulse response (RIR). This work proposes a variational Bayesian inference (VBI) framework with neural speech prior (VINP) for joint speech dereverberation and blind RIR identification. In VINP, a probabilistic signal model is constructed in the time-frequency (T-F) domain based on convolution transfer function (CTF) approximation. For the first time, we propose using an arbitrary discriminative dereverberation deep neural network (DNN) to estimate the prior distribution of anechoic speech within a probabilistic model. By integrating both reverberant speech and the anechoic speech prior, VINP yields the maximum a posteriori (MAP) and maximum likelihood (ML) estimations of the anechoic speech spectrum and CTF filter, respectively. After simple transformations, the waveforms of anechoic speech and RIR are estimated. VINP is effective for automatic speech recognition (ASR) systems, which sets it apart from most deep learning (DL)-based single-channel dereverberation approaches. Experiments on single-channel speech dereverberation demonstrate that VINP attains state-of-the-art (SOTA) performance in mean opinion score (MOS) and word error rate (WER). For blind RIR identification, experiments demonstrate that VINP achieves SOTA performance in estimating reverberation time at 60 dB (RT60) and advanced performance in direct-to-reverberation ratio (DRR) estimation. Codes and audio samples are available online.

[402] Speaker Privacy and Security in the Big Data Era: Protection and Defense against Deepfake

Liping Chen, Kong Aik Lee, Zhen-Hua Ling, Xin Wang, Rohan Kumar Das, Tomoki Toda, Haizhou Li

Main category: eess.AS

TL;DR: Overview of three security techniques (voice anonymization, deepfake detection, watermarking) to combat deepfake speech threats, covering methodologies, advancements, and challenges.

DetailsMotivation: Address growing security risks from deepfake speech misuse in the big data era, which has caused significant societal costs worldwide.

Method: Provides a concise review of three defense techniques: voice anonymization (protects voice attributes), deepfake detection, and watermarking methodologies.

Result: Presents an overview of current advancements and challenges in deepfake speech defense techniques, with a promise of a more comprehensive version to be published later.

Conclusion: Systematic defense approaches combining voice protection and deepfake detection/watermarking are needed to combat the security threats posed by advanced personalized speech generation technologies.

Abstract: In the era of big data, remarkable advancements have been achieved in personalized speech generation techniques that utilize speaker attributes, including voice and speaking style, to generate deepfake speech. This has also amplified global security risks from deepfake speech misuse, resulting in considerable societal costs worldwide. To address the security threats posed by deepfake speech, techniques have been developed focusing on both the protection of voice attributes and the defense against deepfake speech. Among them, the voice anonymization technique has been developed to protect voice attributes from extraction for deepfake generation, while deepfake detection and watermarking have been utilized to defend against the misuse of deepfake speech. This paper provides a short and concise overview of the three techniques, describing the methodologies, advancements, and challenges. A comprehensive version, offering additional discussions, will be published in the near future.

eess.IV

[403] Physics-Guided Diffusion Transformer with Spherical Harmonic Posterior Sampling for High-Fidelity Angular Super-Resolution in Diffusion MRI

Mu Nan, Taohui Xiao, Ruoyou Wu, Shoujun Yu, Ye Li, Hairong Zheng, Shanshan Wang

Main category: eess.IV

TL;DR: Physics-guided diffusion transformer for dMRI angular super-resolution that combines q-space geometry modeling with physical constraints to achieve high-fidelity HAR reconstructions from limited LAR data.

DetailsMotivation: Existing dMRI angular super-resolution methods struggle to recover fine-grained angular details and maintain high fidelity due to inadequate modeling of q-space geometry and insufficient physical constraints.

Method: PGDiT uses a Q-space Geometry-Aware Module with b-vector modulation and random angular masking for training, and a two-stage Spherical Harmonics-Guided Posterior Sampling with heat-diffusion regularization for inference.

Result: Outperforms existing deep learning models in detail recovery and data fidelity on general ASR tasks and downstream applications (DTI and NODDI).

Conclusion: Presents a novel generative ASR framework that provides high-fidelity HAR dMRI reconstructions with potential applications in neuroscience and clinical research.

Abstract: Diffusion MRI (dMRI) angular super-resolution (ASR) aims to reconstruct high-angular-resolution (HAR) signals from limited low-angular-resolution (LAR) data without prolonging scan time. However, existing methods are limited in recovering fine-grained angular details or preserving high fidelity due to inadequate modeling of q-space geometry and insufficient incorporation of physical constraints. In this paper, we introduce a Physics-Guided Diffusion Transformer (PGDiT) designed to explore physical priors throughout both training and inference stages. During training, a Q-space Geometry-Aware Module (QGAM) with b-vector modulation and random angular masking facilitates direction-aware representation learning, enabling the network to generate directionally consistent reconstructions with fine angular details from sparse and noisy data. In inference, a two-stage Spherical Harmonics-Guided Posterior Sampling (SHPS) enforces alignment with the acquired data, followed by heat-diffusion-based SH regularization to ensure physically plausible reconstructions. This coarse-to-fine refinement strategy mitigates oversmoothing and artifacts commonly observed in purely data-driven or generative models. Extensive experiments on general ASR tasks and two downstream applications, Diffusion Tensor Imaging (DTI) and Neurite Orientation Dispersion and Density Imaging (NODDI), demonstrate that PGDiT outperforms existing deep learning models in detail recovery and data fidelity. Our approach presents a novel generative ASR framework that offers high-fidelity HAR dMRI reconstructions, with potential applications in neuroscience and clinical research.

[404] PUUMA (Placental patch and whole-Uterus dual-branch U-Mamba-based Architecture): Functional MRI Prediction of Gestational Age at Birth and Preterm Risk

Diego Fajardo-Rojas, Levente Baljer, Jordina Aviles Verdera, Megan Hall, Daniel Cromb, Mary A. Rutherford, Lisa Story, Emma C. Robinson, Jana Hutter

Main category: eess.IV

TL;DR: PUUMA - a dual-branch deep learning model using T2* fetal MRI to predict gestational age at birth with 3 weeks mean absolute error and good sensitivity for preterm detection, comparable to manual cervical length measurements.

DetailsMotivation: Preterm birth causes significant childhood mortality and morbidity, but current clinical predictors are limited due to complex multifactorial origins, necessitating better prediction methods.

Method: Developed PUUMA dual-branch deep learning architecture using T2* fetal MRI from 295 pregnancies, integrating global whole-uterus and local placental features, benchmarked against linear regression with cervical length measurements and other DL models.

Result: Achieved comparable mean absolute error of 3 weeks and sensitivity of 0.67 for preterm birth detection, despite class imbalance. Demonstrated value of whole-uterus functional imaging and manual cervical length measurements from MRI.

Conclusion: Proof of concept for automated GA prediction from functional MRI, highlighting value of whole-uterus imaging. Future work will expand cohort size and incorporate additional organ-specific imaging to improve generalizability.

Abstract: Preterm birth is a major cause of mortality and lifelong morbidity in childhood. Its complex and multifactorial origins limit the effectiveness of current clinical predictors and impede optimal care. In this study, a dual-branch deep learning architecture (PUUMA) was developed to predict gestational age (GA) at birth using T2* fetal MRI data from 295 pregnancies, encompassing a heterogeneous and imbalanced population. The model integrates both global whole-uterus and local placental features. Its performance was benchmarked against linear regression using cervical length measurements obtained by experienced clinicians from anatomical MRI and other Deep Learning architectures. The GA at birth predictions were assessed using mean absolute error. Accuracy, sensitivity, and specificity were used to assess preterm classification. Both the fully automated MRI-based pipeline and the cervical length regression achieved comparable mean absolute errors (3 weeks) and good sensitivity (0.67) for detecting preterm birth, despite pronounced class imbalance in the dataset. These results provide a proof of concept for automated prediction of GA at birth from functional MRI, and underscore the value of whole-uterus functional imaging in identifying at-risk pregnancies. Additionally, we demonstrate that manual, high-definition cervical length measurements derived from MRI, not currently routine in clinical practice, offer valuable predictive information. Future work will focus on expanding the cohort size and incorporating additional organ-specific imaging to improve generalisability and predictive performance.

[405] Evaluation of Machine Learning Reconstruction Techniques for Accelerated Brain MRI Scans

Jonathan I. Mandel, Shivaprakash Hiremath, Hedyeh Keshtgar, Timothy Scholl, Sadegh Raeisi

Main category: eess.IV

TL;DR: Deep learning MRI reconstruction enables 4x faster brain scans while maintaining diagnostic quality, with 95% of AI-reconstructed images rated as good or excellent.

DetailsMotivation: To address the need for faster MRI scanning times while preserving diagnostic image quality, enabling improved workflow efficiency and patient throughput.

Method: Used DeepFoqus-Accelerate algorithm to reconstruct phase-encoding-undersampled 2D/3D T1, T2, and FLAIR sequences from both public datasets and prospective clinical data. Evaluated with expert radiologist reviews (5-point Likert scale) and quantitative metrics (SSIM, PSNR, HaarPSI).

Result: No AI-reconstructed scan scored below acceptable (≥3), 95% scored ≥4 (good/excellent). Mean SSIM 0.95±0.03, PSNR >41.0 dB, HaarPSI >0.94. Achieved 75% scan time reduction with preserved diagnostic quality.

Conclusion: DeepFoqus-Accelerate enables robust fourfold acceleration of brain MRI while maintaining diagnostic image quality, supporting improved clinical workflow efficiency.

Abstract: This retrospective-prospective study evaluated whether a deep learning-based MRI reconstruction algorithm can preserve diagnostic quality in brain MRI scans accelerated up to fourfold, using both public and prospective clinical data. The study included 18 healthy volunteers (scans acquired at 3T, January 2024-March 2025), as well as selected fastMRI public datasets with diverse pathologies. Phase-encoding-undersampled 2D/3D T1, T2, and FLAIR sequences were reconstructed with DeepFoqus-Accelerate and compared with standard-of-care (SOC). Three board-certified neuroradiologists and two MRI technologists independently reviewed 36 paired SOC/AI reconstructions from both datasets using a 5-point Likert scale, while quantitative similarity was assessed for 408 scans and 1224 datasets using Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Haar wavelet-based Perceptual Similarity Index (HaarPSI). No AI-reconstructed scan scored below 3 (minimally acceptable), and 95% scored $\geq 4$. Mean SSIM was 0.95 $\pm$ 0.03 (90% cases

0.90), PSNR >41.0 dB, and HaarPSI >0.94. Inter-rater agreement was slight to moderate. Rare artifacts did not affect diagnostic interpretation. These findings demonstrate that DeepFoqus-Accelerate enables robust fourfold brain MRI acceleration with 75% reduced scan time, while preserving diagnostic image quality and supporting improved workflow efficiency.

[406] Enhanced SegNet with Integrated Grad-CAM for Interpretable Retinal Layer Segmentation in OCT Images

S M Asiful Islam Saky, Ugyen Tshering

Main category: eess.IV

TL;DR: Improved SegNet-based framework for automated retinal layer segmentation in OCT images with architectural modifications, hybrid loss function, and Grad-CAM for interpretability, achieving high accuracy and clinical relevance.

DetailsMotivation: Manual retinal layer segmentation in OCT is time-consuming and variable, while conventional deep learning models lack interpretability needed for clinical decision-making.

Method: SegNet-based architecture with modified pooling strategies, hybrid loss function (categorical cross-entropy + Dice loss), and integrated Grad-CAM for visual explanations.

Result: Achieved 95.77% validation accuracy, Dice coefficient of 0.9446, and Jaccard Index of 0.8951 on Duke OCT dataset, with robust performance across most retinal layers.

Conclusion: The framework bridges accuracy and interpretability, offering potential for standardizing OCT analysis, enhancing diagnostic efficiency, and fostering clinical trust in AI-driven ophthalmic tools.

Abstract: Optical Coherence Tomography (OCT) is essential for diagnosing conditions such as glaucoma, diabetic retinopathy, and age-related macular degeneration. Accurate retinal layer segmentation enables quantitative biomarkers critical for clinical decision-making, but manual segmentation is time-consuming and variable, while conventional deep learning models often lack interpretability. This work proposes an improved SegNet-based deep learning framework for automated and interpretable retinal layer segmentation. Architectural innovations, including modified pooling strategies, enhance feature extraction from noisy OCT images, while a hybrid loss function combining categorical cross-entropy and Dice loss improves performance for thin and imbalanced retinal layers. Gradient-weighted Class Activation Mapping (Grad-CAM) is integrated to provide visual explanations, allowing clinical validation of model decisions. Trained and validated on the Duke OCT dataset, the framework achieved 95.77% validation accuracy, a Dice coefficient of 0.9446, and a Jaccard Index (IoU) of 0.8951. Class-wise results confirmed robust performance across most layers, with challenges remaining for thinner boundaries. Grad-CAM visualizations highlighted anatomically relevant regions, aligning segmentation with clinical biomarkers and improving transparency. By combining architectural improvements, a customized hybrid loss, and explainable AI, this study delivers a high-performing SegNet-based framework that bridges the gap between accuracy and interpretability. The approach offers strong potential for standardizing OCT analysis, enhancing diagnostic efficiency, and fostering clinical trust in AI-driven ophthalmic tools.

[407] Involution and BSConv Multi-Depth Distillation Network for Lightweight Image Super-Resolution

Akram Khatami-Rizi, Ahmad Mahmoudi-Aznaveh

Main category: eess.IV

TL;DR: A lightweight single-image super-resolution network called IBMDN that uses involution and BSConv multi-depth distillation blocks to reduce computational complexity while maintaining accuracy.

DetailsMotivation: Deep CNN architectures for super-resolution have excessive parameters and computational costs, limiting their use on resource-constrained devices. There's a need for lightweight models that preserve accuracy.

Method: Proposes IBMDN with Involution and BSConv Multi-Depth Distillation Blocks (IBMDB) for efficient feature extraction and Contrast and High-Frequency Attention Block (CHFAB) for perceptual quality enhancement.

Result: Significantly reduces memory usage, parameters, and FLOPs while improving both pixel-wise accuracy and visual quality in super-resolution tasks.

Conclusion: The flexible IBMDB design can be integrated into various SISR frameworks and provides an effective lightweight solution for super-resolution on resource-constrained devices.

Abstract: Single-image super-resolution (SISR) is a fundamental problem in computer vision that aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs. Although convolutional neural networks (CNNs) have achieved substantial advancements, deeper architectures often introduce excessive parameters, higher memory usage, and computational cost, limiting their applicability on resource-constrained devices. Recent research has thus focused on lightweight architectures that preserve accuracy while reducing complexity. This paper presents the Involution and BSConv Multi-Depth Distillation Network (IBMDN), a lightweight and effective architecture for SISR. The proposed IBMDN comprises Involution and BSConv Multi-Depth Distillation Blocks (IBMDB) and a Contrast and High-Frequency Attention Block (CHFAB). IBMDB employs varying combinations of Involution and BSConv at multiple depths to perform efficient feature extraction while minimizing computational complexity. CHFAB, a lightweight self-attention mechanism, focuses on extracting high-frequency and contrast information to enhance perceptual quality in the reconstructed images. The flexible design of IBMDB enables it to be seamlessly integrated into diverse SISR frameworks, including information distillation, transformer-based, and GAN-based models. Extensive experiments demonstrate that incorporating IBMDB significantly reduces memory usage, parameters, and floating-point operations (FLOPs), while achieving improvements in both pixel-wise accuracy and visual quality. The source code is available at: https://github.com/akramkhatami/IBMDN.

[408] A multi-task neural network for atypical mitosis recognition under domain shift

Gennaro Percannella, Mattia Sarno, Francesco Tortorella, Mario Vento

Main category: eess.IV

TL;DR: Multi-task learning approach for domain generalization in atypical mitosis detection, showing promising results on multiple histopathology datasets.

DetailsMotivation: Machine learning models for atypical mitotic figure recognition suffer performance drops under domain shift, requiring robust domain generalization methods.

Method: Multi-task learning with auxiliary tasks correlated to main classification, helping models focus on target objects while ignoring domain-varying backgrounds.

Result: Promising performance in preliminary evaluation on three datasets: MIDOG 2025 Atypical Training Set, Ami-Br dataset, and MIDOG25 challenge test set.

Conclusion: The proposed multi-task learning approach effectively addresses domain shift issues in mitosis detection and shows potential for robust performance across different histopathology domains.

Abstract: Recognizing atypical mitotic figures in histopathology images allows physicians to correctly assess tumor aggressiveness. Although machine learning models could be exploited for automatically performing such a task, under domain shift these models suffer from significative performance drops. In this work, an approach based on multi-task learning is proposed for addressing this problem. By exploiting auxiliary tasks, correlated to the main classification task, the proposed approach, submitted to the track 2 of the MItosis DOmain Generalization (MIDOG) challenge, aims to aid the model to focus only on the object to classify, ignoring the domain varying background of the image. The proposed approach shows promising performance in a preliminary evaluation conducted on three distinct datasets, i.e., the MIDOG 2025 Atypical Training Set, the Ami-Br dataset, as well as the preliminary test set of the MIDOG25 challenge.

Last updated: 2025-09-15
Built with Hugo, theme modified on Stack